Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards learning generalization
(USC Thesis Other)
Towards learning generalization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Towards Learning Generalization
by
Iordanis Fostiropoulos
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2023
Copyright 2023 Iordanis Fostiropoulos
Acknowledgements
I would like to thank my early advisor Prof. Barry Boehm who, although no longer with
us, instilled in me an appreciation for Systems and Software Engineering. I would also like
to thank my later advisor Prof. Laurent Itti with whom we tackled difficult problems that
apply biologically inspired methods to Artificial Neural Networks. I would like to thank
my committee members Prof. Mohammad Soleymani, Prof. Stefanos Nikolaidis and Prof.
Nicolas Schweighofer for their valuable feedback and guidance throughout the process of my
dissertation.
I would like to thank all of my collaborators with whom we have published a paper
together.
I thank the numerous M.Sc. students at USC who have contributed in any capacity to
our work. I would specifically like to thank the research group DeepUSC for humbling me
and giving me a perspective on what it means to be an effective mentor, teacher, and leader.
I would like to thank all the alumni, staff and members at the Center for Systems and
Software Engineering (CSSE) and iLab at University of Southern California for their support
throughout my academic career.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1.1 Representational Generalization . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Learning Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Algorithmic Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Vector Quantized Auto-Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 2: Background: Representation Learning . . . . . . . . . . . . . . . . 10
3.1 Implicit Feature Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Depthwise Quantization . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2 Channel Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Implicit Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Depthwise Quantization Experiments . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Implicit Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Two-Stage Conditional Transformer . . . . . . . . . . . . . . . . . . . . . . . 23
iii
3.3.1 First-Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.2 Hierarchical Conditional Quantization . . . . . . . . . . . . . . . . . 24
3.3.3 Second-Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Two-Stage Fusion Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 First-stage Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.2 Second-stage Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 End-to-End MultiModal Transformer . . . . . . . . . . . . . . . . . . . . . . 33
3.6 MultiModal Fusion Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7.1 First-Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7.2 Second-Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.8 Multimodal learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Chapter 3: Representational Generalization . . . . . . . . . . . . . . . . . . . 15
4.1 Meta-Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Chapter 4: Background: Model Comparison . . . . . . . . . . . . . . . . . . . 46
5.1 Meta-Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.1 Risk Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.2 Assessing Generalization . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Meta-Model Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.1 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.2 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.3 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Stateful Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.1 Initial State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
iv
5.3.2 Intermediate States . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.3 Final State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.4 ABLATOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 ABLATOR Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.1 Experiment Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5.2 Surrogate Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5.3 Hypothesis Analysis and Reporting . . . . . . . . . . . . . . . . . . . 69
5.5.4 AutoML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 5: Model Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1 Continual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Catastrophic Forgetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Chapter 6: Background: Continual Learning . . . . . . . . . . . . . . . . . . . 72
7.1 Model Consolidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.1.1 Buffer-Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.1.2 Stability Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.1.3 Consolidation Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2 Model Consolidation Experiments . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3 Surprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3.1 Surprise Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3.2 Meta-Surprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3.3 Adaptive Meta-Surprise . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3.4 Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
v
7.4 Surprise Detection Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4.1 Surprise Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.4.2 Meta-Surprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4.3 Stream Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.5.1 Batch Model Consolidation . . . . . . . . . . . . . . . . . . . . . . . 93
7.5.2 aMeta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Chapter 7: Algorithmic Generalization . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter 8: Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . 99
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
List of Tables
3.1 Baselines: 1Sparse Transformer [63] 2
Image Transformer [222] 3VD-VAE[62].
“Ours” is a 2 hierarchical DQ-AE with K set to 256 and 128 for top and
bottom codebooks respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Comparison of the Pareto front on the informativeness of the latent space
via probing between VQ-VAE-2 [247], Jukebox [75] and DQ-AE (Ours). DQAE outperform all other methods on reconstruction error and information
content of the latent codes. The difference is amplified for lower rate (K).
Additionally I(ztop; y) for DQ-AE-2 is significantly higher. This lead us to
conclude that DQ-AE is pareto optimal compare to other methods. . . . . . 29
vi
3.3 Quantitative evaluation of Second-stage ablation results, where GPTCCA (Our
method) has an improve FID and IS metrics used to evaluate quality of generated images. We use * to denote that we adapt each method to the novel
experimental setting. For example, we use Jukebox and VQ-VAE with a
CCA; whereas GPTCCA utilizes DQ-AE from Section 3.3.1. Similarly, LatentImageGPT is a GPT without CCA applied on the latent space. Our ablation
study demonstrate that GPTCCA is the only method that performs across
all metrics. On the contrary other methods have close to non-discriminative
power in their generations i.e. ROC of 0.50 (as good as random). . . . . . . 33
3.4 Alblation study on SP-Transformer with Aligned (A) and Un-Aligned (UA)
CMU-MOSEI dataset that use different model structure (“Serial”, “Concat.”),
and parameter sharing strategies (“Cross NS”, “Layer NS”, “Modal S”, “All
Share”). Unimodal (“U”) are model trained with only text features. [1]
Multimodal Performer (“MulP”) [66] [2] Multimodal Transformer (“MulT”)
[284]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Comparison between our model, MulT and Performer (“MulP”) in training
time (seconds per epoch) for the maximum batch size and for different compression ratios S with r fixed to 8. . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 Assessing Generalization by evaluating the choice of learning-rate of SGD for
Implementations (Imp.) Dataset (Vision, Text) on different Model, Random
Initialization (Init.). A Risk value smaller than 1, would be significant evidence that the method is invariant to the type of noise introduced, but the
counter argument can not be determined by a single test. . . . . . . . . . . 56
vii
5.2 Evaluation of RB 5.1.1 when manipulating the comparison bounds (top row)
and for different threshold β. Setting a large β = 0.5 leads to inconclusive
evidence, with a single B, while a small β = 0.01 leads to misleading evidence
by outliers. Manipulating B results in a large risk RB while the careful selection of β can lead to an unbiased evaluation of RB. Finally, too large β lead
to Inconclusive results of a single method performing. . . . . . . . . . . . . . 58
5.3 We compare the best trial for each dataset between the benchmark results
of FT-Transformer (FT-T) and Tablator (Section 5.4.1). FT-T used to obtain the benchmark results is in the subspace of configurations of Tablator;
however, the discovered best configuration differs between the two. Thus,
for some datasets, Tablator outperforms, while for others it underperforms a
result that is a consequence of stochastic dynamics during training. . . . . . 65
7.1 Caption for Results Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Comparison of the mean performance of Stochastic Sampling compared to
Surprise Sampling (SS) augmented methods [16, 10, 309, 338] and finally
compared to αMetaSup + SS augmented methods [250, 45]. The first three
rows are native implementations of the methods. SS augmented methods
are used to determine surprise and apply ER. We report the mean AUC
score across all tasks at the end of training. SS augmented methods perform
better while αMetaSup + SS methods further improve the performance of the
baselines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
viii
7.3 MetaSup and αMetaSup performance difference across dataset from SM Loss
τL, Fisher Information τF I , Energy τE, Feature τF , Uncertainty τU and Gradient τG, with Z-norm Sz, Contrastive score Sc and meta-classifier Sα. T is the
best threshold with the variance among different random Stream sequences.
¯· denotes mean and || · ||2 L2 norm. Sz is reduced by mean1 or absolutemax2
. We find that the reduction method influences the performance of the
metric; an artifact of the linear separability of high-dimensional metric to taskswitch. αMetaSup outperforms a linear threshold. Additionally, the threshold
is highly variable between different runs. . . . . . . . . . . . . . . . . . . . . 91
7.4 Generalization of task-switch detection score for MetaSup and αMetaSup
when trained on auxiliary S-NumbersV
(ResMLP) compared to when trained
on S-Numbers (ResNet-18). αMetaSup outperforms MetaSup; while fattn performs consistently between backbone and Surprise-Metrics Table 7.3 while
only second to fmlp on generalization score. . . . . . . . . . . . . . . . . . . 92
ix
List of Figures
1.1 Illustrative example of generalization of a model M, inspired by [306]. Three
models are evaluated on their performance on Natural Images, and we group
them into four datasets from literature ImageNet [73], CIFAR [168], SVHN
[210], MNIST [175]. M1 is more complex and can perform for larger number of
datasets (larger support for the x-axis) but performs worse for some dataset
compared to simpler models (lower p(D|M) for MNIST compared to M3).
Contrary, M3 is less complex and performs well for simple domains, such as
the MNIST Dataset of (BW images of numbers), but does not generalize as
there is no density on SVHN (RGB images of numbers) . A larger modecoverage can imply generalization, but is not indicative of the performance on
every task. An ideal model is a balance of two e.g. M2. There is no single
model that generalizes well among all datasets. . . . . . . . . . . . . . . . . 2
3.1 The original images (left) are reconstructed by DQ (middle) and VQ (right)
with identical models and training setup. The perceptual quality of DQ outperforms VQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 DQ (left) apply C1 on the first slice of the sub-tensor and for all sub-vectors
and concatenate the quantized vectors. VQ [215] (middle) and PQ (right)
quantize the same vector with different codebook and combine the two subvectors by addition or concatenation. . . . . . . . . . . . . . . . . . . . . . . 17
x
3.3 Effects of implicit marginalization on the learned quantized features. The
mean entropy (“informativeness”) of the quantization vectors for each pixel
for DQ (first) is higher when compared to VQ (second); higher is better. The
mean MI (“redundancy”) scores between each quantization codebook for DQ
(third) is lower when compared to VQ(fourth); lower is better. The diagonal
represent the entropy for each quantization vector, lower half of the diagonal
is empty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Top: The training procedure of a second-stage model starts by first quantizing
a training sample to progressively fewer codes. The two sequences from the
top and bottom hierarchies are of shorter and longer length respectively. Our
method condition on the preceding hierarchy where now the next token prediction objective is informed and controlled by less granular top-level codes.
Bottom: During inference we now provide a conditioning vector that control
generation. When multiple hierarchies are present the same procedure can be
repeated iteratively in a process called ancestral-sampling. . . . . . . . . . . 26
3.5 The top row display the IB curves and report the negative log error and
denoted as I(z; y) for a given rate ‘R’. Lower values on the curve are better.
The bottom row display the RD-curves for the same R as above. VDQ has
improved reconstructions and more informative latent codes for any given rate. 30
3.6 Ablation of the first-stage model of our method DQ-AE when compared to
VQ-VAE-2 [246]. The results demonstrate the effectiveness of our proposed
loss Lcq Eq. (3.6), which is the main difference between the methods. . . . . 30
3.7 Demonstration of code utilization for different hierarchies on a DQ-AE model.
We ablate the informativeness of the hierarchical codes in their informativeness with respect to the input image. The results agree with our observation
in Fig. 3.6a that less granular codes are not informative for reconstructing the
original image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
xi
3.8 Conditional generative sampling on the digit label (0-9) from left to right
for a combination of first-stage and second-stage model variants. First row,
Image-GPT (iGPT) [57] is trained in the original input space, where training
and inference are performed in a 784 sequence length which is slow and computationally expensive. Second to last row, we use GPTCCA to train on
the latent representations of the first-stage model to reduce the input space
of ImageGPT to 256 a 67% reduction where CCA block from Section 3.3.3
is used in the latent representation. Second row, DQ-AE (Our) representations resemble the original ones, but there still exist artificial artifacts i.e.
digit 6. Our method generates images that are difficult to distinguish from
the ground-truth images or iGPT. Third row Jukebox [75] images resemble
the original ones, but they ignore the conditioning label where, for example,
the first image on the row was conditioned to be 0. Fourth row VQ-VAE-2
[247] is trained without intermediate loss LCQ and the generated digits do not
resemble the conditioning class while collapsed to digit 0. Our method reduces
the computational requirement of generating on the original input space while
outperforming previous work in diversity and realism . . . . . . . . . . . . . 32
3.9 A trimodal SP-Transformer for audio A, video V , and text T, with N layers
to update hidden states hm. SP-Block is indicated with grey rectangles. . . . 34
3.10 Multimodal Transformer (“MulT”) from [284]. Performer (“MulP”). SPT
(“Ours”) with variable compression factor S. From left to right: (1) Test on
CPU inference time. (2) Test on memory use. . . . . . . . . . . . . . . . . . 37
3.11 Sample efficiency test on unaligned CMU-MOSEI dataset in comparison with
MulT. We gradually increase the size of the training set and use the same
training set for both models for consistency. . . . . . . . . . . . . . . . . . . 39
xii
5.1 The hypothesis surface S = {SH0
, SH1
, SH2 } contour plot for three models
{H0, H1, H2} from left to right; Section 4.2. Models are synthetic Gaussians
and the x-axis and y-axis are ‘artificial’ hyper-parameters. Decision boundaries are denoted by the black line, while the model most likely under the
boundary, Mˆ
B, is annotated on the boundary area. . . . . . . . . . . . . . . 50
5.2 Evaluation of Bayesian Probability (Our), ANOVA and Prob. Out-Performing
[38] in predicting the out-performing model on a hierarchical mixture model.
Our method outperforms when evaluated for ROC-AUC in determining the
best performing model (left), has lower detection error score (middle), and
provides improved calibrated probabilities (right). . . . . . . . . . . . . . . . 54
5.4 Left is the iterative process of conducting ML experiments with ABLATOR,
where the only user inputs are the method implementation (‘Method.py’) and
the ablation configuration file (‘Config.yaml’). ABLATOR automatically generates analysis artifacts sufficient to update the hypothesis and allow rapid
prototyping. ABLATOR handles the horizontal scaling of experiments and is
fault tolerant. Experiment Persistence allows the experiment to be reproduced independently of the original environment/cluster. Right is the process
without ABLATOR where the user must manually select configurations, manage
execution, consolidation, and analysis of artifacts, which is error-prone and
cumbersome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
xiii
5.5 ABLATOR provides a configuration system specific to ML experiments, where
it has to encompass multiple trials under one definition. On left, is an illustration of the configuration for distributed execution (distributed.yaml)
and method prototyping (base config.yaml). On the right, the configuration is type checked by the ABLATOR library. In Section 5.4.1 we evaluate the
effect of sampling bias from the hyper-parameter selection strategy TPE [32].
The configuration is compact and unambiguous at initialization, supporting
a stateful experiment design paradigm. . . . . . . . . . . . . . . . . . . . . 60
5.6 ABLATOR illustration of the implementation used for our experiments. On the
left is the code specific to the ablation experiment where we use a ProtoTrainer
with built-in training mechanisms. While the ParallelTrainer is used to
scale the model to a large cluster of nodes. On the right it is the PyTorch
model from the official implementation FT-T [105] that uses the configuration
from Fig. 5.5. It required minimal changes to Transformer to evaluate our
hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.7 Evaluation of the effect of a larger model for a regression data set, where
(RMSE) ↓ is normalized for the relative difficulty of each dataset. Larger
model performs better but with higher variance. . . . . . . . . . . . . . . . 66
5.8 Automatically generated analysis artifacts from ABLATOR. On the left is the
normalized accuracy for all dataset. On middle for ‘CO’ [36] and on the
right for AL [102]. It can be observed that the performance metric is heterosckedastic, where ANOVA tests can be inapplicable. Additionally, what
works best for each dataset differs significantly between dataset and for all
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
xiv
7.1 A single incremental step of BMC. On the right figure, the updating of a
base model with Multi-Expert Training: after receiving the data of the new
tasks Di
, . . . , Di+k, a batch of experts θi
, . . . , θi+k are trained separately on
their corresponding tasks with stability loss applied from the base model. The
newly trained experts then sample a subset of their training data and combine
them with the memory to perform batched distillation on the base model. On
the left figure, the regularization helps the batched distillation to update the
model closer to the regularization boundary and towards the jointly low-error
zone of old tasks and two new tasks. . . . . . . . . . . . . . . . . . . . . . . 76
7.2 The loss contours by sequential training compared with batch task training
[113] (shaded areas as low-error zones for each task). Intuition: Similar to
mini-batch training batched task training can reduce the local minima and
improve the convexity of the loss landscape. . . . . . . . . . . . . . . . . . . 76
7.3 BMC outperforms all other methods on the average accuracy for the Stream
dataset and under a CIL evaluation setting. . . . . . . . . . . . . . . . . . . 81
7.4 Left, MetaSup finds fixed threshold by maximizing the F1 score of task-switch
detection. Middle, αMetaSup trains fattn on SM τ1, . . . , τn and task-switch
labels {0, 1} collected from auxiliary streams as time-series classification.
Right, model fθ learning incrementally with Surprise Sampling. New data
batch is first stored in short-term buffer B. Pre-trained fattn infers task-switch
by the attention window. On surprise, B is sub-sampled and stored into M.
Fixed threshold failed to generalize on Stream Benchmark task-gaps. . . . . 83
xv
7.5 Ablation study on the efficiency of the buffer utilization for 100 trials. Left,
Surprise Sampling augmented baselines when compared to Reservoir [255]
and trained on S-NumbersV
. Surprise Sampling performance scales with the
Memory Size at an improved rate and as such is more efficient. Right, larger
Sample Size |B| degrades performance as a larger part of the |M| is replaced
with recent data and discuss more on Section 7.4.1 . . . . . . . . . . . . . . 89
7.6 Mean AUC up to the the number of tasks learned thus far. The variance
in Mean AUC is reflected by the changing difficulty of tasks over time, such
as an easy task followed by more difficult ones. All methods are trained on
an identical sequence of S-Modal tasks and for 252 random initializations.
αMetaSup methods are the only that do not have their performance degraded
close to Random. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
xvi
Abstract
Current practices in Machine Learning (ML) require a model to be iteratively trained on
novel examples, different modalities, and tasks. The same model generalizes poorly on previously learned data where we empirically observe ‘Catastrophic Forgetting’. Traditionally,
Generalization refers to the performance of an ML model on an out-of-distribution dataset.
In this work, we use the broad definition of Generalization to study the performance of
a learner to an ‘out-of-distribution’ learning setting. First, we present our work that analyzes the generalization of the learned representations for an ML model to downstream
tasks, Transfer Learning. Next, we present our work that examines the generalization of
the model architecture for different learning configurations, Model Comparison. Last, we
present our work that analyzes the generalization of the learning algorithm to mitigate forgetting, Continual Learning. Our work explores the three avenues of generalization of an ML
model to find open issues in learning and evaluating a model. First, when model comparison
is performed between learning settings (such as between Optimizers) the performance of
the model can exhibit heteroscedasticity that leads to improper analysis. Next, Continual
Learning algorithms are unable to perform for more complex settings where simpler methods
perform robustly. Last, we find that the representations of a learned model do not generalize among the modalities and domain gaps. We present our contributions and analysis on
the issue of Generalization. Based on our empirical observations, we discuss several future
directions where improvements in algorithms, tools, and methods are required to improve
generalization.
xvii
Chapter 1
Introduction
Generalization in machine learning (ML) commonly refers to the ability of an ML model
to perform well beyond the dataset that was used to train the model. Outside of ML,
generalization is used to refer to ability to learn in the specific case and perform inference
to the ‘general’ case. In this work, we use the broad definition of generalization to study the
generalization of a learning algorithm, which can include the evaluation of a trained model
to a test-set (narrow definition). As such, we also consider the broad definition of ‘model’
in our work to refer to a function with several free variables that can include the choice of
learning algorithm with it’s configurations (training hyper-parameters) as well as the learning
outcome (the trained model). For example, we could study the generalization of a learning
procedure called ‘Experience Replay’ [250], a ‘Memory Size’ learning hyper-parameters of 100
examples on a learner that is a Transformer model [295]. We could thus evaluate whether
our learning setting can generalize for different datasets, for example, whether the above
configuration can work for both the Wikipedia corpus as well as the CommonCrawl corpus
of all websites. This is in contrast to the traditional and narrower definition of generalization
in ML that would only consider in its evaluation the final outcome of the training, which
is the learned model. Under the narrow definition of generalization, we would evaluate how
well the trained Transformer model performs on the unseen dataset without retraining.
Model comparison evaluates the fitness of a model, a hypothesis, in some data. A model
is a parameterized distribution where the objective for fitting a model can be posed as
1
Figure 1.1: Illustrative example of generalization of a model M, inspired by [306]. Three
models are evaluated on their performance on Natural Images, and we group them into four
datasets from literature ImageNet [73], CIFAR [168], SVHN [210], MNIST [175]. M1 is more
complex and can perform for larger number of datasets (larger support for the x-axis) but
performs worse for some dataset compared to simpler models (lower p(D|M) for MNIST
compared to M3). Contrary, M3 is less complex and performs well for simple domains,
such as the MNIST Dataset of (BW images of numbers), but does not generalize as there
is no density on SVHN (RGB images of numbers) . A larger mode-coverage can imply
generalization, but is not indicative of the performance on every task. An ideal model is a
balance of two e.g. M2. There is no single model that generalizes well among all datasets.
2
a density estimation problem and as shown in Fig. 1.1. Under simplified conditions, the
density estimation is performed on a uni-modal setting, where a single mode is sufficient to
represent the task distribution where the loss landscape of all tasks is convex. However, in
practice this is not true [306]; but the problem relaxation suffices to advance ML research.
The problem of overfitting and underfitting in this setting can intuitively be understood as
a trade-off between mode-coverage (‘x-axis’) and specialization (‘y-axis’). For example, a
model with small mode and large likelihood for a small set of images would be an overfit
model. An ideal model in the simplified example would be a balance between the two.
A model is composed of multiple parameters, the free variables θ
∗
, for which there exists
an ideal assignment. For example, in the most simple case where our model is a Gaussian
distribution, the free variables would be the mean and variance. The mathematical definition
of a ‘model’ can be extended to encompass a neural network, where the free variables are the
parameters of the network. While extending on the definition, a model can also include free
variables specific to the optimization method, such as architecture of the neural network,
the choice of an Optimizer or Learning Rate; a meta-model ϕ as we also call in this work.
Finally, a learning algorithm is to find the value assignment for ϕ such that it maximizes
p(ϕ|D). To this end, we identify the three scenarios under which we study generalization;
Representational Generalization; Learning Generalization; Algorithmic Generalization, each
of which study generalization for a different setting of θ
∗
.
We first present the background relevant to the contributions of our work in Chapters 2,
4 and 6, and then present our contributions in Chapters 3, 5 and 7. Our contributions can
be summarized as our work that study and improve the Representational Generalization
(Chapter 3); where we evaluate the usefulness of the learned representations of a DNN. Our
work that evaluates and improves the Learning Generalization (Chapter 5); where we evaluate the generalization of a meta-model that encompasses both the parameters of a DNN,
θ, and meta-learning parameters such as Optimizer, Learning Rate, and more, which we
refer to as ϕ. Last, our work in Algorithmic Generalization (Chapter 7); where we identify
3
several short-comings inherent to Gradient-Based Optimization and propose improvements
to replay-based Continual Learning methods for training a model with improved generalization on multiple tasks. The contributions included in this dissertation can be summarized
by our 7 peer-reviewed publications[60, 89, 95, 101, 92, 91, 94], 3 pre-prints [93, 89, 100], 2
benchmarks [87, 88], a dataset contribution [87] and an open-source tool [94]. We provide
an overview of the Generalization scenarios we study in the remainder of this section and
the problems we discuss within each scenario.
1.1 Representational Generalization
Representations of a DNN are defined as the intermediate activation that is often dimensionality reduction to the original input space where the objective of Representation Learning
is to be nuance-invariant [29]. For example, we would like to learn a model that produces
similar representations for pictures of the same animal species invariant to the setting the
photo was taken in.
There are shortcomings in learning representations that are invariant to noise; that is,
representations that generalize. The problem is of specific importance to the ML community,
where model re-purposing is challenging and a new model has to be trained specific to the
task. When a model is trained end to end, there exist methods for fusing representations at
the input level (early-fusion) or the final learned features (late-fusion) [332]. Where there
are challenges in learning a single model for representations of multiple modalities [333]. We
study the effect of multimodal learning and end-to-end fusion at the intermediate level as
opposed to early or late fusion of multiple modalities.
Approaches in repurposing models involve using a pre-trained model to extract the representations of a dataset to be used in subsequent tasks. For example, CLIP [238], combine the
representation of two pre-trained models a ViT [78] and GPT [239] trained on modalities of
text and vision, respectively. Similar approaches fuse representations in a two-stage training
4
process, where at the second stage, the extracted representations that are of reduced size are
used to efficiently train a computationally intensive generative model. With our work, we
evaluate the short-comings of training a Vector Quantization (VQ) first-stage model with a
Transformer model at the second-stage. While with our work in [60] we evaluate fusing the
representations of several modalities in an end-to-end manner.
We find that when evaluating a first-stage model’s representations using metrics such
as validation loss, the ability for the representations to generalize for subsequent tasks is
misleading. The reason is that the optimized metric, the loss, is at odds with auxiliary
objectives such as the generalization of the representations. For example, our work [93] and
previous work [329] identify a trade-off between the adversarial robustness of the representations (their generalization to adversarial perturbations) and the accuracy for a downstream
task. In our subsequent work (Chapter 3) we evaluated Auto-Encoder [155] architectures
such as VQ-VAE [215] to learn generalizable representations. Contrary to loss, we evaluate
the informativeness of the representations using probing [58]. We use a probe that is trained
on a simple classification task and is auxiliary to the main objective, which is the reconstruction of the input signal for an AutoEncoder architecture. Through our study, we find
that the root cause of the poor generalization of the intermediate representations is mode
collapse Section 3.4.1 where the representations are highly similar between different inputs.
As such, we observe that the network has memorized to reconstruct the data, as opposed to
learning their latent factors. Similar to as before we find that the mode-coverage and the
reconstruction objective are at odds, where there is a trade-off between the two. As such,
with our work we motivate that loss on its own is a bad indicator of generalization. We
systematically evaluate the issue and propose the use of auxiliary training objectives that
partially mitigate the issue of mode-collapse for VQ-VAE architectures. We present our work
in Chapter 3.
5
1.2 Learning Generalization
Understanding the generalization ability of a learning method (a meta-model) is important
with applications to safety, explainability, and deployment of ML systems. For example, we
want to evaluate whether Adam or SGD is “the one and only superior optimization method
to all ML problems” (sarcasm). An absolute claim that would be easy to refute either way
[27]. There are several shortcomings in correctly evaluating the ability of a method i.e.
whether SGD can generalize to all ML problems. The shortcomings involve human error
[258], ML community standards [39] as well as a lack of mathematical frameworks [38] for
comparison between the two methods. Human error can involve the use of improper experimental setup by a study; while the lack of community standards does not employ a single
method of evaluation of SGD and Adam where contradictory answers can be obtained based
on how “better” is defined [27]. For example, one can compare by the best performing result
or the mean performance for a method that can be a statistical outlier [228]. Last, current
mathematical frameworks, such as ANOVA or Bayesian Model comparison, make assumptions that are often inapplicable for the ML context due to the complexity of the model
being evaluated. For example ANOVA does not hold when the performance of a method is
heteroskedastic (uneven variance) for different learning configurations or for Bayesian Model
comparison when a poorly defined priors is used. For example, one could expect to observe
different variance on the performance of a model when re-training using different learningrates, while a poor prior is that all configurations are expected to perform the same.
In Chapter 5 we study the generalization of ‘meta-model’. We first discuss the fundamental theoretical issues of improper model-comparison and present our theoretical findings
that suggest that the performance of a model over different training configurations is heteroskedastic. As such, the aggregate comparison is not valid. We introduce a framework
for assessing the conditions of where a model generalizes, where we find which configurations the ‘meta-model’ generalizes over an alternative hypothesis. Our framework provides
a holistic analysis of a method and is contrary to a binary answer (better or not) of ANOVA
6
or Bayesian Model comparison. As such, our framework, answer ‘Where does a meta-model
perform the best and by how much?’. Next, we study practical problems of performing
model-comparison where it requires evaluating several configurations. Efficient horizontal
scaling is both error-prone and unique to ML experiments, where experimental trials require
specialized hardware such as Graphical Processing Units (GPUs). In practice, such systems
can have non-homogeneous resources with respect to the cluster of computing nodes that the
experiment is required to run. Existing frameworks do not address the custom requirements
of multi-node systems where the experimental trials have to be scheduled to utilize resources
efficiently, be fault-tolerant, and combine their artifacts into a centralized location. As such,
we introduce ABLATOR [94] a framework that addresses the above shortcomings by providing
robust horizontal scaling of deep learning experiments. We present our work in Chapter 5
1.3 Algorithmic Generalization
We evaluate Continual Learning methods as algorithmic improvements for training a learner
on a diverse set of tasks. There are several methods proposed in the literature that can
be seen as ad-hoc and evaluated in very narrow settings [52]. Throughout our work, we
found that simpler settings, such as GDumb [233] and Experience Replay outperform more
complex methods and works that extend the original work [45]. More complex methods fail
to generalize to varying tasks. The result can be seen as a justification for Occam’s razor and
our observation is in agreement with several previous work that criticize model complexity
in Deep Learning [228, 268, 229].
With our work Batch-Model-Consolidation [95] we find that the most effective approaches
in Continual Learning utilize a memory in the form of a buffer that they use to draw samples
for replay during training. Additionally, we find that for long task sequences where the tasks
have variable difficulty and domain-gaps, the buffer can be poorly utilized.
7
Inspired by fast-slow learning system in mammals, we propose learning multiple fast
learners and to consolidate knowledge to a slow learner incrementally and in a distributed
setting similar to mini-batch training with improvements in both speed and reducing forgetting. With our subsequent work, we improve the buffer utilization and answer the question
of ‘when to store’ samples where a meta-model assess novelty similar to the meta-cognitive
mechanism of surprise. We present our work in Chapter 7.
1.4 Future Directions
On the basis of our current findings, we see several future directions and extensions to our
current work. We provide an overview in this section and discuss in detail in Chapter 8. We
find that it is not possible to have a single learner generalize across all domains where there
is a trade-off between performance on a single task and generalization to many. As such we
motivate the search for a single Generalizable learner, architecture or algorithm to be an
idealistic pursuit. Contrary to this, we motivate the search of a learner, algorithm, or model
to be specific to the category of the problems, dataset, and tasks it is required to perform.
Specific to Representation Learning; there are few metrics that evaluate the ability of the
learned representations to generalize. We hypothesize that additional work in evaluating and
improving the mode coverage of the representations could lead to improved generalization.
Specific to Learning Generalization; there are practical problems in the framework to
understand the configurations that a method has improved generalization. Improvements
in frameworks that assess the generalization of a learning method can lead to novel insights
and model architectures that can be useful both to the research community and industry.
Specific to Algorithmic Generalization; current work on learning methods applied to
model to learn a sequence of tasks and reduce their Catastrophic Forgetting are complex
and are evaluated under very limited settings. We find that under more general conditions,
such as training a method on a large sequence of diverse tasks, such methods fail. We
8
hypothesize that the use of a memory of a short (such as buffer) where information can be
distilled, such as by Dataset Distillation [303] and then efficiently Replayed [250] such as by
Knowledge Distillation [126] is a promising direction.
9
Chapter 2
Background: Representation Learning
In this chapter we present background neccessary in understanding our work in Chapter 2.
First, in Section 2.1 we present our work on Quantization methods. Next, in Section 2.2
we extend Quantization methods to discuss how they are applied in the latent space of an
AutoEncoder. Last, in Section 2.3 we discuss how Transformers and their key components
work.
2.1 Quantization
Scalar Quantizer (SQ) with vocabulary of size K is an encoding function for an element
Xi
from a sequence X ∈ RN of length N such that f(X) = {1 . . . K}
N . SQ quantizes every
element of the sequence memory-less with the same encoding function. SQ can not place
assumptions on cross-correlation between different sequence elements. SQ performs optimally when the probability density function (pdf) of all data is known in advance. A trivial
example of an encoding function that follows a uniform distribution (i.e. floor function) will
perform optimally when all XU i ∈ XU are also uniformly distributed and bounded such that
XU i ∈ [a, b]. When XU i has an unknown pdf, SQ will assign probability mass on unlikely
regions in [a, b].
Vector Quantization (VQ) is a lossy data compression method. Given a set of vectors
X = {x1 . . . xm}, xi ∈ Rn
, a codebook C of K vectors, codes C = {c1 . . . ck}, cj ∈ Rn
, and a
10
distance function d. For each vector xi
, VQ function uses the closest cj to xi based on some
distance function d and returns a reconstruction xˆi such that ˆxi = V Q(xi) = cjmin where
jmin = argmin{d(x, c) : c ∈ C}. The objective function c ∈ C
min
d(xi
, xˆi), minimize the error of
the closest codebook vector cj to the vector xi
Product Quantization (PQ) decomposes a one dimensional vector X ∈ RN into slices
of sub-vectors {Xj
: j = 1, . . . , M} and optimizes for a unique pair of a quantizer and a subvector. For M different Codebooks Cj
: j ∈ 1, . . . , M there is a one-to-one mapping with
each Xj
. The PQ decoding function is the concatenation or addition of all VQ decodings
VQj = Xˆ
j
for codebook Cj such that PQ(X) = ∥∀j∈MV Qj (Xj ). We adopt the feature
decomposition from PQ and extend it to high dimensional feature vectors to reduce the
statistical independence among latent features.
Cost of a quantizer is the number of Codebook vectors s.t. Ccost = K × M for PQ.
Representation Capacity (CR) defines an upper bound on the sample space from the number
of discrete latent factors that can be represented by the quantizer for independent random
variables Xi
, such that S = KM for K codes and M decomposed sub-vectors. For redundant
Xi the sample space is reduced such that S
′ = (K − 1)M and thus the capacity CR is bounded
by the sample space, where
CR = −H(X) (2.1)
Note that for PQ, Ccost grows linearly while CR grows exponentially in contrast to VQ
with linear growth for both, and thus VQ has an exponential cost and for an identical
capacity to PQ.
Prior Distribution can have an effect on the decoding performance of the quantizer.
For example, VQ for XU from before, can achieve identical decoding error as SQ but at a
significant cost of KN as compared to K for a memory-less SQ. The distribution of the prior
can determine the trade-off between the cost and the representation capacity of a quantizer.
11
The difference between PQ and VQ is based on the assumption of co-variance among
features. In contrast to VQ, PQ takes advantage of the low co-variance among feature
sub-vectors.
2.2 Vector Quantized Auto-Encoder
Auto-Encoder (AE) is an unsupervised class of DNN architectures that learns compressed
feature representations from high dimensional data. Work by Kingma et al. [156] extend AE
to Deep Latent Variable Models with variants such as Variational Auto-Encoder (VAE). For
some input x and a latent space z, VAE is composed of a decoder p(x|z), a prior p(z), and
an encoder q(z|x). VAE is a probabilistic model that implicitly learns underlying variables
used to generate the data and their latent factors by minimizing the divergence between the
encoded representation q(z|x) and the true data manifold p(z). To evaluate AE, we can use
Mutual Information (MI) which is a statistical dependence metric between two variables s.t.
I(X; Y ) = H(X)−H(X|Y ), where H(X) is the information entropy of X. The optimization
objective of VAE [125, 43] is an upper bound to
max[I(z; p(x|z)) − βI(x; z)] (2.2)
that maximizes the mutual information between latent representation and decoded data and
discards information from x that is not informative to decoding p(x|z). As such, maximizing
Eq. (2.2) also maximizes the entropy of z or “informativeness” [133].
Vector Quantized Auto-Encoders We use the interpretation by Richardson et al.
[249] and MacKay et al. [194] and present quantization under the principles of Information
Theory. In the context of Deep Neural Networks, VQ can be applied as an information
bottleneck in an AutoEncoder framework to learn a discrete latent representation. The VQVAE framework is composed of an encoder, a decoder, and a VQ. For input data x, VQ-VAE
will first encode x to obtain the feature map ze(x), then quantize to indices e = q(z|x) using
12
the closest codes in the codebook C. The quantized feature map is a continuous vector from
codebook C and can be represent as zq(x). Finally, the decoder reconstructs the original
data p(x|zq). The overall training objective is composed of three loss terms; The first term
is the reconstruction loss which we refer to as lDNN, and the following two terms are the
codebook loss and commitment loss which together, we refer to as lV Q and optimize them
under a joint objective using Eq. (2.3). We use sg to denote the stop gradient operation that
prevents the parameters of the operand from updating during the training phase.
LVQVAE = log p(x|zq(x))
| {z }
reconstruction loss
+ ||sg[ze(x)] − zq(x)||2
2
| {z }
codebook loss
+ (2.3)
β ||ze(x) − sg[zq(x)]||2
2
| {z }
commitment loss
2.3 Transformers
The Transformer [295] model is an encoder-decoder architecture that uses an attention-layer
during training and learns to selectively attend to a sequence [20]. Attention is computed as
the similarity between a query and key vector that outputs a value vector.
Attention-Layer Given three sequences xQ, xK, xV ∈ RL×dx of length L with with
feature dimension dx and parameter matrices WQ, WK, WV , we define Q = xQWQ; K =
xKWK; V = xV WV ∈ Rdx×dk , where dk is flexible hyper-parameter that controls the capacity
of the layer.
Attention(Q, K, V ) = sof tmax(
QKT
√
dk
)V (2.4)
Self-attention [295] uses the same sequence for all query, key, value vectors such that
x = xQ = xV = xK. Cross-attention[295] uses two different sequences one as a Query and
another as Key-Value such that x = xQ and x
′ = xK = xV
13
Masked-attention [295] is defined by the addition of a mask matrix M ∈ RL×L
to
the Query-Key computation. The mask matrix is a boolean matrix with binary values of
0 or −∞. The value of 0 for indices i,j corresponds to xi attending to x
′
j
. In contrast,
non-attention in this context corresponds to −∞ values. The value of −∞ for entry i, j will
result in the the value of the softmax to approach 0 and thus results to 0 gradients when
updating the parameters of the model with respect to the attention computation between
elements xi and x
′
j
M askedAttention(Q, K, V ) = sof tmax(
QKT + M
√
dk
)V (2.5)
14
Chapter 3
Representational Generalization
This chapter includes our work published in CVPR 2022 [90], EMNLP 2021 [60] and unpublished follow-up to our work. The work for EMNLP 2021 involves that of my collaborator
Junyan Cheng and the supervision of Prof. Barry Boehm and Prof. Mohammad Soleymani.
The work is summarized in this chapter in the manner that they relate to Representational
Generalization. With our first work we address issues of Mode-Collapse, and with our ongoing work we efficiently utilize the representation learned with a ‘first-stage’ method during
the ‘second-stage’ training. Our work for EMNLP focuses on intermediate fusion of the representations in an end-to-end manner as opposed to a two-stage training. The background
from Chapter 2 on quantization methods, Auto-Encoder and Transformer models would be
required to understand the methods in this chapter.
15
Figure 3.1: The original images (left) are reconstructed by DQ (middle) and VQ (right)
with identical models and training setup. The perceptual quality of DQ outperforms VQ.
16
3.1 Implicit Feature Decoupling
In this section we discuss theoretical contributions and analysis on the improvements of the
quantization with Depthwise Quantization (DQ) where the decomposition of the feature
vector leads to implicit decoupling of the quantized feature space when trained end-toend with an Auto-Encoder. The decoupling allows for more efficient compression of the
representations that is an improved training objective to second-stage model. Additionally,
we find that the use of soft-EMA[259] can improve mode-coverage and thus informativeness
of the representations.
Figure 3.2: DQ (left) apply C1 on the first slice of the sub-tensor and for all sub-vectors
and concatenate the quantized vectors. VQ [215] (middle) and PQ (right) quantize the same
vector with different codebook and combine the two sub-vectors by addition or concatenation.
17
3.1.1 Depthwise Quantization
Given an output feature tensor from encoder X ∈ R with rank r, Depthwise Quantization
(DQ) applies M quantizers V Qi pair-wise on decomposed tensor slices Xi = Xα
i
along an
axis α with quantization dimension D = |Xα
i
|.
DQ(X) = {V Qi(Xi) : i = 1, . . . , M} (3.1)
Each V Qi optimizes Codebook Ci and uses lV Q Eq. (2.3) to define the error between Xi
and the closest quantization vector Xˆ
i = Qi(Xi). The optimization objective is the joint
optimization over each codebook such that
min
C1,...,CM
[LDQ =
X
∀Xˆ
i,Xi
lV Q(Xi
, Xˆ
i)] (3.2)
We use the L2 norm as a similarity metric for lV Q between each Xi and the local quantization vector Xˆ
i
. The DQ loss is then added to the reconstruction loss of the DNN and
the gradients are copied from the quantized vector Xˆ
i to X using auto-grad [223]. The loss
function of DQ becomes
L = LDNN + LDQ(sg(X), Xˆ ) + βLDQ(X, sg(Xˆ )) (3.3)
where sg stands for stop-gradient operator that stops the operand from updating during the
training phase.
The first loss term is used to lower the reconstruction error, the second term adjusts the
codebook corresponding to the encoder output, and the third term is used to prevent the
encoder output from growing arbitrarily. For 2-D Convolutional Neural Networks (CNN),
X is a feature tensor of rank 3. Fig. 3.2 provides an illustration of the DQ process.
18
3.1.2 Channel Capacity
3.1.3 Implicit Decoupling
Decoupled refers to the statistical independence between features while Coupled refers to
their statistical dependence. We present analysis using Information Theory and explain DQ
as an information bottleneck on a signal.
Eq. (2.2) provides the basis of the VAE optimization objective that can be formulated as
a lower bound to the channel capacity as L ≥ Eq(z|x)
log p(x|z)−βDKL(q(z|x)||p(z)) [125, 43]
where β is the Lagrange multiplier. β-VAE assumes a Gaussian prior p(z) ∼ N (0, I), and DQ
assumes a uniform prior. Thus the KL-Divergence of the uniform distribution and decoder
is the capacity of the quantizer DKL(q(z|x)||p(z)) = CR.
max[Eq(z|x)
log p(x|z) − CR] (3.4)
Reducing the capacity of the information bottleneck in VAE encourages disentangled
representations in β-VAE. Similarly, reducing CR encourages disentangled representations
for each codebook, with the upper bound controlled by K and M. By doing so, significantly
compressed representations can be learned for an improved downstream training objective.
Mode collapse can be observed when the quantizer utilize only limited number of
codes. In contrast to the traditional β-VAE where a single mode is learned over all data,
DQ relies on the assumption of uniform distribution of quantized vectors p(z) that cover
multiple modes. However, the assumption of a uniform prior for VQ is weak. Exponential
Moving Average (EMA) and random re-initialization of codes with low usage can improve
mode coverage. Previous works [147, 48] discuss the equivalence of quantization with EMA
as an approximation to the Varitional Information Bottleneck (VIB) when trained with soft
Expectation Maximization (EM). The E-step on the update rule of DQ is approximated with
EMA over mini-batches of data [48, 56]. This is in contrast to hard - EM where quantization
is deterministic. Soft - EM provides a probabilistic discrete information bottleneck[260, 310].
19
Our work improve codebook utilization and reduce mode-collapse however a definite solution
remains an open problem.
Entropy Estimation evaluate the information density of the quantization vectors and
thus their utilization; mode coverage. Estimation on continuous distributions is intractable
and thus, in our work we compute entropy on the quantized discrete distribution from
Eq. (3.1).
I(X; Y ) = X
y∈Y
X
x∈X
p(X,Y )(x, y) log
p(X,Y )(x, y)
pX(x) pY (y)
(3.5)
The entropy of the quantization regions can provide an estimate on the mutual information of the continuous feature space. When DQ is learned end-to-end, entropy can be
calculated directly by the frequency count of each code vector over a sample set. Our
approach in approximating MI is similar to previous work that uses parametric and nonparametric approaches [162, 161, 97, 220, 205] but applied in the context of an implicitly
decoupled feature space from DNN.
20
3.2 Depthwise Quantization Experiments
3.2.1 Implicit Decoupling
Figure 3.3: Effects of implicit marginalization on the learned quantized features. The
mean entropy (“informativeness”) of the quantization vectors for each pixel for DQ (first) is
higher when compared to VQ (second); higher is better. The mean MI (“redundancy”) scores
between each quantization codebook for DQ (third) is lower when compared to VQ(fourth);
lower is better. The diagonal represent the entropy for each quantization vector, lower half
of the diagonal is empty.
We train a DQ-AE and a VQ-VAE [215] for M = 10 and K = 512 with an identical network
configuration, methodology and hyper-parameters.
The difference between DQ and VQ architectures is highlighted in Fig. 3.2. We measure
the likelihood estimation of the two approaches on CIFAR-10[169] and quantize each image
to 8 × 8 × 10 codes. VQ-VAE NLL is 4.36 bits/dim as compared to 3.14 bits/dim for single
hierarchy DQ-AE, a 28% decrease.
High Entropy We show that the learned features of DQ have high entropy which indicates low statistical dependence among them. In contrast, VQ appears to have few very informative features and many uninformative ones. The mean entropy of the prior is H(z) = 6.03
nats/pixel for DQ as compared to H(z) = 5.86 nats/pixel for VQ. The entropy distribution
among spatial features of the prior can be found in the left two sub-figures in Fig. 3.3.
Low MI We estimate the pairwise MI of the quantization vectors along the depth of
the feature tensor with mean score of 1.93 and 2.36 nats/vector respectively. A comparison
matrix can be found in the right two sub-figures in Fig. 3.3. For DQ the MI between
21
quantization vectors is significantly lower as visualized by the mostly empty upper triangular
matrix. In contrast, for VQ there seems to be higher redundancy among quantization vectors.
The diagonal of the matrix represents the entropy of each quantization vector. The MI
estimate on the quantization vector shows that the redundancies are significantly higher in
the learned representation for VQ.
We compare the speed on which each model converges for CIFAR-10. We find that
V Q validation loss plateau in 50K iterations as opposed to 200K iterations for DQ for a
significantly improved reconstruction loss. Moreover DQ converge significantly faster and
reach the best validation loss of V Q in 20K iterations.
3.2.2 Likelihood Estimation
CIFAR-10 ImageNet-32 ImageNet-64
Model (Param.) bits/dim Param. bits/dim Param. bits/dim
S-Tr.1
(59M) 2.80 Img-Tr.2
(-) 3.77 S-Tr.1
(152M) 3.44
VDVAE3
(39M) 2.80 119M 3.80 125M 3.52
(Ours) (22M) 2.52 22M 3.12 22M 2.89
Table 3.1: Baselines: 1Sparse Transformer [63] 2
Image Transformer [222] 3VD-VAE[62].
“Ours” is a 2 hierarchical DQ-AE with K set to 256 and 128 for top and bottom codebooks
respectively.
For likelihood estimation, we compare DQ-AE with other likelihood estimator models and
report the numbers from their work. We use Very Deep VAE (“VD-VAE”) [62] as a continuous AutoEncoder baseline and Sparse Transformer (“S-Tr”) [63] as an Auto-Regressive
baseline.
For experiments on ImageNet, we add a number on the image resolution at which we train
the model at the end of the dataset name. For our model, we use an identical architecture and
number of hierarchies for all resolution of the dataset. The detailed results are in Table 3.1.
We outperform all previous state-of-the-art models when measuring the loss in bits/dim, we
also report CR separately. The estimate for CR ∼ 0.2 nats. Visual inspection of both top
22
and bottom hierarchies confirm that they encode different granularity of features and are
utilized (Fig. 3.7), and perceptual quality is improved (Fig. 3.1).
When compared to the hierarchical model by Razavi et al. [246], DQ-AE also outperforms
in reconstruction error for L2 on ImageNet-256. On CIFAR-10, the DQ-AE loss is 0.019
compared to 0.044 for VQ-VAE. For ImageNet-256, DQ-AE loss is 0.0032 compared to 0.005
for VQ-VAE-2.
3.3 Two-Stage Conditional Transformer
3.3.1 First-Stage
Depthwise Quantized Auto-Encoder (DQ-AE) uses DQ (Section 3.1.1) at different hierarchical feature representation. Hierarchical feature representations can provide different
granularity on the representation from coarser to finer where similar to Curriculum Training [30], the progressive difficulty improves training of the second-stage model. Specific to
Hierarchical variants is mode collapse of specific hierarchies where similar to previous work
[246, 76] we observe it is common. Through experiments in Section 3.4.1, we find that lower
capacity bottom-level hierarchy enforces the utilization of top-level hierarchies and that the
problem of under-utilization of top or bottom level hierarchies can be a consequence of overfitting. During the early stages of training, both hierarchies are used equivalently, but at
later stages, top-level prior collapse.
We avoid collapse of coarser hierarchies by decoding subsequent levels conditioned only
on the preceding hierarchy. Our architecture leads to informative top and bottom level
hierarchies as can be seen in Fig. 3.7. We perform this operation top-bottom and use
Eq. (3.1) as the optimization objective of each DQ. We provide our analysis and theoretical
justification in the following Section 3.3.2.
23
3.3.2 Hierarchical Conditional Quantization
For hierarchical Auto-Encoders, we produce discrete feature maps of images of decreasing
resolution recursively from bottom zbot to top ztop. Similarly, we decode from top dtop to
bottom dbot. Previous work directly use the corresponding hierarchy to decode at each level.
We observe that this leads to mode-collapse of top-level. Intuitively it can be understood
as the network following the path of least resistance to optimize the parameters. It is a lot
more challenging for the network to decode top level as it is a smaller bottleneck and thus
the model over-fits to decode only using the bottom level. As such, we propose to perform
the decoding conditional to the previous hierarchy such that we use the combined pool map
z
′
from ˆzi and di−1 and decode to di such that z
′ = pool(zi
, di−1) and di = Di(z
′
)
At each hierarchy we apply Conditional-Quantization Loss, Lcq that is calculated between
the previous hierarchy‘s decoded reconstruction di−1 and the current hierarchy‘s quantized
representation zi
Lcq =
X
i∈{top...bottom-1}
||zqi − di−1||2 (3.6)
Lastly ˆx is reconstructed using only the bottom level feature map such that ˆx = D(dbot)
as opposed to using decoded representations from all hierarchies [247].
Hierarchical Mode Collapse We provide a theoretical justification and analysis of the
mode collapse issue. Previous work [81, 281, 320, 282] analyze DNN from an information
theoretic perspective and we extend it to hierarchical auto-encoders. The encoder at each
hierarchy in a hierarchical auto-encoder receives as input the previous layer’s activation -
together they form a Markov chain [282]. Similarly, a Markov chain is defined for the decoder
during the decoding process. As a consequence of the Data Processing Inequality (DPI) [281],
information I(X; zi) between the intermediate layer and the original input is progressively
lost at each hierarchy such that
I(X; zbot) ≥ I(X; zi) ≥ I(X; ztop) (3.7)
24
Because of the symmetric nature of the architecture, we can express a lower bound on the
information that the reconstruction contains about the original data [320]
I(X; Xˆ) ≥ I(ˆz; d) =⇒ I(X; Xˆ) ≥ I(X; zi) ≥ 0 (3.8)
where zq and d refer to the top-level representation of the encoder and decoder respectively.
In summary, the objective is to maximize I(X; Xˆ) (or similarly minimize the reconstruction error). There are two observations to make, from the set of Markov chains Eq. (3.7),
I(X; zbottom) is a tighter bound to I(X; Xˆ), and every I(X; zi) is bounded below by 0. As a
consequence, the DNN has sufficient capacity to maximize I(X; Xˆ) by maximizing I(X; zbot).
Optimizing for I(X; zi) can be disregarded and the inequality will still hold true. The outcome is top-level prior that is uninformative I(X; ztop) ≈ 0. In the experimental set-up we
observe vanishing gradients for top-level hierarchies.
It is clear to see that Lcq maximizes I(zqi
; di−1) >> 0 and provides a tighter lower bound
to the inequality in Eq. (3.8), which requires the utilization of all hierarchies. The alternative
would be to directly maximize I(X; ztop), similar to Independent Decoding, as we also call
in this work, and a method that is used by work such as Jukebox [76]. The downside of
this approach is that it breaks the symmetry of the Markov chain and Eq. (3.8) no longer
holds true. In practice, this results in little mutual information between hierarchical encoded
representations, such that I(zqi
; di−1) ≈ 0. In contrast, we want to maximize I(zqi
; di−1) for
conditional generation to the previous hierarchy. As such, Lcq preserves both the symmetry
and improves lower hierarchy code utilization by providing a tighter lower bound on the
intermediate representations.
25
3.3.3 Second-Stage
Figure 3.4: Top: The training procedure of a second-stage model starts by first quantizing
a training sample to progressively fewer codes. The two sequences from the top and bottom
hierarchies are of shorter and longer length respectively. Our method condition on the
preceding hierarchy where now the next token prediction objective is informed and controlled
by less granular top-level codes. Bottom: During inference we now provide a conditioning
vector that control generation. When multiple hierarchies are present the same procedure
can be repeated iteratively in a process called ancestral-sampling.
26
The Conditional-Cross-Attention (CCA) block uses the output of the self-attention layer
q
′
i
to compute a query matrix Q and a representation z to compute the key-value matrices
in the attention [295] mechanism, such that
Qq
′ = q
′ × WQ; Kz = z × WK; Vz = z × Wz (3.9)
where parameter matrices, W ∈ Rdmodel×dk and dmodel and dk are hyper-parameters of the
model that control the size of the matrix. The conditional bias for our method is induced
by content z when computing the similarity between the hidden representation of q
′ and a
key-value representation of the conditional content z. In summary, for a given hierarchy i
the output of the CCA block is given by Eq. (3.10).
CCA(q
′
i
, z) = sof tmax(
QqiKT
√ z
dk
)Vz (3.10)
The main distinction with previous work [295] is that we use CCA in a hierarchical ensemble
in a model we call GPTCCA . Starting from top CCA we use the label of a sample for a
conditioning content during training and inference, such that z = y. For every succeeding
hierarchy i, we condition on the previous hierarchy’s decoded and up-sampled representation
z = sg(di−1) where di
is reconstructed from the first-stage VDQ such that di = Di(zqi
).
During the first-stage of our method we minimized dissimilarity between qi and z = di−1.
Thus, with Eq. (3.10) we are able to maximize the similarity between an informative z and
qi
. Sampling takes place from top to bottom in a process called ancestral sampling as shown
in Fig. 3.4.
27
3.4 Two-Stage Fusion Experiments
3.4.1 First-stage Ablation
We use the number of codes (K) in each codebook to adjust the rate of the Information
Bottleneck (IB) and evaluate our method using Rate-Distortion [8] and IB [103] curves. In
Fig. 3.5 we perform experiments for 5 different rates to establish an approximate upper
bound on the channel rate as R ≈ log2
(K). A first-stage model trained at a lower rate can
lead to an easier training objective for a second-stage model at the cost of distortion. Our
experiments explore the Pareto front of a first-stage model.
For our experiments, we train our method, DQ-AE from Section 3.3.1, Jukebox [76]
and VQ-VAE-2 [247] for 5 different rates K = 8 . . . 128 and evaluate each model using the
information content of the latent codes to predict an attribute y through probing. We use
the dataset label for CIFAR10 and MNIST. For CelebA we use a sample’s attributes for
Gender (Male or not), Smiling, and Age (Young or not) from the original dataset.
Table 3.2 reports the metrics on informativeness using the Negative-Log-Likelihood of
a probe Iy, where lower values are better. We find that DQ-AE outperforms Jukebox in
all regards with the result amplified at lower rates, such as for K = 8. On the contrary,
VQ-VAE-2 is a similar approach that utilizes both hierarchies, but with poor conditioning
between hierarchies; the result is amplified for experiments with larger codes, i.e. K = 128.
The results agree with our analysis and imply that we can train a model at lower rates for
an improved reconstruction loss, while for larger rates our model is more efficient.
Additionally, we observe code collapse for the deeper model as the auxiliary loss of
Jukebox can be sensitive to the latent codebook size. The effect of code collapse for a VQVAE-2 in which we remove the loss function proposed by our method, Eq. (3.6), can be
observed in Fig. 3.6. On the contrary, when using Eq. (3.6) leads to informative utilization
of all codes Fig. 3.7.
28
K Model MNIST CIFAR-10
D (10−4
) ↓ −Iall ↓ −Itop ↓ −Ibot ↓ D −Iall ↓ −Itop ↓ −Ibot ↓
8
VQ-VAE-2 5.37 1.00 1.42 0.72 0.28 2.23 2.29 2.23
Jukebox 6.94 1.00 1.47 0.66 0.25 2.29 2.29 2.28
DQ-AE 5.21 0.93 1.38 0.62 0.24 2.16 2.19 2.18
16
VQ-VAE-2 3.60 1.18 2.44 1.13 0.18 2.40 2.30 2.39
iDQ 3.72 1.15 1.60 1.09 0.19 2.32 2.27 2.33
DQ-AE 3.08 0.93 1.38 0.92 0.16 2.20 2.27 2.32
32
VQ-VAE-2 2.36 1.49 2.31 1.49 0.14 2.32 2.31 2.29
Jukebox 2.60 1.38 2.19 1.31 0.14 2.38 2.30 2.33
DQ-AE 2.11 1.18 1.79 1.15 0.13 2.25 2.28 2.25
64
VQ-VAE-2 2.61 1.70 2.31 1.70 0.13 2.39 2.31 2.40
Jukebox 2.42 1.53 2.29 1.55 0.12 2.42 2.32 2.41
DQ-AE 2.02 1.25 2.16 1.23 0.11 2.33 2.31 2.32
128
VQ-VAE-2 2.67 1.79 2.31 1.78 0.12 2.42 2.33 2.41
Jukebox 1.96 1.57 2.14 1.60 0.10 2.47 2.34 2.36
DQ-AE 1.89 1.32 2.11 1.29 0.09 2.38 2.33 2.35
Table 3.2: Comparison of the Pareto front on the informativeness of the latent space via
probing between VQ-VAE-2 [247], Jukebox [75] and DQ-AE (Ours). DQ-AE outperform
all other methods on reconstruction error and information content of the latent codes. The
difference is amplified for lower rate (K). Additionally I(ztop; y) for DQ-AE-2 is significantly
higher. This lead us to conclude that DQ-AE is pareto optimal compare to other methods.
For the second-stage of our method we use the first-stage model trained with 16 codes
that have the optimal marginal rate of substitution where the slope ∂F
∂R ≈
∂F
∂D . The trade-off
between distortion and rate is optimal for the given dataset and experimental setup. As
such, we utilize fewer codes for the least distortion.
29
Figure 3.5: The top row display the IB curves and report the negative log error and
denoted as I(z; y) for a given rate ‘R’. Lower values on the curve are better. The bottom
row display the RD-curves for the same R as above. VDQ has improved reconstructions and
more informative latent codes for any given rate.
(a) Ablation on DQ-AE (Our). We describe each row in the subfigure from top to bottom. First,
decoding an image by excluding the top level codes. Second, decoding an image by excluding
the bottom level codes. Third, decoding using only the top level codes. Fourth, we use all codes
to reconstruct the original image. Last, the original image. All hierarchies are utilized efficiently to
store sufficient information to reconstruct different granularity level of features; where for example
structural information can be observed in all rows.
(b) Identical to Fig. 3.6a, we ablate the informativeness of all hierarchies in reconstructing the
original image by excluding the codes of a hierarchy for a VQ-VAE-2 model [246]. The bottomlevel codes are sufficient to reconstruct most of the original image, while the remaining hierarchies
are uninformative (black reconstructions).
Figure 3.6: Ablation of the first-stage model of our method DQ-AE when compared to VQVAE-2 [246]. The results demonstrate the effectiveness of our proposed loss Lcq Eq. (3.6),
which is the main difference between the methods.
30
(a) Visualization of the least granular codes from the top-level codebooks for the same images
as Fig. 3.6a. Top, our method DQ-AE when compared to (bottom) VQ-VAE-2. Our method
contains codes that have higher entropy corresponding to being more informative.
(b) Quantization codes from the bottom-level codebooks for the same images as Fig. 3.6a. The
images are presented similar to Fig. 3.7a. Bottom-level codes are more granular, and their structure
resembles the original image. There is high similarity between the two methods. However for DQAE (top), the codes contain higher level features and as such are not over-fit to the structure of
the image.
Figure 3.7: Demonstration of code utilization for different hierarchies on a DQ-AE model.
We ablate the informativeness of the hierarchical codes in their informativeness with respect
to the input image. The results agree with our observation in Fig. 3.6a that less granular
codes are not informative for reconstructing the original image.
31
3.4.2 Second-stage Ablation
Figure 3.8: Conditional generative sampling on the digit label (0-9) from left to right for a
combination of first-stage and second-stage model variants. First row, Image-GPT (iGPT)
[57] is trained in the original input space, where training and inference are performed in a
784 sequence length which is slow and computationally expensive. Second to last row,
we use GPTCCA to train on the latent representations of the first-stage model to reduce
the input space of ImageGPT to 256 a 67% reduction where CCA block from Section 3.3.3
is used in the latent representation. Second row, DQ-AE (Our) representations resemble
the original ones, but there still exist artificial artifacts i.e. digit 6. Our method generates
images that are difficult to distinguish from the ground-truth images or iGPT. Third row
Jukebox [75] images resemble the original ones, but they ignore the conditioning label where,
for example, the first image on the row was conditioned to be 0. Fourth row VQ-VAE-2
[247] is trained without intermediate loss LCQ and the generated digits do not resemble
the conditioning class while collapsed to digit 0. Our method reduces the computational
requirement of generating on the original input space while outperforming previous work in
diversity and realism
We perform an ablation to test the usefulness of the CCA block where we compare it with
Image-GPT [57]. Additionally, we evaluate our claims from Section 3.4.1 where we train a
GPTCCA on the latent space of a DQ-AE, Jukebox, and VQ-VAE-2. We present qualitative
results in Fig. 3.8 and quantitative results in Table 3.3.
We evaluate the increase in memory footprint of the CCA block from Section 3.3.3
compared to a GPT on the same input sequence length. We find that CCA increases the
32
memory footprint by 5% for the top-level hierarchy and 18% for the bottom-level hierarchy,
and has an improved NLL of 5% and 16% respectively.
We compare all models on the Negative-Log-Likelihood (‘NLL’) on the latent space and
the AUC-ROC (‘ROC’) score of a pretrained classifier on correctly predicting the conditioning class. Previous work [247] has noted that the FID and IS scores using the pre-trained
model are sensitive to the blurriness of the reconstructed images. For this reason we use a
feature extractor to calculate the reconstructed ‘FID*’ and ‘IS*’. Similarly to previous work
[247], we use discriminator sampling with threshold of 0.9 for MNIST to calculate FID* and
IS* but we do not use discriminator sampling (as it would be biased) when reporting the
ROC score. Our method outperforms all other variants for every metric with the results
presented in Table 3.3.
Table 3.3: Quantitative evaluation of Second-stage ablation results, where GPTCCA (Our
method) has an improve FID and IS metrics used to evaluate quality of generated images.
We use * to denote that we adapt each method to the novel experimental setting. For
example, we use Jukebox and VQ-VAE with a CCA; whereas GPTCCA utilizes DQ-AE from
Section 3.3.1. Similarly, Latent-ImageGPT is a GPT without CCA applied on the latent
space. Our ablation study demonstrate that GPTCCA is the only method that performs
across all metrics. On the contrary other methods have close to non-discriminative power in
their generations i.e. ROC of 0.50 (as good as random).
Model NLL ↓ FID* ↓ IS* ↑ ROC ↑
Jukebox* 0.89 12.42 8.18 0.76
VQ-VAE-2* 0.99 27.19 6.09 0.50
Latent-ImageGPT 0.99 58.85 2.94 0.50
GPTCCA (Our) 0.84 10.48 9.56 0.99
3.5 End-to-End MultiModal Transformer
In our work we evaluate Conditional-Cross-Attention when the conditioning content z is
from a different modality and trained end-to-end. In contrast to our work in Section 3.3,
a Multimodal SP-Transformer contain a stack of attention layers that use sparse phased
attention SP-Transformer layers. The sparsification of the attention mechanism introduced
33
by our work [60] will not be relevant in explaining the methods of this section and as such
is omitted. We combined three attention layers Input Attention, Cross Attention and Self
Attention. An SP-Transformer layer can process multiple modalities m ∈ M where each
modality can be a sequence Xm. A set of hidden states h
λ
m for each modality at layer
λ and where the layer output updates hidden states h
λ+1
m . The first layer use a learnable
embedding h
0
m.
At every layer Input Attention attend to the original signal Xm with the hidden states
from the previous layer h
λ
m to compute updated hidden states for modality m, hˆλ+1
m .
Figure 3.9: A trimodal SP-Transformer for
audio A, video V , and text T, with N layers
to update hidden states hm. SP-Block is indicated with grey rectangles.
Cross Attention is applied on the output of
the two Input Attention blocks of different
modalities. For every modality, Cross Attention attend to the hidden states between
m → m′
, ∀m′ ∈ M\{m}. We extend Cross
Attention of modality m′ → m and m → m′
.
Finally, we sum the Cross Attention hidden
states for modality m → ∀m′ and apply a
Self Attention mechanism in the final vector
that represent the hidden states learned for
modality m defined as h
λ+1
m . The output of
this layer can be fed to another layer or used in a downstream task. The architecture is
illustrated in Figure 3.9.
Input Attention which compresses each unimodal input sequence into hidden states
hˆλ+1
m = SP-Block(h
λ
m, Xm)
Cross Attention models cross-modal interaction
h¯λ+1
m =
P
m′∈M
SP-Block(hˆλ+1
m , hˆλ+1
m′ )
Self Attention fusing cross-modal information for each modality.
34
h
λ+1
m = SP-Block(h¯λ+1
m , h¯λ+1
m )
Although previous work has proposed multimodal approaches specific to transformers
[284, 241, 117] the use of hidden states in the cross attention mechanism is the novel component of our method. Furthermore, our architecture allows for intermediate interaction of
different modalities [241, 284, 130, 219] as opposed to only at the first layer or recurrently
[70, 240]. In our experiments, we constrain the length of a hidden state sequence Lhm =
LXm
S
where S is a hyperparameter to control the compression ratio.
3.6 MultiModal Fusion Experiments
Table 3.4: Alblation study on SP-Transformer with Aligned (A) and Un-Aligned (UA)
CMU-MOSEI dataset that use different model structure (“Serial”, “Concat.”), and parameter sharing strategies (“Cross NS”, “Layer NS”, “Modal S”, “All Share”). Unimodal (“U”)
are model trained with only text features. [1] Multimodal Performer (“MulP”) [66] [2] Multimodal Transformer (“MulT”) [284].
MOSEI-A MOSEI-UA
Acc F1 Acc F1 Size
Ours 82.6 82.8 82.4 82.7 154K
MulP[1] 82.0 81.8 81.3 81.2 1.56M
Fullattn. 82.5 82.4 82.2 82.4 154K
Serial 82.3 82.3 81.9 82.2 154K
Concat. 82.6 82.9 82.4 82.6 322K
Cross NS 82.6 82.8 82.4 82.6 168K
Layer NS 82.6 82.7 82.3 82.5 545K
Modal S 81.4 81.3 81.5 81.9 70.4K
All Share 81.4 81.8 81.0 81.5 44.8K
MulT[2] (U) - - 77.4 78.2 430K
MulP (U) - - 77.2 78.1 430K
SPT (U) - - 80.7 80.9 45.5K
MulT-SP (U) - - 77.8 78.4 430K
35
We perform an ablation study on SP-Transformer and quantitatively evaluate the memory use, inference time, and training time for our model as well as the benefit of the multimodal cross-attention block. Results of ablation study are available in Table 3.4.
Ablation experiment on network structure We modify and experiment with the
structure of SP-Transformer described in Section 3.5. We experiment on multimodal interactions with a Serial model and two variations of the Concurrent model, with summation
(“Ours”) and concatenation on the output of Cross Attention sub-layer. Summation use
half the parameters compared with the concatenation with nearly identical accuracy. Concurrent structure improves accuracy compared with a serial structure which could be due to
the richer multimodal interactions at every layer.
Parameter sharing We analyze the influence of parameter sharing strategies on the
model performance. We perform experiments on a model that does not share parameters
across layers (“Layer NS”) and within cross attention sub-layer (“Cross NS”). Our results
indicate that parameter sharing can decrease the model size by 71% with negligible impact
on model accuracy. Layer-wise parameter sharing improves performance, this could be due
to the fact that sharing reduces the risk of over-fitting. This is in accordance with the results
reported in [137].
We test two additional sharing strategies, sharing parameters between identical block
types for the same modality (“Modal S”) and for all SP-Block (“All Share”), across all
layers and sub-layers. Due to the difference in the dimensionality of the sequence between
each modality, we use a linear projection to map audio, video, text inputs to dmodel. Results
show that further sharing reduces the size of the model by 70% compared with our model,
with a 1.5% relative reduction in model accuracy. The trade-off between accuracy and
model efficiency can be adjusted depending on the use case. The results demonstrate that
parameter sharing has a small effect on model accuracy, in our approach.
Unimodal experiment We train SPT on the CMU-MOSEI unaligned dataset on a
single modality of text. Results for the unimodal setting use a suffix “U” and can be found
36
in Table 3.4. We compare SPT with the result reported by MulT (“U”) for a Unimodal
setting. We also train a model that replaces the Transformer block with SP-Block in MulT
(“MulT-SP”). SPT (“U”) uses an Input Attention block followed by a Self-Attention block
in each layer. Results show a substantial difference in the performance for SP-Block. The
difference between MulT-SP and Unimodal SPT is in the downsampling by Conv1D as
opposed to the compression by Input Attention block. The advantages of SP-Block lead to
a 3.3% increase in performance and a 89% reduction in parameters.
Comparison with Performer To compare our method with the state-of-the-art efficient
Transformer-based architectures, we compare our method with Performer [66] using the same
architecture from MulT with Performer layers in both multimodal (“MulP”) and unimodal
(“MulP (U)”) setting. Results indicate that our method could improve efficiency without
the loss of accuracy, unlike Performer.
Figure 3.10: Multimodal Transformer (“MulT”) from [284]. Performer (“MulP”). SPT
(“Ours”) with variable compression factor S. From left to right: (1) Test on CPU inference
time. (2) Test on memory use.
Efficiency test is performed on the inference time and memory use of our model on
different input sequence length and we compare to MulT, a MultiModal Performer (“MulP”).
Other state-of-the-art methods, MAG [241], MISA [117] use Transformer and share the same
37
Table 3.5: Comparison between our model, MulT and Performer (“MulP”) in training time
(seconds per epoch) for the maximum batch size and for different compression ratios S with
r fixed to 8.
MOSI-UA MOSEI-UA UR-FUNNY
S = 2 9.8s 137.6s 119.9s
S = 4 4.9s 70.2s 72.6
S = 8 2.5s 37.5s 48.2s
MulT 14.5s 192.7s 171.1s
MulP 6.0s 65.4s 70.7s
quadratic complexity with MulT. We keep the same dmodel = 32 and layers λ = 4 for all
models. Detailed results on inference time and memory use are found in Figure 3.10.
Results show that our model achieves linear complexity O(
rL
S
) on both memory use and
inference time with respect to the sequence length L and with a slope determined by the
compression ratio S and segment length r. The improvement is a result of the downsampling
from the Input Attention, sparse attention from the sampling function, and a simplified
model structure.
We test training time in unaligned CMU-MOSI, CMU-MOSEI, and UR-FUNNY datasets.
All experiments use the largest batch size that can be executed on a single NVIDIA Tesla
V100 with 16GB vRAM. Result are listed in Table 3.5. With a compression factor, S = 8
our method reduces the training time per epoch by 83%, 81%, 72% in MOSI, MOSEI, URFUNNY respectively compared to MulT, and 58%, 43%, 32%. Improvement in training time
is a result of the auto-encoding compressive properties of SP-Block and ratio S.
38
Figure 3.11: Sample efficiency test on unaligned CMU-MOSEI dataset in comparison with
MulT. We gradually increase the size of the training set and use the same training set for
both models for consistency.
3.7 Related Work
3.7.1 First-Stage
Our method focus on improving two areas of the first-stage model, the quantization bottleneck and the uninformative top-level prior. Firstly, we discuss how our work is closely related
to previous studies on feature decomposition and quantization optimization in visual tasks
using DNN. Secondly, we discuss improvements in the informativeness of top-level codes in
the context of VQ-DNN.
Feature decomposition approaches include, Separable Convolutions (SP) [198] that factorizes a convolutional kernel to the spatial dimensions. SP reduces the number of computations required to calculate the filter output. Inception factorizes a feature representation
implicitly with a “Network-in-Network” (NiN) [183] branch of convolutions. Thus, Inception learns spatial cross-correlations and feature cross-correlations independently. Depthwise Depth-wise Separable Convolution (DSC) [65, 278] uses a single “spatial” convolution
followed by multiple vanilla convolutions on a decomposed “segment”. Xception based on
the “Inception Hypothesis” for a decoupled space apply DSC to an eXtreme. Our work
39
with Depthwise Quantization follows a similar hypothesis as DSC and considers modeling
cross-channel correlations and spatial correlations independently.
There are analysis on the decoupling of the feature space implicitly in the context of DSC
as well as in NiN architectures. Blueprint Separable Convolutions (BSConv) [112] have been
proposed as an alternative to DSC based on the observation of intra-kernel correlations. They
propose a point-wise (1x1) convolution followed by a depth-wise convolution as opposed to
the contrary by DSC. In contrast, DSC enforce cross-kernel correlations implicitly. Analysis
on the variance of a convolution kernel shows that DNN can perform better when cross-kernel
redundancies decrease.
Other work explicitly factorize a convolutional filter. There are methods that use low-rank
approximation [136] or closed form decomposition [108] on pretrained networks to speed up
the computation process. Work on speeding up network [275] have used product quantization
to quantize convolutional filters and take advantage of the redundancies. Previous analysis
on the redundancy and cross-correlation of the feature space in DNN is complementary to
our work.
Improvements in quantization learning approaches include Optimized Product Quantization [99] that decomposes the feature vector in a parametric manner. Additive Quantization
[18] improves on the computational efficiency of PQ for high dimensional vector search, and
decomposes the vectors to a sum instead of the concatenation of their sub-vectors. In contrast, DQ can be applied to feature tensors and the quantizer is trained end-to-end with a
DNN.
Kobayashi et al. [163] train a quantizer end-to-end with a DNN. They use multiple codebooks and train each codebook independently for a different supervised task. In contrast to
our method, the codebooks are decoupled in a supervised manner. Moreover, the quantized
representations are used by different networks for different downstream tasks as opposed to
interacting for a single downstream task. Lastly, vector decomposition is applied to feature
vectors as opposed to feature sub-tensors in our work.
40
PQ-VAE [311] also decomposes the latent representations to sub-vectors and uses different quantizers for each sub-vector. Kaiser et al. [147] introduce “sliced quantization” that
is identical to PQ-VAE but use the discrete representation post-hoc with a latent variable
model. By contrast, DQ decomposes the feature space to sub-tensors as opposed to subvectors and thus improves the performance of the statistical dependence within the sub-Tensor.
There are work that improve the informativeness of the quantized discrete representations. VQ-VAE-2 [246] apply quantization at the feature representation for each hierarchy
of a VAE. VQ-VAE-2 can suffer from uninformative top prior and subsequent model such as
“Jukebox” [76] mitigates the issue by modeling each hierarchy with an independent encoderdecoder architecture. In contrast to VQ-VAE-2, we use a CTD model for the second-stage
and we apply an intermediate very-deep-quantization loss term lV DQ at preceding levels to
avoid the issue of codebook collapse. The loss is calculated between the de-quantized representation and the decoded representation from a preceding hierarchy. Similar to our method,
Jukebox fixes the issue of codebook collapse using EMA and random restarts for training
the quantization bottleneck. However, we have found the training objective of Jukebox to be
unstable and obtain an improvement when applying an intermediate very-deep-quantization
loss between hierarchies, lV DQ.
3.7.2 Second-Stage
Previous work proposed a two-stage training process of a quantizer followed by an autoregressive model trained on the quantized prior distribution. PixelCNN [214] is a conditional
auto-regressive prior that is paired with the quantized latent space produced by VQ-VAE
[216]. Similar to the causal masked attention of our method, PixelCNN apply Gated Convolutional Layer and is conditioned on a vector that is embedded and added to the activations of
the convolutional filter. During sampling, PixelCNN predicts one code at a time. VQ-VAE-2
[247] uses different PixelCNN to train on the latent space of each hierarchy. However, since
41
VQ-VAE-2 suffer from uninformative top-level features ancestral sampling is challenging as
the conditioning content is noisy.
In contrast to VQ-VAE-2, we use a CTD model for the second-stage and take advantage of
the improved conditioning between different hierarchies that use lV DQ. Work such as Jukebox
train an indepedent Scalable Transformer that use sparse attention to attend to longer audiosignal latent representations at the second-stage and do not consider the interactions between
different hierarchies.
Previous works such as CogView [77], DALL-E [243] and Latent Video Transformer (LVT)
[242] perform fusion of conditional information by concatenating auxiliary content in a single
sequence. In contrast our method conditions on the content using an ensemble model that
apply Conditional-Cross-Attention between the original content’s hidden representation and
the previous hierarchy’s embedded representation. VideoGPT [313] is the most similar to
the second-stage of our method as it is conditioned via cross-attention on the previous frame
embedded with a ResNet. In our method, we use the lower granularity latent representation
from an ensemble of CTD models to condition the higher granularity sequence, as opposed
to an end-to-end learned embedding on the original input space.
Work by Yao et al. [318] condition their generative model to a specific attribute of the
train data by adding an auxilary training objective and use the classification loss for each
specific attribute. However, they train a different Transformer model for each attribute. In
contrast, our second-stage model can use feature tensors as conditioning content and can
scale beyond a single attribute.
3.8 Multimodal learning
Recent work on multimodal sentiment analysis has focused on the application of Transformer architectures. [284] introduces pairwise cross-modal attention on Transformers for
multimodal sentiment analysis on audio, video, and text. Our model’s architecture follows
42
a similar design that first encodes unimodal inputs, then models cross-modal interactions
and finally fuses multimodal information. The main difference is that our model enables a
concurrent way to implements those steps. [117] project multimodal input into modalityinvariant and modality-specific spaces, and use a Transformer encoder on the concatenated
projected representations. [241] use pre-trained Transformers like BERT [74], XLNET [317]
on a large corpus and perform transfer-learning for multimodal sentiment analysis.
Multimodal learning Previous work use recurrent neural networks (RNN) or convolutional neural networks (CNN) on each modality and perform model-based, e.g., kernelbased fusion, with graphical models, and neural networks, and model-agnostic fusion, e.g.,
early, late or hybrid [22]. [304] fuse multimodal representations with a Gated Modalitymixing Network, that model the fine-grained structure of non-verbal subword sequences and
a Multimodal Shifting mechanism, that dynamically shifts word representations based on
non-verbal cues. [226] trains a sequence-to-sequence RNN to jointly perform inter-modality
translations and sentiment analysis, the encoder output is a joint multimodal representation
that is used for sentiment detection. Tensor fusion networks (TFN) use the outer product
of representations for each modality concatenated with a constant value of “1” to generate
a joint representation [321]. [188] propose to decompose the weight from the fusion layers
into low-rank factors to reduce the large number of parameters in TFN. [322] use a system
of LSTMs to learn modal-specific interactions, learn the cross-modal interactions with an
attention mechanism and finally apply a Multi-view Gated Memory that fuses the multimodal representations through time. [7] project multimodal features to fine-granularity and
coarse-granularity “spaces” through Multi-Layer-Perceptron (MLP) Networks. Audio and
video are aligned with text features in coarse-grained space, while audio and video features
are aligned in fine-grained space. Unlike the previous methods that directly apply transformations on the multimodal inputs, we use a small sequence of hidden states to capture
the features from multimodal inputs thus preserving raw input information while improving
efficiency.
43
In the most similar work to ours, [137] distill information from an input sequence to a
fixed length of hidden states with an Autoregressive Transformer.
Efficient Transformers One of the drawbacks of the Transformer architecture is the
computational cost and the memory footprint of the self-attention mechanism. A number
of efforts have been made to make Transformers more efficient [64, 28, 159, 336, 323]. Such
work use sparse attention to selectively attend to pairs of elements and lead to a reduction in
memory use and computational complexity of the attention mechanism. A notable example,
Performer, [66], improve the efficiency of the attention computation through “unbiased” and
low-rank approximation of the attention matrix.
There are other attempts for reducing the sequence length to improve computational efficiency. [70] introduce “segment-level recurrence” that recurrently use the previous segment
and current segment. [240] extend this idea and compresses multiple previous segments into
memory vectors. The hidden states in our method are based on a similar idea that caching
the information from the input sequence in a shorter sequence can improve efficiency. We
also applied the idea of sparse attention to achieve further improvements.
3.9 Conclusions
We propose a two-stage method that improves generative modeling. For the first-stage of our
method we introduce a novel architecture that addresses the issue of codebook collapse and
an uninformative discrete prior independently. We improve downstream generative modeling
by exploiting the information overlap between different granularity latent representations.
We use Information Theory to provide a detailed analysis of how our method produces informative latent representations. For the second-stage of our method we introduce an ensemble
of generative models that use a Conditional-Cross-Attention block to train on progressively
longer sequences and induce conditional bias from fixed lower granularity counter-parts. We
evaluate the effectiveness of our method on MNIST, CIFAR10, and CelebA to conclude that
44
lV DQ, an intermediate loss between hierarchies, leads to informative discrete latent codes.
Our improved quantization method DQ mitigates codebook collapse and reduces the complexity of the latent space. Lastly, ancestral sampling using CCA blocks leads to improved
generative modeling performance as evaluated experimentally.
Additionally, a multimodal SP-Transformer that uses a sequence of hidden states to
sample from longer multimodal input sequences outperforms previous work on sentiment
analysis. Compared to the previous Transformer-based models, our model has a reduced
computational complexity through sparse attention. The concurrent structure also enables
more effective capturing of the multimodal interaction, resulting in higher performance. The
optimization through parameter sharing patterns leads to a significantly lighter model, with
a lower number of parameters and improved sampled efficiency. As a result, the proposed
model’s performance is superior or comparable to the existing methods at a lower computational and memory cost. Our experiments show that our method has a good balance between
accuracy and efficiency and has the potential to be deployed in real-world multimodal applications.
45
Chapter 4
Background: Model Comparison
This chapter contain the background necessary for understanding the contributions of our
work in Model-Comparison. In Section 4.1 we explain how a Meta-Model can be constructed
from the learning configurations. Next, in Section 4.2, we discuss how a model can be
evaluated on whether it outperforms another.
4.1 Meta-Model
Configuration Space Let X ∈ R
D define the configuration space for hyper-parameter
variables H = {H1 . . . HD} and ϕ be a black-box train function with a corresponding performance metric yi = ϕ(xi), xi ∈ X . Each hyperparameter Hi
is a source of variance model
performance, such as the random seed, the optimizer, and the learning rate. Let Performance Surface be Y = ϕ(X ) where ϕ in the context of ML research would correspond to
a training procedure and Y could correspond to the loss-surface or accuracy-surface. And ϕ
can be described as a stochastic process that produces a performance metric. The stochasticity can be observed when performance differs significantly with identical initial conditions,
such as re-running the same training process but obtaining different results.
Meta-model can be constructed constructed for a subset of configurable attributes
MA ⊂ H and performed for multiple models, an encompassing-model {MA, . . . , MB} = M
46
where T
M = ∅. For example, model comparison can be performed on the best optimizer HOptim where HOptim ∈ {‘Adam’, ‘SGD’} for learning-rate range HLR ∈ {[10−3
, 10−5
]},
HDataset = CIFAR10 and HModel = VGG11
MAdam = {‘Adam’, HLR, HDataset, HModel}
MSGD = {‘SGD’, HLR, HDataset, HModel}
In the case that Hi
is continuous, techniques such as binning can be used to discretize the
range into competing models.
4.2 Model Evaluation
An ablation study can be used to identify causal effects of a training-hyperparameter through
the use of randomized trials on the perturbed hyper-parameters. An ML ablation study, is
thus a hypothesis defined as an experiment on the improvement that an architectural component, such as Residual Connections, or a meta-learning method such as Optimizer provides
to the performance of the model. Multiple experimental trials are required to improve the
statistical power of a test [115] that require randomly sampling from the configuration-space.
Where a trial is identical to the stochastic process ϕ from Chapter 4.
Reproducibility for an experiment can be defined as the ability to obtain similar experimental results given identical initial conditions. The evaluation of the experiment is
the aggregate of multiple trials (stochastic processes) over the hyper-paramter search-space.
Thus, to define a trial, we have to maintain two states to describe the system at any given
point. The initial conditions and the current state.
Evaluation is performed on the final state, on the comparison of the performance
surfaces between YAdam = ϕ(XAdam) and YSGD = ϕ(XSGD) where XAdam = X ∈ MAdam and
XSGD = X ∈ MSGD. In the literature different frameworks for testing hypotheses are used
to quantify the risk of accepting the hypothesis, i.e. Adam >> SGD.
47
Average Difference, evaluate two competing models, on the difference between the
means of their performance surface YAdam −YSGD. For example, an ANOVA test can be used
to quantify the risk of accepting that the two means and their variances are statistically
different. The test make assumptions on both the distribution of Y and its variance.
Bayes Factor has been motivated [172, 154, 96, 148, 106] as a metric for model comparison testing that can be used auxilary or in direct replacement to ANOVA. The Bayes factor
is calculated as the ratio of posterior and prior odds, and is a similarity metric between two
models BFAB =
P(X|MA)
P(X|MB)
. BF is not a hypothesis test on it’s own as it can be interpreted
as ‘strength of evidence’. [106, 257] suggest different methods for quantifying the risk and
provide a hypothesis testing framework which can be extended beyond the method used to
compute BF.
Heteroscedasticity can empirically be observed when for multiple trials of two configurations, xi
, xj with different learning rates such that HLR,i << HLR,j =⇒ Var(yi) <<
Var(yj ). We observe that unstable dynamics can occur for a subrange of hyperparameters
and can be explained by the synergistic or antagonistic effects between specific value ranges
of the hyperparameters, for example, learning rate and batch size [272], that in aggregate
the performance surface will display non-Gaussian properties Figs. 5.3a and 5.3b.
Surrogate Modeling is often used on the performance surface Y for tasks such as
Neural Architecture Search (NAS) or Hyperparameter Optimization (HPO) where the goal
is to predict the performance given a set of hyper-parameters or architectures X . Efficiently
exploring Y in an online manner has been the focus of previous work, where the objective
can be summarized as an approximation to ϕ∗, a meta-model. By extension, post-hoc
analysis using surrogate modeling can be used to analyze the hyper-parameter sensitivity of
ϕ [96]; where model comparison takes place within Y.
48
Chapter 5
Model Generalization
In this chapter we discuss our ongoing work in model evaluation. We study the conditions
under which a learning-setting can generalize beyond the dataset and configuration. Our
work on model comparison focuses on a theoretical framework for assessing the generalization of the training process, or as we also call a meta-model. In contrast, our work with
ABLATOR is an AutoML 1
framework that provides an unambigious implementation of the
training process that helps perform ablation studies on the generalization of a method. The
background of Section 4.1 will be necessary to understand the methods in this chapter. The
reader should be familiar with our definition of the training as a stochastic process, ϕ. In
addition, we perform sensitivity analysis of the performance surface Y of ϕ.
1https://www.automl.org/automl/
49
5.1 Meta-Model Comparison
Figure 5.1: The hypothesis surface S = {SH0
, SH1
, SH2 } contour plot for three models
{H0, H1, H2} from left to right; Section 4.2. Models are synthetic Gaussians and the x-axis
and y-axis are ‘artificial’ hyper-parameters. Decision boundaries are denoted by the black
line, while the model most likely under the boundary, Mˆ
B, is annotated on the boundary
area.
A Bayesian Surface [96] is constructed by evaluating the Bayes factor from Section 4.2
between two models at the experiment surface. For example configurations xi ∈ X , SBFAB =
BFAB(xi). We extend the concept to a hypothesis surface, S for a test T , SAB(xi) =
TAB(xi). SAB(xi) can be evaluated by empirical sampling and repeated measurements yi ∼
ϕˆ(xi) or approximated by density estimation on X . For our work we use Bayesian Probability
for T and surrogate modeling for density estimation, Section 5.1, 5.1.1.
We find that Bayes Factor is sensitive to the definition of the prior, and the power of the
test is influenced by the bias in the choice of the prior [160]. We use Savage-Dickey Density
Ratio [305] in comparing nested models:
BFA =
P(MA|X )P(MA)
P(MA|X )P(MA)
(5.1)
where P(MA|H) is the probability of model A given the hyperparameter space H and MA
is the set of complementary models.
50
We define Bayesian Probability (BP) as the posterior that is the probability that a
method performs within an interval at configuration xi
P(A|xi) = P(yi,A > yi,A + δ) (5.2)
where δ can be used to specify the significance of improvement of MA for configuration xi
.
We specify a non-informative prior P(A) = |A|
−1
. For our experiments, we set δ = 0 where
similar to previous work [38] we motivate that a threshold on the improvement of a method
should be based on community standards.
5.1.1 Risk Assessment
A hypothesis space S = {SA . . . SB} is composed of the hypothesis surface calculated between models M = {MA . . . MB} and an encompassing model M using BP. A pseudolabel can be used that corresponds to the hypothesis most likely at xi such that mi =
argmax S(xi) The surrogate model objective function is to maximize the information
gain IG(X , Hi) = E[I(X )] − E[I(X |Hi)], where Hi ∈ H is a configurable hyperparameter.
Decision Boundary (B) define the hyperparameter range in M, B = {max IG(X , Hi) :
Hi ∈ M}, Mˆ
B = ϕ(B) s.t.
|X ∈ B| < β to account for Type II errors caused by statistical outliers.
P(Mˆ
B ∈ B) > α to account for Type I errors on a confidence interval.
Setting a large α or β will lead to inconclusive results, while setting a small value will
lead to misleading evidence. The two can be better understood by Fig. 5.1 and Section 5.2.3
where α controls the number of bounds, while β controls their size. For our experiments, we
use α = 0.9 and present results under different β where similarly a threshold should be set
at the community level.
51
Bayesian Risk quantify the risk of accepting the performing model in Boundary B, Mˆ
B,
based on evidence in B for analysis in S and is computed as
RB = |lnodds of P(Mˆ
B; B)
odds of P(B)
| (5.3)
where P(Mˆ
B; B) is the empirical evidence that Mˆ
B out-peforms in B, while is the prior
P(B) = |B|
|M|
is the size of B with respect to the comparison bounds.
For a biased prior, such as selecting comparison bounds for which Mˆ
B performs well,
RB → ∞. Similarly, for a weak posterior where there is insufficient evidence that Mˆ
B is an
improvement, RB → ∞. The sign of RB can be studied to determine a weak posterior or a
biased prior. It is worth noting that RB does not account for misleading evidence, such that
of statistical outliers with a large posterior and small prior, but instead the risk is controlled
by β and α.
We use a Decision Tree [224] as a surrogate model for the implementation of B and
RB in our experiments due to the sample efficiency in modeling the hypothesis space, the
interpretability it provides with linear decision boundaries, as well as the relative performance
to alternatives.
5.1.2 Assessing Generalization
We use the definitions by [230] to define reproducibility in the context of criteria that lead
to generalization of conditions. Our assessment can be summarized as a robustness test to ϕ
where additional sources of variances do not influence the results while additionally provide
the set of conditions under which a hypothesis is expected to be true invariant to specific
extraneous factors ξ, to the experiment, e.g. ϕ(xi + ξ) ≈ ϕ(xi). We categorize sources of
variance inherit to the optimization process as HO, to the training protocol such as with the
choice of optimizer as HT , to the implementation of the method such as between studies as
HI , and to the results of the analysis between studies as HA. We use ξ = {HO, HT , HI , HA}
52
as pseudo-hyper-parameter to study the effects via model comparison on ξ, meta-model
comparison:
• Reproducibility Test FR = M
S
HR
• Replicablility Test FT = FR
S
HT
• Robustness Test FI = FT
S
HI
• Generalization Test FA = FI
S
HA
For a method to be Generalizable it would have to be Robust, Replicable and Reproducible.
We perform model comparison to test for low hypothesis strength: for example, comparison
for different random seeds with low hypothesis strength implies that a method is reproducible
(since the hypothesis that the seed will influence the performance is rejected).
Finally, Generallization is assessed by qualitative comparison between analysis of studies,
since the comparison framework used to produce the analysis can be a source of variance
itself. A motivating example on the effect of variance from the analysis of a method is
the Jeffreys-Lindley Paradox where contradictory conclusions would be reached by Bayesian
point null hypothesis and Frequentist analysis. To this end, we motivate the use of diversified methods of model comparison where multiple approaches can provide complementary
analysis.
5.2 Meta-Model Experiments
We evaluate Bayesian Probability 5.1 and Risk Assessment 5.1.1 in 5.2.1. We evaluate Generalization 5.1.2 in Section 5.2.2 and answer the Adam and SGD dilemma in Section 5.2.3.
We evaluate our method on the experimental trial results from [38] and [26]. The concatenated multiverse is composed of 44278 experimental trials across 4 models ResNet [120]
VGG11-VGG16 [269], AlexNet [171], BERT [74] and 4 datasets, CIFAR10-CIFAR100 [168],
53
RTE [301], SST2 [273]. Finally, we perform a cross-study comparison between [307, 26] and
corroborate with our own analysis. For reasons of brevity, we present in this section the
results most significant in evaluating our method.
5.2.1 Ablation Study
Figure 5.2: Evaluation of Bayesian Probability (Our), ANOVA and Prob. Out-Performing
[38] in predicting the out-performing model on a hierarchical mixture model. Our method
outperforms when evaluated for ROC-AUC in determining the best performing model (left),
has lower detection error score (middle), and provides improved calibrated probabilities
(right).
We compare our method, Bayesian Probability, with Average Difference and Probability
of Out-Performing [38]. We use a synthetic dataset for which we have an analytical solution
for the performance surface and the best performing model.
Our method, Baysian Prob. outperforms Prob. Out-Perf. by 6.9% for ROC-AUC and
68% on the Calibration Error. Similarly, we outperform ANOVA by 5.7% for ROC-AUC and
50x difference on the Calibration Error. The lower calibration error signifies less sensitivity
to a threshold, such as ‘p-value’ or γ [38]. Fig. 5.2.
We evaluate RB in predicting the risk of hypothesis hacking on a synthetic dataset where
we purposely manipulate the comparison bounds to favor a method over another. Identifying
h-hacking with RB achieves an AUC-ROC score of 0.99, which means that we would reject
the result based on the risk of manipulation.
For a practical problem for which we have an approximation of the performance surface,
we use 100 experimental trials where a VGG16 is trained on CIFAR10. We use the complementary 13136 trials as an approximation to the true model performance. Due to the
54
empirical approximation the Calibration Error is not meaningful in evaluation. Additionally,
we perform model comparison on HLR for 10 uniformly discretized learning rate intervals
used to train SGD where the hypothesis would answer the best learning rate interval to use.
The AUR-ROC performance is 0.9 for Bayesian Prob., 0.65 for Prob. Out-Perf., and 0.59
for ANOVA.
5.2.2 Reproducibility
We assess the generalization conditions on the choice of learning-rate for SGD. We assess the
reproducibility under different weight initializations and used 8977 trials of a BERT model
trained on SST2 with different random weight initialization: F = {HLR, ‘SGD’, ‘BERT’, ‘SST2’}
and ξ = {Hweight init}. We evaluate the replicability under different model AlexNet, ResNet
and VGG11 and dataset CIFAR10 and CIFAR100: F = {HLR, ‘SGD’} and ξ = {Hdataset, Hmodel}.
We evaluate the robustness for different implementations: F = {HLR, ‘SGD’, ‘CIFAR10’}
and ξ = {‘Bouthillier et al.’, ‘Bell et al.’}.
55
Table 5.1: Assessing Generalization by evaluating the choice of learning-rate of SGD for
Implementations (Imp.) Dataset (Vision, Text) on different Model, Random Initialization
(Init.). A Risk value smaller than 1, would be significant evidence that the method is
invariant to the type of noise introduced, but the counter argument can not be determined
by a single test.
Reproducibility Test Variance - ξ RB < 1
Robust (Imp.) Bell et al. 1.4948
HI Bouthillier et al.
Replicable (Text) SST2 1.6085
HData RTE
Replicable (Vision) AlexNet 0.2443
HModel ResNet
VGG16
HData CIFAR10 0.1142
CIFAR100
Reproducible (Init.) 0 0.0028
Hweight init 1
2
Our results presented in Table 5.1 indicate that SGD is not Robust against implementation. Indeed there are differences between the implementations of [26] and [38], where among
other VGG11 and VGG16 is used by each work, respectively. The objective of the test is to
identify differences between implementations by evaluating when a method satisfies the test
and, as such, the intersection of the formula´e would yield an invalid analysis. Similarly, for
SST2 and RTE, the optimal learning rate for the two dataset is different. Indeed, BERT is
sensitive to learning rate and hyperparameters for different datasets [276].
56
5.2.3 Generalization
(a) Experiment on batch-size learning rate generalization [272] for a VGG16 trained on CIFAR100. Large learning rates lead to unstable
dynamics for smaller batch size (bottom-right
corner) but the stability of the improves for
larger batch-size and larger learning rate. Top
performing trials are in the interaction between
the two for learning rate 10−2 and batch size 26
(b) Comparison of the performance between
SGD and Adam for different learning rates. The
learning rate is discretized to 10 intervals. For
the experiments in Section 5.2.2, we perform a
model comparison between learning rate intervals. For the experiments in Section 5.2.3 we
perform model comparison between Adam and
SGD. It can be observed that the variance of
performance differs for different learning rates
and for each method (heteroscedastic), while
the performance appears to be non-Gaussian.
From the figure it can be difficult to determine
the best performing method, as each method
performs better and for different intervals.
We test the ability for the results to generalize beyond the analysis of a single study and
compare Adam and SGD for different learning rates: F = {Hoptim, CIFAR10, HLR} where
ξ = {‘Formula´ıc’, ‘Bell et al.’, ‘Wilson et al.’}. Furthermore, we evaluated RB where we
purposely manipulate the comparison bounds to bias SGD and present results on Table 5.2.
Shapiro test with p-value∼ 0 suggests non-Guassian residuals and distributions and
Fig. 5.3b confirms the test qualitatively. Ignoring the Shapiro test and performing ANOVA
would suggest a significant difference with a p-value∼ 0 and would have us conclude that
57
SGD outperforms Adam by 18.5%. Additionally, the figure motivates the question ‘which
one is better?’ where the answer depends on the learning rate. We examine two approaches
to answer the question.
1. Assume we randomly sample a learning rate from HLR, evaluate the probability that
Adam outperforms SGD
2. Given best performing interval BAdam ⊂ HLR s.t. Adam outperforms SGD and BSGD ⊂
HLR s.t. SGD outperforms Adam is YAdam >> YSGD for YAdam ∈ BAdam and YSGD ∈
BSGD
The case for Item 1 is that for many problems, we have a limited tuning budget. The case
for Item 2 is that we have a good guess of reasonable defaults or a large tuning budget.
Evaluating under Item 1, we find that SGD outperforms Adam with P(Adam) = 0.15 and
P(SGD) = 0.53 at β = 0.2. Using β = 0.01 we find misleading evidence, where for a single
configuration, Adam performs close to 0.92.
Table 5.2: Evaluation of RB 5.1.1 when manipulating the comparison bounds (top row)
and for different threshold β. Setting a large β = 0.5 leads to inconclusive evidence, with a
single B, while a small β = 0.01 leads to misleading evidence by outliers. Manipulating B
results in a large risk RB while the careful selection of β can lead to an unbiased evaluation
of RB. Finally, too large β lead to Inconclusive results of a single method performing.
B (β) Mˆ
B RB YH(%) P(B) Result
h-hack SGD ∞ 87±0.14 1.00 Biased
Adam 0.11 66 ±1.4 0.23 Mislead
Adam 0.01 92 ±0.0
0.01 Adam 0.01 88±0.8
SGD 0.05 85±0.4 0.55
SGD 0.12 89±0.9
0.2
Adam 0.05 83 ±1.5 0.15 Unbiased
SGD 0.17 86 ±0.9 0.53
0.5 SGD 0.24 87±1.1 0.56 Inconc.
58
[307] suggest that SGD outperforms Adaptive optimizers based on the comparison of
best-performing experimental trials. Using the same criteria as [307] our analysis will yield
contradictory results. The evaluation based on Items 1 and 2 agrees with [307, 26] that SGD
perform better than Adam by ∼ 3% at β = 0.2.
5.3 Stateful Experiment Design
Figure 5.4: Left is the iterative process of conducting ML experiments with ABLATOR,
where the only user inputs are the method implementation (‘Method.py’) and the ablation
configuration file (‘Config.yaml’). ABLATOR automatically generates analysis artifacts sufficient to update the hypothesis and allow rapid prototyping. ABLATOR handles the horizontal
scaling of experiments and is fault tolerant. Experiment Persistence allows the experiment
to be reproduced independently of the original environment/cluster. Right is the process
without ABLATOR where the user must manually select configurations, manage execution,
consolidation, and analysis of artifacts, which is error-prone and cumbersome.
An experimental trial can be defined as ϕ Section 4.1, a stochastic process. For implementing
an experiment design under a ML system it is necessary to formalize the definition as to
avoid errors stemming from ambiguity. In our previous work [91] we find that the ambiguity
can lead to errors resulting in unreproducible results.
To this end we define an Experiment as a stateful process. The initial state of the
experiment is defined by the configuration and the implementation (methodology) of the
59
experiment. While the intermediate state by the checkpointing mechanism of the experiment and the final state of the experiment by the analysis artifacts. The paradigm has
the advantage of producing analysis artifacts sufficient to aid in an informed decision on an
experimental hypothesis and improve prototyping speed.
distributed.yaml
total_trials : 10000
experiment_type : random / tpe
metrics : [[ val_loss , min ]]
tune :
train_config .
optimizer_config .
name : [" adam ", ....
train_config . dataset : [" year " ," yahoo " ," helena ", ...
model_config . mask_type : [" mix " ," global " ," full " ," random "]
model_config . residual : [ True , False ]
model_config . random_mask_alpha : [0.5 , 1]
base config.yaml
train_config :
dataset : adult
optimizer_config :
name : adam
model_config :
mask_type : random
1 @configclass
2 class ModelConfig ( ModelConfigBase ):
3 residual : bool = True
4 d_out : Annotated [ ty . Optional [ int ], Derived ] = None
5 mask_type : MaskType = MaskType (" random ")
6
7 @configclass
8 class RunConfig ( MPConfig ):
9 model_config : ModelConfig
10 train_config : TrainConfig
11 tune : Dict [str , List [ Any ]]
12 experiment_type : Annotated [ ExperimentType , Stateless ] =
ExperimentType . tpe
Figure 5.5: ABLATOR provides a configuration system specific to ML experiments, where
it has to encompass multiple trials under one definition. On left, is an illustration of
the configuration for distributed execution (distributed.yaml) and method prototyping
(base config.yaml). On the right, the configuration is type checked by the ABLATOR library. In Section 5.4.1 we evaluate the effect of sampling bias from the hyper-parameter
selection strategy TPE [32]. The configuration is compact and unambiguous at initialization, supporting a stateful experiment design paradigm.
5.3.1 Initial State
Configuration describes the hyperparameter search-space from which the hyperparameters are sampled. We define two variable types Stateless and Derived, to describe attributes to which the experiment state is agnostic, while Stateful are the experimental
control variables. Stateful attributes require an assignment during the initialization stage.
Stateless type that can be used as a proxy for variables that can take different value
assignments between trials or experiments. For example, the learning rate can be set as
an independent variable and must be defined as stateless. Additionally, there are variables
that take different values between experiments and trials to which the state is agnostic,
for example, a random seed or a directory path between execution environments can be
annotated as stateless.
60
Derived attributes are un-decided at the start of the experiment and do not require a
value assignment. Instead, the value is determined by internal experiment processes that can
depend on other experimental attributes, such as the dataset. However, given the same initial
state, the attribute is expected to result in the same value and is therefore deterministic.
For example, the input size used in a model’s architecture that depends on the dataset will
be annotated as Derived during the experiment design phase.
The type system is specific to ML systems where a configuration may have to describe
a search-space that encompasses multiple trials, as opposed to taking on a specific value
assignment at initialization. Additionally, an ML experiment can have attributes that are
difficult to model at initialization but can be inferred during execution. For a stateful design
paradigm, the configuration should be unambiguous at the initialization state.
Implementation describes the methodology of the hypothesis. Invariance of the implementation w.r.t. the method evaluated produces a single code artifact that encapsulates
all methods i.e. a single code base for using and not using residual connections. The implementation computes one or more evaluation metrics. Lastly, the implementation should
have a deterministic value assignment to the variables we defined as Derived.
Implementation invariance provides a compact representation and is robust to errors. A
compact representation provides ease of use that is a consequence of a shared implementation
among the ablating components where the differences are specified through the configuration
and applied by conditional if statements. The advantage of this approach is that the
performance variance caused by implementation differences is minimized, where even the
order of matrix multiplication can have significant effects on the method performance [339].
5.3.2 Intermediate States
Experiment state can be Running or Complete as the aggregate of the state of all experimental trials. Each trial can be in three additional states as Pending, Failed or Pruned.
61
Pending trials are defined by their initial conditions alone, i.e. the sampled hyperparameters. A Running trial extends the definition to include a checkpoint. Complete trials
extends the definition to include one or more metrics, such as the validation loss. Pruned
and Failed trials are a result of irrecoverable errors during initialization or execution. A
fault-tolerant strategy reschedules trials with recoverable errors as Pending and attempts
to resume from the checkpoint. A long-running experiment can be interrupted (i.e. resource
maintenance) while errored trials do not interfere with the results (i.e. failed trials due to
recoverable errors).
Checkpoint describes the optimization state of a trial and contains sufficient information
to resume execution. ABLATOR store the model weights, optimizer, scheduler, and training
meta-data such as current training iteration using a compact representation. The checkpoint
mechanism in ABLATOR can be extended to support custom use cases (i.e. RL) important
in resuming the method. Lastly, maintaining the state of the experiment requires keeping
track of the checkpoints and results. Multiple checkpoints are stored locally on each node
and can be synchronized with cloud storage. The experiment is agnostic to the execution
environment; experiment persistence.
5.3.3 Final State
Analysis that is actionable, is a result of the automation to provide sufficient artifacts to
support decision making. The artifacts should help facilitate a quick and informed decision
on the likelihood of the hypothesis. The experiment state is used to infer the hypothesis,
i.e. ‘what are we ablating?’, and conclusiveness of the analysis i.e. ‘is the trial failed?’. The
analyses ABLATOR provides infer the search-space, such as control and independent variables
from the configuration and the variable type to produce the corresponding artifacts. The
artifacts produced address common problems in evaluating ML methods Section 5.4.2. For
each attribute, the goal is to encapsulate the best, average, variance and distribution of the
performance metric under a single figure; i.e. Section 5.4.2 and Fig. 5.7.
62
5.3.4 ABLATOR
We implement our paradigm under a tool we open-source ABLATOR. ABLATOR is designed in
Python and with support for PyTorch models, while the distributed execution system uses
Ray Core, [207]; Figure 5.4. We describe the features of ABLATOR important in addressing
a stateful experiment paradigm. ABLATOR can be extended or customized specific to the
use-case following an object-oriented design with access to function overwriting without loss
of automation. The features of ABLATOR provide ease of use where it requires defining
an experiment through implementation and configuration. Automation is supported by
providing an abstraction layer on distributed execution with fault tolerance, artifact consolidation, and analysis. Our framework is agnostic to the execution environment and can run
on a laptop and a cluster of nodes.
1 class MyModelWrapper ( ModelWrapper ):
2
3 def config_parser ( self , config : RunConfig ):
4 config . model_config . d_out = self . dataset . d_out
5 return config
6
7 config = RunConfig . load (" config . yaml ")
8
9 model = MyModelWrapper ( model_class = Transformer )
10 # Prototyping
11 trainer = ProtoTrainer( model = model , run_config = config )
12 # Distributed Execution
13 trainer = ParallelTrainer( model = model , run_config = config )
14 trainer . launch ()
1 class Transformer ( nn . Module ):
2 def __init__ ( self , config : ModelConfig ) :
3 super () . __init__ ()
4 self . residual = config . residual
5
6 def forward ( self , x: Tensor ) -> Tensor :
7 for layer in self . layers :
8 x_prime = layer (x)
9 if self . residual :
10 x_prime = x_prime + x
11 x = x_prime
12 return x
Figure 5.6: ABLATOR illustration of the implementation used for our experiments. On the
left is the code specific to the ablation experiment where we use a ProtoTrainer with builtin training mechanisms. While the ParallelTrainer is used to scale the model to a large
cluster of nodes. On the right it is the PyTorch model from the official implementation FT-T
[105] that uses the configuration from Fig. 5.5. It required minimal changes to Transformer
to evaluate our hypothesis.
Implementation A Trainer class will manage the physical resources of the experiment.
There are three options according to the use case, ProtoTrainer for prototyping at a local
environment, DistributedTrainer for vertical scaling, and ParallelTrainer for horizontal
scaling of a single experiment. ParallelTrainer is unique to ABLATOR, where multiple trials
are managed and executed in parallel. Prototyping to experiment deployment requires a
single change Fig. 5.6.
63
Artifact Persistence For every resource node, the trials are executed in parallel, and
failure in a single trial does not result in interruption of the experiment. We use the master
node to maintain the experiment state (Section 5.3.2) and synchronize the artifacts of all
nodes with a central database. Cloud compute nodes are often ephemeral, and restarting
the experiment requires only for the files to be synchronized among the centralized storage
and all nodes. Furthermore, the files stored in the central storage are sufficient to perform
an analysis or recover from errors.
Analysis Artifacts are specific to numerical attributes and categorical attributes. The
attribute type is informed by the configuration. Figure are artifacts that summarize the
mean, best, and distribution of a performance metric. For numerical attributes, we use
scatter-plot with optional interpolation curves while for categorical attributes we use violinplots.
5.4 ABLATOR Experiments
5.4.1 Experiment Definition
ABLATOR requires the configuration and implementation. We extend the implementation of
FT-Transformers (FT-T) 2
[105] with minimal changes to the original code. We implement a
model we call ‘Tablator’ and evaluate all the design components of FT-T as well as the effect
of Residual Connections [119] and Attention Masks inspired by BigBird [324]. We evaluate
‘Full’, ‘Mixed’, ‘Global’, and ‘Random’ attention mechanisms.
We evaluate 14 model hyperparameters and components in total, and evaluate the effect model-capacity, dropout hyper-parameters , prenormalization, weight initialization, and
activation function have on the model performance. Additionally, we evaluate 7 dataset
preprocessing techniques and training configurations. Feature encoding methods, missing
value imputation, feature normalization, training time, optimization.
2https://github.com/Yura52/tabular-dl-revisiting-models
64
CA ↓ AD ↑ HE ↑ JA ↑ HI ↑ AL ↑ EP ↑ YE ↓ CO ↑ YA ↓ MI ↓
FT-T 0.459 0.859 0.391 0.732 0.729 0.960 0.898 8.855 0.970 0.756 0.746
Tablator 0.535 0.856 0.368 0.718 0.723 0.921 0.896 8.778 0.930 0.780 0.749
Table 5.3: We compare the best trial for each dataset between the benchmark results of
FT-Transformer (FT-T) and Tablator (Section 5.4.1). FT-T used to obtain the benchmark
results is in the subspace of configurations of Tablator; however, the discovered best configuration differs between the two. Thus, for some datasets, Tablator outperforms, while for
others it underperforms a result that is a consequence of stochastic dynamics during training.
The differences between ‘Tablator’ and FT-T are on an additional module for Attention
masks that requires 9 additional lines of code as well as 2 lines of code insertions for residual
connections. Additionally, we replace the initialization function with a configuration class
argument as opposed to multiple keyword arguments. The majority of the development
effort was directed towards making the original dataset performant and converting it to a
PyTorch Dataset as opposed to a Python dataclass. We define the tunable configurable
hyperparameters as shown in Fig. 5.5.
We first verified our implementation with a ProtoTrainer in this section and then were
able to scale using a ParallelTrainer to thousands of trials for our results in Section 5.4.2.
For this experiment, it took significantly more time to write the current section of this paper
than it took to write the code and start the execution of the experiments.
5.4.2 Ablation Study
We use ‘Tablator’ an ablation model from Section 5.4.1 to evaluate possible improvements
in data processing, the Transformer model architecture, and the effect of training hyperparameters on 2337 trials, where the current largest ablation on tabular dataset is 2,000 trials
[341]. Our results are summarized in Section 5.4.2 and Fig. 5.7. On Section 5.4.2 we report
the Accuracy, where higher is better ↑ and root square-mean-error (‘RMSE’) where lower
is better ↓ on 11 dataset; CA [218], AD [164], HE [111], JA [111], HI [21], AL [102], EP
[105], YE [33], CO [36], YA [50], MI [235] identical to the benchmark of FT-T [105]. We
65
find Tablator performs similarly in most datasets and outperforms in other. The goal of
the benchmark comparison is to verify our implementation, while the goal of our study is to
evaluate general methods that work best among dataset and not a benchmark improvement.
Similarly to FT-T [105], we conclude that the simplest methods work best in most general
cases, i.e. SGD [253] with momentum has the best mean performance on 9 of 11 datasets.
For more complex methods, there is a large variance on the performance of the method
between datasets.
For example, we find that RAdam [186] ranks on average 2.71 for classification dataset but
3.75 for regression dataset when evaluated by the mean performance.
Figure 5.7: Evaluation of the effect of a larger model for a regression
data set, where (RMSE) ↓ is normalized for the relative difficulty of each
dataset. Larger model performs better but with higher variance.
Additionally, more complex methods may result in
the best performing trial but perform worse on average, where RAdam ranks on average 2.25 when
evaluated on the best-performing trial for regression
dataset (compared to 3.75). Our results indicate that
using a complex method may require a large tuning budget to return good results. Additionally, we
conclude that larger models only perform moderately
better Fig. 5.7.
The high-performance variance between different
components on different datasets leads us to conclude that evaluations should be done with multiple
datasets. Additionally, tuning would be required specific to the dataset and component, but simple design
choices, such as SGD and moderate model capacity, can provide a good starting point, while
more complex training configurations can provide trade-offs specific to the use case.
From the median and mean performance observed in our results, we did not find that any
of the preprocessing methods had a consistent, significant effect on the model performance.
66
Our analysis is based on the actionable results produced by ABLATOR. However, additional
experiments and analysis might be required for more conclusive results.
Figure 5.8: Automatically generated analysis artifacts from ABLATOR. On the left is the
normalized accuracy for all dataset. On middle for ‘CO’ [36] and on the right for AL [102].
It can be observed that the performance metric is heterosckedastic, where ANOVA tests can
be inapplicable. Additionally, what works best for each dataset differs significantly between
dataset and for all dataset.
5.5 Related Work
5.5.1 Model Selection
Work by Bouthillier et al. [38] evaluate the effect of variance from experimental conditions
on an ML method and find that accounting for multiple sources of variance improves their
analysis. They propose an alternative metric to the average difference, as the Probability of
Out-Performing. Similarly, our results corroborate the observations by [38] and we propose
an improved metric, Bayesian Probability, where we perform model comparison between
multiple models as opposed to two, on an interval based posterior and using an unbiased
prior, Section 5.1.
Bouthillier et al. [40] shows that the analysis is biased on the model comparison method
where contradictory conclusions can be reached. They propose to evaluate the ‘ranking stability‘ and ‘performance distribution‘ over different seeds. Similarly, we also observe that
current comparison between methods is sensitive to the experimental-setup, h-hacking. Auxiliary to their work, we propose a method of evaluation using model comparison between
67
methods across multiple sources of experimental variance, including random seeds and discuss more in Section 5.1.2.
[154, 123] extend p-values to Bayesian analysis; [299] propose e-values as an alternative to
p-values; and [38] propose an informed threshold γ based on a survey of published results that
a method is an improvement. Work on hypothesis testing is orthogonal to our method and
Bayesian Probability, as it can be applied in extension. Bayesian Probability is a metric on
model comparison and does not provide a test on the significance of the result. Similar
to [38] we motivate that the threshold should be set at the community level. For our
experiments we analyze Bayesian Probability on the strength of evidence as well as the
calibration performance that is invariant to a threshold Section 5.2.1.
[148] define a mixture of models as a posterior for model comparison. [237] approximate
the posterior by model classification using a neural network trained on simulated data with
predefined priors and approximate the uncertainty. Similarly, we formulate model selection
as a classification problem but instead use it to evaluate the risk of h-hacking Section 5.1.1.
Contrary, our posterior is computed on the interval of model performance and on empirical
evidence from the hypothesis surface with an unbiased prior Section 5.1.
5.5.2 Surrogate Modeling
[254, 26, 96] have used surrogate modeling for model comparison. [26] propose modeling
the performance surface to argue against the evaluation of a single-point comparison. [96]
perform Bayes surface analysis to study aggregate effects as misleading evidence. [85, 176]
and others have proposed to use a Gaussian Process to quantify uncertainty bounds with
applications to safety-control. Similarly, we also find that aggregate effects of the hypothesis
test, including the Bayes factor, can be misleading evidence. In extension, we propose the
quantitative as opposed to qualitative analysis of the hypothesis surface where we evaluate
the strength of a model hypothesis over a range of hyperparameters with their corresponding
risk. Lastly, we use a maximum information gain criteria and formulate the problem of model
68
selection as a classification problem, Section 5.1.1, surpassing in performance a Gaussian
Process.
5.5.3 Hypothesis Analysis and Reporting
[337] assess generalization through a theoretical analysis specific to the method. [141] empirically assess generalization to identify data, model and optimization properties to be responsible. They evaluate generalization using the correlation between ‘complexity measures’ of
experimental conditions and the validation performance. [140] assess the generalization by
analyzing the disagreement between the predictions of two identically trained models. Similarly, we also identify the sources of variance responsible for the generalizability of a method
and provide a detailed definition in Section 5.1.2. Contrary, evaluate the effect of experimental conditions on the results of the analysis as opposed to the validation set performance
and test generalization via model comparison between different experimental conditions as
opposed to a correlation metric with validation performance.
[340, 6, 252, 55] study the effect of random variance from initialization, implementation, or
tooling and propose specific tools to control variance. Similarly, we also study the effect to the
performance of a method from variance on the experimental conditions and corroborate their
analysis. Our work is orthogonal to their work as we do not provide tooling for controlling
the experimental conditions; instead, we provide a method to quantify and test the effect
that experimental conditions have on the analysis. Our method can be applied post-hoc to
any experimentation tool and model comparison can be performed on any source of variance,
such as the random seed.
5.5.4 AutoML
We identify four categories of work that are most similar to ours. Work that focuses on errors
introduced by tools and incorrect analysis, on horizontal scaling of experiments, works that
aid in ablation studies, and tools for automated HPO.
69
Previous work [41, 38, 39, 185, 231, 2, 339, 79] identify the source of erroneous analysis
as poor experiment design practices resulting from improper use of statistical evaluation
methods, HPO budget, HPO strategies, and tooling and provide recommendations. We
extend their work and investigate errors during horizontal scaling of experiments that lead
to erroneous analysis. We identify errors from the sampling strategy, non-random execution
errors, and implementation errors. We provide general recommendations and address the
errors with ABLATOR.
Several tools are proposed [82, 86, 129, 308, 182] that support distributed experiment execution. However, they require manual effort in integrating with other libraries
for resource allocation, scheduling of experiments, resuming faulty trials, result aggregation,
configuration sampling, and analysis. Contrary, ABLATOR combine all of the above in an automated fashion, where only the implementation and configuration of the method are used
to produce the analysis artifacts.
Ablation framework introduce methods and tools specific to constructing ablation analysis artifacts. Such methods can have limited use cases [114, 34, 234] or lack automation
[294]. In contrast, ABLATOR provides analysis artifacts that provide a holistic view of a
method’s performance that can be extended to support automation and specific use-cases
addressed by the works above.
AutoML methods [84, 341, 35] are designed for HPO and can be extended to ablation
experiments that provide support for automated analysis. Unlike ABLATOR, such tools are
designed for simple use cases, such as statistical models, and require additional effort to scale
the experiments horizontally. AutoAblation [266] extends Maggy [202] to Deep Learning
models. However, manual effort is required in the allocation of GPU resources and for
experiment persistence. Lastly, the declarative design paradigm has limited use cases, as
opposed to the object-oriented design of ABLATOR.
70
5.6 Conclusions
Based on the inapplicable analysis used in ML research as well as the difficulty in comparing
between methods, we propose a model comparison framework. Our method aids in the comparison between studies and methods through the principled concatenation of results using
formal operators. Our method is applicable for cross-study analysis via model comparison
and using Baysian Probability. We suggest a formalization of the conditions of generalization
of a method and assess whether Adam out-performs SGD in a cross-study comparison. We
find that the generalization of a method is sensitive to the comparison bounds, which in our
work we view as a source of bias and evaluate using RB.
Additionally, we identify several sources of error common in horizontal scaling of multiple
experimental trials. We provide general recommendations and address errors with a stateful
experiment design paradigm. ABLATOR implement the paradigm to automate the scaling of
ablation experiments across multiple resources and produce analysis artifacts in an automated fashion and for rapid iterative prototyping. We evaluate ABLATOR with a Transformer
model for Tabular dataset, ‘Tablator’, where we study the effect of several architectural
components and hyperparameters on the largest ablation study for tabular dataset to-date.
We conclude that simpler methods work best, where more complex methods can improve a
benchmark, but have high variance. ABLATOR can be used effectively to conduct large-scale
ablation studies with ease.
71
Chapter 6
Background: Continual Learning
6.1 Continual Learning
In the Continual Learning setting, we seek to learn from a sequential stream of multiple tasks incrementally. Lange et al. [173] and Ven et al. [292] define three scenarios of
Continual Learning to Task/Class-Incremental Learning and Domain-Incremental Learning.
Task/Class-Incremental approaches utilize task identity to separate the logit space of different tasks, either through masking [45] or using different classification heads [199]. General
Continual Learning (GCL) [45, 302] learns without task identity during both training
and inference. It is a more general case of CL where the model learns domain-incrementally
remapping task classes to overlapping concepts [293], similar to Reinforcement Learning
[256] that maps a non-stationary stream of inputs to shared logit/action space. In GCL,
task-switch occurs with a probability during training, where the new task has a different
distribution with respect to all previous tasks. The final GCL objective is to minimize the
average classification error on all tasks. Continual Learning aims to learn a new task from
a stream of sequential tasks and without access to previous task datasets, while maintaining
performance on all previous tasks. Given a set of Tasks T = {T1 . . . Tn}, with clear task
boundaries. For our work in Batch Model Consolidation (Chapter 7) we train and evaluate in a Class Incremental Learning (CIL)[173, 199] setting in which task identity is
provided during training but not at test time. While for our work in αMeta we train and
72
evaluate under a General Continual Learning (GCL) scenario. The Continual Learning
objective function for a model with parameters θ, can be summarized as maximizing the
average classification accuracy after learning a sequence of T tasks:
A = max
θ
1
T
X
T
t=1
Acc(θ(Ti), Tiy
) (6.1)
6.2 Catastrophic Forgetting
Experience Replay (ER) [19] as well as several extensions [11, 15, 23, 46] store a small
subset of training data using a heuristic and rehearse the past data by applying a penalty
auxiliary to the current task training objective. ER samples what data to store randomly
from all training steps (stochastic sampling [255]) and store them into a fixed-size replay
memory. At each training step, a mini-batch from the memory data is chosen at random
and appended to the current training batch. Improvements to stochastic sampling can use a
heuristics to determine the timestep of storing samples to memory [277] (‘when to sample’)
while a different heuristic can be used on which samples to store from the given timestep
[11, 46] (‘what to sample’).
Knowledge Distillation (KD) [126], in the CL setting, can be used to transfer knowledge between models trained on different tasks [45, 248]. KD penalizes the student model
using a loss function between the representations of teacher and student models. Representations used in KD can be the output logits Lkd [49, 45, 330, 248] or the intermediate feature
vectors Lbd [139, 124]. Given the hidden representation vectors at depth i ∈ |θ| from student
and teacher models, ϕ
s
i
, ϕt
i ∈ R
d
, we compute Lbd as the sum of the distance between all
pair-wise hidden representations.
Lbd(θt(x), θs(x)) = X
|θ|
i=1
||sg(ϕ
t
i
) − ϕ
s
i
||2 (6.2)
73
sg is the stop gradient operator that prevents the parameters of the teacher model from
being updated.
Training Statistics are used to perform novelty detection in the context of Out-ofDistribution [187, 221, 279, 131, 181] and CL [10, 309] to 6 categories of statistics, listed in
Table ??. Dimensionality Reduction is applied on the training statistics as an intermediate
step to a downstream objective such as novelty classification. For example, the gradients
are a list τG = {τ1, τ2} (τ1 ∈ R
b×d1×d2
, τ2 ∈ R
b×d2×d3 ) in a model of two feedforward layers,
where d is the input/output dimension of features in the i-th layer, b the batch size. In
this work and for an equivalent evaluation between SM we perform an identical reduction
and evaluate different reduction methods. For example, by mean reduction of the norm,
τ
′
G = {τ
′
1
, τ ′
2} (τ1 ∈ R
b
, τ2 ∈ R
b
). We experimentally observe that the performance of the
downstream objective is sensitive to the reduction method of the statistics Table 7.3.
Training Statistics τ Definition Base
Loss L(fθ(x), y) Loss
Gradient ∇θL(fθ(x), y) Gradient
Fisher Info. θ + [∇θL(fθ(x), yˆ)]2 Gradient
Features ϕ(x) Feature
Uncertainty H(fθ(x)) Logit
Energy − log Pexp(fθ(x)) Logit
Table 6.1: Summary of Training Statistics commonly used in CL and OOD literature
for novelty detection. Where fθ is the model parameterized by θ, L the loss function, ˆy
the pseudo-label, ϕ the set of feature vectors, H the entropy function. ‘Base’ refers to the
training statistic used to derive τ .
74
Chapter 7
Algorithmic Generalization
In this chapter we discuss our work published at CVPR2023 on Batch-Model-Consolidation
[95] and our in-progress work on a multi-modal Continual Learning dataset, the Stream.
Both work were with my collaborator Jiaye Zhu under the supervision of Prof. Laurent Itti.
The Continual Learning setting and the issue of Catastrophic Forgetting from Section 6.1
would be required to understand the methods in this chapter.
75
7.1 Model Consolidation
Figure 7.1: A single incremental step of BMC. On the right figure, the updating of a base
model with Multi-Expert Training: after receiving the data of the new tasks Di
, . . . , Di+k, a
batch of experts θi
, . . . , θi+k are trained separately on their corresponding tasks with stability
loss applied from the base model. The newly trained experts then sample a subset of their
training data and combine them with the memory to perform batched distillation on the
base model. On the left figure, the regularization helps the batched distillation to update
the model closer to the regularization boundary and towards the jointly low-error zone of
old tasks and two new tasks.
Batch Model Consolidation (BMC), combines a rehearsal-based learning system and a
two-step training process, Fig. 7.1 (right). We first introduce the main design components
of BMC for a single task sequence, a task stream. Next, we formalize the constraints under
which we evaluate Multi-Expert Training, where multiple experts are trained in parallel
on distinct task streams composed of sequences with distinct tasks. In short, our method
is composed of multiple training incremental steps until all tasks are learned. Each step
trains multiple expert models in parallel. Each expert is trained on a specific task, different
from all other tasks. The training of each expert is composed of a regularization phase
that reduces the deviation of the parameters for the current task from the base model.
Figure 7.2: The loss contours by sequential training compared with batch task training [113] (shaded areas as low-error zones for
each task). Intuition: Similar to mini-batch
training batched task training can reduce the
local minima and improve the convexity of the
loss landscape.
At the end of the training of all expert
models for the current incremental step,
a consolidation phase distills the expert
knowledge back to the base model simultaneously using batched distillation loss.
76
Our method performs better as compared
to single-model distillation, and is a better
approximation to the ‘multi-task gradient’
Fig. 7.1 (left) and Fig. 7.2.
7.1.1 Buffer-Memory
BMC uses a short-term buffer storage and a
long-term memory bank to store real samples for rehearsal. Memory is a fixed-size storage
that holds training exemplars from multiple previous tasks and is only accessible by the base
model. Buffer is a temporary storage of limited size for a subset of the expert’s training
data. For each train incremental step of k experts, at the end of the regularization phase,
the central memory bank is combined with B1, ..., Bk from experts. At the end of the
consolidation phase, the memory is subsampled to maintain a constant size.
Sampling methods are applied for both memory and buffer data selection to meet the
size constraints of each storage solution. The goal of a sampling method is to improve the
informativeness of the buffer data for the current task and the memory data for all previously
learned tasks. We experiment with multiple sampling methods, including gradient-based
sampling [11] and random selection in an ablation study to find that random sampling
performs the best.
7.1.2 Stability Loss
During the Regularization phase, we train an expert model θexp that is initialized from a
base model θbase. Stability loss is applied during the regularization phase and poses a constraint optimization problem during the training of the expert on the new task. The goal of
the expert is to learn the new task while maintaining feature similarity to previously learned
tasks represented by θbase, implicitly and without access to previous task data. The idea
follows a direct comparison to previous Continual Learning regularization-based approaches
77
[264, 9, 180, 325]. The intuition of the stability loss is to make the model less prone to
task-recency bias [199] which can be viewed as the root cause of forgetting. Additionally, our
ablation studies support the view that the stability loss improves consolidation as each expert’s weights are confined within the regularization boundary (Fig. 7.1). The optimization
objective of each expert can be summarized as:
Lexp = LT (θexp(x), y) + λLbd(θbase(x), θexp(x)) (7.1)
where LT is the task loss and Lbd is the distillation loss applied with the base model as the
‘teacher’ and the ‘expert’ model as the student and λ is the stability coefficient. EWC, or
KD on the logits can be applied in direct replacement to Lbd. We find experimentally that
they under-perform compared to Lbd.
7.1.3 Consolidation Phase
At the end of training and for all k experts, BMC consolidates the learned knowledge of
all experts in a single training step using a batched distillation loss. Batched distillation
loss is applied only to the most recent task buffer data B and to memory M such that
D = {M, B1, . . . , Bk}.
Instead of performing Knowledge Distillation on a single model or task at a time, batched
distillation loss is applied with randomly sampled buffer data and expert representations from
D. As such, each training batch for the base model can contain randomly sampled tasks from
multiple domains. We hypothesize that batched distillation loss, Lbmc, improves performance
by improving the convexity of the loss landscape for all tasks, Fig. 7.2. Lbmc penalizes on
the difference in feature representations from θ
′
base to all experts E = {θ1, . . . , θk}.
Lbmc =
X
θi∈E
Ex,ϕ(x;θi)∼D[Lbd(θ
′
base(x), θi(x))] (7.2)
78
We experiment with alternatives to Lbd when computing Lbmc. We find that Lbd performs
the best. The final optimization objective of the base model is the joint optimization of
Lbmc and the experience replay task loss. The training objective of the base model can be
summarized as:
Lbase = αLT (θ
′
base(x), y) + βLbmc(θ
′
base, D) (7.3)
where (x, y) ∈ D, α is the experience replay task loss coefficient and β the consolidation
coefficient.
Gradient Noise Reduction In many nonconvex optimization problems, the loss manifold is filled with local minima and saddle points, where Stochastic Gradient Descent (SGD)
optimization can underperform [138]. Noise in the training data leads to noise in the gradients and high variance for SGD [200]. Gradient approximation methods, such as minibatch
training, accumulate gradients over a batch of data to estimate the true gradient of the entire
training set. Keskar et al. [153] observed that as the batch becomes smaller, the parameters
are updated further away from their initial point as opposed to large batch training. This
observation is in agreement with [200, 127] that small batch introduces more ‘randomness’
as it is a lesser approximation to the true gradient of the entire training set and causes
instability in training.
Similarly, consider a continuum learning environment where there is a set of expert models
trained on a disjoint set of tasks with the goal of consolidating them sequentially into a single
base model. We argue that the consolidation training process is similar to batch training but
in the multi-task setting, where we reduce the variance by consolidating the experiences from
multiple experts. Previous model consolidation methods [49, 330] combine a single expert
at a time. In contrast, for our method, we observe that batched consolidation improves
accuracy as well as enables data parallelism that can speed up training.
79
7.2 Model Consolidation Experiments
We evaluate our method on three Continual Learning benchmarks, Tiny-ImageNet [174]
split into 10 tasks, CIFAR-100 [170] split into 10 tasks and 20 tasks. Next, we evaluate our
method on a long sequence of diverse tasks to demonstrate BMC’s advantage. We use the
Stream dataset composed of 71 image classification tasks for rigorous evaluation on average
accuracy, Cost Accuracy, and relative training time. Lastly, we evaluate the efficacy of each
design component for our method through ablation experiments on Permuted-MNIST, where
we train for 128 tasks and a total of 1280 classes. We open-source and provide the extracted
feature vectors from the Stream dataset, the code to run the benchmark on the baselines
and the code for our method as a Distributed Continual Learning library1
.
Stream Dataset. Common benchmarks for evaluating Continual Learning methods are
built by splitting classes from datasets such as MNIST, CIFAR-10/100 and Tiny-ImageNet,
which have subtasks in similar domains, of similar size and number of classes. We aim to
evaluate BMC in a more general setting where there are larger domain-shifts, for significantly
more tasks that range in difficulty and problem size. Lastly, synthetically generated datasets
such as permuted-MNIST can be poor references to performance in applicable scenarios [173].
To this end, we use Stream which is composed of 71 publicly available image classification
datasets [196, 13, 14, 280, 209, 68, 167, 143, 315, 149, 217, 189, 300, 213, 104, 67, 1, 69, 122,
179, 44, 3, 262, 150, 286, 211, 37, 144, 178, 203, 274, 285, 25, 245, 298, 236, 24, 312, 152,
289, 316, 118, 342, 165, 291, 297, 191, 227, 296, 335, 42, 270, 206, 331, 212, 59, 166, 128,
290, 116, 287, 263, 17, 80, 288, 210, 132, 5, 232, 98, 204] from the computer vision literature
and Kaggle [146]. We concatenate the datasets into a stream of tasks. There are a total
of 6,770,722 train images and 743,977 validation images with 2866 classes in Stream, with
different numbers of classes for each task. To speed up the experiments, we extract feature
vectors from a pre-trained CLIP model [238] and used them as input to the model. For
both our method and the baselines we use identical train hyper-parameters. We use the
1https://github.com/fostiropoulos/stream benchmark
80
hyper-parameters as reported in the original paper for each method, where it is applicable.
All experiments use an MLP with Residual connections [120] on the extracted CLIP feature
vectors.
Baselines. We follow the methodology and compare with the methods reported in [173,
45]. When running a method on the Stream dataset we use implementation by Mammoth
[45] and FACIL [199]. We report results from each respective paper when they are available
or work by [173, 45]. We compare our method with ER [250], GSS [11], A-GEM [53], iCaRL
[248], GDumb [233], DER++ [45], Online EWC [264], SI [325], MAS [9] and DMC [330]
with details of each method on Section 7.5.1. We train a naive baseline (SGD) without any
Continual Learning strategy as a lower-bound. We compute the theoretical upper bound on
performance as Multi-Task accuracy where we use the mean accuracy of SGD on each task.
7.2.1 Results
Figure 7.3: BMC outperforms all other methods on the average accuracy for the Stream
dataset and under a CIL evaluation setting.
We use 10 experts for our experiments. Table 7.1 report the average accuracy A, for the
main baselines. We report Cost Accuracy Ac, Total Cost Tc and relative training time to
SGD. For the Continual Learning benchmarks, we show that BMC works well on the short
sequence of similar tasks. For the Stream benchmark, our method significantly outperforms
all baselines. In detail, BMC outperforms the second highest (ER) by 70%, and achieve 79%
81
of the theoretical upper bound provided by Multi-Task training. Additionally, BMC has a
constant time complexity w.r.t. the number of previously seen tasks and is 22% faster when
compared to training with SGD on a single device. Other baselines degrade in relative time
performance because they require a second backward propagation [45] or have an intractable
training time as the number of learned tasks increases [45, 11, 52].
Our experiments and benchmark conclude that most of the recent approaches fail in
mitigating forgetting and are outperformed in all regards by simpler alternatives such as ER
[250]. Replay methods that use heuristics in sampling [45, 53, 248] are unable to address the
drastic domain-shift of a long stream of tasks and result in sudden performance degradation
for large domain-shifts as shown in Fig. 7.3. Likewise, Regularization methods [264, 9]
perform similarly to a naive baseline, SGD. Parameter-isolation methods [261, 309] fail to
train due to memory requirements.
We analyze the Total Cost (Tc) as the space complexity of the buffer, memory and any
auxiliary information specific to the method. We denote by |x| the input dimension, |θ| the
number of model parameters, |ϕ| the total size of the intermediate representations (|ϕ−2|
the penultimate feature size, and |ϕ−1| the logit size). iCaRL uses a herding buffer strategy
that computes artifacts for the entire dataset to subsample, which results in higher than
theoretical peak memory usage. For our method, we use an efficient implementation to
calculate Lbmc that does not transmit ϕ. As such, BMC maintains a memory footprint per
task O(|x|) + kθ.
We compute Tc in megabytes (MB) for a given buffer size, and thus Ac represents the
improvement in accuracy per unit of MB. Our best performing model variant achieves an
accuracy of 71.87% with a Buffer and Memory size of 15k respectively. For our method we
find the Pareto optimal configuration with regards to Ac and use Memory and Buffer size of
15k and 10k respectively, Table 7.1. When comparing between methods, there are limitations
in using Ac for a single configuration. The comparison for the hyper-parameter range that
82
each method was trained on and might be optimal for A but not be Pareto optimal in terms
of Ac. To this end, we motivate that the evaluation of Ac is done for multiple configurations.
Methods S-CIFAR (10) S-CIFAR (20) Tiny (10) Stream (71) Ac ↑ Tc Time ↓
SGD 8.5 3.7 7.9 2.1 - O(1) 100%
Multi-Task 75.79 75.79 68.89 89.3 - O(1) 100%
ER 12.43 14.42 27.41 41.4 5.40 O(|x|) 184%
DER++ 27.02
- 39.01 19.4 0.53 O(|x| + |ϕ−1|) 205%
A-GEM 6.52 3.62 8.01 6.6 0.67 O(|x|) + 2|θ| 231%
iCaRL 25.53 19.22 14.11 23.4 0.59 O(|x| + |ϕ−2:|) + |θ| 141%
GDumb 36.02 22.12
- 33.0 4.29 O(|x|) 129%
GSS 17.42 11.32
- - - O(|x|) 1203%
DMC 36.2 - - 1.0 0.02 O(1) + |θ| 140%
EWCon 13.1 3.7 7.6 2.1 0.99 O(1) + 2|θ| 172%
Ours 66.54 67.44 49.44 70.4 6.27 O(|x|) + kθ 78%
Table 7.1: CIL performance on split CIFAR-100 (S-CIFAR) for 10 and 20 tasks, split TinyImageNet for 10 tasks and Stream Dataset for 71 tasks. Baseline results for S-CIFAR and
Tiny-ImageNet are from [45, 195, 109, 47, 199, 334] with the buffer size annotated next to
the reported result as 51201
, 50002
, 10003 or 20004
. We use ‘-’ to denote results that are not
available in the literature, not applicable or not feasible to obtain.
7.3 Surprise
Figure 7.4: Left, MetaSup finds fixed threshold by maximizing the F1 score of taskswitch detection. Middle, αMetaSup trains fattn on SM τ1, . . . , τn and task-switch labels
{0, 1} collected from auxiliary streams as time-series classification. Right, model fθ learning
incrementally with Surprise Sampling. New data batch is first stored in short-term buffer B.
Pre-trained fattn infers task-switch by the attention window. On surprise, B is sub-sampled
and stored into M. Fixed threshold failed to generalize on Stream Benchmark task-gaps.
83
7.3.1 Surprise Sampling
Task-agnostic replay-based CL methods (e.g. DER++ [45]) require sampling from every
batch. Recent work[277] has also identified that the timing of sampling is important and that
oversampling from a given task can lead to degrading performance. Surprise Sampling is
inspired by the biological mechanism of surprise, where it has been observed that experiences
causing arousal of the adrenal system lead to recent experiences being moved to long-term
memory [201]. As such, we propose to select a fixed number of recent samples B from the
short-term memory B ⊂ B to store in the long-term memory M. B are samples prior to
surprise and not including the current training data.
Surprise Sampling can work with any method that can classify surprise from the current
Surprise-Metric. We evaluate Surprise Sampling in Section 7.4.1. Our experiments indicate
that Surprise Sampling is efficient in utilizing M as well as robust to outliers and falsepositive (Fig. 7.5). We present the algorithm in Algorithm 1.
Algorithm 1 Surprise Sampling
Given: data D, pre-trained fmeta, attention window k.
Initialize: model fθ, fixed size FIFO B, fixed sized memory M, warm-up steps w.
for xi
, yi
in D do
Compute SM τi = Fτ (xi
, yi
; fθ).
if fmeta(τi−k, . . . τi) and w = 0 then
Store B ∈ B into M, reset B, w.
Task loss ℓ = L(fθ(xi), yi).
CL loss ℓCL = LCL(M)
fθ ← ∇(ℓ + ℓCL)
if w = 0 then B
Append ←−−−−− (xi
, yi)
else w ← w − 1
end for
7.3.2 Meta-Surprise
Previous works [10, 16, 309, 338] use a fixed hyper-parameter threshold for binary classification of novelty on training statistics τ from Section 6.1 which in this work we refer to as
Surprise Metric (SM). The fixed threshold fail to adapt to variant task-gaps. To evaluate
84
SM as well as the fixed hyper-parameter threshold, we propose the use of a threshold on
Z-norm (Sz) as batch-aggregate statistics of Surprise-Metric updated in an online manner,
as well as Contrastive score (Sc) that computes the similarity of Surprise-Score between
recent training data and current data. Contrary to previous work, we provide a method to
learn the fixed threshold Sz and Sc on an auxiliary dataset.
Z-norm (Sz) and similar to Batch Normalization [134] use the moving average and
standard deviation updated through momentum as the learnable parameters. Given a model
fθ, SM function Fτ , moving average ¯µτ , moving standard deviation ¯στ and decay factor γ,
Sz and the updated statistics are calculated from batch (xi
, yi) as in Eq. (7.4). Sz has the
same dimension as τi = Fτ (xi
, yi
; fθ), where we use mean, absolute-min, and absolute-max
to reduce to a scalar if necessary. Finally, ¯στ is updated by ¯µτ + γ(τi − µ¯τ ) and ¯µτ by
(1 − γ)
σ¯τ + γ(τi − µ¯τ )
2
.
Sz(τi
; ¯στ , µ¯τ ) = (τi − µ¯τ )/
√
σ¯τ + ϵ (7.4)
Contrastive score (Sc) is the similarity between the current SM and a set of reference
SMs and is used to evaluate cosine similarity in identifying novelty [11, 279]. For a reference
window W = {τk−n, . . . , τk−1} of recent n-step SM at training step k, we compute Sc as the
mean similarity to all references, Eq. (7.5).
Sc(τi
; W) = 1
|W|
X
τi
· W
||τi
|| ||W||
(7.5)
Meta-Surprise (MetaSup) detects a task-switch if the current score is above a fixed
threshold. We find the threshold by collecting SMs on auxiliary streams with known task
identity and no CL mechanism, Fig. 7.4 (left). We apply the threshold in GCL where a new
backbone model fθ is trained from scratch. Although MetaSup may be effective to taskswitch detection, it does not solve the problem that also lies in previous works where a fixed
85
threshold do not generalize well between datasets. We use MetaSup to systematically evaluate SM from the current literature and use our insights to propose αMetaSup Section 7.3.3
.
7.3.3 Adaptive Meta-Surprise
Adaptive Meta-Surprise (αMetaSup) train an auxiliary neural network on collected SMs
by time series classification [83]; Fig. 7.4 (middle). Due to the low positive rate of taskswitches the probabilities of the trained meta-classifier are miscalibrated. We calibrate the
meta-classifier after training using isotonic regression [107] on the same auxiliary dataset.
The meta-classifier is used in an online manner during GCL to infer the task boundaries. We
evaluate a single-layer attention network (fattn), a Residual MLP (fmlp) and an LSTM (flstm)
to find that fattn generalizes the best among backbone, dataset and metrics Section 7.4.2.
Naturally, we need to construct a good auxiliary dataset stream from which we can collect
SMs for both MetaSup and αMetaSup. Next, we introduce the method for generating the
data stream with variant size task-gaps from any underlying dataset, and provide analysis
in Section 7.4.3.
7.3.4 Stream
Stream is a sequence of tasks generated by projecting multimodal datasets on the same
feature space and augmenting them within the feature space to synthetically generate an
arbitrarily long sequence of tasks. As such we are able to control the difficulty of the stream
by varying the number and sequence of task-gaps. Given a list of datasets from which to
compose the stream Dbase = {D1, . . . , Dn} and of different modalities we project them on
an identical dimension feature space SD = {fGPT(D1), . . . , fViT(Dn)} s.t. |fGPT(D1)| =
|fViT(Dn)| where D1 can be a vision dataset and Dn a text dataset. We use the same feature
extractor for each modality, i.e. GPT-2 for text and ViT for vision. We use transformation
F to synthetically augment the task-sequence with a new dataset F(Si) → S
′
i where S
′
i
is
86
out-of-distribution w.r.t. Si and is considered as a new task. We set F = {FP erm, FRot}
where FP erm is a random permutation on the feature space and FRot is a rotation on the 2-D
projection of the features by a fixed degree chosen s.t. ρ ∈ {30, 60, . . . , 360}. We define four
types of task-gaps based on their difficulty w.r.t. the backward transfer [51] between Si and
S
′
i
.
• Distribution: S
′
i ↔ S
′
j where the two datasets are from the same modality using an
identical transformation F.
• In-domain: FRot1 ↔ FRot2, where the two datasets are from the same modality using
a different rotation transformation F.
• Data-gap: FP erm1 ↔ FP erm2 is defined for any permutation within the modality.
• Modal-gap: CV′
i ↔ Text′
j where the input modalities are different.
The final dataset sequence is sampled from F × D. Stream is able to produce infinitely
long task-sequences by varying the random seeds in F. We construct four Stream datasets.
1. S-Numbers (S-N), Dbase = {MNIST [175], SVHN [210]}; 2. S-NumbersV
(S-NV
), the
vectorized input version of S-Numbers; 3. S-ImageNetT
(S-INT
), Dbase = {CIFAR-10 [168],
CINIC-10 [71]}; 4. S-Modal Dbase = {Real [225], Sketch [225], IMDB [193] , Amazon [12]},
that we use for our Stream Benchmark.
Stream addresses a realistic evaluation setting in which a neural network can be exposed
to a series of tasks from non-overlapping domains, such as a neural network trained online on
a remote device and exposed to unexpected shifts to the training data from weather patterns,
abrupt changes to the environment, and tasks the neural network is required to perform.
We evaluate the AUC score of model predictions instead of the accuracy score, as the label
distribution in Stream can be imbalanced (e.g. Real and Sketch 10 classes and Amazon 5
classes).
87
7.4 Surprise Detection Experiments
In this section, we show that our method Adaptive Meta-Surprise can robustly identify the
task boundaries in General Continual Learning, and Surprise Sampling is an improvement
to stochastic sampling [255]. In the ablation study, we justify our choice of Surprise-Metric
by evaluating their efficacy with Meta-Surprise on four Stream datasets and two backbone
models a ResNet-18 [120] and a Residual MLP [283]. Last, we show that the attentionbased meta-classifier has improved generalization, and discuss the influence of the task-gaps
on General Continual Learning. We present and discuss the main results in the remainder
of this section.
Method S-INT S-N S-NV S-Modal
DER++ 0.735 0.894 0.816 0.565
CLS-ER 0.741 0.891 0.821 0.555
ER 0.724 0.908 0.810 0.555
Baselines 0.734 0.898 0.816 0.558
S-Energy 0.778 0.908 0.798 0.564
S-Loss-Plateau 0.806 0.901 0.794 0.562
S-One-Shot 0.770 0.915 0.811 0.564
S-Loss-EWMA 0.783 0.901 0.796 0.562
SS 0.784 0.907 0.800 0.563
αS-DER++ 0.755 0.923 0.855 0.603
αS-ER 0.763 0.923 0.847 0.603
αMetaSup + SS 0.759 0.923 0.851 0.603
Table 7.2: Comparison of the mean performance of Stochastic Sampling compared to Surprise Sampling (SS) augmented methods [16, 10, 309, 338] and finally compared to αMetaSup
+ SS augmented methods [250, 45]. The first three rows are native implementations of the
methods. SS augmented methods are used to determine surprise and apply ER. We report
the mean AUC score across all tasks at the end of training. SS augmented methods perform
better while αMetaSup + SS methods further improve the performance of the baselines.
88
Figure 7.5: Ablation study on the efficiency of the buffer utilization for 100 trials. Left,
Surprise Sampling augmented baselines when compared to Reservoir [255] and trained on
S-NumbersV
. Surprise Sampling performance scales with the Memory Size at an improved
rate and as such is more efficient. Right, larger Sample Size |B| degrades performance as a
larger part of the |M| is replaced with recent data and discuss more on Section 7.4.1
89
7.4.1 Surprise Sampling
Figure 7.6: Mean AUC up to the the number of tasks learned thus far. The variance in
Mean AUC is reflected by the changing difficulty of tasks over time, such as an easy task
followed by more difficult ones. All methods are trained on an identical sequence of S-Modal
tasks and for 252 random initializations. αMetaSup methods are the only that do not have
their performance degraded close to Random.
We examine the efficiency of our method in constructing memory M when compared to
Reservoir Sampling [255] (stochastic). We train 100 models on S-NumbersV and report the
test-set mean AUC at the end of 20 tasks. The improvement from stochastic sampling
scales to the buffer size more favorably Fig. 7.5. We verify our hypothesis in Section 7.3.1
that sampling prior to task-switch is an improvement compared to stochastic. Additionally,
we identify degradation with the increase of the number of points sampled from the recent
memory Fig. 7.5. Large sampling size biases toward storing from recent tasks over keeping
old memories, and increases the imbalance in buffer which harms the replay CL loss. As the
memory buffer is fixed-size, older data will be more likely to be replaced.
Our results agree with previous work [16] which found that accurate inference of the taskswitch can improve the performance of a CL method under general conditions. However,
we also observe the contrary where fmeta with high false-positive rate used with Surprise
Sampling and a small |B| can out-perform a low false-positive fmeta when used in conjunction
90
with SS with an equivalent |B| but not when we increase |B|. This is an artifact of the
robustness of the CL method to robustness of false-positives.
7.4.2 Meta-Surprise
S-NumbersV S-ImageNetT S-Numbers S-Modal
(α)MetaSup T ± STD F T ± STD F T ± STD F T ± STD F F¯
Sz(τL) 31.2 ± 23.0 0.944 25.2 ± 16.3 0.914 107.7 ± 80.7 1.000 26.6 ± 17.2 0.987 0.961
Sz(τG)
1 4.1 ± 3.4 0.919 1.3 ± 1.0 0.774 3.3 ± 2.2 0.944 2.7 ± 1.9 0.907 0.886
Sz(τF I )
1 2.6 ± 2.1 0.813 0.9 ± 0.8 0.537 3.4 ± 2.5 0.947 2.8 ± 2.5 0.904 0.800
Sz(τE) 6.8 ± 4.1 0.516 3.6 ± 2.0 0.444 8.9 ± 4.3 0.919 2.6 ± 3.6 0.463 0.586
Sz(||τF ||2)
2 13.1 ± 9.2 0.595 9.5 ± 3.2 0.350 20.1 ± 14.3 0.481 112.8 ± 167.4 0.840 0.566
Sz(τF )
2 7.4 ± 6.4 0.316 7.1 ± 3.6 0.431 10.2 ± 5.4 0.438 29.2 ± 39.2 0.675 0.465
Sz(τU )
1 5.0 ± 3.4 0.444 2.5 ± 1.5 0.324 13.4 ± 8.6 0.821 1.6 ± 2.0 0.234 0.456
Sc(||τF ||2) (0.9 ± 1)e
−2 0.286 (5.9 ± 5)e
−4 0.222 (1.9 ± 2)e
−3 0.444 (2.4 ± 3)e
−2 0.553 0.376
Sα(||τG||2) fattn 0.958 fattn 0.964 fattn 0.996 fattn 0.988 0.977
Sα(||τG||2) fmlp 0.962 fmlp 0.979 fmlp 0.996 fmlp 0.993 0.983
Sα(||τG||2) flstm 0.958 flstm 0.984 flstm 0.996 flstm 0.990 0.982
Table 7.3: MetaSup and αMetaSup performance difference across dataset from SM Loss
τL, Fisher Information τF I , Energy τE, Feature τF , Uncertainty τU and Gradient τG, with
Z-norm Sz, Contrastive score Sc and meta-classifier Sα. T is the best threshold with the
variance among different random Stream sequences. ¯· denotes mean and || · ||2 L2 norm. Sz
is reduced by mean1 or absolute-max2
. We find that the reduction method influences the
performance of the metric; an artifact of the linear separability of high-dimensional metric to
task-switch. αMetaSup outperforms a linear threshold. Additionally, the threshold is highly
variable between different runs.
We evaluate αMetaSup on task-swtich detection using F1 score on four datasets and for
three meta-models and present results in the lower half of Table 7.3. αMetaSup outperforms
MetaSup in task boundary detection and across all datasets. The evaluation setting from
Table 7.3 does not convey the generalization ability in the absence of information on the
GCL stream. As such we evaluate the generalization ability of each MetaSup and αMetaSup
method when trained on Surprise-Metrics collected from a backbone on an auxiliary dataset
stream and compared to the task-switch detection performance to the main GCL stream
with results in Table 7.4.
91
fmlp and flstm meta-classifiers have performance degradation on ResNet-18 compared to
ResMLP, while for fattn there is an increase. On aggregate fattn performs consistently but is
still short of fmlp.We hypothesize improvements on the architecture of the Attention classifier
as well as the meta-training procedure can further reduce the gap to fmlp.
(α)MetaSup Eval F Backbone F¯
S-Modal 0.956 ResMLP
S-INT 0.614 ResNet-18
S-Modal 0.890 ResMLP
S-INT 0.727 ResNet-18
S-Modal 0.920 ResMLP
S-INT 0.805 ResNet-18
S-Modal 0.891 ResMLP
S-INT 0.913 ResNet-18
S-Modal 0.981 ResMLP
S-INT 0.926 ResNet-18
Sz(τL) 0.785
Sz(τG) 0.809
flstm(||τG||2) 0.863
fattn(||τG||2) 0.902
fmlp(||τG||2) 0.953
Table 7.4: Generalization of task-switch detection score for MetaSup and αMetaSup
when trained on auxiliary S-NumbersV
(ResMLP) compared to when trained on S-Numbers
(ResNet-18). αMetaSup outperforms MetaSup; while fattn performs consistently between
backbone and Surprise-Metrics Table 7.3 while only second to fmlp on generalization score.
7.4.3 Stream Benchmark
We study the task difficulty and magnitude of task-gaps in S-Modal (Section 7.3.4) to justify
our task-gap transformations used in the Stream dataset. We evaluate task difficulty by the
in-task AUC without a CL method applied. Text datasets are the most difficult (AUC =
0.682) compared to vision datasets (AUC = 0.951) which can be an artifact explained by the
quality of the features from GPT-2. We use Backward Transfer (BWT) [51] to evaluate the
task-gap as the AUC ratio between tasks of different gaps. The BWT-AUC for Cross-modal,
Data-gap, Distribution, and In-domain gaps is 0.523, 0.642, 0.957, and 0.997 respectively. A
high BWT-AUC score indicates high similarity between two tasks and a small task-gap.
92
The range of task-gap justifies the need for an adaptive threshold rather than a fixed
threshold. The Cross-modal gap has the least BWT and can provide different feature distributions when compared to all other gaps.
We evaluate the performance of our methods αMetaSup and MetaSup on a new challenging benchmark Stream. For our experiments we use S-NumbersV
to train αMetaSup and to
find a fixed threshold for MetaSup, where samples are not present in the Stream benchmark.
We evaluate on S-Modal composed of 40 tasks. We compare to existing CL baselines ER
[250], DER++ [45] and CLS-ER [15]; we augment Loss-Plateau [10], Loss-EWMA [338],
One-Shot [309] and Energy [16] with Surprise Sampling where we use a baseline ER [250] as
the CL component of each method and for a direct comparison. Lastly, we augment with
αMetaSup + SS into GCL methods αS-ER, αS-DER++. Results in Table 7.2 show that
αMetaSup is the only one that can perform better than random. αMetaSup generalizes better and is simple but flexible to be used in conjunction with other CL methods for improved
performance.
7.5 Related Works
7.5.1 Batch Model Consolidation
Following the taxonomy by De Lange et al. [72] we summarize methods that mitigate forgetting in three categories, Replay, Regularization and Parameter isolation methods.
Replay methods identify a limited number of exemplars to store in an auxiliary dataset,
buffer, that is used to retain performance on previously seen tasks through rehearsal (ER
[250], GEM [190], A-GEM [53], GSS [11]). An auxiliary loss can be applied as a regularization term to the main training task, such as with Knowledge Distillation (DER++ [45],
iCaRL [248], FDR [31], DMC [330], ExModel [49]) or by restricting the gradient magnitude
(GEM [190], A-GEM [53]). Exemplars can be randomly selected from the original dataset
[45] or synthetically generated [49]. Similarly, we perform distillation on previously stored
93
exemplars in a memory bank to consolidate knowledge from previous tasks. In contrast to
previous works we perform a two-step process of a regularization-phase where we maintain
proximity of the newly trained task-specific (expert) model to the old (base) model by a
stability loss as opposed to ‘knowledge transfer’, and in the second consolidation-phase, we
apply batched distillation loss on the pair-wise intermediate representations between multiple expert models and an older model on real exemplars from a buffer. We identify that
combining multiple ‘teacher’ models in a single step is better than a single ‘teacher’ model
in multiple steps (DMC [330]), and we provide a theoretical justification of the result in
Section 7.1.3.
Other methods store exemplars in their buffer using prototypes [248], increasing exemplar
representability [11], or gradient projection [190, 53] to restrict drastic gradient updates.
Such methods are orthogonal to our method, since BMC is extendable to different types of
buffer sampling methods and regularization loss as the stability loss.
Parameter-isolation approaches keep the important weights fixed to reduce forgetting.
SupSup [309], HAT [265], PSP [61], PNN [261], and BLIP [267] identify and assign taskspecific parameters in the model via supermasks or by appending new weights to the model
[261]. Model Zoo [244] infers and trains a group of similar tasks into one model as a ‘weak
classifier’ to utilize shared domain knowledge, and use an ensemble of models during inference. Such approaches have the number of parameters grow with respect to the number of
tasks. Methods such as PackNet [197] and RMN [151], overwrite unimportant parameters
to provide larger model capacity for new tasks and do not grow indefinitely. Similarly, we
assign an expert model to each task to isolate the parameters. However, we maintain the
number of experts at each incremental step fixed so that the cost of our method remains
constant as the number of tasks grows. Finally, we perform inference using a single base
model as opposed to an ensemble of models.
94
Regularization methods such as EWC [158] and similarly (MAS [9], SI [325]) use an
auxiliary loss term to constrain optimization w.r.t. to a metric of importance for each parameter for a given task. LwF [180] distills knowledge from the previous model using current
task data, and LFL [145] freezes portion of the network while penalizing intermediate representations using the Euclidean distance. These approaches are orthogonal to our method
and are candidates for the stability loss. We find that they underperform compared to our
method.
Distributed Continual Learning combines Continual Learning and Federated Learning by incrementally learning a model stored on a central server using distributed devices.
Previous works learn the same task on multiple distributed devices that are allocated different subsets of data (CFeD [192]); or train the same sequence of tasks on each device
with inter-client communication of model parameters (FedWeIT [319]). Our method also
combines the knowledge of multiple models trained on distributed devices. In contrast to
previous work, we train a unique task on each device, to learn an expert model. We consolidate multiple expert models with batched distillation on real data as opposed to Knowledge
Distillation on an auxiliary dataset [192] or simple aggregation [319]. Lastly, in contrast to
Federated Learning setting, our method addresses performance constraints as opposed to privacy constraints and prohibits inter-client communication. The remote devices communicate
once per incremental step with the central server.
In summary, our approach is a combination of a Regularization method with the use
of the stability loss, Replay method with the use of batched distillation loss, and finally
Parameter-isolation where we train multiple experts on disjoint tasks and in a distributed
fashion.
7.5.2 aMeta
OOD detection classifies ‘novel’ test-time samples when compared to examples from the
training-set [314]. Offline OOD detection methods are used post-hoc to training; where
95
Gradients [131], Jacobian [221], Features [279] and Logits [181] can be used to infer novelty.
In GCL, the task-ID inference problem is often handled as an online OOD detection [208]
and, by extension, Temporal Anomaly detection [110]. The goal is to classify novel training
samples w.r.t. recent training data. A neural network such as LSTM can be used on the
time series of training statistics [271, 184] to classify novelty or a hyper-parameter chosen
prior to training as a ‘Fixed Threshold’ [4]. We systematically evaluate the previously
mentioned OOD statistics using a Fixed Threshold Section 7.3.2 and find that an Attentionbased classifier has stronger generalization over other Temporal Anomaly detection models
in Section 7.4.2.
Task-Aware CL Methods use the task-ID as an integral part to compute parameter
importance (EWC [158], MAS [9]), perform parameter isolation (PNN [261]), extract class
prototypes (iCaRL [248]), store exemplars (A-GEM [53]), or distill knowledge (LwF [180]).
These approaches all require the task-ID and are not directly applicable to General Continual
Learning. Our approach identifies the task boundary from training statistics and can work
jointly with task-dependent methods on General Continual Learning scenarios where the
task-boundary is identified implicitly via Meta-Surprise.
Task-Switch Detection Methods include approaches that use a task-aware CL approach with the task-ID inferred from recent task-training statistics.
Aljundi et al. [10] use the loss surface to detect a task-switch (‘Loss-Plateau’) and update
the MAS parameter importance [9]. TAME [338] infer a task-switch from the exponential
weighted moving average of the loss (‘Loss-EWMA’). SupSup [309] uses ‘One-Shot’ algorithm
where the entropy of the logits is used to infer ‘Uncertainty’. VariGrow [16] infers the task
boundary using ‘Free Energy’ [187] on the model outputs. incDFM [251] infers task-switch
through the feature reconstruction error between previous novel samples and newly arrived
samples. We evaluate the above training statistics and propose Meta-Surprise where we
train an adaptive classifier on the metrics in an auxiliary task-sequence and use in an online
96
fashion as opposed to using a fixed threshold determined on the task-sequence itself. Our
method outperforms other methods on varying task-gaps; Table 7.3.
Previous works investigated probabilistic models for novelty detection. For instance, CNDPM [177] train generative-discriminative experts to choose which novel samples to store
in a short-term buffer. HCL [157] trains a generative structure in an online manner on the
training data and infers a task-switch on the similarity between current train data and generative structure. Bayesian Gradient Descent [326] based methods [327, 328, 121] use online
variational Bayes to approximate the posterior distribution of input samples for implicit taskswitch detection. Contrarily, our work does not train a model in an online manner, which
we find to be computationally expensive and leads to unstable training dynamics [16, 251].
Our task-identification approach is based on a simple classifier that is trained on an auxiliary
task-sequence on meta-training statistics and can adapt to novel training dynamics of a new
dataset and as such is more robust and requires fewer resources.
Rehearsal Learning [19, 256], store samples in a buffer of fixed size. A sampling
strategy is composed of two decision steps: ‘when to sample’ (at which time step during
training) and ‘what to sample’ (such as sampling by the most informative samples). Taskagnostic rehearsal methods ER [250], DER++ [45] and CLS-ER [15] tackle the lack of task-ID
by ‘when to sample’ - assuming the data stream distribution to be uniform and sampling from
every batch with a probability. InfoRS [277] samples only when the Memorable Information
Criterion of a Bayesian model is greater than a threshold. On the contrary, GSS [11],
GMED [142], LARS [46], Rainbow Memory [23], Selective ER [135] and Ring buffer [54]
identify the most important samples to keep in the buffer. ‘What to sample’ from a given
batch is orthogonal to our method, as such methods can be applied in conjunction with
‘when to sample’ to improve performance. Surprise Sampling is task-agnostic and addresses
‘when to sample’. Our method outperforms all baselines for which there is an open-source
implementation [250, 45, 15, 187, 338, 10, 309]. We make our code, dataset, and benchmark
available for anyone to evaluate their method.
97
7.6 Conclusions
We propose Batch Model Consolidation, a Continual Learning framework that reduces catastrophic forgetting when training on a long sequence of tasks from diverse domains and ranging difficulties. Our method combines Regularization, with the use of the stability loss on a
previously trained base model; Replay, with the use of batched distillation loss on memory
data; and Parameter-Isolation where multiple expert models are trained on a sequence of
disjoint tasks. Lastly, we extend our framework to work in a distributed setting where each
expert can reside on a different device and specialize in a given task. We experimentally
demonstrate that BMC is the only method that maintains performance for our long sequence
of 71 tasks. Lastly, we make the code of this work publicly available so that it can serve as
a benchmark for future work in Distributed Continual Learning.
Additionally, we perform a systematic review of statistics that can be used to identify
forgetting; Surprise-Metric. We propose Surprise Sampling a biologically inspired method to
store recent training data to memory in the event of a surprise. Lastly, we propose MetaSup
where we train an adaptive (αMetaSup) model to identify surprise that we train on an
auxiliary task-sequence to the main task-sequence. We exhaustively evaluate all components
of our method on ablation studies to conclude that αMetaSup is the only method that can
perform well on our benchmark; the Stream. Our method is simple and yet effective and
can be used with any other CL method.
98
Chapter 8
Discussion and Conclusions
There are several open problems in ML research that we find relevant in the principled evaluation of a method to generalize. ML systems can be stochastic by nature and as such hard to
predict given similar initial conditions. With the broad application of Neural Networks, generalization evaluated on a held-out subset of the training dataset is inapplicable. In our work
we study generalization under three evaluation settings of Representational Generalization
(Chapter 3), Model Generalization (Chapter 5) and Algorithmic Generalization (Chapter 7).
In Chapter 3 we evaluate the generalization of the representations of a model for twostage training as well as end-to-end fusion in a single stage. From our analysis, we conclude
that representations with increased mode-coverage lead to improved generalization. Our
work proposes Depthwise Quantization Section 3.1.1 where the decomposition of the representations leads to improved efficiency with the utilization of fewer codes and improved
mode-coverage. With our work on Conditional-Cross-Attention (CCA) we demonstrate
that improved mode coverage of hierarchical quantization methods provide a progressively
more difficult training objective for Transformer models. Our ensemble of ConditionalTransformer-Decoder models leads to improved reconstructions at a reduced computational
cost.
In Chapter 5 we evaluate the generalization of the training process, as we also call a metamodel. We conclude that existing frameworks that evaluate the performance of a method can
99
be misleading. The training process dynamics are unique to ML and exhibit heteroskedasticity and nonnormality. Existing methods such as ANOVA are not applicable. We propose
an evaluation framework for a meta-model via model comparison. Our framework evaluates
the meta-model comparison generalization for different training conditions, such as random
seeds, Optimizer and Scheduler. On the basis of both previous work and our own work, we
identify errors during the training process that can be addressed with improvements in tooling. We propose and open-source a tool ABLATOR where we provide an AutoML framework
that is robust to errors and improve the execution and analysis of experiments.
In Chapter 7 we evaluate the generalization of the learning algorithm. We conclude that
simple methods such as Experience Replay using a short-term memory buffer outperform
more complex methods in realistic scenarios. With the intuition derived from mini-batch
training, we propose to train multiple expert models and consolidate them in a single step,
Batch-Model-Consolidation. Our method provides an improvement in training speed as well
as improved performance on our benchmark composed of long sequence of tasks; Stream.
Next, we evaluate ‘when to sample’ with our work on αMetaSup. We use the biologically
inspired mechanism of surprise and identify a context switch. During context-switch our
method append samples prior to surprise into memory. αMetaSup outperform all previous
methods.
Our work identified several obstacles to training generalizable models that include lack
of community standards on experimental methodologies in ML, experimental tooling, evaluation frameworks, and learning framework.
Bibliography
[1] Sunim Acharya. Electronic components and devices: Dataset containing major electrical and electronic components and devices. https://www.kaggle.com/datasets/
100
aryaminus/electronic-components. Accessed: 2022-11-10.
[2] Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc
Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320, 2021.
[3] Param Aggarwal. Fashion product images dataset: 44k products with multiple category labels, descriptions and high-res images. https://www.kaggle.com/datasets/
paramaggarwal/fashion-product-images-dataset. Accessed: 2022-11-10.
[4] Subutai Ahmad, Alexander Lavin, Scott Purdy, and Zuha Agha. Unsupervised realtime anomaly detection for streaming data. Neurocomputing, 262:134–147, 2017. Online Real-Time Learning Strategies for Data Streams.
[5] M Israk Ahmed, Shahriyar Mahmud Mamun, and Asif Uz Zaman Asif. Dcnn-based
vegetable image classification using transfer learning: A comparative study. In 2021
5th International Conference on Computer, Communication and Signal Processing (ICCCSP), pages 235–243. IEEE, 2021.
[6] Saeed S. Alahmari, Dmitry B. Goldgof, Peter R. Mouton, and Lawrence O. Hall.
Challenges for the repeatability of deep learning models. IEEE Access, 8:211860–
211868, 2020.
[7] Jean-Baptiste Alayrac, Adri`a Recasens, Rosalia Schneider, Relja Arandjelovi´c, Jason
Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self-Supervised MultiModal Versatile Networks. In NeurIPS, 2020.
[8] Alex Alemi, Ben Poole, Ian Fischer, Josh Dillon, Rif A Saurus, and Kevin Murphy.
An information-theoretic analysis of deep latent-variable models. 2018.
[9] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and
Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. CoRR,
abs/1711.09601, 2017.
[10] Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars. Task-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 11254–11263, 2019.
[11] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Online continual
learning with no task boundaries. CoRR, abs/1903.08671, 2019.
[12] Amazon. Amazon customer reviews dataset. https://s3.amazonaws.com/
amazon-reviews-pds/readme.html. Accessed: 2023-02-20.
[13] Aleksandr Antonov. Apparel images dataset. https://www.kaggle.com/datasets/
trolukovich/apparel-images-dataset. Accessed: 2022-10-30.
[14] Asia Pacific Tele-Ophthalmology Society (APTOS). APTOS 2019 blindness detection. https://www.kaggle.com/competitions/aptos2019-blindness-detection/
overview. Accessed: 2022-10-30.
101
[15] Elahe Arani, Fahad Sarfraz, and Bahram Zonooz. Learning fast, learning slow: A
general continual learning method based on complementary learning system. CoRR,
abs/2201.12604, 2022.
[16] Randy Ardywibowo, Zepeng Huo, Zhangyang Wang, Bobak J Mortazavi, Shuai Huang,
and Xiaoning Qian. Varigrow: Variational architecture growing for task-agnostic continual learning based on bayesian novelty. In International Conference on Machine
Learning, pages 865–877. PMLR, 2022.
[17] Alexandre Attia. Simpson recognition. https://github.com/alexattia/
SimpsonRecognition. Accessed: 2022-11-10.
[18] Artem Babenko and Victor Lempitsky. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 931–938, 2014.
[19] Benedikt Bagus and Alexander Gepperth. An investigation of replay-based approaches
for continual learning. CoRR, abs/2108.06758, 2021.
[20] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation
by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[21] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in
high-energy physics with deep learning. Nature communications, 5(1):4308, 2014.
[22] Tadas Baltruˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and
machine intelligence, 41(2):423–443, 2018.
[23] Jihwan Bang, Heesu Kim, Young Joon Yoo, Jung-Woo Ha, and Jonghyun Choi.
Rainbow memory: Continual learning with a memory of diverse samples. CoRR,
abs/2103.17230, 2021.
[24] Puneet Bansal. Intel image classification: Image scene classification of multiclass.
https://www.kaggle.com/datasets/puneet6060/intel-image-classification.
Accessed: 2022-11-10.
[25] Olga Belitskaya. Classification of handwritten letters: Images of russian letters. https://www.kaggle.com/datasets/olgabelitskaya/
classification-of-handwritten-letters. Accessed: 2022-11-10.
[26] Samuel Bell, Onno Kampman, Jesse Dodge, and Neil D Lawrence. Modeling the
machine learning multiverse. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and
Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
[27] Samuel J Bell, Onno Kampman, Jesse Dodge, and Neil Lawrence. Modeling the
machine learning multiverse. Advances in Neural Information Processing Systems,
35:18416–18429, 2022.
102
[28] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document
transformer. arXiv:2004.05150, 2020.
[29] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A
review and new perspectives. IEEE transactions on pattern analysis and machine
intelligence, 35(8):1798–1828, 2013.
[30] Yoshua Bengio, J´erˆome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine
learning, pages 41–48, 2009.
[31] Ari S Benjamin, David Rolnick, and Konrad Kording. Measuring and regularizing
networks in function space. arXiv preprint arXiv:1805.08289, 2018.
[32] James Bergstra, R´emi Bardenet, Yoshua Bengio, and Bal´azs K´egl. Algorithms for
hyper-parameter optimization. Advances in neural information processing systems, 24,
2011.
[33] Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere. The
million song dataset. 2011.
[34] Andr´e Biedenkapp, Marius Lindauer, Katharina Eggensperger, Frank Hutter, Chris
Fawcett, and Holger Hoos. Efficient parameter importance analysis via ablation with
surrogates. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
[35] Andr´e Biedenkapp, Joshua Marben, Marius Lindauer, and Frank Hutter. Cave: Configuration assessment, visualization and evaluation. In Roberto Battiti, Mauro Brunato,
Ilias Kotsireas, and Panos M. Pardalos, editors, Learning and Intelligent Optimization,
pages 115–130, Cham, 2019. Springer International Publishing.
[36] Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks
and discriminant analysis in predicting forest cover types from cartographic variables.
Computers and electronics in agriculture, 24(3):131–151, 1999.
[37] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision,
2014.
[38] Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, Edward Raff, Kanika Madan,
Vikram Voleti, et al. Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems, 3:747–769, 2021.
[39] Xavier Bouthillier, C´esar Laurent, and Pascal Vincent. Unreproducible research is reproducible. In International Conference on Machine Learning, pages 725–734. PMLR,
2019.
103
[40] Xavier Bouthillier, C´esar Laurent, and Pascal Vincent. Unreproducible research is
reproducible. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings
of the 36th International Conference on Machine Learning, volume 97 of Proceedings
of Machine Learning Research, pages 725–734. PMLR, 09–15 Jun 2019.
[41] Xavier Bouthillier and Ga¨el Varoquaux. Survey of machine-learning experimental
methods at NeurIPS2019 and ICLR2020. PhD thesis, Inria Saclay Ile de France, 2020.
[42] Emirhan BULUT. Planets and moons dataset - ai in space: A public dataset for large-scale multi-label and multi-class image classification.
Dataset available from https://github.com/emirhanai/Planets-and-Moons-Dataset-AIin-Space and https://www.kaggle.com/datasets/emirhanai/planets-and-moons-datasetai-in-space, 2022.
[43] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in β-vae.
arXiv preprint arXiv:1804.03599, 2018.
[44] Shiekh Burhan. Face mask dataset: Covid-19 dataset for training face mask classifier.
https://www.kaggle.com/datasets/shiekhburhan/face-mask-dataset. Accessed:
2022-11-10.
[45] Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and SIMONE
CALDERARA. Dark experience for general continual learning: a strong, simple baseline. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,
Advances in Neural Information Processing Systems, volume 33, pages 15920–15930.
Curran Associates, Inc., 2020.
[46] Pietro Buzzega, Matteo Boschini, Angelo Porrello, and Simone Calderara. Rethinking
experience replay: a bag of tricks for continual learning. CoRR, abs/2010.05595, 2020.
[47] Pietro Buzzega, Matteo Boschini, Angelo Porrello, and Simone Calderara. Rethinking
experience replay: a bag of tricks for continual learning. In 2020 25th International
Conference on Pattern Recognition (ICPR), pages 2180–2187. IEEE, 2021.
[48] Olivier Capp´e and Eric Moulines. On-line expectation-maximization algorithm for
latent data models. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 71(3):593–613, Jun 2009.
[49] Antonio Carta, Andrea Cossu, Vincenzo Lomonaco, and Davide Bacciu. Ex-model:
Continual learning from a stream of trained models. CoRR, abs/2112.06511, 2021.
[50] Olivier Chapelle and Yi Chang. Yahoo! learning to rank challenge overview. In
Proceedings of the learning to rank challenge, pages 1–24. PMLR, 2011.
[51] Arslan Chaudhry, Puneet Kumar Dokania, Thalaiyasingam Ajanthan, and Philip H. S.
Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. CoRR, abs/1801.10112, 2018.
104
[52] Arslan Chaudhry, Albert Gordo, Puneet Dokania, Philip Torr, and David Lopez-Paz.
Using hindsight to anchor past knowledge in continual learning. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 35, pages 6993–7001, 2021.
[53] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny.
Efficient lifelong learning with A-GEM. CoRR, abs/1812.00420, 2018.
[54] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan,
Puneet Kumar Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. Continual
learning with tiny episodic memories. CoRR, abs/1902.10486, 2019.
[55] Boyuan Chen, Mingzhi Wen, Yong Shi, Dayi Lin, Gopi Krishnan Rajbahadur, and
Zhen Ming Jiang. Towards training reproducible deep learning models. CoRR,
abs/2202.02326, 2022.
[56] Jianfei Chen, Jun Zhu, Yee Whye Teh, and Tong Zhang. Stochastic expectation maximization with variance reduction. In NeurIPS, pages 7978–7988, 2018.
[57] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and
Ilya Sutskever. Generative pretraining from pixels. In International Conference on
Machine Learning, pages 1691–1703. PMLR, 2020.
[58] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training selfsupervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021.
[59] Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883,
Oct 2017.
[60] Junyan Cheng, Iordanis Fostiropoulos, Barry Boehm, and Mohammad Soleymani.
Multimodal phased transformer for sentiment analysis. In Proceedings of the 2021
Conference on Empirical Methods in Natural Language Processing, pages 2447–2458,
Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
[61] Brian Cheung, Alexander Terekhov, Yubei Chen, Pulkit Agrawal, and Bruno Olshausen. Superposition of many models into one. In H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural
Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[62] Rewon Child. Very deep {vae}s generalize autoregressive models and can outperform
them on images. In International Conference on Learning Representations, 2021.
[63] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences
with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
[64] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences
with sparse transformers, 2019.
105
[65] Fran¸cois Chollet. Xception: Deep learning with depthwise separable convolutions. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages
1251–1258, 2017.
[66] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea
Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser,
et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
[67] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern
Recognition (CVPR), 2014.
[68] Pierre-Alexandre Clorichel. Boat types recognition: About 1,500 pictures of
boats classified in 9 categories. https://www.kaggle.com/datasets/clorichel/
boat-types-recognition. Accessed: 2022-11-10.
[69] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andr´e van Schaik. EMNIST: an
extension of MNIST to handwritten letters. CoRR, abs/1702.05373, 2017.
[70] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan
Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, pages 2978–2988, Florence, Italy, July 2019. Association for Computational
Linguistics.
[71] Luke Nicholas Darlow, Elliot J. Crowley, Antreas Antoniou, and Amos J. Storkey.
CINIC-10 is not imagenet or CIFAR-10. CoRR, abs/1810.03505, 2018.
[72] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleˇs
Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey:
Defying forgetting in classification tasks. IEEE transactions on pattern analysis and
machine intelligence, 44(7):3366–3385, 2021.
[73] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A
large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision
and Pattern Recognition, pages 248–255, 2009.
[74] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
[75] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford,
and Ilya Sutskever. Jukebox: A generative model for music. arXiv preprint
arXiv:2005.00341, 2020.
[76] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford,
and Ilya Sutskever. Jukebox: A generative model for music. arXiv preprint
arXiv:2005.00341, 2020.
106
[77] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang
Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image
generation via transformers. arXiv preprint arXiv:2105.13290, 2021.
[78] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold,
Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition
at scale. arXiv preprint arXiv:2010.11929, 2020.
[79] Katharina Eggensperger, Marius Lindauer, and Frank Hutter. Pitfalls and best practices in algorithm configuration. Journal of Artificial Intelligence Research, 64:861–893,
2019.
[80] Mathias Eitz, James Hays, and Marc Alexa. How do humans sketch objects? ACM
Trans. Graph. (Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.
[81] William HR Equitz and Thomas M Cover. Successive refinement of information. IEEE
Transactions on Information Theory, 37(2):269–275, 1991.
[82] William Falcon et al. Pytorch lightning. GitHub repository, 3, 2019.
[83] Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and
Pierre-Alain Muller. Deep learning for time series classification: a review. CoRR,
abs/1809.04356, 2018.
[84] Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank
Hutter. Auto-sklearn 2.0: The next generation. CoRR, abs/2007.04074, 2020.
[85] Christian Fiedler, Carsten W Scherer, and Sebastian Trimpe. Practical and rigorous uncertainty bounds for gaussian process regression. In Proceedings of the AAAI
conference on artificial intelligence, volume 35, pages 7439–7447, 2021.
[86] V. Fomin, J. Anmol, S. Desroziers, J. Kriss, and A. Tejani. High-level library to help
with training neural networks in pytorch. https://github.com/pytorch/ignite,
2020.
[87] Iordanios Fostiropoulos. Stream benchmark. https://github.com/fostiropoulos/
stream_benchmark, 2023.
[88] Iordanios Fostiropoulos. Stream benchmark. https://github.com/fostiropoulos/
stream, 2023.
[89] Iordanis Fostiropoulos. Depthwise discrete representation learning. arXiv preprint
arXiv:2004.05462, 2020.
[90] Iordanis Fostiropoulos and Barry Boehm. Implicit feature decoupling with depthwise
quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 396–405, June 2022.
107
[91] Iordanis Fostiropoulos, Bowman Brown, and Laurent Itti. Reproducibility requires
consolidated artifacts. arXiv preprint arXiv:2305.12571, 2023.
[92] Iordanis Fostiropoulos, Bowman Noah Brown, and Laurent Itti. Trustworthy model
evaluation on a budget. In ICLR 2023 Workshop on Trustworthy and Reliable LargeScale Machine Learning Models, 2023.
[93] Iordanis Fostiropoulos and Laurent Itti. Supervised contrastive prototype learning:
Augmentation free robust neural network, 2022.
[94] Iordanis Fostiropoulos and Laurent Itti. Ablator: Robust horizontal-scaling of machine
learning ablation experiments. In AutoML Conference 2023 (ABCD Track), 2023.
[95] Iordanis Fostiropoulos, Jiaye Zhu, and Laurent Itti. Batch model consolidation: A
multi-task model consolidation framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3664–3676, June
2023.
[96] Christopher T Franck and Robert B Gramacy. Assessing bayes factor surfaces using
interactive visualization and computer surrogate modeling. The American Statistician,
74(4):359–369, 2020.
[97] Weihao Gao, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath. Estimating
mutual information for discrete-continuous mixtures. arXiv preprint arXiv:1709.06212,
2017.
[98] A Gbeminiyi. Multi-class weather dataset for image classification. Mendeley Data,
2018.
[99] Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization.
IEEE transactions on pattern analysis and machine intelligence, 36(4):744–755, 2013.
[100] Yunhao Ge, Yuecheng Li, Di Wu, Ao Xu, Adam M Jones, Amanda Sofie Rios, Iordanis Fostiropoulos, Po-Hsuan Huang, Zachary William Murdock, Kiran Lekkala, et al.
Shared knowledge lifelong learning. 2022.
[101] Yunhao Ge, Yuecheng Li, Di Wu, Ao Xu, Adam M. Jones, Amanda Sofie Rios, Iordanis Fostiropoulos, Shixian Wen, Po-Hsuan Huang, Zachary William Murdock, Gozde
Sahin, Shuo Ni, Kiran Lekkala, Sumedh Anand Sontakke, and Laurent Itti. Lightweight
learner for shared knowledge lifelong learning, 2023.
[102] Jan-Mark Geusebroek, Gertjan J Burghouts, and Arnold WM Smeulders. The amsterdam library of object images. International Journal of Computer Vision, 61:103–112,
2005.
[103] Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby. An information theoretic tradeoff between complexity and accuracy. In Learning Theory and Kernel Machines, pages
595–609. Springer, 2003.
108
[104] Shubham Goel and Bill Hall. Dermnet: Image data for 23 categories of skin diseases.
https://www.kaggle.com/datasets/shubhamgoel27/dermnet. Accessed: 2022-11-
10.
[105] Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting
deep learning models for tabular data. CoRR, abs/2106.11959, 2021.
[106] Quentin F Gronau, Alexander Ly, and Eric-Jan Wagenmakers. Informed bayesian
t-tests. The American Statistician, 2019.
[107] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern
neural networks. In International conference on machine learning, pages 1321–1330.
PMLR, 2017.
[108] Jianbo Guo, Yuxi Li, Weiyao Lin, Yurong Chen, and Jianguo Li. Network decoupling:
From regular to depthwise separable convolutions. arXiv preprint arXiv:1808.05517,
2018.
[109] Yiduo Guo, Bing Liu, and Dongyan Zhao. Online continual learning through mutual
information maximization. In International Conference on Machine Learning, pages
8109–8126. PMLR, 2022.
[110] Eyal Gutflaish, Aryeh Kontorovich, Sivan Sabato, Ofer Biller, and Oded Sofer. Temporal anomaly detection: calibrating the surprise. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 33, pages 3755–3762, 2019.
[111] Isabelle Guyon, Lisheng Sun-Hosoya, Marc Boull´e, Hugo Jair Escalante, Sergio Escalera, Zhengying Liu, Damir Jajetic, Bisakha Ray, Mehreen Saeed, Mich`ele Sebag,
et al. Analysis of the automl challenge series. Automated Machine Learning, 177, 2019.
[112] Daniel Haase and Manuel Amthor. Rethinking depthwise separable convolutions: How
intra-kernel correlations lead to improved mobilenets. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 14600–14609, 2020.
[113] Raia Hadsell, Dushyant Rao, Andrei A. Rusu, and Razvan Pascanu. Embracing change:
Continual learning in deep neural networks. Trends in Cognitive Sciences, 24(12):1028–
1040, 2020.
[114] Isha Hameed, Samuel Sharpe, Daniel Barcklow, Justin Au-Yeung, Sahil Verma, Jocelyn
Huang, Brian Barr, and C Bayan Bruss. Based-xai: Breaking ablation studies down
for explainable artificial intelligence. arXiv preprint arXiv:2207.05566, 2022.
[115] Eduardo Hariton and Joseph J Locascio. Randomised controlled trials—the gold standard for effectiveness research. BJOG: an international journal of obstetrics and gynaecology, 125(13):1716, 2018.
[116] Adam W Harley, Alex Ufkes, and Konstantinos G Derpanis. Evaluation of deep convolutional nets for document image classification and retrieval. In International Conference on Document Analysis and Recognition (ICDAR), 2015.
109
[117] Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. Misa: Modalityinvariant and -specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, page
1122–1131, New York, NY, USA, 2020. Association for Computing Machinery.
[118] Joost Hazelzet. Images of lego bricks: 40,000 images of 50 different lego
bricks. https://www.kaggle.com/datasets/joosthazelzet/lego-brick-images.
Accessed: 2022-11-10.
[119] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. CoRR, abs/1512.03385, 2015.
[120] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 770–778, 2016.
[121] Xu He, Jakub Sygnowski, Alexandre Galashov, Andrei A Rusu, Yee Whye Teh, and
Razvan Pascanu. Task agnostic continual learning via meta learning. arXiv preprint
arXiv:1906.05201, 2019.
[122] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A
novel dataset and deep learning benchmark for land use and land cover classification.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,
2019.
[123] Leonhard Held and Manuela Ott. On p-values and bayes factors. Annual Review of
Statistics and Its Application, 5(1):593–419, 2018.
[124] Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young
Choi. A comprehensive overhaul of feature distillation. In Proceedings of the
IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
[125] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew
Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual
concepts with a constrained variational framework. 2016.
[126] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural
network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
[127] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing
the generalization gap in large batch training of neural networks. Advances in neural
information processing systems, 30, 2017.
[128] Shahriar Hossain, Jahir Uddin, and Rakibul Alam Nahin. Rock classification dataset:
Multi class classification using different types of images of rocks. https://www.
kaggle.com/datasets/salmaneunus/rock-classification. Accessed: 2022-11-10.
[129] Jeremy Howard and Sylvain Gugger. fastai: A layered API for deep learning. CoRR,
abs/2002.04688, 2020.
110
[130] Jian Huang, Jianhua Tao, Bin Liu, Zheng Lian, and Mingyue Niu. Multimodal transformer fusion for continuous emotion recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
3507–3511. IEEE, 2020.
[131] Rui Huang, Andrew Geng, and Yixuan Li. On the importance of gradients for detecting
distributional shifts in the wild. CoRR, abs/2110.00218, 2021.
[132] Yibin Huang. Textures classification dataset. https://github.com/abin24/
Textures-Dataset. Accessed: 2022-11-10.
[133] Marcus Hutter. Distribution of mutual information. Advances in neural information
processing systems, 1:399–406, 2002.
[134] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
[135] David Isele and Akansel Cosgun. Selective experience replay for lifelong learning.
CoRR, abs/1802.10269, 2018.
[136] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional
neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
[137] Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and
Joao Carreira. Perceiver: General perception with iterative attention. arXiv preprint
arXiv:2103.03206, 2021.
[138] Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer,
Yoshua Bengio, and Amos J. Storkey. Three factors influencing minima in SGD. CoRR,
abs/1711.04623, 2017.
[139] Mingi Ji, Byeongho Heo, and Sungrae Park. Show, attend and distill: Knowledge
distillation via attention-based feature matching. CoRR, abs/2102.02973, 2021.
[140] Yiding Jiang, Vaishnavh Nagarajan, Christina Baek, and J Zico Kolter. Assessing
generalization of sgd via disagreement. arXiv preprint arXiv:2106.13799, 2021.
[141] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. arXiv preprint
arXiv:1912.02178, 2019.
[142] Xisen Jin, Arka Sadhu, Junyi Du, and Xiang Ren. Gradient-based editing of memory examples for online task-free continual learning. Advances in Neural Information
Processing Systems, 34:29193–29205, 2021.
[143] jr2ngb (username). Cataract dataset. https://www.kaggle.com/datasets/jr2ngb/
cataractdataset. Accessed: 2022-11-10.
[144] Philipp Jund, Nichola Abdo, Andreas Eitel, and Wolfram Burgard. The freiburg
groceries dataset. CoRR, abs/1611.05799, 2016.
111
[145] Heechul Jung, Jeongwoo Ju, Minju Jung, and Junmo Kim. Less-forgetting learning in
deep neural networks. CoRR, abs/1607.00122, 2016.
[146] Kaggle. Kaggle: Your Home for Data Science. https://www.kaggle.com. Accessed:
2022-11-10.
[147] Lukasz Kaiser, Samy Bengio, Aurko Roy, Ashish Vaswani, Niki Parmar, Jakob Uszkoreit, and Noam Shazeer. Fast decoding in sequence models using discrete latent variables. In International Conference on Machine Learning, pages 2390–2399. PMLR,
2018.
[148] Kaniav Kamary, Kerrie Mengersen, Christian P Robert, and Judith Rousseau. Testing
hypotheses via a mixture estimation model. arXiv preprint arXiv:1412.2044, 2014.
[149] Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers,
Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Z¨ollner. Multi-class
texture analysis in colorectal cancer histology. Scientific reports, 6(1):1–11, 2016.
[150] Parneet Kaur, , Karan Sikka, Weijun Wang, serge Belongie, and Ajay Divakaran. Foodx-251: A dataset for fine-grained food classification. arXiv preprint
arXiv:1907.06167, 2019.
[151] Prakhar Kaushik, Alex Gain, Adam Kortylewski, and Alan L. Yuille. Understanding
catastrophic forgetting and remembering in continual learning with optimal relevance
mapping. CoRR, abs/2102.11343, 2021.
[152] Daniel S Kermany, Michael Goldbaum, Wenjia Cai, Carolina CS Valentim, Huiying
Liang, Sally L Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al.
Identifying medical diagnoses and treatable diseases by image-based deep learning.
Cell, 172(5):1122–1131, 2018.
[153] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and
Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap
and sharp minima. CoRR, abs/1609.04836, 2016.
[154] Christian Keysers, Valeria Gazzola, and Eric-Jan Wagenmakers. Using bayes factor hypothesis testing in neuroscience to establish evidence of absence. Nature neuroscience,
23(7):788–799, 2020.
[155] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
[156] Diederik P Kingma and Max Welling. An introduction to variational autoencoders.
arXiv preprint arXiv:1906.02691, 2019.
[157] Polina Kirichenko, Mehrdad Farajtabar, Dushyant Rao, Balaji Lakshminarayanan, Nir
Levine, Ang Li, Huiyi Hu, Andrew Gordon Wilson, and Razvan Pascanu. Task-agnostic
continual learning with hybrid probabilistic models. CoRR, abs/2106.12772, 2021.
112
[158] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka
Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia
Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the
National Academy of Sciences, 114(13):3521–3526, mar 2017.
[159] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020.
[160] Irene Klugkist, Bernet Kato, and Herbert Hoijtink. Bayesian model selection using
encompassing priors. Statistica Neerlandica, 59(1):57–69, 2005.
[161] Zeger F Knops, JB Antoine Maintz, Max A Viergever, and Josien PW Pluim. Normalized mutual information-based registration using k-means clustering-based histogram
binning. In Medical Imaging 2003: Image Processing, volume 5032, pages 1072–1080.
International Society for Optics and Photonics, 2003.
[162] Zeger F Knops, JB Antoine Maintz, Max A Viergever, and Josien PW Pluim. Normalized mutual information based registration using k-means clustering and shading
correction. Medical image analysis, 10(3):432–439, 2006.
[163] Kazuma Kobayashi, Ryuichiro Hataya, Yusuke Kurose, Mototaka Miyake, Masamichi
Takahashi, Akiko Nakagawa, Tatsuya Harada, and Ryuji Hamamoto. Decomposing
normal and abnormal features of medical images into discrete latent codes for contentbased image retrieval. arXiv preprint arXiv:2103.12328, 2021.
[164] Ron Kohavi et al. Scaling up the accuracy of naive-bayes classifiers: A decision-tree
hybrid. In Kdd, volume 96, pages 202–207, 1996.
[165] Mert Koklu. Manga facial expressions: Facial expressions of manga
(japanese comic) character faces. https://www.kaggle.com/datasets/mertkkl/
manga-facial-expressions. Accessed: 2022-11-10.
[166] Murat Koklu, Ilkay Cinar, and Yavuz Selim Taspinar. Classification of rice varieties
with deep learning methods. Computers and electronics in agriculture, 187:106285,
2021.
[167] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for
fine-grained categorization. In 4th International IEEE Workshop on 3D Representation
and Recognition (3dRR-13), Sydney, Australia, 2013.
[168] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
[169] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny
images. 2009.
[170] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny
images. 2009.
113
[171] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
[172] John K Kruschke and Torrin M Liddell. The bayesian new statistics: Hypothesis
testing, estimation, meta-analysis, and power analysis from a bayesian perspective.
Psychonomic bulletin & review, 25(1):178–206, 2018.
[173] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales
Leonardis, Gregory G. Slabaugh, and Tinne Tuytelaars. Continual learning: A comparative study on how to defy forgetting in classification tasks. CoRR, abs/1909.08383,
2019.
[174] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3,
2015.
[175] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
[176] Armin Lederer, Jonas Umlauft, and Sandra Hirche. Uniform error bounds for gaussian
process regression with application to safe control. Advances in Neural Information
Processing Systems, 32, 2019.
[177] Soochan Lee, Junsoo Ha, Dongsu Zhang, and Gunhee Kim. A neural dirichlet process
mixture model for task-free continual learning. arXiv preprint arXiv:2001.00689, 2020.
[178] Henry W Leung and Jo Bovy. Deep learning of multi-element abundances from highresolution spectroscopic data. Monthly Notices of the Royal Astronomical Society, nov
2018.
[179] Li-Jia Li and Li Fei-Fei. What, where and who? classifying events by scene and object
recognition. In 2007 IEEE 11th international conference on computer vision, pages
1–8. IEEE, 2007.
[180] Zhizhong Li and Derek Hoiem. Learning without forgetting. CoRR, abs/1606.09282,
2016.
[181] Shiyu Liang, Yixuan Li, and R. Srikant. Principled detection of out-of-distribution
examples in neural networks. CoRR, abs/1706.02690, 2017.
[182] Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez, and
Ion Stoica. Tune: A research platform for distributed model selection and training.
arXiv preprint arXiv:1807.05118, 2018.
[183] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint
arXiv:1312.4400, 2013.
[184] Benjamin Lindemann, Benjamin Maschler, Nada Sahlab, and Michael Weyrich. A
survey on anomaly detection for technical systems using lstm networks. Computers in
Industry, 131:103498, 2021.
114
[185] Chao Liu, Cuiyun Gao, Xin Xia, David Lo, John Grundy, and Xiaohu Yang. On
the reproducibility and replicability of deep learning in software engineering. ACM
Transactions on Software Engineering and Methodology (TOSEM), 31(1):1–46, 2021.
[186] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng
Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv
preprint arXiv:1908.03265, 2019.
[187] Weitang Liu, Xiaoyun Wang, John D. Owens, and Yixuan Li. Energy-based out-ofdistribution detection. CoRR, abs/2010.03759, 2020.
[188] Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli
Bagher Zadeh, and Louis-Philippe Morency. Efficient low-rank multimodal fusion with
modality-specific factors. In Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, July
2018. Association for Computational Linguistics.
[189] Vincenzo Lomonaco and Davide Maltoni. Core50: a new dataset and benchmark for
continuous object recognition. In Conference on Robot Learning, pages 17–26. PMLR,
2017.
[190] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continuum
learning. CoRR, abs/1706.08840, 2017.
[191] Daniel Ma, Gerald Friedland, and Mario Michael Krell. Origamiset1. 0: Two
new datasets for origami classification and difficulty estimation. arXiv preprint
arXiv:2101.05470, 2021.
[192] Yuhang Ma, Zhongle Xie, Jue Wang, Ke Chen, and Lidan Shou. Continual federated
learning based on knowledge distillation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, July 2022.
[193] Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and
Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the
49th annual meeting of the association for computational linguistics: Human language
technologies, pages 142–150, 2011.
[194] David JC MacKay and David JC Mac Kay. Information theory, inference and learning
algorithms. Cambridge university press, 2003.
[195] Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner.
Online continual learning in image classification: An empirical survey. Neurocomputing,
469:28–51, 2022.
[196] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew B. Blaschko, and Andrea
Vedaldi. Fine-grained visual classification of aircraft. CoRR, abs/1306.5151, 2013.
[197] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single
network by iterative pruning. CoRR, abs/1711.05769, 2017.
115
[198] Franck Mamalet and Christophe Garcia. Simplifying convnets for fast learning. In
International Conference on Artificial Neural Networks, pages 58–65. Springer, 2012.
[199] Marc Masana, Xialei Liu, Bartlomiej Twardowski, Mikel Menta, Andrew D. Bagdanov,
and Joost van de Weijer. Class-incremental learning: survey and performance evaluation. CoRR, abs/2010.15277, 2020.
[200] Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical
model of large-batch training. CoRR, abs/1812.06162, 2018.
[201] James L McGaugh. Making lasting memories: Remembering the significant. Proceedings of the National Academy of Sciences, 110(supplement 2):10402–10407, 2013.
[202] Moritz Meister, Sina Sheikholeslami, Amir H Payberah, Vladimir Vlassov, and Jim
Dowling. Maggy: Scalable asynchronous parallel hyperparameter search. In Proceedings of the 1st Workshop on Distributed Machine Learning, pages 28–33, 2020.
[203] Mostafa Mohamed. Garbage classification (12 classes): Images dataset for
classifying household garbage. https://www.kaggle.com/datasets/mostafaabla/
garbage-classification. Accessed: 2022-11-10.
[204] Dominic Monn. Clothing & models: A collection of clothing pieces, scraped from
zalando.com. https://www.kaggle.com/datasets/dqmonn/zalando-store-crawl.
Accessed: 2022-11-10.
[205] Young-Il Moon, Balaji Rajagopalan, and Upmanu Lall. Estimation of mutual information using kernel density estimators. Physical Review E, 52(3):2318, 1995.
[206] Paul Mooney. Chest x-ray images (pneumonia). https://www.kaggle.com/datasets/
paultimothymooney/chest-xray-pneumonia. Accessed: 2022-11-10.
[207] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw,
Eric Liang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed
framework for emerging AI applications. CoRR, abs/1712.05889, 2017.
[208] Martin Mundt, Yong Won Hong, Iuliia Pliushch, and Visvanathan Ramesh. A wholistic
view of continual learning with deep neural networks: Forgotten lessons and the bridge
to active and open world learning. 2020.
[209] Akash Nagaraj. ASL Alphabet: Image data set for alphabets in the American Sign Language. https://www.kaggle.com/datasets/grassknoted/asl-alphabet. Accessed:
2022-10-30.
[210] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y.
Ng. Reading digits in natural images with unsupervised feature learning. In NIPS
Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
[211] M.-E. Nilsback and A. Zisserman. A visual vocabulary for flower classification. In
2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’06), volume 2, pages 1447–1454, 2006.
116
[212] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning
via lifted structured feature embedding. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 4004–4012, 2016.
[213] Alex Olsen, Dmitry A Konovalov, Bronson Philippa, Peter Ridd, Jake C Wood, Jamie
Johns, Wesley Banks, Benjamin Girgenti, Owen Kenny, James Whinney, et al. Deepweeds: A multiclass weed species image dataset for deep learning. Scientific reports,
9(1):1–12, 2019.
[214] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves,
and Koray Kavukcuoglu. Conditional image generation with pixelcnn decoders. arXiv
preprint arXiv:1606.05328, 2016.
[215] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017.
[216] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017.
[217] C¸ . F. Ozgenel. Concrete crack images for classification. ¨ Mendeley Data, V2, 2019.
[218] R Kelley Pace and Ronald Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291–297, 1997.
[219] Zexu Pan, Zhaojie Luo, Jichen Yang, and Haizhou Li. Multi-Modal Attention for
Speech Emotion Recognition. In Proc. Interspeech 2020, pages 364–368, 2020.
[220] Liam Paninski. Estimation of entropy and mutual information. Neural computation,
15(6):1191–1253, 2003.
[221] Jaewoo Park, Hojin Park, Eunju Jeong, and Andrew Beng Jin Teoh. Understanding open-set recognition by jacobian norm of representation. arXiv preprint
arXiv:2209.11436, 2022.
[222] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In International Conference on Machine
Learning, pages 4055–4064. PMLR, 2018.
[223] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic
differentiation package - torch.autograd, 2017.
[224] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.
[225] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang.
Moment matching for multi-source domain adaptation. In Proceedings of the IEEE
International Conference on Computer Vision, pages 1406–1415, 2019.
117
[226] Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnab´as
P´oczos. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6892–6899, 2019.
[227] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In 2007 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2007.
[228] David Picard. Torch. manual seed (3407) is all you need: On the influence of
random seeds in deep learning architectures for computer vision. arXiv preprint
arXiv:2109.08203, 2021.
[229] Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivi`ere, Alina
Beygelzimer, Florence d’Alch´e Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility
program). J. Mach. Learn. Res., 22(1), jul 2021.
[230] Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivi`ere, Alina
Beygelzimer, Florence d’Alch´e Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility
program). J. Mach. Learn. Res., 22(1), jul 2021.
[231] Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivi`ere, Alina
Beygelzimer, Florence d’Alch´e Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility
program). The Journal of Machine Learning Research, 22(1):7459–7478, 2021.
[232] Felice Pollano. Watermark dataset builder. https://github.com/FelicePollano/
WatermarkDataSetBuilder. Accessed: 2022-11-10.
[233] Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. Gdumb: A simple approach
that questions our progress in continual learning. In European conference on computer
vision, pages 524–540. Springer, 2020.
[234] Philipp Probst, Anne-Laure Boulesteix, and Bernd Bischl. Tunability: Importance of
hyperparameters of machine learning algorithms. The Journal of Machine Learning
Research, 20(1):1934–1965, 2019.
[235] Tao Qin and Tie-Yan Liu. Introducing letor 4.0 datasets. arXiv preprint
arXiv:1306.2597, 2013.
[236] Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In 2009 IEEE
conference on computer vision and pattern recognition, pages 413–420. IEEE, 2009.
[237] Stefan T Radev, Marco D’Alessandro, Ulf K Mertens, Andreas Voss, Ullrich K¨othe,
and Paul-Christian B¨urkner. Amortized bayesian model comparison with evidential
deep learning. IEEE Transactions on Neural Networks and Learning Systems, 2021.
118
[238] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
[239] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[240] Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P.
Lillicrap. Compressive transformers for long-range sequence modelling. In International
Conference on Learning Representations, 2020.
[241] Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng
Mao, Louis-Philippe Morency, and Ehsan Hoque. Integrating multimodal information
in large pretrained transformers. In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, Online, July 2020. Association for Computational Linguistics.
[242] Ruslan Rakhimov, Denis Volkhonskiy, Alexey Artemov, Denis Zorin, and Evgeny Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020.
[243] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford,
Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv preprint
arXiv:2102.12092, 2021.
[244] Rahul Ramesh and Pratik Chaudhari. Boosting a model zoo for multi-task and continual learning. CoRR, abs/2106.03027, 2021.
[245] R´emi Ratajczak, Carlos F Crispim-Junior, Elodie Faure, B´eatrice Fervers, and Laure ´
Tougne. Automatic Land Cover Reconstruction From Historical Aerial Images: An
Evaluation of Features Extraction and Classification Algorithms. IEEE Transactions
on Image Processing, January 2019.
[246] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity
images with vq-vae-2. arXiv preprint arXiv:1906.00446, 2019.
[247] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity
images with vq-vae-2. In Advances in neural information processing systems, pages
14866–14876, 2019.
[248] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, and Christoph H. Lampert. icarl:
Incremental classifier and representation learning. CoRR, abs/1611.07725, 2016.
[249] Tom Richardson and Ruediger Urbanke. Modern coding theory. Cambridge university
press, 2008.
[250] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu,
and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and
minimizing interference. CoRR, abs/1810.11910, 2018.
119
[251] Amanda Rios, Nilesh Ahuja, Ibrahima Ndiour, Utku Genc, Laurent Itti, and Omesh
Tickoo. incdfm: Incremental deep feature modeling for continual novelty detection. In
Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October
23–27, 2022, Proceedings, Part XXV, pages 588–604. Springer, 2022.
[252] Emilio Rivera-Landos, Foutse Khomh, and Amin Nikanjam. The challenge of reproducible ml: An empirical study on the impact of bugs. In 2021 IEEE 21st International
Conference on Software Quality, Reliability and Security (QRS), pages 1079–1088,
2021.
[253] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals
of mathematical statistics, pages 400–407, 1951.
[254] Luke E Rogerson, Zhijian Zhao, Katrin Franke, Thomas Euler, and Philipp Berens.
Bayesian hypothesis testing and experimental design for two-photon imaging data.
PLoS computational biology, 15(8):e1007205, 2019.
[255] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory
Wayne. Experience replay for continual learning. Advances in Neural Information
Processing Systems, 32, 2019.
[256] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Greg Wayne.
Experience replay for continual learning. CoRR, abs/1811.11682, 2018.
[257] Jeffrey N Rouder, Paul L Speckman, Dongchu Sun, Richard D Morey, and Geoffrey
Iverson. Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic
bulletin & review, 16(2):225–237, 2009.
[258] Nicolas P. Rougier, Konrad Hinsen, Fr´ed´eric Alexandre, Thomas Arildsen, Lorena A.
Barba, Fabien C. Y. Benureau, C. Titus Brown, Pierre de Buyl, Ozan Caglayan, Andrew P. Davison, Marc-Andr´e Delsuc, Georgios Detorakis, Alexandra K. Diem, Damien
Drix, Pierre Enel, Benoˆıt Girard, Olivia Guest, Matt G. Hall, Rafael Neto Henriques,
Xavier Hinaut, Kamil S. Jaron, Mehdi Khamassi, Almar Klein, Tiina Manninen,
Pietro Marchesi, Dan McGlinn, Christoph Metzner, Owen L. Petchey, Hans Ekkehard Plesser, Timoth´ee Poisot, Karthik Ram, Yoav Ram, Etienne B. Roesch, Cyrille
Rossant, Vahid Rostami, Aaron Shifman, Joseph Stachelek, Marcel Stimberg, Frank
Stollmeier, Federico Vaggi, Guillaume Viejo, Julien Vitay, Anya E. Vostinar, Roman
Yurchak, and Tiziano Zito. Sustainable computational science: the rescience initiative.
CoRR, abs/1707.04393, 2017.
[259] Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063, 2018.
[260] Aurko Roy, Ashish Vaswani, Niki Parmar, and Arvind Neelakantan. Towards a better
understanding of vector quantized autoencoders. 2018.
[261] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James
Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive
neural networks. CoRR, abs/1606.04671, 2016.
120
[262] Manas Sambare. Fer-2013: Learn facial expressions from an image. https://www.
kaggle.com/datasets/msambare/fer2013. Accessed: 2022-11-10.
[263] san bt (username). Satellite images to predict poverty: Images taken over the region
of africa for research purpose. https://www.kaggle.com/datasets/sandeshbhat/
satellite-images-to-predict-povertyafrica. Accessed: 2022-11-10.
[264] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka GrabskaBarwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress:
A scalable framework for continual learning. In International Conference on Machine
Learning, pages 4528–4537. PMLR, 2018.
[265] Joan Serr`a, D´ıdac Sur´ıs, Marius Miron, and Alexandros Karatzoglou. Overcoming
catastrophic forgetting with hard attention to the task. CoRR, abs/1801.01423, 2018.
[266] Sina Sheikholeslami, Moritz Meister, Tianze Wang, Amir H Payberah, Vladimir
Vlassov, and Jim Dowling. Autoablation: Automated parallel ablation studies for
deep learning. In Proceedings of the 1st Workshop on Machine Learning and Systems,
pages 55–61, 2021.
[267] Yujun Shi, Li Yuan, Yunpeng Chen, and Jiashi Feng. Continual learning via bit-level
information preserving. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 16674–16683, 2021.
[268] Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you
need. Information Fusion, 81:84–90, 2022.
[269] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[270] Davinder Singh, Naman Jain, Pranjali Jain, Pratik Kayal, Sudhakar Kumawat, and
Nipun Batra. Plantdoc: A dataset for visual plant disease detection. In Proceedings
of the 7th ACM IKDD CoDS and 25th COMAD, CoDS COMAD 2020, page 249–253,
New York, NY, USA, 2020. Association for Computing Machinery.
[271] John Sipple. Interpretable, multidimensional, multimodal anomaly detection with
negative sampling for detection of device failure. In Hal Daum´e III and Aarti Singh,
editors, Proceedings of the 37th International Conference on Machine Learning, volume
119 of Proceedings of Machine Learning Research, pages 9016–9025. PMLR, 13–18 Jul
2020.
[272] Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. Don’t decay
the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
[273] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning,
Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical
Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA,
October 2013. Association for Computational Linguistics.
121
[274] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking
machine learning algorithms for traffic sign recognition. Neural Networks, 32:323–332,
2012. Selected Papers from IJCNN 2011.
[275] Pierre Stock, Armand Joulin, R´emi Gribonval, Benjamin Graham, and Herv´e J´egou.
And the bit goes down: Revisiting the quantization of neural networks. arXiv preprint
arXiv:1907.05686, 2019.
[276] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to fine-tune bert for text
classification? In China national conference on Chinese computational linguistics,
pages 194–206. Springer, 2019.
[277] Shengyang Sun, Daniele Calandriello, Huiyi Hu, Ang Li, and Michalis Titsias.
Information-theoretic online memory selection for continual learning. In International
Conference on Learning Representations, 2022.
[278] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.
Rethinking the inception architecture for computer vision. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 2818–2826, 2016.
[279] Jihoon Tack, Sangwoo Mo, Jongheon Jeong, and Jinwoo Shin. Csi: Novelty detection
via contrastive learning on distributionally shifted instances. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information
Processing Systems, volume 33, pages 11839–11852. Curran Associates, Inc., 2020.
[280] Wei Ren Tan, Chee Seng Chan, Hernan Aguirre, and Kiyoshi Tanaka. Improved artgan
for conditional synthesis of natural image and artwork. IEEE Transactions on Image
Processing, 28(1):394–409, 2019.
[281] MTCAJ Thomas and A Thomas Joy. Elements of information theory. WileyInterscience, 2006.
[282] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck
principle. In 2015 ieee information theory workshop (itw), pages 1–5. IEEE, 2015.
[283] Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin ElNouby, Edouard Grave, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, and Herv´e
J´egou. Resmlp: Feedforward networks for image classification with data-efficient training. CoRR, abs/2105.03404, 2021.
[284] Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe
Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy, 7 2019.
Association for Computational Linguistics.
[285] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large
collection of multi-source dermatoscopic images of common pigmented skin lesions.
Scientific data, 5(1):1–9, 2018.
122
[286] Oguzhan Ulucan, Diclehan Karakaya, and Mehmet Turkan. A large-scale dataset for
fish segmentation and classification. In 2020 Innovations in Intelligent Systems and
Applications Conference (ASYU), pages 1–5. IEEE, 2020.
[287] Deep Contractor (username). IS THAT SANTA? (image classification):
Santa Claus classification. https://www.kaggle.com/datasets/deepcontractor/
is-that-santa-image-classification. Accessed: 2022-11-10.
[288] Gerry (username). 100 sports image classification. https://www.kaggle.com/
datasets/gpiosenka/sports-classification. Accessed: 2022-11-10.
[289] Manish KC (username). The kvasir-capsule dataset. https://www.kaggle.com/
datasets/manishkc06/the-kvasircapsule-dataset. Accessed: 2022-11-10.
[290] RobinReni (username). House rooms image dataset. https://www.kaggle.com/
datasets/robinreni/house-rooms-image-dataset. Accessed: 2022-11-10.
[291] YoucefATTALLAH97 (username). Minerals identification & classification: Minet v2. https://www.kaggle.com/datasets/youcefattallah97/
minerals-identification-classification. Accessed: 2022-11-10.
[292] Gido M. van de Ven and Andreas S. Tolias. Three scenarios for continual learning.
CoRR, abs/1904.07734, 2019.
[293] Gido M. van de Ven, Tinne Tuytelaars, and Andreas S. Tolias. Three types of incremental learning. Nature Machine Intelligence, 4(12):1185–1197, Dec 2022.
[294] Jan N Van Rijn and Frank Hutter. Hyperparameter importance across datasets. In
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2367–2376, 2018.
[295] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
neural information processing systems, 30, 2017.
[296] Bastiaan S. Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling.
Rotation equivariant CNNs for digital pathology. In Alejandro F. Frangi, Julia A.
Schnabel, Christos Davatzikos, Carlos Alberola-L´opez, and Gabor Fichtinger, editors,
Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, pages
210–218, Cham, 2018. Springer International Publishing.
[297] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
5018–5027, 2017.
[298] Visipedia. inaturalist 2021 competition: Fgvc8 workshop at cvpr. https://github.
com/visipedia/inat_comp/tree/master/2021. Accessed: 2022-10-30.
123
[299] Vladimir Vovk and Ruodu Wang. E-values: Calibration, combination and applications.
The Annals of Statistics, 49(3):1736–1754, 2021.
[300] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD
Birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of
Technology, 2011.
[301] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R.
Bowman. GLUE: A multi-task benchmark and analysis platform for natural language
understanding. 2019. In the Proceedings of ICLR.
[302] Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of
continual learning: Theory, method and application. arXiv preprint arXiv:2302.00487,
2023.
[303] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. arXiv preprint arXiv:1811.10959, 2018.
[304] Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis-Philippe
Morency. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. Proceedings of the AAAI Conference on Artificial Intelligence,
33(01):7216–7223, Jul. 2019.
[305] Ruud Wetzels, Raoul PPP Grasman, and Eric-Jan Wagenmakers. An encompassing
prior generalization of the savage–dickey density ratio. Computational Statistics &
Data Analysis, 54(9):2094–2102, 2010.
[306] Andrew G Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–
4708, 2020.
[307] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht.
The marginal value of adaptive gradient methods in machine learning. Advances in
neural information processing systems, 30, 2017.
[308] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, Joe Davison,
Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M.
Rush. Transformers: State-of-the-art natural language processing. In Proceedings of
the 2020 Conference on Empirical Methods in Natural Language Processing: System
Demonstrations, pages 38–45, Online, October 2020. Association for Computational
Linguistics.
[309] Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, and Ali Farhadi. Supermasks in superposition. CoRR,
abs/2006.14769, 2020.
124
[310] Hanwei Wu and Markus Flierl. Variational information bottleneck on vector quantized
autoencoders. arXiv preprint arXiv:1808.01048, 2018.
[311] Hanwei Wu and Markus Flierl. Learning product codebooks using vector-quantized
autoencoders for image retrieval. In 2019 IEEE Global Conference on Signal and
Information Processing (GlobalSIP), pages 1–5. IEEE, 2019.
[312] Xiaoping Wu, Chi Zhan, Yu-Kun Lai, Ming-Ming Cheng, and Jufeng Yang. Ip102:
A large-scale benchmark dataset for insect pest recognition. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June
2019.
[313] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video
generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
[314] Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-ofdistribution detection: A survey. CoRR, abs/2110.11334, 2021.
[315] Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. From facial parts responses
to face detection: A deep learning approach. CoRR, abs/1509.06451, 2015.
[316] Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use
classification. In Proceedings of the 18th SIGSPATIAL international conference on
advances in geographic information systems, pages 270–279, 2010.
[317] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and
Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding.
In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran
Associates, Inc., 2019.
[318] Xu Yao, Alasdair Newson, Yann Gousseau, and Pierre Hellier. A latent transformer
for disentangled face editing in images and videos. In Proceedings of the IEEE/CVF
international conference on computer vision, pages 13789–13798, 2021.
[319] Jaehong Yoon, Wonyong Jeong, Giwoong Lee, Eunho Yang, and Sung Ju Hwang.
Federated continual learning with adaptive parameter communication. CoRR,
abs/2003.03196, 2020.
[320] Shujian Yu and Jose C Principe. Understanding autoencoders with information theoretic concepts. Neural Networks, 117:104–123, 2019.
[321] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe
Morency. Tensor fusion network for multimodal sentiment analysis. In Proceedings
of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, September 2017. Association for Computational Linguistics.
[322] Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and
Louis-Philippe Morency. Memory fusion network for multi-view sequential learning.
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
125
[323] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and
Amr Ahmed. Big bird: Transformers for longer sequences. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information
Processing Systems, volume 33, pages 17283–17297. Curran Associates, Inc., 2020.
[324] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al.
Big bird: Transformers for longer sequences. In NeurIPS, 2020.
[325] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pages 3987–3995.
PMLR, 2017.
[326] Chen Zeno, Itay Golan, Elad Hoffer, and Daniel Soudry. Bayesian gradient descent:
Online variational bayes learning with increased robustness to catastrophic forgetting
and weight pruning. arXiv preprint arXiv:1803.10123, 2018.
[327] Chen Zeno, Itay Golan, Elad Hoffer, and Daniel Soudry. Task agnostic continual
learning using online variational bayes. arXiv preprint arXiv:1803.10123, 2018.
[328] Chen Zeno, Itay Golan, Elad Hoffer, and Daniel Soudry. Task-Agnostic Continual
Learning Using Online Variational Bayes With Fixed-Point Updates. Neural Computation, 33(11):3139–3177, 10 2021.
[329] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and
Michael Jordan. Theoretically principled trade-off between robustness and accuracy.
In International conference on machine learning, pages 7472–7482. PMLR, 2019.
[330] Junting Zhang, Jie Zhang, Shalini Ghosh, Dawei Li, Serafettin Tasci, Larry Heck,
Heming Zhang, and C-C Jay Kuo. Class-incremental learning via deep model consolidation. In Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, pages 1131–1140, 2020.
[331] Lance Zhang. 7000 hand-cropped and labeled Pokemon images for classification.
https://www.kaggle.com/datasets/lantian773030/pokemonclassification. Accessed: 2022-11-10.
[332] Yifei Zhang, D´esir´e Sidib´e, Olivier Morel, and Fabrice M´eriaudeau. Deep multimodal
fusion for semantic image segmentation: A survey. Image and Vision Computing,
105:104042, 2021.
[333] Yu-Dong Zhang, Zhengchao Dong, Shui-Hua Wang, Xiang Yu, Xujing Yao, Qinghua
Zhou, Hua Hu, Min Li, Carmen Jim´enez-Mesa, Javier Ramirez, et al. Advances in
multimodal data fusion in neuroimaging: overview, challenges, and novel orientation.
Information Fusion, 64:149–187, 2020.
126
Abstract (if available)
Abstract
Current practices in Machine Learning (ML) require a model to be iteratively trained on novel examples, different modalities, and tasks. The same model generalizes poorly on previously learned data where we empirically observe ‘Catastrophic Forgetting’. Traditionally, Generalization refers to the performance of an ML model on an out-of-distribution dataset. In this work, we use the broad definition of Generalization to study the performance of a learner to an ‘out-of-distribution’ learning setting. First, we present our work that analyzes the generalization of the learned representations for an ML model to downstream tasks, Transfer Learning. Next, we present our work that examines the generalization of the model architecture for different learning configurations, Model Comparison. Last, we present our work that analyzes the generalization of the learning algorithm to mitigate forgetting, Continual Learning. Our work explores the three avenues of generalization of an ML model to find open issues in learning and evaluating a model. First, when model comparison is performed between learning settings (such as between Optimizers) the performance of the model can exhibit heteroscedasticity that leads to improper analysis. Next, Continual Learning algorithms are unable to perform for more complex settings where simpler methods perform robustly. Last, we find that the representations of a learned model do not generalize among the modalities and domain gaps. We present our contributions and analysis on the issue of Generalization. Based on our empirical observations, we discuss several future directions where improvements in algorithms, tools, and methods are required to improve generalization.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Learning controllable data generation for scalable model training
PDF
Algorithms and systems for continual robot learning
PDF
Transfer learning for intelligent systems in the wild
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Creating cross-modal, context-aware representations of music for downstream tasks
PDF
Fast and label-efficient graph representation learning
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Visual representation learning with structural prior
PDF
Learning to optimize the geometry and appearance from images
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Learning shared subspaces across multiple views and modalities
PDF
Machine learning in interacting multi-agent systems
PDF
Learning objective functions for autonomous motion generation
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Computational model of stroke therapy and long term recovery
PDF
Modeling motor memory to enhance multiple task learning
PDF
Interaction between Artificial Intelligence Systems and Primate Brains
PDF
Learning logical abstractions from sequential data
Asset Metadata
Creator
Fostiropoulos, Iordanis
(author)
Core Title
Towards learning generalization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-12
Publication Date
11/27/2023
Defense Date
11/16/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
ablator,continual learning,deep learning,generalization,knowledge distillation,machine learning,multi-modal learning,OAI-PMH Harvest,quantization,representation learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Itti, Laurent (
committee chair
), Nikolaidis, Stefanos (
committee member
), Schweighofer, Nicolas (
committee member
), Soleymani, Mohammad (
committee member
)
Creator Email
fostirop@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113777632
Unique identifier
UC113777632
Identifier
etd-Fostiropou-12491.pdf (filename)
Legacy Identifier
etd-Fostiropou-12491
Document Type
Dissertation
Format
theses (aat)
Rights
Fostiropoulos, Iordanis
Internet Media Type
application/pdf
Type
texts
Source
20231127-usctheses-batch-1108
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
ablator
continual learning
deep learning
generalization
knowledge distillation
machine learning
multi-modal learning
quantization
representation learning