Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning at the local level
(USC Thesis Other)
Learning at the local level
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Learning at the Local Level
by
Neal Gregory Lawton
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2024
Copyright 2025 Neal Gregory Lawton
Dedicated to my family,
my Mom, Olivia Melgoza,
my Dad, Chris Lawton,
my brother, Thomas Ryan Lawton, and
my sister, Dana Kathleen Cummins.
ii
Acknowledgements
I would like to thank first and foremost my wonderful advisors, Aram Galstyan and Greg Ver
Steeg, for their guidance and mentorship over the many years and for creating such an open and
collaborative research environment for me and my labmates. Aram has always provided an invaluable steady hand to keep me grounded and on track whenever I found myself lost in the maze that
academic research often can be. Greg has an incredible ability to listen intently to my ramblings of
research ideas and make the connections for which I only had a vague idea but could not express
in words. I am deeply grateful for both of your persistent dedication, patience, and support, which
has made me the researcher I am today.
I would like to thank all my labmates at the Information Sciences Institute, especially Kyle
Reing, David Kale, Daniel Moyer, Shuyang Gao, Sahil Garg, Rob Brekelmans, Hrayr Harutyunyan, Serban Stan, Umang Gupta, Shushan Arakelyan, Palash Goyal, Myrl Marmarelis, Sami
Abu-El-Haija, Ninareh Mehrabi, Sarik Ghazarian, Mehrnoosh Sadat Mirtaheri Feijani, and Elan
Markowitz, for all the amazing and insightful discussions, presentations, and whiteboard sessions
across the years. I would also like to thank all the research leadership at ISI, especially Emilio
Ferrara and Kristina Lerman, for helping to create such an open and collaborative research environment at ISI for me and my peers. I would like to express my special appreciation for all the
administrative staff at ISI, especially Kary Lau and Peter Zamar, as well as the supportive staff at
USC, especially Lizsl De Leon, for all the behind-the-scenes work they do to support me and my
peers. I would also like to thank all the wonderfully friendly people I have met at ISI, especially
Gleb Satyukov, Fred Morstatter, Daniel Benjamin, Goran Muric, Luca Luceri, Daniel Garijo, and
Stephen Rawls, for making ISI such a friendly work environment.
iii
I would like to thank all my collaborators for their insights and mentorship over the years,
especially Ke-Thia Yao and Federico Spedalieri at ISI; Govind Thattai, Jack FitzGerald, Judith
Gaspers, and Aishwarya Padmakumar at Amazon; and Anoop Kumar, Alfy Samuel, Daben Liu,
Xujun Peng, Chenyang Zhu, and Aditya Shrivastava at Capital One.
I would like to thank all the professors over the years that I have worked with as a teaching
assistant, especially David Kempe, Aaron Cote, Mark Redekopp, Michael Shindler, Aleksandra
Korolova, Sandra Batista, and Shahriar Shamsian, for helping to make teaching such a rewarding
part of my time at USC.
Lastly, I would like to thank my Mom, my Dad, my brother Ryan, and my sister Dana, for their
endless love and support.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Overview and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Neural Architecture Search . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Retrieval Augmented Generation . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: A Forest Mixture Bound for Block-Free Parallel Inference . . . . . . . . . . 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Forest Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 The Forest Mixture Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Connection with FMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 Auxiliary Parameter Updates . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Variational Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.9 Appendix A: Auxiliary Parameter Updates . . . . . . . . . . . . . . . . . . . . . . 23
2.10 Appendix B: Variational Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.11 Appendix C: Extension to Deep Models . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 3: Deep Residual Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Notation and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
v
3.4 Residual Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1 The Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Affine layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.3 Activation layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.4 Other layer types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Choosing partitioning variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.9 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.9.1 Bounding ℓ = L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.9.2 Bounding ℓ = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.9.3 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.9.4 Batch Normalization Layers . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 4: Learning Morphisms with Gauss-Newton Approximation for Growing Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Morphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 Gauss-Newton Approximation . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.6.1 Algorithm Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.6.2 Gauss-Newton Approximation Accuracy . . . . . . . . . . . . . . . . . . 67
4.6.3 Learned Morphism Quality . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6.4 Gauss-Newton Approximation . . . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 5: Neural Architecture Search for Parameter-Efficient Fine-tuning of Large
Pre-trained Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.1 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.1 Comparing to Full Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.2 Very Small PETs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.3 Interpreting Learned Architectures . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
vi
5.6.2 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 83
Chapter 6: QuAILoRA: Quantization-Aware Initialization for LoRA . . . . . . . . . . . 85
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3.1 Background and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3.2 Calibrated quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.3 Uncalibrated quantization . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.1 Choice of calibration set . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4.2 Perplexity after fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4.3 Performance on downstream tasks . . . . . . . . . . . . . . . . . . . . . . 95
6.4.4 Effect of LoRA rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4.5 Convergence of fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6 Appendix A: Full Results Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 7: RAG Joint Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3.1 Fine-tuning the Embedding Model . . . . . . . . . . . . . . . . . . . . . . 104
7.3.2 Fine-tuning the Generator Model . . . . . . . . . . . . . . . . . . . . . . . 105
7.3.3 Joint Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Chapter 8: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
vii
Abstract
In this dissertation, I present a perspective of machine learning that views feature learning as the
fundamental strategy by which deep machine learning models learn to solve complex problems:
when trained to perform one specific task, deep machine learning models tend to learn generalizable features that are useful for solving many different tasks. In this way, deep machine learning
models learn at a local level by automatically breaking down complex problems into simple relevant subproblems.
I then present a diverse collection of works that put this perspective into action to design better machine learning algorithms. These works include efficient optimization algorithms, including an algorithm for block-free parallel inference in exponential families (Chapter 2) and a novel
second-order algorithm for training neural networks (Chapter 3); algorithms for efficient neural architecture search (NAS), including a morphism-based NAS algorithm for growing neural networks
(Chapter 4) and a pruning-based NAS algorithm for finding more parameter-efficient PEFT architectures (Chapter 5); and algorithms for efficient fine-tuning of large language models, including
an algorithm for increasing the performance of fine-tuning quantized models (Chapter 6) and a
joint fine-tuning algorithm for retrieval augmented generation (RAG) pipelines (Chapter 7).
viii
Chapter 1
Introduction
A human asks a robot to buy a pint of milk from the corner market. The robot heads out the
door and, after some time, returns with the milk in hand and says, “I did what you asked: first,
I coordinated the motors and actuators that control my limbs to shift my balance onto one leg,
propelled myself forward with my back leg, swung my arms to maintain my balance, then landed
with my front foot forward on the ground. I did this repeatedly while I planned a quick and safe
route to the store, distinguished static objects from pedestrians and vehicles that might cross my
path, and identified landmarks to reorient my position in case I got lost. When I arrived at the
market, I read and followed the signs in the store to find the refrigerated aisle, picked out your
desired brand, and waited patiently in line at the checkout before making payment and heading
home.” The human replied, “That’s great, but I only asked you to buy the milk!”
This story illustrates how modern machine learning works in practice. We train a deep model
to perform a high-level task, like fetch a pint a milk, or classify an image, or predict the next word
in a sentence, and when the model has finished training, we find that along the way, the model
has automatically learned how to perform all sorts of subtasks without us ever telling it to. We
can investigate what subtasks a trained model has learned to perform by looking at its learned
features, which are determined by the outputs of intermediate model layers. When we visualize
the features learned by a convolutional neural net trained to classify images of objects, we find
that the model has learned to recognize object parts, like dog noses and bicycle wheels; textures
and patterns, like polka-dots and squiggly lines; and edge detectors, with various orientations
1
and across various color gradients [Olah et al., 2017]. When we visualize the features learned
by a transformer trained to perform language modeling, we find that the model has learned to
extract syntactic and semantic information at various levels of abstraction. This includes topic-level
information, such as whether a paragraph describes an historical event, a person’s biographical
information, or pop culture news; grammatical information, such as whether a sentence contains
verbs in the past-tense or uses apostrophes to convey possession; and word-level information, such
as disambiguations of the word “light”, or synonyms of the word “vegetation” [Yun et al., 2021].
To a human it makes sense when faced with a new complex problem to break it down into
easier, simpler subproblems. But why do deep models that are trained to perform one specific
task tend to learn generalizable features that are also useful for performing many different tasks?
The information bottleneck method [Tishby et al., 2000, Shwartz-Ziv and Tishby, 2017] provides
some insight into the mathematical dynamics of how this happens. Learning appears to happen
in two phases. The first phase is a memorization phase, in which the deep layers of the model
learn to retain information from the input that is useful for the training task. This is followed by
a compression phase, in which the model learns to ignore information from the input that is not
useful for the training task. In this way, deep models automatically learn that a good strategy for
solving a supervised learning problem, where the task is to produce a target output for a given input,
is to first solve a related unsupervised learning problem, where the task is to find a compressed,
generalizable representation of the input data.
It has also been observed that the features learned by a model tend to be sparse. In particular, the
first-layer features learned by a convolutional neural net trained for image classification strongly
resemble those learned by sparse coding [He et al., Goodfellow et al., 2012, Coates and Ng, 2012].
Feature sparsity is often further encouraged explicitly by training with Dropout [Wager et al.,
2013, Wan et al., 2013]. When learned features are sparse in this way, they form a redundant
representation of the input data that is robust to noise. These sparse representations of data also
naturally describe a decomposition of the concepts learned by the model into subconcepts, e.g.,
image textures can be decomposed as patterns of edges, curves, and color gradients.
2
1.1 Perspective
With this understanding of what features a neural network learns and how those features are learned
during training, we propose a “big picture” perspective of modern machine learning that serves as
a useful guide when developing new methods for machine learning: learning happens at the local
level, and features are fundamental units of learning. By “learning happens at the local level”
we mean that deep models solve complex learning problems via a feature learning strategy that
automatically breaks down the problem down into a hierarchy of increasingly simple learning
problems, each of which is solved in turn by finding patterns in simpler learned features. For
example, an image classification model learns to represent a bicycle as a pattern of wheels, spokes,
pedals, and handlebars, each of which the model learns to represent as a pattern of lines and
curves. By “features are fundamental units of learning” we mean that feature learning is observed
empirically to be the fundamental strategy by which deep machine learning models solve complex
problems, and so when we are thinking of developing new machine learning methods, an important
question to ask ourselves is, “How does this affect the model’s feature learning strategy?”
1.2 Overview and Organization
In this thesis, we present a series of works inspired by this perspective that propose new or improve existing machine learning methods. Here, we provide an overview of the problem domains
addressed in those chapters while drawing a connection back to the larger “big picture” perspective
presented in this introduction.
1.2.1 Optimization
Modern neural networks are trained by minimizing a training loss function with an adaptive learning rate variant of stochastic gradient descent like Adam. These algorithms adapt the learning rate
3
of each model parameter using exponential moving averages of the gradients and square gradients
to account for the curvature of the loss landscape.
In Chapter 2, we introduce a method for inference in a simpler but related family of models
compared to neural networks: exponential families. This method bounds the variational inference
objective for an exponential family model by the variational inference objective for a simpler family of models: forest mixture models. The method introduces “auxiliary parameters” that explicitly
attribute the value of each observed variable x to a hidden parent variable y. In this way, the method
decomposes the “feature” y into observed “sub-features” x for each data point. We then use this
decomposition to design a parallel, block-free algorithm for accelerating inference in the original
exponential family model.
In Chapter 3, we extend the concepts presented in Chapter 2 to deep neural networks. This
generalization introduces “residual partitioning parameters” to explicitly attribute the loss observed
for a given training sample to the features in each layer of the neural network. We then use these
residual partitioning parameters as part of an optimization algorithm that adapts the learning rate
for the parameters of each feature according to how much the training loss can be attributed to that
feature. We demonstrate that models trained with this algorithm can achieve greater generalization
performance than those trained with Adam.
Connection to Big Picture Perspective: Deep models learn by breaking down complex problems into simpler subproblems. We can imagine that when a model fails to correctly perform a
high-level task, it is because the model failed to correctly solve one of the related subproblems.
For example, if a robot fails to bring back milk from the market, it may be because it tripped on
a curb and so needs to learn more about walking, or got lost in a dead-end alley and so needs to
learn more about route planning. The idea behind the methods presented in these chapters is that
we can use the model gradients to attribute the loss observed for the high-level problem to each
feature-level subproblem. This explicitly breaks the high-level learning objective into many independent feature-level learning objectives that can be solved in parallel. Mathematically, we bound
the high-level learning objective by a sum of independent learning objectives, one for each feature.
4
Although this bound may not be tight, it can accelerate optimization by better conditioning the
optimization problem.
1.2.2 Neural Architecture Search
In Chapter 4, we present a method for growing the width of neural networks via network morphisms. This method learns and evaluates network morphisms with a Gauss-Newton approximation of the training loss as a function of the feature’s output. We use this method as part of a
feature-centric neural architecture search algorithm that automatically adapts the width of each
layer in a neural network to optimize the trade-off between model performance and model size.
In Chapter 5, we present a neural architecture search algorithm for parameter-efficient finetuning (PEFT). We use a feature pruning criterion similar to the one used in Chapter 4 to prune
large complex PEFT architectures with the goal of automatically discovering more parameterefficient allocations of PEFT parameters to the different modules and layers of a transformer. We
find that the middle layers of the transformer are most parameter-efficient to fine-tune, and that we
can achieve close to the performance of full fine-tuning by fine-tuning only 20% of the model’s
features’ bias parameters, equivalent to 0.01% of all model parameters.
Connection to Big Picture Perspective: Human-designed neural network architectures contain a manually assigned number of neurons in each neural network layer. Since each neuron in a
deep model corresponds to a single learned feature, this means that the number of concepts that the
network can learn at a particular layer of abstraction is predetermined. However, it is not easy for a
human to determine how many features are needed to effectively express the range of concepts that
might exist at any particular layer of abstraction. Therefore, an automated approach seems more
appropriate. An automated search that balances the tradeoff between model size and performance
can also give us insight into how pattern complexity changes at different layer depths (layers that
need lots of neurons are probably using them to learn complex patterns). The neural architecture
search algorithms that we design in Chapters 4 and 5 work by directly manipulating the neural
network’s learned features with network morphisms.
5
1.2.3 Quantization
In Chapter 6, we present a method for improving the performance of QLoRA, a memory-efficient
method for fine-tuning models by learning high-precision LoRA updates to a low-precision quantized base model. Our improvement finds a good initialization for the LoRA matrices that reduces
quantization error at initialization, resulting in an increase in model performance after fine-tuning.
We find such a LoRA initialization by optimizing a calibrated quantization objective, which aims
to reduce the difference between the quantized and full-precision models’ activations, rather than
only their weights. We demonstrate that initializing this way yields about 86% of the increase in
average downstream task performance as doubling the quantization precision, but without increasing the memory-cost of fine-tuning.
Connection to Big Picture Perspective: The method presented in this chapter is based on
calibrated quantization. Rather than quantizing the model weights with simple nearest-neighbor
rounding, calibrated quantization optimizes a calibrated quantization objective to find a quantization of the model whose activations are close to those of the full-precision model. More simply,
calibrated quantization aims to preserve the model’s activations, rather than its weights. Mathematically, calibrated quantization tries to better distribute the parameter quantization errors in such
a way that they cancel out on average due to correlations between the model features. From a high
level, the effectiveness of calibrated quantization demonstrates the utility of taking a feature-centric
approach to the problem of model compression.
1.2.4 Retrieval Augmented Generation
In Chapter 7, we present a method for jointly fine-tuning the embedding and generator models of
a retrieval augmented generation (RAG) pipeline. Joint fine-tuning is challenging due to the nondifferentiability of the concatenation operation that joins the context documents retrieved by the
embedding model before passing them to the generator model. To address this issue, we explicitly
construct a differentiable objective for the retrieval task that rewards the embedding model for
retrieving documents that actually improve the end-to-end performance of the RAG pipeline. We
6
empirically evaluate our joint fine-tuning method and find it can provide a significant performance
boost over independently fine-tuning the embedding and generator models.
Connection to Big Picture Perspective
Deep neural networks are simple to train with SGD because they are differentiable end-to-end.
However, machine learning pipelines like RAG that build a solution to a complex problem by solving a series of discrete subproblems are often not differentiable end-to-end. In such cases, we can
still try to break down the high-level problem to write an explicit objective for each subproblem
so that optimizing the objective for each subproblem improves the end-to-end performance of the
larger pipeline. We do this for RAG by writing an explicit objective for the retrieval system that
rewards the embedding model for retrieving context documents that actually improve the performance of the larger RAG pipeline.
7
Chapter 2
A Forest Mixture Bound for Block-Free Parallel Inference
Coordinate ascent variational inference is an important algorithm for inference in probabilistic
models, but it is slow because it updates only a single variable at a time. Block coordinate methods perform inference faster by updating blocks of variables in parallel. However, the speed and
convergence of these algorithms depends on how the variables are partitioned into blocks. In this
chapter, we give a convergent parallel algorithm for inference in deep exponential families that
does not require the variables to be partitioned into blocks. We achieve this by lower bounding the
ELBO by a new objective we call the forest mixture bound (FM bound) that separates the inference problem for variables within a hidden layer. We apply this to the simple case when all random
variables are Gaussian and show empirically that the algorithm converges faster for models that are
inherently more forest-like.
2.1 Introduction
Inference in directed models like deep exponential families (DEF’s) [Ranganath et al., 2015] is
complicated by the “explaining away effect”: for a directed model with observed variables x ∈ R
n
and latent variables y ∈ R
m, independent “causes” y j become dependent given an observed “effect”
xi
. To handle this, the coordinate ascent variational inference (CAVI) algorithm iteratively updates
the variational distribution for a single latent variable y j while holding the variational distribution
for all other latent variables fixed [Blei et al., 2017].
8
Though the y j’s are not conditionally independent given x except in exceedingly simple models,
in many cases the y j’s are nearly conditionally independent. Is there a way to perform parallel
inference in such models, or do we have to resort to the serial coordinate algorithm?
Block methods provide one avenue for parallel inference. These algorithms work by first partitioning the latent variables into a collection of blocks, and then iteratively updating a variable
from each block in parallel. However, the speed (as in MCMC methods [Terenin et al., 2015]) or
convergence (as in Hogwild methods [Recht et al., 2011]) of the resulting algorithm will depend
on how the variables are blocked, and finding a good choice of blocking for an arbitrary model can
be difficult.
The main contribution of this chapter is a novel lower bound on log-likelihood we call the forest
mixture bound (FM bound) that separates the problem of inference for each variable in a hidden
layer. This allows all the variables in a layer to be updated in parallel, without the use of blocks.
We call the resulting parallel inference algorithm the forest mixture algorithm (FM algorithm).
We study in detail the case when all the random variables in the DEF are Gaussian. We then
demonstrate on both synthetic and real-world data the proposed method achieves faster convergence compared to existing methods.
2.2 Related Work
Hogwild Block Methods There are two types of block methods for inference. The first is
Hogwild-type algorithms [Recht et al., 2011, Sa et al., 2016, Wang and Banerjee, 2014, Zhao
et al., 2014]. After partitioning the variables into blocks, these algorithms iteratively choose a
single variable from each block and update as in CAVI, but in parallel [Sa et al., 2016]. These algorithms are guaranteed to converge only in certain cases, e.g., when the blocks are conditionally
independent [Johnson et al., 2013].
Convergent Block Methods Instead of making CAVI updates in parallel, block algorithms may
achieve convergence by making small parallel updates [Sontag and Jaakkola, 2009]. For example,
9
“exact” asynchronous Gibbs sampling randomly rejects each block update according to an MCMC
rejection ratio [Terenin et al., 2015]. If the blocks are chosen poorly, the rejection rate will increase
and the rate of convergence will decrease [Singh et al., 2017].
In either type of block method, the performance of the algorithm depends on how the variables
are blocked. In a distributed computation setting, blocking is necessary since each worker can only
store a fraction of all variables in local memory. In this case, the FM bound provides a method for
updating variables within a block or worker in parallel, instead of updating only a single variable
in each block at a time.
Amortized Inference Instead of treating inference as an inverse problem that has to be solved
for each observation, VAE’s train an inference network (encoder) so the cost of inference is amortized over many observations [Kingma and Welling, 2013]. Once the encoder is trained, inference
for any observation can be performed quickly with a single pass through the inference network.
Encoder-free methods like ours may still be useful in the case when we have a trained generative
model (decoder) but no trained encoder and want to perform inference for only a few samples or,
more likely, for when we want to improve the solution produced by the encoder at test time.
Undirected Models Besides directed models, there is a wide literature for fast inference in undirected models [Baque et al., 2016, Singh et al., 2010]. Note that inference in undirected models ´
like Deep Restricted Boltzmann Machines [Salakhutdinov and Hinton, 2009] can already be parallelized: non-consecutive layers can be updated in parallel in red-black fashion. In fact, the same
degree of parallelization can be achieved in a directed model using our technique. While there is
also a wide literature on bounding the log-partition function of an undirected model [Wainwright
et al., 2005], we derive the FM bound by lower bounding the log-partition function of a directed
model. The technique we use may be applicable to undirected models, but that is not explored in
this chapter.
10
Structure Learning The FM bound we derive is closely related to an interesting family of models called forest mixture models. These models may be applicable to the problem of structure learning, where the task is to infer the graphical structure of the underlying model from data [Chow and
Liu, 1968]. However, in this chapter we narrowly focus on the problem of inference in a given
generative model, not on training a new one.
2.3 Preliminaries
2.3.1 Notation
Vector-valued variables are written in bold. The component-wise product of two vectors u and v is
denoted u⊙v. Unless stated otherwise, all expectations, including the variance Var[·], standard deviation Std[·], and conditional entropy H(y|x), are taken with respect to the variational distribution
q(y|x), though we sometimes write this explicitly for emphasis.
An exponential family of distributions is a family of distributions of the form
p(x) = exp{g(x) +t(x)·η −a(η)} (2.1)
Where g is the log-base measure, t are the sufficient statistics, η are the natural parameters,
and a is the log-partition function. When η is a function of another random variable y, e.g.,
η = b+w · y, we will sometimes write η = η(y) for emphasis.
We denote the Gaussian probability density function with mean µ and variance σ
2
as N (µ,σ
2
).
When we write log p(x) ∝ f(x), we mean log p(x) = f(x) +constant.
11
x1 x2 x3 x4 x5
y1 y2 y3
(a)
x1 x2 x3 x4 x5
y1 y2 y3
(b)
x2 x4 x5 x1 x3
y1 y2 y3
(c)
Figure 2.1: Visualization of sampling from a forest mixture model. (a) In a forest mixture model,
the edges between x and y are unknown random variables. (b) To sample from the model, first the
parent of each xi
is chosen independently at random according to p(ei). In this visualization, each
p(ei) is uniform over the latent variables. (c) After sampling a forest structure from p(e), x and y
are sampled according to the resulting forest model.
2.3.2 Forest Mixture Models
Consider a general directed model with a single layer of observed variables x ∈ R
n
and latent
variables y ∈ R
m. The joint distribution p(x, y) takes the form
p(x, y) = "
m
∏
j=1
p(y j)
#" n
∏
i=1
p(xi
|y)
#
(2.2)
A directed model is a forest model if each xi has exactly one parent in the model’s directed dependency graph; they are so-named because the resulting graphical model is a forest with one tree per
latent variable y j
. These models are particularly simple because the y j’s are conditionally independent given x. Let ei ∈ I
m be the one-hot vector indicating the parent of xi
, so ei j = 1 if and only if
y j
is the parent of xi
. Then we can write
p(xi
|y) =
m
∏
j=1
p(xi
|y j)
ei j (2.3)
Suppose we want to fit a forest model to data, but we do not know which xi’s should be the children
of which y j’s. One way to handle this uncertainty is to treat the ei’s as independent latent random
variable that have to be inferred, just like y. To do this, we must first define a prior p(ei) for each
i. Given such a prior, the joint distribution over x, y, and e ≡ {ei}
n
i=1
is
12
p(x, y, e) = "
n
∏
i=1
p(ei)
#" m
∏
j=1
p(y j)
#" n
∏
i=1
p(xi
|y, ei)
#
(2.4)
The resulting model is a forest mixture model (FMM): to sample from this model, we first draw
a random forest structure by sampling from the prior p(e); then, x and y are sampled from the
selected forest model.
Though the y j’s are no longer conditionally independent given x, they are independent given x
and e. Similarly, the ei’s are conditionally independent given x and y. To see this, define ˆp(xi
|y j) ≡
p(xi
|y j
, ei j = 1). Then the joint distribution can be written
p(x, y, e) = "
n
∏
i=1
p(ei)
#" m
∏
j=1
p(y j)
#
n
∏
i=1
m
∏
j=1
pˆ(xi
|y j)
ei j (2.5)
In the next section, we will use the mean-field variational ELBO for this model, which for a
given variational distribution q(y, e|x) is
log p(x) ≥ E[log p(x|y, e)]−DKL(q(y, e|x)∥p(y, e))
=
n
∑
i=1
m
∑
j=1
E[ei j]E[log ˆp(xi
|y j)]
−
m
∑
j=1
DKL(q(y j
|x)
p(y j))
−
n
∑
i=1
DKL(q(ei
|x)∥ p(ei)) (2.6)
2.4 The Forest Mixture Bound
For simplicity, we only consider shallow models in this section. The extension to deep models is
straightforward (see Appendix C).
A single-layer deep exponential family (DEF) model is a directed model with a single layer of
observed variables x ∈ R
n
and hidden variables y ∈ R
m, where the conditional distribution is in an
exponential family. The joint distribution p(x, y) takes the form
13
p(x, y) = "
m
∏
j=1
p(y j)
#" n
∏
i=1
p(xi
|y)
#
(2.7)
p(xi
|y) = exp{g(xi) +t(xi)ηi(y)−a(ηi(y))} (2.8)
Suppose we are given an observation x and want to approximately infer the posterior p(y|x) by
maximizing the variational ELBO, and suppose the y j’s are conditionally independent given x, so
p(x, y) = p(x)∏
m
j=1
p(y j
|x). Then the mean-field variational ELBO is
log p(x) ≥ max
q(y|x)
E[log p(x, y)] +H(y|x) (2.9)
≡ max
q(y|x)
m
∑
j=1
E
log p(y j
|x)
+H(y j
|x) (2.10)
=
m
∑
j=1
max
q(y j
|x)
E
log p(y j
|x)
+H(y j
|x) (2.11)
In the second line, log p(x) is constant with respect to q(y|x) and can be removed without
changing the optimization problem. In this case, the ELBO separates into a sum of terms, each of
which involves only a single y j
. This allows us to optimize the ELBO by updating each q(y j
|x)
independently and in parallel.
In a general DEF, the y j’s are not conditionally independent and the objective does not separate.
However, without much manipulation, much of the ELBO does separate: for a single-layer DEF,
the ELBO can be written
log p(x) ≥ E[log p(x, y)] +H(y|x) (2.12)
=
n
∑
i=1
E[log p(xi
|y)] +
m
∑
j=1
E[log p(y j)] +H(y j
|x) (2.13)
14
So only the E[log p(xi
|y)] terms are not separable. However, if ηi
is an affine function of y, so
ηi ≡ bi +wi
· y for some bi ∈ R and wi ∈ R
m, then each E[log p(xi
|y)] term can be expanded
E[log p(xi
|y)] = g(xi) +t(xi)E[ηi
]−E[a(ηi)] (2.14)
= g(xi) +t(xi) (bi +wi
·E[y])−E[a(bi +wi
· y)] (2.15)
From this we can see the only term left preventing the entire ELBO from separating is Eq(y|x)
[−a(ηi(y))],
a high-dimensional expectation of the non-linear log-partition function. The one thing we know
about the log-partition function in exponential families is that it is convex. This suggests we use
Jensen’s inequality to bound E[−a(ηi)]. Note that using Jensen’s to bring the expectation over
q inside a gives an inequality in the wrong direction because −a(ηi) is concave; to get a lower
bound, we need to pull an expectation out from the inside of a. The derivation of the ELBO gives
a hint on how to do this: recall
log p(x) = logZ
p(x, y)dy (2.16)
= logZ
q(y|x)
q(y|x)
p(x, y)dy (2.17)
= logEq(y|x)
p(x, y)
q(y|x)
(2.18)
≥ Eq(y|x)
log p(x, y)
q(y|x)
(2.19)
In the same way, we will introduce a variational or auxiliary distribution inside the concave function −a(η), then use Jensen’s to pull it out. For each i, introduce an auxiliary discrete distribution
over m categories ε i ∈ ∆
m−1
, so
m
∑
j=1
εi j = 1 εi j ≥ 0 ∀ j ∈ [m] (2.20)
15
Injecting this inside the log-partition function gives
E[−a(bi +wi
· y)] = E
"
−a
bi +
m
∑
j=1
εi j
wi jy j
εi j !# (2.21)
To use Jensen’s inequality, we first need to bring bi
inside the sum, which we can do using bi =
∑
m
j=1
εi jbi
. This partitions the bias bi
into m parts according to ε i
. However, to get a sufficiently
tight bound, we’ll need to consider more general splittings: introduce another set of auxiliary
parameters ˆbi ∈ R
m with the constraint bi = ∑
m
j=1
εi jbˆ
i j. Then
E[−a(bi +wi
· y)] = E
"
−a
m
∑
j=1
εi j
bˆ
i j +
wi jy j
εi j !#
≥
m
∑
j=1
εi jE
−a
bˆ
i j +
wi jy j
εi j (2.22)
Bounding this term for each i separates the entire ELBO into a sum of terms, each of which
involves only a single y j
. Plugging this in directly to get a final bound on log-likelihood results in
an unwieldy expression, so first we will introduce new notation to simplify the bound.
2.4.1 Connection with FMM
To demonstrate the relation of the above bound and forest mixture models, let us define
ηˆi j ≡ bˆ
i j +
wi jy j
εi j
(2.23)
pˆ(xi
|y j) ≡ exp
g(xi) +t(xi)ηˆi j −a(ηˆi j)
(2.24)
Then ηi = ∑
m
j=1
εi jηˆi j and the bound can be rewritten as follows:
E[−a(ηi)] ≥
m
∑
j=1
εi jE
−a
ηˆi j (2.25)
1
This expression can be used to impose bounds on each E[log p(xi
|y)]:
E[log p(xi
|y)] = g(xi) +t(xi)E[ηi
]−E[a(ηi)] (2.26)
≥ g(xi) +t(xi)E[ηi
]−
m
∑
j=1
εi jE[a(ηˆi j)] (2.27)
=
m
∑
j=1
εi j
g(xi) +t(xi)E[ηˆi j]−E[a(ηˆi j)]
(2.28)
=
m
∑
j=1
εi jE[log ˆp(xi
|y j)] (2.29)
Finally, plugging the above expression into the ELBO gives
log p(x) ≥ E[log p(x, y)] +H(y|x)
≥
n
∑
i=1
m
∑
j=1
εi jE[log ˆp(xi
|y j)]
−
m
∑
j=1
DKL(q(y|x)∥p(y)) (2.30)
Comparing (2.30) with (2.6) confirms that this bound is identical to the ELBO of a forest mixture
model with the same ˆp(xi
, y j) and q(y j
|x), with q(ei j = 1|x) = εi j (so that E[ei j] = εi j) and p(ei) =
q(ei
|x) (so that the second KL term of the FMM ELBO is zero and disappears entirely). For
this reason, we call this bound the forest mixture bound (FM bound). Note this bounds the DEF
ELBO by the ELBO of each FMM in a large family of FMM’s parameterized by ε ≡ {ε i}
n
i=1
and
ˆb ≡ {ˆbi}
n
i=1
.
2.5 Algorithm
To optimize the FM bound, we propose an alternating maximization algorithm: in the first step,
update all q(y j
|x) in parallel while holding all εi j and bˆ
i j fixed; in the second step, update all εi j and
1
bˆ
i j in parallel while holding all q(y j
|x) fixed. In this section, we will derive the optimal updates
for q(y j
|x), εi j, and bˆ
i j in the case when each xi and y j are Gaussian with known variance:
p(y j) = N (0,σ
2
y
) p(xi
|y) = N (ηi(y),σ
2
x
) (2.31)
We will derive the updates for the auxiliary parameters first since this will help simplify the update
for the variational distribution later.
2.5.1 Auxiliary Parameter Updates
Maximizing the FM bound over ε and ˆb is equivalent to maximizing Li ≡ ∑
m
j=1
εi jE[−a(ηˆi j)] over
ε i and ˆbi for each i, since these are the only terms in the FM bound that depend on ε and ˆb. In the
Gaussian case, −a(ηˆi j) = −
1
2σ2
x
ηˆ
2
i j and
Li =
m
∑
j=1
εi jE
−
1
2σ2
x
ηˆ
2
i j
(2.32)
= −
1
2σ2
x
m
∑
j=1
εi j
Var
ηˆi j
+E
ηˆi j2
(2.33)
= −
1
2σ2
x
m
∑
j=1
w
2
i jVar[y j
]
εi j
+εi j
bˆ
i j +
wi jE[y j
]
εi j 2
(2.34)
Theorem 1. Holding q(y j
|x) constant, the choice of ˆbi and ε i
that maximizes Li
is ˆbi = ˆb
∗
i and
ε i = ε
∗
i
, where
bˆ∗
i j = E[ηi
]−
wi jE[y j
]
ε
∗
i j
ε
∗
i j =
|wi j|Std[y j
]
∑
m
j
′=1
|wi j′|Std[y j
′]
(2.35)
For a proof, see Appendix A. Note that these computations can be parallelized across i and j.
18
input : An observation x ∈ R
n
and model parameters W ∈ R
n×m, b ∈ R
n
, σ
2
y ∈ R and
σ
2
x ∈ R.
output: The mean-field variational distribution q(y|x) ≡ ∏
m
j=1
q(y j
|x).
initialize (µ0)j and (σ0)
2
j
for each j ∈ [m].
for t = 0 to T −1 do
for i = 1 to n do
for j = 1 to m do
(εt)i j =
|wi j|(σt)j
∑
m
j
′=1
|wi j′|(σt)
2
j
′
(bˆ
t)i j = (bi +∑
m
j=1wi j(µt)j)−
wi j(µt)j
(εt)i j
for j = 1 to m do
(µt+1)j ≡
∑
m
i=1wi j(xi−(bˆ
t)i j)
σ
2
x
σ
2
y
+∑
m
j=1
w
2
i j
(εt
)
i j
(σt+1)
2
j ≡
1
1
σ
2
y
+
1
σ
2
x
∑
m
j=1
w
2
i j
(εt
)
i j
return q(y j
|x) = N ((µT )j
,(σT )
2
j
) for j ∈ [m]
Algorithm 1: The FM algorithm in the Gaussian case.
2.5.2 Variational Updates
Holding the auxiliary parameters fixed, each variational distribution q(y j
|x) can be updated in
parallel:
Theorem 2. For a fixed ε and ˆb, the choice for the next variational distribution qt+1(y j
|x) that
maximizes the FM bound is qt+1(y j
|x) = N ((µ
∗
t+1
)j
,(σ
∗
t+1
)
2
j
), where
(µ
∗
t+1
)j ≡
(x −Eqt
[η])·wj +Eqt
[y j
]∑
n
i=1
w
2
i j
εi j
σ2
x
σ2
y
+∑
n
i=1
w
2
i j
εi j
(2.36)
(σ
∗
t+1
)
2
j ≡
1
1
σ2
y
+
1
σ2
x
∑
n
i=1
w
2
i j
εi j
(2.37)
For a proof, see Appendix B.
19
2.6 Discussion
Tightness We derived the FM bound by using Jensen’s inequality to lower bound the ELBO. For
a given variational distribution q, the gap between the two bounds is
GAP ≡
n
∑
i=1
E[−a(ηi)]−
n
∑
i=1
m
∑
j=1
εi jE[−a(ηˆi j)] (2.38)
In the Gaussian case, for an optimal choice of auxiliary parameters (see Appendix A),
m
∑
j=1
εi jE[−a(ηˆi j)] = −
1
2σ2
x
∥wi ⊙Std[y]∥
2
1 −
1
2σ2
x
E[ηi
]
2
(2.39)
E[−a(ηi)] = −
1
2σ2
x
∥wi ⊙Std[y]∥
2
2 −
1
2σ2
x
E[ηi
]
2
(2.40)
GAP =
1
2σ2
x
n
∑
i=1
∥wi ⊙Std[y]∥
2
1 − ∥wi ⊙Std[y]∥
2
2
(2.41)
Since ∑
n
i=1
∥wi∥
2
1 ≥ ∑
n
i=1
∥wi∥
2
2
, the FM bound imposes a stronger regularization on the variance
of the variational distribution compared to the variational ELBO. For this reason, the variational
distribution q that maximizes the FM bound generally has a smaller variance compared to the
variational distribution that maximizes the ELBO.
The FM bound tightly bounds the ELBO when p is a forest model, so that wi j has exactly one
non-zero element in the component j(i) corresponding to the parent of xi
. In this case,
∥wi ⊙Std[y]∥
2
1 = w
2
i j(i)Var[y j(i)
] = ∥wi ⊙Std[y]∥
2
2
(2.42)
The bound is also tight when Var[y] = 0, but in this case both the ELBO and the FM bound
yield −∞ because of the conditional entropy term H(y|x).
Speed of Convergence Let us examine the role of ε in the update for q(y j
|x). If ∑
m
j=1
w
2
i j
εi j
is
large, then Eqt+1 ≈ Eqt
[y j
], and so the FM algorithm makes a small update for y j
. If ∑
m
j=1
w
2
i j
εi j
is
20
0 100 200
101
102
iterations
ridge regression loss
s = 15 s = 7 s = 3
(a) FM algorithm on synthetic
windows of different sizes
0 100 200 101
102
k = 3iterations k = 5 k = 7
(b) FM algorithm on CNN
kernels of different sizes
0 100 200
101
102
CAVIiterations blocks FM
(c) CAVI, block, and FM
comparison
Figure 2.2: The ridge regression objective over 200 iterations.
small, then Eqt+1
[y j
] makes a large step in the direction of the residual x −E[η]. In fact, if for some
j, εi j = 1 for all i where wi j is non-zero, then the FM algorithm updates q(y j
|x) exactly as CAVI
would. In this sense, ε acts like an attention parameter that selects which q(y j
|x) to change and by
how much.
If p is a forest model, then the FM algorithm chooses ε i
to be the one-hot vector indicating
the parent of xi
. In this case, the FM algorithm makes coordinate updates for all j in parallel and
converges in one iteration. If p is forest-like, i.e., |wj
| · |wj
′| is small for j ̸= j
′
, then ε i
is close to
one-hot and the FM algorithm makes damped, nearly-CAVI updates in parallel. In this sense, the
speed at which the FM algorithm converges depends on how inherently forest-like the model p is.
2.7 Experiments
Recall that we derived the FM bound by lower bounding the ELBO. Algorithms that optimize
the ELBO like CAVI will generally provide a superior lower bound on log-likelihood compared
to the FM algorithm. For a more fair comparison, we can instead measure how quickly these
algorithms converge to the optimal mean. In the Gaussian case, optimizing the mean of the meanfield variational distribution is equivalent to minimizing a ridge regression objective:
21
1
2σ2
x
n
∑
i=1
(xi −(bi +wi
·E[y]))2 +
1
2σ2
y
m
∑
j=1
E[y j
]
2
(2.43)
To evaluate each algorithm on the ridge regression problem, we must first choose a x, b, and a
set of wi j. All the algorithms we consider in this section are guaranteed to converge to the optimal
solution, so we are only interested in comparing how quickly each algorithm converges to that
optimal solution. This is measured by recording the objective value achieved by the mean of the
variational distribution Eqt
[y] in the ridge regression problem across 200 iterations.
In the first experiment, we choose x to be a vectorized sample from the MNIST dataset, with
pixel values scaled to lie in the interval [−1,1]; we choose b to be the average of 1000 randomly
chosen MNIST samples; and we construct a synthetic wi j as follows: given an integer window side
length s, we construct all possible square s × s windows of pixels. For windows that overlap the
border of the 28×28 MNIST image region, we clip the window so that it lies entirely inside the
image region, resulting in a rectangular window. For each window, we add a latent variable y j
to
the model and a corresponding wj
, where wi j = 1 if pixel i lies in window j, and wi j = 0 otherwise.
The resulting model is more forest-like for smaller choices of s: if s = 1, the windows are disjoint
and the graphical model is exactly a forest. Figure 2.2a demonstrates the rate of convergence of the
FM algorithm for various choices of s. As we expect, the FM algorithm converges faster for more
forest-like models, i.e., smaller s. Note that the objective value achieved by the optimal solution to
the ridge regression problem changes as wi j changes.
The second experiment is similar to the first, except it uses x from the CIFAR-10 dataset, b = 0,
and instead of uniform windows, uses the first layer kernels from a convolutional neural net trained
several times changing only the width of the first layer kernels. Figure 2.2b demonstrates the FM
algorithm converges faster for more forest-like models even using real-world data.
Our last experiment compares the convergence of the FM algorithm with CAVI and block
coordinate ascent. Here we choose x and b the same as in the first experiment, but we choose wi j
differently to make blocking the latent variables easy: first we partition the 28×28 MNIST image
region into 16 regions, each of size 7 × 7. Then, we construct all possible 7 × 7 windows (as in
22
the first experiment with s = 7), then clip them to fit in the first region. This is repeated for each
region. If we block the latent variables according to which region the corresponding windows were
clipped to, then the blocks will be conditionally independent, since windows clipped to different
regions must be disjoint. Blocking in this way guarantees that the block coordinate algorithm will
converge to the optimal solution. Figure 2.2c compares the rate of convergence for CAVI, block
coordinate ascent, and the FM algorithm. The figure shows our block-free method can outperform
the block coordinate method, even when the blocking is quite good.
2.8 Conclusion
In this chapter we derived a forest mixture bound on the log-likelihood of deep exponential families. This bound gets around the “explaining away effect” by using a set of auxiliary parameters to
separate the problem of inference for each latent variable in the same layer, allowing us to make
parallel updates. We then made a deep dive into the simple case where all variables are Gaussian:
we derived the exact variable updates, then tested the algorithm on both synthetic and real-world
data. Our promising results show that fast, parallel inference in deep exponential families is possible without the use of blocks.
2.9 Appendix A: Auxiliary Parameter Updates
Proof of Theorem 1: First, we will find the optimal choice of ˆbi for any given ε i
. Since ˆbi
is
constrained by ∑
m
j=1
εi jbˆ
i j = bi
, let us first parameterize ˆbi by a set of unconstrained parameters:
let γ i ∈ R
m and write
bˆ
i j = bi −γi j +ε i
· γ i
(2.44)
So for any choice of γ i
, the constraint bi = ∑
m
j=1
εi jbˆ
i j is satisfied. Now we can differentiate
the bound with respect to γi j, set to zero and solve. We will need the following partial derivatives:
23
∂bˆ
i j
∂ γi j
= −1+εi j
∂bˆ
i j′
∂ γi j
= εi j ∀ j
′
̸= j (2.45)
Now setting the partial derivative of Li with respect to γi j to zero,
0 =
∂
∂ γi j
Li = −
1
σ2
x
m
∑
j
′=1
εi j′E[ηˆi j′]
∂bˆ
i j′
∂ γi j
(2.46)
=
εi j
σ2
x
E[ηˆi j]−
m
∑
j
′=1
εi j′E[ηˆi j′]
!
(2.47)
The derivative is zero for all j in particular when the choice of bˆ
i j makes E[ηˆi j] constant across
j. We can verify this is satisfied by the choice γi j =
wi j
εi j
E[y j
], which makes bˆ
i j = bˆ∗
i j:
E[ηˆi j] = E
bˆ
i j +
wi j
εi j
y j
(2.48)
= E
E[ηi
]−
wi j
εi j
E[y j
] + wi j
εi j
y j
(2.49)
= E[ηi
] (2.50)
Plugging this choice into Li yields
Li = −
1
2σ2
x
m
∑
j=1
w
2
i jVar[y j
]
εi j
+εi jE[ηi
]
2
!
(2.51)
= −
1
2σ2
x
m
∑
j=1
w
2
i jVar[y j
]
εi j !
−
1
2σ2
x
E[ηi
]
2
(2.52)
Now let us try to find the optimal choice of εi j. Since εi j is constrained by εi ∈ ∆
m−1
, we’ll
also parameterize εi j by a set of unconstrained parameters τ i ∈ R
m:
24
εi j = exp{τi j}/
m
∑
j
′=1
exp{τi j′} (2.53)
We will need the following partial derivatives:
∂ εi j
∂ τi j
= εi j(1−εi j)
∂ εi j′
∂ τi j
= −εi jεi j′ ∀ j
′
̸= j (2.54)
Now setting the partial derivative of Li with respect to τi j to zero,
0 =
∂
∂ τi j
Li =
1
2σ2
x
m
∑
j
′=1
w
2
i j′Var[y j
′]
ε
2
i j′
∂ εi j′
∂ τi j
(2.55)
=
1
2σ2
x
m
∑
j
′=1
Var[ηˆi j′]
∂ εi j′
∂ τi j
(2.56)
=
εi j
2σ2
x
Var[ηˆi j]−
m
∑
j
′=1
εi j′Var[ηˆi j′]
!
(2.57)
The derivative is zero for all j in particular when the choice of εi j makes Var[ηˆi j] constant across
j. We can verify this is satisfied by the choice τi j = log|wi jStd[y j
]|, which makes εi j = ε
∗
i j:
Var[ηˆi j] =
w
2
i jVar[y j
]
ε
2
i j
(2.58)
=
w
2
i jVar[y j
]
w
2
i jVar[y j
]/
∑
m
j
′=1
|wi j′|Std[y j
′]
2
(2.59)
= ∥wi ⊙Std[y]∥
2
1
(2.60)
Plugging this choice into Li yields
25
Li = −
1
2σ2
x
m
∑
j=1
|wi j|Std[y j
]
m
∑
j
′=1
|wi j|Std[y j
]
!
−
1
2σ2
x
E[ηi
]
2
(2.61)
= −
1
2σ2
x
m
∑
j=1
|wi j|Std[y j
]
!2
−
1
2σ2
x
E[ηi
]
2
(2.62)
= −
1
2σ2
x
∥wi ⊙Std[y]∥
2
1 −
1
2σ2
x
E[ηi
]
2
(2.63)
2.10 Appendix B: Variational Updates
Proof of Theorem 2: First, note that for any DEF, the optimal update equation is as follows:
logqt+1(y j
|x) ∝ log p(y j) +
n
∑
i=1
εi j log ˆpt(xi
|y j) (2.64)
In the Gaussian case, we have
log p(y j) ∝ −
1
2σ2
y
y
2
j
(2.65)
log ˆp(xi
|y j) ∝ −
1
2σ2
x
(xi −ηˆi j)
2
(2.66)
∝ −
1
2σ2
x
xi −
bˆ
i j +
wi j
εi j
y j
2
(2.67)
∝
1
σ2
x
(xi −bˆ
i j)wi j
εi j
y j −
1
2σ2
x
w
2
i j
ε
2
i j
y
2
j
(2.68)
26
Plugging this in yields
logq(y j
|x) ∝
1
σ2
x
(x − ˆb j)·wj
y j
−
1
2
y
2
j
1
σ2
y
+
1
σ2
x
n
∑
i=1
w
2
i j
εi j ! (2.69)
∝ −
1
σ2
x
(x − ˆb j)·wj
y j −
1
2(σ
∗
t+1
)
2
j
y
2
j
(2.70)
∝ −
1
2(σ
∗
t+1
)
2
j
y j −
1
σ2
x
(x − ˆb j)·wj
1
σ2
y
+
1
σ2
x
∑
n
i=1
w
2
i j
εi j
2
(2.71)
∝ −
1
2(σ
∗
t+1
)
2
j
y j −
(x − ˆb j)·wj
σ2
x
σ2
y
+∑
n
i=1
w
2
i j
εi j
2
(2.72)
After substituting bˆ
i j = E[ηi
]−
wi j
εi j
Eqt
[y j
] and rearranging, we get logqt+1(y j
|x) ∝ N ((µ
∗
t+1
)j
,(σ
∗
t+1
)
2
j
).
2.11 Appendix C: Extension to Deep Models
A DEF model with observed variables y
(0) ∈ R
m0 and L layers of latent variables {y
(ℓ)}
L
ℓ=1 with
y
(ℓ) ∈ R
mℓ has joint distribution
p({y
(ℓ)
}
L
ℓ=0
) = "
L−1
∏
ℓ=0
mℓ
∏
i=1
p(y
(ℓ)
i
|y
(ℓ+1)
)
#" mL
∏
i=1
p(y
(L)
i
)
#
(2.73)
p(y
(ℓ)
i
|y
(ℓ+1)
) = expn
g(y
(ℓ)
i
) +t(y
(ℓ)
i
)η
(ℓ)
i −a(η
(ℓ)
i
)
o
(2.74)
η
(ℓ)
i ≡ b
(ℓ)
i +w
(ℓ)
i
· y
(ℓ+1)
(2.75)
The ELBO for this model is
27
log p(y
(0)
) ≥
m0
∑
i=1
E[log p(y
(0)
i
|y
(1)
)] (2.76)
+
L−1
∑
ℓ=1
mℓ
∑
i=1
E[log p(y
(ℓ)
i
|y
(ℓ+1)
)] +H
q
(y
(ℓ)
i
|y
(0)
) (2.77)
+
mL
∑
i=1
p(y
(L)
i
) +H
q
(y
(L)
i
|y
(0)
) (2.78)
For each ℓ ∈ {0,...,L − 1}, introduce the auxiliary parameters {ε
(ℓ)
i
}
mℓ
i=1
and {
ˆb
(ℓ)
i }
mℓ
i=1
, with
ε
(ℓ)
i ∈ ∆
mℓ+1−1
and ˆb
(ℓ)
i ∈ R
mℓ+1 constrained by b
(ℓ)
i = ∑
mℓ+1
j=1
ε
(ℓ)
i j bˆ
(ℓ)
i j . For all ℓ ∈ {0,...,L − 1},
i ∈ [mℓ
], and j ∈ [mℓ+1], define
ηˆ
(ℓ)
i j ≡ bˆ
(ℓ)
i j +
w
(ℓ)
i j
ε
(ℓ)
i j
y
(ℓ+1)
j
(2.79)
pˆ(y
(ℓ)
i
|y
(ℓ+1)
j
) ≡ exp{g(y
(ℓ)
i
) +t(y
(ℓ)
i
)ηˆ
(ℓ)
i j −a(ηˆ
(ℓ)
i j )} (2.80)
Then by (2.25),
E[log p(y
(ℓ)
i
|y
(ℓ+1)
)] ≥
mℓ+1
∑
j=1
ε
(ℓ)
i j E[log ˆp(y
(ℓ)
i
|y
(ℓ+1)
j
)] (2.81)
Plugging this into the ELBO yields
log p(y
(0)
) ≥
m0
∑
i=1
m1
∑
j=1
ε
(ℓ)
i j E[log ˆp(y
(0)
i
|y
(1)
j
)] (2.82)
+
L−1
∑
ℓ=1
mℓ
∑
i=1
mℓ+1
∑
j=1
ε
(ℓ)
i j E[log ˆp(y
(ℓ)
i
|y
(ℓ+1)
j
)] +H
q
(y
(ℓ)
i
|y
(0)
) (2.83)
+
mL
∑
i=1
p(y
(L)
i
) +H
q
(y
(L)
i
|y
(0)
) (2.84)
28
This objective separates as a sum of terms, each of which involves no more than one latent
variable in the same layer. This allows any group of variables forming an independent set in the
model graph to be updated in parallel, the same as for undirected models.
29
Chapter 3
Deep Residual Partitioning
Deep neural networks are trained by solving a hard optimization problem. This optimization problem can be effectively solved using general optimization methods like Adam, but we can exploit
the special structure of neural networks to derive even better training algorithms. We introduce
residual partitioning, a novel second-order optimization method for training neural nets. In each
training iteration, residual partitioning uses Jensen’s inequality to bound the objective; this bound
has a diagonal Hessian and so is simple to optimize. In our experiments, we compare residual
partitioning with several standard optimization algorithms on several machine learning tasks and
conclude residual partitioning converges to a competitive or better solution.
3.1 Introduction
Training deep neural networks is a difficult optimization problem. This problem can be solved
using gradient descent, which incrementally moves the model parameters in the direction of the
gradient and uses a learning rate to determine how far. Since the curvature of the objective function
varies greatly along different directions, using a constant learning rate causes the process to pingpong along directions of high curvature, leading to slow convergence. Instead, fast optimization
methods estimate the curvature of the objective around the current solution and adjust the learning
rate accordingly, decreasing the learning rate along dimensions with high curvature and increasing
the learning rate along dimensions with low curvature.
30
Adaptive learning rate methods such as Adam [Kingma and Ba, 2014], AdaGrad [Duchi et al.,
2011], and AdaDelta [Zeiler, 2012] estimate the local curvature from an exponential moving average of past gradients using different heuristics. Since implementing these methods only requires
efficient computation of the gradient, they can be used to solve a wide variety of optimization
problems.
Second-order methods estimate the curvature using the Hessian matrix of second derivatives
[Pearlmutter, 1994]. However, the Hessian is typically dense and exhibits negative curvature, so
modern second-order methods instead use various structured, positive-definite approximations of
the Hessian. In practice, modern second-order methods are still too slow, too memory-intensive,
and too complicated to compete against adaptive gradient methods.
In this chapter we introduce residual partitioning, a novel second-order optimization algorithm
for training neural networks. In each training iteration, residual partitioning uses Jensen’s inequality to construct an upper bound on the local second-order approximation of the loss function. This
is a strong theoretical claim, since it guarantees that the decrease in loss achieved by minimizing
the bound will also be achieved for the local second-order approximation. The bound we derive
has a diagonal Hessian matrix and is therefore easy to optimize and memory efficient. This also
makes it exceedingly simple to correct for negative curvature introduced by the nonlinear activation functions without computing the eigendecomposition of any dense matrix. We can efficiently
construct and minimize our bound with a single forward and backward pass through the network
and in the same time complexity as an Adam or SGD step.
In our experiments, we compare residual partitioning with Adam and SGD by training an
autoencoder and classifier network on MNIST, Fashion-MNIST [Xiao et al., 2017], and CIFAR-10
[Krizhevsky et al., 2009]. We observe that residual partitioning converges to a competitive or better
solution compared to Adam and SGD.
31
3.2 Related Work
Adaptive gradient methods such as Adam [Kingma and Ba, 2014], AdaGrad [Duchi et al., 2011],
and AdaDelta [Zeiler, 2012] estimate the local curvature from an exponential moving average of
past gradients using different heuristics. Since the only input to these algorithms is the gradient
of the objective function and some hyperparameters like the learning rate, these methods can be
easily applied to a wide range of optimization problems. This is particularly useful for problems
where second-order information is expensive or complicated to compute. In contrast, residual
partitioning, like other second-order methods for training neural nets, relies on the special structure
of the objective function to efficiently compute useful curvature information from the local secondorder approximation of the objective. Residual partitioning is more memory efficient compared to
adaptive methods since residual partitioning does not need to maintain an exponential moving
average of past gradients.
Second-order methods compute the local curvature of the objective function using the Hessian
matrix of second derivatives. Hessian-free learning uses the special structure of neural networks to
efficiently compute Hessian-vector products [Martens, 2010, Martens and Sutskever, 2011, Pearlmutter, 1994, Schraudolph, 2002, Vinyals and Povey, 2012]. These methods may converge in
relatively few training steps, but each step requires several conjugate gradient iterations, resulting
in a significantly longer wall time.
Many other second-order methods use a structured approximation of the Hessian matrix, e.g.
low-rank [Broyden, 1967, Liu and Nocedal, 1989, Sohl-Dickstein et al., 2014], diagonal [Becker
et al., 1988], or block-diagonal [Dangel et al., 2019], to decrease wall time per training step or
memory utilization. In contrast, rather than approximating the Hessian matrix, residual partitioning constructs a bound on the local second-order approximation, and that bound has a diagonal
Hessian. This provides a strong theoretical guarantee not available to a mere approximation.
Typically the Hessian matrix has some negative eigenvalues, indicating saddle points in the
objective landscape [Mizutani and Dreyfus, 2010]. Second-order methods are attracted to these
critical points and may actually increase the loss after a training step. This can be remedied by
32
approximating the Hessian matrix by a positive-definite matrix, such as the Gauss-Newton approximation [Botev et al., 2017], or by identifying and bounding the negative eigenvalues of the
Hessian [Chen et al., 2019, Dauphin et al., 2014]. Doing this is simple in the residual partitioning
context, since the Hessian of our bound is diagonal. In this way, residual partitioning constructs a
convex upper bound on the loss, even though the objective function is non-convex.
A closely related and well-studied optimization method is natural gradient descent. Instead of
transforming the gradient using the Hessian matrix, natural gradient descent transforms the gradient using the Fisher information matrix, or an approximation [Amari, 1998, Grosse and Salakhudinov, 2015, Martens, 2014, Martens and Grosse, 2015, Pascanu and Bengio, 2013, Roux et al.,
2008]. Like the Hessian matrix, the Fisher information matrix is difficult to compute and invert. If
the Fisher information matrix has the same special structure as the Hessian, it may be possible to
apply residual partitioning to natural gradient descent.
3.3 Notation and Setup
We use dot notation to denote derivatives. Vectors are written in bold. We use subscript ℓ or
superscript (ℓ) to index layers, subscript s to index dataset samples, and subscript i or j to index
neurons in a layer. The ℓ-th layer of a neural net with L layers has mℓ neurons with activations
{y
(ℓ)
s j }
mℓ
j=1
. The input to the net is the vector of first-layer activations y
(1)
s
, and the output is the vector
of last-layer activations y
(L)
s
. Each affine layer has weights {w
(ℓ)
i j }
mℓ+1,mℓ
i=1, j=1
and biases {b
(ℓ)
i
}
mℓ+1
i=1
,
though we write each bias b
(ℓ)
i
as a weight w
(ℓ)
i0 with incoming activation y
(ℓ)
s0 ≡ 1. Each activation
layer has an activation function σℓ(·) : R → R.
Denote P ⊆ [L] as the set of indices corresponding to layers with trainable parameters. In each
training step, for each ℓ ∈ P, we update each weight w
(ℓ)
i j by adding on ∆w
(ℓ)
i j . The new weight is
wˆ
(ℓ)
i j ≡ w
(ℓ)
i j +∆w
(ℓ)
i j . Updating the weights causes the activation y
(ℓ)
s j to change; denote this change
33
∆y
(ℓ)
s j , so the activation after updating weights is ˆy
(ℓ)
s j ≡ y
(ℓ)
s j + ∆y
(ℓ)
s j . Residual partitioning is a
method for solving optimization problems of a special form:
min
{∆w
(ℓ)
i j }
L L ≡
S
∑
s=1
f(yˆ
(L)
s
|xs) (3.1)
where {xs}
S
s=1
is a dataset of observations, yˆ
(L)
s
is the output of a neural net consisting of some
combination of activation layers and affine layers, and f(yˆ
(L)
s
|xs) : R
mL → R is a convex loss function of yˆ
(L)
s
, such as the L
2
loss or cross-entropy error.
3.4 Residual Partitioning
In this section, we will bound the second-order approximation to the objective function L˜ ≈ L .
To do this, we will peel back the neural network layer by layer working backwards. Formally, we
will prove the following with induction:
Theorem 3. For each ℓ ∈ [L], there exists an upper bound L˜
ℓ on L˜ of the following form, where
Y
(ℓ)
s j and W
(ℓ)
i j are univariate convex quadratic functions of ∆y
(ℓ)
s j and ∆w
(ℓ)
i j , respectively:
L˜ ≤ L˜
ℓ ≡
S
∑
s=1
mℓ
∑
j=1
Y
(ℓ)
s j + ∑
ℓ
′∈P, ℓ′≥ℓ
mℓ
′+1
∑
i=1
mℓ
′
∑
j=1
W
(ℓ
′
)
i j (3.2)
Y
(ℓ)
s j ≡ Y˙
(ℓ)
s j ∆y
(ℓ)
s j +
1
2
Y¨
(ℓ)
s j ∆y
(ℓ)
s j
2 W
(ℓ)
i j ≡ W˙
(ℓ)
i j ∆w
(ℓ)
i j +
1
2
W¨
(ℓ)
i j ∆w
(ℓ)
i j
2
(3.3)
Y¨
(ℓ)
s j ≥ 0 W¨
(ℓ)
i j ≥ 0 (3.4)
In the course of constructing these bounds, we will partition each residual f(yˆ
(L)
s
|xs)− f(y
(L)
s
|xs)
amongst the trainable parameters of the network. However, L˜
ℓ
is only an intermediary bound, with
W
(ℓ)
i j terms for parameters that have already been assigned part of the residual and Y
(ℓ)
s j terms for
the remaining pieces of the residual that will continue to backpropagate up to parameters in higher
34
layers. The bound we are ultimately interested in is L˜
1, where the quadratic function Y
(1)
s j is a
constant (since ∆y
(1)
s j = 0) and the bound can be written
L˜ ≤ L˜
1 ≡ ∑
ℓ∈P
mℓ+1
∑
i=1
mℓ
∑
j=1
W
(ℓ)
i j . (3.5)
We call L˜
1 the residual partitioning bound. We are interested in this bound because its Hessian
with respect to {∆w
(ℓ)
i j } is diagonal, and so is easily minimized by ∆w
(ℓ)
i j = ∆w
∗(ℓ)
i j :
∆w
∗(ℓ)
i j = −
W˙
(ℓ)
i j
W¨
(ℓ)
i j
(3.6)
We will begin our proof of Theorem 3 by proving the base case. We do this by bounding the loss
function f to construct L˜
L.
3.4.1 The Loss Function
If the Hessian of f with respect to ∆y
(L)
s
is diagonal, then L˜
L is simply the second-order approximation of L with respect to ∆y
(L)
s
. Otherwise, construct L˜
L by choosing for each s ∈ [S] a set
of numbers {ϕ
(L)
si }
mL
i=1 ∈ ∆
mL−1
that are non-negative and add to one. We call these partitioning
variables: the component ϕ
(L)
si can be interpreted as the fraction of the residual that we choose to
backpropagate through y
(L)
si . To construct L˜
L, multiply by one and bound with Jensen’s inequality.
This is a well-known mathematical technique, used perhaps most famously in the derivation of the
variational ELBO [Kingma and Welling, 2013]. Applying this to f yields the following, where ⊙
is componentwise multiplication, ei
is the i
th standard basis vector, and Y
(L)
si is a univariate convex
quadratic function of ∆y
(L)
si as in (3.3) with the following coefficients:
L ≤
S
∑
s=1
mL
∑
i=1
ϕ
(L)
si f
y
(L)
s +
1
ϕ
(L)
si
∆y
(L)
s ⊙ei
xs
!
≈
S
∑
s=1
mL
∑
i=1
Y
(L)
si ≡ L˜
L (3.7)
Y˙
(L)
si ≡
∂ f(y
(L)
s
|xs)
∂ y
(L)
si
Y¨
(L)
si ≡
1
ϕ
(L)
si
∂
2
f(y
(L)
s
|xs)
∂ y
(L)
si
2
(3.8)
35
For a more detailed derivation of this step, see section 3.9.1 in the appendix. Note that the
convexity of f guarantees Y¨
(L)
si ≥ 0. Comparing with (3.2), this establishes the base case. Now we
will prove the inductive step by constructing L˜
ℓ
from L˜
ℓ+1. To do this, we handle the cases when
layer ℓ is an affine layer or an activation layer separately.
3.4.2 Affine layers
Affine layers include fully connected layers, convolutional layers, and residual connections. Here
we only consider fully connected layers, but the work shown here generalizes to other affine layer
types. For an affine layer, the i-th component of the output and change in output are
y
(ℓ+1)
si ≡
mℓ
∑
j=0
w
(ℓ)
i j y
(ℓ)
s j ∆y
(ℓ+1)
si ≈
mℓ
∑
j=0
∆w
(ℓ)
i j y
(ℓ)
s j +
mℓ
∑
j=1
w
(ℓ)
i j ∆y
(ℓ)
s j . (3.9)
Note that we have omitted the term ∆w
(ℓ)
i j ∆y
(ℓ)
s j from ∆y
(ℓ+1)
si . Doing so is not necessary, but simplifies notation and the resulting algorithm. If ℓ ̸= 1, for each s ∈ [S] and i ∈ [mℓ+1] choose partitioning
variables {ε
(ℓ)
si j }
mℓ
j=0
and {ϕ
(ℓ)
si j }
mℓ
j=1
that are non-negative and together add to one:
mℓ
∑
j=0
ε
(ℓ)
si j +
mℓ
∑
j=1
ϕ
(ℓ)
si j = 1 ∀ s,i ε
(ℓ)
si j ≥ 0 ϕ
(ℓ)
si j ≥ 0 ∀ s,i, j (3.10)
We can interpret ε
(ℓ)
si j as the fraction of the residual backpropagated through y
(ℓ+1)
si that will be
used to update w
(ℓ)
i j , and ϕ
(ℓ)
si j as the fraction that will continue to backpropagate to higher layers
through y
(ℓ)
s j . We will construct L˜
ℓ
from L˜
ℓ+1 by bounding the ∆y
(ℓ+1)
si
2
term in Y
(ℓ+1)
si . To do
this, multiply by one and apply Jensen’s inequality:
∆y
(ℓ+1)
si
2 =
mℓ
∑
j=0
ε
(ℓ)
si j
∆w
(ℓ)
i j y
(ℓ)
s j
ε
(ℓ)
si j
+
mℓ
∑
j=1
ϕ
(ℓ)
si j
w
(ℓ)
i j ∆y
(ℓ)
s j
ϕ
(ℓ)
si j
2
(3.11)
≤
mℓ
∑
j=0
(∆w
(ℓ)
i j y
(ℓ)
s j )
2
ε
(ℓ)
si j
+
mℓ
∑
j=1
(w
(ℓ)
i j ∆y
(ℓ)
s j )
2
ϕ
(ℓ)
si j
(3.12)
36
Applying this to each Y
(ℓ+1)
si term of L˜
ℓ+1 yields the following bound, where Y
(ℓ)
s j and W
(ℓ)
i j are
univariate convex quadratic functions of ∆y
(ℓ)
s j and ∆w
(ℓ)
i j respectively as in (3.3) with the following
coefficients:
S
∑
s=1
mℓ+1
∑
i=1
Y
(ℓ+1)
si ≤
mℓ+1
∑
i=1
mℓ
∑
j=0
W
(ℓ)
i j +
S
∑
s=1
mℓ
∑
j=1
Y
(ℓ)
s j (3.13)
W˙
(ℓ)
i j ≡
S
∑
s=1
Y˙
(ℓ+1)
si y
(ℓ)
s j Y˙
(ℓ)
s j ≡
mℓ+1
∑
i=1
Y˙
(ℓ+1)
si w
(ℓ)
i j
W¨
(ℓ)
i j ≡
S
∑
s=1
Y¨
(ℓ+1)
si y
(ℓ)
s j
2
ε
(ℓ)
si j
Y¨
(ℓ)
s j ≡
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si w
(ℓ)
i j
2
ϕ
(ℓ)
si j
(3.14)
Note that Y¨
(ℓ+1)
si ≥ 0 guarantees W¨
(ℓ)
i j ≥ 0 and Y¨
(ℓ)
s j ≥ 0, so the bound remains in the correct
direction. Finally, plugging this bound (3.13) into L˜
ℓ+1 and collecting like terms yields L˜
ℓ
.
Since residual partitioning only bounds the second-order terms of the loss function, W˙
(ℓ)
i j is
simply the gradient of the loss with respect to ∆w
(ℓ)
i j normally calculated by backpropagation.
Looking at (3.6), we say residual partitioning updates w
(ℓ)
i j with learning rate 1/W¨
(ℓ)
i j and from
(3.14), we see that this learning rate is an increasing function of ε
(ℓ)
si j . We interpret this to mean that
if we assign a large fraction of the residual to w
(ℓ)
i j , then we will update w
(ℓ)
i j using a large learning
rate.
If ℓ = 1, then ∆y
(ℓ)
s j = 0 and we have reached the end of the backpropagation process, so we
can bound instead using only ε
(ℓ)
si j . For a detailed derivation of this step, see section 3.9.2 in the
appendix.
3.4.3 Activation layers
Activation layers include typical activations like Tanh and ReLU, but also dropout and max pooling
layers. In activation layers, each component i of the output is affected by only one component
j ≡ j(i) of the input up to second-order, i.e., the Hessian of the layer output with respect to the
37
layer input is diagonal. Typically the layer output and input have the same size and j(i) = i; the
exception is max pooling layers, where this more general notation becomes necessary.
For an activation layer, the output and change in output are
y
(ℓ+1)
si ≡ σℓ(y
(ℓ)
s j ) ∆y
(ℓ+1)
si ≡ σℓ(yˆ
(ℓ)
s j )−σℓ(y
(ℓ)
s j ) (3.15)
Plugging this into (3.2) and making a second-order approximation yields the following, where
Y
(ℓ)
s j is a univariate quadratic function of ∆y
(ℓ)
s j as in (3.3):
Y
(ℓ+1)
si ≈ Y
(ℓ)
s j (3.16)
Y˙
(ℓ)
s j ≡ Y˙
(ℓ+1)
si σ˙ℓ(y
(ℓ)
s j ) Y¨
(ℓ)
s j ≡ Y¨
(ℓ+1)
si σ˙ℓ(y
(ℓ)
s j )
2 +Y˙
(ℓ+1)
si σ¨ℓ(y
(ℓ)
s j ) (3.17)
Note that Y¨
(ℓ)
s j is not guaranteed to be non-negative. This is because the nonlinear activation function σℓ
introduces negative curvature to the objective function. However, this is easy to remedy
in our setting compared to other methods since residual partitioning breaks the high-dimensional
loss function into many one-dimensional parts: simply replacing Y¨
(ℓ)
s j with a non-negative number bounds the quadratic function Y
(ℓ)
s j in the correct direction. There are many ways to do
this, e.g., with a user-specified hyperparameter µ > 0 to clamp Y¨
(ℓ)
s j ← max(Y¨
(ℓ)
s j ,µ) or scale
Y¨
(ℓ)
s j ← µ|Y¨
(ℓ)
s j |. We prefer to bound using Y¨
(ℓ)
s j ← |Y¨
(ℓ)
s j | since this does not require an additional
hyperparameter. After bounding Y¨
(ℓ)
s j and plugging into L˜
ℓ+1, we are done constructing L˜
ℓ
.
3.4.4 Other layer types
In practice we would also like to train networks with other types of layers that are not affine and do
not have diagonal Hessians, e.g., batch normalization layers. Residual partitioning cannot bound
through these layers, but can bound through an affine approximation. This is the same as making a
Gauss-Newton approximation to the layer’s Hessian. See section 3.9.4 in the appendix for details
on how to do this for batch normalization layers.
38
3.5 Choosing partitioning variables
Now that we have shown how to construct the residual partitioning bound L˜
1, we need to choose
values for the partitioning variables ε
(ℓ)
si j and ϕ
(ℓ)
si j that satisfy the constraint (3.10). There are many
ways to choose the partitioning variables: we could choose them so the residual is partitioned
amongst all parameters equally, or partitioned amongst only a small subset of parameters, and
other ways. Here, we choose the partitioning variables so the residual partitioning bound L˜
1 is
minimized if all parameter updates are equal size, i.e., ∆w
(ℓ)
i j
2
is constant.
Theorem 4. If ∆w
(ℓ)
i j
2
is constant, the choice of partitioning variables that minimizes L˜
1 is ε
(ℓ)
si j = ε
∗(ℓ)
si j
and ϕ
(ℓ)
si j = ϕ
∗(ℓ)
si j:
ε
∗(ℓ)
si j =
|y
(ℓ)
s j |
u
(ℓ+1)
si
ε
∗(ℓ)
si j =
|y
(ℓ)
s j |
u
(ℓ+1)
si
ϕ
∗(ℓ)
si j =
|w
(ℓ)
i j |u
(ℓ)
s j
u
(ℓ+1)
si
ϕ
∗(ℓ)
si ≡ |u
(L)
si |
vuut
∂
2
f(y
(L)
s
|xs)
∂ y
(L)
si
2
for ℓ = 1
for 1 < ℓ < L, ℓ ∈ P
for ℓ = L
(3.18)
where u(ℓ)
si is calculated by a forward pass through the network:
u
(1)
si ≡ 0 u
(ℓ+1)
si ≡
|σ˙ℓ(y
(ℓ)
s j )|u
(ℓ)
s j for activation layers ℓ
∑
mℓ
j=0
|y
(ℓ)
s j |+∑
mℓ
j=1
|w
(ℓ)
i j |u
(ℓ)
s j for affine layers ℓ
(3.19)
A proof is provided in section 3.9.3 in the appendix. We call u
(ℓ)
si the total gradient coming into
y
(ℓ)
si . This is because u
(ℓ)
si is the absolute value of the gradient of y
(ℓ)
si with respect to the parameter
w along the path p, summing over all parameters w and along all paths p from w to y
(ℓ)
si . In effect,
u
(ℓ)
si quantifies how much y
(ℓ)
si can change after updating all parameters.
39
3.6 Practical considerations
Plugging our choice of partitioning variables (3.18) into (3.14) yields the following updates for
W¨
(ℓ)
i j and Y¨
(ℓ)
s j :
W¨
(ℓ)
i j =
S
∑
s=1
Y¨
(ℓ+1)
si u
(ℓ+1)
si |y
(ℓ)
s j | Y¨
(ℓ)
s j =
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
u
(ℓ+1)
si
u
(ℓ)
s j
|w
(ℓ)
i j | (3.20)
When we do this, we never have to compute the partitioning variables {ε
(ℓ)
si j } and {ϕ
(ℓ)
si j } explicitly,
which would incur an unacceptably large memory cost. Instead, we only need to compute {u
(ℓ)
si },
which is much smaller, the same size as {y
(ℓ)
si }.
Expanding Y¨
(ℓ+1)
si in the formula for Y¨
(ℓ)
s j reveals a telescoping product involving u
(ℓ)
si . To
avoid any computational expense or numerical problems arising from dividing by u
(ℓ)
s j , we can
compute and store Y¨
(ℓ)
s j u
(ℓ)
s j instead of Y¨
(ℓ)
s j as we backpropagate. See Algorithm 2 in the appendix
for details.
Since we are minimizing an upper bound on the objective, the learning rates that residual
partitioning yields will be conservative. To correct this, we scale all local learning rates {1/W¨
(ℓ)
i j }
by a global learning rate λ ≥ 1. Mathematically, we achieve this by lower bounding our upper
bound:
W
(ℓ)
i j ≥ W˙
(ℓ)
i j ∆w
(ℓ)
i j +
1
2λ
W¨
(ℓ)
i j ∆w
(ℓ)
i j
2
If our original upper bound was loose on the objective, we might still have a valid upper bound on
the objective after introducing λ, in which case we now have a tighter bound.
3.7 Experiments
In all of our experiments, we estimate W˙
(ℓ)
i j , W¨
(ℓ)
i j , Y˙
(ℓ)
si , and Y¨
(ℓ)
si from a random mini-batch of
100 samples. We preprocessed all three datasets by scaling and shifting the values of both train and
40
0 1,000 2,000 3,000 4,000
70
72
74
epochs
L
2 loss
rp 8
adam 4e-4
sgd 2e-4
(a) MNIST
0 1,000 2,000 3,000 4,000
50
51
52
53
54
epochs
rp 8
adam 8e-5
sgd 2e-4
(b) Fashion-MNIST
0 1,000 2,000 3,000 4,000
220
230
240
epochs
rp 15
adam 4e-5
sgd 2e-4
(c) CIFAR-10
Figure 3.1: L
2
loss achieved by training an autoencoder with each optimization algorithm with the
specified learning rate. Train loss is marked with circles, validation loss is marked with exes. Note
residual partitioning achieves the lowest validation loss in all three experiments.
test data to lie in the interval [−1,1]. We initialized the weights using PyTorch’s default parameter
initialization, which is U (−
√
k,
√
k), where k = 1/in_features. We use L
2 weight regularization
with strength 10−2
for all parameters. We repeated each experiment 10 times, using random seeds
1 through 10; the results we report are the average over those 10 trials. For each experiment we
plot the test and train loss every 100 epochs.
Note that one iteration of residual partitioning has the same computational complexity as SGD
and Adam. Adam and SGD only require that we compute {y
(ℓ)
si }, {Y˙
(ℓ)
si }, and {W˙
(ℓ)
i j }, while
residual partitioning requires that we additionally compute {u
(ℓ)
si }, {Y¨
(ℓ)
si }, and {W¨
(ℓ)
i j }. These
are about equally expensive, so we say that residual partitioning is about twice as expensive per
iteration as SGD and Adam. In practice, we observed that when running on a CPU with 16 cores,
one training step of residual partitioning took about 36% more time than one training step of Adam,
and 66% more time than one training step of SGD (we used PyTorch’s implementations of Adam
and SGD).
In our first experiment, we train an autoencoder with residual partitioning, Adam, and SGD.
The autoencoder has a D−64−2−64−D architecture, where D is the size of the input. All layers
are fully connected and use Tanh activations. The results are in Figure 3.1. We observe that although residual partitioning does not not always achieve the best training loss, it generally overfits
less compared to SGD and Adam and achieves the best validation loss on all three datasets. We
41
speculate that residual partitioning overfits less because it computes higher quality curvature information from the local second-order approximation than Adam does from an exponential moving
average of past gradients.
In our second experiment, we train a classifier network with residual partitioning, Adam, and
SGD. The classifier has a D−400−100−36−10 architecture. All layers are fully connected and
use softplus activations. The results are in Figure 3.2. Residual partitioning achieves noticeably
worse cross-entropy error on all three datasets, and overfits much worse on the CIFAR-10 dataset.
However, residual partitioning achieves similar test accuracy on all three datasets.
We speculate that the difference in performance of residual partitioning on the autoencoder
task versus the classification task is due to either the difference in loss functions f or the difference
in network architecture. In the classification task, the loss function f is the cross-entropy error
and must be bounded as in (3.7); in contrast, in the autoencoder task, the loss function is the L
2
reconstruction error, and does not need to be bounded since its Hessian with respect to the network
outputs is already diagonal. This might be remedied by finding a tighter diagonalizing bound on
the cross-entropy error than (3.7).
The difference in network architecture may also play a role: if residual partitioning splits the
residual amongst many parameter, Jensen’s inequality may yield a bound too loose to be effectively tightened by a global learning rate λ ≥ 1. This might be remedied by introducing layerwise
learning rates, rather than single global learning rate.
3.8 Conclusion
Adaptive gradient methods have found widespread use in the research community while interest
in second-order methods has languished. In this chapter, we developed a second order method that
is computation efficient, memory efficient, remedies the problem of saddle points efficiently, and
provides strong theoretical guarantees. Deep residual partitioning accomplishes this goal by using
Jensen’s inequality to construct an upper bound on the objective function with a diagonal Hessian.
42
0 1,000 2,000 3,000 4,000
0.5
0.6
0.7
epochs
cross-entropy error
rp 2
adam 1e-4
sgd 1e-3
(a) MNIST
0 1,000 2,000 3,000 4,000
0.55
0.6
0.65
0.7
epochs
rp 3
adam 1e-4
sgd 1e-3
(b) Fashion-MNIST
0 1,000 2,000 3,000 4,000
1.5
1.6
1.7
1.8
1.9
epochs
rp 5e2
adam 1e-4
sgd 3e-3
(c) CIFAR-10
0 1,000 2,000 3,000 4,000
0.94
0.95
0.96
0.97
epochs
accuracy
rp 2
adam 1e-4
sgd 1e-3
(d) MNIST
0 1,000 2,000 3,000 4,000
0.84
0.85
0.86
0.87
0.88
epochs
rp 3
adam 1e-4
sgd 1e-3
(e) Fashion-MNIST
0 1,000 2,000 3,000 4,000
0.5
0.55
0.6
epochs
rp 5e2
adam 1e-4
sgd 3e-3
(f) CIFAR-10
Figure 3.2: Cross-entropy error and accuracy achieved by training a classifier network with each
optimization algorithm with the specified learning rate. Train statistics are marked with circles,
validation statistics are marked with exes.
We demonstrate a clear advantage on autoencoder experiments, achieving better quality solutions
and a reduction in overfitting. While results on classification tasks are mixed, we believe that
deep residual partitioning presents a promising path to match the speed and efficiency of adaptive
gradient methods but with the theoretical advantages of second order methods.
3.9 Appendix A
3.9.1 Bounding ℓ = L
In the main body of the chapter, we briefly show how to construct L˜
L using {ϕ
(L)
si } when the loss
function is convex and its Hessian with respect to the network outputs is not diagonal. Here is the
43
same work, but without skipping steps. Here, ⊙ is componentwise multiplication and ei
is the i-th
standard basis vector.
L =
S
∑
s=1
f(yˆ
(L)
s
|xs) =
S
∑
s=1
f(y
(L)
s +∆y
(L)
s
|xs) (3.21)
=
S
∑
s=1
f
y
(L)
s +
mL
∑
i=1
∆y
(L)
s ⊙ei
xs
!
(3.22)
=
S
∑
s=1
f
mL
∑
i=1
ϕ
(L)
si
y
(L)
s +
1
ϕ
(L)
si
∆y
(L)
s ⊙ei
!
xs
!
(3.23)
≤
S
∑
s=1
mL
∑
i=1
ϕ
(L)
si f
y
(L)
s +
1
ϕ
(L)
si
∆y
(L)
s ⊙ei
xs
!
(3.24)
3.9.2 Bounding ℓ = 1
We can bound through an affine layer at the top of the network using only {ε
(ℓ)
si j }, i.e., without
{ϕ
(ℓ))
si j }, since ∆y
(ℓ)
s j = 0. First, bound ∆y
(ℓ+1)
si
2
:
∆y
(ℓ+1)
si
2 =
mℓ
∑
j=0
ε
(ℓ)
si j
∆w
(ℓ)
i j y
(ℓ)
s j
ε
(ℓ)
si j
2
≤
mℓ
∑
j=0
(∆w
(ℓ)
i j y
(ℓ)
s j )
2
ε
(ℓ)
si j
(3.25)
Applying this to each Y
(ℓ+1)
si term of L˜
ℓ+1 yields
S
∑
s=1
mℓ+1
∑
i=1
Y
(ℓ+1)
si ≤
mℓ+1
∑
i=1
mℓ
∑
j=0
W
(ℓ)
i j (3.26)
W˙
(ℓ)
i j ≡
S
∑
s=1
Y˙
(ℓ+1)
si y
(ℓ)
s j W¨
(ℓ)
i j ≡
S
∑
s=1
Y¨
(ℓ+1)
si y
(ℓ)
s j
2
ε
(ℓ)
si j
(3.27)
3.9.3 Proof of Theorem 4
Before we prove Theorem 4, let us prove a much simpler lemma that we will use extensively:
44
Lemma 1. Let {xi} be a set of n real numbers, and let {pi} be a set of n non-negative real numbers
that sum to one. Then the solution to the following constrained optimization problem is pi = p
∗
i
:
min
{pi}
n
∑
i=1
x
2
i
pi
p
∗
i ≡
|xi
|
∑
n
i
′=1
|xi
′|
(3.28)
Proof. First, let us reparameterize pi by a set of unconstrained parameters ηi
. Let {ηi} be a set of
n real numbers, and let
pi ≡
exp{ηi}
∑
n
i
′=1
exp{ηi
′}
(3.29)
We can solve the optimization problem by setting the gradient with respect to {ηi} to zero. We
will need the following partial derivatives:
∂ pi
∂ ηi
= pi(1− pi)
∂ pi
′
∂ ηi
= −pipi
′ if i
′
̸= i (3.30)
Now the gradient of the objective is
∂
∂ ηi
n
∑
i
′=1
x
2
i
′
pi
′
=
n
∑
i
′=1
−
x
2
i
′
p
2
i
′
∂ pi
′
∂ ηi
= pi
x
2
i
p
2
i
−
n
∑
i
′=1
pi
′
x
2
i
′
p
2
i
′
!
(3.31)
The gradient is zero in particular when x
2
i
/p
2
i
is constant across i: if x
2
i
/p
2
i = c, then
pi
x
2
i
p
2
i
−
n
∑
i
′=1
pi
′
x
2
i
′
p
2
i
′
!
= pi
c−
n
∑
i
′=1
pi
′c
!
= pi(c−c) = 0 (3.32)
Note x
2
i
/p
2
i
is constant across i in particular when ηi = log|xi
|, so pi = |xi
|/∑
n
i
′=1
|xi
′| and
x
2
i
p
2
i
=
x
2
i
(|xi
|/∑
n
i
′=1
|xi
′|)
2
=
n
∑
i
′=1
|xi
′|
!2
= c. (3.33)
Now let us use the lemma to prove the theorem:
45
Theorem 4. If ∆w
(ℓ)
i j
2
is constant, the choice of partitioning variables that minimizes L˜
1 is ε
(ℓ)
si j = ε
∗(ℓ)
si j
and ϕ
(ℓ)
si j = ϕ
∗(ℓ)
si j:
ε
∗(ℓ)
si j =
|y
(ℓ)
s j |
u
(ℓ+1)
si
ε
∗(ℓ)
si j =
|y
(ℓ)
s j |
u
(ℓ+1)
si
ϕ
∗(ℓ)
si j =
|w
(ℓ)
i j |u
(ℓ)
s j
u
(ℓ+1)
si
ϕ
∗(ℓ)
si ≡ |u
(L)
si |
vuut
∂
2
f(y
(L)
s
|xs)
∂ y
(L)
si
2
for ℓ = 1
for 1 < ℓ < L, ℓ ∈ P
for ℓ = L
(3.18)
where u(ℓ)
si is calculated by a forward pass through the network:
u
(1)
si ≡ 0 u
(ℓ+1)
si ≡
|σ˙ℓ(y
(ℓ)
s j )|u
(ℓ)
s j for activation layers ℓ
∑
mℓ
j=0
|y
(ℓ)
s j |+∑
mℓ
j=1
|w
(ℓ)
i j |u
(ℓ)
s j for affine layers ℓ
(3.19)
Proof. We will optimize over the partitioning variables one layer at a time working forwards. More
formally, we will prove the theorem by induction. To do this, we need to strengthen the inductive
hypothesis by also proving the following holds for all ℓ ∈ {0} ∪[L−1], where const is a constant
with respect to all partitioning variables:
∑
ℓ
′∈P,ℓ′≤ℓ
mℓ
′+1
∑
i=1
mℓ
′
∑
j=0
W¨
(ℓ
′
)
i j =
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si u
(ℓ+1)
si
2 +const (3.34)
The base case ℓ = 0 is simple: the left hand side of (3.34) is zero since there are no terms to sum
over, and the right hand side is zero since u
(1)
si ≡ 0. Now assume the inductive hypothesis (3.34) is
true for ℓ−1, and let us prove it holds for ℓ.
46
Suppose layer ℓ is an affine layer. Then the left hand side of (3.34) has one additional term for
ℓ compared to ℓ−1, which we can pull out before applying the inductive hypothesis (3.34):
∑
ℓ
′∈P,ℓ′≤ℓ
mℓ
′+1
∑
i=1
mℓ
′
∑
j=0
W¨
(ℓ
′
)
i j = ∑
ℓ
′∈P,ℓ′≤ℓ−1
mℓ
′+1
∑
i=1
mℓ
′
∑
j=0
W¨
(ℓ
′
)
i j +
mℓ+1
∑
i=1
mℓ
∑
j=0
W¨
(ℓ)
i j (3.35)
=
S
∑
s=1
mℓ
∑
j=1
Y¨
(ℓ)
s j u
(ℓ)
s j
2 +
mℓ+1
∑
i=1
mℓ
∑
j=0
W¨
(ℓ)
i j +const (3.36)
If we are at the top of the network, i.e., ℓ = 1, then u
(ℓ)
s j ≡ 0 and plugging in the formula for W¨
(ℓ)
i j
(3.27) yields
S
∑
s=1
mℓ
∑
j=1
Y¨
(ℓ)
s j u
(ℓ)
s j
2 +
mℓ+1
∑
i=1
mℓ
∑
j=0
W¨
(ℓ)
i j (3.37)
=
mℓ+1
∑
i=1
mℓ
∑
j=0
W¨
(ℓ)
i j (3.38)
=
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
mℓ
∑
j=0
y
(ℓ)
s j
2
ε
(ℓ)
si j
(3.39)
Since for each s ∈ [S] and i ∈ [mℓ+1] we have {ε
(ℓ)
si j } are non-negative and sum to one, they fulfill
the role of {pi} in our lemma. Applying the lemma, the optimal choice of {ε
(ℓ)
si j } is
ε
∗(ℓ)
si j ≡
|y
(ℓ)
s j |
∑
mℓ
j
′=1
|y
(ℓ)
s j′ |
(3.40)
Using the formula for u
(ℓ+1)
si (3.19) and u
(ℓ)
s j ≡ 0 (since ℓ = 1), we can write this as
ε
∗(ℓ)
si j ≡
|y
(ℓ)
s j |
u
(ℓ+1)
si
(3.41)
47
Plugging this back into (3.39) yields
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
mℓ
∑
j=0
y
(ℓ)
s j
2
ε
(ℓ)
si j
=
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si u
(ℓ+1)
si
2
(3.42)
and the inductive hypothesis holds. If layer ℓ is affine but we are not at the top of the network, i.e.,
ℓ ̸= 1, plugging the formulae for Y¨
(ℓ)
s j and W¨
(ℓ)
i j (3.14) into (3.36) yields
S
∑
s=1
mℓ
∑
j=1
Y¨
(ℓ)
s j u
(ℓ)
s j
2 +
mℓ+1
∑
i=1
mℓ
∑
j=0
W¨
(ℓ)
i j (3.43)
=
S
∑
s=1
mℓ+1
∑
i=1
mℓ
∑
j=1
Y¨
(ℓ+1)
si
w
(ℓ)
i j
2
ϕ
(ℓ)
si j
u
(ℓ)
s j
2 +
S
∑
s=1
mℓ+1
∑
i=1
mℓ
∑
j=0
Y¨
(ℓ+1)
si
y
(ℓ)
s j
2
ε
(ℓ)
si j
(3.44)
=
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
mℓ
∑
j=1
w
(ℓ)
i j u
(ℓ)
s j
2
ϕ
(ℓ)
si j
+
mℓ
∑
j=0
y
(ℓ)
s j
2
ε
(ℓ)
si j
(3.45)
Since for each s ∈ [S] and i ∈ [mℓ+1] we have {ε
(ℓ)
si j } and {ϕ
(ℓ)
si j } are non-negative and together sum
to one, they fulfill the role of {pi} in our lemma. Applying the lemma, the optimal choice of {ε
(ℓ)
si j }
and {ϕ
(ℓ)
si j } is
ε
∗(ℓ)
si j ≡
|y
(ℓ)
s j |
∑
mℓ
j
′=0
|y
(ℓ)
s j′ |+∑
mℓ
j
′=1
|w
(ℓ)
i j′ |u
(ℓ)
s j′
ϕ
∗(ℓ)
si j ≡
|w
(ℓ)
i j |u
(ℓ)
s j
∑
mℓ
j
′=0
|y
(ℓ)
s j′ |+∑
mℓ
j
′=1
|w
(ℓ)
i j′ |u
(ℓ)
s j′
(3.46)
Using the formula for u
(ℓ+1)
si (3.19) we can write this as
ε
∗(ℓ)
si j ≡
|y
(ℓ)
s j |
u
(ℓ+1)
si
ϕ
∗(ℓ)
si j ≡
|w
(ℓ)
i j |u
(ℓ)
s j
u
(ℓ+1)
si
(3.47)
Plugging this back into (3.45) yields
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
mℓ
∑
j=1
w
(ℓ)
i j u
(ℓ)
s j
2
ϕ
(ℓ)
si j
+
mℓ
∑
j=0
y
(ℓ)
s j
2
ε
(ℓ)
si j
=
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si u
(ℓ+1)
si
2
(3.48)
48
and the inductive hypothesis (3.34) holds.
If layer ℓ is an activation layer, then there are no additional terms on the left hand side for ℓ
compared to ℓ−1, so
∑
ℓ
′∈P,ℓ′≤ℓ
mℓ
′+1
∑
i=1
mℓ
′
∑
j=0
W¨
(ℓ
′
)
i j = ∑
ℓ
′∈P,ℓ′≤ℓ−1
mℓ
′+1
∑
i=1
mℓ
′
∑
j=0
W¨
(ℓ
′
)
i j (3.49)
For the right hand side, we plug in the formula for Y¨
(ℓ)
s j (3.17) and the formula for u
(ℓ+1)
si (3.19) to
get
S
∑
s=1
mℓ
∑
j=1
Y¨
(ℓ)
s j u
(ℓ)
s j
2 =
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si σ˙ℓ(y
(ℓ)
s j )
2 +Y˙
(ℓ+1)
si σ¨(y
(ℓ)
s j )
u
(ℓ)
s j
2
(3.50)
=
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si σ˙ℓ(y
(ℓ)
s j )
2
u
(ℓ)
s j
2 +const (3.51)
=
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si u
(ℓ+1)
si
2 +const (3.52)
Putting (3.49) and (3.52) together gives the inductive hypothesis (3.34). There are no partitioning
variables to optimize over in an activation layer, so we are done with this case.
Finally, if ℓ = L, the inductive hypothesis tells us
L˜
1 = ∑
ℓ∈P
mℓ
′+1
∑
i=1
mℓ
′
∑
j=0
W¨
(ℓ
′
)
i j =
S
∑
s=1
mL
∑
i=1
Y¨
(L)
si u
(L)
si
2
(3.53)
Plugging in the formula for Y¨
(L)
si (3.8) yields
S
∑
s=1
mL
∑
i=1
Y¨
(L)
si u
(L)
si
2 =
S
∑
s=1
mL
∑
i=1
1
ϕ
(L)
si
∂
2
f(y
(L)
s
|xs)
∂ y
(L)
si
2
u
(L)
si
2
(3.54)
49
Since for each s ∈ [S] we have {ϕ
(ℓ)
si } are non-negative and sum to one, they fulfill the role of {pi}
in our lemma. Applying the lemma, the optimal choice of {ϕ
(L)
si } is
ϕ
∗(ℓ)
si ∝ |u
(L)
si |
vuut
∂
2
f(y
(L)
s
|xs)
∂ y
(L)
si
2
(3.55)
3.9.4 Batch Normalization Layers
A batch normalization layer consists of three parts: a mean-normalization, a scale-normalization,
and a reshift-rescale operation. The mean-normalization and reshift-scale operation are affine, and
their Hessian can be bounded with residual partitioning in the same way as other affine layers. The
scale-normalization is not affine, and its Hessian cannot be bounded with residual partitioning.
However, we can bound through an affine approximation of the scale-normalization. Here, we
describe in detail how to bound through each part of batch normalization.
First, let us consider the mean-normalization operation. For each s ∈ [S] and i ∈ [mℓ+1], introduce partitioning variables {ε
(ℓ)
sit j} that are non-negative and sum to one over t and j. Then:
y
(ℓ+1)
si = y
(ℓ)
si −
1
S ·mℓ
S
∑
t=1
mℓ
∑
j=1
y
(ℓ)
t j (3.56)
∆y
(ℓ+1)
si = ∆y
(ℓ)
si −
1
S ·mℓ
S
∑
t=1
mℓ
∑
j=1
∆y
(ℓ)
t j (3.57)
=
1−
1
S ·mℓ
∆y
(ℓ)
si −
1
S ·mℓ
∑
(t, j)̸=(s,i)
∆y
(ℓ)
t j (3.58)
(∆y
(ℓ+1)
si )
2 ≤
(1−
1
S·mℓ
)
2
ε
(ℓ)
sisi
(∆y
(ℓ)
si )
2 + ∑
(t, j)̸=(s,i)
(
1
S·mℓ
)
2
ε
(ℓ)
sit j
(∆y
(ℓ)
t j )
2
(3.59)
Plugging this into Y
(ℓ+1)
si yields
50
S
∑
s=1
mℓ+1
∑
i=1
Y
(ℓ+1)
si ≤
S
∑
t=1
mℓ
∑
j=1
Y
(ℓ)
t j (3.60)
Y˙
(ℓ)
t j = Y˙
(ℓ+1)
t j −
1
S ·mℓ
S
∑
s=1
mℓ
∑
i=1
Y˙
(ℓ+1)
si (3.61)
Y¨
(ℓ)
t j =
S
∑
s=1
mℓ+1
∑
i=1
(δ(s,t)δ(i, j)−
1
S·mℓ
)
2
ε
(ℓ)
sit j
Y¨
(ℓ+1)
si (3.62)
Plugging this into the inductive hypothesis yields
S
∑
t=1
mℓ
∑
j=1
Y¨
(ℓ)
t j u
(ℓ)
t j
2 =
S
∑
t=1
mℓ
∑
j=1
S
∑
s=1
mℓ+1
∑
i=1
(δ(s,t)δ(i, j)−
1
S·mℓ
)
2
ε
(ℓ)
sit j
Y¨
(ℓ+1)
si
u
(ℓ)
t j
2
(3.63)
=
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
S
∑
t=1
mℓ
∑
j=1
(δ(s,t)δ(i, j)−
1
S·mℓ
)
2
ε
(ℓ)
sit j
u
(ℓ)
t j
2
(3.64)
The optimal choice of partitioning variables is
ε
∗(ℓ)
sit j =
|δ(s,t)δ(i, j)−
1
S·mℓ
|u
(ℓ)
t j
u
(ℓ+1)
si
(3.65)
u
(ℓ+1)
si ≡
S
∑
t=1
mℓ
∑
j=1
δ(s,t)δ(i, j)−
1
S ·mℓ
u
(ℓ)
t j (3.66)
=
1−
2
S ·mℓ
u
(ℓ)
si +
1
S ·mℓ
S
∑
t=1
mℓ
∑
j=1
u
(ℓ)
t j (3.67)
Plugging this back into Y¨
(ℓ)
t j gives the following backprop update:
51
Y¨
(ℓ)
t j =
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si u
(ℓ+1)
si
δ(s,t)δ(i, j)−
1
S ·mℓ
1
u
(ℓ)
t j
(3.68)
Y¨
(ℓ)
t j u
(ℓ)
t j ≡
1−
2
S ·mℓ
Y¨
(ℓ+1)
t j u
(ℓ+1)
t j +
1
S ·mℓ
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si u
(ℓ+1)
si (3.69)
Now on to the scale-normalizing operation. This operation is not affine, and the Hessian of its
output with respect to its inputs is not diagonal. Instead, we’ll bound a Gauss-Newton approximation to the Hessian. To do this, we first need to construct the Jacobian:
σ
2
B =
1
S ·mℓ
S
∑
t=1
mℓ
∑
j=1
y
(ℓ)
t j
2
(3.70)
y
(ℓ+1)
si =
y
(ℓ)
q
si
σ
2
B +ε
(3.71)
∂ σ2
B
∂ y
(ℓ)
t j
=
2
S ·mℓ
y
(ℓ)
t j (3.72)
∂ y
(ℓ+1)
si
∂ y
(ℓ)
t j
=
δ(s,t)δ(i, j)
q
σ
2
B +ε
−
1
2
y
(ℓ+1)
si (σ
2
B +ε)
−3/2
∂ σ2
B
∂ y
(ℓ)
t j
(3.73)
=
δ(s,t)δ(i, j)
q
σ
2
B +ε
−
1
S ·mℓ
y
(ℓ)
si (σ
2
B +ε)
−3/2
y
(ℓ)
t j (3.74)
=
1
q
σ
2
B +ε
δ(s,t)δ(i, j)−
1
S ·mℓ
y
(ℓ+1)
si y
(ℓ+1)
t j
(3.75)
52
For each s ∈ [S] and i ∈ [mℓ+1], introduce partitioning variables {ε
(ℓ)
sit j} that are non-negative and
sum to one over t and j. Now we approximate:
∆y
(ℓ+1)
si ≈
S
∑
t=1
mℓ
∑
j=1
∂ y
(ℓ+1)
si
∂ y
(ℓ)
t j
∆y
(ℓ)
t j (3.76)
∆y
(ℓ+1)
si
2 ≤
S
∑
t=1
mℓ
∑
j=1
∂ y
(ℓ+1)
si /∂ y
(ℓ)
t j 2
ε
(ℓ)
sit j
∆y
(ℓ)
t j
2
(3.77)
Plugging this into Y
(ℓ+1)
si yields
S
∑
s=1
mℓ+1
∑
i=1
Y
(ℓ+1)
si ≤
S
∑
t=1
mℓ
∑
j=1
Y
(ℓ)
t j (3.78)
Y˙
(ℓ)
t j =
S
∑
s=1
mℓ+1
∑
i=1
Y˙
(ℓ+1)
si
∂ y
(ℓ+1)
si
∂ y
(ℓ)
t j
(3.79)
Y¨
(ℓ+1)
t j =
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
∂ y
(ℓ+1)
si /∂ y
(ℓ)
t j 2
ε
(ℓ)
sit j
(3.80)
Plugging this into the inductive hypothesis yields
S
∑
t=1
mℓ
∑
j=1
Y
(ℓ)
t j u
(ℓ)
t j
2 =
S
∑
t=1
mℓ
∑
j=1
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
∂ y
(ℓ+1)
si /∂ y
(ℓ)
t j 2
ε
(ℓ)
sit j
u
(ℓ)
t j
2
(3.81)
=
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
S
∑
t=1
mℓ
∑
j=1
∂ y
(ℓ+1)
si /∂ y
(ℓ)
t j 2
ε
(ℓ)
sit j
u
(ℓ)
t j
2
(3.82)
The optimal choice of partitioning variables is
53
ε
(ℓ)
sit j =
|∂ y
(ℓ+1)
si /∂ y
(ℓ)
t j |u
(ℓ)
t j
u
(ℓ+1)
si
(3.83)
u
(ℓ+1)
si ≡
S
∑
t=1
mℓ
∑
j=1
∂ y
(ℓ+1)
si
∂ y
(ℓ)
t j
u
(ℓ)
t j (3.84)
=
1
q
σ
2
B +ε
1−
1
S ·mℓ
(y
(ℓ+1)
si )
2
u
(ℓ)
si −
1
S ·mℓ
(y
(ℓ+1)
si )
2
u
(ℓ)
si +|y
(ℓ+1)
si |
1
S ·mℓ
S
∑
t=1
mℓ
∑
j=1
|y
(ℓ+1)
t j |u
(ℓ)
t j !
(3.85)
Plugging this back into Y¨
(ℓ)
t j gives the following backprop update:
Y¨
(ℓ)
t j u
(ℓ)
t j =
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si u
(ℓ+1)
si |∂ y
(ℓ+1)
si /∂ y
(ℓ)
t j | (3.86)
=
1
q
σ
2
B +ε
1−
1
S ·mℓ
y
(ℓ)
t j
2
Y¨
(ℓ+1)
t j u
(ℓ+1)
t j −
1
S ·mℓ
y
(ℓ)
t j
2Y¨
(ℓ+1)
t j u
(ℓ+1)
t j (3.87)
+|y
(ℓ)
t j |
1
S ·mℓ
S
∑
s=1
mℓ+1
∑
i=1
|y
(ℓ+1)
si |Y¨
(ℓ+1)
si u
(ℓ+1)
si !
(3.88)
Lastly, we have the rescale-reshift operation:
y
(ℓ+1)
si = y
(ℓ)
si γ +β (3.89)
∆y
(ℓ+1)
si = ∆y
(ℓ)
si γ +y
(ℓ)
si ∆γ +∆β (3.90)
(∆y
(ℓ+1)
si )
2 ≤
γ
2
ϕ
(ℓ)
si
(∆y
(ℓ)
si )
2 +
(y
(ℓ)
si )
2
ε
(ℓ)
si1
(∆γ)
2 +
1
ε
(ℓ)
si0
(∆β)
2
(3.91)
Plugging this into Y¨
(ℓ+1)
si yields
54
S
∑
s=1
mℓ+1
∑
i=1
Y
(ℓ+1)
si ≤
S
∑
s=1
mℓ
∑
j=1
Y
(ℓ)
s j + W
(ℓ)
γ + W
(ℓ)
β
(3.92)
Y˙
(ℓ)
s j = Y˙
(ℓ+1)
si γ Y¨
(ℓ)
s j = Y¨
(ℓ+1)
si
γ
2
ϕ
(ℓ)
si
(3.93)
W˙
(ℓ)
γ =
S
∑
s=1
mℓ+1
∑
i=1
Y˙
(ℓ+1)
si y
(ℓ)
si W¨
(ℓ)
γ =
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
y
(ℓ)
si
2
ε
(ℓ)
si1
(3.94)
W˙
(ℓ)
β =
S
∑
s=1
mℓ+1
∑
i=1
Y˙
(ℓ+1)
si W¨
(ℓ)
β =
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
1
ε
(ℓ)
si0
(3.95)
Plugging this into the inductive hypothesis yields
S
∑
s=1
mℓ
∑
j=1
Y¨
(ℓ)
s j u
(ℓ)
s j + W¨
(ℓ)
γ + W¨
(ℓ)
β
(3.96)
=
S
∑
s=1
mℓ+1
∑
1=1
Y¨
(ℓ+1)
si
γ
2
ϕ
(ℓ)
si
!
u
(ℓ)
si
2 +
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
y
(ℓ)
si
2
ε
(ℓ)
si1
+
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si
1
ε
(ℓ)
si0
(3.97)
=
S
∑
s=1
mℓ+1
∑
1=1
Y¨
(ℓ+1)
si
γ
2u
(ℓ)
si
2
ϕ
(ℓ)
si
+
y
(ℓ)
si
2
ε
(ℓ)
si1
+
1
ε
(ℓ)
si0
!
(3.98)
(3.99)
The optimal choice of partitioning variables is
ϕ
(ℓ)
si ≡
|γ|u
(ℓ)
si
u
(ℓ+1)
si
ε
(ℓ)
si1 ≡
|y
(ℓ)
si |
u
(ℓ+1)
si
ε
(ℓ)
si0 ≡
1
u
(ℓ+1)
si
(3.100)
u
(ℓ+1)
si ≡ |γ|u
(ℓ)
si +|y
(ℓ)
si |+1 (3.101)
Plugging this back in gives the following backprop updates:
55
Y¨
(ℓ)
s j u
(ℓ)
s j = Y¨
(ℓ+1)
si u
(ℓ+1)
si |γ| (3.102)
W¨
(ℓ)
γ =
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si u
(ℓ+1)
si |y
(ℓ)
si | (3.103)
W¨
(ℓ)
β =
S
∑
s=1
mℓ+1
∑
i=1
Y¨
(ℓ+1)
si u
(ℓ+1)
si (3.104)
56
Input: Inputs {y
(1)
s }, targets {xs}, weights {w
(ℓ)
i j }, global learning rate λ > 0.
Output: Weight updates {∆w
(ℓ)
i j }.
foreach s ∈ [S], i ∈ [m1] do /* Initialize Forward Pass */
u
(0)
si ← 0;
for ℓ = 1 to L−1 do /* Forward Pass */
foreach i ∈ [mℓ+1] do
if layer ℓ is an affine layer then
u
(ℓ+1)
si ← ∑
mℓ
j=0
|y
(ℓ)
s j |+∑
mℓ
j=1
|w
(ℓ)
i j |u
(ℓ)
s j ;
y
(ℓ+1)
si ← ∑
mℓ
j=0w
(ℓ)
i j y
(ℓ)
s j ;
else if layer ℓ is an activation layer then
u
(ℓ+1)
si ← |σ˙ℓ(y
(ℓ)
s j )|u
(ℓ)
s j ;
y
(ℓ+1)
si ← σ(y
(ℓ)
s j );
foreach s ∈ [S], i ∈ [mL] do /* Initialize Backward Pass */
Y˙
(L)
si ← ∂ f(y
(L)
s
|xs)
∂ y
(L)
si
;
Y¨
(L)
si u
(L)
si ←
r
∂
2
f(y
(L)
s
|xs)
∂ y
(L)
si
2 ∑
mL
i
′=1
|u
(L)
si′ |
r
∂
2
f(y
(L)
s
|xs)
∂ y
(L)
si′
2
;
for ℓ = L−1 to 1 do /* Backward Pass */
if layer l is an affine layer then
foreach i ∈ [mℓ+1], j ∈ [mℓ
] do
W˙
(ℓ)
i j ← 1
S ∑
S
s=1 Y˙
(ℓ+1)
si y
(ℓ)
s j ;
W¨
(ℓ)
i j ← 1
S ∑
S
s=1 Y¨
(ℓ+1)
si u
(ℓ+1)
si |y
(ℓ)
s j |;
foreach s ∈ [S], j ∈ [mℓ
] do
Y˙
(ℓ)
s j ← ∑
mℓ+1
i=1 Y˙
(ℓ+1)
si w
(ℓ)
i j ;
Y¨
(ℓ)
s j ← ∑
mℓ+1
i=1 Y¨
(ℓ+1)
si u
(ℓ+1)
si |w
(ℓ)
i j |;
else if layer l is an activation layer then
foreach s ∈ [S], j ∈ [mℓ
] do
Y˙
(ℓ)
s j ← Y˙
(ℓ+1)
si σ˙ℓ(y
(ℓ)
s j );
Y¨
(ℓ)
s j ← |Y¨
(ℓ+1)
si σ˙ℓ(y
(ℓ)
s j )
2 +Y˙
(ℓ+1)
si σ¨ℓ(y
(ℓ)
s j )|;
foreach ℓ ∈ P, i ∈ [mℓ+1], j ∈ {0} ∪[mℓ
] do /* Return */
∆w
(ℓ)
i j ← −λ · W˙
(ℓ)
i j /W¨
(ℓ)
i j ;
return {∆w
(ℓ)
i j }
Algorithm 2: One training step of residual partitioning.
57
Chapter 4
Learning Morphisms with Gauss-Newton Approximation for
Growing Networks
An appealing method for Neural Architecture Search (NAS) is based on growing networks via
small local changes to the network’s architecture called network morphisms. These methods start
with a small seed network and progressively grow the network by adding new neurons in an automated way. However, efficiently determining the best way to grow the network remains a challenge. Here we propose a NAS method for growing a network which uses a Gauss-Newton approximation of the loss function to efficiently learn and evaluate candidate network morphisms. We
then optimize this approximate loss function to efficiently learn morphism parameters. We compare our method with similar NAS methods for CIFAR-10 and CIFAR-100 classification tasks, and
conclude our method learns similar quality or better architectures at a smaller computational cost.
4.1 Introduction
Neural Architecture Search (NAS), which seeks to automate the architectural design of neural
networks, has become a central problem in machine learning research [Elsken et al., 2019]. Researchers often advance state-of-the-art by carefully designing novel network architectures for specific problems, e.g., ResNets [He et al., 2016] for image classification and transformers [Vaswani
et al., 2017] for natural language processing.
58
There are many different optimization methods for performing NAS, such as evolutionary
methods [Elsken et al., 2017, Real et al., 2019, Nekrasov et al., 2019], reinforcement learning
methods [Zoph et al., 2018, Gong et al., 2019, Zhong et al., 2018], and pruning methods [Frankle
and Carbin, 2018, Liu et al., 2017, Han et al., 2015]. Another unique category of methods for NAS
is growing methods [Liu et al., 2018, Gordon et al., 2018, Liu et al., 2019, Wu et al., 2020, Chavan
et al., 2024, Tong, 2022, King and Mortazavi, 2021]. Growing methods begin with a small seed
architecture, then progressively grow a larger, more complex architecture by repeatedly applying
small parameterizable local changes to the network’s architecture called network morphisms. To
grow a network, we must choose which morphisms to apply as well as the parameters for those
morphisms. However, this optimization problem is challenging to solve at scale, when there are
many possible morphisms to consider.
In this chapter, we propose a method for learning and evaluating morphism parameters quickly
and efficiently. Our method utilizes a Gauss-Newton approximation of the loss function to estimate
the decrease in loss resulting from applying each morphism. We then optimize this loss to learn
and evaluate morphism parameters. We use this method to design a NAS algorithm that iteratively
applies network morphisms to progressively grow a network architecture.
We compare our method with other NAS methods on CIFAR-10 and CIFAR-100 classification tasks [Krizhevsky et al., 2009]. We present promising experiments that demonstrate that our
method grows networks with similar or better parameter-accuracy tradeoff compared to similar
methods.
4.2 Related Work
Pruning methods [Frankle and Carbin, 2018, Liu et al., 2018, 2017, Han et al., 2015] have become
popular for shrinking large, high-performing networks down to much smaller networks without
sacrificing test accuracy. One-shot methods [Pham et al., 2018] are similar: they simplify NAS by
constraining the search space to subgraphs of a large trained network. These methods are much
59
less computationally expensive than reinforcement learning and evolutionary methods, but still
require training a large network.
The computational inexpensiveness of growing progressively larger networks has been exploited for NAS [Liu et al., 2018, Rusu et al., 2016, Gordon et al., 2018] and for training fixed
networks [Karras et al., 2017]. Growing networks via network morphisms has previously been
used in combination with reinforcement learning [Cai et al., 2018] and evolutionary NAS methods
[Elsken et al., 2018]. In contrast, we use network morphisms to view NAS as a continuous optimization problem, similar to other differentiable architecture search methods [Luo et al., 2018,
Shin et al., 2018, Liu et al., 2018]. Unlike Net2Net [Chen et al., 2015], which applies network morphisms with random parameters, we build upon a recent line of work [Liu et al., 2019, Wang et al.,
2019, Wu et al., 2020,] that has made progress in efficiently learning and evaluating morphisms.
We use a Gauss-Newton approximation to estimate the decrease in the loss achieved by applying a network morphism. Bayesian optimization NAS methods [Jin et al., 2018, Liu et al.,
2018, Klein et al., 2016, Negrinho and Gordon, 2017] also try to estimate the performance of new
networks without training them. However, these methods use the performance of previously seen
networks to predict the performance of future unseen networks, while our predictions are made
independently of previously seen networks. In fact, our technique is more similar to [LeCun et al.,
1990] in which the authors use a diagonal approximation of the Hessian to estimate the change in
the loss when pruning neurons.
4.3 Method
4.3.1 Morphisms
A network morphism is a small change in a neural net’s architecture parameterized by θ so that
when θ = 0, the morphism is function-preserving, i.e., the input-output mapping of the neural
network is unchanged. In this chapter, we consider several morphisms.
60
x
y
z
win
wout
=⇒
x
y1 y2
z
win +θ win −θ
1
2
wout
1
2
wout
(a) Channel splitting morphism
x
y
z
win
wout
=⇒
x
y
z
win −θ
wout
(b) Channel pruning morphism
Figure 4.1: Network morphisms. Square nodes represent convolutional layers, circular nodes
represent convolutional channels.
The first morphism we consider is a channel-splitting morphism that grows a network wider,
depicted in Figure 4.1a. For a convolutional channel y with input from a layer x with incoming
kernel parameters win and output to a layer z with outgoing kernel parameters wout, applying the
channel-splitting morphism replaces the channel y with two channels y1 and y2, with incoming
kernel parameters win + θ and win − θ respectively, and each with outgoing kernel parameters
wout/2. If θ = 0, then this morphism duplicates the channel y without changing the input-output
mapping of the neural network; if θ ̸= 0, then this morphism replaces the feature detected by y
with two new feature detectors with parameters win +θ and win −θ. For example, if y is an edge
detector, then y1 and y2 may detect two similar edges with slightly different angles.
The second morphism we consider is a channel-pruning morphism, depicted in Figure 4.1b.
For a channel y with incoming kernel parameters win, applying the channel-pruning morphism
subtracts θ from win. If θ = 0, then the network is unchanged, but if θ = win, then the incoming
kernel parameters of y are zero, and the channel y can be pruned. The parameters of this morphism
are not learned, but instead are always chosen to be θ = win.
To apply a particular morphism, we must first choose values for the morphism’s parameters θ.
The best choice for the morphism’s parameters would maximally decrease the loss of the network
when the morphism is applied. However, these parameters are computationally prohibitive to
calculate exactly at scale.
61
4.3.2 Gauss-Newton Approximation
Instead, we can approximate the decrease in the loss function for each morphism. Each morphism
we consider is local, so that there exists a collection of network activations z such that the mapping
between the network input and any activation higher than z in the computational DAG is unchanged
for any choice of θ. Consider the expanded networks depicted in Figure 4.1. Denote ∆L (θ) the
change in the loss function after applying the morphism with parameters θ; ∆z(θ) the change in
z after applying the morphism with parameters θ; g the gradient of the loss function with respect
to z at θ = 0; and H the Hessian of the loss function with respect to z at θ = 0. Consider the
second-order approximation of the change in loss function with respect to z centered at θ = 0:
∆L (θ) ≈ ∆z(θ)· g+
1
2
∆z(θ)
⊤H∆z(θ)
Computing the Hessian matrix of second derivatives in this approximation is computationally expensive. Instead, we can make a Gauss-Newton approximation of the Hessian matrix, where Lˆ is
the current training loss:
H ≈
1
2Lˆ
gg⊤
Plugging in this approximation yields:
∆L (θ) ≈ ∆z(θ)· g+
1
4Lˆ
(∆z(θ)· g)
2
Recent work [Wu et al., 2020, Wang et al., 2019] also uses a second-order approximation of the
loss function to learn morphism parameters. In that work, the authors make a second-order approximation of the loss function with respect to θ. In contrast, we make a second-order approximation
of the loss function with respect to ∆z(θ). Critically, because ∆z(θ) is a nonlinear function of θ,
our Gauss-Newton approximation is still a high-order approximation of ∆L (θ) with respect to θ.
62
4.3.3 Algorithm
Since ∆L (θ) is computed independently for each training mini-batch, we record an exponential
moving average of ∆L (θ) across mini-batches using a momentum hyperparameter to get a lower
variance estimate of the decrease in the loss function. We can then weigh the tradeoff for each morphism between the estimated change in loss ∆L (θ) and the change in the number of parameters
introduced when applying the morphism. To quantify this tradeoff, we introduce a regularization
hyperparameter indicating the desired tradeoff between training loss and model size. Then we say
a morphism has a positive loss-resource tradeoff if
−∆L (θ) > λp∆Rp
where λp is a hyperparameter regularization constant on the model size and ∆Rp is the change in
the number of parameters resulting from applying the morphism. Given the exponential moving
average estimate of ∆L (θ), checking whether a morphism has positive loss-resource tradeoff
takes constant time.
We use our Gauss-Newton approximation to design a NAS algorithm for growing networks,
summarized in Algorithm 3 in the appendix. Our algorithm alternates between a training phase
and a growing phase. In all our experiments, each phase lasts 20 epochs. In the training phase,
the model architecture is frozen while the model parameters are optimized to minimize the training loss. In the growing phase, the model parameters are frozen while morphism parameters are
optimized to minimize our Gauss-Newton approximation of the loss. After morphism parameters
are learned, we compute each morphism’s loss-resource tradeoff. Then for each layer, we apply
the top 30% of morphisms local to that layer with positive loss-resource tradeoff.
63
Figure 4.2: Network grown from a VGG-19 seed network by our algorithm for classifying CIFAR100. Here the network is at the end of its 15-th growth phase. Channels colored red will be split
with their learned channel-splitting morphism parameters in the next epoch; splitting the reddest
channels is estimated to give the highest loss-resource tradeoff. Channels colored blue will be
pruned in the next epoch; pruning the bluest channels is estimated to give the highest loss-resource
tradeoff.
4.4 Experiments
In all our experiments, we train with a batch size of 64 and use a simple data augmentation scheme
for CIFAR-10 and CIFAR-100: random horizontal flips and random crops with padding 4. In
the appendix, we present additional experiments evaluating the accuracy of our Gauss-Newton
approximation and the quality of our learned morphism parameters.
Here we compare our NAS algorithm end-to-end with other methods for learning architectures
for classifying CIFAR-10 and CIFAR-100. We experiment with different choices of the lossresource tradeoff hyperparameter to grow networks of many different sizes. We grow networks
from one of two seed networks. The first is a VGG-19 network with 16 channels in each convolutional layer. The second is a MobileNetV1 network with 32 channels in each convolutional
layer. In each experiment, we run our algorithm for a total of 30 training and growing phases. We
optimized model parameters using SGD with Nesterov momentum 0.9, weight decay 10−4
, and a
learning rate that begins at 0.1 and decreases by a factor of 10 at epochs 300 and 450. We optimized
morphism parameters with Adam and a learning rate of 10−2
. After our algorithm terminates, we
reinitialize the network’s model parameters and retrain the model from scratch to more accurately
determine the best test accuracy achievable for the learned architecture.
A visualization of a network grown by our algorithm from a VGG-19 seed network for classifying CIFAR-100 is in Figure 4.2. It is worthwhile to point out that growing from a uniform-width
64
Method Type Reference Error Params Reachable GPU time
(%) (M) (days)
SOTA
AmoebaNet-A [121] 3.3 3.2 3150
NASNET-A [172] 3.4 3.3 2000
Large-scale Evolution [120] 5.4 5.4 2600
Morphisms
NASH [34] 5.2 19.7 ✓ 1.0
Slimming [94] 6.2 2.3 ✓ -
Firefly [159] 6.2 1.9 ✓ -
Net2Net [19] 6.5 3.9 ✓ 2.1
Human
DenseNet [61] 3.5 25.6 N/A
VGG-19 Baseline [136] 6.3 20.0 ✓ N/A
ResNet [60] 6.4 1.7 N/A
MobileNetV1 Baseline [58] 6.6 3.2 ✓ N/A
Ours
Seed VGG-19, λp = 3×10−7 5.6 1.2 ✓ 0.7
Seed VGG-19, λp = 1×10−6 6.5 0.6 ✓ 0.5
Seed MobileNetV1, λp = 3×10−8 5.8 0.8 ✓ 1.0
Seed MobileNetV1, λp = 3×10−7 6.0 0.5 ✓ 1.0
Seed MobileNetV1, λp = 1×10−6 6.2 0.4 ✓ 0.7
Table 4.1: Classification performance of various architectures on CIFAR-10
seed network, our algorithm naturally discovers that a unique, bottleneck-shaped architecture provides the best loss-parameter tradeoff.
Next, we report the results for CIFAR-10 and CIFAR-100 classification tasks in Tables 4.1 and
4.2, respectively. We compare with other NAS methods as well as human-designed baselines. We
observe that our method produces networks with similar or better parameter-accuracy tradeoff at a
smaller computational cost. For example, a network we grew from a VGG-19 seed network using
λp = 3 × 10−7
achieved 5.6% test error on CIFAR-10 using only 1.2 million parameters, which
achieves lower test error with fewer parameters compared to Liu et al. [2017], which pruned a
VGG-19 model down to 2.3 million parameters and achieved 6.2% test error. Similarly, a network we grew from a MobileNetV1 seed network using λp = 3×10−7
achieved 25.9% test error
on CIFAR-100 using only 1.4 million parameters, which achieves lower test error with fewer parameters compared to He et al. [2016], a ResNet that achieves 27.2% test error with 1.7 million
parameters.
65
Method Type Reference Error Params Reachable GPU time
(%) (M) (days)
SOTA Large-scale Evolution [120] 23.0 40.4 -
SMASH [11] 22.1 4.6 -
Morphisms NASH [34] 23.4 22.3 ✓ 1.0
Slimming [94] 26.5 5.0 ✓ -
Human
DenseNet [61] 17.2 25.6 N/A
Resnet [53] 27.2 1.7 N/A
VGG-19 Baseline [136] 27.6 20.1 ✓ N/A
MobileNetV1 Baseline [58] 28.7 3.3 ✓ N/A
Ours
Seed VGG-19, λp = 3×10−7 27.2 2.2 ✓ 0.7
Seed VGG-19, λp = 6×10−7 28.0 1.6 ✓ 0.6
Seed MobileNetV1, λp = 1×10−6 27.2 0.8 ✓ 0.8
Seed MobileNetV1, λp = 6×10−7 26.9 1.3 ✓ 1.0
Seed MobileNetV1, λp = 3×10−7 25.9 1.4 ✓ 1.0
Table 4.2: Classification performance of various architectures on CIFAR-100
Note that our method of growing from simple VGG-19 and MobileNetV1 networks with simple
channel splitting and pruning morphisms is not enough to outperform complex architectures like
those produced by NASNET. Architecture elements necessary for high performance, like residual
connections and squeeze-excite modules, make growing complicated because they force several
layers to have the same number of channels, disallowing us from splitting channels in different
layers independently. It may be possible to grow from these types of seed networks using more
complex morphisms that split channels in multiple layers jointly, but this is left for future work.
4.5 Conclusion
In this chapter, we presented a neural architecture search method for growing a network with network morphisms while training. We used a Gauss-Newton approximation of the loss to learn morphism parameters and to estimate the change in the loss resulting from applying those morphisms.
We used the estimated change in loss to compute a loss-resource tradeoff for each morphism using
hyperparameters that regularized the number of parameters of the grown network. We compared
66
our method with state of the art NAS methods for classifying CIFAR-10 and CIFAR-100 and concluded that our algorithm finds similar or better architectures at a smaller computational cost.
4.6 Appendix A
4.6.1 Algorithm Pseudocode
Data: Dataset D, model M, phase length nphase
for t = 1,...,nphase do
foreach mini-batches {ds} ∈ D do
Compute mini-batch loss L = M({ds});
Compute all ∇wL with backprop;
SGD step model parameters w;
for t = 1,...,nphase do
foreach mini-batches {ds} ∈ D do
Compute all ∆L (θ) and ∇θL with backprop;
Update exponential moving average of ∆L (θ);
SGD step morphism parameters θ;
foreach top 30% morphisms with positive tradeoff do
Apply morphism;
Algorithm 3: Growing networks with Gauss-Newton
4.6.2 Gauss-Newton Approximation Accuracy
Here we evaluate the accuracy of our Gauss-Newton approximation of the loss. We begin by
constructing a VGG-19 model for CIFAR-10 and equip it with channel-splitting morphisms, one
for each channel in each convolutional layer in the network. We trained the VGG-19 model with
SGD with learning rate 0.1 for 20 epochs while holding morphism parameters constant. Then we
updated morphism parameters with Adam with learning rate 10−2
for 20 epochs while updating
the exponential moving average estimate of ∆L using momentum hyperparameter 64
50000 ×
1
2
so
67
that our estimate of ∆L is approximately an average over the last 2 epochs. We then computed
the true change in loss achieved by each morphism with its current parameters by applying each
morphism to construct an independent expanded network and evaluating that expanded network
on the test dataset. We then compared our exponential moving average estimate of ∆L with the
true value.
The results are depicted in Figure 4.3. Each circle represents a single channel-splitting morphism in the specified layer. There are 64 channels in the first layer of the VGG-19 model, and
512 channels each in the 9-th and last layers. The figure plots our exponential moving average
estimate of ∆L against the true ∆L computed via brute force. If our method were 100% accurate,
all circles would lie on the grey dashed lines.
The figure shows that the Gauss-Newton approximation used by our algorithm is quite accurate.
This result by itself is significant. Other methods expend enormous computational resources trying
to estimate how the loss of a network changes when channels are added or removed from the
network. This result shows that the change in loss can be approximated to a high degree of accuracy
using only statistics of the network, namely ∆z(θ)· g.
We also observe that the Gauss-Newton approximation seems to be most accurate for the layer
closest to the network output, and least accurate for the layer closest to the network input, though
the reason for this behavior is unclear.
We conclude that the Gauss-Newton approximation used by our algorithm estimates ∆L for
each morphism to a high degree of accuracy.
4.6.3 Learned Morphism Quality
Here we compare the quality of our learned morphisms to those learned via other methods. Another method for learning morphism parameters is to apply the morphism to construct an expanded
network, then optimize the loss of the expanded network with respect to the morphism parameters.
68
6 5 4 3 2 1 0
Estimated L 1e 2
6
5
4
3
2
1
0
Tru
e
L
1e 2 First Layer
(a)
1.2 1.0 0.8 0.6 0.4 0.2 0.0
Estimated L 1e 1
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Tru
e
L
1e 1 Layer 9
(b)
8 6 4 2 0
Estimated L 1e 2
8
6
4
2
0
Tru
e
L
1e 2 Last Layer
(c)
Figure 4.3: Estimated versus actual decrease in loss for morphisms learned while holding model
parameters constant.
This allows us to learn morphism parameters that minimize the loss rather than an approximation of the loss, but is computationally expensive to scale when there are many morphisms under
consideration.
Another method for choosing morphism parameters is to use the steepest descent direction as
in [Wu et al., 2020, Wang et al., 2019, Wu et al., 2020]. However, the steepest descent direction
does not indicate the optimal scale for θ. To approximately compute the optimal scale, we perform
a line search along the steepest descent direction, though this is computationally expensive.
We compare the true decrease in loss achieved by the morphism parameters learned by our
algorithm with the true decrease in loss achieved by the morphism parameters produced by the two
baselines described above. We do this for each of the possible 64 channel-splitting morphisms in
the first layer of the VGG-19 network trained in the previous experiment. The result is in Figure
4.4. Each 3-bar cluster plots the true decrease in loss achieved by the morphism parameters learned
by each method for the corresponding channel-splitting morphism. For ease of viewing, we have
sorted the channels with respect to the true decrease in loss achieved by the first baseline method.
Note that for some channels, none of the methods are able to find good morphism parameters.
After inspecting these features, we observe that at this point in training (epoch 20), those channels
have already “died” due to L2 weight regularization, so it is likely not possible to split such bad
feature detectors into two good feature detectors.
69
0 8 16 24 32 40 48 56 64
Channel index
0.000
0.005
0.010
0.015
0.020
0.025
0.030
True Decrease in Loss
Optimal
Gauss-Newton (Ours)
Steepest Descent + LS
Figure 4.4: Comparison of different morphism learning strategies. Each 3-bar cluster plots the true
decrease in loss achieved by the channel-splitting morphism learned by each method for one of the
64 channels in the first layer of VGG-19.
From the figure, we observe that the morphisms learned by our method most often achieve a
greater decrease in loss than those learned by the steepest descent with line search baseline method.
We also observe that the true decrease in loss achieved by our learned morphisms most often comes
within a constant factor of the decrease achieved by the expensive network expansion baseline. We
observe this most often among the morphisms with the highest potential decrease in loss; this is
important, since these are the morphisms that will be selected by our algorithm to be applied to
grow the network. We conclude that our algorithm learns high quality morphisms, on par with the
expensive network expansion baseline method.
70
4.6.4 Gauss-Newton Approximation
In this section we review the justification for Gauss-Newton approximation. We begin by assuming
that the loss function is well-approximated by a least-squares problem in z, i.e., for some matrix A
and vector b,
L (z) ≈
1
2
∥Az−b∥
2
2
=
1
2
b
⊤b−z
⊤A
⊤b+
1
2
zA⊤Az.
Denote the residual:
r = Az−b
Note that L =
1
2
r
⊤r. Denote the gradient and Hessian of the loss:
g = A
⊤r H = A
⊤A
Consider the change in the loss function when adding a quantity ∆z to z. Denote the change in loss:
∆L (∆z) ≡ L (z+∆z)−L (z)
= ∆z· g+
1
2
∆z
⊤A
⊤A∆z
= ∆z· g+
1
2
∆z
⊤H∆z
We write the Gauss-Newton approximation as
H ≈
1
2L
gg⊤,
71
Theorem 5 (General Gauss-Newton Approximation). If ∆z = λ∆z
∗
for some λ ∈ R and some ∆z
∗
satisfying A(z+∆z
∗
) = b, then
1
2
∆z
⊤H∆z =
1
2
∆z
gg⊤
2L
∆z
Proof. If ∆z = λ∆z
∗
and A(z+∆z
∗
) = b, then A∆z = −λr. So
1
2
∆z
⊤
gg⊤
2L
∆z
=
1
2
∆z
⊤
A
⊤rr⊤A
r⊤r
∆z
=
1
2
λ
2
r
⊤r
Similarly,
1
2
∆z
⊤H∆z =
1
2
∆zA⊤A∆z
=
1
2
λ
2
r
⊤r
Therefore, we say the Gauss-Newton approximation is exact in the space spanned by the solutions ∆z
∗
to the linear system A(z+∆z
∗
) = b.
Theorem 6 (Rank-1 Gauss-Newton Approximation). If H is rank-1 and there exists a solution z∗
to the linear system Az∗ = b, then for all ∆z,
1
2
∆z
⊤H∆z =
1
2
∆z
gg⊤
2L
∆z
Proof. If H is rank-1, then A consists of a single row u
⊤ and b ∈ R is a scalar.
72
Let ∆z be arbitrary. Then there exists a solution ∆z
∗
to the linear system A(z + ∆z
∗
) = b and
λ ∈ R such that ∆z = λ∆z
∗
, namely
∆z
∗ =
b−u
⊤z
u⊤z
∆z
λ =
u
⊤∆z
b−u⊤z
since then
A(z+∆z
∗
) = u
⊤z+u
⊤∆z
∗
= u
⊤z+u
⊤
b−u
⊤z
u⊤z
∆z
= b
Applying the previous theorem yields the result.
From this it is clear that we can expect the Gauss-Newton approximation to be quite accurate
if the true Hessian matrix H is low-rank.
73
Chapter 5
Neural Architecture Search for Parameter-Efficient Fine-tuning
of Large Pre-trained Language Models
Parameter-efficient tuning (PET) methods fit pre-trained language models (PLMs) to downstream
tasks by either computing a small compressed update for a subset of model parameters, or appending and fine-tuning a small number of new model parameters to the pre-trained network. Handdesigned PET architectures from the literature perform well in practice, but have the potential to be
improved via automated neural architecture search (NAS). We propose an efficient NAS method
for learning PET architectures via structured and unstructured pruning. We present experiments on
GLUE demonstrating the effectiveness of our algorithm and discuss how PET architectural design
choices affect performance in practice.
5.1 Introduction
Fine-tuning a large pre-trained language model is a popular method for solving many downstream
natural language processing (NLP) tasks. Full fine-tuning involves fine-tuning all parameters of
the base PLM, resulting in a fine-tuned copy of the model. However, full fine-tuning becomes
cumbersome when fine-tuning on multiple downstream tasks due to the massive size of state-ofthe-art language models, which range from the millions [Devlin et al., 2018, Liu et al., 2019] to
billions [Brown et al., 2020] and now trillions [Fedus et al., 2022] of parameters. Full fine-tuning
74
also carries a risk of catastrophic forgetting [Jang et al., 2021, Chen et al., 2022], wherein the
PLM’s learned useful representation of natural language data is forgotten during fine-tuning.
To address those problems, recent research has focused on parameter-efficient tuning (PET).
Rather than fine-tuning all parameters of the base PLM, PET methods choose a small subset of
parameters to fine-tune [Zaken et al., 2021, Guo et al., 2020], or compute compressed parameter
updates [Hu et al., 2021, Mahabadi et al., 2021], or append and fine-tune a small subset of new parameters [Houlsby et al., 2019, Li and Liang, 2021, Hambardzumyan et al., 2021, He et al., 2021].
Each of these methods has their own advantages and disadvantages, but one question relevant to
all these methods is which parts of the network are most efficient to fine-tune, and what is the most
parameter-efficient way to fine-tune them?
Here we answer this question by designing and applying a fine-grain NAS method for learning
PET architectures. Our method uses a first order approximation of the loss function and is computationally efficient. We compare our approach with several hand-designed PET methods and
find that the architectures learned by our method generally achieve comparable or higher development set performance on GLUE tasks [Wang et al., 2018] for the same number of parameters.
We conclude by examining the PET architectures learned by our method and discuss the effect of
architecture design choices on parameter efficiency.
5.2 Related work
Many different PET methods exist in the literature. Adapter networks insert and fine-tune small
adapter modules to a base PLM. Rebuffi et al. [2017] introduced adapter networks to the visual
domain, and Houlsby et al. [2019] introduced adapters to transformers. Adapters have been applied to text generation [Lin et al., 2020], translation [Bapna et al., 2019], and multi-task learning
[Pfeiffer et al., 2020,]. Peters et al. [2019] compare adaptation with full fine-tuning. AdapterHub
[Pfeiffer et al., 2020] enables easy sharing of adapter models. Additionally, Mosbach et al. [2020]
propose best practices for producing strong full fine-tuning baselines.
75
Prompt-tuning methods fine-tune a PLM by inserting prompt tokens into the input sequence.
Continuous prompts [Li and Liang, 2021, Lester et al., 2021, Hambardzumyan et al., 2021] or
discrete prompts [Shin et al., 2020] can be learned or engineered [Brown et al., 2020]. Gu et al.
[2021] demonstrate the effectiveness of pre-training prompts for low resource tasks.
Some methods fine-tune a subset of parameters [Zaken et al., 2021, Guo et al., 2020], or compute compressed parameter updates [Hu et al., 2021, Mahabadi et al., 2021]. These methods finetune the PLM without increasing test-time inference latency. He et al. [2021] and Mao et al. [2021]
combine multiple PET methods.
Beyond parameter-efficient tuning, NAS has previously been used to discover more parameterefficient base language models. Cheong and Daniel [2019] use magnitude pruning to reduce the
number of parameters in BERT. Many efforts at pruning BERT have focused on pruning attention
heads from the multi-head attention (MHA) modules [Michel et al., 2019, Voita et al., 2019, Li
et al., 2021]. Sajjad et al. [2020] evaluate different ad-hoc strategies for shrinking the depth of a
BERT encoder. So et al. [2019] use an evolutionary NAS method to learn an improved transformer
cell. In contrast to NAS, distillation can be used to compress language models [Sanh et al., 2019,
Jiao et al., 2019, Sun et al., 2020].
In our experiments section, we examine the architectures learned by our algorithm and consider
what they say about which parts of the network are most parameter-efficient to fine-tune. Merchant
et al. [2020] explore a similar question, probing the network activations to understand how the
network’s representation of natural language data changes during full fine-tuning.
5.3 Method
The architecture search space we choose for our NAS method is based on BitFit [Zaken et al.,
2021] and LoRA [Hu et al., 2021], two of the most popular methods for parameter-efficient finetuning in the literature. We consider both structured and unstructured variants of each of these,
76
Method Param. MNLI SST-2 MRPC CoLA QNLI QQP RTE STSB Avg.
FFT 355M 90.6 96.0 89.2 66.8 94.6 91.6 85.2 91.5 88.2
BitFit 273k 89.2 95.6 88.2 65.0 93.9 88.1 81.9 91.4 86.7
Adapters† 3.0M 90.2 96.1 90.2 68.3 94.8 91.9 83.8 92.1 88.4
LoRA 3.4M 90.7 95.3 89.7 65.1 93.8 90.3 84.8 91.7 87.7
MaM 3.4M 90.6 95.3 89.7 65.1 93.8 90.3 84.8 91.7 87.7
S-MaM 3.4M 90.6 95.9 90.4 66.3 94.5 90.6 85.2 91.6 88.1
U-MaM 3.4M 90.3 95.8 90.7 66.8 94.1 90.8 85.9 91.8 88.3
WARP† 25k 88.2 96.0 90.8 60.6 93.5 84.5 75.8 88.6 84.8
S-BitFit 25k 84.1 94.2 70.6 40.2 88.9 83.8 56.0 76.8 74.3
U-BitFit 25k 88.8 95.5 85.3 62.1 93.5 87.7 74.0 90.3 84.6
Table 5.1: GLUE development set score for learned and hand-crafted PET architectures. We report
the result for WARP†
from Hambardzumyan et al. [2021] and for Adapters†
from Hu et al. [2021].
where the non-zero pattern of the learned PET parameters is restricted or unrestricted, respectively.
Specifically, our search space consists of the following:
1. Learning an update ∆b for each vector of bias parameters b. In structured bias-tuning, for
each PLM module, the NAS algorithm must choose whether ∆b = 0 or not. In unstructured
bias-tuning, for each PLM module, the NAS algorithm must choose which components of
∆b should be zero or non-zero.
2. Learning a low-rank (LoRA Hu et al., 2021) update ∆W = UV ⊤ for each user-specified
parameter matrix W. The maximum possible rank for the update is also user-specified. In
structured LoRA, for each parameter matrix W, the NAS algorithm must decide what the
rank of the update UV ⊤ should be. In unstructured LoRA, the NAS algorithm must decide
which components of U and V should be non-zero.
The collection of updates ∆b and ∆W are the PET parameters. In this search space, any number of the above PET modules can be applied to a base PLM without increasing the latency of
inference, just like BitFit [Zaken et al., 2021] and LoRA [Hu et al., 2021].
77
5.3.1 Pruning
We perform NAS via pruning. Our NAS method begins by training a PET architecture of a maximum user-specified size: for each bias tuning module, we fine-tune all bias parameters, and for
each LoRA update module, we learn a dense low-rank update with a user-specified rank (in all
our experiments, we use rank-16 initial LoRA updates). After training the initial PET architecture,
our method decides which PET parameters to prune and which to keep. Then we re-initialize and
re-train the pruned architecture before evaluating on the validation set.
The criteria that we use to decide which PET parameters to prune is based on a first-order
approximation of the change in training loss that results from pruning a PET parameter θ:
−θ ·
∂L
∂ θ .
Note that this is a common pruning criterion, e.g., see Molchanov et al. [2016]. This criterion is
straightforward to use when deciding whether to prune a single PET parameter, as in unstructured
bias-tuning and unstructured LoRA. For structured bias-tuning, we sum this criterion over the
entire bias update ∆b, and for structured LoRA, when considering what column of U and V to
prune, we sum the criterion over each column of U.
Pruning via evaluating the criterion at the end of training does not yield better-than-random
architectures. We observe that the value of the pruning criterion may change drastically from one
stochastic gradient descent (SGD) step to the next. To maximally smooth the noise introduced by
SGD, we instead average the pruning criterion over all training SGD steps. This yields the most
consistent indication of which PET parameters are efficient to prune.
Our NAS algorithm takes as input a parameter budget specifying the desired maximum number
of parameters in the learned PET architecture. After training the initial PET architecture and
evaluating each pruning criterion, we apply each pruning operation in increasing criterion order
until the number of parameters in the PET architecture falls below the parameter budget. This way,
pruning operations that are estimated to increase the training loss the least are applied first.
78
5.3.2 Initialization
Correct initialization is important for successfully applying this algorithm. After pruning, we reinitialize and re-train the learned PET architecture before evaluating on the validation set. We
find that it is important to use the same initialization after pruning as before. We believe this is a
consequence of the lottery ticket hypothesis [Frankle and Carbin, 2018].
We always initialize bias parameter updates as zero, as do other works, and find this works
well. However, we find that the initialization for LoRA given in the original paper [Hu et al.,
2021], which initializes the matrix U with zeros and V with a Gaussian distribution, is not ammenable to unstructured LoRA pruning. Because the parameters in the matrix U are initialized
zero, the magnitudes of those parameters are likely to remain small throughout training relative
to the magnitudes of the parameters in V
⊤. Consequently, the pruning criterion for unstructured
LoRA updates is likely to favor pruning parameters from U over V, leading to an unbalanced,
parameter-inefficient LoRA update. Instead, following the same reasoning given for Kaiming initialization [He et al., 2015], we recommend the following initialization:
U ∼ N (0,1/
√
m) V ∼ N (0,1/
√
n), (5.1)
where m is the first dimension of the matrix U (i.e., the “fan-in”), and n is the second dimension
of the matrix V
⊤ (i.e., the “fan-out”). With this initialization, the expected square gradients for the
parameters of U and V are equal.
5.4 Experiments
Details of our experimental setup, including hyperparameter choices, are available in section 5.6.1
in the appendix. In all experiments we report median validation score at the end of training over 5
random initializations using the GLUE development set for validation.
79
0 20 40 60 80 100
Parameters (% of biases in layer)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Layer
Average Structured Bias Tuning Architecture
RoBERTa large, parameter budget = 50k.
attention.self.query
attention.self.key
attention.self.value
attention.output.dense
attention.output.LayerNorm
intermediate.dense
output.dense
output.LayerNorm
(a)
0 20 40 60 80 100
Parameters (% of biases in layer)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Layer
Average Unstructured Bias Tuning Architecture
RoBERTa large, parameter budget = 50k.
attention.self.query
attention.self.key
attention.self.value
attention.output.dense
attention.output.LayerNorm
intermediate.dense
output.dense
output.LayerNorm
(b)
Figure 5.1: Average learned architecture for (a) structured bias-tuning and (b) unstructured biastuning.
5.4.1 Comparing to Full Fine-tuning
Here we present results for training larger PET architectures with the aim of achieving performance
similar to full fine-tuning, but with fewer parameters. In addition to structured or unstructured
bias-tuning, our learned PET architectures add structured or unstructured LoRA updates to the
MHA query modules, key modules, and the dense feed forward network (FFN) modules. In Table
5.1, our learned structured PET architecture is labeled S-MaM, and our learned unstructured PET
architecture is labeled U-MaM. We compare our method with a LoRA baseline [Hu et al., 2021]
and a baseline similar to Mix-and-Match (MaM) [He et al., 2021]. Our LoRA baseline fine-tunes
all bias parameters and adds rank-8 updates to all MHA query and key modules. Our MaM-like
baseline fine-tunes all bias parameters and adds rank-8 updates to all MHA query and key modules
and all FFN modules.
Results for this experiment with parameter budget 3.4M are in Table 5.1. In our S-MaM and
U-MaM experiments, we prune from an initial architecture with 6.8M parameters. We observe that
our S-MaM architecture achieves slightly higher average GLUE [Wang et al., 2018] validation score
over our MaM-like baseline, and our U-MaM architecture slightly higher average GLUE validation
score over our S-MaM architecture. We conclude that structured architecture search provides a
80
small positive benefit over the uniform-rank baseline architecture, and that unstructured architecture search provides a small positive benefit over structured architecture search. We also observe
our U-MaM architecture achieves average GLUE validation score on par with full fine-tuning while
fine-tuning approximately 100 times fewer parameters.
5.4.2 Very Small PETs
Here we examine our learned PET architectures with parameter budget less than the total number
of bias parameters in the base PLM. For roberta-large, this is about 273k.
We use our method to learn structured and unstructured bias-tuning architectures. We compare
our method with WARP [Hambardzumyan et al., 2021] using parameter budget 25k in Table 5.1,
and report results for our method for other parameter budgets in section 5.6.2 in the appendix. Our
learned structured and unstructured bias-tuning architectures are labeled S-BitFit and U-BitFit,
respectively. In our S-BitFit and U-BitFit experiments, we prune from a PET architecture
with 273k parameters that fine-tunes all bias parameters, the same as BitFit. We observe that the
unstructured bias-tuning architecture achieves significantly higher validation performance than the
structured bias-tuning architecture with the same parameter budget. We conclude that the subset of
bias parameters that are “good” to fine-tune are not concentrated in a few modules, but rather are
distributed throughout the network. Our learned unstructured bias-tuning architecture with < 50k
parameters fine-tunes only 18% of all bias parameters while achieving validation GLUE score
only slightly less than fine-tuning all bias parameters (86.5 versus 86.7). We conclude that a vast
majority of bias parameters do not need to be fine-tuned to achieve performance comparable to
fine-tuning all bias parameters. With a parameter budget of 25k, unstructured bias tuning achieves
similar performance compared to WARP, beating or tying WARP on a majority of GLUE tasks but
achieving slightly worse average performance. We conclude that both methods are about equally
effective.
81
5.4.3 Interpreting Learned Architectures
Here we examine the architectures learned by our algorithm and consider what they say about
which parts of the network are most parameter-efficient to fine-tune. Each illustration discussed in
this section averages the architectures learned by our method over all GLUE tasks and five random
initializations per task. Figure 5.1a illustrates the architecture learned by our method for structured
bias-tuning with parameter budget 50k. We observe a clear preference by our algorithm for finetuning the biases of the intermediate.dense modules in the middle of the network. Figure
5.1b illustrates the architecture learned by our method for unstructured bias tuning with parameter
budget 50k. We observe a weak preference for fine-tuning the bias parameters of modules in the
middle of the network, but not for any particular module type within each transformer block. We
conclude that the biases that are most parameter-efficient to fine-tune are in the middle layers of
the network.
5.5 Conclusion
In this chapter, we considered the question which parts of the network are most efficient to finetune, and what is the most parameter-efficient way to fine-tune them? To answer that question,
we developed a NAS algorithm based on structured and unstructured pruning. We presented experimental results on RoBERTa Large demonstrating the effectiveness of our algorithm, achieving
GLUE validation performance similar to WARP at 25k parameters (9% of all biases), similar to
BitFit at 50k parameters (18% of all biases), and similar to full fine-tuning at 3.4M parameters
(10% of all parameters). From our learned architectures we observed that the bias parameters in
the middle layers of the network are most efficient to fine-tune. We conclude that it is important to
consider where to fine-tune as well as how.
82
5.6 Appendix A
5.6.1 Experiment Setup
In all experiments we use the Adam optimizer Kingma and Ba [2014] and a linear learning rate
scheduler with 6% warm-up steps. We observe that training with a higher peak learning rate works
best when fine-tuning a small number of parameters. We use different peak learning rates for
different experiments depending on the maximum number of parameters being fine-tuned, ranging
from 10−5
for full fine-tuning to 3 × 10−4
for training our smallest PETs. We also train for a
different number of epochs for each GLUE task. We train for 20 epochs on MRPC, RTE, CoLA,
and STSB; 5 epochs on SST-2 and QNLI; and 2 epochs for MNLI and QQP. We observe that
extending the number of training epochs beyond these limits does not substantially affect validation
performance. In all experiments, we use batch size 16 and maximum sequence length 128.
5.6.2 Additional Experimental Results
Method #params MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B Avg.
WARP† 11k 87.6 93.0 83.8 72.9 95.4 85.6 57.4 81.0 82.1
WARP† 25k 88.2 96.0 90.8 60.6 93.5 84.5 75.8 88.6 84.8
S-BitFit 10k 70.1 92.1 70.6 0.0 73.1 73.3 52.7 22.2 56.8
S-BitFit 25k 84.1 94.2 70.6 40.2 88.9 83.8 56.0 76.8 74.3
S-BitFit 50k 87.1 94.3 72.1 51.5 91.4 86.2 59.6 86.9 78.6
S-BitFit 100k 88.2 95.0 87.7 58.8 92.4 87.4 78.7 90.4 84.8
S-BitFit 200k 89.1 95.6 88.2 63.1 93.8 87.9 81.9 91.4 86.4
U-BitFit 10k 87.4 95.1 71.1 58.8 92.2 86.3 59.6 88.3 79.8
U-BitFit 25k 88.8 95.5 85.3 62.1 93.5 87.7 74.0 90.3 84.6
U-BitFit 50k 89.1 95.8 88.5 64.8 93.8 88.0 80.9 91.1 86.5
U-BitFit 100k 89.3 95.8 88.5 63.6 93.9 87.7 81.9 91.3 86.5
U-BitFit 200k 89.4 95.6 88.5 64.8 93.9 86.5 81.9 91.4 86.5
Table 5.2: GLUE development set score for structured (S-BitFit) and unstructured (U-BitFit) biastuning architectures learned by our method for different parameter budgets. The results for WARP†
are reported from Hambardzumyan et al. [2021].
We report results for our learned structured and unstructured bias-tuning architecture with
parameter budgets 10k, 25k, 50k, 100k, and 200k in Table 5.2. We observe that unstructured
83
bias-tuning holds an advantage over structured bias-tuning across all parameter budgets. We also
observe that the performance of unstructured bias-tuning begins to fall off after decreasing the
parameter budget below 50k. WARP with a parameter budget of 11k significantly outperforms
our U-BitFit method with a parameter budget of 10k on the MRPC and COLA tasks. This difference might be explained by the difference in experimental setup (e.g., Hambardzumyan et al.
[2021] reports peak validation score whereas we report end-of-training validation score), or the
small difference in parameter budget. We believe that our method can be improved in the very
small parameter budget regime using iterative, rather than one-shot, pruning.
84
Chapter 6
QuAILoRA: Quantization-Aware Initialization for LoRA
QLoRA reduces the memory-cost of fine-tuning a large language model (LLM) with LoRA by
quantizing the base LLM. However, quantization introduces quantization errors that negatively impact model performance after fine-tuning. In this chapter we introduce QuAILoRA, a quantizationaware initialization for LoRA that mitigates this negative impact by decreasing quantization errors
at initialization. Our method spends a small amount of computational overhead to compute this
quantization-aware initialization, without increasing the memory-cost of fine-tuning. We evaluate our method on several causal language modeling and downstream evaluation tasks using
several different model sizes and families. We observe that almost all LLMs fine-tuned with
QuAILoRA achieve better validation perplexity. When evaluated on downstream tasks, we find
that QuAILoRA yields improvements proportional to the negative effect of quantization error. On
average, applying QuAILoRA to 4-bit QLoRA models yields 75% of the validation perplexity decrease and 86% of the downstream task accuracy increase as doubling the quantization precision
to 8-bit, without increasing GPU memory utilization during fine-tuning.
6.1 Introduction
Fine-tuning state-of-the-art large language models (LLMs) requires a large amount of computational resources due to their increasingly large size. QLoRA [Dettmers et al., 2023], a quantized
version of LoRA [Hu et al., 2021], reduces the memory-cost of fine-tuning sufficiently to fine-tune
85
LLMs on the order of 70B parameters on a single GPU, making fine-tuning much more convenient
and accessible. Although quantization greatly reduces memory costs, it also introduces quantization errors that negatively impact the task performance of the model after fine-tuning. In this
chapter we propose Quantization-Aware Initialization for LoRA (QuAILoRA), a method for initializing the LoRA matrices of a QLoRA model to reduce quantization errors. When fine-tuning
a model with QLoRA, each parameter matrix of the fine-tuned model takes the form Q + AB⊤,
where Q is the quantization of the base parameter matrix W and AB⊤ is the low-rank LoRA update. Typically the matrix A is initialized random normal and the matrix B is initialized zero so
that the input-output mapping of the QLoRA model is the same as the quantized base model at
initialization. Instead, we propose spending a small amount of computational overhead to find an
initialization for the LoRA matrices so that the input-output mapping of the QLoRA model is more
similar to the full-precision base model at initialization.
We conduct an extensive set of experiments and establish that QuAILoRA (1) is robust to
the choice of calibration set; (2) yields better validation perplexity than the baseline initialization
across many model families and sizes on several causal LM tasks; and (3) yields consistently
positive results on downstream task evaluations for smaller, lower-precision quantized LLaMA
models. Additionally, we establish that our method is increasingly effective with larger LoRA
ranks and does not appear to affect the rate of convergence of fine-tuning compared to the baseline
initialization. QuAILoRA provides the largest benefit when the negative effect of quantization
error is significant.
6.2 Related Work
Our method improves the performance of QLoRA, a parameter-efficient fine-tuning (PEFT) technique, but many other PEFT methods exist in the literature, including adapter-based strategies
[Houlsby et al., 2019], BitFit [Zaken et al., 2021], diff-pruning [Guo et al., 2020], NAS for PEFT
86
[Lawton et al., 2023], and AdaLoRA [Zhang et al., 2023]. These methods and others can be combined to form PEFT design spaces [He et al., 2021].
We propose a method for computing a quantization-aware initialization of the trainable parameters of a QLoRA model [Hu et al., 2021, Dettmers et al., 2023]. Our method exploits the special
low-rank structure of the updates to efficiently compute such an initialization. However, it is possible our method could be extended to other reparamterization-based PEFT strategies that exist in
the literature, such as Kroncker-product fine-tuning updates [He et al., 2022].
We seek a quantization-aware initialization of the trainable LoRA parameters of a QLoRA
model that reduces quantization errors between the QLoRA model and the full-precision model.
In order to do so, we build off techniques from the literature on post-training quantization, such as
GPT-Q [Frantar et al., 2022], OPS [Frantar and Alistarh, 2022], Bit-stitching [Wang et al., 2020],
and QuIP [Chee et al., 2023]. Like GPT-Q, we optimize a calibrated quantization objective to
decrease quantization errors on a target calibration dataset. Our method is similar in strategy to
other recent methods in the literature such as LQ-LoRA [Guo et al., 2023], LoftQ [Li et al., 2023],
and ApiQ [Liao and Monz, 2024].
6.3 Method
6.3.1 Background and notation
LoRA [Hu et al., 2021] is a parameter-efficient fine-tuning method that fine-tunes a model by
learning a low-rank update AB⊤ for each parameter matrix W, where W is an m×n matrix, A is an
m×r matrix, and B is an n×r matrix, where r is the user-specified LoRA rank. After fine-tuning,
the parameter matrix in the fine-tuned LoRA model is W + AB⊤. In contrast, QLoRA [Dettmers
et al., 2023] fine-tunes a quantized version of the model with LoRA so that the parameter matrix
in the fine-tuned QLoRA model is Q+AB⊤, where Q is the quantization of W.
Typically, A is initialized random normal and B is initialized zero. We refer to this as the
“baseline initialization”.
87
6.3.2 Calibrated quantization
To find a quantization-aware initialization for the LoRA matrices A and B, we minimize a calibrated quantization objective that aims to keep the activations of the QLoRA model close to those
of the full-precision base model on a given calibration dataset at initialization. For each parameter matrix W, we propose initializing the LoRA matrices A and B by minimizing a calibrated
quantization objective, defined
min
A,B
1
2
∥(W −(Q+AB⊤))X∥
2
F
, (6.1)
where Q is the quantization of W, A and B are the real (stored in high-precision) LoRA matrices,
∥ · ∥F is the Frobenius norm, and X is an n×s real (stored in high-precision) matrix consisting of
the input activations of the full-precision base model to the parameter matrix W on a calibration
dataset of s examples. We minimize this objective with respect to A and B independently for each
parameter matrix W in the model before proceeding to fine-tune as usual with QLoRA.
The samples used for the calibration dataset may come from the training data of the task that
we plan to fine-tune on or from another source.
6.3.3 Uncalibrated quantization
A related objective that we will make repeated reference to is the uncalibrated quantization objective, which does not use a calibration dataset or matrix of input activations X and which instead
aims to keep the weights, rather than the activations, of the QLoRA model close to those of the
full-precision base model at initialization. The uncalibrated quantization objective is defined
min
A,B
1
2
∥(W −(Q+AB⊤))∥
2
F
. (6.2)
88
Note that if the input activations for the parameter matrix W are uncorrelated, so that XX⊤ = c ·I
for some scalar c, then optimizing the calibrated and uncalibrated quantization objectives are equivalent. However, we find that initializing A and B to minimize this uncalibrated quantization objective is ineffective for improving the performance of QLoRA over the baseline initialization.
6.3.4 Optimization
To minimize the calibrated quantization objective with respect to A and B, we propose a simple
alternating optimization algorithm. To begin the optimization, we initialize A and B by minimizing
the uncalibrated quantization objective. This optimization problem is solved by computing the
SVD of the parameter quantization error W −Q:
W −Q = UΣV
⊤. (6.3)
Let Σr be the diagonal matrix consisting of the r largest singular values of W −Q, and let Ur
and Vr be matrices consisting of the corresponding top left and right singular vectors, respectively.
Then we initialize
A = Ur
p
Σr B = Vr
p
Σr
. (6.4)
After initializing A and B, our algorithm proceeds to optimize the calibrated quantization objective alternately over A and B. Define the activation correlation matrix H := XX⊤. Then the
updates for A and B each involve solving an r ×r linear system:
A := (W −Q)HB⊤(BHB⊤)
−1
(6.5)
B := (A
⊤A)
−1A
⊤(W −Q) (6.6)
89
Since r is small, typically on the order of 64, these linear systems are computationally inexpensive to solve. Note that to compute these updates, we only need to compute and store the n×n
activation correlation matrix H rather than the n×s matrix of input activations X, which is typically
much larger.
In all our experiments with our method we use s = 2000 calibration samples (similar to Frantar
et al., 2022) and execute 20 steps of alternating optimization, so that A and B are each updated 20
times.
6.4 Experiments
We compare our method against a QLoRA baseline, quantized to either 4-bit or 8-bit precision, that
uses the baseline initialization: A random normal and B zero. Whenever we fine-tune a QLoRA
model, we use learning rate 2×10−4
and fine-tune for one epoch using total batch size of 16. We
perform quantization using the BitsAndBytes library [Dettmers et al., 2022,], using double quantization (quantization of the affine quantization scaling constants) and the NormalFloat4 (NF4) data
type for 4-bit quantization. This quantization configuration makes quantization errors small, even
before applying our method. In all experiments, we use LoRA α = 16, gradient accumulation,
warm-up ratio 0.03, and optimize using the AdamW optimizer during fine-tuning.
We evaluate our method on publicly available LLMs across four different LLM families:
LLaMA [Touvron et al., 2023], OPT [Zhang et al., 2022], BLOOM [Scao et al., 2022], and Pythia
[Biderman et al., 2023]. We use publicly available causal language modeling datasets for calibration, training, and evaluation: Alpaca [Taori et al., 2023], Unified-Chip2 [LAION, 2023], SelfInstruct [Wang et al., 2022], HH-RLHF [Bai et al., 2022], and SlimOrca [Lian et al., 2023]. Unless
stated explicitly otherwise, we use LoRA rank r = 64 and fine-tune for 1000 steps, except for
Pythia-70m, which we fine-tune for 10k steps.
90
Fine-tuned →
Calibrated ↓
alpaca chip2 s-i rlhf
alpaca 6.94 6.04 3.58 8.36
chip2 6.95 6.04 3.56 8.35
s-i 6.94 6.04 3.58 8.36
rlhf 6.95 6.05 3.58 8.35
QLoRA 4-bit 7.02 6.13 3.65 8.43
QLoRA 8-bit 6.87 6.01 3.57 8.35
Table 6.1: Effect of calibration dataset choice on validation perplexity after fine-tuning for 4-bit
models, averaged across six LLMs. We observe that the choice of calibration dataset does not
significantly affect validation perplexity after fine-tuning.
6.4.1 Choice of calibration set
First we evaluate the effect of the choice of the calibration dataset on performance after finetuning. In each experiment, we calibrate and/or fine-tune on Alpaca, Unified-Chip2, Self-Instruct,
or HH-RLHF. For each choice of calibration dataset and fine-tuning dataset, we report the validation perplexity after fine-tuning, averaged over six LLMs: Pythia-12b, Pythia-410m, Pythia-70m,
BLOOM-3b, BLOOM-560m, and LLaMa-13b.
The results are in Table 6.1. For comparison, for each fine-tuning dataset we include the average validation perplexity after fine-tuning for the 4-bit and 8-bit baseline QLoRA models. We
observe that the choice of calibration dataset does not significantly affect task performance after
fine-tuning: the difference in performance between our method and the 4-bit and 8-bit QLoRA
baselines for any calibration dataset choice is much larger than the difference between the performance of our method with different calibration dataset choices. We conclude that our method is
robust to the choice of calibration dataset.
6.4.2 Perplexity after fine-tuning
Here we compare the validation perplexity of 4-bit and 8-bit QLoRA models initialized with our
method versus the baseline initialization after fine-tuning. Each QuAILoRA model in this section
is calibrated on Alpaca.
91
Table 6.2: Validation perplexity results
Model Bits Method Avg.
LLaMA-7b 4 QLoRA 3.51
LLaMA-7b 4 Ours 3.49
LLaMA-7b 8 QLoRA 3.49
LLaMA-7b 8 Ours 3.48
LLaMA-13b 4 QLoRA 3.33
LLaMA-13b 4 Ours 3.32
LLaMA-13b 8 QLoRA 3.32
LLaMA-13b 8 Ours 3.31
LLaMA-30b 4 QLoRA 3.30
LLaMA-30b 4 Ours 3.30
LLaMA-30b 8 QLoRA 3.31
LLaMA-30b 8 Ours 2.29
OPT-13b 4 QLoRA 3.77
OPT-13b 4 Ours 3.71
OPT-30b 4 QLoRA 3.66
OPT-30b 4 Ours 3.60
BLOOM-560m 4 QLoRA 6.84
BLOOM-560m 4 Ours 6.73
BLOOM-560m 8 QLoRA 6.73
BLOOM-560m 8 Ours 6.76
BLOOM-3b 4 QLoRA 4.82
BLOOM-3b 4 Ours 4.75
BLOOM-3b 8 QLoRA 4.78
BLOOM-3b 8 Ours 4.76
Pythia-70m 4 QLoRA 10.98
Pythia-70m 4 Ours 10.80
Pythia-70m 8 QLoRA 10.72
Pythia-70m 8 Ours 10.69
Pythia-410m 4 QLoRA 6.73
Pythia-410m 4 Ours 6.67
Pythia-410m 8 QLoRA 6.57
Pythia-410m 8 Ours 6.54
Pythia-12b 4 QLoRA 5.14
Pythia-12b 4 Ours 5.11
Pythia-12b 8 QLoRA 5.09
Pythia-12b 8 Ours 5.08
(a) Validation perplexity after fine-tuning various LLMs on four
causal language modeling tasks, with and without QuAILoRA.
Our method provides a moderate improvement over the baseline
in the vast majority of cases.
Model Gap Closed
LLaMA-7b 100%
LLaMA-13b 61%
LLaMA-30b N/A
OPT-13b N/A
OPT-30b N/A
BLOOM-560m 96%
BLOOM-3b 100%
Pythia-70m 69%
Pythia-410m 37%
Pythia-12b 64%
Avg. 75%
(b) The gap closed in validation perplexity
after fine-tuning between QLoRA 4-bit and
QLoRA 8-bit quantization by QuAILoRA.
The result for LLaMA 30b is omitted because the 8-bit model underperforms the 4-
bit model. The results for OPT 13b and
30b are omitted because we were not able
to generate results for 8-bit quantization.
92
The average validation perplexity after fine-tuning on Alpaca, Unified-Chip2, Self-Instruct, or
HH-RLHF is in Table 6.2a, and a breakdown of this average by task is in Appendix Table 6.4.
Results for the 8-bit OPT models are omitted due to errors encountered while fine-tuning these
models using the BitsAndBytes and peft libraries. We observe that in the vast majority of cases,
fine-tuning a 4-bit or 8-bit QLoRA model from our initialization achieves lower validation perplexity than fine-tuning from the baseline initialization. In a small number of cases, the model
initialized with our method achieves worse validation perplexity compared to the baseline initialization, which we present as failure cases.
In most cases, the 4-bit QuAILoRA model outperforms the 4-bit QLoRA baseline and underperforms the 8-bit QLoRA baseline. We can use the results in Table 6.2a to compare the decrease
in validation perplexity yielded by applying our method to a 4-bit QLoRA model versus doubling
the quantization precision of the 4-bit QLoRA model to 8-bit. For each model, we can compute the
gap in average validation perplexity closed by our method as the difference in average validation
perplexity between the 4-bit QLoRA and QuAILoRA models, divided by the difference in average
validation perplexity between the 4-bit and 8-bit QLoRA models, capping the computed ratio at 1.
The results of this analysis are in Table 6.2b. We exclude LLaMA-30b from this analysis, since we
observed the 8-bit baseline model underperform the 4-bit baseline on average for this model. We
also exclude the OPT-13b and OPT-30b LLMs since we could not generate results for the 8-bit versions of these models. Averaging across the other models, the average gap in perplexity between
4-bit and 8-bit quantization closed by applying our method to the 4-bit QLoRA models is 75%. We
conclude that applying our method to 4-bit quantized QLoRA models yields approximately 75%
of the decrease in validation perplexity achieved by doubling the quantization precision, without
increasing GPU memory utilization during fine-tuning.
We observe that in the LLaMA family of models, quantization error does not appear to significantly negatively affect validation perplexity after fine-tuning: the difference in validation perplexity after fine-tuning between the 4-bit and 8-bit baseline QLoRA models is small. Since our
93
Table 6.3: Downstream Task Results
Model Bits Method Avg.
LLaMA-7b 4 QLoRA 62.1
LLaMA-7b 4 Ours 62.8
LLaMA-7b 8 QLoRA 63.0
LLaMA-7b 8 Ours 63.1
LLaMA-13b 4 QLoRA 65.4
LLaMA-13b 4 Ours 65.8
LLaMA-13b 8 QLoRA 65.8
LLaMA-13b 8 Ours 65.7
(a) Downstream task accuracy averaged across 7
downstream tasks for LLaMA models fine-tuned
on Alpaca.
Model Gap Closed
LLaMA-7b 74%
LLaMA-13b 100%
(b) The gap in average accuracy between 4-bit
and 8-bit QLoRA models closed by QuAILoRA
for models fine-tuned on Alpaca.
Model Bits Method Avg.
LLaMA-7b 4 QLoRA 63.2
LLaMA-7b 4 Ours 63.9
LLaMA-7b 8 QLoRA 63.8
LLaMA-7b 8 Ours 63.9
LLaMA-13b 4 QLoRA 66.7
LLaMA-13b 4 Ours 67.0
LLaMA-13b 8 QLoRA 67.2
LLaMA-13b 8 Ours 67.1
(c) Downstream task accuracy averaged across 7
downstream tasks for LLaMA models fine-tuned
on SlimOrca.
Model Gap Closed
LLaMA-7b 89%
LLaMA-13b 84%
(d) The gap in average accuracy between 4-bit
and 8-bit QLoRA models closed by QuAILoRA
for models fine-tuned on SlimOrca.
method improves performance by reducing quantization error, we observe that the gain in performance provided by our method over the baselines for the LLaMA models is proportionately
small. For the other families, the difference in performance between the 4-bit and 8-bit baselines
is significant enough for our method to provide a larger advantage over the baselines.
From these results, we conclude that QuAILoRA reduces validation perplexity after fine-tuning
on average, proportional to the negative effect of quantization error.
94
6.4.3 Performance on downstream tasks
Here we compare how LLaMA models fine-tuned with QuAILoRA versus QLoRA on Alpaca or
SlimOrca perform on seven downstream tasks: Arc-Challenge (Arc-C) [Clark et al., 2018], ArcEasy (Arc-E), BoolQ [Clark et al., 2019], HellaSwag (HS) [Zellers et al., 2019], OpenBookQA
(OBQA) [Mihaylov et al., 2018], PIQA [Bisk et al., 2020], and WinoGrande (WinoG) [Keisuke
et al., 2019]. We use the EleutherAI LM Evaluation Harness [Gao et al., 2021] for evaluation. For
Alpaca experiments, we calibrate on Alpaca and fine-tune for one epoch. For SlimOrca experiments, we calibrate on SlimOrca and fine-tune on a random size-10000 subset of SlimOrca.
The average accuracy achieved across the evaluation tasks is in Tables 6.3a and 6.3c, and
a breakdown of this average by task is in Appendix Table 6.5. The gap in accuracy between
4-bit and 8-bit quantization closed by applying our method to each 4-bit model is computed in
Tables 6.3b and 6.3d. We compute the gap closed for the downstream task experiments in the
same way as for the perplexity experiments in the previous subsection. Averaged over all the
downstream task experiments, the average gap closed by our method is approximately 86%. We
conclude that our method improves the downstream task performance of QLoRA on average, and
that applying our method to 4-bit quantized QLoRA models yields approximately 86% of the
increase in downstream task performance achieved by doubling the quantization precision, without
increasing GPU memory utilization during fine-tuning.
6.4.4 Effect of LoRA rank
Here we examine the effect of the choice of the LoRA rank hyperparameter r on performance
after fine-tuning. We expect that using larger r will allow our initialization of A and B to reduce
a greater part of the quantization error and result in better performance after fine-tuning. A plot
illustrating the effect of changing the LoRA rank when using our initialization versus the baseline
initialization, averaged across six 4-bit LLMs (excluding Pythia-70m, LLaMa-30b and OPT-30b)
and 4 causal language modeling tasks, is in Figure 6.1. We observe that the performance of QLoRA
generally increases as we increase the LoRA rank r, albeit with diminishing returns. In contrast,
95
8 16 32 64 128
5.02
5.04
5.06
5.08
5.1
5.12
LoRA rank
Avg. Validation Perplexity
Six-Model Average
Ours
QLoRA
8 16 32 64 128
10
11
12
13
14
15
LoRA rank
Avg. Validation Perplexity
Pythia-70m
Ours
QLoRA
Figure 6.1: Effect of LoRA rank on validation perplexity after fine-tuning 4-bit models, averaged
across six 4-bit LLMs and 4 causal language modeling tasks. Increasing the LoRA rank results
in a continual decrease in average validation perplexity when initializing with our method, albeit
with diminishing returns. In contrast, increasing the LoRA rank does not significantly affect performance when using the baseline initialization. We plot results for Pythia-70m separately (not
included in the six-model average) as this was the only baseline to show a strong decrease in validation perplexity with increasing LoRA rank.
the choice of r does not appear to significantly affect the performance of QLoRA when using the
baseline initialization.
6.4.5 Convergence of fine-tuning
Here we examine how our method affects the speed of convergence of fine-tuning. In Figure 6.2,
we plot the convergence curve for each of the fine-tuning experiments used to generate the validation perplexity results in Table 6.2a. We plot fine-tuning steps on the horizontal axis and the
average validation perplexity across the four causal language modeling tasks, measured every 100
fine-tuning steps, on the vertical axis. We observe that the 4-bit QLoRA models, when initialized
with our method, achieve lower average validation perplexity at all stages of fine-tuning compared
to the 4-bit QLoRA baselines. The difference in performance between the 8-bit QLoRA models
fine-tuned with and without our method is on average much smaller compared to the 4-bit models,
likely because the quantization error is already quite small for the 8-bit models and there are diminishing returns for reducing quantization error. From these plots, it appears that fine-tuning does
not converge more quickly or slowly when initialized with our method compared to the baseline
96
100 500 1,000
3.2
3.3
3.4
3.5
3.6
3.7
Fine-tuning steps
Avg. Validation Perplexity
LLaMa-13B
QLoRA 4-bit QLoRA 8-bit QuAILoRA 4-bit QuAILoRA 8-bit
100 500 1,000
3.2
3.3
3.4
3.5
3.6
3.7
Fine-tuning steps
Avg. Validation Perplexity
LLaMa-30B
QLoRA 4-bit QLoRA 8-bit QuAILoRA 4-bit QuAILoRA 8-bit
100 500 1,000
3.7
3.8
3.9
4
4.1
4.2
4.3
4.4
4.5
Fine-tuning steps
Avg. Validation Perplexity
OPT-13B
QLoRA 4-bit QuAILoRA 4-bit
100 500 1,000
3.5
3.6
3.7
3.8
3.9
4
4.1
4.2
4.3
4.4
4.5
Fine-tuning steps
Avg. Validation Perplexity
OPT-30B
QLoRA 4-bit QuAILoRA 4-bit
100 500 1,000
6.7
7
7.5
8
8.5
Fine-tuning steps
Avg. Validation Perplexity
BLOOM-560m
QLoRA 4-bit QLoRA 8-bit QuAILoRA 4-bit QuAILoRA 8-bit
100 500 1,000
4.7
5
5.5
6
Fine-tuning steps
Avg. Validation Perplexity
BLOOM-3b
QLoRA 4-bit QLoRA 8-bit QuAILoRA 4-bit QuAILoRA 8-bit
5,000 10,000
10
11
12
13
14
15
Fine-tuning steps
Avg. Validation Perplexity
Pythia-70m
QLoRA 4-bit QLoRA 8-bit QuAILoRA 4-bit QuAILoRA 8-bit
100 500 1,000
6.5
7
7.5
8
Fine-tuning steps
Avg. Validation Perplexity
Pythia-410m
QLoRA 4-bit QLoRA 8-bit QuAILoRA 4-bit QuAILoRA 8-bit
100 500 1,000
5
5.5
6
Fine-tuning steps
Avg. Validation Perplexity
Pythia-12b
QLoRA 4-bit QLoRA 8-bit QuAILoRA 4-bit QuAILoRA 8-bit
100 500 1,000
4.9
5
5.5
6
Fine-tuning steps
Avg. Validation Perplexity
Six-Model Average
QLoRA 4-bit QLoRA 8-bit QuAILoRA 4-bit QuAILoRA 8-bit
Figure 6.2: Fine-tuning convergence for each of our 9 base models. We also include a six-model
average convergence curve that excludes OPT-13b and OPT-30b (due to the perplexity spikes in
the middle of fine-tuning) as well as Pythia-70m (which we fine-tune for 10 times as many steps
as the other models).
initialization. Rather, the benefit of decreased validation perplexity after fine-tuning observed for
our method is due to decreased validation perplexity at initialization from decreased quantization
error.
6.5 Conclusion
In this chapter we introduced QuAILoRA, a method for increasing the performance of QLoRA
without additional memory cost during fine-tuning by initializing the LoRA matrices to decrease
quantization error. To find such an initialization, we optimized a calibrated quantization objective
using alternating optimization, solving a small rank-r linear system in each step. In our experiments, we demonstrated that our LoRA initialization can moderately improve the performance
of QLoRA when the impact of quantization errors is significant. Furthermore, we found that our
results are robust to the choice of the calibration dataset.
97
6.6 Appendix A: Full Results Tables
Method Model Arc-C Arc-E BoolQ HS OBQA PIQA WinoG avg.
QLoRA 7b 4-bit 41.8±1.5 71.3±1.0 73.9±0.5 55.7±0.2 31.5±0.7 77.6±0.3 82.8±0.2 62.1
Ours 7b 4-bit 42.2±1.5 73.1±0.9 75.7±0.4 55.8±0.2 31.9±0.7 77.9±0.3 82.9±0.2 62.8
QLoRA 7b 8-bit 43.3±1.5 72.3±0.9 76.8±0.4 55.9±0.2 32.1±0.7 77.8±0.3 83.0±0.2 63.0
Ours 7b 8-bit 42.8±1.5 73.3±0.9 76.2±0.4 56.4±0.2 32.5±0.7 77.8±0.3 83.0±0.2 63.1
QLoRA 13b 4-bit 44.6±1.5 74.3±0.9 82.6±0.4 59.2±0.2 33.0±0.7 79.4±0.3 84.9±0.2 65.4
Ours 13b 4-bit 46.6±1.5 74.5±0.9 83.0±0.4 58.5±0.2 33.2±0.7 79.5±0.3 85.2±0.2 65.8
QLoRA 13b 8-bit 45.8±1.5 74.7±0.9 82.6±0.4 58.9±0.2 33.5±0.7 79.5±0.3 85.4±0.2 65.8
Ours 13b 8-bit 45.6±1.5 74.5±0.9 82.8±0.4 58.8±0.2 33.3±0.7 79.5±0.3 85.6±0.2 65.7
(a) Downstream task performance for LLaMA models fine-tuned on Alpaca.
Method Model Arc-C Arc-E BoolQ HS OBQA PIQA WinoG avg.
QLoRA 7b 4-bit 41.6±1.5 71.8±0.9 80.2±0.4 57.2±0.2 31.5±0.7 77.1±0.3 83.2±0.2 63.2
Ours 7b 4-bit 42.4±1.5 73.3±0.9 80.9±0.4 57.6±0.2 31.9±0.7 77.4±0.3 83.7±0.2 63.9
QLoRA 7b 8-bit 41.8±1.5 72.8±0.9 81.2±0.4 57.9±0.2 32.0±0.7 77.6±0.3 83.5±0.2 63.8
Ours 7b 8-bit 41.6±1.5 73.0±0.9 81.5±0.4 58.1±0.2 31.9±0.7 77.5±0.3 83.7±0.2 63.9
QLoRA 13b 4-bit 48.0±1.5 77.3±0.9 84.1±0.4 59.8±0.2 33.4±0.7 79.4±0.3 85.2±0.2 66.7
Ours 13b 4-bit 48.3±1.5 77.8±0.9 83.9±0.4 59.9±0.2 34.0±0.7 79.6±0.3 85.6±0.2 67.0
QLoRA 13b 8-bit 48.8±1.5 78.1±0.9 84.6±0.4 60.1±0.2 34.0±0.7 79.5±0.3 85.5±0.2 67.2
Ours 13b 8-bit 48.8±1.5 78.3±0.9 83.5±0.4 60.2±0.2 34.0±0.7 79.5±0.3 85.5±0.2 67.1
(b) Downstream task performance for LLaMA models fine-tuned on SlimOrca.
Table 6.5: Downstream task accuracy of QLoRA and QuAILoRA models. Error bars reported are
one standard error.
98
Table 6.4: Validation perplexity of QLoRA and QuAILoRA models.
Model Bits Method Alpaca Chip2 Self-Instruct HH-RLHF Avg.
LLaMA-7b 4 QLoRA 3.69 3.44 2.08 4.84 3.51
LLaMA-7b 4 Ours 3.65 3.44 2.09 4.79 3.49
LLaMA-7b 8 QLoRA 3.65 3.43 2.08 4.82 3.49
LLaMA-7b 8 Ours 3.65 3.41 2.06 4.81 3.48
LLaMA-13b 4 QLoRA 3.45 3.24 2.11 4.52 3.33
LLaMA-13b 4 Ours 3.44 3.24 2.11 4.50 3.32
LLaMA-13b 8 QLoRA 3.44 3.23 2.09 4.52 3.32
LLaMA-13b 8 Ours 3.43 3.23 2.09 4.51 3.31
LLaMA-30b 4 QLoRA 3.42 3.22 2.10 4.48 3.30
LLaMA-30b 4 Ours 3.42 3.22 2.10 4.46 3.30
LLaMA-30b 8 QLoRA 3.44 3.21 2.10 4.48 3.31
LLaMA-30b 8 Ours 3.42 3.20 2.08 4.47 2.29
OPT-13b 4 QLoRA 3.89 3.68 2.25 5.27 3.77
OPT-13b 4 Ours 3.82 3.59 2.21 5.24 3.71
OPT-30b 4 QLoRA 3.78 3.57 2.14 5.15 3.66
OPT-30b 4 Ours 3.68 3.48 2.13 5.11 3.60
BLOOM-560m 4 QLoRA 6.85 6.40 3.67 10.46 6.84
BLOOM-560m 4 Ours 6.73 6.27 3.60 10.34 6.73
BLOOM-560m 8 QLoRA 6.70 6.31 3.60 10.31 6.73
BLOOM-560m 8 Ours 6.71 6.36 3.63 10.33 6.76
BLOOM-3b 4 QLoRA 4.71 4.44 2.63 7.51 4.82
BLOOM-3b 4 Ours 4.64 4.35 2.59 7.45 4.75
BLOOM-3b 8 QLoRA 4.66 4.39 2.60 7.45 4.78
BLOOM-3b 8 Ours 4.63 4.36 2.59 7.44 4.76
Pythia-70m 4 QLoRA 13.55 11.08 6.26 13.03 10.98
Pythia-70m 4 Ours 13.39 10.86 6.08 12.87 10.80
Pythia-70m 8 QLoRA 13.18 10.73 5.97 13.00 10.72
Pythia-70m 8 Ours 13.13 10.74 6.00 12.90 10.69
Pythia-410m 4 QLoRA 7.61 6.57 4.18 8.55 6.73
Pythia-410m 4 Ours 7.57 6.52 4.09 8.50 6.67
Pythia-410m 8 QLoRA 7.42 6.40 4.12 8.36 6.57
Pythia-410m 8 Ours 7.35 6.38 4.06 8.36 6.54
Pythia-12b 4 QLoRA 5.93 5.06 3.08 6.50 5.14
Pythia-12b 4 Ours 5.90 5.00 3.03 6.50 5.11
Pythia-12b 8 QLoRA 5.83 5.00 3.04 6.48 5.09
Pythia-12b 8 Ours 5.81 4.99 3.03 6.47 5.08
99
Chapter 7
RAG Joint Fine-tuning
Retrieval augmented generation (RAG) is a popular pipeline for question answering that is powered by two LLMs: an embedding model that retrieves context documents that are relevant to the
question from a database, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned independent of
one another to increase performance of the RAG pipeline on a new task, but joint fine-tuning may
further increase performance. However, the problem of efficient joint fine-tuning remains an open
challenge. In this chapter, we propose an efficient method for jointly fine-tuning the embedding
and generator models of a RAG pipeline that works by rewarding the embedding model for retrieving context documents that actually improve the generator model’s prediction for the answer.
In our experiments, we empirically evaluate our joint fine-tuning method and find it can yield a
significant positive performance boost over independent fine-tuning, as much as +5.3% EM on
HotPotQA.
7.1 Introduction
Retrieval augmented generation (RAG) is a popular pipeline for performing NLP tasks like question answering. RAG is powered by two LLMs: an embedding model, and a generator model. The
embedding model computes a low-dimensional vector representation of a given input text, such
as a question, which is then used to perform an approximate nearest-neighbors search in a vector
100
database to retrieve context documents that contain relevant information for answering the question. The generator model then takes the concatenated question and retrieved context documents
as input and generates a prediction for the answer to the question.
Both the embedding model and generator model can be fine-tuned to improve the performance
of the RAG pipeline. Given a dataset of (question, context) pairs, the embedding model can be
fine-tuned to retrieve more relevant context documents by minimizing the distance between the
embedding vectors of the question and context, according to some metric (usually cosine similarity or L2 distance). Note that when we fine-tune the embedding model of a RAG pipeline,
the embedding vectors of the context documents stored in the vector database remain unchanged.
The generator model can be fine-tuned given a dataset of (question, context, answer) triplets by
minimizing the negative log-likelihood loss of the generator’s prediction for the answer given the
question and context.
Although the embedding and generator models can be fine-tuned independently of one another,
there may be an opportunity to achieve better performance by fine-tuning them jointly. Lewis et al.
[2020] describe two methods, RAG-Token and RAG-Sequence, for training the embedding and
generator models jointly. These methods pass one context document at a time as input to the generator model to make a prediction for the answer for each context document, then combines the
predictions in a weighted average. The methods differ in how this weighted average is computed.
These methods yield an objective that is differentiable with respect to both the embedding and
generator models’ weights, but does not accurately reflect how RAG operates at test time, when all
context documents are concatenated and passed as input to the generator model to predict the answer. This discrepancy becomes especially problematic when the information required to correctly
answer a question is spread across multiple context documents. We also note that Contextual AI
Team [2024] claims to perform end-to-end fine-tuning, but does not explain how.
In this chapter, we propose an efficient method for jointly fine-tuning the embedding and generator models of a RAG pipeline. Our method fine-tunes the embedding model to retrieve documents
101
that actually improve the generator model’s prediction of the answer. Our method works by comparing the generator’s loss function when one of the retrieved context documents is swapped out
for another random document, then minimizing the distance between the embedding vector of the
question and the embedding vector of the document that achieved lower generator loss when used
as part of the context. Our method also fine-tunes the generator using the full retrieved context to
allow it to learn to synthesize information spread across multiple context documents to accurately
answer the question.
We evaluate our method on PubMedQA [Jin et al., 2019] and HotPotQA [Yang et al., 2018] using an MPNet sentence transformer [Reimers and Gurevych, 2019] embedding model and LLaMA3-
8b [AI@Meta, 2024] generator model. Our results show that our method can provide a significant
performance boost over independent fine-tuning.
7.2 Method
Designing a RAG joint fine-tuning algorithm is challenging because the operation that concatenates
the retrieved context documents before passing them as input to the generator is not differentiable
with respect to the embedding model’s parameters. We propose a solution based on randomly
swapping out one of the retrieved context documents for another document. Here we describe the
procedure for executing one step of our joint fine-tuning algorithm.
If a RAG pipeline would retrieve k context documents, we first instead retrieve 2k documents.
Then we ask the generator model to generate a prediction for the answer using the top k most
relevant documents as context as we usually would do. Denote the generator’s loss function when
using the top k relevant documents L
∗
.
Then we construct a new collection of context documents by randomly choosing one of the top
k relevant documents and swapping it out for one of the top k +1,...,2k most relevant documents
(we call these the “holdout” documents), chosen uniformly at random. We then ask the generator
model to generate a prediction for the answer using this collection of documents as context. Denote
102
the generator’s loss when using this set of documents as context as L
′
. Denote the document that
was swapped out d
∗
, and the holdout document that was swapped in d
′
.
If L
∗ ≤ L
′
, then the generator achieves a lower loss when using d
∗
as context versus d
′
, and
so we update the parameters of the embedding model by minimizing the distance between the
embedding vector of the question and the embedding vector of d
∗ with one step of SGD. If L
′ < L
∗
,
then the generator achieves a lower loss using the holdout document d
′ versus the original context
document d
∗
, so we update the parameters of the embedding model by minimizing the distance
between the embedding vector of the question and the embedding vector of d
′ with one step of
SGD. We also update the parameters of the generator model by minimizing the generator loss
using the original top k relevant documents as context with one step of SGD.
By considering only the top k + 1,...2k most relevant documents to swap in to the context
rather than considering all documents indexed in the vector database, we increase the likelihood
that we swap in a document that actually decreases the generator loss. Compared to RAG-Token
and RAG-Sequence, this swapping method only makes two calls to the generator per optimization
step instead of k, and so our method is comparatively less compute-expensive and less memoryexpensive per step.
7.3 Experiments
We evaluate the effectiveness of the various fine-tuning methods discussed in this section using
a RAG pipeline consisting of a MPNet sentence transformer [Reimers and Gurevych, 2019] embedding model and LLaMA3-8b [AI@Meta, 2024] generator model. We use the PubMedQA [Jin
et al., 2019] and HotPotQA [Yang et al., 2018] datasets for fine-tuning and evaluation. Note that
for PubMedQA, we use the 211.3k artificial QA instances split for fine-tuning. Our retrieval system uses the embedding model to retrieve the top k = 5 most relevant documents from PubMed1
for PubMedQA, and from Wikipedia2
for HotPotQA. We use the same chunking of PubMed and
1https://pubmed.ncbi.nlm.nih.gov/
2https://huggingface.co/datasets/legacy-datasets/wikipedia
103
Method PubMedQA (EM) HotPotQA (EM)
Baseline 59.7% 24.2%
Fine-tuned 58.7% 24.3%
Table 7.1: Fine-tuning the embedding model.
Wikipedia as Xiong et al. [2024]. Each corpus contains over 20M chunks. We construct a vector
database from each corpus using a FAISS index [Johnson et al., 2019].
To improve our understanding of how fine-tuning affects the end-to-end performance of RAG,
we begin by evaluating the effectiveness of fine-tuning just the embedding model or just the generator model.
7.3.1 Fine-tuning the Embedding Model
Here we evaluate how fine-tuning just the embedding model impacts the end-to-end performance of
the RAG pipeline. Fine-tuning the embedding model uses a dataset of labeled (query, context) pairs
rather than retrieving context documents from the vector database. In each of these experiments,
we fine-tune for 1 epoch with a multiple negatives ranking loss [Henderson et al., 2017]. We then
plug-in the fine-tuned embedding model to the RAG pipeline described at the beginning of this
section and evaluate on each dataset’s validation split. The exact match (EM) metric achieved by
the RAG pipeline after fine-tuning for each evaluation dataset is in Table 7.1.
We observe that the end-to-end performance of the RAG pipeline with the fine-tuned embedding model decreases by 1.0% on PubMedQA and increases by only 0.1% on HotPotQA. This is
surprising, since we expect fine-tuning to increase performance. The likely explanation for this
result is that fine-tuning the embedding model uses the labeled (query, context) training datasets
of PubMedQA or HotPotQA rather than the full collection of PubMed or Wikipedia documents
that the RAG pipeline actually retrieves from. In this way, the embedding model is limited in the
context documents it sees and cannot reliably learn how to retrieve documents that will actually
improve the performance of the larger RAG pipeline compared to the baseline embedding model.
104
Method PubMedQA (EM) HotPotQA (EM)
Baseline 59.7% 24.2%
Fine-tuned 71.8% 25.6%
Table 7.2: Fine-tuning the generator model.
Since the PubMed and Wikipedia corpora used by the RAG pipeline contain only context documents and do not have their own labeled datasets of (question, context) pairs, they cannot be used
to fine-tune the embedding model.
7.3.2 Fine-tuning the Generator Model
Here we evaluate how fine-tuning just the generator model impacts the end-to-end performance
of the RAG pipeline. Fine-tuning the generator model uses a dataset of labeled (query, context,
answer) triplets. Since each example from the training dataset includes labeled context, fine-tuning
the generator does not require retrieving context documents and so it is totally independent of the
embedding model. In each of these experiments, we fine-tune for 1 epoch with a standard crossentropy loss. We then plug-in the fine-tuned generator model to the RAG pipeline described at the
beginning of this section and evaluate on each dataset’s validation split. The exact match (EM)
metric achieved by the RAG pipeline after fine-tuning for each evaluation dataset is in Table 7.2.
We observe a large gain in performance when fine-tuning the generator model for PubMedQA
(+12.1%), and a much smaller gain when fine-tuning on HotPotQA (+1.4%). This is likely because the baseline generator model was pretrained on text collected from across the Web plain,
probably similar to text contained in Wikipedia [Dubey et al., 2024]. Since we retrieve context
documents from Wikipedia for HotPotQA, the generator model is already familiar with the style
and vocabulary of the context documents that we pass to it as input, and so fine-tuning the generator
model only yields marginal benefits. On the other hand, we retrieve from PubMed for PubMedQA.
Documents from this corpus contain domain-specific jargon and highly technical language, much
105
Method PubMedQA (EM) HotPotQA (EM)
Baseline 59.7% 24.2%
Independent Fine-tuning 68.5% 25.6%
Joint Fine-tuning (Ours) 69.0% 30.9%
Table 7.3: RAG Fine-tuning.
different from what the pretrained generator model is used to. Therefore, we expect that allowing the generator to familiarize itself with the language used in this new technical domain via
fine-tuning to yield a significant boost in performance.
7.3.3 Joint Fine-tuning
Here we evaluate how jointly fine-tuning the embedding and generator models with our method
impacts the end-to-end performance of the RAG pipeline compared to fine-tuning them independently. The results of fine-tuning for 1 epoch using each method is in Table 7.2.
We observe a small increase in end-to-end performance for our method over independent finetuning for PubMedQA (+0.5%) but a more significant increase for HotPotQA (+5.3%). PubMedQA is a relatively less challenging problem for RAG, since for each question there is exactly
one document in the retrieval database that contains the information needed to correctly answer the
question. In this case, we expect independent fine-tuning to do about as well as joint fine-tuning.
In contrast, for each question in HotPotQA, often the information needed to correctly answer the
question is spread across multiple documents. In this case, joint fine-tuning trains the embedding
model to retrieve a good collection of context documents that together improve the answer prediction generated by the generator model, and trains the generator model to synthesize information
spread across the context documents to correctly answer the question.
7.4 Conclusion
In this chapter, we introduced a method for jointly fine-tuning the embedding and generator model
of a RAG pipeline. We demonstrated the effectiveness of our method by evaluating on PubMedQA
106
and HotPotQA, observing a small to significant benefit for joint fine-tuning. We also used our
evaluation to draw insights into how we expect fine-tuning to affect the end-to-end performance of
RAG.
107
Chapter 8
Conclusion
In this dissertation, we presented a perspective of machine learning that views feature learning as
the fundamental strategy by which modern deep machine learning models solve complex problems.
We presented a diverse collection of works that demonstrate how this perspective can serve as a
guide for designing better machine learning algorithms for optimization, neural architecture search,
and efficient fine-tuning.
In Chapters 2 and 3, we proposed novel optimization algorithms built upon the strategy of
partitioning and attributing the training loss to the model features. This yields a novel avenue for
efficient second-order learning. These methods could be further developed to make second-order
learning more competitive with adaptive learning rate methods. In fact, the second-order updates
that we derive in these chapters appear to be similar to those of Adam; further exploration could
make the connections between our method and Adam clearer.
In Chapters 4 and 5, we proposed efficient morphism-based neural architecture search methods
that grew and pruned neural networks. While these methods focused on optimizing the widths of
neural network modules, further work could extend these methods to more complex neural network
modules, such as growing and pruning network blocks or layers. This would expand the neural
architecture search space and allow us to discover even more efficient architectures by optimizing
over the model depth.
In Chapters 6 and 7, we applied our perspective of machine learning to improve methods for
efficient LLM fine-tuning and inference. The field of LLM research is developing at a rapid pace,
108
with growing interest in long-context models and mitigating LLM hallucination. As NLP tasks
grow more complex, breaking down these complex problems explicitly into simpler interpretable
subproblems will continue to be useful for future research.
We hope that our local feature learning perspective presented in this thesis can be helpful for
designing more principled and effective machine learning algorithms.
109
Bibliography
[1] AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/
llama3/blob/main/MODEL_CARD.md.
[2] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10
(2):251–276, 1998.
[3] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma,
Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful
and harmless assistant with reinforcement learning from human feedback. arXiv preprint
arXiv:2204.05862, 2022.
[4] Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. Simple, scalable adaptation for neural
machine translation. arXiv preprint arXiv:1909.08478, 2019.
[5] P. Baque, T. Bagautdinov, F. Fleuret, and P. Fua. Principled parallel mean-field inference for ´
discrete random fields. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5848–5857, June 2016. doi: 10.1109/CVPR.2016.630.
[6] Sue Becker, Yann Le Cun, et al. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 connectionist models summer
school, pages 29–37, 1988.
[7] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle
O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai
Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across
training and scaling. In International Conference on Machine Learning, pages 2397–2430.
PMLR, 2023.
[8] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on
Artificial Intelligence, 2020.
[9] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for
statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017. doi:
10.1080/01621459.2017.1285773. URL https://doi.org/10.1080/01621459.2017.
1285773.
[10] Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical Gauss-Newton optimisation for deep learning. In Proceedings of the 34th International Conference on Machine
Learning-Volume 70, pages 557–565. JMLR. org, 2017.
110
[11] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: one-shot model
architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.
[12] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems,
33:1877–1901, 2020.
[13] Charles G Broyden. Quasi-Newton methods and their application to function minimisation.
Mathematics of Computation, 21(99):368–381, 1967.
[14] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture
search by network transformation. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 32, 2018.
[15] Arnav Chavan, Deepak Gupta, et al. Beyond uniform scaling: Exploring depth heterogeneity in neural architectures. arXiv preprint arXiv:2402.12418, 2024.
[16] Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. Quip: 2-bit quantization of large language models with guarantees. arXiv preprint arXiv:2307.13304, 2023.
[17] Guanzheng Chen, Fangyu Liu, Zaiqiao Meng, and Shangsong Liang. Revisiting parameterefficient tuning: Are we really there yet? arXiv preprint arXiv:2202.07962, 2022.
[18] Sheng-Wei Chen, Chun-Nan Chou, and Edward Y Chang. EA-CG: An Approximate
Second-Order Method for Training Fully-Connected Neural Networks. In Proceedings of
the AAAI Conference on Artificial Intelligence, volume 33, pages 3337–3346, 2019.
[19] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via
knowledge transfer. arXiv preprint arXiv:1511.05641, 2015.
[20] Robin Cheong and Robel Daniel. transformers. zip: Compressing transformers with pruning
and quantization. Technical report, tech. rep., Stanford University, Stanford, California,
2019.
[21] C. Chow and C. Liu. Approximating discrete probability distributions with dependence
trees. IEEE Transactions on Information Theory, 14(3):462–467, May 1968. ISSN 0018-
9448. doi: 10.1109/TIT.1968.1054142.
[22] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and
Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions.
In NAACL, 2019.
[23] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa
Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2
reasoning challenge. arXiv:1803.05457v1, 2018.
[24] Adam Coates and Andrew Y Ng. Learning feature representations with k-means. In Neural
Networks: Tricks of the Trade: Second Edition, pages 561–580. Springer, 2012.
111
[25] Contextual AI Team. Introducing rag 2.0. 2024. URL https://contextual.ai/
introducing-rag2/.
[26] Felix Dangel, Philipp Hennig, and Stefan Harmeling. Modular block-diagonal curvature
approximations for feedforward architectures. arXiv preprint arXiv:1902.01813, 2019.
[27] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and
Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional
non-convex optimization. In Advances in neural information processing systems, pages
2933–2941, 2014.
[28] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
[29] Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via blockwise quantization. 9th International Conference on Learning Representations, ICLR, 2022.
[30] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient
finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314, 2023.
[31] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
[32] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3
herd of models. arXiv preprint arXiv:2407.21783, 2024.
[33] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–
2159, 2011.
[34] Thomas Elsken, Jan-Hendrik Metzen, and Frank Hutter. Simple and efficient architecture
search for convolutional neural networks. arXiv preprint arXiv:1711.04528, 2017.
[35] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Efficient multi-objective neural
architecture search via lamarckian evolution. arXiv preprint arXiv:1804.09081, 2018.
[36] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A
survey. Journal of Machine Learning Research, 20(55):1–21, 2019. URL http://jmlr.
org/papers/v20/18-598.html.
[37] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion
parameter models with simple and efficient sparsity. The Journal of Machine Learning
Research, 23(1):5232–5270, 2022.
[38] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
112
[39] Elias Frantar and Dan Alistarh. Optimal brain compression: A framework for accurate posttraining quantization and pruning. Advances in Neural Information Processing Systems, 35:
4475–4488, 2022.
[40] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint
arXiv:2210.17323, 2022.
[41] Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster,
Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria
Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework
for few-shot language model evaluation, September 2021. URL https://doi.org/10.
5281/zenodo.5371628.
[42] Xinyu Gong, Shiyu Chang, Yifan Jiang, and Zhangyang Wang. Autogan: Neural architecture search for generative adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3224–3234, 2019.
[43] Ian Goodfellow, Aaron Courville, and Yoshua Bengio. Large-scale feature learning with
spike-and-slab sparse coding. arXiv preprint arXiv:1206.6407, 2012.
[44] Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward
Choi. Morphnet: Fast & simple resource-constrained structure learning of deep networks.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
1586–1595, 2018.
[45] Roger Grosse and Ruslan Salakhudinov. Scaling up natural gradient by sparsely factorizing
the inverse fisher matrix. In International Conference on Machine Learning, pages 2304–
2313, 2015.
[46] Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. PPT: Pre-trained prompt tuning for
few-shot learning. arXiv preprint arXiv:2109.04332, 2021.
[47] Demi Guo, Alexander M Rush, and Yoon Kim. Parameter-efficient transfer learning with
diff pruning. arXiv preprint arXiv:2012.07463, 2020.
[48] Han Guo, Philip Greengard, Eric P Xing, and Yoon Kim. Lq-lora: Low-rank plus
quantized matrix decomposition for efficient language model finetuning. arXiv preprint
arXiv:2311.12023, 2023.
[49] Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. WARP: Word-level adversarial reprogramming. arXiv preprint arXiv:2101.00121, 2021.
[50] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep
neural networks with pruning, trained quantization and huffman coding. arXiv preprint
arXiv:1510.00149, 2015.
[51] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. arXiv preprint
arXiv:2110.04366, 2021.
113
[52] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:
Surpassing human-level performance on ImageNet classification. In Proceedings of the
IEEE International Conference on Computer Vision, pages 1026–1034, 2015.
[53] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
[54] Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, and Xin Eric Wang. Parameterefficient model adaptation for vision transformers. arXiv preprint arXiv:2203.16329, 2022.
[55] Yunlong He, Koray Kavukcuoglu, Yun Wang, Arthur Szlam, and Yanjun Qi. Unsupervised
Feature Learning by Deep Sparse Coding, pages 902–910. doi: 10.1137/1.9781611973440.
103. URL https://epubs.siam.org/doi/abs/10.1137/1.9781611973440.103.
[56] Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, Laszl ´ o Luk ´ acs, Ruiqi ´
Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. Efficient natural language response
suggestion for smart reply. arXiv preprint arXiv:1705.00652, 2017.
[57] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer
learning for NLP. In International Conference on Machine Learning, pages 2790–2799.
PMLR, 2019.
[58] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias
Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural
networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[59] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. arXiv
preprint arXiv:2106.09685, 2021.
[60] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661.
Springer, 2016.
[61] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely
connected convolutional networks. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 4700–4708, 2017.
[62] Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim,
Stanley Jungkyu Choi, and Minjoon Seo. Towards continual knowledge learning of language models. arXiv preprint arXiv:2110.03215, 2021.
[63] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and
Qun Liu. TinyBERT: Distilling BERT for natural language understanding. arXiv preprint
arXiv:1909.10351, 2019.
114
[64] Haifeng Jin, Qingquan Song, and Xia Hu. Auto-keras: Efficient neural architecture search
with network morphism. arXiv preprint arXiv:1806.10282, 5, 2018.
[65] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA:
A dataset for biomedical research question answering. In Kentaro Inui, Jing Jiang, Vincent
Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1259. URL
https://aclanthology.org/D19-1259.
[66] Jeff Johnson, Matthijs Douze, and Herve J ´ egou. Billion-scale similarity search with GPUs. ´
IEEE Transactions on Big Data, 7(3):535–547, 2019.
[67] Matthew Johnson, James Saunderson, and Alan Willsky. Analyzing hogwild parallel
Gaussian Gibbs sampling. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani,
and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26,
pages 2715–2723. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/
5043-analyzing-hogwild-parallel-gaussian-gibbs-sampling.pdf.
[68] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans
for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
[69] Sakaguchi Keisuke, Le Bras Ronan, Bhagavatula Chandra, and Choi Yejin. Winogrande:
An adversarial winograd schema challenge at scale. 2019.
[70] Ryan King and Bobak Mortazavi. Growing representation learning. arXiv preprint
arXiv:2110.08857, 2021.
[71] D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December
2013.
[72] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[73] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[74] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint
arXiv:1312.6114, 2013.
[75] Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve
prediction with bayesian neural networks. 2016.
[76] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny
images. 2009.
[77] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny
images. 2009.
115
[78] LAION. Open-instruction-generalist dataset. https://github.com/LAION-AI/, 2023.
[79] Neal Lawton, Anoop Kumar, Govind Thattai, Aram Galstyan, and Greg Ver Steeg. Neural
architecture search for parameter-efficient fine-tuning of large pre-trained language models.
arXiv preprint arXiv:2305.16597, 2023.
[80] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in
neural information processing systems, pages 598–605, 1990.
[81] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient
prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
[82] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman
Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨ aschel, et al. Retrieval- ¨
augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information
Processing Systems, 33:9459–9474, 2020.
[83] Jiaoda Li, Ryan Cotterell, and Mrinmaya Sachan. Differentiable subset pruning of transformer heads. Transactions of the Association for Computational Linguistics, 9:1442–1459,
2021.
[84] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
[85] Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and
Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv
preprint arXiv:2310.08659, 2023.
[86] Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong,
and “Teknium”. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with
verification, 2023. URL https://https://huggingface.co/Open-Orca/SlimOrca.
[87] Baohao Liao and Christof Monz. Apiq: Finetuning of 2-bit quantized large language model.
arXiv preprint arXiv:2402.05147, 2024.
[88] Zhaojiang Lin, Andrea Madotto, and Pascale Fung. Exploring versatile generative language
model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829, 2020.
[89] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei,
Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In
Proceedings of the European conference on computer vision (ECCV), pages 19–34, 2018.
[90] Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale
optimization. Mathematical programming, 45(1-3):503–528, 1989.
[91] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search.
arXiv preprint arXiv:1806.09055, 2018.
[92] Qiang Liu, Lemeng Wu, and Dilin Wang. Splitting steepest descent for growing neural
architectures. arXiv preprint arXiv:1910.02366, 2019.
116
[93] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized
BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[94] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.
Learning efficient convolutional networks through network slimming. In Proceedings of the
IEEE international conference on computer vision, pages 2736–2744, 2017.
[95] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the
value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
[96] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimization. arXiv preprint arXiv:1808.07233, 2018.
[97] Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient
low-rank hypercomplex adapter layers. arXiv preprint arXiv:2106.04647, 2021.
[98] Yuning Mao, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Wen-tau
Yih, and Madian Khabsa. UniPELT: A unified framework for parameter-efficient language
model tuning. arXiv preprint arXiv:2110.07577, 2021.
[99] James Martens. Deep learning via Hessian-free optimization. In ICML, volume 27, pages
735–742, 2010.
[100] James Martens. New insights and perspectives on the natural gradient method. arXiv
preprint arXiv:1412.1193, 2014.
[101] James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored
approximate curvature. In International conference on machine learning, pages 2408–2417,
2015.
[102] James Martens and Ilya Sutskever. Learning recurrent neural networks with Hessian-free
optimization. In Proceedings of the 28th International Conference on Machine Learning
(ICML-11), pages 1033–1040. Citeseer, 2011.
[103] Amil Merchant, Elahe Rahimtoroghi, Ellie Pavlick, and Ian Tenney. What happens to BERT
embeddings during fine-tuning? arXiv preprint arXiv:2004.14448, 2020.
[104] Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one?
Advances in neural information processing systems, 32, 2019.
[105] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor
conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
[106] Eiji Mizutani and Stuart Dreyfus. An analysis on negative curvature induced by singularity in multi-layer neural-network learning. In Advances in Neural Information Processing
Systems, pages 1669–1677, 2010.
117
[107] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440,
2016.
[108] Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of
fine-tuning BERT: Misconceptions, explanations, and strong baselines. arXiv preprint
arXiv:2006.04884, 2020.
[109] Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training
deep architectures. arXiv preprint arXiv:1704.08792, 2017.
[110] Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian Reid. Fast neural architecture
search of compact semantic segmentation models via auxiliary cells. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9126–9135,
2019.
[111] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill,
2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization.
[112] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv
preprint arXiv:1301.3584, 2013.
[113] Barak A Pearlmutter. Fast exact multiplication by the Hessian. Neural computation, 6(1):
147–160, 1994.
[114] Matthew E Peters, Sebastian Ruder, and Noah A Smith. To tune or not to tune? Adapting
pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987, 2019.
[115] Jonas Pfeiffer, Aishwarya Kamath, Andreas Ruckl ¨ e, Kyunghyun Cho, and Iryna Gurevych. ´
AdapterFusion: Non-destructive task composition for transfer learning. arXiv preprint
arXiv:2005.00247, 2020.
[116] Jonas Pfeiffer, Andreas Ruckl ¨ e, Clifton Poth, Aishwarya Kamath, Ivan Vuli ´ c, Sebastian ´
Ruder, Kyunghyun Cho, and Iryna Gurevych. AdapterHub: A framework for adapting
transformers. arXiv preprint arXiv:2007.07779, 2020.
[117] Jonas Pfeiffer, Ivan Vulic, Iryna Gurevych, and Sebastian Ruder. MAD-X: An adapter-based ´
framework for multi-task cross-lingual transfer. arXiv preprint arXiv:2005.00052, 2020.
[118] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning,
pages 4095–4104. PMLR, 2018.
[119] Rajesh Ranganath, Linpeng Tang, Laurent Charlin, and David Blei. Deep exponential families. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth
International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings
of Machine Learning Research, pages 762–771, San Diego, California, USA, 09–12 May
2015. PMLR. URL http://proceedings.mlr.press/v38/ranganath15.html.
118
[120] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie
Tan, Quoc V Le, and Alexey Kurakin. Large-scale evolution of image classifiers. In International Conference on Machine Learning, pages 2902–2911. PMLR, 2017.
[121] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution
for image classifier architecture search. In Proceedings of the aaai conference on artificial
intelligence, volume 33, pages 4780–4789, 2019.
[122] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30, 2017.
[123] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free
approach to parallelizing stochastic gradient descent. In J. Shawe-Taylor, R. S. Zemel,
P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information
Processing Systems 24, pages 693–701. Curran Associates, Inc., 2011.
[124] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese
bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing. Association for Computational Linguistics, 11 2019. URL https:
//arxiv.org/abs/1908.10084.
[125] Nicolas L Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In Advances in neural information processing systems, pages 849–
856, 2008.
[126] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
[127] Christopher De Sa, Chris Re, and Kunle Olukotun. Ensuring rapid mixing and low bias for
asynchronous Gibbs sampling. In Maria Florina Balcan and Kilian Q. Weinberger, editors,
Proceedings of The 33rd International Conference on Machine Learning, volume 48 of
Proceedings of Machine Learning Research, pages 1567–1576, New York, New York, USA,
20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/sa16.html.
[128] Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. Poor man’s BERT: Smaller
and faster transformer models. arXiv preprint arXiv:2004.03844, 2020.
[129] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann machines. In David van Dyk
and Max Welling, editors, Proceedings of the Twelth International Conference on Artificial
Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages
448–455, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr
2009. PMLR. URL http://proceedings.mlr.press/v5/salakhutdinov09a.html.
[130] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled
version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108,
2019.
119
[131] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hess- ´
low, Roman Castagne, Alexandra Sasha Luccioni, Franc¸ois Yvon, Matthias Gall ´ e, et al. ´
Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint
arXiv:2211.05100, 2022.
[132] Nicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7):1723–1738, 2002.
[133] Richard Shin, Charles Packer, and Dawn Song. Differentiable neural network architecture
search. 2018.
[134] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts.
arXiv preprint arXiv:2010.15980, 2020.
[135] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via
information. arXiv preprint arXiv:1703.00810, 2017.
[136] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[137] S S Singh, F Lindsten, and E Moulines. Blocking strategies and stability of particle Gibbs
samplers. Biometrika, 104(4):953–969, 2017. doi: 10.1093/biomet/asx051. URL +http:
//dx.doi.org/10.1093/biomet/asx051.
[138] Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. Distributed MAP inference for undirected graphical models. In Neural Information Processing
Systems (NIPS) Workshop on Learning on Cores, Clusters, and Clouds (LCCC), 2010.
[139] David So, Quoc Le, and Chen Liang. The evolved transformer. In International Conference
on Machine Learning, pages 5877–5886. PMLR, 2019.
[140] Jascha Sohl-Dickstein, Ben Poole, and Surya Ganguli. Fast large-scale optimization by
unifying stochastic gradient and quasi-Newton methods. In International Conference on
Machine Learning, pages 604–612, 2014.
[141] David Sontag and Tommi Jaakkola. Tree block coordinate descent for MAP in graphical
models. In David van Dyk and Max Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of
Machine Learning Research, pages 544–551, Hilton Clearwater Beach Resort, Clearwater
Beach, Florida USA, 16–18 Apr 2009. PMLR.
[142] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
MobileBERT: a compact task-agnostic bert for resource-limited devices. arXiv preprint
arXiv:2004.02984, 2020.
[143] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin,
Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama
model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
120
[144] A. Terenin, D. Simpson, and D. Draper. Asynchronous Gibbs sampling. ArXiv e-prints,
September 2015.
[145] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck
method. arXiv preprint physics/0004057, 2000.
[146] Ruilin Tong. Growing neural network with shared parameter. arXiv preprint
arXiv:2201.06500, 2022.
[147] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothee Lacroix, Baptiste Rozi ´ ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. `
Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,
2023.
[148] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[149] Oriol Vinyals and Daniel Povey. Krylov subspace descent for deep learning. In Artificial
Intelligence and Statistics, pages 1261–1268, 2012.
[150] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multihead self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv
preprint arXiv:1905.09418, 2019.
[151] Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization.
Advances in neural information processing systems, 26, 2013.
[152] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log
partition function. IEEE Transactions on Information Theory, 51(7):2313–2335, July 2005.
ISSN 0018-9448. doi: 10.1109/TIT.2005.850091.
[153] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of
neural networks using dropconnect. In International conference on machine learning, pages
1058–1066. PMLR, 2013.
[154] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
[155] Dilin Wang, Meng Li, Lemeng Wu, Vikas Chandra, and Qiang Liu. Energy-aware
neural architecture optimization with fast splitting steepest descent. arXiv preprint
arXiv:1910.03103, 2019.
[156] Huahua Wang and Arindam Banerjee. Randomized block coordinate descent for online and
stochastic optimization. CoRR, abs/1407.0107, 2014.
121
[157] Peisong Wang, Qiang Chen, Xiangyu He, and Jian Cheng. Towards accurate post-training
network quantization via bit-split and stitching. In International Conference on Machine
Learning, pages 9847–9856. PMLR, 2020.
[158] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel
Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
[159] Lemeng Wu, Bo Liu, Peter Stone, and Qiang Liu. Firefly neural architecture descent: a
general approach for growing neural networks. Advances in Neural Information Processing
Systems, 33, 2020.
[160] Lemeng Wu, Mao Ye, Qi Lei, Jason D Lee, and Qiang Liu. Steepest descent neural architecture optimization: Escaping local optimum with signed neural splitting. arXiv preprint
arXiv:2003.10392, 2020.
[161] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for
Benchmarking Machine Learning Algorithms, 2017.
[162] Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrievalaugmented generation for medicine. arXiv preprint arXiv:2402.13178, 2024.
[163] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop
question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URL https:
//aclanthology.org/D18-1259.
[164] Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization
via dictionary learning: contextualized embedding as a linear superposition of transformer
factors. arXiv preprint arXiv:2103.15949, 2021.
[165] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. BitFit: Simple parameterefficient fine-tuning for transformer-based masked language-models. arXiv preprint
arXiv:2106.10199, 2021.
[166] Matthew D Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701, 2012.
[167] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can
a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, 2019.
[168] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu
Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv
preprint arXiv:2303.10512, 2023.
122
[169] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen,
Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained
transformer language models. arXiv preprint arXiv:2205.01068, 2022.
[170] Tuo Zhao, Mo Yu, Yiming Wang, Raman Arora, and Han Liu. Accelerated mini-batch
randomized block coordinate descent method. In Z. Ghahramani, M. Welling, C. Cortes,
N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing
Systems 27, pages 3329–3337. Curran Associates, Inc., 2014.
[171] Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Practical block-wise
neural network architecture generation. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 2423–2432, 2018.
[172] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable
architectures for scalable image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 8697–8710, 2018.
123
Abstract (if available)
Abstract
In this dissertation, I present a perspective of machine learning that views feature learning as the fundamental strategy by which deep machine learning models learn to solve complex problems: when trained to perform one specific task, deep machine learning models tend to learn generalizable features that are useful for solving many different tasks. In this way, deep machine learning models learn at a local level by automatically breaking down complex problems into simple relevant subproblems.
I then present a diverse collection of works that put this perspective into action to design better machine learning algorithms. These works include efficient optimization algorithms, including an algorithm for block-free parallel inference in exponential families (Chapter 2) and a novel second-order algorithm for training neural networks (Chapter 3); algorithms for efficient neural architecture search (NAS), including a morphism-based NAS algorithm for growing neural networks (Chapter 4) and a pruning-based NAS algorithm for finding more parameter-efficient PEFT architectures (Chapter 5); and algorithms for efficient fine-tuning of large language models, including an algorithm for increasing the performance of fine-tuning quantized models (Chapter 6) and a joint fine-tuning algorithm for retrieval augmented generation (RAG) pipelines (Chapter 7).
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Fast and label-efficient graph representation learning
PDF
Robust causal inference with machine learning on observational data
PDF
Inductive biases for data- and parameter-efficient transfer learning
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Learning to diagnose from electronic health records data
PDF
Tractable information decompositions
PDF
Information geometry of annealing paths for inference and estimation
PDF
Leveraging training information for efficient and robust deep learning
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Controlling information in neural networks for fairness and privacy
PDF
Hashcode representations of natural language for relation extraction
PDF
Mutual information estimation and its applications to machine learning
PDF
Representation problems in brain imaging
PDF
Heterogeneous federated learning
PDF
Scaling control synthesis and verification in autonomy using neuro-symbolic methods
PDF
Trustworthy spatiotemporal prediction models
PDF
Sharpness analysis of neural networks for physics simulations
PDF
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
PDF
Building generalizable language models for code processing
Asset Metadata
Creator
Lawton, Neal Gregory
(author)
Core Title
Learning at the local level
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-12
Publication Date
12/12/2024
Defense Date
10/24/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
large language models,machine learning,neural architecture search,OAI-PMH Harvest,optimization,parameter-efficient fine-tuning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Galstyan, Aram (
committee chair
), Dilkina, Bistra (
committee member
), Oberai, Assad (
committee member
), Ver Steeg, Greg (
committee member
)
Creator Email
nlawton@usc.edu
Unique identifier
UC11399ERJX
Identifier
etd-LawtonNeal-13689.pdf (filename)
Legacy Identifier
etd-LawtonNeal-13689
Document Type
Dissertation
Format
theses (aat)
Rights
Lawton, Neal Gregory
Internet Media Type
application/pdf
Type
texts
Source
20241213-usctheses-batch-1228
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
large language models
machine learning
neural architecture search
optimization
parameter-efficient fine-tuning