Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational intelligence: prediction, control and memory in artificial and biological agents
(USC Thesis Other)
Computational intelligence: prediction, control and memory in artificial and biological agents
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Computational Intelligence: Prediction, Control and Memory in Articial and
Biological Agents
by
Tommaso Furlanello
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Neuroscience)
May 2019
Copyright 2019 Tommaso Furlanello
Dedication
In memory of Bosco S. Tjan for his friendship and scientic mentorship.
ii
Acknowledgements
I wish to acknowledge all the co-authors, friends and mentors that helped shaping my thoughts. A
special thanks to my advisor Laurent Itti for letting me cultivate my passions and collaborations
with complete intellectual and economic support and freedom.
Chapter 3 is inspired by on-going writings in collaboration with Stefano Anzellotti and Laurent
Itti. Chapter 4 and 5 are the pre-liminary outputs of an ongoing collaboration with Amy Zhang,
Kamyar Azizzadenesheli, Anima Anandkumar, Joelle Pineau, Zachary C. Lipton and Laurent
Itti. Chapter 8 is based on a paper in collaboration with Zachary C. Lipton, Michael Tschannen,
Laurent Itti and Anima Anandkumar. Chapter 9 is based on a paper in collaboration with Andrew
Saze, Laurent Itti and Bosco S. Tjan.
The author acknowledges the partial nancial support of National Science Foundation (grant
number CCF-1317433), C-BRIC (one of six centers in JUMP, a Semiconductor Research Corpo-
ration (SRC) program sponsored by DARPA), and the Intel Corporation. The authors arm
that the views expressed herein are solely their own, and do not represent the views of the United
States government or any agency thereof.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List Of Tables vi
List Of Figures viii
Chapter 1: Introduction 1
Chapter 2: Information Theoretic Representation of Stochastic Processes 4
2.0.1 Information Theoretic Measures . . . . . . . . . . . . . . . . . . . . . . . . 5
2.0.2 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.0.3 Partially Observable Decision Processes . . . . . . . . . . . . . . . . . . . . 8
Chapter 3: Basic Intelligence 11
3.1 The Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Global Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Local Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.3 Computing Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 The Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Sensory Transduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Sensory Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 Temporal Sensory Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Planning with Causal States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Chapter 4: Learning Minimal Sucient Representations of Dynamical Systems 22
4.1 abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Optimal Representation of Discrete Stochastic Processes . . . . . . . . . . . . . . . 24
4.4 Causal States of Latent Stochastic Processes . . . . . . . . . . . . . . . . . . . . . 26
4.4.1 Recovering Causal States from Sucient Statistics . . . . . . . . . . . . . . 26
4.5 Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Chapter 5: Learning Minimal Sucient Representations of Partially Observable
Decision Processes 32
5.1 Adapting Causal States to Real Valued Measurements . . . . . . . . . . . . . . . . 33
5.2 Learning Causal States from Rollout Data . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Environment Details and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
iv
Chapter 6: Meta Intelligence 39
6.0.1 Multi-task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.0.2 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.0.3 Catastrophic Forgetting and Long Term Memory . . . . . . . . . . . . . . . 42
6.0.4 Second order modelling of the internal model - Attention and Hyper-networks 43
Chapter 7: Bottom Up Attention 45
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.3 Results: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 8: Born Again Neural Networks 49
8.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.3 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.3.1 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.3.2 Residual and Densely Connected Neural Networks . . . . . . . . . . . . . . 54
8.4 Born-Again Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.4.1 Sequence of Teaching Selves Born-Again Networks Ensemble . . . . . . . . 56
8.4.2 Dark Knowledge Under the Light . . . . . . . . . . . . . . . . . . . . . . . . 56
8.4.3 BANs Stability to Depth and Width Variations . . . . . . . . . . . . . . . . 59
8.4.4 DenseNets Born-Again as ResNets . . . . . . . . . . . . . . . . . . . . . . . 59
8.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.5.1 CIFAR-10/100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.5.2 Penn Tree Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.6.1 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.6.2 CIFAR-100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.6.3 Penn Tree Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Chapter 9: Active Long Term Memory Networks 68
9.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
9.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.4 The A-LTM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.5.1 Catastrophic Interference in Deep Linear Networks . . . . . . . . . . . . . . 78
9.5.2 Sequential and Multi-task Learning of Orthogonal Factors on iLab20M . . . 80
9.5.3 Sequential, Multi-Task and A-LTM Domain Adaptation over Imagenet . . . 81
9.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Chapter 10: Conclusion and ongoing work 84
Reference List 86
v
List Of Tables
4.1 Decoding accuracy of neural net trained to recover the true process' output variable
from our model's hidden states on latent MNIST case.jYj denotes the size of the
process'alphabet. First number: accuracy relative to the oracle when trained on
discrete states s, Second: accuracy relative to the oracle trained on the continuous
representations s. Additional Results, oracle accuracies and absolute accuracies, for
the same processes can be found in the appendices. . . . . . . . . . . . . . . . . . 29
4.2 Decoding accuracy of neural network trained to recover true output variables from
our model's latent states in the the revealed,latent-Gaussian and latent-MNIST
case. Y denotes number of hidden states. First number: accuracy when trained
on discrete states s, Second: accuracy trained on the continuous representations s.
Parentheses indicate best achievable performance (oracle). . . . . . . . . . . . . . 31
5.1 Results for the Active Setting. Reward obtained with tabular and function ap-
proximation versions of value iteration. Numbers in each cell correspond to action
spaces next id, switch source, switch mem. Models trained on 500 episodes and
evaluated on 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.1 Test set accuracy for Cifar100 for Wide Residual Network with layer width 1-2-5-10,
without any attention mechanism, with the local attention mechanism Squeeze and
Excitation and with our proposal LSTM-bottom-up . . . . . . . . . . . . . . . . . 48
8.1 Test error on CIFAR-10 for Wide-ResNet with dierent depth and width and
DenseNet of dierent depth and growth factor. . . . . . . . . . . . . . . . . . . . . 63
8.2 Test error on CIFAR-100 Left Side: DenseNet of dierent depth and growth
factor and respective BAN student. BAN models are trained only with the teacher
loss, BAN+L with both label and teacher loss. CWTM are trained with sample
importance weighted label, the importance of the sample is determined by the
max of the teacher's output. DKPP are trained only from teacher outputs with all
the dimensions but the argmax permuted. Right Side: test error on CIFAR-100
sequence of BAN-DenseNet, and the BAN-ensembles resulting from the sequence.
Each BAN in the sequence is trained from cross-entropy with respect to the model
at its left. BAN and BAN-1 models are trained from Teacher but have dierent
random seeds. We include the teacher as a member of the ensemble for Ens*3 for
80-120 since we did not train a BAN-3 for this conguration. . . . . . . . . . . . . 64
vi
8.3 Test error on CIFAR-100 for Wide-ResNet students trained from identical
Wide-ResNet teachers and for DenseNet-90-60 students trained from Wide-ResNet
teachers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.4 Test error on CIFAR-100-Modied Densenet: a Densenet-90-60 is used as
teacher with students that share the same size of hidden states after each spatial
transition but diers in depth and compression rate . . . . . . . . . . . . . . . . . 65
8.5 DenseNet to ResNet: CIFAR-100 test error for BAN-ResNets trained from a
DenseNet-90-60 teacher with dierent numbers of blocks and compression factors. In
all the BAN architectures, the number of units per block is indicated rst, followed
by the ratio of input and output channels with respect to a DenseNet-90-60 block.
All BAN architectures share the rst (conv1) and last(fc-output) layer with the
teacher which are frozen. Every dense block is eectively substituted by residual
blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.6 Validation/Test perplexity on PTB (lower is better) for BAN-LSTM language
model of dierent complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.1 Test set Accuracy of domain adaptation over Imagenet and Memory of the Viewpoint
task for iLAB for single task (ST), multi-task/domain (MD) and Active Long Term
Memory networks with and without replay. . . . . . . . . . . . . . . . . . . . . . . 82
vii
List Of Figures
3.1 Graphical representation of a computing device approximating a channel
without memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Graphical representation of a neural network approximating a stochastic
process with memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Graphical representation of a neural network approximating a channel
with memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Graphical representation of a the perceptual cycle of the agent ignoring
the eect of actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Graphical representation of a the action-perception cycle of the agent . 20
4.1 Graphical representation of a discrete latent process with alphabet be-
tween 0 and 9 and the observed measurements sampled from the corre-
sponding image category in the MNIST dataset . . . . . . . . . . . . . . . . 27
4.2 Relative accuracy (vs. RNN) for next symbol prediction as a function of the number
of states for dierent binary processes. Graphical illustrations depict processes
from [85], with edges labeled by the associated output emission. RNN accuracy is
matched at the correct number of states, except for in process n5ki2.. where the
accuracy is matched with 3 (vs 5) causal states. . . . . . . . . . . . . . . . . . . . . 29
5.1 Graphical representation of the neural network model used to estimate
the causal states of latent decision process . . . . . . . . . . . . . . . . . . . 36
6.1 Graphical representation of the Multi-task training procedure for a memory-
less channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Graphical representation of the Knowledge Distillation training proce-
dure of a memory-less channel . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
viii
7.1 Represention of the three types residual blocks used: Residual blocks are
the basic component of Deep Residual Networks. which are constructed by stacking
consecutively multiple blocks. On the left the vanilla unit which simply applies
convolutional functions on the input of the block X and sums the results over the
original input. In the center the Squeeze and Excitation unit that extracts he gist
vector
^
f(x) from the spatial representation f(x) and use it to produce the attention
function g(
^
f(x)). On the he gist vector
^
f(x) our proposed LSTM-bottom-up unit
that uses a recurrent neural network to create a hidden state h able to accumulate
information from all the previous layers gist vectors
^
f(x). . . . . . . . . . . . . . . 46
8.1 Graphical representation of the BAN training procedure: during the rst
step the teacher model T is trained from the labelsY . Then, at each consecutive step,
a new identical model is initialized from a dierent random seed and trained from
the supervision of the earlier generation. At the end of the procedure, additional
gains can be achieved with an ensemble of multiple students generations. . . . . . . 51
9.1 Catastroc Intereference experiments with Deep Linear Networks: Task A is equiv-
alent to Task B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.2 a): example of turntable setup and of multiple-camera viewpoint, lighting directions
and objects from iLab20M. b):: Results of Sequential and Multi-Task learning of
categories and viewpoitns onf iLab20M . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.3 Test Set Accuracy on iLab20M of A-LTM: categories in blue and viewpoints
in red. A-LTM with replay is indicated by circles while A-LTM without replay with
plus signs, the dashed horizontal lines represents the accuracies at initialization.
Both models suers from an initial drop in accuracy during the rst epochs but the
A-LTM network that has access to samples of the original dataset is able to fully
recover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
ix
Chapter 1
Introduction
All non-linear dynamical systems produce computations by storing and transforming input infor-
mation. A computing device is a physical system that can be reliably controlled into transducing
specic computations. Intelligent agents, computing devices themselves, manipulate and exploit
their physical environment to achieve their goals. Goals are future states of the world with a
specic information content that is desired by the agent. In this setting, the agent's behavior can
be interpreted as the tentative manipulation of a computing device (the environment) into the
conguration that will output the desired information. The interaction with the physical context
allows agents to drastically surpass their own computational capacity by exploiting the unique
dynamics of their environment. As an example think of a human desiring to keep accurate track of
time: internal neural dynamics give rise to a noisy perception of time passing, yet a simple device
like a sundial is able to exploit the stationary orbital dynamics to output a precise computation.
We call computational intelligence the ability to control the physical environment into specic
informational states. We dene prediction and control as basic-intelligence and call meta-intelligence
the collection of computational tools necessary to exhibit basic intelligence in a non-stationary
world using a nite physical computing device.
In this dissertation I develop a framework to model the computational intelligence exhibited
by biological and articial agents as they control their context to achieve their goals. We build
1
on the original levels of understanding manifesto from Marr and Poggio[59] and use the same
levels: computational, algorithmic, and implementational { to develop a theory of intelligence
that is agnostic about the physical substrate used to compute. Dierently from the purely
descriptive intentions of the original formulation we take a normative approach. Starting from
assumptions on the causal architecture of the world and the perceptual tools of the agents, we
dene the computational requirement to satisfy for basic intelligence. With a few additional
assumptions on the relationship between computations and the hardware used to implement them,
we derive multiple meta-intelligence requirements for agents that live in non-stationary and diverse
environments.
This approach draws ideas both from information theory, whose principles govern the optimal
representations of non-linear dynamical systems, and machine learning, which oers theory and
techniques for approximating systems from empirical realizations. The framework is grounded
on the computational mechanics [18] principle that systems that are optimally predicted can be
intelligently controlled [16]. Intelligence, therefore, requires the agent to own internal models able
to predict the dynamics of their environment as well as the tools to acquire new models when
novel, unforeseen, dynamics are encountered.
We use the human brain as an example of a computational device able to holistically express
the required computations for basic and meta intelligence. In contrast we use neural networks
as a practical example of algorithms able to learn arbitrary computations from data. That is,
we use humans as an example of a computing device optimized by the evolution algorithm to
exhibit basic-intelligence and neural networks to model and study specic aspects of the association
between computations and their algorithmic implementation. In the rst part of the thesis I
focus on the information theoretic requirements for an agent with imperfect perceptual system to
exhibit basic intelligence in a spatio-temporal environment. As a demonstration of the approach
we develop a novel neural network algorithm for learning minimal sucient representations of
both dynamical systems and partially observable decision processes that can be used to carry on
2
long term predictions and control plans. We then move our attentions to agents that interact with
environments that are spatially diverse ad non-stationary while possessing limited computational
capacity. The sequential utilization and learning of predictive models, under the computational
constrains of a nite emulation device, becomes a stochastic process itself whose understanding
and compression give rise to meta-intelligence requirements for virtualization, attention, and
long-term memory. Virtualization is the ability to achieve the same computational goal through
dierent algorithms which might rely on dierent specialized hardware; attention is the ability
to express dierent algorithms as a combination of the same computing blocks which share
specialized hardware; long-term memory is the ability to sequentially learn new computations
without forgetting the previous ones using the same hardware. We investigate the virtualization
of computations by studying knowledge distillation between neural networks with identical and
dierent architectures in what we call the Born Again Networks. We study a simple bottom-
up attention algorithm that integrates with convolutional neural network by conveying global
information from earlier layers to spatially localized processing of the later stages. Finally, I
conclude with a end to end architecture, the Active Long Term Memory Networks that is able to
exhibit long-term memory in an environment with non stationary tasks by using a combination of
knowledge distillation and generative modelling.
3
Chapter 2
Information Theoretic Representation of Stochastic
Processes
We now proceed to a formal description of the framework used to dene basic intelligence
using a notation inspired from computational mechanics[18]. We will make use of stationary
stochastic processes and partially observable decision processes that we study as information
channels[81, 5].We use the information-theoretic notions of entropy, mutual information and
conditional entropy.
We denote random variables with capital lettersX and their realization withx, we use bold for
the support set X and describe their probability distribution withP(X). When a random variable
is discrete we also call the support alphabet and the realizations symbols. We use the subscript
X
t
to indicate a random variable at a specic time t and the the superscript X
k
to indicate a a
random variable in a specic spatial location. We use for simplicity a discretized time and space,
we interpret space as the location on spatial-grid, voxels of a 3d space, or nodes of an arbitrary
graph.
4
2.0.1 Information Theoretic Measures
We quantify the randomness of a discrete random variable X through the entropy function:
H(X) =
n
X
i=1
P(x
i
) log
2
P (x
i
) (2.1)
We quantify the randomness of a joint random variable (X;Y ) with the joint entropy:
H(X;Y ) =
n
X
i=1
n
X
i=1
P(x
i
;y
i
) log
2
P (x
i
;y
i
) (2.2)
We quantify the dependence between two variables through the mutual information:
I[X;Y ] =
n
X
i=1
n
X
i=1
P(x
i
;y
i
) log
2
P(x
i
;y
i
)
P(x
i
)P(y
i
)
(2.3)
We quantify the randomness left over in X after knowing the value of Y with the conditional
entropy:
H(XjY ) =H(X;Y )H(Y ) =
n
X
i=1
n
X
i=1
P(x
i
;y
i
) log
2
P(y
i
)
P(x
i
;y
i
)
(2.4)
and respectively the randomness left over in Y after knowing the value X:
H(YjX) =H(X;Y )H(X) =
n
X
i=1
n
X
i=1
P(x
i
;y
i
) log
2
P(x
i
)
P(x
i
;y
i
)
(2.5)
2.0.2 Stochastic Processes
We model time considering the stochastic process given by the bi-innite one-dimensional chain
!
Y of discrete random variables Y
t
that take values y
t
from the countable alphabet of symbols
Y. We indicate its non-inclusive past with
Y = :::Y
3
;Y
2
;Y
1
and its inclusive future with
!
Y =Y
0
;Y
1
;Y
2
:::. The set of all bi-innite sequences
!
Y with alphabet Y is indicated by
!
Y. We
5
dene nite blocks of variables and their realizations with the subscript
i:j
, respectively Y
i:j
and
y
i:j
.
The distribution of sub-sequencesP(Y
t:t+L
) for L2Z
+
is called words distribution and denes
the probability of every nite sequence of length L that the stochastic process can output. We
focus our analysis on ergodic stationary processes, that is processes whose word distributions are
time-invariant and can be reliably estimated from empirical data. More formally stationary implies
P(Y
t:t+L
) =P(Y
0:L
) for all t2Z and all L2Z
+
. Ergodicity implies that the empirical estimates
^
P(Y
0:L
) from the nite realization y
0:M
converge to the process's true world probabilityP(Y
0:L
)
for M!1.
The amount of memory present in the stochastic process is dened by how well its future
can be predicted from its past. The excess entropy E quanties memory through the mutual
information between the past and the future I[
Y;
!
Y ]. A process is said to have nite memory of
Markov order L if the nite past excess entropy E
L:
: [Y
L:0
;
!
Y ] is equivalent to the innite past
excess entropy such that:
I[Y
L:0
;
!
Y ] =I[
Y;
!
Y ] (2.6)
for L inZ. The process is said to be memoryless if the equivalence holds for L = 0 as it implies
that the past of the process does not contain information that is predictive of its future. We can
represent a stochastic process by a map :
Y7! S from the set of semi-innite pasts to the set
of statistics . A representation is optimal if it produces sucient statistics of the future, i.e.,
when the mutual information between the representation of the past and the future is equal to the
excess entropy:
I[;
!
Y ] =I[
Y;
!
Y ] (2.7)
A sucient statistic
j
is said to be minimal if it can be calculated from any other, i.e., if there
exists a map
:
i
7!
j
for all
i
that satises the equality of Eq. 2.7. A statistic is said to be
6
next-step sucient if its mutual information with the next symbol I[;Y ] is equal to the mutual
information between the complete past and the next symbol:
I[;Y ] =I[
Y;Y ] (2.8)
Next-step sucient statistics are optimal only for 1-step ahead predictions, but if they can be com-
puted recursively with a function(
t
;y
t
) =
t+1
then they are globally sucient and can be used to
compute optimal predictions for the whole future. For stationary processes, a minimal sucient re-
cursive representation is given by the -machine crutcheld1989inferring,shalizi2001computational,
i.e., the unique, optimal, and minimal unilar hidden Markov model able to generate the process.
Sometimes,-machine are referred to as causal state models. The-machine of a stochastic process
is constructed by nding groups of past histories that induce the same distribution of the innite
future. This clustering procedure is dened by the causal equivalence
:
y
y
0
() P(
!
Yj
Y =
y ) =P(
!
Yj
Y =
y
0
): (2.9)
The partitions induced by the causal equivalence
over the space of feasible pasts
Y are the
causal states of the process which take values from alphabet S with elements
i
. The -map is
the function :
Y7!S that associates histories into their respective causal states. Sampling of
new symbols in the sequence induces the creation of new histories and consequently new causal
states. The stochastic process induced by the output of the -map after each new observation is
called the causal state process and is dened over the bi-innite sequences of causal states
!
S .
This recurrent dynamics of the causal state model are fully specied by the state-conditional
symbol emission probabilityP(Y
t
jS
t
) and the symbol-conditional causal state emission probability
7
P(S
t+1
jY
t
;S
t
). Together these quantities form indexed setT of symbol-labeled transition matrices
T
y
with elements
T
y
ij
=P(S
t+1
=
j
;y
t
=yjS
t
=
i
): (2.10)
The probability of the current output symbol depends only on the current causal state, and
therefore the probability of the next causal state is a function uniquely of the current state
making the sequence of states a Markov chain. Furthermore, causal state models are unilar|the
transitions between states are deterministic given the output symbol
H[S
t+1
jY
t
;S
t
] = 0: (2.11)
Unilarity is a useful property that derives from the determinism of the -map and guarantees that
an observer with no uncertainty of the causal state of the process does not de-synchronize after
new output emission. We will exploit the Markovianity of the causal state process in later sections
to show that in stationary sequential decision processes, the best Markovian policies dened over
causal states are as good as the policies dened over innite history of observables.
To summarize, the -machine of a stationary stochastic process is the minimal optimal hidden
Markov model with deterministic transition and is fully specied by the tuplehY; S; Ti containing
the output alphabet, the set of causal states and the set of indexed transition matrices.
2.0.3 Partially Observable Decision Processes
The previous framework can be extended to Input-Output (I/O) processes, the memory-full
counterpart of information channels [81] . We use information channels to extend the observer
model to the actor case and use it as an abstraction method for partially observable decision
processes.
8
We maintain the notation from [5] but adapt the notation of inputs to the letter A instead of
X in analogy of the reinforcement learning notation for actions. We indicate the input random
variables as A and their realizations with a2A.
A channel
!
Yj
!
A is a collection of stochastic processes dened over the countable joint alphabet
AY where each process
!
Yj
!
a is dened by its input sequence
!
a . That is a xed input
!
a
realization maps into a stochastic process dened over the output alphabet. Consequently a input
stochastic process
!
A maps into a joint process (
!
A;Y ) that can be marginalized to obtain the
output process
!
Y . A channel is characterized by its conditional world probabilities:
P(Y
t:t+L
=y
t:t+L
j
!
A =
!
a ) (2.12)
As with stochastic processes we limit our attention to stationary and ergodic channels, settings that
are amenable to empirical study. The previous denitions extend to the channel's conditional world
probabilities. An interesting consequence of having a stationary map from input to output process is
stationary preservation for output process. When this assumption holds in the agent-environment
setting it implies that the joint stochastic process of actions and observations is ergodic and
stationary. for stationary policies.
The last assumption that we require is for the channel to have nite anticipation, that is to
only depend on the semi-innite past of input and outputs and on the nite future of inputs. If the
anticipation is nite then the input process can be shifted back and the channel can be considered
causal, i.e., the conditional probability over the future is uniquely dened by the distribution by
the past. The denition of markov-order extends to channels but is separated in feedforward order
and feedback order which respectively dene the memory of past-inputs and past-outputs
The optimal representation of a stationary channel with stationary inputs can be derived by
the -machine of its joint process
!
A;Y and is called the -transducer. The causal states of the
joint process -machine are minimal sucient statistics of the joint future
!
A;Y =
!
Yj
!
A. The
9
conditional future outputs
!
Y can be obtained by marginalizing the future inputs
!
A obtaining the
-transducer.
The only dierence with respect to the previous formulation is that causal states are dened over
joint input and output histories and the state transition matrices T
yja
2T are input-conditional,
with elements:
T
yjx
ij
=P(S
t+1
=
j
;y
t
=yjS
t
=
i
;X
t
=x) (2.13)
Since the causal states are dened over joint symbols, the causal state model is co-unilary:
H[S
t+1
jA
t
;Y
t
;S
t
] = 0 (2.14)
In analogy the unilarity in the observer model, the co-unilary property allows an actor to
develop synchronized forward plans without ambiguity about the state transitions beyond the
uncertainty in output symbol emission.
When the-transducer has nite causal states it is possible to represent it as a labeled-directed
graphG = (S;T
(yja)
ij
) with statesS as nodes and transition probabilitiesT
(yja)
ij
as edges. Every node
of the graph represent a dierent causal-state S
i
and the edges T
(yja)
ij
represents the conditional
probability of emitting symbol y and consequentially transition from S
i
toS
j
after taking actiona.
WithG, we can now plan by mapping out maximum probability trajectories using graph walking
algorithms like Dijkstra's to reach a desired goal state.
10
Chapter 3
Basic Intelligence
We discuss here the challenge that agents with imperfect perceptual systems face while trying to
predict and control a spatio-temporal environment. We make minimal assumption on the nature of
the environment and the agent: the environment has multiple-spatial locations each with a local
physical state, the future local states depend on the complete history of states at each location
(the global state), the agent and its actions are part of the environment state and can have causal
in
uence over its future; the agent is at a single spatial location at each time step, it can observe
the physical state of the environment only by taking measurements from the spatial location it
occupies, the measurements are taken by imperfect perceptual systems that do not fully capture
the information contained in the physical state of the environment. The objective of the agent is
predicting (controlling) the global state of the environment.
We propose a taxonomy of the computational systems required to exhibit basic intelligence
under the constrains dened above: 1) physical transduction mechanisms, 2) approximately
stationary general-purpose perceptual systems, 3) general-purpose generative dynamic models, We
separate systems by their spatio-temporal memory requirement, their degree of specialization, and
consequently generalizability across space and time
Transduction mechanisms transform external physical phenomena (such as light and air
vibrations) into electrical signals. Approximately stationary perceptual systems are used as a
11
mechanism to compute stable features that act as sucient statistics of the inputs. General-purpose
dynamic models can learn complex temporal structure for arbitrary perceptual inputs. Dynamic
models that integrate information across time are learned not on the sensory inputs, but on the
outputs of perceptual systems.
3.1 The Environment
Let the environment be a spatio-temporal stochastic process of physical states Z
t
= (Z
1
t
;::;Z
K
t
)2
Z
K
, where K is the dimensionality of the environment. We assume the agent has constrained
perceptual dimensions and can perceive a single point of space Z
K
t
at each time-step. We use the
abstraction of space in analogy to the concept of task in machine learning. In later chapters on
meta-intelligence we will exploit this association between space and tasks to reason about how
transitions in space create a process in terms of locally optimal predictive representations and the
implication of this for attention, multitask, and continual learning.
3.1.1 Global Causality
The dynamics of the environment are determinde by the global causal channel P(
!
Zj
Z ) that fully
species the future global dynamics
!
Z of the environment as a function of the complete global
histories of physical states
Z . Generally we consider the agent and its actions A
t
with values
a
t
2A as part of the state of the environment, we indicate with Z
t
the state of the environment
without considering the action A
t
of the agent, such that Z
t
=
~
Z
t
[A
t
andP(
!
Zj
Z;
A;
!
A). When
we are concerned with describing the next state of the environment Z
t
j
Z from the perspective of
the agent we abuse the notation and write it conditioning on the next action P(Z
t
j
Z;A
t
) without
any regards for the underline notations as the agent has full control over the realizations of the
action A
t
.
12
3.1.2 Local Causality
The causal dynamics at each location k of the local state Z
k
t
depend on the history of local states
at every other location through the global causal channel. We decompose the mutual information
between a local state and the global history I[Z
k
t
;
Z ] into three components: the information
contained in the history of all the other locations I
k
=I[Z
k
t
;
Z
k
], the information contained
in the history at location k I
k
= I[Z
k
t
;
Z
k
], and the information that appears only when all
the locations are considered simultaneously I
k;k
= I (I
k
+I
k
). We call the purely local
component P(Z
k
t
j
Z
k
;A
t
) the local causal channel with excess entropy I
k
, and local conditional
entropy H
k
(Z
k
t
j
Z
k
;A
t
) lower bounded by the global conditional entropy H(Z
k
t
j
Z;A
t
) plus the non
redundant parts of the missing information given by considering the local history in isolation, that
is the extra entropy given by ignoring I
k
and I
k;k
. Throughout this thesis we focus on agents
that exploits the local causality channel to predict and control their environment. We assume the
global causal channel and respectively all its local components to be stationary ergodic (decision)
processes.
While it is natural to think extensions of this framework where agents attempt to reduce
the local states uncertainty by constructing approximate representations of the global state
through spatial integration of information; this is not our current intention. We currently use
the relationship between global and local causality exclusively as a way to attribute the source of
uncertainty in local predictions.
3.1.3 Computing Devices
A computing device f
i
is a subset of the local environment that implements a input-output
channel
!
Yj
!
X where the dynamics of channel's map f(
i
) :
!
X
Y7!
!
Y are determined by
the conguration parameters
i
. A computing device is therefore an abstraction for a physical
model able to transduce the past input information
X
Y into the future input conditional
13
Figure 3.1: Graphical representation of a computing device approximating a channel
without memory
.png
Figure 3.2: Graphical representation of a neural network approximating a stochastic
process with memory
Figure 3.3: Graphical representation of a neural network approximating a channel with
memory
14
output
!
Yj
!
X where the nature of the the map can be controlled through manipulation of the
conguration parameters
i
2. A computational device is characterized by the input conditional
world probabilities P
k
induced by each possible conguration
k
2:
P
k
(Y
t:t+L
=y
t:t+L
j
!
X =
!
X; =
k
) (3.1)
A collection of computing devices (f
0
;:::;f
n
) and respective parameters (
0
;:::;
n
) can be used to
form other, more sophisticate, composite computing devices. A computational graph is a directed
acyclic graph with n computing devices as nodes and an adjacency matrix dening the ordering of
the operations. When the operations happen in discrete time computational graphs correspond
to articial neural networks where the concept of network's layers correspond to computational
devices that are activated in parallel and depth correspond to the number of time-steps the
computation is carried on.
3.2 The Agent
The agent is a collection of computing devicesf = (f
0
;:::;f
n
) withf
i
2F that collectively exhibit
basic intelligence, that is they actively accumulate and transform information in order to generate
a dynamical systems f :
Z
k
7! S that has the same information content of the local causality
channel. The objective of basic intelligence is to maximize the amount of information contained in
the agent that are useful to predict the local future for each location.
min
k
X
k2K
H(
!
Z
k
jS
t
=f(z
k
0:t
;
k
);A
t
) (3.2)
Which has a minimum in the local conditional entropy H
k
(
!
Z
k
j
Z
k
;A
t
) . Equivalently basic
intelligence can be thought as searching for each location k a conguration of the internal
computing device
k
that uses the empirical statesz
k
0:t
to output a representations that minimizes
15
the predictive information loss with respect to the theoretical local causal channel. The agent
interacts with
!
Z
k
through two channels which are respectively dedicated to the causal eect of the
computations produced by the computing devices f with the respect to the external environment
and on the agent itself. The rst is called the policy channel
k
= P(A
t
j
Z
k
) that maps the
past physical states and actions into the next action. The second is called the learning channel
l =P(
k
t
j
Z
k
) that maps the past physical states and actions into a new conguration
k
t
for the
computing device that is approximating the kth local causal channel.
In a basic intelligent agent the policy channel outputs actions
!
a that create the sequence
of physical states
!
zja that once given input to the learning channel would give the optimal
conguration
k
, which is a minimizer of the optimization problem 6.3. In other words a basic
intelligent agent interacts with the environment with a policy that reveals the data sequence that
would induce the conguration of each internal device that is maximally predictive of the future.
For simplication we start considering a setting where both the policy and the learning rulel are
stationary and given. It is natural to think about extensions where an information-seeking policy is
learned interactively [83, 84] or where the learning rule itself is learned through a meta-optimization
process [39, 3, 93].
We now dene a functional categorization scheme for the agent's devices involved in constructing
an internal dynamical system that exhibit information analogue to those in the local causal channel.
The rst category is composed by sensory transduction devices with the main role of transforming
the external physical state z
k
t
into a sucient internal representation s
k
t
which the agent can
further manipulate through the other devices. The second category of sensory processing devices
have the role of removing artefactual inter-temporal dynamics from the internal channel
!
S , i.e.
dynamics that are due extra information generated by the transduction process. This corresponds
to internal representations that have transition dynamics independent to their own past given the
past of the external stateP(S
t
j
Z
k
t
) =P(S
k
t
j
Z
k
;
S
k
). This latter condition allows to study dynamics
16
of the local causal channelP(
!
Z
k
j
Z
k
) using the internal sensory channel P(
!
S
k
j
S
k
) through a group
of devices dedicated to the imitation of temporal sensory dynamics.
3.2.1 Sensory Transduction
The agent, a part of the environment itself, does not have direct access to the complete observations
z
k
t
but must rely on a collection of devices that compose the sensory system :Z7! , which
maps the physical state at the agent location into an internal representation 2 . We assume
the system is multi-modal, there are Q sensory systems that output representations
i
through
the lossy transducers
i
=P(
i
t
jZ
k
t
) with i2 0; 1;::;Q. Each sensory transducer is able to encode
a partial subset of information contained in Z
k
t
, the information loss due to sensory transduction
of each system is given by the bivarite cross-entropy H[Z
k
t
j
i
t
], while the total information loss is
given by the multivariate cross-entropy
H
t
[Z
k
t
j
0
t
;::;
Q
t
] (3.3)
that lower bound each bivariate cross-entropy. The objective of this rst computational sys-
tem is that of constructing a collection of statistisc
0
t
;::;
Q
t
that are collectively a sucient
representation of the physical state, i.e. that minimizes equation 3.3.
They are local in the sense that they only depend on information at the same spatial location
of the agent I[
i
t
;Z
k
t
] = I[
i
t
;Z
t
] and are either memoryless, that is they depend only on the
instantaneous stimulus I[
i
t
;Z
k
t
] = I[
i
t
;
Z
k
], or posses simple simple short term adaptation
property I[
i
t
;Z
k
t:tl
] = I[
i
t
;
Z
k
]. The parameters
1
;::;
Q
of the devices responsible for
sensory transductors are typically learned over the evolutionary scale and develop innately through
development, i.e. they are not learnable and their modication during the lifespan is typically
associate only to degeneration. The repeated construction of a sucient representation
0
t
;::;
Q
t
from each time step of the local process
!
Z
k
give rise to a collection of internal processes
!
=
17
!
1
;::;
!
Q
. If we assume that the the instantaneous cross-entropy of equation 3.3 is equal to 0,
and therefore is a sucient statistics of Z
k
t
if follows that the past
is a sucient statistic of
the local causality channel:
I[
!
Z
k
;
] =I[
!
Z
k
;
Z
k
] (3.4)
3.2.2 Sensory Processing
Through the sensory transduction system the internal representation becomes a sucent statistics
of the external environment but the opposite is not necessary true, that is predicting the future of
the internal dynamics might require more information than those contained in the history of the
external environment I[
!
;
Z
k
]I[
!
;
]. If that is the case, then it implies that the the sensory
transduction channel generates extra artefactual information that are useful for predicting the
future
!
. Therefore, the past-future mutual information I[
!
;
] can be decomposed in two
terms, I[
!
;
Z
k
] that measures the information shared between the future representation
!
and
the history of physical states
Z
k
and the residual artefactual information I[
!
;
]I[
!
;
Z
k
] that
are generated spontaneously by the sensory transduction process that have no relationship with
the future
!
Z
k
.
The role of the sensory processing devices : 7! is the transformation of the input process
!
into an output process
!
with realizations
!
with temporal dynamics that are unaected
by perceptual artifacts, i.e. such that I[
!
j
] I[
!
j
Z
k
]. That is the role of the kth early
perceptual system
K
(
K
) is the construction of statistics
t
that are invariant to temporal
transformations of
K
that do not depend on any underlying variation of the physical state Z
t
.
When the pasts
Z
k
and
are sucient statistics of respectively of the futures
!
and
!
Z
k
the
internal sensory process transfers to the future the same information content of the local causal
channel. This implies that the minimal sucient representation of the sensory channel are also
minimal sucient representation of the local causal channel.
18
Figure 3.4: Graphical representation of a the perceptual cycle of the agent ignoring
the eect of actions
3.2.3 Temporal Sensory Dynamics
The last collection of devices are those dedicated to modelling of the internal temporal sensory
dynamics. This model of the internal dynamics
!
Sj
!
can produce the expected future trajectories
of its own sensory dynamicsP(
!
jS
t
;
!
A). Because of the informational equivalence between the
pasts
and
Z
k
their compression into mimimal sucient statistics is equivalent. That is the
machine of either the physical states process
!
Z
k
or the sensory process
!
have isomorph
causal states. The main idea here is that the role of sensory transduction and sensory processing
systems is creating a time-series of internal representations, the sensory process, that can be used
to study the causal structure of the external physical reality. This happens when the causal
equivalence
creates clusters of histories with the same time-index of the causal equivalence
Z
.
19
Figure 3.5: Graphical representation of a the action-perception cycle of the agent
3.3 Planning with Causal States
The graphG = (S;T
(oja)
ij
) can be used to easily derive the optimal one step state to state policy
ij
=a
that maximizes the probability of transitioning
ij
between the neighboring nodes S
i
and
S
j
by solving:
max
a2A
P (S
t+1
=S
j
jS
t
=S
i
;O
t
=o;A
t
=a) = max
a2A
X
o2O
T
(oja)
ij
(3.5)
The optimal single-step policy
ij
from (3.5) can be used to create the optimal transition graph
C = (S;) with edges C
ij
=
ij
. Following [101] we can use the optimal transition graph to
derive the multi-step plan
k
(s
i
;s
j
) between any couple of reachable causal states S
i
and S
j
as a
short path problem. The optimal path, i.e. the path that maximizes the probability of successful
transition between the initial and target state can be found using Dijkstras algorithm with a
distance metric log(
ij
).
20
When the objective of our control policy
k
(s
i
;o
) is not to manipulate the environment into
a target causal state but to force it to emit a specic observable o
the graphC is not sucient as
the restriction of single action for state transition is not enough to induce a deterministic mapping
between transitions and observable. Additionally, nding the max-likelihood transition might
discard actions with high likelihood of revealing the target observables. We extend theC graph
with a set of target nodes S
+o
by introducing a copy of each node reachable through an edge of
the graph G with label T
(o
ja)
ij
> 0 for each i in the on zero-set for at least one action a2A and
call S
o
the non-original set.
The optimal symbol emission policy
o
ij
for transitioning to target nodes is given by the
action that maximizes the probability of emission
o
ij
= max
a2A
T
(o=o
ja)
ij
. The new graphC
o
=
(S +S
+o
;
+o
) has elements c
o
ij
such that :
c
o
ij
=
8
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
:
ij
; if j2S
o
o
ij
; if j2S
+o
0; if i2o
+o
(3.6)
We do not maintain the ith duplicate of the jth target node in the original set S
o
if
ij
=
o
ij
since the highest probability transition between the two states can be achieved while emitting the
target symbol. The max probability plan to emit a desired symbol can be found using Dijkstra's
algorithm between the source state and all target nodes over the graphC
o
and choosing the path
whose optimal-multi step plan has the highest likelihood of reaching the target node.
21
Chapter 4
Learning Minimal Sucient Representations of Dynamical
Systems
4.1 abstract
Intelligent agents cope with the rich sensory world by learning succinct latent models. We propose
a new algorithm for extracting minimal sucient unilar hidden Markov models of discrete
stochastic processes from neural networks trained to approximate the process. This approach
draws ideas both from information theory, whose principles govern the optimal representations of
non-linear dynamical systems, and machine learning, which oers techniques for approximating
systems from empirical realizations. We apply our algorithm to reconstruct the underlying causal
structure of both observable and latent stochastic processes: (i) we study the ability of our model
to retrieve the correct number of hidden states from known stochastic processes, (ii) we study
non-i.i.d. versions of traditional Gaussian mixture modelling and image classication benchmarks,
demonstrating that our approach accurately reconstructs the latent past-conditional sampling
scheme.
22
4.2 Introduction
When observations are complex but the underlying dynamics of the environment are simple, an
agent can exploit this structure by inferring a latent model underlying its perceptual observations.
We study this problem from the perspective of an observer that wishes to predict the future of a
stationary stochastic process based on historical observations. For stochastic processes that unfold
in discrete time steps, information theory [14] oers an elegant framework for dening optimal
representations through the idea of causal states crutcheld1989inferring. These representations
are minimal sucient statistics of the stochastic process, constructed by grouping sequences of past
observations as corresponding to the same causal state if they predict the same future sequence.
The model describing the transition probabilities between causal states has the form of a unilar
hidden Markov model [6] and is called the -machine of the process. In computational mechanics
[15], -machines have been extensively studied and are known to have many useful properties. For
instance, they allow the closed-form expression of multiple relevant information-theoretic quantities
[21] (e.g., the Markov order of the process), and can predict future trajectories as well as any other
(non-oracle) model [18, 78].
Recently, neural networks have become popular tools used not only for conventional discim-
inative learning tasks, but also for approximating dynamical systems. with highly structured
outputs (e.g., video or speech generation [33]) and long-range sequential relationships (e.g., lan-
guage modeling [86]). In this paper, we observe that when neural networks learn to predict a
stationary stochastic process as well as it is theoretically possible, their hidden states correspond
to sucient representations of the past of the process, and hence can be partitioned into clusters
that correspond to optimal minimal sucient statistics of the -machine, i.e., its (discrete) causal
states.
We exploit this observation to develop a new algorithm to reconstruct the causal states, and
corresponding -machine of a stationary stochastic process from a NN trained to predict the next
23
observation of the process. Because we derive the causal states (and their transitions kernel) from
a (non-minimal) representation (a hidden representation in the NN), we are able to approximate
-machines for a broader class of processes than was previously possible. After demonstrating
through multiple examples that our algorithm is able to reconstruct the causal states of stochastic
processes, we apply it to the case where the discrete outputs of the process are masked by a
stochastic mapping into a high-dimensional structured space (in our applications, a space of
images).
4.3 Optimal Representation of Discrete Stochastic Processes
We rst consider a stochastic process given by a bi-innite one-dimensional chain
!
Y of discrete
random variablesY
t
that take valuesy
t
from the nite alphabet of symbols Y. For a xed random
variable Y
t
, we indicate its non-inclusive past with
Y =:::Y
3
;Y
2
;Y
1
and its inclusive future
with
!
Y =Y
0
;Y
1
;Y
2
:::, dropping the subscriptt from the notation for convenience when the context
is clear. The set of all bi-innite sequences
!
Y with alphabet Y is indicated as
!
Y. We focus on
analysis of ergodic stationary processes, i.e., processes for which the probability of every sequence
of nite length L2Z
+
is time-invariant, which can be reliably estimated from empirical data.
We represent the history by :
Y7! S a map from the set of semi-innite pasts to the set of
statistics S. A representation is sucient if it produces sucient statistics of the future, i.e.,
when the mutual information between representations of the past and the future observables is
maximized. This occurs when the representation retains all useful information about the future
from the complete past.
A sucient statistic is minimal if it can be calculated from any other statistics, i.e., there
exists a map
:
0
7! for all
0
that are maximally predictive of the future. For station-
ary processes, a minimal sucient recursive representation is given by the -machine crutch-
eld1989inferring,shalizi2001computational, i.e., the unique miminal sucient minimal unilar
24
hidden Markov model able to generate the process. The -machine of a stochastic process is
constructed by nding groups of past histories that induce the same distribution of the future.
This clustering procedure is dened by the causal equivalence
:
y
y
0
() P(
!
Yj
Y =
y ) =P(
!
Yj
Y =
y
0
) () P(
!
YjS
t
=
i
): (4.1)
for any sucient representaion . We call minimal sucient representations optimal.
The partitions induced by the causal equivalence
over the space of feasible pasts
Y are the
causal states of the process which take values from alphabet S with elements
i
. The -map is
a function :
Y7!S that associates histories with their respective causal states. Sampling of
new symbols in the sequence induces the creation of new histories and consequently new causal
states. The stochastic process induced by the output of the -map after each new observation
is called the causal state process and is dened over the bi-innite sequences of causal states
!
S . The recurrent dynamics of the causal state model are fully specied by the state-conditional
symbol emission probabilityP(Y
t
jS
t
) and the symbol-conditional causal state emission probability
P(S
t+1
jY
t
;S
t
). Together these quantities form indexed setT of symbol-labeled transition matrices
T
y
with elements
T
y
ij
=P(S
t+1
=
j
;y
t
=yjS
t
=
i
): (4.2)
The probability of the current output symbol is conditionally independent of the past given the
causal state, and the probability of the next causal state is a function uniquely of the current state
and the output symbol, making the sequence of states a Markov chain. Furthermore, causal state
models are unilar|the transitions between states are deterministic given the output symbol,
H[S
t+1
jY
t
;S
t
] = 0, where H denotes the entropy function.
To summarize, the-machine of a stationary stochastic process is the minimal sucient hidden
Markov model with deterministic transitions, and is fully specied by the tuplehY; S; Ti containing
the output alphabet, the set of causal states, and the set of indexed transition matrices.
25
4.4 Causal States of Latent Stochastic Processes
Let
!
Y be a ergodic stationary process with discrete realizations y
t
in the nite alphabet Y and
!
X be a ergodic stationary process of high dimensional measurements x
t
2 X. We denote byG
the rendering function G :
!
Y 7!
!
X which maps the discrete outputs y
t
into the high dimensional
measurements x
t
. We restrict our attention to memoryless render functions such that the discrete
realizations y
t
are minimal sucient statistics of the measurement x
t
, i.e.:
P(X
t
jY
t
) =P(X
t
jY
t
;
Y ) =P(X
t
j
X) (4.3)
When G is the identity function the process
!
X =
!
Y and we call the stochastic process revealed.
For the general case where
!
X 6=
!
Y we say that the process is latent.
1
Let denote the unique minimal unilar Markov model of the stochastic process
!
Y ands
i
2S
denote its (nite) causal states. Together, and G fully specify both the latent and measured
sequences of observations and since s
t
are the minimal sucient statistics of y
t+1
they are in turn
minimal sucient statistics of x
t+1
=G(y
t+1
) when G is memoryless:
P(Y
t
jS
t
) =P(Y
t
j
Y )^P(X
t
jY
t
) =P(X
t
j
X) =) P(X
t
jS
t
) =P(X
t
j
Y ) =P(X
t
j
X): (4.4)
4.4.1 Recovering Causal States from Sucient Statistics
In the previous section we dened -machines as the optimal representation of discrete stochastic
processes. We now introduce a new algorithm that accurately reconstructs -machines from
empirical data. Existing methods either directly partition past sequences of length L into a nite
number of causal states via conditional hypothesis test shalizi2004blind or use Bayesian inference
over a set of existing candidate -machines strelio2014bayesian. Our method departs from these
1
This categorization is dierent from the classical treatment of hidden and revealed sochastic processes as we
model the latent stochastic process as a hidden Markov process itself.
26
Figure 4.1: Graphical representation of a discrete latent process with alphabet between
0 and 9 and the observed measurements sampled from the corresponding image
category in the MNIST dataset
previous approaches by rst estimating continuous sucient statistics of the process with recurrent
neural networks and only subsequently mining the learned representation space to approximate
the minimal discrete causal states.
When the process is latent, traditional methods do not work because the output alphabet
and the mapping from high-dimensional measurements to latent observables are unknown. Our
two-step procedure directly uses neural networks to learn sucient statistics of the high dimen-
sional measurements. When the latter are generated from memoryless stochastic mappings, they
correspond to sucient statistics of the latents, and can consequently be rened into the causal
states.
We now introduce an alternative representation of the measured process
!
X learned via an
encoder f : X7!
^
Y , decoder ^ g :
^
Y7! X, recurrent neural network h :
^
Y 7!
^
S and prediction
network :
^
S7!
^
Y. The encoder maps from the measurements x
t
to the corresponding sucient
statistic ^ y
t
2R
p
, and is optimised via reconstruction loss of the decoder. The recurrent neural
network h maps from the history of encoder representations
^ y into the hidden states ^ s
t
2 R
k
.
Finally the prediction network maps the RNN hidden state into the next sucient statistic ^ y
t+1
.
We note that when(^ s
t
) is maximally predictive of the subsequent encoder embeddings (the future
!
^
Y ), ^ s
t
constitutes a sucient statistic of both the future measurements
!
X and latent observables
!
Y . Together with the unilar and Markovian nature of transitions in RNNs, the suciency of ^ s
t
implies that there exists a functionD
s
:R
k
7!S that allows to describe the causal states s as a
27
renement of the RNN hidden state ^ s crutcheld2010synchronization and a functionD
y
:R
p
7!Y
that allows to describe the process output y
t
as a renement of the auto-encoder hidden state ^ y
t
.
In practice we estimate causal states S by rst using the empirical realizations
!
x to learn a
neural network that approximates end to end the maps f;g;h and by minimizing:
min
wg;w
w
h
;w
f
T
X
t
L
r
P(X
t
j
X =
x ); (g
wg
w
h
w
h
f
w
f
)(
x )
+
T
X
t
L
e
P(X
t
jX
t
=x
t
); (g
wg
f
w
f
)(x
t
)
(4.5)
After solving 5.2 we can use the neural network parameterized by the optimal parameters
w
g
;w
w
h
;w
f
to estimate the causal states. We set up a second optimization problem using
the trained neural network and the empirical realizations of the process
!
x to estimate the
discretization network
d
s
:
^
S7!
S withj
Sj =jSj and the new prediction network :
S7!
^
Y
that maps causal states into the next sucient statistic of the measurements. We match the
predictive behavior of the neural network betwen the original RNN and prediciton function and
the discretized states s and the corresponding prediction function by minimizing the knowledge
distillation [35] loss:
min
w ;w
d
T
X
t
L
d
(g
w
g
w
h
w
h
f
w
f
)(
x ); (g
w
g
w
d
s
w
d
h
w
h
f
w
f
)(
x )
(4.6)
Minimizing Eq. 5.3 guarantees a sucient discrete representation and by adding the following
regularizing terms we enforce the resulting states to be minimal and their transitions unilar:
min
j
Sj;wu
j
Sj +
T
X
t
L
u
P(
S
t+1
j
S
t
; ^ y
t
);u
wu
(
S
t
; ^ y
t
)
(4.7)
Where the variables w
u
parametrize the map u :
S
^
Y7!
S from the state s
t
and sucient
statistic ^ y
t
, to the next causal state s
t+1
. When the loss functionL
u
is equal to 0 the transition
can be fully predicted when using the previous state and output implying the transitions are
28
Figure 4.2: Relative accuracy (vs. RNN) for next symbol prediction as a function of the number
of states for dierent binary processes. Graphical illustrations depict processes from [85], with
edges labeled by the associated output emission. RNN accuracy is matched at the correct number
of states, except for in process n5ki2.. where the accuracy is matched with 3 (vs 5) causal states.
Memory /jYj 2 4 8 10
2 0.99, 0.99 0.98, 0.98 0.96, 0.96 0.83, 0.87
4 0.99, 0.98 0.97, 0.96 0.94, 0.91 0.70, 0.76
8 0.96, 0.93 0.54, 0.56 0.67,0.65 0.35, 0.38
16 0.65, 0.67 0.45, 0.47 0.17, 0.178 0.11, 0.11
Table 4.1: Decoding accuracy of neural net trained to recover the true process' output variable
from our model's hidden states on latent MNIST case.jYj denotes the size of the process'alphabet.
First number: accuracy relative to the oracle when trained on discrete states s, Second: accuracy
relative to the oracle trained on the continuous representations s. Additional Results, oracle
accuracies and absolute accuracies, for the same processes can be found in the appendices.
deterministic. The minimization of the cardinality of the causal states alphabetj
Sj pushes the
learned representation toward minimality. To summarize we rst minimizeL
r
+L
e
to obtain a
neural model able to generate sucient statistics of the future observables of the process and
subsequently minimizeL
d
+L
u
+j
^
Sj to obtain a minimal sucient representation of the dynamical
system.
4.5 Experiment Details
For both the revealed and latent case we approximate h with a long short term memory network
[38] and
^
d
s
with a vector quantized embedding from [89]. For the revealed case, we use an MLP to
implementf,g, and while for the latent case we use the continuous encoder-decoder architecture
29
of [89] with ResNet [29] blocks, and train the auto-encoding loss with variational regularization
[46]. We minimize all losses using RMSPROP algorithm [34], and use the cross-entropy loss and
reconstruction loss. A proper divergence should be used also for the high dimensional measurements
but we found the reconstruction loss to work well in practice.
In the revealed and latent scenario, we nd it is important to estimate the RNN and the VAE
simultaneously and end to end.
In order to arbitrarily control the memory length k and alphabet sizejYj of the problem we
develop the articial stochastic processes used for the the experiments reported in tables 4.2a, 4.2b,
and 4.2c. We construct the process such that the probability of the next output y
t
depends only
on the value of y
t
k by sampling from the multinomial distibution: P(Y
t
=y
0
jY
tk
=y
0
) =p
andP(Y
t
=y
0
jY
tk
6=y
0
) = 1p=jYj for all y
0
2 Y.
Tables 4.2b and 4.2c present the results of projecting the output of latent processes into high
dimensional measurements. The Multivariate Gaussians measurements are constructed by taking
a vector composed by k blocks of sizejYj, the i-th block has has mean
i
= 4 if Y
t
=y
i
or
i
= 0
ifY
t
6=y
i
. For MNIST we use a process alphabet of up to sizejYj = 10, and associate each symbol
Y to an image category, the measurement x
t
are generated by rst sampling the realization y
t
from the process sampling a random image from the correspondening MNIST category.
We evaluate revealed processes by comparing the next step forecasting accuracy of observation
Y and compare against the using the actual distribution. For the latent case we estimate via
supervised learning a separate classier X7!Y and compare its accuracy in predicting the correct
y
t+1
from the generated next step observations ^ x
t+1
.
4.6 Results
First, in gure 4.2, we evaluate our algorithms on several binary revealed stochastic processes
generated by-machines with a known number of states. Specically, we use the following processes
30
Memory /jYj 2 4 8 16
2 74.08, 74.08 (74.08) 75.50, 75.54 (75.54) 73.92, 73.93 (73.93) 74.58, 74.84 (74.75)
4 74.75, 74.47 (74.47) 75.45, 75.49 (75.49) 75.08, 75.11 (75.11) 74.58, 74.63 (74.63)
8 74.37, 74.37 (74.37) 74.76, 74.79 (74.79) 73.11, 74.10 (75.47) 73.94, 74.66 (75.61)
16 73.59, 73.92 (74.46) 75.26, 75.32 (75.33) 73.60, 74.58 (75.09) 70.69, 70.98 (74.95)
(a) Discrete channel
Memory /jYj 2 4 8 16
2 74.89, 74.89 (74.89) 74.80, 74.85 (74.90) 74.63, 74.74 (74.58) 73.67, 74.21 (74.82)
4 73.82, 73.72 (74.17) 74.13, 74.35 (74.66) 74.87, 75.02 (74.65) 73.74, 74.05 (74.29)
8 75.46, 75.79 (76.41) 75.79, 75.91 (75.97) 73.07, 73.46 (73.91) 72.26, 72.69 (75.31)
16 73.33, 73.64 (73.87) 74.48, 75.54 (75.72) 74.20, 74.41 (75.19) 58.67, 58.96 (74.84)
(b) Multivariate Gaussian rendering
Memory /jYj 2 4 8 10
2 74.77, 74.70 (74.85) 74.26, 73.95 (75.25) 72.72, 72.48 (75.22) 62.60, 65.68 (75.32)
4 75.56, 75.49 (75.68) 73.97, 73.55 (75.95) 70.51, 68.55 (74.78) 52.69, 57.15 (74.94)
8 72.01, 69.77 (74.69) 41.39, 43.13 (75.99) 50.71, 49.14 (75.17) 26.53, 28.76 (74.99)
16 49.24, 50.96 (75.06) 33.92, 35.39 (74.93) 13.08, 13.08 (74.88) 8.80, 8.80 (74.01)
(c) MNIST
Table 4.2: Decoding accuracy of neural network trained to recover true output variables from our
model's latent states in the the revealed,latent-Gaussian and latent-MNIST case. Y denotes number
of hidden states. First number: accuracy when trained on discrete states s, Second: accuracy
trained on the continuous representations s. Parentheses indicate best achievable performance
(oracle). .
with increasing number of states| n1k2id3, n2k2id12 n4k2id3334, and n5k2id22970 from [85]|
generated by the topoligical enumeration algorithm of [43]. Additionally, in table 1 we explore
a broad class of articial stochastic processes varying the memory length and output alphabet
size. We use the stochastic processes as an un-observed sampling scheme for non-i.i.d versions of a
Gaussian mixture problem (in appendix) and for unsupervised MNIST training. For the Gaussian
mixture case, we associate one realization of the mixture to each element of the alphabet Y while
for non-i.i.d MNIST, we associate each element of the process'alphabet to one image category and
sample a random image from that class. Our results shows that the estimated -machines predict
the future of the stochastic process consistently as well as the original RNN in both the revealed
and the latent case.
31
Chapter 5
Learning Minimal Sucient Representations of Partially
Observable Decision Processes
Decision making and control often require that an agent interact with an environment whose
causal mechanisms are unknown or unobserved. Further, we must consider the agent's par-
tial ability to observe the environment due the limitations of its perceptual system. In such
cases, one might hope to construct latent representations of action/observation processes. At
present, this practice is dominated by two points of view: (i) partially observable Markov de-
cision processes (POMDPs), where one starts from a generative model of the latent transi-
tions and emission dynamics, using observations to infer beliefs over the unobserved states
aastrom1965optimal,cassandra1994acting; and (ii) predictive state representations (PSRs), where
one starts from the history of the process and constructs states through modeling of the trajectories
of observations littman2002predictive,singh2004predictive. Both directions have drawbacks: the
generative approach requires access to a model that is equivalent to the real generator, while PSRs
are constrained by the information content of the observation process.
We propose a principled approach for learning state representations, inspired by information
theory [81, 14], that generalizes PSR to non-linear dynamics and allows a formal comparison
between generator- and history-based state abstractions. We exploit the idea of causal states
[18, 78, 80, 15], i.e., the coarsest partition of histories into classes that are maximally predictive of
32
the future. By learning this mapping from histories to clusters, we construct the minimal sucent
unilar conditional hidden Markov model of the observation process.
5.1 Adapting Causal States to Real Valued Measurements
We dened a framework for optimal representations of decision processes in discrete time, discrete
observations, and discrete actions. We now extend to the case where the agent can only observe
the real valued stochastic process of high dimensional measurements
!
X generated by the rendering
functionG :
!
O 7!
!
X that maps the discrete observationO
t
into the high dimensional measurement
X
t
with realizations x
t
2 X. We focus on the case of memoryless render functions that produce
the observed measurementx
t
using exclusively information from o
t
, making the current observable
a sucient statistic of the current measurement, i.e.:
P(X
t
jO
t
) =P(X
t
jA
t
;
Y ) =P(X
t
jA
t
;
X) (5.1)
Since the current observable O
t
shields the future measurements
!
X from their past
X we can
search for representations of the measurement process from the optimal representations of the
process
!
O . Let denote the unique minimal unilar Markov model of the stochastic process
!
O;A
and s
i
2S denote its (nite) causal states. Together, and G fully specify both the latent and
measured sequences of observations and since s
t
are the minimal sucient statistics of O
t+1
, they
are in turn minimal sucient statistics of x
t+1
=G(O
t+1
) when G is memoryless:
P(O
t
jS
t
;A
t
) =P(O
t
j
Y;A
t
)^P(X
t
jO
t
) =P(X
t
jA
t
;
X)
=) P(X
t
jS
t
;A
t
) =P(X
t
j
Y;A
t
) =P(X
t
jA
t
;
X):
That is the causal states of the latent discrete joint process
!
O;A are (miminal) sucient statistics
of the measurement process
!
X . The same approach could be taken to adapt our framework to
33
continuous action by assuming the existance of a mapping between latent discrete actions and
observed continuous.
5.2 Learning Causal States from Rollout Data
We now introduce a new algorithm that accurately reconstructs -machines of the joint action-
observable process from empirical data. Existing methods either directly partition past sequences
of length L into a nite number of causal states via conditional hypothesis test shalizi2004blind
or use Bayesian inference over a set of existing candidate -machines strelio2014bayesian. Both
methods can be adapted to model a joint-process and consequently obtain the next-step conditional
output by marginalizing out the action A
t
, but do not extend to the real-valued measurement case
described above.
We obtain the minimal-sucient representations of the underlying process
!
O;A from a sucient
model of the observable process
!
X;A. This alternative representation is learned via an encoder
f : X7!
^
O , decoder ^ g :
^
O7! X, recurrent neural network h :
^
O; A A7!
^
S and prediction
network :
^
S7!
^
O. The encoder maps from the measurements x
t
to the corresponding sucient
statistic ^ o
t
2R
p
, and is optimised via reconstruction loss of the decoder. The recurrent neural
network h maps from the joint history of encoder representations and actions
^ o;a and current
action a
t
into the hidden states ^ s
t
2R
k
. Finally the prediction network maps the RNN hidden
state into the next sucient statistic ^ o
t+1
. We note that when (^ s
t
) is maximally predictive
of the subsequent encoder embeddings (the future
!
^
O), ^ s
t
constitutes a sucient statistic of
both the future measurements
!
X and latent observables
!
O. Together with the unilar and
Markovian nature of transitions in RNNs, the suciency of ^ s
t
implies that there exists a function
D
s
:R
k
7!S that allows to describe the causal states s as a renement of the RNN hidden state ^ s
crutcheld2010synchronization and a functionD
o
:R
p
7! O that allows to describe the process
output o
t
as a renement of the auto-encoder hidden state ^ o
t
.
34
In practice we estimate causal states S by rst using the empirical realizations
!
x;a to learn a
neural network that approximates end to end the maps f;g;h and by minimizing:
min
wg;w
w
h
;w
f
T
X
t
L
r
P(X
t
j
X =
x ); (g
wg
w
h
w
h
f
w
f
)(
x;a;a
t
)
+
T
X
t
L
e
P(X
t
jX
t
=x
t
); (g
wg
f
w
f
)(x
t
)
(5.2)
After solving Eq.5.2 we can use the neural network parameterized by the optimal parameters
w
g
;w
w
h
;w
f
to estimate the causal states. We set up a second optimization problem using the
trained neural network and the empirical realizations of the process
!
x to estimate the discretization
network
d
s
:
^
S7!
S withj
Sj =jSj and the new prediction network :
S7!
^
O that maps causal
states into the next sucient statistic of the measurements. We match the predictive behavior of
the neural network betwen the original RNN and prediciton function and the discretized states s
and the corresponding prediction function by minimizing the knowledge distillation [35] loss:
min
w ;w
d
T
X
t
L
d
(g
w
g
w
h
w
h
f
w
f
)(
x;a;a
t
); (g
w
g
w
d
s
w
d
h
w
h
f
w
f
)(
x;a;a
t
)
(5.3)
Minimizing Eq. 5.3 guarantees a sucient discrete representation and by adding the following
regularizing terms we enforce the resulting states to be minimal and their transitions unilar:
min
j
Sj;wu
j
Sj +
T
X
t
L
u
P(
S
t+1
j
S
t
; ^ o
t
;a
t
);u
wu
(
S
t
; ^ o
t
;a
t
)
; (5.4)
where the variables w
u
parametrize the map u :
S
^
O A7!
S from the state s
t
and sucient
statistic ^ o
t
, to the next causal state s
t+1
. When the loss functionL
u
is equal to 0 the transition
can be fully predicted when using the previous state and output, implying the transitions are
deterministic. The minimization of the cardinality of the causal states alphabetj
Sj pushes the
learned representation toward minimality. To summarize, we rst minimizeL
r
+L
e
to obtain
a neural model able to generate sucient statistics of the future observables of the process and
35
Figure 5.1: Graphical representation of the neural network model used to estimate the
causal states of latent decision process
subsequently minimizeL
d
+L
u
+j
^
Sj to obtain a minimal sucient representation of the dynamical
system.
5.3 Environment Details and Results
For both the revealed and latent case we approximate h with a long short term memory network
[38] and
^
d
s
with the vector quantization algorithm from [89]. We use an MLP to implement f, g,
and of varying capacity for the revealed and latent cases. We minimize all losses using Adam
[45], and use the cross-entropy loss and reconstruction loss. A proper divergence should be used
also for the high dimensional measurements but we found the reconstruction loss to work well in
practice. In the revealed and latent scenario, we nd it is important to estimate the RNN and the
MLP simultaneously and end to end.
In order to arbitrarily control the memory length k and alphabet sizejYj of the problem
we develop the articial stochastic processes used for the the experiments reported in Table 1.
We construct the process such that the probability of the next output o
t
depends only on the
value of o
t
k by sampling from the multinomial distibution: P(O
t
= o
0
jO
tk
= o
0
) = p and
P(O
t
=o
0
jO
tk
6=o
0
) = 1p=jOj for all o
0
2 O.
36
The central and right-mosts columns of Table 1 present the results of projecting the output of
latent processes into high dimensional measurements. The Multivariate Gaussians measurements
are constructed by taking a vector composed by k blocks of sizejOj, the i-th block has has mean
i
= 4 ifO
t
=y
i
or
i
= 0 ifO
t
6=o
i
. For MNIST we use a process alphabet of up to sizejOj = 10,
and associate each symbol O to an image category, the measurement x
t
are generated by rst
sampling the realization o
t
from the process sampling a random image from the correspondening
MNIST category.
We introduce a binary action spaceA =f0; 1g. Default actionA
t
= 0 givesp(O
t
=ija
t
= 0) =
8
>
>
>
<
>
>
>
:
p if o
tk
=i;
1p
jOj
otherwise:
, but for A = 1 we use the transition distribution of the neighboring class,
p(O
t
=ijA
t
= 1) =
8
>
>
>
<
>
>
>
:
p if o
tk
=i 1;
1p
jOj
otherwise:
The goal is to maximize the occurrence of o
t
= 1; 0 t <1. The environment returns a
reward of 1 at time step t if o
t
= 1, and 0 otherwise. Each episode lasts 100 time steps. We again
train on sequence datafo
0
;a
0
;o
1
;a
1
;:::;o
T
;a
T
g generated with a random policy. We pass the
actions into the RNN with a linear layer and concatenate with the observation embedding.
For the case of rendering high dimensional measurements with images we use the full image
dataset for sampling X
t
and use train and val sets for training and test for evaluation, showing
generalization ability in image classication through this method, with no explicit training on
class label.
We compare representations with end-to-end DQN mnih2013playing trained on sequences
Y
t
, single observation Y
t
, S, and
S. Diculty level of these environments can be extended by
increasing the class sizejYj and the memory k. In the future, we can also investigate learning a
tabular policy on
S
t
, which cannot be done under any continuous representation.
37
Discrete Gaussian MNIST
Method jYj;k = 2 jYj;k = 4 jYj;k = 2 jYj;k = 4 jYj;k = 2 jYj;k = 4
DQN on Y 49.5 24.79 49.86 25.01 50.56 25.11
DQN on
Y 72.72 54.98 73.61 54.59 66.02 36.27
DQN on S 73.0 57.38 75.40 56.43 74.21 55.69
DQN on
S 71.84 57.37 71.13 54.44 74.72 51.18
Table 5.1: Results for the Active Setting. Reward obtained with tabular and function approx-
imation versions of value iteration. Numbers in each cell correspond to action spaces next id,
switch source, switch mem. Models trained on 500 episodes and evaluated on 100.
38
Chapter 6
Meta Intelligence
As introduced in chapter 3 agents that are basic intelligent use their learning channel to congure
their internal devices such that synchronization with the environment local causal channel becomes
feasible - producing in the time-limits internal representations that completely compress the history
of the environment at a specic location.
When the agent, either due to endogenous actions or exogenous forces, changes spatial locations
(tasks) over time the learning channel needs to nd a conguration
k
for its internal devices
at each spatial location k2 K. Assuming a one to one correspondence between learned tasks
(locations) and computing devices implies complete specialization of the agent computational
capabilities; on the contrary assuming a single computational device able to synchronize to any
location with a single conguration implies complete virtualization. Reasonable assumptions on
the agents computational capacity (e.g. bounded space/energy to store/activate devices) make the
complete specialization assumption unfeasible. We focus therefore on the (partial) virtualization
of computational devices specically on two properties:
1. Dierent information channels can be approximated by the same computing device.
2. The same information channel can be approximated by dierent computing devices.
The rst property is at the foundation of the known algorithmic concept of Multi-task Learning
[12, 13] where the same neural network is used to produce sucient statistics for dierent tasks.
39
Figure 6.1: Graphical representation of the Multi-task training procedure for a
memory-less channel
The second property is instead associated to the concept of knowledge distillation [9, 11, 35] where
is is shown that pre-trained neural networks can be used as teacher for other neural networks with
dierent architectures in place of the ground truth label.
6.0.1 Multi-task Learning
In multi-task learning the agent tries to be basic intelligent in every spatial location k2 K using
the same conguration (up to the output layer)
i
=
j
8i;j2 K.
min
k
X
k2K
H(
!
Z
k
jS
t
=f(z
k
0:t
;
k
);A
t
) (6.1)
S.T.:
i
=
j
8i;j2 K (6.2)
Remember that the value of
k
is set by the learning channel l = P(
k
t
j
Z
k
) which use the
history of observation at location k. Since in multi-task learning the same device is used for all
the tasks k2 K the history
Z
k
of each task needs to be taken in consideration. Specically the
learning channel needs to take as input the new history
z
iid
which is constructed by i.i.d sampling
of the local sub-histories
z
k
at each location. Since each local process
z
k
is stationary also this new
40
Figure 6.2: Graphical representation of the Knowledge Distillation training procedure
of a memory-less channel
process
z
iid
constructed by sampling each sub-process through a stationary distribution (uniformly
at random) is stationary and will therefore induce a stationary distribution over parameters
P(
iid
j
Z
iid
).
6.0.2 Knowledge Distillation
Knowledge Distillation (KD) is the procedure through which and internal device reaches a
conguration
s
by imitating the output of a second device with conguration
k
, that is a student
device synchronizing to a teacher device. This procedure is analogous to learning from the local
causality channel but only depends on the internal dynamics. From practical purposes a key
dierence between learning from the dynamics of ground truth physical states and learning from
another device is given be the fact that empirical realizations only give information about what
happened while internal models carry on additional information aboutwhat could happen.
41
In practice knowledge distillation in neural network is achieved by feeding both networks with
the same input and forcing the student network to match the teacher network's output with an
adequate loss function (e.g. cross-entropy):
min
k
X
k2K
H(g(z
k
0:t
;
k
t
)jf(z
k
0:t
;
k
s
);A
t
) (6.3)
6.0.3 Catastrophic Forgetting and Long Term Memory
In practice the limitation of the agent of perceiving a single spatial location Z
k
at a time makes
the creation of the history
Z
iid
a daunting task:
1. Each location has to be observed for a time long enough to reveal interesting dynamics, that
is for a time longer than the markov order of the channel at that location.
2. Not all tasks might be neighbours in the space geometry making the generation of process
visiting uniformly at random each spatial location potentially impossible.
3. Non uniformity and memory-full dynamics of the task sampling process increase the necessary
amount of time for a history of realizations
z
iid
to exhibit a stationary distribution across
sub-process's realizations
z
k
.
When the memory of the learning channel is shorter than the mixing time of the sampling process
the resulting device conguration will be biased by the subset of transitions that the learning
channel is taking in consideration.
The interaction between the time necessary to observe exhaustively and stationary the local
dynamics of each location and the memory of the learning process bounds intrinsically the speed
at which learning can happen in the system. A fast learning algorithm must have short memory
and leads to congurations that are overly tuned to the most recent task while a slow learning
algorithms can have long memory and converge to parameters optimal across a broader amount of
tasks.
42
The above phenomena has been often observed under the name of catastrophic forgetting [61]
when training neural network with stochastic gradient descent in the continual learning setting,
that is when tasks are learned one at a time sequentially. Catastrophic forgetting is the consequence
of either learning channels having a nite memory or of the distribution across tasks not to mix
at all. An example of the rst case is when the task id changes after the agent reached a certain
score in the previous task, an example of the second case is given when task samples happens
on a directed chain graph like most continual learning evaluation tasks. A direct consequence
of catastrophic forgetting is that the conguration for old tasks is constantly forgotten requiring
re-learning every time an old task is faced again.
The complementary learning systems theory [60] suggests that mammals solve this problem
through the use of two separate learning systems, the Hippocampus and the Cortex, with dierent
learning rates and roles. The hippocampus generally learns quickly the local dynamics observed
during the day while the knowledge extracted is consolidated in the cortex selectively during the
night inducing a long term learning process where local knowledge is accumulated over time.
6.0.4 Second order modelling of the internal model - Attention and
Hyper-networks
In the previous sections we assumed it is possible (and reasonable) to express all the required
computing devices with the same set of parameters - as an example think two locations with
identical histories
z
i
=
z
j
= but dierent conditional future as a consequence of the dierent
local causal channelsP(
!
z
i
j)6=P(
!
z
j
j) would be ambiguously modelled by using a single set of
parameters without additional information to disambiguate whether we are interested in task i or
j.
A possible solution to this limitation of multi-task learning is the concept of attention, an
auxiliary process that dynamically modies the information-channel responsible for modelling the
43
internal perceptual stream. This modication is generally implemented in neural network via a
second network that censors inputs or outputs of each layer [40, 91], through dynamically generated
noise [1] or by directly learning to output a input or memory based conguration [76, 26].
44
Chapter 7
Bottom Up Attention
7.1 Introduction
Neural networks with large output space exploit a common hidden representation that is fully
shared across all categorization tasks. Humans instead rely on a densely connected substrate that
is never fully activated: information are routed through specic pathways via inhibitory currents
that focus their processing over features that are relevant for the task at hand. This process allows
structures like the visual cortex to have a role in multitudes of behaviors by giving attention
each time to dierent information. This
exibility can be either induced from data-dependent
property (bottom-up attention) or from prior or contextual information (top-down attention). We
leverage this analogy with human attention to build a neural network architecture that explicitly
decomposes into structural parameters dening feature extractors and task-dependent functional
parameters that dene an attention policy over the shared feature maps. We enrich a Residual
Network [29] with a bottom-up attention pathway and exploit the intuition that residual networks
implicitly dene a functional space that can be parametrized by an attention policy able to
modulate the Residual Network internal activations. In our results we see increasing performance
of the state of the art models for objects classication in the benchmark dataset Cifar100 [47]
with our attention mechanism. We build upon the Squeeze and Excitation [?] block, and develop
45
Figure 7.1: Represention of the three types residual blocks used: Residual blocks are the
basic component of Deep Residual Networks. which are constructed by stacking consecutively
multiple blocks. On the left the vanilla unit which simply applies convolutional functions on the
input of the block X and sums the results over the original input. In the center the Squeeze and
Excitation unit that extracts he gist vector
^
f(x) from the spatial representation f(x) and use it to
produce the attention functiong(
^
f(x)). On the he gist vector
^
f(x) our proposed LSTM-bottom-up
unit that uses a recurrent neural network to create a hidden stateh able to accumulate information
from all the previous layers gist vectors
^
f(x).
a recurrent network [38] that follows the feedforward visual process layer by layer where it : 1)
integrates spatial activity in a single gist vector, 2) updates its internal state based on the incoming
gist vector, 3) uses the updated internal state to generate inhibitory and excitatory currents over
the visual feature space generated at each block of the Residual Network. This process is repeated
sequentially at each block of the feedforward network allowing the attention network to accumulate
information through the whole hierarchy.
7.2 Methods
Methods
Residual networks are composed by a sequence of residual-units:
x
i
=x
i1
+f
i
(x
i1
;
i
) (7.1)
46
At each residual unit the input x
i
is combined with the residual function f(x
i
;
i
), dened by a se-
quence of two conv-bachnorm-relu blocks. The sequential application of this type of transformation
gives rise to an architecture where each unit iteratively renes the original representation through
addition of new information. In the case of images, x
i
are 3-d tensors heightXwidhtXchannels,
which indicates that at the ith step of processing the NNs store a midlevel representation of the
input over a spatial grid with vector length channels . The resolution of the spatial grid is reduced
by a factor of two by averaging neighboring pixels at 1/3rd and 2/3rd of the feedforward process,
gradually increasing the receptive eld of each feature vector. The rst change that we introduce
is an improvement over the bottom-up attention mechanism called Squeezed and Excitation [?].
Squeeze and excitation adds an attention function of dimensionalityg( x) : channels7! channels
that maps a spatially averaged visual representation x to an attention policy over features. The
attention policy is composed by a gating vector g
i
( x) that is multiplied at each location with the
original representation tensor x. This procedure allows global information at the current layer to
re-normalize local activity, a procedure akin to divisive normalization in the visual cortex [?].
Resulting in the following residual block:
x
i
=x
i1
+f
i
(x
i1
;
c
i
)g
i
(
f
i
(x
i
);
a
i
) (7.2)
To increase the representational capability of our attention policy, we implement g
i
(:) as a
recurrent function that is applied at each layer. The results is a modied residual block whose
attention function is in
uenced by the global activity at each previous layer
f
1
(x
i
);:::;
f
i
(x
i
),
implementing the bottom-up attention mechanism:
x
i
=x
i1
+f
i
(x
i1
;
c
i
)g(
f
1
(x
1
);:::;
f
i
(x
i
);
a
i
) (7.3)
47
We implement the recurrent bottom-up network using a Long Short Term Memory cell [38] because
of its ability in handling dependences across long sequences, like potential in
uence of early visual
processing on attention polices at the decision making stage. A graphical depiction of the three
types of residual block is present in gure 7.1.
7.3 Results:
As can be seen in table 7.1, the performance of the three type of models is tested on CIFAR100
using Wide Resnet [99] of dierent width as basic architecture. The results follows our expectation
with the systematic improvement of SE over the baseline, and our LSTM attention model over SE.
We nd that the parameter controlling model complexity, width, and our attention mechanism to
be independent in their contribution to model performance. We reach a test error on CIFAR100
of 18.4 % without the use of dropout which is 0.45 lower than the originally reported [99] dropout
results of 18.85 %.
Cifar100 Vanilla Squeeze and Excitation LSTM-Bottom Up
Wide Resnet 28-1 70.04% 70.40% 70.72%
Wide Resnet 28-2 74.71% 75.42% 76.43%
Wide Resnet 28-5 79.12% 80.17% 80.25%
Wide Resnet 28-10 80.75% 81.12% 81.60%
Table 7.1: Test set accuracy for Cifar100 for Wide Residual Network with layer width 1-2-5-10,
without any attention mechanism, with the local attention mechanism Squeeze and Excitation
and with our proposal LSTM-bottom-up
48
Chapter 8
Born Again Neural Networks
8.1 Abstract
Knowledge Distillation (KD) consists of transferring \knowledge" from one machine learning model
(the teacher) to another (the student). Commonly, the teacher is a high-capacity model with
formidable performance, while the student is more compact. By transferring knowledge, one hopes
to benet from the student's compactness, without sacricing too much performance. We study
KD from a new perspective: rather than compressing models, we train students parameterized
identically to their teachers. Surprisingly, these Born-Again Networks (BANs), outperform their
teachers signicantly, both on computer vision and language modeling tasks. Our experiments
with BANs based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10
(3.5%) and CIFAR-100 (15.5%) datasets, by validation error. Additional experiments explore two
distillation objectives: (i) Condence-Weighted by Teacher Max (CWTM) and (ii) Dark Knowledge
with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD,
demonstrating the eect of the teacher outputs on both predicted and non-predicted classes.
49
8.2 Introduction
In a 2001 paper on statistical modeling [8], Leo Breiman noted that dierent stochastic algorithmic
procedures hansen1990neural,liaw2002classication,chen2016xgboost can lead to diverse models
with similar validation performances. Moreover, he noted that we can often compose these
models into an ensemble that achieves predictive power superior to each of the constituent models.
Interestingly, given such a powerful ensemble, one can often nd a simpler model | no more
complex than one of the ensemble's constituents | that mimics the ensemble and achieves its
performance. Previously, in Born-Again Trees breiman1996born pioneered this idea, learning single
trees that match the performance of multiple-tree predictors. These born-again trees approximate
the ensemble decision but oer some desired properties of individual decision trees, such as their
purported amenability to interpretation. A number of subsequent papers have proposed variations
the idea of born-again models. In the neural network community, similar ideas emerged in papers
on model compression by bucilua2006model and related work on knowledge distillation (KD) by
hinton2015distilling. In both cases, the idea is typically to transfer the knowledge of a high-capacity
teacher with desired high performance to a more compact student [4, 88, 73]. Although the student
cannot match the teacher when trained directly on the data, the distillation process brings the
student closer to matching the predictive power of the teacher.
We propose to revisit KD with the objective of disentangling the benets of this training
technique from its use in model compression. In experiments transferring knowledge from teachers
to students of identical capacity, we make the surprising discovery that the students become the
masters, outperforming their teachers by signicant margins. In a manner reminiscent to Minsky's
Sequence of Teaching Selves [65], we develop a simple re-training procedure: after the teacher model
converges, we initialize a new student and train it with the dual goals of predicting the correct
labels and matching the output distribution of the teacher. We call these students Born-Again
Networks (BANs) and show that applied to DenseNets, ResNets and LSTM-based sequence models,
50
Figure 8.1: Graphical representation of the BAN training procedure: during the rst
step the teacher model T is trained from the labels Y . Then, at each consecutive step, a new
identical model is initialized from a dierent random seed and trained from the supervision of the
earlier generation. At the end of the procedure, additional gains can be achieved with an ensemble
of multiple students generations.
BANs consistently have lower validation errors than their teachers. For DenseNets, we show that
this procedure can be applied for multiple steps, albeit with diminishing returns.
We observe that the gradient induced by KD can be decomposed into two terms: a dark
knowledge term, containing the information on the wrong outputs, and a ground-truth component
which corresponds to a simple rescaling of the original gradient that would be obtained using the
real labels. We interpret the second term as training from the real labels using importance weights
for each sample based on the teacher's condence in its maximum value. Experiments investigating
the importance of each term are aimed at quantifying the contribution of dark knowledge to the
success of KD.
Furthermore, we explore whether the objective function induced by the DenseNet teacher can
be used to improve a simpler architecture like ResNet bringing it close to state-of-the-art accuracy.
We construct Wide-ResNets [99] and Bottleneck-ResNets [30] of comparable complexity to their
teacher and show that these BAN-as-ResNets surpass their DenseNet teachers. Analogously
we train DenseNet students from Wide-ResNet teachers, which drastically outperform standard
ResNets. Thus, we demonstrate that weak masters can still improve performance of students, and
KD need not be used with strong masters.
51
8.3 Related Literature
We brie
y review the related literature on knowledge distillation and the models used in our
experiments.
8.3.1 Knowledge Distillation
A long line of papers have sought to transfer knowledge between one model and another for various
purposes. Sometimes the goal is compression: to produce a compact model that retains the accuracy
of a larger model that takes up more space and/or requires more computation to make predictions
bucilua2006model,hinton2015distilling. breiman1996born proposed compressing neural networks
and multiple-tree predictors by approximating them with a single tree. More recently, others have
proposed to transfer knowledge from neural networks by approximating them with simpler models
like decision trees chandra2007combination and generalized additive models tan2018transparent
for the purpose of increasing transparency or interpretability. Further, frosst2017distilling proposed
distilling deep networks into decision trees for the purpose of explaining decisions. We note that in
each of these cases, what precisely is meant by interpretability or transparency is often undeclared
and the topic remains fraught with ambiguity lipton2016mythos.
Among papers seeking to compress models, the goal of knowledge transfer is simple: produce
a student model that achieves better accuracy by virtue of knowledge transfer from the teacher
model than it would if trained directly. This research is often motivated by the resource constraints
of underpowered devices like cellphones and internet-of-things devices. In a pioneering work,
bucilua2006model compress the information in an ensemble of neural networks into a single neural
network. Subsequently, with modern deep learning tools, ba2014deep demonstrated a method to
increase the accuracy of shallow neural networks, by training them to mimic deep neural networks,
using an penalizing the L2 norm of the dierence between the student's and teacher's logits. In
another recent work, romero2014tnets aim to compress models by approximating the mappings
52
between teacher and student hidden layers, using linear projection layers to train the relatively
narrower students.
Interest in KD increased following hinton2015distilling, who demonstrated a method called
dark knowledge, in which a student model trains with the objective of matching the full softmax
distribution of the teacher model. One paper applying ML to Higgs Boson and supersymmetry
detection, made the (perhaps inevitable) leap to applying dark knowledge to the search for dark
matter [?]. urban2016deep train a super teacher consisting of an ensemble of 16 convolutional
neural networks and compresses the learned function into shallow multilayer perceptrons containing
1, 2, 3, 4, and 5 layers. In a dierent approach, zagoruyko2016paying force the student to match
the attention map of the teacher (norm across the channel dimension in each spatial location)
at the end of each residual stage. czarnecki2017sobolev try to minimize the dierence between
teacher and student derivatives of the loss with respect to the input in addition to minimizing the
divergence from teacher predictions.
Interest in KD has also spread beyond supervised learning. In the deep reinforcement learning
community, for example, rusu2015policy distill multiple DQN models into a single one. A number
of recent papers furlanello2016active, li2016learning,shin2017continual employ KD for the purpose
of minimizing forgetting in continual learning. [67] incorporate KD into an adversarial training
scheme. Recently, lopez2015unifying pointed out some connections between KD and a theory of
on learning with privileged information pechyony2010theory.
In a supercially similar work to our own, yim2017gift propose applying KD from a DNN
to another DNN of identical architecture, and report that the student model trains faster and
achieves greater accuracy than the teacher. They employ a loss which is calculated as follows: for
a number of pairs of layersf(i;j)g of same dimensionality, they (i) calculate a number of inner
products G
i;j
(x) between the activation tensors at the layers i and j, and (ii) they construct a
loss that requires the student to match the statistics of these inner products to the corresponding
statistics calculated on the teacher (for the same example), by minimizingjjG
T
i;j
(x)G
S
i;j
(x)jj
2
2
.
53
The authors exploit a statistic used in gatys2015neural to capture style similarity between images
(given the same network).
Key dierences Our work diers from [?] in several key ways. First, their novel loss function,
while technically imaginative, is not demonstrated to outperform more standard KD techniques.
Our work is the rst, to our knowledge, to demonstrate that dark knowledge, applied for self-
distillation, even without softening the logits results in signicant boosts in performance. Indeed,
when distilling to a model of identical architecture we achieve the current second-best performance
on the CIFAR100 dataset. Moreover, this paper oers empirical rigor, providing several experiments
aimed at understanding the ecacy of self-distillation, and demonstrating that the technique is
successful in domains other than images.
8.3.2 Residual and Densely Connected Neural Networks
First described in [29], deep residual networks employ design principles that are rapidly becoming
ubiquitous among modern computer vision models. The Resnet passes representations through a
sequence of consecutive residual-blocks, each of which applies several sub-modules, denoted residual
units), each of which consists of convolutions and skip-connections, interspersed with spatial
down-sampling. Multiple extensions [30, 99, 95, 27] have been proposed, progressively increasing
their accuracy on CIFAR100 [47] and ImageNet [72]. Densely connected networks (DenseNets)
[42] are a recently proposed variation where the summation operation at the end of each unit is
substituted by a concatenation of the input and output of the unit.
8.4 Born-Again Networks
Consider the classical image classication setting where we have a training dataset consisting of
tuples of images and labels (x;y)2XY and we are interested in tting a functionf(x) :X7!Y,
able to generalize to unseen data. Commonly, the mapping f(x) is parametrized by a neural
54
network f(x;
1
),
1
with parameters in some space
1
. We learn the parameters via Empirical
Risk Minimization (ERM), producing a resulting model
1
that minimizes some loss function:
1
=
1
L(y;f(x;
1
)); (8.1)
typically optimized by some variant of Stochastic Gradient Descent (SGD).
Born-Again Networks (BANs) are based on the empirical nding demonstrated in knowledge
distillation / model compression papers that generalization error, can be reduced by modifying the
loss function. This should not be surprising: the most common such modications are the classical
regularization penalties which limit the complexity of the learned model. BANs instead exploit the
idea demonstrated in KD, that the information contained in a teacher model's output distribution
f(x;
1
) can provide a rich source of training signal, leading to a second solution f(x;
2
),
2
2
2
,
with better generalization ability. We explore techniques to modify, substitute, or regularize the
original loss function with a KD term based on the cross-entropy between the new model's outputs
and the outputs of the original model:
L(f(x;
1
L(y;f(x;
1
)));f(x;
2
)): (8.2)
Unlike the original works on KD, we address the case when the teacher and student networks
have identical architectures. Additionally, we present experiments addressing the case when the
teacher and student networks have similar capacity but dierent architectures. For example we
perform knowledge transfer from a DenseNet teacher to a ResNet student with similar number of
parameters.
55
8.4.1 Sequence of Teaching Selves Born-Again Networks Ensemble
Inspired by the impressive recent results of SGDR Wide-Resnet [58] and Coupled-DenseNet [20]
ensembles on CIFAR100, we apply BANs sequentially with multiple generations of knowledge
transfer. In each case, the k-th model is trained, with knowledge transferred from the k 1-th
student:
L(f(x;
k1
L(f(x;
k1
)));f(x;
k
)): (8.3)
Finally, similarly to ensembling multiple snapshots [41] of SGD with restart [58], we produce
Born-Again Network Ensembles (BANE) by averaging the prediction of multiple generations of
BANs.
^
f
k
(x) =
k
X
i=1
f(x;
i
)=k: (8.4)
We nd the improvements of the sequence to saturate, but we are able to produce signicant gains
through ensembling.
8.4.2 Dark Knowledge Under the Light
The authors in [35] suggest that the success of KD depends on the dark knowledge hidden in the
distribution of logits of the wrong responses, that carry information on the similarity between
output categories. Another plausible explanations might be found by comparing the gradients
owing through output node corresponding to the correct class during distillation vs. normal
supervised training. Note that restricting attention to this gradient, the knowledge distillation
might resemble importance-weighting where the weight corresponds to the teacher's condence in
the correct prediction.
56
The single-sample gradient of the cross-entropy between student logits z
j
and teacher logits t
j
with respect to the ith output is given by:
@L
i
@z
i
=q
i
p
i
=
e
zi
n
P
j=1
e
zj
e
ti
n
P
j=1
e
tj
: (8.5)
When the target probability distribution function corresponds to the ground truth one-hot label
p
=y
= 1 this reduces to:
@L
@z
=q
y
=
e
z
n
P
j=1
e
zj
1 (8.6)
When the loss is computed with respect to the complete teacher output, the student back-
propagates the mean of the gradients with respect to correct and incorrect outputs across all the b
samples s of the mini-batch (assuming without loss of generality the nth label is the ground truth
label):
b
X
s=1
n
X
i=1
@L
i;s
@z
i;s
=
b
X
s=1
(q
;s
p
;s
) +
b
X
s=1
n1
X
i=1
(q
i;s
p
i;s
); (8.7)
up to a rescaling factor 1=b. The second term corresponds to the information incoming from all the
wrong outputs, via dark knowledge. The rst term corresponds to the gradient from the correct
choice and can be rewritten as
1
b
b
X
s=1
(q
;s
p
;s
y
;s
) (8.8)
which allows the interpretation of the output of the teacher p
as a weighting factor of the original
ground truth label y
.
When the teacher is correct and condent in its output, i.e. p
;s
1, Eq. (8.8) reduces to
the ground truth gradient in Eq. (8.6), while samples with lower condence have their gradients
rescaled by a factor p
;s
and have reduced contribution to the overall training signal.
57
We notice that this form has a relationship with importance weighting of samples where the
gradient of each sample in a mini-batch is balanced based on its importance weight w
s
. When the
importance weights correspond to the output of a teacher for the correct dimension we have:
b
X
s=1
w
s
b
P
u=1
w
u
(q
;s
y
;s
) =
b
X
s=1
p
;s
b
P
u=1
p
;u
(q
;s
y
;s
): (8.9)
So we ask the following question: does the success of dark knowledge owe to the information
contained in the non-argmax outputs of the teacher? Or is dark knowledge simply performing
a kind of importance weighting? To explore these questions, we develop two treatments. In the
rst treatment, Condence Weighted by Teacher Max (CWTM), we weight each example in the
student's loss function (standard cross-entropy with ground truth labels) by the condence of
the teacher model on that example (even if the teacher wrong). We train BAN models using an
approximation of Eq. (8.9), where we substitute the correct answer p
;s
with the max output of
the teacher maxp
:;s
:
b
X
s=1
maxp
:;s
b
P
u=1
maxp
:;u
(q
;s
y
;s
): (8.10)
In the second treatment, dark knowledge with Permuted Predictions (DKPP), we permute
the non-argmax outputs of the teacher's predicted distribution. We use the original
formulation of Eq. (8.7), substituting the operator with max and permuting the teacher
dimensions of the dark knowledge term, leading to:
b
X
s=1
n
X
i=1
@L
i;s
@z
i;s
=
b
X
s=1
(q
;s
maxp
:;s
)
+
b
X
s=1
n1
X
i=1
q
i;s
(p
j;s
); (8.11)
58
where(p
j;s
) are the permuted outputs of the teacher. In DKPP we scramble the correct attribution
of dark knowledge to each non-argmax output dimension, destroying the pairwise similarities of
the original output covariance matrix.
8.4.3 BANs Stability to Depth and Width Variations
DenseNet architectures are parametrized by depth, growth, and compression factors. Depth
corresponds to the number of dense blocks. The growth factor denes how many new features
are concatenated at each new dense block, while the compression factor controls by how much
features are reduced at the end of each stage.
Variations in these hyper-parameters induce a tradeo between number of parameters, memory
use and the number of sequential operations for each pass. We test the possibility of expressing
the same function of the DenseNet teacher with dierent architectural hyperparameters. In order
to construct a fair comparison, we construct DenseNets whose output dimensionality at each
spatial transition matches that of the DenseNet-90-60 teacher. Keeping the size of the hidden
states constant, we modulate the growth factor indirectly via the choice the number of blocks.
Additionally, we can drastically reduce the growth factor by reducing the compression factor before
or after each spatial transition.
8.4.4 DenseNets Born-Again as ResNets
Since BAN-DenseNets perform at the same level as plain DenseNets with multiples of their
parameters, we test whether the BAN procedure can be used to improve ResNets as well. Instead
of the weaker ResNet teacher, we employ a DenseNet-90-60 as teacher and construct comparable
ResNet students by switching Dense Blocks with Wide Residual Blocks and Bottleneck Residual
Blocks.
59
8.5 Experiments
All experiments performed on CIFAR-100 use the same preprocessing and training setting as for
Wide-ResNet [99] except for Mean-Std normalization. The only form of regularization used other
than the KD loss are weight decay and, in the case of Wide-ResNet drop-out.
8.5.1 CIFAR-10/100
Baselines To get a strong teacher baseline without the prohibitive memory usage of the original
architectures, we explore multiple heights and growth factors for DenseNets. We nd a good
conguration in relatively shallower architectures with increased growth factor and comparable
number of parameters to the largest conguration of the original paper. Classical ResNet baselines
are trained following [99]. Finally, we construct Wide-ResNet and bottleneck-ResNet networks
that match the output shape of DenseNet-90-60 at each block, as baselines for our BAN-ResNet
with DenseNet teacher experiment.
BAN-DenseNet and ResNet We perform BAN re-training after convergence, using the same
training schedule originally used to train the teacher networks. We employ DenseNet-(116-33,
90-60, 80-80, 80-120) and train a sequence of BANs for each conguration. We test the ensemble
performance for sequences of 2 and 3 BANs. We explored other forms of knowledge transfer for
training BANs. Specically, we tried progressively constraining the BANs to be more similar
to their teachers, sharing the rst and last layers between student and teacher, or adding losses
that penalize the L2 distance between student and teacher activations. However, we found these
variations to systematically perform slightly worse than the simple KD via cross entropy. For
BAN-ResNet experiments with a ResNet teacher we use Wide-ResNet-(28-1, 28-2, 28-5, 28-10).
BAN without Dark Knowledge In the rst treatment, CWTM, we fully exclude the eect
of all the teacher's output except for the argmax dimension. To do so, we train the students with
60
the normal label loss where samples are weighted by their importance. We interpret the max of
the teacher's output for each sample as the importance weight and use it to rescale each sample of
the student's loss.
In the second treatment, DKPP, we maintain the overall high order moments of the teachers
output, but randomly permute each output dimension except the argmax one. We maintain the
rest of the training scheme and the architecture unchanged.
Both methods alter the covariance between outputs, such that any improvement cannot be
fully attributed to the classical dark knowledge interpretation.
Variations in Depth, Width and Compression Rate We also train variations of DenseNet-
90-60, with increased or decreased number of units in each block and dierent number of channels
determined through a ratio of the original activation sizes.
BAN-Resnet with DenseNet teacher In all the BAN-ResNet with DenseNet teacher experi-
ments, the student shares the rst and last layers of the teacher. We modulate the complexity of the
ResNet by changing the number of units, starting from the depth of the successful Wide-ResNet-28
[99] and reducing until only a single residual unit per block remains. Since the number of channels
in each block is the same for every residual unit, we match it with a proportion of the corresponding
dense block output after the 1 1 convolution, before the spatial down-sampling. We explore
mostly architectures with a ratio of 1, but we also show the eect of halving the width of the
network.
BAN-DenseNet with ResNet teacher With this experiment we test whether a weaker ResNet
teacher is able to successfully train DenseNet-90-60 students. We use multiple congurations of
Wide-ResNet teacher and train the Ban-DenseNet student with the same hyper parameters of the
other DenseNet experiments.
61
8.5.2 Penn Tree Bank
To validate our method beyond computer vision applications, we also apply the BAN framework
to language models and evaluate it on the Penn Tree Bank (PTB) dataset [?] using the standard
train/test/validation split by [?]. We consider two BAN language models: a single layer LSTM
[38] with 1500 units [?] and a smaller model from [?] combining a convolutional layers, highway
layers, and a 2-layer LSTM (referred to as CNN-LSTM).
For the LSTM model we use weight tying [?], 65% dropout and train for 40 epochs using SGD
with a mini-batch size of 32. An adaptive learning rate schedule is used with an initial learning
rate 1 that is multiplied by a factor of 0.25 if the validation perplexity does not decrease after an
epoch.
The CNN-LSTM is trained with SGD for the same number of epochs with a mini-batch size
of 20. The initial learning rate is set to 2 and is multiplied by a factor of 0.5 if the validation
perplexity does not decrease by at least 0.5 after an epoch (this schedule slightly diers from [?],
but worked better for the teacher model in our experiments).
Both models are unrolled for 35 steps and the KD loss is simply applied between the softmax
outputs of the unrolled teacher an student.
8.6 Results
We report the surprising nding that by performing KD across models of similar architecture,
BAN student models tend to improve over their teachers across all congurations.
8.6.1 CIFAR-10
As can be observed in Table 8.1 the CIFAR-10 test error is systematically lower or equal for both
Wide-ResNet and DenseNet student trained from an identical teacher. It is worth to note how for
62
BAN-DenseNet the gap between architectures of dierent complexity is quickly reduced leading to
implicit gains in the parameters to error rate ratio.
Table 8.1: Test error on CIFAR-10 for Wide-ResNet with dierent depth and width and
DenseNet of dierent depth and growth factor.
Network Parameters Teacher BAN
Wide-ResNet-28-1 0.38 M 6.69 6.64
Wide-ResNet-28-2 1.48 M 5.06 4.86
Wide-ResNet-28-5 9.16 M 4.13 4.03
Wide-ResNet-28-10 36 M 3.77 3.86
DenseNet-112-33 6.3 M 3.84 3.61
DenseNet-90-60 16.1 M 3.81 3.5
DenseNet-80-80 22.4 M 3.48 3.49
DenseNet-80-120 50.4 M 3.37 3.54
8.6.2 CIFAR-100
For CIFAR-100 we nd stronger improvements for all BAN-DenseNet models. We focus therefore
most of our experiments to explore and understand the born-again phenomena on this dataset.
BAN-DenseNet and BAN-ResNet In Table 8.2 we report test error rates using both labels
and teacher outputs (BAN+L) or only the latter (BAN). The improvement of fully removing
the label supervision is systematic across modality, it is worth noting that the smallest student
BAN-DenseNet-112-33 reaches an error of 16.95% with only 6.5 M parameters, comparable to the
16.87% error of the DenseNet-80-120 teacher with almost eight times more parameters.
In Table 8.3 all but one Wide-ResNnet student improve over their identical teacher.
Sequence of Teaching Selves Training BANs for multiple generations leads to inconsistent
but positive improvements, that saturate after a few generations. The third generation of BAN-
3-DenseNet-80-80 produces our single best model with 22M parameters, achieving 15.5% error
on CIFAR0100 (Table 8.2). To our knowledge, this is currently the SOTA non-ensemble model
without shake-shake regularization. It is only beaten by anonymous2018shakedrop who use a
63
Table 8.2: Test error on CIFAR-100 Left Side: DenseNet of dierent depth and growth factor
and respective BAN student. BAN models are trained only with the teacher loss, BAN+L with
both label and teacher loss. CWTM are trained with sample importance weighted label, the
importance of the sample is determined by the max of the teacher's output. DKPP are trained
only from teacher outputs with all the dimensions but the argmax permuted. Right Side: test error
on CIFAR-100 sequence of BAN-DenseNet, and the BAN-ensembles resulting from the sequence.
Each BAN in the sequence is trained from cross-entropy with respect to the model at its left. BAN
and BAN-1 models are trained from Teacher but have dierent random seeds. We include the
teacher as a member of the ensemble for Ens*3 for 80-120 since we did not train a BAN-3 for this
conguration.
Network Teacher BAN BAN+L CWTM DKPP BAN-1 BAN-2 BAN-3 Ens*2 Ens*3
DenseNet-112-33 18.25 16.95 17.68 17.84 17.84 17.61 17.22 16.59 15.77 15.68
DenseNet-90-60 17.69 16.69 16.93 17.42 17.43 16.62 16.44 16.72 15.39 15.74
DenseNet-80-80 17.16 16.36 16.5 17.16 16.84 16.26 16.30 15.5 15.46 15.14
DenseNet-80-120 16.87 16.00 16.41 17.12 16.34 16.13 16.13 / 15.13 14.9
Table 8.3: Test error on CIFAR-100 for Wide-ResNet students trained from identical Wide-
ResNet teachers and for DenseNet-90-60 students trained from Wide-ResNet teachers
Network Teacher BAN Dense-90-60
Wide-ResNet-28-1 30.05 29.43 24.93
Wide-ResNet-28-2 25.32 24.38 18.49
Wide-ResNet-28-5 20.88 20.93 17.52
Wide-ResNet-28-10 19.08 18.25 16.79
pyramidal ResNet trained for 1800 epochs with a combination of shake-shake [23], pyramid-drop
[96] and cut-out regularization [19].
BAN-Ensemble Similarly, our largest ensemble BAN-3-DenseNet-BC-80-120 with 150M pa-
rameters and an error of 14.9% is the lowest reported ensemble result in the same setting.
BAN-3-DenseNet-112-33 is based on the building block of the best coupled-ensemble of [20] and
reaches a single-error model of 16.59% with only 6.3M parameters, furthermore the ensembles of
two or three consecutive generations reach a comparable error of 15.77% and 15.68% with the
baseline error of 15.68% reported in [20] where four models were used.
Eect of non-argmax Logits As can be observed in the two rightmost columns if the left side
of Table 8.2 we nd that removing part of the dark knowledge still generally brings improvements
to the training procedure with respect to the baseline. Importance weights CWTM lead to weak
64
Table 8.4: Test error on CIFAR-100-Modied Densenet: a Densenet-90-60 is used as teacher
with students that share the same size of hidden states after each spatial transition but diers in
depth and compression rate
Densenet-90-60 Teacher 0.5*Depth 2*Depth 3*Depth 4*Depth 0.5*Compr 0.75*Compr 1.5*compr
Error 17.69 16.95 16.43 16.64 16.64 19.83 17.3 18.89
Parameters 22.4 M 21.2 M 13.7 M 12.9 M 1 2.6 M 5.1 M 10.1 M 80.5 M
improvements over the teacher in all models but the largest DenseNet. Instead, in DKPP we nd
a comparable but systematic improvement eect of permuting all but the argmax dimensions.
These results demonstrate that KD does not simply contribute information on each specic
non-correct output. DKPP demonstrates that the higher order moments of the output distribution
that are invariant to the permutation procedure still systematically contribute to improved
generalization. Furthermore, the complete removal of wrong logit information in the CWTM
treatment still brings improvements for three models out of four, suggesting that the information
contained in pre-trained models can be used to rebalance the training set, by giving less weight to
training samples for which the teacher's output distribution is not concentrated on the max.
DenseNet to modied DenseNet students It can be seen in Table 8.4 that DenseNet
students are particularly robust to the variations in the number of layers. The most shallow
model with only half the number of its teacher layers DenseNet-7-1-2 still improves over the
DenseNet-90-60 teacher with an error rate of 16.95%. Deeper variations are competitive or even
better than the original student. The best modied student result is 16.43% error with twice the
number of layers (half the growth factor) of its DenseNet-90-60 teacher.
The biggest instabilities as well as parameter saving is obtained by modifying the compression
rate of the network, indirectly reducing the dimensionality of each hidden layer. Halving the
number of lters after each spatial dimension reduction in DenseNet-14-0.5-1 gives an error of
19.83%, the worst across all trained DenseNets. Smaller reductions lead to larger parameter savings
with lower accuracy losses, but directly choosing a smaller network retrained with BAN procedure
like DenseNet-106-33 seems to lead to higher parameter eciency.
65
Table 8.5: DenseNet to ResNet: CIFAR-100 test error for BAN-ResNets trained from a
DenseNet-90-60 teacher with dierent numbers of blocks and compression factors. In all the BAN
architectures, the number of units per block is indicated rst, followed by the ratio of input and
output channels with respect to a DenseNet-90-60 block. All BAN architectures share the rst
(conv1) and last(fc-output) layer with the teacher which are frozen. Every dense block is eectively
substituted by residual blocks
DenseNet 90-60 Parameters Baseline BAN
Pre-activation ResNet-1001 10.2 M 22.71 /
BAN-Pre-ResNet-14-0.5 7.3 M 20.28 18.8
BAN-Pre-ResNet-14-1 17.7 M 18.84 17.39
BAN-Wide-ResNet-1-1 20.9 M 20.4 19.12
BAN-Match-Wide-ResNet-2-1 43.1 M 18.83 17.42
BAN-Wide-ResNet-4-0.5 24.3 M 19.63 17.13
BAN-Wide-ResNet-4-1 87.3 M 18.77 17.18
Table 8.6: Validation/Test perplexity on PTB (lower is better) for BAN-LSTM language
model of dierent complexity
Network Parameters Teacher Val BAN+L Val Teacher Test BAN+L Test
ConvLSTM 19M 83.69 80.27 80.05 76.97
LSTM 52M 75.11 71.19 71.87 68.56
DenseNet Teacher to ResNet Student Surprisingly, we nd (Ttable 8.5) that our Wide-
ResNet and Pre-ResNet students that match the output shapes at each stage of their DenseNet
teachers tend to outperform classical ResNets, their teachers, and their baseline.
Both BAN-Pre-ResNet with 14 blocks per stage and BAN-Wide-ResNet with 4 blocks per
stage and 50% compression factor reach respectively a test error of 17.39% and 17.13% using a
parameter budget that is comparable with their teachers. We nd that for BAN-Wide-ResNets,
only limiting the number of blocks to 1 per stage leads to inferior performance compared to the
teacher.
Similar to how adapting the depth of the models oers a nice tradeo between memory
consumption and number of sequential operations, exchanging dense and residual blocks allows to
choose between concatenation and additions. By using additions, ResNets overwrite old memory
banks, saving RAM, at the cost of heavier models that do not share layers oering another technical
tradeo to choose from.
66
ResNet Teacher to DenseNet Students The converse experiment, training a DenseNet-
90-60 student from ResNet student conrms the trend of students surpassing their teachers.
The improvement from ResNet to DenseNet (Table 8.3, right-most column) over simple label
supervision is signicant as indicated by 16.79% error of the DenseNet-90-60 student trained from
the Wide-ResNet-28-10.
8.6.3 Penn Tree Bank
Although we did not use the state-of-the-art bag of tricks [62] for training LSTMs, nor the
recently proposed improvements on KD for sequence models [44], we found signicant decreases in
perplexity on both validation and testing set for our benchmark language models. The smaller
BAN-LSTM-CNN model decreases test perplexity from 80.05 to 76.97, while the bigger BAN-LSTM
model improves from 71.87 to 68.56. Unlike the CNNs trained for CIFAR classication, we nd
that LSTM models work only when trained with a combination of teacher outputs and label
loss (BAN+L). One potential explanation for this nding might be that teachers generally reach
100% accuracy on the CIFAR training sets while the PTB training perplexity is far from being
minimized.
8.7 Discussion
In Marvin Minsky's Society of Mind [65], the analysis of human development led to the idea of a
sequence of teaching selves. Minsky suggested that sudden spurts in intelligence during childhood
may be due to longer and hidden training of new "student" models under the guidance of the
older self. Minsky concluded that our perception of a long-term self is constructed by an ensemble
of multiple generations of internal models, which we can use for guidance when the most current
model falls short. Our results show several instances where such transfer was successful in articial
neural networks.
67
Chapter 9
Active Long Term Memory Networks
9.1 Abstract
Continual Learning in articial neural networks suers from interference and forgetting when
dierent tasks are learned sequentially. This paper introduces the Active Long Term Memory
Networks (A-LTM), a model of sequential multi-task deep learning that is able to maintain
previously learned association between sensory input and behavioral output while acquiring knew
knowledge. A-LTM exploits the non-convex nature of deep neural networks and actively maintains
knowledge of previously learned, inactive tasks using a distillation loss [35]. Distortions of the
learned input-output map are penalized but hidden layers are free to transverse towards new
local optima that are more favorable for the multi-task objective. We re-frame the McClelland's
seminal Hippocampal theory [60] with respect to Catastrophic Inference (CI) behavior exhibited by
modern deep architectures trained with back-propagation and inhomogeneous sampling of latent
factors across epochs. We present empirical results of non-trivial CI during continual learning in
Deep Linear Networks trained on the same task, in Convolutional Neural Networks when the task
shifts from predicting semantic to graphical factors and during domain adaptation from simple to
complex environments. We present results of the A-LTM model's ability to maintain viewpoint
recognition learned in the highly controlled iLab-20M [2] dataset with 10 object categories and 88
68
camera viewpoints, while adapting to the unstructured domain of Imagenet [72] with 1,000 object
categories.
9.2 Introduction
The recent interest in bridging human and machine computations [66, 54, 51, 49] obliges to
consider a learning framework that is continual, sequential in nature and potentially lifelong
[63, 87].Therefore, such a learning framework is prone to interferences and forgetting. On the
positive side, intrinsic correlations between multiple tasks and datasets allow to train deep
learning architectures that can make use of multiple data and supervision sources to achieve
better generalization [98]. The favorable eect of multi-task learning [12, 13] depends on the
shared parametrization of the individual functions and the simultaneous estimation and averaging
of multiple losses. When trained simultaneously shared layers are obliged to learn a common
representation, eectively cross-regularizing each task. Generally, in the sequential estimation case,
the most recent task benets from an inductive bias [70, 98] while older tasks become distorted
[61, 22, 25] by unconstrained back-propagation [32, 52] of errors in shared parameters, a problem
identied as Catastrophic Interferences (CI) between tasks [61].
In humans the dyadic interaction between Hippocampus and Neocortex [60] is thought to
mitigate the problem of CI by carefully balancing the sensitivity-stability [31] trade-o such that
new experiences can be integrated without exposing the system to risk of abrupt phase transitions.
Think of the classical example of a child exploring new objects through structured play [69],
who samples from multiple points of view, lighting directions and generates movement of the object
in space. This exploration creates inputs for the perceptual systems that span homogeneously all
the underlying variation in viewing parameters and construct general purpose graphical, categorical
or semantic level representations of its perception. Such representations can be used in the future
to actively regularize experience in environments where exploration is constrained or costly.
69
We conjecture that CI arises because during the learning lifespan of a system the distribution
of locally observed states [17, 79] in the environment is non-stationary and potentially chaotic,
while the neural representation of the environment with respect to its mode of variation as
latent graphical [50, 7] and semantic [28, 97] factors must be stable and slowly evolving to store
non-transient knowledge.
Early experiments on neural networks' ability to learn the meso-scale structure [36] of their
training environment required interleaved exposure to the dierent semantic variations. This
heuristic is respected in modern architectures for object recognition [48] trained with stochastic
mini batches with uniform, and possibly alternated, rich sampling of categories [53]. Analogously
data augmentation regularizes the distribution of graphical factors. We hypothesize that the
inability of CNNs to learn object categorization with strongly correlated batches is caused by
interferences across the vast number of categories and latent graphical factors that must be
memorized.
Similarly, the outstanding success of Deep-Q-Networks (DQN) [66, 82] can partially be found
in their intrinsic replay system [55] that augments learning batches with state-action transitions
that have not been observed in a long time. This procedure allows the creation of representations
that DQNs transverse quasi-hierarchically [100] during play. It is dicult to imagine how these
representations could be remembered if past states are not visited again through the replay system.
For the simplied case of Deep Linear Networks, it is possible to obtain an exact analytical
treatment of the network's learning time as a function of input-output statistics and network depth
[75, 74]. Suggesting that phase transitions typical of catastrophic interference might appear when
the mixing time of the data generating process is longer than the learning time of the system,
obliging the neural network to track the local dependencies between factors of variation instead of
learning to represent the data generating process in the completeness of its ergodic state.
In this paper, we develop the Active Long Term Memory networks inspired by the Hippocampus-
Neocortex duality and based on the knowledge distillation [35] framework. Our model is composed
70
by a stable component containing the long term task memory and a
exible module that is
initialized from the stable component and faces the new environment. We capitalize on the human
infant metaphor and show that is possible to maintain the ability to predict the viewpoint of an
object while adapting to a new domain with more images, object categories, and viewing conditions
than the original training domain. Moreover, we discuss on the necessity to store and replay input
from the old domain to fully maintain the original task.
9.3 Related Work
With a non-convex system to store knowledge in a changing environment, is important to understand
what knowledge is synthesized in a neural network. The general intuition is that for hierarchical
models with thousands of intermediate representations and millions of parameters it is dicult to
identify the contained knowledge with respect to its parameters value across all layers. Without any
guarantee of being in a unique global optimum, multiple congurations of weight parameters could
sustain the same input-output map, making the association between parameters and knowledge
vague.
The Knowledge Distillation (KD) framework [?, 35] introduces the concept of model compression
to transfer the knowledge of a computationally expensive ensemble into a single, easy to deploy,
model using the prediction of the complex model as supervision for the compressed one. In the
born-again-tree paradigm, Leo Breiman [9] proposed to use one tree model to predict outputs of
random forest ensembles for better interpretability. In the KD framework, knowledge in neural
networks is therefore identied in the input-output map without regard of its parametrization.
Transferring knowledge between two neural architecture ("from teacher to student") therefore
corresponds to supervised training of the student network using the logits of the original teacher
network, or by matching the soft-probabilities that they induce.
71
This framework received a lot of recent interest and has been extended to what is called Gener-
alized Distillation [57] to incorporate some theoretical results of Vapnik's privileged information
algorithm [90]. Recently large scale experiments [4, 24, 88] have been carried on the ability to
compress fully connected into shallow models or convolutional models into Long Short Term
Memory [38] networks and vice-versa.
The introduction of double streams architecture inspired by the classical Siamese Network [10]
for metric learning, and the consequent generalization of the multi-task framework to domain
adaptation [56, 71] oers a natural extension of KD, where distillation happens between streams
made to handle dierent data or supervision sources, but with a shared parametrization.
The case for a strong eect of CI during sequential learning in deep neural networks has been
shown, respectively between semantic [25] and graphical factors [2]. In [50] the authors are able
to estimate an encoder-decoder model with correlated mini-batches using interleaved learning
with carefully selected factors ratio and ad-hoc clamping of the neurons learning rate. In [73, 68]
multiple Atari games are learned with interleaved distillation across games, the correct ratio
between batch size and interleaving was again carefully curated and crucial for the algorithm
success. The authors [73] also present a novel self-distillation framework remembering the Minskian
Sequence of Teaching Selves [64].
9.4 The A-LTM Model
We approach the problem of learning with a sequence of input-outputs that exhibits transitions in
its latent factors using a dual system. Our model is inspired from the seminal theory of McClelland
on the dyadic role of hippocampus and neocortex in preserving long-term memory and avoiding
catastrophic interferences in mammals [60].
The rst A-LTM component is a mature and stable network, Neocortex (N), which is trained
during a development phase in a homogeneous environment rich of supervision sources. To prevent
72
the interference of new experience with previously stored functions, the N networks's learning
rate is drastically reduced during post-developmental phases in an imitation of the visuo-cortical
critical period of plasticity [94].
The second component in the A-LTM is a
exible network, Hippocampus (H), which is subject
to a general unstructured environment. H weights are initialized from N and learning dynamics
are actively regularized from H output activity.
This dual mechanism allows to maintain stability in N without ignoring new inputs. H can
quickly adapt to new information in a non-stationary environment without generating a risk for
the integrity of knowledge already stored in N. Furthermore long term information in N are
actively distilled into H, with the eect of constraining the gradient descent dynamics ongoing in
H towards a better local minimum able to sustain both new and old knowledge. Operationally:
1. During development: N is trained in a controlled environment where multiple examples of
the same object and of its potential graphical transformation are present. We train N with
a multi-task objective to predict both semantic (category) and graphical (camera-viewpoint)
labels of the object. After convergence, the learning rate of N is set to 0.
2. During maturity: H is initialized from N and faces a novel environment, where objects are
typically available from a single perspective and the number of categories is increased by two
orders of magnitude (from 10 to 1000 classes). H is trained with a multi-task objective to
predict both the new higher dimensional semantic task and the output of N with respect to
the developmental tasks, in this case N's ability to dierentiate between dierent point of
view of the same object.
A-LTM networks relies on the core idea that all the tasks to memorize need simultaneous experience
to nd a multi-task optimum that is a critical point for all the objectives. Therefore, if the
environment has unstable input-output because of missing labels,H has to rely on predictions from
73
N. In the complementary case, where instabilities are generated by changes in the distribution of
input, an auxiliary replay mechanism is also necessary.
A Sequential Environment
We study the situation where an agent interacts with a sequential environment, dened by
the joint distribution P(y;x) of visual stimuli x2X and their binary latent factors y2Y . The
agent receives information through a perceptual mechanism (x) :X7!S and makes decision
based on a hierarchical representation of its percept (s) :S7!
d
. The agent's goal is to name
the underlying latent factors for each stimulus. Actions are chosen by the agent with a n-way
soft-max that transforms
d
(s), the last layer of the hierarchical representation of sensory inputs,
into a probability distribution over actions.
The environment responds to actions with a supervised signal y2Y informing the agent on
the correct action given a particular stimulus. The hierarchical representation is updated in
order to minimize the cross-entropy lossL(
d
(s);y) . This task is computationally tedious because
of the vast range of possible transformations in sensory inputs :S7!S that do not have any
meaningful consequence on the semantic nature of the stimulus. We call these transformations
latent graphical factors. Perceptual inputs to our system s have therefore two modes of variation:
1. Variations in semantic factors that alters the category of stimulus x and its percept s,
parametrized by the subcomponent y
s
2y.
2. Variations in graphical factors that are invariant with respect to the category of x but alter
its percept s, parametrized by the subcomponent y
g
2y.
Catastrophic interference happens when the distribution of supervised signals P(y;x) is not
homogeneous. While modeling environment's non-stationarity in the language of stochastic process
could lead to interesting insights, we limit ourselves to the simple regime with a single discrete
transition from P
1
(y
1
;x
1
) to P
2
(y
2
;x
2
).
Bridging Sequential and Multi-Task Learning via Knowledge Distillation
74
Let the multitask function representing the input - output maps of network bef(w
0
;w
1
;w
2
;x) :
X7!Y , where w
0
are shared parameters and w
1
;w
2
the task-specic parameters dening a map
from the common representation to the individual tasks y
1
;y
2
. Sequential Learning corresponds
to solving in this sequence the following two optimization problems. First the minimization of the
cross entropy lossL(f(w
0
;w
1
;x
1
);y
1
) between the environment data generating process P
1
(y
1
jx
1
)
and the softmax probability distribution induced over the task 1 predictions f(w
0
;w
1
;x
1
), with a
Gaussian initializations w
0
0
and w
0
1
:
min
w0;w1
L(f(w
0
;w
1
;x
1
);y
1
)s. t.w
0
0
N (0;)w
0
1
N (0;) (9.1)
with solutions w
0
, w
1
. Followed by analogous problem for the second environment with starting
condition equal to the solution of the previous problem for the shared parameters w
0
:
min
w0;w2
L(f(w
0
;w
2
;x
2
);y
2
)
s. t. w
0
0
=w
0
w
0
2
N (0;)
(9.2)
with solutionsw
0
,w
2
, makingf(w
0
;w
1
;x
1
) an unlikely critical point of problem 9.1. Multitask
learning mitigates the problem of CI by averaging weight updates across dierent objectives and
corresponds to solving 9.1 and 9.2 simultaneously, with an omitted scaling factors between the two
objectives:
min
w0;w1;w2
L(f(w
0
;w
1
;x
1
);y
1
) +L(f(w
0
;w
2
;x
2
);y
2
)
s. t. w
0
0
N (0;)
w
0
1
N (0;)
w
0
2
N (0;)
(9.3)
75
The drawback of this approach is that (y
1
;x
1
;y
2
;x
2
) must be available to the network during the
whole training phase.
In absence of labels y
1
for the old task, Knowledge Distillation can be used as a surrogate.
In A-LTM the stable module N trained with the problem 9.1 can be used to hallucinate missing
labels. This way, the new learning phase of H can be recast in the multi-task framework even in
absence of external supervision solving the following:
min
w0;w1;w2
L(f(w
0
;w
1
;x
1
);f(w
0
;w
1
;x
1
)) +L(f(w
0
;w
2
;x
2
);y
2
)
s. t. w
0
0
=w
0
w
0
1
=w
1
w
0
2
N (0;)
(9.4)
With respect to the availability of inputs, if they belong to the same modality and share the
same distribution of graphical factors of variation, the active stream of perception can be used
to train both active and inactive tasks in the
exible module. Otherwise, either inputs have
to be stored or the networks must rely on some generative mechanism to generate imaginary
samples for the non-ongoing task. In eq 9.4 we assume the presence of a replay mechanism
allowing the A-LTM networks to have access to both x
1
and x
2
during training. We relax this
assumption in the experiments and present results also for the problem with objective function
L(f(w
0
;w
1
;x
2
);f(w
0
;w
1
;x
2
)) +L(f(w
0
;w
2
;x
2
);y
2
).
Beyond Active Maintenance, Memory Consolidation
While in this article we focus on the early phase of memory maintenance, a successive phase called
memory consolidation, where knowledge is distilled from H to N is necessary for non-active long
term memory.
We interpret the phase of learning that we model as the step right before consolidation. An
irreparable damage to H would create a memory loss equivalent to temporally graded retrograde
76
amnesia. In case of bilateral lesions [77] to the hippocampus previously known information stored
in N would be safe, but recent adaptations to newly known environments only stored in H would
be completely lost.
9.5 Experiments
As a rst illustration of the Active Long Term Memory framework, we conduct a sequence of
experiments on sequential, multitask and A-LTM learning. We employ three datasets in experiments
of increasing complexity. We rst show that interferences emerge in the trivial case sequential
learning of the same function in deep linear networks. We analyze the situation where complete
forgetting happens during sequential learning of semantic and graphical factors of variation. Finally,
we demonstrate the ability of A-LTM to preserve memory during domain adaptation with and
without access to replays of the previous environment. The datasets used are:
Synthetic Hierarchical Features [74]: we employ a procedure based on independent
sampling from a branching diusion process, previously used to study Deep Linear Networks
learning (DLNs) dynamics, to generate a synthetic dataset with a hierarchical structure
similar to the taxonomy of categories typical to the domain of living things.
iLab20M [2]: in this highly controlled dataset, images of toy-vehicles are collected over a
turntable with multiple cameras, with dierent lighting directions. Ilab totals 704 objects
from 15 categories, 8 rotating angles, 11 cameras 5 lighting conditions and multiple focus and
background, for a complete dataset of over 22M images. Each object has 1320 images per
background. Because of the highly controlled nature of the dataset, we reach high accuracy
with a random subset of 850k images from 10 categories and 11*8 camera positions: thus,
we do not exploit variations in all the other factors for a matter of simplicity. iLab20M is
freely available.
77
Imagenet 2010 [72]: is an image database with 1000 categories and over 1.3 M images
for training set. Imagenet is the most used dataset for large scale object recognition, since
Alexnet [48] won the 2012 submission only deep learning model have been on the leaderboard
of the annual ILSVRC competition.
Except for the DLNs experiments, we train using stochastic gradient descent Alexnet's like
Architecture with drop-out without any form of data augmentation on a central 256-by-256 crop of
the images. We train until convergence,1 to 3 epochs, in iLab20M and for 10 epochs in Imagenet.
During multi-domain and A-LTM, the loss function associated to iLAB is scaled by a factor of 0.1,
as larger factors tend to stuck the Imagenet training in local minima. Knowledge distillation is
implemented using an l
2
loss between the decision layers of the N and H network of both iLab
categories and viewpoints; we adopted thel
2
instead of cross-entropy to avoid tuning an additional
parameter for the teacher soft-max temperature. All of the experiments are implemented in
Matconvnet [92] and run three Nvidia GPUs, two Titan X and one Tesla K-40.
9.5.1 Catastrophic Interference in Deep Linear Networks
We rst study the aforementioned discrete transition in Deep Linear Networks [75] for the trivial
case of learning sequentially the same function. DLNs exhibits interesting non-linear dynamics
typical of their non-linear counterpart, but are amenable to interpretation. Their optimization
problem is non-convex and, since their deep structure can be transformed to a shallow linear map
by simple factorization of layers, each deep local minimum can be compared to the shallow global
optima.
In our experiments, the network architecture has an input layer, a single hidden layer, and two
output layers, one for each task. We examine the trivial situation where Task A is equivalent to
Task B. In this situation, obviously a multi-task solution corresponds to the network re-learning
the hidden to output layer without any modication of the input to hidden layer. Alternatively,
78
(a) Sequential Learning: Task A7! B (b) Multitask Learning: Task A7! A+B
(c) Interleaved Learning: Task A7! B7! A7! B (d) Zoom of (c): Catastrophic Interference
Figure 9.1: Catastroc Intereference experiments with Deep Linear Networks: Task A is equivalent
to Task B
multiple solutions that maintain the input-output relationship constant but modify weights of the
input-hidden layer, therefore requiring and adjustment of the hidden-output weights, are possible.
In Fig 9.1a, we rst train the network with respect to Task A till convergence and then begin
training with respect to the identical Task B; we nd that back-propagation systematically has
an eect on the input-hidden layer, generating interference of the original network. In Fig 9.1b,
multi-task learning does not have the same problem. In Fig 9.1c-9.1d we instead alternate task
every 50 epochs, generating reciprocal interference of decreasing intensity that delays convergence
for more than 100 epochs with respect to the multi-task case.
79
(a) Turntable setup for iLab20M, gure
from [2]
Categories Viewpoints
ST - Category 0.81 /
ST - Viewpoint / 0.94
Cat7! View 0.12 0.93
View7! Cat 0.84 0.02
MT - Cat/View 0.85 0.91
(b) Test set Accuracy for experiment 4.2: ST for single
task model,7! indicate sequential learning, MT for multitask
learning. The eects of Catastrophic Interference in bold.
Figure 9.2: a): example of turntable setup and of multiple-camera viewpoint, lighting directions
and objects from iLab20M. b):: Results of Sequential and Multi-Task learning of categories and
viewpoitns onf iLab20M
9.5.2 Sequential and Multi-task Learning of Orthogonal Factors on
iLab20M
Because of its parametrized nature, iLab20M is a good environment for the early training phase of
models that can be taught to incorporate the ability to predict multiple semantic (like categories
and identities) and graphical factors (like luminance or viewpoint).
We train single task architectures to classify either between 10 Categories (ST-Category ) or
88 Viewpoints (ST-Viewpoint) and use its weights at convergence to warm up a network for the
complementary task. We compare the Inductive Bias and Catastrophic Interference eect to single
and multi-task solutions.
The results in table b of g 2 conrm the inductive bias of Viewpoints over Category and the
dramatic interference of sequential training between the two tasks. Finally, as expected, multi-task
learning can easily learn both tasks, incorporating the inductive bias from viewpoints to categories.
80
9.5.3 Sequential, Multi-Task and A-LTM Domain Adaptation over Imagenet
Imagenet is a popular benchmark for state of the art architecture for visual recognition because of
its 1000 categories and over 1M images. We compare the adaptation performance to Imagenet of
multiple architectures: a single task network with Gaussian initialization (ST - Gauss), sequential
transfer from the iLab20M multi task network (ST-iLab), multi-domain networks trained simulta-
neously over the iLab and Imagenet tasks either initialized randomly (MD - Gauss) or from the
iLab20M multi task network (MD - iLab).
As seen in Table 9.1 we conrm the general intuition developed in the previous experiment.
Initializing from iLab has a positive eect on the transferred performance; moreover unconstrained
adaptation completely wipes the ability of the network to perform viewpoint identication. Multi-
Task/Domain learning is able to reach the same performances on Imagenet while maintaining
almost completely the viewpoint detection task.
A-LTM: Domain Adaptation from iLab20M to Imagenet without extrinsic super-
vision
While Multi-task learning seems to be a good strategy for nding local minimum able to mantain
multiple tasks it requires two expensive sources of supervision: Images and Labels from the original
domain.
The A-LTM architecture is able to exploit the absence of supervision by using its stable
component to hallucinate the missing labels and convert the otherwise sequential learning in a multi-
task scenario. If the distribution of input is homogeneous across datasets, i.e P (y
1
jx
1
) =P (y
1
jx
2
)
the input-output map of the stable network can be expressed, therefore distilled, using only images
from the new domain as input avoiding completely the limitations of multi-task learning (A-LTM -
naive). In the contrary case a replay system either based on re-generation or storage of the past
input is necessary (A-LTM - replay).
81
Table 9.1: Test set Accuracy of domain adaptation over Imagenet and Memory of the Viewpoint
task for iLAB for single task (ST), multi-task/domain (MD) and Active Long Term Memory
networks with and without replay.
Imagenet Viewpoints
ST - Gauss Init 0.44
ST - iLab Init 0.46 0.03
MD - Gauss Init 0.44 0.81
MD - iLab Init 0.45 0.84
A-LTM - naive 0.40 0.57
A-LTM - replay 0.41 0.90
The multi-task architecture of experiment 4.2 is used as stable component N for both experi-
ments, i.e. we use it to for weight initialization of H and as a source of supervision for KD in both
A-LTM models.
The results on Imagenet in table 9.1 show that the A-LTM architectures are able to maintain
the long term memory of the Viewpoint task at the cost of a slower adaptation. The reduced
memory performance in table 9.1 of A-LTM-naive (without replay), especially the strong initial
drop, for both viewpoints and categories as illustrated in gure 9.3 is indicative of the strong shift
in underlying factors between the two datasets and the importance of generative mechanisms to
re-balance these dierences with replay.
9.6 Discussion
In this work we introduced a model of long term memory inspired from neuroscience and based on
the knowledge distillation framework that bridges sequential learning with multi-task learning. We
show empirically that the ability to recognize dierent viewpoints of an object can be maintained
also after the exposure to millions of new examples without extrinsic supervision using knowledge
distillation and a replay mechanism. Furthermore we report encouraging results on the ability of
A-LTM to maintain knowledge only relying on the current perceptual stream. The theoretical
analysis of DLNs linking convergence time of stochastic gradient descent to input-output statistics
and network depth, and the plethora of tricks developed to successfully train deep networks suggest
82
Figure 9.3: Test Set Accuracy on iLab20M of A-LTM: categories in blue and viewpoints in
red. A-LTM with replay is indicated by circles while A-LTM without replay with plus signs, the
dashed horizontal lines represents the accuracies at initialization. Both models suers from an
initial drop in accuracy during the rst epochs but the A-LTM network that has access to samples
of the original dataset is able to fully recover.
a potential relationship with the vanishing gradient problem [38, 37]. In order to learn in complex
environments whose data generating process take a long time to mix it is necessary to use deeper
architectures that have a longer convergence time, that in turn are plagued by the problem of
propagating the gradient across multiple layers.
While it seems easy to confuse the two memory problems, in supervised classication with
Long Short Term Memory networks a careful balance of positive and negative examples in the
mini-batches is crucial to the model performance, similarly to the interleaved scheme for sampling
categories in CNNs.
83
Chapter 10
Conclusion and ongoing work
This thesis constructs a philosophical framework to reason about intelligent computations by
bridging Computational Mechanics and modern Machine Learning. We dene the conditions for an
agent to study the causality of its physical environment by studying its own internal representation
of the latter.
The presented results conveys early implementations of algorithms able to learn self-supervised
sucient and minimal representations of:
1. Memory-less channels
2. Stochastic processes
3. Partially observable decision processes
As well as new meta-algorithms applied to the memory-less channels case for :
1. Self-improvement through retraining with the Born Again procedure.
2. Dynamic global aggregation of local information with Bottom Attention Networks
3. Long term memory in the continual learning setting through the Active Long Term Memory
networks.
84
In future work we will expand the meta-algorithms to handle more sophisticate environments with
temporal dependency and memory using as base architecture those introduced in the rst half of
the manuscript.
85
Reference List
[1] Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal represen-
tations through noisy computation. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2018.
[2] Borji Ali, Wei Liu, Izadi Saeed, and Itti Laurent. ilab-20m: A large-scale controlled object
dataset to investigate deep learning. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2016.
[3] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Homan, David Pfau, Tom
Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by
gradient descent. In Advances in Neural Information Processing Systems, pages 3981{3989,
2016.
[4] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural
information processing systems, pages 2654{2662, 2014.
[5] Nix Barnett and James P Crutcheld. Computational mechanics of input{output processes:
Structured transformations and the eps-transducer. Journal of Statistical Physics, 161(2):404{
451, 2015.
[6] Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of nite
state markov chains. The annals of mathematical statistics, 37(6):1554{1563, 1966.
[7] Yoshua Bengio, Aaron Courville, and Pierre Vincent. Representation learning: A review
and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
35(8):1798{1828, 2013.
[8] Leo Breiman et al. Statistical modeling: The two cultures (with comments and a rejoinder
by the author). Statistical science, 16(3):199{231, 2001.
[9] Leo Breiman and Nong Shang. Born again trees. Available onlin e at: ftp://ftp. stat. berkeley.
edu/pub/users/breiman/BAtrees. ps, 1996.
[10] Jane Bromley, James W Bentz, L eon Bottou, Isabelle Guyon, Yann LeCun, Cli Moore,
Eduard S ackinger, and Roopak Shah. Signature verication using a siamese time delay
neural network. International Journal of Pattern Recognition and Articial Intelligence,
7(04):669{688, 1993.
[11] Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 535{541. ACM, 2006.
[12] Rich Caruana. Learning many related tasks at the same time with backpropagation. Advances
in neural information processing systems, pages 657{664, 1995.
86
[13] Rich Caruana. Multitask learning. Machine learning, 28(1):41{75, 1997.
[14] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons,
2012.
[15] James P Crutcheld. The origins of computational mechanics: A brief intellectual history
and several clarications. arXiv preprint arXiv:1710.06832, 2017.
[16] James P Crutcheld, Christopher J Ellison, Ryan G James, and John R Mahoney. Synchro-
nization and control in intrinsic and designed computation: An information-theoretic analysis
of competing models of stochastic computation. Chaos: An Interdisciplinary Journal of
Nonlinear Science, 20(3):037105, 2010.
[17] James P Crutcheld and Cosma Rohilla Shalizi. Thermodynamic depth of causal states:
Objective complexity via minimal representations. Physical review E, 59(1):275, 1999.
[18] James P Crutcheld and Karl Young. Inferring statistical complexity. Physical Review
Letters, 63(2):105, 1989.
[19] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural
networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
[20] A. Dutt, D. Pellerin, and G. Quenot. Coupled Ensembles of Neural Networks. ArXiv e-prints,
September 2017.
[21] David P Feldman, Carl S McTague, and James P Crutcheld. The organization of intrin-
sic computation: Complexity-entropy diagrams and the diversity of natural information
processing. Chaos: An Interdisciplinary Journal of Nonlinear Science, 18(4):043106, 2008.
[22] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive
sciences, 3(4):128{135, 1999.
[23] Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
[24] Krzysztof J Geras, Abdel-rahman Mohamed, Rich Caruana, Gregor Urban, Shengjie Wang,
Ozlem Aslan, Matthai Philipose, Matthew Richardson, and Charles Sutton. Compressing
lstms into cnns. arXiv preprint arXiv:1511.06433, 2015.
[25] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical
investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint
arXiv:1312.6211, 2013.
[26] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106,
2016.
[27] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. arXiv
preprint arXiv:1610.02915, 2016.
[28] James V Haxby, M Ida Gobbini, Maura L Furey, Alumit Ishai, Jennifer L Schouten, and
Pietro Pietrini. Distributed and overlapping representations of faces and objects in ventral
temporal cortex. Science, 293(5539):2425{2430, 2001.
[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770{778, 2016.
87
[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep
residual networks. In European Conference on Computer Vision, pages 630{645. Springer,
2016.
[31] Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Psychology
Press, 2005.
[32] Robert Hecht-Nielsen. Theory of the backpropagation neural network. In Neural Networks,
1989. IJCNN., International Joint Conference on, pages 593{605. IEEE, 1989.
[33] Georey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep
Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep
neural networks for acoustic modeling in speech recognition: The shared views of four
research groups. IEEE Signal processing magazine, 29(6):82{97, 2012.
[34] Georey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning
lecture 6a overview of mini-batch gradient descent.
[35] Georey Hinton, Oriol Vinyals, and Je Dean. Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531, 2015.
[36] Georey E Hinton. Learning distributed representations of concepts.
[37] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and J urgen Schmidhuber. Gradient
ow in
recurrent nets: the diculty of learning long-term dependencies, 2001.
[38] Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735{1780, 1997.
[39] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient
descent. In International Conference on Articial Neural Networks, pages 87{94. Springer,
2001.
[40] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. arXiv preprint
arXiv:1709.01507, 2017.
[41] Gao Huang, Yixuan Li, Geo Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Weinberger.
Snapshot ensembles: Train 1, get m for free. arXiv preprint arXiv:1704.00109, 2017.
[42] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely
connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
[43] Benjamin D Johnson, James P Crutcheld, Christopher J Ellison, and Carl S McTague.
Enumerating nitary processes. arXiv preprint arXiv:1011.0036, 2010.
[44] Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. arXiv preprint
arXiv:1606.07947, 2016.
[45] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,
abs/1412.6980, 2014.
[46] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
[47] Alex Krizhevsky and Georey Hinton. Learning multiple layers of features from tiny images.
2009.
88
[48] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. Imagenet classication with deep
convolutional neural networks. In Advances in neural information processing systems, pages
1097{1105, 2012.
[49] Tejas D Kulkarni, Karthik R Narasimhan, Ardavan Saeedi, and Joshua B Tenenbaum.
Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic
motivation. arXiv preprint arXiv:1604.06057, 2016.
[50] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep
convolutional inverse graphics network. In Advances in Neural Information Processing
Systems, pages 2530{2538, 2015.
[51] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building
machines that learn and think like people. arXiv preprint arXiv:1604.00289, 2016.
[52] B Boser Le Cun, John S Denker, D Henderson, Richard E Howard, W Hubbard, and
Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In
Advances in neural information processing systems. Citeseer, 1990.
[53] Yann A LeCun, L eon Bottou, Genevieve B Orr, and Klaus-Robert M uller. Ecient backprop.
In Neural networks: Tricks of the trade, pages 9{48. Springer, 2012.
[54] Qianli Liao and Tomaso Poggio. Bridging the gaps between residual learning, recurrent
neural networks and visual cortex. arXiv preprint arXiv:1604.03640, 2016.
[55] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and
teaching. Machine learning, 8(3-4):293{321, 1992.
[56] Mingsheng Long and Jianmin Wang. Learning multiple tasks with deep relationship networks.
arXiv preprint arXiv:1506.02117, 2015.
[57] David Lopez-Paz, L eon Bottou, Bernhard Sch olkopf, and Vladimir Vapnik. Unifying
distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015.
[58] Ilya Loshchilov and Frank Hutter. Sgdr: stochastic gradient descent with restarts. arXiv
preprint arXiv:1608.03983, 2016.
[59] David Marr and Tomaso Poggio. From understanding computation to understanding neural
circuitry. 1976.
[60] James L McClelland, Bruce L McNaughton, and Randall C O'Reilly. Why there are
complementary learning systems in the hippocampus and neocortex: insights from the
successes and failures of connectionist models of learning and memory. Psychological review,
102(3):419, 1995.
[61] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks:
The sequential learning problem. In Psychology of learning and motivation, volume 24, pages
109{165. Elsevier, 1989.
[62] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing
lstm language models. arXiv preprint arXiv:1708.02182, 2017.
[63] Tomas Mikolov, Armand Joulin, and Marco Baroni. A roadmap towards machine intelligence.
arXiv preprint arXiv:1511.08130, 2015.
[64] Marvin Minsky. Society of mind. Simon and Schuster, 1988.
89
[65] Marvin Minsky. Society of mind: a response to four reviews. Articial Intelligence, 48(3):371{
396, 1991.
[66] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G
Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.
Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
[67] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation
as a defense to adversarial perturbations against deep neural networks. In Security and
Privacy (SP), 2016 IEEE Symposium on, pages 582{597. IEEE, 2016.
[68] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask
and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
[69] Jean Piaget and Barbel Inhelder. The psychology of the child. Basic Books, 2008.
[70] Lorien Y Pratt, Jack Mostow, Candace A Kamm, and Ace A Kamm. Direct transfer of
learned information among neural networks. In AAAI, volume 91, pages 584{589, 1991.
[71] Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Beyond sharing weights for deep
domain adaptation. arXiv preprint arXiv:1603.06432, 2016.
[72] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale
visual recognition challenge. International Journal of Computer Vision, 115(3):211{252,
2015.
[73] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James
Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell.
Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
[74] Andrew M Saxe, James L McClelland, and Surya Ganguli. Learning hierarchical categories
in deep neural networks.
[75] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear
dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
[76] J urgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic
recurrent networks. Neural Computation, 4(1):131{139, 1992.
[77] William Beecher Scoville and Brenda Milner. Loss of recent memory after bilateral hip-
pocampal lesions. Journal of neurology, neurosurgery, and psychiatry, 20(1):11, 1957.
[78] Cosma Rohilla Shalizi and James P Crutcheld. Computational mechanics: Pattern and
prediction, structure and simplicity. Journal of statistical physics, 104(3-4):817{879, 2001.
[79] Cosma Rohilla Shalizi et al. Causal architecture, complexity and self-organization in the
time series and cellular automata. PhD thesis, University of Wisconsin{Madison, 2001.
[80] Cosma Rohilla Shalizi and Kristina Lisa Shalizi. Blind construction of optimal nonlinear re-
cursive predictors for discrete sequences. In Proceedings of the 20th conference on Uncertainty
in articial intelligence, pages 504{511. AUAI Press, 2004.
[81] Claude Elwood Shannon. A mathematical theory of communication. ACM SIGMOBILE
mobile computing and communications review, 5(1):3{55, 2001.
90
[82] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van
Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc
Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature,
529(7587):484{489, 2016.
[83] Susanne Still. Information-theoretic approach to interactive learning. EPL (Europhysics
Letters), 85(2):28005, 2009.
[84] Susanne Still and Doina Precup. An information-theoretic approach to curiosity-driven
reinforcement learning. Theory in Biosciences, 131(3):139{148, 2012.
[85] Christopher C Strelio and James P Crutcheld. Bayesian structural inference for hidden
processes. Physical Review E, 89(4):042119, 2014.
[86] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural
networks. In Advances in neural information processing systems, pages 3104{3112, 2014.
[87] Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, and Shie Mannor. A deep
hierarchical approach to lifelong learning in minecraft. arXiv preprint arXiv:1604.07255,
2016.
[88] Gregor Urban, Krzysztof J Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang,
Rich Caruana, Abdelrahman Mohamed, Matthai Philipose, and Matt Richardson. Do deep
convolutional nets really need to be deep and convolutional? arXiv preprint arXiv:1603.05691,
2016.
[89] Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In
Advances in Neural Information Processing Systems, pages 6306{6315, 2017.
[90] Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: Similarity
control and knowledge transfer. Journal of Machine Learning Research, 16:2023{2049, 2015.
[91] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural
Information Processing Systems, pages 5998{6008, 2017.
[92] Andrea Vedaldi and Karel Lenc. Matconvnet: Convolutional neural networks for matlab. In
Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, pages 689{692.
ACM, 2015.
[93] Jane X Wang, Zeb Kurth-Nelson, Dharshan Kumaran, Dhruva Tirumala, Hubert Soyer, Joel Z
Leibo, Demis Hassabis, and Matthew Botvinick. Prefrontal cortex as a meta-reinforcement
learning system. Nature neuroscience, 21(6):860, 2018.
[94] Torsten N Wiesel, David H Hubel, et al. Eects of visual deprivation on morphology and
physiology of cells in the cat's lateral geniculate body. J Neurophysiol, 26(978):6, 1963.
[95] Saining Xie, Ross Girshick, Piotr Doll ar, Zhuowen Tu, and Kaiming He. Aggregated residual
transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016.
[96] Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Deep pyramidal residual networks
with separated stochastic depth. arXiv preprint arXiv:1612.01230, 2016.
[97] Daniel L Yamins, Ha Hong, Charles Cadieu, and James J DiCarlo. Hierarchical modular
optimization of convolutional networks achieves representations similar to macaque it and
human ventral stream. In Advances in Neural Information Processing Systems, pages
3093{3101, 2013.
91
[98] Jason Yosinski, Je Clune, Yoshua Bengio, and Hod Lipson. How transferable are features
in deep neural networks? In Advances in Neural Information Processing Systems, pages
3320{3328, 2014.
[99] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint
arXiv:1605.07146, 2016.
[100] Tom Zahavy, Nir Ben Zrihem, and Shie Mannor. Graying the black box: Understanding
dqns. arXiv preprint arXiv:1602.02658, 2016.
[101] Amy Zhang, Adam Lerer, Sainbayar Sukhbaatar, Rob Fergus, and Arthur Szlam. Composable
planning with attributes. In Proceedings of the 35th International Conference on Machine
Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80,
pages 5837{5846. JMLR.org, 2018.
92
Abstract (if available)
Abstract
All non-linear dynamical systems produce computations by storing and transforming input information. A computing device is a physical system that can be reliably controlled into transducing specific computations. Intelligent agents, computing devices themselves, manipulate and exploit their physical environment to achieve their goals. Goals are future states of the world with a specific information content that is desired by the agent. In this setting, the agent's behavior can be interpreted as the tentative manipulation of a computing device (the environment) into the configuration that will output the desired information. We call computational intelligence the ability to control the physical environment into specific informational states. We define prediction and control as basic-intelligence and call meta-intelligence the collection of computational tools necessary to exhibit basic intelligence in a non-stationary world using a finite physical computing device. In this dissertation I develop a framework to model the computational intelligence exhibited by biological and artificial agents as they control their context to achieve their goals. We build on the original levels of understanding manifesto from Marr and Poggio and use the same levels: computational, algorithmic, and implementational—to develop a theory of intelligence that is agnostic about the physical substrate used to compute. Differently from the purely descriptive intentions of the original formulation we take a normative approach. Starting from assumptions on the causal architecture of the world and the perceptual tools of the agents, we define the computational requirement to satisfy for basic intelligence. With a few additional assumptions on the relationship between computations and the hardware used to implement them, we derive multiple meta-intelligence requirements for agents that live in non-stationary and diverse environments.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Biologically inspired approaches to computer vision
PDF
Attention, movie cuts, and natural vision: a functional perspective
PDF
Controlling electronic properties of two-dimensional quantum materials: simulation at the nexus of the classical and quantum computing eras
PDF
Transfer learning for intelligent systems in the wild
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Learning from planners to enable new robot capabilities
PDF
Interaction between Artificial Intelligence Systems and Primate Brains
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Value-based decision-making in complex choice: brain regions involved and implications of age
PDF
Bounded technological rationality: the intersection between artificial intelligence, cognition, and environment and its effects on decision-making
PDF
Learning controllable data generation for scalable model training
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Quantification and modeling of sensorimotor dynamics in active whisking
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Deep representations for shapes, structures and motion
PDF
Computational principles in human motor adaptation: sources, memories, and variability
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Computational models and model-based fMRI studies in motor learning
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
Asset Metadata
Creator
Furlanello, Tommaso (author)
Core Title
Computational intelligence: prediction, control and memory in artificial and biological agents
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Neuroscience
Publication Date
02/21/2019
Defense Date
05/10/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,long term memory,machine learning,OAI-PMH Harvest,reinforcement learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Coricelli, Giorgio (
committee chair
), Itti, Laurent (
committee member
), Scherer, Stefan (
committee member
)
Creator Email
furlanel@usc.edu,tfurlanello@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-126137
Unique identifier
UC11676777
Identifier
etd-Furlanello-7104.pdf (filename),usctheses-c89-126137 (legacy record id)
Legacy Identifier
etd-Furlanello-7104.pdf
Dmrecord
126137
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Furlanello, Tommaso
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
deep learning
long term memory
machine learning
reinforcement learning