Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Neural spiketrain decoder formulation and performance analysis
(USC Thesis Other)
Neural spiketrain decoder formulation and performance analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NEURAL SPIKETRAIN DECODER
FORMULATION AND PERFORMANCE ANALYSIS
by
Arvind Iyer
____________________________________________________
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfilment of the
Requirements of the Degree
DOCTOR OF PHILOSOPHY
(BIOMEDICAL ENGINEERING)
August 2015
Copyright 2015 Arvind Iyer
i
Dedication
To Bhuvana Venkataraman and U. Venkataraman, my mother and father,
for making it possible, and making it joyful,
to live, love, learn and love learning
ii
Acknowledgments
To my advisor, Norberto M. Grzywacz, my most profuse thanks are due, for harmoniously and joyously
embodying the twin research virtues of discipline and curiosity, for doing justice both to knowledge and to
imagination, and for being to his students not just someone whom we work with, but someone whom we
can wonder with. The evolution of the research presented here, and my own evolution as a researcher is
owed in large measure to the times of work and wonder spent together with members and alumni of the
Visual Processing Laboratory: Dr. Grace Lee, Dr. Joaquin Rapela, Dr. Xiwu Cao, Dr. Junkwan Lee, Dr.
Yerina Ji, Wan-Qing Yu and Nadav Ivzan. In them, I have seen living examples of the principled
convictions, committed work-ethic and sense of belonging that bring inspiration to any endeavor, scientific
or social.
I would like to gratefully acknowledge the contribution of Dr. Xiwu Cao, alumnus of the Visual Processing
Laboratory, who supplied the electrophysiological data for analyses in the research presented here. The
research summarized in this thesis greatly benefited via the productive and insightful directions provided
by the members of my guidance committee:Dr. Vasilis Marmarelis, Dr. Bosco Tjan, Dr. Dong Song and
Dr. James Weiland. From the outset and at every stage in my graduate research, I have relied on training
received from inspirational educators at the Viterbi School of Engineering:Bartlett Mel, Norberto Grzywacz
and Michael Arbib whose courses initiated me to the discipline of Computational Neuroscience, and
Professors Stefan Schaal, Robert Scholtz and Bart Kosko, who through their courses provided a rich
understanding of both the principles and the possibilities of their fields. Among my educators at USC, I
count proudly the participants of the Vision Journal Club(especially members of the Laboratory of Neural
Computation, T-Lab and Hirsch Lab), who provided healthy peer-pressure, helpful peer-mentoring and a
cherished sense of community.
iii
Table of Contents
Dedication i
Acknowledgments ii
List of Figures v
Abstract vi
1. Introduction 1
1.1. Neural decoding with linear decoders 1
1.2. A Bayesian approach to benchmark neural decoders 2
1.3. Research paradigm to study neural decoding 3
1.4. Neural decoding as vertex labeling on a graph 6
1.5. Rationale for choice of representational space 7
1.6. Linear separability of a vertex-labeling on an N-cube 8
2. Theoretical results 11
2.1. Terminology and definitions 11
2.2. List of theorems 13
2.3. Connectivity and linear separability 14
2.4. Linear and polynomial separability 17
2.5. Temporal independence and linear separability 22
2.6. Discussion 25
3. Computational estimates of the probability of optimal linear decoding 26
3.1. Software implementation of the test of linear separability of an N-cube vertex-labeling 26
3.2. Monte Carlo sampling of vertex-labelings on N-cubes 26
3.3. Monte Carlo estimates of the proportion P L of vertex-labelings allowing linear separability 29
3.4. Discussion 31
4. Optimal decoding of experimental neural spiketrain recordings 32
4.1. Optimal decoders and decoder performance assessment 32
4.2. Ideal Observer Analysis for a labeled set of spiketrains 32
4.2.1. Estimation of an Ideal Observer Decoder 33
4.2.1.1. ‘Local’ Ideal Observer decoding 33
4.2.1.2. ‘Global’ Ideal Observer decoding 34
4.2.2. Estimation of a Linear Ideal Observer Decoder 36
4.3. Decoding of Retinal Ganglion Cell spiketrains using IOD and LIOD estimation 39
4.3.1. Data pre-processing and partitioning 39
4.3.2. Ground-truth labeling for multiple questions 40
4.3.3. Comparison of IOD and LIOD performance for questions to posed to RGCs 42
4.4. Discussion 44
iv
5. Linear decoder performance for simulated naturalistic neural responses 45
5.1. Motivation to study the effect of spiking statistics on decoder performance 45
5.2. Unit-level simulation study of the effect of naturalistic response statistics 46
5.2.1. Unit-level response model 46
5.2.2. Unit-level simulation results 48
5.3. Population-level simulation study of the effect of naturalistic response statistics 51
5.3.1. Formulations of population decoders 51
5.3.2. Effect of population size on non-hierarchical decoder performance 53
5.3.3. Analysis methods for performance assessment of hierarchical population decoders 54
5.4. Discussion 57
6. Discussion 58
6.1. Performance of linear decoders for single neurons 58
6.2. Performance of linear decoders for neural populations 59
6.3. Methodological extensions and applications 60
References 62
v
List of figures
Fig.1:Schematic representation of a linear decoder 2
Fig.2:Neural decoding in the context of psychophysics and sensory physiology 3
Fig.3:Psychophysics-inspired paradigm 5
Fig.4:Spiketrain decoding as vertex-labeling on an N-cube 6
Fig.5:Connectivity of node labels and linear separability 9
Fig.6:The opposite-direction-motion criterion for linear separability 10
Fig.7:Monte Carlo sampling of vertex-labelings with and without the constraint of connectivity 27
Fig.8:Algorithm for applying the opposite-direction-motion criterion for linear separability 28
Fig.9:Results of Monte Carlo simulations:Proportion of vertex-labelings allowing linear separability 30
Fig.10:Schematic illustration of ‘Local’ Ideal Observer Decoder estimation procedure 34
Fig.11:Algorithm for Ideal Observer Decoder estimation 35
Fig.12:Schematic illustration of ‘Global’ Ideal Observer Decoder estimation procedure 36
Fig.13:Algorithm for Linear Ideal Observer Decoder estimation 37
Fig.14:Schematic illustration of Linear Ideal Observer Decoder estimation procedure 38
Fig.15:Rationale for selection of early visual tasks for LIOD demonstrations 41
Fig.16:Results of Linear Ideal Observer Analyses applied to naturalistic RGC recordings 43
Fig.17:Distribution of response spike-counts of an RGC to natural images 46
Fig.18:‘Zebra-labelings’:A family of non-linearly-separable IOD labelings 48
Fig.19:Effect of spiketrain response distributions on Linear Ideal Observer Decoder performance 49
Fig.20:Architectures for population decoding 51
Fig.21:Simulation results showing effect of population size on non-hierarchical decoder performance 53
vi
Abstract
A computational understanding of how animals are equipped by their brains to behave in the natural world,
requires an account both of how animals are able to represent information in their sensory environment,
and of how they are able to process this information to arrive at perceptual decisions. Neural decoder is the
term we use for an operation that takes a neural spiketrain (time-series of all-or-none action-potentials) as
input and yields a perceptual decision, here posed in the form of a Yes/No question. Neural decoders which
yield a decision by a simple threshold operation on a weighted sum of spikes in the input spiketrain, are
called linear decoders. While linear decoders are mathematically convenient and have long and varied
history of use, they are not guaranteed to yield optimal classification performance in a Yes/No task. In this
study, we obtained theoretical results and developed tools to determine how closely the performance of the
best possible linear decoder can approach Ideal Observer performance (i.e. Bayes-optimal performance).
The contributions of this work are new mathematical results and computational tools addressing the broad
question of “How powerful are linear decoders?”
We formulated the decoding of neural spiketrains for answering Yes/No questions, as a vertex-labeling
(with Yes and No labels) on a graph (specifically, an N-cube) whose vertices represent the possible
spiketrains (Chapter 1). Vertex-labelings corresponding to situations allowing optimal linear decoding,
obey graph-theoretic conditions that we discovered (Chapter 2) and exploited to build Monte Carlo
estimators of the probability of optimal linear decoding for an arbitrary labeling (Chapter 3), which
revealed that linear decoders would perform poorly for an overwhelming majority of Yes/No questions. To
evaluate and benchmark the actual performance of linear decoders for physiological recordings, we devised
algorithms to estimate from labeled experimental recordings, two types of optimal decoders, that we call
the Ideal Observer Decoder and the Linear Ideal Observer Decoder, which respectively yield Bayes-optimal
performance and the best performance attainable by a linear decoder (Chapter 4). We applied these
techniques to extracellular single-cell recordings of retinal ganglion cells(RGCs) in order to answer
psychophysically motivated questions about the natural image stimuli driving the cells. We found that linear
decoders for a broad class of early visual tasks yielded Ideal-Observer-like performance, and demonstrate
via simulations a plausible explanation for this based on the naturalistic firing statistics of single retinal
neurons (Chapter 5). Extensions of these simulations from single neurons to neural populations, yield
preliminary insights about the limitations to which linear decoders (that yield near-optimal performance in
single-cell decoding of naturalistic responses) are subject to, when faced with population decoding.
vii
1
1. Introduction
1.1 Neural decoding with linear decoders
To survive in the wild, an animal must be able to obtain and respond in a timely fashion to changes in its
surroundings
1
. Events in an animal’s sensory environment evoke associated neural activity, from which the
animal must be able to extract the information it needs to make crucial decisions in activities like prey-
detection
2
or predator-avoidance
3
. The stimulus information of interest maybe represented as a quantitative
variable in an estimation problem
4
or a categorical variable in a discrimination (or classification) task
5
. A
computational account of neural decoding
6
, that is, a model-based account of how neural mechanisms in
behaving organisms can accomplish such estimation or classification tasks, has classically been formulated
as successive operations on a time-series representation of neural activity. The first operation performed
upon the time-series of neural activity is typically a linear integration or weighted averaging
4,5
.
Previous experimental investigations of decoding quantitative or categorical information about evoking
stimuli from evoked neural activity recorded in the form of action-potentials
4,5
, EEG
7
or BOLD signals
8
have typically assumed that the output of a neural decoder depends only on a linearly filtered summary of
the afferent neural activity. A major motivation presented for such linear filtering is that it yields a low-
dimensional representation of the input that is sufficient for acceptable decoding performance
7,10
, when the
raw input data are often intractably high-dimensional. As for the decoders of neural spike-trains in the
literature, the simplest models most commonly in use, treat the count of spikes in a given interval as a
property that is sufficiently informative about the stimuli
11
. The assumption that it is the spike-rate which
is stimulus-property-dependent, is motivated by experimental findings such as those in the classic
experiments of Hubel and Wiesel
12
, that neurons tend to spike at a higher rate when exposed to preferred
stimuli than when shown other stimuli.
Most decoders of neural spike-trains feature an assumption of linearity, often explicitly such as in temporal
kernel methods
4
or implicitly such as in Bayesian decoders demonstrated for retinal population firings
8
.
Previous studies that found such linear decoders of neural spiketrains to perform comparably with other
more complex decoders against whom they were played off
13
, lend support to their continued status as the
decoders of first choice in studies of neural decoding. The form of a linear decoder taking a neural spiketrain
as input, is that of a classic McCulloch Pitts neuron consisting of a weighted summer followed by a
nonlinear operation, typically thresholding. Historically important studies have noted the limited
2
approximation properties of a McCulloch Pitts neuron (or single-layer perceptron) and its inability to
function as a universal classifier
14
, vis-à-vis later generalizations with universal approximation properties
15
.
It follows that these shortcomings also apply to linear decoders when posed with arbitrary estimation or
classification tasks. Also, the most commonly used linear decoders extract the mean firing rate during the
linear stage, which is suitable when the estimated variable is indeed signaled via a classic rate code but not
in general guaranteed to be optimal when the signaling is via other encoding schemes
13,14
. This thesis
introduces and demonstrates a test for when a linear decoder is optimal for performing a classification task
given neural spike-trains, with no further assumptions about the spike-generation process.
Fig. 1 Schematic representation of a linear decoder A linear decoder for a Yes/No question consists of a set of filter-
coefficients 𝑎 𝑡 𝑖 and a threshold θ which yield a decision on a binary input spiketrain as shown.
1.2 A Bayesian approach to benchmark neural decoders
Neural decoders which assume the spike rate (or a temporally weighted sum of spikes) as the stimulus-
dependent spiketrain property and yield a decision by a simple threshold operation on this weighted sum,
are called linear decoders (as shown in Fig. 1). A probabilistic estimate to quantify how well such linear
decoders recover sensory information from spiketrains and how often they answer questions correctly using
spiketrains, requires models of how the stimulus properties in the world are distributed (prior distribution)
and how likely a given response pattern is given a value of the stimulus property (likelihood function)
6
.
Using these, a maximum a posteriori estimate of the stimulus property value given a response pattern can
be obtained
16
. The performance of maximum a posteriori(MAP) estimator is treated as the upper bound. In
this thesis, a linear decoder will be considered optimal only if it yields performance equivalent to that of a
MAP estimator, and methods are developed to identify both a MAP estimator and the linear decoder that
most closely approaches its performance.
Building the MAP estimator requires density estimation to obtain the prior and likelihood functions, which
in experimental settings is very susceptible to the curse of dimensionality. Besides, convenient
mathematical assumptions do not often hold for spike-responses to naturalistic stimuli. Responses of
sensory neurons to naturalistic stimuli have been observed to be sparse, which means that very dense
response patterns very rarely occur, regardless of the stimuli presented
17
. If parametric forms of the
3
distributions are not assumed beforehand, a larger amount of data will be needed for density estimation.
The response distribution has a bearing on the performance attainable by linear decoders and is the topic of
investigation in Chapter 5 of this thesis.
1.3 Research paradigm to study neural decoding
A theoretical description of how an organism performs neurally mediated sensorimotor interactions with
its environment involves a specification of the interaction between three classes of variables : (i) stimulus
properties (ii) (stimulus-dependent) neural activity i.e. spiketrains (iii) (activity-dependent and hence
stimulus-dependent)behavioral responses/perceptual decisions (eg. Yes or No).
Fig. 2 Neural decoding in the context of psychophysics and sensory physiology The circles are variables and arrows
show dependencies/mappings. Traditionally, study of the ‘Question’ mapping has been the subject matter of
Pyschophysics. An important subdiscipline in Sensory Physiology is the characterization of the ‘Neural Encoding’
mapping. This study concerns itself with a third mapping, namely, from neural responses to perceptual decisions.
A framework for building a computational account of neural decoding, therefore includes a characterization
of the following mappings:
(i) Question: Mapping from stimulus properties to a perceptual decision (requiring estimation of stimulus
properties)
Traditional and typical demonstrations of neural decoding attempt complete reconstruction of the sensory
environment, such as the reconstruction of entire visual scenes as pixel-wise intensity maps from spike-
trains of retinal ganglion cell populations
18
. The approach adopted here is to instead probabilistically infer
scene properties of interest
19
from the available spiketrains rather than attempt scene reconstruction in toto.
An organism’s brain is subject to biological constraints which preclude representing every detail in the
sensory environment. In evolutionary terms, it is reasonable to assume that that what the neural apparatus
extracts, is the sort of information that confers survival advantage. In behavioral terms, we can infer from
the observed behaviors of an organism that the information needed for these behaviors is available to the
organism in terms of neural representations and readouts. Therefore, what is meant by the brain’s ‘inner
4
representation’ of the world, is not a complete description but a description relevant to the organism in a
behavioral sense.
Examples of perceptual decisions to be made from stimuli are “What speed of running will be adequate
dodge this predator?” or “Is this fruit ripe?” The latter question belongs to a broad class of questions that
can be answered with a Yes or a No. Other such questions in sensorimotor contexts are “Did a shadow fall
here?” and “Is this muscle flexing?” Behavioral performance in such questions are treated as a 2 alternative-
forced-choice (2AFC) tasks in Psychophysics. We adapt the psychophysical paradigm used to study
behavioral performance in 2AFC tasks
20
, to studying decoder performance in answering Yes/No questions
from neural response patterns. Chapter 2.3.2 shows examples of such questions posed to RGCs.
(ii) Encoding: Mapping from stimulus properties (continuous or categorical variable) to spiketrains
The encoding mapping is specified by a lookup table consisting of paired entries of response patterns
(spiketrains) and the stimulus variable associated with them. For Yes/No questions about the stimulus, the
encoding is specified by assigning a Y or N label to each of the spiketrains (which number 2
N
when there
are N bins or bits per spiketrain). A probabilistic specification of the encoding would in addition also include
the probabilities of a given spiketrain accompanying a Y or N label. Characterization of encoding, such as
mapping of receptive fields i.e. estimating preferred or spike-triggering stimuli of visual neurons, is a
longtime mainstay of sensory neuroscience. Such characterization is often facilitated by employing an
optimal stimulus representational space
21
(eg. Gabor functions for V1 receptive fields).
(iii) Decoding: Mapping from spiketrain to perceptual decision(class-labels/Yes-No answers)
Decoding a response pattern, for purposes of this study, is defined as the task of estimating the label (Y or
N) associated with the response pattern, or more generally, estimating the probabilities of a given response
pattern being associated with a Y or N label. This can be cast as supplying missing labels in a lookup table
of response patterns. We define a ‘decoder’ as an operation on a response pattern that yields an estimate of
the unknown label. Some decoders of particular interest are: (i) a threshold operation on a weighted
summing operation on the spiketrain (as discussed in Chapter 1.1) (ii) maximum-a-posteriori (MAP)
estimator using likelihoods available from experimental recordings. The emphasis in this study is on the
performance of decoders with respect to the benchmark set by the Bayes-optimal MAP estimator which we
will call the Ideal Observer Decoder
12
(See Chapter 4.2.1).
5
Fig. 3 Psychophysics-inspired paradigm The research described in detail in Chapter __ incorporates available
experimental recordings, on which are performed Ideal Observer Analyses
22
analogous to those performed for 2-AFC
psychophysical tasks
20
.
The parametrizing of decoders can be facilitated by using a suitable response-representational-space that is
amenable to visualization and interpretation. The rationale behind the choice of the response-
representational space
13
that is adopted for the rest of the study, is outlined in the next section.
Error
assessment
Experimental
recording
Observer
analysis
IOD predictions
for test set
LIOD predictions
for test set
Ground truth
RGC
Natural image
Test spike
response
Vs.
Question answered
by observing stimuli
Question posed to
Ideal Observer Decoder
(IOD)
Question posed to
Linear Ideal Observer
Decoder (LIOD)
6
1.4 Neural decoding as vertex-labeling on a graph
The optimal inference of ‘scene properties’ or ‘decision variables’ from spiketrains can be formulated as a
problem of classification
24
or regression
42
, employing the terminology and techniques from Machine
Learning. The ‘training examples’ in these problems are time-series segments of neural activity in the form
of binary-spiketrains. These training examples can be represented on a graph (unit N-cube) whose vertices
represent possible spiketrains of N bits or bins. Regression in this space (referred to as a Hamming Space
of dimensionality N) is the assignment of a real-valued estimate to every spike-train instance, while
classification is the assignment of a categorical label. A particularly interesting case of classification is one
in which there are two classes, like a 2 alternative-forced-choice task or a Yes/No question. In a discrete
space like the Hamming Space, there is a finite number of possible Yes/No labelings (one of which is shown
in Fig. 4), unlike in a Euclidean space where the possible scatterplots are not thus enumerated. We can thus
estimate the proportion of all possible configurations that allow the Yes and No classes that can be correctly
linearly separated (i.e. by a single hyperplane), and also enumerate the patterns that allow such separation
(For more, see Chapter 3.3).
Fig. 4 Spiketrain decoding as vertex-labeling on an N-cube Consider a hypothetical neuron whose responses are
recorded as spiketrains which are 3 bits (or ‘bins’) in length. The table lists the bit sequences which are possible
responses of this neuron, along with Yes(Y) or No(N) labels which are the output of a hypothetical decoder of these
neural responses for a particular Yes/No question. The spiketrain N-tuples are represented as points in a space whose
co-ordinates are the spiketrain bits (eg. The spiketrain 011 is represented by the point (0,1,1)). The graph which these
responses constitute can be recognized as the unit-N-cube (with N = 3 here) or Hamming Cube. Decoding can be
viewed as a problem of coloring or partitioning this graph. The decoding listed explicitly in the table is shown as a
vertex-labeling on the corresponding Hamming Cube. For ease of visualization, most illustrations in this thesis are
shown for such a 3-dimensional space. However, the results and methods apply generally to higher-dimensional
spaces.
Spiketrain bit-string Label
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Y
Y
Y
N
N
N
N
N
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
Y
Y
Y
N
N
N
N
N
7
A Yes/No labeling on a Hamming Cube can in theory be used to represent an encoding scheme, where all
response spike-trains (nodes) that can be elicited by a stimulus with a given property are identified. A
decoding problem is one in which an unlabeled spiketrain is presented, for which a label must be provided.
A ‘decoder’ as well may be represented as an exhaustive labeling of the Hamming Cube, consisting of the
estimated labels for each possible spike-train. This can be viewed as a ‘lookup table’ of predictions for
every possible observed spiketrain. Such a lookup table can be built using standard methods such as least-
squares, maximum-likelihood or maximum-a-posteriori(MAP) estimation. Using an MLE or MAP
estimator offers the advantages of not assuming linearity and yielding probabilistic rather than point
estimates
16
. The classification performance yielded by these decoders can be compared with that yielded
by linear approximations of the same decoders, to examine whether a linear decoder can be optimal.
1.5 Rationale for choice of representational space
The representation we choose to specify the Y/N labels estimated by a decoder, is a coloring of a unit N-
cube in which each node is a binary response spiketrain with N bits (as shown in Fig. 4). Such a geometric
representation
25
offers the following advantages over other more conventional representations:
(i) Many models of decoding in the literature assume the availability of response averages (in the
form of peristimulus time histograms or PSTHs), whereas behaving organisms must arrive at
perceptual decisions with only single-trial records (i.e. the current spiketrain) available.
Representing the neural responses as binary strings is a higher-resolution and more assumption-
free representation than response-averages. Using a representation with the theoretical
maximum resolution also permits clearly distinguishing ‘coarse codes’ like rate codes (or
spike-count codes) and codes requiring finer temporal precision
26
like instantaneous rate codes
(rate codes with temporal weighting) or inter-spike-interval(ISI)-based codes.
(ii) The unit-N-cube, or Hamming Space, that is naturally induced by response-patterns in the form
of binary strings, satisfies the requirements of a metric space. The Hamming distance between
two nodes or vertices is a ready way of quantifying the similarity between two spiketrains
27
.
For low-dimensional spiketrains (say N = 3), such distances and distributions of spiketrains
signaling the same class (i.e. having the same label Yes or No) can be readily visualized yielding
insights that are useful while studying higher-dimensional responses as well.
8
Such a representation of the spike-train input has the advantage of not amounting to an a priori
assumption about which spike-train metric
28
is most relevant for classification. In particular,
this representation of the input is free of the assumptions of mean-firing-rate or instantaneous-
firing-coding which is implicit in the most commonly used linear decoders. Further, the binary
string representation of the input can be generalized for population codes by concatenating the
bit-strings corresponding to the individual responses of multiple neurons.
Representing a decoder can be cast as a problem of vertex-labeling performed on the unit-N-cube with two
colors representing estimated Yes and No labels respectively. We will specify a condition on such vertex-
labelings that correspond to the case where a linear decoder can supply the correct answers/labels for every
node/spiketrain. The coefficients of a linear decoder define an N-dimensional hyperplane in an N-
dimensional Hamming Space. Linear separability is said to hold when there exists such a hyperplane which
can separate without error the nodes bearing Yes labels from those bearing No labels. In the following
section we will briefly describe the geometric intuition underlying the formal conditions for linear
separability that will be proven in Chapter 2.
1.6 Linear separability of a vertex-labeling on an N-cube
Neural decoding in the context of Yes-No questions is the problem of assigning to N-bit spiketrains (nodes
of a unit-N-cube) one of two labels, Yes or No. There are 2
2
𝑁 possible ground-truth labelings that can result.
Linear decoders, which generate output labels as shown in Figure 1, cannot in general exactly reproduce
all of these possible labelings, but reproduce only those that satisfy the graph-theoretic condition stated in
this section. Knowledge of this condition allows us to estimate the proportion P L of all possible labelings
which can be correctly separated by linear decoders.
Viewed geometrically, a binary decision or a classification problem is the partitioning of a feature-space.
For our problem of assigning N-bit spiketrains to one of two classes (Yes or No), the feature space is the
unit-N-cube. A linear decoder can reproduce the correct class labels only if the classes can be separated
without error by an N-dimensional hyperplane defined by the decoder coefficients. If members of each
class are closely spaced, but more distantly spaced from members of the opposite class, then there is a better
chance of accurate partitioning by a single hyperplane
24
. In a graph such as the unit N-cube, the closest
spacings are attainable when members of a given class (i.e. nodes sharing a Yes or No label) exhibit high
mutual connectivity. In a graph, a set of nodes with a given label is connected when a path exists between
9
any two nodes sharing the label. A situation that is analogous to a favorable situation in a Euclidean space
(where there is minimum within-class dispersion) occurs when members of both classes exhibit full
connectivity. We can show that a single hyperplane cannot separate the classes without error unless such a
condition holds. In Chapter 2.3 (Theorem 3), connectivity of nodes belonging to a classes, will be proven
to be a necessary condition for linear separability. However, this is not a sufficient condition for linear
separability as illustrated by the counter-example in Fig. 5(b).
Fig. 5 Connectivity of node labels and linear separability Three distinct labelings are shown for N = 3, with 4 Yes
nodes and 4 No nodes in each case. In (a), connectivity does not hold. For example, there is no path from the Yes node
at (1,1,0) to the other Yes nodes without passing through the No nodes. In both (b) and (c), connectivity holds.
However, linear separability holds only for (c) and the separating plane is shown with parallel hatching. The sufficient
condition for existence of such a linear decoder that can separate the classes without error is illustrated in Fig. 6.
A necessary and sufficient condition for the non-existence of a single separating hyperplane for a given
labeling (i.e. the non-existence of a linear decoder that yields Yes/No decisions without error), is indicated
by the occurrence of a characteristic motif of arrangement of class labels (which we will call an opposite-
direction-motion). Detection of this motif in any pair of edges in an N-dimensional Hamming Space is
sufficient to conclude that the Yes/No labeling being examined does not permit linear decoding without
errors. Fig. 6 presents illustrative examples of the opposite-direction-motion criterion.
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
Y
Y
Y
N
N
N
N
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
Y
Y
Y
Y
Y
N
N
N
N
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
Y
Y
Y
N
N
N
N
Y
(a) (b) (c)
10
Fig. 6 The opposite-direction-motion criterion for linear separability If it is possible to make a motion from a Yes
node to a No node along an axis in opposite directions, then a motif that we call an ‘opposite-direction-motion’ exists
in the labeling being examined. The labelings shown are the same as those in Fig. 5. In (a) and (b), linear separability
is abolished by the occurrence of opposite-direction-motions shown by the color-paired arrow-marks on the edges.
Such a motif does not occur in (c), where a separating hyperplane exists. See Chapter 2.4 (Theorem 4) for a proof of
this result, and Chapter 3 (Fig.8) for an algorithmic implementation of this test.
This result was motivated by an earlier result on existence of solutions for linear inequalities
29
, which can
be applied to deciding the existence of an N-dimensional hyperplane that can correctly separate the Yes and
No nodes of an N-cube. A proof of this key result is presented in Chapter 2.4 (Theorem 4).
An additional observation here, that will become useful later on, is that ‘removal’ or ‘deletion’ of the No
label at node (1,1,1) in Fig. 6(b) will render the remainder of the labeling linearly separable. The spiketrain
111 corresponding to this node has the highest spike-count i.e. 3 possible for this hypothetical neuron. For
real neurons in the early visual systems, the theoretical highest spike-count is very rarely attained if at all
17
,
thus amounting effectively to the ‘deletion’ of a node from the corresponding unit N-cube.
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
Y
Y
Y
N
N
N
N
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
Y
Y
Y
Y
Y
N
N
N
N
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
Y
Y
Y
N
N
N
N
Y
(a)
(b) (c)
>
<
>
>
>
>
11
2. Theoretical Results
This chapter presents a body of theoretical results with rigorous proof, that serves as the foundation of the
graph-theoretic approach to neural decoding that was intuitively sketched in Chapter 1.6, and will underpin
the decoder estimation methods demonstrated in Chapter 4.2. As discussed in Chapter 1.4, labeled graphs
offer a powerful representation scheme for the mappings of neural encoding or decoding (as conceptualized
in Fig. 2). The results in this chapter are arranged in sections devoted to properties of labeled graphs, defined
in below in Chapter 2.1, which relate in useful ways to properties of the neural mappings they model.
2.1 Terminology and definitions
Graph: A graph is a collection of points (called ‘nodes’ or ‘vertices’) and lines (called ‘links’ or ‘edges’)
connecting a subset (possibly empty) of these points. An edge can be defined as a pair of vertices (which
it connects).
Connected graph: A graph is said to be connected if there exists a path (i.e. a series of edges, each of which
shares a vertex with the preceding edge) between any pair of vertices in the graph.
Partition: A partition of a graph, for purposes of this text, is a collection of sets of nodes, whose union is
the set of all nodes in the graph. Each such set is called a ‘part’.
N-cube: An 𝑁 -cube is an undirected graph (i) whose nodes/vertices are all the elements of the set of all
points in an N-dimensional Euclidean space having co-ordinates ( 𝑠 1
, … , 𝑠 𝑗 , … , 𝑠 𝑁 ) where 𝑁 ∈ ℕ , 1 ≤
𝑗 ≤ 𝑁 , 𝑠 𝑗 ∈ {0,1} (ii) where an edge/link exists between each pair of nodes ( 𝑢 1
, … , 𝑢 𝑗 , … , 𝑢 𝑁 ) and
( 𝑣 1
, … , 𝑣 𝑗 , … , 𝑣 𝑁 ) for which ∃ 𝑗 such that 𝑢 𝑗 = 1 − 𝑣 𝑗 and 𝑢 𝑘 = 𝑣 𝑘 , ∀𝑘 ≠ 𝑗 .
This graph is a hypercube of dimensionality N. Such an 𝑁 -cube is also referred to as a ‘unit cube’ or
‘Hamming Cube of dimensionality N’, in this text.
Vertex-labeling: In this text, we use the term ‘vertex-labeling’ for a many-to-one mapping whose domain
is the set of all vertices in a graph and range is the set {𝑐 1
, 𝑐 2
, … , 𝑐 𝑘 }. The 𝑐 ’s represent ‘colors’ indicating
‘classes’ to which the vertices are assigned. Such a labeling creates a 𝑘 -way partition. Unless otherwise
stated, we will use ‘vertex-labeling’ in this text to refer to the special case where 𝑘 = 2, i.e. a 2-way partition
12
or bisection. We interpret {𝑐 1
,𝑐 2
}, denoted according to convenience as {1,0} or {1, −1}, as representing
the Yes(Y) and No(N) answers to a question.
Linear separability: A vertex-labeling on an 𝑁 -cube is linearly separable if and only if a set of coefficients
{𝑎 𝑗 }, 1 ≤ 𝑗 ≤ 𝑁 , 𝜃 exists such that ∑ 𝑎 𝑗 𝑋 1,𝑖 ,𝑗 > 𝜃 , ∀𝑖 𝑁 𝑗 =1
and ∑ 𝑎 𝑗 𝑋 −1,𝑖 ,𝑗 < 𝜃 , ∀𝑖 𝑁 𝑗 =1
, where 𝑋 1,𝑖 ,𝑗 and
𝑋 −1,𝑖 ,𝑗 are 𝑗
th
co-ordinates of 𝑖
th
vertex that belongs respectively to the Yes and No parts of the 2-way
partition made by the vertex-labeling.
We interpret 𝜃 as a threshold. The existence of the coefficients {𝑎 𝑗 }, 𝜃 indicates the existence of a 𝑁 -d linear
surface or hyperplane in the Euclidean space where the 𝑁 -cube resides, that can correctly separate the
vertices bearing Yes and No color-labels.
Polynomial separability: A vertex-labeling on an 𝑁 -cube shows polynomial separability if and only if there
exists a polynomial
𝑌 = ∑ 𝑎 𝑗 1
𝑋 𝑗 1
𝑁 𝑗 1
=1
+ ∑ ∑ 𝑎 𝑗 1
𝑗 2
𝑋 𝑗 1
𝑋 𝑗 2
𝑁 𝑗 2
>𝑗 1
𝑁 −1
𝑗 1
=1
+ ⋯ + ∑ … ∑ 𝑎 𝑗 1
𝑗 2
…𝑗 𝑘 𝑋 𝑗 1
𝑋 𝑗 2
… 𝑋 𝑗 𝑘 𝑁 𝑗 𝑘 >𝑗 𝑘 −1
𝑁 −𝐾 +1
𝑗 1
=1
+
𝑎 𝑗 1
𝑗 2
…𝑗 𝑁 𝑋 𝑗 1
𝑋 𝑗 2
… 𝑋 𝑗 𝑁 ,
such that 𝑌 ( 𝑋 𝑗 ) = 𝐶 𝑗 , where is the label/‘color’ 𝐶 𝑗 = {0,1} of the 𝑗
th
vertex of the 𝑁 -cube, denoted by 𝑋 𝑗
, whose 𝑖
th
co-ordinate is 𝑋 𝑗 𝑖 .
Temporal independence: Consider an 𝑁 -cube with vertices who co-ordinates are given by the binary time-
series 𝑺 = {𝑠 1
, 𝑠 2
, … , 𝑠 𝑖 , … , 𝑠 𝑁 } where 𝑠 𝑖 ∈ {0,1}. Consider a vertex-labeling on this 𝑁 -cube assigning
labels Yes(𝑌 ) or No(𝑁 ) to the vertices. Temporal independence holds if 𝑃 ( 𝑌 | 𝑺 ) ∝
∏ 𝑃 ( 𝑌 | 𝑠 𝑖 )
𝑵 𝒊 𝑃 ( 𝑌 )
and
𝑃 ( 𝑁 | 𝑺 ) ∝
∏ 𝑃 ( 𝑁 | 𝑠 𝑖 )
𝑵 𝒊 𝑃 ( 𝑁 )
.
We note here that we will use the term ‘temporal independence’ for the above mathematical condition even
when the index 𝑖 does not strictly have the physical interpretation of time-bin of a single neuron’s response
(eg. when the series 𝑺 contains a concatenation of spiketrains from different neurons).
13
2.2 List of Theorems
A consolidated list of graph-theoretical results proven and invoked in this thesis is presented below. The
proofs and demonstrations appear in the following sections.
1. Theorem 1: An N-cube is a connected graph.
2. Theorem 2: For a partition induced on an N-cube by a linear surface, the vertices belonging to a
part form a connected graph.
3. Theorem 3: A vertex-labeling on an N-cube is linearly separable only if vertices sharing a label
form a connected graph. (See Fig. 5 in Chapter 1.6)
4. Theorem 4: A vertex-labeling on an N-cube allows linear separability if and only if no opposite-
direction motion occurs in the labeling. (See Fig. 6 in Chapter 1.6)
5. Theorem 5: Any vertex-labeling on an N-cube allows polynomial separability using a polynomial
of degree N.
6. Theorem 6: A vertex-labeling of an N-cube obeying temporal independence, is also linearly
separable.
7. Theorem 7: A naïve Bayes decoder of binary sequences can be represented as a linear decoder (so
long as no decoder outcome is absolutely certain for any bit occurrence).
14
2.3 Connectivity and linear separability
An optimal linear decoder (i.e. one that makes no errors) of neural spiketrains for a Yes/No question is
possible by definition only if the correspond vertex-labeling of the N-cube is linearly separable. The
theorems in this section establish the connectivity of vertices sharing a label, as a necessary condition for
the existence of an optimal linear decoder.
***
Theorem 1: An N-cube is a connected graph.
Proof:
Consider any two nodes in the N-cube, 𝒖 and 𝒗 with co-ordinates ( 𝑢 1
, … , 𝑢 𝑗 , … , 𝑢 𝑁 ) and ( 𝑣 1
, … , 𝑣 𝑗 , … , 𝑣 𝑁 )
where 𝑢 𝑗 , 𝑣 𝑗 ∈ {0,1} ∀𝑗 .
The co-ordinate sequences of 𝒖 and 𝒗 differ in at most N co-ordinates.
Consider the operation of flipping the 𝑗 th
co-ordinate of the node 𝒖 i.e. 𝑢 𝑗 ← ( 1 − 𝑢 𝑗 )
This operation yields the co-ordinates of a node that is connected by an edge to 𝒖 . (From the definition of
N-cube)
There are at most N such flipping operations required to transform the co-ordinates of 𝒖 into those of 𝒗 .
An edge exists between the node at each step and its predecessor in this series of flips, thus proving the
existence of a path between any two nodes in the N-cube. QED.
***
15
***
Theorem 2: For a partition induced on an N-cube by a linear surface, the vertices belonging to a part
form a connected graph.
Proof:
We will proceed by induction, treating 𝑁 = 2 as our base case. All partitions of 2-cube induced by a
linear surface(in this case, a line) can be treated as equivalent to one of the following cases.
Above, vertices belonging to different parts in a partition, are shown in different shades of blue. For the 2-
cube partitioned by a linear surface, the vertices belong to a part always form a connected graph. This
proves the base case.
As the induction hypothesis, we will assume that the result holds for an ( 𝑁 − 1) -cube.
We will now show that the result holds for an 𝑁 -cube, 𝑁 > 2. Any 𝑁 -cube can be treated as a ‘stacking’
of two ( 𝑁 − 1) -cubes, with an additional co-ordinate added to the vertices of each, so that one of the ( 𝑁 −
1) -cubes has this co-ordinate as ‘1’ and the other as ‘0’. These ( 𝑁 − 1) -cubes are connected to each other.
Any linear surface (i.e. an 𝑁 -dimensional hyperplane) partitioning the 𝑁 -cube will belong to one of the
below cases:
Case 1 : The hyperplane does not intersect the ( 𝑁 − 1) -cubes.
In this case, the set of nodes in a part in the partition created by the hyperplane, is an ( 𝑁 − 1) -cube, which,
according to Theorem 1 is connected.
Case 2: The hyperplane intersects at least one of the ( 𝑁 − 1) -cubes.
The intersection of an 𝑁 -d hyperplane with an ( 𝑁 − 1) -cube is an ( 𝑁 − 1) -d hyperplane. By the induction
hypothesis, the parts of an ( 𝑁 − 1) -cube partitioned by an ( 𝑁 − 1) -d hyperplane are connected.
The connected parts in both of the constituent ( 𝑁 − 1) -cubes, which lie in the same part of the partition
induced by the 𝑁 -d hyperplane, are also connected to each other (since the nodes in the union of both those
sets were connected to begin with, and the 𝑁 -d hyperplane does not abolish this connectivity). QED.
(0,0) (1,0)
(0,1)
(1,1)
(0,0) (1,0)
(0,1)
(1,1)
(0,0) (1,0)
(0,1)
(1,1)
(a) (b)
(c)
16
***
Theorem 3: A vertex-labeling on an N-cube is linearly separable only if vertices sharing a label form a
connected graph.
Proof:
Consider a vertex-labeling on an N-cube where the two possible label-assignments for a vertex are
denoted by Y and N for Yes and No.
Without loss of generality, assume that the set of vertices labeled Y does not form a connected graph.
For linear separability to hold, all vertices bearing label Y must lie in the same part of a partition created
by an N-d hyperplane.
According to Theorem 2, a partition by an N-d hyperplane can only result in parts whose nodes form a
connected graph.
Therefore an N-d hyperplane which can create partition yielding parts that consist exclusively of Y and N
vertices respectively, does not exist when the vertices sharing a label are not connected.
QED
***
17
2.4 Linear and polynomial separability
A linear decoder obtains a Yes/No decision by thresholding a weighted sum of the input spikes. A
‘quadratic’ decoder instead thresholds a weighted sum both of the individual spikes and of products of
spikes taken pairwise. Linear and quadratic decoders represent the simplest instances of a class of spiketrain
decoders that arrives at a Yes/No decision by thresholding a function that is polynomial in the spike-train
bits. Polynomial decoders can thus be considered a natural generalization of linear decoders. The existence
conditions for optimal linear and polynomial decoders are presented in this section. While a necessary
condition for the existence for an optimal linear decoder was presented in Chapter 2.3, the condition
presented here (Theorem 4) is both necessary and sufficient. It will also be shown (via Theorem 5), that
polynomial decoders enjoy a ‘universal approximation property’ in the sense that they can correctly yield
any vertex-labeling. Linear decoders do not have such a property, and the question of how often they can
correctly yield a vertex-labeling will be addressed via simulations in Chapter 3, using Theorem 4 below.
***
Theorem 4: A vertex-labeling on an N-cube allows linear separability if and only if no opposite-direction
motion occurs in the labeling.
Before proceeding to prove the theorem, we introduce the following notation and the definition of an
opposite-direction-motion.
Consider a unit 𝑁 -cube labeled/‘colored’ with Yes and No answers. It has 2
𝑁 vertices.
Let 𝑀 be the number of vertices with Yes answers denoted by “1”.
The number of vertices with No answers (denoted by “-1”) is 2
𝑁 − 𝑀
Let the set of vertices with Yes answers be {𝑌 1,𝑖 }, 1 ≤ 𝑖 ≤ 𝑀 , 𝑌 1,𝑖 = ( 𝑋 1,𝑖 ,1
, . . , 𝑋 1,𝑖 ,𝑁 ) , 𝑋 1,𝑖 ,𝑗 ∈ {−1,1}
Let the set of vertices with No answers be similarly denoted by {𝑌 −1,𝑖 }
A vertex-labeling on an 𝑁 -cube has opposite-direction motions if and only if two pairs of vertices
{𝑌 1,𝑘
, 𝑌 1,𝑙 } and {𝑌 −1,𝑚
, 𝑌 −1,𝑛 } exist such that
𝑌 1,𝑘
− 𝑌 −1,𝑚
= ( 0, … ,0, 𝑋 1,𝑘 ,𝜌 − 𝑋 −1,𝑚 ,𝜌 = 2,0, … ,0)
𝑌 1,𝑙
− 𝑌 −1,𝑛
= ( 0, … ,0, 𝑋 1,𝑙 ,𝜌 − 𝑋 −1,𝑛 ,𝜌 = −2,0, … ,0)
18
Proof:
Part 1: (Proof of sufficiency)
Assume that the labeling on the unit 𝑁 -cube has an opposite-direction motion. Without loss of generality,
let this opposite-direction motion be given as in Definition 2. Hence for all sets of coefficients as in
Definition 1,
∑ 𝑎 𝑗 𝑋 1,𝑘 ,𝑗
𝑁 𝑗 =1
− ∑ 𝑎 𝑗 𝑋 −1,𝑚 ,𝑗 𝑁 𝑗 =1
= 𝑎 𝜌 ( 𝑋 1,𝑘 ,𝜌 − 𝑋 −1,𝑚 ,𝜌 ) = 2𝑎 𝜌 > 0
∑ 𝑎 𝑗 𝑋 1,𝑙 ,𝑗
𝑁 𝑗 =1
− ∑ 𝑎 𝑗 𝑋 −1,𝑛 ,𝑗 𝑁 𝑗 =1
= 𝑎 𝜌 ( 𝑋 1,𝑙 ,𝜌 − 𝑋 −1,𝑛 ,𝜌 ) = −2𝑎 𝜌 > 0
…which is impossible. Therefore, no set of coefficients meets the criteria in Definition 1, making an
LIOD impossible.
Part 2: (Proof of necessity)
Assume that linear separability is impossible. The given labeling on a unit 𝑁 -cube would be linearly
separable if and only if we can find coefficients such that for every 𝑙 and 𝑛 ,
∑ 𝑎 𝑗 𝑋 1,𝑙 ,𝑗
𝑁 𝑗 =1
> ∑ 𝑎 𝑗 𝑋 −1,𝑛 ,𝑗 𝑁 𝑗 =1
as in this way we can always set 𝜃 in between the terms of the inequality
or ∑ 𝑎 𝑗 ( 𝑋 1,𝑙 ,𝑗 − 𝑋 −1,𝑛 ,𝑗 )
𝑁 𝑗 =1
> 0 …. (1)
This defines a system of 𝑀 ( 2
𝑁 − 𝑀 )+ 1 linear inequalities. According to Carver’s Theorem 3, this
system will be inconsistent if and only if a set of non-negative constants 𝑘 1,1
, … , 𝑘 𝑀 ,( 2
𝑁 −𝑀 ) +1
exists such
that at least one 𝑘 > 0 and,
𝑘 𝑀 ( 2
𝑁 −𝑀 ) +1
+ ∑ ∑ ∑ 𝑘 𝑙𝑛
𝑎 𝑗 ( 𝑋 1,𝑙 ,𝑗 − 𝑋 −1,𝑛 ,𝑗 ) = 0 ∀𝑎 𝑗 𝑁 𝑗 =1
2
𝑁 −𝑀 𝑛 =1
𝑀 𝑙 =1
…(2)
Linear separability impossible if and only if no coefficients can be found satisfying (1). Carver’s Theorem
3 specifies the conditions for which such coefficients cannot be found
29
.
In our case, 𝑘 𝑀 ( 2
𝑁 −𝑀 ) +1
= 0 , because otherwise, Equation (2) wouldn’t be obeyed when 𝑎 𝑗 = 0. Thus,
we can rewrite Equation (2) as:
∑ 𝑎 𝑗 𝑁 𝑗 =1
∑ ∑ ( 𝑋 1,𝑙 ,𝑗 − 𝑋 −1,𝑛 ,𝑗 ) 𝑘 𝑙 ,𝑛 2
𝑁 −𝑀 𝑛 =1
𝑀 𝑙 =1
= 0 …(3)
Because the 𝑎 𝑗 are free to vary, this condition is equivalent to finding 𝑘 𝑙 ,𝑛 such that,
∑ ∑ ( 𝑋 1,𝑙 ,𝑗 − 𝑋 −1,𝑛 ,𝑗 ) 𝑘 𝑙 ,𝑛 2
𝑁 −𝑀 𝑛 =1
𝑀 𝑙 =1
= 0 ∀𝑗 …(4)
We need to prove that the labeling of unit 𝑁 -cube exhibits opposite-direction motions. Suppose on the
contrary that such motions exist.
19
We proceed by induction, treating 𝑁 = 2 as our base-case.
Case 1 Case 2 Case 3
For 𝑁 = 2 it is sufficient to consider only the cases illustrated in the box above i.e. the three cases for
which no opposite-direction motion exists. Other variants (eg. 4 No vertices) can be shown to be
equivalent. For Case 1, linear separability is possible by definition, since 𝜃 can be set to any value.
For Case 2,
𝑋 1,1,1
= −1 ; 𝑋 1,1,2
= −1 ; 𝑋 1,2,1
= 1 ; 𝑋 1,2,2
= −1 ;
𝑋 1,3,1
= −1 ; 𝑋 1,3,2
= 1 ; 𝑋 −1,1,1
= 1 ; 𝑋 −1,1,2
= 1 ;
2( −𝑘 1,1
+ 0𝑘 2,1
− 𝑘 3,1
) = 0
…(5a)
2( −𝑘 1,1
− 𝑘 2,1
+ 0𝑘 3,1
) = 0
…(5b)
(5a) and 5(b) cannot be simultaneously true if all the 𝑘 s are non-negative and at least one is positive.
Thus, linear separability is again possible.
For Case 3, Equation (4) becomes
2( 0𝑘 1,1
− 𝑘 1,2
− 𝑘 2,1
+ 0𝑘 2,2
) = 0
…(6a)
2( −𝑘 1,1
− 𝑘 1,2
− 𝑘 2,1
− 𝑘 2,2
) = 0
…(6b)
…which is again impossible.
This proves the base case for 𝑁 = 2.
As our inductive hypothesis, we assume that the system of equations in (4) has no solutions for an
( 𝑁 − 1) -cube. We now proceed to show that this is the case also for the unit 𝑁 -cube.
The unit 𝑁 -cube can be treated as equivalent to a ‘stacking’ of two unit ( 𝑁 − 1) -cubes, with an additional
co-ordinate added to the vertices of each, so that one of the ( 𝑁 − 1) -cubes has this co-ordinate as ‘1’ and
(-1,-1) (1,-1)
(-1,1)
(1,1)
Y
3 1
1
2
(-1,-1) (1,-1)
(-1,1)
(1,1)
(-1,-1) (1,-1)
(-1,1)
(1,1)
Y
Y
Y
Y
N
Y
Y
Y
Y
Y
N
N
1 2
1 2
20
the other as ‘-1’. We assume that this unit 𝑁 -cube has no opposite-direction motions which means that
neither of the component ( 𝑁 − 1) -cube has an opposite-direction-motion.
The inductive hypothesis is that Equation (4) has no solutions for these ( 𝑁 − 1) -cubes. Using super-indices
1 and 2 to denote these ( 𝑁 − 1) -cubes, we can rewrite Equation (4) as follows:
∑ ∑ ( 𝑋 1,𝑙 ,𝑗 1
− 𝑋 −1,𝑛 ,𝑗 1
) 𝑘 𝑙 ,𝑛 2
𝑁 −1
−𝑀 1
𝑛 =1
= 0 , 1 ≤ 𝑗 ≤ 𝑁 − 1
𝑀 1
𝑙 =1
…(7)
∑ ∑ ( 𝑋 1,𝑙 ,𝑗 2
− 𝑋 −1,𝑛 ,𝑗 2
) 𝑘 𝑙 ,𝑛 2
𝑁 −1
−𝑀 1
+ 2
𝑁 −1
−𝑀 2
𝑛 =2
𝑁 −1
−𝑀 1
+1
= 0 , 1 ≤ 𝑗 ≤ 𝑁 − 1
𝑀 1
+𝑀 2
𝑙 =𝑀 1
+1
…(8)
For = 𝑁 , we have:
−2 ∑ ∑ 𝑘 𝑙 ,𝑛 2
𝑁 −1
−𝑀 1
+ 2
𝑁 −1
−𝑀 2
𝑛 =2
𝑁 −1
−𝑀 1
+1
= 0
𝑀 1
𝑙 =1
…(9)
2 ∑ ∑ 𝑘 𝑙 ,𝑛 2
𝑁 −1
−𝑀 1
𝑛 =1
= 0
𝑀 1
+𝑀 2
𝑙 =𝑀 1
+1
…(10)
The 𝑘 s that would solve equations (7) to (10) are all different. A solution to this system exists if 𝑘 s can be
found that solve at least one of these equations, separately from the others (in which case, the 𝑘 s for the
remaining equations can be set to 0). However, according to the inductive hypothesis, Equations (7) and
(8) have no solution. Equations (9) and (10) also have no solution under the condition that the 𝑘 s be non-
negative with at least one 𝑘 positive.
QED
Corollary to Theorem 4: Linear separability of a vertex-labeling on an N-cube is invariant under
permutation of axes (co-ordinates of vertices).
***
21
***
Theorem 5: Any vertex-labeling (2-way partition) on an N-cube allows polynomial separability using a
polynomial of degree N.
We introduce the following notation. Let the 2
𝑁 vertices of an N-cube be denoted by 𝐴 1
, 𝐴 2
, … , 𝐴 2
𝑁
.
Vertex j is the tuple (𝐴 𝑗 1
, 𝐴 𝑗 2
, … , 𝐴 𝑗 𝑁
) where 𝐴 𝑗 𝑘
= {0,1}. Without loss of generality, 𝐴 1
= {0, … . ,0}. Let
the label of the vertex 𝑗 be 𝐶 𝑗 = {0,1}. Without loss of generality 𝐶 1
= 0.
Proof:
Consider the polynomial
𝑌 = ∑ 𝑎 𝑗 1
𝑋 𝑗 1
𝑁 𝑗 1
=1
+ ∑ ∑ 𝑎 𝑗 1
𝑗 2
𝑋 𝑗 1
𝑋 𝑗 2
𝑁 𝑗 2
>𝑗 1
𝑁 −1
𝑗 1
=1
+ ⋯ + ∑ … ∑ 𝑎 𝑗 1
𝑗 2
…𝑗 𝑘 𝑋 𝑗 1
𝑋 𝑗 2
… 𝑋 𝑗 𝑘 𝑁 𝑗 𝑘 >𝑗 𝑘 −1
𝑁 −𝐾 +1
𝑗 1
=1
+
𝑎 𝑗 1
𝑗 2
…𝑗 𝑁 𝑋 𝑗 1
𝑋 𝑗 2
… 𝑋 𝑗 𝑁
𝑌 ( 𝐴 1
)= 0 = 𝐶 1
Hence if we can choose 𝑎 𝑗 1
, 𝑎 𝑗 1
𝑗 2
, …,𝑎 𝑗 1
𝑗 2
…𝑗 𝑁 such that 𝑌 ( 𝐴 𝑗 ) = 𝐶 𝑗 , 𝑗 > 1 , then the theorem would
follow by choosing 𝜃 =
1
2
. This polynomial has
𝑁 !
1!( 𝑁 −1) !
+
𝑁 !
2!( 𝑁 −2) !
+ ⋯ +
𝑁 !
0!𝑁 !
= 2
𝑁 − 1 coefficients, and
the number of label-assignments to match is 2
𝑁 − 1 too. We will show that substituting the co-ordinates
𝐴 𝑗 𝑘 for the variables 𝑋 𝑗 𝑘 gives rise to an independent linear system of equations for 𝑎 𝑗 1
, 𝑎 𝑗 1
𝑗 2
, …,𝑎 𝑗 1
𝑗 2
…𝑗 𝑁 .
We will prove by induction. To prove our base case, we begin with the 𝑁 vertices that satisfy
∑ 𝐴 𝑗 𝑙 = 1
𝑁 𝑗 𝑙 =1
. For these vertices, only one of the 𝑥 𝑗 𝑙 ≠ 0 . Hence for these vertices, 𝑎 𝑗 𝑙 = 𝐶 𝑗 .
As the induction hypothesis, we assume next that we set values of all coefficients 𝑎 𝑗 1
, 𝑎 𝑗 1
𝑗 2
, …,𝑎 𝑗 1
𝑗 2
…𝑗 𝑘
by using vertices such that ∑ 𝐴 𝑗 𝑙 = 𝑘 𝑁 𝑗 𝑙 =1
.
For the induction step, we consider vertices for which ∑ 𝐴 𝑗 𝑙 = 𝑘 𝑁 𝑗 𝑙 =1
+ 1. When we substitute any of these
vertices in the polynomial, only one of the ( 𝑘 + 1) degree terms (i.e. terms with ( 𝑘 + 1) 𝑋 ’s) is nonzero
(this happens when the ( 𝑘 + 1) co-ordinates that are chosen happen to be the ones with value 1), besides
terms of degree less than ( 𝑘 + 1). Consequently, we get 𝑎 𝑗 1
𝑗 2
…𝑗 𝑘 +1
= 𝐵 𝑗 1
𝑗 2
…𝑗 𝑘 +1
, where 𝐵 results from the
values obtained from the lower-degree terms (terms with fewer 𝑋 ’s). Thus we can set the values for all
𝑎 𝑗 1
𝑗 2
…𝑗 𝑘 +1
. QED.
22
2.5 Temporal independence and linear separability
Temporal independence of spiking i.e. mutual independence of bins in the spiketrain, is an assumption that
is commonplace in neuroscience, especially in models of Poisson spiking. The simplest form of Bayesian
decoding of spiketrains, called naïve Bayes decoding, incorporates the simplifying assumption of temporal
independence of spiking. In this section, a link between this important spiking property of temporal
independence and linear decoding will be established. It will be shown that naïve Bayes decoders (optimal
or not) are linear in the spiketrains, and that when the naïve Bayes assumption of temporal independence is
indeed valid, an optimal linear decoder is guaranteed to exist. Temporal independence is thus an important
special case where a linear decoder is Bayes-optimal. The question of how closely a linear decoder can
approach Bayes-optimal performance for an arbitrary labeling (under no assumption of temporal
independence) will be addressed in Chapter 4.2.2, through techniques developed for the purpose.
***
Theorem 6: A vertex-labeling of an N-cube obeying temporal independence, is also linearly separable.
Proof:
Consider an 𝑁 -cube with vertices who co-ordinates are binary time-series 𝑺 =
{𝑠 1
, 𝑠 2
, … , 𝑠 𝑖 , … , 𝑠 𝑁 } where 𝑠 𝑖 ∈ {0,1}.
Consider all pairs of vertices 𝑼 and 𝑽 such that 𝑢 𝑘 = 0 , 𝑣 𝑘 = 1 and 𝑢 𝑗 = 𝑣 𝑗 , ∀𝑗 ≠ 𝑘 for any 1 ≤ 𝑘 ≤ 𝑁 .
The set of edges connecting each such pair, is the set of parallel edges along the 𝑘 th
axis.
Consider a vertex-labeling on this 𝑁 -cube assigning labels Yes(𝑌 ) or No(𝑁 ) to the vertices. Temporal
independence holds if 𝑃 ( 𝑌 | 𝑺 ) ∝
∏ 𝑃 ( 𝑌 | 𝑠 𝑖 )
𝑵 𝒊 𝑃 ( 𝑌 )
.
Assume that temporal independence holds. Then we can write
𝑃 ( 𝑌 | 𝑽 )
𝑃 ( 𝑌 | 𝑼 )
=
𝑃 ( 𝑌 |𝑠 𝑘 =1)
𝑃 ( 𝑌 |𝑠 𝑘 =0)
. (Assume that 𝑃 ( 𝑌 |𝑠 𝑘 =
0) , 𝑃 ( 𝑌 |𝑠 𝑘 = 1) ≠ 0)
The values of 𝑃 ( 𝑌 |𝑠 𝑘 = 1) and 𝑃 ( 𝑌 |𝑠 𝑘 = 0) are fixed. Without loss of generality, let us assume 𝑃 ( 𝑌 |𝑠 𝑘 =
0) > 𝑃 ( 𝑌 |𝑠 𝑘 = 1) . In this case, the probability of a Y label always decreases when we move from any
node 𝑼 to its corresponding node 𝑽 .
23
The occurrence of an opposite-direction motion (i.e. the existence of a pair of edges, in one of which 𝑼 and
𝑽 have labels Y and N respectively, and another in which 𝑼 and 𝑽 have labels N and Y respectively) would
require the probability of a Y label to increase in some instances when we move from any node 𝑼 to its
corresponding node 𝑽 . This is not possible when temporal independence holds. From Theorem 4, we can
conclude that linear separability holds.
QED.
***
Theorem 7: A naïve Bayes decoder of binary sequences can be represented as a linear decoder (so long
as no decoder outcome is absolutely certain for any bit occurrence).
Proof:
Consider a spiketrain 𝑺 = {𝑠 1
, 𝑠 2
, … , 𝑠 𝑖 , … , 𝑠 𝐾 } 𝑤 ℎ𝑒𝑟𝑒 𝑠 𝑖 ∈ {0,1}
The decoder’s possible outputs are Yes and No which we denote by Y and N respectively. Without loss of
generality, let us consider the condition for the decoder’s output to be Y. This occurs when the conditional
probabilities below obey 𝑃 ( 𝑌 |𝑺 ) > 𝑃 ( 𝑁 |𝑺 )
Bayes Rule gives 𝑃 ( 𝑺 |𝑌 ) 𝑃 ( 𝑌 ) > 𝑃 ( 𝑺 |𝑁 ) 𝑃 ( 𝑁 )
Applying the Naïve Bayes assumption i.e. that the s i’s are independent:
𝑃 ( 𝑌 )∏ 𝑃 ( 𝑠 𝑖 |𝑌 )
𝑖 =𝐾 𝑖 =1
> 𝑃 ( 𝑁 )∏ 𝑃 ( 𝑠 𝑖 |𝑁 )
𝑖 =𝐾 𝑖 =1
∴
𝑃 ( 𝑌 )
𝑃 ( 𝑁 )
∏
𝑃 (𝑠 𝑖 |𝑌 )
𝑃 (𝑠 𝑖 |𝑁 )
𝑖 =𝐾 𝑖 =1
> 1 … ( 1)
For convenience, we introduce the following notation:
𝑦 𝑖 ( 𝑠 𝑖 ) = 𝑃 ( 𝑠 𝑖 |𝑌 )
𝑛 𝑖 ( 𝑠 𝑖 ) = 𝑃 ( 𝑠 𝑖 |𝑁 )
Since the above variables are probabilities,
0 < 𝑦 𝑖 ( 𝑠 𝑖 ) < 1
0 < 𝑛 𝑖 ( 𝑠 𝑖 ) < 1
Since s i is a Bernoulli variable,
24
𝑦 𝑖 ( 𝑠 𝑖 )+ 𝑦 𝑖 ( 1 − 𝑠 𝑖 ) = 1
𝑛 𝑖 ( 𝑠 𝑖 )+ 𝑛 𝑖 ( 1 − 𝑠 𝑖 ) = 1
We make an additional reasonable assumption that neither Y nor N is absolutely certain for any bit
occurrence. Under this assumption,
𝑦 𝑖 ( 𝑠 𝑖 ) ≠ 0
𝑛 𝑖 ( 𝑠 𝑖 ) ≠ 0
Under the above assumption, the following ratios exist:
ℎ
𝑖 1
=
𝑦 𝑖 ( 𝑠 𝑖 =1)
𝑛 𝑖 ( 𝑠 𝑖 =1)
ℎ
𝑖 0
=
𝑦 𝑖 ( 𝑠 𝑖 =0)
𝑛 𝑖 ( 𝑠 𝑖 =0)
Hence we can write
𝑦 𝑖 ( 𝑠 𝑖 )
𝑛 𝑖 ( 𝑠 𝑖 )
= ℎ
𝑖 0
(
ℎ
𝑖 1
ℎ
𝑖 0
)
𝑠 𝑖 … ( 2)
With this notation, (1) becomes…
𝑃 ( 𝑌 )
𝑃 ( 𝑁 )
∏
𝑦 𝑖 ( 𝑠 𝑖 )
𝑛 𝑖 ( 𝑠 𝑖 )
𝑖 =𝐾 𝑖 =1
> 1
Taking logarithms gives
𝑙𝑜𝑔 (
𝑃 ( 𝑌 )
𝑃 ( 𝑁 )
) + ∑ 𝑙𝑜𝑔 (
𝑦 𝑖 ( 𝑠 𝑖 )
𝑛 𝑖 ( 𝑠 𝑖 )
)
𝑖 =𝐾 𝑖 =1
> 0]
From (2), we get,
𝑙𝑜𝑔 (
𝑃 ( 𝑌 )
𝑃 ( 𝑁 )
) + ∑ 𝑙𝑜𝑔 (ℎ
𝑖 0
(
ℎ
𝑖 1
ℎ
𝑖 0
)
𝑠 𝑖 )
𝑖 =𝐾 𝑖 =1
> 0
∴ 𝑙𝑜𝑔 (
𝑃 ( 𝑌 )
𝑃 ( 𝑁 )
) + ∑ 𝑙𝑜 𝑔 𝑖 =𝐾 𝑖 =1
( ℎ
𝑖 0
)+ ∑ 𝑠 𝑖 log (
ℎ
𝑖 1
ℎ
𝑖 0
)
𝑖 =𝐾 𝑖 =1
> 0
Simplifying,
∑ (𝑙𝑜𝑔 (
𝑦 𝑖 ( 1)
𝑛 𝑖 ( 1)
𝑛 𝑖 ( 0)
𝑦 𝑖 ( 0)
))
𝑖 =𝐾 𝑖 =1
𝑠 𝑖 > − ∑ 𝑙𝑜𝑔
𝑖 =𝐾 𝑖 =1
(
𝑦 𝑖 ( 0)
𝑛 𝑖 ( 0)
) − 𝑙𝑜𝑔 (
𝑃 ( 𝑌 )
𝑃 ( 𝑁 )
)
The above is a linear decoder of the form described in Figure 1 or Theorem 4. The linear filter coefficients
and threshold are in terms the class-conditional probabilities and prior probabilities.
***
25
2.6 Discussion
The results in this chapter supply the existence conditions for linear and polynomial decoders of binary
spiketrains that yield all-correct answers for Yes/No questions. These existence conditions are stated in
terms of the geometries of the corresponding N-cube vertex-labelings. For a linear decoder yielding all-
correct answers for a given Yes/No question to exist, the geometric condition on the corresponding N-cube
vertex-labeling is that it must not exhibit opposite-direction-motions. Using this result, an important special
case of spiketrain encodings, namely, those that exhibit temporal independence, can be shown to always
admit the existence of an optimal linear decoder. For conditions where such a linear decoder is not possible,
a polynomial decoder yielding the correct labels is always possible, since it has been proven to exist for an
arbitrary vertex-labeling on an N-cube. A polynomial decoder for spiketrains of length N, is at most of
degree N, which implies that such polynomial-based decoding is tractable for higher dimensions and
importantly, not stymied by the ‘curse of dimensionality’.
26
3. Computational estimates of the
probability of optimal linear decoding
3.1 Software implementation of the test of linear separability of an N-
cube vertex-labeling
An algorithm implementing criterion for linear separability of a decoder labeling (stated in Chapter 1.6 and
proven in Chapter 2.4) is shown in pseudocode form in Fig. 8. This algorithm takes as input a vector of
binary labels for each of the 2
N
nodes in a unit N-cube, some of which maybe missing, and returns an
indicator of whether this labeling satisfies the condition in Fig. 6. A simple version of the algorithm with
an emphasis on the test for linear separability is shown here, and additional functionality such as building
a record of all nodes whose labels participate in an opposite-direction-motion can be added. This algorithm
is implemented as a function in MATLAB® and is called for both the Monte Carlo simulations described
below and for applying the check of linear decoding to decoders estimated from RGC data.
3.2 Monte Carlo sampling of vertex-labelings on N-cubes
Monte Carlo methods are used to generate a large set of simulated decoder outputs i.e. unique Yes-No
labelings in a Hamming Space for different values of the parameters listed below:
(i) Spiketrain length (N) : A longer spike-train length (N) means an exponentially larger number
of possible spike-trains. If a fixed number m of Yes nodes has to be randomly assigned positions
in a Hamming Space, then we can expect that fewer connected configurations (See Fig. 5 for
examples) will result in the higher dimensional spaces, thus reducing the probability of finding
linear decoders.
(ii) Class population size (m) : The chance of ‘overlap’ between classes (and thus, violations of
linear separability) is greatest when there is an equal population size of nodes of both Yes and
No classes, and less when either class predominates. We adopt without loss of generality a
convention of treating the number of Yes nodes, which we will call m, as the variable for
studying the effect of class population size on P L. The examples in Fig. 5 and Fig. 6 are for N
= 3 and m = 4.
27
The labelings for a given N and m are sampled under the following conditions of interest: (i) without
constraint of connectivity (The m Yes labels are allowed to occur in any previously unlabeled node in the
N-cube) (ii) with constraint of connectivity (The m Yes labels are chosen to satisfy the constraint of
connectivity shown in, say, Fig. 5b or Fig. 5c). Fig. 8 shows schematically how sampling is performed
under both these conditions for examples with N = 3 and m = 3.
In addition, for some values of N and m, P L (N,m) was evaluated both for labelings of the full N-cube, as
well as for reduced labelings where spiketrains with a spike-count higher than a certain value and the
corresponding nodes were eliminated. This comparison is motivated by the observation made at the end of
Chapter 1.
Fig. 7 Monte Carlo sampling of labelings without and with the constraint of connectivity Suppose that labelings
with N = 3 and m =3 need to be sampled. Here the Yes nodes are numbered as Y1,Y2 and Y3 in the order they were
labeled. Suppose (0,0,0) was sampled as the first of the three Yes nodes required. The nodes shown encircled in dark
double-dashes indicate the candidates for the second Yes node once the first is fixed, and those encircled in light triple-
dashed indicate the candidates for the third Yes node once the first two are fixed. The cube in (a) shows one such
labeling generated without a constraint of connectivity, with the only constraint that nodes are not visited more than
once. In practice, a candidate node is found by simply generating a random binary string of length N. The cube in (b)
additionally incorporates the constraint of connectivity, so that each new node added to the labeling must share an
edge with at least one of the nodes already in the labeling. In practice, a candidate node for such a labeling is found
by randomly picking a previously selected node and flipping a bit in its bit-sequence, thus yielding an immediate
neighbor (or a node at a Hamming distance of 1 from the selected node).
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
Y
1
Y1
1
Y2
Y3
1
Y1
1
Y2
Y3
1
(a)
(b)
28
ALGORITHM TO CHECK LINEAR SEPARABILITY OF A VERTEX-LABELING
Input: Consider a labeling {𝑏 𝑗 } on a unit 𝑁 -cube such that {𝑏 𝑗 } , 1 ≤ 𝑗 ≤ 2
𝑁 , 𝑏 𝑗 ∈ {0,1, 𝑁𝑎𝑁 }
1 and 0 denote Yes and No answers respectively.𝑁𝑎𝑁 labels those nodes for which a Yes/No answer is unavailable.
𝑏 𝑗 corresponds to the label corresponding to the spiketrain whose 𝑁 -bit sequence has decimal equivalent 𝑗 − 1.
Output: Let 𝐿 be 1 if {𝑏 𝑗 } admits LIOD and 0 otherwise.
Temporary variables: Let the 𝑗 th
spiketrain be denoted by 𝒔 𝑗
= {𝑠 𝑗 ,𝑘 } where 1 ≤ 𝑘 ≤ 𝑁
We treat ∙ as the string concatenation operator. We can write 𝒔 𝑗
= {𝑠 𝑗 ,1:𝑘 −1
} ∙ 𝑠 𝑗 ,𝑘 ∙ {𝑠 𝑗 ,𝑘 +1:𝑁 }
Let 𝒍 𝑗 ,𝑘 denote {𝑠 𝑗 ,1:𝑘 −1
} and 𝒓 𝑗 ,𝑘 denote {𝑠 𝑗 ,𝑘 +1:𝑁 }. Therefore we can write 𝒔 𝑗
= 𝒍 𝑗 ,𝑘 ∙ 𝑠 𝑗 ,𝑘 ∙ 𝒓 𝑗 ,𝑘
Let 𝐴 𝑘 denote a 2 -d array with binary entries containing all columns that satisfying the form (
𝒍 𝑗 ,𝑘 ∙ 1 ∙ 𝒓 𝑗 ,𝑘 𝒍 𝑗 ,𝑘 ∙ 0 ∙ 𝒓 𝑗 ,𝑘 ) .
The columns in this array represent all edges in the same direction parallel to the 𝑘 th axis.Suppose there are 𝑚
columns in 𝐴 𝑘 .
Let 𝐸 𝑘 be a matrix containing the labels corresponding to the entries of 𝐴 𝑘 , which are binary strings of length 𝑁 . If
the decimal equivalent of an entry in 𝐴 𝑘 is 𝑑 then the corresponding entry in 𝐸 𝑘 is 𝑏 𝑑 +1
𝐸 𝑘 has 2 rows and 𝑚 columns. The entries in 𝐸 𝑘 are 𝜀 𝑝𝑞
where 1 ≤ 𝑝 ≤ 2 and 1 ≤ 𝑞 ≤ 𝑚
Algorithm:
Initialize 𝐿 ← 1
for 𝑘 = 1: 𝑁 %Loop through each axis in the unit 𝑁 -cube
Construct 𝐴 𝑘
Construct 𝐸 𝑘
%Nodes with a 𝑁𝑎𝑁 label do not participate in an opposite-direction motion
Delete the 𝑟 th
column of 𝐸 𝑘 , ∀ 𝑟 | ∃ 𝜀 𝑝𝑟
== 𝑁𝑎𝑁 , 1 ≤ 𝑝 ≤ 2
%Let the reduced matrix be 𝐸 𝑘 ′
with 𝑛 columns, whose entries are denoted by 𝜀 𝑝𝑞
′
where 1 ≤ 𝑝 ≤ 2 , 1 ≤ 𝑞 ≤ 𝑛 .
%Edges with identical labels do not participate in an opposite-direction motion. Let ⊕ denote the logical XOR operation
Delete the 𝑠 th
column of 𝐸 𝑘 ′
, ∀𝑠 | ∃ 𝜀 1𝑠 ′
⊕ 𝜀 2𝑠 ′
== 0
%Let the reduced matrix be 𝐸 𝑘 ′′
with 𝑡 columns. Let the top row of this matrix be {𝑒 𝑢 ′′
} where 1 ≤ 𝑢 ≤ 𝑡
if 𝑡 ≠ 0
if 𝑒 𝑣 ′′
≠ 𝑒 1
′′
∀ 𝑣 | 2 ≤ 𝑣 ≤ 𝑡 %i.e if an opposite-direction motion exists along the 𝑘 th
axis
𝐿 ← 0
end
end
end
return 𝐿
Fig.8 Algorithm for applying the criterion for linear separability This is a pseudocode description of an algorithm
that applies the condition illustrated in Fig.6 and proven in Theorem 4, to an arbitrary vertex coloring to check for
linear separability.
29
3.3 Monte Carlo estimates of the proportion P
L
of vertex-labelings
allowing linear separability
Fig. 9(A) shows the Monte Carlo estimates of P L (N,m) as described in the previous section, for some early
values of N and m. The Monte Carlo estimator of P L has the favorable properties of unbiasedness (confirmed
by the estimator matching actual values of P L when available, such as in Inset (a) of Fig. 9) and consistency
(confirmed by the negligible standard errors).
P L as a function of N, as can be visualized via the family of curves in Fig. 9(A) or via Inset (b), falls very
rapidly with N for all m>1. P L as a function of m shows a symmetric characteristic which can be readily
understood on observing that the situation with m = 0 (All No nodes) and with m = 2
N
(All Yes nodes) are
equivalent for purposes of linear separability and both yield P L = 1.
Noting the symmetry of P L as a function of m, further plots show only one of the symmetric halves of this
curve for N = 6. Fig. 9 (B) superimposes a plot of P L as a function of m for N=6, with a plot of the fraction
of connected labelings with the same N and m that are linearly separable. Fig. 9(C) shows that the chances
of finding a linearly separable labeling increase greatly for diminished spiketrain repertoires where high-
spike-count responses are absent.
30
Asdf
Fig. 9 Results of Monte Carlo simulations:Proportion of vertex-labelings allowing linear separability Panel (A)
shows the proportion(P L) of all possible decoders for a given N and m that are linear, estimated using 100000 Monte
Carlo samples for each parameter setting. For a given N, P L varied in a U-shaped manner with m, symmetric about m
= 2
N-1
, as can be seen for N = 3 and N = 4 in Panel (A). P L falls rapidly with N for m>1, as shown in Inset (b). The
Monte Carlo estimates are found to be unbiased by comparing against the values of P L obtained by an exhaustive
search of all possible decoder labelings where this is tractable, as shown in Inset (a). In Panel (B), the black curve
reproduces one symmetric half of the P L curve for N = 6, using 10000 random Monte Carlo samples. The green curve
shows P L estimated using the same number of samples of decoder labelings in which each of the m Yes nodes is
constrained to lie on the same connected subgraph (The sampling is done as illustrated in Fig. 8). Decoder labelings
satisfying the connectivity constraint show P L falling less precipitously with m. The green curve in Panel (C) is the
same as that in Panel (B). The red curve in Panel (C) shows P L estimated for the same decoder labelings used for the
green curve, but by discarding from each labeling all those nodes in which more than three ‘1’s or spikes occur. The
red curve falls even more slowly with m.
(A)
(B)
(b)
(a)
(C)
31
3.4 Discussion
The aim of the computational study in this chapter was to estimate the probability that linear decoding is
optimal for an arbitrary Yes/No question answered with spiketrains of N bins. The estimation of this
probability was done by Monte Carlo sampling of vertex-labelings on N-cubes, and application of the test
of linear separability (Fig. 1.6) on the sampled labelings. The results of these simulations summarized in
Fig. 9. The vanishingly small values of P L for increasing values of N and intermediate values of m,
suggested an overwhelming preponderance of cases for which linear decoders are not optimal for spiketrain
sizes of practical interest. These results led to the belief that for realistic sensory decoding of spiketrains
where N is expected to be higher, there would be plenty of questions for which linear decoding of
spiketrains would yield poor results (like Q2 in Chapter 4.3.2).
However the properties of labelings shown in Figs. 9 (B) and 9 (C) which favor linear separability, are
properties which are likely to be exhibited by real sensory neurons. Spiketrains that are similar (i.e. closer
in Hamming distance, or particularly, sharing an edge in the corresponding unit N-cube) can be expected
to be likelier to share labels, thus exhibiting connectivity. Also, the ‘peaky and sparse’ firing of sensory
neurons to naturalistic stimuli would mean that spiketrains with very high spike counts would occur rarely
if ever
17
.
32
4. Optimal decoding of experimental
neural spiketrain recordings
4.1: Optimal decoders and decoder performance assessment
The principal performance metric for the decoders considered in this thesis is the classification rate i.e. the
fraction (or more typically, percentage) of a decoder’s estimated labels that match the ‘ground truth’ labels
for a Yes/No question. We will use the following terms to disambiguate different types of decoders.
Decoder refers to any operation on a response pattern that yields an estimate of the unknown class-label. A
linear decoder is one whose estimates are obtained as a result of a weighted summing and threshold
operation on the input spiketrain. The Ideal Observer Decoder (IOD) is a decoder that yields the maximum-
a-posteriori estimate of the class labels i.e. yields Bayes-optimal performance which we will treat as the
upper-bound
16
. A Linear Ideal Observer Decoder (LIOD) is the linear decoder whose classification rate is
closest to that of the IOD.
The key objective in this study of benchmarking the performance of linear decoders of actual spiketrains,
will be addressed primarily through the modus operandi of comparing the classification-rates achieved by
estimated IODs and the corresponding LIODs for a variety of questions posed to retinal ganglion cells
(RGCs).
4.2 Ideal Observer Analysis for a labeled set of spiketrains
In the Monte Carlo simulations in Chapter 3, complete labelings of a unit N-cube were given i.e. the decoder
output was assumed available for every possible spiketrain. In this section, we describe principles and
procedures for generating decoder outputs for arbitrary spiketrains, when a set of labeled spiketrains is
available from an experiment. Let us suppose that this set of recordings is partitioned into a training set and
test set for identifying a decoder and evaluating its classification rate respectively.
In the training set from such an experiment, some spiketrain response patterns from the corresponding unit
N-cube may never occur, whereas some response response patterns may occur multiple times with different
labels. This section describes how scarce training data of the distribution of labels among response patterns
(like in Fig. 10) can be used to decode test spiketrains Bayes-optimally i.e. by Ideal Observer Analysis. In
33
addition, a procedure is described to identify a linear decoder whose classification performance best
approaches Bayes-optimality. The terminology for decoders used below is as stated in Chapter 4.1 above.
4.2.1 Estimation of an Ideal Observer Decoder (IOD) given a set of labeled spiketrains
Two variants of Ideal Observer decoding are described: a ‘local’ version which yields decoder outputs only
for reappearances in the test set for spiketrain sequences that already appeared in the training set, and a
‘global’ version which incorporates assumptions of similarity of labels of neighboring spiketrains to also
yield decoder outputs for sequences that did not appear in the training set.
4.2.1.1 ‘Local’ Ideal Observer Decoding
We term as the ‘local’ Ideal Observer Decoder (IOD) the decoder which assigns labels to every spiketrain
in the test set as follows: The label which most often accompanied the occurrences of the same spiketrain
in the training set, is assigned as the decoder output for the training set. This yields a maximum a posteriori
(strictly, maximum likelihood) estimate which is ‘local’ because it is based only on label occurrences
associated with the given node in question. This procedure can be thought of as a raw ‘histogram-based’
estimator. Such decoding is shown schematically in Fig.10.
The classification-rate (i.e. percentage of times the answer is correct) attained for a given node is simply
the percentage of times the label assigned to the node was associated with it in the training set. The
classification-rate for the ‘local’ IOD is the sum of these node-wise classification-rates, weighted by the
occurrence probability of each node.
34
Fig. 10 Schematic illustration of ‘Local’ Ideal Observer Decoder estimation procedure Data from a hypothetical
experiment are summarized in histogram form on the cube on the left. The green and red bars show the relative
proportion of occurrences of a given response (node) in association with a Yes and No answer respectively. The output
of the decoder, arrived at as a Winner-Take-All labeling, is shown on the right cube.
4.2.1.2 ‘Global’ Ideal Observer Decoding
We develop a ‘global’ variant of Ideal Observer decoding in order to account for an eventuality which the
‘local’ procedure above is not equipped to handle, namely, the occurrence in the test set of a spiketrain that
did not appear in the training set. This procedure can generate decoder outputs for test spiketrains for which
Yes and No label probabilities are not available in histogram form, through a procedure of kernel density
estimation
30
. Fig. 11 outlines the algorithm functioning as a regularization tool to obtain the IOD predictions
for every possible spiketrain response.
The Ideal Observer Decoder(IOD) output for every possible test spiketrain of a neuron, for a given question,
is specified as a Y/N labeling of the corresponding Hamming cube, where each label represents the
Maximum A Posteriori(MAP) estimate of the label given the spike-response. The IOD is ‘trained’ using the
labeled spike-trains in the training set, through kernel density estimation
30
of the probabilities P Y({S n}) and
P N({S n}) that the label at a given node {S n} in the Hamming Cube is Y and N respectively. An update
procedure accumulates the contribution of each training example in succession to P Y({S n}) (or P N({S n}))
depending on whether its training label is Y(or N). The contribution is greatest at the node corresponding
to the spiketrain in the training example, and falls off in a Gaussian fashion with Hamming distance. Once
these updates are complete, a winner-take-all (WTA) labeling of every node in the Hamming cube is
performed, assigning the label Y if P Y({S n}) exceeds P N({S n})at that node and the label N otherwise. The
procedure is schematically illustrated in Fig. 12.
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
N
N
Y
Y
35
The classification-rate achieved by the ‘global’ IOD for the test set, is the percentage of times the ground-
truth labels in the test-set nodes match the labels arrived at using WTA above, at the corresponding nodes.
ALGORITHM FOR IDEAL OBSERVER DECODER ESTIMATION
Inputs:
𝑹 𝑘 ×𝑁 is the matrix of spiketrain recordings in the training set, where the 𝑘 th
row 𝒓 𝑘 =
[𝑟 𝑘 1
, 𝑟 𝑘 2
, … , 𝑟 𝑘𝑖
, … , 𝑟 𝑘𝑁
], 𝑟 𝑘𝑖
∈ {0,1}∀𝑖 is the 𝑘 th
recording in the set.
𝒔 𝑘 ×1
,where the 𝑘 th
entry 𝑠 𝑘 , is the ground-truth label for the 𝑘 th
recording in the training set.
𝑠 𝑘 ∈ {0,1}∀𝑘 where 0 and 1 denote Yes and No labels respectively
𝜎 is the width of the kernel (in Hamming distance units) used in kernel density estimation
Outputs:
𝑯 2
𝑁 ×𝑁 is a matrix with all binary strings of length 𝑁 in rows. Let 𝒉 𝑘 be the 𝑘 th
row in 𝑯 2
𝑁 ×𝑁 .
𝒑 2
𝑁 ×1
𝟏 and 𝒑 2
𝑁 ×1
𝟎 are vectors with the (non-normalized) probability scores of Yes and No respectively, for
each entry in 𝑯 2
𝑁 ×𝑁 .
𝑝 𝑘 1
and 𝑝 𝑘 0
are the Yes and No scores for 𝒉 𝑘 .
𝒃 2
𝑁 ×1
where the 𝑘 th
entry 𝑏 𝑘 ∈ {0,1}∀𝑘 is the vector of Ideal Observer Decoder labeling.
The function 𝐻𝐷𝑖𝑠𝑡 ( 𝒉 𝑖 , 𝒉 𝑗 ) returns the Hamming Distance between the nodes corresponding to the
spiketrains 𝒉 𝑖 and 𝒉 𝑗 .
The kernel function 𝐾𝑒𝑟𝑛𝐹𝑢𝑛 ( 𝒉 𝑖 , 𝒉 𝜇 , 𝜎 ) is here chosen as a Gaussian kernel so that
𝐾𝑒𝑟𝑛𝐹𝑢𝑛 ( 𝒉 𝑖 , 𝒉 𝜇 , 𝜎 ) =
1
𝜎 √2𝜋 𝑒 −
1
2𝜎 2
( 𝐻𝐷𝑖𝑠𝑡 ( 𝒉 𝑖 ,𝒉 𝜇 ) )
2
Pseudocode:
Initialize 𝒑 𝟏 and 𝒑 𝟎 to 𝟎 2
𝑁 ×1
.
for 𝑖 = 1: 𝑘
for 𝑗 = 1: 2
𝑁
𝑝 𝑗 𝑠 𝑖 ← 𝑝 𝑗 𝑠 𝑖 + 𝐾𝑒𝑟𝑛𝐹𝑢𝑛 ( 𝒉 𝑗 , 𝒓 𝑖 , 𝜎 )
end
end
for 𝑗 = 1: 2
𝑁
𝑏 𝑗 ← argmax
𝑠 𝑝 𝑗 𝑠
end
return 𝒃 , 𝒑 𝟏 ,𝒑 𝟎
Fig. 11 Algorithm for Ideal Observer Decoder estimation The pseudocode above outlines the procedure for obtaining
‘global’ IOD predictions from a set of labeled experimental spiketrains, using a procedure of kernel density estimation.
36
Fig. 12 Schematic illustration of ‘Global’ Ideal Observer Decoder estimation procedure As in Fig. 10, data from a
hypothetical experiment are shown summarized in histogram form in the left cube. The histogram measurements are
used to perform kernel density estimation, which yields estimates Yes and No probability distributions over the whole
unit-N-cube, visualized as green and red clouds above(not to scale).The output of the decoder, arrived at as a Winner-
Take-All labeling based on the estimated densities, is shown on the right cube.
4.2.2 Estimation of an Linear Ideal Observer Decoder (LIOD) given a set of labeled
spiketrains
The Linear Ideal Observer Decoder (LIOD) for a given question, is a Y/N labeling of the corresponding
Hamming Cube that yields the best classification performance among all those labelings that satisfy the
condition of linear separability in Chapter 1.6. The greedy shallowest-descent procedure used to estimate
the LIOD is shown in pseudocode form in Fig.13.
The LIOD labeling is initialized to the previously obtained IOD labeling for the same question. At the start
of the iterations, the nodes having a minority label (say,Y, if there are fewer Yes nodes than No nodes) are
identified. The minority-label nodes which occur as part of an opposite-direction-motion are retained for
analysis. These nodes are sorted in descending order of P N({S N}), i.e. the probability of the node having the
opposite label. The label of the node topmost in the list is flipped, from Y to N in this case. Such a flip
would abolish an opposite-direction-motion thus rendering the labeling closer to a linearly separable one.
This procedure is iterated until the labeling satisfies the test of linear separability in Fig. 6. The flip thus
performed at each step will result in the least possible increase in classification error vis-à-vis the IOD. The
labels obtained on convergence are treated as the LIOD predictions for the given question and neuron. The
procedure is summarized schematically in Fig. 14.
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
N
N
N
N
Y
Y
Y
Y
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
37
ALGORITHM FOR LINEAR IDEAL OBSERVER DECODER ESTIMATION
Inputs:
𝒃 2
𝑁 ×1
where the 𝑘 th
entry 𝑏 𝑘 ∈ {0,1}∀𝑘 is the vector of an Ideal Observer Decoder labeling.
𝑝 𝑘 1
and 𝑝 𝑘 0
are the Yes and No scores the Hamming Space node or spike sequence 𝒉 𝑘 bearing the label 𝑏 𝑘 .
Let 𝑆 𝒃 0
be the set of all 𝑘 : 𝑏 𝑘 = 0. Let 𝑆 𝒃 1
be the set of all 𝑘 : 𝑏 𝑘 = 1.
𝐼𝑠𝐿 𝑖𝑛𝑒𝑎𝑟 ( 𝒃 ) is a function implementing the algorithm for checking linear separability (Fig. 8).
𝐼𝑠𝐿𝑖𝑛𝑒𝑎𝑟 ( 𝒃 ) = 1 if 𝒃 is a linearly separable coloring and 0 otherwise.
𝐼𝑠𝑉𝑖𝑜𝑙 ( 𝑏 𝑘 ) is a function that returns 1 if the node bearing label 𝑏 𝑘 in the 𝑁 -cube participates in an opposite
direction motion. Such a function can be readily implemented (as an extension of 𝑠𝐿𝑖 𝑛𝑒𝑎𝑟 ( 𝒃 ) ) by noting
that a node participates in an opposite direction only if it has a neighbor with the opposite label/color,
along an axis in which opposite direction motions occur.
Outputs:
𝒍 2
𝑁 ×1
where the 𝑘 th
entry 𝑙 𝑘 ∈ {0,1}∀𝑘 is the vector of a corresponding Linear Ideal Observer Decoder
labeling.
Pseudocode:
MinLabel = argmin
𝑘 |𝑆 𝒃 𝑘 |
𝒍 ← 𝒃
while 𝐼𝑠𝐿𝑖𝑛𝑒𝑎𝑟 ( 𝒍 ) == 0
𝐹𝑙𝑖𝑝𝐿𝑖𝑠𝑡 ← 𝑆 𝒍 𝑀𝑖𝑛𝐿𝑎𝑏𝑒𝑙
𝐹𝑙𝑖𝑝𝐿𝑖𝑠𝑡 ← 𝐹𝑙𝑖𝑝𝐿𝑖𝑠𝑡 − {𝑘 : 𝑘 ∈ 𝑆 𝒍 𝑀𝑖𝑛𝐿𝑎𝑏𝑒𝑙 , 𝐼𝑠𝑉𝑖𝑜𝑙 ( 𝑙 𝑘 )== 0}
𝑘 𝑓𝑙𝑖𝑝 ← argmax
{𝑘 :𝑘 ∈𝐹𝑙𝑖𝑝𝐿𝑖𝑠𝑡 }
𝑝 𝑘 ( 1−𝑀𝑖𝑛𝐿𝑎𝑏𝑒𝑙 )
𝑙 𝑘 𝑓𝑙𝑖𝑝 ← 1 − 𝑙 𝑘 𝑓𝑙𝑖𝑝
end
return 𝒍
Fig. 13 Algorithm for Linear Ideal Observer Decoder estimation The pseudocode above outlines the procedure for
obtaining the predictions of a Linear Ideal Observer Decoder (LIOD), i.e. the linear decoder whose performance is
closest to that of the corresponding Ideal Observer Decoder (LIOD).
38
Fig. 14 Schematic illustration of Linear Ideal Observer Decoder estimation procedure The Ideal Observer decoding
shown on the left is the same as in Fig. 12. This IOD does not allow linear decoding without errors, due to the opposite-
direction-motion shown (Ref. Fig. 6). In this example, suppose that the Yes nodes are treated as ‘minority nodes’ to
be flipped. Among the Yes nodes, it is those at (0,0,1) and (1,1,0) that participate in an opposite-direction-motion. Of
these two nodes, the Yes node at (1,1,0) is the least supported i.e. has the highest No probability. Flipping this node
yields linear separability as shown in the right cube, with a possible separating plane shown in blue. In general,
achieving linear separability can require many such flips. The labeling shown on the right shows the output predictions
of a Linear Ideal Observer Decoder, which is the linear decoder whose performance is closest to upper-bound defined
by the IOD performance.
N
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
N
N
N
N
Y
Y
Y
Y
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
N
N
N
N
Y
Y
Y
>
>
39
4.3 Decoding of Retinal Ganglion Cell spiketrains using IOD and LIOD
estimation
In order to demonstrate the utility of the techniques developed above to the analysis of stimulus-response
data in sensory neuroscience, we applied them to retinal ganglion cell (RGC) recordings obtained under
natural image stimulation available courtesy of Cao et al
31,32
. In these recordings, each cell was stimulated
by 1000 natural images lasting a second each, alternating with full-field gray images of the same duration.
The use of natural stimuli allows asking a wider range of questions (in the sense of Chapter 1.3), than
artificial stimuli customized for a single question. The decoders built and tested here are unit-level decoders
of the natural-image responses.
4.3.1 Data pre-processing and partitioning
Spike responses of retinal ganglion cells (RGCs) to a succession of alternating natural images and full-field
grey backgrounds lasting a second each, recorded previously by Cao et al at the Visual Processing
Laboratory
31,32
are represented as binary strings of tractable length, to render them amenable to the analyses
in Chapter 4.2. The peristimulus time histograms (PSTHs) are inspected for the peak of activity and non-
responding portions are discarded. In this way, the interval from 60 ms to 250 ms following stimulus onset,
that was found to be the portion of interest in the 1-second-long recording, is retained for further analyses.
The retained portion is then coarsely binned in bins of 20 ms, yielding binary strings of length 9 bits. This
coarse binning results in some instances of multiple spikes being treated as a single spike, and our further
analyses are applied only to those neurons where under 10% of the total recorded spikes are ‘lost’ by
binning. The data for each neuron, comprising 1000 distinct natural image presentations, are partitioned
into a training and test set of 900 and 100 labeled spiketrains respectively.
40
4.3.2 Ground-truth labeling for multiple questions
Ground-truth labels of the recorded spike-trains for a given Yes/No question are obtained by examining the
stimulus images corresponding to each spike-train. Some psychophysically motivated Yes/No questions
(Ref. Chapter 1.3) relevant to RGC function which are considered in this study are:
Q1: Did the center-mean-contrast of the current image increase by at least 20% with respect to the previous
image?
Q2: Did the center-mean-contrast of the previous image change by at least 20% with respect to the previous
image?
Center-mean contrast (C) is defined as follows
32
:
C = (M Center – M Gray)/ M Gray , where M Center was the mean intensity of the natural image in the RF center and
M Gray was the mean intensity of the full-field gray background.
Visual change-detection is a canonical psychophysical question
33
, and Q2 maybe considered to be that
question asked at the level of a single RGC. An additional rationale for choosing the pair questions above,
in this study for benchmarking linear decoder performance, is that we expect these questions to differ
markedly in the ease of linear decoding (See Fig. 15). We expect from known RGC properties that Q1
would be a question where a linear decoder would yield optimal performance, whereas Q2 would not. This
is because an ON (or an OFF) cell can detect contrast changes in one direction eg. Increase, like in Q1, but
not in both directions like in Q2 which intuitively seems to require linear filters implementing two distinct
stimulus-preferences rather than just one. Both Q1 and Q2 are parametric on a percentage criterion (here,
20%). The percentage criterion in Q1 and Q2 is chosen by inspection of the natural-image statistics, and
picking values for which both Yes and No answers are adequately represented in the ground-truth labeling
for these questions.
41
Fig. 15 Rationale for the selection of early visual tasks for LIOD demonstrations Consider a
hypothetical ON-type RGC whose responses can be represented in the unit N-cube for N = 3 as shown.
Consider two decoders whose respective predictions can be represented as in (a) and (b) respectively.
For the question Q1 (Did the center-mean contrast increase by at least 20%?), decoder (a), would match
the ‘ground truth’ often, because the RGC would respond with little or no spikes when the ground truth
answer is No and respond vigorously with more spikes when the ground truth answer is Yes. Therefore,
the linear decoder in (a) would attain a high classification rate for Q1. For the question Q2 (Did the
center-mean contrast change by at least 20%?), however a null response by the RGC can sometimes
correspond to a ground truth answer of Yes (The ON RGC is insensitive to a temporal contrast decrease
of 20%, for which the correct ground-truth answer to Q2 is Yes ). Using prior knowledge that large
contrast decreases are likelier in natural stimuli, a decoder prediction of Y for (0,0,0) like in (b) would
have a lower error rate than in (a). However, the decoder in (b) is not a linear decoder(Ref. Fig. 6).
Following this intuition, the Ideal Observer Decoder(IOD) for questions like Q2 for an ON cell would
not be a linear decoder, and the corresponding LIOD can be expected to yield poorer performance than
the IOD (Ref. Fig. 14). A similar discussion holds for OFF cells as well.
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
N
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
Y
Y
Y
N
N
N
N
N
N
Y
Y
Y
Y
(a) (b)
42
4.3.3 Comparison of IOD and LIOD performance for questions posed to RGCs
The training and performance of the Ideal Observer Decoder(IOD) for each of the questions is performed
as described in Chapter 4.2.1. An open parameter in ‘global’ Ideal Observer Decoding is the width of the
Gaussian kernel used for density estimation. This parameter was fixed at the setting yielding the highest
classification performance for a range of explored values. The Linear Ideal Observer Decoder (LIOD) is
trained and tested as described in Chapter 4.2.2.
The classification percentages achieved by the IOD and LIOD are compared for the two questions in
Chapter 4.3.2 for an ON and OFF cell in Fig. 16(g). The panels in Fig. 16 also provide visualization or
relevant stimulus statistics (center-mean-contrast or temporal-contrast distributions of the natural image
stimuli) in panels (a) and (b), and also of response statistics in panels (c) through (f).
43
Fig.16 Results of Linear Ideal Observer Decoding analyses applied to naturalistic RGC recordings The
classification percentages (along with bootstrap standard errors) attained by Linear Ideal Observer Decoders along
with their corresponding Ideal Observer Decoders are summarized in the table in Panel (g) for an ON and OFF RGC
example for the questions Q1 and Q2 detailed in Section 4.3.2. The RGCs analyzed
31,32
were stimulated with a 1000
natural images alternating with a gray full-field and the relevant stimulus statistics are visualized in the top row. Panel
(a) shows the histogram of temporal contrast (change in center-mean contrast with respect to the previous image)
experienced by the example ON RGC and Panel (b) shows the same for the OFF RGC. Panels (c) through (f) assist
visualization of class-conditional response distributions for the two questions considered. For example, in Panel (c)
the blue trace shows the raw average of the response of the ON cell preprocessed into strings of 9 bins (i.e. N = 9),
and the green and red traces show the class-conditional response averages for responses elicited by stimuli with Yes
and No answers respectively for Question Q1. Panel (d) shows how for the same ON RGC with a different question,
Q2, different class-conditional response distributions (and hence different labelings of the corresponding 9-cube)
result.
(b)
(a)
(c) (d)
(e)
(f)
(g)
44
Both the ON and OFF cell in Fig. 16 attained a better classification percentage for Q1 than Q2, along the
lines expected in Chapter 4.3.2. However, the most salient feature of the IOD and LIOD performance
comparison is how the LIOD matched IOD performance i.e. reached the upper-bound of performance in all
the cases considered.
4.4: Discussion
In this chapter, we developed computational tools to obtain the performance of Ideal Observer and Linear
Ideal Observer decoders for Yes/No questions, given stimulus-response data from actual experiments.
These tools were applied for the benchmarking of the performance linear decoders vis-à-vis an Ideal
Observer upper-bound, in the decoding of available retinal spiketrains
31,32
. For the broadly representative
selection of early visual tasks considered, linear decoders of retinal spike trains were nearly optimal. This
was surprising considering the findings from Monte Carlo simulations in Chapter 3.3, which had suggested
that for a situation with N=9 as in the analyzed recordings, it would be extremely unlikely to find an optimal
linear decoder. The simulations in Chapter 5 were motivated in large part by the need to account for this
better-than-expected performance of linear decoders.
45
5. Linear decoder performance for
simulated naturalistic neural responses
5.1 Motivation to study the effect of spiking statistics on decoder
performance
Linear decoders were found sufficient to attain optimal classification performance for decoding the
responses of single retinal ganglion cells that were examined in Chapter 4.3. Moreover, linear decoders
could yield optimal performance even for a question where we expected Ideal-Observer vertex-labelings to
not be linearly separable (Chapter 4.3.2). Some possible explanations we considered for the surprisingly
good performance of linear decoders are summarized below:
S. No. Property of estimation
procedure
Implications for estimated linear decoding
performance
1 Overly coarse sampling of the
spiketrain responses
Response spaces are N-cubes of low dimensionality,
where Ideal Observer labelings are likelier to obey
linear separability than in higher-dimensional N-cubes.
2 Kernel-based density estimation
in the global Ideal Observer
Decoder estimation
Kernel smoothing can lead to large uniformly colored
neighborhoods, thus biasing the estimation towards
linearly separable colorings.
3 Small number of training
examples
(i)For ‘local’ Ideal Observer Decoder estimation, a
small number of colored nodes biases the estimation
towards linearly separable colorings.
(ii)For ‘global’ Ideal Observer Decoder estimation,
too few training examples can exacerbate the tendency
of kernel density estimation to produce uniformly
colored neighborhoods.
4 Weak firing of retinal ganglion
cells in response to naturalistic
stimulation
Naturalistic responses occur close to the all-zero node
in the Hamming Cube, rendering colorings that obey
linear separability to be likelier than for more
widespread response distributions.
46
The first three of the proposed explanations above pertain to how the choices of some open parameters in
the Ideal Observer Decoder estimation algorithms may result in the estimated IOD vertex-labelings being
‘more linear’ than they truly are, and thus over-estimate the performance of the corresponding Linear Ideal
Observer Decoder. However, irrespective of the settings of the open parameters of the estimation algorithm,
the estimated IOD (as well as the true IOD) would be very likely to obey linear separability if the spiketrain
statistics are as in #4 in the preceding table. The simulations in the next section study the effect of firing
statistics mimicking those observed in the recorded RGCs.
5.2 Unit-level simulation study of the effect of naturalistic response
statistics
5.2.1 Unit-level simulation setup
Unit response distributions for decoding analyses, are simulated according to the following parameters
which are motivated by the observed response statistics of RGCs to natural images:
(i) λ (length constant of spike-count decay): It is observed from the available retinal ganglion
cell responses to natural images that the probability of occurrence of a spiketrain falls off in a
qualitatively exponential fashion with increasing Hamming distance from the all-zero node
(null response).
Fig. 17 Distribution of response spike-counts of an RGC to natural images In order to visualize how the responses
of an RGC are distributed in the corresponding Hamming Cube (with N=9 in this case), the distribution of Hamming
distances of the response nodes from the null response node (0,0,…0,0) is plotted. We can see that most responses are
crowded close to the null-response node, with the rest of the Hamming Cube nearly ‘empty’. The red curve shows an
exponential fit with a low length constant.
λ = 1.22
47
Simulated response statistics are modeled by setting the response probabilities to decay
exponentially with Hamming distance, according to a length-constant parameter λ.
Distributions with high values of λ (i.e. those for which the spike-count decays slowly and
whose response probabilities are close-to-uniformly distributed) resemble RGC responses to
artificial images whereas those with low values of λ(with response probabilities peaked near
the null response)
17
resemble responses to naturalistic stimuli like those in Chapter 4.3.1
31,32
.
In addition to this key parameter, there are two other key experimental control parameters
used in the simulations.
(ii) ‘Split’ (measure of noise): The classification-rate attained by a ‘local’ IOD (or in general, the
performance of any decoder) is limited by ‘encoding noise’ as modeled by the (Bernoulli)
probability distribution of the occurrence of Yes and No labels at a node. A variable called
‘split’ is defined as the percentage of times a decoder-generated label occurs at a node, as a
way of studying the effect of such encoding noise or uncertainty. Referring to Section 4.2.1.1,
this ‘split’ can also be thought of as the local classification-rate at each node. In this simulation,
the ‘split’ is assumed to be identical at all nodes in the Hamming Cube i.e. each node is assumed
to be equally reliable.
(iii) c : Hamming band cycle length (A parameter identifying the nonlinearity) : In order to
meaningfully compare IOD and LIOD performance, it is important to choose examples in
which the IOD labeling does not at the outset itself admit linear separability. For these
simulations, a family of such IOD labelings, consisting of alternating Yes and No zones periodic
in Hamming distance (with different spatial periods/cycle-lengths) , which we will call ‘zebras’
for convenience, are chosen as shown in Fig. 18. From the figures, it is obvious that for small
values of c, a single hyperplane cannot separate the Yes and No classes for such labelings and
they are therefore do not meet the condition of LIODs to start with.
48
Fig. 18 ‘Zebra labelings’: A family of non-linearly-separable IOD labelings The left cube shows a zebra labeling
for c=1, for N = 3. This labeling is not linearly separable, as can be seen from the multiple occurrences of opposite-
direction-motions. The right cube is also a zebra-labeling, with c=2 for N = 3. The labeling for c=2 and N =3 is linearly
separable. In general, nonlinear labelings will result so long as c < N/2. Therefore, for higher values of N, it is possible
to generate a sufficiently large family of nonlinear zebra labelings. We note here that the labeling shown in Fig. 15
closely resembles a zebra-labeling.
In a manner of speaking, zebra-labelings with smaller values of c are ‘more nonlinear’ as they have
more opposite-direction-motions and hence more nodes to flip to attain linear separability (See Fig.
6). When LIOD approximation procedure is applied to IOD labelings such as the above, we can
expect a decrease in classification-rate resulting from the flipping of labels (Ref. Fig. 14). The
‘zebra’ labeling with c=1 is of particular interest as it is the labeling that has the largest possible
number of opposite-direction motions, and in this regard represents the ‘most nonlinear’ labeling.
The IOD and LIOD classification rates are obtained as a function of the above variables, in order to study
to what degree each of them accounts for the closeness of LIOD and IOD performance.
5.2.2 Unit-level simulation results
The performance of Linear Ideal Observer Decoders is found to approach that of Ideal Observer Decoders
even in situations where the Ideal Observer Decoder labeling is very nonlinear, when the distribution of
spiketrain responses is concentrated near the all-zero node in the Hamming Cube (i.e. when the length
constant λ is lower). Fig. 19 summarizes the effect of λ for a family of non-linear zebra-labelings on a
Hamming Cube with N = 10, for different values of split.
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
Y
Y
Y
Y
N
N
N
N
(1,0,1)
(0,1,1) (1,1,1)
(1,1,0)
(1,0,0)
(0,0,1)
(0,0,0)
(0,1,0)
Y
Y
Y
Y
N
N
N
N
49
Fig. 19 Effect of spiketrain response distributions on Linear Ideal Observer Decoder performance The family of
curves in Panel (a) reveals how the classification percentage achieved by Linear Ideal Observer Decoders (obtained
by finding the best-performing linearly separable approximants of a family of ‘zebra’ IOD labelings) varies with λ.
Panel (a) shows results for a single value of split (See Chapter 5.2.1). The three subpanels in the bottom row in (b)
show the classification rates as a function of split for some representative values of λ. The curve labeled ‘UB’ (for
upper-bound) is the classification percentage achieved by the Ideal Observer Decoder, and the curves labeled with
different values of c are of classification percentages attained by the Linear Ideal Observer Decoder approximations
of zebra-labelings with those values of c. As the length constant λ increases (when the firing being modeled resembles
RGC responses to artificial images), the LIOD performance falls significantly short of the upper-bound. However, for
sufficiently low length constants (which qualitatively resemble naturalistic responses), the LIOD performance begins
to closely approach the upper-bound.
(b)
(a)
50
The finding that spiketrain distributions resembling naturalistic RGC firings render such firings more
amenable to optimal linear decoding, offers an explanation for the better-than-expected performance of
linear decoders for RGC responses to natural images observed in Chapter 4.3.3. The ease of linear
separability for naturalistic firing distributions can be understood to be arising for reasons similar to why
spike-count-limiting favors linear separability(See Fig. 9(c)) since nodes occurring very rarely (like in
distributions with low length constant λ) are effectively ‘eliminated’. This provides a resolution to the
apparent conflict between the theoretical prediction of optimal linear decoders being rare (Chapter 3.3), and
the empirical observation of linear decoders being optimal for a broad class of visual tasks (Chapter 4.3).
While this explanation applies to linear decoders for unit-level firings, an investigation with simulated
population firings in the next section can help address how often linear decoders can be expected to be
optimal for population firings
42
.
51
5.3 Population-level simulation study of the effect of naturalistic
response statistics
5.3.1 Formulations of population decoders
Of the many possibilities of representing the population response of 𝑛 neurons responding with spiketrains
of length 𝑘 bits, the two broad classes which are considered in Fig. 20 are of particular interest because:
(i) The ‘non-hierarchical’ model, preserving the spike timing information of all neurons, is the
most information-preserving representation. In a manner of speaking, this maybe called the
‘raw’ population response.
(ii) The ‘hierarchical’ model, using a very coarse summary of the information from each neuron’s
spiketrain, is one of the least information-preserving representations.
The performance of decoders based on any other proposed representational scheme of the population
response, can therefore be reasonably expected to be intermediate to the performances yielded by these two
schemes in Fig. 20.
Fig. 20 Architectures for population decoding The population response of n neurons each responding with spiketrains
of length k, can be represented in an ‘unprocessed’ way by treating the combined population spiketrain as we would
treat a single spiketrain (as done in the ‘non-hierarchical’ architecture shown on the left). Alternatively, a processed
or filtered version of the high-dimensional population response maybe used for decoding. A simple architecture for
such filtering, involving a bank of linear filters yielding unit-wise decoder outputs, is shown on the right. The two
population response representations considered, lie at two extremes in terms of information-content about the input
spiketrains, with the ‘non-hierarchical’ architecture retaining the most information and the ‘hierarchical’ architecture
retaining the least.
52
The techniques developed in Chapter 4.2 can be applied to each of the response spaces that result, thus yield
for each such population a non-hierarchical IOD and LIOD as well as a hierarchical IOD and LIOD. The
hierarchical setup considered in Fig. 20 above incorporated linear summing of spikes from one neuron at a
time. This assumption of linear combination of unit responses was found adequate for some population
decoding applications in the visual system
34
. However, there maybe applications that necessitate relaxing
the assumption of linearity, and one way of doing this is by taking into account the interactions between
spiketrains of different neurons. In other words, a processed population response can be obtained by
applying a suitable kernel such as a Volterra kernel to the input spiketrains, instead of only applying a linear
filterbank. Such representations promise to be amenable to polynomial decoders of the sort introduced in
Chapter 2.4. The relevance of such polynomial decoders for population decoding is discussed further in
Chapter 6 in the Discussion. The rest of this chapter examines conditions limiting the performance of LIODs
in the non-hierarchical and hierarchical(with linear pooling) stages considered above.
53
5.3.2 Effect of population size on non-hierarchical decoder performance
We performed a population counterpart of the unit-level simulation described in Chapter 5.2.1-2. For this
we considered independently firing populations of 𝑛 identical neurons, whose responses are of length 𝑘
bins. We examined how closely a non-hierarchical LIOD (i.e. the best linear decoder of the ‘raw’ population
response) can approach the upper bound of performance (when the non-hierarchical IOD was set as the
worst-case zebra labeling on the 𝑛𝑘 -cube, with c=1), as we lower the length-constant parameter λ
determining the firing statistics of the identical neurons. Further, we repeated these analyses for different
pairs of 𝑛 and 𝑘 yielding the same 𝑛𝑘 (i.e. yielding non-hierarchical population response cubes of the same
dimensionality). The results are summarized in Fig. 21.
λ n= 1, k = 12 n= 2, k = 6 n= 3, k = 4 n= 4, k = 3 n= 6, k = 2 n= 12, k = 1
100 51.7243 50.458 50.1796 50.1659 50.0308 50.0104
1 66.042 61.7256 58.2551 55.9645 53.1127 50.9303
0.1 70.000 70.0000 70.0000 70.0000 70.0000 70.0000
Fig. 21 Simulation results showing effect of population size on non-hierarchical decoder performance The entries
in the table are classification percentages attained by an LIOD (corresponding to an IOD that is a zebra-labeling with
c=1) on an 𝑛𝑘 -cube, for the different values of 𝑛 and 𝑘 such that 𝑛𝑘 is always 12. The split assumed for the IOD
labelings was 70%, which is the upper bound of performance in all cases. The first column with 𝑛 =1 and 𝑘 =12 shows
the results for a single-unit situation, where, just like the results in Fig. 19, a sufficiently low length constant λ allows
the LIOD to reach the upper bound (here, 70). For all other values of 𝑛 as well, there is an improvement in LIOD
classification percentage as λ reduces (as can be seen from the values increasing down each column). An intermediate
value, λ=1, corresponding to naturalistic firings is shown highlighted by the green arrow. In this row, the classification
percentage falls with increasing 𝑛 .This illustrates that linear decoders for naturalistic firings are less likely to be
optimal for larger populations than smaller ones, though the non-hierarchical population response maybe of the same
dimensionality.
Two salient features from the simulation above are:
(i) For independent identical populations of neurons, weak firing confers an advantage on linear
decoders, qualitatively like it did in the case of a single neuron. In the extreme case
corresponding almost to the absence of spiking (λ=0.1) linear decoder performance is identical
to the upper bound.
(ii) When we examine the case most resembling naturalistic firing (λ=1), we see that an increase
in the number of neurons in the population confers a disadvantage on linear decoders, even for
the same firing statistics and combined population response size.
54
The latter result can be explained noting that different population response distributions (over the 𝑛𝑘 -
cube) arise for different values of 𝑛 and 𝑘 , with the mode of this distribution occurring farther away
from the all-zero node in the 𝑛𝑘 -cube, as 𝑛 increases (and 𝑘 decreases). This leads to increasing losses
in classification rate while flipping node labels to obtain the LIOD from the IOD, for the cases with a
larger number of units.
5.3.3 Analysis methods for performance-assessment of hierarchical decoders
Of the two population decoder formulations considered in Fig. 20, we benchmarked linear decoder
performance for non-hierarchical decoders via simulations, in the preceding section. In this section, we
develop the necessary tools to perform a similar analysis for hierarchical decoders as well. The equations
derived here will yield qualitative insights on the relative performance of some classes of decoders, and
serve as a basis for simulations in future work mentioned in Chapter 6.2.
We consider a population of 𝑛 identical neurons, whose responses are of length 𝑘 bins. The response space
of an individual neuron in the population is a 𝑘 -cube. We introduce the following notation:
Notation:
Let 𝑃 ( 𝐵 ) be the probability of occurrence of this spike sequence 𝐵 corresponding to a node in this 𝑛𝑘 -cube
representing the concatenation of the 𝑛 simultaneous spike-sequences in each of the neurons in the
population. This is the cube on which non-hierarchical decoder outputs are defined.
Let 𝑃 ( 𝐴 ) be the probability of occurrence of a spiketrain 𝐴 corresponding to a node in this 𝑘 -cube of the
responses of one of the 𝑛 identical neurons.
Let 𝑃 ( 𝐶 ) be the probability of occurrence of a sequence 𝐶 corresponding to a node in this 𝑛 -cube resulting
from the sort of hierarchical processing shown in the right of Fig. 17. Suppose that a Linear Ideal Observer
Decoder (LIOD) is defined for each of the 𝑛 neurons in the population, yielding a Yes or a No answer
(represented by 1 and 0 respectively). Each possible response of the population thus yields a binary
sequence of LIOD outputs, which corresponds to a node 𝐶 of this 𝑛 -cube.
55
Setting simulation parameters:
In these simulations, we will consider a ground-truth graph-coloring on the 𝑛𝑘 -cube representing
‘unprocessed’ population responses (like we did in effect in Section X.3.2) i.e. we will assume 𝑃 ( 𝐵 ) and
𝑃 ( 𝑌 |𝐵 ) known. For instance, we can obtain 𝑃 ( 𝐵 ) is obtained assuming spiking obeying the same
exponential decay in spike-count in each of the identical units, and also obeying independence across units.
𝑃 ( 𝑌 |𝐵 ) is set by assigning a zebra-labeling with c=1 for the nodes 𝐵 , and assuming a uniform split as in
the unit-level simulations.
Consider an arbitrary vertex labeling on this graph (yielded by a decoder) which partitions the graph into
two sets of nodes 𝐵 𝑌 and 𝐵 𝑁 assigned Yes and No labels respectively by the decoder. The performance of
this decoder (correct classification fraction) is given by ∑ 𝑃 ( 𝐵 𝑌 ) 𝑃 (
𝐵 𝑌 𝑌 |𝐵 𝑌 )+ ∑ 𝑃 ( 𝐵 𝑁 ) ( 1 −
𝐵 𝑁 𝑃 (𝑌 |𝐵 𝑁 ) ) . For likewise obtaining the performance of the other decoders, we will also need to calculate
the quantities ( 𝐴 ) , 𝑃 ( 𝑌 |𝐴 ) , 𝑃 ( 𝐶 ) and 𝑃 ( 𝑌 |𝐶 ) , which we will do below.
Calculation of quantities needed for hierarchical decoder performance:
We note that
𝑃 ( 𝑌 ) = ∑ 𝑃 ( 𝑌 |𝐵 ) 𝑃 ( 𝐵 )
𝐵 …1
The relevant quantities for each of the individual units can be obtained as follows:
𝑃 ( 𝐴 ) = ∑ 𝑃 ( 𝐴 |𝐵 ) 𝑃 ( 𝐵 )
𝐵 …2
𝑃 ( 𝑌 |𝐴 ) = ∑ 𝑃 ( 𝑌 |𝐵 ) 𝑃 ( 𝐵 |𝐴 )
𝐵 = ∑
𝑃 ( 𝑌 |𝐵 ) 𝑃 ( 𝐴 |𝐵 ) 𝑃 ( 𝐵 )
𝑃 ( 𝐴 )
𝐵 …3
The relevant quantities for the 𝑛 -cube of the ‘processed’ population responses that have passed through a
stage of unit-wise optimal linear decoding, can be calculated as follows.
Let 𝐶 = {𝑣 1
, 𝑣 2
, … . , 𝑣 𝑖 , … , 𝑣 𝑛 } where 𝑣 𝑖 = {0,1}
Assuming the units to be statistically independent, we have
𝑃 ( 𝐶 ) = ∏ 𝑃 ( 𝑣 𝑖 𝑖 ) …4
Let us denote the operation of the LIOD at the level of the individual neuron by the binary function 𝐿 ( 𝐴 ) =
{0,1} which specifies whether the LIOD yields No or Yes for node 𝐴 .
56
𝑃 ( 𝑣 𝑖 = 1) = ∑ 𝑃 ( 𝑣 𝑖 = 1|𝐴 ) 𝑃 ( 𝐴 )
𝐴 = ∑ 𝐿 ( 𝐴 ) 𝑃 ( 𝐴 )
𝐴 …5
𝑃 ( 𝑣 𝑖 = 0) = ∑ 𝑃 ( 𝑣 𝑖 = 0|𝐴 ) 𝑃 ( 𝐴 )
𝐴 = ∑ ( 1 − 𝐿 ( 𝐴 ) ) 𝑃 ( 𝐴 )
𝐴 …6
Equations 4-6 give 𝑃 ( 𝐶 ) . We can calculate 𝑃 ( 𝑌 |𝐶 ) as follows. Bayes’ Theorem gives
𝑃 ( 𝑌 |𝐶 ) =
𝑃 ( 𝐶 |𝑌 ) 𝑃 ( 𝑌 )
𝑃 ( 𝐶 )
…7
Eqn. 1 gives 𝑃 ( 𝑌 ) . We can obtain 𝑃 ( 𝐶 |𝑌 ) as follows. Since the units are assumed independent, we have
𝑃 ( 𝐶 |𝑌 ) = ∏ 𝑃 ( 𝑣 𝑖 |𝑌 )
𝑖 …8
𝑃 ( 𝑣 𝑖 = 1|𝑌 ) = ∑ 𝑃 ( 𝑣 𝑖 = 1|𝐴 ) 𝑃 ( 𝐴 |𝑌 )
𝐴 = ∑ 𝐿 ( 𝐴 ) 𝑃 ( 𝐴 |𝑌 )
𝐴 …9
𝑃 ( 𝑣 𝑖 = 0|𝑌 ) = ∑ 𝑃 ( 𝑣 𝑖 = 0|𝐴 ) 𝑃 ( 𝐴 |𝑌 )
𝐴 = ∑ ( 1 − 𝐿 ( 𝐴 ) ) 𝑃 ( 𝐴 |𝑌 )
𝐴 …10
The encoding model i.e. 𝑃 ( 𝐴 |𝑌 ) required in the above equations is obtained by Bayes’ Theorem, noting
that 𝑃 ( 𝑌 |𝐴 ) is already available in Eqn. 3.
𝑃 ( 𝐴 |𝑌 ) =
𝑃 ( 𝑌 |𝐴 ) 𝑃 ( 𝐴 )
𝑃 ( 𝑌 )
…11
The above exercise yields the following insight about the performance of hierarchical decoders. Consider
a hierarchical IOD and a hierarchical LIOD (both defined over an 𝑛 =cube) for the population considered
above. When the units in the population are statistically independent, the hierarchical LIOD will perform
as well as the hierarchical IOD. This can be inferred by invoking Theorem 6. Alternatively, it can be inferred
by noting that the hierarchical LIOD is such that it assigns a Y label to a vertex 𝐶 if that sequence
corresponding to that vertex has more 1’s than 0’s and 0 otherwise. Such an LIOD is in effect a vote counter,
whose results will be exact when 𝑛 is odd (avoiding ties).
The above exercise also supplies the framework for performing a study of the effect of naturalistic statistics
(as was done for single units and non-hierarchical decoders) for hierarchical decoders.
57
5.4 Discussion
In this chapter, a study of the effect of weak firing in neurons, on the performance achievable by linear
decoders, was undertaken. Weak firing is a characteristic of neural responses to naturalistic stimuli (See
Fig. 17), such as those analyzed in Chapter 4.3. In the simulations performed, the strength of firing was
modeled in terms of a parameter λ, the length constant in an exponential decay function of spike counts.
Weak firing of neurons was found to favor optimal linear decoding of individual neural responses. An
investigation of the performance of linear decoders for populations of neurons, was also commenced in this
chapter. For an important special case of population decoding with identical independent units, linear
decoders (of the raw population response) were shown to perform more poorly as the size of the population
increased (See Fig. 21). Analysis methods were developed for future simulations to assess the performance
of hierarchical population decoders, as was done for the non-hierarchical population decoders of the raw
population response.
58
6. Discussion
6.1 Performance of linear decoders for single neurons
Our analyses of actual retinal responses to naturalistic stimulation, showed that for all tasks in the broad
class of early visual tasks considered, linear decoders of retinal spike trains are nearly optimal. This
observation was of tremendous interest in light of the following:
(i) Earlier investigators had noted that in general, there is more information in a spiketrain than
linear decoders could extract. Such ‘lossy’ (in terms of information) decoders were thus
believed unlikely to deliver optimal performance.
(ii) Results of Monte Carlo simulations which sampled from the space of all possible Yes-No
graph-colorings for a given dimensionality, led us to expect an overwhelming preponderance
of questions that do not allow optimal linear decoding i.e. led us to believe that Yes-No
questions allowing optimal linear decoding are extremely rare.
This motivated further investigation towards trying to understand why linear decoders perform this well
and whether they may fail for some perceptually relevant visual tasks. The better-than-expected
performance of linear decoders was accounted for by considering the characteristic firing statistics of retinal
ganglion cells to natural images. Studies with simulated RGCs whose spiking mimicked observed
naturalistic spiking, helped resolve the apparent conflicts above as follows:
(i) The firing statistics of retinal ganglion cells to natural stimuli (like in the data analysed in this
thesis) are markedly different from those of artificial stimuli
35
. Conclusions of most earlier
investigations of neural encoding and decoding were based on spike responses obtained in
response to artificial stimuli.
(ii) Weak firing of RGCs results in a spiketrain repertoire that is a small subset of the entire set of
spiketrains represented by all nodes of the corresponding N-cube. In this effectively reduced
response space, linear separability is likelier than in the arbitrary unconstrained response spaces
examined in the Monte Carlo simulations.
In sum, our efforts to address the question “How well can linear decoders perform for responses of single
neurons?”, led to the broad realization that the answer depends profoundly on the firing statistics, and in
particular, led to the finding that naturalistic firing statistics tend to favor linear decoding.
59
6.2 Performance of linear decoders for neural populations
In the unit-level simulation studies, we had found that, all other things remaining equal, a linear decoder is
less likely to be optimal when the response dimensionality is higher (Chapter 5.3.2). In the preliminary
population-level studies, we found that the chances of a linear decoder being optimal, differ with the size
of the population, even for the same population response dimensionality. For the same population response
dimensionality, a linear decoder is likelier to do worse for a population with more neurons responding with
smaller spiketrains, than vice versa. This population effect was manifested even under the condition of
independent spiking among the units of the population, which was assumed during the preliminary
simulations. All other things remaining equal, an increase in the number of neurons in the population lessens
the chances of a linear decoder being optimal.
The advantage that naturalistic firing statistics confer on linear decoders of single neurons, was found to
also apply for non-hierarchical linear decoders of population responses, for populations of identical neurons
considered in our preliminary simulations. In upcoming work, we will examine the effect of naturalistic
firing statistics on performance of hierarchical linear decoders for population responses, using the tools
developed for the purpose in Chapter 5.3.3. A promising line of future work, is to relax the assumptions of
identical neurons and of independent spiking to see how diversity of neurons in the population and
correlations between them affect the performance of linear decoders.
Models involving non-linear (particularly, quadratic) processing stages have been proposed for population
decoding of properties of visual scenes such as local image velocity
36
and binocular disparity
37
. Such
models maybe treated as instances of polynomial classifiers, which are promising as a general framework
of decoding for Yes-No questions. Unlike for linear decoders, the existence of an optimal polynomial
classifier (whose degree is at most the dimensionality of the population response) is guaranteed in general
(Theorem 5 in Chapter 2.4). The degree of the polynomial rises linearly as the dimensionality of the
population response, thus keeping parameter estimation of a polynomial decoder tractable. Such
polynomial-based decoding can be accommodated in the canonical framework of discrete Volterra
models
38
, which have been found useful in expressing a variety of transformations of neural spiketrains
39
in terms of spike interactions of different orders (which correspond to terms of different degrees in a
polynomial discriminant function).
60
6.3 Methodological extensions and applications
1. Application to other neuroscience datasets: The tests and tools introduced in this thesis and
demonstrated for decoding retinal ganglion cell spiketrains can be adapted to other stimulus-response
from sensory neurons
40
, as well as datasets from neurons in other brain-regions involved in perceptual
or motor decision-making studied in 2-AFC paradigms
41
.
2. Beyond Yes-No questions:In this thesis, decoding has been posed as a problem of binary classification
in a space of binary features. Binary classification can be viewed as the estimation of a Bernoulli
random variable, and hence a variant of the estimation of Gaussian random variables that is
commonplace in the neural decoding literature.
More typically, neural decoding is treated as function-approximation
42
(or regression) which can be
cast as multiclass classification, that is often resolved into a series of binary classification problems
43
.
Our formulation of binary classification (dealing with Yes-No questions) is therefore a useful and
eventually generalizable starting point for addressing a more general classes of perceptual questions.
3. Improved decoder estimation procedures: Some ways in which the estimation of Ideal Observer
Decoders (Chapter 4.2 ) are:
(i) Incorporation of informative priors
44
during density estimation of label probabilities, to obtain
potentially better classification performance than the current simple approach that effectively
does Maximum Likelihood Estimation
(ii) Use of optimized custom (possibly non-radial) kernel functions
45
for density estimation of label
probabilities, relaxing assumptions inherent in the use of symmetric Gaussian kernels used so
far
4. More general representational spaces for neural spiketrains: In this thesis, neural responses were
represented as binary strings or equivalently as nodes in an N-cube. Possible extensions are:
(i) n-ary strings: We can allow elements of the neural response time series (of length N) to assume
integer values from 0 to n, rather than using a binary time-series representation. The opposite-
direction motion criterion for determining linear separability of a vertex-labeling (Chapter 1.6)
applies to the discrete space of n-ary strings too (noting that collinear edges can be treated as
parallel edges).
61
(ii) “Warped” N-cubes: A convenient measure of similarity between spiketrains is the Hamming
distance between the corresponding nodes in the N-cube. In an N-cube, all the edges are
assumed to be identical in all respects, but graphs in which the edges are ‘weighted’
27
(thus
‘warping’ the space of responses) could be a useful generalization that can help better represent
some observed neural response properties (such as refractoriness, which would mean that in a
4-cube, the spiketrain 0100 would be ‘more distant’ from 0110 than from 0101, though both of
them are at the same Hamming distance.). When density estimation is performed in an
unewighted Hamming space, the occurrence probability of a spiketrain (or of a label
assignment to a spiketrain) is implicitly assumed to vary as a function of spike count. A
generalized weighted graph representation would help relax such assumptions.
5. Tests for conditions of optimal decoding by multiple linear decoders: Vertex-labelings for some
Yes/No questions may allow optimal classification using multiple linear hyperplanes, rather than just
one (which was the case of linear separability that can be tested using the algorithm in Fig. 8).
Determining the number of linear surfaces needed to correctly partition a vertex-labeling is a problem
of considerable theoretical interest, an algorithm for which will be a useful extension of the current
work. We conjecture that the number of hyperplanes required to correctly partition a labeling, is the
same as the degree of the lowest-degree polynomial surface that can separate the labels correctly.
6. Improving efficiency and throughput using parallel computing: The algorithms introduced in this
thesis for determining linear separability of an N-cube vertex-labeling (Chapter 3.1) and for estimating
Linear Ideal Observers(Chapter 4.2) lend themselves to parallelization, which can facilitate analyses
in higher-dimensional response spaces that arise, for example, from the combined responses of large
neural populations.
62
References
1. Gollisch, Tim. "Throwing a glance at the neural code: rapid information transmission in the
visual system." HFSP journal 3.1 (2009): 36-46.
2. Gollisch, Tim, and Markus Meister. "Rapid neural coding in the retina with relative spike
latencies." Science 319.5866 (2008): 1108-1111.
3. Zhang, Yifeng, et al. "The most numerous ganglion cell type of the mouse retina is a selective
feature detector." Proceedings of the National Academy of Sciences 109.36 (2012): E2391-
E2398.
4. Bialek, William, et al. "Reading a neural code." Science 252.5014 (1991): 1854-1857.
5. Shadlen, Michael N., et al. "A computational analysis of the relationship between neuronal and
behavioral responses to visual motion." The Journal of neuroscience 16.4 (1996): 1486-1510.
6. Dayan, P., & Abbott, L. F. (2003). Theoretical neuroscience: computational and mathematical
modeling of neural systems, pp 87-89
7. Parra, Lucas C., et al. "Spatiotemporal linear decoding of brain state." Signal Processing
Magazine, IEEE 25.1 (2008): 107-115.
8. Kamitani, Yukiyasu, and Frank Tong. "Decoding the visual and subjective contents of the
human brain." Nature neuroscience 8.5 (2005): 679-685.
9. Jacobs, Adam L., et al. "Ruling out and ruling in neural codes." Proceedings of the National
Academy of Sciences 106.14 (2009): 5936-5941.
10. Nirenberg, Sheila H., and Jonathan D. Victor. "Analyzing the activity of large populations of
neurons: how tractable is the problem?." Current Opinion in Neurobiology 17.4 (2007): 397-
400.
11. Van Rullen, R., & Thorpe, S. J. (2001). Rate coding versus temporal order coding: what the
retinal ganglion cells tell the visual cortex. Neural computation, 13(6), 1255-1283.
12. Hubel, David H., and Torsten N. Wiesel. "Receptive fields of single neurones in the cat's striate
cortex." The Journal of Physiology 148.3 (1959): 574.
13. Warland, David K., Pamela Reinagel, and Markus Meister. "Decoding visual information from a
population of retinal ganglion cells." Journal of Neurophysiology 78.5 (1997): 2336-2350.
14. Minsky, Marvin, and Seymour A. Papert. "Perceptrons, Expanded Edition An Introduction to
Computational Geometry."
15. Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are
universal approximators." Neural networks 2.5 (1989): 359-366.
63
16. Bishop, C. M. (2006). Pattern recognition and machine learning (Vol. 1). New York: springer.
pp 38-40
17. Reinagel, P. (2001). How do visual neurons respond in the real world?. Current opinion in
Neurobiology, 11(4), 437-442.
18. Granot-Atedgi, E., Tkačik, G., Segev, R., & Schneidman, E. (2013). Stimulus-dependent
maximum entropy models of neural population codes. PLoS computational biology, 9(3),
e1002922.
19. Geisler, W. S., Najemnik, J., & Ing, A. D. (2009). Optimal stimulus encoders for natural tasks.
Journal of vision, 9(13), 17.
20. Newsome, W. T., Britten, K. H., & Movshon, J. A. (1989). Neuronal correlates of a perceptual
decision. Nature.
21. Sharpee, T. O. (2013). Computational identification of receptive fields. Annual review of
neuroscience, 36, 103-120.
22. Geisler, W. S. (2011). Contributions of ideal observer theory to vision research. Vision research,
51(7), 771-781.
23. Kriegeskorte, Nikolaus, and Rogier A. Kievit. "Representational geometry: integrating
cognition, computation, and the brain." Trends in cognitive sciences 17.8 (2013): 401-412.
24. Bishop, C. M. (2006). Pattern recognition and machine learning (Vol. 1). New York: springer.
pp 179-220
25. Victor, Jonathan D., and Keith P. Purpura. "Metric-space analysis of spike trains: theory,
algorithms and application." Network: computation in neural systems 8.2 (1997): 127-164.
26. Reich, D. S., Mechler, F., & Victor, J. D. (2001). Temporal coding of contrast in primary visual
cortex: when, what, and why. Journal of Neurophysiology, 85(3), 1039-1050.
27. Victor, J. D. (2014). “Spike Train Distance”. Encyclopedia of Computational Neuroscience, pp
1-8
28. Victor, Jonathan D. "Spike train metrics." Current opinion in neurobiology 15.5 (2005): 585-592
29. Carver, W. B. (1922). Systems of linear inequalities. Annals of mathematics, 212-220.
30. Bishop, C. M. (2006). Pattern recognition and machine learning (Vol. 1). New York: springer.
pp 120-125
31. Cao, Xiwu, David K. Merwine, and Norberto M. Grzywacz. "Weakness of surround inhibition
with natural-image stimulation." Journal of Vision 6.13 (2006): 43-43.
32. Cao, Xiwu, David K. Merwine, and Norberto M. Grzywacz. "Dependence of the retinal
Ganglion cell's responses on local textures of natural scenes." Journal of vision 11.6 (2011): 11.
64
33. Pashler, Harold. "Familiarity and visual change detection." Perception & psychophysics 44.4
(1988): 369-378.
34. Graf, Arnulf BA, et al. "Decoding the activity of neuronal populations in macaque primary
visual cortex." Nature neuroscience 14.2 (2011): 239-245.
35. Grzywacz, N. M., et al. "Comparison of responses of retinal ganglion cells to natural and
artificial images." Proc. Eur. Conf. on Visual Perception (ECVP 2007). 2007.
36. Grzywacz, Norberto M., and A. L. Yuille. "A model for the estimate of local image velocity by
cells in the visual cortex." Proceedings of the Royal Society of London. B. Biological Sciences
239.1295 (1990): 129-161.
37. Tsai, Jeffrey J., and Jonathan D. Victor. "Reading a population code: a multi-scale neural model
for representing binocular disparity." Vision research 43.4 (2003): 445-466.
38. Marmarelis, Vasilis Z., and Xiao Zhao. "Volterra models and three-layer perceptrons." Neural
Networks, IEEE Transactions on 8.6 (1997): 1421-1433.
39. Zanos, Theodoros P., et al. "Nonlinear modeling of causal interrelationships in neuronal
ensembles." Neural Systems and Rehabilitation Engineering, IEEE Transactions on 16.4 (2008):
336-352.
40. Teeters, Jeffrey L., et al. "Data sharing for computational neuroscience." Neuroinformatics 6.1
(2008): 47-55.
41. Song, Dong, et al. "Extraction and restoration of hippocampal spatial memories with non-linear
dynamical modeling." Frontiers in Systems Neuroscience 8 (2014).
42. Pouget, Alexandre, Peter Dayan, and Richard S. Zemel. "Inference and computation with
population codes." Annual review of neuroscience 26.1 (2003): 381-410.
43. Bishop, C. M. (2006). Pattern recognition and machine learning (Vol. 1). New York: springer.
pp 184-186
44. Mukherjee, Sach, and Terence P. Speed. "Network inference using informative priors."
Proceedings of the National Academy of Sciences 105.38 (2008): 14313-14318.
45. Hwang, Jenq-Neng, S-R. Lay, and Alan Lippman. "Nonparametric multivariate density
estimation: a comparative study." Signal Processing, IEEE Transactions on 42.10 (1994): 2795-
2810.
Abstract (if available)
Abstract
A computational understanding of how animals are equipped by their brains to behave in the natural world, requires an account both of how animals are able to represent information in their sensory environment, and of how they are able to process this information to arrive at perceptual decisions. Neural decoder is the term we use for an operation that takes a neural spiketrain (time‐series of all‐or‐none action-potentials) as input and yields a perceptual decision, here posed in the form of a Yes/No question. Neural decoders which yield a decision by a simple threshold operation on a weighted sum of spikes in the input spiketrain, are called linear decoders. While linear decoders are mathematically convenient and have long and varied history of use, they are not guaranteed to yield optimal classification performance in a Yes/No task. In this study, we obtained theoretical results and developed tools to determine how closely the performance of the best possible linear decoder can approach Ideal Observer performance (i.e. Bayes‐optimal performance). The contributions of this work are new mathematical results and computational tools addressing the broad question of “How powerful are linear decoders?” ❧ We formulated the decoding of neural spiketrains for answering Yes/No questions, as a vertex‐labeling (with Yes and No labels) on a graph (specifically, an N-cube) whose vertices represent the possible spiketrains (Chapter 1). Vertex‐labelings corresponding to situations allowing optimal linear decoding, obey graph‐theoretic conditions that we discovered (Chapter 2) and exploited to build Monte Carlo estimators of the probability of optimal linear decoding for an arbitrary labeling (Chapter 3), which revealed that linear decoders would perform poorly for an overwhelming majority of Yes/No questions. To evaluate and benchmark the actual performance of linear decoders for physiological recordings, we devised algorithms to estimate from labeled experimental recordings, two types of optimal decoders, that we call the Ideal Observer Decoder and the Linear Ideal Observer Decoder, which respectively yield Bayes‐optimal performance and the best performance attainable by a linear decoder (Chapter 4). We applied these techniques to extracellular single‐cell recordings of retinal ganglion cells (RGCs) in order to answer psychophysically motivated questions about the natural image stimuli driving the cells. We found that linear decoders for a broad class of early visual tasks yielded Ideal‐Observer‐like performance, and demonstrate via simulations a plausible explanation for this based on the naturalistic firing statistics of single retinal neurons (Chapter 5). Extensions of these simulations from single neurons to neural populations, yield preliminary insights about the limitations to which linear decoders (that yield near‐optimal performance in single‐cell decoding of naturalistic responses) are subject to, when faced with population decoding.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Decoding memory from spatio-temporal patterns of neuronal spikes
PDF
Functional consequences of network architecture in rat hippocampus: a computational study
PDF
Learning contour statistics from natural images
PDF
Nonlinear dynamical modeling of single neurons and its application to analysis of long-term potentiation (LTP)
PDF
Applications of graph theory to brain connectivity analysis
PDF
Parametric and non‐parametric modeling of autonomous physiologic systems: applications and multi‐scale modeling of sepsis
PDF
Detection and decoding of cognitive states from neural activity to enable a performance-improving brain-computer interface
PDF
Characterization of visual cells using generic models and natural stimuli
PDF
Functional models of fMRI BOLD signal in the visual cortex
PDF
A million-plus neuron model of the hippocampal dentate gyrus: role of topography, inhibitory interneurons, and excitatory associational circuitry in determining spatio-temporal dynamics of granul...
PDF
Pattern detection in medical imaging: pathology specific imaging contrast, features, and statistical models
PDF
Geometric and dynamical modeling of multiscale neural population activity
PDF
Data-driven modeling of the hippocampus & the design of neurostimulation patterns to abate seizures
PDF
Simulating electrical stimulation and recording in a multi-scale model of the hippocampus
PDF
Optimal electrical stimulation of smooth muscle
PDF
Efficient graph learning: theory and performance evaluation
PDF
Online learning algorithms for network optimization with unknown variables
PDF
On the electrophysiology of multielectrode recordings of the basal ganglia and thalamus to improve DBS therapy for children with secondary dystonia
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Multi-region recordings from the hippocampus with conformal multi-electrode arrays
Asset Metadata
Creator
Iyer, Arvind V.
(author)
Core Title
Neural spiketrain decoder formulation and performance analysis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Biomedical Engineering
Publication Date
06/30/2015
Defense Date
12/12/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
classifiers,graph theory,Ideal observer analysis,neural decoding,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Grzywacz, Norberto M. (
committee chair
), Marmarelis, Vasilis Z. (
committee member
), Song, Dong (
committee member
), Tjan, Bosco S. (
committee member
)
Creator Email
arvind.v.iyer@gmail.com,arvindiy@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-582553
Unique identifier
UC11301794
Identifier
etd-IyerArvind-3522.pdf (filename),usctheses-c3-582553 (legacy record id)
Legacy Identifier
etd-IyerArvind-3522.pdf
Dmrecord
582553
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Iyer, Arvind V.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
classifiers
graph theory
Ideal observer analysis
neural decoding