Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The importance of not being mean: DFM -- a norm-referenced data model for face pattern recognition
(USC Thesis Other)
The importance of not being mean: DFM -- a norm-referenced data model for face pattern recognition
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
THE IMPORTANCE OF NOT BEING MEAN:
DFM – A NORM-REFERENCED DATA MODEL FOR FACE PATTERN
RECOGNITION
by
Lawrence Marc Kite
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2009
Copyright 2009 Lawrence Marc Kite
ii
Dedication
To my wife Lisa, who, at long last, will no longer be married to a graduate student; your
love, your patience, your generosity, and your unstinting support have made this work
possible. Perhaps I should mention your patience again. And again.
To Olivia and Samuel, our beautiful children, you are my reason for getting out of bed in
the morning and my last thought before drifting off to sleep.
You fill my world with indescribable joy; I cannot imagine a life without you.
iii
Acknowledgments
I would like to thank all of the people who have been instrumental to making this work
possible and without whom this document would not exist, most particularly Dr.
Christoph von der Malsburg, who gave me the gift of getting to do what I love. It has
been the uttermost privilege to have been his student. I have been enriched beyond
measure by my well-nigh decade-long participation as a member of his lab, the
Laboratory for Computational and Biological Vision at USC. His support, in every sense
of the word, has been beyond value – his generosity, epic.
I would also like to thank my lab mates: Kazunori Okada, Junmei Zhu, Xiangyu Tang,
Shuang Wu, Douglas Garrett and Viral Shah for the many enjoyable and spirited
conversations; I knew that I could always trust your judgment and intellectual integrity
and that we could be diametrically opposed on controversial topics of the day without
contention or incivility. Thanks to my family for your unwavering love and support.
Lastly, I wish to thank the professors and students who have helped me along the way
and from whom I learned so much about what it means to learn and practice science –
CvdM, Stefan Schaal, Laurent Itti, Bartlett Mel, George Bekey, Michael Arbib, Irving
Biederman, Jan Peters, Nathan Mundhenk, Alex Frasier, Aaron D’Souza, Vidhya
Navalpakkam, and countless others whom I have had the pleasure to know along the way.
To all, I extend heartfelt gratitude and thanks.
iv
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables vi
List of Figures ix
Abstract xvii
Chapter 1: Introduction 1
Chapter 2: Foundations 7
2.1 Data Representation 11
2.1.1 Significant Deviations from the Mean 13
2.1.2 Measures of Pattern Similarity 15
2.2 Maximizing Available Information 17
2.2.1 Information Entropy 17
2.2.2 Face Recognition as a Binary Classification Problem 20
2.3 Recognition Performance 27
2.3.1 Analysis of Faces 29
2.3.2 Relevance 31
2.3.2.1 Relevance Criterion I: Analysis of Variance 32
2.3.2.2 Relevance Criterion II: Leaving Out Groups of Features 35
2.3.2.3 Relevance Criterion III: Binary Channel Capacity 39
2.3.3 Computational Complexity 42
2.3.3.1 Time Complexity 43
2.3.3.2 Space Complexity 45
2.4 Literature Review 45
2.4.1 Norm-based Encoding 45
2.4.2 Related Face Recognition Methods 51
2.4.2.1 Gabor Transform Methods 51
2.4.2.2 PCA Methods 53
Chapter 3: The DFM Model 56
3.1 Formal Description of the Model 59
3.1.1 The Image and Model Layers 62
3.1.2 The Representation Layer 65
3.1.2.1 The Mean (Mu) Layer 66
3.1.2.2 The Competition (Chi) Layer 70
3.1.2.3 The Representation (Rho) Layer 75
3.2 DFM Similarity Measures 76
3.2.1 Hamming Distance on binary patterns 76
v
3.2.2 Maximize Mutual Information 77
3.3 Image Reconstructions from Binary Magnitudes and Phases 79
Chapter 4: Experiments 87
4.1 The Data Set 87
4.2 Experiments and Results 89
4.2.1 Design of Experiments 95
4.2.2 Performance Comparison – Floating Point Representations 96
4.2.3 b-string Construction – By Landmark or By Model Graph 100
4.2.4 Performance Comparison – All Representations 103
4.2.5 Comparison of Relevance Criteria 112
4.2.6 Recognition Experiments with All 48 Landmarks 114
4.2.7 Recognition Experiments with 24 Highest BCC Landmarks 116
Chapter 5: Discussion and Conclusion 118
Bibliography 120
Appendix 124
vi
List of Tables
Table 1 Statistical summary of 1,920 Gabor features. The minimum and maximum
of the mean and standard deviation of the range of Gabor features. In total
there are 1,920 coefficients in a model graph. The table reflects the statistics
of more than three thousand model graphs. 9
Table 2 Summary of Data Representations For floating point representations, the
last column denotes the source of the floating point coefficients and/or the
method by which the coefficients are derived from the original Gabor
Amplitudes (AM). For binary representations, the last column denotes the
set of floating point coefficients that inform the construction of the b-string. 12
Table 3 Summary of b-string Representations 14
Table 4 Six b-string Representations with equal pattern entropy Representations
with a parameter setting of k=20 have active bits set “by landmark”.
Representations with a parameter setting of m=960 have active bits set “by
model graph”. 19
Table 5 Counts of image pairs in the fa and fb FERET sets belonging to each class
in the binary categorization problem. 22
Table 6 General Confusion Matrix for Binary Classifier 24
Table 7 DFM Confusion Matrix 25
Table 8 EBGM Confusion Matrix 26
Table 9 Normalized Z-Score Confusion Matrix 26
Table 10 True Positive Rates of various representations for a range of decision
threshold choices, which correspond to rates of false positives. For the same
false positive rate, both ZN and DFM offer significantly higher numbers of
true positives. 27
Table 11 A summary of differences in face and non-face object processing in the
brain. 46
Table 12 The data components of a model graph and the alternative representations. 82
Table 13 Feature Combinations for Image Reconstructions An image can be
reconstructed from a model graph from the Gabor features, amplitudes and
phases, and the positions of the graph nodes, or fiducial points. 83
vii
Table 14 Sub-categories of the FERET Database. Several groupings of the data
within the FERET fa and fb sets and corresponding frequency statistics. 88
Table 15 Data Representations and Similarity Criteria A summary of the various
data representations, the similarity criteria employed to evaluate the
representation, and data from which they were derived. At the highest level,
all representations derive from the complex Gabor coefficients; we use only
the amplitudes, which we have denoted GA above. Similarity measures:
HD: fractional Hamming distance; MI: mutual information between
vectors; DP: dot product between vectors;. In the last column, the codes
have these meanings: GA-μ: mean subtracted Gabor amplitudes; Z(GA): z-
transformed Gabor amplitudes; Z(GA)>0.5: z-transformed Gabor
amplitudes – transformed coefficients with are set to “active”;
Z(GA), unit vectors: same as Z(GA) with jets of z-scores normalized to unit
length; unit vectors: same as Z(GA), unit vectors but where the
absolute values of the z-scores were taken. 90
Table 16 Summary of floating point data representations 97
Table 17 Recognition rate percentages of measures that do not depend on the
choice of k (number of active bits per landmark) or m (the number of active
bits over the model graph.) These figures establish baseline measures
against which the other criteria should be judged. The last row in this table
establishes the baseline of baselines. It is the standard cosine measure used
in Elastic Bunch Graph Matching. Note that five of seven of the new
criteria beat the current standard similarity measurement in terms of
recognition performance. RPI = Recognition Performance Increase. 114
Table 18 Recognition Performance Summary of DFM by Landmark. The first
column shows the number of active features per landmark, i.e., the number
of bits among the forty for each landmark that will have a value of “1”. The
various similarity criteria are presented in Table 15. 114
Table 19 Recognition Performance Summary of DFM by Model graph. As above
the first column shows the number of bits in the binary-encoded string that
take a value of “1”. The active bits are selected from all 1,920 features of
the model graph without reference to the landmark with which they are
associated. 115
Table 20 Same information as in Table 18, above, recast as a reduction in error rate.
See caption of next table for more precise formulation. 115
Table 21 The same table as Table 19 recast in terms of percent increase in
recognition performance. 115
viii
Table 22 Recognition Performance Summary of DFM by Landmark. The first
column shows the number of active features per landmark, i.e., the number
of bits among the forty for each landmark that will have a value of “1”. The
various similarity criteria are presented in Table 15. 116
Table 23 Recognition Performance Summary of DFM by Model graph. As above
the first column shows the number of bits in the binary-encoded string that
take a value of “1”. The active bits are selected from all 1,920 features of
the model graph without reference to the landmark with which they are
associated. 116
Table 24 Same information as in Table 22 above recast as a reduction in error rate.
Alternatively the data in the table can be interpreted as a percentage
increase in recognition performance. See caption of next table for more
precise formulation. 116
Table 25 The same table as Table 23 recast in terms of percent increase in
recognition performance. More precisely this is a measure of reduction in
recognition error rate: 117
ix
List of Figures
Figure 1 Recognition experiment on a large data set (700 image pairs) using binary
strings to represent faces. The different curves correspond to a number of
criteria for selecting which bits are “active” (set to “1”). For one encoding, a
binary face representation in which only 152 out of 1,920 bits are active,
less than 10% of the bits, matches the performance of Elastic Bunch Graph
Matching. For a wide range of choices of the parameter, , which controls
the number of active bits in a string composed by model graph (see section
3.1.2.2.2 Composition By Model Graph), recognition performance is
superior to EBGM. The choice of m offers great flexibility, further
supporting the notion that patterns of deviation are more important than the
deviations themselves. The minimum of the best encoding occurs at a point
(m = 964) close to where the greatest number of different binary strings are
possible (m = 960). At m = 960, the number of possible b-strings is on the
order of The curves with the lowest error rates correspond to
encodings of z-scores and mean-subtracted coefficient values. 6
Figure 2 Pixel intensity representation of the means of all 1,920 Gabor coefficients
over all fa and fb images in the FERET database. The coefficients are
displayed such that each row of pixels represents a single Gabor jet at one
landmark. In each row of forty pixels, successive groups of eight pixels
comprise the eight orientations at successively decreasing spatial frequency
– the first eight pixels are means of all orientations at the highest frequency,
etc. In a similar manner, each column of pixels comprises the mean
coefficient values from a single kernel at each of forty-eight fiducial points. 10
Figure 3 after subtracting the mean, or taking a z-score, the top deviations from the
mean are identified; corresponding bits in a b-string are set to “1,” or
“activated.” 15
Figure 4 Binary Class Histograms for Elastic Bunch Graph Matching. The
histograms reflect the distributions of similarity scores for all “different”
images (blue) and all “same” images (green) in the fa and fb FERET sets.
A decision threshold was chosen where the histograms cross. The ROC
Curve in Figure 7 shows that this is close to optimal. 23
Figure 5 Class histograms of similarities for DFM. Histograms reflect the
distribution of Hamming distances for all “same” image pairs and all
“different” image pairs in the fa and fb FERET sets. Note that the DFM
representation increases the inter-class distance for the binary classification
problem. 23
x
Figure 6 Binary class histograms of dot product similarities between vectors of
normalized z-scores (ZN), in which the jets of z-scores are normalized to
unit length. Absolute values were not taken during normalization. Note the
average of the “different” histogram near zero. On average, “different”
images have offsetting positive and negative coefficients. “Same” images,
with correlated coefficients, yield a positive dot product. 24
Figure 7 ROC curve for the binary face classification problem. Over nearly the
entire range of threshold choice, DFM, a binary representation, and NZ
(sometimes denoted Zn or ZN), a floating point representation, offer
improved discrimination over EBGM 25
Figure 8 ANOVA F Matrices. Lighter values indicate higher f-score, indicating the
power of a coefficient to discriminate between inter- and intra-class
exemplars. Left: Gabor Magnitudes. Middle: Normalized z-scores of Gabor
magnitudes. Right: Binary DFM strings constructed by model graph
(m=960). The qualitative similarity between the three images support the
claim that information is preserved during the transformation from a
floating point to a binary representation. In each intensity image, each row
represents a single jet; each column represents a single Gabor kernel. Each
jet comprises five frequency level groups, each of which comprises eight
orientations. The highest frequency group is at the far left; the lowest group
is at the far right. Thus, columns one through eight are the same spatial
frequency (the highest). Column 1 is a vertically oriented kernel; successive
columns, through column eight, correspond to incremental rotations of the
kernel. The next group begins in column 9, comprising eight orientations at
the next lowest frequency level, etc. 33
Figure 9 Relevance of Landmarks – Analysis of Variance (ANOVA) on three
representations. ANOVA was calculated for each feature, resulting in 1,920
univariate analyses. F-scores for each feature were calculated (see Figure 8)
and the sum over the f-scores at each landmark was taken. The 50 % most
relevant landmarks are represented by green circles; the least relevant
landmarks are shown in magenta. The size of the circle is indicative of
relative relevance and is not proportional to the f-score in the strictest sense.
Left: ANOVA on Gabor Magnitudes as in (Kalocsai, Neven et al. 1998;
Kalocsai, von der Malsburg et al. 2000; Kalocsai 2004). Middle: ANOVA
on normalized z-scores; Right: ANOVA on Binary DFM string constructed
by model graph (m=960). Again, the qualitative similarity between the three
images supports our claim that information is preserved during the
transformation from floating point to binary. 34
xi
Figure 10 Showing the effects of leaving out coefficients corresponding to kernels
of each spatial frequency. As in the figures below, 200 recognition
experiments on data sets of 400 image pairs were conducted for each of the
five spatial frequencies. The graph shows clearly that lower frequencies are
more relevant for recognition, all else remaining equal. Leaving out the
highest frequency has a small, yet still significant effect on recognition
performance; the light blue line is the performance of EBGM; the dark blue
curve is the mean performance obtained by leaving out the spatial frequency
denoted in the abscissa. 36
Figure 11 Showing the effects of leaving out coefficients corresponding to kernels
of each spatial orientation, rotated in increments of starting with the
vertical. This figure shows that the vertical (orientation 1) and horizontal
orientations (orientation 5) are most relevant to face recognition, with the
vertical orientation being most relevant. None of the orientations, however,
are so irrelevant to recognition that one would want to exclude their
coefficients entirely from the similarity calculation. The x-axis denotes the
orientation number; the y-axis denotes the mean recognition performance
obtained by leaving out all coefficients of the corresponding spatial
orientation. 37
Figure 12 Showing the effects on recognition rates of leaving out all features from
each landmark. The middle plot represents the mean recognition rate of 200
individual recognition experiments for each landmark; the straight line
represents the recognition rate of EBGM algorithm on the same data sets.
Sets of 400 randomly selected image pairs were used for each individual
experiment. The data sets were pre-selected so that the experiments for
leaving out each landmark were performed using the same 200 data sets.
The curves above and below the central curves are analogous to error bars;
they represent one standard deviation above and below the central curve.
Even though it seems that the signal is lost in the noise, the ranking of the
landmarks, when sorted by mean recognition rate, is qualitatively similar to
the other methods used to estimate relevance. 38
Figure 13 Graphical depiction of relative importance (relevance) of landmarks
according to “leave one out” empirical experiments. Top 50% most relevant
landmarks are in green; bottom 50% in yellow. Circle size is indicative of
relative importance only and is not strictly proportional to any particular
measurement. 39
Figure 14 Schematic diagram of a binary symmetric channel. The capacity of a
binary symmetric channel is: 40
Figure 15 Image of feature channel capacities represented as intensity values,
corresponding to bits/transmission. 41
xii
Figure 16 TOP: the total channel capacity of the “landmark channels”. Compare
this plot to the BOTTOM, a similar plot of the total F-scores for each
landmark from an ANOVA on the individual features (coefficients) of the
Gabor transform. 42
Figure 17 Showing the relevance of landmarks, considering each landmark as a set
of binary channels. There is one channel for each feature, 40 for each
landmark. The total channel capacity for a landmark is calculated by
summing over the set of channels corresponding to the features attributable
to it. In other words, there is a binary channel corresponding to each Gabor
kernel in a jet; the landmark capacity is the sum of the capacities of these
channels. The landmark sums are sorted and the smallest circle radius is
assigned to the landmark with least channel capacity. The size of the circles
is only indicative of relative importance and is not proportional to the actual
channel capacity of the landmark. Green circles indicate that the landmark
is among the most relevant half of landmarks. Consequently, the magenta
circles indicate that the landmark is among the least relevant half. 43
Figure 18 The first one hundred Eigenfaces from the fa and fb sets of the FERET
database. Image alignment was done automatically using the landmarks
located by the EBGM algorithm. Note that this places EBGM and
Eigenfaces on an equal footing with regard to registration of image features.
Experiments show that EBGM, and consequently DFM, are much more
robust to incidental misalignments than the Eigenface algorithm. 54
Figure 19 RMS errors from image reconstructions using the top n Eigenfaces. 55
Figure 20 Processing of a face image starts at lower left. During the EGBM phase,
a model graph is extracted from a Gabor-transformed image. Feature
vectors (Jets) are extracted from selected fiducial points. Jets become node
labels in a model graph structure. Mean coefficient values ( layer) are
subtracted from the model graph coefficients in the layer; In the layer a
competition mechanism, which acts on coefficients grouped by landmark or
model graph, identifies the coefficients with the most significant deviations
from the mean. In the layer, units corresponding to the most significant
deviations are activated; the pattern of activations can be stored in, for
example, a b-string. 57
Figure 21 Schematic showing flow of processing through layers of the DFM model. 60
Figure 22 After a model graph is extracted from an image using EBGM each
landmark contains a “jet” of forty complex coefficients. We consider only
the magnitude of the complex coefficient, represented in polar form. 61
xiii
Figure 23 The coefficients are marshaled. They will be compared to the mean and
sorted by landmark or by model graph. 61
Figure 24 From floating point coefficients to b-string bits. After coefficients are
marshaled, the mean is subtracted, yielding a vector or deviations from the
mean. Next, the most significant deviations are identified. In the last step,
the bits of the most significant deviations are “activated” (set to “1”). In this
illustration, the selection process could be either “by landmark” or “by
model graph”; the circles, which denote coefficients, are illustrative of the
process only. 62
Figure 25 Images of Face Representations. At left, colors stand in for Gabor
magnitudes. Cooler colors are lower values; warmer colors are higher
values. At right, a binary representation of the same “image”. White squares
indicate “active” bits (1); dark squares denote “inactive” bits (0). 62
Figure 26 Average Faces. A new set of complex coefficients for each average face
was calculated from the images in the dataset. New magnitudes were taken
from the arithmetic mean of magnitudes from a subset of images in the
database; similarly, circular statistics were used to calculate the average
phase. From left: the average male, the overall average, and the average
female. 69
Figure 27 More Average Faces – from left: average Caucasian, average Asian,
average African. 70
Figure 28 Image reconstructions from model graphs . In this figure one should read
“Mean Features” as “Mean Magnitudes”. Similarly, “Subject Features”
becomes “Subject Magnitudes.” The phase information is treated on its own
axis. The image is a projection or squashing of the three dimensional binary
parameter space onto two dimensions On the left side, reconstructions using
all features or their alternate representation. On the right side,
reconstructions in which half of the magnitudes and phases, or their
alternates, have been masked by a DFM b-string . Reconstructions are
performed for each of the eight possible combinations of veridical /
alternative features. The reconstructions that use only the alternate
representations, i.e. no veridical data, (third row, right side in each half of
the figure) is still readily recognizable as the same person. This is the case
even when half of the alternate data is removed (third row, far right). 84
Figure 29 Results of recognition experiments using only the phase components of
the complex Gabor coefficients. 86
Figure 30 Recognition experiment on a large data set (700 image pairs) using
binary strings to represent faces. 94
xiv
Figure 31 A alternative coding strategy based on z-scores. In this representation,
bits are activated if the corresponding coefficients have z-scores greater
than some value, z. Experiments show that this representation is poor; only
a setting of z=0.5 yields recognition performance that equals EBGM. 95
Figure 32 Mean Error rates for five floating point data representations at
cardinalities ranging from 100 to 900. Shorter bars indicate lower
recognition error rates and better recognition performance. 98
Figure 33 Standard deviations of error rates for five floating point data
representations at cardinalities ranging from 100 to 900. Shorter bars
indicate lower variance in the error rates of individual recognition tests. 99
Figure 34 Top: Recognition Performance of DFM with 48 landmarks. Active Bits
selected “By Landmark”. Landmarks removed were determined to have the
least channel capacity, where each of 1,920 binary features was modeled as
a binary symmetric channel. Shorter bars are better, indicating higher
recognition rates and lower recognition errors. Horizontal dotted lines
depict the baseline performance of floating point representations. They are
(from top): 1) EBGM results, 2) EBGM with landmarks removed, 3)
Normalized z-scores, all landmarks; and 4) Normalized z-scores, landmarks
removed. Note that the blue and red lines in this figure represent the same
values as the blue and green lines in Figure 34. With only the most relevant
landmarks retained, only the voting method can match representation Zn.
All of the binary representations best the performance of Zn using all
landmarks, for some parameter settings. 102
Figure 35 Top: Recognition Performance of DFM with 24 of 48 landmarks
removed. Active Bits selected “By Landmark”. Landmarks removed were
determined to have the least channel capacity, where each of 1,920 binary
features was modeled as a binary symmetric channel. Shorter bars are
better, indicating higher recognition rates / lower recognition errors.
Horizontal dotted lines depict the baseline performance of floating point
representations. They are (from top): 1) EBGM results, 2) EBGM with
landmarks removed, 3) Normalized z-scores, all landmarks; and 4)
Normalized z-scores, landmarks removed. Note that the blue and red lines
in this figure represent the same values as the blue and green lines in Figure
34. 103
Figure 36 A comparison of recognition performance of floating point and binary
representations. The binary representations, except as noted below, are
constructed “by landmark” using all forty-eight landmarks. 108
xv
Figure 37 A comparison of recognition performance of floating point and binary
representations. The binary representations, except as noted below, are
constructed “by model graph” using all forty-eight landmarks. 109
Figure 38 A comparison of recognition performance of floating point and binary
representations using the most relevant landmarks. The binary
representations, except as noted below, are constructed “by landmark” using
the twenty-four landmarks with the highest binary channel capacity. The
floating point representations (cols. 8-13) and the binary representations
ZGT (cols. 4 and 7), indicated in the table directly above this caption with
lighter shading, are not dependent upon any choice of k and/or m. The
binary representations are indicated with darker shading. The four
horizontal dotted lines represent the recognition performance of (from top
to bottom): EBGM using all 48 landmarks, EBGM using the best 24
landmarks as gauged by the binary channel capacity relevance measure,
Normalized z-scores using all landmarks, and normalized z-scores using the
best 24 landmarks as above. The blue and red lines in this figure represent
the same values as the blue and green lines in Figure 36 and Figure 37.
Tables summarizing these results will follow. 110
Figure 39 A comparison of recognition performance of floating point and binary
representations. The binary representations, except as noted below, are
constructed “by model graph” using the twenty-four most relevant
landmarks as determined by the binary channel capacity relevance criterion.
The floating point representations (cols. 8-13) and the binary
representations ZGT (cols. 4 and 7), indicated in the table below the figure
with lighter shading, are not dependent upon any choice of k and/or m. The
binary representations are indicated with darker shading. The four
horizontal dotted lines represent the recognition performance of (from top
to bottom): EBGM using all 48 landmarks, EBGM using the best 24
landmarks as gauged by the binary channel capacity relevance measure,
Normalized z-scores using all landmarks, and normalized z-scores using the
best 24 landmarks as above. The blue (top) and red (third from top) lines in
this figure represent the same values as the blue (top) and green (bottom)
lines in Figure 36 and Figure 37. Tables summarizing these results will
follow. 111
Figure 40 Comparison of Relevance Criteria for DFM b-strings constructed by
Model Graph (number of active bits = 960) 112
Figure 41 Comparison of Relevance Criteria for DFM b-strings constructed by
Landmark (number of active bits per landmark = 20; over the whole graph
per landmark = 960 bits). 113
xvi
xvii
Abstract
A successful, mature system for face recognition, Elastic Bunch Graph Matching,
represents a human face as a graph in which nodes are labeled with double precision
floating-point vectors called “jets”. Each jet in a model graph comprises the responses at
one fiducial point, or face landmark, of a convolution of the image with a set of self-
similar Gabor wavelets of various orientations and spatial scales. Gabor wavelets are
scientifically reasonable models for the receptive field profiles of simple cells in early
visual cortex. Heretofore, the recognition process simply searched for the stored model
graph with the greatest total jet-similarity to a presented image graph. The most widely
used measure of jet similarity is the sum over the graph of the dot-products of jets
normalized to unit length. We improve significantly upon this system, with orders of
magnitude improvements in time and space complexity and marked reductions in
recognition error rates. We accomplish these improvements by recasting the
concatenated vector of model-graph jets as a binary string, or b-string, comprising bits
with one-to-one correspondence to the floating-point coefficients in the model graph. The
b-string roughly models a pattern of correlated firing among a population of idealized
neurons. The “on” bits of the b-string correspond to the identities of the coefficients that
deviate the greatest amount from the corresponding mean coefficient values. We show
that this simple recoding consistently reduces recognition error rates by margins
exceeding thirty percent. Our investigations support the hypothesis that the b-string
representation for faces is extremely efficient and, ultimately, information preserving.
1
Chapter 1: Introduction
In this thesis, we present a model for the representation of human faces from two-
dimensional images for face recognition. Underpinning our model is the notion of the
centrality of the mean values of a set of feature instances to the representation and
recognition of human faces. Recent neurophysiologic studies provide evidence that the
brain represents the human face not in absolute terms, but relative to an “average face”
(Leopold, O'Toole et al. 2001), which is learned with experience. To a rough
approximation, moving along an axis in face space passing from the average, or “mean”,
face toward an observed face correlates with an increase in perceived “identity strength.”
Faces that deviate from the mean tend to be more distinguishable than those closer to the
mean. We hypothesize that a face representation for computer vision may be derived
solely from the pattern of deviations of local features from accumulated mean feature
values. We call this the “DFM model”, “the Deviations from the Mean model”, or
simply “DFM”.
The key components of the DFM model comprise a data representation and a similarity
function for comparing data instances. We will demonstrate that the model, founded upon
the centrality of patterns of deviation from the mean, is extremely efficient and flexible;
we will have occasion to consider representations comprising either floating point
features or binary features, and we will examine a number of similarity measures
pertaining to each data format. In all cases, the representations that derive from
deviations from the mean achieve significantly lower recognition error rates than those
2
that do not. Furthermore, the binary data representations achieve reductions in both time
and space complexity on a scale significantly greater than one order of magnitude. We
will show that a DFM binary string (a b-string) of length 1,920 achieves recognition
parity with the Elastic Bunch Graph Matching algorithm when only 150 or so of the
1,920 bits are set to a value of one (“active bits”), all other bits retaining the uninitialized
value of zero. In this thesis, when we speak of the DFM model we refer to the b-string
representation, unless otherwise noted.
DFM represents a face as a pattern of “activations” over a population of idealized
neurons. Neuronal activations in DFM are binary; a neuron is either active or it is not.
The active bits in DFM constitute a set of “firing” neurons, itself a subset of the full
population of face coding neurons. We hypothesize that two different face images of a
single individual will produce patterns of activation that are closer (in the sense of
Hamming distance or an alternative suitable similarity measure) to each other than to
patterns produced by images of other individuals.
1
This is, of course, the goal of any
pattern categorization problem: to find a model that tends to maximize inter-class
distances while minimizing intra-class distances. DFM achieves this goal in a simple,
elegant manner. We will show that the determination of “match” between two DFM
patterns reduces to the magnitude of failure of a test of statistical independence
(Daugman 2003). The DFM representation is analogous to the iris representation
1
In fact, we do not calculate the actual Hamming Distance, which is the number of 1’s in the xor operation
on two binary strings. Instead, we compute the number of 1’s in the and operation, i.e., the extent of the
overlap between two patterns of neuronal activation, primarily because the similarity of all pairs drawn
from two sets of binary strings reduces to a single matrix multiplication.
3
employed by Daugman in his system for biometric authentication based on the
uniqueness of human iris patterns, with some important distinctions. In short, both
representations begin with vectors of complex-valued coefficients of a Gabor transformed
image, which are subsequently demodulated to form a binary string. Each complex
coefficient, or its magnitude and/or phase component, becomes one or two
2
bits in a
binary representation string, which we call a “b-string”. Interpreting a b-string as a
sequence of Bernoulli trials, and assuming for the moment that any bit is assigned a one
or zero with equal probability, then strings derived from images of different individuals
should differ in roughly half of their bits. Conversely, strings from images of the same
person may differ in perhaps thirty percent of their bits. This statistical dependence
provides compelling evidence for a match.
The DFM b-string face representation succeeds ultimately because it preserves the
information necessary to perform statistically meaningful comparisons between human
faces. Further, the mechanism that ensures that only neurons with coefficients that
deviate significantly from the mean are activated reduces distracting information that
often contributes to recognition errors yet retains the information that allows faces to be
2
Daugman uses a two-bit demodulation code for the phase component of each Gabor coefficient. Gabor
coefficients are densely sampled at every pixel location of an iris image. Each phases component is
assigned two gray-coded bits, which indicate the quadrant in the complex plane in which the phase
component resides. In contrast, DFM uses only the Gabor magnitudes from a sparsely sampled Gabor-
transformed face image. The sample points coincide with fiducial points, which are automatically located
in the face image, and the vectors of Gabor coefficients are encapsulated in a graph structure called a
Model Graph. Bits in a DFM string, which are in one-to-one correspondence with Gabor coefficients in the
model graph, are assigned a value based on whether the observed Gabor magnitude is among the most
significant deviations from the mean coefficient value. DFM bits/neurons are in competition with each
other. Unlike the Daugman representation, the value of a DFM bit cannot be determined by considering a
single Gabor magnitude. The semantics of an active DFM neuron connote that the corresponding
magnitude is in some sense distinctive
4
distinguished. At the heart of DFM is an implicit dimensionality reduction tailored to
retain information required to distinguish between subordinate-level instances. This
stands in contrast to a dimensionality reduction algorithm like Principle Components
Analysis (PCA), the cornerstone of the Eigenfaces representation, which focuses on
capturing the information that accounts for most of the variance among all base-level
class instances.
In this thesis we address the following questions:
1. How can information present in an image of a human face be used to greatest
advantage?
a. How can a data representation maximize available information?
b. How can this information be quantified, either in absolute or relative
terms?
c. What information is “relevant” or helpful to the face recognition task?
d. What information serves as a hindrance to accurate recognition?
2. Given a data representation, the product of a process acting as a consumer of
information, what measure or measures of pattern similarity maximally
leverage the aggregated information?
3. How does the choice of a data representation and a similarity measure affect
recognition performance?
In the pages that follow we will investigate these and other related questions. We will
demonstrate that there are several representations and several measures of similarity, all
5
of which rely on deviations of coefficients from mean feature values. These “norm-
referenced” representations improve greatly upon the already successful and mature
Elastic Bunch Graph Matching Algorithm. Most importantly, we will show that in the
context of such norm-referenced representations, a transformation of a floating point
vector to a binary vector of the same length is information preserving. We demonstrate
the validity of this claim with statistical / information theoretic analyses of the underlying
patterns and through empirical face recognition experiments.
From investigations guided by the foregoing, we conclude, among other things, that:
1. The existing Elastic Bunch Graph Algorithm can be improved by performing
a z-transform on coefficients. The mean feature values can be computed from
stored data or can be accumulated as new data arrives. (Knuth 1997)
2. The fact that a binary pattern preserves the information present in a floating
point vector of Gabor magnitudes comprising thirty-two or sixty-four times as
much data supports the notion that a face can be represented by a binary
pattern which, we argue, is analogous to correlated firing/activation of a
subpopulation of face-selective neurons.
3. Recognition performance can be improved dramatically by retaining
information from merely half of the landmarks currently in use by the Elastic
Bunch Graph algorithm.
4. For the purposes of face recognition, a b-string of length 1,920 bits with only
approximately 150 “active” bits yields the same average performance as a
mode graph comprising 1,920 floating point coefficients. See Figure 1.
6
5. A pattern of significant deviations from a stored mean is more diagnostic of
face identity than a model graph of Gabor coefficients.
Figure 1 Recognition experiment on a large data set (700 image pairs) using binary strings to represent faces.
The different curves correspond to a number of criteria for selecting which bits are “active” (set to “1”). For one
encoding, a binary face representation in which only 152 out of 1,920 bits are active, less than 10% of the bits, matches
the performance of Elastic Bunch Graph Matching. For a wide range of choices of the parameter, , which controls the
number of active bits in a string composed by model graph (see section 3.1.2.2.2 Composition By Model Graph),
recognition performance is superior to EBGM. The choice of m offers great flexibility, further supporting the notion
that patterns of deviation are more important than the deviations themselves. The minimum of the best encoding occurs
at a point (m = 964) close to where the greatest number of different binary strings are possible (m = 960). At m = 960,
the number of possible b-strings is on the order of The curves with the lowest error rates correspond to
encodings of z-scores and mean-subtracted coefficient values.
7
Chapter 2: Foundations
This research builds upon the foundation erected by Wiskott, et.al. in (Wiskott, Fellous et
al. 1997), in which the authors introduced the Elastic Bunch Graph Matching (EGBM)
algorithm for face detection, representation, and recognition. The EGBM algorithm was
motivated by the Dynamic Link Matching (DLM) algorithm (Lades, Vorbruggen et al.
1993; Wiskott and von der Malsburg 1996), which is an instance of the Dynamic Link
Architecture (von der Malsburg 1985). DLM is a biologically inspired algorithm for
recognizing objects by finding correspondences between a Gabor-transformed image and
a collection of stored models. DLM was successful but was too slow to converge to be of
use in practical applications. EBGM, as applied to the object class of human faces,
replaced DFM’s slowly converging correspondence finding process with a biphasic
procedure. In the first phase, a face is detected in the image by scanning over the image a
so-called Bunch Graph – a graph structure comprising an aggregation of knowledge
about human faces – and, subject to geometrical constraints imposed by the graph
structure, locating the positions at which the fiducial points of the bunch graph register
the greatest similarity to the subject image. Having located the most likely positions of
the fiducial points, a model graph is extracted from the image. The model graph consists
of nodes, each of which is labeled with a vector of Gabor features called a Jet. An edge in
the graph encodes the relative position of the node to its neighbors. A Gabor jet
comprises the complex-valued coefficients of the Gabor transform – a convolution of the
image with a family of wavelets at a fixed number of orientations and spatial scales. The
jet can be interpreted as an analogue to the responses of a macro-column of simple cells
8
in primary visual cortex. To facilitate face recognition, each jet is normalized to unit
length; the phase components are generally ignored for the face recognition procedure.
The reasons for this are discussed in some detail in (Shams and von der Malsburg 2002).
In the second phase, the EBGM algorithm seeks a match between the model graph
extracted from the image, the “probe” graph, and each graph in a collection of stored
model graphs (the “database”). The matching procedure is extremely simple: a similarity
function is applied to model graph pairs; the probe graph is compared to each stored
model graph. The stored graph with the highest similarity value is deemed the best match.
The canonical similarity function is the dot-product of two graphs, i.e., concatenations of
the unit vectors of amplitudes of each graph. This choice of function has great intuitive
appeal as it reflects the cosine of the angle between the vectors: identical vectors register
a similarity of one, the cosine of an angle of zero degrees; and orthogonal vectors register
similarity of zero. In practice, two jets are never orthogonal; all of the coefficients are
positive numbers. In practice similarity values, even for vastly different faces, are rarely,
if ever, less than 0.5.
The normalized dot product, or “Cosine”, similarity function has proved to be successful
in practice; it is simple to implement and is relatively computationally efficient,
particularly on processors optimized for vector processing. It pays a price, however, for
this simplicity. Its primary weakness is its failure to make use of all of the information
intrinsic to the Gabor feature representation. Consider Table 1, below. In Table 1 we see
summarized the variability of the means and standard deviations of the 1,920 Gabor
features constituting a model graph. In the column under the “Mean” heading, we see the
9
minimum and maximum of all of the mean feature values over the complete data set.
Similarly, in the “Standard Deviation” column we see the minimum and maximum
standard deviation of all of the individual feature standard deviations. Clearly, a system
built on Gabors must contend with features possessing many degrees of statistical
variability. In the context of a calculation of pattern similarity, the cosine similarity
function seems to be less than ideally suited to cope with the statistical variability of the
Gabor features.
3
The most obvious weakness is that the Cosine similarity function is a
simple dot product; all features contribute equally to the similarity calculation even
though each feature does not contribute equally to a determination of
Gabor Magnitudes Mean Standard Deviation
Min 0.016509 0.016891
Max 0.43974 0.18808
Table 1 Statistical summary of 1,920 Gabor features. The minimum and maximum of the mean and standard
deviation of the range of Gabor features. In total there are 1,920 coefficients in a model graph. The table reflects the
statistics of more than three thousand model graphs.
3
One may ask why the cosine similarity function performs as well as it does, given the variation in the
underlying statistics of the features. Though we will not discuss the matter at length, we will note that there
is a rough correlation between the relative magnitude of a feature’s mean value over a dataset and its
relevance to face recognition, as determined by algorithms that we present in a later section. We improve
on this representation and similarity measure, in part, by teasing apart and making explicit what is already
implicit in the EBGM algorithm.
10
identity. In the pages that follow we explore a number of alternatives to both the data
representation and the similarity measure of the Elastic Bunch Graph Matching algorithm
as applied to the problem of face recognition. We present a new representation and
similarity measure for face recognition, DFM. DFM encodes a face as a binary pattern of
significant deviations from mean feature values. We systematically show that DFM is an
improvement over EBGM for the representation and recognition of faces according to all
of the criteria we tested.
Figure 2 Pixel intensity representation of the means of all 1,920 Gabor
coefficients over all fa and fb images in the FERET database. The
coefficients are displayed such that each row of pixels represents a single
Gabor jet at one landmark. In each row of forty pixels, successive groups of
eight pixels comprise the eight orientations at successively decreasing spatial
frequency – the first eight pixels are means of all orientations at the highest
frequency, etc. In a similar manner, each column of pixels comprises the mean
coefficient values from a single kernel at each of forty-eight fiducial points.
11
2.1 Data Representation
The EBGM algorithm represents a single complex Gabor feature as a floating point
number, the amplitude of the complex coefficient. Each such number is a member of a jet
of Gabor amplitudes that has been normalized to unit length. As previous noted, this
representation does not account for the variability in the statistical properties of the Gabor
features and, at best, only incidentally and implicitly contains information about a
feature’s relevance to face recognition. Given that there is mounting evidence for a norm-
based face representation in the brain, in which faces are encoded relative to an average
face, in some sense, what alternative representations might we consider? We have created
a number of new representations to systematically compare to EBGM. Some of these
representations are, like EBGM’s, floating point, and some are binary. These
representations are summarized in Table 2 below.
The floating point representations are largely self descriptive. AM or GA denotes the
default representation, unit length vectors of Gabor magnitudes extracted from EBGM
model graphs. It is the representation used by EBGM for face finding and matching.
MSAM denotes the floating point vector of Gabor Coefficients (AM) after the mean value
of each coefficient has been subtracted; ZAM denotes the Gabor magnitudes of AM
following the application of a z-transform, which, in short, converts a Gabor coefficient
into a measure of the coefficient’s distance from the mean in units of standard deviation.
ZAMn denotes the values of ZAM after each jet of z-scores has been normalized to unit
length. ZAMna refers to the absolute values of ZAMn.
12
Name Format Representation
Derivation
AM or GA (EBGM) Floating Point Gabor Amplitudes (GA)
MSAM Floating Point
ZAM Floating Point Z-Scores of GA [
ZAMn Floating Point Z-Scores of GA (unit
length)
ZAMna Floating Point
BAM Binary AM
BMSAM Binary MSAM
BZAM Binary ZAM
BZGT Binary ZAM
Table 2 Summary of Data Representations For floating point representations, the last column denotes the source of
the floating point coefficients and/or the method by which the coefficients are derived from the original Gabor
Amplitudes (AM). For binary representations, the last column denotes the set of floating point coefficients that inform
the construction of the b-string.
The binary representations share some common characteristics. First, each binary string
has a number of bits equal to the number of coefficients in the floating point
representations. For example, if a model graph has a total of 1,920 coefficients, forty
coefficients at each of forty-eight fiducial points, then each floating point representation
described in Table 2 contains 1,920 floating point numbers and each binary
representation has 1,920 bits. It is easy to see that the binary representations have a space
complexity that is thirty-two or sixty-four times less than the corresponding floating point
representation.
Second, a bit in a b-string denotes whether the corresponding floating point feature is
“significant” in some sense. For representations BMSAM and BZAM, a bit taking the
value “1”, an active bit, denotes that the underlying coefficient was among the most
significant deviations from the mean in either coefficient value or z-score, respectively.
Alternatively, in representation BZGT, an active bit denotes that the z-score of the
underlying floating point data was greater than some value, z. Representation BAM is a
13
control case in which an active bit denotes that the Gabor magnitude was a member of a
subgroup comprising the magnitudes with the highest values.
2.1.1 Significant Deviations from the Mean
The pattern of active bits in a b-string, which denotes the coefficients with the most
significant deviations from the mean, can be derived from a vector of floating point
model graph coefficients (Gabor magnitudes or z-scores) by considering groups of
coefficients. The coefficients can be grouped by Landmark or, alternatively, by Model
Graph, in which case they would be considered as one large group. In the former case,
bits are activated by taking the top k deviations at each landmark:
. In the latter case active bits corresponding to the top m deviations over the complete
model graph: or to coefficients whose z-score is above a
threshold, : . Accordingly, b-strings formed by landmark will have the same
number of active bits per landmark as all other b-strings formed “by Landmark” for the
same value of . Similarly, strings formed “by Model Graph” will have the same number
of active bits over the complete b-string for the same value of m. This is not true for the
last method, BZGT. Strings formed by method BZGT for the same value of will
generally not share the same number of active bits. The following table summarizes the
methods for building b-strings. Figure 3 is a caricature of this process.
14
b-string
Representation
Grouping Floating Point
Representation
Parameter Range of
Parameter
Values
BMSAM By Landmark
Gabor Magnitudes –
k [1..40]
BMSAM_MG By Model
Graph
Gabor Magnitudes –
m [1..1920]
BZAM By Landmark Gabor z-scores k [1..40]
BZAM_MG By Model
Graph
Gabor z-scores m [1..1920]
BZGT By Model
Graph
Gabor z-scores z 0.5
4
Table 3 Summary of b-string Representations
4
For an explanation of this choice of threshold, see section 4.2 beginning on page 93.
15
Figure 3 after subtracting the mean, or taking a z-score, the top deviations from the mean are identified;
corresponding bits in a b-string are set to “1,” or “activated.”
2.1.2 Measures of Pattern Similarity
In addition to the normalized dot product, or Cosine, similarity function, we developed a
number of alternative similarity measures.
5
Some of these measures apply only to binary
representations; others can be applied to both binary and to floating point representations.
Examples of the former include Hamming Distance, the bit sum of an XOR operation on
5
We do not address whether any of these similarity measures are in fact metrics.
16
two binary strings, and the AND distance, the bit sum of an AND operation on two binary
strings. Interestingly, if two binary strings have the same number of active bits, these two
measures are functionally equivalent. If we know the value of one measure we can
calculate the other. This is significant to the extent that it is often easier to calculate an
AND operation on all pairs of strings than to calculate Hamming Distance. An AND can
be calculated for all pairs with a single binary matrix multiplication.
Examples of similarity measures that one can apply to either floating point or to binary
representations are the inner product of which the Cosine similarity function and the
AND distance are special cases, and mutual information, an information theoretic
measure of how much information one random variable contains about another random
variable.(Cover and Thomas 2006). In information-theoretic terms, the mutual
information is the Kullback-Liebler distance between the joint probability distribution of
two random variables, , and the product of the marginal distributions,
and is, in this sense, a measure of the dependence between two random variables. (Ibid.)
We have stated previously that the quality of a match between two b-strings correlates
with the magnitude of the failure of a test of statistical independence on the strings. This
similarity measure makes that criterion explicit. We will see that in cases where we are
called upon to measure the similarity between strings containing different numbers of
active bits – any comparison between strings of representation BZGT and comparisons
between strings of representation BZAM_MG when some of the landmarks are removed
from consideration – the mutual information similarity measure is the only measure to
utilize adequately the information content of the strings.
17
2.2 Maximizing Available Information
2.2.1 Information Entropy
What do we mean when we say that a representation or similarity criterion maximizes the
available information? We will consider first the information content of a data
representation. How can we measure the amount of information in a random variable?
The information theoretic measure of information is the entropy of a random variable.
The Entropy, H, of a random variable, X, is a function solely of the statistical distribution
of the random variable. It is a measure of the average uncertainty in the random variable
(Cover and Thomas 2006). It is also the average number of bits required to describe the
random variable (Ibid.). As an illustration, consider the information content of a flip of a
fair coin. In this case:
and
Therefore, to communicate the result of a fair coin flip, the minimum average number of
bits required is one. Alternatively, consider the information content of a two-headed coin.
In this case,
18
Intuitively, a flip of the completely unfair coin contains no information; the result of the
toss is completely certain.
How are we to determine the entropy of a floating point random variable? To obtain a
meaningful measure that can be used to compare the information content of different
floating point representations, each variable must be quantized to one of discrete
values, where is the number of bits in the quantized representation . Consider our
floating point representations: AM, MSAM, and ZAM. If we normalize the coefficients
of these representations so that each takes a value in the range [0,255] then each
quantized coefficient is eight bits in length. An examination of the entropies of the
individual feature channels reveals that for any given channel, the entropy is the same
under each of the representations. For each representation, the average coefficient
entropy is 6.9631 with a standard deviation of 0.48. We will see later that even though
the information content is the same, the choice of representation has a pronounced effect
on recognition performance. It is clear that entropy is not a sufficient criterion alone to
determine whether information is being efficiently utilized.
Next, consider the floating point representation, ZAMn, in which the Jets of z-scores of
the Gabor coefficients are normalized to unit length. In this case, the mean feature
entropy is 7.2739 with a standard deviation of 0.162. In fact, of all of the floating point
representations, ZAMn delivers the best recognition performance.
19
For binary patterns, comparisons of entropies are more straightforward as no
renormalization or quantization need occur. For comparisons to be meaningful between
competing b-string representations, however, each string should have the same number of
active bits. Let us take as an example the following sets of b-strings:
Representation Parameter Value Pattern
Entropy
Mean Feature
Entropy
Feature
Entropy
BAM k=20 1.0 0.68727 0.26819
BAM_MG m=960 1.0 0.68713 0.2656
BMSAM k=20 1.0 0.99539 0.00926
BMSAM_MG m=960 1.0 0.99635 0.00848
BZAM k=20 1.0 0.99768 0.00438
BZAM_MG m=960 1.0 0.99568 0.00674
Table 4 Six b-string Representations with equal pattern entropy Representations with a parameter setting of k=20
have active bits set “by landmark”. Representations with a parameter setting of m=960 have active bits set “by model
graph”.
Each of the patterns in each representation have a pattern entropy of 1.0. This indicates
that each pattern contains an equivalent number of ones and zeros, i.e., p(1)=p(0)=0.5.
We attempt to quantify the amount of information in each feature channel for each
representation by calculating the feature entropy. The feature entropy is the information
entropy of the set of samples for a single feature (Gabor kernel) over the complete data
set. We consider each feature channel as a random variable and calculate the entropy of
the random variable over the entire data set. Ideally, the entropy of each channel would
equal 1.0; this would indicate that each channel contains the maximum amount of
information. This would also indicate that for a given binary representation, each channel
is active in half of the patterns in the data set. In other words, for any given pattern each
bit (feature channel) has a fifty percent chance of being active. This is important because
it further supports an interpretation of a b-string as the outcome of a sequence of
20
Bernoulli trials. Note that each pattern already has fifty percent active bits. However, to
claim that each bit has equal probability of being zero or one would want to know that
over the whole data set each channel is activated as often as any other.
2.2.2 Face Recognition as a Binary Classification Problem
One criterion for the “goodness” of a pattern classifier is that it should maximize the
distance between classes and should minimize the distance between patterns of the same
class. To further support our contention that representations based on deviations from the
mean are more efficient consumers of information than the Gabor jet representation, we
show that in the context of a binary classification problem (“same” versus “different”),
the former representations are quantifiably better. We note at the outset that the face
recognition problem is not a binary classification problem; in general, there are as many
classes as individuals in the data gallery. However, framing face recognition as a binary
classification problem allows us to simplify the problem in order to compare some
statistical properties of competing representations in an intuitive manner.
First, we will show that the choice of representation affects the separation between
classes. We present histograms of the distributions of intra- and inter-class similarity
comparisons using the same criterion for selecting a suitable decision threshold. The
Receiver Operating Characteristic (ROC) curve will show that for any reasonable choice
of threshold the Gabor jet representation, EBGM, yields significantly fewer true positives
than either the DFM b-string or normalized z-score representation, for a desired false
positive rate.
21
In (Daugman 2003), Daugman examined the statistical properties of a binary
representation for iris recognition. There are many differences between Daugman’s
“phase demodulated” code for irises and DFM’s norm-based code ; still, the analysis is
applicable here. In both cases, the question is this: are two binary-coded strings
exemplars derived from the same or from different subjects? Daugman’s concern seems
to have been primarily binary classification. This makes sense given that a primary
motivator for iris recognition is biometric validation: is this person who he purports to
be? If validation is one’s primary goal one would seek a representation that minimizes the
overlap between the distribution of similarities of pairs of images of the same person and
the distribution of similarities of pairs of images of different persons. Further, it would be
a priority to avoid false positives entirely. For our purposes, however, we simply wish to
show that a representation based on deviations from average feature values represents a
significant improvement over EBGM in terms of intra- and inter-class distances.
In the following paragraphs we address the following questions: in a binary classification
problem, given a binary or a floating point representation, and a measure of similarity,
1. How does the choice of threshold affect the rate of false positives and false
negatives?
2. To what extent does the choice of representation satisfy the goal of increasing
inter-class distances?
First, we will examine the distributions of similarities for each class of image pairs.
Following Daugman, we will call the first class of images “same” images. The “same”
22
class comprises all pairs in the dataset of images of the same person, excluding identical
images. We call image pairs belonging to the second class “different” images. The
“different” class comprises all pairs of images in the dataset picturing different
individuals. The following table shows the number of pairs in each class.
Class Number of Image Pairs
Same 5,921
Different 5,371,639
Table 5 Counts of image pairs in the fa and fb FERET sets belonging to each class in the binary categorization
problem.
Figure 4,Figure 5, and Figure 6 show the histograms of similarities of each class for three
combinations of data representation and similarity function: Elastic Bunch Graph
Matching (EBGM) with the Cosine similarity function, DFM with the fractional
Hamming distance similarity function, and normalized z-score with the dot product
similarity function, respectively. The histograms have been normalized so they can be
viewed in the same figure. The blue histograms denote the distribution of “different”
images; the green histograms denote “same” images. The reader will note that the
histograms in the DFM figure appear to be reversed. This is because a lower fractional
Hamming distance is indicative of higher similarity. The red vertical line indicates the
decision threshold. The threshold was chosen at the point where the histograms cross,
reflecting a somewhat arbitrary decision to seek a balance between false positives and
false negatives. We make no assumptions about the undesirability of false positives or
false negatives. Still, we will address the issue of tradeoffs between false- and true-
positive rates by looking at the ROC curves of the three representations.
23
Figure 4 Binary Class Histograms for Elastic Bunch Graph Matching. The histograms reflect the distributions of
similarity scores for all “different” images (blue) and all “same” images (green) in the fa and fb FERET sets. A
decision threshold was chosen where the histograms cross. The ROC Curve in Figure 7 shows that this is close to
optimal.
Figure 5 Class histograms of similarities for DFM. Histograms reflect the distribution of Hamming distances for all
“same” image pairs and all “different” image pairs in the fa and fb FERET sets. Note that the DFM representation
increases the inter-class distance for the binary classification problem.
24
Figure 6 Binary class histograms of dot product similarities between vectors of normalized z-scores (ZN), in
which the jets of z-scores are normalized to unit length. Absolute values were not taken during normalization. Note
the average of the “different” histogram near zero. On average, “different” images have offsetting positive and negative
coefficients. “Same” images, with correlated coefficients, yield a positive dot product.
Table 6 below outlines a “confusion matrix” for a binary classification problem. A
confusion matrix is a summary of how often the classifier makes a correct decision and
where it gets confused.
Decision
“Same” “Different”
Ground Truth Same True Positive (TP) False Negative (FN)
Different False Positive (FP) True Negative (TN)
Table 6 General Confusion Matrix for Binary Classifier
25
The confusion matrix is, however, only a glimpse into the performance of the classifier, a
summary of outcomes for one choice of decision threshold. Another measure of classifier
performance is the ROC curve. The ROC curve shows how the classifier performs over a
range of decision thresholds, plotting “false Positives” in the abscissa and “true positives”
in the ordinate. Points along the ROC curve denote the true positive rates for all possible
choices of threshold, i.e., for any rate of false positives. Analysis of ROC curves are
discussed at length in (Fawcett 2006).
Decision
“Same” “Different”
Ground Truth Same 0.91876 0.081236
Different 0.053343 0.94666
Table 7 DFM Confusion Matrix
Figure 7 ROC curve for the binary face classification problem. Over nearly the
entire range of threshold choice, DFM, a binary representation, and NZ (sometimes
denoted Zn or ZN), a floating point representation, offer improved discrimination over
EBGM
26
Decision
“Same” “Different”
Ground Truth Same 0.82725 0.17275
Different 0.040017 0.95998
Table 8 EBGM Confusion Matrix
Decision
“Same” “Different”
Ground Truth Same 0.90944 0.090556
Different 0.040457 0.959540
Table 9 Normalized Z-Score Confusion Matrix
Figure 4 shows ROC curves for the three binary classifiers we tested. In general, a
classifier with a ROC curve that hugs the left and top of the plot is better than one with a
curve that is closer to the diagonal. A diagonal line running from bottom left to top right
corresponds to a classifier that randomly selects a class. For example, if the classifier
selected “same” sixty percent of the time, it would get sixty percent of the “same” pairs
right. It would also get a false positive rate of sixty percent on pairs of “different” images.
To move off the diagonal the classifier must exploit some information in the data
(Fawcett 2006).
Table 10 is a numerical view of the result shown in the ROC curve. It shows a number of
choices of threshold, indicated by the choice of false positive rate, and the corresponding
true positive rates for each of the three classifiers. The last column indicates the
difference in true positive rates between the DFM classifier and the EBGM classifier. The
performance of the DFM and ZN classifiers are comparable; both classifiers are superior
to the EBGM classifier over a range of choices of false positive rate.
27
FP Rate EBGM TPR ZN TPR DFM TPR DFM-EBGM
0.30% 69.68% 77.78% 79.26% 9.58%
0.90% 75.86% 83.85% 84.12% 8.26%
1.50% 77.89% 86.57% 86.17% 8.28%
2.10% 79.89% 87.97% 88.14% 8.26%
2.70% 80.95% 89.09% 88.99% 8.04%
3.30% 81.83% 89.98% 89.97% 8.14%
3.90% 82.72% 90.61% 90.71% 7.99%
4.50% 83.64% 91.57% 91.25% 7.61%
5.10% 83.64% 91.96% 91.88% 8.24%
5.70% 84.38% 92.16% 92.28% 7.90%
Table 10 True Positive Rates of various representations for a range of decision threshold choices, which
correspond to rates of false positives. For the same false positive rate, both ZN and DFM offer significantly higher
numbers of true positives.
We conclude from the foregoing that:
1. Compared to the Gabor jet representation of the Elastic Bunch Graph
Matching algorithm, both binary (DFM) and floating point (normalized z-
scores (ZN)) representations based on deviations from mean feature values
yield improved binary classification performance over a range of decision
thresholds, and
2. In combination with suitable similarity functions, this can be attributed to an
increase in the distance between category classes evidenced by the higher
rates of true positives for any given false positive rate.
2.3 Recognition Performance
In the field of computer vision, face recognition is a difficult problem. The many
extrinsic factors that contribute to the problem’s intractability have been well delineated,
e.g., (Hallinan 1999). Variations in illumination, contrast, head pose, and identity are but
28
a few of these factors. Indeed, of all these factors, difference in identity accounts for the
least amount of change. (Ibid.) Another source of identity-obscuring information, which
makes an already difficult problem more difficult, is intrinsic to the representation of the
face. After sources of variation have been neutralized, or the invariant properties have
been isolated and a representation of a face is extracted from an image, information can
still remain that hinders correct recognition. Among the goals of this research is to
identify the information intrinsic to a face representation that contributes to successful
recognition. The trivial corollary to this is that there must be other information that could
be excluded that would yield improved results. We introduce the concept of relevance as
a measure of how information in a representation contributes to successful recognition.
Relevance is by its nature a relative measure. We present several measures of feature
relevance and show that judicious selection of only the most relevant features or groups
of features significantly improves recognition performance. Although the binary
representation undeniably improves recognition performance, equally important is our
demonstration that relevance can be measured either with a statistical analysis of the
floating point Gabor jet representation, as was done in (Kalocsai, Neven et al. 1998;
Kalocsai, von der Malsburg et al. 2000; Kalocsai 2004), or with an information-theoretic
analysis of the binary DFM representation. In fact, on recognition tests using the same set
of image pairs, features chosen according to relevance values determined from an
analysis of DFM b-strings yields better recognition performance than features chosen
according to the ANOVA analysis on Gabor jets outlined in the previously cited works.
29
We believe this further supports our contention that a change of representation from
floating point Gabor jets to DFM binary strings is at least information preserving.
In addition to investigating feature relevance as a criteria for data inclusion/exclusion and
showing that relevance may be judged effectively with either information contained in a
set of floating point coefficients or with information encapsulated in a binary
representation of significant deviations from mean feature values, we also show that in
empirical face recognition experiments, mean-based representations significantly
improve recognition performance, as measured by a decrease in the recognition error rate,
by as much as 40%.
2.3.1 Analysis of Faces
At various points in the discussion we will have occasion to divide the coefficients into
groups according to shared characteristics. We may group the coefficients by fiducial
point, by orientation of the Gabor wavelet, by spatial scale of the wavelet, by ordinal jet
feature or by model graph. Given the model graph structure we have chosen, a choice that
is somewhat arbitrary but has proven itself in practice , a face model graph comprises
1,920 coefficients, which can be viewed as:
1. A set of forty-eight landmarks each of which contains a jet comprising forty
coefficients;
2. A set of forty Jet features, each of which comprises forty-eight coefficients,
one coefficient from each landmark;
30
3. A set of orientation features, each of which consists of 240 coefficients, five
coefficients for each orientation at each landmark; and
4. A set of scale features, represented by 384 coefficients, eight coefficients for
each of five scales at each landmark.
As we analyze the relative importance, or relevance, of coefficients and groups of
coefficients to particular cognitive tasks, it will be useful to consider such groupings of
features.
It has been shown, for example in (Kalocsai, von der Malsburg et al. 2000), that some
Gabor features are more diagnostic of recognition than others. Kalocsai, et.al, analyzed
the statistics of Gabor coefficients, with the goal of determining a weight for each
coefficient based on its ability to predict the similarity of faces. They performed a
between-individual one-way Analysis of Variance (ANOVA) for each of the 1,920
kernels, and formed a weight matrix from the ANOVA F-values. Higher F-values
indicate greater ability to discriminate between individuals. In other words, a high F-
value indicates that between-individual variance was significantly higher than the within-
individual variance. Kalocsai, et.al. reported that performing a recognition test using only
the 150 kernels with the highest F-score resulted in only a three percent decline in
recognition rate (96% vs. 93%). Still, the 150 kernels with the lowest F-scores produced
recognition results only 13% lower than the test performed with all 1,920 kernels (96%
vs. 83%). They also reported recognition rates of 90% and 73% for tests using the top 40
kernels and 10 kernels respectively. Using the “worst” kernels, they reported recognition
31
rates of 74% and 32% respectively. These tests appear to have been performed on a set
comprising 325 pairs of images of Caucasian faces.
The F-matrices reported in the above work were calculated from 81 images of six
Caucasian males under different combinations of conditions (orientation, illumination,
facial expression and background). The reported results were apparently part of a larger
effort to identify the information relating to identity among all of the sources of image
variation in their dataset. Our goal, on the other hand, is to determine which kernels, i.e.,
Gabor magnitudes, are most diagnostic of recognition among a larger data set with fewer
images per individual subject.
2.3.2 Relevance
It is clear that not all information available in a model graph is useful for every cognitive
task, in-silico, to which it might be applied. Indeed, we demonstrate that some
information may “confuse” or “misdirect” the recognizer, actively hampering recognition
and categorization performance. The landmarks around the mouth, for example,
adversely affect face recognition performance. The mouth, and the face around the
mouth, are sources of too much intra-subject variation to be a reliable indicator of
identity. In contrast, for an expression recognition task, the landmarks around the mouth
would be among the most relevant. What features or groups of features are the most
reliable indicators of individual identity? To bolster our claim that the transformation
from a floating point to a binary representation is at least information preserving we have
investigated three criteria for evaluating the “relevance” of features and feature groups to
32
the task of categorizing human faces. We performed three sets of experiments to
discover the features most relevant to face recognition.
In the first set of experiments, we followed the example of (Kalocsai, von der Malsburg
et al. 2000), and performed separate one-way ANOVAs on each model graph coefficient.
Although we present here results of this experiment for the face recognition task, in
unreported experiments we calculated the results for other, related categorization tasks,
for example, gender identification, race identification, and eyewear detection. For any
single task, the 1,920 analyses of variance used the same grouping variables. This
yielded, for each task, an F matrix in which each matrix entry, an F-score, is an indication
of the discriminative power of the coefficient for the given task.
2.3.2.1 Relevance Criterion I: Analysis of Variance
For each categorization problem we performed a univariate analysis of variance
(ANOVA) on each of the 1,920 coefficients (48 landmarks, 8 orientations, 5 spatial
frequencies) over 3,280 model graphs. The resulting F-scores are an indication of a
coefficient’s ability to discriminate between intra- and inter-class exemplars.
Where is the value of a coefficient; and are the coefficient mean and the mean of
the coefficients of group , respectively; is the number of groups in the dataset; and
is the number of images in group .
33
Displayed below are graphical depictions of the F-matrices for Face Recognition using
several different representations:
Figure 8 ANOVA F Matrices. Lighter values indicate higher f-score, indicating the power of a coefficient to
discriminate between inter- and intra-class exemplars. Left: Gabor Magnitudes. Middle: Normalized z-scores of Gabor
magnitudes. Right: Binary DFM strings constructed by model graph (m=960). The qualitative similarity between the
three images support the claim that information is preserved during the transformation from a floating point to a binary
representation. In each intensity image, each row represents a single jet; each column represents a single Gabor kernel.
Each jet comprises five frequency level groups, each of which comprises eight orientations. The highest frequency
group is at the far left; the lowest group is at the far right. Thus, columns one through eight are the same spatial
frequency (the highest). Column 1 is a vertically oriented kernel; successive columns, through column eight,
correspond to incremental rotations of the kernel. The next group begins in column 9, comprising eight orientations at
the next lowest frequency level, etc.
Summing over the rows and/or the columns of an F-matrix gives us the relevance of each
landmark and/or feature group respectively. A feature group can be any combination of
features. We examined groupings of coefficients of kernels with the same orientation, the
same spatial frequency, and the same landmark. The following figures show the
relevance of landmarks. The radius of the circle over a landmark is not strictly
proportional to its total F-score but is an indication of its relevance in relation to other
landmarks. The highest total F-score gets the biggest circle; the landmark with the lowest
F-score is covered with the smallest visible circle. The color of a circle merely denotes
whether the landmark is among the most relevant (green) or the least relevant (magenta)
fifty percent of landmarks.
34
Figure 9 Relevance of Landmarks – Analysis of Variance (ANOVA) on three representations. ANOVA was
calculated for each feature, resulting in 1,920 univariate analyses. F-scores for each feature were calculated (see Figure
8) and the sum over the f-scores at each landmark was taken. The 50 % most relevant landmarks are represented by
green circles; the least relevant landmarks are shown in magenta. The size of the circle is indicative of relative
relevance and is not proportional to the f-score in the strictest sense. Left: ANOVA on Gabor Magnitudes as in
(Kalocsai, Neven et al. 1998; Kalocsai, von der Malsburg et al. 2000; Kalocsai 2004). Middle: ANOVA on normalized
z-scores; Right: ANOVA on Binary DFM string constructed by model graph (m=960). Again, the qualitative similarity
between the three images supports our claim that information is preserved during the transformation from floating point
to binary.
We also performed a number of unreported experiments that show that other cognitive
tasks could be aided by selecting the features best suited to correctly distinguish between
individual members of a sub-population or to determine membership in a set whose
members share a common quality. Examples of the latter are: gender identification, race
identification, eyewear detection and detection of beards, moustaches, goatees, et cetera.
We were also able to demonstrate a small, but statistically significant, “other race” effect.
Experiments show that recognizing members of a subset of images containing faces of
just one “race” -- by which we mean apparent skin tone
6
, or possession of physical
characteristics common to people with a shared geography – is slightly hampered when
6
A quality that can be affected by many sources of image variation such as illumination intensity, direction
and color.
35
b-strings are constructed using the mean from an image set comprising faces of another
race and landmarks relevant to recognizing faces of the other race were used. Conversely,
in an experiment on sets of Caucasian and Asian faces, recognition was more accurate for
same-race faces (binary strings constructed using the “correct” mean) using landmarks
relevant to recognition of the “same” race.
2.3.2.2 Relevance Criterion II: Leaving Out Groups of Features
For this set of experiments, we attempted to determine the relevance of groups of features
by measuring mean recognition performance on data sets in which coefficients from
those groups had been removed. For example, to measure the relevance of a landmark,
relative to other landmarks, we would run a number of recognition trials, on randomly
generated datasets (pairs of images of k members of the FERET database) but leaving out
all features belonging to the chosen landmark. We performed several hundred trial per
landmark using a consistent set of randomly selected data sets, leaving out each landmark
in turn. We calculated the mean recognition rate over all trials for each landmark
excluded. We judged the relevance of a landmark by the extent of the impact on
recognition performance leaving it out of the recognition calculation. If performance
improved, then we surmised that the landmark was not helpful to recognition; if it
worsened, then we hypothesized that the landmark was helpful to recognition. Having
obtained the mean recognition rate for trials that excluded each landmark, we sorted the
means, thus obtaining a ranking of relevance; the lowest mean rate corresponded to the
most relevant landmark and the highest to the least relevant. We ran the aforementioned
series of tests for the following groups of features:
36
Landmarks: 48 sets of 40 features, 40 at each landmark
Features (i.e. kernels): 40 sets of 48 features, 1 at each landmark * 48 landmarks
Orientations: 8 sets of 240 features, 5 at each landmark * 48 landmarks
Frequencies: 5 sets of 384 features, 8 at each landmark * 48 landmarks;
The following figures illustrate the effects on recognition performance of leaving out
such groups of features.
Figure 10 Showing the effects of leaving out coefficients corresponding to kernels of
each spatial frequency. As in the figures below, 200 recognition experiments on data
sets of 400 image pairs were conducted for each of the five spatial frequencies. The graph
shows clearly that lower frequencies are more relevant for recognition, all else remaining
equal. Leaving out the highest frequency has a small, yet still significant effect on
recognition performance; the light blue line is the performance of EBGM; the dark blue
curve is the mean performance obtained by leaving out the spatial frequency denoted in
the abscissa.
37
Figure 11 Showing the effects of leaving out coefficients corresponding to kernels of
each spatial orientation, rotated in increments of starting with the vertical. This
figure shows that the vertical (orientation 1) and horizontal orientations (orientation 5)
are most relevant to face recognition, with the vertical orientation being most relevant.
None of the orientations, however, are so irrelevant to recognition that one would want to
exclude their coefficients entirely from the similarity calculation. The x-axis denotes the
orientation number; the y-axis denotes the mean recognition performance obtained by
leaving out all coefficients of the corresponding spatial orientation.
38
Figure 12 Showing the effects on recognition rates of leaving out all features from
each landmark. The middle plot represents the mean recognition rate of 200 individual
recognition experiments for each landmark; the straight line represents the recognition
rate of EBGM algorithm on the same data sets. Sets of 400 randomly selected image pairs
were used for each individual experiment. The data sets were pre-selected so that the
experiments for leaving out each landmark were performed using the same 200 data sets.
The curves above and below the central curves are analogous to error bars; they represent
one standard deviation above and below the central curve. Even though it seems that the
signal is lost in the noise, the ranking of the landmarks, when sorted by mean recognition
rate, is qualitatively similar to the other methods used to estimate relevance.
39
2.3.2.3 Relevance Criterion III: Binary Channel Capacity
The DFM model operates on long binary strings. Accordingly, we model the recognition
procedure on DFM b-strings as a communication over a set of binary symmetric
communication channels. While in transit, the bit corresponding to coefficient may be
flipped with probability . The image pair, one stored image and one “probe” image,
constitute the input and the output of the channel. In other words the probe image is
considered to be a noise-corrupted version of the communicated image; recognition is the
task of inferring from a probe image which stored image was actually sent.
The channel capacity of a binary symmetric channel, , is:
Figure 13 Graphical depiction of relative importance
(relevance) of landmarks according to “leave one out”
empirical experiments. Top 50% most relevant landmarks are in
green; bottom 50% in yellow. Circle size is indicative of relative
importance only and is not strictly proportional to any particular
measurement.
40
Where p is the probability that the bit received was not the bit sent. We estimated the
capacity of each feature channel by examining the distribution of the individual
transmissions for all pairs of images of the same person, excluding pairs of identical
images. Figure 15 shows the channel capacities of all 1,920 feature channels as an
intensity image. Higher values are depicted as lighter pixels. Note the similarity between
this image and the ANOVA F-matrices in Figure 8 above. This qualitative similarity
further suggests that the transformation from floating point to binary is information
preserving.
Figure 16 shows the total channel capacity of the channels corresponding to each
landmark represented as a bar chart. Compare this plot to the one below, a similar plot of
the total F-scores for each landmark from an ANOVA on the individual feature channels
of the Gabor transform. Note the nearly identical pattern of relevant landmarks. Note
Figure 14 Schematic diagram of a binary symmetric channel. A bit is
sent over a communication channel. There is a probability that the
bit received was the same as the bit sent; there is a probability that the
bit received will be flipped during transmission. The channel is
symmetric because the probability of a one becoming a zero is the same
as the probability of a zero becoming a one. Simply, the bit received will
not equal the bit sent with probability . The capacity of a binary
symmetric channel is:
41
further, the disparity between the most relevant and the least relevant is more significant
in the top of the figure, which represents the total channel capacity of each landmark. As
we will see, although the 50% most relevant landmarks under each technique are the
same with two exceptions, the exceptions give the binary channel method of determining
relevance a slight edge in recognition performance.
Figure 15 Image of feature channel capacities represented as intensity values, corresponding to
bits/transmission; the mapping from channel capacity, in units of bits per transmission, to intensity values is indicated
by the gradient bar at right. The figure is qualitatively similar to the intensity images produced by ANOVA (F-scores)
and “Leave One Out” empirical experiments. As in the other images, the areas where the channel capacity is greatest,
where the image intensity values tend to be high are the forehead and the outline of the head. The mouth and the
fiducial points proximal to the mouth are least relevant, indicated by the low capacity, particularly in the higher
frequencies.
42
Figure 16 TOP: the total channel capacity of the “landmark channels”. Compare this plot to the BOTTOM, a
similar plot of the total F-scores for each landmark from an ANOVA on the individual features (coefficients) of the
Gabor transform.
2.3.3 Computational Complexity
Let us consider for a moment the time and space complexity of DFM, compared to the
Elastic Bunch Graph Matching algorithm presented in (Wiskott, Fellous et al. 1997). As
43
processing is identical in both algorithms up through the extraction of the model graph,
we will begin our analysis at the point at which the model graph coefficients are
available.
Figure 17 Showing the relevance of landmarks, considering each landmark as a set of binary channels.
There is one channel for each feature, 40 for each landmark. The total channel capacity for a landmark is
calculated by summing over the set of channels corresponding to the features attributable to it. In other
words, there is a binary channel corresponding to each Gabor kernel in a jet; the landmark capacity is the
sum of the capacities of these channels. The landmark sums are sorted and the smallest circle radius is
assigned to the landmark with least channel capacity. The size of the circles is only indicative of relative
importance and is not proportional to the actual channel capacity of the landmark. Green circles indicate that
the landmark is among the most relevant half of landmarks. Consequently, the magenta circles indicate that
the landmark is among the least relevant half.
2.3.3.1 Time Complexity
The matching phase of the EBGM algorithm compares a probe model graph with each
stored model graph to determine which stored graph is most similar. The standard
measure of model graph similarity is the normalized dot product, equal to the cosine of
44
the angle between the feature vectors corresponding to the observed and the stored
graphs. Thus, for each comparison, one must perform:
1,920 floating point multiplications
1,919 floating point additions
In the DFM model, for each comparison of one model graph against another:
Recoding Phase:
o 1,920 floating point subtractions (to calculate the deviations from the
mean);
o 48 n log n sorts (n=40) (to find the coefficients with the k highest
deviations);
Comparison Phase:
o 1,920 binary AND operations
o 1,920 binary additions
Note that in EBGM all of the floating point operations enumerated above must be
performed for each stored model graph, whereas for DFM, the floating point operations
and the sorts need only occur once, when a new image is initially processed. Only the
comparison phase operations, which have significantly lower latencies (the latency of the
binary AND and ADD operations can be reduced at least 32-fold by packing the logical
values, i.e., bits, into bytes for processing in parallel), must be performed for each stored
image. Although the two algorithms are both O(n), on a database of 1,000 images, for
example, the total ALU/FPU latency for EBGM is about two orders of magnitude greater
than for DFM.
7
7
Assume that the latency for floating point operations is 3 clock cycles and the latency for integer additions
and binary comparisons is 1 clock cycle. This is obviously an oversimplification. However it should be
45
2.3.3.2 Space Complexity
For each stored face model graph, EBGM must store 1,920 floating point coefficients. In
comparison, DFM must store, for the entire system, one set of 1,920 floating point
coefficients, which represents the average face, and one set of 1,920 bits for each stored
face. In the limit, assuming that floating point variables are four or eight bytes, depending
on precision, the storage requirements of DFM are either 32 or 64 times less than for
EBGM.
The foregoing complexity analysis demonstrates the suitability of the DFM
representation for use in computer systems with limited processing and storage resources.
It is therefore an ideal choice for embedded devices, such as may be used in ultra-
portable security systems.
2.4 Literature Review
2.4.1 Norm-based Encoding
There is growing evidence that the brain uses a norm-based representation for faces;
faces are stored in the brain in some fashion as a deviation or set of deviations from an
noted that the binary AND operations take much less than one clock per comparison as the processor’s
ALU can compare at least 32 bits per cycle. On a database of 1,000 images, the total latency for each
method is:
EBGM: 1000 * (1,920 * 3 + 1,919 * 3) ≈ 11,520,000 cycles
DFM: (1,920 * 3 + 48 * 40 log 40) +1000 *( 1,920 / 32 + 1,920 / 32) ≈ 136,000 cycles
By this calculation, DFM is slightly more than 80 times faster than EBGM on a database of 1,000 images.
Further, the rate of growth in time complexity as a function of database size of EBGM is nearly two orders
of magnitude greater than that of DFM. In the limit, DFM is on the order of 96 times faster than EBGM.
46
accumulated average face or its constituent parts or lower-level features. FMRI
adaptation studies lend support to the idea of a “face space” in the brain with the mean,
or average, face at the center. Common to all face spaces, faces are represented as points
in a high-dimensional vector space. Faces further from the mean are perceived as more
distinguishable than those close to the mean. Away from the mean, faces are considered
to have greater “identity strength”. The physical neuronal representation in the brain for
faces is still largely unknown.
There is growing support for the idea that the brain processes faces in a manner different
from other objects. A number of these differences were discussed in (Biederman and
Kalocsai 1997). The following table presents a comparison of how faces are believed to
be processed as compared to other objects (Bruce, Green et al. 2003).
Faces Non-face Objects
Processed “holistically” or
“configurally”
Processed by breaking down
object into geometric
primitives (geons)
Repetition priming sensitive
to image format
Repetition priming largely
insensitive to format
Representation taken from
low-level of processing,
perhaps V1 simple cells
Representation taken from
late stage of processing,
perhaps the lateral occipital
area (LO).
Recognition suffers when
image is inverted to a
photographic negative.
Recognition largely
unaffected by inversion.
Recognition hampered when
faces are physically inverted
(upside-down).
Recognition largely
unaffected by rotating image
180 degrees.
Table 11 A summary of differences in face and non-face object processing in the brain.
47
We take as our point of departure the idea that faces can be encoded in relation to a
stored average face. We present a representation for faces that explicitly encodes patterns
of deviation from an average face. We will present and resolve an apparent paradox –
that reducing the size of a feature vector 32- or 64-fold by converting a high-dimensional
vector of double precision floating point numbers to a sequence of binary digits of the
same length improves recognition performance, by reducing the error rate, by as much as
forty percent.
Our representation for faces is based on the idea of converting each floating point feature
coefficient to a single binary digit. A similar scheme was presented by Daugman in the
context of iris recognition, see, e.g., (Daugman 1993; Daugman 1997; Daugman 2001;
Daugman 2001; Daugman 2003; Daugman 2004; Daugman 2004). To recognize an
individual from a scan of the iris, the iris is localized and the image is convolved with a
bank of Gabor filters at a number of fixed orientations and scales. The Gabor transform
yields one complex coefficient for each orientation / scale combination. The complex
coefficients are converted to polar form and each Gabor phase is subsequently encoded as
a two digit binary number. Each such number is an encoding of the quadrant denoted by
the phase angle. For example if the phase of a Gabor coefficient is (30 ), the binary
code is, for example, “11”. The phases are converted to binary using a two bit Gray code
in which the codes for adjacent quadrants differ by only a single bit. Unlike DFM, the iris
representation presented by Daugman does not utilize the Gabor magnitudes.
48
To compute the similarity of iris images, Daugman computes the fractional Hamming
distance between the corresponding binary iris codes. As any digit of the encoded binary
string has equal chance of being “1” or “0”, the comparison of each bit can be treated as a
single Bernoulli trial. Thus the comparison of two strings can be seen as a sequence of
Bernoulli trials where matching bits are “successes” and non-matching bits are “failures”.
Indeed, over nine million comparisons of different iris images, the resulting fractional
Hamming distances are almost precisely binomially distributed. The computation of the
number of successes is implemented as a sum of XOR operation. Hence, a pair of irises
are considered to be a match when the pair fails a test of statistical independence.
Statistically independent binary strings are ones where the probability of two
corresponding bits are equal is no better than chance. Such strings are deemed to
correspond to irises of different individuals. Daugman reports that in over nine-million
comparisons there was not a single case of a false match.
Our encoding method , which was developed independently, differs from Daugman’s iris
code in several significant ways. First, Daugman encodes irises using a two-bit phase
code determined solely by the quadrant of the phase angle. Our method of encoding faces
uses a single bit per coefficient to encode a pattern of coefficients where the Gabor
magnitude differs “significantly” from an average face. Daugman’s encoding is
independent of other iris images; ours is dependent on the mean values of the coefficients
taken over a gallery of faces. Our encoding for faces yields patterns, about which, we
argue, it can be said that binary-encoded coefficients taking a value of “1” indicate which
49
coefficients are most significant to this individual and that a particular constellation of
“active” bits are useful in identifying other images of the same individual.
In previous sections we showed that the distribution of fractional Hamming distances
over all pairs of different face images are also nearly binomially distributed. In contrast to
the iris code, however, there is more overlap between the distribution of “same” images
and “different” images than in Daugman. This yields a much lower value of d-prime and
consequently a higher probability of false matches. As we do not argue that DFM is a
good representation for biometric validation, this is not a concern.
There have been a number of successful algorithms for encoding wavelet-transformed
images. These algorithms were motivated by the need to compress “Super High-
Definition” images – images whose resolution exceeds 2000 by 2000 pixels. Several of
these algorithms are: Embedded Zerotree Wavelet Encoding (Shapiro 1993), Successive
Approximation Wavelet Vector Quantization(Silva, Sampson et al. 1996), and Set
Partitioning into Hierarchical Trees (SPIHT)(Said 1996). The goal of all of these
algorithms was to leverage the frequency-domain characteristics of the wavelet transform
to derive lossy and lossless compressed images. They were motivated by the need to
compress images without visible picture quality deterioration (Dasilva 1996).
Common to these algorithms are the following observations:
1. On average, the energy of the low frequency components of a wavelet
transformed natural image is greater than the energy in the high frequencies. The
high frequencies carry much of the fine detail in the image, but account for a
50
lower proportion of the total sub-band energy than the low frequency components.
(Valens 1999); (Dasilva 1996).
2. Large wavelet coefficients are more important than small wavelet coefficients.
(Ibid.) This follows from 1.
3. Wavelet coefficients of kernels of the same orientation tend to be correlated
across frequency levels. Although there is little correlation between frequency
bands, coefficients of the same orientation tend to be similar. Hence, if a
coefficient is nearly zero at a particular orientation, the coefficients at the same
position over all frequency levels will tend to be nearly zero. This statement is
not always true, but is true with high probability.
These observations lead naturally to compression encodings with hierarchical structure.
Because of the sub-sampling that occurs in a wavelet transform, the entire transform can
be represented as a tree, wherein coefficients at high frequencies have ancestors at lower
frequencies. It is assumed that sub-trees with roots below a certain threshold also have
descendents that are below the threshold. This yields an efficient, progressive encoding
that can be tailored to any desired bit rate.
We do not spend a lot of time going into the details of these algorithms whose purpose is
compression, not recognition. We mention them here to acknowledge a widely used class
of wavelet encoding algorithms. We note here, therefore, that although our representation
for faces naturally reduces space complexity by nearly two orders of magnitude and
could be viewed as a compression scheme, this effect is merely salutary; image
compression is not our goal. Our goal is to extract what is truly informative about each
51
Gabor wavelet transformed face image to yield a “face-print” that is diagnostic of
individual identity.
2.4.2 Related Face Recognition Methods
There are too many face recognition algorithms extant to do justice to them all in this
review. Fortunately, though the implementations may differ, many of them use one of a
small subset of image pre-processing methods. The two most prevalent pre-processing
methods are the Gabor Transform and Principal Components Analysis (PCA) (Turk and
Pentland 1991). We mention here that there are other methods widely in use, for
example, Independent Components Analysis, Linear Discriminant Analysis, Fischer
faces, to name but a few, which we shall not discuss. We discuss the Gabor Transform
and PCA for another reason: they are often employed in the Neuroscience, Cognitive
Neuroscience and Psychophysics literature as models for early visual processing and as
models for “face spaces” .
2.4.2.1 Gabor Transform Methods
The Gabor transform is often used as mechanism for modeling simple cells in V1
(Biederman and Kalocsai 1997). This is so because Gabor kernels, functions obtained by
translating, rotating and dilating a “mother” wavelet, closely mirror the response profiles
of V1 simple cells (Jones and Palmer 1987). Jones and Palmer found that Gabor
functions fit the receptive field profiles of 97% of the simple cells tested in cat visual
cortex. (Ibid.) In a Gabor transform, a family of Gabor kernels comprising various
orientations and spatial frequencies, or spatial scales, are convolved with an image at
52
each pixel, at each node of a grid superimposed on an image or at image pixel locations
corresponding to image landmarks. The coefficients, or filter responses, of the Gabor
transformed image at a single pixel location are grouped together into a vector called a
“jet” or “Gabor jet”. For example, an often-used set of kernels comprises eight different
orientations, rotations of the mother wavelet by increments of , and five spatial
frequencies. The combination of orientations and scales is chosen to provide maximum
coverage in the phase plane (Potzsch, Kruger et al. 1996). One can interpret the filter
responses of a single Gabor jet as analogous to a macro column in V1, in which cells in
the column are tuned to respond maximally to bars of a certain orientation and spatial
scale.
We have noted previously that Gabor wavelet coefficients correlate strongly with the
response profiles of simple cells in V1. It has also been shown, again by Daugman, that
the tradeoff in uncertainty between determining the precise orientation of an object, such
as an oriented bar, and its location in space in that direction is optimal in Gabor wavelets
(Daugman and Downing 1998). To paraphrase Daugman, if we want to know in relation
to an object both “what” and “where”, then we can do no better than to construct a spatial
code from 2D Gabor wavelets. (Ibid.)
Algorithms founded upon a Gabor transform include, inter alia, Dynamic Link Matching
(Konen, Maurer et al. 1994; Wiskott and von der Malsburg 1996; Aonishi and Kurata
2000; Zhu and von der Malsburg 2002; Pichevar, Rouat et al. 2006) and Elastic Bunch
53
Graph Matching (Wiskott, Fellous et al. 1997; Arca, Campadelli et al. 2006; Yu, Teng et
al. 2006) .
This research describes a model that takes as input a model graph, 48 jets, one each from
48 fiducial coordinates, and reduces it to its essence. We will prove that this reduction is,
ultimately, information preserving and that it yields a representation for faces that is more
diagnostic of individual identity than the one currently used by the Elastic Bunch Graph
Matching algorithm.
2.4.2.2 PCA Methods
In the context of face recognition, among the most widely known face recognition
algorithms is the Eigenfaces algorithm. (Turk and Pentland 1991; Moghaddam and
Pentland 1997; Turk 2001). Eigenfaces are merely the principal components of a set of
intensity images, registered so that facial features such as eyes, nose, and mouth are
located at the same pixel location in each image. In Figure 18 we show the first one
hundred eigenfaces of frontal images from the FERET dataset (Phillips, Moon et al.
1997; Phillips, Wechsler et al. 1998; Phillips, Moon et al. 2000).
54
Figure 18 The first one hundred Eigenfaces from the fa and fb sets of the FERET database. Image alignment
was done automatically using the landmarks located by the EBGM algorithm. Note that this places EBGM and
Eigenfaces on an equal footing with regard to registration of image features. Experiments show that EBGM, and
consequently DFM, are much more robust to incidental misalignments than the Eigenface algorithm.
In the next figure we show the RMS reconstruction error measured in pixels, showing the
average per pixel RMS error over a reconstruction set of 1000 face images. From these
two images a number of flaws in the algorithms become apparent. First, to calculate the
principal components of pixel values of intensity images, the images must be manually
aligned, or some automated procedure must be used. To calculate the Eigenfaces shown
in Figure 18, we obtained a registration of features using the landmarks located by the
Elastic Bunch Graph algorithm. The results of the research presented herein were all
based on features extracted from these landmark locations, so any incidental
misalignments would affect both recognition methods equally. In unreported experiments
recognition rates using these principal components were extremely poor.
55
Figure 19 RMS errors from image reconstructions using the top n
Eigenfaces.
56
Chapter 3: The DFM Model
We begin with a frontally-oriented image of a human figure, resized and cropped to 128
by 128 pixels so that the face is isolated. Following previous work (Lades, Vorbruggen et
al. 1993; Wiskott and von der Malsburg 1996; Wiskott 1997; Wiskott, Fellous et al.
1997) the image is convolved with a family of forty Gabor kernels comprising eight
orientations and five spatial scales. A single Gabor wavelet kernel is a rotated and scaled
“mother wavelet”, a plane wave bounded by a Gaussian envelope. After a face-finding
process, samples of the Gabor wavelet transform image are taken at forty-eight fiducial
points. Each sample consists of forty Gabor coefficients, which are referenced
collectively as a Gabor Jet, or simply a Jet. The ensemble of forty-eight Jets constitute a
Model Graph, a data structure in which each node is labeled with the Jet extracted at the
fiducial point associated with the node, and each edge is labeled with the position of the
node in relation to the nodes to which it is connected. We will consider primarily the jets
of the complete model graph, comprising 1,920 Gabor coefficients, without reference to
the node positions, yet maintaining correspondences between coefficients / jets drawn
from different model graph instances. Figure 22 and Figure 23 show two equivalent
graphical representations of this structure. In the former, the graph structure and the
constituent landmarks are made explicit; in the latter, the coefficients have been
marshaled into a grid. For computation, the coefficients are placed into a single vector. In
a previous section we have investigated which coefficients, orientations, spatial
frequencies and fiducial points are most diagnostic of face recognition. As elsewhere in
this document, we will variously refer to the individual Jet components as “coefficients,”
57
“features,” or “kernels”. We take these terms to be synonymous and will use them
interchangeably.
The idealized neurons in our model correspond in a one-to-one fashion with the model
graph coefficients. Although we have stated that the neurons are either active or
quiescent, yielding a binary pattern of activations, we must, as a preliminary step to
calculating the activations, consider the mean value of each coefficient over some set of
previously observed images and compare this mean value to the observed value of a
model graph coefficient.
Figure 20 Processing of a face image starts at lower left. During the EGBM phase, a model graph is extracted
from a Gabor-transformed image. Feature vectors (Jets) are extracted from selected fiducial points. Jets become
node labels in a model graph structure. Mean coefficient values ( layer) are subtracted from the model graph
coefficients in the layer; In the layer a competition mechanism, which acts on coefficients grouped by
landmark or model graph, identifies the coefficients with the most significant deviations from the mean. In the
layer, units corresponding to the most significant deviations are activated; the pattern of activations can be stored
in, for example, a b-string.
58
To mark a clear path from an image to a DFM b-string, we divide the model into six
discrete layers, which we designate:
1. Image ( ), representing the Gabor-transformed image;
2. Model ( ), corresponding to the model graph, which comprises jets extracted
from a set of fiducial points;
3. Mu ( ), corresponding to the stored representation of the mean face, or mean
coefficient values;
4. Delta ( ), corresponding to the layer in which deviations from the mean of
individual coefficients are calculated;
5. Chi ( ), a competitive layer in which only the coefficients that deviate
significantly from the mean, in some sense, trigger activation in the representation
layer; and
6. Rho ( ), the representation layer, consisting of binary neurons, where each bit
corresponds to a neuron that is either active (“1”) or quiescent (“0”).
The Image ( ) and the Model ( ) layers encapsulate the process of Elastic Bunch Graph
Matching (EBGM) (Wiskott, Fellous et al. 1997). In this research, we treat these layers as
a “black box”; for our purposes their workings are fixed. Layer three through six,
constitute the DFM model. Each of these layers admits a number of
variations, or changes to the algorithms by which the layer outputs are calculated. This, in
turn, affects the character of the final binary representation. A diagram showing these
layers in the context of the complete model is presented in Figure 20. Assuming we have
computed and stored a face model, represented by the mean values of previously
59
observed features -- we can conceive of the flow of information through the layers of the
model in the following way: given an image we wish to store for future recognition, we
perform a Gabor transform of the image in the Image layer. Next, we extract the model
graph, in our case comprising 1,920 Gabor coefficients, in the Model layer. In the Mu
layer, the coefficients from the Model layer are compared with the stored mean
coefficient values; this yields values corresponding to Model layer coefficients that
signifies the deviation of the coefficient from the mean. In the Chi layer the neurons
corresponding to the deviations from the mean compete such that only the k most
significant deviations will produce an activation in Rho, the representation layer. The
pattern of activations in the Rho layer is stored for subsequent processing.
The flow of processing is exactly the same for any observed face image. To perform
comparisons for face recognition, or for other cognitive tasks such as, for example,
gender identification, race identification, or expression recognition, it is necessary only to
classify patterns based on activations in the Rho layer. The coefficient values from the
model graphs can be discarded, as can the values of the specific deviations from the mean
calculated in Mu. In fact, the only non-binary values that must be stored in the DFM
framework are the mean feature values encoded in for the purposes of calculating
deviations from the norm.
3.1 Formal Description of the Model
Figure 21, identical to Figure 20 and reproduced here for convenience, shows a schematic
of the flow of data through the various layers. Recapitulating, the process begins in the
60
lower left of the diagram where in the image layer, which comprises the Gabor
transformed image. Next, the model layer can be considered to carry out a process of
dynamic link matching, in which correspondences between Image layer and a stored face
model are found. Algorithmically, it encapsulates the process of Elastic Bunch Graph
Matching, a computational caricature of DLM. The units of the Model layer are
compared to the stored values in the Mu layer by subtracting the mean from the observed
value. The deviations are forwarded to the Chi layer, which sorts the values at each
landmark and identifies the k strongest deviations. The identities of the units with the
strongest deviations are passed to the Rho layer, which carries the binary encoding of the
observed face. This process is illustrated in Figure 22 to Figure 25. We will now describe
in detail the implementation of the various layers of the DFM model.
Figure 21 Schematic showing flow of processing through layers of the DFM model.
61
Figure 22 After a model graph is extracted from an image using EBGM each landmark contains a “jet” of forty
complex coefficients. We consider only the magnitude of the complex coefficient, represented in polar form.
Figure 23 The coefficients are marshaled. They will be compared to the mean and sorted by landmark or by model
graph.
62
Figure 24 From floating point coefficients to b-string bits. After coefficients are marshaled, the mean is subtracted,
yielding a vector or deviations from the mean. Next, the most significant deviations are identified. In the last step, the
bits of the most significant deviations are “activated” (set to “1”). In this illustration, the selection process could be
either “by landmark” or “by model graph”; the circles, which denote coefficients, are illustrative of the process only.
Figure 25 Images of Face Representations. At left, colors stand in for Gabor magnitudes. Cooler colors are lower
values; warmer colors are higher values. At right, a binary representation of the same “image”. White squares indicate
“active” bits (1); dark squares denote “inactive” bits (0).
63
3.1.1 The Image and Model Layers
The Image Layer consists of the intensity image and its Gabor Transform, from which
features will be extracted:
The I layer take an intensity image Ι and convolves I with a family of Gabor wavelets.
For all of the algorithms described in this proposal, facial information is extracted from
the image through a process of convolution
with a family of Gabor kernels
where each is a sinusoidal plane wave with wave vector multiplied by a
Gaussian, which has the effect of restricting the plane wave to a localized area under the
Gaussian envelope. Gabor kernels constitute a family of self-similar wavelets differing
only in orientation and scale, i.e. they comprise rotations and dilations of the mother
wavelet. We employ a Gabor transform with kernels of eight orientations ( , for
μ = ) and five scales ( for ). A sample of
the Gabor transformed image at a pixel constitutes a forty-dimensional vector of complex
coefficients called a jet or a Gabor jet [Wiskott, et.al. 1999].
64
The Gabor coefficients are returned as quadrature-phase pairs, which are subsequently
converted to polar form. The phase component is discarded, leaving a 40-vector of
double precision floating point numbers.
8
The justification for this can be found in
(Wundrich, von der Malsburg et al. 2002) and other precedent research. For the purposes
of this thesis, a reference to a jet will refer to the real-valued, unit-length vector of Gabor
amplitudes. If we reference the complex-valued vector, we will state this explicitly.
Accordingly, a jet is a sample of the Gabor-transformed image taken at an arbitrary pixel
location. The feature representation, a vector of real-valued Gabor magnitudes, models
the responses to visual input of a hyper-column of complex cells in visual cortex,
comprising orientation tuned complex cells. The kernel parameters denoting scale, or the
widths of the Gaussian envelope functions, are chosen in such a way as to support the
detection of spatial frequencies over the whole of the relevant frequency domain.
It has been noted, for example in (Shams and von der Malsburg 2002), that a Gabor
kernel, constituted of a plane wave with wave vector, , and orientation, , multiplied by
a Gaussian envelope of variance , is a very good approximation to the receptive field of
a simple cell in early visual cortex. A jet, then, can be thought of as analogous to a visual-
cortical hyper-column. The DFM model itself can be thought of as a constellation of
neuronal complexes situated further along the ventral visual pathway.
8
For a detailed discussion and a justification for discarding Gabor phases and using only magnitudes, see
Shams, L. and C. von der Malsburg (2002). "The role of complex cells in object recognition." Vision
Research 42(22): 2547-2554.
65
Although a jet can be sampled at any image location, we use a process of Elastic
Bunch Graph Matching (EGBM) (Wiskott, Fellous et al. 1997) to locate forty-eight
selected facial landmarks, or “fiducial points”, in frontally oriented face images (Phillips,
Moon et al. 2000), and then sample a jet at each of these fiducial points.
Though no semantics need attach to the Gabor magnitudes, jet coefficients have
been interpreted as the mean firing rate of a columnar cell, over and above the cell’s
resting rate; or as a probability of observing a spike in a given columnar cell during a
time-slice at least as long as the cell’s refractory period. (Hertz, Krogh et al. 1991) We
find it sufficient to note that the Gabor wavelet is optimal in the sense of balancing
precision in spatial localization with sensitivity to orientation.(Daugman 1988) Thus, the
Gabor is an optimal choice of filter from which to derive, for example, an estimate of the
orientation and position of contrast features.
The Model Layer consists of a feature graph extracted from the Image Layer;
The coefficients of the model graph are marshaled and are presented as inputs to the Mu
layer of the Representation layer.
3.1.2 The Representation Layer
The Representation Layer comprises three sub-layers, which we have
designated Mu (μ), Chi (χ), and Rho (ρ), which denote the Mean, Competition, and
Representation layers, respectively.
66
3.1.2.1 The Mean (Mu) Layer
The Mu layer, Μ, comprises a population of neurons, in which each neuron represents a
single coefficient from the mean model graph
The indices l and f indicate the index of the landmark and the Gabor feature respectively.
The layer simply encodes the average face, in the form of the mean over all previously
observed model graphs. The mean coefficient values, , will be compared in the next
layer to the currently observed coefficients. The means represent the expected values of
the probability distributions of the coefficients. Significant deviations from this value are
indications that the coefficient is salient at this landmark.
3.1.2.1.1 Calculating the Average Face
Given a set of model graphs, it is a simple matter to calculate the average face. A model
graph comprises 1,920 complex coefficients, forty at each landmark. The forty
coefficients constitute a jet, which comprises filter responses of Gabor kernels of eight
orientations and five scales, or spatial frequencies. Viewed in polar coordinates, each
complex coefficient consists of a magnitude component and a phase component. Each
component must be handled separately when computing the mean. It should be noted that
for most recognition experiments, the phase may be disregarded. This is so for a number
of reasons, which are treated at some length in (Shams and von der Malsburg 2002)
among others. In short, the magnitudes of kernels of the same orientation and phase
sampled at small pixel intervals will vary smoothly over short distances; phases, on the
67
contrary tend to oscillate with a frequency proportional to the frequency of the plane-
wave that partly defines the kernel. This makes the magnitudes extremely robust to small
displacements of the sampled location to the veridical landmark position. This is not the
case with phases. To use phases for recognition one must estimate the displacement
vectors to adjust the phase prior to comparison. (Wiskott 1997) has shown that the
displacement vectors can be estimated and that the ability to include phase in the
calculation can improve recognition performance, but the procedure is more
computationally intensive and less robust in general than using the magnitudes alone.
Although we use only the magnitudes for recognition and other identification tasks, we
still have a use for a mean value of the phase coefficients over a number of model graphs.
We will see how magnitudes and phases can be averaged in the next sections.
3.1.2.1.2 Computing the Mean Magnitudes
Given model graphs , where each model graph
, has complex
coefficients. Each has a magnitude and a phase part, denoted and . The set
of all magnitudes in a model graph will be denoted ; the set of all phases will be
denoted
Now,
68
The mean magnitude of a kernel over a set of model graphs is the arithmetical mean of
the individual kernel magnitudes over the set. For example, the mean of kernel 1 (of
1,920) in a set comprising k model graphs is the arithmetical mean of the k values of
kernel 1 in the set.
3.1.2.1.3 Computing the Mean Phases
To compute the mean phases over a set of model graphs, one can no longer use the
arithmetical mean because of the circular nature of the phase coefficient. A commonly
used example is the following: if you wanted to compute the mean of two angles
measuring 1° and 359°, you could try taking the arithmetic mean, (1°+359°)/2 = 180°.
Unfortunately, this result is literally opposite the desired answer: .
Obviously the two angles are two degrees apart so their mean should be exactly between
them, at 0 degrees. One can reconcile this difference using circular statistics. In short, the
mean of a set of angles is the arctangent of the mean of the sines divided by the means of
the cosines of the angles.
Using this method, we were able to calculate the means of the phases of the individual
kernels over sets of model graphs. We used the mean phases primarily for the purpose of
creating image “reconstructions” from the mean model graphs of sets of images. In
Figure 26 and Figure 27 below we show reconstructions of average faces. In the first
69
triptych, the middle image is the mean over all 3,280 images in the fa and fb sets of
FERET. On the left and right, respectively, are the mean male and mean female faces.
The number of images averaged to obtain the model graph whose reconstructions are
shown below are found in Table 14.
Figure 26 Average Faces. A new set of complex coefficients for each average face was calculated from the images
in the dataset. New magnitudes were taken from the arithmetic mean of magnitudes from a subset of images in the
database; similarly, circular statistics were used to calculate the average phase. From left: the average male, the overall
average, and the average female.
For exploring the other race effect, it was necessary as well to calculate the mean
Caucasian, Asian and African faces/model graphs. These images can be seen in Figure 27
below. From left to right are the mean Caucasian, Asian, and African faces respectively.
In all, we computed sixteen separate means, calculated from subsets of images in the
FERET database. Note that there is significant overlap between groups. The complete list
of means we used for various unreported experiments can be found in Table 14 below.
70
Figure 27 More Average Faces – from left: average Caucasian, average Asian, average African.
Computing the deviations from the mean is relatively straightforward. Given a model
graph, subtract from its coefficients those of a given mean model graph.
The next question is, how are we to measure the greatest deviations from the mean? This
question is not straightforward. A first attempt might be to take the absolute value of
and sort the results, taking the top deviations. Another possible method is to take the
absolute value of and select the top deviations at each landmark. As we will see, this
is a step in the right direction, but not at all satisfactory. We explored both of the
foregoing options, obtaining the worst results with the former option. The latter was a bit
better but still failed to attain the results we received with the method we used in the end.
3.1.2.2 The Competition (Chi) Layer
The Chi Layer constitutes a mechanism by which only the most significant deviations of
units in the model layer from the mean encoded in the Mu layer can cause a unit in the
next layer to be active. We start by calculating the difference between the observed
coefficient and its respective mean.
71
In the foregoing, is the set of landmarks, is the set of features in a Jet, and are,
respectively, the landmark index and feature index; and are, respectively, the sets of
coefficients and mean coefficients at landmark . At each landmark, the units compete,
and of all the coefficients at each landmark, here signified , only the k highest
deviations at each landmark will correspond to active bits in the Rho layer. The DFM
model as presented herein forces consistency across landmarks in terms of the number of
units, , from each landmark that will be allowed to be active in the next layer,
designated Rho.
This states that the Chi layer, at landmark comprises the identities of the observed
coefficients constituting the top deviations from the mean. This formulation is very
general, which permits a number of alternative formulations to select “active” bits.
Among the strengths of the DFM representation for faces is its great versatility. There are
many ways to design an encoder to convert the double precision floating point
representation of the bunch graph to the binary representation of the DFM string.
Additionally, DFM admits the use of a number of similarity criteria to judge the quality
of a match. In the paragraphs that follow, we discuss a number of different encodings.
72
DFM strings may be encoded individually for each landmark and concatenated. They
may also be encoded taking the jets of the Model Graph as a single vector. The floating
point values from which deviations from the mean are calculated may be simply the
veridical model graph coefficients or the z-scores of the coefficients. The encoded bits
taking the value of “1” may be chosen from the top deviations (by landmark or by model
graph) or they may be chosen by thresholding the z-scores and selecting the bits whose z-
score is above or below a certain value. In the paragraphs that follow, we discuss the
various encoding options available, which will motivate the presentation of experiments
and results in a later chapter.
3.1.2.2.1 Composition By Landmark
Consider again our hypothesis: that the important information for face recognition
is not the actual coefficient values (firing rates), but the subpopulation of jet coefficients
(columnar neurons) that represent the most significant deviations above the mean
(quiescent, or resting rate). To this end, we enforced a one-size-fits-all subpopulation size
per landmark (macro-column). For example, there are forty coefficients per landmark;
there are forty-eight landmarks per model graph. At each landmark, we identified the
coefficients that were the greatest distance from the mean. To encode a face model graph
using the “by landmark” method we start with the jets of the model graph, one for each
landmark. For every jet coefficient there is a corresponding bit in the binary string. We
start with a binary string initialized to 0, a vector of 1,920 zeros. For each landmark in
turn, consider a single jet and its corresponding binary substring. Bits in the substring are
set to “1” if the corresponding jet coefficient is among the top deviations or z-scores.
73
After doing this for every landmark, we have a complete encoded binary string with
“active” bits.
Here is the algorithm:
1. For each jet, , in model graph , and its corresponding binary
substring in DFM binary string :
2. Sort the coefficients (Gabor amplitudes, deviations, z-scores, etc.) of in
descending order; this will yield an ordering of the coefficients;
3. For the bits in the substring with ordinals corresponding to those of the first
coefficients in the ordering on , set those bits to “1”. Set all other bits in the
substring, corresponding to the ordinals not chosen, to “0”.
4. End For
5. Store for later processing.
We experimented with different values of , where It should be clear
that the choice of affects many aspects of the model: the information content of a
binary pattern, ; the information content of each individual feature, , over all model
graphs in the database, and on the combinatorial complexity of binary patterns, i.e., the
number of distinct patterns that can be generated by the model. These three aspects are
interrelated as we shall see in a later section.
3.1.2.2.2 Composition By Model Graph
A face model graph can also be encoded to a binary string by considering all
coefficients of the model graph without reference to the landmark from which they were
74
sampled. In effect, deviations from the mean of all model graph coefficients, 1,920 of
them per image, are sorted from highest to lowest, and the bits corresponding to the
indices of the top will be “active” in the encoding. This yields the following simple
algorithm:
1. Sort the z-scores or deviations from the mean of the coefficients of the model
graph, ;
2. Set the bits of the binary vector, , where the index of each bit, , corresponds
to the index into a model graph of a single coefficient, , to “1” if the coefficient
is among the top z-scores or deviations from the mean in the sorted ordering of
all coefficients in .
3. Store the binary string for further processing.
3.1.2.2.3 On Deviations from the Mean
We also experimented with different definitions of “deviation from the mean”, which
could be interpreted in the following ways:
1. consider only positive deviations from the mean, where ;
2. consider only negative deviations from the mean, where and
3. consider the absolute values of the deviations from the mean, where
For representations that will be encoded from z-scores, we also experimented with an
encoding that “activated” bits corresponding to z-scores above a threshold.
75
For a given value of we record the identities of coefficients with the greatest deviations,
and take the top deviations, starting with the greatest deviation according to one of the
three definitions, without regard to whether the deviations otherwise fit the definition. For
example, if we used definition one and only considered the positive deviations, then we
would sort the deviations in descending order starting from the highest positive deviation,
and take the first coefficients in the list whether or not they were all positive. For some
values of , for some images, deviation or , for example, denoting the smallest of
the “significant” deviations in a jet or model graph, could be close to zero or even
negative. For simplicity reasons, however, we opted to favor the one-size-fits-all bits per
landmark criterion. As we will see, this choice, number one in the list above, was better
than the two alternatives by a long margin.
3.1.2.3 The Representation (Rho) Layer
The Rho layer comprises a population of neurons with a one-to-one correspondence to
units in the Mu and Chi layers. The Chi layer has performed the calculation of the
deviations from the mean, and sorted the resulting deviations to determine the most
significant deviations. The Rho layer units corresponding to the strongest deviations from
the Chi layer at each landmark will be considered active. All other unites are deemed
inactive. An active unit carries a value of one; an inactive unit carries an activation of
zero.
76
Rho comprises a binary vector, in which an element carries a value of 1 if and only if the
feature the element represents was among the coefficients with the highest deviations at
each landmark. We tried other criteria for constituting the binary vector, which will be
discussed in a later section.
3.2 DFM Similarity Measures
In addition to the dot-product, both normalized and non-normalized, including the dot
product of binary vectors, we use two other similarity measures that merit discussion:
3.2.1 Hamming Distance on binary patterns
The fractional Hamming Distance between two patterns is the percentage of bits in two
binary strings that take different values. For example, the fractional Hamming Distance
(fHD) between “1 0 1 0 1” and “1 1 1 1 1” is Intuitively, the fHD may be
calculated as the sum over an XOR operation on two strings divided by the length of the
strings.
Note that as a consequence of the manner in which we have created the binary
representations of bunch graphs, it is equivalent to calculate the sum of an (AND)
operation (SoA). The only difference is that with the fHD, we are looking for strings with
low fHD to a probe string; with AND, we are looking for strings with high SoA. The HD
can be converted to a SoA by the following formula:
,
77
where is the number of active bits in a binary pattern. We can see that this is so because
an encoded bit that does not match the same bit in another pattern is going to yield a fHD
of two. This is so because all strings have the same number of active bits. Hence, a
mismatch in one ordinal position automatically results in a mismatch at another ordinal
position. In the case of SoA, a mismatch in one position, where the “1”s match up, will
only take away one from the total of the AND operation. Taking the simplest example, if
you had two bit positions that were “0 1” in one string and “0 1” in the other, the SoA
would be 1 and the HD would be 0. By our formula: Changing the strings
to “1 0” and “0 1” results in a score of 0 for the SoA operation, and a score of 2 for the
HD operation. By our formula above: .
We raise this point only because the sum of AND, which requires a single matrix
multiplication is much more efficient to implement in matrix form than the equivalent
sum of XOR, which requires manual iteration over each pair of patterns.
3.2.2 Maximize Mutual Information
There is another way to think about comparisons between two patterns. One can compare
the amount of information they have in common. Restated, how much is our uncertainty
reduced about one pattern upon presentation of another pattern? These questions can be
formalized and quantified using concepts from Information Theory.
78
The Entropy of a discrete random variable is defined as:
Entropy can be interpreted in several ways, for example as the lower limit on the
compressibility of a random variable – a strict lower bound on the minimum average
encoding length for transmission of a random variable, i.e., its compressibility – and as
the amount of information content in a random variable. It has also been interpreted as
the degree of surprise on receiving the value of the random variable (Bishop 2006).
If Entropy is a measure of uncertainty, then Mutual Information is a measure of how
much uncertainty is reduced in one random variable given another random variable.
Intuitively, it is the uncertainty of one random variable less the uncertainty of the first
random variable, given the second:
It is a measure of dependence between two random variables. It is symmetric in X and Y
and is always non-negative. (Cover & Thomas 1991). It is, in fact, the Kullback-Liebler
distance – or relative entropy – between the joint distribution of two random variables
and their product distribution.
79
We hypothesize that two patterns attributable to the same individual will, in a manner
similar to the iris patterns of Daugman, e.g., (Daugman 1993), fail a test of statistical
independence. The degree of this failure can be quantified, for example, by the amount of
mutual information, the Kullback-Liebler distance, the Hamming distance, and the AND
similarity. The results given by each of these measures are either exactly or highly
correlated with results produced by the others except in a few important instances.
3.3 Image Reconstructions from Binary Magnitudes and Phases
An algorithm for the reconstruction of intensity images from model graph coefficients
was presented in (Wundrich, von der Malsburg et al. 2002). The reconstruction of an
image from the sparse sampling of jets available in a model graph takes place in
piecewise fashion. The intensity values of an image patch within a Voronoi region
surrounding a fiducial point are estimated from the jet coefficients. An example of a
reconstruction of an image from our dataset using the jets of the model graph extracted
from the image is seen in Figure 28.
In the DFM face representation, the veridical Gabor magnitudes from a face model graph
are encoded to a binary string. The magnitude of the Gabor coefficient -- the amplitude of
the complex-valued coefficient in polar form -- is discarded. We have asserted and will
show that recognition performance can be improved dramatically with such an encoding.
But how can this be? To help develop an intuition for what is happening here, we
performed image reconstructions with the method cited above, substituting an alternative
representation for the actual model graph data. The substitute representation was derived
80
from either the mean features, in the case of magnitudes or fiducial point locations, or a
binary representation of the actual Gabor features, in the case of Gabor phases. It is well
known that the phase information in a Fourier transform cannot be changed prior to
performing an inverse Fourier transform to recover the transformed image. The same is
not true of the magnitudes; an inverse performed on substitute or random magnitudes
along with veridical phases will retain the character of the original image. A
reconstruction using the “average” phase, as we used average magnitudes in Figure 28,
would yield images that looked nearly identical. Thus, we had to retain some of the phase
information from the original image. Though it seems paradoxical to use Gabor
magnitudes for recognition yet assert that phase information is essential to reconstruct the
image so that individuals can be distinguished, this paradox was resolved in (Shams and
von der Malsburg 2002). If the image had been densely sampled, the image could be
reconstructed, up to a photographic negative, from the magnitudes alone (Shams and von
der Malsburg 2002).
It is also possible to reconstruct the image from only partial information; the mean
magnitudes and phases can be masked by the DFM b-string -- only the coefficients
corresponding to the active bits in the string are used. How much information is
necessary to reconstruct a face image from a model graph, which is by its very nature a
sparse sampling of the Gabor transform of the original image, in such a way that the
person depicted in the image is easily recognizable? We point out that reconstructing
images is not the focus of this thesis. We created the reconstructions in the figures below
81
to get an intuition for how much information is retained in the DFM binary encoding.
We tried several techniques, all of which yielded satisfying results.
A model graph comprises a set of nodes labeled with Gabor jets and the image
coordinates of the fiducial points. Each jet contains forty complex Gabor coefficients,
represented in polar form as magnitude-phase pairs. Given a set of node positions, a set
of magnitudes and a set of phases, the reconstruction algorithm can generate a
reconstructed image.
The original model graph output by the EBGM algorithm is
where
From the set of edges, , we derive the node positions of the fiducial points:
In this the and are the Gabor amplitudes and phases respectively from the jet at
landmark number i, and the are the image coordinates of the fiducial points. The
reconstruction algorithm takes a model graph as input. We have, therefore, three
opportunities to substitute our own “average” data for the unaltered model graph data.
82
We can:
1. Use the true Gabor magnitudes or the mean magnitudes masked by the DFM
b-string;
2. Use the true Gabor phases, or the “demodulated” phases as in (Daugman
2003) wherein each Gabor phase is replaced by the angle bisector of the
quadrant of the actual phase; phase information can be stored with a two bit
Gray code, which denotes the quadrant of the phase angle. In this way, phase
information can be retained and the floating point phase component can be
discarded.
3. Use the true fiducial point positions, or the average fiducial point positions
calculated by taking the arithmetic mean of the fiducial point locations over
the whole data set
The veridical data and their alternate representations are summarized in Table 12.
Table 12 The data components of a model graph and the alternative representations.
The data from each class can come from one of two sources: the veridical model graph
data and the alternate representation. Thus there are eight ( ) possible combinations of
data for reconstructions. In all eight cases, the binary encoding is used as a mask of the
Model Graph Data Alternate Representation
Gabor Magnitudes Mean Gabor Magnitudes
Gabor Phases Demodulated Gabor Phases
Node Positions Mean Node Positions
83
data vector. For example, if the binary encoding in position is “1”, the value from one
of the representations is used. If it is “0”, then a value of zero is used.
Feature
Combination
Amplitude Phase Node Positions
Subject Mean Subject Generic Subject Mean
1 ASPSNS X X X
2 ASPSNM X X X
3 AMPSNS X X X
4 AMPSNM X X X
5 AMPGNS X X X
6 AMPGNM X X X
7 ASPGNS X X X
8 ASPGNM X X X
Table 13 Feature Combinations for Image Reconstructions An image can be reconstructed from a model graph
from the Gabor features, amplitudes and phases, and the positions of the graph nodes, or fiducial points. For each of
Gabor amplitudes, Gabor phases and model graph node positions there is an alternative source of data. For Gabor
amplitudes, the mean amplitudes are used; the mean vector is masked by the DFM b-string to determine which
amplitudes will be used. For node positions, the mean node positions are used. For phases, each phase coefficient is
demodulated as in to a two-bit Gray code corresponding to the quadrant of the phase angle; the generic substitutions
comprise the angle bisectors of the coordinate axes, i.e., angles of
84
Figure 28 Image reconstructions from model graphs . In this figure one should read “Mean Features” as “Mean
Magnitudes”. Similarly, “Subject Features” becomes “Subject Magnitudes.” The phase information is treated on its
own axis. The image is a projection or squashing of the three dimensional binary parameter space onto two dimensions
On the left side, reconstructions using all features or their alternate representation. On the right side, reconstructions in
which half of the magnitudes and phases, or their alternates, have been masked by a DFM b-string . Reconstructions are
performed for each of the eight possible combinations of veridical / alternative features. The reconstructions that use
only the alternate representations, i.e. no veridical data, (third row, right side in each half of the figure) is still readily
recognizable as the same person. This is the case even when half of the alternate data is removed (third row, far right).
We conclude this chapter with a summary of an experiment otherwise undocumented.
We were interested to see how much information about subject identity is carried in the
phase information. EBGM discards the phase information from the complex Gabor
85
coefficients for the simple reason that small spatial displacements of a convolution kernel
have little effect on the magnitude component, but affect the phase component in less
predictable ways. A small displacement will affect high frequency kernels much more
than low frequency kernels for predictable reasons. This renders phase information less
reliable –small displacements affect each kernel differently. Yet, following the lead of
Daugman, we performed a group of experiments using only binary vectors of
demodulated phases. Phases were demodulated, or recoded, in three ways:
1. Using one bit per phase component, where the bit encoded whether the phase
angle was greater or lesser than π radians.
2. Using two bits per phase component, in which each phase component was
represented by two bits, denoting the quadrant of the phase angle. A binary
code of “00” was assigned to the first quadrant (top-right). Proceeding
counter-clockwise, quadrants were assigned, respectively, “01”, “10”, and
“11”.
3. Using two bits per phase component but the quadrants were assigned an
encoding based on a Gray Code, in which the code for a quadrant differed in
only one bit from the encoding of an adjacent quadrant.
The results of this group of experiments can be seen in Figure 29. Interestingly,
recognition rates in excess of eighty percent, for small data sets, and in excess of seventy
percent for large data sets, can be achieved from demodulated phases alone.
86
Figure 29 Results of recognition experiments using only the phase components of the complex Gabor
coefficients. Each data point is the average of three hundred recognition runs on sets of image pairs of the cardinality
denoted in the abscissa. The error bars represent one standard deviation of the distribution of recognition results at each
cardinality. One sees that two-bit Gray-coded phases can achieve recognition rates in excess of eighty percent, often
higher.
87
Chapter 4: Experiments
4.1 The Data Set
For all of the experiments herein, we have used images from the FERET database
(Phillips, Moon et al. 2000). Specifically, we begin with an image domain that includes
all images from the fa and fb sets of images, both of which sets comprise images of
frontally oriented people, i.e., oriented perpendicularly to the image plane, from just
below the shoulders to the top of the head. The primary difference between an image
from the fa set and an image of the same individual from the fb set is a change of facial
expression. In many cases there are multiple images of the same individual in the fa set,
in the fb set, or in both sets. Illumination conditions vary between individuals and/or
between images of the same individual, though they occur far more frequently in the
former case.
In total, there were 3,280 images in the full gallery comprising the fa and fb sets
of images. The data were labeled with class identifiers appropriate to the task. For
example, for gender identification, samples were labeled “Male” and “Female”. For the
recognition tasks, samples were labeled with the subject identifier unique to each
individual. In all, there are 3,280 frontal images in the database of 1,011 unique
individuals. According to our classifications, the various subcategories can be broken
down as follows:
88
Subcategory Number of
Images
% of Images Number of
Subjects
% of Subjects
Male 2,112 64.39 603 59.64
Female 1,168 35.61 408 40.36
Glasses 438 13.35 167 16.52
No Glasses 2,842 86.65 940
9
92.98
Caucasian 1,622 49.45 515 50.94
Asian 503 15.34 147 14.54
African 262 7.99 83 8.21
Other 893
10
27.22 266 26.31
Table 14 Sub-categories of the FERET Database. Several groupings of the data within the FERET fa and fb sets and
corresponding frequency statistics.
The FERET database, which we used for all experiments reported herein, does not
contain ground truth for any of the above categories, except the presence of eyewear.
Accordingly, we created a browser in Matlab, which allowed us to relatively quickly
categorize the images in the database around any criterion for which there were two
possible choices. This was a simple matter in the cases of gender and eyewear. For the
manual classifications based on race, we performed several two-choice categorizations
and then combined the results. For example, in one pass through the database, we
classified images as “Asian” and “Not Asian”, or as “African” and not “African”. Other
racial categories were Caucasian and Other Race. There was certainly a great deal of
subjective judgment in some of the individual decisions. Others categorizing the same
data set would most likely have made different judgments on some of the images. In
9
For a number of subjects there are images taken while the subject was wearing glasses and other images
taken without. That is why the number of subjects does not add to 1,011. The percentages in the fourth
column indicate the percentage of subjects who have at least one picture taken while wearing glasses.
10
The “Other Race” category includes, among others, individuals identified as “South Asian”. The “Asian”
category includes people who appeared to have characteristically East Asian features.
89
performing the task we relied on our best judgment. In any event, we do not present the
results of any of the experiments that used this data
4.2 Experiments and Results
Table 15 shows a summary of the combinations of representations and similarity
measures used in the following experiments. We will address the following questions:
1. Of the several new floating point representations presented herein, are there
any that offer clear performance improvements?
2. How does the choice of criteria for the selection of active bits in a DFM b-
strings – by landmark or by model graph -- affect recognition performance?
3. Which combinations of data representation and similarity function perform
the best on recognition tests?
4. How does eliding the coefficients of one or more landmarks affect recognition
performance of DFM b-strings when, in the case of strings composed “by
model graph”, the resulting b-strings may have differing numbers of active
bits?
5. Which of the relevance criteria discussed on pages 31 to 42 -- ANOVA,
Binary Channel Capacity, and “Leave One Out” – when used to choose the
most relevant (upper fifty percentile) landmarks for recognition yield a set of
landmarks that offer the best recognition performance?
90
Abbreviation Similarity
Measure
Data
Format
Data Basis
1 Voting --- ---
Plurality vote of all
similarity criteria
2 DFM BM
HD
B
GA-μ
3 DFM BZ Z(GA)
4 DFM ZGT Z(GA)>0.5
5 MI BM
MI
GA-μ
6 MI BZ Z(GA)
7 MI ZGT Z(GA)>0.5
8 Cos AM
DP
FP
GA
9 Cos Z Z(GA)
10 Cos Zn Z(GA), unit vectors
11 Cos Zna , unit vectors
12 MI AM
MI
GA
13 MI Z Z(GA)
Table 15 Data Representations and Similarity Criteria A summary of the various data representations, the
similarity criteria employed to evaluate the representation, and data from which they were derived. At the
highest level, all representations derive from the complex Gabor coefficients; we use only the amplitudes, which we
have denoted GA above. Similarity measures: HD: fractional Hamming distance; MI: mutual information between
vectors; DP: dot product between vectors;. In the last column, the codes have these meanings: GA-μ: mean subtracted
Gabor amplitudes; Z(GA): z-transformed Gabor amplitudes; Z(GA)>0.5: z-transformed Gabor amplitudes –
transformed coefficients with are set to “active”; Z(GA), unit vectors: same as Z(GA) with jets of z-scores
normalized to unit length; unit vectors: same as Z(GA), unit vectors but where the absolute values of the z-
scores were taken.
Before we address these questions, we would like to present a result that illustrates the
essence of this research. Figure 30 shows the results of recognition experiments on a
single data set of seven hundred image pairs. For each natural number,
of active bits in the complete b-string, we compose a b-string for each
image in the 700 pairs. These images comprise two sets of images of equal size. Each
image in the first set has a pair image in the second. We will call these sets C and R – R
is the “recollection” set, the testing set; logically then, C is the “collection” set, the
“training” set, the database. We will use the C set to “train” the system, the R set to test
it. As usual, is the parameter denoting the number of active bits – an “active” bit is a
91
bit taking the value “1” – in the b-string. The algorithm is simple: for each image in the R
set, find the image in the C set with the greatest similarity to it. Next, tabulate the result
for this value of , the percentage of R images that were matched to the correct image in
C. The complete algorithm:
1. For each
2. For each image,
3. For each image ,
4.
5.
6. End For
7.
8.
9. End For
The function simply takes as input the subject
identifiers estimated by the algorithm and held in the variable and compares the
estimates to the ground truth identifiers. It returns a recognition percentage based on
these comparisons.
In each case, active bits were selected “by model graph.” For example, for m = 960, we
set to the bit corresponding to a coefficient to 1 if its amplitude, z-score, etc. was among
the top half of the sorted values; similarly for , only the bit corresponding to the
greatest amplitude, z-score, raw deviation, etc, was set to 1; all other bits were set to 0.
92
One can see in this figure several prominent features:
1. For a large range of values of m, the number of active bits in the b-string, from
approximately 150 to 1,600 active bits, recognition performance of DFM
surpasses EBGM. This attests to the robustness of the b-string representation.
2. For as small a value as , recognition performance equals that of EBGM.
3. At its peak value, when the number of active bits is approximately half of the total
number of Gabor coefficients, the error rate of EBGM is nearly halved. In terms
of performance, RPI (Recognition Performance Improvement), a measure we
define in a subsequent section, is 43.88%. In brief, this means that DFM narrowed
the gap between EBGM and perfect recognition performance by nearly half. This
attests to the suitability of DFM as a replacement for the data representation used
by EBGM in recognition experiments.
4. When considering which deviations are most significant, we start with deviations
rightmost in the distribution of deviations. This means that deviations falling in
the left tail of the distribution, which correspond to coefficients having values
significantly less than the mean, are not as important as those corresponding to
coefficients with significant positive activations. Compare the two blue curves,
corresponding to deviations in the form of mean-subtracted Gabor amplitudes and
of Gabor amplitude z-scores, to the curves for data representations derived from
the absolute values of the same deviations, the light green and magenta curves.
Clearly, in addition to deviating significantly from the mean, underlying
coefficients must not be minimally active (close to zero). In other words, the
93
deviation, to be significant, must also be positive or among the top deviations
starting with the greatest positive deviation.
5. In Figure 31 we see further that a strategy of activating bits whose z-score is
greater than some value, , is not a winning strategy. Only at does
recognition performance equal that of EBGM. Interestingly, as we will see in the
following sections, a similarity measure based on the information theoretic
concept of mutual information can be used to obtain performance that is much
improved. We see the same phenomenon in the case of b-strings composed “by
model graph” when less than all of the landmarks are used. In these cases, the
number of active bits in individual b-strings encoded in this fashion will not be
equal.
94
Figure 30 Recognition experiment on a large data set (700 image pairs) using binary strings to represent faces.
The different curves correspond to a number of criteria for selecting which bits are “active” (set to “1”). For one
encoding, a binary face representation in which only 152 out of 1,920 bits are active, less than 10% of the bits, matches
the performance of Elastic Bunch Graph Matching. For a wide range of choices of the parameter, , which controls the
number of active bits in the string, recognition performance is superior to EBGM. The choice of m offers great
flexibility, further supporting the notion that patterns of deviation are more important than the deviations themselves.
The minimum of the best encoding occurs at a point (m=964) close to where the greatest number of different binary
strings are possible (m=960). At m=960, the number of possible b-strings is on the order of The curves with
the lowest error rates correspond to encodings of z-scores and mean-subtracted coefficient values.
95
Figure 31 A alternative coding strategy based on z-scores. In this representation, bits are
activated if the corresponding coefficients have z-scores greater than some value, z. Experiments
show that this representation is poor; only a setting of z=0.5 yields recognition performance that
equals EBGM.
4.2.1 Design of Experiments
For all of the experiments described below we randomly selected pairs of images, as
required for the particulars of the experiment, from the fa and fb sets of the FERET 2001
face recognition database, to which we shall refer herein as the Gallery. The fa and fb sets
depict frontal orientations. Sources of intra-subject variation include: manner of dress,
facial expression, eyewear, hair style, background, contrast, and illumination. Image pairs
were chosen without regard to designation of images as fa or fb; all such images were
comingled. There was a small subset of the 3,280 images in our Gallery, less than one
hundred images, with which EBGM had difficulty locating the face, extracting a model
graph from a seemingly random location in the image. We developed an algorithm for the
96
automatic detection of these graphs, which can be found in Appendix A. In the end,
however, we left these “bad” graphs in the Gallery, allowing the uncorrelated data to
average out. In most of the experiments below we used a randomly selected set of 700
image pairs. In experiments that used a sets with lower cardinality the results represent
the average of at least one hundred individual experiments using different randomly
chosen sets of the same cardinality. Based on the results of unreported experiments we
feel confident that even if all fiducial points were placed by hand for all images in the
Gallery, there would be little appreciable difference in the results.
4.2.2 Performance Comparison – Floating Point Representations
The EBGM model graph represents features as vectors of complex coefficients called
Gabor jets, often called, simply, jets. Each jet encapsulates information about features
around a point in the image. A coefficient, or feature response, will have significant
amplitude if the image near the sample point contains a feature with an orientation and
spatial scale coinciding with the corresponding Gabor kernel’s preferred orientation and
scale. We have argued that it is not the feature values per se that are important for
recognition, but the pattern of deviations from the mean that are most salient. We will
show that this is indeed true. More generally, however, we show in this experiment that a
floating point representation in which coefficients represent distances from the mean
coefficient values represents a significant improvement over the unit-length Gabor jets
employed by EBGM.
97
Name Format Representation Derivation
AM Floating Point Gabor Amplitudes (unit length)
MSGM Floating Point
MSGMn Floating Point (unit length)
ZGM Floating Point Z-Scores of GA [
ZGMn Floating Point Z-Scores of GA (unit length)
Table 16 Summary of floating point data representations
In Table 16 we reproduce part of Table 2, which summarizes the floating point
representations used in this experiment. All but the first, the unaltered Gabor magnitudes,
are novel representations. For this experiment we performed recognition experiments on
sets of image pairs of various cardinalities. For each cardinality we created 300 randomly
selected sets of image pairs. For each such testing set, we performed successive
recognition experiments using the five floating point representations. For each
cardinality, for each representation, we calculated the arithmetic mean recognition rate.
The results are presented in Figure 32. We also calculated and compared the standard
deviation of the 300 trial results for each cardinality and representation. These results are
presented in figure 33.
We draw the following conclusions about floating point representations from the
foregoing results:
1. Merely subtracting the mean from the coefficient magnitudes and comparing
vectors of raw deviations (MSGM) increases errors in recognition;
2. Normalizing the jets of MSGM to unit length (MSGMn) eliminates the errors
introduced by MSGM , but produces only a very modest improvement over
GM;
98
3. Incorporating the variance present in the distribution of coefficient values
(ZGM) represents a marked improvement over GM;
4. Performance can be further improved by normalizing the jets of ZGM to unit
length (ZGMn);
5. In addition to its performance advantages, ZGMn demonstrates more
consistency of results than other representations, as demonstrated in figure 33.
The figure shows that in a series of recognition tests, particularly on testing
sets of smaller cardinalities, ZGMn produced recognition rates that were less
varied than other representations.
Figure 32 Mean Error rates for five floating point data representations at
cardinalities ranging from 100 to 900. Shorter bars indicate lower recognition error rates
and better recognition performance.
99
These results demonstrate a clear advantage to representations that retain information
derived from deviations from the mean. The best representations yield coefficients that:
Reflect the distance from the mean,
Normalize this distance by accounting for the variance of the
coefficient values, and
Further normalize the deviations so that the vector of deviations from
one landmark is the same length as those from other landmarks.
Figure 33 Standard deviations of error rates for five floating point data
representations at cardinalities ranging from 100 to 900. Shorter bars indicate lower
variance in the error rates of individual recognition tests.
100
As we will see, these qualities are also applicable to the DFM binary representations,
with one significant difference: DFM appears to be more robust to the variations in the
distributions of Gabor coefficients. In other words, as we will see below, the performance
difference between MSGM and ZGM floating point representations is much less than that
of the corresponding binary representations. This further supports the hypothesis that
faces can be represented in toto by a pattern of significant deviations from the mean.
Only the residue of the calculation, the identities of the units whose “firing rates” are
significantly above the mean, need be retained. The identities of the most active units are
memorialized in sixty words, on a 32 bit system, or thirty words, on a 64 bit system, two
hundred forty Bytes in total. This modest representation greatly surpasses the
performance of EBGM and is equal to the performance of the best floating point
representations, which themselves outpace EBGM.
4.2.3 b-string Construction – By Landmark or By Model Graph
The next four figures compare, in terms of recognition performance, the difference
between choosing active bits by considering the top deviations at each landmark against
the alternative, considering the top deviations over the complete model graph. In brief,
we conclude the following:
1. The Hamming distance similarity measure is not a suitable choice when less
than all of the coefficients are being used and the construction method is “by
model graph”. This is so because the redacted strings will not all have the
same number of active bits. The Mutual Information measure is still a viable
101
option. Further, the joint distributions required to use mutual information as a
criterion can be estimated with the same operation used to calculate hamming
distance.
2. When using the most relevant top half of the landmarks, only the voting
method, when half of the bits per landmark are active, can beat the best
floating point representation, Zn. No binary representation can beat Zn when
selecting bits by model graph.
3. Using all of the landmarks, choosing active bits my model graph is a slightly
better choice than the alternative, but the advantage is marginal.
102
Figure 34 Top: Recognition Performance of DFM with 48 landmarks. Active Bits selected “By Landmark”.
Landmarks removed were determined to have the least channel capacity, where each of 1,920 binary features was
modeled as a binary symmetric channel. Shorter bars are better, indicating higher recognition rates and lower
recognition errors. Horizontal dotted lines depict the baseline performance of floating point representations. They are
(from top): 1) EBGM results, 2) EBGM with landmarks removed, 3) Normalized z-scores, all landmarks; and 4)
Normalized z-scores, landmarks removed. Note that the blue and red lines in this figure represent the same values as
the blue and green lines in Figure 34. With only the most relevant landmarks retained, only the voting method can
match representation Zn. All of the binary representations best the performance of Zn using all landmarks, for some
parameter settings.
Bottom: Removing landmarks leaves b-strings with unequal numbers of active bits. Here, active bits are selected
“by Model Graph”. With less than all landmarks strings are likely to have different numbers of active bits. This makes
Hamming distance an unreliable indicator. However, the mutual information similarity function, which measures the
Kullback-Liebler distance between the joint probability distribution and the product distribution ,
accounts for the disparity in active neurons between two strings.
103
4.2.4 Performance Comparison – All Representations
Figure 35 Top: Recognition Performance of DFM with 24 of 48 landmarks removed. Active Bits selected “By
Landmark”. Landmarks removed were determined to have the least channel capacity, where each of 1,920 binary
features was modeled as a binary symmetric channel. Shorter bars are better, indicating higher recognition rates / lower
recognition errors. Horizontal dotted lines depict the baseline performance of floating point representations. They are
(from top): 1) EBGM results, 2) EBGM with landmarks removed, 3) Normalized z-scores, all landmarks; and 4)
Normalized z-scores, landmarks removed. Note that the blue and red lines in this figure represent the same values as
the blue and green lines in Figure 34.
Bottom: Removing landmarks leaves b-strings with unequal numbers of active bits. Here, active bits are selected
“by Model Graph”. With less than all landmarks strings are likely to have different numbers of active bits. This makes
Hamming distance an unreliable indicator. However, the mutual information similarity function, which measures the
Kullback-Liebler distance between the joint probability distribution and the product distribution ,
accounts for the disparity in active neurons between two strings.
104
In these experiments we address questions 2, 3 and 4, posed on page 89. In Table 15 on
page 90 we enumerated and described the combinations of data representation / similarity
functions we will consider here. For simplicity we will refer to the combination of a data
representation and a similarity function as a representation. If we need to distinguish
between the combination and the data alone, we will state this explicitly. The floating
point representations and the binary representations DFM ZGT and MI ZGT do not
depend on the number of active bits per landmark or per model graph. In some of the
figures below, we have omitted these values to avoid redundancy and to make some
points clearer.
The figures below reflect the results of recognition experiments performed using a single
data set. The data set was selected at random from the full 3,280 images in the dataset.
For each of seven hundred (700) randomly chosen subjects, selected from a pool of 1,014
potential subjects, a pair of images was selected at random from the full set of the
subject’s images. The number of images per subject in the database ranged from two to
twenty-five, the average being 3.25. Three hundred eighty-five subjects had more than
two images; 624 subjects had exactly two images.
To reiterate, the training and the testing set comprised seven hundred images each. The
recognition procedure was the same for each representation. First, the training set was
marshaled from the appropriate underlying data. Next, for each testing set image, the
similarity function was applied to it and each training image in turn. For similarity
functions based on the normalized dot product, or cosine measure, and on the mutual
information measure, a maximum of the similarity values was found. For the Hamming
105
distance measure, a minimum was sought. The subject corresponding to this maximum or
minimum was deemed to be the best match. After all testing images were processed, the
total number of correct matches was divided by the cardinality of the set to obtain a
recognition percentage.
We also wanted to gauge the performance of the representations against a common frame
of reference. As we are claiming to have improved upon the EBGM representation, we
used the EBGM recognition percentage as a baseline. The improvement in recognition
performance of a representation over EBGM was calculated by measuring the reduction
in the rate of recognition errors. Using this measure, we were able to compare the
performance of any representation on any test. Performance improvement was evaluated
with the following formula:
This formula says that for a given data representation, D, and a similarity measure, S, the
recognition performance improvement is the ratio of the difference between the new
representation and EBGM to the difference between EBGM and perfect recognition.
Intuitively, RPI measures the how much of the distance from EBGM toward the goal of
perfect recognition the new representation travelled. If the new representation achieved
perfect recognition, RPI would equal one; if it matched EBGM, RPI would equal zero.
For a more concrete example, in a test using all available landmarks, the data
representation comprising z-scores, where jets of z-scores were normalized to unit length,
attained a recognition rate of 90.43%. On the same data set, using the same similarity
106
measure (Cosine) EBGM achieved 86% recognition. Accordingly, the RPI on normalized
z-scores, using the cosine similarity measure was:
On a test using the same data set, with the least relevant 24 landmarks removed from the
similarity calculation, as measured by the Binary Channel Capacity relevance criterion,
we have:
If one was to compare the recognition percentage gains of the two experiments, it would
appear that the Zn representation did not benefit from the removal of the least relevant
landmarks. Using the RPI calculation, however, it is clear that Zn benefits significantly
from the removal. This is not only useful to judge the performance of Zn, but also to
measure the advantages of removing landmarks that have a deleterious effect on
recognition. Figure 36 and Figure 37 show the recognition performance of all
representations on the set of 700 image pairs, using all of the landmarks. The former
shows the results for the case in which binary representations were constructed using the
“by landmark” criterion. Similarly, in the latter, the active bits were chosen “by model
graph”. This figure shows both the effect on recognition performance of changing the
number of active bits per landmark for the binary representations and the effect of the
choice of construction method. The floating point representations do not depend on the
choice of parameter and are therefore the same for all values of the parameter, .
107
A cursory glance at the figure reveals a number of points of interest:
1. Column 8 corresponds to the Cosine similarity measure applied to Gabor
magnitudes, as in EBGM. For values of from 10 to 25, in the case of the
MSGM representations, and from 5 to 30, in the case of the ZGM
representations, the binary representations outperform EBGM.
2. Column 10 show the recognition rate of representation Zn on the recognition
testing set. For the “by model graph” method (bottom of figure) all of the
binary representations outperform Zn, the best floating point representation,
over a wide range of the parameter, For the “by landmark” method, all but
3. The voting method of determining the best match tabulates the output of eight
to twelve algorithms. The subject matched with the greatest frequency by the
individual representations is chosen to be the best match. The voting method
outperforms all of the individual representations. This holds true, but to an
even greater extent, for binary strings constructed by model graph.
4. The voting method is surprisingly robust over the whole range of the
parameter, . This is also true to a greater extent for the range of parameter,
, when choosing bits by model graph.
108
Recognition Methods
1 2 3 4 5 6
Voting DFM BM DFM BZ DFM ZGT MI BM MI BZ
7 8 9 10 11 12
MI ZGT Cos GM Cos Z Cos Zn Cos Zna MI AM
13
MI Z
Figure 36 A comparison of recognition performance of floating point and binary representations. The binary
representations, except as noted below, are constructed “by landmark” using all forty-eight landmarks. The floating
point representations (cols. 8-13) and the binary representations ZGT (cols. 4 and 7) are indicated in the table above
with lighter shading; the other binary representations are indicated with darker shading in the table directly above
this caption. The lightly shaded representations are not dependent upon any choice of k and/or m. The blue dotted
line indicates the performance of EBGM; the green dotted line denotes the recognition performance of floating
point representation ZN. Tables summarizing these results will follow.
5. Active bits selected by considering deviations over the model graph yield
slightly better performance than those selected by landmark, given that the
number of active bits is the same under both representations.
109
Recognition Methods
1 2 3 4 5 6
Voting DFM BM DFM BZ DFM ZGT MI BM MI BZ
7 8 9 10 11 12
MI ZGT Cos GM Cos Z Cos Zn Cos Zna MI AM
13
MI Z
Figure 37 A comparison of recognition performance of floating point and binary representations. The binary
representations, except as noted below, are constructed “by model graph” using all forty-eight landmarks. The
floating point representations (cols. 8-13) and the binary representations ZGT (cols. 4 and 7), indicated in the table
below the figure with lighter shading, are not dependent on the criteria that are used to determine the pattern of
active bits in a DFM string, whether the DFM strings are constructed by landmark or by model graph. The binary
representations are indicated with darker shading. As in Figure 36, the blue dotted line indicates the performance
of EBGM; the green dotted line denotes the recognition performance of floating point representation ZN. Tables
summarizing these results will follow.
6. BZ (binary z-scores) offers a slight performance edge over BM (binarized
mean-subtracted Gabor magnitudes). This difference is much less than the
floating point representations upon which the binary strings are based.
7. The mutual information similarity measure offers similar performance to the
Hamming distance measure. This becomes more important when some
landmarks are left out, which will be discussed below.
110
8. Considering the absolute value of the deviation when selecting active bits is
not a winning strategy. A low coefficient is a low coefficient even if its
deviation is significant.
Recognition Methods
1 2 3 4 5 6
Voting DFM BM DFM BZ DFM ZGT MI BM MI BZ
7 8 9 10 11 12
MI ZGT Cos GM Cos Z Cos Zn Cos Zna MI AM
13
MI Z
Figure 38 A comparison of recognition performance of floating point and binary representations using the
most relevant landmarks. The binary representations, except as noted below, are constructed “by landmark”
using the twenty-four landmarks with the highest binary channel capacity. The floating point representations (cols.
8-13) and the binary representations ZGT (cols. 4 and 7), indicated in the table directly above this caption with
lighter shading, are not dependent upon any choice of k and/or m. The binary representations are indicated with
darker shading. The four horizontal dotted lines represent the recognition performance of (from top to bottom):
EBGM using all 48 landmarks, EBGM using the best 24 landmarks as gauged by the binary channel capacity
relevance measure, Normalized z-scores using all landmarks, and normalized z-scores using the best 24 landmarks
as above. The blue and red lines in this figure represent the same values as the blue and green lines in Figure 36
and Figure 37. Tables summarizing these results will follow.
111
Recognition Methods
1 2 3 4 5 6
Voting DFM BM DFM BZ DFM ZGT MI BM MI BZ
7 8 9 10 11 12
MI ZGT Cos GM Cos Z Cos Zn Cos Zna MI AM
13
MI Z
Figure 39 A comparison of recognition performance of floating point and binary representations. The binary
representations, except as noted below, are constructed “by model graph” using the twenty-four most relevant
landmarks as determined by the binary channel capacity relevance criterion. The floating point representations
(cols. 8-13) and the binary representations ZGT (cols. 4 and 7), indicated in the table below the figure with lighter
shading, are not dependent upon any choice of k and/or m. The binary representations are indicated with darker
shading. The four horizontal dotted lines represent the recognition performance of (from top to bottom): EBGM
using all 48 landmarks, EBGM using the best 24 landmarks as gauged by the binary channel capacity relevance
measure, Normalized z-scores using all landmarks, and normalized z-scores using the best 24 landmarks as above.
The blue (top) and red (third from top) lines in this figure represent the same values as the blue (top) and green
(bottom) lines in Figure 36 and Figure 37. Tables summarizing these results will follow.
112
4.2.5 Comparison of Relevance Criteria
Abscissa Labels
1 Voting
2 DFM BM
3 DFM BZ
4 DFM ZGT
5 MI BM
6 MI BZ
7 MI ZGT
8 Cos AM
9 Cos Z
10 Cos Zn
11 Cos Zna
12 MI Z
Figure 40 Comparison of Relevance Criteria for DFM b-strings constructed by Model Graph (number of active
bits = 960) The number labels in the abscissa denote the combination of representation and similarity function in the
corresponding line of Table 15. Column 12 in the figure corresponds to representation 13 in the table. The error rate for
mutual information on Gabor Magnitudes is greater than fifty percent. The values in the ordinate are recognition error
rates, represented in percent. The recognition error rate equals one minus the correct recognition rate. Thus, lower
values are better. Non-DFM representations use their floating point coefficients. Note the difference in performance
between DFM similarity comparisons using Hamming distance (groups 2-4) and mutual information (groups 5-7).
In brief:
1. Binary strings derived from a pattern of significant deviations from a shared
mean value are information preserving.
2. They can be used for recognition, without reference to any original data.
3. Bits corresponding to jet coefficients, indeed the coefficients themselves if
working with floating point data, can be removed from the similarity
calculation, halving the size of the representation, and improving recognition
performance.
113
Figure 41 Comparison of Relevance Criteria for DFM b-strings constructed by Landmark (number of
active bits per landmark = 20; over the whole graph per landmark = 960 bits). The number labels in the
abscissa again denote the combination of representation and similarity function in the corresponding line of
Table 15. As in Figure 40, Column 12 in this figure corresponds to representation 13 in the table. The error
rate for mutual information on Gabor Magnitudes is greater than fifty percent. The values in the ordinate are
recognition error rates, represented in percent. The recognition error rate equals one minus the correct
recognition rate. Thus, lower values are better. Non-DFM representations retain their floating point
coefficients.
114
Similarity
Metric / Data
Representation
Recognition Performance
Independent Representations
(on n=700 image pairs)
All Landmarks
Top 24 Landmarks by
Channel Capacity Criterion
Recognition
Rate
RPI
Recognition
Rate
RPI
Cos Z 90.14% 29.60% 90.70% 17.70%
Cos Zn 90.43% 31.60% 93.30% 40.50%
Cos Zna 85.00% -7.10% 85.00% -32.90%
ZGT 86.57% 4.10% 84.70% -35.00%
MI AM 40.14% -328.00% 31.40% -508.00%
MI Zn 89.71% 26.50% 92.30% 31.70%
MI ZGT 89.71% 26.50% 90.90% 19.00%
Cos AM 86.00% --- 88.70% ---
Table 17 Recognition rate percentages of measures that do not depend on the choice of k (number of active bits
per landmark) or m (the number of active bits over the model graph.) These figures establish baseline measures
against which the other criteria should be judged. The last row in this table establishes the baseline of baselines. It is the
standard cosine measure used in Elastic Bunch Graph Matching. Note that five of seven of the new criteria beat the
current standard similarity measurement in terms of recognition performance. RPI = Recognition Performance
Increase.
4.2.6 Recognition Experiments with All 48 Landmarks
Active feats Voting DFM
BM
DFM
BZ
MI
BZ
MI
BM
Cos
Zn
Cos
AM
3 0.90286 0.81429 0.85143 0.85143 0.81429 0.90429 0.86000
4 0.90143 0.83571 0.87000 0.87000 0.83571 0.90429 0.86000
10 0.90857 0.89000 0.90000 0.89857 0.89000 0.90429 0.86000
15 0.90714 0.90286 0.90714 0.90429 0.90143 0.90429 0.86000
19 0.91286 0.90571 0.91571 0.91286 0.90571 0.90429 0.86000
20 0.91286 0.90286 0.91143 0.91143 0.90143 0.90429 0.86000
25 0.90714 0.89000 0.90286 0.90286 0.89000 0.90429 0.86000
30 0.90429 0.85429 0.88143 0.88000 0.85429 0.90429 0.86000
Table 18 Recognition Performance Summary of DFM by Landmark. The first column shows the number of active
features per landmark, i.e., the number of bits among the forty for each landmark that will have a value of “1”. The
various similarity criteria are presented in Table 15.
115
Active Feats Voting DFM
BM
DFM
BZ
MI
BZ
MI
BM
Cos
Zn
Cos
AM
152 0.89714 0.82143 0.86143 0.86000 0.82143 0.90429 0.86000
240 0.90143 0.85286 0.88000 0.88000 0.85286 0.90429 0.86000
480 0.90571 0.89571 0.89857 0.89571 0.89571 0.90429 0.86000
720 0.90857 0.90857 0.90571 0.90571 0.90857 0.90429 0.86000
864 0.90857 0.91286 0.90714 0.90714 0.91286 0.90429 0.86000
912 0.91143 0.91000 0.91714 0.91286 0.90857 0.90429 0.86000
960 0.91286 0.91000 0.91714 0.91571 0.90857 0.90429 0.86000
1008 0.91857 0.90857 0.91857 0.91429 0.90857 0.90429 0.86000
1056 0.91857 0.90571 0.91857 0.91714 0.90571 0.90429 0.86000
1200 0.90857 0.90000 0.90143 0.90000 0.90000 0.90429 0.86000
1440 0.90714 0.86286 0.88429 0.88286 0.86286 0.90429 0.86000
1680 0.90571 0.81571 0.82286 0.82286 0.81571 0.90429 0.86000
Table 19 Recognition Performance Summary of DFM by Model graph. As above the first column shows the
number of bits in the binary-encoded string that take a value of “1”. The active bits are selected from all 1,920 features
of the model graph without reference to the landmark with which they are associated.
Active feats Voting DFM
BM
DFM
BZ
MI
BZ
MI
BM
Cos
Zn
Cos
AM
3 30.61% -32.65% -6.12% -6.12% -32.65% 31.63% 0.00%
4 29.59% -17.35% 7.14% 7.14% -17.35% 31.63% 0.00%
10 34.69% 21.43% 28.57% 27.55% 21.43% 31.63% 0.00%
15 33.67% 30.61% 33.67% 31.63% 29.59% 31.63% 0.00%
19 37.76% 32.65% 39.80% 37.76% 32.65% 31.63% 0.00%
20 37.76% 30.61% 36.73% 36.73% 29.59% 31.63% 0.00%
25 33.67% 21.43% 30.61% 30.61% 21.43% 31.63% 0.00%
30 31.63% -4.08% 15.31% 14.29% -4.08% 31.63% 0.00%
Table 20 Same information as in Table 18, above, recast as a reduction in error rate. See caption of next table for
more precise formulation.
Active
Feats
Voting DFM
BM
DFM
BZ
MI
BZ
MI
BM
Cos
Zn
Cos
AM
152 26.53% -27.55% 1.02% 0.00% -27.55% 31.63% 0.00%
240 29.59% -5.10% 14.29% 14.29% -5.10% 31.63% 0.00%
480 32.65% 25.51% 27.55% 25.51% 25.51% 31.63% 0.00%
720 34.69% 34.69% 32.65% 32.65% 34.69% 31.63% 0.00%
864 34.69% 37.76% 33.67% 33.67% 37.76% 31.63% 0.00%
912 36.73% 35.71% 40.82% 37.76% 34.69% 31.63% 0.00%
960 37.76% 35.71% 40.82% 39.80% 34.69% 31.63% 0.00%
1008 41.84% 34.69% 41.84% 38.78% 34.69% 31.63% 0.00%
1056 41.84% 32.65% 41.84% 40.82% 32.65% 31.63% 0.00%
1200 34.69% 28.57% 29.59% 28.57% 28.57% 31.63% 0.00%
1440 33.67% 2.04% 17.35% 16.33% 2.04% 31.63% 0.00%
1680 32.65% -31.63% -26.53% -26.53% -31.63% 31.63% 0.00%
Table 21 The same table as Table 19 recast in terms of percent increase in recognition performance. More
precisely this is a measure of reduction in recognition error rate: . is a constant
scalar.
116
4.2.7 Recognition Experiments with 24 Highest BCC Landmarks
Active feats Voting DFM
BM
DFM
BZ
MI
BZ
MI
BM
Cos
Zn
Cos
AM
3 0.91571 0.78857 0.83286 0.83286 0.78857 0.93286 0.88714
4 0.91714 0.83143 0.85286 0.85286 0.83143 0.93286 0.88714
10 0.91857 0.88714 0.89714 0.89571 0.88714 0.93286 0.88714
15 0.92714 0.90714 0.91714 0.91571 0.90571 0.93286 0.88714
19 0.93571 0.91429 0.92286 0.92143 0.91286 0.93286 0.88714
20 0.93143 0.91571 0.91714 0.91429 0.91286 0.93286 0.88714
25 0.93143 0.89714 0.90571 0.90143 0.89714 0.93286 0.88714
30 0.92429 0.86571 0.88286 0.88143 0.86571 0.93286 0.88714
Table 22 Recognition Performance Summary of DFM by Landmark. The first column shows the number of active
features per landmark, i.e., the number of bits among the forty for each landmark that will have a value of “1”. The
various similarity criteria are presented in Table 15.
Active Feats Voting DFM
BM
DFM
BZ
MI
BZ
MI
BM
Cos
Zn
Cos
AM
152 0.92000 0.79857 0.82000 0.84286 0.82000 0.93286 0.88714
240 0.91429 0.84000 0.85286 0.87286 0.85143 0.93286 0.88714
480 0.92571 0.88857 0.88571 0.90000 0.89857 0.93286 0.88714
720 0.92429 0.87857 0.86571 0.91429 0.91429 0.93286 0.88714
864 0.92429 0.85571 0.85714 0.91000 0.91286 0.93286 0.88714
912 0.92429 0.85286 0.85000 0.91429 0.91571 0.93286 0.88714
960 0.92571 0.84286 0.85000 0.91714 0.91429 0.93286 0.88714
1008 0.92429 0.83000 0.83857 0.91143 0.91000 0.93286 0.88714
1056 0.92571 0.82429 0.83143 0.91714 0.91000 0.93286 0.88714
1200 0.92857 0.77000 0.79429 0.91571 0.90143 0.93286 0.88714
1440 0.92714 0.71571 0.56714 0.87857 0.87571 0.93286 0.88714
1680 0.92429 0.60286 0.22571 0.80857 0.81714 0.93286 0.88714
Table 23 Recognition Performance Summary of DFM by Model graph. As above the first column shows the
number of bits in the binary-encoded string that take a value of “1”. The active bits are selected from all 1,920 features
of the model graph without reference to the landmark with which they are associated.
Active feats Voting DFM
BM
DFM
BZ
MI
BZ
MI
BM
Cos
Zn
Cos
AM
3 25.32% -87.34% -48.10% -48.10% -87.34% 40.51% 0.00%
4 26.58% -49.37% -30.38% -30.38% -49.37% 40.51% 0.00%
10 27.85% 0.00% 8.86% 7.59% 0.00% 40.51% 0.00%
15 35.44% 17.72% 26.58% 25.32% 16.46% 40.51% 0.00%
19 43.04% 24.05% 31.65% 30.38% 22.78% 40.51% 0.00%
20 39.24% 25.32% 26.58% 24.05% 22.78% 40.51% 0.00%
25 39.24% 8.86% 16.46% 12.66% 8.86% 40.51% 0.00%
30 32.91% -18.99% -3.80% -5.06% -18.99% 40.51% 0.00%
Table 24 Same information as in Table 22 above recast as a reduction in error rate. Alternatively the data in the
table can be interpreted as a percentage increase in recognition performance. See caption of next table for more precise
formulation.
117
Active
Feats
Voting DFM
BM
DFM
BZ
MI
BZ
MI
BM
Cos
Zn
Cos
AM
152 29.11% -78.48% -59.49% -39.24% -59.49% 40.51% 0.00%
240 24.05% -41.77% -30.38% -12.66% -31.65% 40.51% 0.00%
480 34.18% 1.27% -1.27% 11.39% 10.13% 40.51% 0.00%
720 32.91% -7.59% -18.99% 24.05% 24.05% 40.51% 0.00%
864 32.91% -27.85% -26.58% 20.25% 22.78% 40.51% 0.00%
912 32.91% -30.38% -32.91% 24.05% 25.32% 40.51% 0.00%
960 34.18% -39.24% -32.91% 26.58% 24.05% 40.51% 0.00%
1008 32.91% -50.63% -43.04% 21.52% 20.25% 40.51% 0.00%
1056 34.18% -55.70% -49.37% 26.58% 20.25% 40.51% 0.00%
1200 36.71% -103.80% -82.28% 25.32% 12.66% 40.51% 0.00%
1440 35.44% -151.90% -283.54% -7.59% -10.13% 40.51% 0.00%
1680 32.91% -251.90% -586.08% -69.62% -62.03% 40.51% 0.00%
Table 25 The same table as Table 23 recast in terms of percent increase in recognition performance. More
precisely this is a measure of reduction in recognition error rate:
118
Chapter 5: Discussion and Conclusion
Lest these graphs and tables at the conclusion of the previous chapter stand without
comment, we present a brief précis:
1. Recognition performance can be immediately improved by a) performing a z-
transform on all available data and normalizing the jets of z-scores to unit
length. As a floating point representation, it is superior. Our experience with
this system has suggested to us that the mean is a fairly robust object, it can
take a fair amount of noise and still produce acceptable results. The mean
cannot be entirely random, for predictable reasons.
2. Recognition performance using the standard 48 landmarks of EBGM can be
immediately improved by leaving out from the similarity calculation half of
the available landmarks. Through an examination of binary channel capacity
(BCC) or using statistical measures like ANOVA, one can determine a
reasonable set of landmarks to exclude. Both BCC and ANOVA produce
highly similar results, though they rest on different assumptions.
3. Information parity can be maintained while reducing the amount of data by
97%. The binary indicators of the constellation of features that registered
significant deviations from the mean retain all of the information of the
floating point magnitudes, at least compared to EBGM.
4. The foregoing prove that it is the pattern of deviations from the mean among a
set of Gabor features, in this case magnitudes, that is diagnostic of face
119
identity. The coefficients of the Gabor transform can be discarded, thrown
out, not stored, etc., and still retain all of the information used by EBGM.
It remains to be investigated to what extent this describes a general phenomenon when
you have a pool of features, each with its own distribution. What classes of objects lend
themselves to such a binary characterization? We hope to continue this work and
discover how the principals investigated herein may be used to advantage in portable
security devices, cell phones, digital cameras, or in larger applications such as rapid
identification of face images in search engine image indexing.
120
Bibliography
Aonishi, T. and K. Kurata (2000). "Extension of dynamic link matching by introducing
local linear maps." Ieee Transactions on Neural Networks 11(3): 817-822.
Arca, S., P. Campadelli, et al. (2006). "A face recognition system based on automatically
determined facial fiducial points." Pattern Recognition 39(3): 432-443.
Biederman, I. and P. Kalocsai (1997). "Neurocomputational bases of object and face
recognition." Philosophical Transactions of the Royal Society of London Series B-
Biological Sciences 352(1358): 1203-19.
Biederman, I. and P. Kalocsai (1997). "Neurocomputational bases of object and face
recognition." Philosophical Transactions of the Royal Society of London Series B-
Biological Sciences 352(1358): 1203-1219.
Bruce, V., P. Green, et al. (2003). Visual Perception: Physiology, Psychology and
Ecology, Psychology Press.
Cover, T. M. and J. A. Thomas (2006). Elements of Information Theory, 2nd Edition,
John Wiley and Sons.
Daugman, J. (1997). "Neural image processing strategies applied in real-time pattern
recognition." Real-Time Imaging 3(3): 157-171.
Daugman, J. (2001). "Iris recognition - The colored part of the eye contains delicate
patterns that vary randomly from person to person, offering a powerful means of
identification." American Scientist 89(4): 326-333.
Daugman, J. (2001). "Statistical richness of visual phase information: Update on
recognizing persons by iris patterns." International Journal of Computer Vision
45(1): 25-38.
Daugman, J. (2003). "The importance of being random: statistical principles of iris
recognition." Pattern Recognition 36(2): 279-291.
Daugman, J. (2004). "How iris recognition works." Ieee Transactions on Circuits and
Systems for Video Technology 14(1): 21-30.
Daugman, J. (2004). "Recognising persons by their iris patterns." Advances in Biometric
Person Authentication, Proceedings 3338: 5-25.
Daugman, J. and C. Downing (1998). Gabor wavelets for statistical pattern recognition.
The handbook of brain theory and neural networks, MIT Press: 414-420.
121
Daugman, J. G. (1988). "Complete Discrete 2-D Gabor Transforms by Neural Networks
for Image-Analysis and Compression." Ieee Transactions on Acoustics Speech
and Signal Processing 36(7): 1169-1179.
Daugman, J. G. (1993). "High Confidence Visual Recognition of Persons by a Test of
Statistical Independence." Ieee Transactions on Pattern Analysis and Machine
Intelligence 15(11): 1148-1161.
Fawcett, T. (2006). "An introduction to ROC analysis." Pattern Recognition Letters
27(8): 861-874.
Hallinan, P. L., Gordon, G. G., Yuille, A. L., Giblin, P., and Mumford, D. (1999). Two
and Three-Dimensional Patterns of the Face, A K Peters Ltd.
Hertz, J., A. Krogh, et al. (1991). Introduction to the Theory of neural computation.
Redwood City, CA, Addison-Wesley.
Jones, J. and L. Palmer (1987). "An evaluation of the two-dimensional gabor model of
simple receptive in cat striate cortex." Journal of Neurophysiology(58): 58--6.
Kalocsai, P. (2004). Separating Useful From Useless Image Variation For Face
Recognition. International Conference on Image Processing (ICIP), IEEE.
Kalocsai, P., H. Neven, et al. (1998). Statistical Analysis of Gabor-filter Representation.
Third IEEE International Conference on Automatic Face and Gesture
Recognition. Nara, Japan. Proceedings of the Third IEEE International
Conference on Automatic Face and Gesture Recognition: 360-365.
Kalocsai, P., C. von der Malsburg, et al. (2000). "Face recognition by statistical analysis
of feature detectors." Image and Vision Computing 18(4): 273-278.
Knuth, D. E. (1997). {The Art of Computer Programming}, Addison Wessley.
Konen, W. K., T. Maurer, et al. (1994). "A Fast Dynamic Link Matching Algorithm for
Invariant Pattern-Recognition." Neural Networks 7(6-7): 1019-1030.
Lades, M., J. Vorbruggen, et al. (1993). "Distortion Invariant Object Recognition in the
Dynamic Link Architecture." IEEE Transactions on Computers 42(3): 11.
Leopold, D. A., A. J. O'Toole, et al. (2001). "Prototype-referenced shape encoding
revealed by high-level after effects." Nature Neuroscience 4(1): 89-94.
Moghaddam, B. and A. Pentland (1997). "Probabilistic visual learning for object
representation." Ieee Transactions on Pattern Analysis and Machine Intelligence
19(7): 696-710.
122
Phillips, P. J., H. Moon, et al. (1997). "The FERET September 1996 database and
evaluation procedure." Audio- and Video-Based Biometric Person Authentication
1206: 395-402.
Phillips, P. J., H. Moon, et al. (2000). "The FERET evaluation methodology for face-
recognition algorithms." Ieee Transactions on Pattern Analysis and Machine
Intelligence 22(10): 1090-1104.
Phillips, P. J., H. Wechsler, et al. (1998). "The FERET database and evaluation procedure
for face-recognition algorithms." Image and Vision Computing 16(5): 295-306.
Pichevar, R., J. Rouat, et al. (2006). "The oscillatory dynamic link matcher for spiking-
neuron-based pattern recognition." Neurocomputing 69(16-18): 1837-1849.
Potzsch, M., N. Kruger, et al. (1996). "Improving object recognition by transforming
Gabor filter responses." Network 7(2): 341-7.
Said, A., Pearlman, A. W. (1996). A New, Fast, and Efficient Image Codec Based on Set
Partitioning. in Hierarchical Trees, IEEE Trans. on Circuits and Systems for
Video Technology 6: 243--249.
Shams, L. and C. von der Malsburg (2002). "The role of complex cells in object
recognition." Vision Research 42(22): 2547-2554.
Shapiro, J. M. (1993). "Embedded Image Coding Using Zerotrees of Wavelet
Coefficients." IEEE Trans. on Signal Processing(41): 3445--3462.
Silva, E. A. B. d., D. G. Sampson, et al. (1996). "A Successive Approximation Vector
Quantizer for Wavelet Transform Image Coding." Ieee Transactions on Image
Processing(5): 299--310.
Turk, M. (2001). "A random walk through Eigenspace." Ieice Transactions on
Information and Systems E84d(12): 1586-1595.
Turk, M. and A. Pentland (1991). "Eigenfaces for recognition." Journal of Cognitive
Neuroscience(3).
von der Malsburg, C. (1985). "Nervous Structures with Dynamical Links." Berichte Der
Bunsen-Gesellschaft-Physical Chemistry Chemical Physics 89(6): 703-710.
Wiskott, L. (1997). "Phantom faces for face analysis." Pattern Recognition 30(6): 837-
846.
Wiskott, L. (1997). "Phantom Faces for Face Analysis." Pattern Recognition 30(6).
123
Wiskott, L., J. M. Fellous, et al. (1997). "Face recognition by elastic bunch graph
matching." Ieee Transactions on Pattern Analysis and Machine Intelligence
19(7): 775-779.
Wiskott, L., J. M. Fellous, et al. (1997). "Face recognition by elastic bunch graph
matching." Ieee Transactions on Pattern Analysis and Machine Intelligence
19(7): 775-779.
Wiskott, L. and C. von der Malsburg (1996). "Recognizing faces by dynamic link
matching." Neuroimage 4(3 Pt 2): S14-8.
Wiskott, L. and C. von der Malsburg (1996). "Recognizing faces by dynamic link
matching." Neuroimage 4(3): S14-S18.
Wundrich, I. J., C. von der Malsburg, et al. (2002). "Image reconstruction from Gabor
magnitudes." Biologically Motivated Computer Vision, Proceedings 2525: 117-
126.
Yu, W. W., X. L. Teng, et al. (2006). "Face recognition fusing global and local features."
Journal of Electronic Imaging 15(1): -.
Zhu, J. M. and C. von der Malsburg (2002). "Synapto-synaptic interactions speed up
dynamic link matching." Neurocomputing 44: 721-728.
124
Appendix
This set includes a number of images whose graphs were not correctly placed during the
Elastic Bunch Graph Matching procedure. The fifty-eight images that were removed
suffered from one or more of the following defects: the image contained only a partial
face, the image had poor overall contrast, the image was not recognizable as a human
face, or the orientation of the head was not fully frontal or upright.
The process of selecting images to remove was primarily automated; selecting the images
solely by visual inspection would have left too much to individual discretion.
Accordingly, we produced programmatically two sets of images, in total constituting
about ten percent of the total number of images, from which those with particularly badly
placed graphs were removed. for some of the experiments.
The sets were constituted by the following procedure:
For each model graph in the full gallery, concatenate the jets into one long vector; forty-
eight landmarks, each with forty Gabor magnitudes yielding a 1,920-row-vector.
Place all such vectors into one matrix, M, such that each row corresponds to an image,
and each column represents one jet coefficient.
(continues)
125
Calculate the correlation coefficients of the matrix, M.
where is the covariance between random variables (the columns of M), indexed by
and ;
where is the mathematical expectation, and where in this
case the arithmetic mean of and The diagonal of contains the variances of the
individual random variables that constitute the random vectors.
1. In this case we are not interested in the correlations among jet coefficients, which
would be the natural random variables, but rather correlations between the images
themselves. Thus the random variables for which correlation coefficients were
calculated were the image indices. Thus, a random vector comprises the
observations over all images of a single jet coefficient, rather than all jet
coefficients for a single image. Accordingly, each entry in M was interpreted to
indicate the amount of overall correlation between model graphs of images.
2. Create a binary matrix, , with a “1” in matrix cell if ; with
“0” in all other places
3. Create a binary matrix, , with a “1” in matrix cell if ; with
“0” in all other places.
4. Sum each matrix over its rows. For matrix this sum indicates for each image the
number of other images with which it has a correlation coefficient below 0.3. For
126
matrix this sum indicates the number of other images with which it has a
correlation coefficient of at least 0.7.
5. Compile two sets of images: one constituting the 150 images that had the highest
sum over matrix , and the other constituting the 150 images that had the lowest
sum over matrix . We called the first set HiLo denoting that it contained images
with the highest number of “low correlates”; and the second set LoHi denoting the
set contains images with the lowest numbers of “high correlates”.
6. Calculate the intersection of the two sets.
Abstract (if available)
Abstract
A successful, mature system for face recognition, Elastic Bunch Graph Matching, represents a human face as a graph in which nodes are labeled with double precision floating-point vectors called "jets". Each jet in a model graph comprises the responses at one fiducial point, or face landmark, of a convolution of the image with a set of self-similar Gabor wavelets of various orientations and spatial scales. Gabor wavelets are scientifically reasonable models for the receptive field profiles of simple cells in early visual cortex. Heretofore, the recognition process simply searched for the stored model graph with the greatest total jet-similarity to a presented image graph. The most widely used measure of jet similarity is the sum over the graph of the dot-products of jets normalized to unit length. We improve significantly upon this system, with orders of magnitude improvements in time and space complexity and marked reductions in recognition error rates. We accomplish these improvements by recasting the concatenated vector of model-graph jets as a binary string, or b-string, comprising bits with one-to-one correspondence to the floating-point coefficients in the model graph. The b-string roughly models a pattern of correlated firing among a population of idealized neurons. The "on" bits of the b-string correspond to the identities of the coefficients that deviate the greatest amount from the corresponding mean coefficient values. We show that this simple recoding consistently reduces recognition error rates by margins exceeding thirty percent. Our investigations support the hypothesis that the b-string representation for faces is extremely efficient and, ultimately, information preserving.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The neural correlates of face recognition
PDF
Efficient template representation for face recognition: image sampling from face collections
Asset Metadata
Creator
Kite, Lawrence Marc
(author)
Core Title
The importance of not being mean: DFM -- a norm-referenced data model for face pattern recognition
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
05/13/2009
Defense Date
11/03/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
face information,face processing,face recognition,face representation,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
von der Malsburg, Christoph (
committee chair
), Itti, Laurent (
committee member
), Mel, Bartlett W. (
committee member
), Schaal, Stefan (
committee member
)
Creator Email
kite@usc.edu,larrykite@mac.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2239
Unique identifier
UC1178993
Identifier
etd-Kite-2151 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-240048 (legacy record id),usctheses-m2239 (legacy record id)
Legacy Identifier
etd-Kite-2151.pdf
Dmrecord
240048
Document Type
Dissertation
Rights
Kite, Lawrence Marc
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
face information
face processing
face recognition
face representation