Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
First steps towards extracting object models from natural scenes
(USC Thesis Other)
First steps towards extracting object models from natural scenes
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
FIRST STEPS TOWARDS EXTRACTING OBJECT MODELS FROM NATURAL
SCENES
Copyright 2005
by
Viral H. Shah
A Thesis Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(ELECTRICAL ENGINEERING)
May 2005
Viral H. Shah
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 1427984
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
®
UMI
UMI Microform 1427984
Copyright 2005 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, Ml 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
To my dearest wife, Payal
Dedication
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgements
I would like to sincerely thank my advisor, Prof. Christoph von der Malsburg, because
without his inspiration this work would have been impossible. In addition, special thanks
go to Douglas Garrett and Xiangyu Tang for their guidance and advice in completing this
work. I am grateful to Shuang Wu for his quick resolution of technical issues and
Alexander Heinrichs for his help in dealing with the sponsors of this project. I would like
to mention Larry Kite for aiding in the data acquisition.
All of the work uses FLAVOR, a software library developed at the Institut fur
Neuroinformatik, Ruhr-Universitat Bochum, Germany and our Laboratory for
Computation and Biological Vision at USC, and I would like to thank all the developers.
I would like to show gratitude my friends for their support. Lastly I would like to thank
my family for their encouragement. I am especially indebted to my wife for sacrificing
much for my benefit.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table of Contents
Dedication ii
Acknowledgements iii
List of Figures and Tables vi
Abstract vii
1 Introduction 1
1.1 The Problem 1
1.2 Objective 2
1.3 Approach 2
2 Looking at Humans 3
2.1 Full Body Human Recognition 3
2.2 Previous Work in this Field 4
2.3 Learning to Look at Humans 5
3 Overview of the System 6
4 Pre-processing Module 8
4.1 Segmenter Block 8
4.2 Scanner Block 10
5 Instantaneous Representation Module 13
5.1 Node Finder Block 13
5.1.1 Method 1 14
5.1.2 Method 2 17
5.2 Feature Extraction Block 18
6 Learner Sub-system 25
6.1 Representation Generalizer Block 25
6.1.1 Method 1 25
6.1.2 Method 2 28
7 Recognizer Sub-system 29
7.1 Similarity Calculator Block 29
7.1.1 Method 1 29
7.1.2 Method 2 33
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
8 Results and Conclusion 36
8.1 Results 36
8.2 Discussion 37
8.3 Future Work 38
List of References 39
v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Figures and Tables
Figure 3.1 Overview of system..................................................................................... 7
Figure 4.1 Input and Output Characteristics of the Segmenter Block........................ 8
Figure 4.2 Input and Output Characteristics of the Scanner Block............................ 10
Figure 4.3 Inner workings of the Scanner Block......................................................... 11
Figure 5.1 Input and Output Characteristics of the Node Finder Block..................... 14
Figure 5.2 Inner workings of the Node Finder Block for Method 1........................... 16
Figure 5.3 Output of Node Finder Black for Method 2............................................... 17
Figure 5.4 Scaling Factor.............................................................................................. 19
Figure 5.5 Color Region Levels.................................................................................... 20
Figure 5.6 Features Extracted at each Node................................................................. 24
Figure 6.1 Inner Workings of the Representation Generalizer for Method 1............. 27
Figure 7.1 Inner Workings of the Similarity Calculator for Method 1....................... 30
Table 8.1 Results.......................................................................................................... 36
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Abstract
This dissertation presents the first steps towards a general method o f
extracting features from images to fulfill the two-fold task o f Object
Classification and Recognition. We consider looking at humans as a test
case and perform recognition tasks. Two methods were developed for the
purpose. The first method was provided with minimal a priori information
about the structure o f humans, in the form o f fixed node positions within
bounding boxes around the major body parts. This method gave rise to
several problems and the system failed in cases o f partial occlusion.
Therefore a second method was developed which extracted information on
from a vast sampling o f points placed on the segmented part o f the image.
This method showed much better results in cases o f partial occlusion.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 1
Introduction
Interpreting visual input, using a computer, to a level comparable to any human is a
challenging task. After decades of research in the field we are still nowhere near the
solution. There are several systems which produce remarkable results, but only under
restricted circumstances.
1.1 The Problem
One of the vexing problems is the inability of vision systems to generalize visual data
using a few exemplars. Humans, on the other hand, can readily learn an entire class of
objects using only a handful of examples of that class. The question that needs to be
answered is: How best to extract visual information from an example image of an object
in order to achieve a two-fold objective: a) to be able to condense the examples into a
generic model that can be used to reliably detect objects of the same category, and b) to
build specific models to help distinguish between objects of the same category by
recognizing nuances of each object instance. Most current systems meet one of these
objectives, like object detection [7] [4] [12] or object recognition [8] [16], but rarely both.
Those which succeed in both these objectives are geared towards specific object
categories e.g. faces [18]. They achieve this by embedding information about the objects
into the learning and recognition procedures.
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1.2 Objective
Our ultimate goal is to develop a system which is able to work given minimal
information about the object it is looking at. For object detection in a scene, the system
would self-organize the individual object instance models into a generalized model, by
statistical extraction of features that are common to all class members and aid in
distinguishing the class from others. Eventually it would evolve into a subsystem
integrator, where each subsystem would specialize in detecting a part of the object and
together would vote on the presence of an instance of the object class in the scene. The
individual object instances would be assimilated into specific models of the instances by
highlighting those features that are common to images of that individual object instance
and which accentuate differences to other class instances. Such a system would be
automatic and general enough to be easily adapted to work on different object categories.
1.3 Approach
For this purpose, we advocate placing a grid graph of nodes on the object and extracting
features at each node location. Thus for each object instance, we take a rich sampling of
the object using this graph. Intuition for this comes from the fact that, the richer the
sampling taken more likely it is to find, by statistical analysis, relevant features for
creating generic object class models and more specific models of object instance. But, as
we will show in the following chapters, even in the absence of statistical processing, this
rich sample model of the object instance is sufficient to identify it.
2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 2
Looking at Humans
Given the lofty goals outlined in Chapter 1, it was necessary to reduce the scope of our
experiment, in the interest of time and limited resources. Consequently, we chose to focus
only on the task of recognizing object instances within a class by creating object instance
specific models, rather than generate generic models of object classes. This enabled us to
concentrate on instances of a single class of objects rather than spread our efforts over
multiple classes.
We also restricted ourselves to objects moving autonomously against a static background
to facilitate easy segmentation using background subtraction. Our goal was to develop a
learning and recognition system and therefore the segmentation task was simplified as far
as possible. We chose the human body as an object class to be the center of our attention.
2.1 Full Body Human Recognition
There were several reasons for us to choose the human body for our experiment. Firstly,
our interest was piqued by the fact the human brain has a distinct region which responds
selectively to images of the human body [3]. The human body, being a highly articulated
object with up to 20 degrees of freedom [14], is a worthy test case for our theory. In
addition, we can draw on a wealth of literature already present in the field. Moreover our
lab has been involved with recognizing human faces and the human body seems a natural
3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
extension of our domain, though the human face is not articulated and deforms only
elastically. Among human subjects, face recognition or familiarity seems to play a role in
full body human recognition even when the face is not clearly discemable [11]. Lastly,
there is a direct application of the system in the field of automated surveillance.
2.2 Previous Work in this Field:
There has been a lot of work done, particularly in the case of human detection. There are
several approaches to the problem, but all of them have some structural knowledge about
human shapes built into them. Some systems explicitly model the kinematics of human
motion [2] [11] whereas others model the typical human motion patterns [14] in order to
detect the human body. Still others learn the shape of a human silhouette [6] [9] or the
collection of moving parts which compose a human body [13].
Human Recognition, on the other hand, is the task of distinguishing between individual
humans. New research focuses on human gait for recognizing humans [1] [17]. However
these methods approach the problem with a toolbox of hand-crafted human gait models.
In addition human gait is a notoriously unreliable trait with a number of factors, ranging
from footwear to mood, affecting the way one walks. [10] combines torso color cues and
a face recognition system to boost recognition rates. [8] and [5] adopt a more holistic
approach, in which they extract color and texture features from the segmented human
body. But the choice of classifiers in these systems seems ad-hoc and leaves little room
4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
for future integration of the person specific models into a generic model for detecting
humans.
2.3 Learning to Look at Humans
We will describe two attempts at solving the problem called Method 1 and Method 2. In
Method 1, we jump start the procedure of learning human models by providing a very
minimal structure to the system in the form of bounding boxes around three major body
parts. We then place nodes at fixed locations within each bounding box and extract
features at these positions. One might argue that providing the system which such
bounding boxes defeats the very purpose of our experiment, but in comparison to several
systems designed for human recognition these bounding box assumptions are very
simple. Also this is used only as an initial step, we thereafter remove even these crutches
for our system when we discuss Method 2. For Method 2 we place nodes at regular
intervals on the segmented image and extract features at these nodes.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 3
Overview of the System
The system consists of two overlapping parts, as shown in figure 3.1: the Recognition
sub-system (Recognizer) and the Learning Sub-system (Learner). The Recognizer and
Learner sub-systems share common modules such as the Pre-processing Module,
Instantaneous Representation Module and the Storage Module. The two methods,
described in 2.3, essentially have a similar structure. Therefore we shall describe the parts
that are common and divide the discussion into the different methods only when
necessary. I would like to mention here that Method 1 was developed with significant
collaboration from Douglas Garrett and Xiangyu Tang, members of our lab, who
developed parts of the structure that you see in the system.
Each chapter explains in detail the inner workings of a module. Chapter 4 describes the
Pre-processing Module, which consists of the Segmentation Block (Segmenter) and the
Scanning Block (Scanner). Chapter 5 explains the Instantaneous Representation Module,
which includes the Node Finder Block, and the Feature Extractor Block. Chapters 6 and 7
deal with the Learning Sub-system (Learner) and the Recognition Sub-system
(Recognizer), respectively. Chapter 8 presents the results of the system and discusses the
pros and cons of the approach we have adopted and suggests future work to be done on
the system.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Recognizer
Similarity
Calculator
V
Output Result
Input
Sequences
Segmenter
I
Scanner
Instantaneous
Representation I
Module
Node Finder
Feature Extractor
\ f <
Storage Module,
Database
Learner
Representation
Generalizer
Figure 3.1: Overview of the System
7
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 4
Pre-processing Module
The Pre-processing Module performs preliminary image processing tasks on the input
frames. It segments the image (Segmenter Block) and then encloses each individual
segmented person in a separate bounding rectangle (Scanner Block). It forms the base of
the system, since all the succeeding blocks depend on it.
4.1 Segmenter Block:
♦ Segmenter
Figure 4.1: Input and output characteristics of
the Segmentation block, a) Input image,
b ) Segmentation Mask
The Segmenter Block segments the input image sequence as shown in figure 4.1. This
sequence could be either a learning sequence or a test sequence; the segmentation block
still stays the same. Earlier we used a simple background subtraction technique for
8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
segmentation. The segmentation results were ‘coarse’ and prone to noise and other errors.
Therefore we have now incorporated a new segmentation technique [15] developed by
Xiangyu Tang, a PhD student in our lab. It uses a cue integration approach to
segmentation.
The new system tackles the difficulties of image segmentation by dynamically integrating
multiple cues, which achieves a synergy through the sharing of information from
independent sources. It is built upon a Bayesian cue integration framework, which
combines color, texture and contrast cues to robustly and accurately segment coherent
moving objects from image sequences. Each pixel in a frame can decide its own layer
assignment (background or foreground) by deriving the posterior probabilities from the
cues’ likelihood models. The individual cues provide rough but complementary
segmentation results. And their likelihood models are independent observations that are
subsequently trained by self-adaptation toward the segmentation consensus.
The amount of the contribution from an individual cue under different situations is
adjusted by measuring that cue’s quality based on its similarity with the overall
segmentation agreement. Allowing cooperation and competition among the cues at the
same time, the system maintains and improves its segmentation results in a self-organized
manner, which cannot be achieved by focusing on any one cue alone. Moreover, the
system does not require particular parameter tuning for different sequences. The
integration scheme and the individual cues’ likelihood models are designed to eliminate
subjective human intervention.
9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.2 Scanner Block:
Scanner
a) b)
Figure 4.2: Input and output characteristics of
the Scanning block, a) Segmented image,
b) Image with bounding rectangles around
each person.
The Scanner Block is provided with the segmented image of the scene. It then encases
each ‘blob’ in a bounding rectangle and treats it as a person, as seen in figure 4.2. This
helps track the blobs (persons) as they move around in the scene. It provides a bounding
rectangle of the blob to the next block in sequence. This block can handle multiple
persons in the scene.
In order to ‘encase’ each person into a bounding rectangle, we adopt a very simple
technique. We project the segmented image down to the X-axis, i.e. for each column, we
count the pixels which are segmented as foreground. This gives a histogram along the X
direction as shown in Figure 4.3. From this plot we find the ‘valley points’ or points
which lie at the edges of the ‘bumps’, which yield the approximate boundaries of the
person in the X direction (dotted vertical lines).
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Discarded (too small)
a)
M
Number of Pixels
C )
V alley Points
X - direction
- ►
Figure 4.3: a) Segmented image, b) Plot of number of
foreground pixels in the X - direction, c) Plot of number of
foreground pixels in the Y - direction.
The boundaries in the Y direction are obtained by repeating the procedure in the Y
direction. We restrict ourselves to counting the foreground pixels of those parts of the
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
image, which are between the X direction boundaries that we found earlier. This gives us
the boundaries in the Y-direction (dotted horizontal lines).
We then check the combinations of X and Y boundaries and put bounding rectangles on
biggest foreground object enclosed within them (bounding rectangles). Note that for
some combinations of the X and Y boundaries there might be no foreground object
within them, since there might be overlap between the X and Y boundaries in areas which
are completely devoid of foreground pixels.
Finally, we check whether rectangles enclose an area greater than 1/100th the area of the
whole image. We use this measure to discard those rectangles which are too small to
contain humans and usually just contain some noise formed during segmentation. The
underlying assumption that we make in this block is that the persons are walking upright
and are standing away from each other.
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 5
Instantaneous Representation Module
This module is chiefly involved with generating an instantaneous (one per frame) model
of a person. For this purpose we place nodes on the person and extract features at those
node positions. We make the assumption that the features at the node positions would be
sufficient to describe a person uniquely.
For the generation of the instantaneous representations, learning and recognition, we have
developed two different methods (Method 1 and 2). Both the methods have a similar
structure but are based on different approaches to the problem. They have the same
blocks, but the inner workings of each block differ to a varying extent. Therefore, from
here on each block is divided into two parts, one for each method.
5.1 Node Finder Block:
The Node Finder Block places nodes on the human body. It uses the segmentation mask,
provided by the Segmenter, and the bounding rectangle around each person, provided by
the Scanner, and other information, if provided, to determine the locations to place the
node.
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.1.1 Method 1
The Node Finder block is given the bounding rectangle of the segmented body from the
Scanner block and the segmentation mask from the Segmenter block. It positions the
nodes on the segmented body of the person using knowledge about the structure of a
human body as shown in figure 5.1. Thus some a priori information is introduced into the
system.
We tell the system that the body consists of 3 parts (Head, Torso and Legs) and that each
part has a fixed ratio with the total height of the body. For example, the height of the head
was fixed to be 16% of the total height, and the width of the head is 80% of its height.
Similarly, we have fixed ratio for the height and width of the torso and the height of the
legs (legs do not have a fixed ratio for the width, since the legs can move about). We
have arrived at these ratios empirically, and the system is not very sensitive to minor
changes in these values.
Node Finder
a)
b)
Figure 5.1: Input and output characteristics of the
Node Finder block, a) Segmented image,
b) Image with the nodes positioned on the body
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
We have the height and width of the head and torso and only the height of the legs. Now
we enclose each body part in a separate box. For this purpose we use a ‘template’
matching approach. For the head, we place a rectangle of the chosen height and width in
the upper 16% of the bounding box for the entire body. We then move the rectangle
around horizontally (within the bounding box) and find the position at which maximum
foreground pixels lie inside the box. We repeat the procedure with a torso rectangle to
enclose the torso. This method of template matching ensures that the torso rectangle
encloses the bulk of the torso and is not led astray by the arms which keep moving about
around the torso. For the legs we enclose the entire lower half of the body.
After enclosing each body part within its own rectangle, we place nodes on each body
part. For the head and torso we place nodes at fixed positions within the rectangle. For
the head, we place a total of 7 nodes and 13 nodes for the torso. The nodes are placed
well within the enclosing rectangle with margins on all sides. They are placed in a
slightly ‘staggered’ fashion as shown in figure 5.2.
For the legs, we follow a slightly different approach. We fix the height at which we can
place the nodes within the legs rectangle. But, we allow the nodes to have some
horizontal leeway, so that they can position themselves on the center of each leg. We do
this by choosing the centers of groups of foreground pixels. We place nodes at 4 different
heights, and one node for each leg, for a total of 8 nodes. Therefore we have a total of 28
nodes on the human body.
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
b) d)
Figure 5.2: a) Segmented image with bounding rectangle and bounding box size of
each body part, b) Image with bounding boxes around each person, c) Bounding
boxes of each body part with node positions within them, d) Image with nodes
placed on the body.
Also it is important to note that we also have a 2-Dimensional scaling factor (Sx and SY )
associated with each node. This factor is the distance between the 2 adjacent nodes, in the
respective directions, centered on the node. The scaling factor ensures that the nodes
scale to include more area under their ‘influence’ for feature extraction as the person
moves towards and away from the camera in the video sequences. This is discussed in
more detail in the next block.
16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
We also have a validity factor associated with each node. Since our node placement
algorithm is rigid, we have to ensure that all the nodes lie on the foreground. Therefore
we set the validity factor to 1 if the node lies on the foreground (as seen in the segmented
image) or to 0 if it happens to be placed on the background. Thus you will see some
nodes as green (valid) or red (invalid).
5.1.2 Method 2
In this method we just use the Segmentation mask and the bounding rectangle around
each person to determine the positions of the nodes. No a priori information is provided
to the system. It adopts a more generic approach to placing the Nodes: It forms a grid of
nodes on the segmented object.
Figure 5.3: Image with grid-like node structure on the Segmented Human Body.
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
It places nodes at regular intervals in the X and Y direction. The distance between two
adjacent nodes scales with the size of the bounding rectangle and the width of the
segmented portion within the bounding rectangle. This allows the nodes to be spread out
in a very dense pattern on the segmented human body as shown in figure 5.3. We thus
capture a rich sampling of the person. This block of the system makes no assumption
about the object but gathers as much information as possible from the placement of a
dense grid.
All nodes are treated equally during the matching process. The advantage of taking such
a large sample of nodes is that it gives more variability to the data and allows that data to
define the object rather than imposing a predefined structure on the data. Since we have
many sampling points which lie on the object, the representation that we create is richer.
We also do not need to use a node ‘validity’ factor here since the grid is determined by
the segmentation mask and all nodes are valid.
5.2 Feature Extraction block:
Once the nodes have been placed, we extract features at each node, if the node is a ‘valid’
node. Both the methods have a similar feature extraction block.
Important to note here is the influence of the 2-D Scaling factor (Sx and Sy). This factor
decides the area around the node over which the features are extracted. This factor is
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
exactly the distance between 2 adjacent nodes (in each direction) centered on the node
position as shown in figure 5.4(a), for Method 1. This ensures that the nodes do not
‘infringe’ on each other’s space. This is important for Method 1, wherein nodes are
separated from each other by a sufficient distance and a strict neighborhood is enforced,
by the node placement algorithm. But for Method 2 the nodes are spaced very close to
each other. So here we consider a scaling factor that is four times the distance between to
adjacent nodes. This also enforces some neighborhood relationship among the features of
adjacent nodes. We extract features at those nodes which are valid, i.e. nodes that are on
the foreground.
v
<
H I •
V .
C u
*
•
Sx
a)
< >
Sx
Area of Influence
around the node
\
% f.
, S *
;,P - i.
:,V /
# •
*
#
m < # •
' * ' ? >
■
f e p u
/
b)
Figure 5.4: Scaling factor is determined using distance
between adjacent nodes, a) for Method 1, b) for Method 2.
In addition to the scale, for Method 2, we extract and store an additional feature which we
call relative position. This feature stores the position of the node within the bounding
box, relative to the top left comer of the bounding box around the entire human. This
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
relative position is normalized with the height and width of the box to get values between
0 and 1. This feature localizes the color and Gabor features to a region within the
bounding box.
First, we extract color features (average color in 3 channels, Red, Green and Blue). For
color, we have developed a pyramid-like structure, with multiple levels. We start from
the base which encapsulates the entire area of influence (as defined by Sx and SY ), and
then uniformly reduce the size of the region as we move up a level. We then calculate the
average color at each level. This gives us the added flexibility of choosing the average
color of an area, of any size, centered on the node.
Figure 5.5: Different Color Region
Levels Centered on the Node
More explicitly,
C ir
N
X X - u,yt- v)
S xr $ yr
2 v~ 2~
(1)
where,
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Cir = average color vector for Node Index i and Color Region Level r.
Sxr and Syr = scaling factors in the X and Y directions for Level r.
c(x,y) = color vector for pixel at position (x, y).
(xi ’ yO = co-ordinates of Node i.
N = number of pixels within the area of influence ( Sxr x Syr).
Note: S x j = Sx when j = 0 (base level)
= 0 when j = highest level
We employ only the base level (entire area of influence) in our current implementation.
The concept is akin to the Gabors of different levels. Therefore we have 3 color features
(Red, Green and Blue) for each node, the average color within the area of influence of the
node.
Next, we extract the Gabor Features. We extract Gabor features at 3 levels and 4
directions, for a total of 12 Gabor Features at each node. We perform the transform on the
grayscale image of the video frame. We adjust the size of the Gabors to fit within the area
of influence. We know that the radius of the spatial kernel is given by:
radius — - (2)
k,
Where,
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
s factor controlling the approximation of the kernel in space domain.
a = the a of the Gaussian envelope.
k, = frequency of the level / kernel.
But
max
(3 )
Where,
k.
max
= Max. frequency of Gabor kernel
= factor determining the distance between the kernels in the frequency
kfa c \
domain = — = = ■
42
Note here that the radius is largest, when the frequency of the kernel is low (Higher
level). Since we have a total of 3 levels (level 0, level 1 and level 2), for determining the
maximum radius of the Gabor spatial kernels, we consider the frequency of the highest
kernel (level 2). We then try to restrict this to within the area of influence determined by
the Scaling factor. For this purpose, we adjust the maximum frequency ( k ^ ) of the
Gabor Kernel. This ensures that the size (or accuracy) of the spatial kernels is within the
area of influence.
sxcr
max (4 )
Where,
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
S = (Sx + Sy)/2.
All the other spatial kernels will be smaller in size and therefore will be well within the
area of influence.
In total we have 15 features for each node: 3 color features and 12 Gabor features. The
system is scalable to include more features if required. In Method 1, for the entire human
body, we compute these 15 features at each of the 28 nodes and therefore we compute
420 features for each frame of the video sequence as shown in figure 5.6.
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5.6: Features extracted at each
node.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 6
Learner Sub-system
From this point on the Recognizer and Learner Sub-systems diverge. The Learner Sub
system stores the data about a person extracted from a sequence, whereas the Recognizer
system is geared towards using the stored data (Representations) to recognize the person
in the scene. The systems shared the blocks described above. In addition, the Learner
Sub-system contains the Representation Generalizer block. It generates specific models
for each individual and is not to be confused with the goal of developing generic model
of a human.
6.1 Representation Generalizer block:
We have the data (as described in the Feature Extraction block) from a single frame. This
data gathered over several frames of the learning video sequence should be combined
into a ‘generalized’ representation of the person being learned. This function is
performed by the Representation Generalizer block. Each method has its own learning
scheme and therefore we shall discuss them separately.
6.1.1 Method 1
For method 1, since each node position is rigidly determined, we assume that node will
‘see’ nearly the same area of the person. Therefore we generate histograms of the features
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
of each node for each person, figure 4.1. A histogram generated over entire sequence
would tend to peak around a value which would be characteristic of that ‘region’ of the
person. For example, a node placed in the middle of the torso region would have a color
histogram which would peak around the color of the shirt of the person over the entire
video sequence.
We create the histograms by quantizing each feature of each node into coarse ‘bins’. We
choose 30 bins for each feature for each node (This number is adjustable). From the
incoming feature response at node i for feature j (f ..), we determine the bin index (b)
using the formula:
Kfij) = round
Where,
fij = value of response of feature j at node i.
b{fjj) = bin index which is a function of the response of feature j at node i ( f y).
/'™n = the minimum possible response for feature j at node i.
f™* = the maximum possible response for feature j at node i.
Nb = total number of bins allocated for feature j at node i.
round( ) = rounds the value to the nearest integer.
26
/• - /
J ij J ij
/
•max rr\
ii J ii
xN .
min o
(5 )
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
For example, say we have a feature value of 156 for the red color. Therefore we have,
/ , = 156;
f""n = 0; /™ x = 255; (For color features the range is 0 - 255)
N b =30;
b(fu) = round
156
255'
x30 = 18.
Once we have the bin index we increment that bin value for node i and feature j.
Therefore after we have processed the entire learning sequence, we arrive at a count of
the number of times a feature value range (represented by a bin) has occurred for each
feature j of node i. Now, we divide this value (for each bin index, for each feature of each
Figure 6.1: Features at each node are
converted into bin indexes and thereafter a
probabilistic histogram is generated over the
entire sequences.
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
node) with the number of frames in the whole learning sequence for the person under
consideration, giving us the probability of seeing that value range, at that node for a
particular feature, given we are looking at that person.
So now we have a probabilistic histogram of each person, for each feature (quantized into
coarse bins) of each node, given the condition that we are looking at that particular
person. In total we have 420 such histograms for each person. They capture the variance
of the feature values at a particular node position for that person. Typically they appear to
be smoothened out Gaussian distributions, but they present a very compact and powerful
way to summarize the data obtained over the entire learning sequence of a person.
6.1.2 Method 2
This Sub-system is much simpler than that for Method 1. The Node Finder places a grid
of nodes on the segmented body of the person and the Feature Extractor provides the
feature data at each node. This is the instantaneous representation of the person as
generated from the current frame. For this system we store all the nodes to form the
representation of the person and that for each frame of the learning sequence. Thus we do
not reduce the information at this stage and instead store all the data that is available to
us.
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 7
Recognizer Sub-system
Once the system has finished learning all the people and has written all the data to disk,
we can now perform recognition. For recognition we consider each input frame at a time
and perform the basic pre-processing tasks. Therefore the Recognizer Sub-system is
provided with a graph structure with the features extracted at every node. The essential
component of this sub-system is the Similarity Calculator Block.
7.1 Similarity Calculator Block
We turn our attention to finding the best match for the graph that we have gleaned from
the current frame. We consider each person in the database and compare them to the
graph (or instantaneous representation) obtained from the current frame. We compute a
‘similarity’ measure between the representation under consideration and the graph to
determine which learnt model best matches the data from the current frame. For each
method we compute similarity differently, since the generalized representation for each
method is different.
7.1.1 Method 1
To find the similarity measure, we obtain the conditional probability of finding the
feature response (gleaned from the current frame) by considering each feature of each
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
node at a time and looking up the corresponding bin index (of that feature at that node) in
the histograms that we have learned of the different people.
During the histogram look up process, we used a density estimation technique called k-
nearest neighbor. This helps smoothen the histograms, and obtain a good estimate from
the histograms. It also ensures that we obtain a non-zero value when we look up the
histogram. When we lookup the histogram for a particular bin index, we place a window
centered on the bin index and we report the value of the histogram at that location if it is
non-zero. If, however, it is zero, then we expand the window (by an integer amount in
each direction) till it contains at least one non-zero bin value as shown in figure 7.1. We
then divide the sum of all non-zero bin entries within the current window by the window
size. This gives us a much better estimate of the density function.
156
Probability
Bins
Empty V .
Window
^''-Window Expanded to
include non zero bins
Figure 7.1: Histogram look-up using k-nearest
neighbor technique. Window is expanded until we
have a non zero entry within the window.
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Thus for each feature (j) of each node (z), from the graph, we obtain a similarity estimate
( Sjm )) of the feature value at that node for the person (m) under consideration.
+ w
K H f , ) + * ) / « ]
S i j m) ( b i f t j J ) = ------------------
(2 w + l)
(6)
: value of histogram (conditional probability) of person m, for feature j
at node /, at bin index given by b.
■ bin index as a function of the response of feature j at node i ( f tj).
Refer Eqn 5.
where,
Pijib/m)
Hfy)
(m ) = similarity between feature j of node i of person m, and the feature
Sy (b(fj )) response/,..
w = number of times the window has been expanded
Note:
S-jn)(bifjj)) = Py(b(f]j)/m), when Pij(b(fj)/m) is non-zero, else we expand the
window.
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Now to compute the similarity at the node level for person m (5 ( < m )), we simply multiply
all the feature similarities within the node.
w /» ) > ,7)
j
Now we have the node wise similarity for each person m, which has been learnt. It is
important to note that this node level similarity is calculated for only those nodes which
are ‘valid’. Therefore, when we calculate the similarity at the body part level we need to
take this into account. We have
T P
g (m) — i
( m )
B P
N ( 8 )
iV B P
where,
BP = Body Part, can be legs, torso or head.
i = indices of valid nodes contained within the body part (BP) under
consideration
N b p = number of valid nodes for the given body part (BP).
Now we combine the similarities at the body part level to obtain the total similarity of the
person with the graph obtained from the current frame. For this purpose we again
multiply the similarities obtained at the body part level. This ensures that the person
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
should match as a whole. If the torso (shirt) or a person matches very well with the graph,
but the legs (pants) do not, then the system should be able to tell the difference.
C (m) _ I I o (m)
^ 1 1 ^ B P (9 )
B P
Now we have the total similarity of the person formed by combining, hierarchically, the
similarities at the feature, node and body part level. For each person in the database we
obtain a total similarity value as described above. The person with the highest similarity
is reported as the person recognized.
7.1.2 Method 2
As discussed above, the representation (for Method 2) consists simply of the nodes (and
the corresponding data) that have been collected from the learning sequences. Again for
Method 2 we adopt a very simple process of similarity computation.
This process involves doing a node-by-node matching of the current frame nodes
(instantaneous nodes) with the nodes in the stored representation (representation nodes)
of a person (m) under consideration. For computing the similarities at the node level we
perform the normalized dot product for each set of features. We consider color features as
one set and Gabor Features as another set. The second term in the equation is a
localization term. This term biases the node to be centered on its relative position within
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the learned models. This provides a greater chance that neighboring nodes will be
matched to neighboring nodes in the model, or at least nodes in the same region.
Therefore we have,
$
(m) __
ik
where,
c ( « )
^ik
Ci,Ck
Si’gk
a
P
a
c i ' ck
v
1 — 1
c i
N
+ P fz
g i ' g k
\
a l & l
xe
-° x (Rxi~Rxk f -Oy (Ryi~Ryk f }
(10)
J
Similarity between instantaneous node i and representation node k of person m.
- ■ color vectors of instantaneous node i and representation node k, respectively.
: Gabor vectors of instantaneous node i and the representation node k, repectively.
; weight associated with the color similarities. O L = 0.7
weight associated with the Gabor similarities, ft =0.3
= Relative x co-ordinates, within the bounding box, of instantaneous node i and
xi ’ xk representation node k, respectively.
^ ^ = Relative y co-ordinates, within the bounding box, of instantaneous node i and
yi ’ yk representation node k, respectively.
a ,cr
x y
factor to control the localization in the x and y directions respectively.
Thus for each instantaneous node, we find the best match representation node of the
person m. Therefore to obtain the total similarity of a person, we sum over all the best
matches of the instantaneous nodes.
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
S < " > = X m a x 5 r
/c
I
(11)
where,
s.",} = Similarity between instantaneous node i and representation node k of person
S (m ) = Similarity for person m.
mf x S,k } = maximize with argument k.
This gives us the total similarity of a person in the data base with the graph obtained form
the current frame. The person with the highest similarity is reported as the recognized
person.
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 8
Results and Conclusion
8.1 Results
Method Used Gallery Size
Number of Test
Method 1 4 7 88.7%
Method 2 (without
localization term)
6 10 72.1%
Method 2
(without
localization term)
15 19 69.3%
Method 2
(with localization
term)
15 19 82.2%
Table 8.1: Results
Note: Correct Recognition Rate is defined as then total number of correct frames in a
video sequence.
Method 1 shows a good recognition rate i.e. recognizes people correctly. But the gallery
size used was particularly small. We avoided introducing the system to occluded image
sequences since the system was not designed to deal with such situations.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Method 2 has been tested extensively with different gallery sizes. It is found that the
localization term greatly improves results. Therefore even with increase in gallery size,
Method 2 performs better with the localization term.
8.2 Discussion
The results for Method 1 are quite impressive given its simplicity. However it relies more
on Color Information than Gabors. The reason for this is that the Gabors are not
consistent enough (since they are not fixed to a ‘landmark’). Therefore this method fails
on occasions when the color is influenced by lighting changes and the like. Also this
method is not able to deal with partial occlusion in a satisfactory manner. This method
proposes a very simple solution to the task at hand.
Method 2 is an attempt to overcome the problems with Method 1. It tries to avoid the
problem of trying to match incoming test data which might not be consistent with a
sparse sample of points collected during learning. Therefore we try to capture the
variability and use it to our advantage by sampling a large array of points. It is essentially
a ‘brute force’ algorithm. It does a node by node comparison, but we ensure some
localization of the node by using a Gaussian function to bind the node to its relative
position within the human bounding box. It performs well in these cases, but the errors
occur when the input image is shows a human who is largely occluded and there is no
such occluded image in the learning set.
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
8.3 Future Work
These are the first steps towards developing a better system for object recognition.
Method 2 could be improved by using an explicit neighborhood similarity term to
compute the similarity than the current localization term. Also if a generic model of the
human body is obtained then it could be used to hypothesize the part of the human which
is occluded and therefore we could match only those parts of the model, rather than the
entire model as we do now.
The next big logical step is to perform statistical analysis on the specific object models to
arrive at a common thread linking them. This would lay the basis for finding a generic
model for detection of humans. Due to the generality of this method it could be extended
to include other object classes as well.
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of References
[1] C. Abdelkarder, R. Cutler. “Motion-based Recognition of People in EigenGait
Space”. Conference on Automatic Face and Gesture Recognition, 2002.
[2] C. Bregler and J. Malik. “Tracking People with Twists and Exponential Maps”.
Proceedings CVPR ’ 98
[3] P. Downing, Y. Jiang, M. Shuman and N. Kanwisher. “A Cortical Area Selective for
Visual Processing of the Human Body”. Science, Pages 2470-2473, 2001.
[4] R. Fergus, P. Perona and A. Zisserman. “Object Class Recognition by Unsupervised
Scale-Invariant Learning”. Proceedings CVPR ’ 03.
[5] M. Hahnel, D. Klunder and K. Kraiss. “Color and Texture Features for Person
Recognition”. IJCNN ’ 04.
[6] J. Lim, D. Kriegman. “Tracking Humans using Prior and Learned Representations of
Shape and Appearance”. Conference on Automatic Face and Gesture Recognition, 2004.
[7] H. Loos and C. von der Malsburg. “1-Click Learning of Object Models for
Recognition”. Proceedings BMCV’ 02, Pages 377-386.
[8] C. Nakajima, M. Pontil, B. Heisele and T. Poggio. “Full-body person recognition
system”. Pattern Recognition, 36, 2003.
[9] C. Nicolaou, A. Egbert Jr., R. Lacher, S. Bassett. “Human Shape Recognition Using
the Method of Moments and Artificial Neural Networks”. IJCNN ’ 99.
[10] K. Okada, L. Kite and C. von der Malsburg. “An Adaptive Person Recognition
System”. Proceedings o f International Workshop on Robot-Human Interactive
Communication, 2001.
[11] D. Roark, A. O’Toole and H. Abdi. “Human Recognition of Familiar and Unfamiliar
People in Naturalistic Video”. Proceedings ofAM FG’ 03.
[12] H. Scheiderman and T. Kanade. “A Statistical Method for 3D Object Detection
Applied to Faces and Cars”. Proceedings CVPR ’ 00.
[13] Y. Song, L. Goncalves, P. Perona. “Unsupervised Learning of Human Motion”.
Transactions on Pattern Analysis and Machine Intelligence, Pages 814-827, 2003.
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[14] Y. Song, X. Feng and P. Perona. “Towards Detection of Human Motion”.
Proceedings CVPR ’ 00.
[15] X. Tang, S. Wu, C. von der Malsburg. “Self-Organized Figure-Ground
Segmentation by Multiple-Cue Integration”. Submitted to IASTED SIP, 2005.
[16] M. Turk and A. Pentland. “Eigenfaces for Face Recognition”. Conference on
CVPR ’ 94.
[17] L. Wang, W. Hu and T. Tan. “A New Attempt to Gait-based Human Identification”.
Conference on Pattern Recognition, Pages 115-118, 2002.
[18] L. Wiskott, J.-M. Fellous, N. Krueger and C. von der Malsburg. “Face Recognition
by Elastic Bunch Graph Matching.” Transactions on Pattern Analysis and Machine
Intelligence, Pages 775-779, 1997.
[19] T. Zhao, R. Nevatia, F. Lv. “Segmentation and Tracking of Multiple Humans in
Complex Situations”. Proceedings CVPR ’ 01.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A model for figure -ground segmentation by self -organized cue integration
PDF
Analysis, synthesis and recognition of human faces with pose variations
PDF
Analysis, recognition and synthesis of facial gestures
PDF
A dynamic method to reduce the search space for visual correspondence problems
PDF
Boundary estimation and tracking of spatially diffuse phenomena in sensor networks
PDF
Extending the design space for networks on chip
PDF
Building queryable datasets from ungrammatical and unstructured sources
PDF
A comparative study of network simulators: NS and OPNET
PDF
Error-tolerance in digital speech recording systems
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
A thermal management design for system -on -chip circuits and advanced computer systems
PDF
An investigation into depth estimation from blurring and magnification
PDF
Data driven control and identification: An unfalsification approach
PDF
Alias analysis for Java with reference -set representation in high -performance computing
PDF
Architectural support for network -based computing
PDF
Initiation of apoptosis by application of high-intensity electric fields
PDF
A low-complexity construction of algebraic geometric codes better than the Gilbert -Varshamov bound
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
A finite element model of the forefoot region of ankle foot orthoses fabricated with advanced composite materials
PDF
Biological materials investigation by atomic force microscope (AFM)
Asset Metadata
Creator
Shah, Viral H.
(author)
Core Title
First steps towards extracting object models from natural scenes
School
Graduate School
Degree
Master of Science
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Malsburg, Christoph von der (
committee chair
), Garrett, Douglas (
committee member
), Tang, Xiangyu (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-48087
Unique identifier
UC11338051
Identifier
1427984.pdf (filename),usctheses-c16-48087 (legacy record id)
Legacy Identifier
1427984.pdf
Dmrecord
48087
Document Type
Thesis
Rights
Shah, Viral H.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA