Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multiple pedestrians tracking by discriminative models
(USC Thesis Other)
Multiple pedestrians tracking by discriminative models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MULTIPLE PEDESTRIANS TRACKING BY DISCRIMINATIVE MODELS
by
Cheng-Hao Kuo
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2011
Copyright 2012 Cheng-Hao Kuo
Acknowledgements
First of all, my profound gratitude to my advisor Prof. Ram Nevatia for the immense
support academically and nancially. It was his vast knowledge in the eld of computer
vision and articial intelligence that opened my eyes to new and exciting possibilities. As
an advisor, his willingness to share his keen observations and uncanny intuitions guided
me on a journey of compelling research. Without his mentoring, this thesis would not
have been possible.
I thank Prof. Gerard Medioni, Prof. Richard Leahy, Prof. C.-C. Jay Kuo, and Prof.
Fei Sha for acting as committee members during my qualifying exam and thesis defense.
Their input helped me rene the thesis. A special thanks to Dr. Chang Huang, who
was a postdoctoral in our lab for nearly three years. His work laid the foundation upon
which my thesis was built. I beneted greatly from our discussions, his inspiring ideas,
and fund of knowledge in computer vision.
I also thank the current and former USC IRIS members with whom I share some of
my most memorable experiences at the building of PHE. Special thanks to my senior lab
mates: Wei-Kai Liao, Yu-Ping Lin, Chi-Wei Chu, Tae Eun Choe, Fengjun Lv, Chang
Yuan, Qian Yu, Bo Wu, Pradeep Natarajan, Vivek Kumar Singh, Li Zhang, Eunyoung
Kim, Thang Ba Dihn, and Yuan Li. You all are brilliant researchers and my good friends.
I also thank my junior lab mates: Bo Yang, Dian Gong, Xuemei Zhao, Huei-Hung Liao,
ii
Jan Prokaj, Anustup Kumar Choudhury, Prithviraj Banerjee, Pramod Sharma, Furqan
Khan, Amit Pal, and Younghoon Lee. We will always share the camaraderie of ghting
to meet conference deadlines and then the reward of some R&R at the unforgettable
conference venues.
I received the following support of my advisor's grants: the U.S. Army Research
Laboratory and the U.S. Army Research Oce under contract number W911NF-08-C-
0068, and Oce of Naval Research under grant number N00014-10-1-0517. The grants
were invaluable to the development of my dissertation.
It was also my pleasure and privilege to be a teaching assistant at the Department of
Electrical Engineering at USC.
I also thank my spiritual brothers and sisters at the Chinese Baptist Church at West
Los Angeles (CBCWLA) for their friendship. We shared an amazing ve years of growing
together and walking with the Lord. Thank you Pastor Wang for your tireless spiritual
nurturing. And a big shout out to Alight and Jessica, Mickey and Amanda's family,
Michael and Mimi's family, Kun-Chih and Chun-Yi's family, Alex and Shuya's family,
Gordon and Jade, Julia, Hsing-Hau, Peng and Ning. Although not directly related to
my thesis, my spiritual journey with them helped build my work ethic at the lab and at
the same time have a well-rounded life beyond academia. They are like angels sent from
God.
Big thanks to my dear parents in Taiwan. I regret not being able to spend more
time with them since I came to the United States for graduate school in July 2004. No
matter where life takes me, I know they will always be there when I need them the most.
Everything I like about myself and every positive quality I possess is a direct re
ection of
iii
their love and guidance and a testament to what wonderful people they truly are. They
are the best parents anyone could ask for and they deserve far more credit than I can
ever put into words. Thanks to my younger brother, Cheng-Wei, for staying in Taiwan
who look after Mom and Dad. His family including his wife Yun-Pyn and my nephew
Timothy who brought me lots joy and happiness.
A huge thank-you to my wife, Jennifer. Our love for each other carried me through
the dicult days of pursuing my Ph.D. As long as we are at each other's side, I am not
alone to face any challenge in our lives. As my soul mate, you give me strength whenever
I doubt myself and you make me smile whenever I am down. Your love, devotion, and
sacrice allowed me to focus on my research and thesis without the distractions of daily
aairs. Without you it would denitely take more months or years to nish my Ph.D.
Finally, I want to give thanks to the Holy and Almighty God. He is my Lord and
my Savior, my Refuge and my Shield. Under His boundless grace I enjoy His abundant
provision. I truly understand that apart from Him I can do nothing. All glory be to God
in the highest.
iv
Abstract
We present our work on multiple pedestrians tracking in a single camera and across
multiple non-overlapping cameras. We propose an approach for online learning of dis-
criminative appearance models for robust multi-target tracking in a crowded scene. Al-
though much progress has been made in developing methods for optimal data association,
there has been comparatively less work on the appearance model, which is the key ele-
ment for good performance. Many previous methods either use simple features such as
color histograms, or focus on the discriminability between a target and the background
which do not resolve ambiguities between the dierent targets. We propose an algorithm
for learning discriminative appearance models for dierent targets. Training samples
are collected online from tracklets within a time sliding window based on some spatial-
temporal constraints; this allows the models to adapt to target instances. Learning uses
an AdaBoost algorithm that combines eective image descriptors and their correspond-
ing similarity measurements. We term the learned models as OLDAMs. Our evaluations
indicate that OLDAMs have signicantly higher discrimination between dierent targets
than conventional holistic color histograms, and when integrated into a hierarchical as-
sociation framework, they help improve the tracking accuracy, particularly reducing the
false alarms and identity switches.
v
Furthermore, we extend our approach to multiple non-overlapping cameras. Given
the multi-target tracking results in each camera, we propose a framework to associate
those tracks. Collecting reliable training samples is a major challenge in on-line learning
since supervised correspondence is not available at runtime. To alleviate the inevitable
ambiguities in these samples, Multiple Instance Learning (MIL) is applied to learn an ap-
pearance anity model which eectively combines three complementary image descriptors
and their corresponding similarity measurements. Based on the spatial-temporal infor-
mation and the proposed appearance anity model, we present an improved inter-camera
track association framework to solve the "target handover" problem across cameras. Our
evaluations indicate that our method has higher discrimination between dierent targets
than previous methods.
vi
Table of Contents
Acknowledgements ii
Abstract v
List of Tables x
List of Figures xii
1 Introduction 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Complex environments . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Occlusions between targets . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Appearance modelling . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Overview of related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Summary of results and contributions . . . . . . . . . . . . . . . . . . . . 11
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Related Work 14
2.1 Multiple target tracking in a single camera . . . . . . . . . . . . . . . . . 14
2.2 Multiple target tracking across multiple cameras . . . . . . . . . . . . . . 17
3 Hierarchical Association-Based Multi-Target Tracking 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Hierarchical tracklet association . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1 Low-level Association . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 Mid-level Association . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 On-line Learned Discriminative Appearance Models 33
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Online Learned Discriminative Appearance Models(OLDAMs) . . . . . . 38
4.4.1 Reliable Tracklets . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vii
4.4.2 Collecting training samples . . . . . . . . . . . . . . . . . . . . . . 39
4.4.3 Representation of appearance model . . . . . . . . . . . . . . . . . 41
4.4.4 Similarity of appearance descriptors . . . . . . . . . . . . . . . . . 42
4.4.5 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.1 Discrimination Comparison . . . . . . . . . . . . . . . . . . . . . . 46
4.5.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.3 Results for the CAVIAR dataset . . . . . . . . . . . . . . . . . . . 49
4.5.4 Results of TRECVID08 dataset . . . . . . . . . . . . . . . . . . . . 50
4.5.5 Computational Speed . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Person Identity Recognition based Multi-Person Tracking 54
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Appearance-based anity models . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Local image descriptors and similarity measurements . . . . . . . . 61
5.3.2 Model denition and descriptor selection . . . . . . . . . . . . . . . 62
5.4 Tracklet association framework . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Reliable Tracklets . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.2 Tracklets classication . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.3 On-line appearance-based anity models . . . . . . . . . . . . . . 67
5.4.4 Tracklets association . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5.1 CAVIAR dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5.2 TRECVID 2008 dataset . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5.3 ETH mobile platform dataset . . . . . . . . . . . . . . . . . . . . . 72
5.5.4 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6 Combination of On-Line and O-Line Learning 76
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.4 Hierarchical tracklet association . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4.1 Reliable tracklet generation . . . . . . . . . . . . . . . . . . . . . . 82
6.4.2 MAP formulation for tracklet association . . . . . . . . . . . . . . 83
6.5 Oine Learning of Tracklet Anity Models . . . . . . . . . . . . . . . . . 85
6.5.1 The HybridBoost algorithm . . . . . . . . . . . . . . . . . . . . . . 86
6.5.2 Feature pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.5.3 Weak learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.5.4 Training process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.6.1 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.6.2 Appearance feature analysis . . . . . . . . . . . . . . . . . . . . . . 95
viii
6.6.3 Tracking performance . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7 Inter-Camera Tracking for Multiple Targets 99
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.4 Track association between cameras . . . . . . . . . . . . . . . . . . . . . . 105
7.5 Discriminative appearance anity models with Multiple Instance Learning 107
7.5.1 Collecting training samples . . . . . . . . . . . . . . . . . . . . . . 108
7.5.2 Representation of appearance model and similarity measurement . 109
7.5.3 Multiple instance learning . . . . . . . . . . . . . . . . . . . . . . . 111
7.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.6.1 Comparison of discriminative power . . . . . . . . . . . . . . . . . 114
7.6.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.6.3 Tracking results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8 Future work 120
8.1 Robust segmentations of targets . . . . . . . . . . . . . . . . . . . . . . . 120
8.2 Dynamic appearance descriptors . . . . . . . . . . . . . . . . . . . . . . . 121
8.3 Recognition using attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 121
References 122
ix
List of Tables
3.1 Models and methods used in dierent levels of the hierarchical framework
in [36]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Tracking results on CAVIAR dataset. *The numbers of Frag and IDS
in [86] [92] [88] are obtained by dierent metrics from what we adopt [55],
which is more strict. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Tracking results on TRECVID08 dataset. . . . . . . . . . . . . . . . . . . 50
5.1 Comparison between multi-person tracking and person identity recognition. 57
5.2 Comparison of tracking results between the state-of-the-arts and PIRMPT
on CAVIAR dataset. *The numbers of Frag and IDS in [86] [92] [88] are
obtained by looser evaluation metrics. The human detection results we use
are the same as [36, 55, 48]. . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Comparison of tracking results between the state-of-the-arts and PIRMPT
on the TRECVID 2008 dataset. The human detection results we use are
the same as [36, 55, 48]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Tracking results from PIRMPT on sequences "BAHNHOF" and "SUNNY
DAY" from the ETH dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1 The list of features for the tracklet anity model. . . . . . . . . . . . . . . 90
6.2 Tracking results on TRECVID08 dataset. . . . . . . . . . . . . . . . . . . 97
6.3 Tracking results on CAVIAR dataset. *The numbers of Frag and IDS
in [86, 92, 88] are obtained by dierent metrics from what we adopt [55],
which is more strict. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.1 A short summary of the elements in each sub-matrix in H, which models
all possible situations between the tracks of two non-overlapping cameras.
The optimal assignment is solved by Hungarian algorithm. . . . . . . . . . 107
x
7.2 The comparison of the Equal Error Rate using dierent appearance anity
models. It shows that the on-line learning method has the most discrimi-
native power. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.3 Tracking results using dierent appearance models with our proposed met-
rics. The lower the numbers, the better performance it is. It shows that
our on-line learned appearance anity models achieve the best results. . . 118
xi
List of Figures
1.1 The sample tracking result of our approach applied in a crowded scene. . 2
1.2 An example of results for background subtraction. . . . . . . . . . . . . . 3
1.3 Applying the object detector on a single image. Green boxes indicate the
correct detections; red boxes indicate the miss detections; yellow boxes
indicate the inaccurate detections. Although the recent improvement of
detection algorithm has been achieved, the detection errors such as miss
detections, false alarms, and inaccurate detections still occur often. . . . . 4
1.4 Four snapshots in a video sequence describe a scene that a lady indicated
by a green arrow walking behind several people. She is hardly visible in
the snapshot 2 and 3. It increases the diculty of correct tracking. . . . . 5
1.5 The diculties of appearance modelling in multi-target tracking. (a)Dierent
targets may have similar appearances. (b) A target may change its pose
and view-points. (c) The appearance of a target may change dramatically
in a multi-camera tracking scenario. . . . . . . . . . . . . . . . . . . . . . 6
1.6 The scheme of considering past and current frames for tracking. (ag)
When the current frame is observed, the corresponding detection responses
are obtained by an object detector, and then the tracker associates the
responses with the existing tracks. (h) Final tracking results. . . . . . . . 8
1.7 The scheme of considering past and "future" frames in a time sliding win-
dow. (a) Object detector is applied in all frames in a time sliding window.
(b,c) Globally optimization of multiple trajectories. (d) Final tracking
results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 The block diagram of our multi-target tracking framework based on hier-
archical association method. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
xii
3.2 The overview of hierarchial tracklet association. (a) Detection responses
in each frame in a time sliding window. (b) Reliable tracklets by low-
level association. (c) Rened tracklet by Kalman ltering. (d) Round 1
of mid-level association. (e) Rened tracklets by Kalman ltering. (f)
Interpolation of missing responses and removal of false detections. (g)
Round 2 of mid-level association. (h) Rened tracklets by Kalman ltering.
Interpolation of missing responses and removal of false detections. . . . . 24
4.1 Sample detection results in column (a) and sample tracking results in col-
umn (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 The block diagram of our multi-object tracking system with on-line learned
discriminative appearance models. . . . . . . . . . . . . . . . . . . . . . . 37
4.3 The overview of the process of obtaining on-line training samples. (a): The
raw detection responses. (b) The result of reliable tracklets. (c) Positive
training samples. (d) Negative training samples. . . . . . . . . . . . . . . 39
4.4 The sample distribution based on correlation coecients of color histogram
(left) and OLDAMs (right). Blue represents positive samples and red rep-
resents negative ones. The gure gives an example to show that OLDAMs
are more discriminative than color histogram. . . . . . . . . . . . . . . . . 47
4.5 A graphical example of the evaluation metrics proposed in [55]. . . . . . . 48
4.6 Sample tracking result on CAVIAR dataset. Top row: Two targets indi-
cated by arrows switch their IDs where color histogram serves the appear-
ance model. Bottom row: These two targets are tracked successfully as
the OLDAMs are used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 Sample tracking result on TRECVID08 dataset. The top row shows the
result of [55] that a man has a new ID and his old ID is transferred to the
lady behind. The bottom row shows that they are consistently tracked in
our method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Some snapshots of videos from our multi-person tracking results. The goal
of this work is to locate the targets and maintain their identities in a real
and complex scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 The block diagram of our proposed method. . . . . . . . . . . . . . . . . . 59
5.3 The o-line training samples for appearance models. Images in each col-
umn indicate the same person. . . . . . . . . . . . . . . . . . . . . . . . . 63
xiii
5.4 Some sample feature selected by Adaboost algorithm. The local descriptors
of color histograms, HOG, and covariance matrices are indicated by red,
green, and yellow respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 Sample tracking result on (a) CAVIAR, (b,c) TRECVID 2008, and (d,e)
ETH dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1 The block diagram of our proposed system which combines on-line (red)
and o-line (blue) learning methods. . . . . . . . . . . . . . . . . . . . . . 77
6.2 The detailed block diagram of our proposed methods embedded in the
hierarchical tracklet association framework. . . . . . . . . . . . . . . . . . 79
6.3 Sample tracking result on CAVIAR dataset. Top row: Two targets indi-
cated by arrows switch their IDs where color histogram serves the appear-
ance model. Bottom row: These two targets are tracked successfully as
the OLDAMs are used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.1 Illustration of inter-camera association between two non-overlapping cam-
eras. Given tracked targets in each camera, our goal is to nd the optimal
correspondence between them, such that the associated pairs belong to
the same object. A target may walk across the two cameras, return to the
original one, or exit in the blind area. Also, a target entering Camera 2
from blind area is not necessarily from Camera 1, but may be from some-
where else. Such open blind areas signicantly increase the diculty of
the inter-camera track association problem. . . . . . . . . . . . . . . . . . 100
7.2 The block diagram of our system for associating multiple tracked targets
from multiple non-overlapping cameras. . . . . . . . . . . . . . . . . . . . 103
7.3 Sample tracking results on our dataset. Some tracked people travelling
through the cameras are linked by dotted lines. For example, the targets
with IDs of 74, 75, and 76 leave Camera 2 around the same time, our
method nds the correct association when they enter Camera 1. This
gure is best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . 119
xiv
List of Algorithms
1 Learning the on-line discriminative appearance anity model . . . . . . . 45
2 HybridBoost algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3 Multiple Instance Learning Boosting . . . . . . . . . . . . . . . . . . . . . 114
xv
Chapter 1
Introduction
Tracking multiple targets in a real scene is one of the important topics in the eld of
computer vision since it has many applications such as surveillance systems, robotics, and
human-computer interaction environments. In particular, tracking multiple pedestrians
is of intense interest as it has been widely used in the security scenarios; it also can serve
the basis for higher level tasks such as event recognition and activity reasoning. As video
becomes more and more pervasive in the world, it becomes more and more critical to
have robust, automatic analysis of object behavior and interactions to eectively process
the large amount of data.
The goal of multi-target tracking is to infer the targets, retrieve their trajectories, and
maintain their identities through video sequences. This is a relatively easy task if the
targets are isolated and easily distinguished from the background. However, it becomes a
highly challenging problem in complex and crowded environments when the occlusions of
targets are frequent. Moreover, similar appearances and complicated interactions between
dierent targets often result in incorrect tracking results such as track fragmentation and
identity switches. For example, Figure 1.1 shows a busy scenario which is a dicult case
1
Frame 3410 Frame 3430 Frame 3450 Frame 3470
Frame 640 Frame 740 Frame 700 Frame 790
Figure 1.1: The sample tracking result of our approach applied in a crowded scene.
for multi-target tracking. In this work, we aim to eectively and eciently track multiple
pedestrians in such challenging conditions.
1.1 Problem statement
The objective of this work is to develop a robust system to track multiple pedestrians
within a single camera and across multiple cameras. Traditional feature-based tracking
methods, such as those based on color histogram, salient points, or motion blobs, do
not have a discriminative model that distinguishes the object category of interest from
others. Using object detectors as discriminative models helps overcome this limitation. It
also enables automatic initialization and termination of trajectories. However, although
there has been signicant progress in object detection performance, the accuracy of the
state-of-the-art is still far from perfect. Missed detections, false alarms, and inaccurate
responses are common. In this work, given the imperfect detection results of each frame
2
Pedestrian Detector
Figure 1.2: An example of results for background subtraction.
in video sequences, we propose an association-based tracking framework to link them into
longer tracks gradually to form the nal tracking results.
1.2 Challenges
The problem of tracking multiple targets in videos is quite dicult due to the complex
environments, occlusions between targets, and appearance modeling. We brie
y describe
some of the challenges here.
1.2.1 Complex environments
When the targets are in simple environments where they are easily distinguished from
the background, an easy method such as background substraction is able to locate the
targets well. However, dynamic environment and cluttered background are very common
in the real world which makes the localization of targets can not rely on simple methods.
For example, given the snapshot in a video and its background model, we could use
background subtraction to nd the targets as shown in Figure 1.2. However, not all
white pixels are the targets; they might be shadows or other moving backgrounds.
3
Pedestrian Detector
Figure 1.3: Applying the object detector on a single image. Green boxes indicate the
correct detections; red boxes indicate the miss detections; yellow boxes indicate the in-
accurate detections. Although the recent improvement of detection algorithm has been
achieved, the detection errors such as miss detections, false alarms, and inaccurate detec-
tions still occur often.
Another solution is to use object detector. Even though the great improvements of
the state-of-the-art object detector are achieved, detection mistakes still often occur, as in
Figure 1.3. How to recover miss detections, remove false alarms, and rene the inaccurate
detections makes the tracking problem dicult.
1.2.2 Occlusions between targets
Occlusion is always a huge challenge in the task of object detection and object tracking.
It is very dicult to track a target successfully if we are not able to locate it. The current
object detector works if the object is unoccluded or partially occluded (<30%). However,
in a crowded environment, partial or heavy occlusions between targets are frequent. If
the time of a target being occluded is long, it is a hard task to track it correctly, as in
Figure 1.4. How to bridge the gap of detected objects and maintain the their identities
is still an unsolved problem.
4
hardly visible! hardly visible!
Figure 1.4: Four snapshots in a video sequence describe a scene that a lady indicated by
a green arrow walking behind several people. She is hardly visible in the snapshot 2 and
3. It increases the diculty of correct tracking.
1.2.3 Appearance modelling
To establish the suitable appearance models for targets is the key element for multi-
target tracking. In a complex scenario, successful appearance models are hard to obtain
since dierent targets may have similar appearance, e.g., similar colors or clothes as in
Figure 1.5(a). Besides, the appearance of a certain target may change through a video
sequence due to the changes of pose or view-point, as in Figure 1.5(b). Moreover, if
we consider the case of tracking across cameras, the appearance of a target may not be
consistent due to dierent sensor characteristics, lighting conditions, and view-points, as
in Figure 1.5(c). Previous methods often use simple features, e.g. a static color histogram,
as the appearance models, which can not eectively distinguish dierent targets. In this
work, we focus on how to build an on-line learned discriminative appearance model and
show its capability on the problem of multi-target tracking.
5
(a) (b) (c)
Figure 1.5: The diculties of appearance modelling in multi-target tracking. (a)Dierent
targets may have similar appearances. (b) A target may change its pose and view-points.
(c) The appearance of a target may change dramatically in a multi-camera tracking
scenario.
1.3 Overview of related work
Multi-target tracking has been an active topic in computer vision. Among a vast amount
of work in the literature, Multiple Hypothesis Tracking (MHT) [66] and Joint Probabilistic
Data Association Filters (JPDAF) [25] are two early representatives and several variations
are developed later. In general, their goal is to make inference of the association between
measurements and existing tracks over multiple targets simultaneously. However, the
search space of these methods grow exponentially which limits them to consider only few
time steps in practice.
There are many previous works [94, 9, 44, 41, 91] for multi-target tracking which rely
on background subtraction or modeling from single or multiple cameras. The targets are
detected as "blobs" and these methods try to track those blobs throughout the video
sequence. However, background subtraction-based methods suer the diculties that
the background modeling is not robust. The assumption that all blobs are targets is
6
not always true; it also requires some pre-processing and post-processing techniques.
Moveover, they easily fail if blobs merge or split; they also easily fail if camera moves.
In recent years, object detection techniques [79, 80, 19, 52, 96, 84, 78, 24, 35], have
achieved signicant progress and bring a new trend of tracking approach: detection-based
tracking. Compared to the methods which locate targets via background subtraction or
modeling, detection-based tracking is essentially more robust in complex environments.
There are two main streams in detection-based tracking methods: one considers only past
and current frames to make association decisions; the other takes information from future
frames also. The former [60, 14, 90, 11] usually adopts a particle ltering framework and
uses detection responses to guide the tracker. Therefore, these methods are suitable for
time-critical applications since no clues from future frames are required, as in Figure 1.6.
However, it may be prone to yield identity switches and trajectory fragments due to noisy
observation and long occlusion of targets.
In contrast to those methods which only consider the past information, several ap-
proaches are proposed to optimize multiple trajectories globally, i.e. by considering some
"future" frames throughout the video sequence or in a time sliding window. A graphical
example is shown in Figure 1.7. Some representative works include [51, 3, 92, 36, 55]. The
underlying philosophy is that observing more frames before making association decisions
should generally help overcome the ambiguities caused by long-term occlusions and false
or missed detections. However, the appearance models in these methods are aimed at
making the targets distinguishable from their neighborhood in the background, rather
than from each other.
7
(a)
(b)
(c)
(d) (h)
(g)
(f)
(e)
Figure 1.6: The scheme of considering past and current frames for tracking. (ag) When
the current frame is observed, the corresponding detection responses are obtained by an
object detector, and then the tracker associates the responses with the existing tracks.
(h) Final tracking results.
8
(a)
(b)
(c)
(d)
Figure 1.7: The scheme of considering past and "future" frames in a time sliding window.
(a) Object detector is applied in all frames in a time sliding window. (b,c) Globally
optimization of multiple trajectories. (d) Final tracking results.
9
1.4 Overview of our approach
We present a detection-based tracking approach with hierarchical tracklet association
framework [36] and on-line learned discriminative appearance models (OLDAMs) [48].
To process a video sequence, a sliding time window method is applied to segment a
shorter sequence into smaller sections. In each time sliding window, a state-of-the-art
object detector is applied in every frame. Given the detection responses, a dual-threshold
strategy is used to generate short but reliable tracklets. We dene a pair of tracklets
as training samples. To collect on-line positive and negative training samples, spatio-
temporal constraints are applied on those reliable tracklets. Several image descriptors at
multiple locations and the corresponding similarity measurements are computed as fea-
tures, which are combined into OLDAMs by AdaBoost algorithm. We integrate OLDAMs
and other cues (e.g. motion and time) into a hierarchical association framework to link
short tracklets into longer and longer tracks and nally the desired target trajectories.
To improve our multi-target tracking performance, we make our eorts in two dier-
ent directions: a) to explore a more advanced method to combine dierent cues including
OLDAMs in our tracklet-association framework; b) to boost the OLDAMs with the help
of appearance-based person identity recognition. The former one builds a general anity
model which incorporates multiple types of features in an o-line learning framework as
in [55]. The proposed HybridBoost algorithm which combines the merits of the Rank-
Boost and AdaBoost algorithm. We adopt this method to establish a more robust anity
model to measure the similarities of any two tracklets. The later one is inspired from the
topic in person identity recognition. In our design, the traklets are classied into gallery
10
tracklets and query tracklets and learned the dierent target-specic appearance models
for tracklets. We term this Person Identity Recognition based Multi-Person Tracking
method as PIRMPT.
We further extend our approach to multiple non-overlapping camera [47]. Given the
multi-target tracking results in each camera, we propose a framework to associate those
tracks by OLDAMs. The training sample is dened as a pair of tracks from two cameras
respectively. However, it is dicult to collect the on-line training samples based on the
spatio-temporal constraints since exact positive samples are ambiguous. To solve this
problem, we form several potentially linked pairs of tracks into one positive "bag", which
allows some
exibility for the labelling process. Once the training samples are collected,
a Multiple Instance Learning (MIL) algorithm is applied to select discriminative features
for OLDAMs. Combined with spatio-temporal cues, an association framework is applied
for linking multiple tracks from multiple non-overlapping cameras to form the nal target
trajectories across cameras.
1.5 Summary of results and contributions
Pedestrians are used as the class of interest to demonstrate the robustness of our proposed
approach. We evaluate our multi-pedestrian tracking system on several public datasets
with surveillance scenarios, either in the setting of a single camera or multiple non-
overlapping cameras. Frequent and heavy occlusions are present and the background
scene can be highly cluttered. The camera can be stationary or potentially moving.
The environment can be indoors or outdoors with possible illumination changes. The
11
experimental results show that our methods outperform the state-of-the-art algorithms
for multiple target tracking.
The major contributions of this thesis work include:
On-line Learned Discriminative Appearance Models (OLDAMs). OLDAMs consist
of the strategies of collecting on-line training samples, image descriptors, appearance
anity models, and learning framework. It is updated in an on-line fashion to the current
targets at runtime. This work is originally published in [48].
A Person Identity Recognition based Multi-Person Tracking (PIRMPT). In the
tracklets association stage, we divide all short but reliable tracklets into two groups:
"gallery" tracklets and "query" tracklets. Each group has its own strategies to learn the
appearance models. This work is originally published in [50].
Combination of on-line learning and o-line learning for tracklet association. We
present a unied framework to combine two approaches which learns the appearance cue
in on-line phase and other cues in o-line phase to boost the tracking accuracy. This
work is originally published in [55, 48].
A framework for inter-camera association of multiple target tracks. An association
matrix between two cameras is proposed and the Multiple Instance Learning (MIL) al-
gorithm is applied to solve the ambiguity of collecting on-line training samples and learn
the eective appearance models. This work is originally published in [47].
12
1.6 Outline
This thesis is organized as follows: we begin with a review of related work in Chapter 2.
The basic tracklet linking method is introduced in Chapter 3. Our framework for multi-
target tracking in a single camera using on-line learned discriminative appearance models
(OLDAMs) is presented in Chapter 4. The Person Identity Recognition based Multi-
Person Tracking (PIRMPT) is described in Chapter 5. The o-line learning approach
which better combine dierent cues for tracklets association is discussed in Chapter 6. A
system to associate multiple tracks across multiple non-overlapping cameras is shown in
Chapter 7. Finally, the future directions of our research are given in Chapter 8.
13
Chapter 2
Related Work
Over the past decade, a vast amount of work has been published on multi-target tracking.
Here we review overall previous approaches to multi-target tracking in a single camera
and across multiple cameras. Those that are more relevant to the proposed system will
be discussed in Chapter 4.2, 5.2, 6.2, and 7.2 respectively.
2.1 Multiple target tracking in a single camera
Early methods, e.g. [38, 94, 95, 72, 67] usually rely on background substraction from a
static camera to locate the targets. Zhao et al. [94] track motion blobs and assume that
each individual blob corresponds to one human. This type of methods usually do not
consider multiple objects jointly and tend to fail when blobs merge or split. The work in
[38, 95, 72, 67] try to t multiple object hypotheses to explain the foreground or motion
blobs. All of these methods have shown experiments with a stationary camera only, where
the background subtraction provides relatively robust object motion blobs. However, the
foreground blob based methods are not discriminative; all moving pixels are assumed as
humans. It is not true in more general situations.
14
Due to the impressive advances in object detection [19, 84, 34, 70, 78, 24, 49, 35],
detection-based tracking methods gain more and more attention since they are essentially
more robust in complex environments. They are also insensitive to the situations when the
camera is moving. Given the detections responses generated by the detector, the tracking
method needs to retrieve the real objects among those responses and maintain identities
for each of them. The main challenges are that the resulting output is unreliable and
sparse. To deal with this data association problem, particle ltering framework are adopt
for representing the tracking uncertainty in a Markovian manner by only considering
information from past frames. Okuma et al. [60] and Cai et al. [14] combine tracking-by-
detection with particle ltering by using nal detections to initialize color based tracker
samples. Breitenstein et al. [11, 12] also follow the similar framework with the continuous
condence of pedestrian detectors and online trained classiers as a graded observation
model. Although this type of methods is suitable for time-critical applications since no
information from future are required, it may be prone to error due to noisy observation
and long occlusion of targets.
In contrast to those methods which only consider the past information, more and
more approaches are proposed to optimize multiple trajectories globally, i.e. by consid-
ering the future frames. Leibe et al. [51] use Quadratic Boolean Programming to couple
the detection and estimation of trajectory hypotheses. Andriluka et al. [3] apply Viterbi
algorithm to obtain optimal object sequences. Zhang et al. [92] use a cost-
ow network to
model the MAP data association problem. Xing et al. [88] combine local tracklets lter-
ing and global tracklets association. Huang et al. [36] propose a hierarchical association
15
framework to link shorter tracklets into longer tracks. Li et al. [55] adopt similar struc-
ture as [36] and present a HybridBoost algorithm to learn the anity models between
two tracklets. The underlying philosophy is that observing more frames before making
association decisions should generally help better overcome the ambiguities caused by
long term occlusions and false or missed detections.
Recently, there are several interesting works in dierent aspects. Yang et al. [89]
argue that the independent assumption of tracklets is not valid, and adopt a learning-
based Conditional Random Field (CRF) model to consider both tracklet anities and
dependencies among them. Andriyenko et al. [4] extend the Integer Linear Programming
(ILP) by discretizing the space of target locations into hexagonal lattice. Their later
work [5] further formulated the multi-target tracking as a problem of minimization of
a continuous non-convex energy function and used the jump move to nd the solution.
Brendel et al. [83] transform the tracklet data-association problem into a problem of nd-
ing the Maximum-Weight Independent Set (MWIS) of the graph. Pirsiavash et al. [64]
closely follow the cost
ow network [92] and applied a greedy, successive shortest-path
algorithm to reduce the execution time. Wu et al. [87] construct "track graph" by two-
stage network
ow process and performed occlusion reasoning on multi-objects in single
or multiple views using set-cover techniques. Benfold et al. [8] use a multi-threaded ap-
proach and combined asynchronous HOG detections with simultaneous KLT tracking and
Markov-Chain Monte-Carlo Data Association (MCMCDA) to achieve real-time tracking
in high denition video.
To track the targets and maintain their identities requires a strong appearance models
to distinguish each target. However, there has been relatively little attention given to
16
development of discriminative appearance models among dierent targets. [36, 55, 86, 88,
92] use only a color histogram as their appearance model with dierent anity measures
such as the
2
distance, Bhattacharyya coecient, and correlation coecient. To enhance
the appearance model for tracking, several methods [6, 18, 31] obtain dynamic feature
selection or the target observation model by online learning techniques. However, the
appearance models in these methods are aimed at making the targets distinguishable
from their neighborhood in the background, rather than from each other.
2.2 Multiple target tracking across multiple cameras
There is a large amount of work, e.g. [13, 17, 45], for multi-camera tracking with overlap-
ping eld of views. These methods usually require camera calibration and environmental
models to track targets. However, the assumption that cameras have overlapping elds of
view is not always practical due to the large number of cameras required and the physical
constraints upon their placement.
In the literature, [37, 62, 43] represent some early work for multi-camera tracking with
non-overlapping eld of views. To establish correspondence between objects in dierent
cameras, the spatio-temporal information and appearance relationship are two important
cues. For the spatio-temporal cue, Javed et al. [39] propose a method to learn the camera
topology and path probabilities of objects using Parzen windows. Dick and Brooks [20] use
a stochastic transition matrix to describe people's observed patterns of motion both within
and between elds of view. Makris et al. [57] investigate the unsupervised learning of a
model of activity from a large set of observations without hand-labeled correspondence.
17
Gilbert and Bowden [30] propose an incremental learning method to model posterior
probability distributions of spatio-temporal links between cameras.
For the appearance cue, Porikli [65] derive a non-parametric function to model color
distortion for pair-wise camera combinations using correlation matrix analysis and dy-
namic programming. Javed et al. [40] show that the brightness transfer functions(BTFs)
from a given camera to another camera lie in a low dimensional subspace and demon-
strated that this subspace can be used to compute appearance similarity. Gilbert and
Bowden [30] learn the BTFs incrementally based on Consensus-Color Conversion of Mun-
sell color space [76]. Chen et al. [15] propose an adaptive learning of BTFs using Markov
Chain Monte Carlo (MCMC) sampling algorithm.
The methods discussed above usually combine those cues such as temporal informa-
tion and the appearance relationship in their systems. In some recent work, they focus
on appearance cue only without any temporal constraints and formulate the target cor-
respondence problem from dierent cameras as a re-identication problem. Gheissari et
al. [29] propose a two human identication methods including graph based spatiotem-
poral segmentation and decomposable triangulated graphs. Gray et al. [33] present an
approach of performing viewpoint invariant pedestrian recognition using an eciently
and intelligently designed object representation. Farenzena et al. [23] design a feature
extraction and matching strategy based on the localization of perceptual relevant human
part.
18
Chapter 3
Hierarchical Association-Based Multi-Target Tracking
Hierarchical association-based multi-target tracking is originally published by Huang et
al. [36]. This thesis work follows the similar formulation but has many novelties and
improvements on the appearance modelling of targets. To be self-contained, a simple
review of [36] is described here. Some implementation details are also provided.
3.1 Introduction
Tracking multiple targets is important for many computer vision applications. It is a
relatively challenging task, especially in complex and crowded environments where the
background is cluttered and the targets may have similar appearances, complicated in-
teractions, and heavy occlusions with others.
To tackle with this dicult problem, there are several earlier feature-based tracking
methods based on color, salient points, or motion blobs. However, those methods do not
have a discriminative model that distinguishes the target-of-interest from others. Due
to the recent improvements on the performance of object detection algorithms, applying
the object detector help the tracker overcome this limitation. It also provides the strong
19
evidence for automatic initialization and termination of target trajectories. Neverthe-
less, the accuracy of the sate-of-the-art object detectors is still far from perfect. Missed
detection, false alarms, and inaccurate responses still happen frequently. Given the detec-
tion responses, how to associate true detections, remove false alarms, and make smooth
trajectories becomes the pivotal issues in the detection-based multi-target tracking.
There are two main streams in detection-based tracking methods: 1) making the track-
ing decision at frame t by only considering the observations from past frames 0 t 1
and current frame t; 2) making the tracking decision at frame t by past frames, current
frame, and some "future" framestt+N. The former is suitable for time-critical appli-
cations since the decision is made immediately when the current frame comes. However,
it is sensitive to false alarms and inaccurate detections; it also creates broken tracking
trajectories if long occlusions exists. The latter method usually takes a certain number of
frames as input, i.e., some latency is needed, and then nd the global optimization over
this time sliding window. The underlying philosophy is that observing more frames before
making association decisions should generally help solve the issue of occlusions to have
complete tracking trajectories. Besides, it is also useful to prevent wrong association,
remove the false alarms, and recover missed detections.
A common approach for 2) is to use a two-stage tracking approach which generates
"tracklets", i.e. partial target tracks, in the rst stage, and then globally merges the
generated tracklets to form a nal tracing trajectories in the second stage. This method
is referred as tracklet stitching or tracklet linking. Stauer [75] rst obtains tracklets
by performing a conservative frame-to-frame correspondence, and then associates these
tracklets by the Hungarian algorithm with an extended transition matrix that considers
20
initialization and termination of each tracklet. He proposes a parametric source/sink
model for the tracklet initialization and termination, and solves the coupled problem
(estimating the source/sink model and associating tracklets) in an EM framework. Perera
et al. [63] modify this extended transition matrix to maintain object identities even if their
trajectories are merged or split. Singh et al. [71] use a Multiple Hypothesis Tracker to
grow tracklets before associating them by the Hungarian algorithm. In these methods,
the transition matrix for the Hungarian algorithm is computed only once and xed for
association. Hence, the errors in anity computation caused by inaccurate detections
are hard to be alleviated during the association process and are likely to be propagated
to the higher level analysis.
Following the tracklet linking methodology, Huang et al. [36] proposed a three-level
hierarchical framework to generate target trajectories by progressively associating detec-
tion responses. This thesis work takes [36] as a basis and develops several novel ideas to
have a robust performance.
3.2 System Overview
The tracking system takes a video sequence as an input, and outputs the number of
tracked targets as well as the trajectory of each. For an input video sequence, we extract a
certain number of beginning frames as a buer, and apply the state-of-art object detector
in each frame in these cached frames. After the hierarchical association of detection
responses is done in current cached frames, we move the time sliding window to the next
21
Motion Appearance Association Scene model Coordinates
Low level N/A Raw Direct link N/A Image
Middle level Dynamic Rened Hungarian General Image
High level Dynamic Rened Hungarian Specic Ground plane
Table 3.1: Models and methods used in dierent levels of the hierarchical framework
in [36].
position with some overlaps with previous time sliding window. This process is executed
iteratively until the end of the video.
How to associate those detection responses and form the tracking trajectories is a
dicult problem. A eective three-level hierarchical framework was proposed in [36],
which can be summarized in Table 3.1
First, at the low level, reliable tracklets are generated by linking detection responses
in any two consecutive frames in a time sliding window. A conservative two-threshold
strategy is used to prevent "unsafe" associations until more evidence is collected to reduce
the ambiguity at higher levels. Second, at the middle level, the short tracklets obtained at
the low level are iteratively associated into longer and longer tracklets. This is formulated
as a MAP problem that considers not only initialization, termination and transition of
tracklets but also hypotheses of tracklets being false alarms. In each round, positions
and velocities of each input tracklet are estimated. This information helps rene the
appearance model, and additionally provides a motion model to characterize the target.
A modied transition matrix is computed and sent to the Hungarian algorithm to obtain
optimal association. Finally, at the high level, a scene structure model, including three
maps for entries, exits and scene occluders, is estimated based on the tracklets provided
by the middle level. Afterward, the long-range trajectory association is performed with
22
Video
Sequence
Human
Detector
Frame-by-Frame
Association
Tracklets
Association
Detection
Responses
Reliable
Tracklets
Tracking
Results
Time Sliding Window
Figure 3.1: The block diagram of our multi-target tracking framework based on hierar-
chical association method.
the help of the scene knowledge based reasoning to reduce trajectory fragmentation and
prevent possible identity switches.
In this thesis work, the high level association part which involves the inference of scene
occluders is excluded. We only use the low level (rst stage) and middle level (second
stage) association in our framework later. The block diagram of the system is given in
Figure 3.1.
3.3 Hierarchical tracklet association
In this section, we describe two stages of association which is proposed in [36]. In the
rst stage, the detection responses in any two neighboring frames are linked by a dual-
threshold strategy, which is biased to link the "safe" pairs of detection responses to form
reliable tracklets. In the second stage, the tracklets obtained in the rst stage are it-
eratively linked into longer and longer trajectories. The MAP formulation is used to
simultaneously nd the optimal association of tracklets and determine the track initial-
ization, termination, miss detection recover, and false alarms removal using Hungarian
algorithm. The overview of hierarchial tracklet association is presented in Figure 3.2.
More details are presented later sections.
23
(a)
(b)
(c)
(d) (h)
(g)
(f)
(e)
Figure 3.2: The overview of hierarchial tracklet association. (a) Detection responses in
each frame in a time sliding window. (b) Reliable tracklets by low-level association. (c)
Rened tracklet by Kalman ltering. (d) Round 1 of mid-level association. (e) Rened
tracklets by Kalman ltering. (f) Interpolation of missing responses and removal of
false detections. (g) Round 2 of mid-level association. (h) Rened tracklets by Kalman
ltering. Interpolation of missing responses and removal of false detections.
24
3.3.1 Low-level Association
We denote that R =fr
t
i
g is the set of all detection responses in a time sliding window,
where i is the detection index and t is the frame index. Each r =fx;y;s; cg consists of
the center position (x;y) in image plane, the size s, and the holistic color histogram c .
GivenR as the input, the goal is to link them frame by frame using a simple and ecient
method. Similar to [86, 36], the link probability between any two responses in any two
neighboring frames is dened as the product of three anities:
P
link
(r
t
i
;r
t+1
j
) =A
pos
(r
t
i
;r
t+1
j
)A
size
(r
t
i
;r
t+1
j
)A
appr
(r
t
i
;r
t+1
j
) (3.1)
where A
pos
, A
size
, and A
appr
are anity measurements based on position, size, and ap-
pearance respectively. Their denitions are:
A
pos
(r
t
i
;r
t+1
j
) =
pos
exp
(x
t
i
x
t+1
j
)
2
2
x
exp
(y
t
i
y
t+1
j
)
2
2
y
A
size
(r
t
i
;r
t+1
j
) =
size
exp
(s
t
i
s
t+1
j
)
2
2
s
A
appr
(r
t
i
;r
t+1
j
) =BC(c
t
i
; c
t+1
j
)
(3.2)
where
pos
and
size
are normalization factors , and BC(:) is the Bhattachayya distance
between two color histograms.
A dual-threshold strategy is applied to associate pairs of detections in two neighboring
frames in a conservative manner. It links a pair of responses r
t
i
and r
t+1
j
if the following
conditions are met: 1) their anity is high enough; 2) their anity is signicantly higher
25
than any other pairs which involve either r
t
i
or r
t+1
j
. This method prevents the "unsafe"
associations until more evidence is collected to solve the ambiguity at later stages.
In our implementation, a linking score matrix S
t
is computed given all detection
responses in frame t and t + 1. Each element S
t
(i;j) is determined by (3.1). Two
responses r
t
i
and r
t+1
j
are linked if S
t
(i;j) is higher than a threshold
1
; and S
t
(i;j)
exceeds any other elements in thei-th row andj-th column of S by another threshold
2
.
This procedure is executed from t = 0 to t = T 1, where T is the length of the time
sliding window.
Based on this simple dual-threshold strategy, we can eciently generate a set of
reliable tracklets. Since this method is biased to link the condent pairs of detection
responses, in general the output will have very few ID switches (no incorrect association
is made) and many fragmentations (correct association is not made). We further stitch
those "broken" tracks in the hierarchical tracklet association framework.
3.3.2 Mid-level Association
Given a set of short but reliable tracklets as described above, a data association is applied
iteratively to link the tracklets. Each round takes the tracklets generated in the previous
round as input and does further association. At round k, given the tracklet setT
k1
=
fT
k1
i
g from the k 1 round, the tracker tries to link the tracklets into longer ones that
form a new tracklet setT
k
=fT
k
j
g, in which T
k
j
=fT
k1
i
0
;T
k1
i
1
;:::;T
k1
i
l
j
g, and l
j
is the
26
number of tracklet in T
k
j
. To obtain an optimal association result, this process can be
formulated as an MAP problem:
T
k
= arg max
T
k
P (T
k
jT
k1
)
= arg max
T
k
P (T
k1
jT
k
)P (T
k
)
= arg max
T
k
Y
T
k1
i
2T
k1
P (T
k1
i
jT
k
)
Y
T
k
j
2T
k
P (T
k
j
)
(3.3)
assuming that the likelihoods of input tracklets are conditionally independent givenT
k
,
and the tracklet associationsfT
k
j
g are independent of each other.
A Bernoulli distribution is used to model the probability of a detection response being
a true detector or a false alarm. Let be the precision of the detector, the likelihood of
an input tracklet is dened as:
P (T
k1
i
jT
k
) =
P
+
(T
k1
i
) =
jT
k1
i
j
; 9T
k
j
2T
k
;T
k1
i
2T
k
j
P
(T
k1
i
) = (1)
jT
k1
i
j
; 8T
k
j
2T
k
;T
k1
i
= 2T
k
j
(3.4)
wherejT
k1
i
j is the number of detection responses inT
k1
i
, andP
+
(T
k1
i
) andP
(T
k1
i
)
are the likelihoods of T
k1
i
being a true detection and a false alarm respectively.
The tracklet association priors in (3.3) are modeled as Markov Chains:
P (T
k
j
) =P
init
(T
k1
i
0
)P
+
(T
k1
i
0
)P
link
(T
k1
i
1
jT
k1
i
0
)P
link
(T
k1
i
l
k
jT
k1
i
l
k
1
)P
+
(T
k1
l
k
)P
term
(T
k1
i
l
k
)
(3.5)
27
which is composed of an initialization term P
init
(T
k1
i
0
), a termination term P
term
(T
k1
i
l
k
)
and a series of transition termsP
link
(T
k1
i
x+1
jT
k1
ix
). Denitions of these terms will be given
later.
Constrained by the non-overlap assumption that T
k1
i
cannot belong to more than
one T
k
j
, thus we have
T
k
= arg max
T
k
Y
T
k1
i
:8T
k
j
;T
k1
i
= 2T
k
j
P
(T
k1
i
)
Y
T
k
j
2T
k
[P
init
(T
k1
i
0
)P
+
(T
k1
i
0
)P
link
(T
k1
i
1
jT
k1
i
0
)P
link
(T
k1
i
l
k
jT
k1
i
l
k
1
)P
+
(T
k1
l
k
)P
term
(T
k1
i
l
k
)]
(3.6)
This MAP formulation has a distinct property compared to the other work as in [75,
63]: it allowsT
k
to exclude some input tracklets, rejecting them as false alarms instead
of receiving their initialization/termination penalties or transition terms by linking them.
To solve this MAP problem, the cost-
ow method [92] and the Hungarian method [36]
have shown their eciency. Here we describe the Hungarian method.
28
Assuming that there are n input tracklets in the time sliding window, the MAP
problem in (3.6) can be transferred into a standard assignment problem by dening a
transition matrix of size 2n by 2n:
C =
2
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4
C
11
C
12
C
1n
C
21
C
22
C
2n
C
n1
C
n2
C
nn
C
1(n+1)
1 1
1 C
2(n+2)
1
1 1 C
n(2n)
C
(n+1)1
1 1
1 C
(n+2)2
1
1 1 C
(2n)n
0 0 0
0 0 0
0 0 0
3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
(3.7)
whose components are dened as
C
ij
=
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
lnP
(T
i
); if i =jn
lnP
link
(T
j
jT
i
) + 0:5[lnP
+
(T
i
) + lnP
+
T
j
]; if i;jn and i6=j
lnP
init
(T
j
) + 0:5 lnP
+
(T
j
); if i =j +n
lnP
term
(T
i
) + 0:5 lnP
+
(T
i
); if j =i +n
0; if i>n and j >n
1; otherwise
(3.8)
29
Note that the superscript has been omitted due to the simplicity and the index i and j
are row and column indexes in the matrix.
As stated before, the MAP formulation in this framework takes false alarm hypothe-
ses into account. In particular, this is represented by the diagonal components of the
transition matrix: each one is set to be the logarithmic likelihood of the tracklet being
a false alarm, and the self-association of a tracklet is equivalent to rejecting it as a false
alarm since it cannot be associated with any other tracklet, initialization or termination.
Denoting
= [
ij
]
2n2n
as the optimal assignment matrix obtained by applying the
Hungarian algorithm to the transition matrix C, for each
ij
= 1,
(1) if i =jn, T
i
is considered as a false alarm;
(2) if if i;jn and i6=j, link the tail of T
i
to the head of T
j
;
(3) if i =j +n, T
j
is initialized as the head of the generated trajectory;
(4) if j =i +n, T
i
is terminated as the tail of the generated trajectory.
In this way, we can computeT
k
and its corresponding tracklet set.
The crucial term in this framework is P
link
(T
j
jT
i
), the linking probability between
two tracklets. It is designed as the product of three important components (appearance,
motion and time):
P
link
(T
j
jT
i
) =A
appr
(T
j
jT
i
)A
m
(T
j
jT
i
)A
t
(T
j
jT
i
) (3.9)
To alleviate noises from inaccurate detections, for each input tracklet, a Kalman Filter
method is used to rene the positions and sizes of its detection responses and estimate
30
their velocities. Color histograms of the detection responses are also recomputed and
integrated into a rened color histogram a
i
for the tracklet by a RANSAC method.
The appearance anity is dened by a Gaussian distribution:
A
appr
(T
j
jT
i
) =G(corr(a
i
;a
j
); 0;
c
)
(3.10)
where corr() calculates the correlation between a
i
and a
j
.
The motion anity is dened as
A
m
(T
j
jT
i
) =G(x
tail
i
+v
tail
i
tx
head
j
;
j
)
G(x
head
j
v
head
j
tx
tail
i
;
i
)
(3.11)
where t is the frame gap between the tail (the last detection response) of T
i
and the
head (the rst detection response) of T
j
; x
head
j
(or x
tail
i
) and v
head
j
(or v
tail
i
) are the
rened position and estimated velocity of T
j
or T
i
at the head or tail) (see Fig.2 for an
illustration). The dierence between the predicted position and the observed position is
assumed to obey a Gaussian distribution.
The temporal anity limits the maximum frame gap between two associated tracklets,
and measures the probability of missed detections within the gap:
A
t
(T
j
jT
i
) =
8
>
>
>
>
<
>
>
>
>
:
Z
t1!
; if t2 [1;]
0; otherwise
(3.12)
31
where is the missed detection rate of the detector, is an upper bound of frame gap, and
Z
is a normalization factor. Within the frame gap, ! is the number of frames in which
the tracked object is occluded by other objects, and t 1! is the number of frames
in which the tracked object is visible but missed by the detector. In practice, to compute
!, we interpolate detection responses within the frame gap and check whether they are
occluded by other objects by applying the occupancy map based occlusion reasoning
method in [86] toT
k1
.
Initialization and termination probabilities of each tracklet are empirically set to be
P
init
(T
i
) =P
term
(T
j
) =Z
1
2
(3.13)
So far we have elaborated the rst round at the middle level. In the following rounds,
tracklets with longer frame gaps are associated by progressively increasing . In our
implementation, we have four-stage middle level tracklet association. The maximum
allowed frame gap in each stage is set to be 16, 32, 64 and 128 respectively. When the
shorter tracklets are gradually associated into longer tracks, it means that the tracker
gradually reduces the trajectory fragmentations. However, it could increase the number
of ID switches at the same time. Making the correct associations while preventing the
wrong ones is the key part of this tracklet linking method. To achieve this goal, having a
advanced appearance model for targets is important. In [36] only holistic color histogram
is used which is satised for robust performance. In later chapters, we propose the on-line
learned discriminative appearance models(OLDAMs) to address this problem.
32
Chapter 4
On-line Learned Discriminative Appearance Models
An advanced appearance model is the key for the problem of multi-target tracking. To
achieve this goal, we present novel on-lined learned discriminative appearance models.
This work is originally published in [48].
4.1 Introduction
Multi-target tracking is important for many applications such as surveillance and human-
computer interaction systems. Its aim is to locate the targets, retrieve their trajectories,
and maintain their identities through a video sequence; this is a highly challenging prob-
lem in crowded environments when the occlusions of targets are frequent. In particular,
similar appearance and complicated interactions between dierent targets often result in
incorrect tracking results such as track fragmentation and identity switches. Figure 4.1
features a busy airport terminal [2] which is a challenging case for multi-target tracking.
Detection-based tracking methods have become popular due to recent improvements in
object detection performance. These methods integrate several cues such as appearance,
motion, size, and location into an anity model to measure similarity between detection
33
(b) (a)
Figure 4.1: Sample detection results in column (a) and sample tracking results in column
(b).
responses or between tracklets in an association optimization framework. While many al-
gorithms have been proposed for the association framework, there has been relatively less
eort addressed to develop improved appearance models. Many previous methods simply
compute the distance between two holistic color histograms for consistency measurement.
A recent paper, [36], shows impressive tracking results on dicult datasets. It uses a
hierarchical association framework to progressively link tracklets into longer ones to form
the nal tracking result. At each stage, given the tracklet set provided by previous stage,
the tracklet association is formulated as a MAP problem, which is solved by Hungarian
algorithm with the link probabilities between tracklets from previous stage. In [36], this
34
link probability is dened by an anity model comprising three anity terms for motion,
time and appearance respectively:
P
link
(T
j
jT
i
) =A
m
(T
j
jT
i
)A
t
(T
j
jT
i
)A
a
(T
j
jT
i
) (4.1)
each of them measures the likelihood of tracklet T
i
and T
j
belonging to the same target
according to their own features. We follow this formulation but with the goal of replacing
the appearance probability with a more signicantly discriminative model.
We propose an algorithm for online learning of discriminative appearance models;
resulting models are called OLDAMs. They are designed to give a high anity score
for the tracklets which belong to the same target and low score for the tracklets which
belong to dierent targets. On-line learning is more eective than o-line learning for
this task as it is specically designed for the targets present in the current scene. Given
short but reliable tracklets in a time sliding window, spatial-temporal constraints are
applied to collect positive and negative samples. Several image descriptors at multiple
locations and the corresponding similarity measurements are computed as features, which
are combined into OLDAMs by AdaBoost algorithm. We compare the discriminative
ability of OLDAMs to color histograms. We also integrated them into a hierarchical
association framework, similar to that in [36]. The block diagram of our proposed
system is shown in Figure 4.2. The usage of OLDAMs shows considerable improvements
in tracking performance on CAVIAR and TRECVID08 data sets, particularly in metrics
of false alarms and identity switches.
35
Notice that although the learning approach proposed by [55] and OLDAMs are both
designed to improve the tracklet anity model, the two methods focus on dierent as-
pects: [55] aims at o-line learning of a general anity model that combines multiple
types of cues such as tracklet length, motion, and time while using conventional ap-
pearance descriptors such as color histograms; OLDAMs are designed to discriminate
specically among the targets in the current time window according to their appear-
ance and update when the targets change. Thus, both approaches are complementary;
OLDAMs can be incorporated into the approach of [55] in future work.
4.2 Related Work
Tracking multiple objects has been an active research topic in computer vision. There
are two main streams in detection-based tracking methods: one considers only past and
current frames to make association decisions [11, 14, 60, 86, 90]; the other takes infor-
mation from future frames also [3, 9, 36, 51, 55, 63, 88, 92]. The former usually adopts
a particle ltering framework and uses detection responses to guide the tracker. It is
suitable for time-critical applications since no clues from future are required; however, it
may be prone to yield identity switches and trajectory fragments due to noisy observation
and long occlusion of targets; The latter method considers both past frames and future
frames, and then performs a global optimization which is more likely to give improved
results.
There has been relatively little attention given to development of discriminative ap-
pearance models among dierent targets. [36, 55, 86, 88, 92] use only a color histogram
36
Video
Sequence
Human
Detector
Frame-by-Frame
Association
Tracklets
Association
Spatial-Temporal
Constraints
AdaBoost
Learning
Reliable
Tracklets
Appearance
Model
Tracking
Results
Training
Samples
Detection
Responses
Time Sliding Window
Figure 4.2: The block diagram of our multi-object tracking system with on-line learned
discriminative appearance models.
as their appearance model with dierent anity measures such as the
2
distance, Bhat-
tacharyya coecient, and correlation coecient. To enhance the appearance model for
tracking, several methods [6, 11, 18, 31] obtain dynamic feature selection or the target
observation model by online learning techniques. Several features, e.g., Haar-like fea-
tures [79], SIFT-like features [28, 56], orientation histograms [19, 53], spatiograms [10],
are integrated in this framework. However, the appearance models in these methods are
aimed at making the targets distinguishable from their neighborhood in the background,
rather than from each other.
4.3 Overview of our approach
There are two main components in our approach: one is the strategy for online sample
collection, the other is the appearance model learning method.
Sample collection strategy is based on examining spatial-temporal relations between
tracklets in a time window. We rely on a dual-threshold association method [36] that is
37
conservative and generally provides tracklets that correctly correspond to a single object.
Positive samples are collected by extracting pairs of dierent detection responses within
the same tracklet; negative samples are collected by extracting pairs of detection responses
from tracklets that can not belong to the same target based on their spatial-temporal
relationships.
Model learning problem is transformed into a binary classication problem: determine
whether two tracklets belong to the same target or not according to their appearance de-
scriptors. For each detection response, appearance descriptors consisting of the color
histogram, the covariance matrix, and the HOG feature, are computed at multiple loca-
tions. Similarity measurements among the training samples establish the feature pool.
The AdaBoost algorithm is adopted to select discriminative features from this pool and
combine them into a strong classier; the prediction condence output by this classier
is transformed to a probability, which cooperates with other cues (e.g. motion and time)
to compute the link probability between tracklets for their association. In this way, the
OLDAMs are capable of eectively distinguishing dierent targets within the current
sliding window, and automatically adapting to new targets and varying environments in
the coming sliding windows.
4.4 Online Learned Discriminative Appearance Models(OLDAMs)
The learning of OLDAMs involves four parts: samples collection, descriptor extraction,
similarity measurement, and the learning algorithm. Before elaborating them, we brie
y
describe the dual-threshold method used to generates the reliable tracklets.
38
t = 1 234 5 6789 10 t = 1234 5 6789 10
-1 +1
(a) (b) (c) (d)
T 1
T 2
T 3
T 4
T 5
T 6
T 7
T 8
T 9
Figure 4.3: The overview of the process of obtaining on-line training samples. (a): The
raw detection responses. (b) The result of reliable tracklets. (c) Positive training samples.
(d) Negative training samples.
4.4.1 Reliable Tracklets
Given the detection responses, a dual-threshold strategy is used to generate short but
reliable tracklets as in [36]. Note that the association here is only between two consecutive
frames. The anity of two responses is dened as the product of three measurements
based on their position, size, and color histogram. Given all detection responses in two
neighboring frames, a matching score matrix S can be formed. Two responses r
i
and r
j
are linked if their anity scoreS(i;j) is higher than a threshold
1
and exceeds any other
elements in the i-th row and j-th column of S by another threshold
2
. This strategy is
conservative and biased to link only reliable associations.
4.4.2 Collecting training samples
We propose a method to collect positive and negative training samples using spatial-
temporal constraints. Based on the tracklets generated by the dual-threshold strategy,
we make two assumptions: 1) responses in one tracklet describe the same target. 2) any
responses in two dierent tracklets which overlap in time represent dierent targets. The
39
rst one results from observing that the tracklets are reliable; the second one is based on
the observation that one object can not belong to two dierent trajectories at the same
time. Additionally, we denote certain tracklets as not being associable based on their
spatial relations; if the frame gap between two tracklets is small but they are spatially
separated, we consider them to belong to dierent targets based on the observation that
tracked objects have limited velocities.
Figure 4.3 illustrates the process of collecting training samples. Some sample detection
results, in a window, are given in Figure 4.3(a). Associations generated by dual-threshold
strategy are made between the consecutive frames as in Figure 4.3(b). From those asso-
ciated tracklets, we can use the spatial-temporal constraints to collect training samples.
For example,T
3
is an associated tracklet so that the link between any dierent responses
in T
3
, e.g. r
1
and r
3
, are labeled as positive samples. On the other hand, T
1
and T
2
overlap in time so that the link between any response in T
1
and T
2
, e.g. r
1
and r
8
, are
labeled as negative samples. Besides, T
1
andT
6
are too far in the spatial domain so that
the link between any responses in T
1
and T
6
, e.g. r
4
and r
12
, are labeled as negative
samples as well. Based on these training samples, the learned appearance model which
has discriminative power betweenT
8
andT
9
is able to prevent the wrong link betweenT
3
and T
5
; this happens when target of T
3
is occluded by target of T
5
for a while.
In our implementation, a discriminative set is formed by the negative constraints. For
a certain trackletT
j
, each element in the discriminative setD
j
indicates a dierent target
from T
j
by spatial-temporal information. For example,D
1
=fT
2
;T
3
;T
5
;T
6
g. Therefore,
we can extract any two dierent responses from one tracklet as a positive training sample
and two responses from two tracklets which belong to dierent targets as a negative
40
training sample. We can dene the instance space to beX =RR, whereR is the set
of detection responses in tracklets. The sample setB =B
+
[B
can be denoted by
B
+
=fx
i
: (r
m
;r
n
);y
i
: +1jr
m
;r
n
2T
j
g
B
=fx
i
: (r
m
;r
n
);y
i
:1jr
m
2T
j
;r
n
2T
k
;
T
j
2D
k
or T
k
2D
j
g
(4.2)
where x2X , r
m
;r
n
2R, and m6=n.
4.4.3 Representation of appearance model
To build a strong appearance model, we begin by computing several local features to
describe a tracked target. In our implementation, color histograms, covariance matrixes,
and histogram of gradients (HOG) constitute the feature pool. Given a detection response
r, each feature is evaluated at dierent locations and dierent scales to increase the
descriptive ability.
We use standard color histograms to represent the color appearance of a local image
patch. Histograms have the advantage of being easy to implement and having well studied
similarity measures. We adopt RGB color space for simplicity. Single channel histograms
are concatenated to form a single vector f
RGB
i
, but any other suitable color space can be
used. In our implementation, we use 8 bins for each channel to form a 24-element vector.
To describe the image texture, we use a descriptor based on covariance matrices of
image features proposed in [77]. It has been shown to give good performance for texture
41
classication and object categorization. The texture descriptor C
R
corresponds to the
covariance matrix:
C
R
=
1
n 1
n
X
k=1
(z
k
)(z
k
)
T
(4.3)
where
z
k
=
@I
@x
@I
@y
@
2
I
@x
2
@
2
I
@y
2
@
2
I
@xy
T
(4.4)
is the vector containing rst and second derivatives of image at k-th pixel in the Region
R, is the mean vector over R, and n is the number of pixels.
To capture shape information, we choose the Histogram of Gradients (HOG) Feature
proposed in [19]. In our design, a 32D HOG feature f
HOG
i
is extracted over the region
R; it is formed by concatenating 8 orientations bins in 2 2 cells over R.
In summary, the appearance descriptor of a tracklet T
i
can be written as:
A
i
= (ff
l
RGB
i
g;fC
l
i
g;ff
l
HOG
i
g) (4.5)
where f
l
RGB
i
is the feature vector for color histogram, C
l
i
is the covariance matrix, and
f
l
HOG
i
is the 32D HOG feature vector. The superscript l means that the features are
evaluated at region R
l
. In our design, we choose the number of regions to be 15 so that
the feature pool contains 45 cues in total.
4.4.4 Similarity of appearance descriptors
Given the appearance descriptors explained above, we can compute similarity between two
patches. The color histogram and HOG feature are histogram-based features so standard
42
measurements, such as
2
distance, Bhattacharyya distance, and correlation coecient
can be used. In our implementation, correlation coecient is chosen for simplicity. We
denote the similarity of these two descriptors as (f
RGB
i
; f
RGB
j
) and (f
HOG
i
; f
HOG
j
).
The distance measurement of covariance matrices is described in [77]:
(C
i
; C
j
) =
v
u
u
t
5
X
k=1
ln
2
k
(C
i
; C
j
) (4.6)
wheref
k
(C
i
; C
j
)g are the generalized eigenvalues of C
i
and C
j
, computed from
k
C
i
x
k
C
j
x
k
= 0 k = 1::: 5 (4.7)
and x
k
6= 0 are generalized eigenvectors.
After computing the appearance model and the similarity between appearance de-
scriptors at dierent regions, we form a feature vector
h(A
i
;A
j
) =
(f
1
RGB
i
; f
1
RGB
j
);:::;(f
L
RGB
i
; f
L
RGB
j
);
(C
1
i
; C
1
j
);:::;(C
L
i
; C
L
j
);
(f
1
HOG
i
; f
1
HOG
j
);:::;(f
L
HOG
i
; f
L
HOG
j
)
(4.8)
by concatenating the similarity measurements with dierent appearance descriptors at
multiple locations. This feature vector gives us a feature pool so that we can use Adaboost
algorithm to combine those cues into a strong classier.
43
4.4.5 Learning Algorithm
Our goal is to design a strong model which determines the anity score of appearance
between two instances. It takes a pair of instances as input and returns a real value to
distinguish positive pairs from negative pairs. The larger theH(A
i
;A
j
) is, the more likely
thatA
i
andA
j
represent the same target. We adopt the learning framework of binary
classication and transfer the condence score into the probability space. The anity
model is designed to be a linear combination of the similarity measurements computed
in (4.8). We choose Adaboost algorithm [27, 68] to learn the coecients. The strong
classier takes the following form:
H(A
i
;A
j
) =
T
X
t=1
t
h
t
(A
i
;A
j
) (4.9)
In our framework, the weak hypothesis is from the feature pool obtained by (4.8). We
adjust the sign and normalize h(x) to be in the restricted range [1; +1]. The sign of
h(x) is interpreted as the predicted label and the magnitudejh(x)j as the condence in
this prediction.
The loss function for Adaboost algorithm is dened as:
Z =
X
i
w
0
i
exp
y
i
H(x
i
)
(4.10)
where w
0
is the initial weight for each training sample, which will be updated during
boosting. Our goal is to nd H(x) which minimizes Z, where H(x) is obtained by
44
Algorithm 1 Learning the on-line discriminative appearance anity model
Input:
B
+
=f(x
i
; +1)g: Positive samples
B
=f(x
i
;1)g: Negative samples
F =fh(x
i
)g: Feature pools
1: Set w
i
=
1
2jB
+
j
if x
i
2B
+
, w
i
=
1
2jB
j
if x
i
2B
2: for t = 1 to T do
3: for k = 1 to K do
4: r =
X
i
w
i
y
i
h
k
(x
i
)
5:
k
=
1
2
ln(
1 +r
1r
)
6: end for
7: Choose k
= arg min
k
X
i
w
i
exp[
k
y
i
h
k
(x
i
)]
8: Set
t
=
k
and h
t
=h
k
9: Update w
i
w
i
exp[
t
y
i
h
t
(x
i
)]
10: Normalize w
i
11: end for
Output: H(x) =
P
T
t=1
t
h
t
(x)
sequentially adding new weak classiers. In the t-th round, we aim at learning the
optimal (h
t
;
t
) to minimize the loss
Z
t
=
X
i
w
t
i
exp
t
y
i
h
t
(x
i
)
(4.11)
The algorithm proposed in [68] is adopted to nd
t
in an analytical form. We then
update the sample weights according to h
t
and
t
to focus on the misclassied samples.
The learning procedure is summarized in Algorithm 1.
45
4.5 Experimental results
We evaluate the eectiveness of OLDAMs incorporated in a hierarchical tracking al-
gorithm applied to two public surveillance datasets: the CAVIAR dataset [1] and the
TRECVID08 [2] dataset. Performance of our method is compared with several state-
of-art methods and with other appearance models using our implementations; we also
provide some graphical examples.
4.5.1 Discrimination Comparison
We rst evaluate the discriminative power of OLDAMs, independent of the tracking
system that they may be embedded in. For each tracklet pair in a given temporal sliding
window, the anity score based on OLDAMs and correlation coecient of color histogram
are computed. We manually label which tracklet pairs should be associated to form the
ground truth. A distribution of scores is displayed in Figure 4.4; it is generated by 192
positive pairs and 2,176 negative pairs of detection responses extracted from a window of
200 frames. The horizontal axis represents the anity scores and vertical axis denotes the
density distribution. The Bhattacharyya distance between negative samples and positive
samples is 0.284 by correlation coecient of color histogram and 0.689 by OLDAMs.
The equal error rate is 0.265 by correlation coecient of color histogram and 0.125 by
OLDAMs. It can be seen that the OLDAMs are signicantly more discriminative than
color histograms as the positive and negative samples are separated better.
46
0 0.5 1
0
0.2
0.4
0.6
affinity scores
density
0 0.5 1
0
0.2
0.4
0.6
affinity scores
density
Figure 4.4: The sample distribution based on correlation coecients of color histogram
(left) and OLDAMs (right). Blue represents positive samples and red represents negative
ones. The gure gives an example to show that OLDAMs are more discriminative than
color histogram.
4.5.2 Evaluation metrics
We adopt the commonly used metrics [86] which include the evaluation of performance
both in tracking and detection. Note that for the track fragments and ID switches,
we follow the denition proposed in [55]; it is more strict but better-dened than the
denition in [86]. A program is written to compute the metrics automatically. The key
part of the evaluation program is the matching between groundtruth and tracking result,
which is non-trivial itself. We implemented this part by Hungarian algorithm based on
the VACE evaluation software [42].
Here we explain the details in two important metrics: fragments and ID switch. Our
denitions of track fragments and ID switches are slightly dierent from those of [85]
and [92] (referred to as traditional metric) as shown in Figure 4.5. Traditional ID switch
47
5. Experimental results
We applied our approach to human tracking and carried
out the experiments on the CAVIAR dataset [13] and the
TRECVID08dataset[14]whichfeaturesamuchmorechal-
lenging airport scene.
5.1. Implementation details
Given the detection results, we use the dual-threshold
strategy to generate short but reliable tracklets on a frame-
to-frame basis as done in [8]. After that, four stages of as-
sociation are used. The maximum allowed frame gap for
tracklet association in each stage is 16, 32, 64 and 128 re-
spectively. For each stage, an affinity model (a strong rank-
ingclassifier H with100weakrankingclassifiers)istrained
to compute the transition cost L
T
(T|T
), and the inner cost
L
I
(T) in (3) is calculated in the way proposed by [8]. The
combination coefficient β of the hybrid loss function in (8)
is set to 0.75 for evaluation. The threshold ζ of each strong
ranking classifier controls the tradeoff between fragmenta-
tion and ID switch. It can be either selected automatically
based on training data or specified by the user. We simply
use ζ =0 for all the strong ranking classifiers. There is no
other parameter that needs human intervention.
5.2. Evaluation metrics
The evaluation metrics we use are listed in Table 4. A
program is written to compute the metrics automatically.
The key part of the evaluation program is the matching be-
tween groundtruth and tracking result, which is non-trivial
itself. We implemented this part by Hungarian algorithm
based on the the VACE evaluation software [15].
Our definitions of track fragments and ID switches are
slightlydifferentfromthoseof[6]and[2](referredtoastra-
ditional metric) as shown in Figure 3. Traditional ID switch
is defined as “two tracks exchanging their ids”. However
the case in Figure 3(b) is not well-defined: TRK 1’s iden-
tity changed but was not “exchanged” with others. We
hereby define ID switch as a tracked trajectory changing
its matched GT ID, e.g. in (a) there are two ID switches
by our metric. Similar modification is made to the defini-
tion of fragments. Our definition is easier to implement but
more strict and gives higher numbers in fragments and ID
switches.
5.3. Analysis of the training process
Best features. For all of the four trained affinity models,
the first three features selected in boosting are from motion
smoothness (feature type 13 or 14), color histogram simi-
larity (feature 4) and number of miss detected frames in the
gap between the two trackelts (feature 7 or 9).
Strong ranking classifier output . Typically one strong
ranking classifier includes multiple stump weak ranking
!
!
!
!
!
"
Figure 3. Illustration of fragment and ID switch definitions.
Name Definition
Recall (Frame-based) correctly matched objects / total groundtruth objects.
Precision(Frame-based) correctly matched objects / total output objects.
FA/Frm (Frame-based) No. of false alarms per frame. The smaller the better.
GT No. of groundtruth trajectories.
MT% Mostly tracked: Percentage of GT trajectories which are covered by
tracker output for more than 80% in length.
ML% Mostlylost: PercentageofGTtrajectorieswhicharecoveredbytracker
output for less than 20% in length. The smaller the better.
PT% Partially tracked: 1.0-MT-ML.
Frag Fragments: The total of No. of times that a groundtruth trajectory is
interrupted in tracking result. The smaller the better.
IDS ID switches: The total of No. of times that a tracked trajectory changes
its matched GT identity. The smaller the better.
Table 4. Evaluation metrics.
classifiers on the same feature but with different thresholds.
By combining them, the output on one feature takes the
form of a piece-wise function. Figure 4 shows the output on
two features (the color histogram similarity and the motion
smoothness in image plane) learned in each stage. We can
see that in early stages, very high appearance similarity is
required for association; while in later stages the constraint
becomes looser, allowing more tracklets to be linked. For
motionsmoothness,adecreaseofimportanceofthisfeature
is observed, because in later stages, tracklets are longer and
more stable, the affinity model is placing more importance
on motion smoothness in ground plane.
Choice of β . In the hybrid loss function in equation
(8), β is used to adjust the relative weights of the ranking
part and the classification part. We tested trackers trained
with different β , as shown in Figure 5. The best results are
achieved by HybridBoost with β around 0.5 to 0.75, which
outperforms the pure AdaBoost (β =0) or pure RankBoost
(β =1).
5.4. Tracking performance
We report our tracking performance on 20 videos from
the CAVIAR set
2
, and 9 videos from the TRECVID08 set,
which are from three different scenes, each of 5000 frames
in length and different from the training data.
Table 5 shows the comparison among our approach, the
hierarchical Hungarian algorithm based approach of [8]
3
,
2
Results of [2] and [6] are reported on these 20 videos in [6].
3
Resultsobtainedbycourtesyofauthorsof[8]. Thehigh-levelassocia-
Figure 4.5: A graphical example of the evaluation metrics proposed in [55].
is dened as "two tracks exchanging their ids". However the case in Figure 4.5(b) is not
well-dened: TRK 1's identity changed but was not exchanged with others. We hereby
dene ID switch as a tracked trajectory changing its matched GT ID, e.g. in Figure 4.5(a)
there are two ID switches by our metric. Similar modication is made to the denition
of fragments. This denition is easier to implement but more strict and gives higher
numbers in fragments and ID switches.
In summary, the metrics in tracking evaluation are:
Ground Truth(GT): The number of trajectories in the ground truth.
Mostly tracked trajectories(MT), the percentage of trajectories that are successfully
tracked for more than 80% divided by GT.
Partially tracked trajectories(PT), the percentage of trajectories that are tracked
between 20% and 80% divided by GT.
48
Mostly lost trajectories(ML), the percentage of trajectories that are tracked for less
than 20% divided by GT.
Fragments(Frag): The total number of times that a trajectory in ground truth is
interrupted by the tracking results.
ID switches(IDS): The total number of times that a tracked trajectory changes its
matched GT identity.
Since multi-object tracking can be viewed as a method which is able to recover missed
detections and remove false alarms from the raw detection responses, we also provide the
metrics for detection evaluation:
Recall: The number of correctly matched detections divided by the total number of
detections in ground truth.
Precision: The number of correctly matched detections divided by the number of
output detections.
False Alarm per Frame(FAF): The number of false alarms per frame.
The higher value, the better is the performance in MT, recall, and precision; The
lower value, the better is the performance in ML, Frag, IDS, and FAF.
4.5.3 Results for the CAVIAR dataset
The CAVIAR dataset contains 26 video sequences of a corridor in a shopping center
taken by a single camera with frame size of 384288 and frame rate of 25fps. We use
the method of [35] as our pedestrian detector. For comparison with the state-of-arts, we
conduct our experiment on the 20 videos selected by [92]. Tracking evaluation results
are presented in Table 4.1. OLDAM's results have the best recall rate, the smallest
49
Method Recall Precision FAF GT MT PT ML Frag IDS
Wu et al. [86] 75.2% - 0.281 140 75.7% 17.9% 6.4% 35* 17*
Zhang et al. [92] 76.4% - 0.105 140 85.7% 10.7% 3.6% 20* 15*
Xing et al. [88] 81.8% - 0.136 140 84.3% 12.1% 3.6% 24* 14*
Huang et al. [36] 86.3% - 0.186 143 78.3% 14.7% 7.0% 54 12
Li et al. [55] 89.0% - 0.157 143 84.6% 14.0% 1.4% 17 11
OLDAM 89.4% 96.9% 0.085 143 84.6% 14.7% 0.7% 18 11
Table 4.1: Tracking results on CAVIAR dataset. *The numbers of Frag and IDS
in [86] [92] [88] are obtained by dierent metrics from what we adopt [55], which is
more strict.
Method Recall Precision FAF GT MT PT ML Frag IDS
Huang et al. [36] 71.6% 80.8% - 919 57.0% 28.1% 14.9% 487 278
Li et al. [55] 80.0% 83.5% - 919 77.5% 17.6% 4.9% 310 288
Ours (a) 77.0% 85.3% 1.015 919 71.6% 23.1% 5.4% 496 303
Ours (b) 80.5% 83.9% 1.181 919 77.8% 18.7% 3.4% 475 286
OLDAM 80.4% 86.1% 0.992 919 76.1% 19.3% 4.6% 322 224
Table 4.2: Tracking results on TRECVID08 dataset.
False Alarm per Frame, and the second best mostly tracked trajectories. The number of
fragments and ID switches in our approach are lower than most previous methods and
competitive with [55]. Some tracking results are shown in Figure 4.6. The two targets
are consistently tracked by our method while they exchange their IDs in other methods.
4.5.4 Results of TRECVID08 dataset
The CAVIAR dataset is relatively easy and several methods achieve good results. To test
the eectiveness of the OLDAM approach, we further evaluate our method on a much
more challenging TRECVID 2008 event detection [2] dataset which consists of videos
captured in a major airport.
50
As in [55], the experiment is conducted on 9 videos with frame size of 720576 cho-
sen from the TRECVID08 set, which contains three dierent scenes, each being 5000
frames in length. This dataset has a high crowd density and inter-object occlusions and
interactions occur often; thus, stronger appearance models should be of greater value.
Table 4.2 shows comparison between the OLDAM approach and [36, 55] on this dataset.
Our result achieves the best recall rate and precision rate. With similar mostly tracked
trajectories and fragments compared to [55], we have signicant reduction in ID switches.
It shows that on-line learned discriminative appearance model prevents wrong association
between dierent targets which a simple appearance model fails to do. Notice that [55]
and our approach focus on dierent aspects as mentioned in introduction, thus they are
complementary works; OLDAMs can be incorporated into the approach of [55] in future
work. Some tracking results are shown in Figure 4.7.
We also compare our approach with dierent appearance models using our own im-
plementation. The results are also shown in Table 4.2. Ours (a) represents the result
that only color histogram is used as the appearance model. In the result of ours (b), our
proposed appearance model is used but learned in an o-line environment, which means
the coecients
t
are xed. It shows that our proposed method outperforms these two
appearance models. This comparison justies that our stronger appearance model with
on-line learning improves the tracking performance.
4.5.5 Computational Speed
We measure the execution time using OLDAMs on 20 videos in the evaluation from
CAVIAR dataset, which typically has 2 to 8 pedestrians to be tracked in each frame.
51
Frame 399 Frame 439 Frame 424 Frame 519
Figure 4.6: Sample tracking result on CAVIAR dataset. Top row: Two targets indicated
by arrows switch their IDs where color histogram serves the appearance model. Bottom
row: These two targets are tracked successfully as the OLDAMs are used.
Frame 2170 Frame 2360 Frame 2300 Frame 2250
Figure 4.7: Sample tracking result on TRECVID08 dataset. The top row shows the result
of [55] that a man has a new ID and his old ID is transferred to the lady behind. The
bottom row shows that they are consistently tracked in our method.
52
The tracking speed of our system is about 4 fps, on a 3.0GHz PC with the program being
coded in Matlab; this does not count the processing time of the human detection step. In
fact, most of the processing time is spent in extracting appearance descriptors from the
videos, which are shared by the online learning and the tracklets association. We also test
our implementation using the o-line learned appearance model on the same videos. By
removing the online learning, the execution time is reduced by 17%. This indicates that
the online learning does not signicantly increase the computational load of the tracking
system.
4.6 Conclusion
We present an approach of online learning a discriminative appearance model for robust
multi-target tracking. Unlike previous methods, our model is designed to distinguish
dierent targets from each other, rather than from the background. Spatial-temporal
constraints are used to select training samples automatically at runtime. Experiments on
challenging datasets show clear improvements by our proposed OLDAMs.
53
Chapter 5
Person Identity Recognition based Multi-Person Tracking
To further enhance the pedestrian appearance models, we are inspired by an interesting
topic: appearance-based person identity recognition. In fact, multi-person tracking and
person recognition can be viewed as highly connected tasks. In this chapter, we present
novel person identity recognition based multi-person tracking. This work is originally
published in [50].
5.1 Introduction
Tracking multiple people in a real scene is an important topic in the eld of computer
vision since it has many applications such as surveillance systems, robotics, and human-
computer interaction environments. This is a highly challenging problem, especially in
complex and crowded environments with frequent occlusions and interactions of targets.
For example, Figure 5.1 shows several busy scenes which are dicult cases for multi-
person tracking.
In recent literature, human detection techniques have achieved impressive progress [19,
24, 35, 78, 82, 84], which enable a popular tracking scheme: tracking by tracklet-association [36,
54
Figure 5.1: Some snapshots of videos from our multi-person tracking results. The goal
of this work is to locate the targets and maintain their identities in a real and complex
scene.
55
55, 88, 92]. The main idea is to link detection responses or short tracklets gradually into
longer tracks by optimizing linking scores or probabilities between tracklets globally.
Considering the information from future frames, some detection errors such as missed
detections and false alarms can be corrected; this also solves the problem of frequent or
long occlusions between targets eectively. A key element in tracklet-association is the
anity measurement between two tracklets to decide whether they belong to the same
target or not. Relying on spatio-temporal restrictions in video sequences, these types of
methods usually fuse several features such as motion, time, position, size, and appear-
ance as the anity measurement. However, due to the computational constraints, often
in previous work only simple features are used as their appearance models, which limit
the accuracy.
To enhance the human appearance model, we explore another interesting topic: appearance-
based person identity recognition [23, 29, 33, 61, 69]. Given an image of a person, the
recognition system nds the best match in the gallery set. In fact, multi-person tracking
and person recognition can be viewed as highly connected tasks since solving the problem
of whether two tracklets are from the same person is essentially the recognition problem.
Nevertheless, there exist some dierences between the approaches of these two problems.
Compared to multi-person tracking where several cues can be applied, person recognition
methods typically use the appearance as the only cue to help the association between a
query and a gallery of people. Besides, to deal with the larger view-point and illumination
changes, the appearance model used for person recognition is, in general, more complex
than that in multi-person tracking. Some comparisons between these two problems are
listed in Table 5.1.
56
Multi-person Person
Tracking Recognition
Appearance models simple complex
View-point change small large
Illumination change small large
Gallery size small large
Other cues yes no
Occlusion handling yes no
Table 5.1: Comparison between multi-person tracking and person identity recognition.
In this paper, we propose a framework to incorporate the merits of person identity
recognition to help multi-person tracking performance. Compared to normal applica-
tion of person recognition, for multi-person tracking we have a small, though dynamic
gallery. The proposed system is named as Person Identity Recognition based Multi-
Person Tracking (PIRMPT). A recent work [48] used tracking by tracklet-association
method and proposed a strategy to collect training samples for learning on-line discrimi-
native appearance models (OLDAMs). PIRMPT adopts a similar framework but makes
signicant improvements with several new ideas. Unlike [48] which uses pre-dened local
image descriptors, in PIRMPT a set of most discriminative features is selected by au-
tomatic learning from a large number of local image descriptors. This set serves as the
feature pool for on-line learning of appearance-based anity models. In the test phase,
tracklets formed by frame-to-frame association are classied as query tracklets or gallery
tracklets. For each gallery tracklet, a target-specic appearance-based anity model is
learned from the on-line training samples collected by spatio-temporal constraints, instead
of learning a global OLDAMs for all tracklets in [48]. Both gallery tracklets and query
tracklets are then fed into a hierarchical association framework to obtain nal tracking
57
results. The block diagram of the PIRMPT system is shown in Figure 5.2. We evaluate
our proposed method on several public datasets: CAVIAR, TRECVID 2008, and ETH
to show signicant improvements in terms of tracking evaluation metrics.
5.2 Related Work
Due to the impressive advances in object detection [19, 24, 35, 78, 82, 84], detection-based
tracking methods have gained increasing attention since they are essentially more robust
in complex environments, even when the camera is moving. One typical solution is to
adopt a particle ltering framework for representing the tracking uncertainty in a Marko-
vian manner by only considering detection responses from past frames. Okuma et al.[60]
formed the proposal distribution of the particle lter from a mixture of the Adaboost
detections and the dynamic model. Cai et al. [14] worked on the same dataset as in [60]
and made the improvement by rectication technique and mean-shift embedded particle
lter. Breitenstein et al. [11] used the continuous condence of pedestrian detectors and
online trained classiers as a graded observation model. These methods are suitable for
online applications since the results are based on the past and current frames; no infor-
mation from future frames are considered. However, it may fail when long occlusion of
targets exists and sensitive to imperfect detection responses.
In contrast to those methods which only consider the past information, several ap-
proaches are proposed to optimize multiple trajectories globally, i.e. by considering the
future frames in a time sliding window. Leibe et al. [51] used Quadratic Boolean Pro-
gramming to couple the detection and estimation of trajectory hypotheses. Andriluka et
58
Figure 5.2: The block diagram of our proposed method.
59
al. [3] applied Viterbi algorithm to obtain optimal object sequences. Zhang et al. [92]
used a cost-
ow network to model the MAP data association problem. Xing et al. [88]
combined local tracklets ltering and global tracklets association. Huang et al. [36] pro-
posed a hierarchical association framework to link shorter tracklets into longer tracks. Li
et al. [55] adopted similar structure as [36] and presented a HybridBoost algorithm to
learn the anity models between two tracklets. The underlying philosophy is that ob-
serving more frames before making association decisions should generally help overcome
ambiguities caused by long term occlusions and false or missed detections.
On the other hand, person identity recognition is a less addressed problem. Unlike
tracking where motion and position are used to help maintain identities of targets, ap-
pearance is the only cue in person identity recognition. Gheissari et al. [29] developed
two person re-identication approaches which use interest operators and model tting
for establishing spatial correspondences between individuals. Gray et al. [33] presented a
method of performing viewpoint invariant pedestrian recognition using the ensemble of lo-
calized features. Farenzena et al. [23] found the asymmetry/symmetry axes and extracted
the symmetry-driven accumulation of local features. Schwartz et al. [69] established a
high-dimensional signature which is then projected into a low-dimensional discriminant
latent space by Partial Least Squares reduction. Oreifej et al. [61] proposed a system to
recognize humans across aerial images by Weighted Region Matching (WRM).
60
5.3 Appearance-based anity models
Appearance models play an important role in both person recognition and multi-person
tracking. Previous methods for person recognition usually propose complex features or
a large number of image descriptors which need extensive computational power. On
the other hand, multi-person tracking often use simple features such as color histograms
for the speed issue. We aim to construct the appearance models which have strong
discriminative power, yet are ecient to fulll the speed requirements for the problem of
multi-person tracking.
5.3.1 Local image descriptors and similarity measurements
To establish a strong appearance model, we extract a rich set of local descriptors to
describe a person's image. A local descriptor d consists of a feature channel and a
support region r, where = (Color;Shape;Texture) and r = (x;y;w;h). Given an
image sample I, a single descriptor d
i;j
extracted over r
j
via
i
is denoted as
d
i;j
=I(
i
;r
j
) (5.1)
where i and j are the indices of the feature channel and the support region respectively.
In our design, the support regionsfrg are sampled from a large set of all possible
rectangles within I, with the constraints of the width to height ratio xed to 1:1, 1:2,
or 2:1, which gives us 654 rectangles. For the feature channel, RGB color histograms
are used with 8 bins for each channel and concatenated into a 24-element vector. To
capture shape information, we adopt the Histogram of Gradients (HOG) feature [19] by
61
concatenating 8 orientations bins in 2 2 cells over r to form a 32-element vector. To
describe the image texture, we use a descriptor based on covariance matrices of image
features proposed in [77].
Given the appearance descriptors, we can compute similarity between two person
image patches. The color histogram and HOG feature are histogram-based features so
standard measurements, such as
2
distance, Bhattacharyya distance, and correlation
coecient can be used; we choose correlation coecient for simplicity. Distance between
two covariance matrices is determined by solving a generalized eigenvalue problem, as
described in [77]. The similarity score s between two image patches based on a certain
local descriptor can be written as:
s
i;j
=
i
I
1
(
i
;r
j
);I
2
(
i
;r
j
)
(5.2)
where
i
is the corresponding similarity measurement function of feature channel
i
.
5.3.2 Model denition and descriptor selection
We dene the appearance-based anity model as an ensemble of local descriptors and
their corresponding similarity measurements. It takes any two images of persons as
input and computes an anity score as the output. The desired model has the goal of
discriminating between the correct and the wrong pairs. The larger the anity score
is, the more likely it is that the two images belong to the same person. We design the
62
Figure 5.3: The o-line training samples for appearance models. Images in each column
indicate the same person.
appearance-based anity models to be a linear combination of all similarity measurements
by (5.2). It takes the following form:
H(I
1
;I
2
) =
X
i;j
s
i;j
(5.3)
where the coecientsfg represent the importance of local descriptors.
The training data are obtained from the tracking ground truth of TECVID 2008
dataset [2] provided by [55]. For each individual, M images of that person are extracted
randomly along its trajectory. Some examples are provided in Figure 5.3. A training
sample for the learning algorithm is dened as a pair of images: a positive sample is
collected by a pair of images from the same person; a negative one is collected by a pair
63
Figure 5.4: Some sample feature selected by Adaboost algorithm. The local descriptors of
color histograms, HOG, and covariance matrices are indicated by red, green, and yellow
respectively.
of images from any two dierent persons. The similarity scores of training samples are
used as features in a learning framework.
We may use a large set of local descriptors via dierent channels and support regions.
However, if we plan to include all of them in on-line multi-person tracking, the compu-
tation cost would be too high. Hence, we apply the standard Adaboost algorithm [68]
to sequentially learn the features which are eective in comparing two images; it may
be regarded as a feature selection process. This selected smaller set becomes the feature
pool for the use of multi-person tracking, as the diamond block in Figure 5.2. In general,
the color histograms are selected most; the covariance matrices are selected least. Color
histograms tend to have smaller regions; HOG features tend to have larger regions. Some
sample features are shown in Figure 5.4.
64
5.4 Tracklet association framework
As in Figure 5.2, PIRMPT system involves four parts: tracklet generation, tracklet classi-
cation, on-line appearance-based anity models learning, and tracklet association. We
describe each important component in this section.
5.4.1 Reliable Tracklets
In a given time sliding window, we apply a state-of-the-art human detector such as [19, 24,
35] on each frame. A dual-threshold association strategy is applied to detection responses
in consecutive frames and generate short but reliable tracklets as in [36]. Based on the
assumption that targets have small displacements over neighboring frames in a video
sequence, we form a anity score matrix S where each element in S is dened as the
product of three similarity scores based on position, size and appearance betweenr
m
and
r
n
as in [86]:
S(m;n) =A
pos
(r
m
;r
n
)A
size
(r
m
;r
n
)A
appr
(r
m
;r
n
) (5.4)
Two responses r
m
and r
n
in two neighboring frames are linked if their anity score
S(m;n) is higher than a threshold
1
and exceeds any other elements in the m-th row
and n-th column of S by another threshold
2
. This strategy is conservative and biased
to link only reliable associations between any two consecutive frames.
Since the detection responses in a generated tracklet may be noisy and not well-aligned
to the target, tracklet renement is needed to extract correct descriptors of motion and
appearance. Letx
k
indicate the the position and the size of an observationr
k
in a tracklet
65
T =fr
k
g, wherek is the index of time frame. We dene the probability of a certain state
f^ x
k
g given the observationfx
k
g as:
P (f^ x
k
gjfx
k
g) =
Y
k
G(x
k
^ x
k
;
p
)
Y
k
G(v
k
;
v
)
Y
k
G(a
k
;
a
) (5.5)
wherev
k
=
^ x
k+1
^ x
k
t
k+1
t
k
anda
k
=
v
k
v
k1
0:5(t
k+1
t
k1
)
are the velocity and the acceleration at frame
t
k
; G(:; ) is the zero-mean Gaussian distribution. For each tracklet, the estimation of
the true states ^ x
k
can be computed as:
f^ x
k
g
= arg max
f^ x
k
g
P (f^ x
k
gjfx
k
g) (5.6)
5.4.2 Tracklets classication
How to learn eective appearance-based anity models is a key problem for robust perfor-
mance in multi-person tracking. [48] proposed an approach to learn the global appearance-
based anity model which is shared by all tracklets in a given time sliding window, i.e.,
all tracklets use the same model. In contrast to this work, we plan to learn target-specic
models.
Inspired by the work of person identity recognition, we try to include the concept
of gallery and query, with some necessary modications which can be incorporated in
multi-person tracking. We divide all tracklets into two groups: "gallery" tracklets and
"query" tracklets. Dierent groups of tracklets have dierent strategies to learn their
own appearance-based anity models. In our design, a tracklet which is considered as a
gallery tracklet needs to fulll two requirements: 1) it is longer than a certain threshold;
66
2) it is not totally or heavily occluded by any other tracklet. The rst requirement is
based on the observation that a longer tracklet is more likely to be a true tracklet of
a person; a shorter tracklet tends to be false alarm so that it is not appropriate to be
registered as a gallery tracklet. The reason for second requirement is that the heavily
occluded tracklets are not suitable to extract the local descriptors. Given the optimal
states of each tracklet in (5.6), an occupancy map is established in every frame and
the visibility ratio is computed for each x
k
of each tracklet. The second requirement is
satised if the number of frames with the visibility ratio less than a certain threshold
v
,
is larger than M(
v
= 0:7, M = 8 in our implementation). If a tracklet is not classied
as a gallery tracklet, it will be considered as a query tracklet.
5.4.3 On-line appearance-based anity models
The construction of on-line appearance-based anity models contains three parts: local
descriptors extraction, on-line training sample collection, and the learning framework. A
small set of discriminative local descriptors is learned o-line as described in Section 5.3.
For each tracklet, we extract these local descriptors from the rened detection responses in
head part (the rstM frames) and tail part (the lastM frames). Based on the occupancy
map, we do not extract local descriptors for the detection responses whose visibility ratio
is less than
v
.
On-line training sample collection is an important issue in the learning of target-
specic appearance model. We adopt assumptions similar to those in [48]: 1) detection
responses in one tracklet are from the same person; 2) any detection responses in two
dierent tracklets which overlap in time are from dierent persons. The rst assumption
67
is based on the observation that dual-threshold strategy generates reliable tracklets; the
second one is based on the fact that one person can not appear at two dierent locations
at the same time. For a certain tracklet T
i
, a con
ict setC
i
is established where no
element inC
i
can be the same person as T
i
. Next, we extract any two dierent detection
responses from T
i
as positive training samples, and two responses from T
i
and T
j
2C
i
respectively as negative training samples. The training set for tracklet T
i
is denoted by
B
i
.
For a gallery tracklet T
gallery
i
, the training dataB
i
is used to learn a target-specic
appearance-based anity model. For a query tracklet T
query
i
, the training data contain
the union of allB
i
in our design since a query tracklet is less robust to learn a meaningful
target-specic model.
Once on-line training sample collection is nished, we compute the similarity scores be-
tween local appearance descriptors via all feature channels over all support regions. These
similarity measurements serve as weak learners which are used in a standard boosting
algorithm, as in [68], to learn the weight coecientsfg.
5.4.4 Tracklets association
We adopt the hierarchical tracklet association framework of [36]. The linking probability
between two tracklets is dened by three cues: motion, time, and appearance:
P
link
(T
i
;T
j
) =A
m
(T
i
;T
j
)A
t
(T
i
;T
j
)A
appr
(T
i
;T
j
) (5.7)
68
The motion model is dened by
A
m
(T
i
;T
j
) =G(x
tail
i
+v
tail
i
tx
head
j
;
j
)
G(x
head
j
v
head
j
tx
tail
i
;
i
)
(5.8)
where t is the time gap between the tail of T
i
and the head of T
j
; x
i
and v
i
are rened
positions and velocities of the head part or tail part of T
i
.
The time model is simply a step function
A
t
(T
i
;T
j
) =
1; if t> 0
0; otherwise
(5.9)
which makes the link between T
i
andT
j
becomes possible if the tail of T
i
appears earlier
than the head of T
j
.
The appearance-based anity model is a linear combination of several similarity mea-
surements of a set of local descriptors, as mentioned in Section 5.3. Every gallery tracklet
has its own appearance-based anity model: i.e. target-specic weighted coecients
fg
gallery
i
. All query tracklets shared the same global weighted coecientsfg
query
. The
appearance-based anity models used in A
appr
(T
i
;T
j
) depends on whether T
i
or T
j
is a
gallery tracklet or not.fg
gallery
i
is used ifT
i
is a gallery tracklet;fg
gallery
j
is used ifT
j
is a gallery tracklet but T
i
is not.fg
query
is used if T
i
and T
j
are both query tracklets.
69
Method Recall Precision FAF GT MT PT ML Frag IDS
Wu et al. [86] 75.2% - 0.281 140 75.7% 17.9% 6.4% 35* 17*
Zhang et al. [92] 76.4% - 0.105 140 85.7% 10.7% 3.6% 20* 15*
Xing et al. [88] 81.8% - 0.136 140 84.3% 12.1% 3.6% 24* 14*
Huang et al. [36] 86.3% - 0.186 143 78.3% 14.7% 7.0% 54 12
Li et al. [55] 89.0% - 0.157 143 84.6% 14.0% 1.4% 17 11
Kuo et al. [48] 89.4% 96.9% 0.085 143 84.6% 14.7% 0.7% 18 11
PIRMPT 88.1% 96.6% 0.082 143 86.0% 13.3% 0.7% 17 4
Table 5.2: Comparison of tracking results between the state-of-the-arts and PIRMPT on
CAVIAR dataset. *The numbers of Frag and IDS in [86] [92] [88] are obtained by looser
evaluation metrics. The human detection results we use are the same as [36, 55, 48].
Method Recall Precision FAF GT MT PT ML Frag IDS
Huang et al. [36] 71.6% 80.8% - 919 57.0% 28.1% 14.9% 487 278
Li et al. [55] 80.0% 83.5% - 919 77.5% 17.6% 4.9% 310 288
Kuo et al. [48] 80.4% 86.1% 0.992 919 76.1% 19.3% 4.6% 322 224
PIRMPT 79.2% 86.8% 0.920 919 77.0% 17.7% 5.2% 283 171
Table 5.3: Comparison of tracking results between the state-of-the-arts and PIRMPT on
the TRECVID 2008 dataset. The human detection results we use are the same as [36,
55, 48].
5.5 Experimental results
To evaluate the performance of the PIRMPT, experiments are conducted on three public
datasets: the CAVIAR Test Case Scenarios [1], the TRECVID 2008 [2], and the ETH
Mobile Platform [22] dataset. The comparison between PIRMPT and several state-of-the-
art methods is given based on the commonly used evaluation metrics [55]. Computation
speed is also provided.
70
5.5.1 CAVIAR dataset
The CAVIAR Test Case Scenarios dataset was captured in a shopping center corridor
by two xed cameras from two dierent viewpoints. We use the video clips from the
corridor view only. The frame rate is 25 fps and the image size is 384288 pixels for
all videos. For fair comparison with other state-of-the-art methods, the experiments are
conducted on 20 clips
1
selected by [92]. There are in total 143 people across 29,283
frames and 77,270 detection annotations in the ground truth of 20 videos. The tracking
evaluation results are shown in Table 5.2. Note that the ID switches are reduced by
64% by the PIRMPT approach; it also achieves lower fragmentation and false alarms
per frame compared to other methods while keeping competitive numbers on recall rate,
precision rate, and mostly tracked trajectories. Some sample frames with tracking results
are presented in Figure 5.5(a).
5.5.2 TRECVID 2008 dataset
The TRECVID 2008 event detection dataset contains videos from with 5 xed cameras
covering dierent eld-of-views in an airport. Authors in [55] extracted 9 videos from
three cameras (Cam 1, Cam 3, and Cam 5) and annotated the tracking grounds truth for
their evaluation on multi-target tracking. Each video features the 25 fps of frame rate,
720576 pixels of the image size, and 5000 frames in length. In the ground truth, there are
in total 919 people across 45,000 frames and 342,814 detection annotations from 9 videos.
1
Originally there are 26 videos in CAVIAR dataset. The selected 20 videos are EnterExitCross-
ingPaths1, EnterExitCrossingPaths2, OneLeaveShop1, OneLeaveShopReenter2, OneShopOneWait1,
OneShopOneWait2, OneStopEnter1, OneStopEnter2, OneStopMoveEnter1, OneStopMoveEnter2, On-
eStopMoveNoEnter1, OneStopNoEnter1, OneStopNoEnter2, ShopAssistant1, ShopAssistant2, ThreeP-
astShop1, TwoEnterShop1, TwoEnterShop3, TwoLeaveShop2, and WalkByShop1.
71
Sequence Recall Precision FAF GT MT PT ML Frag IDS
BAHNHOF 76.5% 86.6% 0.976 95 51 37 7 21 10
SUNNY DAY 77.9% 86.7% 0.653 30 22 5 3 2 1
Table 5.4: Tracking results from PIRMPT on sequences "BAHNHOF" and "SUNNY
DAY" from the ETH dataset.
TRECVID 2008 dataset is much more dicult compared to the CAVIAR dataset, due
to its high crowd density, heavy occlusions, and frequent interactions between dierent
targets. Table 5.3 presents a comparison with results from [36, 55, 48], which have shown
the impressive results on this challenging dataset. Our proposed method decreases the
fragments and ID switches signicantly. Compared to [48], the fragmentation is reduced
12% and the ID switch is reduced 24%. It shows that the person identity recognition helps
the tracking system make correct associations between tracklets and improve the tracking
results. Some sample frames with tracking results are presented in Figure 5.5(b,c).
5.5.3 ETH mobile platform dataset
The ETH dataset [22] was captured by a stereo pair of forward-looking cameras mounted
on a moving children's stroller in a busy street scene. Due to the lower position of the
cameras, total occlusions happen often in these videos, which increases the diculty of this
dataset. The frame rate is 1314 fps and the image size is 640480 pixels for all videos.
The ground truth annotations provided in the website
2
of [22] is only for pedestrian
detection, not for multi-person tracking. Several previous papers [22, 16, 59] reported
their tracking results based on mostly tracked trajectories, fragments, and ID switches by
2
http://www.vision.ee.ethz.ch/aess/dataset/
72
manual counting, which is time-consuming and prone-to-error. To avoid this, we created
our tracking ground truth for automatic evaluation; two sequences "BAHNHOF" and
"SUNNY DAY" from left camera only are used for our experiments. In our annotation,
"BAHNHOF" sequence contains 95 individuals over 999 frames; "SUNNY DAY" sequence
contains 30 individuals over 354 frames. No stereo depth maps, structure-from-motion
localization, and ground plane estimation are utilized in our method.
The tracking result are presented in Table 5.4. Note that Mitzel et al. [59] proposed
a segmentation-based tracker and reported their numbers on the same two sequences
recently. However, direct comparison is dicult as the evaluation in [59] is based on
manual counting. The number of targets in our ground truth is much larger than theirs
because we include smaller targets as well as short tracks to have a complete annotation.
Besides, people who undergo long and total occlusion and then appear again are still
considered the same persons in our case. In this type of situation, the appearance models
enhanced by person identity recognition is more important when the motion information
is not reliable. Some sample frames with tracking results are presented in Figure 5.5(d,e).
5.5.4 Speed
The computation speed of tracking depends on the density of observations from the
detector, i.e. the number of detection responses which are present in each frame in the
videos. The execution time are measured on 20 videos from the CAVIAR dataset and
9 videos from the TRECVID 2008 dataset. On average, the runtime speed is 48 fps for
CAVIAR and 7 fps for TRECVID 2008, not including the processing time of the human
detector. Our implementation is coded in C++ on a Intel Xeon 3.0 GHz PC.
73
Figure 5.5: Sample tracking result on (a) CAVIAR, (b,c) TRECVID 2008, and (d,e) ETH
dataset.
74
5.6 Conclusion
We present a system, PIRMPT, to combine the merits of person recognition to help
the performance of multi-person tracking. By o-line learning, a small number of lo-
cal image descriptors is selected to be used in tracking framework for maintaining the
eectiveness and eciency. Given reliable tracklets, we identify them as query track-
lets and gallery tracklets. For each gallery tracklet, a target-specic appearance-based
anity model is learned from the on-line training samples collected by spatio-temporal
constraints. Experiments on challenging datasets show signicant improvements by our
proposed PIRMPT.
75
Chapter 6
Combination of On-Line and O-Line Learning
In previous chapters, an advanced on-line learned appearance model such as OLDAMs [48]
is presented. Next, we describe a o-line learning-based method to incorporate multiple
cues for better tracklet anity measurement. This o-line learning framework is originally
published in [55]. Here we introduce [55] brie
y and describe how to combine these two
complementary work.
6.1 Introduction
With surveillance applications attracting more and more attentions from both academic
and industry, multi-target tracking has become a quite important problem in computer
vision. It aims at inferring trajectories for each target as precise as possible without
mixing identities among dierent targets. Such problem becomes very dicult in crowded
scenes due to several reasons, including similar appearance, inter and intra occlusions,
interactions between targets, low resolution, etc.
In recent years, detection techniques have achieved signicant progress, and bring a
new trend of tracking approach: association-based tracking. Such approaches tend to link
76
Video
Sequence
Human
Detector
Frame-by-Frame
Association
Tracklets
Association
Spatial-Temporal
Constraints
AdaBoost
Learning
Reliable
Tracklets
Appearance
Model
Tracking
Results
Training
Samples
Detection
Responses
Time Sliding Window
Tracklet
Affinity
Model
HybridBoost
Learning
Off-line Training
data Generation
Training
Samples
Figure 6.1: The block diagram of our proposed system which combines on-line (red) and
o-line (blue) learning methods.
detection responses or track fragments (i.e., tracklets) gradually into long tracks [63, 9, 92,
36]. Unlike most single target tracking approaches, linking decisions are often considered
globally, and solved eciently by Hungarian algorithm [63] or cost
ow method [92].
One crucial issue for association based methods is the computation of anities be-
tween tracklets or detection responses. Pre-dened models and parameters may lack of
generality in dierent cases. To alleviate the problem, Li et al. [55] propose an o-line
learning approach to better combine dierent cues in decisions of anities; however, all
cues are computed o-line and are xed during association. On the other hand, consider-
ing the importance of target appearance models, Kuo et al. [48] propose an on-line learning
method to nd a more discriminative appearance model called OLDAMs, whereas the
way of combining dierent cues are xed and dened by prior knowledge.
77
Considering the complementary properties of the two methods, we propose a frame-
work that combines o-line and on-line learning together. Our framework is shown in
Figure 6.1. We adopt the hierarchical framework used in [36, 55, 48] that we associate
short tracklets gradually into longer ones by several levels. The maximum tolerant frame
gap between tracklets are increased with levels. In each level, the tracklets from associa-
tion in last level are used as input. To measure anities between tracklets, following [55],
we adopt several static cues measuring motion smoothness, frame gaps, occlusions, etc.
However, for the appearance model, unlike [55] which uses simple color histograms, we
adopt the Online Learned Discriminative Appearance Models (OLDAMs) used in [48],
which is computed in a time sliding window to better distinguish dierent targets. Then
we use the OLDAMs with other static cues together to learn a better way of combining
them in an oine process. The learned anity model is used to associate tracklets in
current level to longer ones which are further used as input for the next level. In the
testing process, tracklets in a video sequence is gradually associated to longer ones by
anity models in dierent levels. After passing all levels, we get the nal tracking results.
A more detailed illustration is given in Figure 6.2.
We evaluate our approach on the CAVIAR dataset [1] and the TRECVID08 dataset
[2], which include crowds of pedestrians with diverse moving styles and appearances.
Experiments show that our approach outperforms both oine learning and online learning
methods, especially reducing fragments and id switches in tracks.
78
Online
Learning
Time
Sliding
Window
Cue 1
Cue 2
Cue n
OLDAMs
Training
Tracklet
Pairs Level 0
Tracking
Ground
Truth
Offline
Learning
Affinity
Model For
1
st
Level
Test
Video
Sequence Tracklet
Level 0
Training
Module
Testing
Module
Level 1
Online
Learning
Time
Sliding
Window
Cue 1
Cue 2
Cue n
OLDAMs
Training
Tracklet
Pairs Level 1
Offline
Learning
Affinity
Model For
2
nd
Level
Tracklet
Level 1
Level 2
Up to Level K
Up to Level K Tracklet Level K
(Final Results)
Up to Level K
Affinity
Model For
k
th
Level
Level k
Figure 6.2: The detailed block diagram of our proposed methods embedded in the hier-
archical tracklet association framework.
6.2 Related Work
According to information used for tracking, previous approaches can be classied as using
past information only or using global information. Methods of the former predict targets'
positions and identities continually based on information up to the current frame. Such
approaches often adopt particle ltering framework to online decide target positions [32,
54, 11, 93, 46]; many of them can be optimized for real-time tracking. However, owing
to lack of global information, such methods are probably stuck into local minimum, and
prediction errors may accumulate continuously. On the other hand, tracking methods
using global information consider both past and future detection responses, and are often
presented as a tracklets association problem [63, 88, 36, 55, 48]. This kind of approaches
are more powerful for dealing with noisy observation and target occlusions, and are more
79
possible to nd global optimal solution by Hungarian method [36] or cost
ow algorithm
[92].
One key issue in association based tracking is the measurement of anities be-
tween tracklets, which is often set heuristically by prior knowledge in most previous
work [63, 88, 36]. Li et al. [55] rst introduced an oine learning framework to incor-
porate multiple cues automatically to produce a tracker with better generality. Among
all kinds of cues, appearance often plays a rather important role on the anity scores.
Therefore, Kuo et al. introduced the on-line learned discriminative appearance models
(OLDAMs) [48], which integrated multiple appearance cues to produce more discrimina-
tive models between dierent objects rather than simply using color histograms. However,
in [48], weights for dierent cues are still set manually. Combining both methods would
perform better than either.
6.3 System Overview
Our tracking system uses a video sequence as an input, and outputs the number of
tracked targets as well as the trajectory of each. For an input video sequence, we rst
detect pedestrians in each frame. We train a cascade on ecient JRoG features [35],
which have shown great power on object detection problems. Then the detection results
are associated to condent "low-level" tracklets.
A "low-level" tracklet consists of detection responses in consecutive frames. Two
neighboring responses r
i
;r
j
are connected if and only if their similarity is higher than
a threshold, and the link is condent, indicating that similarities between r
i
and any
80
other response in the second frame should be less than a threshold (similar constrains for
r
j
). The similarity is measured by position, size, and appearance. Such dual threshold
method assures correctness in each single tracklet.
Then our association module uses tracklets as input and produces longer tracklets in
each level to generate nal tracks gradually. In training process of each level, we rst
compute dierent kinds of features between each tracklet pairs, and then train a strong
classier to compute anities for each pair by a boosting process. In the testing process,
we just adopt the classier to all possible pairs, and then use Hungarian algorithm [63]
to nd the best conguration. The features we adopt consider lengths of tracklets, gaps
between pairs, entry or exit points, motion smoothness, and appearance. The training
process will automatically select most discriminative features. All features are computed
statically, i.e., values do not change with time, except for appearance similarity. In the
system, appearance models are built dierently in each time sliding window, so that
the models are more powerful in distinguishing dierent identities. Such approach inte-
grates oine and online learning together to produce a more powerful tracker with better
generality.
6.4 Hierarchical tracklet association
In this section, we describe two stages of association which is proposed by Huang et
al. [36]. In the rst stage, the detection responses in any two neighboring frames are
linked by a dual-threshold strategy, which is biased to link the "safe" pairs of detection
responses to form reliable tracklets. In the second stage, the tracklets obtained in the rst
81
stage are iteratively linked into longer and longer trajectories. An MAP formulation is
used to simultaneously nd the optimal association of tracklets and determine the track
initialization, termination, miss detection recover, and false alarms removal using the
cost-
ow method [92].
6.4.1 Reliable tracklet generation
We denote that R =fr
t
i
g is the set of all detection responses in a time sliding window,
where i is the detection index and t is the frame index. Each r =fx;y;s; cg consists of
the center position (x;y) in image plane, the size s, and the holistic color histogram c .
GivenR as the input, the goal is to link them frame by frame using a simple and ecient
method. Similar to [86, 36], the link probability between any two responses in any two
neighboring frames is dened as the product of three anities:
P
link
(r
t
i
;r
t+1
j
) =A
pos
(r
t
i
;r
t+1
j
)A
size
(r
t
i
;r
t+1
j
)A
appr
(r
t
i
;r
t+1
j
) (6.1)
where A
pos
, A
size
, and A
appr
are anity measurements based on position, size, and ap-
pearance respectively. Their denitions are:
A
pos
(r
t
i
;r
t+1
j
) =
pos
exp
(x
t
i
x
t+1
j
)
2
2
x
exp
(y
t
i
y
t+1
j
)
2
2
y
A
size
(r
t
i
;r
t+1
j
) =
size
exp
(s
t
i
s
t+1
j
)
2
2
s
A
appr
(r
t
i
;r
t+1
j
) =BC(c
t
i
; c
t+1
j
)
(6.2)
where
pos
and
size
are normalization factors , and BC(:) is the Bhattachayya distance
between two color histograms.
82
A dual-threshold strategy is applied to associate pairs of detections in two neighboring
frames in a conservative manner. It links a pair of responses r
t
i
and r
t+1
j
if the following
conditions are met: 1) their anity is high enough; 2) their anity is signicantly higher
than any other pairs which involve either r
t
i
or r
t+1
j
. This method prevents the "unsafe"
associations until more evidence is collected to solve the ambiguity at later stages.
In our implementation, a linking score matrix S
t
is computed given all detection
responses in frame t and t + 1. Each element S
t
(i;j) is determined by (6.1). Two
responses r
t
i
and r
t+1
j
are linked if S
t
(i;j) is higher than a threshold
1
; and S
t
(i;j)
exceeds any other elements in thei-th row andj-th column of S by another threshold
2
.
This procedure is executed from t = 0 to t = T 1, where T is the length of the time
sliding window.
Based on this simple dual-threshold strategy, we can eciently generate a set of
reliable tracklets. Since this method is biased to link the condent pairs of detection
responses, in general the output will have very few ID switches (no incorrect association
is made) and many fragmentations (correct association is not made). We further stitch
those "broken" tracks in the hierarchical tracklet association framework.
6.4.2 MAP formulation for tracklet association
Given a set of short but reliable tracklets as described above, a data association is applied
iteratively to grow the tracklets. Each round takes the tracklets generated in the previous
round as input and does further association. At round k, given the tracklet setT
k1
=
fT
k1
i
g from the k 1 round, the tracker tries to link the tracklets into longer ones
that form a new tracklet setT
k
=fT
k
j
g, in whichfT
k
j
g =fT
k1
i
0
;T
k1
i
1
;:::;T
k1
i
l
j
g. To
83
obtain an optimal association result, following [36], we formulate this process as an MAP
problem:
T
k
= arg max
T
k
P (T
k
jT
k1
)
= arg max
T
k
P (T
k1
jT
k
)P (T
k
)
= arg max
T
k
Y
T
k1
i
2T
k1
P (T
k1
i
jT
k
)
Y
T
k
j
2T
k
P (T
k
j
)
(6.3)
and model P (T
k
j
) as Markov Chains. Thus we have
T
k
= arg max
T
k
Y
T
k1
i
:8T
k
j
;T
k1
i
= 2T
k
j
P
(T
k1
i
)
Y
T
k
j
2T
k
[P
init
(T
k
t
0
1)P
+
(T
k1
i
0
)P
link
(T
k1
i
1
jT
k1
i
0
)P
link
(T
k1
i
l
k
jT
k1
i
l
k
1
)P
+
(T
k1
l
k
)P
term
(T
k1
i
l
k
)]
(6.4)
where P
+
(T
k1
i
) is the likelihood that T
k1
I
is a true tracklet and P
(T
k1
i
) is the likeli-
hood that it is a false tracklet. Both can be modeled by Bernoulli distribution given the
detection precision and tracklet length. P
init
(T ) and P
term
(T ) are the initialization and
termination probability, while P
link
(TjT
0
) is the transition probability. Notice that some
tracklets given by the previous stage may be excluded from the optimal tracklet set T
k
as being false tracks.
By dening an inner cost and a transition cost as
L
I
(T
k1
i
) = ln
P
init
(T
k1
i
)P
+
(T
k1
i
)P
term
(T
k1
i
)
P
(T
k1
i
)
(6.5)
84
L
T
(T
k1
j
jT
k1
i
) = ln
P
link
(T
k1
j
jT
k1
i
)
P
term
(T
i
)
k1
P
init
(T
k1
j
)
(6.6)
we can rewrite (6.3) as:
T
k
= arg max
T
k
Y
T
k1
i
2T
k1
P
(T
k1
i
)
Y
T
k
j
2T
k
[e
L
I
(T
k1
i
0
)
e
L
T
(T
k1
i
1
)
e
L
I
(T
k1
i
1
)
e
L
I
(T
k1
il
k
)
]
= arg max
T
k
X
T
k
j
2T
k
[L
I
(T
k1
i
0
) +L
T
(T
k1
i
1
jT
k1
i
0
) +L
I
(T
k1
i
1
) +L
I
(T
k1
i
l
k
)]
(6.7)
The cost-
ow method [92] and the Hungarian method [36] have shown their eciency
in solving this MAP problem. However, how to compute those probabilities (or costs)
still remains arguable. In particular,L
T
(T
k1
j
jT
k1
i
), as the anity measurement between
two tracklets, plays the most important role in the tracklet association. In the coming
section, the HybridBoost algorithm is proposed to learn strong ranking classiers for the
computation of this crucial transition cost.
6.5 Oine Learning of Tracklet Anity Models
One pivotal element for association-based tracking methods is the anity models between
two tracklets. In [36, 88], they form the tracklet anity models by simple multiplication
of several important cues. However, the pre-dened model parameters and their relative
weights of dierent cues often lack the generalization to other situations; it also requires
unaordable manual tuning. We follow the work proposed in [55] in which the model
parameters and its corresponding weights are automatically learned from oine training
data. In this section we show that the association anity model has the property of both
85
a ranking function and a binary classier and propose a hybrid boosting algorithm to
solve the ranking and classication problems simultaneously. Following that we give our
design of feature pool, weak learner and training process.
6.5.1 The HybridBoost algorithm
A classic ranking problem involves on instance spaceX and a ranker H which denes a
linear ordering of instances inX . H typically takes the form ofH :X!R. Training data
is given as a set of instance pairsR =f(x
i
;x
j
)jx
i
;x
j
2Xg, meaning that x
j
should be
ranked higher than x
i
, i.e. H(x
j
)>H(x
i
). The goal is to nd anH which describes the
desired preference or ranking overX indicated by the data. RankBoost is an algorithm
developed for this purpose, which has been applied to web search, face recognition and
other tasks.
The tracklet association problem can be mapped to the ranking problem as follows.
Dene the instance space to beX =T T , whereT is the set of tracklets to perform
association on. For tracklets T
1
;T
2
;T
3
2T , if tracklet T
1
should be linked to T
2
to from
a correct trajectory, andT
1
andT
3
should not be linked (e.g. belong to dierent targets),
then there is a ranking preference H(T
1
;T
2
) > H(T
1
;T
3
). Therefore H can be used to
compute the transition cost.
Moreover, in the tracking problem, the anity model is not limited to keeping the
relative preference over any two tracklet pairs, but also needs to output a low value for
any tracklet pair that should not be associated. To see why this is necessary, let T be the
terminating tracklet of a target trajectory, the anity model should prevent association
T to any other tracklet T
0
. In this case, no relative ranking preference is present, but it
86
is desirable to keep H(T;T
0
) < ;8T
0
2T , where is a certain rejection threshold. In
this sense, this is no longer a simple ranking problem but a combination of ranking and
binary classication.
To solve the two problems simultaneously, we combine RankBoost [26] with Ad-
aBoost [68] to form a HybridBoost algorithm. The training set includes both a ranking
sample setR and a binary sample setB. The ranking sample set is denoted by
R =f(x
i;0
;x
i;1
jx
i;0
2X;x
i;1
2X )g
(6.8)
where x
i;0
and x
i;1
each represent a pair of tracklets as a candidate for association, and
(x
i;0
;x
i;1
)2R indicates that the association of x
i;1
should be ranked higher than x
i;0
.
The binary sample set is denoted by
B =f(x
j
;y
j
)jx
j
2X;y
j
2f1; 1gg
(6.9)
where y
j
=1 means that the corresponding x
j
should not be associated at any time,
whiley
j
= 1 means the contrary. In Section, we will discuss how to generate the training
sets.
A new loss function for boosting is dened as a linear combination of ranking loss and
binary classication loss:
Z =
X
w
0
(x
i;0
;x
i;1
) exp(H(x
i;0
)H(x
i;1
)))
+ (1)
X
w
0
(x
j
;y
j
) exp(y
j
H(x
j
))
(6.10)
87
where is a coecient to adjust the emphasis on either part. w
0
is the initial weight of
each sample, which will be updated during boosting. The goal is to nd and H(x) that
minimizesZ. As in traditional boosting, H is obtained by sequentially adding new weak
ranking classiers. In the t-th round, we try to nd an optimal weak ranking classier
h
t
:X!R and its weight
t
that minimizes
Z
t
=
X
w
t
(x
i;0
;x
i;1
) exp(
t
(h
t
(x
i;0
)h
t
(x
i;1
)))
+ (1)
X
w
t
(x
j
;y
j
) exp(
t
y
j
h
t
(x
j
))
(6.11)
and update the sample weight according to h
t
and
t
to emphasize dicult ranking
and binary samples. The nal strong ranking classier is the weighted combination of
the selected weak ranking classiers: H(x) =
P
n
t=1
t
h
t
(x), where n is the number of
boosting round. The algorithm of HybridBoost is shown in Algorithm 2.
Because of the hybrid form of our loss function, H(x) has the merits of both a ranker
and a classier. When used in the anity model, for any pair of tracklets x = (T
1
;T
2
),
we dene the transition cost as
L
T
(T
2
jT
1
) =
8
>
>
>
>
<
>
>
>
>
:
H(x); if H(x)>
1; otherwise
(6.12)
where is a threshold which can conveniently control the trade o between trajectory
fragmentation and risk of wrong association (ID switches). The default learned by
boosting is 0, which we use for our experiments.
88
Algorithm 2 HybridBoost algorithm
Input:
ranking sample set R =f(x
i;0
;x
i;1
)jx
i;0
2X;x
i;1
2X )g
binary sample set B =f(x
j
;y
j
)jx
j
2X;y
j
2f1; 1gg
1: Set w
0
(x
i;0
;x
i;1
) =
jRj
, w
0
(x
j
;y
j
) =
1
jB
j
2: for t = 1 to n do
3: On the current sample distribution,
nd optimal weak ranking classier h
t
:X!R and its weight
t
by weak learner
4: Update sample weight:
For each ranking sample (x
i;0
;x
i;1
),
w
t
(x
i;0
;x
i;1
) =w
t1
(x
i;0
;x
i;1
) exp[
t
(h
t
(x
i;0
)h
t
(x
i;1
))]
For each binary sample (x
j
;y
j
),
w
t
(x
j
;y
j
) =w
t1
(x
j
;y
j
) exp[
t
y
j
h
t
(x
j
)]
5: Normalize w
i
6: end for
Output: H(x) =
P
n
t=1
t
h
t
(x)
6.5.2 Feature pool
The boosting algorithm works with a feature pool and a weak learner. Each feature is a
functionf :X!R, namely it takes an pair of trackletsx = (T
1
;T
2
) as input and outputs
a real value. The criterion of constructing feature pool is to include any cue that can
provide evidence for whether to associate two tracklets or not. In our implementation,
ve categories of 19 types of feature are used in table 6.1; more can be added easily if
necessary.
Most features are easily understood by the description. We add some more ex-
planation on the appearance-related features (14 and 15) and motion-related features
(16,17,18,and 19).
89
Id Feature description
Length
1 Length of T
i
2 Length of T
j
3 Number of detection responses in T
i
4 Number of detection responses in T
j
5 Number of detection responses inT
i
di-
vided by length of T
i
6 Number of detection responses inT
j
di-
vided by length of T
j
Gap
7 Frame gap between T
i
's tail and T
j
's
head
8 Number of miss detected frames in the
gap between T
i
and T
j
9 Number of frames occluded by other
tracklets in the gap between T
i
and T
j
10 Number of miss detected frames in the
gap divided by the gap length
11 Number of frames occluded in the gap
divided by the gap length
Entry/Exit
12 Estimated time from T
i
's head to the
nearest entry point
13 Estimated time from T
j
's tail to the
nearest exit point
Appearance
14 Appearance anity ofT
i
's tail part and
T
j
's head part by OLDAMs
15 The appearance consistency in the gap
between T
i
and T
j
Motion
16 Position smoothness if connecting T
i
and T
j
17 Velocity smoothness if connecting T
i
and T
j
18 Acceleration smoothness if connecting
T
i
and T
j
19 Total motion smoothness if connecting
T
i
and T
j
Table 6.1: The list of features for the tracklet anity model.
90
OLDAMs: In [55], simple static color histograms are used as appearance models. In
this work, we replace it by on-lined learned discriminative appearance models to measure
the similarity of two tracklets. The details is given in Chapter 4.
The appearance consistency in the gap: To compute the appearance consistency
between two tracklets, we interpolate the detection responses in the gap the then extract
the color histogram for each of them. The averaged color histogram of interpolated
responses is denoted asH
gap
ij
. Suppose that the color histogram ofT
i
's tail andT
j
's head
are H
tail
i
and H
head
j
respectively, the measurement of the appearance consistency can be
expressed as:
f(T
i
;T
j
) = max(BC(H
tail
i
;H
gap
ij
);BC(H
head
j
;H
gap
ij
));
(6.13)
where BC(; ) indicates the Bhattacharyya distance.
6.5.3 Weak learner
The weak learner aims at nding the optimal(h;) to minimize the training loss Z in
each boosting round. We use the stump function on a single feature as the weak ranking
classier:
h(x) =
8
>
>
>
>
<
>
>
>
>
:
1; if f(x)>
1; otherwise
(6.14)
h(x) monotonically increases withf(x) because we use prior knowledge to pre-adjust the
sign of each feature, e.g., for the appearance feature, the smaller the distance between
two tracklets' color histograms is, the more likely that T
1
and T
2
should be associated.
91
To learn the optimal (h;), we enumerate all the possible featuresf and a number of
promising candidate threshold . Since h is xed given f and , we then compute
^ = arg minZ()
(6.15)
The algorithm of the weak learner is given in Table. To make the search for and
computation of Z more ecient, we build an adaptive histogram of the current training
sample distribution on each feature f.
6.5.4 Training process
As mentioned in Section, we use multiple stages of data association, and for each stage
an anity model is learned. We hereby describe training sample generation for a single
stage.
Let the tracklet set from the previous stage beT and the groundtruth tract set beG.
First we compute a mapping
' :T !G[fG
?
g
(6.16)
so that'(T ) =G2G indicates trackletT is a correct tracklet matched with groundtruth
track G, while '(T ) =G
?
indicates T is a false tracklet.
For eachT
i
2T , if'(T
i
) =G6=G
?
, we consider two possible association involvingT
i
:
1) connecting T
i
's tail to the head of some other tracklet after T
i
which is also matched
to G; 2) connecting T
i
's head to the tail of some other tracklet before T
i
which is also
92
matched toG. For the rst case, the possible correct association pairs and incorrect ones
can be represented respectively by
X
tail
i;1
=f(T
i
;T
j
)j'(T
j
) =G;T
i
is linkable to T
j
g
(6.17)
X
tail
i;0
=f(T
i
;T
j
)j'(T
j
)6=G;T
i
is linkable to T
j
g
(6.18)
Here T
i
is linkable to T
j
if and only if T
i
occurs before T
j
and the frame gap between
them does not exceed the max allowed frame gap in the current association stage.
According to the denition of ranking problem,X
tail
i;0
X
tail
i;1
will be valid ranking
training samples. Similarly for connecting T
i
's head with another tracklet's tail, we have
X
head
i;1
=f(T
j
;T
i
)j'(T
j
) =G;T
j
is linkable to T
i
g
(6.19)
X
head
i;0
=f(T
j
;T
i
)j'(T
j
)6=G;T
j
is linkable to T
i
g
(6.20)
andX
head
i;0
X
head
i;1
are ranking samples as well.
Therefore the ranking training sample setR is
R =
[
T
i
2T
(X
tail
i;0
X
tail
i;1
)[ (X
head
i;0
X
head
i;1
)
(6.21)
93
As for the binary classication samples, positive samples include all the correct asso-
ciation pairs:
B
1
=
[
T
i
2T
(X
tail
i;1
[X
head
i;1
)f1g
(6.22)
The negative samples are collected from those cases when all association choices are
incorrect, to enforce the classication function to reject all of them:
B
1
=
0
B
@
[
T
i
2T;X
tail
i;1
=?
X
tail
i;0
f1g
1
C
A[
0
B
@
[
T
i
2T;X
head
i;1
=?
(X
head
i;0
f1g
1
C
A (6.23)
Finally the binary classication sample set is a union of the two:
B =B
1
[B
1
(6.24)
Just as in the actual tacking process, training is also performed in a hierarchical way.
As shown in Figure, for stage k, training consists of three steps: 1) use the groundtruth
G and the tracklet setT
k1
obtained from stage k 1 to generate ranking and binary
classication samples as described above; 2) learn a strong ranking classier H
k
by the
HybridBoost algorithm; 3) apply the tracker using H
k
as the anity model to perform
association onT
k1
and generateT
k
, which is input to the next stage. This cycle is
performed K times to build a K-stage tracker.
94
6.6 Experimental Results
We evaluate our proposed multi-target tracking system on two challenging datasets: the
CAVIAR dataset and the TRECVID08 dataset. Performance of our method is compared
with several state-of-art methods. The computational speed is provided; some graphical
examples are also given.
6.6.1 Implementation details
Given the detection results, we use the dual-threshold strategy to generate short but
reliable tracklets on a frame-to-frame basis as done in [36]. After that, four stages of
association are used. The maximum allowed frame gap for tracklet association in each
stage is 16, 32, 64 and 128 respectively. For each stage, an anity model (a strong ranking
classier H with 100 weak ranking classiers) is trained to compute the transition cost
L
T
(TjT
0
), and the inner costL
I
(T ) in (3) is calculated in the way proposed by [36]. The
combination coecient of the hybrid loss function in (1) is set to 0.75 for evaluation. The
threshold of each strong ranking classier controls the tradeo between fragmentation
and ID switch. It can be either selected automatically based on training data or specied
by the user. We simply use = 0 for all the strong ranking classiers. There is no other
parameter that needs human intervention.
6.6.2 Appearance feature analysis
In our method, the appearance models we use is OLDAMs instead of static color histogram
used in [55]. To compare these two features, we add them simultaneously in the feature
pool and observe how the tracklet anity model select them. Note that one feature
95
stage 1 stage 4 stage 3 stage 2
(a)
output on
OLDAMs
(b)
output on
color histogram
Figure 6.3: Sample tracking result on CAVIAR dataset. Top row: Two targets indicated
by arrows switch their IDs where color histogram serves the appearance model. Bottom
row: These two targets are tracked successfully as the OLDAMs are used.
typically contains multiple stump weak classiers with dierent weights and thresholds
so the output of one feature has the piece-wise form. Figure 6.3 shows the outputs of
these two learned in each stage. In terms of the weighted coecient and times that a
feature to be selected, it is clear that OLDAMs outperform the simple color histogram
the number of feature to be selected is larger than color histogram
6.6.3 Tracking performance
We report the tracking performance on TRECVID08 dataset and the CAVIAR dataset.
For the TRECVID08 dataset, the experiment is executed on 9 videos with frame size of
720576 and frame rate of 25fps chosen from the TRECVID08 set, which contains three
dierent scenes in a major airport, each being 5000 frames in length. We take another 9
videos for the oine training data. This dataset is considered as an extremely challenging
task for its high crowd density and frequent inter-occlusions. The comparison between
our proposed method with the state-of-the-arts is shown in Table 6.2. Our result achieves
96
Method Recall Precision FAF GT MT PT ML Frag IDS
Huang et al. [36] 71.6% 80.8% - 919 57.0% 28.1% 14.9% 487 278
Li et al. [55] 80.0% 83.5% - 919 77.5% 17.6% 4.9% 310 288
Kuo et al. [48] 80.4% 86.1% 0.992 919 76.1% 19.3% 4.6% 322 224
Ours 80.3% 86.1% 0.985 919 75.4% 19.2% 5.4% 253 182
Table 6.2: Tracking results on TRECVID08 dataset.
Method Recall Precision FAF GT MT PT ML Frag IDS
Wu et al. [86] 75.2% - 0.281 140 75.7% 17.9% 6.4% 35* 17*
Zhang et al. [92] 76.4% - 0.105 140 85.7% 10.7% 3.6% 20* 15*
Xing et al. [88] 81.8% - 0.136 140 84.3% 12.1% 3.6% 24* 14*
Huang et al. [36] 86.3% - 0.186 143 78.3% 14.7% 7.0% 54 12
Li et al. [55] 89.0% - 0.157 143 84.6% 14.0% 1.4% 17 11
Kuo et al. [48] 89.4% 96.9% 0.085 143 84.6% 14.7% 0.7% 18 11
Ours 89.4% 97.4% 0.063 143 85.3% 13.3% 1.4% 18 10
Table 6.3: Tracking results on CAVIAR dataset. *The numbers of Frag and IDS in [86,
92, 88] are obtained by dierent metrics from what we adopt [55], which is more strict.
the similar recall, precision and mostly tracked trajectories compared to [55, 48] while
has signicant reduction in both fragments and ID switches. It shows that combining on-
line learned discriminative appearance model and o-line learned tracklet anity model
outperform the previous methods.
To show the generalized ability of our system, we use model parameters and weighted
coecients learned in TRECVID08 to apply on the CAVIAR dataset; This dataset con-
tains 26 video sequences of a corridor in a shopping center taken by a single camera
with frame size of 384288 and frame rate of 25fps. To have a fair comparison with the
state-of-the-arts, we conduct our experiment on the 20 videos selected by [92]. Tracking
evaluation results are presented in Table 6.3. Even though our system is trained on the
dierent dataset, the result is still competitive to the previous methods.
97
6.7 Conclusion
We propose an approach with o-line and on-line learning of tracklet anity models.
O-line learning aims at how to combine dierent cues such as motion, appearance, and
gap. On-line learning focus on the discriminative appearance models which are adaptive
to the current targets. This enhanced anity model is integrated in a hierarchical data
association framework for multi-target tracking task in crowded scenarios. Experiments
on two challenging datasets show impressive progress by our method.
98
Chapter 7
Inter-Camera Tracking for Multiple Targets
Based on our system of multi-target tracking in a single camera, we would like to extend it
to a more general and realistic scenario: multi-target tracking across multiple overlapping
cameras. This work is originally published in [47].
7.1 Introduction
Multi-target tracking is an important problem in computer vision, especially for appli-
cations such as visual surveillance systems. In many scenarios, multiple cameras are
required to monitor a large area. The goal is to locate targets, track their trajectories,
and maintain their identities when they travel within or across cameras. Such a system
consists of two main parts: 1) intra-camera tracking, i.e. tracking multiple targets within
a camera; 2) inter-camera association, i.e. "handover" of tracked targets from one camera
to another. Although there have been signicant improvements in intra-camera tracking ,
inter-camera track association when cameras have non-overlapping elds of views (FOVs)
remains a less explored topic, which is the problem we focus on in this paper.
99
Camera 1 Camera 2 Open Blind Area
Figure 7.1: Illustration of inter-camera association between two non-overlapping cameras.
Given tracked targets in each camera, our goal is to nd the optimal correspondence
between them, such that the associated pairs belong to the same object. A target may
walk across the two cameras, return to the original one, or exit in the blind area. Also,
a target entering Camera 2 from blind area is not necessarily from Camera 1, but may
be from somewhere else. Such open blind areas signicantly increase the diculty of the
inter-camera track association problem.
An illustration for inter-camera association of multiple tracks is shown in Figure
7.1. Compared to intra-camera tracking , inter-camera association is more challenging
because 1) the appearance of a target in dierent cameras may not be consistent due
to dierent sensor characteristics, lighting conditions, and view-points; 2) the spatio-
temporal information of tracked objects between cameras becomes much less reliable.
Besides, the open blind area signicantly increases the complexity of the inter-camera
track association problem.
Associating multiple tracks in dierent cameras can be formulated as a correspondence
problem. Given the observations of tracked targets, the goal is to nd the associated pairs
of tracks which maximizes a joint linking probability, in which the key component is the
100
anity between tracks. For the anity score, there are generally two main cues to be
considered: the spatio-temporal information and appearance relationships between two
non-overlapping cameras. Compared to spatial-temporal information, the appearance
cues are more reliable for distinguishing dierent targets especially in cases where FOVs
are disjoint. However, such cues are also more challenging to design since the appearances
of targets are complex and dynamic in general. A robust appearance model should be
adaptive to the current targets and environments.
A desired appearance model should incorporate discriminative properties between
correct matches and wrong ones. Between a set of tracks among two non-overlapping
cameras, the aim of the anity model is to distinguish the tracks which belong to the
same target from those which belong to dierent targets. Previous methods [65, 40, 15]
mostly focused on learning the appearance models or mapping functions based on the
correct matches, but no negative information is considered in their learning procedure.
To the best of our knowledge, online learning of a discriminative appearance anity
model across cameras has not been utilized.
Collecting positive and negative training samples on-line is dicult since no hand-
labelled correspondence is available at runtime. Hence, traditional learning algorithms
may not apply. However, by observing spatio-temporal constraints of tracks between
two cameras, some potentially associated pairs of tracks and some impossible pairs are
formed as "weakly labelled samples". We propose to adopt the Multiple Instance Learning
(MIL) [21, 81, 7] to accommodate the ambiguity of labelling during the model learning
process. Then the learned discriminative appearance anity model is combined with
spatio-temporal information to compute the crucial anities in the track association
101
framework, achieving a robust inter-camera association system. It can be incorporated
with any intra-camera tracking method to solve the problem of multi-object tracking
across non-overlapping cameras.
7.2 Related work
There is a large amount of work, e.g. [13, 17, 45], for multi-camera tracking with overlap-
ping eld of views. These methods usually require camera calibration and environmental
models to track targets. However, the assumption that cameras have overlapping elds of
view is not always practical due to the large number of cameras required and the physical
constraints upon their placement.
In the literature, [37, 62, 43] represent some early work for multi-camera tracking with
non-overlapping eld of views. To establish correspondence between objects in dierent
cameras, the spatio-temporal information and appearance relationship are two important
cues. For the spatio-temporal cue, some work [39, 20, 57, 30] propose their approaches
mainly on how to learn the probability distribution between cameras regarding to space
and temporal relationship. On the other hand, some work [65, 40, 30, 76, 15]discuss
appearance cue. They usually use color information and learn the brightness transfer
functions (BTFs) between cameras.
Besides, there is some work addressing the optimization framework of multiple targets
correspondence. Kettnaker and Zabih [43] used a Bayesian formulation to reconstruct
the paths of targets across multiple cameras. Javed et al. [39] dealt with this problem
by maximizing the a posteriori probability using a graph-theoretic framework. Song and
102
Track
Association
Spatio-Temporal
Constraints
MIL
Boosting
Tracking
Results across
Cameras
Positive Bags
Association of Tracks across Cameras
Spatio-Temporal
Constraints
MIL
Boosting
Track
Association
Appearance
Model
Appearance
Model
Negative Bags
Positive Bags
Negative Bags
Tracks
within
Camera 1
Tracks
within
Camera 2
Tracks
within
Camera n
Figure 7.2: The block diagram of our system for associating multiple tracked targets from
multiple non-overlapping cameras.
Roy-Chowdhury [73] proposed a multi-objective optimization framework by combining
short-term feature correspondences across the cameras with long-term feature dependency
models.
Learning a discriminative appearance anity model across non-overlapping cameras
at runtime makes our approach dierent from the existing ones. Most previous methods
did not incorporate any discriminative information to distinguish dierent targets, which
is important for inter-camera track association especially when the scene contains multiple
similar targets.
7.3 Overview of our approach
Our system contains three main components: the method of collecting online training
samples, the discriminative appearance anity model, and track association framework.
We use a time sliding window method to process video sequences. The learned appearance
103
anity models are updated in each time window. The system block diagram of our
method is shown in Figure 7.2.
The collection of online training samples is obtained by observing the spatio-temporal
constraints in a time sliding window. Assuming that the multi-object tracking is nished
in each camera, a training sample is dened as a pair of tracks from two cameras respec-
tively. Negative samples are collected by extracting pairs of tracks in two cameras which
overlap in time. It is based on the assumption that one object can not appear in two
non-overlapping cameras at the same time. Positive samples could be collected by similar
spatio-temporal information. However, it is dicult to label the positive training sample
in an online manner since it is indeed the correspondence problem that we want to solve.
Instead of labelling each sample, several potentially linked pairs of tracks constitute one
positive "bag", which is suitable for the Multiple Instance Learning (MIL) algorithm.
The learning of appearance anity model is to determine whether two tracks from dif-
ferent cameras belong to the same target or not according to their appearance descriptors
and similarity measurements. Instead of using only color information as in previous work,
appearance descriptors consisting of the color histogram, the covariance matrix, and the
HOG feature, are computed at multiple locations to increase the power of description.
Similarity measurements based on those features among the training samples establish
the feature pool. Once the training samples are collected in a time sliding window, a
MIL boosting algorithm is applied to select discriminative features from this pool and
their corresponding weighted coecients, and combines them into a strong classier in
the same time sliding window so that the learned models are adapted to the current sce-
nario. The prediction condence output by this classier is transformed to a probability
104
space, which cooperates with other cues (e.g. spatial correspondence and time interval)
to compute the anity between tracks for association.
The association of tracks in two cameras is formulated as a standard assignment
problem. A correspondence matrix is dened where the pairwise association probabilities
are computed by spatio-temporal cues and appearance information. This matrix is de-
signed to consider all possible scenarios in two non-overlapping cameras. The Hungarian
algorithm is applied to solve this problem eciently.
7.4 Track association between cameras
To perform track association across multiple cameras, we rstly focus on the track associ-
ation between two cameras and then extend it to the case of multiple cameras. Previous
methods often model it as an MAP problem to nd the optimal solution via Bayes The-
orem [43, 15], a graph theoretic approach [39], and expected weighted similarity [74].
We present an ecient yet eective approach which maximizes the joint linking proba-
bility. Assuming that the task of single camera tracking has been already solved; there
are m tracks in camera C
a
denoted byT
a
=fT
a
1
; ;T
a
m
g and n tracks in camera C
b
denoted byT
b
=fT
b
1
; ;T
b
n
g respectively. We may simply create a m by n matrix
and nd the optimal correspondence betweenT
a
andT
b
. However, in the case of non-
overlapping cameras, there exist "blind" areas where objects are invisible. For example,
an object which leaves C
a
does not necessarily enter C
b
as it may either go to the exit
in the blind area or return to C
a
. We dene an extended correspondence matrix of size
(2m + 2n) (2m + 2n) as follows:
105
H =
2
6
6
6
6
6
6
6
6
6
6
4
A
mm
B
mn
D
nm
E
nn
F
mm
1
mn
1
nm
G
nn
J
mm
1
mn
1
nm
K
nn
0
(m+n)(m+n)
3
7
7
7
7
7
7
7
7
7
7
5
(7.1)
This formulation is inspired by [36], but we made the necessary modication to ac-
commodate all situation which could happen between the tracks of two non-overlapping
cameras. The components of each matrix are dened as follows: B
ij
= logP
link
(T
a
i
!T
b
j
)
is the linking score of that the tail of T
a
i
links to the head of T
b
j
. It models the sit-
uation that a target leaves C
a
and then enters C
b
; a similar description is applied to
D
ij
= logP
link
(T
a
j
! T
b
i
). A
ij
= logP
link
(T
a
i
! T
a
j
) if i6= j is the linking score of
that the tail of T
a
i
links to the head of T
a
j
. It models the situation that a target leaves
C
a
and then re-enters camera a without travelling to camera C
b
; a similar description
is also applied to E
ij
= logP
link
(T
b
i
! T
b
j
) if i6= j. F
ij
or G
ij
if i = j is the score of
the T
a
i
or T
b
j
is terminated. It models the situation that the head of target can not be
linked to the tail of any tracks. J
ij
and K
ij
if i =j is the score of that the T
a
i
or T
b
j
is
initialized. It models the situation that the tail of target can not link to the head of any
track. By applying the Hungarian algorithm to H, the optimal assignment of association
is obtained eciently. A summary of each sub-matrix in H is given in Table 7.1.
The linking probability, i.e. anity between two tracks T
i
and T
j
is dened as the
product of three important cues(appearance, space, time):
P
link
(T
i
!T
j
) =P
a
(T
i
;T
j
)P
s
e(T
i
);e(T
j
)
P
t
T
i
!T
j
je(T
i
);e(T
j
)
(7.2)
106
matrix description element
A the target leaves and returns to C
a
A
ij
=1 if i =j
B the target leaves C
a
and enters C
b
B
ij
is a full matrix
D the target leaves C
b
and enters C
a
D
ij
is a full matrix
E the target leaves and returns to C
b
E
ij
=1 if i =j
F the target terminates in C
a
F
ij
=1 if i6=j
G the target terminates in C
b
G
ij
=1 if i6=j
J the target is initialized in C
a
J
ij
=1 if i6=j
K the target is initialized in C
b
K
ij
=1 if i6=j
Table 7.1: A short summary of the elements in each sub-matrix in H, which models
all possible situations between the tracks of two non-overlapping cameras. The optimal
assignment is solved by Hungarian algorithm.
where e(T
i
) denotes the exit/entry region of T
i
. Each of three components measures
the likelihood of T
i
and T
j
being the same object. The latter two terms P
s
and P
t
are
spatio-temporal information which can be learned automatically by the methods proposed
in [57, 15]. We focus on the rst termP
a
and propose a novel framework of online learning
a discriminative appearance anity model.
7.5 Discriminative appearance anity models with Multiple
Instance Learning
Our goal is to learn a discriminative appearance anity model across the cameras at run-
time. However, how to choose positive and negative training samples is a major challenge
since exact hand-labelled correspondence is not available while learning online. Based on
the spatio-temporal constraints, we are able to only exclude some impossible links and
retain several possible links, which are called "weakly labelled training examples".
107
Recent work [81, 7] presents promising results on face detection and visual tracking
respectively using Multiple Instance Learning (MIL). Compared to traditional discrimi-
native learning, MIL describes that samples are presented in "bags", and the labels are
provided for the bags instead of individual samples. A positive "bag" means it contains
at at least one positive sample; a negative bag means all samples in this bag are negative.
Since some
exibility is allowed for the labelling process, we may use the "weakly labelled
training examples" by spatio-temporal constraints and apply a MIL boosting algorithm
to learn the discriminative appearance anity model.
7.5.1 Collecting training samples
We propose a method to collect weakly labelled training samples using spatio-temporal
constraints. To learn an appearance anity model between cameras, a training sample is
dened as a pair of tracks from two cameras respectively. Based on the tracks generated
by a robust single camera multi-target tracker, we make a conservative assumption: any
two tracks from two non-overlapping cameras which overlap in time represent dierent
targets. It is based on the observation that one target can not appear at dierent locations
at the same time. Positive samples are more dicult to obtain since there is no supervised
information to indicate which two tracks among two cameras represent the same objects.
In other words, the label of "+1" can not be assigned to individual training samples.
To deal with the challenging on-line labelling problem, we collect possible pairs of tracks
by examining spatio-temporal constraints and put them into a "bag" which is labelled
"+1". The MIL boosting is applied to learn the desired discriminative appearance anity
model.
108
In our implementation, there are two set to be formed for each track: a set of "sim-
ilar" tracks and a set of "discriminative" tracks. For a certain track T
a
j
in camera C
a
,
each element in its "discriminative" setD
b
j
indicates a target T
b
k
in camera C
b
which is
impossible to be the same target withT
a
j
; each element in the "similar" setS
b
j
represents
a possible target T
b
k
in C
b
which might be the same target with T
a
j
. These cases are
described as:
T
b
k
2S
b
j
if P
s
(T
a
j
!T
b
k
)P
t
(e(T
a
j
);e(T
b
k
))>
T
b
k
2D
b
j
if P
s
(T
a
j
!T
b
k
)P
t
(e(T
a
j
);e(T
b
k
)) = 0
(7.3)
The threshold is adaptively chosen to maintain a moderate number of instances included
in each positive bag. The training sample setB =B
+
[B
can be denoted by
B
+
=
n
x
i
:fT
a
j
;T
b
k
g;8T
b
k
2S
b
j
;y
i
: +1
o
B
=
n
x
i
: (T
a
j
;T
b
k
); if T
b
k
2D
b
j
;y
i
:1
o
(7.4)
where each training sample x
i
may contain multiple pairs of tracks which represents a
bag. A label is given to a bag.
7.5.2 Representation of appearance model and similarity measurement
To build a strong appearance model, we begin by computing several local features to
describe a tracked target. In our design, three complementary features: color histograms,
covariance matrices, and histogram of gradients (HOG) constitute the feature pool. Given
a tracked target, features are extracted at dierent locations and dierent scales from the
head and tail part to increase the descriptive ability.
109
We use RGB color histograms to represent the color appearance of a local image
patch. Histograms have the advantage of being easy to implement and having well studied
similarity measures. Single channel histograms are concatenated to form a vector f
RGB
i
,
but any other suitable color space can be used. In our implementation, we use 8 bins
for each channel to form a 24-element vector. To describe the image texture, we use
a descriptor based on covariance matrices of image features proposed in [77]. It has
been shown to give good performance for texture classication and object categorization.
To capture shape information, we choose the Histogram of Gradients (HOG) Feature
proposed in [19]. In our design, a 32D HOG feature f
HOG
i
is extracted over the region
R; it is formed by concatenating 8 orientations bins in 2 2 cells over R.
In summary, the appearance descriptor of a track T
i
can be written as:
A
i
= (ff
l
RGB
i
g;fC
l
i
g;ff
l
HOG
i
g) (7.5)
where f
l
RGB
i
is the feature vector for color histogram, C
l
i
is the covariance matrix, and
f
l
HOG
i
is the 32D HOG feature vector. The superscript l means that the features are
evaluated over region R
l
.
Given the appearance descriptors, we can compute similarity between two patches.
The color histogram and HOG feature are histogram-based features so standard measure-
ments, such as
2
distance, Bhattacharyya distance, and correlation coecient can be
used. In our implementation, correlation coecient is chosen for simplicity. The distance
measurement of covariance matrices is determined by solving a generalized eigenvalues
problem, which is described in [77].
110
After computing the appearance model and the similarity between appearance de-
scriptors at dierent regions, we form a feature vector by concatenating the similarity
measurements with dierent appearance descriptors at multiple locations. This feature
vector gives us a feature pool that we can use an appropriate boosting algorithm to
construct a strong classier.
7.5.3 Multiple instance learning
Our goal is to design a discriminative appearance model which determines the anity
score of appearance between two objects in two dierent cameras. Again, a sample is
dened as a pair of targets from two cameras respectively. The anity model takes a
pair of objects as input and returns a score of real value by a linear combination of weak
classiers. The larger the anity score, the more likely that two objects in one sample
represent the same target. We adopt the MIL Boosting framework proposed in [81] to
select the weak classiers and their corresponding weighted coecients. Compared to
conventional discriminative boosting learning, training samples are not labelled individ-
ually in MIL; they form "bags" and the label is given to each bag, not to each sample.
Each sample is denoted by x
ij
, where i is the index for the bag and j is the index for
the sample within the bag. The label of each bag is represented by y
i
where y
i
2f0; 1g.
Although the known labels are given to bags instead of samples, the goal is to learn the
the instance classier which takes the following form:
H(x
ij
) =
T
X
t=1
t
h
t
(x
ij
) (7.6)
111
In our framework, the weak hypothesis is from the feature pool obtained by Section 7.5.2.
We adjust the sign and normalize h(x) to be in the restricted range [1; +1]. The sign
of h(x) is interpreted as the predicted label and the magnitudejh(x)j as the condence
in this prediction.
The probability of a sample x
ij
being positive is dened as the standard logistic
function,
p
ij
=(y
ij
) =
1
1 + exp(y
ij
)
(7.7)
wherey
ij
=H(x
ij
). The probability of a bag being positive is dened by the "noisy OR"
model:
p
i
= 1
Y
j
(1p
ij
)
(7.8)
If one of the samples in a bag has a high probability p
ij
, the bag probability p
i
will be
high as well. This property is appropriate to model that a bag is labelled as positive if
there is at least one positive sample in this bag. MIL boosting uses the gradient boosting
framework to train a boosting classier that maximizes the log likelihood of bags:
logL(H) =
X
i
y
i
logp
i
+ (1y
i
) log(1p
i
) (7.9)
112
The weight of each sample is given as the derivative of the loss function logL(H) with
respect to the score of that sample y
ij
:
w
ij
=
@ logL(H)
@y
ij
=
y
i
p
i
p
i
p
ij
(7.10)
Our goal is to nd H(x) which maximizes (7.9), where H(x) can be obtained by sequen-
tially adding new weak classiers. In the t-th boosting round, we aim at learning the
optimal weak classier h
t
and weighted coecient
t
to optimize the loss function:
(
t
;h
t
) = arg min
h;
logL(H
t1
+h)
(7.11)
To nd to the optimal (
t
,h
t
), we follow the framework used in [58, 81] which views
boosting as a gradient descent process, each round it searches for a weak classier h
t
to
maximize the gradient of the loss function. Then the weighted coecient
t
is determined
by a linear search to maximize logL(H +
t
h
t
). The learning procedure is summarized
in Algorithm 1.
7.6 Experimental results
The experiments are conducted on a three-camera setting with disjoint FOVs. First, we
evaluate the eectiveness of our proposed on-line learned discriminative appearance an-
ity model by formulating the correspondence problem as a binary classication problem.
Second, for a real scenario of multiple non-overlapping cameras, the evaluation metric is
dened, and the tracking results using our proposed system are presented. It is shown
113
Algorithm 3 Multiple Instance Learning Boosting
Input:
B
+
=
n
(fx
i1
;x
i2
;:::g; +1)
o
: Positive bags
B
=
n
(fx
i1
;x
i2
;:::g;1)
o
: Negative bags
F =fh(x
ij
)g: Feature pools
1: Initialize H = 0
2: for t = 1 to T do
3: for k = 1 to K do
4: p
k
ij
=(H +h
k
(x
ij
))
5: p
k
i
= 1
Y
j
(1p
k
ij
)
6: w
k
ij
=
y
i
p
k
i
p
k
i
p
k
ij
7: end for
8: Choose k
= arg max
k
X
ij
w
k
ij
h
k
(x
ij
)
9: Set h
t
=h
k
10: Find
= arg max
logL(H +h
t
) by linear search
11: Set
t
=
12: Update H H +
t
h
t
13: end for
Output: H(x) =
P
T
t=1
t
h
t
(x)
that our method achieves good performance in a crowded scene. Some graphical examples
are also provided.
7.6.1 Comparison of discriminative power
We rst evaluate the discriminative ability of our appearance anity model, independent
of the tracking framework that it will be embedded in. Given the tracks in each camera,
we manually label the pairs of tracks that should be associated to from the ground truth.
Anity scores are computed among every possible pair in a time sliding window by four
methods: (1) the correlation coecients of two color histogram; (2) the model proposed
114
Camera pair color only no learning o-line learning on-line learning
C
1
,C
2
0.231 0.156 0.137 0.094
C
2
,C
3
0.381 0.222 0.217 0.159
Table 7.2: The comparison of the Equal Error Rate using dierent appearance anity
models. It shows that the on-line learning method has the most discriminative power.
in Section 5 but without MIL learning, i.e. with equal coecients
t
; (3) o-line MIL
learning, i.e. learning is done on another time sliding window; (4) MIL learning on the
same time window. In a three-camera setting, the experiments are done in two camera
pairs (C
1
,C
2
) and (C
2
,C
3
); equal error rate in two tasks is the metric to evaluate the
performance. In (C
1
,C
2
), the number of evaluated pairs is 434 and the number of positive
pairs is 35. In (C
2
,C
3
), the number of evaluated pairs is 148 and the number of positive
pairs is 18. The length of time sliding window is 5000. The experimental results are
shown in Table 7.2. In each camera pair, the model using online MIL learning achieves
the lowest equal error rate compared to the other three methods.
7.6.2 Evaluation metrics
In previous work, quantitative evaluation of multi-target tracking across multiple cameras
is barely mentioned or simply a single number e.g. tracking accuracy is used. It is
dened as the ratio of the number of objects tracked correctly to the total number of
objects that passed through the scene in [40, 15]. However, it may not be a suitable
metric to measure the performance of a system fairly, especially in a crowded scene where
targets have complicated interactions. For example, if two tracked targets exchange their
identities twice while travelling across a series of three cameras should be worse than if
115
they exchange only once. Nevertheless, these two situations are both counted as incorrect
tracked objects in the metric of "tracking accuracy". We need a more complete metric
to evaluate the performance of inter-camera track association.
In the case of tracking within a single camera, fragments and ID switches are two
commonly used metrics. We adopt the denitions used in [48] and apply it to the case of
tracking across cameras. Assuming that multiple targets tracking in a single camera is
obtained, we only focus on the fragments and ID switches which are not dened within
cameras. Given the tracks in two cameras C
a
and C
b
: T
a
=fT
a
1
; ;T
a
m
g andT
b
=
fT
b
1
; ;T
b
n
g, the metrics in tracking evaluation are:
Crossing Fragments(X-Frag): The total number of times that there is a link between
T
a
i
and T
b
j
in the ground truth, but missing in the tracking result.
Crossing ID switches(X-IDS): The total number of times that there is no link be-
tween T
a
i
and T
b
j
in the ground truth, but existing in the tracking result.
Returning Fragments(R-Frag): The total number of times that there is link between
T
a
i
and T
a
j
which represents a target leaving and returning to C
a
in ground truth, but
missing in the tracking result.
Returning ID switches(R-IDS): The total number of times that there is no link
between T
a
i
and T
a
j
which means they represent dierent targets in ground truth, but
existing in the tracking result.
For example, there areT
a
1
,T
a
2
inC
a
, andT
b
1
,T
b
2
inC
b
. In the ground truth, (T
a
1
,T
b
1
)
and (T
a
2
,T
b
2
) are the linked pairs. If they switch their identities in the tracking result, i.e.
(T
a
1
, T
b
2
) and (T
a
2
, T
b
1
) are the linked pairs, that is considered as 2 X-frag and 2 X-IDS.
This metric is more strict but well-dened than the traditional denition of fragments
116
and ID switches. Similar descriptions apply to R-Frag and R-IDS. The lower these four
metrics, the better is the tracking performance.
7.6.3 Tracking results
The videos used in our evaluation are captured by three cameras in a campus environment
with frame size of 852480 and length of 25 minutes. It is more challenging than the
dataset used in the previous works in the literature since this dataset features a more
crowded scene (2 to 10 people per frame in each camera). There are many inter-object
occlusions and interactions and people walking across cameras occurs often. The multi-
target tracker within a camera we use is based on [48], which is a detection-based tracking
algorithm with hierarchical association.
We compare our approach with dierent appearance models. The results are also
shown in Table 7.3. The result of (a) represents the input, i.e. no linking between any
tracks in each camera. The result of (b) uses only color histogram is used as the appear-
ance model. In the result of (c), our proposed appearance model is used but learned in
an o-line environment, which means the coecients
t
are xed. The result of (d) uses
our proposed appearance models. It shows that our proposed on-line learning method
outperforms these two appearance models. This comparison justies that our stronger
appearance model with on-line learning improves the tracking performance. Some as-
sociation results are shown in Figure 7.3. It shows that our method nds the correct
association among multiple targets in a complex scenen, e.g. people with IDs of 74, 75,
and 76 when they travel from camera 2 to camera 1.
117
Method X-Frag X-IDS R-Frag R-IDS
(a)input tracks 206 0 15 0
(b)color only 9 18 12 8
(c)o-line learning 6 15 11 7
(d)on-line learning 4 12 10 6
Table 7.3: Tracking results using dierent appearance models with our proposed metrics.
The lower the numbers, the better performance it is. It shows that our on-line learned
appearance anity models achieve the best results.
7.7 Conclusion
We describe a novel system for associating multi-target tracks across multiple non-
overlapping cameras. The contribution of this paper focuses on learning a discriminative
appearance anity model at runtime. To solve the ambiguous labelling problem, we
adopt Multiple Instance Learning boosting algorithm to learn the desired discriminative
appearance models. An eective multi-object correspondence optimization framework
for intra-camera track association problem is also presented. Experimental results on a
challenging dataset show the robust performance by our proposed system.
118
(a) Camera 1 (b) Camera 2
Frame 777
Frame 1044
Frame 1259
Frame 1964
Figure 7.3: Sample tracking results on our dataset. Some tracked people travelling
through the cameras are linked by dotted lines. For example, the targets with IDs of
74, 75, and 76 leave Camera 2 around the same time, our method nds the correct
association when they enter Camera 1. This gure is best viewed in color.
119
Chapter 8
Future work
For Intra-Camera tracking, we have implemented a system to track multiple targets in
a crowd scene using on-line learned discriminative appearance models (OLDAMs). We
extended this framework to solve the Inter-Camera association of multiple target tracks by
Multiple Instance Learning. Our evaluations on several public dataset demonstrate that
our methods have higher discriminative power between dierent targets than previous
methods.
In the future, we plan to improve our tracking system in several aspects, which includes
robust segmentations of targets, dynamic appearance descriptors, and recognition using
attributes.
8.1 Robust segmentations of targets
As we can see, one disadvantage of the detection-based tracking method is that the
bounding box is not always perfect aligned to the targets. It will result in the impre-
cise extraction of appearance descriptors. To overcome this issue, the segmentations of
120
target could be an appropriate solution. Instead of computing the appearance descrip-
tors in sub-rectangles in detection responses, a reasonable segmenter can distinguish the
background and foreground, thus provide a reliable cue to extract accurate appearance
models. Intuitively, the improvement of the appearance anity models will lead to boost
the tracking performance.
8.2 Dynamic appearance descriptors
In our system, the appearance anity models takes two images of targets as inputs and
compute the similarity between them. However, it can be viewed as still image-based
matching scheme. The temporal continuity information of a tracklet is not taken into
consideration yet. To have a well-designed dynamic appearance modeling should be
helpful to the overall performance.
8.3 Recognition using attributes
How do humans perform the recognition of targets? One possible method could be the
attributes which the targets have. In other words, a human-specied high level description
can be learned to represent the targets. For example, the tracked target can be a man
who is wearing a black suit and dragging his red luggage. Unlike the low level descriptors
which is not invariant to the dierent viewpoints, these high level attributes should be
of a great value to help the system tracking the targets, even they are travelling across
dierent cameras.
121
References
[1] Caviar dataset. http://homepages.inf.ed.ac.uk/rbf/CAVIAR/.
[2] National institute of standards and technology: Trecvid 2008 evaluation for surveil-
lance event detection. http://www.nist.gov/speech/tests/trecvid/2008/.
[3] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. People-tracking-by-detection
and people-detection-by-tracking. In Proceedings of Computer Vision and Pattern
Recognition(CVPR), 2008.
[4] Anton Andriyenko and Konrad Schindler. Globally optimal multi-target tracking
on a hexagonal lattice. In Proceedings of European Conference on Computer Vi-
sion(ECCV), 2010.
[5] Anton Andriyenko and Konrad Schindler. Multi-target tracking by continuous
energy minimization. In Proceedings of Computer Vision and Pattern Recogni-
tion(CVPR), 2011.
[6] Shai Avidan. Ensemble tracking. In Proceedings of Computer Vision and Pattern
Recognition(CVPR), 2005.
[7] B. Babenko, Ming-Hsuan Yang, and S. Belongie. Visual tracking with online mul-
tiple instance learning. In Proceedings of Computer Vision and Pattern Recogni-
tion(CVPR), 2009.
[8] Ben Benfold and Ian Reid. Stable multi-target tracking in real-time surveillance
video. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2011.
[9] J. Berclaz, F. Fleuret, and P. Fua. Robust people tracking with global trajectory
optimization. In Proceedings of Computer Vision and Pattern Recognition(CVPR),
2006.
[10] Stanley T. Bircheld and Sriram Rangarajan. Spatiograms versus histograms for
region-based tracking. In Proceedings of Computer Vision and Pattern Recogni-
tion(CVPR), 2005.
[11] Michael D. Breitenstein, Fabian Reichlin, Bastian Leibe, Esther Koller-Meier, and
Luc Van Gool. Robust tracking-by-detection using a detector condence particle
lter. In Proceedings of International Conference on Computer Vision(ICCV), 2009.
122
[12] Michael D. Breitenstein, Fabian Reichlin, Bastian Leibe, Esther Koller-Meier, and
Luc Van Gool. Online multi-person tracking-by-detection from a single, uncalibrated
camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1820{
1833, 2011.
[13] Q. Cai and J.K. Aggarwal. Tracking human motion in structured environments using
a distributed-camera system. IEEE Transactions on Pattern Analysis and Machine
Intelligence(PAMI), 21:1241{1247, November 1999.
[14] Yizheng Cai, Nando de Freitas, and James J. Little. Robust visual tracking for mul-
tiple targets. In Proceedings of European Conference on Computer Vision(ECCV),
2006.
[15] Kuan-Wen Chen, Chih-Chuan Lai, Yi-Ping Hung, and Chu-Song Chen. An adap-
tive learning method for target tracking across multiple cameras. In Proceedings of
Computer Vision and Pattern Recognition(CVPR), 2008.
[16] Wongun Choi and Silvio Savarese. Multiple target tracking in world coordinate
with single, minimally calibrated camera. In Proceedings of European Conference on
Computer Vision(ECCV), 2010.
[17] Robert Collins, Alan Lipton, Hironobu Fujiyoshi, and Takeo Kanade. Algorithms
for cooperative multisensor surveillance. Proceedings of the IEEE, 89(1):1456 { 1477,
October 2001.
[18] Robert T. Collins and Yanxi Liu. On-line selection of discriminative tracking fea-
tures. In Proceedings of International Conference on Computer Vision(ICCV), 2003.
[19] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In
Proceedings of Computer Vision and Pattern Recognition(CVPR), 2005.
[20] Anthony R. Dick and Michael J. Brooks. A stochastic approach to tracking objects
across multiple cameras. In Australian Conference on Articial Intelligence, 2004.
[21] Thomas G. Dietterich, Richard H. Lathrop, Tomas Lozano-Perez, and Arris Phar-
maceutical. Solving the multiple-instance problem with axis-parallel rectangles. Ar-
ticial Intelligence, 89:31{71, 1997.
[22] A. Ess, B. Leibe, K. Schindler, , and L. van Gool. A mobile vision system for robust
multi-person tracking. In Proceedings of Computer Vision and Pattern Recogni-
tion(CVPR), 2008.
[23] M. Farenzena, L. Bazzani, A. Perina1, V. Murino, and M. Cristani. Person re-
identication by symmetry-driven accumulation of local features. In Proceedings of
Computer Vision and Pattern Recognition(CVPR), 2010.
[24] Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discriminatively
trained, multiscale, deformable part model. In Proceedings of Computer Vision and
Pattern Recognition(CVPR), 2008.
123
[25] T. Fortmann, Y. Bar-Shalom, and M. Schee. Sonar tracking of multiple targets
using joint probabilistic data association. IEEE Journal of Oceanic Engineering,
8:173{184, 1983.
[26] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An ecient boosting algorithm
for combining preferences. In Journal of Machine Learning Research, 2003.
[27] J. Friedman, T. Hastie, and R. Tibshirani. Additive Logistic Regression: a Statistical
View of Boosting. The Annals of Statistics, 38(2), 2000.
[28] Brian Fulkerson, Andrea Vedaldi, and Stefano Soatto. Localizing objects with smart
dictionaries. In Proceedings of European Conference on Computer Vision(ECCV),
2008.
[29] Niloofar Gheissari, Thomas B. Sebastian, Peter H. Tu, and Jens Rittscher. Person
reidentication using spatiotemporal appearance. In Proceedings of Computer Vision
and Pattern Recognition(CVPR), 2006.
[30] Andrew Gilbert and Richard Bowden. Tracking objects across cameras by incremen-
tally learning inter-camera colour calibration and patterns of activity. In Proceedings
of European Conference on Computer Vision(ECCV), 2006.
[31] H. Grabner and H. Bischof. On-line boosting and vision. In Proceedings of Computer
Vision and Pattern Recognition(CVPR), 2006.
[32] Helmut Grabner, Jiri Matas, Luc Van Gool1, and Philippe Cattin. Tracking the
invisible: Learning where the object might be. In Proceedings of Computer Vision
and Pattern Recognition(CVPR), 2010.
[33] Douglas Gray and Hai Tao. Viewpoint invariant pedestrian recognition with an
ensemble of localized features. In Proceedings of European Conference on Computer
Vision(ECCV), 2008.
[34] Chang Huang, Haizhou Ai, Yuan Li, and Shihong Lao. High-performance rotation
invariant multiview face detection. IEEE Transactions on Pattern Analysis and
Machine Intelligence(PAMI), 29:671{686, 2007.
[35] Chang Huang and Ram Nevatia. High performance object detection by collaborative
learning of joint ranking of granule features. In Proceedings of Computer Vision and
Pattern Recognition(CVPR), 2010.
[36] Chang Huang, Bo Wu, and Ramakant Nevatia. Robust object tracking by hierar-
chical association of detection responses. In Proceedings of European Conference on
Computer Vision(ECCV), 2008.
[37] Timothy Huang and Stuart Russell. Object identication in a bayesian context. In
International Joint Conference on Articial Intelligence(IJCAI), 1997.
[38] M. Isard and J. MacCormick. Bramble: A bayesian multiple-blob tracker. In Pro-
ceedings of International Conference on Computer Vision(ICCV), 2001.
124
[39] Omar Javed, Zeeshan Rasheed, Khurram Shaque, and Mubarak Shah. Tracking
across multiple cameraswith disjoint views. In Proceedings of International Confer-
ence on Computer Vision(ICCV), 2003.
[40] Omar Javed, Khurram Shaque, and Mubarak Shah. Appearance modeling for
tracking in multiple non-overlapping cameras. In Proceedings of Computer Vision
and Pattern Recognition(CVPR), 2005.
[41] Hao Jiang, Sidney Fels, and James J. Little. A linear programming approach for
multiple object tracking. In Proceedings of Computer Vision and Pattern Recogni-
tion(CVPR), 2007.
[42] R. Kasturi, D. Goldgof, V. Manohar, M. Boonstra, and V. Korzhova. Performance
evaluation protocol for face, person and vehicle detection and tracking in video anal-
ysis and content extraction. In In Workshop on classication of events, activities
and relationships, 2006.
[43] Vera Kettnaker and Ramin Zabih. Bayesian multi-camera surveillance. In Proceed-
ings of Computer Vision and Pattern Recognition(CVPR), 1999.
[44] Saad M. Khan and Mubarak Shah. A multiview approach to tracking people in
crowded scenes using a planar homography constraint. In Proceedings of European
Conference on Computer Vision(ECCV), 2006.
[45] Sohaib Khan and Mubarak Shah. Consistent labeling of tracked objects in multiple
cameras with overlapping elds of view. IEEE Transactions on Pattern Analysis
and Machine Intelligence(PAMI), 25:1355{1360, 2003.
[46] Louis Kratz and Ko Nishino. Tracking with local spatio-temporal motion patterns
in extremely crowded scenes. In Proceedings of Computer Vision and Pattern Recog-
nition(CVPR), 2010.
[47] Cheng-Hao Kuo, Chang Huang, and Ram Nevatia. Inter-camera association of multi-
target tracks by on-line learned appearance anity models. In Proceedings of Euro-
pean Conference on Computer Vision(ECCV), 2010.
[48] Cheng-Hao Kuo, Chang Huang, and Ram Nevatia. Multi-target tracking by on-line
learned discriminative appearance models. In Proceedings of Computer Vision and
Pattern Recognition(CVPR), 2010.
[49] Cheng-Hao Kuo and Ram Nevatia. Multi-object tracking through occlusions by
local tracklets ltering and global tracklets association with detection responses. In
Proceedings of IEEE Workshop on Applications of Computer Vision(WACV), 2009.
[50] Cheng-Hao Kuo and Ram Nevatia. How does person identity recognition help
multi-person tracking? In Proceedings of Computer Vision and Pattern Recogni-
tion(CVPR), 2011.
125
[51] Bastian Leibe, Konrad Schindler, and Luc Van Gool. Coupled detection and trajec-
tory estimation for multi-object tracking. In Proceedings of International Conference
on Computer Vision(ICCV), 2007.
[52] Bastian Leibe, Edgar Seemann, and Bernt Schiele. Pedestrian detection in crowded
scenes. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2005.
[53] Kobi Levi and Yair Weiss. Learning object detection from a small number of ex-
amples: the importance of good features. In Proceedings of Computer Vision and
Pattern Recognition(CVPR), 2004.
[54] Yuan Li, Haizhou Ai, Takayoshi Yamashita, Shihong Lao, and Masato Kawade.
Tracking in low frame rate video: A cascade particle lter with discriminative ob-
servers of dierent lifespans. In Proceedings of Computer Vision and Pattern Recog-
nition(CVPR), 2007.
[55] Yuan Li, Chang Huang, and Ram Nevatia. Learning to associate: Hybridboosted
multi-target tracker for crowded scene. In Proceedings of Computer Vision and
Pattern Recognition(CVPR), 2009.
[56] David G. Lowe. Distinctive image features from scale-invariant keypoints. Interna-
tional Journal of Computer Vision(IJCV), 60(2):91{110, November 2004.
[57] Dimitrios Makris, Tim Ellis, and James Black. Bridging the gaps between cameras.
In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2004.
[58] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting algo-
rithms as gradient descent in function space, 1999.
[59] Dennis Mitzel, Esther Horbert, Andreas Ess, and Bastian Leibe. Multi-person track-
ing with sparse detection and continuous segmentation. In Proceedings of European
Conference on Computer Vision(ECCV), 2010.
[60] Kenji Okuma, Ali Taleghani, O De Freitas, James J. Little, and David G. Lowe.
A boosted particle lter: Multitarget detection and tracking. In Proceedings of
European Conference on Computer Vision(ECCV), 2004.
[61] Omar Oreifej, Ramin Mehran, and Mubarak Shah. Human identity recognition in
aerial images. In Proceedings of Computer Vision and Pattern Recognition(CVPR),
2010.
[62] Hanna Pasula, Stuart Russell, Michael Ostl, and Ya'acov Ritov. Tracking many
objects with many sensors. In International Joint Conference on Articial Intelli-
gence(IJCAI), 1999.
[63] A. G. Amitha Perera, Chukka Srinivas, Anthony Hoogs, Glen Brooksby, and
Wensheng Hu. Multi-object tracking through simultaneous long occlusions and
split-merge conditions. In Proceedings of Computer Vision and Pattern Recogni-
tion(CVPR), 2006.
126
[64] Hamed Pirsiavash, Deva Ramanan, and Charless C. Fowlkes. Globally-optimal
greedy algorithms for tracking a variable number of objects. In Proceedings of Com-
puter Vision and Pattern Recognition(CVPR), 2011.
[65] Fatih Porikli. Inter-camera color calibration by correlation model function. In Pro-
ceedings of International Conference on Image Processing(ICIP), 2003.
[66] Donald B. Reid. An algorithm for tracking multiple targets. IEEE Transactions on
Automatic Control, 24:843{854, 1979.
[67] Jens Rittscher, Peter H. Tu, and Nils Krahnstoever. Simultaneous estimation of
segmentation and shape. In Proceedings of Computer Vision and Pattern Recogni-
tion(CVPR), 2005.
[68] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using
condence-rated predictions. In Machine Learning, pages 80{91, 1999.
[69] William Robson Schwartz and Larry S. Davis. Learning discriminative appearance-
based models using partial least squares. In SIBGRAPI, 2009.
[70] Vinay D. Shet, Jan Neumann, Visvanathan Ramesh, and Larry S. Davis. Bilattice-
based logical reasoning for human detection. In Proceedings of Computer Vision and
Pattern Recognition(CVPR), 2007.
[71] Vivek Kumar Singh, Bo Wu, and Ram Nevatia. Pedestrian tracking by associat-
ing tracklets using detection residuals. In IEEE workshop on Motion and Video
Computing(WMVC), 2008.
[72] Kevin Smith, Daniel Gatica-Perez, and Jean-Marc Odobez. Using particles to track
varying numbers of interacting people. In Proceedings of Computer Vision and Pat-
tern Recognition(CVPR), 2005.
[73] Bi Song and A.K. Roy-Chowdhury. Robust tracking in a camera network: A multi-
objective optimization framework. IEEE Journal of Selected Topics in Signal Pro-
cessing, 2(4):582{596, aug. 2008.
[74] Bi Song and Amit K. Roy-Chowdhury. Stochastic adaptive tracking in a camera
network. In Proceedings of International Conference on Computer Vision(ICCV),
2007.
[75] Chris Stauer. Estimating tracking sources and sinks. In Proceedings of Computer
Vision and Pattern Recognition Workshop, 2003.
[76] J. Sturges and T. Whiteld. Locating basic colour in the munsell space. Color
Research and Application, 20:364{376, 1995.
[77] Oncel Tuzel, Fatih Porikli, and Peter Meer. Region covariance: A fast descriptor for
detection and classication. In Proceedings of European Conference on Computer
Vision(ECCV), 2006.
127
[78] Oncel Tuzel, Fatih Porikli, and Peter Meer. Human detection via classication on
riemannian manifolds. In Proceedings of Computer Vision and Pattern Recogni-
tion(CVPR), 2007.
[79] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of sim-
ple features. In Proceedings of Computer Vision and Pattern Recognition(CVPR),
2001.
[80] Paul Viola, Michael Jones, and Daniel Snow. Detecting pedestrians using patterns
of motion and appearance. In Proceedings of International Conference on Computer
Vision(ICCV), 2003.
[81] Paul Viola, John C. Platt, and Cha Zhang. Multiple instance boosting for object
detection. In Advances in Neural Information Processing Systems 21(NIPS), 2005.
[82] Xiaoyu Wang, Tony X. Han, and Shuicheng Yan. An hog-lbp human detector with
partial occlusion handling. In Proceedings of International Conference on Computer
Vision(ICCV), 2009.
[83] Sinisa Todorovic William Brendel, Mohamed Amer. Multiobject tracking as maxi-
mum weight independent set. In Proceedings of Computer Vision and Pattern Recog-
nition(CVPR), 2011.
[84] Bo Wu and Ram Nevatia. Detection of multiple, partially occluded humans in a
single image by bayesian combination of edgelet part detectors. In Proceedings of
Computer Vision and Pattern Recognition(CVPR), 2005.
[85] Bo Wu and Ram Nevatia. Tracking of multiple, partially occluded humans based on
static body part detection. In Proceedings of Computer Vision and Pattern Recog-
nition(CVPR), 2006.
[86] Bo Wu and Ram Nevatia. Detection and tracking of multiple, partially occluded
humans by Bayesian combination of edgelet based part detectors. International
Journal of Computer Vision(IJCV), 75(2):247{266, November 2007.
[87] Zheng Wu, Thomas H. Kunz, and Margrit Betke. Ecient track linking methods
for track graphs using network-
ow and set-cover techniques. In Proceedings of
Computer Vision and Pattern Recognition(CVPR), 2011.
[88] Junliang Xing, Haizhou Ai, and Shihong Lao. Multi-object tracking through oc-
clusions by local tracklets ltering and global tracklets association with detection
responses. In Proceedings of Computer Vision and Pattern Recognition(CVPR),
2009.
[89] Bo Yang, Chang Huang, and Ram Nevatia. Learning anities and dependencies
for multi-target tracking using a crf model. In Proceedings of Computer Vision and
Pattern Recognition(CVPR), 2011.
128
[90] Ming Yang, Fengjun Lv, Wei Xu, and Yihong Gong. Detection driven adaptive
multi-cue integration for multiple human tracking. In Proceedings of International
Conference on Computer Vision(ICCV), 2009.
[91] Qian Yu, G. Medioni, and I. Cohen. Multiple target tracking using spatio-temporal
markov chain monte carlo data association. In Proceedings of Computer Vision and
Pattern Recognition(CVPR), 2007.
[92] Li Zhang, Yuan Li, and Ram Nevatia. Global data association for multi-object track-
ing using network
ows. In Proceedings of Computer Vision and Pattern Recogni-
tion(CVPR), 2008.
[93] Li Zhang, Bo Wu, and Ram Nevatia. Detection and tracking of multiple humans with
extensive pose articulation. In Proceedings of International Conference on Computer
Vision(ICCV), 2007.
[94] Tao Zhao and Ram Nevatia. Tracking multiple humans in complex situations. IEEE
Transactions on Pattern Analysis and Machine Intelligence(PAMI), 26(9):1208 {
1221, 2004.
[95] Tao Zhao and Ram Nevatia. Tracking multiple humans in crowded environment. In
Proceedings of Computer Vision and Pattern Recognition(CVPR), 2004.
[96] Qiang Zhu, Shai Avidan, Mei-Chen Yeh, and Kwang-Ting Cheng. Fast human
detection using a cascade of histograms of oriented gradients. In Proceedings of
Computer Vision and Pattern Recognition(CVPR), 2006.
129
Abstract (if available)
Abstract
We present our work on multiple pedestrians tracking in a single camera and across multiple non-overlapping cameras. We propose an approach for online learning of discriminative appearance models for robust multi-target tracking in a crowded scene. Although much progress has been made in developing methods for optimal data association, there has been comparatively less work on the appearance model, which is the key element for good performance. Many previous methods either use simple features such as color histograms, or focus on the discriminability between a target and the background which do not resolve ambiguities between the different targets. We propose an algorithm for learning discriminative appearance models for different targets. Training samples are collected online from tracklets within a time sliding window based on some spatial-temporal constraints; this allows the models to adapt to target instances. Learning uses an AdaBoost algorithm that combines effective image descriptors and their corresponding similarity measurements. We term the learned models as OLDAMs. Our evaluations indicate that OLDAMs have significantly higher discrimination between different targets than conventional holistic color histograms, and when integrated into a hierarchical association framework, they help improve the tracking accuracy, particularly reducing the false alarms and identity switches. ❧ Furthermore, we extend our approach to multiple non-overlapping cameras. Given the multi-target tracking results in each camera, we propose a framework to associate those tracks. Collecting reliable training samples is a major challenge in on-line learning since supervised correspondence is not available at runtime. To alleviate the inevitable ambiguities in these samples, Multiple Instance Learning (MIL) is applied to learn an appearance affinity model which effectively combines three complementary image descriptors and their corresponding similarity measurements. Based on the spatial-temporal information and the proposed appearance affinity model, we present an improved inter-camera track association framework to solve the ""target handover"" problem across cameras. Our evaluations indicate that our method has higher discrimination between different targets than previous methods.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
Tracking multiple articulating humans from a single camera
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Motion pattern learning and applications to tracking and detection
PDF
A deep learning approach to online single and multiple object tracking
PDF
Moving object detection on a runway prior to landing using an onboard infrared camera
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
3D deep learning for perception and modeling
PDF
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
Exploitation of wide area motion imagery
PDF
Theoretical foundations for modeling, analysis and optimization of cyber-physical-human systems
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Intelligent video surveillance using soft biometrics
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Video object segmentation and tracking with deep learning techniques
Asset Metadata
Creator
Kuo, Cheng-Hao (author)
Core Title
Multiple pedestrians tracking by discriminative models
Contributor
Electronically uploaded by the author
(provenance)
School
Andrew and Erna Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2011-12
Publication Date
11/11/2011
Defense Date
04/27/2011
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
AdaBoost,association-based tracking,detection-based tracking,discriminative models,multiple instance learning,mutli-target tracking,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Advisor
Nevatia, Ramakant (
committee chair
), Leahy, Richard M. (
committee member
), Medioni, Gerard G. (
committee member
)
Creator Email
cheng-hao.kuo@siemens.com,samuelkuo@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC1347635
Unique identifier
UC1347635
Identifier
etd-KuoChengHa-399.pdf (filename)
Legacy Identifier
etd-KuoChengHa-399
Dmrecord
671069
Document Type
Dissertation
Format
theses (aat)
Rights
Kuo, Cheng-Hao
Internet Media Type
application/pdf
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
AdaBoost
association-based tracking
detection-based tracking
discriminative models
multiple instance learning
mutli-target tracking