Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multiple humnas tracking by learning appearance and motion patterns
(USC Thesis Other)
Multiple humnas tracking by learning appearance and motion patterns
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MULTIPLE HUMANS TRACKING BY LEARNING APPEARANCE AND MOTION PATTERNS by Bo Yang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2012 Copyright 2012 Bo Yang Table of Contents List of Tables v List of Figures vii Abstract xi Chapter 1 Introduction 1 1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Detection errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Object occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Non-linear motion . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.4 Appearance modeling . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Category free tracking . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Association based tracking . . . . . . . . . . . . . . . . . . . . . . 7 1.2.3 Using scene knowledge for tracking . . . . . . . . . . . . . . . . . . 9 1.3 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2 Online Learned Knowledge from Static Cameras for Multi- Target Tracking 13 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Motion map learning for motion affinities . . . . . . . . . . . . . . . . . . 18 2.3.1 Non-linear motion map learning . . . . . . . . . . . . . . . . . . . 18 2.3.2 Estimation of motion affinities . . . . . . . . . . . . . . . . . . . . 20 2.4 MIL using the entry/exit map . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 Incremental learning of the entry/exit map . . . . . . . . . . . . . 22 2.4.2 Learning for appearance models . . . . . . . . . . . . . . . . . . . 24 2.5 Track completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6.1 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6.2 Entry/exit map estimation . . . . . . . . . . . . . . . . . . . . . . 29 2.6.3 Performance on less crowded data sets . . . . . . . . . . . . . . . . 30 2.6.4 Performance on Trecvid 2008 data set . . . . . . . . . . . . . . . . 32 ii 2.6.5 Computation speed. . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter 3 Tracking by an Offline Learned CRF Model 36 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 CRF Formulation for Tracklet Association . . . . . . . . . . . . . . . . . . 40 3.4 Learning a CRF Model for Association . . . . . . . . . . . . . . . . . . . . 42 3.4.1 Creating Association Graph . . . . . . . . . . . . . . . . . . . . . . 42 3.4.2 Learning Energy Terms . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.3 Learning the Weak Energy Function . . . . . . . . . . . . . . . . . 45 3.4.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.1 Analysis of the Training Process . . . . . . . . . . . . . . . . . . . 50 3.5.2 Tracking Performance . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Chapter4 AnOnlineLearnedCRFModelforMulti-TargetTrackingfrom Both Static and Moving Cameras 55 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 CRF Formulation for Tracking . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4 Online Learning of CRF Models . . . . . . . . . . . . . . . . . . . . . . . 63 4.4.1 CRF Graph Creation for Tracklets Association . . . . . . . . . . . 63 4.4.2 Learning of Unary Terms . . . . . . . . . . . . . . . . . . . . . . . 65 4.4.3 Learning of Pairwise Terms . . . . . . . . . . . . . . . . . . . . . . 66 4.4.4 Energy Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5.1 Results on Static Camera Videos . . . . . . . . . . . . . . . . . . . 71 4.5.2 Results on Moving Camera Videos . . . . . . . . . . . . . . . . . . 72 4.5.3 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Chapter 5 Integration of Association Based Tracking and Category Free Tracking 76 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Building Part-Based Feature Sets for Tracklets . . . . . . . . . . . . . . . 82 5.4 Online Learning DPAMs for Conservative Category Free Tracking . . . . 86 5.4.1 Learning of Discriminative Part-Based Appearance Models . . . . 86 5.4.2 Conservative Category Free Tracking . . . . . . . . . . . . . . . . . 89 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5.1 Results on PETS 2009 data set . . . . . . . . . . . . . . . . . . . . 93 5.5.2 Results on ETH data set . . . . . . . . . . . . . . . . . . . . . . . 94 5.5.3 Results on Trecvid 2008 data set . . . . . . . . . . . . . . . . . . . 95 5.5.4 Computational Speed . . . . . . . . . . . . . . . . . . . . . . . . . 96 iii 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Chapter 6 Future work 100 Reference List 101 iv List of Tables 2.1 Comparison of results on PETS 2009 dataset. The PRIMPT results are provided by courtesy of authors of [27]. Our ground truth is more strict than that in [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 Comparison of results on CAVIAR dataset. The human detection results are the same as used in [26, 27], and are provided by courtesy of authors of [27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3 Comparison of results on TRECVID 2008 dataset. The human detection results are the same as used in [54, 26, 27], and are provided by courtesy of authors of [27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Feature pool for learning unary terms. . . . . . . . . . . . . . . . . . . . . 48 3.2 Feature pool for learning pairwise terms. . . . . . . . . . . . . . . . . . . . 48 3.3 Comparison of results on TRECVID 2008 dataset. . . . . . . . . . . . . . 52 4.1 ComparisonofresultsonTUDdataset. ThePRIMPTresultsareprovided by courtesy of authors of [27]. Our ground truth includes all persons appearing in the video, and has one more person than that in [4]. . . . . . 71 4.2 Comparison of tracking results on Trecvid 2008 dataset. The human de- tection results are the same as used in [54, 26, 27], and are provided by courtesy of authors of [27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Comparison of tracking results on ETH dataset. The human detection resultsarethesameasusedin[27],andareprovidedbycourtesyofauthors of [27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.1 Comparison of association based tracking with category free tracking. . . 78 5.2 Comparison of results on PETS 2009 dataset. The PRIMPT results are provided by authors of [27]. Our ground truth is more strict than that in [4]. 93 v 5.3 Comparison of tracking results on ETH dataset. The human detection resultsarethesameasusedin[27],andareprovidedbycourtesyofauthors of [27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4 Comparison of tracking results on Trecvid 2008 dataset. The human de- tection results are the same as used in [54, 26, 27], and are provided by authors of [54]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 vi List of Figures 1.1 Sample tracking results of our approach. . . . . . . . . . . . . . . . . . . . 2 1.2 Examples of challenges in tracking problem. . . . . . . . . . . . . . . . . . 3 1.3 Examples of category free tracking results from [15, 28, 19]. . . . . . . . . 6 1.4 Examples of association based tracking results from [45, 51]. . . . . . . . . 7 1.5 Thebasicassociationbasedtrackingframeworkusedin[21,31,26,27]and our approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Examples of useful information for tracking. . . . . . . . . . . . . . . . . . 14 2.2 Trackingframeworkofourapproach. Colorsofdetectionresponsesindicate their belonging tracklets. The non-linear motion map, entry/exit map, appearancemodels,andmovinggroupsareallunsupervisedonlinelearned. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Estimation of motion affinity using linear assumptions. . . . . . . . . . . . 20 2.4 Estimations of motion affinity using the motion map. . . . . . . . . . . . . 20 2.5 A nonlinear association example in real case. . . . . . . . . . . . . . . . . 22 2.6 Estimation of entry/exit points. Each circle indicates a estimated en- try/exit point; red polygons indicate the convex hulls of all estimated points; blue circles indicate later removed points; yellow circles indicate new added ones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7 Illustrationsfortrainingsamplecollectionsusedinmultipleinstancelearn- ing. Colors of detection responses indicate their belonging tracklets. See text for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 vii 2.8 A moving group example. The occluded target is completed even though the detector fails after frame 2475. . . . . . . . . . . . . . . . . . . . . . . 28 2.9 Entry/exit maps estimation results. The third column shows ground truths. 31 2.10 ExamplesoftrackingresultsofourapproachonPETS2009, CAVIARand Trecvid 2008 data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1 Examples of dependencies between tracklet associations. . . . . . . . . . . 37 3.2 Framework of our tracking system. Connections between CRF nodes indi- cate local association dependencies, and detection responses in red denote false alarms. A CRF association graph is built to model all local associ- ations and dependencies. A CRF energy function is learned from global association training samples by a RankBoost algorithm, and is used for estimation of energies in testing. . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Four types of occlusion dependencies. . . . . . . . . . . . . . . . . . . . . 43 3.4 Comparison of negative sample numbers under different feature numbers. The blue curve shows results using only unary features; the red one shows results using both unary and pairwise features. . . . . . . . . . . . . . . . 51 3.5 Sample tracking results of our approach on different scenes in Trecvid08 data set [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1 Examplesofrelativepositionsandlinearestimations. Inthe2Dmap,filled circlesindicate positions of person at the earlier frames; dashed circlesand solid circles indicate estimated positions by linear motion models and real positions at the later frames respectively. . . . . . . . . . . . . . . . . . . 57 4.2 Trackingframeworkofourapproach. IntheCRFmodel,eachnodedenotes a possible link between a tracklet pair and has a unary energy cost based on global appearance and motion models; each edge denotes a correlation between two nodes and has a pairwise energy cost based on discriminative appearance and motion models specifically for the two nodes. Colors of detection responses indicate their belonging tracklets. Best viewed in color. 58 4.3 Examples of head close and tail close tracklet pairs. . . . . . . . . . . . . 64 4.4 Global motion models for unary terms in CRF. . . . . . . . . . . . . . . . 65 4.5 Pairwise motion models for pairwise terms in CRF. . . . . . . . . . . . . . 67 4.6 Tracking examples on TUD [4], Trecvid 2008 [2], and ETH [16] data sets. 75 viii 5.1 Limitations of previous association based tracking methods. See text for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Tracking framework of our approach. Colors of detection responses come fromtheirtracklets’colors,andblackcirclesdenotesamplesextractedfrom backgrounds. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Illustrations for human parts: (a) parts definition for a normalized 24× 58 human detection response; 15 squares are used to build part-based appearance models, their sizes are shown in blue numbers; (b)(c) show automatic occlusion inference results in real scenes; visible parts for each person are labeled with the same color, and occluded parts are not shown and not used for appearance modeling. Best viewed in color. . . . . . . . 83 5.4 Online learning of appearance models for category free tracking: (a) es- timation of potential distracters for ; (b) online collected samples for learning DPAMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.5 Illustrationsforcategoryfreetrackingprocess. Thegreendashedrectangle ispredictedbythelinearmotionmodel;theorangedashedonesaresamples generated around the predicted position. . . . . . . . . . . . . . . . . . . . 90 5.6 Comparisons of tracking results with or without category free tracking: (a)(c)showresultswithoutCFT,(b)(d)showresultsonthesamesequences with CFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.7 Tracking results of our approach on ETH and Trecvid data sets. . . . . . 99 ix List of Algorithms 1 The algorithm for learning the motion map. . . . . . . . . . . . . . . . . . 19 2 Learning algorithm for the extry/exit map. . . . . . . . . . . . . . . . . . 23 3 Learning algorithm for appearance models. . . . . . . . . . . . . . . . . . 26 4 Learning algorithm for CRF energy function. . . . . . . . . . . . . . . . . 46 5 Learning algorithm for a weak energy function. . . . . . . . . . . . . . . . 47 6 Finding labels with low energy cost. . . . . . . . . . . . . . . . . . . . . . 70 7 The algorithm of building head feature set for one tracklet. . . . . . . . . 85 8 The learning algorithm for DPAMs. . . . . . . . . . . . . . . . . . . . . . 90 9 The CFT algorithm for one tracklet from its tail. . . . . . . . . . . . . . . 92 x Abstract Tracking multiple humans in real scenes is an important problem in computer vision due to its importance for many applications, such as surveillance, robotics, and human- computer interactions. Association based tracking often achieves better performances than other approaches in crowded scenes. Based on this framework, I propose offline and online learning algorithms to automatically find potential useful appearance and motion patterns, and utilize them to deal with difficulties in the association framework and to produce much better tracking results. In association based framework, an offline learned detector is first applied in each video frame to produce detection responses, which are further associated into tracklets, i.e., track fragments, in multiple steps. Measurement of affinities between tracklets is the key issue that determines the performance. In the first part of my thesis, I propose an online learning algorithm which automatically find three important cues from a static scene to improve tracking performance: non-linear motion patterns, potential entry/exit points, and co-moving groups. Association based tracking methods are often based on the assumption that affinities between tracklet pairs are independent of each other. However, this is not always true in real cases. In order to relax the independent assumption, we introduce an offline learned xi ConditionalRandomField(CRF)modeltoestimatebothaffinitiesbetweentrackletsand dependencies among them. Finding best associations between tracklets is transformed into an energy minimization problem, and energies of unary and pairwise terms in the CRF model are offline learned from pre-labeled ground truth data by a RankBoost algo- rithm. ThenIfurtherextendedtheapproachintoanonlineversion. Positiveandnegative pairsareonlinecollectedaccordingtotemporalconstraints; thelearnedappearancemod- els better distinguish close but visually similar targets and the learned motion models considered relative distances between targets to alleviate camera motion and non-linear path effects. As detection performance limits the performances of traditional association based tracking approaches, I further propose an online learned discriminative part-based ap- pearance models which incorporates category free tracking techniques into association based tracking. In this work, occlusions among targets are explicitly considered to pro- duce more robust appearance models. A category free tracking method is adopted to track a target without detection responses while distinguishing different targets and the background. I designed comprehensive experiments to evaluate all my algorithms and important modules. The performances show effectiveness of my approaches on different data sets, with different human densities, illuminations, camera motions, and etc. xii Chapter 1 Introduction As video cameras become more and more pervasive in recent years, huge amount of data are captured every day. It is impossible for humans to look through all videos; automatic and robust analysis of such data becomes more and more critical. Tracking multipletargetsisoftenusedasthebasisforhigherleveltasks, suchaseventrecognition, activity reasoning. Except for surveillance systems, tracking also has applications in robotics, human-computer interaction, etc. Therefore, it attracts lots of attentions from researchers. Tracking multiple targets in real scenes is one of the most important problems in computer vision. It aims at inferring trajectories of targets in a video sequence while keeping their identities. In particular, humans are often the most concerned targets, and tracking multiple humans from static cameras is of intense interest due to the increase of attentions on security scenarios. Appearance and motion cues are often the most important cues in the tracking problem; in addition, they often form some patterns, e.g., motionpathpatterns,co-movingpatterns,appearancepatterns. Therefore,inthispaper, I focus on automatically finding useful appearance and motion patterns and utilize them 1 Frame 440 Frame 480 Frame 560 Frame 620 Frame 2400 Frame 2460 Frame 2520 Frame 2580 Figure 1.1: Sample tracking results of our approach. to improve tracking performance. Figure 1.1 shows two sample tracking results by our approach. 1.1 Challenges Tracking multiple humans can be a quite easy task, if all humans are separate from each other and can be easily extracted from the background. However, in real crowded surveillancescenes, trackingmultiplehumansisaquitedifficultproblemduetodetection errors, similar appearance, inter and intra occlusions, low resolution, etc. I will briefly describe some of the main challenges in this section. 1.1.1 Detection errors Astargets maybemovingorstatic inreal cases andthe objects maynot be alwayseasily distinguished from the background, background substraction approaches tend to fail to find many objects in real surveillance cases. Recent years, object detection techniques 2 (a) Detection errors and occlusions examples Scene Occluder Human Occluder (b) Non-linear motion example (c) Examples of object appearances [27]. Each column shows the same person in different frames. Figure 1.2: Examples of challenges in tracking problem. 3 have achieved great improvements; even in cluttered and dynamic background, specific kinds of objects can still be detected. Though the great improvements of the state- of-the-art object detector are achieved, detection mistakes still often occur. How to recover miss detections, remove false alarms, and refine the inaccurate detection makes the tracking problem difficult. Figure 1.2(a) shows a sample detection result by the up- to-date approach in [20]. Some missing detections are labeled in yellow and some false alarms are labeled in red. 1.1.2 Object occlusion Occlusion for an object is defined as some or all parts of the object are not visible in an image; based on partly visible or totally invisible, occlusions are classified as partial occlusion and full occlusion. For partial occlusions, detectors may fail to find objects, and appearance descriptors for the same object may change a lot. For full occlusions, we can only infer the trajectories from previous or future motion patterns, but cannot have any appearance information to help. In a crowded scene, objects are frequently occluded by each other or scene occluders. For short time occlusions, e.g., 2 or 3 frames, we can estimate the trajectories from most recent motions; but for long time occlusions, it is still a challenging problem to fill the gaps while maintain object identities. Figure 1.2(a) shows some occlusion examples. Humans may be occluded by scene occluders or by each other. 4 1.1.3 Non-linear motion Motion pattern is an important cue in tracking problems. Previous motion patterns can help to estimate future positions of a target. This cue would provide strong priors when targets are moving linearly. However, as shown in Figure 1.2(b), sometimes the motion can be non-linear, and this would make the position estimation much difficult. 1.1.4 Appearance modeling To track multiple object simultaneously, we need to find appropriate appearance descrip- torssothatthesameobjectswouldhavesimilarappearanceswhiledifferentobjectswould have different appearances. In real surveillance cases, appearances of the same person may change by time due to illumination variation, pose change, or occlusions as shown in Figure 1.2(c). On the other hand, different targets may have similar appearance due to similar clothes. These variations make maintaining person identities a difficult problem. 1.2 Related work 1.2.1 Category free tracking Tracking problems can be classified into category free tracking (CFT) and association based tracking (ABT). Category free tracking focuses on tracking pre-specified regions, usually one specific target, and find the trajectories indicating consistent visual appear- ances. Figure1.3showssomeexamplesforcategoryfreetracking. Earlyvisualappearance tracking approaches often adopt meanshift [14, 13, 55] or particle filtering [22, 35, 30] like 5 Figure 1.3: Examples of category free tracking results from [15, 28, 19]. approach to online adjust target appearance models, and use updated models to contin- uously track targets. Multiple cues are considered to estimate most probable state or lowest energy state [30, 56, 11, 19, 9]. These approaches are able to locate target efficiently only based on previous informa- tion, but once the tracking has an error, it would be probably increased during time and the location may drift to other objects. In recent years, context information are more and more used for category free tracking [12, 19, 15]. These kinds of approach tends to find contextual information, usually co-moving regions or targets to facilitate CFT. Such approaches are less likely to drift and may also perform real-time tracking after some optimization. Most category free tracking approaches do not consider the initialization of tracked objects, and are often based on manual initialization. Some approaches use background subtractiontechniquestofindforegroundblobsandtrackeachblobseparately[59,44,40]. Though multiple human hypothesis may be considered during tracking, these methods are highly dependent on robust background subtraction and would only deal with short time merge of multiple people. However, they may fail in crowded scenes or cluttered background, which often occur in real surveillance scenes. On the other hand, most 6 Figure 1.4: Examples of association based tracking results from [45, 51]. categoryfreetrackingapproachesfocusontrackingsingleobject. Thoughmultipletargets are tracked in some previous work, each target is still often tracked separately without global optimization. 1.2.2 Association based tracking Different from category free tracking, association based tracking aims at tracking specific kindsofobjects,usuallymultiplepeople,findingtheirtrajectorieswithoutconfusingtheir identities. As the category of tracked objects are pre-known, pre-trained detectors are often used for initialization. In addition, as manual initialization is not necessary, most association based tracking work focuses on tracking multiple objects simultaneously, and global optimization is often adopted. Figure 1.4 shows some association based tracking examples. Amongavastamountofworkinmultipletargettracking,MultipleHypothesisTrack- ing (MHT) and Joint Probabilistic Data Association Filters (JPDAF) are two early rep- resentatives and several variations are developed later. In general, their goal is to make 7 inference of the association between measurements and existing tracks over multiple tar- gets simultaneously. However, the search space of these methods grow exponentially which limits them to consider only few time steps in practice. To reduce the search space and thanks to great improvements on object detection, Data Association based Tracking (DAT) attracted more and more attentions in recent years. Association based methods often consider both past and future frames, and associate tracklets gradually into longer ones by Dynamic Programming [8], MCMC [45, 57], or global optimization methods [38, 29, 58, 51]. Association based methods use different approaches to better global associations, such as estimation of merge and split [38], oc- clusion reasoning [58, 51], multi-cue offline learning [31]. Great effort has also been put on producing better local affinity scores between tracklets, i.e., producing high associa- tion confidence for tracklets belonging to the same target and low confidence for those belonging to different targets, for example, dynamic feature selection [9, 5] and online learning [26, 18]. Associationbasedapproachesoftenperformsmuchbetterformultipletargetstracking tasksthancategoryfreetrackingapproaches,becauseofthehelpfromdetectors,informa- tion from future frames, as well as global optimization. Such approaches are often more robust and less likely to drift; they may recover correct tracks even after partial tracking errors. However, detectors and global optimization approaches require more computa- tional cost, so that most association based methods are not as fast as visual appearance tracking methods. 8 1.2.3 Using scene knowledge for tracking Mining knowledge from a static scene has attracted lots of attentions from researchers. Some previous work focuses on understanding scene knowledge from analyzing tracking results[23, 50, 49]. These approaches aim at understanding semantic scene knowledge or abnormal detection, rather than using scene knowledge to improve tracking. To improve tracking, motion pattern is one of the most frequently used knowledge from static scenes, especially in very crowded ones. Most previous work learns global motion patterns by optical flow, motion blobs, or partially tracked results, and assumes almost all targets are following the pattern [25, 46, 41]. Then a strong prior is provided to estimate motions for each target. Such approaches work well for high density crowds but may fail for mid-level density scenes. As in a extremely crowded scene, it is difficult for an individual to move differently with the crowds moving trend; but in a mid-level density scene, people may have many different moving patterns, not necessarily following the main trend. Another useful scene knowledge is co-moving clusters. Many researchers have pro- posed different ways to automatically finding these clusters and use the information to facilitate tracking tasks [42, 37, 36, 53]. Different from high density motion patterns, these work focuses on mid-level or low-level density scenes, and multiple moving clusters may be found instead of one global pattern. The motion of each target is then guided by the cluster motion pattern to estimate trajectories. In addition, collision avoidance is oftenthekeyconsiderationduringsuchkindofpreviouswork, sothatnomultipletargets will locate at the same location on the ground plane. 9 1.3 Overview of our approach Inthispaper,weaimattrackingmultiplehumansinastaticscene. Atrackingframework is given in Figure 1.5. A human detector is first applied to each video frame, and we then associate corresponding detection responses gradually to form final tracks by multiple level associations. In each level, the output tracklets, i.e., not fully associated tracks or detection responses, from last level are taken as input; then for each tracklet pair, a motionaffinityscoreandanappearanceaffinityscoreiscomputed. Basedonthesescores, a global optimization approach is applied to find a good association. Then the tracking problem can be decomposed into four sub-tasks: 1) human de- tection; 2) motion affinity computation; 3) appearance affinity computation; 4) finding global association solution. Our work is to improve some of these modules for better results. Among the above sub-tasks, we focus on improving the performances of the second and third modules, and solve the tracking problem by learning appearance and motion patterns (LAMPs). We first utilize online learned scene knowledge to better estimate association affinities between tracklets. In a static scene, human motions often have some patterns, such as moving from an entry to an exit. Based on existing confident tracklets,weautomaticallyfindnon-linearmotionpatternsandusethemtorefinemotion affinities between other tracklets. In addition, with pre-known entry and exit points, we areabletoknowthatwhetheratrackletisanenteringtracklet, orexitingone, orneither. Based on the information, we propose an online multiple instance learning algorithm to automatically refine tracklet appearance models. Moreover, we also propose a moving 10 Video Sequence Human Detector Confident Low Level Association Motion Affinity Computation Appearance Affinity Computation Tracklets Association 1 st Level Associated Tracklets K-1 th Level Associated Tracklets Motion Affinity Computation Appearance Affinity Computation Tracklets Association Detection Responses ... Tracking Results Figure 1.5: The basic association based tracking framework used in [21, 31, 26, 27] and our approaches. cluster finding algorithm to partly solve heavy occlusion cases. Then the global optimal association is found by a Hungarian algorithm. Therefore, we improve both motion and appearance affinity computations in Figure 1.5. To release the independent assumptions on tracklet affinities, we propose a learning- based Conditional Random Field (CRF) model for tracking multiple targets by progres- sively associating detection responses into long tracks. Tracking task is transformed into a data association problem, and most previous approaches developed heuristical para- metric models or learning approaches for evaluating independent affinities between track fragments (tracklets). We argue that the independent assumption is not valid in many cases,andadoptaCRFmodeltoconsiderbothtrackletaffinitiesanddependenciesamong them, which are represented by unary term costs and pairwise term costs respectively. Unlike previous methods, we learn the best global associations instead of the best local affinities between tracklets, and transform the task of finding the best association into an energy minimization problem. A RankBoost algorithm is proposed to select effective features for estimation of term costs in the CRF model, so that better associations have lower costs. In this part of work, we incorporate motion and appearance affinity compu- tations in Figure 1.5 into one uniformed framework, and propose a new method to find global solutions. 11 As detection performance strictly constraints the tracking performance, the category free tracking techniques are incorporated into the association based tracking framework. Explicit part based models are introduced to find unoccluded parts for a conservative category free tracking, so that gaps between tracklets are shortened and tracks may extend to positions where no detection responses exist. Our approaches are evaluated on challenging pedestrian data sets, and are compared with state-of-art methods. Experiments show effectiveness of our algorithm as well as improvement in tracking performance. 1.4 Thesis organization The thesis is organized as follows. Our online learned knowledge from a static scene for multi-target tracking is presented in Chapter 2. The off-line learning approach which relax affinity independent assumptions and better combine different cues for tracklets association is discussed in Chapter 3. The integration of category free tracking and associationbasedtrackingapproachisdetailedinChapter5. Finally,thefuturedirections of our research are given in Chapter 6. 12 Chapter 2 Online Learned Knowledge from Static Cameras for Multi-Target Tracking In this chapter, we will introduce an online learning process to improve both motion and appearanceaffinitiesbetweentracklets,aswellasanmethodforextendingsometracklets based on moving groups. 2.1 Introduction As discussed in Section 1.3, affinities between tracklets, i.e., the linking probabilities, are often evaluated as ( , )= ( , ) ( , ) ( , ) (2.1) where (⋅), (⋅), and (⋅) indicate the motion, appearance and temporal affinities between tracklets and . A Hungarian algorithm is often used to find the global optimum [38, 27]. Though much progress has been made, motion affinity estimation and appearance modeling remain key issues that limit performance. 13 (a) motion patterns Entry/Exit Points (b) entry/exit points Moving Groups (c) moving groups Figure 2.1: Examples of useful information for tracking. A linear motion model is commonly assumed for each target [51, 45, 7, 27]. However, as shown in Figure 2.1(a), there are often several non-linear motion patterns in a scene. Appearancemodelsareoftenpre-defined[51,45]oronlinelearnedfromafewneighboring frames [27, 43]; tracklets with long gaps are difficult to be associated due to appearance changes. Fortunately, there is often useful knowledge in the scene, such as motion patterns, entry/exit points, and moving groups, as shown in Figure 2.1, for solving the above problems. Wedescribeanonlinelearningmethodwhichautomaticallyfindsdynamicnon- linear motion patterns and robust appearance models to improve tracking performance. The framework of our approach is shown in Figure 2.2. Similar to [26], detection responses are first linked into tracklets by a low level association approach. Based on confident tracklets 1 , we online learn a non-linear motion map, which is a set of non-linear motion patterns, e.g., the orange tracklet in Figure 2.2, used for explaining non-linear gaps between other tracklets, e.g., the gap between two blue tracklets in Figure 2.2. For efficiency purpose, our tracking is done in sliding windows one by one; the non-linear 1 The definition is given in Section 2.3.1. 14 ! " " ! " # $ # % $& " ’ ( ( " ) * " " ) + $ # % $& " Figure 2.2: Tracking framework of our approach. Colors of detection responses indi- cate their belonging tracklets. The non-linear motion map, entry/exit map, appearance models, and moving groups are all unsupervised online learned. Best viewed in color. motion map is updated by re-learning for each sliding window, as motion patterns may change with time. Meanwhile, an online learning process is adopted to automatically find entry or exit points in a scene, as shown in green masks in Figure 2.2. We limit our approach to static cameras, where entry/exit points do not change in different sliding windows and can be learnedincrementally. AsshowninFigure2.1(b),entry/exitpointsconstrainthestarting and ending of the trajectories; a tracklet ending before reaching exit points should be associated with other tracklets. We collect detection responses from different tracklets to form potential positive and negative bags, which are used in a multiple instance learning approach for building robust appearance models, so that tracklets tend to be associated until they reach entry/exit points, even under long gaps. The Hungarian algorithm is utilized to find the global optimum; for the temporal affinity in Equ. 2.1, we use the same function as in [27]. For tracklets not reaching entry/exit points, we try to find whether they belong to any moving groups as shown in Figure 2.1(c). The trajectories of observed targets in the same groups, e.g., the green 15 tracklet in Figure 2.2, are used to complete those of occluded unobserved ones, e.g., the pink tracklet in Figure 2.2, so that the latter can reach the entry/exit points. The contributions of this chapter include: ∙ An online learned non-linear motion map for explaining reasonable non-linear mo- tions, as well as a new motion affinity estimation method. ∙ An incremental learning approach for an entry/exit map, and a multiple instance learning (MIL) algorithm for finding robust appearance models. ∙ An algorithm for automatically finding moving groups, which are further used for track completion. The rest of this chapter is organized as follows: related work is discussed Section 2; learning of the motion map for motion affinities is given in Section 3; Section 4 describes the entry/exit map estimation and the MIL algorithm for appearance models; learning moving groups for tracklets completion is presented in Section 5; experiments are shown in Section 6, followed by conclusion in Section 7. 2.2 Related work Multi-target tracking has been a popular topic for several years. Visual tracking ap- proaches often track each object separately [30, 56], but have difficulties to deal with large number of objects in crowded scenes. Association based approaches often optimize multipletargetsgloballyandsimultaneously[38,45,7,54], andthereforearemorerobust to long time occlusions and crowded scenes. 16 Motion patterns have attracted researchers’ attention in recent years, and are often learned as priors for estimation of each target’s trajectory [25, 46, 41, 60]. It is often assumed that targets follow a similar motion pattern; trajectories not following the pat- tern are penalized. This assumption works well for extremely crowded scenes, where it is difficult for a single human to move against the main trend. However, in our scenario, persons may move freely. There may be many motion patterns, and a particular indi- vidual may follow any one or none of them. The motion patterns are used to explain non-linear trajectories, but do not produce extra penalties for individuals not following them. Robust appearance models are also key for multi-target tracking, and many methods collecttrainingsamplesonlinefromtrackletstolearnappearancemodels[6,26,27,33,43]. However, these approaches collect positive samples from the same tracklets in a few neighboring frames; therefore, the positive samples often lack diversity. Once there are illumination or pose changes, tracklets belonging to the same target but with long gaps may have different appearance models, as positive samples are always from neighboring frames. On the contrary, our method utilizes entry/exit points to find potential correct associations, and positive samples may come from tracklets with long gaps and thus have highdiversities. Tothebestofourknowledge,therehasbeennoexplicituseofentry/exit points to improve appearance models in multi-target tracking tasks. Humaninteractionshavealsobeenconsideredintracking. Somefocusonusingsocial behaviorstoavoidcollisions forafewpersons[42,37], orfindingmovinggroupsofpeople to estimate the motion of each individual [36, 53]. However, in crowded scenes, heavy occlusions may cause very close persons to merge. In addition, previous moving groups 17 are used as motion constraints when each member is clearly visible [36, 53]. However we use moving groups to complete tracklets when they are not observable due to the failure of the detector. 2.3 Motion map learning for motion affinities In this section, we introduce an online learning approach to find reasonable non-linear motion patterns for each sliding window, and use them to produce more precise motion affinities, i.e., (⋅) in Equ. 2.1. 2.3.1 Non-linear motion map learning As modern detectors are quite robust, we safely assume that there are no long time continuous missing detections for a target if there are no occlusions. For unoccluded targets, it is easy to associate tracklets based on linear motion assumptions as gaps are often quite small. However, in a long time span, these targets may follow non-linear patterns, which may provide guidance for associations of other tracklets. Similar to [27], we also adopt a multi-level association approach, i.e., gradually associate tracklets in multiple steps instead of one-time association. Non-linear motion patterns learned from tracklets in previous level are used for current level association. For current sliding window, a motion map ={ 1 , 2 ,..., } is defined as a set of trackletsthatincludeconfidentnon-linearmotionpatterns. Atracklet ={ ,..., } is a set of detection responses, or interpolated responses, in consecutive frames, where and denote the starting and ending frame numbers, and = { , , } denote the response at frame , including position , size , and velocity vector . 18 Due to possible false alarms or false associations, we build the motion map only on confident tracklets, each of which satisfies two constraints: 1) it is long enough, e.g., longer than 50 frames, as false tracklets are mostly short ones; 2) it is not or only lightly occluded by other tracklets, e.g., at most 10% frames having visibility ratios less than 70%, as most association errors happen when there are heavy occlusions. Otherwise, a tracklet is classified as an unconfident one. For each confident tracklet, we remove the linear motion parts in the head or in the tail. If the remaining parts still satisfy a non-linear motion pattern, we put it into the motion map. The motion map learning algorithm is shown in Algorithm 1, where ⟨,⟩ denotestheanglebetweenvectorandvector,and(,)denotesavectorfromposition to . The threshold is set to 10 degree in our experiments. As shown in Figure 2.2, the learned motion map is a union of existing non-linear moving tracklets, which are used for explaining non-linear gaps between other tracklets in the next level of association. Algorithm 1 The algorithm for learning the motion map. Input: tracklets from previous level { 1 , 2 ,..., } Initialize motion map: = For =1,..., do: ∙ If is not a confident track, continue for next iteration. ∙ initialize non-linear motion start frame = and end frame = . ∙ For = +1,..., do: – if ⟨ ,( , )⟩>, then =−1, break. ∙ For = −1,..., do: – if ⟨ ,( , )⟩>, then =+1, break. ∙ if < & ⟨ ,( , )⟩ > 2 & ⟨ ,( , )⟩ > 2, = ∪{ ∗ }, where ∗ = { ,..., }. Output: the motion map . 19 T 1 T 2 p tail p tail +v tail ∆t p head -v head ∆t p head Figure 2.3: Estimation of motion affinity using linear assumptions. matched pair estimated path matched pair T 1 T 2 estimated positions using motion map estimated positions by linear motion assumption T 0 Figure 2.4: Estimations of motion affinity using the motion map. 2.3.2 Estimation of motion affinities In most previous work [51, 7], motion affinity is estimated by a linear motion assumption as shown in Figure 2.3. The affinity score is given as ( + Δ− ℎ ,Σ )( ℎ − ℎ Δ− ,Σ ) (2.2) where Δ is the frame difference between and ℎ , and (⋅,Σ) is the zero-mean Gaussian function. Using the linear motion assumption, the motion affinity between 1 and 2 in Figure 2.4 would be quite low. However, a non-linear connection between 1 and 2 may be 20 explained by a pattern tracklet 0 ∈ which is a matched tracklet for the tail response of 1 , and the head response of 2 . 0 is a matched tracklet for a response ={,,} if ∃ ∈ 0 ∥ −∥<⋅min{ ,} & ⟨ ,⟩< (2.3) where is a weight factor set to 0.5 in our experiment, and is the same as used in Algorithm 1. Then the estimated path is achieved by a quadratic function mostly satisfying positions and velocities at the tail of 1 and the head of 2 . The path is valid onlyif 0 isamatchedtrackletforeachresponseintheestimatedpath. Thisassuresthat the estimated path is similar enough to the non-linear pattern tracklet, as each pattern is only effective for explanations of neighboring tracklets with similar motion directions. Then we still use the Gaussian function similar to that in Equ. 4.9 but based on non-linear estimation of positions as shown in Figure 2.4. If multiple patterns exist, the one that produces highest affinity score is used. Note that we do not use motion patterns as priors to penalize tracklets not following the patterns like previous work [46, 41, 60]. Targets are not necessarily assumed to follow any non-linear patterns, but once an individual does, we reduce the penalty for that non-linear motion. Figure 2.5 shows a non-linear motion example in a real case. From the 2D map, we see that there is a direction change between 18 and 1 though they are the same person. The linear motion assumption would produce a very low score for associating these two. However, our approach indicates that a confident tracklet 16 well explains the direction change from 18 to 1 , and therefore gives a high motion affinity between 18 and 1 . 21 2D map Frame 1100 Frame 1160 T 18 T 1 T 16 Figure 2.5: A nonlinear association example in real case. 2.4 MIL using the entry/exit map Appearance models play important roles in tracking. Most previous online learning ap- proaches [6, 26, 27, 33] collect positive training samples, i.e., responses belonging to the same target, from the same tracklet within a few frames. However, these responses are likely quite similar and lack diversity. We further collect potential positive pairs from re- sponsesindifferenttrackletswithlongergapsaccordingtotheestimatedentry/exitmap, so that the diversity is higher; then a multiple instance learning approach is proposed to get robust appearance models. 2.4.1 Incremental learning of the entry/exit map An entry/exit map is a binary image with 0 denoting entry/exit points and 1 denoting others. The entry/exit points are positions where a target enters or exits in video frames. We do not constraint order; an entry point is also an exit point and vice versa. We limit our approach to static cameras, so that the entry/exit map does not change with time and can be learned incrementally. 22 Figure 2.6: Estimation of entry/exit points. Each circle indicates a estimated entry/exit point; red polygons indicate the convex hulls of all estimated points; blue circles indicate later removed points; yellow circles indicate new added ones. We continuously add starting or ending positions of confident tracklets into , and the neighboring regions of these positions are treated as entry/exit points. We assume that all entry/exit points form a convex hull, i.e., all possible points are at borders of a real 3D scene (not necessarily borders on 2D frames), and a target cannot enter or exit in the middle of a scene. Based on this assumption, we continuously update the entry/exit points by removing those inside the convex hull and adding those outside it. Figure 2.6 shows an example of the update process. The incremental learning algorithm of the entry/exit map is shown in Algorithm 2. Algorithm 2 Learning algorithm for the extry/exit map. Input: confident tracklets obtained from previous association level in current sliding window { 1 , 2 ,..., } If this is the first sliding window, initialize entry/exit point set =, and its convex hull = For =1,..., do: ∙ If is outside , =∪{ }; if is outside , =∪{ }. ∙ Update convex hull using current . ∙ ∀ ∈, if is inside , =−{ }. Output: the binary entry/exit map , where () = 0 if ∃ ∈ so that ∣∣− ∣∣ < , and ()=1 otherwise. 23 2.4.2 Learning for appearance models With the learned entry/exit map, we identify each tracklet as an entry tracklet, an exit one, both, or neither. Definition 1 An entry tracklet starts at any entry/exit point or at the beginning of current sliding window; otherwise, it is a non-entry tracklet. The definition of an exit tracklet is similar. For tracklets that are both entry and exit ones, we call them completed tracklets; otherwise, we call them uncompleted tracklets. Ideally, any real track should be a completed tracklet. However, due to false alarms, appearance changes and occlusions, there are usually many uncompleted tracklets. For an uncompleted but confident tracklet, e.g., a non-exit tracklet, it probably needs to be associated with other tracklets so that the linked tracklet would be completed. Figure 2.7 gives an illustration for our potential training pairs collection. For a non- exitconfidenttracklet 1 ,wecollectalltrackletsthathavemotionaffinitieswith 1 higher than a threshold , set to 0.2 in our experiments, as potential correct associations, e.g., 3 , 4 , and 5 . A positive bag is formed by = {( 1 , 3 ),( 1 , 4 ),( 1 , 5 )}, where 1 ∈ 1 , 3 ∈ 3 , 4 ∈ 4 , and 5 ∈ 5 . At least one pair in a positive bag is a correct pair, so that the bags are suitable for MIL. The positive bags provide more positive pairs from long gap tracklets, and make 1 be more probably associated with one of 3 , 4 , and 5 . Note that for the unconfident tracklet 0 ( 5 is similar), it is not necessary to associate with one of 3 , 4 , or 5 , as unconfident tracklets may be false alarms, but it may appear in the positive bags of other confident tracklets, e.g., 3 . 24 T 1 T 2 T 3 T 4 T 5 T 0 positive: negative: …... …... Training Bags Detection Response Tracklet Legend Figure2.7: Illustrationsfortrainingsamplecollectionsusedinmultipleinstancelearning. Colors of detection responses indicate their belonging tracklets. See text for details. For negative training samples, we adopt the approach used in [27]: responses from tracklets having temporal overlaps are used for forming negative training samples. The feature pool as in [27] is used; it is based on color, shape, and texture, and features are extracted from pre-defined regions of human responses, e.g., the color histogram of the upper body. We adopt the multiple instance learning framework used in [34], and the algorithm is shown in Algorithm 3, where ∣ ∣ denotes the number of elements in and is the log-likelihood of bags to be maximized as ()= ∑ ( log +(1− )log(1− )) (2.4) The learned classifier is used as the appearance model, and the appearance affinity scores are computed as in [27]. 2.5 Track completion After tracklet associations, there are likely several uncompleted but confident tracklets. Thesetrackletsareprobablyduetoocclusionsbytargetsinothertrackletsorincluttered 25 Algorithm 3 Learning algorithm for appearance models. Input: training bags = {( = { 1 , 2 ,...}, )}, where = { , } indicates a pair of responses, and ∈ {1,0}. Feature pool = {ℎ 1 ,ℎ 2 ,...,ℎ }. Let the number of selected features be ( <). Initialize classifier function =0 For =1,..., do: ∙ For =1,..., do: – =1/(1+exp(−(( )+ℎ ( )))) – =1− ∏ (1− ) 1/∣∣ – =( − ) /(∣ ∣ ) ∙ Find ∗ =argmax ∑ ℎ ( ); ℎ =ℎ ∗ ∙ Find ∗ =argmax ( +ℎ ) by linear search; = ∗ ∙ = + ℎ Output: = ∑ =1 ℎ background where the detector fails. Therefore, it is difficult to observe them from detection responses. Fortunately, people often move in groups, and the group motion can provide priors for motions of each member in the group. Definition 2 A moving group is a group of people who move at similar speeds and in similar directions as well as keep close to each other. Two tracklets and belong to the same moving group if they satisfy the following constrains (assuming is equal or longer than ): ≥ & ≤ & − > ∀ ∈ ∣∣ − ∣∣< & ∣∣ − ∣∣< (2.5) where is a threshold for minimum co-moving frames, is a threshold for maximum distance, isthestandarddeviationfunction,and isathresholdformaximumstandard 26 deviation. In our experiments, we set =50, =0.5min( , ), and =0.2min( , ). Equ. 2.5 assures that the two tracklets are close enough for a certain time indicating a possible moving group. If disappears before reaching the exit points but is still visible, then we can complete , i.e., make reach the entry/exit points, from future positions of at frame > as = + − (2.6) where − denotes the average difference vector between and based on last 20 frames co-moving patterns. There may be multiple tracklets in a moving group, and the position is taken as the average of estimations from all of them. Note that there is a maximum completion frame number, which is set to 20 in our experiments. has to reach the entry/exit points after the completion process; otherwise, the completion will not be applied, as moving groups may change after long time. Thoughmovinggroupsarealsousedin[36,53],groupmotionsareusedasconstraints for each target while all members are clearly visible. However, we use them to complete tracklets while no reliable observations are available. Figure 2.8 shows an example of our tracklet completion approach. The person 30 and 33 are co-moving for a long time. After frame 2475, the detector fails to find person 30 due to occlusions. However, based on the trajectory of person 33, we can estimate the missing trajectory of person 30, so that it reaches the exit point after the completion. 27 Frame 2400 Frame 2475 Frame 2495 Figure 2.8: A moving group example. The occluded target is completed even though the detector fails after frame 2475. 2.6 Experiments We applied our approach to three public data sets: PETS 2009, CAVIAR and Trecvid 2008, which have been commonly used in previous multi-target tracking work. For fair evaluation, we use the same offline learned human detector responses as used in the com- pared approaches. Low level association is performed as in [26]: responses in neighboring frames with high similarity of position, size, and appearances are associated if they have very low similarities with all other responses. 2.6.1 Evaluation Metric As it is difficult to use a single score to evaluate tracking performance, we adopt the evaluation metric defined in [31], including: ∙ Recall(↑): correctly matched detections / total detections in ground truth. ∙ Precision(↑): correctlymatcheddetections/totaldetectionsinthetrackingresult. ∙ FAF(↓): Average false alarms per frame. ∙ GT: The number of trajectories in ground truth. 28 ∙ MT(↑): The ratio of mostly tracked trajectories, which are successfully tracked for more than 80%. ∙ ML(↓): The ratio of mostly lost trajectories, which are successfully tracked for less than 20%. ∙ PT: The ratio of partially tracked trajectories, i.e., 1− −. ∙ Frag(↓): fragments, the number of times that a ground truth trajectory is inter- rupted. ∙ IDS(↓): id switch, the number of times that a tracked trajectory changes its matched id. For items with ↑, higher scores indicate better results; for those with ↓, lower scores indicatebetterresults. Forfragmentsandidswitch,weadoptthedefinitionin[31],which is more strict but more clearly defined than previous definitions. All data used in our experiments are publicly available 2 . 2.6.2 Entry/exit map estimation The estimations of entry/exit maps for all five scenes used in our experiments are shown in Figure 2.9. We can see that with time, our approach produces more and more precise estimations. Note that the maps are used for improving tracking not for scene under- standing; for some entry/exit points, if no targets appear or disappear there, they would havenoinfluenceontracking. Therefore, theprecisionismoreimportantthanrecall. We 2 http://iris.usc.edu/people/yangbo/downloads.html 29 Method Recall Precision FAF GT MT PT ML Frag IDS Energy Minimization [4] - - - 23 82.6% 17.4% 0.0% 21 15 PRIMPT [27] 89.5% 99.6% 0.020 19 78.9% 21.1% 0.0% 23 1 Our approach 91.8% 99.0% 0.053 19 89.5% 10.5% 0.0% 9 0 Table 2.1: Comparison of results on PETS 2009 dataset. The PRIMPT results are provided by courtesy of authors of [27]. Our ground truth is more strict than that in [4]. Method Recall Precision FAF GT MT PT ML Frag IDS OLDAMs [26] 89.4% 96.9% 0.085 143 84.6% 14.7% 0.7% 18 11 PRIMPT [27] 88.1% 96.6% 0.082 143 86.0% 13.3% 0.7% 17 4 Our approach 90.2% 96.1% 0.095 143 89.1% 10.2% 0.7% 11 5 Table 2.2: Comparison of results on CAVIAR dataset. The human detection results are the same as used in [26, 27], and are provided by courtesy of authors of [27]. can see that nearly all our estimated points are correct positions. Some imprecise esti- mations occur in top parts of images in the second row, because humans are really small on top parts of this scene and detectors would fail there. Therefore, the tracklets often end before reaching the top parts. However, as there are almost no detection responses there, such estimation would have little negative influence on the tracking performance. 2.6.3 Performance on less crowded data sets PETS 2009 and CAVIAR are two commonly used data sets for multi-target tracking. Scenes are not crowded, and people are sometimes occluded by other humans or other objects. The PETS 2009 data are the same as used in [4], but we modify the ground truth annotations, so that people who are fully occluded for many frames but are visible later are labeled with the same id. For CAVIAR, we test our approach on the same 20 video clips used in [26, 27]. 30 Frame 1600 Frame 4800 Frame 4800 Frame 1600 Frame 4800 Frame 1600 Frame 800 Frame 3500 Frame 200 Frame 700 Figure 2.9: Entry/exit maps estimation results. The third column shows ground truths. 31 Method Recall Precision FAF GT MT PT ML Frag IDS CRF Tracking [54] 79.2% 85.8% 0.996 919 78.2% 16.9% 4.9% 319 253 OLDAMs [26] 80.4% 86.1% 0.992 919 76.1% 19.3% 4.6% 322 224 PRIMPT [27] 79.2% 86.8% 0.920 919 77.0% 17.7% 5.2% 283 171 Motion map only 79.5% 87.8% 0.853 919 75.1% 19.2% 5.7% 255 155 MIL only 80.2% 87.8% 0.855 919 77.1% 17.7% 5.2% 250 167 Track completion only 80.1% 87.3% 0.895 919 76.7% 17.6% 5.7% 258 169 Our approach 80.2% 87.5% 0.880 919 76.9% 17.6% 5.5% 242 153 Our approach w/ manual maps 79.5% 87.1% 0.903 919 77.0% 17.5% 5.5% 235 152 Table 2.3: Comparison of results on TRECVID 2008 dataset. The human detection results are the same as used in [54, 26, 27], and are provided by courtesy of authors of [27]. ThecomparisonresultsareshowninTable2.2. Wecanseethatourapproachproduces obvious improvements; fragments are greatly reduced on both data sets by over 50% and 35% respectively, while keeping other scores competitive or with some improvements. Some visual results are shown in Figure 2.10(a). We can see that for almost totally overlapped persons, our tracker does not confuse their identities and finds the correct associations. 2.6.4 Performance on Trecvid 2008 data set AsthePETS2009andCAVIARdatasetsarerelativelyeasy,weshowmoreresultsonthe difficult Trecvid 2008 data set. There are frequently heavy occlusions, many non-linear motions, and interactions between people. There are three different scenes in the test videos with three 5000 frames video clips for each scene. The quantitative comparison results are shown in Table 2.3. To show effectiveness of each component of our approach, except for final performance, we also report three additional results where only one component of our approach is activated in each. In 32 addition, to see the effectiveness of our estimation of entry/exit maps, we also report the performance using manually assigned maps. From Table 2.3, we see that by using only any of the single components of our ap- proach, the overall performance is better than the up-to-date approaches. By using all components, our approach reduces around 14.5% fragments and 10.5% id switches compared with the most up-to-date performance reported in [27], while keeping other evaluation scores similar or improved. Using the manually assigned maps does not pro- vide large extra improvements, indicating the effectiveness of our learning method of entry/exit maps; the few improvements mostly happen in early frames, when the esti- mated maps have not been well learned due to few confident tracklets. Though there are three video clips for each scene, the entry/exit maps are not shared and are learned separately for fair comparison. Figure 2.10(b)-(d) show visual results of our approach. In Figure 2.10(b), a woman has non-linear motions and is almost fully occluded by person 170 around frame 4262. Traditional linear motion assumptions cannot connect tracklets before and after the oc- clusion; however,withthehelpofanonlinelearnedmotionmap,wesuccessfullyassociate these tracklets into one. In Figure 2.10(c), person 41 is fragmented from frame 2075 to frame 2105 because of heavy occlusions, and his appearance changes from almost black to dark blue due to illumination change. However, our MIL algorithm is able to produce high appearance similarity between the two tracklets, and links them successfully. In Figure 2.10(d), our approach finds that person 50,51,and 54 form a moving group. After frame 1265, only person 51 is visible; however, we complete trajectories of person 50 and 54 according to the visible tracklet 51. 33 Frame 180 Frame 265 Frame 325 Frame 360 Frame225 Frame 675 Frame 683 Frame 703 Frame 725 Frame 730 (a) Do not confuse identities when targets are quite close. Frame 4140 Frame 4262 Frame 4290 Frame 4360 Frame 4230 (b) Track non-linear moving targets under heavy occlusions. Frame 2045 Frame 2075 Frame 2105 Frame 2135 Frame 2090 (c) Track targets when appearances change a lot. Frame 1115 Frame 1225 Frame 1265 Frame 1290 Frame 1170 (d) Complete tracklets to exit points by moving groups. Figure 2.10: Examples of tracking results of our approach on PETS 2009, CAVIAR and Trecvid 2008 data sets. 34 2.6.5 Computation speed The speed is highly related with the number of targets in a video. Our approach is im- plemented using C++ on a PC with 3.0GHz CPU and 8GB memory. The average speeds are 48 fps, 16 fps, and 6 fps for CAVIAR, PETS 2009, and Trecvid 2008, respectively. Comparing 48 fps and 7 fps for CAVIAR and Trecvid 2008 reported in [27], our approach does not provide much extra computational burden 3 . In our experiments, most of the computation is spent on feature extraction, followed by low-level association. 2.7 Conclusion We described an online learning approach for multi-target tracking. We learn a non- linear motion map, describe a multiple instance learning algorithm for better appearance models based on estimated entry/exit points, as well as complete tracklets based on moving groups. Our approach improves the performance a lot compared with up-to-date approaches, while adding little extra computation cost. 3 Detection time is not included in both our speed and that in [27] 35 Chapter 3 Tracking by an Offline Learned CRF Model In this chapter, we introduce an offline learning approach that relaxes the independent assumption on affinities between tracklet pairs. 3.1 Introduction In this chapter, we also focus on multiple target tracking, i.e., inferring trajectories for each target from a video sequence. With significant improvements in object detection, manyassociationbasedtrackingmethodshavebeenproposedtodealwithcrowdedscenes by considering more frames and making global inference. Such approaches tend to as- sociate detection responses or track fragments (i.e., tracklets) gradually into long tracks [38, 8, 58, 51, 26]. Affinities between tracklets, indicating linking probabilities, are often estimated by pre-defined parametric models [38, 58] or an offline learning algorithm [31]; the best global association 1 is achieved by Hungarian algorithm [38] or cost flow method [58]. 1 Wecallan association betweena trackletpair local association, butassociationsfor alltracklets global association. 36 T 1 T 3 T 2 (a) Motion dependency. T 1 T 2 T 3 T 4 (b) Occlusion dependency. Figure 3.1: Examples of dependencies between tracklet associations. Alltheseapproachessharetheassumptionthatthelocalassociationsareindependent ofeachother, i.e., theaffinityfortwotrackletsdoesnotchangewithassociationsofother tracklets; however, this assumption is not always valid in real cases. Motion smoothness is an important cue for affinity computation. In Figure 3.1(a), associating 1 and 3 is smooth,asonlypositionsofafewdetectionsin 3 needtoberefined;similarly,associating 2 and 3 isalsosmooth. However,associatingthethreetrackletstogetherwouldproduce a sharp change in the motion direction. We call this motion dependency. Occlusions in the gap of two tracklets are often another important cue for tracklets affinities. In Figure 3.1(b), if 3 and 4 areassociated, thegapbetween 1 and 2 isoccluded; otherwise, itis notoccluded. Affinitiesfortwocaseswouldbedifferent;wecallthisocclusion dependency. We propose a framework for considering both the affinities between the tracklets and thedependenciesamonglocalassociationsbyaConditionalRandomField(CRF)model. Each node in the CRF denotes a tracklet pair, and the label for the node indicates the associationresultofthepair. InaCRFmodel,affinitiesanddependenciesarerepresented by unary and pairwise energy terms in the CRF respectively. The framework of our approach is shown in Figure 3.2. In the training part, we aim at finding proper energy functions for the CRF model. For any two global associations 37 T= 1 2 3 4 5 6 (Association 1, Association 2) (Association 1, Association 3) (Association 2, Association 3) Ranking Pairs ... ... Training Data Generation Ground Truth RankBoost CRF Energy Function T=1 2 3 4 5 6 Test data Energy Minimization Tracking Output Training Testing Detection response Association ground truth Association result Training Association Set T=1 2 3 4 5 6 T=1 2 3 4 5 6 T=1 2 3 4 5 6 G CRF Association Graph Creation CRF Model T=1 2 3 4 5 6 G CRF Model Tracklet pair observation Association 1 Association 3 Association 2 Figure3.2: Frameworkofour trackingsystem. Connections betweenCRF nodesindicate localassociationdependencies,anddetectionresponsesinreddenotefalsealarms. ACRF associationgraphisbuilttomodelalllocalassociationsanddependencies. ACRFenergy function is learned from global association training samples by a RankBoost algorithm, and is used for estimation of energies in testing. in the training set 2 , the preferred association should have lower energy than the other. This is achieved by a RankBoost algorithm [17], which continually selects weak rankers from a ranker pool to form a strong ranker. Each weak ranker is based on either a unary feature for tracklet affinities or a pairwise feature for dependencies. During the test part, a CRF graph is created from the input tracklets (or detection responses), and energies of all nodes and edges are computed from the learned strong ranker. By minimizing the energyfortheCRFgraph, thefinaltrackingresultisproduced. Thecontributionsofthis chapter include: ∙ Alearning-basedCRFframeworkthatmodelsboththeaffinitiesbetweenthetrack- lets and the dependencies among local associations. 2 Creation of the set will be described in Section 3.4 38 ∙ Novel unary features for modeling affinities and pairwise features for modeling de- pendencies. ∙ Instead of learning the best local affinities like most previous work, we learn the best global association directly by a RankBoost algorithm, so that the learning target is consistent with the problem solution, whereas better local affinities do not necessarily assure better global associations. The rest of this chapter is organized as follows: related work is discussed Section 2; problem formulation is given in Section 3; Section 4 describes learning approaches for a CRF model; experimental results are shown in Section 5, followed by conclusion in Section 6. 3.2 Related Work The association-based tracking framework associates tracklets gradually into longer ones by Dynamic Programming [8], MCMC [45, 57], or global optimization methods [38, 29, 58, 51]. Association based methods use different approaches to better global associations, such as estimation of merge and split [38], occlusion reasoning [58, 51], multi-cue offline learning [31]. Though great progress has been achieved by association-based approaches, dependencies between local associations are ignored. Great effort has also been put on producing better local affinity scores between tracklets, i.e., producing high association confidencefortrackletsbelongingtothesametargetandlowconfidenceforthosebelong- ing to different targets, for example, dynamic feature selection [9, 5] and online learning 39 [26, 18]. However, better local affinities does not necessarily assure better global associ- ation results. To the best of our knowledge, there has been no explicit use of association dependencies or direct optimization for global tracking results. Note that [58] and [45] both create graphs for tracking, but nodes in those graphs are either detection responses or tracklets, and relationships between nodes denote affinities between tracklets, not association dependencies. However, in our approach, each node in the graph denotes an association of two tracklets, and links between nodes indicate association dependencies. 3.3 CRF Formulation for Tracklet Association Given an input video, we first detect pedestrians in each frame. Then we associate de- tection responses into confident low level tracklets S = { 0 , 1 ,..., } as input; in this association process, we make conservative associations: two responses are only associ- ated when they are in consecutive frames and are close enough in space and similar enough in appearance. A graph = (,) is created for representing affinities be- tween tracklets in S and dependencies among tracklet association pairs, and are defined as ={ 0 , 1 ,..., }, and ={( , )}, where , ∈ and and has motion or occlusion dependency. Each node is defined as a tracklet pair as =( 1 , 2 ), where 1 = 2 or 1 is linkable to 2 which is defined as 0< 2 .− 1 .< (3.1) 40 where 1 . indicates the end time of 1 , 2 . indicates the start time of 2 , and is a maximum allowed frame gap between any linkable tracklet pairs. A tracking result becomes a label map on as = { 0 , 1 ,..., }, where ∈ {0,1} denotes the label for node = ( 1 , 2 ). When 1 ∕= 2 , label 1 indicates that 1 is associated with 2 , and 0 indicates the opposite; when 1 = 2 , label 1 indicates that 1 is a false alarm, and 0 indicates that 1 is a true tracklet. We assume that one tracklet cannot belong to more than one track, i.e., ∑ ∈ ℎ 1 ≤1 ∑ ∈ 2 ≤1 (3.2) ℎ 1 ={( 1 , )∈∣∀ ∈S} (3.3) 2 ={( , 2 )∈∣∀ ∈S} (3.4) where ℎ 1 and 2 indicate the set of nodes which have 1 as the first tracklet or 2 as the second tracklet respectively. Viewing the graph as a CRF, we define an energy function for any tracking result as ()= ∑ ∈ ( ∣ )+ ∑ ( , )∈ ( , ∣( , )) (3.5) where ( ) defines the label preference cost of node , and ( , ) defines the label dependency cost of node and . With ground truth training data, we are able to evaluate any by an evaluation function which produces lower values for preferrable results; the definition of is discussed in Section 3.4.4. Our training task is to find the 41 bestenergyfunction, i.e.,(⋅)and(⋅),sothatpreferrableglobalassociationshavelower energy costs, i.e., ∀ , ( )<( ) if ( )<( ) (3.6) where and are two different global association results. For a test video, we find the global association with lowest energy cost as the result. 3.4 Learning a CRF Model for Association In this section, we will show the methods of creation of a tracking graph from input tracklets, and introduce features for modeling local association affinities and dependen- cies. We formulate the training process as a ranking problem, and energy functions for a CRF model is learned by a RankBoost algorithm. 3.4.1 Creating Association Graph Given a set of input tracklets S = { 0 , 1 ,..., }, we create a node set including all linkable tracklet pairs and self association pairs according to Equ. 3.1. For creation of the edge set , we need to find dependencies between nodes. AsshowninFigure3.1(a),motiondependenciesexistbetweentwonodes =( 1 , 2 ) and =( 1 , 2 ) when they are both linkable pairs and satisfy 2 = 1 and ( 2 )< (3.7) 42 T i1 T i2 T j1 T j2 T i1 T i2 T j1 T i1 T i2 T j1 T j1 T i1 (a) (b) (c) (d) Figure 3.3: Four types of occlusion dependencies. where ( 2 ) denote the length of 2 , and is a threshold, which is set to 60 in our experiment. Thismeans that when twotrackletsareconnected via a shorttrackletin the middle, we need to consider the whole motion smoothness; however, if 2 is long enough, there is no dependency, as the target has enough time to change the motion pattern. As as shown in Figure 3.3, occlusion dependency has more types defined as: ∙ Dependency between = ( 1 , 2 ) and = ( 1 , 2 ) as shown in Figure 3.3(a), when the gap between 1 and 2 is occluded by the gap between 1 and 2 , and vice versa. ∙ Dependencybetween =( 1 , 2 )and =( 1 , 1 )asshowninFigure3.3(b)(c), when the gap between 1 and 2 is occluded by 1 , and vice versa. ∙ Dependency between = ( 1 , 1 ) and = ( 1 , 1 ) as shown in Figure 3.3(d), when 1 is occluded by 1 , and vice versa. The definition is based on the idea that whenever the occluded information of one node is influenced by association of another tracklet pair or judging another tracklet as a false alarm, they have an occlusion dependency. 43 By checking motion and occlusion dependencies between any node pair, the edge set of the graph is defined as = {( , )}, where , ∈ and there is a dependency between them. 3.4.2 Learning Energy Terms In order to find the best energy function in Equ. 3.5, we continually add non-parametric weak energy functions to form the final function by a RankBoost algorithm. Each weak energy function is either a unary function defined on the label of a node or a pairwise function defined on the labels of a node pair. Therefore, ( ∣ ) = ∑ ( ∣ ) and ( , ∣( , ))= ∑ ( , ∣( , )). The training set is composed of a set of ranking association pairs = {( 1 , 2 )}, where ∀ ( 1 )<( 2 ) and each 1 or 2 is a global association result, not local labelsforsomenodes. Inordertoproduceaninitialtrainingset,wefirstgenerateaglobal association set = { 1 , 2 ,..., } by randomly performing the following operations: 1) breaking an existing link; 2) connecting two linkable tracklets; 3) changing status of a tracklet, i.e., false alarm to true tracklet and vice versa. Then we are able to compare the correctness of any two associations in by the evaluation function , and form the initial training set. For any global association result , a weak function or would output the cost for all labels in as ()= ∑ ∈ ( ∣ ) ()= ∑ ( , )∈ ( , ∣( , )) (3.8) 44 The best energy function is learned by a RankBoost algorithm. We aim at finding a function to satisfy as many ranking pairs as possible. Therefore, the loss function for boosting is defined as = ∑ 0 exp(( 1 )−( 2 )) (3.9) where 0 is the initial weight for each training sample, and the weight is updated during boosting process. In the -th round, we aim at finding a best weak energy function ℎ () and a weight factor which minimizes = ∑ exp((ℎ ( 1 )−ℎ ( 2 )) (3.10) where ℎ is either a unary or a pairwise energy function defined in Equ. 3.8. In the training process, we update the training association set by continuously adding new samples, which is the association result generated byminimizing currentenergy function. Then we could have more training pairs, and these pairs are probably more difficult samples, just like samples from a bootstrap process. The learning algorithm is shown in Algorithm 4. 3.4.3 Learning the Weak Energy Function A weak energy function ℎ is either a unary function or a pairwise function, which takes a labeloralabelpairasinputandoutputsarealvalueforenergycost. Bytreatingdifferent labels asdifferentfunctions, e.g., isdivided into 0 and 1 whichestimate energy costs for label 0 and 1 respectively, we take a node or a node pair as the input instead of the labels. In the following, we use ℎ and to represent a weak energy function and its 45 Algorithm 4 Learning algorithm for CRF energy function. Input: global association set ={ 1 , 2 ,..., }, ranking sample pair set ={( 1 , 2 )}. Initialize the weight for each sample ( 1 , 2 ), = 1 ∣∣ . Initialize unary and pairwise energy functions: (⋅)=(⋅)=0 For =1,..., do: ∙ Find best weak energy function ℎ() and its weight . ∙ Update sample weight as = ∗exp((ℎ( 1 )−ℎ( 2 ))). ∙ Normalize weight so that ∑ =1. ∙ If ℎ is a unary function, = +ℎ; otherwise, =+ℎ. ∙ Findbestassociationresult ∗ undercurrentenergyfunction. ∀ ∈, =∪( ∗ , ) if ( ∗ )<( ); otherwise, =∪( , ∗ ). Output: CRF energy function ()=()+(). input respectively; is a node indicating a tracklet pair, when ℎ is a unary function; is a node pair, when ℎ is a pairwise function. The input is represented by a series of featureswhichincludeasmostcuesaspossibletomodeltrackletaffinitiesandassociation dependencies. Eachfeatureisafunctionthatmapsanodeoranodepairintoarealvalue. A weak energy function ℎ is defined based on one single feature as ℎ()= ⎧ ⎨ ⎩ 1 if ()> −1 otherwise (3.11) where is a threshold. The output of ℎ on an association is defined as ℎ() = ∑ ∈ ℎ(). The optimum for ℎ could be found by Newton’s method. The learning algorithm of the weak energy function is given in Algorithm 5. The feature pool used in our learning process is defined in Table 3.1 and Table 3.2. More features can be added into the feature pool if necessary; the purpose is to include any possible cues for modeling affinities and dependencies. In [31], a feature pool for 46 Algorithm 5 Learning algorithm for a weak energy function. Input: ranking sample pair set ={( 1 , 2 )} and current weight for each sample. Initialize training loss ∗ =+∞ For each feature f do: For each threshold do: ∙ ℎ()= { 1 if ()> −1 otherwise . ∙ Compute loss function on current sample weight distribution as () = ∑ exp((ℎ( 1 )−ℎ( 2 )) ∙ Find ˜ =argmin () ∙ If (˜ )< ∗ , let ∗ =(˜ ), ℎ ∗ =ℎ, ∗ = ˜ Output: weak energy function ℎ ∗ () and ∗ . learning local associations is proposed; that pool can be viewed as a subset of the unary feature pool, as it not only models affinities between tracklet pairs but also models costs of one tracklet being false alarm or true tracklet. In addition, pairwise feature pool does not exist in [31] owing to their independence assumption on local associations. 3.4.4 Implementation Details In this section, we will introduce the definition of evaluation function , energy mini- mization method, and some other details. Evaluation Function. As there is no commonly used single score for evaluating tracking results, we adopt multiple scores together, including: recall rate and precision, number of id switches, number of fragments. An association is better than , when the former has better scores in all four evaluations than the latter. Energy Minimization. With an energy function and a CRF association graph, we need to find a label map with minimum energy as the final association result. However 3 an occlusion is assumed when two detection responses have an overlap ratio larger than 0.5. 47 Id Feature description Length 1 Length of or 2 Number of detection responses in or 3 Number of detection responses in or di- vided by length of or Appearance 4 Similarity of color histogram between ’s tail part and ’s head part 5 Self appearance smoothness of 6 Appearance smoothness in the gap of and Gap 7 Frame gap between ’s tail and ’s head 8 Number of miss detected frames in or in the gap between and 9 Number of frames occluded 3 by other tracklets in or in the gap between and 10 Number of miss detected frames divided by the tracklet length or the gap length 11 Numberofframesoccludeddividedbythetrack- let length or the gap length End 12 Estimated time from ’s head to the nearest entry point 13 Estimatedtimefrom ’stailtothenearestexit point Motion 14 Motion smoothness in image plane or ground plane if connecting and 15 Self motion smoothness of in image plane or ground plane Table 3.1: Feature pool for learning unary terms. Id Feature description Motion 1 Motion smoothness in image plane or ground plane if connecting , , and together se- quentially. 2 Motion smoothness divided by the sum of de- tection responses in , , and . Occlusion 3 Number of frames in the gap of 1 and 2 oc- cluded by the gap of 1 and 2 . 4 Number of frames in the gap of 1 and 2 oc- cluded by . 5 Number of frames in occluded by the gap of 1 and 2 . 6 Number of frames in occluded by . Table 3.2: Feature pool for learning pairwise terms. 48 the energy function is not sub-modular. For example, the constraint in Equ. 3.2 is achieved by defining an infinite cost for assigning two nodes in ℎ 1 or 2 , i.e., ( = 1, = 1) = +∞ if , ∈ ℎ 1 or , ∈ 2 ; yet a sub-modular energy function should satisfy (0,0)+(1,1)<(0,1)+(1,0). Therefore, we cannot find a globally optimal solution easily. We first use Hungarian algorithm [38] to find an initial optimum solutionusingonly theunaryenergy costs. Thena simulatedannealing algorithmisused toiterativelyfindasolutionwithlowerenergies; inearlyiterations, theglobalassociation result can change easily to avoid sticking to a local minimum, and it gradually goes to a stable state with a low energy cost. Speed Optimization. As each node in a CRF graph denotes two linkable tracklets, whereasedgesaredefinedbetweenpotentialnodes,thenumbersofnodesandedgescould be ( 2 ) and ( 4 ) respectively. To limit size of the graph, we adopt a sliding window approach and build a separate CRF graph for each window; there are overlaps between neighboring windows so that trajectories can grow across different windows. Similar to [31], we make a multi-level association to progressively link tracklets so that we can limit the maximum frame gap and maximum speed between linkable tracklets at early levels, and loosen the constraint at later levels. In early levels, there are more tracklets and the constraint significantly reduce the graph size; in later levels, the number of tracklets is much less, and the constraint is loosened to avoid missing possible connections. Note that all components introduced in this section are modular and can be easily replaced by others. For example, a better energy minimization algorithm or a better evaluation function would help improve the performance of the whole system. 49 3.5 Experiments We applied the CRF based tracking approach to human tracking, and evaluated the performance on the public Trecvid08 data set [2], which captures several indoor scenes of a busy airport and contains crowds of persons and frequent occlusions. This data set is much more complex than the commonly used CAVIAR data set [1]; as the the state-of- art approaches produce almost perfect performance on the latter, we do not use it in our experiments. 3.5.1 Analysis of the Training Process Inthetrainingprocess,thefirstfivefeaturesselectedare 2 , 7 , 5 , 8 , 1 ,where or denotethe-thfeatureintheunaryfeaturepool(Table3.1)orinthepairwisefeaturepool (Table 3.2). We see that motion dependency and occlusion dependency are as important as motion smoothness or number of frames in the gap. Note that all features in [31] are used for modeling linkable tracklet pairs; however, the 1 selected here is for modeling falsealarmtracks. 2 , 5 , and 1 areallfeaturesnotcontainedinthefeaturepoolin[31]. This indicates the effectiveness of the new features as well as the importance of modeling dependencies. Figure 3.4 shows the trend of negative sample numbers in the first 20 rounds of training. We see that using both unary and pairwise features is more discriminative than using only unary features. This indicates that association dependencies do exist frequently in real cases, and the proposed pairwise features are effective for modeling these dependencies. 50 0 5 10 15 20 2 3 4 5 6 7 8 9 10 x 10 4 Feature No. Negative Sample No. Unary Features Both Features Figure 3.4: Comparison of negative sample numbers under different feature numbers. The blue curve shows results using only unary features; the red one shows results using both unary and pairwise features. 51 Method Recall Precision GT MT PT ML Frag IDS Huang et al. [21] 71.6% 80.8% 919 57.0% 28.1% 14.9% 487 278 Li et al. [31] 80.0% 83.5% 919 77.5% 17.6% 4.9% 310 288 CRF Unary Only 80.1% 84.3% 919 78.0% 17.0% 5.0% 309 278 CRF Tracking 79.2% 85.8% 919 78.2% 16.9% 4.9% 319 253 Table 3.3: Comparison of results on TRECVID 2008 dataset. 3.5.2 Tracking Performance We evaluated our approach on the Trecvid08 data set, which includes three different scenes with more than 10 persons per frame on average. People are frequently occluded or interact with each other, making the tracking problem very challenging. We select six video clips for each scene with a length of 5,000 frames each. We use a total of nine videos selected evenly from each scene as the training set, so that it has enough variety; the other nine videos are used for testing. As only a few have reported performance on this challenging data set, we compare our results with [21] and [31]. Quantitative results are shown in Table 4.2. “CRF unary only” row in Table 4.2 shows the performance of our approach without using pairwise features, while the last row shows the results using all listed features. We see that even using unary features only, our approach achieves better performance than of [31] and [21]. By using both unary and pairwise features, our approach achieves the best precision and mostly tracked rate; the recall rate is comparable to [31]. With 9 more fragments,wereduce35idswitchescomparedwith[31]. ThisdemonstratesthattheCRF model is able to consider one step further than using only affinities which may be heavily affectedbydependencieswithotherlocalassociations. Notethatin[26],aloweridswitch is reported; however, [26] focuses on producing a more discriminative appearance model, 52 Frame 414 Frame 421 Frame 447 Frame 469 Frame 294 Frame 314 Frame 335 Frame 358 Frame 1640 Frame 1688 Frame 1722 Frame 1842 (a) Keep identities when people are quite close (b) Keep tracking people successfully under occlusions (c) Do not drift under long-time overlap with others Figure 3.5: Sample tracking results of our approach on different scenes in Trecvid08 data set [2]. while we aim at building a more powerful framework for integrating multiple cues, and the model in [26] can be integrated into our system as a more powerful feature. Our framework does not have limitations on models for affinities and dependencies, and can easily absorb any new models. Figure 3.5 shows some sample tracking results. In the first row, three persons (No. 19, No. 25, No. 26) come very close to each other; our tracker is able to differentiate them with each other and keep their identities. In frame 447 and 469, there is a similar case;allthreeinvolvedpersonsaretrackedcorrectlyandsuccessfully. Figure3.5(b)shows 53 an occlusion case, where person 12 and 13 are occluded by person 32 and 35. We see that the occlusion lasts for more than fifty frames, and sometimes these persons are fully overlapped. However, our tracker is able to provide correct associations by considering association affinities and dependencies. In Figure 3.5(c), a man is bending in front of the door for more than 200 frames, and multiple persons pass through the door and have overlaps with this man. The tracklet for the man does not drift to other persons and all persons are tracked correctly. ThetrainingfortheCRFenergyfunctiontakesabout12hours. Testingtimeistightly related with the number of tracklets. As the nodes in the CRF graph builds on tracklet pairs and edges builds on node pairs, the time cost is polynomial to number of tracklets, which is often proportional to the number of persons. We implement our system using C on a PC with 3.0G Hz CPU and 8GB memory. The average testing time is around ten minutes for a video clip with 5000 frames and around 10 persons per frame. 3.6 Conclusion WeproposeaCRFmodeltotransformtheproblemofmulti-targettrackingintoanenergy minimizationproblem. TheproposedRankBoostalgorithmisabletolearncostfunctions for modeling both affinities between tracklets and local association dependencies. Exper- imental results have demonstrate effectiveness of our approach in crowded scenes as well as the importance of local association dependencies. Comparisons with up-to-date meth- ods are given. Future work would be focused on integrating online information into the framework. 54 Chapter 4 An Online Learned CRF Model for Multi-Target Tracking from Both Static and Moving Cameras In this chapter, we extend the offline learned CRF model tracking approach into an online method, so that the CRF model is adaptive to different videos to produce better performances. 4.1 Introduction Association based approaches are powerful at dealing with extended occlusions between targetsandthecomplexityispolynomialinthenumberoftargets. However,howtobetter distinguishdifferenttargetsremainsakeyissuethatlimitstheperformanceofassociation based tracking. It is difficult to find descriptors to distinguish targets in crowded scenes with frequent occlusions and similar appearances. In this chapter, we propose an online learned condition random field (CRF) model to better discriminating different targets, especially difficult pairs, which are spatially near targets with similar appearance. 55 To identify each target, motion and appearance information are often adopted to produce discriminative descriptors. Motion descriptors are often based on speeds and distances between tracklet pairs, while appearance descriptors are often based on global or part based color histograms to distinguish different targets. In most previous association based tracking work, appearance models are pre-defined [51, 39] or online learned to discriminate all targets [26, 43] or to discriminate one target with all others [10, 27]. Though such learned appearance models are able to distinguish most targets, they are not necessarily capable of differentiating difficult pairs, i.e., close targets with similar appearances. The discriminative features between a difficult pair are possibly quite different with those for distinguishing with all other targets. Linear motion models are widely used in previous tracking work [47, 54, 10]; linking probabilities between tracklets are often based on how well a pair of tracklets satisfies a linear motion assumption. However, as shown in the first row of Figure 4.1, if the view angle changes due to camera motion, the motion smoothness would be impaired; it could be compensated by frame matching techniques, but this is a challenging task by itself. Relative positions between targets are less dependent on view angles, and are often more stable than linear motion models for dealing with camera motions. For static cameras, relative positions are still helpful, as shown in the second row of Figure 4.1; some targets may not follow a linear motion model, and relative positions between neighboring targets are useful for recovering errors in a linear motion model. Based on above observations, we propose an online learning approach, which formu- lates the multi-target tracking problem as inference in a conditional random field (CRF) 56 Frame 505 Frame 518 Enlarged 2D map for person 41&48 Frame 3692 Frame 3757 Enlarged 2D map for person 145&144 Figure 4.1: Examples of relative positions and linear estimations. In the 2D map, filled circles indicate positions of person at the earlier frames; dashed circles and solid circles indicateestimatedpositionsbylinearmotionmodelsandrealpositionsatthelaterframes respectively. 57 Object Detector Energy Minimization Pairwise Models Pairwise motion models ∆s 2 Global motion models for any tracklet pairs G +1 -1 ... ... Training samples for global appearance models +1 -1 ... ... Training samples for pairwise appearance models +1 -1 ... ... Training samples for pairwise appearance models ∆s ∆s’ ∆s ∆s’ ∆s 1 Pairwise motion models CRF Model Pairwise Models ... Global Models Building CRF Graph Online learning global models Online learning pairwise models Tracking Output Detection response Tracklet Tracklet pair observation Legend Low Level Tracklet Input Video Low Level Association Unary Energy Functions Pairwise Energy Functions Figure4.2: Trackingframeworkofourapproach. IntheCRFmodel, eachnodedenotesa possible link between a tracklet pair and has a unary energy cost based on global appear- ance and motion models; each edge denotes a correlation between two nodes and has a pairwise energy cost based on discriminative appearance and motion models specifically for the two nodes. Colors of detection responses indicate their belonging tracklets. Best viewed in color. model as shown in Figure 4.2. Our CRF framework incorporates global models for dis- tinguishing all targets and pairwise models for differentiating difficult pairs of targets. All linkable tracklet pairs form the nodes in this CRF model, and labels of each node (1 or 0) indicate whether two tracklets can be linked or not. The energy cost for each node is estimated based on global appearance and motion models similar to [26]. The energy cost for an edge is based on discriminative pairwise models, i.e., appearance and motion descriptors, that are online learned for distinguishing tracklets in the connected two CRF nodes. Global models and pairwise models are used to produce unary and pairwise energy functions respectively, and the tracking problem is transformed into an energy minimization task. 58 The contributions of this chapter include: ∙ A CRF framework for modeling both global tracklets affinity models and pairwise discriminative models. ∙ An online learning approach for producing unary and pairwise energy functions in a CRF model. ∙ An approximation algorithm for efficiently finding good tracking solutions with low energy costs. The rest of the chapter is organized as follows: related work is discussed in Section 4.2; problem formulation is given in Section 4.3; Section 4.4 describes the online learning approach for a CRF model; experiments are shown in Section 4.5, followed by conclusion in Section 5.6. 4.2 Related Work Multi-target tracking has been an important topic in computer vision for several years. One key issue in tracking is that how to distinguish targets with backgrounds and with each other. Most visual tracking methods focus on tracking single object or multiple objects sep- arately [30, 48]; they usually try to find proper appearance models that distinguish one objectwithallothertargetsorbackgrounds,andadoptmeanshift[14]orparticlefiltering [22] like approach to online adjust target appearance models, and use updated models to continuously track targets. 59 Ontheotherhand,mostassociationbasedmethodsfocusontrackingmultipleobjects of a pre-known class simultaneously [38, 45, 4, 52]. They usually associate detection responses produced by a pre-trained detector into long tracks, and find a global optimal solutionforalltargets. Appearancemodelsareoftenpre-defined[51,39]oronlinelearned todistinguishmultipletargetsglobally[43,26];inaddition,linearmotionmodelsbetween tracklet pairs [47, 10] are often adopted to constrain motion smoothness. Though such approaches may obtain global optimized appearance and motion models, they are not necessarily able to differentiate difficult pairs of targets, i.e., close ones with similar appearances, as appearance models for distinguishing a specific pair of targets may be quite different with those used for distinguishing all targets, and previous motion models are not stable for non-static cameras. However, our online CRF models consider both global and pairwise discriminative appearance and motion models. Note that both the method in Chapter 3 and this approach relax the assumption that associations between tracklet pairs are independent of each other. However, the offline CRF method focused on modeling association dependencies, while this approach aims at better distinction between difficult pairs of targets and therefore the meanings of edges in CRF are different. In addition, the method in Chapter 3 is an offline approach that integrates multiple cues on pre-labeled ground truth data, but the approach in this chapter is an online learning method that finds discriminative models automatically without pre-labeled data. 60 4.3 CRF Formulation for Tracking Given a video input, we first detect targets in each frame by a pre-trained detector. Sim- ilar to [26], we adopt a low level association process to connect detection responses in neighboring frames into reliable tracklets, and then associate the tracklets progressively in multiple levels. A tracklet = { ,..., } is defined as a set of detection or inter- polated responses in consecutive frames, where and denote the start and end frames of and = { , , } denote the response at frame , including position , size , and velocity vector . Ateachlevel,theinputisthesetoftrackletsproducedinpreviouslevel ={ 1 , 2 ,..., }. For each possible association pair of tracklet ( 1 → 2 ), we introduce a label , where =1 indicates 1 is linked to 2 , and =0 indicates the opposite. We aim to find the best set of associations with the highest probability. We formulate the tracking problem as finding the best given ∗ =argmax (∣)=argmax 1 exp(−Ψ(∣)) (4.1) where is a normalization factor, and Ψ is a cost function. Assuming that the joint distributions for more than two associations do not make contributions to (∣), we have ∗ = argmin Ψ(∣) = argmin ∑ ( ∣)+ ∑ ( , ∣) (4.2) 61 where ( ∣)=−ln( ∣) and ( , ∣)=−ln( , ∣) denote the unary and pair- wise energy functions respectively. In Equ. 4.2, the first part defines the linking probabilities between any two track- lets based on global appearance and motion models, while the second part defines the correlations between tracklet pairs based on discriminative models especially learned for corresponding pairs of tracklets. We model the tracking problem by a Conditional Random Field (CRF) model. As shown in Figure 4.2, a graph = (,) is created for each association level, where = { 1 ,..., } denotes the set of nodes, and = { 1 ,..., } denotes the set of edges. Each node = ( 1 → 2 ) denotes a possible association between tracklets 1 and 2 ; each edge = {( 1 , 2 )} denotes a correlation between two nodes. A label ={ 1 ,..., } on denotes an association result for current level. We assume that one tracklet cannot be associated with more than one tracklet, and therefore any valid label set should satisfy ∑ ∈ 1 ≤1 & ∑ ∈ 2 ≤1 (4.3) 1 ={( 1 → )∈} ∀ ∈ 2 ={( → 2 )∈} ∀ ∈ where the first constraint limits any tracklet 1 link to at most one other tracklet, and the second constraint limits that at most one tracklet may be link to any tracklet 2 . 62 For efficiency, we track in sliding windows one by one instead of processing the whole video at one time. The CRF models are learned individually in each sliding window. 4.4 Online Learning of CRF Models Inthissection, weintroduce ourtrackingapproachinseveralsteps, includingCRFgraph creation,onlinelearningofunaryandpairwiseterms,aswellashowtofindanassociation label set with low energy. 4.4.1 CRF Graph Creation for Tracklets Association Given a set of tracklets = { 1 , 2 ,..., } as input, we want to create a CRF graph for modeling all possible associations between tracklets and their correlations. Tracklet is linkable to if the gap between the end of and the beginning of satisfies 0< − < (4.4) where is a threshold for maximum gap between any linkable pair of tracklets, and and denotes the start and end frames of and respectively. We create the set of nodes in CRF to modeling all linkable tracklets as ={ =( 1 → 2 )} .. 1 is linkable to 2 (4.5) Instead of modeling association dependencies as in [54], edges in our CRF provide corresponding pairwise models between spatially close targets, and are defined between any nodes that have tail close or head close tracklet pairs. 63 T i T j s i t i p s i t j p e m t m p T k T m e m t k p Figure 4.3: Examples of head close and tail close tracklet pairs. As shown in Figure 4.3, two tracklets and are a head close pair if they satisfy (suppose ≥ ) < & ∣∣ − ∣∣<min{ , } (4.6) where is a distance control factor, set to 3 in our experiments. This definition indicates that the head part of is close to at ’s beginning frame. The definition of tail close is similar. Then we define the set of edges as ={( , )} ∀ , ∈ (4.7) .. 1 and 1 are tail close, or 2 and 2 are head close. Such definition constraints the edges on difficult pairs where wrong associations are mostlikelytohappen,sothatedgesproduceproperpairwiseenergiestodistinguishthem. 64 p tail p tail +v tail ∆t p head -v head ∆t p head ∆p 1 ∆p 2 T 1 T 2 Figure 4.4: Global motion models for unary terms in CRF. 4.4.2 Learning of Unary Terms Unary terms in Equ. 4.2 define the energy cost for associating pairs of tracklets. As defined in section 4.3, ( ∣) = −ln( ∣). We further divide the probability into motion based probability (⋅) and appearance based probability (⋅) as ( =1∣)=−ln( ( 1 → 2 ∣) ( 1 → 2 ∣)) (4.8) is defined as in [26, 54, 27], which is based on the distance between estimations of positions based on linear motion models and the real positions. As shown in Figure 4.4, the motion probability between tracklets 1 and 2 are defined based on Δ 1 = ℎ − ℎ Δ− and Δ 2 = + Δ− ℎ as ( 1 → 2 ∣)=(Δ 1 ,Σ )(Δ 2 ,Σ ) (4.9) where(⋅,Σ)isthezero-meanGaussian function, andΔistheframedifferencebetween and ℎ . 65 For (⋅), we adopt the online learned discriminative appearance models (OLDAMs) definedin[26],whichfocusonlearningappearancemodelswithgoodglobaldiscriminative abilities between targets. 4.4.3 Learning of Pairwise Terms Similar to unary terms, pairwise terms are also decomposed into motion based and ap- pearance based parts. Motion based probabilities are defined based on relative distance between tracklet pairs. Take two nodes ( 1 , 3 ) and ( 2 , 4 ) as an example. Suppose 1 and 2 are a tail close pair; therefore, there is an edge between the two nodes. Let =min{ 1 , 2 }, and =max{ 3 , 4 }. As 1 and 2 are tail close, we estimate positions of both at frame , as shown in dash circles in Figure 4.5. Then we can get the estimated relative distance between 1 and 2 at frame as Δ 1 =( 1 1 + 1 ( − 1 ))−( 2 2 + 2 ( − 2 )) (4.10) where 1 and 1 are the tail velocity and end frames of 1 ; 2 and 2 are similar. We comparetheestimatedrelativedistancewiththerealoneΔ 2 ,andusethesameGaussian function in Equ. 4.9 to compute the pairwise motion probability as (Δ 1 −Δ 2 ,Σ ). As shown in Figure 4.5, the difference between Δ 1 and Δ 2 is small. This indicates that if 1 is associated to 3 , there is a high probability that 2 is associated to 4 and vise versa. Note that if 3 and 4 in Figure 4.5 are head close, we also do a similar 66 T 2 T 1 ∆p 1 2 x t p 1 x t p T 3 3 y t p 4 y t p 1 1 e t p 1 1 1 1 ( ) e tail e y t V t t p + - + - + - + - 2 2 2 2 ( ) e tail e y t V t t p + - + - + - + - ∆p 2 T 4 Figure 4.5: Pairwise motion models for pairwise terms in CRF. computation as above; the final motion probability would be taken as the average of both. Pairwise appearance models are designed for differentiating specific close pairs. For example, in Figure 4.5, 1 and 2 are a tail close pair; we want to produce an appear- ance model that best distinguishes the two targets without considering other targets or backgrounds. Therefore, we online collect positive and negative samples only from the concerned two tracklets so that the learned appearance models are most discriminative for these two. Positive samples are selected from responses in the same tracklet; any pair of these responses should have high appearance similarity. For a tail close tracklet pair 1 and 2 , the positive sample set + is defined as + ={( 1 , 2 )} ∈{1,2} (4.11) ∀ 1 , 2 ∈[max{ , −}, ] 67 where is a threshold for the number of frames used for computing appearance models (set to 10 in our experiments). The introduction of is because a target may change appearance a lot after some time due to illumination, view angle, or pose changes. Negative samples are selected from responses in different tracklets, and they should have as much differences as possible in appearance. The negative sample set − between 1 and 2 is defined as − = {( 1 1 , 2 2 )} (4.12) ∀ 1 ∈[max{ 1 , 1 −}, 1 ], ∀ 2 ∈[max{ 2 , 2 −}, 2 ] Sample collection for head close pairs are similar, but detection responses are from the first frames of each tracklet. With the positive and negative sample sets, we adopt the standard Real Boost al- gorithm to produce appearance models for best distinguishing 1 and 2 ; we adopt the featuresdefinedin[26],includingcolor,texture,andshapefordifferentregionsoftargets. Based on the pairwise model, we get new appearance based probabilities for ( 1 , 3 ) and ( 2 , 4 ) shown in Figure 4.5. If 3 and 4 are a head close pair, we adopt a similar learning approach to get appearance probabilities based on discriminative models for 3 and 4 , and use the average of both scores as the final pairwise appearance probabilities. Note that the discriminative appearance models between 1 and 2 are only learned once for all edges like{( 1 , ),( 2 , )} ∀ , ∈. Therefore, the complexity is much less than the number of edges and becomes ( 2 ), where is the number of tracklets. 68 Moreover, as only a few tracklet pairs are likely to be spatially close, the actual times of learning is often much smaller than 2 . 4.4.4 Energy Minimization For CRF models with submodular energy functions, where (0,0)+(1,1)<(1,0)+ (1,1), aglobaloptimalsolutioncanbefoundbythegraphcutalgorithm. However, due totheconstraintsinEqu. 4.3, theenergyfunctioninourformulationisnotsub-modular. Therefore, it is difficult to find the global optimal solution in polynomial time. Instead, we introduce a heuristic algorithm to find a good solution in polynomial time. The unary terms in our CRF model have been shown to be effective for non-difficult pairs by previous work [26]. Considering this issue, we first use the Hungarian algorithm [38] to find a global optimal solution by only considering the unary terms and satisfying theconstraintsinEuq. 4.3. Thenwesorttheselectedassociations, i.e., nodeswithlabels of1, accordingtotheirunarytermenergiesfromleasttomostas={ =( 1 → 2 )}. Then for each selected node, we try to switch labels of it and each neighboring node, i.e., a node that is connected with current node by an edge in the CRF model; if the energy is lower, we keep the change. The energy minimization algorithm is shown in Algorithm 1. Note that the Hungarian algorithm has a complexity of ( 3 ), while our heuristic search process has a complexity of (∣∣) = ( 4 ). Therefore, the overall complexity is still polynomial. In addition, as nodes are only defined between tracklets with a proper time gap and edges are only defined between nodes with head or tail close tracklet pair, 69 Algorithm 6 Finding labels with low energy cost. Input: Tracklets from previous level ={ 1 , 2 ,..., }; CRF graph =(,). Find the label set with the lowest unary energy cost by Hungarian algorithm, and evaluate its overall energy Ψ by Equ. 4.2. Sortnodeswithlabelsof1accordingtotheirunaryenergycostsfromleasttomostas{ 1 ,..., }. For =1,..., do: ∙ Set updated energy Ψ ′ =+∞ ∙ For =1,..., that ( , )∈ do: – Switch labels of and under constraints in Equ. 4.3, and evaluate new energy Ω. – If Ω<Ψ ′ , Ψ ′ =Ω ∙ If Ψ ′ <Ψ, update with the switch, and set Ψ=Ψ ′ . Output: The set of labels on the CRF graph. the actual number of edges is typically much smaller than 4 . In our experiments, the run time is almost linear in the number of targets. 4.5 Experiments We evaluate our approach on three public pedestrian data sets: the TUD data set [4], Trecvid 2008 [2], and ETH mobile pedestrian [16] data set. We show quantitative com- parisons with state-of-art methods, as well as visualized results of our approach. Though frame rates, resolutions and densities are different in these data sets, we use the same parameter setting, and performance improves compared to previous methods for all of them. This indicates that our approach has low sensitivity on parameters. All data used in our experiments are publicly available 1 . 1 http://iris.usc.edu/people/yangbo/downloads.html 70 Method Recall Precision FAF GT MT PT ML Frag IDS Energy Minimization [4] - - - 9 60.0% 30.0% 0.0% 4 7 PRIMPT [27] 81.0% 99.5% 0.028 10 60.0% 30.0% 10.0% 0 1 Online CRF Tracking 87.0% 96.7% 0.184 10 70.0% 30.0% 0.0% 1 0 Table 4.1: Comparison of results on TUD dataset. The PRIMPT results are provided by courtesy of authors of [27]. Our ground truth includes all persons appearing in the video, and has one more person than that in [4]. Method Recall Precision FAF GT MT PT ML Frag IDS Offline CRF Tracking [54] 79.2% 85.8% 0.996 919 78.2% 16.9% 4.9% 319 253 OLDAMs [26] 80.4% 86.1% 0.992 919 76.1% 19.3% 4.6% 322 224 PRIMPT [27] 79.2% 86.8% 0.920 919 77.0% 17.7% 5.2% 283 171 Online CRF Tracking 79.8% 87.8% 0.857 919 75.5% 18.7% 5.8% 240 147 Table 4.2: Comparison of tracking results on Trecvid 2008 dataset. The human detection results are the same as used in [54, 26, 27], and are provided by courtesy of authors of [27]. 4.5.1 Results on Static Camera Videos Wefirsttestourresultsondatasetscapturedbystaticcameras, i.e.,TUD[4]andTrecvid 2008 [2]. For fair comparison, we use the same TUD-Stadtmitte data set as used in [4]. It is captured on a street at a very low camera angle, and there are frequent full occlusions among pedestrians. But the video is quite short, and contains only 179 frames. ThequantitativeresultsareshowninTable4.1. Wecanseethatourresultsaremuch better than that in [4], but the improvement is not so obvious compared with [27]: we have higher MT and recall and lower id switches, but PRIMPT has higher precision and lower fragments. This is because the online CRF model focuses on better differentiating difficult pairs of targets, but there are not many people in the TUD data set. Some 71 visualresultsareshowninthefirstrowofFigure4.6; ourapproachisabletokeepcorrect identities while targets are quite close, such as person 0, person 1, and person 2. To see the effectiveness of our approach, we further evaluate our approach on the difficult Trecvid 2008 data set. There are 9 video clips in the data set, each of which has 5000 frames; these videos are captured in a busy airport, and have high density of people with frequent occlusions. There are lots of close track interactions in this data set, indicating huge number of edges in the CRF graph. The comparison results are showninTable4.2. Comparedwithup-to-dateapproaches, ouronlineCRFachievesbest performance on precision, FAF, fragments, and id switches, while keeping recall and MT competitive. Comparedwith[27],ourapproachreducesthefragmentsandtheidswitches by about 15% and 14% respectively. Row 2 and 3 in Figure 4.6 show some tracking examples by our approach. We can see that when targets with similar appearances get close, the online CRF can still find discriminative features to distinguish these difficult pairs. However, global appearance and motion models are not effective enough in such cases, such as person 106 and 109 in the third row of Figure 4.6, who are both in white, move in similar directions, and are quite close. The second row in Figure 4.1 shows an example where the approach in [26] produces a fragmentation due to non-linear motions while our approach has no fragments by considering pairwise terms in the CRF model. 4.5.2 Results on Moving Camera Videos We further evaluate our approach on the ETH data set [16], which is captured by a pair of cameras on a moving stroller in busy street scenes. The stroller is mostly moving 72 Method Recall Precision FAF GT MT PT ML Frag IDS PRIMPT [27] 76.8% 86.6% 0.891 125 58.4% 33.6% 8.0% 23 11 Online CRF Tracking 79.0% 90.4% 0.637 125 68.0% 24.8% 7.2% 19 11 Table 4.3: Comparison of tracking results on ETH dataset. The human detection results are the same as used in [27], and are provided by courtesy of authors of [27]. forward, but sometimes has panning motions, which makes the motion affinity between two tracklets less reliable. Forfaircomparison,wechoosethe“BAHNHOF”and“SUNNYDAY”sequencesused in [27] for evaluation. They have 999 and 354 frames respectively, and people are under frequent occlusions due to the low view angles of cameras. For fair comparison with [27], we also use the sequence from the left camera; no depth and ground plane information are used. ThequantitativeresultsareshowninTable4.3. Wecanseethatourapproachachieves better or the same performances on all evaluation scores. The mostly tracked score is improvedbyabout10%; fragmentsarereducedby17%; recallandprecisionareimproved by about 2% and 4% respectively. The obvious improvement in MT and fragment scores indicate that our approach can better track targets under moving cameras, where the traditional motion models are less reliable. The last two rows in Figure 4.6 show some visual tracking results by our online CRF approach. Both examples show obvious panning movements of cameras. Traditional motion models, i.e., unary motion models in our online CRF, would produce low affinity scoresfortrackletsbelongingtothesametargets. However,byconsideringpairwiseterms, relative positions are helpful for connecting correct tracklets into one. This explains the obvious improvements on MT and fragments. The first row in Figure 4.1 shows an 73 example that our approach successfully tracks persons 41 and 48 under abrupt camera motions, while the method in [26] fails to find the correct associations. 4.5.3 Speed AsdiscussedinSection4.4.4,thecomplexityofouralgorithmispolynomialinthenumber of tracklets. Our experiments are performed on a Intel 3.0GHz PC with 8G memory, and the codes are implemented in C++. For the less crowded TUD and ETH data sets, the speed are both about 10 fps; for crowded Trecvid 2008 data set, the speed is about 6 fps. Compared with the speed of 7 fps for Trecvid 2008 reported in [27], the online CRF does not add much to the computation cost (detection time costs are not included in either measurement). 4.6 Conclusion We described an online CRF framework for multi-target tracking. This CRF considers both global descriptors for distinguishing different targets as well as pairwise descriptors for differentiating difficult pairs. Unlike global descriptors, pairwise motion and appear- ance models are learned from corresponding difficult pairs, and are further represented by pairwise terms in the CRF energy function. An effective algorithm is introduced to efficiently find associations with low energy, and the experiments show significantly im- proved results compared with up-to-date methods. Future improvement can be achieved by adding camera motion inference into pairwise motion models. 74 Frame 47 Frame 85 Frame 125 Frame 60 Frame 620 Frame 650 Frame 685 Frame 700 Frame 170 Frame 190 Frame 230 Frame 245 Frame 1210 Frame 1220 Frame 1252 Frame 1263 Frame 3300 Frame 3330 Frame 3360 Frame 3410 Figure 4.6: Tracking examples on TUD [4], Trecvid 2008 [2], and ETH [16] data sets. 75 Chapter 5 Integration of Association Based Tracking and Category Free Tracking In this chapter, we introduce a framework that incorporate category free tracking into the association based tracking framework, so that the tracking results can go beyond the limitation of detectors’ performances. 5.1 Introduction As mentioned in Chapter 1, most previous approaches can be classified into Association Based Tracking (ABT) or Category Free Tracking (CFT); ABT is usually a fully auto- matic process to associate detection responses into tracks, while CFT usually tracks a manual labeled region without requirements of pre-trained detectors. This chapter aims at incorporating merits of both ABT and CFT in a unified framework. A key aspect of our approach is online learning discriminative part-based appearance models for robust multi-human tracking. 76 Association based tracking methods focus on specific kinds of objects, e.g., humans, faces, or vehicles [43, 54, 7, 39, 45, 37]; they use a pre-trained detector for the concerned kind of objects to produce detection responses, then associate them into tracklets, i.e., trackfragments,andproducefinaltracksbylinkingthetrackletsinoneormultiplesteps. The whole process is typically fully automatic. On the contrary, category free tracking methods,sometimescalled“visualtracking”inpreviouswork,continuouslytrackaregion based on manual labels in the first frame without requirements of pre-trained detectors [48, 32, 19]. We focus on automatically tracking multiple humans in real scenes. As humans may enterorexitscenesfrequently,manualinitializationisimpractical. Therefore,association based tracking is frequently adopted in previous work for this problem [43, 54, 7, 39]. In the ABT framework, linking probabilities between tracklet pairs are often defined as ( → )= ( → ) ( → ) ( → ) (5.1) where (⋅), (⋅), and (⋅) denote appearance, motion, and temporal linking proba- bilities. (⋅) is often a binary function to avoid temporally overlapped tracklets to be associated, and design of (⋅) and (⋅) plays an important role in performance. A global optimal linking solution for all tracklets is often found by a Hungarian algorithm [38, 51, 27] or network flow methods [43, 39]. However, in most previous ABT work, occlusions are not explicitly considered in appearance models for calculating (⋅), leading to a high possibility that regions of one person are used to model appearances for another. In addition, the performance of ABT 77 Frame 664 Frame 685 Frame 706 Figure 5.1: Limitations of previous association based tracking methods. See text for details. initialization track solution motion cue association based tracking auto & imperfect global available category free tracking manual & perfect individual unavailable Table 5.1: Comparison of association based tracking with category free tracking. is tightly restricted by detectors. On one hand, missed detections at the beginning or end of a tracklet cannot be found by associations, which only fill gaps between tracklets but do not extend them. One example is shown in Figure 5.1; the man in blue is missed in frame 664 and is not tracked until frame 685 due to failure of detectors. On the other hand, to compute (⋅) in Equ. 5.1, it is often assumed that humans move linearly with a stable speed; this assumption is valid for most short gaps as humans do not change directions within a short interval e.g., 4 or 5 frames, but would be problematic for long gaps. For example, in Figure 5.1, person 29 in frame 664 and person 28 in frame 706 are actually the same person but his track is fragmented due to both the failure of the detector and the direction change of the person during the missed detection period; the linear motion model fails to successfully fill the gap. One possible solution to overcoming the detection limitation is to use category free tracking methods. However, it is difficult to directly use CFT in association based track- ing due to differences in many aspects, as shown in Table 5.1. First, CFT starts from a 78 Tracklets Association Select Tracklets for Category Free Tracking ... +1 -1 Video Frames Pedestrian Detector Occlusion Reasoning for Human Parts Head Tail Tail ... ... ... ... ... +1 -1 Head Tail ... Learning Appearance Model 1 Learning Appearance Model 2 ... Conservative Category Free Tracking +1 -1 ... ... Learning Global Appearance Model Motion Cues Collect Training Samples for Category Free Tracking Collect Training Samples from All Tracklets Tracklet Legend Detection Response Low Level Association Figure5.2: Trackingframeworkofourapproach. Colorsofdetectionresponsescomefrom theirtracklets’colors,andblackcirclesdenotesamplesextractedfrombackgrounds. Best viewed in color. perfect manually labeled region; however, in our problem, automatic initialization is pro- vided by an pre-learned detector, and some detections may be imprecise or false alarms. Second, CFT methods often find best solutions for one or few targets individually while ignoringothers; however, in multi-targettracking, aglobalsolution foralltargets is more important. Finally, CFT often ignores motion cues, to deal with abrupt motion, while most multi-human tracking problems focus on surveillance videos where people are un- likely to change motion directions and speeds much in a short period, e.g., 4 or 5 frames. We propose a unified framework that efficiently finds global tracking solutions while incorporating the merits of category free tracking with little extra computational cost. WeintroducetheonlinelearnedDiscriminativePart-basedAppearanceModels(DPAMs) to explicitly deal with occlusion problems and detection limitations. The system frame- workisshowninFigure5.2. SimilartopreviousABTwork[27,54,39], ahumandetector 79 is applied to each video frame, and detection responses in neighboring frames are conser- vatively associated into tracklets. Then based on occlusion reasoning for human parts, we select a set of tracklets that are reliable for CFT. DPAMs for each tracklet are online learned to differentiate the tracklet from backgrounds and other possibly close-by track- lets. A conservative CFT method is introduced to safely track reliable targets without detections, so that missing head or tail parts of tracklets are partially recovered and gaps between tracklets are reduced making linear motion estimations more precise. Finally a global appearance model is learned, and tracklets are associated according to appearance and motion cues. Weemphasizethatthecategoryfreetrackingmoduleinourframeworkisnotproposed tofindspecifictargetsfrom“entrytoexit”asmostexistingCFTmethodsdo, butisused for extrapolating the tracklets and enabling associations to be made more robustly. The contributions of this chapter include: ∙ A unified framework to utilize merits of both ABT and CFT. ∙ Part based appearance models to explicitly deal with human inter-occlusions. ∙ A conservative category free tracking method based on DPAMs. The rest of the chapter is organized as follows: Section 2 discusses related work; building part-based feature sets is described in Section 3; Section 4 introduces the online learning process of DPAMs and the conservative CFT method; experiments are shown in Section 5, followed by conclusion in Section 6. 80 5.2 Related Work Tracking multiple humans has attracted much attention from researchers in computer vision area in recent years. To deal with large number of humans, many association based tracking approaches have been proposed to find a global optimal solution [38, 37, 45,54]. Theseapproachesdetecthumansineachframeandgraduallyassociatetheminto tracks based on motion and appearance cues. Appearance models are often pre-defined [38, 51, 39] or online learned to distinguish different targets [43, 47, 26]. Occlusions are often ignored [26, 54] or modeled as potential nodes in an association graph [39, 43], but have not been used explicitly for appearance modeling, indicating high possibilities that parts used for modeling appearances of a person belong to other individuals. Moreover, performance of association based tracking is tightly limited by detection results; missed detectionsinheadsortailsofatrackcannotberetained, andlonggapsbetweentracklets are difficult to beassociated according to a linear motion model, whichiscommonly used in most previous ABT work [51, 46, 27, 54]. Different from association based tracking, category free tracking methods do not de- pend on pre-learned detectors, and appearance models are often based on parts to deal with partial occlusions [48, 32, 19]. However, CFT focuses on single or a few objects individually without finding a global solution for all targets, and is difficult to deal with large number of targets in crowded scenes due to high complexity. In addition, CFT approaches are often sensitive to initialization; once the tracking drifts, it is difficult to recover. If the initialization is imprecise or even a false alarm, the proposed tracks would be problematic. 81 Note that some previous work also adopts CFT techniques into multi-target tracking problems [51, 7, 10]. However, CFT is often used to generate initial tracklets, which may include many false tracklets or miss some tracklets without carefully tuning the initialization thresholds for each video. However, we use conservative CFT methods to reduce association difficulties and missed detections in an association based framework, butwedonotrelyitonfindingwholetracks. Also, unlikepreviousapproaches, weonline learn discriminative part based appearance models to better distinguish tracklets with each other and the background. 5.3 Building Part-Based Feature Sets for Tracklets Inordertoimproveefficiency,wetrackinslidingwindowsonebyoneinsteadofprocessing thewholevideoatonetime. Inthefollowing, wewillusetheterm current sliding window to refer to the sliding window being processed. Given the detection responses in the current sliding window, we adopt the low level association approach as in [26] to generate reliable tracklets from consecutive frames. These tracklets are then associated in multiple steps. Appearance models play important roles in tracklet associations. An appearance model for a tracklet includes two parts: a set of features ={ 1 , 2 ,..., } and a set of weights = { 1 , 2 ,..., }. The features could be color or shape histograms extracted from some regions of responses in the tracklet, and there is a unique for each . The set of weights measures importance of features and is often shared between multipletracklets. Asfeaturesalwaysexist(thoughmaynotbecomputediftheirweights 82 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 9 7 7 7 5 5 5 5 24 58 (a) (b) (c) x x y y Figure 5.3: Illustrations for human parts: (a) parts definition for a normalized 24×58 human detection response; 15 squares are used to build part-based appearance models, their sizes are shown in blue numbers; (b)(c) show automatic occlusion inference results in real scenes; visible parts for each person are labeled with the same color, and occluded parts are not shown and not used for appearance modeling. Best viewed in color. arezero)whileweightscontrolwhichfeaturestouseandtheirimportance, findingrobust appearance models equals finding the best set of weights . We will use the term appearance models to represent the set of weights for clarity. Given , the appearance similarity between two feature sets and is defined as ∑ ℎ ( , ), where ℎ (⋅) is anevaluationfunctionfortwofeatures, e.g., Euclideandistancesorcorrelationcoefficient between two vectors. A tracklet is defined as detected or interpolated responses in consecutive frames from time 1 to as = { 1 ,..., }, where is the length of . We produce two feature sets ℎ and , including features modeling appearances of ’s head and ’s tail respectively. Appearance linking probability from to is evaluated on and ℎ using a set of appropriate weights . In order to explicitly consider occlusions, features are extracted from parts in stead of from whole human responses. As shown in Figure 5.3(a), each response is normalized 83 into 24×58 and is composed of 15 parts defined as ={ (1), (2),..., (15)}. Each () is a set of features extracted from part . If part u is occluded, all features in () are invalid, as they may not come from the concerned human. In this section, we aim at building ℎ and for each tracklet by explicitly considering human inter- occlusions. Scene occluders, e.g., screens, pillars, are not considered due to difficulty of inferring them. We assume that all persons stand on the same ground plane, and the camera looks down towards the ground, which is valid for typical surveillance videos. Two scene exam- ples are shown in Figure 5.3(b)(c). Such assumptions indicate that smaller y-coordinate 1 in a 2D frame corresponds to larger depth from the camera in 3D. Therefore, given detection responses in one frame, we sort them according to their lowest y-coordinate, indicating their ground plane positions, from largest to smallest. Then we do occlusion reasoning for each person. If more than 30% of a part is already occupied by parts of persons who have larger y-coordinates, then it is labeled as occluded; otherwise, it is unoccluded. Some examples of automatic detected unoccluded parts are shown in Figure 5.3(b)(c) in colored squares. To produce ℎ for tracklet , we decompose the set into 15 subsets as ℎ ={ ℎ (1), ℎ (2),..., ℎ (15)} (5.2) Each ℎ () is a subset that contains features only extracted from part . To suppress noise, each feature is taken as the average of valid features from the first responses in 1 See Figure 5.3(b)(c) for definition of y-coordinates. 84 Algorithm 7 The algorithm of building head feature set for one tracklet. Input: tracklet . Initialize ℎ = and ℎ ()= ∀ ∈{1,...,15} For = 1 ,..., do: ∙ For =1,...,15 do: – If ∣ ℎ ()∣< and () is unoccluded, ℎ ()= ℎ ()∪{ ()}; ∙ If at least one of () is unoccluded, ℎ = ℎ ∪{ } ∙ If ∀ ∈{1,...,15}, ∣ ℎ ()∣=, break; Compute each feature in ℎ by averaging corresponding features in ℎ . Output: ℎ . , where is a control parameter and is set to 8 in our experiments. For each part j we introduce a set ℎ (), which contains multiple ()s from different responses, so that each feature in ℎ () is taken as the average value of corresponding features in ℎ (). Meanwhile, we also maintain a set ℎ , which contains all responses used in building ℎ ; this set is used in the learning process for appearance models. Algorithm7showstheprocessofbuilding ℎ ,where∣ ℎ ()∣denotesthenumber of elements in ℎ (), and ℎ = ∪ ℎ (); the process of building is similar. Our algorithm tends to find the earliest unoccluded regions for each part. If a part is occluded in early frames, we use later unoccluded ones to represent its appearance, assumingthattheoccludedpartshavesimilarappearanceswiththosetemporallynearest unoccluded ones; this is a widely used assumption in CFT work [24, 19]. If a part is occluded in the whole tracklet, ℎ () would be an empty set, as all features in it are invalid. 85 5.4 OnlineLearningDPAMsforConservativeCategoryFree Tracking Inthissection,weintroduceconservativecategoryfreetrackingmethods,sothattracklets are extended from heads or tails to partially overcome the missed detection problem, and long gaps between tracklets are shortened to make linear motion assumptions more reliable. 5.4.1 Learning of Discriminative Part-Based Appearance Models Ascategoryfreetrackinginourframeworkisaconservativeprocessandwedonotheavily rely on it to find whole trajectories, we care more about precision than recall. Therefore, we only do CFT for reliable tracklets, which satisfy three constraints: 1) they are longer than a threshold , as short tracklets are more possible to be false alarms; 2) the number of non-empty feature set defined in Equ.5.2 is larger than a threshold , because if there are not enough unoccluded parts, appearance models may not be reliable to represent the human; 3) it does not reach the boundary of current sliding window. We set = 10 and =6 in our experiments. If qualifies for CFT from its tail, we online learn discriminative part-base appear- ancemodels(DPAMs), sothatthelearnedmodelswelldistinguish withotherclose-by tracklets as well as the background. Far away tracklets are not worth considering as it is difficult to confuse with them. A linear motion model is often used in association based tracking work. As shown in Figure 5.4(a), if the tail of tracklet locates at position , after time Δ, the human 86 +1 -1 ... ... Tail (a) (b) T k p k tail p k tail +v k tail ∆t T 1 T 2 Figure 5.4: Online learning of appearance models for category free tracking: (a) estima- tion of potential distracters for ; (b) online collected samples for learning DPAMs. is predicted to be at position + Δ, where is the velocity of ’s tail part. Therefore, we may use the linear motion model to estimate which tracklets are possibly close to in the category free tracking process; we call these tracklets distracters. As the linear motion model is probably invalid over a long period, we do category free tracking only in one second long intervals. If the CFT does not meet termination conditions (detailed later), we do the CFT again for the next one second based on re- learneddistractersandappearancemodels. WeexpectthattheCFTresultsdonotlocate too far from the linear estimated positions within one second. The estimated distance between and another tracklet at frame (> ) is defined as = ⎧ ⎨ ⎩ ∥( + (− ))−( ℎ − ℎ ( 1 −))∥ if < 1 ∥( + (− ))− ∥ if ∈[ 1 , ] ∥( + (− ))−( + (− ))∥ if > (5.3) where∥⋅∥denotestheEuclideandistance, and denotetheresponsepositionoftracklet at frame . A tracklet is a distracter for if it satisfies 87 ∃∈[ 1 , ] 1 ≤≤ & ∃∈( , +] <ℎ (5.4) wheredenotestheframeratepersecond,isaweightfactor,setto2inourexperiments, and ℎ denotes the height of ’s tail response. Equ. 5.4 indicates that a distracter is a tracklet that has temporal overlap with , so that it belongs to a different human than , and may be close to the future path of , such as 1 and 2 in Figure 5.4(a). Responses in all distracters for ’s tail form a set named as Ψ ; these responses should have low appearance similarities with responses in . In addition, to better distinguish from the background, we also collect a set of responses from the background in frame , defined as ={ } ∀ ∈[1,] & where is in frame & = + (5.5) The positions of these responses are selected from possible future positions of in the video, but the responses are extracted at the last frame of . Therefore, as long as is not zero, these responses do not belong to and should have low appearance similarities with responses in . Then we build positive and negative training sets + and − for learning DPAMs for ’s tail, defined as 88 + ={ =( 1 , 2 ), =+1} ∀ 1 , 2 ∈ (5.6) − ={ =( 1 , 2 ), =−1} ∀ 1 ∈ & ∀ 2 ∈ ∪Ψ (5.7) where is the set of responses used in building as shown in Algorithm 7. Some visualized examples are shown in Figure 5.4(b). For a training sample , a control function ( ) is defined to measure its validity of feature as ( )= ⎧ ⎨ ⎩ 1 if Φ( ) is not unoccluded in both responses in 0 otherwise (5.8) whereΦ( )denotethepart(1to15)thatafeature belongsto. Weadoptthestandard RealBoost algorithm to learn discriminative appearance models based on valid features only as shown in Algorithm 8, where ℎ () denotes the weak classifier, i.e., an evaluation function for two feature s, e.g., Euclidean distances or correlation coefficient between two vectors, and is normalized to [−1,1]. We adopt features defined in [26], including color, texture, and shape information, and each feature is defined in a specific part, e.g., color histogram of part 1. 5.4.2 Conservative Category Free Tracking With online learned DPAMs, we adopt a conservative category free tracking method on each reliable tracklet. The CFT is done one frame by one frame. In each frame, a set of samplesaregeneratedaroundthepredictedresponsebythelinearmotionmodelasshown 89 Algorithm 8 The learning algorithm for DPAMs. Input: training set + and − . Initialize the weights for samples = 1 2∣ + ∣ , if ∈ + ; = 1 2∣ − ∣ , if ∈ − . For =1,..., do: ∙ For =1,..., do: – = ∑ ℎ ( ) ( ) – = 1 2 ln 1+ 1− ∙ Select ∗ =argmin ∑ exp(− ℎ ( ) ( )) ∙ Set = ∗, ℎ =ℎ ∗, = ∗ ∙ Update sample weights = exp(− ℎ ( ) ( )), and normalize weights. Output: ()= ∑ 1 ℎ (). T k p k tail p k tail +v k tail *1 T 1 T 2 Figure 5.5: Illustrations for category free tracking process. The green dashed rectangle is predicted by the linear motion model; the orange dashed ones are samples generated around the predicted position. in Figure 5.5. Sizes and positions of the samples are randomly disturbed around values of the predicted response. A feature set for each sample is extracted to evaluate its similarity to . The sample with highest score, named as ∗ , is chosen as a potential extension. However, the potential extension is not taken, if it meets any of the following termi- nation conditions: ∙ ∗ goes beyond the frame boundary or the sliding window boundary. ∙ The similarity between ∗ and is smaller than a threshold 1 . 90 ∙ The similarity between ∗ and any ℎ is larger than a threshold 2 , where starts at the frame where ∗ locates. In our experiments, we set 1 = 0.8 and 2 = 0.2. These constraints assures a high similarity between ∗ and and a low similarity between ∗ and any distracters, so that the category free tracking is less likely to drift or go beyond a true association. For example, in Figure 5.5, and 2 are actually the same person, and if CFT goes beyond 2 ’s head, and 2 cannot be associated, and additional temporal overlapped tracks would be produced for the same person. But if the CFT stops early, we may still successfully associate them. If ∗ is accepted as an extension, we update and for . However, we do not update to avoid drift. This is different with most existing category free tracking approaches, where appearance models are online updated for continuous tracking. In our framework, categoryfreetrackingisonlyusedtopartiallyovercomedetectionlimitations and shorten gaps between tracklets to make linear motion estimation more precise. If CFT stops early, it is still probable to fill missing gaps by the global association. But errors in CFT may cause failure of associations as discussed above. The whole CFT method is shown in Algorithm 9. After conservative category free tracking, we learn a global appearance model simi- lar to [26] using only unoccluded parts from all tracklets, and then use the Hungarian algorithm to find global optimal association results. Figure 5.6 shows some comparing tracking results by using or disabling the category free tracking module. We can see that person 13 in Figure 5.6(b) would be fragmented into two tracklets without the CFT due to long time missed detections and non-linear 91 Algorithm 9 The CFT algorithm for one tracklet from its tail. Input: A reliable tracklet for CFT, and the frame rate per second . Initialize max frame number for one second tracking = +. Initialize the appearance feature set for ’s tail using Algorithm 7. While does not meet the boundary of current sliding window do: ∙ Find potential distracters in the predicted tracking path of from to . ∙ Collect training samples from , its distracters, and the background in the pre- dicted path, and learn DPAMs for the next one second using Algorithm 8. ∙ For = +1,..., do: – Generate samples around the predicted position + ∗1. – Extractthefeaturesetforeachsampleandevaluateitssimilarityto using the learned DPAMs. – Check whether termination conditions are met. If yes, stop CFT; otherwise, add the best response ∗ into ’s tail, and update and ∙ Update , and set = +. Output: Updated tracklet . motions. However, the conservative CFT shortens the gaps between the two tracklets so that they become easy to be associated. In addition, person 10 in Figure 5.6(b) is achieved in frame 330 without having detection responses at that time; person 8 and 15 are similar in Figure 5.6(d). 5.5 Experiments We evaluate our approach on three public data sets: PETS 2009 [3], ETH [16], and Trecvid 2008 [54], which have been commonly used in previous multi-target tracking work. We show quantitative and visualized comparisons with state-of-art methods. 92 Method Recall Precision FAF GT MT PT ML Frag IDS Energy Minimization [4] - - - 23 82.6% 17.4% 0.0% 21 15 PRIMPT [27] 89.5% 99.6% 0.020 19 78.9% 21.1% 0.0% 23 1 Part Model Only 92.8% 95.4% 0.259 19 89.5% 10.5% 0.0% 13 0 Part Model + CFT 97.8% 94.8% 0.306 19 94.7% 5.3% 0.0% 2 0 Table 5.2: Comparison of results on PETS 2009 dataset. The PRIMPT results are provided by authors of [27]. Our ground truth is more strict than that in [4]. The three data sets have different resolutions, densities, and average motion speeds. However, we use the same parameter settings on all, and performances all improve com- pared with state-of-art methods. This indicates low sensitivity of our approach on pa- rameters. As detection performance would influence tracking results, for fair comparisons, we use the same detection results as in [26, 27, 54] on three data sets, which are provided by authors of [54, 27]. The detection results are obtained by an offline learned detector proposed in [20]. No scene occluders are manually assigned in all our experiments. 5.5.1 Results on PETS 2009 data set We use the same PETS 2009 video clip as used in [4] for fair comparison. There are 795 frames in total, and the density is not high. However people are frequently occluded by each other or scene occluders, and may change motion directions frequently. The quantitative comparison results are shown in Table 5.2. We modified the ground truth annotations from [4], and fully occluded people that appear later are still labeled as the same person. We can see that by only using part based appearance models, the MT is improved by more than 10%, and fragments are reduced by 43% compared with up-to-date results in [27]. By using category free tracking, we further improve MT by 93 Method Recall Precision FAF GT MT PT ML Frag IDS PRIMPT [27] 76.8% 86.6% 0.891 124 58.4% 33.6% 8.0% 23 11 Part Model Only 77.5% 90.9% 0.595 124 66.1% 25.0% 8.9% 21 12 Part Model + CFT 81.0% 87.8% 0.861 124 78.2% 12.9% 8.9% 19 11 Table 5.3: Comparison of tracking results on ETH dataset. The human detection results are the same as used in [27], and are provided by courtesy of authors of [27]. 5%, and reduce fragments by 85%. This indicates our part based models are effective for modelinghumans’appearances,andtheconservativeCFTmethodshortensgapsbetween tracklets so that fragmented tracks can be associated based on the linear motion model. Some visualized examples are shown in Figure 5.6(b). 5.5.2 Results on ETH data set The ETH data set [16] are captured by a pair of cameras on a moving stroller in busy streetscenes. Theheightsofhumanmaychangesignificantlyfrom40pixelstomorethan 400 pixels. The cameras shift frequently making the linear motion model less reliable, andtherearefrequentfullinter-occlusionsduetothelowcameraangle. Weusethesame ETH sequences as in [27], including two video clips with 999 and 354 frames respectively. Only data captured by the left camera are used in our experiment. The quantitative results are shown in Table 5.3. We can see that using the part models only, MT is improved for about 8%; using category free tracking improves MT significantlyforanadditional12%. In ETHdata, partialandfullocclusionsarefrequent; therefore,thepartbasedmodelshelpbuildmorepreciseappearancemodels. Thecategory free tracking method recovers many missed detections, especially for humans who appear for only short periods, e.g., less than 40 frames, and therefore improves MT significantly. 94 Some tracking examples are shown in Figure 5.6(d) and Figure 5.7(a). We can see that person 10 in Figure 5.7(a) is first detected in frame 151 and faileds to be found after frame 158, but we start to track her from frame 140 until frame 162. After frame 162, our conservative CFT stops due to large appearance changes caused by shadows. 5.5.3 Results on Trecvid 2008 data set Trecvid 2008 is a very difficult data set, which contains 9 video clips with 5000 frames for each. The videos are captured in a busy airport; the density is very high, and people occlude each other frequently. Quantitative comparison results are shown in Table 5.4. Compared with [27], using only part based models reduces fragments and id switches by about 11% and 13% respectively; using category free tracking additionally reduces fragment and ID switches by about 2% and 3%. We can see that in Trecvid 2008 data sets, most improvements come from part based models and less from the category free tracking method, which is quite different with results on PETS 2009 and ETH. This is becauseTrecvid2008isaverycrowdeddataset;intheCFTprocess,thereareoftenmany distractersforeachtracklet. ThereforetheCFToftenstopsearlybecausetheprobability is quite high that at least one of many distracters would have similar appearances with the concerned person. Figure 5.7(b)(c) show some tracking examples. We see that person 28 & 29 in Figure 5.7(b) and person 121 in Figure 5.7(c) are under heavy occlusions due to high densi- ties of humans; however, our approach is able to find correct discriminative part-based appearance models, and therefore tracks these persons successfully. 95 Method Recall Precision FAF GT MT PT ML Frag IDS Offline CRF Tracking [54] 79.2% 85.8% 0.996 919 78.2% 16.9% 4.9% 319 253 OLDAMs [26] 80.4% 86.1% 0.992 919 76.1% 19.3% 4.6% 322 224 PRIMPT [27] 79.2% 86.8% 0.920 919 77.0% 17.7% 5.2% 283 171 Part Model Only 78.7% 88.2% 0.807 919 73.0% 20.9% 6.1% 253 149 Part Model + CFT 79.2% 87.2% 0.895 919 75.5% 18.6% 5.9% 247 145 Table 5.4: Comparison of tracking results on Trecvid 2008 dataset. The human detection results are the same as used in [54, 26, 27], and are provided by authors of [54]. 5.5.4 Computational Speed The processing speed is highly related with the number of humans in videos. We imple- mentourapproachusingC++onaPCwith3.0GHzCPUand8GBmemory. Theaverage speeds are 22fps, 10fps, and 6fps on PETS 2009, ETH, and Trecvid 2008 respectively. Compared with [27], which reported 7fps on Trecvid 2008, our approach does not impose much extra computational cost 2 . The major extra cost is from the feature extractions for all samples in the CFT module, which could be constrained by setting the number of evaluated samples. 5.6 Conclusion We introduced online learned discriminative part-based appearance models for multi- human tracking by incorporating merits of association based tracking and category free tracking. Part-basedmodelsareabletoexcludeoccludedregionsinappearancemodeling, and a conservative category free tracking can partially overcome limitations of detection 2 Detection time is not included in both our reported speed and that in [27] 96 performanceaswellasreducinggapsbetweentrackletsintheassociationprocess. Exper- iments on three public data sets show that our approach improves tracking performance significantly compared with state-of-art methods with little extra computational cost. 97 Frame 306 Frame 318 Frame 330 (a) (b) Frame 60 Frame 70 Frame 80 (c) (d) Figure5.6: Comparisonsoftrackingresultswithorwithoutcategoryfreetracking: (a)(c) show results without CFT, (b)(d) show results on the same sequences with CFT. 98 Frame 140 Frame 151 Frame 162 Frame 174 Frame 545 Frame 560 Frame 575 Frame 590 Frame 3185 Frame 3220 Frame 3270 Frame 3310 (a) (b) (c) Figure 5.7: Tracking results of our approach on ETH and Trecvid data sets. 99 Chapter 6 Future work Though our approaches have made obvious progress on multiple humans tracking, there are still several more work to be done in the future. We will detail the future work in this chapter. Althoughtrackingperformancesforpedestrianshaveachievedgreatprogress,tracking articulatedhumansremainaquitedifficultproblemandtheperformanceisfarfromusable in real applications. There are several limitations of current approaches: 1. Detection precision and recall rate for articulated humans are not as good as those for pedestrians. 2. Posechangesmaycausethecentroidofahumanchangeswithinashortperiod, but such changes are not modeled in any of current frameworks. 3. Appearances of articulated humans may change much more than pedestrians due to different poses. As the general multiple articulated persons tracking is quite difficult, here we assume that an articulated person would at least in a standing pose for a short period, e.g., one 100 second, during the whole video sequence. Such assumption is often valid in surveillance videos, as persons need to get into a scene by walking before they change poses. Suchassumptionenablesustofindarticulatedpersontracksfromconfidentpedestrian tracks, and we can solve the problem in multiple steps. Robust pedestrian tracks can be achieved by existing pedestrian tracks and are further used as the initialization of articulatedpersons. Inalmostallcases,aperson’sstandingpositioncanonlychangewhen the person is in walking pose, i.e., a pedestrian. We can then consider pose transitions betweenspatiallyclosepedestriantrackstocheckwhethertheybelongtothesameperson. Then, we can adopt the visual tracking process to extend the tracks by applying the articulated human detector only to heads or tails of them, which reduces the possibility that an articulated detector produces lots of false alarms that further form into false tracks. 101 Reference List [1] Caviar dataset. http://homepages.inf.ed.ac.uk/rbf/CAVIAR/. [2] National institute of standards and technology: Trecvid 2008 evaluation for surveil- lance event detection. http://www.nist.gov/speech/tests/trecvid/2008/. [3] Pets 2009 dataset. http://www.cvg.rdg.ac.uk/PETS2009. [4] Anton Andriyenko and Konard Schindler. Multi-target tracking by continuous en- ergy minimization. In CVPR, 2011. [5] Shai Avidan. Ensemble tracking. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2005. [6] B. Babenko, Ming-Hsuan Yang, and S. Belongie. Visual Tracking with Online Mul- tiple Instance Learning. In Proceedings of Computer Vision and Pattern Recogni- tion(CVPR), 2009. [7] Ben Benfold and Ian Reid. Stable multi-target tracking in real-time surveillance video. In CVPR, 2011. [8] J. Berclaz, F. Fleuret, and P. Fua. Robust people tracking with global trajectory optimization. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2006. [9] Michael D. Breitenstein, Fabian Reichlin, Bastian Leibe, Esther Koller-Meier, and Luc Van Gool. Robust tracking-by-detection using a detector confidence particle filter. In Proceedings of International Conference on Computer Vision(ICCV),2009. [10] Michael D. Breitenstein, Fabian Reichlin, Bastian Leibe, Esther Koller-Meier, and Luc Van Gool. Online multi-person tracking-by-detection from a single, uncali- brated camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9):1820–1833, 2011. [11] Kevin Cannons, Jacob Gryn, and Richard Wildes. Visual tracking using a pixelwise spatiotemporal oriented energy representation. In Proceedings of European Confer- ence on Computer Vision(ECCV), 2010. [12] LukavoCerman, Jiˇ r´ ıMatas, andV´ aclavHlav´ aˇ c. Sputniktracker: Havingacompan- ion improves robustness of the tracker. In Proceedings of Scandinavian Conference on Image Analysis(SCIA), 2009. 102 [13] Robert T. Collins. Mean-shift blob tracking through scale space. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2003. [14] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Real-time tracking of non-rigid objects using mean shift. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2000. [15] Thang Ba Dinh, Nam Vo, and G´ erard Medioni. Context tracker: Exploring sup- porters and distracters in unconstrained environments. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2011. [16] Andreas Ess, Konrad Schindler Bastian Leibe, and Luc van Gool. Robust multiper- son tracking from a mobile platform. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10):1831–1846, 2009. [17] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. In Journal of Machine Learning Research, 2003. [18] H.GrabnerandH.Bischof. On-lineboostingandvision. InProceedings of Computer Vision and Pattern Recognition(CVPR), 2006. [19] Helmut Grabner, Jiri Matas, Luc Van Gool1, and Philippe Cattin. Tracking the invisible: Learning where the object might be. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2010. [20] ChangHuangandRamNevatia. Highperformanceobjectdetectionbycollaborative learning of joint ranking of granule features. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2010. [21] Chang Huang, Bo Wu, and Ramakant Nevatia. Robust object tracking by hierar- chical association of detection responses. In Proceedings of European Conference on Computer Vision(ECCV), 2008. [22] Michael Isard and Andrew Blake. Condensation - conditional density propagation for visual tracking. International Journal of Computer Vision(IJCV), 29(1):5–28, 1998. [23] Imran N. Junejo and Hassan Foroosh. Trajectory rectification and path modeling for video surveillance. In Proceedings of International Conference on Computer Vi- sion(ICCV), 2007. [24] Zdenek Kalal, Jiri Matas, and Krystian Mikolajczyk. P-n learning: Bootstrapping binary classifiers by structural constraints. In CVPR, 2010. [25] Louis Kratz and Ko Nishino. Tracking with local spatio-temporal motion patterns inextremelycrowdedscenes. In Proceedings of Computer Vision and Pattern Recog- nition(CVPR), 2010. [26] Cheng-Hao Kuo, Chang Huang, and Ram Nevatia. Multi-target tracking by on-line learned discriminative appearance models. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2010. 103 [27] Cheng-Hao Kuo and Ram Nevatia. How does person identity recognition help multi-person tracking? In Proceedings of Computer Vision and Pattern Recogni- tion(CVPR), 2011. [28] Junseok Kwon and Kyoung Mu Lee. Efficient extraction of human motion volumes by tracking. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2010. [29] Bastian Leibe, Konrad Schindler, and Luc Van Gool. Coupled detection and trajec- toryestimationformulti-objecttracking. InProceedings of International Conference on Computer Vision(ICCV), 2007. [30] Yuan Li, Haizhou Ai, Takayoshi Yamashita, Shihong Lao, and Masato Kawade. Tracking in low frame rate video: A cascade particle filter with discriminative ob- servers of different lifespans. In Proceedings of Computer Vision and Pattern Recog- nition(CVPR), 2007. [31] Yuan Li, Chang Huang, and Ram Nevatia. Learning to associate: Hybridboosted multi-target tracker for crowded scene. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2009. [32] Baiyang Liu, Junzhou Huang, Lin Yang, and Casimir Kulikowsk. Robust tracking using local sparse appearance model and k-selection. In CVPR, 2011. [33] Wei-Lwun Lu, Jo-Anne Ting, Kevin P. Murphy, and James J. Little. Identifying players in broadcast sports videos using conditional random fields. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2011. [34] ZefengNi,SanthoshkumarSunderrajan,AmirRahimi,andB.S.Manjunath. Particle filter trackingwith online multiple instance learning. In Proceedings of International Conference on Pattern Recognition(ICPR), 2010. [35] Kenji Okuma, Ali Taleghani, O De Freitas, James J. Little, and David G. Lowe. A boosted particle filter: Multitarget detection and tracking. In Proceedings of European Conference on Computer Vision(ECCV), 2004. [36] Stefano Pellegrini, Andreas Ess, and Luc Van Gool. Improving data association by joint modeling of pedestrian trajectories and groupings. In Proceedings of European Conference on Computer Vision(ECCV), 2010. [37] Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc J. Van Gool. Youll neverwalk alone: Modeling social behavior for multi-target tracking. In Proceedings of International Conference on Computer Vision(ICCV), 2009. [38] A. G. Amitha Perera, Chukka Srinivas, Anthony Hoogs, Glen Brooksby, and Wensheng Hu. Multi-object tracking through simultaneous long occlusions and split-merge conditions. In Proceedings of Computer Vision and Pattern Recogni- tion(CVPR), 2006. 104 [39] Hamed Pirsiavash, Deva Ramanan, and Charless C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR, 2011. [40] Jens Rittscher, Peter H. Tu, and Nils Krahnstoever. Simultaneous estimation of segmentation and shape. In Proceedings of Computer Vision and Pattern Recogni- tion(CVPR), 2005. [41] Mikel Rodriguez, Ivan Laptev, Josef Sivic, and Jean-Yves Audibert. Density-aware person detection and tracking in crowds. In Proceedings of International Conference on Computer Vision(ICCV), 2011. [42] Paul Scovanner and Marshall F. Tappen. Learning pedestrian dynamics from the realworld. In Proceedings of International Conference on Computer Vision(ICCV), 2009. [43] Horesh Ben Shitrit, Jerome Berclaz, Francois Fleuret, and Pascal Fua. Tracking multiple people under global appearance constraints. In ICCV, 2011. [44] Kevin Smith, Daniel Gatica-Perez, and Jean-Marc Odobez. Using particles to track varying numbers of interacting people. In Proceedings of Computer Vision and Pat- tern Recognition(CVPR), 2005. [45] Bi Song, Ting-Yueh Jeng, Elliot Staudt, and Amit K. Roy-Chowdhury. A stochas- tic graph evolution framework for robust multi-target tracking. In Proceedings of European Conference on Computer Vision(ECCV), 2010. [46] Xuan Song, Xiaowei Shao, Huijing Zhao, Jinshi Cui, Ryosuke Shibasaki, and Hong- bin Zha. An online approach: Learning-semantic-scene-by-trackingandtracking-by- learning-semantic-scene. In Proceedings of Computer Vision and Pattern Recogni- tion(CVPR), 2010. [47] Severin Stalder, Helmut Grabner, and Luc Van Gool. Cascaded confidence filtering for improved tracking-by-detection. In ECCV, 2010. [48] Shu Wang, Huchuan Lu, Fan Yang, and Ming-Hsuan Yang. Superpixel tracking. In ICCV, 2011. [49] Xiaogang Wang, Keng Teck Ma, Gee-Wah Ng, and W. Eric L. Grimson. Trajectory analysis and semantic region modeling using a nonparametric bayesian model. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2008. [50] Xiaogang Wang, Kinh Tieu, and Eric Grimson. Learning semantic scene models by trajectory analysis. In Proceedings of European Conference on Computer Vi- sion(ECCV), 2006. [51] Junliang Xing, Haizhou Ai, and Shihong Lao. Multi-object tracking through oc- clusions by local tracklets filtering and global tracklets association with detection responses. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2009. 105 [52] Junliang Xing, Haizhou Ai, Liwei Liu, and Shihong Lao. Multiple players tracking in sports video: a dual-mode two-way bayesian inference approach with progressive observation modeling. IEEE Transaction on Image Processing, 20(6):1652–1667, 2011. [53] Kota Yamaguchi, Alexander C. Berg, Luis E. Ortiz, and Tamara L. Berg. Who are you with and where are you going? In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2011. [54] Bo Yang, Chang Huang, and Ram Nevatia. Learning affinities and dependencies for multi-target tracking using a crf model. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2011. [55] ChangjiangYang, RamaniDuraiswami, andLarryDavis. Efficientmean-shifttrack- ing via a new similarity measure. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2005. [56] Ming Yang, Fengjun Lv, Wei Xu, and Yihong Gong. Detection driven adaptive multi-cue integration for multiple human tracking. In Proceedings of International Conference on Computer Vision(ICCV), 2009. [57] Qian Yu, G. Medioni, and I. Cohen. Multiple target tracking using spatio-temporal markov chain monte carlo data association. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2007. [58] LiZhang,YuanLi,andRamNevatia. Globaldataassociationformulti-objecttrack- ing using network flows. In Proceedings of Computer Vision and Pattern Recogni- tion(CVPR), 2008. [59] Tao Zhao and Ram Nevatia. Tracking multiple humans in crowded environment. In Proceedings of Computer Vision and Pattern Recognition(CVPR), 2004. [60] Xuemei Zhao and G´ erard Medioni. Robust unsupervised motion pattern inference from video and applications. In Proceedings of International Conference on Com- puter Vision(ICCV), 2011. 106
Abstract (if available)
Abstract
Tracking multiple humans in real scenes is an important problem in computer vision due to its importance for many applications, such as surveillance, robotics, and human-computer interactions. Association based tracking often achieves better performances than other approaches in crowded scenes. Based on this framework, I propose offline and online learning algorithms to automatically find potential useful appearance and motion patterns, and utilize them to deal with difficulties in the association framework and to produce much better tracking results. ❧ In association based framework, an offline learned detector is first applied in each video frame to produce detection responses, which are further associated into tracklets, i.e., track fragments, in multiple steps. Measurement of affinities between tracklets is the key issue that determines the performance. In the first part of my thesis, I propose an online learning algorithm which automatically find three important cues from a static scene to improve tracking performance: non-linear motion patterns, potential entry/exit points, and co-moving groups. ❧ Association based tracking methods are often based on the assumption that affinities between tracklet pairs are independent of each other. However, this is not always true in real cases. In order to relax the independent assumption, we introduce an offline learned Conditional Random Field (CRF) model to estimate both affinities between tracklets and dependencies among them. Finding best associations between tracklets is transformed into an energy minimization problem, and energies of unary and pairwise terms in the CRF model are offline learned from pre-labeled ground truth data by a RankBoost algorithm. Then I further extended the approach into an online version. Positive and negative pairs are online collected according to temporal constraints
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multiple pedestrians tracking by discriminative models
PDF
Motion pattern learning and applications to tracking and detection
PDF
Tracking multiple articulating humans from a single camera
PDF
Exploitation of wide area motion imagery
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Facial gesture analysis in an interactive environment
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Structure learning for manifolds and multivariate time series
PDF
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
3D deep learning for perception and modeling
PDF
A deep learning approach to online single and multiple object tracking
PDF
Green unsupervised single object tracking: technologies and performance evaluation
PDF
Modeling, learning, and leveraging similarity
PDF
Robust representation and recognition of actions in video
PDF
Video object segmentation and tracking with deep learning techniques
Asset Metadata
Creator
Yang, Bo
(author)
Core Title
Multiple humnas tracking by learning appearance and motion patterns
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/31/2012
Defense Date
04/23/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
appearance and motion patterns,multi-target tracking,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
), Kuo, C.C. Jay (
committee member
), Medioni, Gérard G. (
committee member
)
Creator Email
bo.yang02@gmail.com,yangbo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-77404
Unique identifier
UC11289936
Identifier
usctheses-c3-77404 (legacy record id)
Legacy Identifier
etd-YangBo-1077.pdf
Dmrecord
77404
Document Type
Dissertation
Rights
Yang, Bo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
appearance and motion patterns
multi-target tracking