Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Tracking multiple articulating humans from a single camera
(USC Thesis Other)
Tracking multiple articulating humans from a single camera
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Tracking Multiple Articulating Humans from a Single Camera by Weijun Wang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2015 Copyright 2015 Weijun Wang Acknowledgment First of all, I would like to express my profound gratitude to my advisor Prof. Ram Nevatia for his immense support academically and financially. It was his vast knowledge, rich experience and keen observations in the field of computer vision and machine learning that guided me to an exciting research field and throughout a journey of pursuing a PhD in computer science. Without his mentoring, this dissertation would not have been possible. Besides, his positive attitude, sense of humor and thoughtfulness will always encourage me to learn from him and to achieve success. I would also like to thank Prof. Gerard Medioni, who is on my dissertation committee and with whom I worked as a teaching assistant for my first semester at USC. His great experience, research motivation and being considerate to students set a role model for all IRIS members. I would like to thank Prof. C.C.Kuo, Prof. Yan Liu and Prof. Suya You for serving as my dissertation committee. Without their consistent support and valuable feedback on my work, I would not be able to finish this dissertation. Over the years at USC, I have been fortunate to work together and learn from a lot of good researchers and friends: Chang Huang, Cheng-Hao Kuo, Dian Gong, Xuemei Zhao, Vivek Kumar Singh, Prithviraj Banerjee, Pramod Sharma, Yinghao Cai, Zhuoliang Kang, Sung-chun Lee, Chen Sun, Cao Song, Kan Chen, Bor-Jeng Chen, Eunyoung Kim, Matthias Hernandez, Thang Dinh, Jongmoo Choi, Bo Yang, Furqan Khan, Sikai Zhu, Younghoo Lee, Arnav Agharwal. I will always remember the joyful time with them, the great help I have received and valuable things I have learned from them. Fight On! ii Least but not last, I would like to thank my parents and family for their unconditional love and enduring support. For these years far away from home, their love and encouragement is always the strongest power for me to overcome difficulties and go forward. iii Table of Contents Acknowledgment ii List of Tables vii List of Figures viii Abstract x 1 Introduction 1 1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Detection errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Non-linear motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.4 Appearance observation noises . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Category detection based tracking . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Instance specific tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.3 Social context for tracking . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 Articulated human tracking . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Tracking Single Deformable Object Using Superpixel Constellation Model 15 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Graphical Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Constellation Appearance Model . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.1 Spatial Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.2 Parts and Appearance Feature Extraction . . . . . . . . . . . . . . . . . 23 2.4.3 Normalized Likelihood Measure for Key State . . . . . . . . . . . . . . 25 2.4.4 Appearance Descriptor for Leaf Parts . . . . . . . . . . . . . . . . . . . 25 2.4.5 Likelihood Measures for Leaf States . . . . . . . . . . . . . . . . . . . . 27 2.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6 Motion Model and Online Update . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 iv 2.7.1 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.7.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 Group Context and Appearance Saliency for Multi-target Tracking 35 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Overview of our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Group Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.1 Online inferred moving groups . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.2 Refine linking matrixM with Group Context . . . . . . . . . . . . . . . 44 3.5 Appearance Saliency for tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.1 Region based Appearance Affinity Estimator . . . . . . . . . . . . . . . 47 3.5.2 Online sample collection . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.3 Appearance Salience Weights learning . . . . . . . . . . . . . . . . . . . 49 3.5.4 Estimate appearance affinity . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.1 Results on ETH data set . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.2 Results on Trecvid 2008 data set . . . . . . . . . . . . . . . . . . . . . . 53 3.6.3 Results on PETS 2009 data set . . . . . . . . . . . . . . . . . . . . . . . 54 3.6.4 Computational Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4 A Binary Quadratic Programming Formulation with Social Factors 58 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 Tracking objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.1 Trajectory independence assumption . . . . . . . . . . . . . . . . . . . . 62 4.3.2 Simplified linear association model . . . . . . . . . . . . . . . . . . . . 62 4.4 Binary quadratic programming formulation . . . . . . . . . . . . . . . . . . . . 66 4.5 Semidefinite programming solution . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5.1 Rounding scheme to feasible solutions . . . . . . . . . . . . . . . . . . . 71 4.6 Online inferring trajectory dependency . . . . . . . . . . . . . . . . . . . . . . . 72 4.7 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.7.1 Region based appearance affinity estimator . . . . . . . . . . . . . . . . 74 4.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.8.1 Results on the ETH dataset . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.8.2 Results on the Trecvid 2008 dataset . . . . . . . . . . . . . . . . . . . . 76 4.8.3 Results on the PETS 2009 dataset . . . . . . . . . . . . . . . . . . . . . 78 4.8.4 Results on the TownCentre dataset . . . . . . . . . . . . . . . . . . . . . 78 4.8.5 Computational Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 v 5 A Hybrid Approach of Tracking Multiple Articulating Humans 81 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 Patch-based Instance-Specific Detector . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.1 Discriminative color filter . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.2 Filter with local binary features . . . . . . . . . . . . . . . . . . . . . . 88 5.3.3 Centroid displacement distribution . . . . . . . . . . . . . . . . . . . . . 89 5.4 Online Learning of Patch Classifiers for ISD . . . . . . . . . . . . . . . . . . . . 90 5.4.1 Online Sample Collection . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.2 Learning discriminative color filter . . . . . . . . . . . . . . . . . . . . . 91 5.4.3 Learning local binary filter and centroid displacement distribution . . . . 92 5.5 Instance Detection based Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.6 Tracklet Extrapolation and Association . . . . . . . . . . . . . . . . . . . . . . . 95 5.7 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6 Future work 101 6.1 Deep neural networks for appearance affinity learning . . . . . . . . . . . . . . . 102 6.2 Person identity matching with salient attribute . . . . . . . . . . . . . . . . . . . 102 6.3 Motion model compensation with motion features . . . . . . . . . . . . . . . . . 103 Reference List 104 vi List of Tables 2.1 Average center errors (in pixels) comparison . . . . . . . . . . . . . . . . . . . . 30 2.2 Successful Frame Rate (%) Comparison . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Results comparison on ETH data set. Two best results for each column are shown in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Results comparison on Trecvid 2008 data set. Two best results for each column are shown in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Results comparison on PETS 2009 data set. Two best results for each column are shown in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Average computational Speed on three data sets . . . . . . . . . . . . . . . . . . 55 4.1 Results comparison on the ETH dataset with 124 ground truth(GT) tracks. . . . . 76 4.2 Results comparison on the Trecvid 2008 dataset with 919 GT tracks. . . . . . . . 77 4.3 Results comparison on the PETS 2009 dataset. . . . . . . . . . . . . . . . . . . . 77 4.4 Results comparison on the Towncentre dataset. . . . . . . . . . . . . . . . . . . 78 4.5 Average computational Speed" (fps) on three datasets . . . . . . . . . . . . . . . 78 5.1 Results comparison on PETS 2009 dataset. . . . . . . . . . . . . . . . . . . . . 96 5.2 Results comparison on iLIDS medium sequence wtth 46 ground truth(GT) tra- jectories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3 Comparison of results on CMU action dataset with 55 ground truth trajectories and ELASTIC dataset with 91 ground truth trajectories. HYBRID0 is imple- mented with color prior disabled. . . . . . . . . . . . . . . . . . . . . . . . . . . 99 vii List of Figures 1.1 Examples of tracking results by our approach . . . . . . . . . . . . . . . . . . . 2 1.2 Detection result with false alarms and miss detections . . . . . . . . . . . . . . . 3 1.3 Examples of occlusions and non-linear motion. . . . . . . . . . . . . . . . . . . 4 1.4 Examples of appearance observations over time. Each column shows the appear- ance of the same person in different frames. . . . . . . . . . . . . . . . . . . . . 6 1.5 The overview of our proposed approach . . . . . . . . . . . . . . . . . . . . . . 11 2.1 DBN over 3 time slices. Att 1 andt 2 , 3 sub-states are shown for the state and 3 sub-observations are shown for the observation. . . . . . . . . . . . . . . . . . . 20 2.2 Star model:a key nodeX 0 and 3 leaf part nodes. Each state node is connected with its own observation node; observation nodeO 3 shows multiple discrimina- tive features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Superpixels, parts and codebook with SP clusters . . . . . . . . . . . . . . . . . 24 2.4 Appearance descriptors for leaf parts . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Tracking result comparison with SPT on dataset yard. Yellow outputs are from SPT, while blue outputs are given by our DBNT . . . . . . . . . . . . . . . . . . 30 2.6 Tracking result comparison with SPT on PETS09 dataset. Yellow outputs are from SPT, while blue outputs are given by our DBNT. . . . . . . . . . . . . . . . 31 2.7 Tracking comparison with SPT on 2 datasets with occlusion and deformation . . 33 2.8 Tracking stability comparison with SPT, which shows our tracker is more stable in dealing with partial occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Examples of our tracking results on ETH and Trecvid 2008 data sets . . . . . . . 37 3.2 Example of tracking results by GCAST. Colored arrows are used to mark interest- ing points. GCAST tracker can track targets in the distance even though frequent occlusion happens and response sizes are small. Best viewed in color. . . . . . . 38 3.3 Progressive tracklets association . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Extract moving groups for tracklets. The end groupG e 2 of tracklet 2 contains f2; 1; 3; 9g shown in orange dashed circle; the start groupG s 6 of tracklet 6 contains f6; 1; 7; 10g in blue dashed circle. . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 Refine (a) original linking matrixM 0 with group context to get (b) adjusted link- ing matrixM 1 . We highlight one example of refiningm 0 26 tom 1 26 (the linking likelihood from 2 ! 6 ) by referring end groupG e 2 contextf2; 1; 3; 9g and start groupG s 6 contextf6; 1; 7; 10g. Bridge tracklet 1 (in diagonal) and elementm 0 37 are picked to refinem 0 26 ! m 1 26 . Contender elements ofm 0 26 are marked in. Support elements ofm 0 26 are marked in. . . . . . . . . . . . . . . . . . . . . . 46 3.6 Examples of sampled regions for learning appearance salience model . . . . . . . 48 viii 3.7 Tracking results of GCAST on ETH ”BAHNHOF” sequence. Colored arrows are used to mark interesting points, and the arrows with the same ID indicate the the same target across frames. Best viewed in color. . . . . . . . . . . . . . . . . . . 53 3.8 Example of tracking results by GCAST on Trecvid dataset by CAM5. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.9 Example of tracking results by GCAST on Trecvid dataset by CAM3. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.10 Tracking results of GCAST on PETS 2009. Best viewed in color. . . . . . . . . . 56 4.1 An example of the cost-flow network with 9 input tracklets. Note that those 2 possible linking edges shown in red are highlighted just to demonstrate the link- ing dependency among moving companions. Intuitively, the red pair of linking edges, showing potential group motion, should be favored by our new optimiza- tion objective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Tracking results of BQPT B on PETS09-S2L2 sequence on MOT challenge 2015. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.1 Tracking results on iLIDS (a) and CMU action (b). . . . . . . . . . . . . . . . . 82 5.2 The overview of our HYBRID tracker. We first get pedestrian detection responses in a sliding temporal window by applying offline trained pedestrian detector [40]. Then, a pedestrian tracker [41] is used to link pedestrian detections into reliable tracklets(trajectory fragments) pool. To recover articulated humans missed by offline trained pedestrian detectors, instance-specific human detectors(ISD) are learned online from online collected training samples. By applying ISD, initial pedestrian tracklets are extended to track large pose changes. The online learning procedure of ISD iterates with the ISD detection process to adapt to appearance changes(shown in the middle loop). By instance detection based tracking (IDT), the affinity between tracklet pairs is refined for further association to produce final tracking results. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . 85 5.3 Illustration of online sample collection for a target trackletTrk 1 . Conflict tracks of the redTrk 1 contain the orange track and the green one. . . . . . . . . . . . . 90 5.4 Tracking with instance-specific detector . . . . . . . . . . . . . . . . . . . . . . 94 5.5 Example of tracking results (in elliptical) on ELASTIC dataset by HYBRID. Tracking results from visual trackers are shown in boxes. Note some visual track- ers drifted. Legend is given on the right border. . . . . . . . . . . . . . . . . . . 100 ix Abstract Multi-target tracking aims at locating multiple targets in frames, maintaining their identities and estimating their motion trajectories over time, which has been an active research topic in the field of computer vision in past few decades. With a large amount of video data generated at every moment, it is highly demanded to improve artificial visual intelligence capacities which can auto- matically analyze videos and understand scenes without human involvement. It has tremendous applications such as surveillance, autonomous vehicles and human-computer interaction. Among multi-target tracking problems, monocular multi-target tracking using video data from a single camera view is critical and fundamental to others with multiple camera views. In particular, humans are often the most concerned targets as daily activities and events in real scenes usually involve human participants. In this thesis, we target at improving monocular multi-target tracking on not only pedestrians but also articulating humans. Even though some fairly significant advances have been made on pedestrian tracking in re- cent years, the problem of tracking multiple humans with large articulating poses or in crowded scenes is still far from solved. Unlike well-studied pedestrian detection, generic articulated hu- man detection in a wild scene remains a challenging problem. As multi-target tracking problem typically requires detection input from offline-trained detectors, large miss detection and false alarm rates on articulated humans make the existing tracking approaches less effective on artic- ulating humans than pedestrians. In common crowded scenes, where human bodies are partially x visible and frequent occlusions exist, appearance and motion cues are weakened; however, so- cial context cues become more informative and can be explored to help tracking, as humans may move in groups, follow others’ paths or repulse from obstacles. In order to track each single target with large pose variations, we first introduced a part-based appearance model with superpixel-based features and proposed an instance-specific tracker. For common crowded scenes, we proposed to online learn discriminative appearance features of each target and consider social context to improve tracking performance. Different from the linear simplification in classic tracking optimization formulations by ig- noring motion dependency among targets, we proposed a new optimization framework that can consider various pairwise social dependencies to improve tracking. To ensure the tracking effi- ciency, we introduced an approach to convert the new binary quadratic programming formulation to a semidefinite programming problem under convex relaxation for efficient solution. Besides, we propose to infer simple motion dependency factors online efficiently. In scenarios where no trajectory dependency can be explored, our solution is the same and as efficient as classic linear optimization formulations. In general, the new formulation provides a way to consider various high order context to improve multi-target tracking. To address the problem of tracking multiple articulating humans, we proposed a hybrid ap- proach. Our method incorporates offline learned category-level detector with online learned instance-specific detector as a hybrid system. To deal with humans in large pose articulation, which can not be reliably detected by off-line trained detectors, we propose an online learned instance-specific patch-based detector, consisting of layered patch classifiers. With extrapolated tracklets by online learned detectors, we use the discriminative color filters learned online to compute the appearance affinity score for further global association. xi Experimental evaluation on both standard pedestrian datasets and articulated human datasets shows significant improvement compared to state-of-the-art multi-human tracking methods. xii Chapter 1 Introduction As video cameras become more and more pervasive, huge amount of video data are captured every day containing valuable information. Automatic analysis of such video data and mining vi- sual content without human involvement becomes more and more needed. Among such growing needs, tracking multiple targets is one of the critical problems in computer vision as it can serve as the basis for higher level activity reasoning and event recognition. It has huge potential appli- cations in surveillance, autonomous vehicles, human-machine interaction etc. In past decades, it has attracted lots of attentions from researchers. The goal of tracking multiple targets in real scenes is to infer trajectories of targets in a video while keeping their identities consistent throughout the sequence. In particular, humans are often the most concerned targets as daily activities and events in real scenes usually involve human par- ticipants. Even though some fairly significant advances have been made on pedestrian tracking in recent years, the problem of tracking multiple humans towards higher-level reasoning is still far from solved. For example , humans might move in groups in real scenes and important group context features have not been effectively explored by the usual assumption for simplification that targets’ trajectories are independent. Most importantly, unlike well-studied pedestrian detec- tion, articulated human detection remains a challenging task which makes the existing pedestrian 1 Frame 136 Frame 153 Frame 185 Frame 217 (b) Frame 1504 Frame 1662 (a) (c) Figure 1.1: Examples of tracking results by our approach tracking approaches less effective on videos with multiple articulated humans. In this work, we focus on exploring important online learned appearance and motion context cues to improve tracking performance on pedestrians as well as articulated humans. Figure 1.1 shows examples of tracking results by our approach on four different real scenes. 1.1 Challenges Multiple target tracking can be greatly simplified, if all targets are separate from each other and can be easily extracted from the background scene. However, in real scenarios, tracking multiple humans is a quite difficult problem due to detection errors, ambiguous appearance observations caused by occlusions, pose changes, illumination changes, etc. We will briefly describe the major challenges in this section. 2 Miss Detections Miss Detections False Alarm False Alarm Figure 1.2: Detection result with false alarms and miss detections 1.1.1 Detection errors With static cameras, background subtraction approaches can produce moving foreground targets. However, in real applications, moving cameras, dynamic background or static targets make back- ground subtraction approaches fail to find many targets. With great improvements in object detection techniques in recent years, offline learned cat- egory detectors from large amounts of training data are capable of detecting object of a known category, especially pedestrians, even in cluttered and dynamic background. Despite the improve- ments in offline learned detectors, detection errors, including miss detections, false alarms and inaccurate locations, still often occur. Especially, unlike pedestrian detection, training reliable detectors offline for articulated humans remains a challenging problem due to much larger intra- class variations, e.g. various non-upright poses. Without reliable generic human detectors, many human targets in non-upright poses are highly likely to be missed. To recover miss detections, remove false alarms, and refine inaccurate detection observations makes the tracking problem difficult. Figure 1.2 shows some detection result by a state-of-the-art approach in [40]. 3 1.1.2 Occlusions From the view of a single camera, some parts or the entire body of a human target might not be visible, especially in a crowded scene, due to scene occluders or occlusions by other targets. For partial occlusions, detectors may still find the targets although the bounding boxes can be inaccurate and include observations noises from other objects. But the observation noises chal- lenges the appearance descriptors for tracking, as only parts of those observations belong to the real targets. For full occlusions, there is a temporal gap during which no appearance observation is available. Particularly, for long time occlusions, to fill the gaps while maintain target identities is still a challenging problem. Figure 1.3 (a)shows some occlusion examples. Besides, articulated humans are not always in upright poses as pedestrians, where body parts are coarsely aligned. When humans are in non-upright poses, like bending or sitting, their body parts become misaligned and can be even invisible due to self occlusions, which pose great chal- lenges for appearance descriptors towards identity matching between different poses. Scene Occluder Inter-target Occlusion (b) Non-linear Motion (a) Occlusions Figure 1.3: Examples of occlusions and non-linear motion. 4 1.1.3 Non-linear motion As humans may follow a certain motion pattern, which is an important cue in tracking humans. Motion patterns inferred from the existing trajectories can help to estimate the positions of targets backwards or forwards, especially when targets are moving in linear patterns. However, as shown in figure 1.3(b), human can follow non-linear motion patterns in real scenes. Therefore, how to use motion cues to make position estimation in real scenes remains a difficult problem. 1.1.4 Appearance observation noises Appearance observation is an important cue for multiple human tracking, based on which dis- criminative descriptors can be extracted for identity matching. Effective appearance descriptors should be able to distinguish different targets while keeping the identity consistent for the same target. However, in real scenarios, observed appearances of the same person may change over time due to illumination variation, pose changes, occlusions as mentioned in 1.1.2, etc. On the other hand, different targets may show similar appearances due to limited observation from a single camera. Some appearance observation noises are shown in figure 1.4. These observation noises make matching target identities using appearance cues a difficult problem. 1.2 Related work 1.2.1 Category detection based tracking In recent years, category detection based tracking(CDT) have achieved high performance for multi-target tracking, especially for pedestrian tracking. With great improvement of object detec- tion techniques, offline trained object detectors are able to find objects of a known category with high precision and recall rates. Based on detection results of a particular category, CDT aims at 5 Figure 1.4: Examples of appearance observations over time. Each column shows the appearance of the same person in different frames. linking detection responses for all targets, estimating their trajectories and keeping their identities consistent over time. As detection results are provided as input, CDT methods usually focus on the identity matching and association problem for multiple targets. Considering the reduced solu- tion space from raw video observation to noisy detection responses, various kinds of optimization methods have been adopted as CDT framework. The key issue of CDT is how to obtain correct associations robustly with noisy detection results. Figure shows some examples of CDT tracking results. Among early CDT methods, Multi-Hypothesis Tracking(MHT)[71], Joint Probabilistic Data Association Filters (JPDAFs)[30], and the particle filter based trackers [62, 56] attempted to solve this problem by maintaining multiple hypotheses until enough evidence can be collected to re- solve the ambiguity. To get better association by global analysis, in[54], detection and tracking are coupled as Quadratic Boolean Programming that is solved by an EM-style algorithm. However, 6 these approaches are not suitable for long-time association due to the combinatorial growth of hypothesis search space. Sampling methods such as MCMC [104, 84] have also been employed to find approximate solutions in the high-dimensional hypothesis space. To reduce the search space, latency is created to cache detection responses in a temporal sliding window, consisting both past and future frames. Then responses are gradually associated into longer and longer tracklets(track fragments). Over the cached window, global data associ- ation methods, e.g.Min-cost flow[106, 68], Dynamic Programming [32] or Hungarian algorithm [85, 67, 41] can infer optimal solutions to link responses into trajectories. Based on the global data association framework, different attempts have been made to im- prove tracking performance, such as estimation of merge and split [67], occlusion reasoning [106, 93], offline learning multiple cues for better association[57]. To get better affinity estima- tion between tracklets, dynamic feature selection [9, 17] and online learning methods [36, 48] have been proposed. 1.2.2 Instance specific tracking Another type of tracking problem, called instance-specific tracking(IST), has also attracted a lot of research attention. IST focuses on tracking pre-specified targets, usually one specific target. Figure shows examples for IST, which is also called visual tracking. IST requires manual se- lection at the beginning, but no offline trained object detectors of a certain category are needed. Therefore, IST can target at any specific instance without its category information. Early IST approaches often adopt meanshift [24, 23, 20] or particle filtering [42, 62, 56] as framework to online estimate the most probable state of the target at an incoming frame, based on which multiple cues have been considered to boost IST performance [56, 100, 39, 17]. To continuously track the target, its appearance model and motion model have to be online updated 7 to adapt to all kinds of variations, such as illumination change, occlusion, scale variation, shape deformation. During the online update learning process of IST, accumulated tracking errors can lead to tracker’s drift. Once IST drifts, a tracking failure is likely to occur since it is difficult for a IST tracker to recover from its drifted position. In recent years, several state-of-the-arts IST trackers have been proposed to improve track- ing precision and robustness. An incremental visual tracker(IVT) [74] with an online subspace algorithm has been proposed to model target appearance adaptively. To handle partial occlusion, a generative fragments-based representation is used in(FragT) [5]. In visual tracking decompo- sition(VTD) [51], a conventional particle filter framework is extended with multiple motion and observation models to account for appearance variations. Several discriminative approaches have also been proposed, treating IST tracking as a binary classification between a target region and the background [10]. In [11], multiple instance learning (MIL) is applied to IST problem to han- dle noisy data collected online for update and reduce drifts with better updates of the classifier. In [44](TLD), PN-learning is proposed to select positive and negative samples for update according to underlying constraints. Context information, like using co-moving targets as supporters, is also considered in IST [39, 27] to improve the tracking robustness. 1.2.3 Social context for tracking Besides appearance and motion, social context also provides important hints for tracking. For extremely crowded scenes, objects are highly likely to follow similar motion patterns which are learned as priors for trajectory estimation[47, 73, 108]. Motion patterns can be learned from optical flow features, motion blobs or partially tracked results in such scenarios. Using motion patterns as motion prediction priors may work well for high density crowds, but may be less reliable for mid-level density scenes where objects may move freely alone or in groups. Different 8 from the obvious moving trends in extremely crowed scenes, in normal semi-crowded scenes, people may move with different directions among which group motions are hidden. To track multiple targets in normal crowd, motion independence among targets is usually assumed to simplify the underlying optimization problem[22]. However, motion independence among targets does not always hold because people may move in groups when each group member’s single motion is highly related to the group motion. Therefore, the hidden group motion context may contain useful cues for tracking. Existing methods[95, 65, 66] exploring trajectories’ dependence usually use group information to refine individual object motion model, which is not suitable for the efficient global association framework for tracking. To compensate approximate linearity assumption, motion patterns are used to explain non- linear trajectories[97]. In sports tracking[58], team context features learned offline are considered in motion models. Given manually annotated training data for offline parameter learning, to predict where to go in next frame in pedestrian tracking, group states are considered in a motion model[95, 66]. In an association framework for semi-crowded scene, object trajectories and groups are jointly modeled [65]. However, these models[95, 65, 66] are more suitable for a specific scene, since motion model parameters are learned offline and limit the generalization of the model to other real scenes. Pairwise context is used assuming all pairs of objects in the scene have correlated motions[90] in a maximum weight independent set optimization. 1.2.4 Articulated human tracking The greatest challenge in multiple articulated human tracking is the lack of a reliable generic human detector for various poses. Unlike pedestrian detection, training reliable detectors offline for articulated humans remains a hard problem due to much larger intra-class variations, e.g. 9 various non-upright poses. Therefore, tracking multiple articulated humans in real scene from a single camera is greatly different from pedestrian tracking and is rarely studied. Part based models have achieved great success on object detections (including humans in different poses) by finding parts and their correlations[29, 64, 102]; however, they may only work well on a certain predefined poses, which are not directly applicable to tracking multiple articulated humans. Assuming detection is given, human pose estimation and tracking techniques target at finding human body parts precisely [70, 53, 101, 77, 34]. Pose tracking may work well for single or separated humans in controlled scenes; however, they cannot deal multiple humans in real scenes. Different from precisely locating body parts in pose tracking, multiple human tracking focuses on finding human locations with correct identities. Based on body motion during pose change, approaches [60, 72] are able to locate and track body boundaries, however they are just suitable for separate persons in controlled environments and only work when targets are moving. A cascade SVM detector is proposed in [107] to semi-supervised classify humans into dif- ferent pose clusters and train a detector for each one, a particle filtering process is used to track each human target along pose changes. Focusing on pedestrians, predefined human parts are used to model the appearance of humans and a classic visual tracker is incorporated conservatively in the association framework [99] to deal with certain human articulations. However, neither pre- defined parts or classic visual tracking methods(e.g.mean-shift tracker) works well for human in non-upright poses. Targeted at deformable objects in instance-specific tracking(IST), which requires manual ini- tialization, there have been recent work using local patches [35, 19, 50]. However, IST trackers do not utilize scene information from a global association view point, when a target gets close to 10 1 Video Fragment 0 In temporal Sliding Window Pedestrian Tracker ... +1 -1 ... ... Learning ISD ... ... Instance detection based Tracking(IDT) by Online Learned Instance-specific Detector(ISD) Association of Extrapolated Tracklets Tracklet Legend Detection Pedestrian tracklets Final Articulated Human tracking results Collect tracklet- specific training samples instance i instance 0 instance 0 instance i time Detections by offline learned Pedestrian detector Category detection based Tracking(CDT) detection by ISD for one instance at time t Iterative detection and online learning Offline learned Articulated Human Detectors in pre-defined non-pedestrian poses Non-pedestrian Human Detections (noisy) Offline learned Association rules between poses Pose Transition Constraints: standing → bending √ riding → running X ... Motion features (speed, flows) Part correspondence Between poses Extrapolate tracklets by ISD Detection Human Tracklets in predefined poses Simple Conservative Linker Figure 1.5: The overview of our proposed approach similar ones(e.g.in human crowd) or in a cluttered background, degraded pixel precision cause trackers to drift easily. 1.3 Overview of our approach In this work, we aim at tracking multiple articulated humans in real scenes from a single cam- era. The framework of our approach is a hybrid system, shown in Figure1.5, that incorporates both offline-learned category-level detectors, e.g. pedestrian detectors or detectors for any other predefined pose, and online-learned instance-specific detector (ISD). The overall method is a tracklets (track fragments) association based tracking, which pro- duces the final articulated human tracks by gradually linking or growing shorter tracklets to longer ones in an association hierarchy. Tracklets can be generated by associating responses from offline-learned pedestrian detector, by extrapolating pedestrian tracklets with online learned 11 instance-specific detectors, or by conservatively linking non-pedestrian human detections of pre- defined poses. We formulate the tracking problem into a global data association problem by creating a certain latency to cache detection responses in a temporal sliding window and do- ing tracklet association hierarchically over the sliding window as shown at the top-left part of Figure1.5. As human detector for a predefine non-pedestrian pose is not as reliable as pedestrian detec- tors, we treat category detection responses from pedestrian detectors differently from the rest. In particular, we view pedestrian tracking, which requires only pedestrian detections, as the founda- tion of the overall system for articulated humans. We decompose the tracking framework into four major parts: 1) tracking pedestrians; 2) instance-specific tracking to deal with large pose changes from non-upright poses; 3) extrapolat- ing pedestrian tracklets by instance detection based tracking(IDT); 4) learning association rules offline to link the same target in different poses. As pedestrian tracking is the foundation of the proposed approach, we also proposed to im- prove its performance by considering social context and appearance saliency, which are not well studied especially for semi-crowded scenes. The motivation is that in common scenes peoples’ motion can be affected by various social factors, such as moving in groups, following paths or re- pulsion from obstacles. Exploring such social dependency as context can be beneficial especially when appearance cues are weakened by scene noise, e.g.occlusions. To use such group motion context efficiently in the association hierarchy, motion groups are inferred online from tracklets’ motion at a lower level and are naturally incorporated as context to refine its linking likelihood with other tracklets in a higher-level association. On the other hand, to distinguish a target from its close companions moving in a group, we explore dynamic salient appearance features, which are online learned for each specific tracklet. 12 To address the issue of lacking reliable detections for non-pedestrian poses, we resort to an instance detection based tracking (IDT) by online learning instance-specific detector(ISD). As tracking is an online scenario, each ISD with instance-specific features is learned from limited numbers of online samples from existing tracklets. To deal with occlusions and appearance de- formation, the proposed ISD models each target as a collection of small patches. By recovering those salient patches collectively, ISD extrapolates the target’s trajectory. The IDT process al- ternates between detection and online learning. For online learning, knowledge about existing tracklet pool is utilized to select useful positive and negative samples. The online learned ISD consists of layered patch classifiers. The first layer is a color saliency filter, which aims at ex- tracting salient patches in color; the second layer uses random ferns classifier with a larger set of local binary features, which is efficient for online learning. Online extrapolated responses from ISD can adapt to instance-specific appearance changes and shorten its frame gaps to possible tracklet matches, which are hard cases for offline trained detectors. Besides, extrapolated tracklets can also recover some non-linear motion fragments of human tracks, which cause problems for linear motion assumptions in tracklets association methods. Although offline trained human detectors for non-upright poses may produce a lot of detec- tion errors, e.g.miss detections and false alarms (noises), they can still be used as an optional module to improve the overall tracking. Assuming that we predefine a few common poses in real scenes (e.g.sitting,bending), a detector for each predefined pose can be learned offline, using state-of-the-art methods (e.g. deformable part based model). To suppress false alarms, simple and conservative frame-to-frame linking can be employed with strict rules to generate candidate 13 human tracklets in known poses, which can be highly fragmented. So far, our approach can gener- ate extrapolated tracklets from pedestrian tracklets and fragmented human tracklets in predefined poses, as shown in the center yellow block of Figure1.5. To associate tracklets between different poses, how to measure the matching affinity is a key issue. As human body parts are likely to be misaligned between different poses, appearance affinity estimation for pedestrian tracking cannot work here. However, with labeled ground truth tracks for training, part correspondence, pose transition constraints, motion features between pose transitions can be offline learned to help the association. 1.4 Thesis Outline The thesis is outlined as follows: in chapter 2, we propose an instance-specific tracking method with a part-based appearance model to track a single unknown object with large pose variations. Then, we propose a pedestrian tracking method by considering motion context and appearance saliency to improve the tracking performance in chapter 3. In chaper 4, we propose a binary quadratic programming formulation to consider possible social factors to improve tracking. We also propose a way to reform the combinatorial problem to a semidefinite programming problem and provide an efficient solution to it. In chapter 5, we describe our hybrid method for articulated human tracking that incorporates both offline-learned category-level detectors and online-learned instance-specific detector. Finally in chapter 6, we summarize our work and propose possible directions for future research in this field of study. 14 Chapter 2 Tracking Single Deformable Object Using Superpixel Constellation Model In this chapter, we target at the problem of tracking a single unknown target with large pose variations. Tracking unknown objects in a video without object category information, which we call category free tracking (CFT) in this chapter, is an important but challenging problem, which can help solve a bigger problem of tracking multiple articulating targets. 2.1 Motivation While significant advances have been achieved, category free tracking methods are less effec- tive in dealing with occlusion and object deformation (e.g. human pose changes). Detection based tracking has been well studied in the literature [41], but it relies on good detection results which are usually produced by well-trained off-line detectors for a specific object category (e.g. pedestrians) using a large amount of training samples. For on-line tracking, category specific appearance and motion cues may not be at our disposal. In CFT, we focus on individual features rather than category similarity to track a target. 15 To deal with challenging appearance variations caused by changes in scale and illumination, rotation, shape deformation and occlusion, a flexible yet discriminative appearance representa- tion is critical. In this chapter, we introduce the part-based appearance model studied in object detection and recognition [91, 28]into the setting of online tracking and model a target as a con- stellation of deformable parts. In contrast to earlier part-based models in detection and tracking methods [91], the parts in our tracker have no specific semantics, e.g. they do not correspond to head, torso and legs of a human object. Parts in our tracker are automatically extracted from the target, in contrast to manually selected parts in [109], and are trained to describe local appearance of unknown objects and updated to adapt to appearance changes during tracking process. During partial occlusion, tracking a target’s local features is a reasonable choice [79], how- ever, local point extraction and feature point trackers [87] are not effective because of motion blur and lack of discriminative power. Superpixels, which are over-segmented regions of pixels and have been successfully employed in image segmentation, provide valuable color and structure cues of a region. In our work, we treat superpixels as region elements and extract appearance features for parts based on superpixels. The main contributions of this chapter are threefold. First, we proposed to track unknown objects with multiple related parts and model tracking as Dynamic Bayesian Network (DBN); therefore, this tracker is named as DBNT. Second, we proposed a part constellation appearance model and to extract novel appearance features based on superpixels, which are efficient and effective for object representation. Third, we designed an efficient particle-based inference algo- rithm over the DBN for tracking. 16 2.2 Related Work The Mean-shift algorithm has been naturally applied to object tracking [25]. However, lacking an adaptive appearance model, mean-shift tracker is not suited to handle large appearance variations. To track objects with appearance change due to various influencing factors, a flexible and adaptive appearance model is important. An incremental visual tracker(IVT) [74] with an online subspace algorithm has been proposed to model target appearance adaptively. Although it has been shown to handle certain illumination change and articulated deformation, it is less effective in dealing with occlusion or distortion due to its holistic appearance model. To handle partial occlusion, a generative fragments-based representation is used in FragT [5]; the tracking task is performed by matching local patches using a histogram template. But, how to update the template to handle scale variation and shape deformation remains an issue. In visual tracking decomposition (VTD) [51], a conventional particle filter framework is extended with multiple motion and observation models to account for appearance variations, but background noise within a bounding rectangle around the target makes appearance update unreliable. Several discriminative approaches have also been proposed, treating object tracking as a bi- nary classification between a target region and the background [10]. In [11], multiple instance learning (MIL) is extended to online tracking setting. By handling roughly labeled data collected online, it is able to reduce visual drifts with better updates of the classifier. In [44](TLD), PN- learning is proposed to select positive and negative samples for update according to underlying constraints. Whereas it is shown to track rigid regions within a target well, the use of a bounding box limits its power to handle deforming objects. In general, formulating object tracking as a classification problem is a challenging task due to lack of sufficient good training samples of a changing target in a complex scene. 17 An alternative idea is to seek visual cues from target segmentation and get effective object representation to assist in the tracking task. Mid-level visual cues from oversegmentation have shown some flexibility when compared with low-level features and high-level appearance models. Specificially, effective part representation can be obtained based on superpixels with boundary ev- idence. In [72], the tracking task is regarded as a figure/ground segmentation problem; however, as it seeks a complete segmentation of each frame individually, the computational complexity is high. There is a large literature on video segmentation which explores local motion and appearance cues to segment video sequences into objects and background [37, 88]. Recently, long range mo- tion cues (eg. dense points trajactories [86]) have been employed to segment and track deforming agents in a detection free fashion [55, 31], but how to get dense and reliable point trajectories from motion analysis remains challenging. In contrast, tracking superpixels (regions) instead of tracking feature points seems more feasible. A very recent object tracking method [89](SPT), which uses superpixels to get a discriminative appearance model, shows good results in dealing with appearance variations. Nonetheless, with a single bounding box observer, it does not utilize important boundary evidence from superpixels which contains boundary shape features of the target. In the following, we introduce our graphical representation for tracking in section 2.3. A superpixel-based appearance model is proposed to automatically extract features for key part and supporting parts in section 2.4. The inference algorithm is described in section 2.5, followed by motion model and online update scheme in section 2.6. Experimental comparison and discussion is give in section 2.7. 18 2.3 Graphical Representation Target objects may be only partially observable when occluded by other objects or background or undergoing certain articulations. In order to deal with these situations, we represent each target object as a group of spatially related parts (the whole target can also be viewed as the key part). The spatial relations among parts are formulated probabilisticaly, giving flexibility to model articulated appearance change. These parts are extracted automatically, given an initialized target. The tracking system is constructed as a Dynamic Bayesian Network (DBN), shown in Fig- ure 2.1, where each stateS t contains multiple sub-statesS t = (X 0 t ;:::X K t ) andK denotes the total number of parts. The sub-stateX k t is the state ofPart k at timet, eachPart k has its own observationsO k 1:t up to timet. The part state is defined asX k t = (C k t ;Z k t ;R k t ), whereC k t represents the center coordinates of the part,Z k t denotes its bounding size andR k t is its orien- tation at timet. Specificially, we model the whole target as a key part surrounded by other leaf parts. At each timet, there is only direct influence between the key part and a leaf part. When the whole target is not fully observable, some supporting leaf part might still be well observed and help improve the estimate of the key state based on multiple observations. Since this graphical representation models flexible constraints among parts, articulated states can be proposed from which good ones with confident evidence are selected and propagated. The aim of the tracking system is to estimate p(S t jO 1:t ), which stands for the probability distribution of a target state given all observations (O 1:t = O 0:K 1:t ) up to timet. Specificially, we want the MAP estimate b X t 0 of the keyPart 0 state: b X t 0 =arg max X 0 t p(X 0 t jO 0:K 1:t ) (2.1) 19 X 1 0 X 2 1 X 0 1 X 1 1 X 2 0 X 0 0 O 1 1 O 0 1 O 2 1 X 2 2 X 0 2 X 1 2 O 1 2 O 0 2 O 2 2 Figure 2.1: DBN over 3 time slices. At t 1 and t 2 , 3 sub-states are shown for the state and 3 sub-observations are shown for the observation. Since the MAP key state b X t 0 represents the most likely state estimate of the whole target, it is used as the system output at each timet. As a state-observation model, a two-step recursion is employed. We model the motion dy- namics only for the key statep(X 0 t jX 0 t1 ), which represents the state transition of the whole target, and keep separate observation models p(O k t jX k t ) for each part. For motion prediction step: p(X 0 t jO 0:K 1:t1 ) = X X 0 t1 p(X 0 t jX 0 t1 )p(X 0 t1 jO 0:K 1:t1 ) (2.2) We can view the equation (2.2) as dynamic prior forX 0 t , which can be denoted as: e p(X 0 t ) : p(X 0 t jO 0:K 1:t1 )/p(O 0:K 1:t1 jX 0 t )p(X 0 t ) (2.3) 20 For update with observationsO 0:K t , we have: p(X t 0 jO 0:K 1:t ) =p(X 0 t )p(O 0:K 1:t1 ;O 0 t ;O 1:K t jX 0 t ) =p(X 0 t )p(O 0:K 1:t1 jX 0 t )p(O 0 t jX 0 t )p(O 1:K t jX 0 t ) =e p(X 0 t )p(O 0 t jX 0 t ) X X 1:K t p(O 1:K t jX 1:K t ;X 0 t )p(X 1:K t jX 0 t ) =e p(X 0 t )p(O 0 t jX 0 t ) X X 1:K t 1:K Y k p(O k t jX k t )p(X k t jX 0 t ) (2.4) where; are normalization terms. Exact MAP inference for equation (2.4) is infeasible due to the large state space. However, with strong observation likelihood from a discriminative appear- ance model(described in section 2.4), sampling based approximate inference can be effective. Therefore, to make inference over this temporal model, we do sequential importance sampling in the temporal setting (HMM), then do forward sampling in the Bayesian network at timet, details are given in section 2.5. 2.4 Constellation Appearance Model To model the object appearance, we propose a generative model with a constellation of parts, in which each part has its own local discriminative appearance model. Spatial constraints between parts are captured as a probabilistic graphical model with directed edges. In order to apply effi- cient probabilistic inference algorithms, we make a simplifying assumption that the constellation graphG c is a Gaussian tree model rooted at the key partX 0 . Specifically, we modelG c as a star model (shown in figure 2.2), only allowing directed edges between rootX 0 and leaf parts, further simplifying our exposition. 21 X 0 X 3 X 1 X 2 O1 O3 O2 O0 F 0 F 1 F n … Figure 2.2: Star model:a key nodeX 0 and 3 leaf part nodes. Each state node is connected with its own observation node; observation nodeO 3 shows multiple discriminative features 2.4.1 Spatial Constraints We model the spatial constraints between parts as a configuration distribution over all parts p(X 0:K t ). Since the constellation graph G c is a star graph, the configuration distribution fac- tors into a product of local terms: p(X 0:K t ) =p(X 0 t ) 1:K Y k p(X k t jX 0 t ) (2.5) p(X 0 t ) is a prior over the key state. In the setting of tracking, a dynamic prior (2.3) updated with motion prediction from previous tracks is employed. The second term specifies the relative spatial constraints inG c : p(X k t jX 0 t ) =p(C k t jC 0 t )p(Z k t jZ 0 t )p(R k t jR 0 t ) =N(C k t C 0 t ; ck t ; ck t )N(Z k t =Z 0 t ; zk t ; zk t )N(R k t R 0 t ; rk t ; rk t ) where ck t = 2 6 6 4 ck;x 0 0 ck;y 3 7 7 5 ; zk t = 2 6 6 4 zk;x 0 0 zk;y 3 7 7 5 (2.6) 22 ck t ; zk t ; rk t give the ideal spatial configuration of part states X k t with respect to the key state X 0 t , and ck t ; zk t ; rk t represent the spatial deformation variance. Given the training samples by initialization or online collection, we learn the Gaussian parameters with a Maximum Likelihood Estimator[28]. 2.4.2 Parts and Appearance Feature Extraction The key part represents the whole target, whose state is initialized in the first frame and tracked onward. Given an image region covered by the key state X 0 (the whole target), leaf parts are automatically extracted as sub-regions of it. Therefore, each leaf part represents the local appear- ance of some blocks of the whole object image. How to extract regular sub-regions can vary in different application scenarios. A simple way of equally dividing the whole region intoK = 4 parts is employed in our implementation, as shown in figure 2.3(a). State parameters for leaf part X k = (C k ;Z k ;R k ) can be estimated easily according to the key stateX 0 during the extraction. Superpixels(SP) from over-segmentation of image present useful mid-level visual cues, which show good boundary fidelity. We extract appearance features based on superpixels for both the key part and leaf parts. Instead of training a holistic discriminative appearance model of the whole object (key part) from the background, we estimate the likelihood of a key state by considering the likelihood of all the SPs it covers. For each leaf part, color patterns and edge features based on SP are extracted and updated during the tracking process; details are given in section 2.4.4. For the initialization frame, we segment the surrounding region of initialized target into superpixels, which are used as initial training samples, and extract appearance features (eg. color histogram) for each SP. In order to map each SP’s features to its likelihood of being an object, we first build a codebook consisting of SP clusters. Unsupervised clustering methods (mean shift clustering or hierarchical K-means) can be employed to group training SPs intoM clusters, each of which has 23 (a) Target with superpixels and 4 leaf parts h(2)=0.4 h(2)=0.4 h(1)=0.8 h(1)=0.8 h(3)= -0.2 h(3)= -0.2 h(4)= -0.9 h(4)= -0.9 (b) Codebook with SP clusters Figure 2.3: Superpixels, parts and codebook with SP clusters its center and radius. Intuitively, the larger target-background ratio a cluster covers , the higher likelihood its member SPs have (as eq. (2.7)). Without ambiguity, we omit the superscriptt in the following 3 sections. We define the likelihoodh(m) of a SP clusterm being in target as: h(m)/ A t (m) A t (m) +A bg (m) (2.7) where A t (m) denotes the target area cluster m covers, A bg (m) denotes the background area clusterm covers. Based on the cluster likelihood (2.7), we define the feature likelihood of SPn being in target as: g(n)/h(m) exp( jf(n)f c (m)j 2 r(m) ) (2.8) wherem indicates the index of the cluster, to which SPn is assigned by nearest-neighbor classi- fication with featuref(n),f c (m) andr(m) denote clusterm’s center feature and radius respec- tively. 24 2.4.3 Normalized Likelihood Measure for Key State To measure the likelihood of a key state, superpixel likelihoodg(n) defined in eq. (2.8) is used. All pixelsI(x;y) within the same superpixeln are assigned the same likelihood valueg(x;y) = g(n). Given the key state X 0 (i) of a sample i, we resize its image regions to a standard size and get a normalized image regionI 0 (i) =fI( x; y)g. Here we define a normalized likelihood measure as: l 0 (i) = X ( x; y)2I 0 (i) g(x;y) Z 0 (i) Z st (2.9) whereZ 0 (i) represents the region size of the key state of samplei andZ st represents the standard size. The normalization term makes the observation model adaptive to scale change by accepting more positive regions. Thus, the observation model for key stateX 0 can be given as: p(O 0 jX 0 (i))/l 0 (i) (2.10) 2.4.4 Appearance Descriptor for Leaf Parts As described in section 2.4.2, leaf parts are extracted as sub-regions of the target image. To represent the color appearance of a sub-image, color histogram is commonly used as a global descriptor. However, global color histogram lacks locally discriminative power, which can be improved with SPs. After the SP-segmentation process, each leafPart k consists of a group of local SPs within the sub-imagefSP n 2Part k g. Based on SPs, we propose efficient descriptors of the color pattern and boundary features. Color pattern can be represented as a compact descriptor for each leafPart k by feature vec- tor quantization, using the same codebook as for SP clustering for the key part in section 2.4.2. This quantization step maps SPs in a sub-image of leafPart k to a single vectorV k = (v 1 ;v 2 ;:::;v Mp ), 25 (a) Color pattern vector (b) Boundary map for 4 leaf parts Figure 2.4: Appearance descriptors for leaf parts each elementv m of which counts the weighted votes for a positive (h(m) > 0:5) SP clusterm in the codebook, an example is shown in figure 2.4(a). M p is the number of positive clusters (h(m) > 0:5) in codebook. The quantization proceeds with allfSP n 2 Part k g by nearest- neighbor classification. Each SP only votes for its closest SP cluster in codebook with a weight, as given in the definition of vector elementv m : v m = X SPn2Part k 1(vote(n) =m) exp( jf(n)f c (m)j 2 r(m) ) (2.11) where the weight term measures the distance between SP featuref(n) and cluster center feature f c (m), giving higher weights to features closer to the center. We denote b V k , whose elements sum to 1, as a normalized vector ofV k . Boundary features can also be extracted from over-segmented image of each leaf Part k . First, we get an estimated part mask by suppressing the background and setting SPs in SP clusterfm;h(m) < 0:5g to 0. Then, we set those SPs which have no background neighbors to 0 and others 1, getting a binary boundary (consisting boundary SPs) mapb k of the object part, as shown in figure 2.4(b). 26 2.4.5 Likelihood Measures for Leaf States Given the leaf stateX k (j) of a samplej, we extract color and boundary features as described in section 2.4.4, getting b V k (j) andb k (j). By initial training or online updating, the tracker maintains both color and boundary templates ( b V k ,b k ) for leaf parts. Since b V k is normalized, we only need to resizeb k (j) to template size getting b b k (j). The observation model for a leaf stateX k can be computed as: p(O k jX k (j))/l c (j) +l b (j) where :l c (j) = b V k b V k (j); l b (j) =b k b b k (j) (2.12) Color likelihoodl c computes the inner product of a candidate’s color descriptor with a template, while boundary likelihood l b is obtained by convolving the part boundary template with can- didate’s boundary map. denotes the importance factor of boundary features relative to color pattern. 2.5 Inference By representing the tracking system as a DBN, we can view the tracking problem as a filtering process over the temporal model. Specificially, we make MAP inference of b X t 0 in eq. (2.1). We do particle-based approximate inference over the graph, the proposed algorithm is given in Algorithm 1. 27 Algorithm 1 Particle-based Inference Algorithm for DBN Tracking 1: Input: DBN: Graphical representation for tracking,I: Number of samples, 2: O 1:t 0:K : Observation sequence for all parts 3: Initialization:X 0 0 4: fori = 1 toI do 5: Assign samplex 0 0 [i] X 0 0 6: w 0 [i] 1=I 7: Dist 0 0 f(x 0 0 [i];w 0 [i]) :Ig 8: end for 9: Inference: 10: fort = 1; 2;::: do 11: fori = 1;:::I do 12: Samplex t1 0 [i] fromDist t1 0 . 13: Samplex t 0 [i] from motion dynamicsp(X t 0 jx t1 0 [i]) 14: w t [i] w t [i]p(O t 0 jx t 0 [i]) //update weights by key state likelihood 15: fork = 1;:::K do 16: Samplex t k [i] from spatial constraintsp(X t k jx t 0 [i]). 17: w t [i] w t [i]p(O t k jx t k [i]) //update weights by leaf part state likelihood 18: end for 19: Dist t 0 f(x t 0 [i];w t [i]) :Ig 20: end for 21: Output: b X t 0 =arg max x 0 t [i] p(x 0 t [i]jO 0:K 1:t );i = 1;:::;I 22: end for 2.6 Motion Model and Online Update The motion model is set as a Gaussian random walk, which is reasonable when no strong dynam- ics prior can be assumed. This is given as: p(X 0 t jX 0 t1 ) =N(X 0 t ;X 0 t1 ; ) (2.13) An online update procedure is designed to collect training samples and adapt the constellation appearance model to target’s appearance variations. Updating proceeds iteratively with tracking by keeping a queue of collected training samples, in which a sequence of tracked target and its 28 surrounding region is cached. An average MAP tracking score is maintained, which is defined as: p = 1 Q TQ:T X t max X 0 t p(X 0 t jO 0:K 1:t ) (2.14) New tracked frames are added into the queue while older frames are discarded, unless the new tracking MAP estimate is significantly lower than the queue average p. In that case, occlusion is predicted and poorly tracked frames are discarded. After collecting online tracked samples, the same training process is employed as at the initialization stage. 2.7 Experimental Evaluation In this section, we describe evaluation of our tracking method (DBNT) and compare its perfor- mance with other leading trackers. 2.7.1 Implementation details To get superpixels with similar sizes, which show good boundary fidelity across frames, the SLIC algorithm [4] is used. The average size of superpixels is set to a ratio (eg. 0.005) of the initial target size and spatial proximity weight is set 10. A superpixel is represented by its color histogram. 500 particles are used during sequencial importance sampling for inference. The update queueQ is set as 4-8 frames. We evaluate our tracking results on 6 public sequences with PASCAL ground truth and 2 challenging real-life sequences selected from Mind’s Eye Dataset, which contain more severe deformation and occlusion. Two criteria are employed for quantitative evaluations: tracking sucess rate and average center error in pixels. The criterion of successfully tracked frame in PASCAL VOC is given asframeScore> 0:5;frameScore =T\G=T[G, whereT denotes 29 the tracked region, andG denotes ground truth. To compare with previous algorithms, we either take the reported best results or carefully select the parameters with the provided source code. Figure 2.5: Tracking result comparison with SPT on dataset yard. Yellow outputs are from SPT, while blue outputs are given by our DBNT 2.7.2 Results and Discussion We tested our tracking method and compared it with other famous trackers on 6 public videos (board, woman, liquor, singer, basketball and transformer) and 2 more challenging videos(run, yard)selected by us. Average pixel errors are summerized in Table 2.1 (DBNT is the label for our method). We discuss the performance of our tracker in comparison with other tracking methods. Dateset IVT FragT MILT VTD MS [23] PF [61] SPT DBNT board 14 84 14 98 236 184 7 6 woman 133 112 120 109 32 79 9 8 liquor 296 31 165 155 137 28 9 9 singer 5 21 20 3 116 25 4 5 basketball 120 14 104 11 203 21 6 5 transformer 131 47 33 43 46 49 13 11 run 147 163 129 154 235 109 69 23 yard 43 52 47 50 67 53 46 10 Table 2.1: Average center errors (in pixels) comparison Comparison with state-of-the-art tracking methods: 30 Figure 2.6: Tracking result comparison with SPT on PETS09 dataset. Yellow outputs are from SPT, while blue outputs are given by our DBNT. IVT method is desgined to account for appearance variations due to affine motion, scale change and illumination, thus it performs well in tracking targets without large deformation, as shown in Table 2.1(board) and (singer). The use of a Gaussian random walk motion model and particle filter enable it to pick the target after full occlusion in some situations. However, based on a holistic appearance model, it does not deal well with partial occlusion, as shown in Table 2.1(woman). Using all tracking results for update, tracking errors are likely to accumulate in the IVT tracker. When large pose changes are present, our part-based representation models the appearance more flexibly, therefore is more likely to match, as shown in Table 2.1(basketball). 31 MILT method utilizes multiple instance learning algorithm as appearance update scheme, which shows good performance in tracking rigid targets without large changes in scale and pose (see board and transformer in Table 2.1). To some extent, the use of MIL improves the perfor- mance of holistic appearance model in dealing with deformation. However, its tracking result is not as accurate as ours (see board and transformer in Table 2.1). It demonstrates that our constel- lation appearance model with SP-based features extraction, is helpful during shape deformation. VTD method uses multiple motion and observation models to account for appearance varia- tions. It outperforms our methods in the singer video where there is significant lightening change. But the representation is still holistic and less effective in dealing with deformation and partial occlusion. Although it shows good performance in the basketball video, it shows bigger center error than DBNT. Dateset IVT FragT MILT VTD MS PF SPT DBNT Board 78.3 50.7 82.7 35.3 12.8 31.9 96.6 97.8 woman 14.7 13.2 11.4 8.1 10.5 9.3 93.1 95.5 liquor 21.8 79 20.3 27.6 23.7 69 97.7 98.1 singer 94.6 24.8 25.6 99.7 18.2 27.3 84.6 79.8 basketball 11 70 28.1 82.9 10.8 62.8 97.5 91 transformer 20.4 30.6 24.2 37.9 22.6 25.8 100 100 run 65.3 33.4 56.6 43.5 11.2 34.7 36.5 97.3 yard 35 36 35 37 30 35 38 71 Table 2.2: Successful Frame Rate (%) Comparison SPT method [89] is also a superpixel-base tracking method, which shows good results compa- rable to ours on the common dataset. This tracker learns a discriminative appearance model based on superpixels with more than 1 initialized frame (user tagged) and updates this discriminative appearance model online. It shows competitive results to ours on average. However, under some detailed analysis, we can show that DBNT is more stable with less fluctuation in dealing with 32 (a) (b) Figure 2.7: Tracking comparison with SPT on 2 datasets with occlusion and deformation (a) (b) Figure 2.8: Tracking stability comparison with SPT, which shows our tracker is more stable in dealing with partial occlusion appearance variation during partial occlusion and deformation, see Figure 2.8(a) around frame number 350 and 2.8(b) around frame number 39. Moreover, our tracker’s ability of handling deformation and occlusion is well shown on two challenging videos (run, yard). Experimen- tal result comparisons are given in Fig. 2.7 and Table 2.1, 2.2. Qualitative results are shown in Fig. 2.5, 3.10. 33 In Fig. 2.5, the SPT tracker fails after encountering heavy target occlusion while DBNT over- comes this situation because multiple part observers in our model take effect when the key ob- server gets into trouble. In Fig. 3.10, SPT loses track while DBNT continues to track the deformed target. This demonstrates that collaborative multiple parts with spatial constraints outperform a single holistic model, especially when the background color patterns are close to the object’s. Besides, the major computational complexity of both SPT and DBNT is over-segmentation part, which makes DBNT run at around 1fps for high resolution (1280 720) video on a PC with 3.0GHz CPU, to get good tracking accuracy with fine scale superpixels especially during defor- mation. Moreover, SPT method relies on more than one tagged frame for initialization, which can be limiting in practice. 2.8 Conclusion In this chapter, we introduced a single target tracker (named as DBNT) using superpixel based constellation model. The constellation appearance model uses multiple parts to deal with partial occlusion and deformation where discriminative parts can be strong evidence for the target. In contrast, other tracking methods use a single bounding box observer, which ignores partial ev- idence. In terms of feature extraction with superpixels, instead of just using superpixels’ own color feature as SPT does, we explore the color patterns as well as boundary structures of su- perpixels’ neighborhood, which make the appearance model more discriminative. Experimental results show that our tracker (DBNT) outperforms existing single target trackers, especially under deformation and occlusion. 34 Chapter 3 Group Context and Appearance Saliency for Multi-target Tracking Multi-target tracking becomes more challenging in crowded scenes. Strictly speaking, we are targeting at common semi-crowded scenes where frequent occlusions exist and human bodies are at least partially observable at some frames; in contrast, those analysis on extremely crowded scenes with very high people density where only heads are visible are different topics from ours. In this chapter, we try to explore group context and discriminative appearance features of each target to help tracking. 3.1 Motivation With improvements in object detection, association based tracking(ABT)[17, 62, 67, 54, 6, 93]has recently become an important framework for multi-target tracking. In a global associ- ation framework, a temporal sliding window is used to cache detection responses and global data association methods, e.g.Min-cost flow[106] or Hungarian algorithm[67, 54], can infer optimal solutions over the cached window to link responses into trajectories efficiently, different from methods only requiring past and current frames[103, 17, 62, 100]. 35 To simplify the underlying optimization problem, motion independence among targets is usu- ally assumed in the ABT framework [22, 106, 41]. However, motion independence among tra- jectories does not always hold because people may move in groups when each group member’s single motion is highly related to the group motion. In such cases, motion independence assump- tion may overlook important motion cues which may be helpful for association especially when appearance cues are weakened by occlusions or other scene noise. For example, appearance of target 13 in figure 3.1 is frequently affected by illumination changes. Existing methods[95, 65, 66] exploring trajectories’ dependence usually use group information to refine object motion model, which is different from our method as described below. We employ an efficient hierarchical association framework[41] to progressively link input responses into longer and longer tracklets(trajectory fragments). In our association hierarchy, flexible groups, which are efficiently inferred online from tracklets’ motion at a lower level, are naturally incorporated as object context to refine its linking likelihood with other tracklets in a higher-level association. Targets moving with their companions in close groups also pose challenges for tracklets as- sociation since motion cues may become less reliable to distinguish members in a close group. To distinguish a target from its companions, appearance is a critical cue. Unlike existing methods [81, 48, 57] which learn globally discriminative appearance model for all targets online or offline, we argue that each target has its own salient appearance features, which are discriminative in telling a target apart from its companions and invariant to cross- tracklet noises in keeping identity for the same target over a certain time gap. Naturally, salient appearance features are specific to a certain case and can vary across different targets, therefore they should be learned online as new cases come. Besides, the effectiveness of salient features for the same target may also change over time as viewpoints, targets’ poses and companions change. 36 Figure 3.1: Examples of our tracking results on ETH and Trecvid 2008 data sets Features learned at an earlier stage may even become invisible at a later stage after large part misalignment happens with pose changes. In our approach, after lowest level association we adaptively learn salient appearance features for each tracklet for the next higher association level. To deal with part misalignment, appearance features are extracted from coarsely sampled overlapped regions within responses. In the learned salience model, each feature from a local region gets a normalized salience weight. The overview of our approach is given in section 3.3 and some tracking results are shown in figure 3.1. We term our group context and appearance saliency based tracker to be GCAST. Our contributions are summarized as: 1.Group context, which are efficiently inferred online from tracklets’ motion, is incorporated in an efficient association framework to improve tracking (details in sec.3.4). 2.Efficient matrix operations are proposed to consider group context to refine the linking matrix for association by Hungarian algorithm (details in sec.3.4.2). 3.In the association hierarchy, adaptive appearance salience features are online learned for affinity estimation between tracklets (details in sec.3.5), which outperform global discriminative model as shown in experiment sec.3.6. 37 3.2 Related Work Motion provides important hints for tracking. For extremely crowded scenes, objects are highly likely to follow similar motion patterns which are learned as priors for trajectory estimation[47, 73, 108]. However, objects may move freely alone or in groups in normal scenes and strong mo- tion prior is less reliable. To compensate approximate linearity assumption, motion patterns are used to explain non-linear trajectories[97]. In sports tracking[58], team context features learned offline are considered in motion models. Given manually annotated training data for offline pa- rameter learning, to predict where to go in next frame in pedestrian tracking, group states are considered in a motion model[95, 66]. In an association framework for semi-crowded scene, object trajectories and groups are jointly modeled [65]. However, these models[95, 65, 66] are more suitable for a specific scene, since motion model parameters are learned offline and limit the generalization of the model. Pairwise context is used assuming all pairs of objects in the scene have correlated motions[90] in a maximum weight independent set optimization. Frame 2676 Frame 2869 Frame 3407 Frame 2549 Figure 3.2: Example of tracking results by GCAST. Colored arrows are used to mark interesting points. GCAST tracker can track targets in the distance even though frequent occlusion happens and response sizes are small. Best viewed in color. Different from general image salience detection methods[16], our appearance salience is on- line learned for tracklet matching. For person re-identification, appearance salience based on unsupervised learning is used to deal with cross-view matching[75], while in our method it is online learned from reliable tracklets produced by a lower-level association. 38 3.3 Overview of our Approach Given the input observation responsesX =fr i g, the data association is cast into a MAP estima- tion form as: T = arg max T P (TjX ) (3.1) = arg max T P (XjT )P (T ) where the most probable set of trajectory hypothesesT =fT 1 ;T 2 ;:::;T N g is estimated and a single trajectory hypothesis is a temporal sequence of responsesT n =fr n1 ; r n2 ;:::g. The likelihood term P (XjT ) measures how well the given observation responses match a set of trajectories; while the priorP (T ) estimates the linking probabilityP (r nt ! r n t+1 ), the initialization probability P enter (n) and the termination probability P exit (n) for all trajectories. In previous methods on multi-target tracking, the motion of each object is often assumed to be independent and the prior is simplified to consider each trajectory in isolation as: P (T ) = Y n P (T n ) (3.2) = Y n;t P (r nt ! r n t+1 )P enter (n)P exit (n) (3.3) We employ the Hungarian algorithm to form the MAP inference in eqn.4.1 as a linear assign- ment(association) problem from tracklet ends to tracklet beginnings, which works on a linking matrixM as its key input. Suppose we haveR tracklets for association, therefore the linking ma- trixM hasR rows for tracklet ends andR columns for tracklet beginnings. Importantly, motion dependence among tracklets moving in groups is translated into dependence among elements in matrixM, detailed in section 3.4. 39 1 4 5 2 3 8 9 10 11 12 15 16 17 18 19 20 21 22 23 24 25 26 7 32 13 14 ΔT 1 ΔT 2 t γ 1 γ 2 γ 3 γ 4 γ 5 γ 6 γ 7 γ 8 γ 9 γ 10 γ 11 27 28 γ 11 29 30 γ 12 6 Figure 3.3: Progressive tracklets association Specifically, we employ a hierarchical association framework which is efficient by reducing the solution space progressively as association goes to higher levels. Given the detection re- sponsesD =fd i g as initial inputX =D, a simple frame-to-frame association method[41] with dual-threshold is used to generate conservative tracklets, each of which can be confidently treated as a reliable fragment consisting of responses for the same target. The goal of higher association stages is to link these conservative tracklets to longer and complete trajectories. After the lowest level associaiton, we treat the tracklet set =f i g output from a lower level as the input observation responses,X in eqn.4.1, for the next level. Matrix elementm ij inM measures the likelihood that j is the immediate successor of i , which is composed of terms in appearance, motion and time gap: m ij =L a (i;j)L m (i;j)L tg (i;j) /P ( i ! j ) (3.4) 40 The appearance likelihood termL a (i;j) is estimated by using the salient appearance features online learned for current tracklet set (details in section 3.5). The motion likelihoodL m is approximated by linear motion affinity from the end positions i of i to the starts j of j as: L m (i;j)/exp(d 2 ( i ! j ))exp(d 2 ( i j )) (3.5) d 2 ( i ! j ) =js i +v i t ij s j j 2 d 2 ( i j ) =js j v j t ij s i j 2 where v i denotes estimated velocity from the end part of i and v j denotes estimated velocity from the start part of j . t ij is the time gap between the end of i and the start of j , which is also used in the termL tg (i;j): L tg (i;j) = 8 > > < > > : 1 if 0< t ij < T level 0 otherwise (3.6) where T level is a time gap threshold, which is increased progressively as association level goes higher. Note that so far the appearance likelihoodL a (i;j) are estimated as if trajectories are indepen- dent. After considering possible motion dependence within a moving group described in section 3.4, we can refine the initial linking matrixM 0 and get a new linking matrixM 1 , which is used as input to Hungarian algorithm for association. In the association hierarchy, confident and easy associations done at lower levels can produce reliable tracklets for online salient feature learning to better resolve linking ambiguity at higher levels. As in figure 3.3 , over interval T 1 , a moving group is more likely to persist (f 1 ; 2 g to 41 t 2 1 3 4 5 6 7 8 9 10 Figure 3.4: Extract moving groups for tracklets. The end groupG e 2 of tracklet 2 contains f2; 1; 3; 9g shown in orange dashed circle; the start groupG s 6 of tracklet 6 containsf6; 1; 7; 10g in blue dashed circle. f 4 ; 5 g) and group context can be better utilized to improve association; over a larger interval T 2 , fewer moving groups can be found and salient appearance features are more reliable. 3.4 Group Context Motion independence among trajectoriesfT n g does not always hold as targets may move in a group with similar motion, which brings both benefits and challenges for correct association. Challenges are that motion cues are less reliable to distinguish targets in a group; benefits are that considering the group context around the targets, high association confidence for one tracklet pair may boost some other tracklet pairs’ linking possibilities. For example in figure 3.3, tracklets 1 and 2 are recognized as a group by their end states, tracklets 4 and 5 are recognized as a group by their start states. If 1 can be confidently linked to 4 , then it is highly likely that 2 should be linked to 5 . If the linking likelihoodm 25 from 42 eqn.3.4 is too low in MatrixM without considering group context, then its value should be refined using group context for a better linking chance. 3.4.1 Online inferred moving groups For each reliable tracklet, spatial and motion information is used to extract its moving groups. A tracklet i potentially has a start groupG s i and an end groupG e i .G s i andG e i are not necessarily the same since some group member may become occluded or leave the group in the duration of tracklet i . It is also possible that a groupG s i orG e i only has i itself as a degenerate group, which indicates tracklet i is moving alone. For example in figure 3.4, we extract an end groupG e 2 for 2 and a start groupG s 6 for 6 . A tracklet j in the end groupG e i is defined as one that coexists and moves together with the end part of tracklet i for a long enough time, during which it remains spatially close to i G e i :f j jjt co (i e ;j)j>; (3.7) t v (i;j)< v ; t s (i;j)< s ;8t2t co (i e ;j)g where t co (i e ;j) denotes the coexisting time interval of j and the end part of i , t v (i;j) and t s (i;j) are velocity and spatial difference between trackletsi andj respectively,; s ; v are threshold parameters for co-moving time length, spatial difference and speed difference. Similarly,G s i is defined as the set of tracklets that coexist and move together with the start part of tracklet i for a long enough time, during which it remains spatially close to i . Typically, group members are within s = 0:5 meters in ground position and as many as 5 members can be extracted for a tracklet in a crowd scene. For a tracklet which has group 43 members, we consider them in its group context to refine its linking possibility with potential matches as described in section 3.4.2. 3.4.2 Refine linking matrixM with Group Context We denoteM 0 as the linking matrix computed without group context. WithM 0 as input for association by Hungarian algorithm, we consider each tracklet independently for linking. With salient appearance features, described in section 3.5, it is highly unlikely that the appearance likelihood termL a (i;j) in eqn.3.4 between unsuited pairs is overestimated above a confident threshold. However, target’s appearance may change over time as its pose varies and image observation is exposed to noise such as occlusion and illumination changes, due to whichL a (i;j) can be underestimated and elements inM 0 , denoted asm 0 ij , can be significantly lower than its real value. Considering each tracklet in a group context, we aim to refineM 0 by adjusting its underestimated element values. Consider the tracklets example shown in figure 3.4, we evaluate each tracklet pair indepen- dently by eqn.3.4 and get an 10 10 matrixM 0 shown in figure3.5(a). To refineM 0 , we start by applying Hungarian algorithm onM 0 and get the best matching solution from tracklet ends(rows) to tracklet starts(columns). However, we only accept those matching pairs with confident linking likelihoodm 0 ij > m 0 accept . In our implementation, we set m 0 accept = max(0:7;m 0 top20% ), andm 0 top20% denotes a value as high as that only top 20% of the matching pairs can be accepted. We denote those confident elements inM 0 for confidently matching pairs asE accept . For each element inE accept , e.g.m 0 37 with red arrow in figure 3.5(a), other elements from the same 44 row or the same column(row 3 or column7 in figure3.5(a)) inM 0 are classified as mute ele- ments, denoted asE mute , which will remain constant and will not affect other elements during refinement. We refine an element inM 0 if it is neither in the accepted setE accept nor the mute setE mute , besides elements with zero values indicate unlinkable tracklet pairs and remain 0. To refine an elementm 0 ij , we consider it within a group context mask as highlighted in figure 3.5, including contender elements(), support elements() and bridge tracklet indexes(red arrows in diagonal). Specifically, only bridge tracklet indexes and confidently accepted support elements, as indicated by red arrows in figure3.5(b), can help increasem 0 ij value. For an elementm 0 ij , we define its contender elements, which will not help increasem 0 ij value as: E con ij :fm 0 ik jk2G s j ;k6=jg[fm 0 kj jk2G e i ;k6=ig (3.8) as marked in green () in figure3.5(a) form 0 26 . The support elements of elementm 0 ij is defined as: E spt ij :fm 0 pq jp2G e i ;p6=i;q2G s j ;q6=jg (3.9) as marked in green () in figure3.5(a) for m 0 26 . Besides, we define bridge tracklet indexes for elementm 0 ij as: B brg ij :fkj k 2G e i ; k 2G s j g (3.10) e.g. tracklet 1 form 0 26 in figure 3.4 and 3.5. Algorithm2 shows the refinement procedure with elements inM 0 using group context. 45 (a) (b) M 0 1 2 3 4 5 6 7 8 9 10 1 2 0 m 0 26 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 M 1 1 2 3 4 5 6 7 8 9 10 1 2 0 m 1 26 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 Figure 3.5: Refine (a) original linking matrixM 0 with group context to get (b) adjusted linking matrixM 1 . We highlight one example of refining m 0 26 to m 1 26 (the linking likelihood from 2 ! 6 ) by referring end groupG e 2 contextf2; 1; 3; 9g and start groupG s 6 contextf6; 1; 7; 10g. Bridge tracklet 1 (in diagonal) and element m 0 37 are picked to refine m 0 26 ! m 1 26 . Contender elements ofm 0 26 are marked in. Support elements ofm 0 26 are marked in. Algorithm 2 Refine linking matrixM 1: Input:M 0 , with normalized elements 0m 0 ij 1 2: Do Hungarian algorithm onM 0 3: Accept confident elementsE accept for confidently matching pairs inM 0 ; Get mute elements E mute 4: fori;j = 1; ;R;i6=j do 5: ifm 0 ij = 2E accept [E mute andm 0 ij 6= 0 then 6: Extract moving groupsG e i andG s j 7: Get support elementsE spt ij and bridge trackletsB brg ij for elementm 0 ij 8: m 1 ij m 0 ij 9: form 0 pg 2E spt ij \E accept do 10: m 1 ij m 1 ij + (1)m 0 pg 11: end for 12: fork2B brg ij do 13: m 1 ij m 1 ij + (1) 0:9 14: end for 15: end if 16: end for 17: Output: the new linking matrixM 1 46 3.5 Appearance Saliency for tracking Appearance features are critical cues to associate tracklets correctly. We argue that each target has its unique appearance features which make it stand out from the crowd, but these also depend on the current crowd appearance from which we aim to distinguish. As the association level and the pool of tracklets changes, the salient appearance features for a target also change. In our work, appearance saliency measures the discriminative power from others as well as the consistency over time in its own appearance. We learn a salience weighted appearance affinity estimator for each tracklet online at each specific association level, as shown in figure3.6(a). 3.5.1 Region based Appearance Affinity Estimator Local salient regions seem effective for humans to measure affinity between targets, especially during frequent occlusions and targets are partially visible. Therefore, we represent target appear- ance by local features from local regions. To deal with certain part misalignment, we coarsely sample overlapped regions within responses’ range, examples of which are shown in figure 3.6. We extract a RGB color histogram and a HOG feature 1 for each sampled region and use his- togram correlation to measure feature similarity. For example in figure 3.6, we sample 21 regions and get 42 local features to measure appearance affinity. We denote by h q (x) a local affinity estimator, which takes a pair of responses as inputx : (r ;r ) and use the local region to extract color histogram or HOG feature, compute feature correlation and output a normalized affinity value in [1; 1]. Local features have different levels of salience, which are translated into normalized salience weights q in a combined appearance affinity estimator H(x) = P q q h q (x). As the set of 1 We use 8 bins for each channel and concatenate them to a 24-bin RGB color histogram. HOG feature is formed by concatenating 8 orientation bins in22 cells over the region. 47 0 1 2 3 4 9 10 11 12 5 6 7 8 13 14 15 16 17 18 19 20 t Association level Appearance salience weights Appearance salience weights (a) (b) Figure 3.6: Examples of sampled regions for learning appearance salience model . tracklets changes in the association hierarchy, salience weightsf q g are learned online described in section 3.5.3 with sample collection described in sec.3.5.2. 3.5.2 Online sample collection Suppose we have a tracklet set =f i gat an association level above the bottom level(frame-to- frame linking). The positive samples for tracklet i with labels (y k : +1) are formed by pairs of responses from different frames of tracklet i : S + i :fx k : (r ;r );y k : +1jr 2 i ;r 2 i ;6=g (3.11) To collect negative samples for i , we argue that any tracklet coexisting with i belongs to the crowd we want to distinguish from. Especially, tracklets in co-moving groupsG e i andG s i , which pose challenges for discriminative power by motion as they are spatially close and move together with i , are given more emphasis as described below. 48 We divide the negative samples for i with labels (y k :1) into two sets. The first setS i[a] is formed by pairing one response from i and one response from a different tracklet in co-moving groupsG e i orG s i : S i[a] :fx k : (r ;r );y k :1jr 2 i ;r 2 j ; (3.12) j 2G e i [G s i ; j 6= i g The second setS i[b] is formed by pairing one response from i and one response from a tracklet in the co-existing set except forG e i andG s i : S i[b] :fx k : (r ;r );y k :1jr 2 i ;r 2 j ; (3.13) j 2 i ; j = 2G e i [G s i g where i is the set of coexisting tracklets which have at least 1 frame overlap with i in time. At the learning stage, samples fromS i[a] are emphasized with higher initial weights than those from S i[b] . 3.5.3 Appearance Salience Weights learning Local region-based affinity estimators are combined into a stronger appearance affinity estimator H(x) = P q q h q (x) using normalized salience weightsf q g. The salience weights are learned in a boosting framework[78], given in Algorithm 3. Note that each tracklets get its unique salience weightsf q g to estimate its affinity with others. To deal with partial occlusion, we use an occupancy map to estimate the occlusion mask v q (x) for each local estimator, assuming responses with smaller Y coordinates can be occluded 49 Algorithm 3 Salience weights learning for i 1: Input:S + i ,S i[a] ,S i[b] 2: Set initial weight w k = 1 2jS + i j if x k 2 S + i , w k = 1 2 S i[a] + S i[b] if x k 2 S i[a] , w k = 1 4 S i[a] +2 S i[b] ifx k 2S i[b] 3: fort = 1 toT do 4: forq = 1 toQ do 5: r = P k w k y k h q (x k )v q (x k ) 6: q = 1 2 ln 1+r 1r 7: end for 8: Selectq = arg min q P k w k exp[ q y k h q (x k )v q (x k )] 9: Set t = q ;h t =h q ;v t =v q 10: Update weightw k w k exp[ t y k h t (x k )v t (x k )] 11: Normalizew k 12: end for 13: Output:H(x) = P T 1 t h t (x) by responses with larger Y(closer to camera) in the same frame. If over 30% of a region in either response in an input pair is occluded,v q (x) = 0; otherwise,v q (x) = 1. Method Recall Precision FAF GT MT PT ML Frag IDS PRIMPT[49] 76.8% 86.6% 0.891 124 58.4% 33.6% 8.0% 23 11 DPAM [99] 77.5% 90.9% 0.595 124 66.1% 25.0% 8.9% 21 12 DP[68] 67.4% 91.4% - 124 50.2% 39.9% 9.9% 143 4 DCCRF[59] 77.3% 87.2% - 124 66.4% 25.4% 8.2% 69 57 AST(no context) 77.3% 91.6% 0.560 124 65.8% 25.6% 8.6% 20 10 GCAST(ours) 77.9% 91.4% 0.572 124 69.7% 21.5% 8.8% 18 7 Table 3.1: Results comparison on ETH data set. Two best results for each column are shown in bold. 3.5.4 Estimate appearance affinity Suppose we have learned an appearance affinity estimator for each tracklet inf i g, we estimate the appearance likelihood termL a (i;j) in eqn.3.4 using both salience weightsf q g i off i g and salience weightsf q g j off j g. The estimation procedure is given in algorithm 4. 50 Algorithm 4 Estimate appearance affinity between i and j 1: Input: i , j ,f q g i ,f q g j 2:fr g collect responses from the end part of i 3:fr g collect responses from the start part of j 4: Testing samples:fx k g make pairs by one response fromfr g and one response from fr g 5: forq = 1 toQ do 6: fork = 1 do 7: local affinity value setfh q g add valueh q (x k )v q (x k ) to set 8: end for 9: h avg q calculate the average value of the largest 4 values in setfh q g 10: end for 11: H ij P ( qi + qi )h avg q 12: Output:L a (i;j)/H ij 3.6 Experimental Results The performance of our approach, named as GCAST, is evaluated on three public pedestrian data sets: PETS 2009[1], ETH mobile[17], and Trecvid 2008[2]. Quantitative comparisons with several state-of-the-art methods and visualized results are shown; some video results are provided in the supplementary material. As input raw detection would influence tracking performance, for fair comparisons, the same detection results, provided by authors of[49], are used for all compared methods. No additional scene knowledge is provided. Quantitative evaluation is based on the commonly used evaluation metrics[57], including recall and precision of detection performance after tracking; false alarms per frame(FAF); num- ber of ground truth trajectories(GT); percentage of mostly tracked(MT) trajectories with tracked part 80% trajectory length, partially tracked(PT) trajectories with tracked part between 20% and 80% trajectory length, and mostly lost(ML) trajectories with tracked part 20% trajec- tory length; fragments(Frag), the total number of times ground truth tracks are interrupted; id switches(IDS), the total number of times that generated tracks change their matched ground truth tracks. 51 The three data sets have different camera angles, crowd densities and motion patterns. Our approach has low sensitivity to parameters and the same setting of parameters is used on all three data sets. Performance is improved on all three data sets compared with earlier state-of-the-art methods. Method Recall Precision FAF GT MT PT ML Frag IDS CRF[96] 79.2% 85.8% 0.996 919 78.2% 16.9% 4.9% 319 253 OLDAM[48] 80.4% 86.1% 0.992 919 76.1% 19.3% 4.6% 322 224 PRIMPT[49] 79.2% 86.8% 0.920 919 77.0% 17.7% 5.2% 283 171 DPAM [99] 78.7% 88.2% 0.807 919 73.0% 20.9% 6.1% 253 149 AST(no context) 82.0% 90.6% 0.785 919 77.4% 18.6% 4.0% 267 147 GCAST(ours) 83.4% 89.3% 0.792 919 77.6% 18.2% 4.2% 245 143 Table 3.2: Results comparison on Trecvid 2008 data set. Two best results for each column are shown in bold. Method Recall Precision FAF GT MT PT ML Frag IDS CEM[7] - - - 23 82.6% 17.4% 0.0% 21 15 PRIMPT[49] 89.5% 99.6% 0.020 19 78.9% 21.1% 0.0% 23 1 DPAM [99] 92.8% 95.4% 0.259 19 89.5% 10.5% 0.0% 13 0 GCAST(ours) 96.9% 97.4% 0.172 19 94.8% 5.2% 0.0% 2 0 Table 3.3: Results comparison on PETS 2009 data set. Two best results for each column are shown in bold. 3.6.1 Results on ETH data set The ETH data set is captured by a stereo pair of cameras mounted on a moving stroller in busy city street scenes. Only the data captured by the left camera for sequences ”BAHNHOF” AND ”SUNNY DAY”, as in [49, 99], are used in our experiment to demonstrate tracking performance from a single camera. The challenges are significant target size change from 40 pixels(in height) in distance to 400 pixels(height) when getting close to the camera, frequent fast shift of camera, and frequent full occlusions due to low camera angle. 52 Quantitative comparison is shown in table4.1. Compared to methods using global discrimina- tive appearance models[99, 49], our tracker(AST), only using salient appearance features, reduces Frag and IDS and improves tracking responses’ precision, while giving compatible results based on rest of the metrics. This shows the advantage of online adaptively learned salient appearance features. Using both group context and salient appearance features, Frag and IDS are further reduced and percentage of mostly tracked trajectories (MT) is improved by over 3% compared to the best of the other methods. Note that frequent occlusions happen in this data set and many miss detections exist in the raw detection input, which makes it challenging to track a target completely and improve MT. Examples of tracking results 2 are shown in figure 3.7. Frame 445 Frame 489 Frame 387 44 44 44 38 38 40 40 40 35 35 Frame 591 Frame 627 64 64 66 66 Figure 3.7: Tracking results of GCAST on ETH ”BAHNHOF” sequence. Colored arrows are used to mark interesting points, and the arrows with the same ID indicate the the same target across frames. Best viewed in color. 3.6.2 Results on Trecvid 2008 data set Trecvid 2008 is a very challenging data set captured indoor at a busy airport with high crowd density and frequent occlusions. Each video clip has 5000 frames; we use 9 videos from Cam1, Cam3, Cam5 for evaluation, same as in [57, 49, 99]. Table 4.2 shows quantitative comparison results. Our tracker reduces as many as 10 fragments and gets the lowest IDS, while achieving 2 Readers are encouraged to compare figure 3.7 with results in work[59] on this sequence. 53 the highest precision and recall rates. Note that group context is effective even in such a crowded scene, as it helps reduce 22 fragments and 4 IDS while improving the recall rate by 2%. Frame 311 Frame 343 Frame 377 Frame 420 Frame 1101 Frame 1141 Frame 1326 Frame 1357 Figure 3.8: Example of tracking results by GCAST on Trecvid dataset by CAM5. Best viewed in color. Tracking results on this data set are shown in figure 3.2,3.8 and 3.9. GCAST tracker is able to track small targets in the distance even though they are more vulnerable to frequent occlusion. 3.6.3 Results on PETS 2009 data set The same PETS 2009 video clip is used as in [7]. The challenge in this data set is due to the abrupt motion direction changes that happen when targets are occluded by other targets or by scene occluders. The quantitative comparison results are shown in table 3 3.3. A more strict ground truth annotation than [7], where fully occluded targets are labeled with same ID when they reappear, is used for evaluation. Compared to the second best performance in table 3.3, the 3 Results of earlier methods are reported by their published work. 54 Frame 400 Frame 460 Frame 500 Frame 560 Frame 600 Frame 680 Frame 740 Frame 800 Frame 925 Frame 960 Frame 1000 Frame 1043 Figure 3.9: Example of tracking results by GCAST on Trecvid dataset by CAM3. Best viewed in color. fragments is greatly reduced from 13 to 2 ; MT is improved by more than 5%; precision rate is improved by 5%. Tracking results on PETS 2009 is shown in figure 3.10. Method Trecvid 2008 ETH PETS 2009 PRIMPT[49] 7 - - DPAM[99] 6 10 22 GCAST(ours) 6.29 12.3 18.2 Table 3.4: Average computational Speed on three data sets 55 Frame 153 Frame 205 Frame 288 Frame 371 Figure 3.10: Tracking results of GCAST on PETS 2009. Best viewed in color. 3.6.4 Computational Speed Our method is implemented using C++ on a PC with 3.0GHz CPU and 4GB memory. The average running speeds (frames per second) of the GCAST tracker for 3 data sets are summerized in table 4 4.5, which shows that this tracker is comparable in efficiency to methods with global learned discriminative appearance models. Although global models need to be learned less frequently, our adaptive salient appearance learning for a single target requires much less training time with 4 Earlier methods’ speeds are reported by their published work. 56 fewer online training samples. Besides, inferring group context online and matrix refinements are also efficient. 3.7 Conclusion In this chapter, we proposed to efficiently online infer group context from a crowded scene and incorporate it in a hierarchical association framework to refine linking likelihood; we also intro- duce adaptive appearance salience features which are online learned to better estimate appearance affinity between tracklets. Experimental results demonstrate that the use of these two can help tracking in semi-crowded scenes. 57 Chapter 4 A Binary Quadratic Programming Formulation with Social Factors To simplify the underlying optimization problem, trajectory independence among objects is usu- ally assumed in these existing association frameworks. However, trajectory independence among objects does not always hold due to various social factors. For example, people do not always walk alone [66] but may move in groups, where one person’s trajectory can be highly related to his/her companions’; even without a companion, a person’s path choice can be highly influenced by previous paths of others’. In such cases, trajectory independence assumption may overlook important social factors which are helpful for association especially when appearance cues are weakened by occlusions or other scene noise. 4.1 Motivation For time sensitive scenarios, there are methods only requiring past and current frames[103, 17, 62, 100], but they have difficulty in dealing with occlusions and are vulnerable to detection errors. With improvements in object detection, association based tracking using detection result as input has recently become an important framework for multi-target tracking [17, 62, 67, 54, 6, 93, 22]. In a global association framework, a temporal sliding window is used to cache detection obser- vations and global data association methods, e.g. Linear Programming[43], Min-cost flow[106], 58 Hungarian algorithm[67, 45, 41], K-shortest path[15], or dynamic programming[68], can effi- ciently infer solutions over the cached data and link observations into trajectories. Social factor is a broad concept, while in the tracking scenario the narrow concept refers to factors that have influence on targets’ trajectories, including moving path choices, desired speed, repulsion from obstacles, etc. There are many possible social factors the tracking community can explore, in this work we propose a general optimization formulation to incorporate any possible social factors. Specially, we show improvement to tracking by exploring the pairwise trajectory dependency among targets. Especially, such pair-wise independence can be efficiently inferred online from premature trajectories generated from a lower association level, as we employ a hier- archical association framework[41] to progressively link input observations into longer trajectory. There are existing tracking methods[95, 65, 66] exploring trajectory dependency, but they usually use such dependency for better prediction of behavior or destination. In contrast, we resort to incorporate such dependency into a global optimization formulation for multi-target tracking to get higher accuracy. To this end, we propose a binary quadratic programming formulation of the tracking problem considering the underlying pairwise trajectory dependency among targets, and explore a way using convex relaxation to convert the new tracking formulation to a semidefinite programming problem, which can be solved efficiently by off-the-shelf methods. The contributions of this work are summarized as: 1. We propose a general optimization framework with a binary quadratic programming for- mulation to incorporate social factor into the multi-target tracking problem. 2. We propose a way using convex relaxation to convert the new combinatorial formulation with a large amount of equality constraints to a semidefinite programming problem and solve the tracking problem efficiently 59 3. By exploring simple pairwise trajectory dependency which can be efficiently inferred on- line, we improve the tracking performance from the state-of-the-art. The rest of the chapter is organized as follows: after a discussion of related work in section 4.2, we review the original tracking objective in section 4.3 and propose our new binary quadratic programming formulation in section 4.4 in comparison to those simplified classic linear formu- lation in section 4.3.2. Then, we show how to convert the new formulation to a semidefinite programming problem for efficient solution in section 4.5. A set of simple pairwise trajectory dependency terms, which can be efficiently inferred online, are introduced in section 4.6. Exper- imental results are shown and discussed in section 5.7. 4.2 Related Work Social factor provides important hints for tracking. [76] uses social structure to improve tracking in crowded scenarios. For extremely crowded scenes, objects are highly likely to follow regular motion patterns which can be learned as priors for trajectory estimation[47, 73, 108]. However, objects may move freely alone or in groups in normal scenes and a strong motion prior is less reliable. To compensate approximate linearity assumption, motion patterns are learned to ex- plain non-linear trajectories[97]. To track sport players, team context features learned offline are considered in motion models[58]. Given manually annotated training data for offline parameter learning, to predict where to go in next frame in pedestrian tracking, group states are considered in a motion model[95, 66]. In an association framework for semi-crowded scene, object trajec- tories and groups are jointly modeled for higher trajectory prediction accuracy [65]. However, these models[95, 65, 66] are more suitable for a specific scene, since motion model parameters learned offline may limit the generalization of the model. Pairwise context is used assuming 60 all pairs of objects in the scene have correlated motions in a maximum weight independent set optimization[90]. In recent years, social group behavior has been explored to improve tracking[69] and elemen- tary groups are online learned and integrated into the basic affinity model for tracking[21]. In [96, 98], CRF models are used to model the dependence among associations of tracklet pairs in pairwise terms, however only poor local optimal can be reached by local search method follow- ing a degenerated linear assignment solution by Hungarian algorithm without the pairwise de- pendence. In contrast, we provide a better and more efficient optimization solution for the global quadratic formulation. [80] also proposes a global frame data association method to model inter- action between trajectory and solve it using tensor power iteration, but our formulation is more general and can also apply to tracklet association, which is more efficient in practice. 4.3 Tracking objective We treat the multi-target tracking problem as the data association of the input object observations X = fo i g in a temporal sliding window. The input observation setX can be object track- lets(trajectory fragments) or primitive detection responses. The objective of data association based tracking is to get the association hypothesisT which maximizes the posterior probability (MAP) given the observationX : T =argmax T P (TjX ) =argmax T P (XjT )P (T ) (4.1) =argmax T Y i P (o i jT )P (T ) where the association hypothesisT is defined as a set of non-overlap single trajectory hypothesis, i.e. T =fT k jT k \T l =;;8k6= lg, and assuming that the observation likelihood terms are 61 conditionally independent given the hypothesisT . Then a single trajectory can be generated from the the ordered list of object observations in hypothesisT k =fo k 1 ;:::;o kn g. 4.3.1 Trajectory independence assumption It is hard to solve the MAP problem in eqn.4.1 directly, since the hypothesis space ofT is ex- tremely large. To get efficient solutions, existing methods usually simplify the original MAP problem of eqn.4.1 by further assuming the trajectory independence among different targets, which leads to the decomposition of the prior term in eqn.4.1 as: P (T ) = Y T k 2T P (T k ) (4.2) Based on such trajectory independence simplification in eqn.4.2, there are successful methods that convert the MAP problem to linear optimization problems which have polynomial time so- lution, such as the min-cost flow problem[106] and linear assignment problem [48] with efficient solution by Hungarian algorithm. However, the trajectory independence assumption in eqn.4.2 does not always hold consider- ing various social factors. In such cases, important cues from trajectory dependency for better association solution may be overlooked. In following sections, we propose a new formulation without oversimplifying the original MAP problem in eqn.4.1, which also has efficient solution. 4.3.2 Simplified linear association model Based on the trajectory independence assumption, there are a few classic linear association for- mulations for tracking, e.g. linear assignment problem by Hungarian algorithm [48] and cost-flow 62 u 1 u 9 u 4 u 3 u 5 u 7 u 6 u 8 u 2 v 1 v 3 v 2 v 4 v 5 v 7 v 8 v 9 v 6 S t (u i ->v i ) Observation edges (input tracklets) (V j ->U i ) or Possible linking edges (hypothesis) (S->u i ) or (V j ->t) Enter/Exit edges (hypothesis) u i v i Tracklet head Tracklet tail Source Sink Figure 4.1: An example of the cost-flow network with 9 input tracklets. Note that those 2 possible linking edges shown in red are highlighted just to demonstrate the linking dependency among moving companions. Intuitively, the red pair of linking edges, showing potential group motion, should be favored by our new optimization objective. network model [106, 68]. The cost-flow network model is a famous one because of its intuitive representation of multiple target tracking problem in a graph structure. To help illustrate our new formulation in later section 4.4, we borrow the variable definitions from the famous cost-flow graph. As shown in Figure 4.1, each flow path can be viewed as a single trajectory, and the amount of the flow sent from the source to the sink is equal to the number of total trajectories, and the total cost of the flow on the network correspond to the negative of the log-likelihood of the association hypothesis. Besides, the flow conservation constraint naturally guarantees that no flow paths share a common edge, and therefore no trajectories overlap. 63 In the graph shown in Figure 4.1, for every input observationo i 2X , there is an observation edge (u i ;v i ) with cost C i and flow f i , and an entrance edge (s;u i ) with cost C en!i and flow f en!i , and an exit edge (v i ;t) with cost C i!ex and flow f i!ex . Besides, for every possible linkingo i !o j , there is a linking edge (v i ;u j ) with costC i!j and flowf i!j . All flow variables are 0-1 binary variables, which are defined as: f i = 8 > > < > > : 1 9T k 2T;o i 2T k 0 otherwise false observation (4.3) f en!i = 8 > > < > > : 1 9T k 2T ,T k starts fromo i 0 otherwise f i!ex = 8 > > < > > : 1 9T k 2T ,T k ends ato i 0 otherwise f i!j = 8 > > < > > : 1 9T k 2T ,o j is right aftero i inT k 0 otherwise Beside, the validity of an association hypothesis for having non-overlap among trajectories are guaranteed by the flow constraints: f en!i + X j f j!i =f i =f i!ex + X j f i!j ; 8i (4.4) The likelihood function P (o i jT ) in eqn.4.1 ofo i being a true observation is modeled as a Bernoulli distribution as follows: P (o i jT ) = 8 > > < > > : 1 9T k 2T;o i 2T k 0 otherwise (4.5) 64 A single trajectory hypothesisP (T k ) in eqn.4.2 is modeled as a Markov chain as: P (T k ) =P (o k 1 ;:::;o kn ) (4.6) =P entr (o k 1 )P (o k 1 !o k 2 ):::P (o k n1 !o kn )P exit (o k 1 ) which includes entrance probabilityP entr , exit probabilityP exit and linking probabilitiesP (o k i ! o k i+1 ). With all probability terms and flow variables defined as above, the objective of eqn.4.1 can be formed into the minimization of a negative logarithm function: T =argmin T X i logP (o i jT ) X T k 2T logP (T k ) (4.7) =argmin T X i C i f i + X i C en!i f en!i + X i;j C i!j f i!j + X i C i!ex f i!ex subject to constraints in eqn.4.4, where C en!i = logP entr (o i ); C i!ex = logP exit (o i ) (4.8) C i!j = logP (o i !o j ); C i = log log(1) Although the formulation in eqn.4.7 can be efficiently solved by a min-cost flow algorithm. The possible trajectory dependence among targets due to social factors is ignored. Note that we use this graph and variable definition only for illustration of our formulation, and no known graph-based solution exists for our new formulation. In contrast, graph-based algorithm, e.g. min-cost flow method, provides efficient solution for the classic linear association 65 formulation [106, 68]. In other words, our new formulation can be derived directly from the original tracking objective eqn.4.1. 4.4 Binary quadratic programming formulation Based on the variable definitions shown in last section 4.3.2, we introduce a binary quadratic formulation for tracking: T =argmax T X i C 0 i f i + X i C 0 en!i f en!i (4.9) + X i;j C 0 i!j f i!j + X i C 0 i!ex f i!ex + X (i;j)(m;n) C 0 i!j;m!n f i!j f m!n + X (i;j);k C 0 i!j;k f i!j f k subject to constraints in eqn.4.4 and 0-1 value constraint in eqn.4.3. To be consistent with the original MAP tracking objective in eqn.4.1 and simplify later for- mulation, our objective is chosen as a maximization of a logarithm function in contrary to the minimization in eqn.4.7. Therefore, we have C 0 i = log(1) log; C 0 en!i = logP entr (o i ) (4.10) C 0 i!j = logP (o i !o j ); C 0 i!ex = logP exit (o i ) Note that the last two quadratic terms of eqn.4.9 are newly introduced to account for the potential linking dependence based on possible trajectory dependency. The termC 0 i!j;m!n is introduced to reward a pair of possible linkingfo i ! o j ;o m ! o n g which produces a pair 66 of dependent trajectories. The other reward termC 0 i!j;k is introduced to facilitate a possible linkingfo i !o j g if both ends of the linking have the same dependent trajectory in their context. With the newly introduced quadratic terms, all kinds of social factors can be considered and incorporated to the tracking objective to look beyond linear association for better tracking solu- tion. In section 4.6, we will explore some simple pairwise trajectory dependency terms and show improvement in experiments. Our new formulation is a binary quadratic programming problem, which is an NP-hard com- binatorial problem in nature. However, we can convert it to a semidefinite programming problem and use convex relaxation to solve it efficiently. 4.5 Semidefinite programming solution Assume we get an observation setX = fo i g, we want to consider trajectory dependency to associate them and assume such trajectory dependency is also available by considering social factor in some way (e.g. sec.4.6). Let the number of tracklets observations ben =jXj and the number of candidate linking fo i !o j g bem. In practice, we findm<<n 2 by applying the simple non-overlap rule: m = X ij I ij ; I ij = 8 > > < > > : 1 0< ij < threshold 0 otherwise where ij denotes the time gap between the end ofo i and the start ofo j . As a combinatorial optimization problem, the problem formulation in eqn.4.9 has 3n+m ’0- 1’ flow variables and 2n linear equality constraints as in eqn.4.4 and 3n+m ’0-1’ constraints in 67 eqn.4.3. We now reform this NP-hard binary quadratic optimization problem to a semidefinite programming problem. To simplify the formulation, we first divide the flow variables into two vectorsf s andf q , where 2n-dimensionalf s consists of all those flow variables only appearing in linear terms of eqn.4.9 and (m+n)-dimensional f q is formed by the rest flow variables which may show in quadratic terms in eqn.4.9. Thus we havef s = (f en!i ;:::f i!ex ) T andf q = (f i ;:::f i!j ) T . The objective of eqn.4.9 can be written as: T =argmax T fs T f s +q T f q +f T q Qf q g (4.11) =argmax T (q T s T ) f q f s + (f q T f s T ) 2 6 6 4 Q 0 0 0 3 7 7 5 f q f s where the vector (q T s T ) records the parametersC 0 in the linear terms and the symmetric matrix Q records parameters from the quadratic terms of eqn.4.9. By denotingb T = (q T s T ),f T = (f T q ;f T s ) andB = Q 0 0 0 , the objective can be written as: argmax T b T f +f T Bf (4.12) to convert to a homogeneous form by defining x T = (f T 1) = (x 1 ;x 2 ; ;x 3n+m+1 = 1) ; C = 2 6 6 4 B 1 2 b 1 2 b T 0 3 7 7 5 (4.13) 68 andX =xx T , the objective can be written in a homogeneous quadratic form as: argmax T x T Cx (4.14) =argmax T tr(CX); X 0 subject tox i 2f0; 1g,x 3n+m+1 = 1 and the flow constraints in eqn.4.4. HereX 0 means that X is positive semidefinite. For the constraints ofx i 2f0; 1g, we can model them as quadratic constraints: x i 2 x i = 0 (4.15) and their equivalent matrix forms are: x T A i x =tr(A i X) =e (4.16) wheree is a simple constant number and only non-zero elements ofA i are shown: A i = 0 B B B B B B B B B @ . . . . . . . . . a i;i = 1 a i;3n+m+1 =0:5 . . . . . . . . . a 3n+m+1;i =0:5 e 1 C C C C C C C C C A Similarly, the linear flow constraints in eqn.4.4 andx 3n+m+1 = 1 can also be converted to the form oftr(A l X) =e. Note that the original binary quadratic problem implies a non-convex rank constraintrank(X) = 1, asX =xx T . By dropping this non-convex constraint, we obtain the relaxed convex problem 69 and can finally write the tracking problem in eqn.4.9 as a semidefinite programming problem in the primal form: argmax T tr(CX) (4.17) subject to: A(X) =a ; X 0 where A(X) = 2 6 6 6 6 6 6 6 4 tr(A 1 X) tr(A L X) 3 7 7 7 7 7 7 7 5 (4.18) The dual form can be written as: argmin T a T y (4.19) subject to: A T (y)C =Z ; Z 0 where A T (y) = L X i=1 A i (4.20) Now we have obtained the primal-dual pair of the tracking problem after the convex relaxation. The relaxed semidefinite programming problem can be efficiently solved by off-the-shelf method, such as the primal-dual barrier method. In our implementation, we use the CSDP package[13] 70 which can produce optimal solution to the relaxed convex problem efficiently(in less than 30 iterations). 4.5.1 Rounding scheme to feasible solutions The solution matrixX from the semidefinite programming problem is the optimal solution for the relaxed convex problem. Ifrank(X) = 1, then the solution corresponds to the exact solution of the original problem in eqn.4.9, which is the case when all quadratic terms are zero because no available trajectory dependency is inferred online. And the formulation falls back to linear objective for min-cost flow problem or Hungarian algorithm. So our semidefinite programming provides an alternative efficient solution to the classic linear optimization formulations which do not consider trajectory dependency. Ifrank(X)6= 1, we need to find a good approximate solution for the original problem. First, we do eigenvalue decomposition on X to get the eigenvector v 0 corresponding to the largest eigenvalue 0 of matrixX. Then, we get an approximate solution ~ x = p 0 v 0 . To meet the 0-1 value constraint, we first do deterministic rounding to round each element value of~ x to its nearest integer of 0 or 1. In most cases, such deterministic rounding can produce feasible solutions which meet constraints in eqn.4.4. In cases that deterministic rounding may produce infeasible solutions, we do a randomized rounding for each element value ~ x i by use its value as the probability of rounding to 1. We do a few iterations in such randomized rounding procedure and pick the feasible solution with the largest objective value. Then we can map the rounded 0-1 valued vector to the final tracking result e T . 71 4.6 Online inferring trajectory dependency In order to automatically infer trajectory dependency online reliably, we only consider confident observations. We employ a hierarchical association framework [41] to gradually generate longer and more confident tracklet observationsfo i g. The initial tracklets set are produced from the lowest-level association by a simple dual-threshold method [41]. Conservative lower level as- sociations can be viewed as filtering stages to reduce false alarms while not introducing many wrong associations. Note that observations in lower levels can be highly fragmented and confi- dent trajectory dependency can hardly be inferred for further association. Assume we have a set of confident tracklet observationsfo i g after several levels of asso- ciation. For each o p , another observation o q is considered as its dependent trajectory o q 2 DT p if they meet the following criteria: 1) o p and o q have sufficient l 10 frames tempo- ral overlap (co-existence); 2)their average spatial distance during the temporal overlap is less than the larger target height; 3)their overall moving directions during the overlap are similar, e.g.arccos ~ vp~ vq (k~ vpkk~ vqk) 4 , where~ v p is the spatial movement vector ofo p during the overlap . Using the pairwise trajectory dependency terms defined above, the two quadratic terms in eqn.4.9 can be given as: C 0 i!j;m!n = (4.21) 8 > > > > > > < > > > > > > : log(2 + ~ v i ~ vm k~ v i kk~ vmk ~ u j ~ un k~ u j kk~ unk ) m2DT i ;n2DT j K i =m;j6=n or i6=m;j =n 0 otherwise 72 C 0 i!j;k = (4.22) 8 > > < > > : log(2 + ~ v i ~ v k k~ v i kk~ v k k ~ u j ~ u k k~ u j kk~ u k k ) k2DT i ;k2DT j 0 otherwise where is a weight factor to set the relative importance of the quadratic term. K > 100 is a punish factor to additionally prevent hypothesis violating non-overlap constraints. ~ v i and ~ u i stand for the spatial motion vector at the tail and head part of the coexistence respectively. The inner product term measures the motion similarity between a pair of trajectories. 4.7 Implementation details We use a sliding-temporal window to cache frames for association which produces near-online result but a constant latency. Besides, we use no offline data for training which is suitable for a near-online scenario. In each sliding window, we employ a hierarchical association framework for tracking and the initial tracklet set are produced from the lowest-level association by a simple dual-threshold method [41]. After the lowest-level, we adopt the binary quadratic programming formulation introduced in section 4.4 to gradually associate tracklets to more complete trajecto- ries. To estimate the quadratic terms in eqn.4.9, the online inferring process in section 4.6 for trajectory dependency is conducted. We use a motion affinity term which measures linear smoothness by connecting two tracklets as: P motion (o i !o j ) =N(c tail i +v tail i tc head j ; j ) N(c head j v head j tc tail i ; i ) (4.23) 73 whereN(x; ) denotes normal distribution, t is the frame gap between the tail ofo i and the head ofo j ;v i denotes the velocity which can be calculated from positionsc t i of the head or tail part ofo i . 4.7.1 Region based appearance affinity estimator Appearance are critical cues to associate observations correctly. Local regions seem effective for humans to measure affinity between objects, especially during frequent occlusions and targets are partially visible [81, 48, 57]. Therefore, same as [48, 97, 99] we represent an observation appearance by local features and use a discriminative weight learning method. To deal with certain part misalignment, we coarsely sample overlapped regions within re- sponses’ range. We extract a RGB color histogram and a HOG feature 1 for each sampled region and use histogram correlation to measure feature similarity. For example, we sample 21 regions and get 42 local features to measure appearance affinity. We denote by h q (x) a local affinity estimator, which takes a pair of responses as inputx : (r ;r ) and use the local region to extract color histogram or HOG feature, compute feature correlation and output a normalized affinity value in [1; 1]. Local features are assigned different discriminative weights q by an online discriminative learning process [48], which are then normalized and combined to a total affinity estimator H(x) = P q q h q (x). We use the learned appearance affinity estimator to estimate the appearance term ofP (o i ! o j ) =P appe (o i !o j )P motion (o i !o j ) for eqn.4.10. 1 We use 8 bins for each channel and concatenate them to a 24-bin RGB color histogram. HOG feature is formed by concatenating 8 orientation bins in22 cells over the region. 74 4.8 Experimental Results The performance of our tracking method, named as BQPT, is evaluated on four public pedestrian datasets: PETS 2009[1], ETH mobile[17], Trecvid 2008[2] and TownCentre [14]. Quantitative comparisons with several state-of-the-art methods and visualized results are shown; some video results are provided in the supplementary material. As input raw detection would influence track- ing performance, for fair comparisons, the same detection results, provided by authors of[49] and [69], are used for compared methods. Quantitative evaluation is based on the commonly used evaluation metrics 2 [57]. Note that methods that model social factors are marked in red in tables. Especially, regarding those parameters for inferring trajectory dependency in section 4.6, we conducted a systematic analysis of their effects on tracking performance on Trecvid and Town- Centre. We show 3 different sets of criteria setting for the trajectory dependency inference in section 4.6, denoted as model A, B, C. Model A sets the temporal frame overlapl > 20 and the moving direction difference 6 , which is a strict criteria; model B is a moderate criteria, which setsl> 10 and the direction difference 5 ; model C setsl> 5 and the direction difference 4 , which is the loosest one. As the classic cost-flow method [106] used an outdated experiment setting and did not report tracking results using discriminative appearance models, we implemented the cost-flow method, denoted as cost-flow*, which uses the same hierarchical association and appearance models as ours for fair comparison. The only difference between cost-flow* and ours is that ours use the new binary quadratic formulation by considering trajectory dependency. 2 Recall" and Precision" of detection performance after tracking; false alarms per frame(FAF)#; number of ground truth trajectories(GT); percentage of mostly tracked(MT)" trajectories with tracked part 80% trajectory length, partially tracked(PT) trajectories with tracked part between 20% and 80% trajectory length, and mostly lost(ML)# trajectories with tracked part 20% trajectory length; fragments(Frag)#, the total number of times ground truth tracks are interrupted; id switches(IDS)#, the total number of times that generated tracks change their matched ground truth tracks. 75 4.8.1 Results on the ETH dataset As in [49, 99], only the data captured by the left camera for sequences ”BAHNHOF” AND ”SUNNY DAY”, are used in our experiment to demonstrate monocular tracking performance. Quantitative comparison is shown in table4.1. There are a large number of dependent trajectories in this dataset. However, due to the low camera angle and the frequent fast shift of camera, not all trajectory dependency can be reliably captured and fully utilized. However, with the moderate parameter setting in model B, our tracker successfully recovered some miss detections due to scene noise and linked quite a few fragmented tracks. Our tracker achieved the highest mostly tracked trajectories MT, lowest Frag. 4.8.2 Results on the Trecvid 2008 dataset Trecvid 2008 is a very challenging dataset captured indoor at a busy airport with high crowd density and frequent occlusions. Each video clip has 5000 frames; we use 9 videos from Cam1, Cam3, Cam5 for evaluation, same as in [57, 49, 99]. This dataset contains a large number of inter-trajectory dependency. Method Recall" Precision" FAF# MT" PT ML# Frag# IDS# PRIMPT[49] 76.8% 86.6% 0.891 58.4% 33.6% 8.0% 23 11 CRF [98] 79.0% 90.4% 0.637 68.0% 24.8% 7.2% 19 11 DP[68] 67.4% 91.4% - 50.2% 39.9% 9.9% 143 4 DCCRF[59] 77.3% 87.2% - 66.4% 25.4% 8.2% 69 57 Cost-flow* 75.2% 87.8% 0.807 61.6% 29.4% 9.0% 35 16 BQPT-B 78.2% 91.2% 0.572 69.8% 21.4% 8.8% 18 7 Table 4.1: Results comparison on the ETH dataset with 124 ground truth(GT) tracks. 76 Method Recall" Precision" FAF# MT" PT ML# Frag# IDS# OLDAM[48] 80.4% 86.1% 0.992 76.1% 19.3% 4.6% 322 224 DPAM [99] 78.7% 88.2% 0.807 73.0% 20.9% 6.1% 253 149 CRF[98] 79.8% 87.8% 0.857 75.5% 18.7% 5.8% 240 147 Cost-flow* 79.3% 86.7% 0.925 77.1% 17.9% 5.0% 261 162 BQPT-B 84.5% 89.3% 0.785 81.6% 14.5% 3.9% 221 136 BQPT-A 80.6% 86.4% 0.932 77.3% 18.0% 4.7% 251 151 BQPT-C 79.1% 85.9% 0.994 74.5% 19.2% 6.3% 281 243 Table 4.2: Results comparison on the Trecvid 2008 dataset with 919 GT tracks. Method Recall" Precision" FAF# GT MT" PT Frag# IDS# CEM[7] - - - 23 82.6% 17.4% 21 15 DPAM [99] 92.8% 95.4% 0.259 19 89.5% 10.5% 13 0 Cost-flow* 96.6% 97.8% 0.171 19 91.2% 8.8% 10 7 TensorPower [80] 97.7% 98.9% - 19 94.7% 5.3% 6 4 BQPT-B 98.6% 98.2% 0.165 19 95.8% 4.2% 3 0 Table 4.3: Results comparison on the PETS 2009 dataset. Table4.2 gives quantitative comparison results. Our BQPT with parameter setting B gives the best result over every metric, which demonstrates the effectiveness of modeling trajectory depen- dency for tracking. An interesting point is that with a too strict model A, our BQPT performance drops and the result is comparable to the linear formulation with cost-flow model. This is the exact the case when little trajectory dependency is utilized for tracking, which falls back to the linear formulation. For model C, too much trajectory dependency noise is captured by a loose criteria which hurts the tracking performance. 77 Method MT" ML# Frag# IDS# SGB[69] 83.2% 5.9% 28 39 EGM [21] 85.5% 5.9% 26 36 BQPT-A 88.3% 4.5% 22 28 BQPT-B 91.2% 3.5% 17 23 BQPT-C 87.3% 4.2% 23 31 Table 4.4: Results comparison on the Towncentre dataset. Method Trecvid 2008 ETH PETS 2009 DPAM[99] 6 10 22 Cost-flow* 6.31 11.5 19.4 BQPT 6.52 12.6 20.2 Table 4.5: Average computational Speed" (fps) on three datasets 4.8.3 Results on the PETS 2009 dataset The same PETS 2009 video clip is used as in [7]. The quantitative comparison results are shown in table 3 5.1. Note that our method outperforms a recent method [80] which also models the inter- action between trajectories. Some examples of tracking results on a crowded sequence PETS09- S2L2 in MOT challenge 2015 is shown in figure 4.2. 4.8.4 Results on the TownCentre dataset To compare our method with two recent methods [69, 21] which also explore social factors for tracking, we use the same first 3 minutes of the TownCentre video and the same input tracklets generation method[21] as them. The quantitative results are shown in Table 4.4. Our tracker outperforms the other two on every metric. 3 Results of earlier methods are reported by their published work. 78 Frame 52 Frame 71 Frame 90 Frame 109 Figure 4.2: Tracking results of BQPT B on PETS09-S2L2 sequence on MOT challenge 2015. Best viewed in color. 4.8.5 Computational Speed Our method is implemented using C++ on a PC with 3.0GHz CPU and 4GB memory. The CSDP package for semidefinite programming is written in C. The average running speeds (frames per second) of the BQPT tracker for 3 datasets are summerized in table 4 4.5. Note that the other two methods use linear association formulation. It shows that considering the social factor in our binary quadratic formulation with solutions using semidefinite programming is as efficient as linear formulation. 4 Earlier methods’ speeds are reported by their published work. 79 4.9 Conclusion In this chapter, we proposed a binary quadratic programming formulation to consider possible social factor for multi-target tracking. Our experimental results with simple pairwise trajectory dependency estimated online show that we can improve tracking performance by considering tra- jectory dependency and get efficient solutions for it with the semidefinite programming approach. Overall, our formulation provides a useful tool to consider various social factors to further im- prove tracking accuracy. 80 Chapter 5 A Hybrid Approach of Tracking Multiple Articulating Humans Multi-target tracking has been an important computer vision topic [66, 14, 94, 8, 48], with focus on tracking pedestrians. However, humans are not always in a pedestrian pose while perform- ing various activities such as sitting and bending. In this chapter, we aim at tracking multiple articulating humans automatically from a single camera; this is less studied and different from pedestrian tracking, but has broader areas of application. Different from human pose estimation and pose tracking methods [102, 83, 101, 77, 70], which aim at locating detailed body parts of isolated humans imaged with high resolution or in stylized poses, our approach aims at locating multiple human targets in moderate resolution under various unknown pose changes. Besides, this automatic tracker for multiple human targets differs from visual tracking for a single target which requires manual initialization [35, 19, 50]. Some examples of our tracking results are shown in Figure 5.1. 5.1 Motivation In recent years, category detection based tracking(CDT) have achieved high performance for pedestrian tracking [84, 14, 82], which link category-level detection responses into tracks and 81 Frame 177 Frame 194 (a) (b) Frame 1504 Frame 1662 Figure 5.1: Tracking results on iLIDS (a) and CMU action (b). rely on offline trained pedestrian detectors with high precision and recall rates. Unlike pedes- trian detection, training reliable detectors offline for articulated humans remains a challenging problem due to much larger intra-class variations, e.g. various non-upright poses. To train a hu- man detector offline for a predefined pose(e.g. upright pose), some intra-class invariant patterns can be learned from large amounts of annotated training samples using HOG or JRoG features (e.g.[40, 26]). To deal with intra-class variations due to deformation, deformable part based models [29] allow object parts to have flexible relations. However, considering various human 82 pose articulations from upright to non-upright, learning a universal human detector becomes dif- ficult. Instead, as tracking is an online scenario, learning human detectors with instance-specific discriminative features online for each target from limited numbers of online samples is more practical. As heavy self-occlusion can cause semantic body parts to be only partially visible, our online learned detector models each instance as a collection of small patches. By recovering those salient patches, the instance-specific detector retrieves each tracked target as shown on the top-right of Fig.5.2. We propose a tracklets (track fragments) association based tracking framework, which is termed as HYBRID tracker, to incorporate offline-learned category-level detector with online- learned instance-specific detectors(ISD) to track articulating humans. An overview of our ap- proach is illustrated in Figure5.2. Given cached pedestrian detections in a sliding temporal win- dow [54], a global association method, e.g. Hungarian or Min-cost network flow algorithm, is employed to produce initial reliable pedestrian tracklets. To recover articulated humans missed by offline trained pedestrian detectors, instance-specific human detectors are learned online by col- lecting training samples from existing tracklets pools. By applying these online learned instance- specific detectors(ISD), initial reliable pedestrian tracklets are extended by recovering missed track fragments during large pose changes. In this way, the affinity between tracklet pairs is refined for further tracklets association. The contributions of this chapter are: • A hybrid tracking framework associating both category-level detection responses and instance- specific detection responses for multiple articulated humans. • A patch-based object detection method for tracking specific instance with large deforma- tions. 83 • An appearance affinity evaluation model for associating detection responses in different poses with misaligned parts. The rest of this chapter is organized as follows: Each technical section illustrates some part of the overview figure 5.2. After a discussion about related work in Section 5.2, we describe the structure and feature of the instance-specific detector(ISD) in Section 5.3. The online learning method of ISD is illustrated in Section 5.4. How to apply ISD for instance detection based tracking is given in Section 5.5. Association method for extrapolated tracklets is described in Section 5.6. Experimental results are discussed in Section 5.7, followed by conclusion in Section 5.8. 5.2 Related Work For multi-human tracking, one of the most successful approaches is association based track- ing, which associates detection responses into tracks according to appearance and motion cues [18, 69, 92, 105]. Focusing on pedestrians, predefined human parts are used to model the ap- pearance of humans and a classic visual tracker is incorporated conservatively in the association framework [99]. However, neither predefined parts nor classic visual tracking methods(e.g.mean- shift tracker) works well for articulating human with large and rapid unknown pose changes. Targeted at deformable objects in visual tracking, which requires manual initialization, there has been recent work using local patches [35, 19, 50]. However, visual trackers do not utilize information from a global association view point; when a target gets close to similar ones(e.g.in human crowd) or in a cluttered background, degraded pixel precision can cause trackers to drift easily. 84 1 Video Fragment 0 In temporal Sliding Window 0 Pedestrian Tracker ... +1 -1 ... ... Learning Patch- based ISD ... ... Instance detection based Tracking(IDT) by Online Learned Instance-specific Detector(ISD) Association of Extrapolated Tracklets Tracklet Legend Response Pedestrian tracklets Final Articulated Human tracking results Sample Collection instance i instance 0 instance 0 instance i time Detections by offline learned Pedestrian detector Category detection based Tracking(CDT) Apply ISD in IDT to extrapolate tracklets Iterative process between detection with ISD and Online learning of ISD Figure 5.2: The overview of our HYBRID tracker. We first get pedestrian detection responses in a sliding temporal window by applying offline trained pedestrian detector [40]. Then, a pedestrian tracker [41] is used to link pedestrian detections into reliable tracklets(trajectory fragments) pool. To recover articulated humans missed by offline trained pedestrian detectors, instance-specific human detectors(ISD) are learned online from online collected training samples. By applying ISD, initial pedestrian tracklets are extended to track large pose changes. The online learning procedure of ISD iterates with the ISD detection process to adapt to appearance changes(shown in the middle loop). By instance detection based tracking (IDT), the affinity between tracklet pairs is refined for further association to produce final tracking results. Best viewed in color. 85 Part based models have achieved great success on object detection (including humans in dif- ferent poses) by finding parts and their correlations[29, 64, 102]; however, they requires a large amount of training samples on predefined poses and high resolution, which are often not applica- ble to tracking multiple articulated humans. Assuming detection is given, human pose estimation techniques target at finding human body parts precisely [101, 77, 83], but such an assumption does not hold for human tracking; also such methods are computationally expensive. Hough forests have shown good results in detecting objects of a specific category [33, 12]. The forest structure is offline learned by optimizing binary tests with large amount of training samples, which is not suitable for online detection considering efficiency and limited training data. 5.3 Patch-based Instance-Specific Detector We introduce the structure and features of our ISD in this section. Unlike popular part-based models for category-level detection, our online learned instance-specific detector uses smaller patches instead of semantic parts to deal with large appearance variations during pose changes. Each object can be modeled as a mosaic of patches with a changing object centroidc t i . Each patch z k has a size (h;w) set as (11; 11), its own center position atpos c k , appearance featuresf k and a hidden label of belonging to the object or not (l k = +1 or1). Each positive patch also records its centroid displacement from the object centroidv k =c t i pos c k . P k , denoting the probability of a patch belonging to the target P k = P (l k = +1), can be estimated by an online learned Bayesian patch classifier, which consists of two layers of patch filters. 86 Online collected positive training samples, described in Section 5.4.1, inevitably contain background noise from the imprecise bounding box representation. To suppress such noise, the first layer is a color discriminative filter, designed to differentiate patches with different levels of color distinctiveness to the background. After applying the first filter, patches get initial proba- bility of belonging to the objectP 0 k , which is then used as a prior for the second layer. The second layer uses Random Ferns classifier with a large set of local binary features, which can be learned online efficiently, to estimate the finalP k and centroid displacement distributions d(v k ). 5.3.1 Discriminative color filter The color filter use superpixels[4] as basic elements, which show boundary respect and useful visual cues. The color filterF is an online learned codebook consisting of clusters of superpixels fC m g, clusters’ center color featuresff C m g and clusters’ color discriminative scoresfds m g. To use the color filter, a local search window is over-segmented to generate a set of superpixel fsp j g. By referring to the codebookF, we classify each new superpixelsp j to the nearest cluster C m and estimate the query superpixel’s color discriminative score as ds(sp j ) =ds m expf f j f C m avg q2Cm kf q f C m k g (5.1) whereavg is the average operator,f j andf q denote color-histogram features of superpixels. After applying the color filter, all pixels in a superpixel will get the same color discriminative score: ds(pixel pos ) =ds(sp j );8pixel pos 2sp j (5.2) 87 whereds(sp j ) is estimated by using the color filter as in Eqn.5.1. Then, the initial probability P 0 k of patchz k belonging to the object is approximated by its center pixel’s color discriminative scoreds(pixel pos c k ). 5.3.2 Filter with local binary features To make the patch classifier more discriminative without hurting its learning efficiency too much, we additionally consider local binary features and adopt random ferns classifier [63] as the second layer. We generateN binary featuresF =ff 1 ;:::;f N g by randomly selecting a pair of locations (pos a ;pos b ) in patchz k , a feature channelfea (intensity, RGB-colors, gradient features, 9-bin HOG features) and a threshold, as in eqn.5.3. f i = 8 > > < > > : 1 ifI(fea;pos a )I(fea;pos b )> 0 otherwise (5.3) Random ferns classifier dividesN binary features intoM groupsfF 1 ;:::;F M g, and assumes con- ditional independence among feature groups given the class label. Each fern is a weak classifier using a group of featuresF m . WithS =N=M binary features in groupm, a patchz k is mapped into one (j th of 2 S )leaf node in them th fern and gets the likelihood of belonging to the object (l k = +1) as: P (F m jl k = +1) = v m;j + P 2 S b=1 (v m;b +) (5.4) 88 wherev m;j denotes the value of thej th bin, where patchz k is mapped into, of the multinomial distribution for positive class (l k = +1) in them th fern. The overall likelihood 1 forz k to have positive labell k = +1 is given as: P (Fjl k = +1) = M Y m=1 P (F m jl k = +1) (5.5) The final probabilityP k of patchz k belonging to the object is estimated as a posterior: P k =P (l k = +1jF ) = P (Fjl k = +1)P 0 (l k = +1) P c=f+1;1g P (Fjl k =c)P 0 (l k =c) (5.6) where the prior term is from the initial probability P 0 k output by the color filter and likelihood term is from eqn.5.5 by binary features. 5.3.3 Centroid displacement distribution To estimate object centroid from patches’ centers, centroid displacement is recorded in random ferns classifier for positive patches with labels l k = +1. For efficiency, our detector does not further differentiate positive patches within a fern node by centroid displacement. Instead, at each fern leaf node, we keep a centroid displacement distributiond m;j (v), represented by a 2D histogram with unit cells of equal length and width. For each fern node (e.g. j th node ofm th fern), we buildd m;j (v) by accumulating centroid displacement vectors of positive patches to its object centroid. The center cell of the 2D histogram d m;j (v) denotes zero displacement. To tolerate imprecise votes, the voting distribution uses unit cells of 4 4 pixels. All unit cells are uniformly initialized with value 1. 1 In contrast, binary features are used, in [35, 44] for visual tracking, by simply averaging weak hypotheses from weak fern classifiers, which tends to misclassify patches and cause tracking drift in a cluttered background. 89 Time Y X t1 t2 t3 Tracklet end In Track Response Existing Track Legend Conflict Tracks for T i Target Tracklet T i Background of T i [-] Set [+] Set Confidently Tracked Responses in T i [-]Training Samples [+]Training Samples Figure 5.3: Illustration of online sample collection for a target trackletTrk 1 . Conflict tracks of the redTrk 1 contain the orange track and the green one. 5.4 Online Learning of Patch Classifiers for ISD We describe online sample collection and learning method for the patch-based ISD in this sec- tion. To adapt to appearance changes, the following learning process iterates with the instance detection procedure in section 5.5 to update the ISD, as shown in the loop of figure 5.2. 5.4.1 Online Sample Collection For an online learning approach, how to effectively collect training samples online is an important issue. For a target tracklet trk i , online samples are collected from current tracklets pool T . As shown in Figure 5.3, positive samples are defined as confident responses from the same tracklet as S + =fr t i jr t i 2trk i ^conf(r t i )>g (5.7) whereconf(r t i ) is detection confidence by the pedestrian detector, and is set to 0.7. 90 We define the conflict tracks i oftrk i as spatially close tracks which co-exist withtrk i in timet2 [head i ;tail i ] for at least one frame. The conflict criteria argues that one target cannot belong to two tracks at the same time. Besides, we only consider close-by tracks, as only local discriminative power rather than a global one is needed for ISD. The negative sample set is to differentiate a target with others in conflict tracks i as well as its background B i , which is defined as S =fr t k jr t k 2B i g[fr t k jr t k 2trk j ^trk j 2 i g (5.8) At learning timet learn , an outdated factorf t learn t ; = 0:9g is assigned as the weight of samples with time stampt to lower the importance of outdated samples. 5.4.2 Learning discriminative color filter To learn the color discriminative filter, online collected positive samples are first over-segmented into positive superpixel set asfsp + n g with featuresf(a + n ;f + n )g; while negative samples are turned into negative superpixel setfsp n g with featuresf(a n ;f n )g, wherea andf denote the pixel area and color-histogram feature of a superpixel respectively. Since the positive superpixel set may contain background superpixels imported from bound- ing box and less distinctive foreground superpixels to the background are less informative for patch-based object detection, we want to learn a codebookF which can be referred to differenti- ate a query superpixel by color discriminative score. As shown in Algorithm 5, we first do clustering on the set of collected positive and negative superpixels by mean-shift clustering algorithm, which can output a number! of clustersfC m g. Then, we estimate the color discriminative score of a clusterC m as ds m =A + m =(A + m +A m ) (5.9) 91 whereA + m = P sp + n 2Cm a + n calculates positive pixel area, and similarly we can get the negative oneA m . A cluster with discriminative scoreds m > 0:5 has more positive pixels asA + m =A m > 1, which means superpixels in this cluster are more likely from the object. Algorithm 5 Learning Discriminative Color Filter. Input: training setS + andS . Segment samples into superpixelsS + :fsp + n g withf(a + n ;f + n )g andS :fsp n g withf(a n ;f n )g Do mean-shift clustering onS + [S and get! clustersfC m g withff C m )g. Form = 1;:::;! do: • Get color saliency scoreds m for clusterC m by Eq.5.9. Output:F : fC m g withf(ds m ;f C m )g The discriminative color filter is learned repeatedly in every 10 frames to adapt to the color changes of the target object as well as the background. For update learning, the same learning method with online collected samples from current tracklets pool is used as initial learning. 5.4.3 Learning local binary filter and centroid displacement distribution For the second layer of the classifier, as introduced in section 5.3.2, we learn a random ferns classifier as an ensemble of 10 different ferns. For each fern, at the initial learning stage, we build its structure by randomly selecting binary tests and thresholds in eqn.5.3 as binary features. The rest of the learning task is to learn the multinomial distributions of both positive and negative classes for each fern. Each bin node in a multinomial distribution here corresponds to a leaf node in a fern. All labeled patches are mapped using selected binary features into leaf nodes to update the distribution. For online learning, we keep the initial ferns’ structure fixed and update the multinomial distributions incrementally. Besides, for each positively labeled patch, we map its centroid displacement into the unit cell of the 2D histogram in the displacement distributiond m;j (v) as introduced in section 5.3.3. 92 Since the incremental learning of ferns classifier and the centroid displacement distribution are efficient, we keep updating the multinomial distribution of ferns classifier as well as the displacement distributiond m;j (v) at each frame, iterating with the instance detection process. 5.5 Instance Detection based Tracking In this section, we describe how to apply an online learned patch-based ISD for instance detection based tracking (IDT) as shown in Fig.5.4. A target tracklet trk i is selected for extrapolation forwards from its tail or backwards from its head if conf(r t i ) > ;t2ftail i ;head i g, where conf(r t i ) is detection confidence by pedestrian detector and is set to 0:7. To detect an object at timet, we start by estimating the object centroidc t i . We densely sample patchesfz k g in a local search window and use them to vote for the object centroid in a Hough- voting scheme. Patches, confidently classified as belonging to the target, cast votes for the target centroid from patch centerspos c k by centroid-displacement vectorsv k = c t i pos c k ; while other patches do not cast votes. After the voting process, we get a voting map where the object centroid can be localized at the peak position as ^ c t i = arg max c t i X k f[pos c k +v k c t i ]w k g (5.10) where :(x) = 8 > > < > > : 1 ifjxj< 0 otherwise ;w k = 8 > > < > > : P k ifP k > 0:7 0 otherwise In our implementation, we pick confident patches withP k > 0:7 to vote and set the voting range as = 2 pixels. Since each patch is mapped to a leaf node in a fern classifier and get a displacement distribution d m;j (v k ) from there, centroid displacement distributions across ferns 93 Training Samples Collected online Applying Patch Classifier during detection with ISD (shaded area) Patch Classifier Learning Superpixels Classifier Output: Probability P k for each pixel location Sampling Patches Voting patches V Voting map of object centroid GrabCut mask Detection with foreground mask 1 st layer Output: Probability P’ k for each pixel location Input: a local search window Binary feature based Patch classifer Discriminative Color Filter Figure 5.4: Tracking with instance-specific detector are multiplied to getd(v k ). We then use 9 most likely displacement vectors in the distribution d(v k ) to vote. With non-maxima suppression on the voting map, we get the estimated object centroid ^ c t i . As the object shape is deformable, we need to estimate the object shape mask besides the centroid. Again, using the centroid displacement vectorv k , we can retrieve the voting patchesV for the estimated object centroid ^ c t i : V =fz k jjpos c k +v k ^ c t i j<g (5.11) By applying the standard GrabCut segmentation method initialized withV as seeds, we extract the object foreground mask as shown in figure 5.4. To deal with occlusion, an occupancy mapMap t is estimated by sorting all existing responses according to their foot positionsfoot t i =c t i + 0:5h t i in frame, assuming responses with smaller foot t i can be occluded by close responses in front of them. The instance detection based tracking 94 for a target is stopped if it tracks to occupied zones on the occupancy mapMap t or the number of voting patchesV in eqn.5.11 is less than 20. 5.6 Tracklet Extrapolation and Association Online extrapolated responses by IDT in section 5.5 can adapt to specific appearance changes, which are hard cases for offline trained detectors, and shorten frame gaps to possible tracklet matches. Besides, extrapolated tracklets can also recover some non-linear motion fragments of human tracks, which cause problems for linear motion assumptions in tracklets association methods. After IDT, we do further tracklets association by solving a linear assignment problem using Hungarian algorithm, which requires an affinity matrix with elements P link (trk i ;trk j ) as its input. The linking likelihood between two tracklets is determined by their affinities in appearance, motion and time: P link (trk i ;trk j ) =A appr (i;j)A m (i;j)A t (i;j) (5.12) The motion affinity measures linear smoothness by connecting two tracklets as: A m (i;j) =N(c tail i +v tail i tc head j ; j ) N(c head j v head j tc tail i ; i ) whereN(x; ) denotes normal distribution, t is the frame gap between the tail oftrk i and the head oftrk j ;v i denotes the velocity which can be calculated from positionsc t i of the head or tail 95 part oftrk i . The time model makes the link betweentrk i andtrk j possible iftail i is earlier than head j as A t (i;j) = 8 > > < > > : 1 if t> 0 0 otherwise (5.13) The appearance affinity is estimated as the similarity between two discriminative color filters (details in Sec.5.3.1) learned from tail responses oftrk i and head responses oftrk j respectively. Intuitively, it is hard to directly compare two detection responses with different poses and mis- aligned parts. Instead, we pick positive clusters in the color filters as indirect representation of object appearance under large pose deformations. We assume that the closer two positive super- pixel cluster sets are in the color feature space, the more likely that two tracklets belong to the same target. By bidirectional comparison between two sets of positive clusters using Nearest- Neighbor matching, we estimate the appearance affinity of two tracklets. 5.7 Experimental results Method Recall" Precision" FAF# GT MT" PT Frag# IDS# CEM[7] - - - 23 82.6% 17.4% 21 15 PIRMPT[49] 89.5% 99.6% 0.020 19 78.9% 21.1% 23 1 DPAM[99] 97.8% 94.8% 0.306 19 94.7% 5.3% 2 0 HYBRID(ours) 99.5% 96.2% 0.231 19 100.0% 0.0% 1 0 Table 5.1: Results comparison on PETS 2009 dataset. There are no standard tracking datasets to evaluate the performance of multiple articulated hu- man tracking. However, there are public activity datasets containing multiple articulated humans, e.g.CMU Action[46]. Therefore, we manually annotated this dataset the ground truth trajectories 96 for evaluation. Two public multi-target tracking datasets iLIDS[3] and PETS09[1] are also used for evaluation. Besides, we have created a new dataset, called ELASTIC, for tracking multiple articulated humans (detail at below). We first employ the commonly used tracking metrics 2 [57] for quantitative evaluation. As the performance of offline trained pedestrian detector influences final tracking results, for fair comparisons, the same pedestrian detection results by [40] are used as input for all methods. No camera model or ground plane estimation is used. For others’ methods, we use the best results provided by the authors. Some video results are provided in supplemental material. We also compare our tracking results with state-of-the-art visual tracking methods, which require manual initialization unlike our automatic HYBRID tracker. Since visual tracking has different evaluation metrics with multi-target tracking, the comparison is conducted qualitatively on ELASTIC dataset and visualized in figure 5.5. [Results on PETS 2009 dataset:] PETS09 is used to demonstrate that HYBRID also improves tracking performance on a pedestrian dataset, although it is designed to track beyond pedestrians. The quantitative comparison results are given in Table5.1. HYBRID outperforms other state-of- the-art methods by achieving 100% (MT), with only 1 fragment. [Results on iLIDS dataset:] iLIDS is commonly used for evaluation of pedestrian tracking, which contains frequent occlusions due to high human density at a subway station. We pick the ”medium” sequence (4582 frames,720 576 resolution), which contains frequent human pose changes between standing and sitting. We manually altered the ground truth trajectories to include sitting poses, which significantly increase the difficulty of this sequence. 2 The metrics includes recall and precision of detection performance after tracking; false alarms per frame(FAF)#; number of ground truth trajectories(GT); percentage of mostly tracked(MT)" trajectories with tracked part 80% trajectory length, partially tracked(PT) trajectories with tracked part between 20% and 80% trajectory length, and mostly lost(ML)# trajectories with tracked part 20% trajectory length; fragments(Frag)#, the total number of times ground truth tracks are interrupted; id switches(IDS)#, the total number of times that generated tracks change their matched ground truth tracks. 97 Some result examples are shown in figure 5.1. The quantitative comparison results are given in table 5.2. We compared HYBRID with two state-of-the-art multi-target tracking methods, the results of which are provided by authors of [49] and [99]. Note that method of [99] incorporates a conservative instance tracker with a discriminative appearance model with 15 fixed regions within a pedestrian response. This type of instance tracker is able to extend pedestrian tracklets from their ends for a certain length. But, if large pose change happens, a rigid appearance model with fixed parts may become unreliable. In contrast, our HYBRID tracker is based on a collection of smaller patches, some of which may remain stable for a certain time during the pose changes and are helpful for object localization. Moreover, association of extrapolated tracklets have a better linking chance with the color filter based affinity estimation even if the tracklet pair are in different poses. Compared to [99], HYBRID improved the recall rate and MT by 10% and reduced Frag from 38 to 22. [Results on CMU action dataset:] CMU action dataset was Method Recall" Precision" FAF# MT" PT ML# Frag# IDS# PIRMPT[49] 64.8% 95.5% 0.149 58.7% 28.3% 13.0% 27 14 DPAM[99] 68.6% 93.7% 0.228 67.4% 26.1% 6.5% 38 9 HYBRID 74.2% 95.2% 0.153 73.5% 20.3% 6.2% 22 7 Table 5.2: Results comparison on iLIDS medium sequence wtth 46 ground truth(GT) trajectories. captured using hand held cameras. Six video clips (0229,0233c,0213,0208,0154c,0154b) in low resolution(160 120) are selected with pose changes between standing and bending down in a crowd scene, as shown in fig.5.1(b). Ground truth trajectories are manually annotated by us to include all poses. Quantitative comparison results are given in table 5.3. [Results on ELASTIC dataset:] ELASTIC is a new dataset, created by us for tracking mul- tiple articulated humans with various pose changes. This dataset contains 10 video sequences, collected from the Internet capturing daily sport activities, e.g. skate boarding, ice-dancing, with 98 CMU action dataset Method MT" PT ML# Frag# IDS# Recall" Precision" FAF# Raw detection - - - - - 45.3% 95.8% 0.062 DPAM[99] 49.1% 38.2% 12.7% 11 14 71.2% 93.9% 0.103 HYBRID0 81.8% 12.7% 5.5% 13 22 88.2% 65.9% 0.235 HYBRID 83.6% 10.9% 5.5% 8 9 93.1% 87.3% 0.125 ELASTIC dataset Raw detection - - - - - 28.3% 99.2% 0.01 DPAM[99] 28.2% 39.5% 32.3% 43 2 59.1% 86.1% 0.27 HYBRID0 57.6% 18.7% 23.7% 35 3 73.2% 74.0% 0.34 HYBRID 72.6% 19.7% 7.7% 29 2 83.7% 85.5% 0.29 Table 5.3: Comparison of results on CMU action dataset with 55 ground truth trajectories and ELASTIC dataset with 91 ground truth trajectories. HYBRID0 is implemented with color prior disabled. non-stationary hand-held cameras. Each sequence contains 400 to 1000 frames in 640 480 resolution. This dataset is challenging due to large pose changes and fast motion of humans in sport activity. Some visualized tracking examples are shown in Fig.5.5. Quantitative comparison results are given in Table 5.3. Note that color prior improved the performance more on ELASTIC dataset, which contains fast human motion, than on CMU Action. This is due to that fewer stable patches by binary features can be found for humans in fast motion, while color features are more robust to motion blurs. To demonstrate the robustness of our online learned instance-specific patch-based detec- tor(ISD), we compare HYBRID tracking results on ELASTIC dataset with several state-of-the-art visual tracking methods qualitatively, including context tracker[27], struck tracker[38] and Hough tracker[35], results of which are shown in the first two rows of figure 5.5. Specifically, visual trackers are sensitive to initialization, we tried context tracker twice(green and yellow) for each target with slightly different initialization but get very different results. In contrast, HYBRID is a fully automatic tracker which has to deal with imprecise pedestrian detection responses but still 99 achieved good pixel precision shown in figure 5.5. [Computational Speed] The average running Frame 20 Frame 35 Frame 44 Frame 69 Frame 123 Frame 130 Frame 147 Frame 156 Frame 155 Frame 158 Frame 174 Frame 180 All visual tracker drifted Context tracker 1 st try Context tracker 2nd try Struck tracker Hough tracker Figure 5.5: Example of tracking results (in elliptical) on ELASTIC dataset by HYBRID. Tracking results from visual trackers are shown in boxes. Note some visual trackers drifted. Legend is given on the right border. speed of HYBRID (implemented in C++ on PC with a 4GHz cpu and 4G RAM) is about 3 frames per second, about 2.5 times slower than the pedestrian trackers we compared with. This is due to the frequent online update learning process of ISD to adapt to the pose changes of a target, although the learning algorithm of patch classifier is efficient. 5.8 Conclusion Learning offline trained generic detectors for articulated humans remains a challenging problem. Instead, we combined offline learned pedestrian detector and online learn instance-specific detec- tors in a hybrid system to deal with articulated human tracking and showed significant improve- ment compared to state-of-the-art methods on both pedestrian datasets and articulated human dataset. 100 Chapter 6 Future work Multi-target tracking has been actively studied in past decade and great progress has been made. However, tracking problem is still far from being solved. On one hand, recently reported im- provement on standard datasets has more or less shown saturation signs. On the other hand, the closer to real applications, the bigger challenges are faced towards the wild side. For examples, in a recently released benchmark [52] of multiple object tracking (MOT Challenge 2015), the best reported results still have over 26% targets mostly lost and many ID switches and fragments. Some reasons of the low performance are that cameras have abrupt motion, occlusions get severer as scenes are getting more crowded and large pose articulations appear more frequently. To address these challenges, we have investigated part-based dynamic appearance models, online learning of discriminative appearance features of each specific target for better affinity estimation, online learning of instance specific detector with local features (local patches or su- perpixels), online tracking by detection and segmentation, as well as optimization methods con- sidering social factors. In the future, there are several promising directions to improve multiple human tracking, including deep neural network offline affinity learning, person identity matching with salient attributes and motion compensation using low level motion features. 101 6.1 Deep neural networks for appearance affinity learning In our past work, discriminative appearance models are online learned efficiently using boosting strategy, which can produce fairly simple models for affinity estimation and doesn’t require large numbers of training samples. In contrast, deep neural network learning with a large amount of of- fline training samples have made promising great improvement in many fields of computer vision, including object recognition, detection, segmentation and etc. The merits of convolution neural network for feature learning and end-to-end learning strategy have attracted many researchers’ attention. Compared to the hand-crafted features used in our current tracking system, e.g. HOG or color histograms, it will be interested to see what kind of features could be learned directly out of the data given offline. As neural network are very flexible models, sufficient amount of training data are required to avoid over-fitting. Therefore, a major task is to design a good data augmentation method as the available training data for tracking are usually small. 6.2 Person identity matching with salient attribute In our current tracking methods, low level feature descriptors are used to discriminate different targets. A supervised learned global model has become a standard way for affinity estimation. A global affinity model is shared among all targets, while we also proposed to learn specific discriminative model for each target. However, these low level features are susceptible to noisy observation, especially under occlusion and large pose articulation. In contrast, higher-level at- tributes, e.g. wearing a hat or carrying a backpack, are more invariant to such appearance changes. Therefore, it is worth investigating the attribute representation of targets and the supervised and unsupervised ways of discovering valuable attributes. 102 6.3 Motion model compensation with motion features In most standard tracking methods, the best parameters of motion model are discovered by tuning for a video. On one hand, it is not a practical way for all kinds of scenes. On the other hand, abrupt motion changes of cameras will cause large errors of trackers. Low level motion features, e.g. optical flow, can be valuable to compensate abrupt motion change in the whole scene and make the motion model more dynamic and robust without too much parameter tuning. 103 Reference List [1] http://www.cvg.rdg.ac.uk/PETS2009. [2] http://www.nist.gov/speech/tests/trecvid/2008. [3] http://www.eecs.qmul.ac.uk/ ˜ andrea/avss2007_d.html, 2004. [4] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurlien Lucchi, Pascal Fua, and Sabine Ssstrunk. SLIC Superpixels. Technical report, 2010. [5] A. Adam, E. Rivlin, and I. Shimshoni. Robust fragments-based tracking using the integral histogram. In CVPR, 2006. [6] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. People-tracking-by-detection and people-detection-by-tracking. In CVPR, 2008. [7] A. Andriyenko and K. Schindler. Multi-target tracking by continuous energy minimiza- tion. In CVPR, 2011. [8] Anton Andriyenko, Konrad Schindler, and Stefan Roth. Coupling detection and data asso- ciation for multiple object tracking. In CVPR, 2012. [9] Shai Avidan. Ensemble tracking. In CVPR, pages 494–501, 2005. [10] Shai Avidan. Ensemble tracking. PAMI, 2007. [11] B. Babenko, Ming-Hsuan Yang, and S. Belongie. Visual tracking with online multiple instance learning. In CVPR, 2009. [12] Olga Barinova, Victor Lempitsky, and Pushmeet Kohli. On detection of multiple object instances using hough transforms. In CVPR, 2010. [13] B.Borchers. Csdp 2.3 user’s guide. In Optimization Methods and Software, 1999. [14] Ben Benfold and Ian Reid. Stable multi-target tracking in real-time surveillance video. In CVPR, 2011. [15] J Berclaz, F Fleuret, and E Turetken. Multiple object tracking using k-shortest paths opti- mization. In CVPR, 2011. [16] A. Borji and L. Itti. Exploiting local and global patch rarities for saliency detection. In CVPR, 2012. 104 [17] Michael D. Breitenstein, Fabian Reichlin, Bastian Leibe, Esther Koller-Meier, and Luc Van Gool. Robust tracking-by-detection using a detector confidence particle filter. In ICCV, 2009. [18] Michael D. Breitenstein, Fabian Reichlin, Bastian Leibe, Esther Koller-Meier, and Luc Van Gool. Online multi-person tracking-by-detection from a single, uncalibrated camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9):1820–1833, 2011. [19] Luka Cehovin, Matej Kristan, and Ales Leonardis. An adaptive coupled-layer visual model for robust visual tracking. In ICCV, 2011. [20] Ramani Duraiswami Changjiang Yang and Larry S. Davis. Efficient mean-shift tracking via a new similarity measure. In CVPR, 2005. [21] Xiaojing Chen, Zhen Qin, Le An, and Bir Bhanu. An online learned elementary grouping model for multi-target tracking. In CVPR, 2013. [22] R. T. Collins. Multitarget data association with higher-order motion models. In CVPR, 2012. [23] Robert Collins. Mean-shift blob tracking through scale space. In CVPR, 2003. [24] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Real-time tracking of non-rigid objects using mean shift. In CVPR, pages 142–149, 2000. [25] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Kernel-based object tracking. PAMI, 2003. [26] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. [27] T. B. Dinh, N. V o, and G. Medioni. Context tracker: Exploring supporters and distracters in unconstrained environments. In CVPR, 2011. [28] Pedro Felzenszwalb and Daniel Huttenlocher. Pictorial structures for object recognition. IJCV, 2004. [29] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. [30] Thomas E. Fortmann, Yaakov Bar-Shalom, and Molly Scheffe. Sonar tracking of multiple targets using joint probabilistic data association. IEEE Journal of Oceanic Engineering, 8:173–184, 1983. [31] K. Fragkiadaki and Jianbo Shi. Detection free tracking: Exploiting motion and topology for segmenting and tracking under entanglement. In CVPR, 2011. [32] Jerome Berclaz Francois, Jrme Berclaz, Franois Fleuret, and Pascal Fua. Robust people tracking with global trajectory optimization. In CVPR, pages 744–750, 2006. 105 [33] Juergen Gall and Victor Lempitsky. Class-specific hough forests for object detection. In CVPR, 2009. [34] Juergen Gall, Angela Yao, Nima Razavi, Luc Van Gool, and Victor Lempitsky. Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11):2188–2202, 2011. [35] Martin Godec, Peter M. Roth, and Horst Bischof. Hough-based tracking of non-rigid objects. In ICCV, 2011. [36] Helmut Grabner and Horst Bischof. On-line boosting and vision. In CVPR, pages 260– 267, 2006. [37] M. Grundmann, V . Kwatra, Mei Han, and I. Essa. Efficient hierarchical graph-based video segmentation. In CVPR, 2010. [38] Sam Hare, Amir Saffari, and Philip H. S. Torr. Struck: Structured output tracking with kernels. In ICCV, 2011. [39] Luc Van Gool Helmut Grabner, Jiri Matas and Philippe Cattin. Tracking the invisible: Learning where the object might be. In CVPR, 2010. [40] Chang Huang and Ramakant Nevatia. High performance object detection by collaborative learning of joint ranking of granules features. In CVPR, 2010. [41] Chang Huang, Bo Wu, and Ramakant Nevatia. Robust object tracking by hierarchical association of detection responses. In ECCV, 2008. [42] Michael Isard and Andrew Blake. Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5–28, 1998. [43] H Jiang, S Fels, and JJ Little. A linear programming approach for multiple object tracking. In CVPR, 2007. [44] Z. Kalal, J. Matas, and K. Mikolajczyk. P-n learning: Bootstrapping binary classifiers by structural constraints. In CVPR, 2010. [45] Robert Kaucic, A. G. Amitha Perera, Glen Brooksby, John P. Kaufhold, and Anthony Hoogs. A unified framework for tracking through occlusions and across sensor gaps. In CVPR, 2005. [46] Y . Ke, R. Sukthankar, and M. Hebert. V olumetric features for video event detection. IJCV, 2010. [47] Louis Kratz and Ko Nishino. Tracking with local spatio-temporal motion patterns in ex- tremely crowded scenes. In CVPR, 2010. [48] ChengHao Kuo, Chang Huang, and Ram Nevatia. Multi-target tracking by on-line learned discriminative appearance models. In CVPR, 2010. 106 [49] ChengHao Kuo and Ram Nevatia. How does person identity recognition help multi-person tracking? In CVPR, 2011. [50] Junseok Kwon and Kyoung Mu Lee. Tracking of a non-rigid object via patch-based dy- namic appearance modeling and adaptive basin hopping monte carlo sampling. In CVPR, 2009. [51] Junseok Kwon and Kyoung Mu Lee. Visual tracking decomposition. In CVPR, 2010. [52] Laura Leal-Taix´ e, Anton Milan, Ian D. Reid, Stefan Roth, and Konrad Schindler. Motchal- lenge 2015: Towards a benchmark for multi-target tracking. arxiv.org-CoRR, 2015. [53] Mun Wai Lee and Ramakant Nevatia. Human pose tracking in monocular sequence us- ing multilevel structured models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):27–38, 2009. [54] Bastian Leibe, Konrad Schindler, and Luc Van Gool. Coupled detection and trajectory estimation for multi-object tracking. In ICCV, 2007. [55] J. Lezama, K. Alahari, J. Sivic, and I. Laptev. Track to the future: Spatio-temporal video segmentation with long-range motion cues. In CVPR, 2011. [56] Yuan Li, Haizhou Ai, Takayoshi Yamashita, Shihong Lao, and Masato Kawade. Tracking in low frame rate video: A cascade particle filter with discriminative observers of different lifespans. In CVPR, 2007. [57] Yuan Li, Chang Huang, and Ram Nevatia. Learning to associate: Hybridboosted multi- target tracker for crowded scene. In CVPR, 2009. [58] Jingchen Liu, Peter Carr, Robert T. Collins, and Yanxi Liu. Tracking sports players with context-conditioned motion models. In CVPR, 2013. [59] Anton Milan, Konrad Schindler, and Stefan Roth. Detection- and trajectory-level exclu- sion in multiple object tracking. In CVPR, 2013. [60] Juan Carlos Niebles, Bohyung Han, and Fei-Fei Li. Efficient extraction of human motion volumes by tracking. In CVPR, pages 655–662, 2010. [61] Katja Nummiaro, Esther Koller-Meier, and Luc Van Gool. An adaptive color-based parti- cle filter. Image and Vision Computing, 2003. [62] Kenji Okuma, Ali Taleghani, Nando De Freitas, O De Freitas, James J. Little, and David G. Lowe. A boosted particle filter: Multitarget detection and tracking. In ECCV, 2004. [63] Mustafa Ozuysal, Michael Calonder, Vincent Lepetit, and Pascal Fual. Fast keypoint recognition using random ferns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3):448–461, 2010. [64] Marco Pedersoli, Andrea Vedaldi, and Jordi Gonzalez. A coarse-to-fine approach for fast deformable object detection. In CVPR, 2011. 107 [65] Stefano Pellegrini, Andreas Ess, and Luc Van Gool. Improving data association by joint modeling of pedestrian trajectories and groupings. In ECCV, 2010. [66] Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc Van Gool. You’ll neverwalk alone: Modeling social behavior for multi-target tracking. In ICCV, 2009. [67] A. G. Amitha Perera, Chukka Srinivas, Anthony Hoogs, Glen Brooksby, and Wensheng Hu. Multi-object tracking through simultaneous long occlusions and split-merge condi- tions. In CVPR, 2006. [68] Hamed Pirsiavash, Deva Ramanan, and Charless C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR, 2011. [69] Zhen Qin and Christian R.Shelton. Improving multi-target tracking via social grouping. In CVPR, 2012. [70] Deva Ramanan, D.A. Forsyth, and Andrew Zisserman. Strike a pose: Tracking people by finding stylized poses. In CVPR, 2005. [71] Donald B. Reid. An algorithm for tracking multiple targets. IEEE Transactions on Automatic Control, 24:843–854, 1979. [72] Xiaofeng Ren and J. Malik. Tracking as repeated figure/ground segmentation. In CVPR, 2007. [73] Mikel Rodriguez, Ivan Laptev, Josef Sivic, and Jean-Yves Audibert. Density-aware person detection and tracking in crowds. In ICCV, 2011. [74] David Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-Hsuan Yang. Incremental learning for robust visual tracking. IJCV, 2008. [75] Xiaogang Wang Rui Zhao, Wanli Ouyang. Unsupervised salience learning for person re- identification. In CVPR, 2013. [76] Mubarak Shah Saad Ali. Floor fileds for tracking in high density crowd scene. In ECCV, 2008. [77] Benjamin Sapp, Alexander Toshev, and Ben Taskar. Cascaded models for articulated pose estimation. In ECCV, 2010. [78] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence- rated predictions. in machine learning. Machine Learning, 37(3):297–336, 1999. [79] Jianbo Shi and C. Tomasi. Good features to track. In CVPR, 1994. [80] Xinchu Shi, Haibin Ling, Weiming Hu, Chunfeng Yuan, and Junliang Xing. Multi-target tracking with motion context in tensor power iteration. In CVPR, 2014. [81] Horesh Ben Shitrit, Jerome Berclaz, Francois Fleuret, and Pascal Fua. Tracking multiple people under global appearance constraints. In ICCV, 2011. 108 [82] Guang Shu, Omar Oreifej, Afshin Dehghan, and Mubarak Shah. Part-based multiple- person tracking with partial occlusion handling. In CVPR, 2012. [83] Vivek Kumar Singh, Ram Nevatia, and Chang Huang. Efficient inference with multiple heterogeneous part detectors for human pose estimation. In ECCV, 2010. [84] Bi Song, Ting-Yueh Jeng, Elliot Staudt, and Amit K. Roy-Chowdhury. A stochastic graph evolution framework for robust multi-target tracking. In ECCV, 2010. [85] Chris Stauffer. Estimating tracking sources and sinks, 2003. [86] Narayanan Sundaram, Thomas Brox, and Kurt Keutzer. Dense point trajectories by gpu- accelerated large displacement optical flow. In ECCV, 2010. [87] Carlo Tomasi and Takeo Kanade. Detection and tracking of point features. Technical report, 1991. [88] Amelio Vazquez-Reina, Shai Avidan, Hanspeter Pfister, and Eric Miller. Multiple hypoth- esis video segmentation from superpixel flows. In ECCV, 2010. [89] Shu Wang, Huchuan Lu, Fan Yang, and Ming-Hsuan Yang. Superpixel tracking. In ICCV, 2011. [90] Sinisa Todorovic William Brendel, Mohamed Amer. Multiobject tracking as maximum weight independent set. In CVPR, 2011. [91] Bo Wu and Ram Nevatia. Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. IJCV, 2007. [92] Zheng Wu, Ashwin Thangali, Stan Sclaroff, and Margrit Betke. Coupling detection and data association for multiple object tracking. In CVPR, 2012. [93] Junliang Xing, Haizhou Ai, and Shihong Lao. Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses. In CVPR, 2009. [94] Junliang Xing, Haizhou Ai, Liwei Liu, and Shihong Lao. Multiple players tracking in sports video: a dual-mode two-way bayesian inference approach with progressive obser- vation modeling. IEEE Transaction on Image Processing, 20(6):1652–1667, 2011. [95] Kota Yamaguchi, Alexander C. Berg, Luis E. Ortiz, and Tamara L. Berg. Who are you with and where are you going? In CVPR, 2011. [96] Bo Yang, Chang Huang, and Ram Nevatia. Learning affinities and dependencies for multi- target tracking using a crf model. In CVPR, 2011. [97] Bo Yang and Ram Nevatia. Multi-target tracking by online learning of non-linear motion patterns and robust appearance models. In CVPR, 2012. [98] Bo Yang and Ram Nevatia. An online learned crf model for multi-target tracking. In CVPR, 2012. 109 [99] Bo Yang and Ram Nevatia. Online learned discriminative part-based appearance models for multi-human tracking. In ECCV, 2012. [100] Ming Yang, Fengjun Lv, Wei Xu, and Yihong Gong. Detection driven adaptive multi-cue integration for multiple human tracking. In ICCV, 2009. [101] Weilong Yang, Yang Wang, and Greg Mori. Recognizing human actions from still images with latent poses. In CVPR, 2010. [102] Yi Yang and Deva Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR, 2011. [103] N.de Freitas Yizheng Cai and J.J. Little. Robust visual tracking for multiple targets. In ECCV, 2006. [104] Qian Yu, G´ erard Medioni, and Isaac Cohen. Multiple target tracking using spatio-temporal markov chain monte carlo data association. In CVPR, 2007. [105] Amir Roshan Zamir, Afshin Dehghan, and Mubarak Shah. Gmcp-tracker: Global multi- object tracking. using generalized minimum clique graphs. In ECCV, 2012. [106] Li Zhang, Yuan Li, and Ram Nevatia. Global data association for multi-object tracking using network flows. In CVPR, 2008. [107] Li Zhang, Bo Wu, and Ramakant Nevatia. Detection and tracking of multiple humans with extensive pose articulation. In ICCV, 2007. [108] Xuemei Zhao and G´ erard Medioni. Robust unsupervised motion pattern inference from video and applications. In ICCV, 2011. [109] Ming Yang Zhimin Fan and Ying Wu. Multiple collaborative kernel tracking. PAMI, 2007. 110
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multiple pedestrians tracking by discriminative models
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
Motion pattern learning and applications to tracking and detection
PDF
Line segment matching and its applications in 3D urban modeling
PDF
Model based view-invariant human action recognition and segmentation
PDF
Event detection and recounting from large-scale consumer videos
PDF
A deep learning approach to online single and multiple object tracking
PDF
Exploitation of wide area motion imagery
PDF
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
PDF
3D object detection in industrial site point clouds
PDF
Human pose estimation from a single view point
PDF
Robust representation and recognition of actions in video
PDF
Moving object detection on a runway prior to landing using an onboard infrared camera
PDF
Object detection and recognition from 3D point clouds
PDF
Analyzing human activities in videos using component based models
Asset Metadata
Creator
Wang, Weijun
(author)
Core Title
Tracking multiple articulating humans from a single camera
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/20/2015
Defense Date
04/23/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
binary quadratic programming,detection and tracking,discriminative appearance model,instance-specific detection,OAI-PMH Harvest,online learning,semidefinite programming
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
), Kuo, C. -C. Jay (
committee member
), Liu, Yan (
committee member
), Medioni, Gerard (
committee member
), You, Suya (
committee member
)
Creator Email
wangwj08@gmail.com,weijunw@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-200923
Unique identifier
UC11277943
Identifier
etd-WangWeijun-4051.pdf (filename),usctheses-c40-200923 (legacy record id)
Legacy Identifier
etd-WangWeijun-4051.pdf
Dmrecord
200923
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wang, Weijun
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
binary quadratic programming
detection and tracking
discriminative appearance model
instance-specific detection
online learning
semidefinite programming