Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
(USC Thesis Other)
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PART BASED OBJECT DETECTION, SEGMENTATION, AND TRACKING BY BOOSTING SIMPLE SHAPE FEATURE BASED WEAK CLASSIFIERS by Bo Wu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2008 Copyright 2008 Bo Wu Acknowledgements First, I would like to express my deepest gratitude to my advisor Prof. Ram Nevatia. Withouthimthisthesisworkisimpossibleinmanyways. Ibenefitalotfromhisinsightful guidance, inspiring support and invaluable knowledge. I would also like to thank Prof. Gerard Medioni, Prof. Ulrich Neumann, Prof. Karen Liu, and Prof. Bosco Tjan for taking their precious time to serve on committees of my qualifying exam or/and thesis defense. Myfouryears’PhDstudyhasbeenmadenotonlyagreatlearningexperiencebutalso enjoyable by the interactions with other current or previous USC vision group members (special thanks to Fengjun Lv, Tao Zhao, Xuefeng Song, Munwai Lee, Tae Eun Choe, Chang Yuan, Qian Yu, Shengyang Dai, Chang Huang, Li Zhang, Yuan Li, and Vivek Kumar Singh) and other friends at USC (Cong Zhang, Ling Zhuo, and Jing Li, among many others) in this period. IamalsogratefultothesupportfrommyparentsinChina. ThepursuitofPhDstudy ispartiallyaresultofmyparents’cultivationofmyinterestinscienceandencouragement for advanced education from childhood. ii Table of Contents Acknowledgements ii List Of Tables vii List Of Figures ix Abstract xv Chapter 1: Introduction 1 1.1 Objective and Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Overview of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Summary of Results and Contributions. . . . . . . . . . . . . . . . . . . . 7 1.5 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Related Work 12 2.1 Object Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Off-line Supervised Learning for Object Detection . . . . . . . . . . . . . 15 2.4 Online Unsupervised Learning for Object Detection . . . . . . . . . . . . 18 2.5 Object Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3: Object Detection by Boosting Local Shape Features 25 3.1 Boosting Edgelet based Tree Structured Classifier . . . . . . . . . . . . . . 25 3.1.1 Motivation and Outline . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2 Tree Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.2.1 Formulation of tree structured classifier . . . . . . . . . . 27 3.1.2.2 Splitting sample space . . . . . . . . . . . . . . . . . . . . 30 3.1.2.3 Retraining with sub-categorization . . . . . . . . . . . . . 32 3.1.3 Edgelet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.4.1 Comparison of splitting strategies . . . . . . . . . . . . . 39 3.1.4.2 Comparison on data without predefined sub-categorization 41 3.1.4.3 Comparison on data with view-based sub-categorization . 43 iii 3.1.4.4 Multi-view car detection . . . . . . . . . . . . . . . . . . 46 3.1.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Integrating Heterogeneous Types of Local Features . . . . . . . . . . . . . 48 3.2.1 Motivation and Outline . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.2 General Weak Classifier . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.3 Hierarchical Weak Classifier . . . . . . . . . . . . . . . . . . . . . . 52 3.2.4 Learning Weak Classifier . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2.5.1 Feature dependent projection functions . . . . . . . . . . 55 3.2.5.2 Computational costs of features . . . . . . . . . . . . . . 57 3.2.5.3 Memory costs of features . . . . . . . . . . . . . . . . . . 58 3.2.5.4 Selecting the best weak classifiers . . . . . . . . . . . . . 59 3.2.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.6.1 Performance on pedestrians . . . . . . . . . . . . . . . . . 59 3.2.6.2 Feature statistics . . . . . . . . . . . . . . . . . . . . . . . 60 3.2.6.3 Hierarchical v.s. sequential weak classifier . . . . . . . . . 63 3.2.6.4 Performance on cars . . . . . . . . . . . . . . . . . . . . . 64 3.2.7 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter 4: Object Segmentation by Boosting Local Shape Features 70 4.1 Motivation and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 Design of the Weak Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.1 Weak classifier for detection . . . . . . . . . . . . . . . . . . . . . . 73 4.2.2 Weak classifier for segmentation . . . . . . . . . . . . . . . . . . . 73 4.3 Boosting Ensemble Classifier for Segmentation and Detection . . . . . . . 75 4.3.1 Sample weight evolution . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3.2 Optimization of weak classifier . . . . . . . . . . . . . . . . . . . . 77 4.4 Refining Segmentation by Color . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5.1 Experiments on pedestrians . . . . . . . . . . . . . . . . . . . . . . 82 4.5.1.1 Results for frontal/rear viewpoint pedestrians . . . . . . 82 4.5.1.2 Results for side viewpoint pedestrians . . . . . . . . . . . 86 4.5.1.3 Comparison of segmentation with mean mask. . . . . . . 87 4.5.2 Experiments on cars . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.5.2.1 Results of shape based detection and segmentation . . . . 90 4.5.2.2 Results of color based refinement . . . . . . . . . . . . . . 94 4.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Chapter 5: Part Combination for Detection of Occluded Objects 97 5.1 Motivation and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2 Hierarchical Body Part Detectors . . . . . . . . . . . . . . . . . . . . . . . 101 5.2.1 Learning part detectors . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2.2 Detecting body parts and object edges . . . . . . . . . . . . . . . . 103 5.3 Joint Analysis for Multiple Objects . . . . . . . . . . . . . . . . . . . . . . 105 5.3.1 Proposing object hypotheses . . . . . . . . . . . . . . . . . . . . . 105 iv 5.3.2 Joint occlusion map of silhouettes . . . . . . . . . . . . . . . . . . 107 5.3.3 Matching object edges with visible silhouettes . . . . . . . . . . . . 108 5.3.4 Matching detection responses with visible parts . . . . . . . . . . . 110 5.3.5 Computing joint image likelihood . . . . . . . . . . . . . . . . . . . 111 5.3.6 Searching for the best configuration . . . . . . . . . . . . . . . . . 112 5.4 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.4.1 Training part detector hierarchy . . . . . . . . . . . . . . . . . . . 115 5.4.2 Evaluation on indoor occluded examples . . . . . . . . . . . . . . . 115 5.4.3 Evaluation on outdoor occluded examples . . . . . . . . . . . . . . 118 5.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Chapter 6: Improving Object Detection by Unsupervised, Online Learn- ing 124 6.1 Motivation and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2 Experimental Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3 Learning Seed Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.4 Oracle by Combining Part Detectors . . . . . . . . . . . . . . . . . . . . . 129 6.4.1 Positive sample collection . . . . . . . . . . . . . . . . . . . . . . . 130 6.4.2 Negative sample collection . . . . . . . . . . . . . . . . . . . . . . . 133 6.5 Online Boosting for Cascade Classifier . . . . . . . . . . . . . . . . . . . . 135 6.5.1 Updating weak classifiers . . . . . . . . . . . . . . . . . . . . . . . 135 6.5.2 Updating sample weights . . . . . . . . . . . . . . . . . . . . . . . 136 6.5.3 Updating cascade thresholds . . . . . . . . . . . . . . . . . . . . . 137 6.5.4 Adaptation of cascade complexity . . . . . . . . . . . . . . . . . . 139 6.6 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.6.1 Results on the CAVIAR set . . . . . . . . . . . . . . . . . . . . . . 142 6.6.2 Results on the CLEAR-VACE set . . . . . . . . . . . . . . . . . . 144 6.7 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Chapter 7: Object Tracking by Associating Detection Responses 149 7.1 Tracking Occluded Objects based on Part Detection . . . . . . . . . . . . 149 7.1.1 Motivation and Outline . . . . . . . . . . . . . . . . . . . . . . . . 150 7.1.2 Part based Detection Module . . . . . . . . . . . . . . . . . . . . . 152 7.1.3 Part based Tracking Module . . . . . . . . . . . . . . . . . . . . . 153 7.1.3.1 Affinity for detection responses . . . . . . . . . . . . . . . 153 7.1.3.2 Trajectory initialization . . . . . . . . . . . . . . . . . . . 155 7.1.3.3 Trajectory growth . . . . . . . . . . . . . . . . . . . . . . 156 7.1.3.4 Trajectory termination . . . . . . . . . . . . . . . . . . . 160 7.1.3.5 The combined tracker . . . . . . . . . . . . . . . . . . . . 160 7.1.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 163 7.1.4.1 Track level evaluation criteria. . . . . . . . . . . . . . . . 163 7.1.4.2 Results on the CAVIAR set . . . . . . . . . . . . . . . . . 164 7.1.4.3 Results on the skate board set . . . . . . . . . . . . . . . 167 7.1.4.4 Results on building top set . . . . . . . . . . . . . . . . . 169 7.1.4.5 Tracking performance with occlusions . . . . . . . . . . . 169 v 7.1.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . 172 7.2 Robust Tracking based on Soft Decision . . . . . . . . . . . . . . . . . . . 173 7.2.1 Motivation and Outline . . . . . . . . . . . . . . . . . . . . . . . . 173 7.2.2 Soft Decision based Detection Module . . . . . . . . . . . . . . . . 175 7.2.2.1 Edgelet based boosted classifier . . . . . . . . . . . . . . 175 7.2.2.2 HOG based SVM classifier . . . . . . . . . . . . . . . . . 176 7.2.2.3 Discussion about detection module design . . . . . . . . . 178 7.2.3 Soft Decision based Tracking Module . . . . . . . . . . . . . . . . . 179 7.2.3.1 Detection response association . . . . . . . . . . . . . . . 179 7.2.3.2 Trajectory initialization and termination . . . . . . . . . 181 7.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 182 7.2.4.1 CLEAR evaluation metrics for tracking . . . . . . . . . . 183 7.2.4.2 Human tracking in meeting videos . . . . . . . . . . . . . 184 7.2.4.3 Human tracking in surveillance videos . . . . . . . . . . . 186 7.2.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . 189 Chapter 8: Conclusion and Future Work 190 References 192 vi List Of Tables 3.1 Detection equal-precision-recall rates on the UIUC car image set. . . . . . 46 3.2 Evaluation frequencies of different feature types. . . . . . . . . . . . . . . 62 3.3 Evaluation frequencies of the first, second, and third features. . . . . . . . 63 3.4 Detection equal-precision-recall rates on the UIUC car image set. . . . . . 65 4.1 Summary of experimental data for simultaneous object detection and seg- mentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2 Segmentationaccuracyofmeanmaskandshapebasedsegmentoronaligned profile view pedestrian samples. . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3 The per-pixel figure-ground segmentation accuracy of side view cars. (The number with a ? means the testing is done on a subset of the image set.) 92 4.4 Detection equal-precision-recall rates on the UIUC car image set. . . . . . 92 5.1 Performance of part detectors on the USC pedestrian set B. (The perfor- mance of right shoulder/arm/leg is similar to their right counterparts.) . . 117 5.2 Detection rates on different degrees of occlusion (with 12 false alarms). . . 117 6.1 Standard deviations of the head and the feet positions before and after alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.2 Comparison of the generic and specialized detectors on the CAVIAR set. . 144 6.3 ComparisonofthegenericandspecializeddetectorsontheCLEAR-VACE set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.1 Tracking performance of the combined tracker on the CAVIAR set. (GT: ground-truth; MT: mostly tracked; ML: mostly lost; Fgmt: trajectory fragment; FAT: false alarm trajectory; IDS: ID switch) . . . . . . . . . . . 165 7.2 Detection performance before and after tracking on the CAVIAR set. . . . 165 7.3 Tracking performance of the combined tracker on the skate board set. (see Table 7.1 for abbreviations.) . . . . . . . . . . . . . . . . . . . . . . . . . . 167 vii 7.4 Comparison between the single part tracker and the combined tracker on a subset of the skate board set. (see Table 7.1 for the abbreviations.) . . . 167 7.5 Trackingperformanceofthecombinedtrackeronthebuildingtopset. (see Table 7.1 for the abbreviations.) . . . . . . . . . . . . . . . . . . . . . . . 169 7.6 Frequency and tracking performance on occlusion events. n=m: n success- fully tracked among m occlusion events. (SS: short scene; LS: long scene; SO: short object; LO: long object) . . . . . . . . . . . . . . . . . . . . . . 171 7.7 Detection performance of the three detection levels on the NIST meeting videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.8 Tracking performance on the NIST meeting videos. . . . . . . . . . . . . . 185 7.9 Tracking performance on the CLEAR-VACE surveillance videos. . . . . . 187 viii List Of Figures 1.1 Sample frames: a) from the CAVIAR set [12], and b) from the data we have collected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Overview diagram of our approach. . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Examples of tracking results. . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Illustration of tree structure. . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Evolution of Z value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Algorithm of learning tree structured classifier. In our experiments, T = 2;000, F = 10 ¡6 , C = 8, µ Z = 0:985 and µ B = 75%. The setting of fP t g is similar to the original cascade’s layer acceptance rates. The cascade is divided into 20 segments, the lengths of which grow gradually. The weak classifiers at the end of the segments have a positive passing rate of 99:8%, and the other weak classifiers have a passing rate of 100:0%. . . . . . . . 33 3.4 Retraining weak classification functions of ancestor nodes with intra-class sub-categorization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Edgelet features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6 Examples of pedestrian training samples. . . . . . . . . . . . . . . . . . . 38 3.7 Examples of car training samples. . . . . . . . . . . . . . . . . . . . . . . . 39 3.8 Comparison of splitting strategies. . . . . . . . . . . . . . . . . . . . . . . 41 3.9 Tree structures from different splitting strategies. (The number in the box gives the percentage of samples belonging to that branch. The tree from the random splitting strategy is similar to that from the image feature based k-mean splitting strategy.) . . . . . . . . . . . . . . . . . . . . . . . 42 3.10 Evaluation of the CBT method on the INRIA set. . . . . . . . . . . . . . 43 3.11 Missed human examples by our method with a false alarm rate of 10 ¡4 . . 43 3.12 Structure of the view-based tree classifier for pedestrians. . . . . . . . . . 44 ix 3.13 Evaluation of the CBT method on the USC pedestrian set C (100 images with 232 pedestrians). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.14 Schematic diagram of our feature integration method. . . . . . . . . . . . 49 3.15 An illustration of hierarchical partition of sample space. . . . . . . . . . . 55 3.16 Evaluation of our multi-feature based detection method on the INRIA set. 60 3.17 The first weak classifier learned for pedestrians and its selected features. Thefirstfeatureevaluatedisanedgeletcorrespondingtothehead-shoulder contourofhumanbody;thesecondfeatureisacovariancedescriptorwhose supporting region surrounds the head-shoulder part; the third feature is a HOG descriptor. The x-axis is the index of the histogram bins, i.e. the partition along the projection direction. The y-axis is the classifier’s output. 61 3.18 Frequenciesof differentfeature types as the first, second, and third feature in the hierarchical weak classifiers. . . . . . . . . . . . . . . . . . . . . . . 62 3.19 Comparison of different feature combination strategies. . . . . . . . . . . . 64 3.20 Evaluation of our multi-feature based detection method on the problem of multi-view car detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.21 Example results of pedestrian detection. . . . . . . . . . . . . . . . . . . . 68 3.22 Example results of car detection. . . . . . . . . . . . . . . . . . . . . . . . 69 4.1 Local shape features in [64, 84]: a) Boundary fragments selected for cows [64]; b) Feature responses of a face [84]; c) Feature responses of a cow [64]. 71 4.2 Effective field: a) definition of effectiveness; b) effective field bases of indi- vidual edge points; c) effective field of edgelet feature. . . . . . . . . . . . 74 4.3 Algorithm of simultaneously learning detector and segmentor. . . . . . . . 79 4.4 Segmentation error distribution: a) frontal/rear view pedestrians; b) left profile pedestrians; c) left profile cars. (Brighter point has higher error rate than darker point.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5 Color based segmentation refining algorithm. (In our experiments, N =3.) 81 4.6 Samples for frontal/rear view pedestrians: the first row is the image sam- ples; the secondrowisthesegmentationground-truth (The grey pixelsare do-not-care). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 x 4.7 Thefirstfivefeaturesselectedandtheirsegmentorslearnedforfrontal/rear pedestrians. (The 0-th segmentor is the prior distribution. Each edgelet based weak segmentor is implemented as a histogram. Each bin of the his- togramisareal-valuedmatrixdefinedbyEqu.4.3withthesamedimension of the training samples. In our experiments, a segmentor histogram has eight bins. In this figure, we visualize the matrices of the histogram bins by normalizing them to [0, 255] gray scales. White is for higher values and black for lower values.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.8 Segmentation performance on the normalized samples. . . . . . . . . . . . 85 4.9 Evaluation of detection performance on the USC pedestrian set A (205 images with 313 humans). . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.10 Training samples for left profile view pedestrians. . . . . . . . . . . . . . . 87 4.11 Examples of detection and segmentation results for pedestrians. For each pair, leftisthedetectionresultandrightisthesegmentationresult. When there are multiple objects, different colors represent different objects. . . . 88 4.12 Mean mask of left profile view pedestrians. . . . . . . . . . . . . . . . . . 89 4.13 Training samples for left profile view cars. . . . . . . . . . . . . . . . . . . 91 4.14 Examples of detection and segmentation results for cars. . . . . . . . . . . 93 4.15 Examples of segmentation results with color based refinement. The first row are the source images; the second row are the shape based segmenta- tion results; the third row are the segmentation results after color based refinement. The source images are from the TU Darmstadt image set. . . 95 5.1 Examples of part detection responses for pedestrians. . . . . . . . . . . . . 98 5.2 Schematic diagram of our part combination method. . . . . . . . . . . . . 99 5.3 Hierarchy of human body parts. (Pt 0 is full-body; Pt 1;0 head-shoulder; Pt 1;1 torso; Pt 1;2 legs; Pt 2;0 left shoulder; Pt 2;1 head; Pt 2;2 right shoulder; Pt 2;3 left arm; Pt 2;4 right arm; Pt 2;5 left leg; Pt 2;6 feet; Pt 2;7 right leg. The left and right sides here are w.r.t. the 2-D image space.) . . . . . . . 101 5.4 Illustrationoffeaturesharinginpartdetectorhierarchy. (Theblackpoints are the inherited features, and the gray are the newly selected features.) . 103 5.5 Extracted object edgelet pixels. . . . . . . . . . . . . . . . . . . . . . . . . 105 5.6 Searching for the best multiple object configuration. . . . . . . . . . . . . 106 5.7 Proposal of multiple human hypotheses. . . . . . . . . . . . . . . . . . . . 106 5.8 Assumption of scene structure. . . . . . . . . . . . . . . . . . . . . . . . . 107 xi 5.9 Occlusion relation of multiple objects. a) 1-D silhouette based occlusion reasoning with boosted object segmentor; b) 2-D region based occlusion reasoning with constant object mask. . . . . . . . . . . . . . . . . . . . . . 108 5.10 Matching and aligning edgelets with silhouettes. . . . . . . . . . . . . . . 109 5.11 The parts of the silhouettes that have matched edgelets (red points). . . . 110 5.12 Result of matching full-body detection responses in Figure 5.1(a) with the proposed hypotheses in Figure 5.7(yellow: matched responses; orange: re- sponsenotmatchedwithanyhypothesis; red: hypothesiswithoutmatched response). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.13 An example of searching for the best multi-object configuration. (The blue rectangles overlaid on the images are the hypotheses being examined. The red boxes are the states kept after comparing the image likelihoods with/without one hypothesis. When examining a hypothesis, one of the “with”and“without”likelihoodscanbeinheritedfromthepreviousround to reduce computational cost. For example “without h 0 ” and “with h 1 ” are the same state, as h 0 is removed.) . . . . . . . . . . . . . . . . . . . . 114 5.14 The first two edgelet features learned for head-shoulder, torso, and legs. . 115 5.15 Examples of part detection results on images from the Internet. (Green: successful detection; Red: false alarm) . . . . . . . . . . . . . . . . . . . . 116 5.16 Evaluation of detection performance on the USC pedestrian test set B. . . 118 5.17 Example detection and segmentation results on the USC pedestrian set B. 119 5.18 Evaluation of detection performance on the Zurich mobile pedestrian se- quences. (Following [21]’s evaluation, only humans higher than 60 pixels are counted. The curves of [21], the courtesy of the original authors, are for their full-system, i.e. with ground plane and stereo depth.) . . . . . . 121 5.19 Example detection and segmentation results on the Zurich mobile pedes- trian sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.1 Schematic diagram of our unsupervised, online learning approach. . . . . 125 6.2 Example frames of the CAVIAR set and the CLEAR-VACE set. . . . . . 128 6.3 Off-line learning algorithm for general seed classifiers. . . . . . . . . . . . 129 6.4 Performance of the oracle for positive sample collection. . . . . . . . . . . 132 6.5 Examples of positive samples collected by the oracle. . . . . . . . . . . . . 132 6.6 Performance of the oracle for negative sample collection. . . . . . . . . . . 134 xii 6.7 Features in the neighborhood of the first weak classifier of the full-body detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.8 Performance of unsupervised, online learning with and without noise re- straining strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.9 Sample passing rates of the full-body detector on the CAVIAR burn-in set before and after online learning. . . . . . . . . . . . . . . . . . . . . . . . . 140 6.10 SamplepassingratesofthelegsdetectorontheCAVIARburn-insetbefore and after online learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.11 Online learning algorithm for cascade structured classifiers. . . . . . . . . 143 6.12 ExamplesofdetectionresultsbeforeandafteronlinelearningontheCAVIAR set. (Yellow for full-body; red for head-shoulder; purple for torso; blue for legs; and green for combined) . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.13 Examples of detection results on the CLEAR-VACE set before and after online learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.1 Examples of appearance changes due to inter-object occlusions. . . . . . . 150 7.2 Schematic diagram of our part based tracking system. . . . . . . . . . . . 151 7.3 Output of the part based detection module. a) and b) are from the full- bodydetector;c)isfromthecombineddetector(greenforcombined;yellow for full-body; red for head-shoulder; purple for torso; blue for legs). . . . . 153 7.4 Probability map for the mean-shift tracker: a) is the original frame; b) is the final probability map; c), d) and e) are the probability maps for appearance, dynamic and detection respectively. (The object concerned is marked by a red ellipse.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.5 PCA based body part segmentation: a) training samples; b) eigenvectors. The top-left is the mean vector; c) original human samples; d) color prob- ability map; e) PCA reconstruction; f) thresholded segmentation map. . . 160 7.6 Forward part based combined object tracking algorithm. . . . . . . . . . . 162 7.7 Track level evaluation criteria.. . . . . . . . . . . . . . . . . . . . . . . . . 164 7.8 Example results of our combined tracker on the CAVIAR set. (The first and the second rows are for one sequence; the third and the fourth for another.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.9 Example results of our combined tracker on the skate board set. (The first and the second rows are for one sequence; the third and the fourth for another.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 xiii 7.10 Example results of our combined tracker on the building top set. (The first and the second rows are for one sequence; the third and the fourth for another.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.11 Schematic diagram of our soft decision based tracking method. . . . . . . 174 7.12 Examples of the three levels of detection responses for the problem of meeting room human detection and tracking. . . . . . . . . . . . . . . . . 177 7.13 Forward soft decision based object tracking algorithm. . . . . . . . . . . . 182 7.14 Example results of the soft decision based tracking on the NIST meeting videos. (The first and second rows are for one sequence; the third and fourth for another.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.15 Example results of the soft decision based tracking on the CLEAR-VACE surveillance videos. (The first and second rows are for one sequence; the third and fourth for another.) . . . . . . . . . . . . . . . . . . . . . . . . . 188 xiv Abstract Detection, segmentation, and tracking of objects of a known class is a fundamental prob- lem in computer vision. For this task, we need to first detect the objects of interest and segment them from the background, and then track them across different frames while maintaining the correct identities. The two principle sources of difficulty in performing this task are: a) change in appearance of the objects with viewpoint, illumination, and possible articulation, and b) partial occlusion of objects of interest by other objects. The objective of this work is to develop a system to automatically detect, segment, and track multiple, possibly partially occluded objects of a known class from a single camera. We take pedestrians, which are important for many real-life applications, as the main class of interest to demonstrate our approach. However, some components of the method are also applied to the class of cars to show the generality of our approach. Werepresentanobjectasahierarchyofparts. Theuseofapartbasedmodelenables us to detect and track objects when some parts are not visible. We have developed a new type of shape oriented features, called edgelet, to capture the silhouette based patterns. We have integrated the edgelet features with some other existing shape features, and learnedtreestructuredclassifiersforobjectparts. Partdetectionresponsesarecombined xv jointly so that the spatial relations, including possible occlusions, between multiple ob- jects are analyzed. For specific applications, an unsupervised, online learning algorithm is used to improve the performance of the detectors by adapting them to the particular environment. Object segmentor, whose output is pixel-level figure-ground segmentation, is learned based on local shape features. The object detection and segmentation results provide the observations for tracking. Trajectory initialization and termination are both automatic and rely on the detection results. Two complementary techniques, data asso- ciation and mean-shift, are used to track an object. An automatic object detection and tracking system has been implemented and evalu- ated on a number of images and videos. The experimental results show that our method achieves state-of-the-art performance. xvi Chapter 1 Introduction 1.1 Objective and Difficulty Detection, segmentation, and tracking of objects of a known class is a fundamental prob- lem of computer vision. The class of pedestrians is particularly important for many applications, such as visual surveillance, human computer interaction, and driving assis- tance systems. For this task, we need to detect the objects of interest first (i.e. find the image regions corresponding to the objects), which may include two steps, i.e. a bound- ing box localization and a pixel level figure-ground segmentation, and then track them acrossdifferentframeswhilemaintainingthecorrectidentities. Thetwoprinciplesources of difficulties in performing this task are: a) changes in appearance of the objects with viewpoint, illumination and possible articulation, and b) partial occlusions of the target objectsbyotherobjects. Forpedestrians,thereareadditionaldifficultiesintrackingafter initial detection. The image appearance of humans changes not only with the changing viewpoint but even more strongly with the visible parts of the body and clothing. In 1 addition, it is difficult to maintain the identities of objects during tracking when they are close to each other. Most of the previous efforts in pedestrian detection in videos have relied on detection by changes caused in subsequent image frames due to human motion. A model of the backgroundislearnedandthepixelsdepartingfromthismodelareconsideredtheresultof object motion; nearby pixels are then grouped into motion blobs. This approach is quite effectivefordetectingisolatedmovingobjectswhenthecameraisstationary,illumination is constant or varies slowly, and humans are the only moving objects; an early example is given in [102]. For a moving camera, the background motion can be compensated for in some cases, but the registration may not be accurate in presence of parallax. In any case, for more complex situations where multiple humans and other objects move in a scene, possibly occluding each other to some extent, the motion blobs do not necessarily correspond to individual humans; multiple moving objects may merge into a single blob with only some parts visible for the occluded objects, and a single human may split into multiple blobs. Figure 1.1 shows two examples where such difficulties can be expected. (a) (b) Figure 1.1: Sample frames: a) from the CAVIAR set [12], and b) from the data we have collected. 2 Anumberofsystemshavebeendevelopedinrecentyears,e.g. [39,118,87],tosegment multiplehumansfrommotionblobs. Whilethesesystemsdemonstrateimpressiveresults, theytypicallyassumethatallofamotionregionbelongstooneormorepersons, whereas real motion blobs may contain multiple categories of objects, shadows, reflection regions, and blobs created because of illumination changes or camera motion parallax. 1.2 Motivation of Our Approach We propose a shape based detection and tracking method with part based representation ofobjectclasses. Ourhumaneyescanonlyrecognizeanobjectbyitsappearance,regard- less of whether it is moving or not. The characteristic features could be color, texture, or shape. For the class of pedestrians, because of a tremendous variation in clothing, sil- houette or contour is the most consistent feature. This observation is also true for many other object classes, such as cars. Compared to many existing motion based methods, the main advantages of shape based methods are: 1. Shape based methods are discriminative. They can apply to the situation where there are multiple object classes that can not be distinguished by motion. For example,foranoutdoorsurveillancesystem,pedestriansandcarsarethetwoclasses of greatest interest, and both of them are moving objects. A motion blob based detector can not differentiate them. 2. Shape based methods can work without motion information. The situation where motion segmentation is not applicable or gives very noisy output, is not a problem for shape based methods. For example, when the camera of a driving assistance 3 systemmoveswiththecar, themotionsegmentationisbeyondthecapabilityofthe current techniques. When objects are moving together or interacting with each other, they could occlude each other, such as a group of pedestrians. In addition, the objects of interest could be occluded by other scene objects, such as a human standing behind a bush. Our human eyes can recognize the object even when some parts of the object are invisible. This suggests a part based representation. Compare to holistic representations, the main advantages of part based representations are: 1. Part based representations can work with partial occlusions. When we know that some parts are not visible in the image, we need only look for the evidence of the visible parts. 2. Part based representations are more robust to articulation and viewpoint changes, as the constraints of the spatial relation of the parts are flexible. 3. Part based representations result in more robust detector and tracker, as the final decision is made based on evidence from multiple parts. In developing an object detection and tracking system based on these two principles, severalissuesarise. First,howtolearnthepartdetectors? Thisincludesdefiningthebody parts,coveringdifferentviewpoints,andintegratingdifferenttypesofimagecues. Second, how to combine the part detection results into final object hypotheses, especially when there are partial occlusions? Third, how to track objects given detection responses with classificationerrors? Thisincludesbuildingadiscriminativemodelforaparticularobject, 4 as well as automatic initialization and termination of trajectories. All these questions are open and their answers may not be unique. 1.3 Overview of Our Approach We have implemented an automatic object detection and tracking system based on part based representation and shape based classifiers. This system consists of three main components: a learning component, a detection and segmentation component, and a tracking component. Figure 1.2 shows an overview diagram. In the learning stage, several part detectors are trained from a number of training samples by a supervised learning algorithm. When detecting and tracking in videos, the frames are first fed into the detection module, whose output is a number of bounding boxes; then the frame by frame detection responses are sent to the tracking module, whose output is a number of object trajectories. The trajectory initialization and termination are fully automatic and rely on the detection responses. We represent an object as a hierarchy of parts. For each part, we learn a detector by boosting local shape features. To cover multiple viewpoints and different poses, we built tree structured part detectors using a boosting approach, which is an enhanced version of the object detection framework proposed by Viola and Jones [96]. The tree structured classifier divides the object class into several sub-classes based on the shape similarity by unsupervised clustering, and allows feature sharing between different sub- classes. Besides part detectors, a whole-object segmentor, whose output is pixel-level 5 Figure 1.2: Overview diagram of our approach. figure-ground segmentation, is also learned based on the local shape features and by a boosting approach. In the detection process, an image is scanned by the part detectors first, and then thepixel-levelsegmentationisobtainedbyapplyingtheobjectsegmentor. Theresponses of part detectors are combined based on a joint image likelihood function for multiple, possibly inter-occluded objects. We formulate the multiple object detection problem as a MAP estimation problem, and search the solution space to find the best interpretation of the image observation. During the combination, inter-object occlusion reasoning is donebasedonsomeassumptionofthe3-Dscenestructure. Performanceofthecombined detectorisbetterthanthatofanyindividualpartdetector. Forapplicationswithlimited environments, an online, unsupervised learning method is developed to adapt generic 6 object detectors to the specific environments to achieve better accuracy. This method is based on the online boosting algorithm proposed by Oza and Russell [67]. Ourtrackingmethodisbasedontrackingpartsoftheobject. Thedetectionresponses from the part detectors and the combined detector are taken as input for the tracker. We track objects by data association, i.e. matching the object hypotheses with the detection responses, whenever corresponding detection responses can be found. The output of the detectionmodulemayconsistofseverallevelswithdifferentclassificationconfidences. For robustness, object hypotheses are associated with detection responses in the descending orderoftheirconfidencelevels. Ifnodetectionresponsesarefoundfortheobject,amean- shift tracker [15] is used to follow it. Most of the time objects are tracked successfully by data association; the mean-shift tracker gets utilized only occasionally and for short periods. Since our method is based on part detection, it can work under both scene and inter-object occlusion conditions. As the cues for tracking are strong, we do not utilize statistical sampling techniques as in some of the previous work, e.g. [39, 118, 87]. A trajectoryisinitializedwhenevidencefromnewobservationscannotbeexplainedbythe current hypotheses, as is also the case in many previous methods [19, 39, 118, 87, 71]. Similarly, a trajectory is terminated when it is lost by the detectors for a certain period. Figure 1.3 shows some examples of the final output of the system. 1.4 Summary of Results and Contributions We use pedestrians as the main class of interest to demonstrate our approach. However, somecomponentsofthesystemarealsoappliedtotheclassofcars. Forthedetectionand 7 (a) (b) Figure 1.3: Examples of tracking results. segmentation modules, we evaluate our approach with examples of un-occluded objects as well as partially occluded objects. The experimental results show that our approach achieves the state-of-the-art performance on un-occluded objects, and outperforms the other current methods on partially occluded objects. We evaluate our online learning algorithm on two public video corpora of surveillance scenarios. The experimental re- sults show that our approach can greatly improve the performance of general detectors on particular applications. We evaluate our tracking methods on a number of video se- quences. Considerable and persistent occlusions are present and the scene background canbehighlycluttered. Thecameracanbestationaryormoving/zooming. Theenviron- ment can be indoors or outdoors with possibly changing illumination. The results show that our methods achieve state-of-the-art performance for tracking. The major contributions of this thesis work include: ² A novel type of local shape features, edgelet. Edgelet can be considered as a local version of the traditional edge template. It explicitly captures the local shape information. 8 ² An automatic tree learning algorithm, Cluster Boosted Tree. The CBT learning method does not require the pre-definition of the tree structure based on domain knowledge. Instead, it adaptively builds tree classifiers according to the difficulty of the problem. ² A feature integration method to improve detection performance by using multiple, heterogeneous types of features. Our feature integration method optimizes both classification accuracy and computation efficiency. ² A classification based object segmentation method. We formulate segmentation as a binary classification problem, and learn boosted segmentor from local shape features. ² A part combination method for detection and segmentation of multiple, partially occluded objects. Our combination method performs joint analysis of multiple objects. The joint likelihood is computed and optimized based on the occlusion relation of multiple objects. ² An online, unsupervised learning method to adapt generic object detectors to par- ticular applications. Our automatic labeler for unsupervised learning is based on a set of part classifiers. Our online learning method is based on the online boosting algorithm. This approach is an extension of the co-training framework. ² A part based tracking method, in which part detection responses are combined and linked to form object trajectories. As part based representation is used, this algorithm can track objects through partial occlusions. 9 ² Asoftdecisionbasedtrackingmethodtoimprovethetrackingrobustnesswithnoisy detection results. Tracking is done by associating detection responses of multiple confidence levels. 1.5 Outline of Thesis The rest of the thesis is organized as follows: ² Chapter 2 introduces related previous work. ² Chapter3describestwoalgorithmsforlearningobjectdetectors. Thefirstalgorithm trains tree structured object classifiers automatically. It was originally published in [103]. The second algorithm integrates multiple, different types of features for object detection, and optimizes the tradeoff between accuracy and speed. It will be published in [107]. This chapter also describes the shape feature we invented for object detection. This was originally published in [105, 104]. ² Chapter4describesourclassificationbasedobjectsegmentationmethod,whichwas originally published in [108]. ² Chapter 5 describes our joint analysis algorithm for part combination. The first version of this method was published in [105, 104]. The second version of this method will be published in [111]. This chapter focuses on the description of the second version, which is an enhancement of the first version. ² Chapter 6 describes our online, unsupervised learning method to specialize generic object detectors. This method was originally published in [106]. 10 ² Chapter 7 describes two detection based tracking algorithms. The first algorithm combines part tracking to cover occluded objects. It was originally published in [110, 104]. The second algorithm uses multiple levels of classification confidence to improve tracking robustness. It was originally published in [115]. ² Some conclusions and discussions about future work are given in the last chapter. 11 Chapter 2 Related Work The literature on object detection and segmentation in static images, and on object tracking in videos is abundant. It is unlikely to do a thorough survey. As we take pedestrians as the major object class of interest to demonstrate our approach, we focus the introduction of related work mainly on the pedestrian related methods. 2.1 Object Representation Many methods for pedestrian detection in static images represent a human body as an integral whole, e.g. Papageorgiou et al.’s SVMs detectors [68], Felzenszwalb’s shape models [24], Wu et al.’s Markov Random Field based representation [114], and Gavrila et al.’s edge templates [31, 30, 29]. The boosting based object detection framework proposed by Viola and Jones [96] has proven very efficient for detecting faces. Viola et al. [97]haveadaptedthisframeworktotheproblemofpedestriandetectionandachieved good performance. Their method is also based on a holistic representation. Holistic representation based methods do not work well with large spatial occlusions, as they need evidence for most parts of the whole body. 12 Some methods for representation as an assembly of body parts have been developed. Mohan et al. [60] divide the human body into four parts: head-shoulder, legs, left arm, and right arm. The results reported in [60] show that the part based human model is much better than the holistic model in [68] for detection. Shashua et al. [82] divide the human body into nine overlapping sub-regions. Some of the sub-regions correspond to naturalhumanbodyparts,suchashead,torso,arms,andlegs; somedonot. Mikolajczyk et al. [59] divide human body into seven parts: face/head for frontal view, face/head for profile view, head-shoulder for frontal and rear view, head-shoulder for profile view, and legs. Lin et al. [55] divide the human body into four parts: head, torso, legs, and feet. Shet et al. [83] divide human body into three parts: head, torso, and legs, in addition to a full-body model. Most of the existing methods define the partition based on natural humanbodystructure,mainlybecausethenaturalbodypartshavearelativelyconsistent appearance and are well-defined. In the part based methods, individual classifiers are learned for the parts. Given a new image, all the part classifiers are applied. The part detection results, i.e. some part responses with classification confidence, are combined to form human hypotheses. Many previous methods [60, 82, 59] combine the parts independently for each human. The combination is usually based on majority voting (e.g. checking if more than half of the partsaredetected)orweightedsum(thresholdingtheweightedsumofthepartdetection confidences, where the weights are determined by the performance of the part detectors) ofthepartdetectionresults. Thistypeofmethoddoesnotconsidertheocclusionrelation of multiple humans. Only recently, starting from our work [105], several methods have 13 been developed [55, 83] to combine parts of multiple humans jointly to detect humans with inter-occlusions. 2.2 Image Features Fortheproblemofobjectdetection, alargevarietyofimagefeatureshasbeendeveloped. Some are spatially global features, e.g. the edge template by Gavrila et al. [31, 30, 29]. However, most recent methods use local features, because local features are less sensitive to occlusions and other types of partial observation missing. Some examples are the wavelet descriptor by Schneiderman and Kanade [79], the Haar like feature by Viola and Jones [96], the sparse rectangle feature by Huang et al. [35], the SIFT like orientation featurebyMikolajczyket al. [59],theHistogramofOrientedGradients(HOG)descriptor by Dalal and Triggs [16], the code-book of local appearance by Leibe et al. [50, 49], the boundary fragment by Opelt et al. [64], the biologically-motivated sparse, localized feature by Mutch and Lowe [61], the shapelet feature by Sabzmeydani and Mori [76], the covariance descriptor by Tuzel et al. [92], the motion enhanced Haar feature by Viola et al. [97], and the Internal Motion Histogram (IMH) by Dalal et al. [17]. These features have been applied successfully to the detection of several object classes, including faces [79,96,35,68],pedestrians[30,29,59,16,50,105,76,92,97,17],andcars[79,49,61,103]. The local feature based methods are less sensitive to occlusions, as only some of the features are affected by occlusions. After the features or descriptors are computed, they are fed into a classifier. The classifier could be an SVM [16, 68], a boosted cascade [96], or based on a graphical model [34, 85, 64, 50]. 14 Some previous methods use more than one type of image feature to improve clas- sification accuracy. For the graphical models, different types of features are naturally integrated by the observation model of each node, such as in [85]. However, estimating the joint probability distribution in a high dimensional feature space is not feasible. In practice,itisusuallyassumedthatthedifferentfeaturetypesareindependent,sothatthe joint probability is equal to the multiplication of the probabilities of individual feature types. For SVM classifiers, concatenated feature vectors are commonly used as input, but this is feasible only when the features are homogeneous, as the combination of two histogram features (HOG and IMH) in [17]. Linear combination of multiple non-linear kernels, each of which is based on one feature type, is a more general method to integrate heterogeneous features, e.g. [94]. However, both vector concatenation and kernel com- bination methods require the evaluation of all features. For cascade classifiers, different types of features can be included in one large feature pool, from which an ensemble clas- sifier is learned by a boosting algorithm, as in [97]. However, if there are big differences of computational complexity between different feature types, the speed of the cascade classifier learned in this way will be dominated by the most complex feature type. 2.3 Off-line Supervised Learning for Object Detection Manyexistingapproachesbuildgenericobjectclassifiersbyoff-linelearningmethods,such as SVM [17] and Artificial Neural Network (ANN) [74]. The input of an off-line learning algorithm are some training samples with class labels; the output is a classifier. A recent breakthrough on object detection problems is the cascade structured boosted classifier 15 proposed by Viola and Jones [96]. This method uses the AdaBoost [25] algorithm to select a number of weak classifiers that are based on simple image features, such as Haar wavelet, and to combine them to form strong classifiers. One cascade classifier consists of a series of strong classifiers, called layers. Only when an input sample is accepted by one layer as an object, is it sent to the next layer for further processing. During detection, a large number of image sub-windows are checked by the classifier. The cascade structure allows most of the non-object sub-windows to be filtered out in the first few layers so thatthedetectionprocedure isveryefficient. When the intra-classvariationis small, e.g. the frontal view faces in [96] and the frontal/rear view pedestrians in [105], the cascade structured detectors achieve good accuracy. However, when the object class variance is large, a divide-and-conquer strategy is necessary, e.g. the variations of tree structured detectors for multi-view faces [36, 41]. The differences between the various tree structure based classification methods are mainly twofold: first, how to split a tree node; and second, how to learn a tree node. In [41], a pose estimator is learned to classify a sample into one of several view-based sub-categories; in [36] one tree node is a multi-class multi-label classifier, where each sub-category corresponds to one view and a sample may be sent to multiple view-based sub-categories depending on the output vector. The tree nodes are learned by the binary AdaBoost algorithm [25] in [41, 90], and by the Vector Boosting algorithm in [36] (we refer to the method of [36] as Vector Boosted Tree, VBT). Among these, the method of [36] achieved the best performance. Intra-class sub-categorization can be either top-down or bottom up. For top-down, some domain knowledge is used to group the samples. For example, Huang et al. [36] 16 pre-define five view sub-categories based on the left-right out-of-plane rotation angle of the faces. The top-down sub-categorization is usually defined manually, as the automatic extraction of high level knowledge could be even more difficult. Some methods have been developed to compute the intra-class sub-categorization automatically based on low level image features, i.e. bottom-up. Seemann et al. [80] cluster pedestrians based on their silhouettes. However, the features used for clustering are different from the features used fordetection,andhencetheclusteringisnotoptimizeddirectlyforthedetectiontask. Tu [90] proposes a Probabilistic Boosting Tree (PBT) model for general object classification problems, in which the estimated posterior probability is used to divide the samples. However, his splitting method can not achieve a balanced division for both the positive and negative sample spaces, as object detection is an asymmetric classification problem. Some other researchers try to use examplar-based methods to model complicated ob- jectclasses. Shanet al. [81]learnapedestriandetectorbyboostingexamplar-basedweak classifiers. Each weak classifier is built based on one representative examplar. However, their algorithm is sensitive to the initial candidate set of the examplars. How to select the candidate set is still an open problem. SVM based approaches, e.g. the HOG based SVMforpedestriandetectionbyDalalandTriggs[16], isanothertypeofexamplarbased methods. Although SVM based methods have good accuracy, they are computationally expensive. 17 2.4 Online Unsupervised Learning for Object Detection Intheoff-line supervisedlearningmethods, theclassifiers are learnedonceand thenfixed for all applications. In order to obtain a generic detector with good performance, tens of thousands of samples could be required [36]. Manually labeling such a large amount of data is time-consuming. In some applications, the environments considered are limited. For example, a surveillance system with a stationary camera only monitors a particular scene. In such a case, a specialized detector could be better than a general detector in terms of both accuracy and efficiency. With an off-line supervised learning algorithm, to obtain a specialized detector for a new application, we have to manually collect some newtrainingdataandrerunthewholetrainingprocedure. Onlineunsupervisedlearning, which adapts existing general detectors to a special task, is more desirable here. The key components of unsupervised, online learning algorithms are 1) an automatic labeler, called an oracle, which segments and labels objects from row data automatically, and 2) an online learning algorithm, which modifies the existing classifiers based on the samples collected by the oracle. The design of the oracle is not trivial, as it is an object detector itself. The difference of an oracle from a regular detector is that the precision of the oracle should be high while the detection rate could be low. Some existing methods, e.g. [62, 73], use motion segmentation as an oracle. However, motion based object detection is not robust due to many factors, such as shadows, reflections, merging and splitting of blobs, illumination change, etc. In order to improve the precision of the motion based oracle, an appearance based model can be used for verification, e.g. the PCA based representation by Roth 18 et al. [73]. When the oracle is relatively weak, in order to achieve high precision, the labeling must be very conservative, which results in a very low detection rate. Insteadofmakingaseparateoracle, somemethodsusethelearningframework, called co-training [4], to improve the performance of a couple of classifiers by unlabeled data. The inputs of co-training are two classifiers and a set of unlabeled data. The confidently classified samples by the classifier A/B are used to update the classifier B/A. It has been proven that when the two seed classifiers are not fully correlated, they can be improved by co-training with unlabeled data [4]. Levin et al. [53] use co-training to improve the performance of two car detectors, which are learned from regular images and foreground images respectively. The correlation of these two types of inputs is relatively high, and the performance of the final detectors is not good. Javed et al. [40] apply co-training to boosted ensemble classifiers to classify moving blobs into cars and pedestrians. The samples confidently classified by a subset of the weak classifiers in the ensemble are used to update the remaining weak classifiers. However, the weak classifier, which is based on one dimension of a learned PCA model for cars and humans, is not very discriminative, resulting in an ineffective oracle. Given a sample collected by the oracle, some incremental learning algorithm is used to update the current classifier. Oza and Russell [67] propose an online version of the boosting algorithm to learn ensemble classifiers in an incremental way. Some variations of this algorithm have recently been developed and applied to several vision problems, including object detection [73, 40, 32]. For object detection, the boosted detectors [96] areveryefficientbecauseoftheircascadestructure. However,theexistingonlineboosting algorithms are designed for standard ensemble classifiers, where the number of the weak 19 classifiers is fixed, and the decision is made only after all weak classifiers are evaluated. Anonlinelearningalgorithmforcascadedetectorsmustbeabletochangethecomplexity of the cascade structure and to refine the cascade decision strategy adaptively. One of the main issues of unsupervised learning is the oracle’s errors, which can be categorizedintotwotypesfordetectiontasks,alignmenterrorsandlabelingerrors. When a positive sample, i.e. a sub-window cut by the oracle, does include an object, but the position or the size of the object is not accurate, this is considered an alignment error. When the predicted label of a sample is wrong, this is considered a labeling error. These errors can make the learner over-fit, and must be restrained or eliminated in the oracle part, in the learning part, or in both. However, none of the above efforts include noise restraining strategies explicitly. 2.5 Object Segmentation The output of the object detection methods is a set of bounding boxes of the objects, whichcanbeseenasaroughsegmentation. However,anaccuratepixel-levelfigure-ground segmentation is needed for a number of high level tasks. For example, for tracking, we need to build appearance based models for individual objects, which requires the knowledge of which regions in the image belong to the objects. Traditionally, one would segment an image by one of the various region segmentation methods, and then try to classify the regions as belonging to one of the desired classes. Thisapproachworkswellwhenobjectsofinteresthaverelativelyhomogeneousproperties insomeimageattributes, suchasintensity, color, ortexture. However, formanycommon 20 objects of interest, e.g. humans, the surfaces are not uniform and the texture can be arbitrarily complex. In such cases, no effective algorithms for bottom-up segmentation have been devised; existing methods tend to over, or under, segment an image. If objects of interest are moving, motion-based segmentation can be more reliable, but even here, merging of motion blobs with adjacent objects and with shadows and reflections can be problematic. The main difference between the general image segmentation methods, e.g. [91], and the segmentation of objects of a known class, is the use of prior knowledge, i.e. an object model, of the concerned class. In addition to guiding segmentation, the object models can also function as discriminative models for recognition and detection, e.g. [101, 88, 64, 85, 42, 100, 49, 70], or generative models for pose estimation, e.g. [9]. Many of the recent efforts build object models based on some image features, other than color value or gray intensity. The use of features enables us to focus on informative and discriminative image cues. Similar to the problem of object detection, the features used for object segmentation could be global or local. Global features are relatively sensitive to partial occlusions compared to local ones. The properties that the features attempt to capture mainly include: 1) color, e.g. the mixtures of Gaussian color model in[85, 84], and thekerneldensityestimation of color distributionin [116]; 2)texture, e.g. the texton in [85]; and 3) shape, e.g. the part template in [7], the boundary or contour fragment in [64, 84], and the simplified SIFT descriptor in [42]. When global features are used, the object models are sometimes equal to the fea- tures, e.g. the edge template models in [31, 116]. When local features are used, we need some method to organize the features to form the object models. Many existing 21 object segmentation methods use random field approaches, e.g. the Layout Consistent Conditional Random Field in [101], the Located Hidden Random Field in [42], the tex- ton based CRF in [85], the pose-specific MRF in [9], the Pictorial Structure enhanced MRF in [70]. The inference of the CRF models usually requires loopy belief propagation or sequential tree-reweighted message passing, while graph cut is a widely used solu- tion for inference in the MRF models. These techniques are computationally expensive. Some other methods use constellation type models to organize the local features, e.g. the Boundary-Fragment-Model in [64], and the Implicit Shape Model in [50, 49]. These models are star-shaped, which can be inferred by Hough Transformation much more effi- ciently. However, both the random field methods and the constellation methods usually assume a fixed object size so that the solution space is highly restricted. Some of these methods, e.g. [88, 101, 64, 42, 85, 100, 116, 70], result in simultaneous detection and segmentation. Unlike the random field approaches and the constellation approaches, the boosting methods originally proposed for detection encode the shape of the objects by including a number of local features within the sample window. The relative positions of these local features represent the shape implicitly. Although some existing methods, e.g. [64, 85], use boosting as a feature selector for segmentation, none learn the ensemble classifier as a segmentor directly. 22 2.6 Object Tracking For tracking of humans, some early methods, e.g. [117], track motion blobs and assume that each individual blob corresponds to one human. These early methods usually do not consider multiple objects jointly and tend to fail when blobs merge or split. Some of the recent methods [39, 118, 87, 71] try to fit multiple object hypotheses to explain the motion blobs. These methods infer occlusion relation and compute joint image likelihood for multiple objects. Because the joint hypothesis space is usually of high dimension, an efficient optimization algorithm, such as particle filter [39], MCMC [118, 87], or EM [71], is used. All of these methods have shown experiments with only a stationary camera, where the background subtraction provides relatively robust motion blobs. The motion blob based methods are not discriminative. They assume that all moving pixels are from the target objects. Although this is true in some environments, it is not in more general situations. Detection based tracking methods attempt to overcome these limitations by using discriminative methods. Recently, the development of object detection techniques has resulted in many promising methods for the detection of particular object classes. These objectdetectorsproducereliableobservationsforthedetectionbasedtrackingalgorithms. Detection responses at different frames are linked together to form object trajectories. For example, Davis et al. [19] build deformable silhouette models for pedestrians and track the models based on edge features. The silhouette matching is done frame by frame. Li et al. [54] and Okuma et al. [63] use particle filter based methods to associate the detection responses of an unknown number of objects. The detection responses are 23 usedtogeneratenewparticlesandtoevaluateexistingparticles. Byparticlefiltering, the tracker can maintain multiple hypotheses. However, increasing the number of particles requires more computational cost. To improve the computational efficiency, Li et al. [54] use multiple detectors (observers) to form a cascade particle filter. The order in which the detectors are applied is determined based on their computational costs: the faster, the earlier. Leibe et al. [51] formulate detection and tracking as a combined Quadratic Boolean Problem (QBP). The optimization criterion is designed to find the best spatial- temporalinterpretationforthedetectionresponses. However,thereisnoefficient,optimal solution for this complicated optimization problem. An EM-style iterative algorithm is used to find a suboptimum. The detection based tracking methods are fully automatic. They do not need outside initialization. These methods are less dependent on camera motion. Part tracking has been used to track the pose of humans [86, 72, 47]. However, the objectives of pose tracking methods and multiple human tracking methods are different. The methodologies of the two problems are also different. The existing pose tracking methods do not consider multiple humans jointly. Although they can work with tem- porary or slight partial occlusions, because of the use of part representation and the enforcement of some temporal consistency constraint, they do not work well with persis- tent and significant occlusions, as they do not model occlusions explicitly and the part models used are not very discriminative. The automatic initialization and termination strategies in the existing pose tracking methods are not general. In [72] a human track is initialized only when a side view walking pose human is detected, and no termination strategy is mentioned. 24 Chapter 3 Object Detection by Boosting Local Shape Features Our approach detects objects by combining part detection responses. For individual object parts, welearn classifiersbased onlocal shape features byboostingmethods. This chapter describes two learning algorithms used to build part classifiers. 3.1 Boosting Edgelet based Tree Structured Classifier Our first learning algorithm builds tree structured classifiers. The image features used are a novel type of shape oriented local features. 3.1.1 Motivation and Outline Theimageappearanceofobjectschangesasaresultofmanyfactors,suchasillumination, viewpoints, and poses. To make the image patterns relatively invariant, the external variations are sometimes limited. For example, the class of frontal human faces has smaller variance than the class of multi-view faces. For object classes with small intra- classvariation, thecascadestructuredclassifier[96]hasproventobeanefficientsolution. However, for more diverse patterns, such as multi-view faces, or multi-view multi-pose 25 human bodies, more powerful classifier models are needed, such as the tree structured classifiers in [90, 36]. The underlying philosophy is “divide-and-conquer”; when the class can not be modeled as a whole, divide it into several sub-categories, and learn a model for each of them individually. When the appearance changes are due to multiple factors, as in the case of human bodies, it is difficult to find one dominating property to divide the samples. Manually assigning the sub-category labels for the training samples could be difficult and time consuming. The domain knowledge-based sub-categorization may also not be optimal for the classification task. We propose a learning method that automatically constructs treestructuredclassifiers. Insteadofusingpredefinedintra-classsub-categorizationbased on domain knowledge, we divide the sample space by unsupervised clustering based on discriminative image features. We call this method Cluster Boosted Tree (CBT). For our classifier model, we choose the tree structure proposed by Huang et al. [36]. The VBT model in [36] is more suitable for detection tasks than Tu’s PBT model in [90], as PBT does not have the cascade detection strategy, and the root node of PBT is incognizant of the intra-class sub-categorization information. The differences in design choices may be due to the different objectives of their work: [36] focuses on detection, while [90] aims at a general object classification model. Our learning method is an iterative algorithm. Initially, the tree classifier contains only a null branch, i.e. there is one sub-category containing all the samples. At each boosting round, for each branch of the tree, one simple feature based weak classifier is selected from the weak classifier pool and attached to the branch. Then the discrimina- tive power of the newly added weak classifier is estimated. If this power is too weak, we 26 divide the branch into two smaller and thus easier problems by unsupervised clustering, i.e. the training samples are divided into two sub-categories. Unsupervised clustering is performed based on the responses of the selected image features, which form an informa- tivedescriptorofthesamples. Thenewintra-classsub-categorizationinformationisused togeneratenewtreebranches,andisalsopropagatedupstreamtorefinetheclassification functions of the parent nodes. The growth of each branch stops automatically when a target accuracy is reached. We apply our method to the classes of pedestrians and cars to show its accuracy and efficiency. 3.1.2 Tree Learning Algorithm The learning method described in this section is generic for any type of features. We will introduce the image feature we use later in Section 3.1.3. 3.1.2.1 Formulation of tree structured classifier We begin this section with a formal definition of a tree structured classifier. The input of such a classifier is an image sample,x2X, whereX is the sample space, and the output is a vector. Let H(x)=[H 1 (x);:::;H C (x)] (3.1) be the tree classifier, where C is the number of sub-categories, called channels. Each channel is an ensemble classifier: H k (x)= X T t=1 h k;t (x);k =1;:::;C (3.2) 27 where h k;t is called weak classifier, and T is the number of weak classifiers, i.e. the depth of the tree. The input of a weak classifier is an image sample x and the output is a real-valued number, whose sign is the class prediction (positive for object and negative for non-object). Each weak classifier is built based on one image feature f. The weak classifiers at the same level may share the same feature. Figure 3.1 gives an illustration of the tree structure. The tree shown has three levels and three channels. At the first level,alltheweakclassifiersofthethreechannelssharethesamefeature,indicatedbythe red box, though they may have different classification functions. At the second level, the third channel stops sharing features with the other two channels; this is called a splitting operation. At the third level, the first and second channels split. Feature sharing has proven to be an efficient strategy for multi-class object detection [89] and multi-view object detection [56, 36]. The underlying assumption is that although different sub- categories of the objects appear to be quite different, they may share some common visual features. Figure 3.1: Illustration of tree structure. 28 To integrate with cascade decision strategy, a threshold b k;t is learned for each weak classifier h k;t . Let H k;t be the partial sum of the first t weak classifiers of channel k, i.e. H k;t (x)= X t j=1 h k;j (x) (3.3) A sample is accepted by channel i, if and only if 8t2f1;:::;Tg;H k;t >b k;t (3.4) A sample is classified as an object, if and only if it is accepted by at least one channel. OurlearningalgorithmisbasedontherealAdaBoostalgorithmproposedbySchapire and Singer [78]. An image feature can be seen as a function from the image space to a real valued range, f :X 7! [0;1]. The weak classifier based on f is a function from the image space X to a real valued object/non-object classification confidence space. Given a labeled sample set S =f(x i ;y i )g, where x2X is the image patch, and y =§1 is the class label of x, in real AdaBoost we first divide the sample space into several disjointed parts: X = [ n j=1 X j (3.5) where n is the size of the partition. Then the weak classifier h is defined as a piecewise function based on the partition: ifx2X j ;h(x)= 1 2 ln à W j + +² W j ¡ +² ! ;j =1;:::;n (3.6) 29 where ² is a smoothing factor [78], and W § is the probability distribution of posi- tive/negative samples on the partition, implemented as a histogram: W j § =P (x2X j ;y =§1);j =1;:::;n (3.7) In practice, following [36] the partition of the sample space is achieved by dividing the feature space into n equal sized sub-ranges, i.e. X j = ½ x ¯ ¯ ¯ ¯ f(x)2 · j¡1 n ; j n ¶¾ ;j =1;:::;n (3.8) In our experiments, we set n=32. 3.1.2.2 Splitting sample space Asourpurposeistoconstructtreeclassifierswithoutapredefinedintra-classsub-categorization, we need to answer two questions: when is the sub-categorization necessary, and how to divide the sample space? In real AdaBoost [78], the classification power of the weak classifier is measured by a Z value: Z =2 X j q W j + W j ¡ (3.9) When W § is normalized, Z 2 [0;1] is the Bhattacharyya distance between the distri- butions of the object and non-object classes. The smaller Z is, the more discriminative the weak classifier is. At each boosting round, the weak classifier with the smallest Z is selected. Due to the cascade decision strategy, the classifier gradually focuses on the 30 difficult part of the sample space. Thus it becomes more and more difficult to find weak classifiers with small Z values during the boosting procedure. Figure 3.2 shows the evo- lution of the Z values of the selected weak classifiers. When Z gets very close to 1, its contribution to the whole classifier is small, or in other words, the current problem is beyond its ability. This suggests a division. In practice we set a threshold µ Z , if the Z values of three consecutive weak classifiers are all larger than µ Z , the splitting operation is triggered. 100 200 300 400 0.8 0.9 1 Boosting Round Z Value Figure 3.2: Evolution of Z value. Because our method is a discriminative approach based on image features, we use a bottom-upsplittingstrategytofacilitatetheclassification,insteadoftop-down. Weargue that the features selected by boosting form an informative descriptor of the object class. As they are selected for the classification task directly, clustering based on these features is a better method to divide the sample space than those based on domain knowledge or posterior probability. In Section 3.1.4 we will show the advantage of our splitting method. Suppose, up to the splitting point, K features have been selected, then the vector f = ff 1 ;:::;f K g is fed into an unsupervised clustering algorithm to divide the sample space into two parts. In practice, we use the standard k-mean algorithm with 31 Euclidean distance. Although there are many other clustering algorithms, we find this serves our purpose well. 3.1.2.3 Retraining with sub-categorization During the training procedure, without predefined sub-categorization, one leaf node of the tree always corresponds to one single channel. Suppose the current tree has t levels and c channels. When a leaf node h k;t splits, two branches with a new channel are generated, h k;t+1 and h c+1;t+1 . The new sub-categorization information is sent upstream to all the ancestors of h k;t+1 and h c+1;t+1 to refine their classification functions. For example, h k;t changestoapairfh k;t ;h c+1;t g. Theentireboostingprocedureisthenrerun from the beginning for the two changed channels, k and c+1. As opposed to a standard boosting algorithm, in the retraining we do not do feature selection and only recompute theclassificationfunctions, i.e. h k;t andh c+1;t sharetheoriginalfeatureforh k;t andtheir classification functions are retrained on the samples of the new sub-categories. Later in Section 3.1.4, we show that the retrained weak classifiers have better performance. Our complete learning algorithm is given in Figure 4.3 and Figure 3.4. Note that learning of the cascade decision strategy is integrated with the boosting procedure. This enables us to represent the learning algorithm as one boosting procedure, instead of several stages as in [96]. The main new parameter, compared to the previous cascade learning methods, is the splitting threshold µ Z , which is set based on experience. We find that the resulting classification accuracy is not very sensitive to this parameter. The output classifier of our learning algorithm has the same structure as that in [36], while our algorithm does not require a predefined intra-class sub-categorization. 32 ² Given the initial sample set S =S +1 [S ¡1 =f(x i ;+1g[f(x i ;¡1)g, and a negative images set; ² Set the algorithm parameters: the maximum weak classifier number T, the positive passing ratesfP t g T t=1 , the target false alarm rate F, the maximum channel number C, the splitting threshold µ Z , and the threshold for bootstrapping µ B ; ² InitializetheweightsD 0 (x)= 1 kSk forallsamples,thecurrentfalsealarmrateofthecurrent channel F 1;0 =1, the current channel number c=1, and t=0; ² Construct the weak classifier pool,H, from the image features; ² while t<T do 1. For all channels, k =1;:::;c, do (a) If F k;t <F, continue; (b) For every weak classifier in H, learn the classification function from the sample sets S §k by Equ.3.6; (c) Compute the Z values for all weak classifiers by Equ.3.9, and select the weak classifier with the smallest Z, denoted by h k;t and its corresponding feature f k;t ; (d) Update sample weights by D t+1 (x)=D t (x)exp[¡yh k;t (x)];8x2S §k (3.10) and normalize D t+1 to a probability distribution; (e) Select the threshold b k;t for the partial sum H k;t , so that a portion of P t positive samples are accepted; and reject as many negative samples as possible; (f) RemovetherejectedsamplesfromS §k . Iftheremainingnegativesamplesareless than µ B percent of the original, recollect S ¡k by bootstrapping on the negative image set and update F k;t ; (g) IfZ ofh k;t ,h k;t¡1 ,andh k;t¡2 arealllargerthanµ Z ,andc<C,performsplitting and retraining i. 8x2S +k , compute the feature descriptor f(x)=[f k;1 (x);:::;f k;t (x)]; ii. Based on f(x), do k-mean clustering on S +k to divide it into two parts, and assign the labels k and c + 1 to the two new sub-categories, i.e. S +k ! S +k [S +(c+1) ; iii. Retrain all the previous weak classifiers for channel k with the algorithm in Figure 3.4; iv. c++; 2. t++; ² Outputfh k;t ;b k;t g as the tree structured classifier for detection. Figure 3.3: Algorithm of learning tree structured classifier. In our experiments, T = 2;000, F =10 ¡6 , C =8, µ Z =0:985 and µ B =75%. The setting offP t g is similar to the original cascade’s layer acceptance rates. The cascade is divided into 20 segments, the lengths of which grow gradually. The weak classifiers at the end of the segments have a positive passing rate of 99:8%, and the other weak classifiers have a passing rate of 100:0%. 33 ² Given the sample sets of the new channels S +k and S +(c+1) ; ² Inherit all the parameters from the main algorithm in Figure 4.3; ² Randomly collect the initial negative sample sets for the new channels, S ¡k and S ¡(c+1) ; ² Reset the weights8x2S +k [S +(c+1) ;D 0 (x)= 1 kSk ; ² For t 0 =1 to t do 1. Split the original h k;t 0 into two, h k;t 0 and h c+1;t 0, which share the same feature, i.e. f k;t 0 =f c+1;t 0 2. Retrain h k;t 0 with the following steps: (a) LearntheclassificationfunctionfromthenewsamplesetsS §k byEqu.3.6; (b) Update the sample weights by Equ.3.10; (c) Select the threshold b k;t 0 for the partial sum H k;t 0, so that a portion of P t 0 positive samples are accepted; and reject as many negative samples as possible; (d) Removetherejectedsamples from S §k . If theremainingnegativesamples are less than µ B percent of the original, recollect S ¡k by bootstrapping on the negative image set; 3. Retrain h c+1;t 0 with S §(c+1) by the same steps for h k;t 0. Figure 3.4: Retraining weak classification functions of ancestor nodes with intra-class sub-categorization. 34 3.1.3 Edgelet Features Based on the observation that shape is one of the most salient patterns of many object classes, e.g. humans and cars, we developed a new class of local shape features that we call edgelet features. In our work, most experiments are done based on this feature. An edgeletisashortsegmentofalineoracurve. Denotethepositionsandnormalvectorsof the points in an edgelet E, byfu i g k i=1 andfn E i g k i=1 , where k is the length of the edgelet, see Figure 3.5 for an illustration. Given an input image I, denote by M I (p) and n I (p) the edge intensity and normal at positionp of I. The affinity between the edgelet E and the image I at position w is calculated by f(E;I;w)= 1 k k X i=1 M I (u i +w) ¯ ¯ n I (u i +w);n E i ®¯ ¯ (3.11) Note, u i in the above equation is in the coordinate frame of the sub-window, and w is the offset of the sub-window in the image frame. The edgelet affinity function captures both intensity and shape information of the edge; it could be considered a variation of the standard Chamfer matching [5]. In our experiments, the edge intensity M I (p) and normal vectorn I (p) are calculated by 3£ 3 Sobel kernel convolution applied to gray level images. We do not use color information for detection. Since we use the edgelet features only as weak features in the boosting algorithm, we simplify them for computational efficiency. First, we quantize the orientation of the normal vector into six discrete values, see Figure 3.5. The range [0;¼) is divided into six bins evenly, which correspond to the integers from 0 to 5 respectively. 35 Figure 3.5: Edgelet features. An angle µ within range [¼;2¼) has the same quantized value as µ¡¼. Second, the dot product between two normal vectors is approximated by the following function: l[x]= 8 > > > > > > > > > > < > > > > > > > > > > : 1 x=0 4/5 x=§1;§5 1/2 x=§2;§4 0 x=§3 (3.12) where the input x is the difference between two quantized orientations. Denote by fV E i g k i=1 and V I (p) the quantized edge orientations of the edgelet and the input im- age I respectively. The simplified affinity function is ˜ f(E;I;w)= 1 k k X i=1 M I (u i +w)¢l £ V I (u i +w)¡V E i ¤ (3.13) The computation of edgelet features therefore only includes short integer operations. 36 In our experiments, the possible length of one single edgelet is from 4 pixels to 12 pixels. The edgelet features we use consist of single edgelets, including lines, 1 8 circles, 1 4 circles, and 1 2 circles, and their symmetric pairs. One single edgelet is defined by the locations of its two ends and its shape, and is rendered by the Bresenham algorithm [10]. A symmetric pair is the union of a single edgelet and its mirror, see Figure 3.5. 3.1.4 Experimental Results The experiments in this section are designed to compare different learning methods. Hence, we only use the whole-object classifiers for demonstration and comparison. The definition of object parts and comparison of their detection performance will be given later in Chapter 5. We apply our CBT method to the problem of pedestrian detection and car detection. In addition toan end-to-end comparison, evaluationsof the individual modules of the method are helpful to understand the different aspects of the method. In the experiments of this section, we use edgelets as the image features. InSection3.1.4.1,3.1.4.2,and3.1.4.3,weevaluateourmethodonthepedestrianclass and compare with previous methods. For pedestrian detection, there have been several data sets publicly available, e.g. the INRIA pedestrian set [16] 1 . The INRIA set covers multiple viewpoints and a large variation of poses. It contains 2,478 positive samples and 1,218negativeimagesfortraining,and1,128positivesamplesand453negativeimagesfor testing. AstheINRIAsetcontainssegmentedsamples, itisappropriateforacomparison of learning algorithms. However, for multi-view pedestrians, there is no public set for evaluating the detection performance of the whole system. Hence, we collected our own 1 http://pascal.inrialpes.fr/data/human/ 37 multi-view pedestrian test set. The pedestrian samples in our experiments have been resized to 24£ 58 pixels. Figure 3.6 shows some examples of the training samples of pedestrians. The overall number of possible edgelet features for pedestrians is 857,604. Figure 3.6: Examples of pedestrian training samples. In Section 3.1.4.4, we apply our method to multi-view car detection. For cars, there are several public sets, e.g. the UIUC car set 2 [1], which includes a training set, a single- scale test set, and a multi-scale test set. However, all the existing car image sets are for a single viewpoint, e.g. the UIUC set contains only profile view cars. Hence, we collected a training set for multi-view cars from the MIT street scene images 3 [52]. This training set contains 4,000 car samples of various models and different viewpoints. The car samples are resized to 64£32 pixels. Figure 3.7 shows some examples of the car training samples. 2 http://l2r.cs.uiuc.edu/~cogcomp/Data/Car/ 3 http://cbcl.mit.edu/software-datasets/streetscenes/ 38 Figure 3.7: Examples of car training samples. For evaluation of detection methods, the precision-recall (PR) curves are more ap- propriate than the receiver operator characteristic (ROC) curves [18]. We attempt to use PR curves to show the results whenever it is feasible. However, in order to compare with some previous methods which report results in other forms, we may also use other measures. 3.1.4.1 Comparison of splitting strategies We first compare three splitting strategies: Tu’s posterior probability based method [90], apurerandomsplittingmethod,andourimagefeaturebasedk-mean. In[90],aposterior probability is computed by q(§1jx)= expf§2H(x)g 1+expf§2H(x)g (3.14) 39 When the classification is confident, i.e. q(§1jx)¡ 1 2 > ", where " defines a gray “not sure” area, the sample is sent to left or right branch according to the sign. Samples in the gray area are sent to both left and right branches. As the original work of Tu [90] is designed for general object classification problem, the threshold is set to 0:5. However, for a detector with a cascade decision strategy, the probabilities of most of the positive samples are close to 1, so we use a threshold of 0:85 in our experiment. The pure random splitting strategy is to randomly divide the sample set into two equal sized sub-sets. In order to evaluate the splitting strategies, we deactivate the retraining module and force the tree to split at level 50 and level 100, and the maximum number of weak classifier is set to 150. This experiment is done with the training samples of the INRIA set. Note that we do not implement the whole algorithm of Tu [90]. Instead, we only use their splitting strategy in our framework. Intuitively, the better the splitting, the more powerful the weak classifier later selected should be. The convergence speed of the learningproceduredependsonthehardestsub-category, andwethustracethemaximum Z value of the weak classifiers at each tree level and use the evolution curve of the maximum Z value to measure the performance of the splitting strategies. Figure 3.8 shows the curves of maximum Z value, and Figure 3.9 shows the resulting tree structures by the posterior probability based strategy and our image feature based k-mean strategy. FromFigure3.8,wecanseethattheimagefeaturebasedsplittingmethodachievesthe besteffectonfacilitatingclassification. Anotherinterestingresultisthateventherandom splitting method is better than the posterior probability based method. This is because the probability based method generates a highly unbalanced tree, while both the other twomethodsresultinbalancedtrees, seeFigure3.9. Becauseoftheasymmetricproperty 40 20 40 60 80 100 120 140 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 Boosting Round Maximum Z Value P(y|x) Thresholding Balanced Random Shape Feature k−mean Figure 3.8: Comparison of splitting strategies. ofthedetectionproblemandthecascadedecisionstrategy,thevalueofq(+1jx)continues to increase for all samples. The sample distribution along the posterior probability tends to be a single peak gaussian with small variance, and the peak is close to one. No matter what splitting threshold is used, it is unlikely to achieve balanced divisions on both positive and negative sample spaces. This suggests that the scalar valued posterior probability may not be a good clustering criterion for object detection tasks. 3.1.4.2 Comparison on data without predefined sub-categorization Next,wecompareourmethodwithsomepreviouswork[92,76,16,119]ontheINIRAset that does not have view/pose labels. We learn the classifiers with the training samples of theINRIAset,andevaluateonitstestsamples. Forourmethod,threerunsareperformed with different settings. In the first run, we set the maximum number of channels to one (C = 1), which reduces the classifier to a cascade structure. In the second run, we set 41 Figure 3.9: Tree structures from different splitting strategies. (The number in the box gives the percentage of samples belonging to that branch. The tree from the random splitting strategy is similar to that from the image feature based k-mean splitting strat- egy.) C = 4 and disable the retraining module. In the last run, we set C = 4 and enable the retraining module. The other parameters of the three runs are the same. Figure 3.10 gives the ROC curves. From the results, it can be seen that for our method, four channel no retraining CBT classifier outperforms the cascade one, while the four channel with retraining CBT classifier outperforms both. This shows the effectiveness of the splitting and retraining modules. Ourmethodiscomparableto[92,16,119]. However,[92,16]focusondeveloping stronger features, while our work focuses on the classifier structure and training. These are two complementary directions. Figure 3.11 shows some failure examples using our method. 42 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 65 70 75 80 85 90 95 100 False Positive Per Window Detection Rate (%) HOG SVM by Dalal & Triggs CVPR05 HOG boosting by Zhu et al. CVPR06 Covar SVM by Tuzel et al. CVPR07 Edgelet Cascade (CBT 1 Channel) Edgelet CBT 4 Channel no retraining Edgelet CBT 4 Channel with retraining Figure 3.10: Evaluation of the CBT method on the INRIA set. Figure 3.11: Missed human examples by our method with a false alarm rate of 10 ¡4 . 3.1.4.3 Comparison on data with view-based sub-categorization In[36],theintra-classsub-categorizationoffacesamplesisbasedonviewpoint,whichisa commonsolutionformulti-viewobjectdetection. Inthisexperiment,wecompareourun- supervisedsub-categorizationbasedmethodwiththepredefinedsub-categorizationbased 43 method. We divide the pedestrian class into three view-based sub-categories: left profile, frontal/rear,andrightprofile. AstheINRIAsetdoesnothaveviewpointinformation,we use our own collection from the Internet for this experiment. Our training set contains 1,120 positive samples for left profile, 1,742 positive for frontal/rear, 1,120 positive for rightprofile,and7,000negativeimageswithouthumans. Fortheviewbasedclassifier,we use Vector Boosting [36] to train a three channel cascade as root. When the false alarm rate of the root reaches 10 ¡4 , we stop the feature sharing and split it into three branches, each of which then reaches a false alarm rate of 10 ¡6 . Figure 3.12 shows the structure of the view-based classifier. For the CBT method, we set C = 8; however, the resulting tree has only four channels. To evaluate the detection performance, we collect a test set containing100 imageswith232 multi-viewpedestrians. Wecallthisset “USCpedestrian set C”. 4 Figure 3.13 shows the precision-recall curves of the two tree classifiers on this test set. Figure 3.12: Structure of the view-based tree classifier for pedestrians. 4 http://iris.usc.edu/~bowu/DatasetWebpage/dataset.html 44 0 0.2 0.4 0.6 0.8 0.8 0.85 0.9 0.95 1 1 − Precision Recall View based VBT Our CBT method Figure 3.13: Evaluation of the CBT method on the USC pedestrian set C (100 images with 232 pedestrians). Fromtheresults,itcanbeseenthattheCBTdetectordominatestheview-basedVBT detector. One possible reason is that the view-based sub-categorization is not optimal for the pedestrian class as the articulation is also an important factor that affects the appearance. Different from the domain-knowledge based sub-categorization, our cluster- ing method is based on the discriminative features selected directly for the classification task. We find no obvious physical meanings of the learned sub-categories. Another advantage of our CBT method is that it always generates balanced trees, which are more efficient than unbalanced ones. In the view based VBT detector we learned, the branches for the profile views are much longer than the branch for the frontal/rear view, which means the classification difficulty is not equally distributed in the view based sub-categories. 45 3.1.4.4 Multi-view car detection To show the generality of the proposed method, we apply it to the problem of multi-view car detection. The 4,000 car samples collected from the MIT street scene image set are taken as our positive training set; the 7,000 negative images in Section 3.1.4.3 are used as the negative set. A CBT detector with four channels are learned for multi-view cars. To compare with previous methods, we learn a cascade detector from the training data of the UIUC car set [1], which contains only profile view cars, and test the CBT detector for multi-view cars and the cascade detector for side view cars on the UIUC single scale set, which has 170 images with 200 cars, as well as the multi-scale test set, which has 108 images and 139 cars. Comparison results are given in Table 4.4. It can be seen that our multi-view CBT detector has performance comparable to the state-of-the-art single-view approach on a particular viewpoint. Compared to the previous methods on multi-view car detection, e.g. [81], our method covers a similar range of viewpoints, but includes additional variations of car models. Method Single-scale Multi-scale Leibe et al.[49] 97:5% - Todorovic & Ahuja [88] »86:5% - Mutch & Lowe [61] 99:94% 90:6% Single view cascade 97:5% 93:5% Multi-view CBT 97:5% 92:8% Table 3.1: Detection equal-precision-recall rates on the UIUC car image set. Our program is coded in C++ using OpenCV functions without any code level opti- mization or parallel computing. The experimental machine used in this section is a 2.8G Hz 32-bit Pentium PC. Our CBT method requires approximately one day to learn one 46 classifier; this is slower than the learning of SVM classifiers but much faster than that of VBT classifiers. The detection speed depends on the size of the object and the image. It usually requires a few hundreds milliseconds for one image. 3.1.5 Conclusion and Discussion In this section, we described a method to automatically learn tree structured object classifiers,withoutpredefinedintra-classsub-categorization. Wedividedthesamplespace byanunsupervisedclusteringapproachbasedonimagefeaturesselectedbyboosting,and refined the classification functions of the ancestors by the sub-categorization information of their children in the tree. In this work, we learned our detector based on the edgelet features [105], which is one type of shape feature. There are many other candidates, such as SIFT [57], HOG [16], Covariance Descriptor [92], shapelet [76], and Haar [96]. Our learning algorithm does not limit the type of features used. New features could easily be integrated to the framework. 47 3.2 Integrating Heterogeneous Types of Local Features There have recently been several local shape features invented for object detection. Among them is our edgelet feature. Many previous methods for object detection base their detectors on a single type of feature. This enables the direct comparison of the detection performance of different features. However, we should be able to improve the detection performance by using multiple feature types. Intuitively, more information will result in a better decision. Though some features may be dominated by others sta- tistically, as long as they are not fully correlated, a combination of them should bring classification improvement. This section describes our method to integrate multiple, het- erogeneous types of features to improve detection performance. 3.2.1 Motivation and Outline There are two main issues in feature integration. First, evaluating all the features before making a prediction is not efficient, because some features could be computationally expensiveyetnot bring a significantboost in classification power. Second, different types of features could lie in different spaces, linear or nonlinear, which may require different classification techniques. For example, some features may lie on a nonlinear manifold embedded in a linear space. Directly applying the traditional classification techniques based on Euclidean distance to such a feature space is not appropriate, as two points close in the linear space may be far from each other on the manifold. Hence, a direct Cartesianproductofdifferenttypesoffeaturesbeforeclassificationisnotalwaysfeasible. 48 Weproposeanovelmethodfortheintegrationofheterogeneousfeaturesforobjectde- tection. Our approach balances two criteria: accuracy and efficiency. It is more accurate than the single-feature based methods and yet maintains a relatively fast speed. We chose the boosted classifier for our classifier model. In the previous boosting based object detection methods, each weak classifier corresponds to one image feature. In our approach, each weak classifier corresponds to one sub-region of the image, and different types of features are extracted from the sub-region (see Figure 3.14 for an illus- tration of our approach). The classification function for each feature type is learned in its own feature space. The multi-feature weak classifier makes a prediction by examining the different types of features from the sub-region one by one. Only when the predic- tion based on the already examined features is not of high confidence, does the weak classifier examine the next feature type. The order in which the features are evaluated is determined based on a measure of the classification power that includes the cost of computational time. A number of such weak classifiers are selected and combined by a boosting algorithm to form a strong object classifier. Figure 3.14: Schematic diagram of our feature integration method. 49 The main advantages of our approach compared to the previous related methods are: 1) the speed normalized classification power is used as the criterion for feature selection. Thus, the optimization goal is not only classification accuracy but also efficiency. 2) The classification functions of different types of features are learned in their own spaces, not in the Cartesian space, so that different classification techniques can be used to achieve better accuracy for different features. 3) Complex features, which may be more powerful, are evaluated only when necessary, i.e. when the decision can not be made confidently from relatively cheap features. We use three types of features, our edgelet feature,theHOGdescriptor[16],andthecovariancedescriptor[92],toshowtheaccuracy and efficiency of our approach. We apply our method to the classes of pedestrians and cars. The experimental results show that our method achieves better accuracy with a relatively fast speed compared to the state-of-the-art single-feature based methods. 3.2.2 General Weak Classifier Assume that from one image region R we can extract m different types of local features, ff 1 ;:::;f m g, could be an edgelet or any other type of features. A feature f is a mapping from the image spaceX to a real valued spaceR d , where d is the dimension of the space. (Note, in this section we generalize the definition of feature in Section 3.1.3 from 1-D to d-D.) Denote the weak classifier based on image region R by h R : X ! R. (The sign of h R ’s output indicates the predicted class, + for object and ¡ for non-object, and the magnitude represents the classification confidence.) If accuracy is the only objective, h R should use all the features extracted from region R for classification. However, for real applications, speed is another important criterion. We allow the weak classifier to use a 50 variable subset of features to make the decision. Denote the power set offf 1 ;:::;f m g by F, and a weak classifier based on a subset of features by h F , F 2F. We formalize the weak classifier by h R (x)=h Φ(x) (x) (3.15) where Φ:X !F is a feature type selector. We define a computational Cost Normalized classification Margin (CNM) to measure the classification efficiencies of different subsets of features. For asamplex, denoteits true classlabel by y (=§1). The classification margin of h onxisdefinedbyy£h(x)(assuminghhasbeennormalizedto[¡1;1]). Theclassification margin represents the discriminative power of the classifier. Larger margins imply lower generalization error [77]. The CNM of h on x is defined by ˙ M(h;x) Δ = yh(x) t(h) (3.16) where t(h) is the computational cost of h. The subset of features with the highest CNM measure is considered the optimal tradeoff between the classification accuracy and the computational efficiency: Φ ? (x)=argmax F2F n ˙ M(h F ;x) o (3.17) Ifh F isapproximatedbyalinearcombinationofseveralbaseclassifiers P f i 2F h f i ,eachof whichisbasedononefeaturetype,thecomputationsofdifferentfeaturesareindependent, 51 and the computational cost of Φ is ignorable compared to those of h f i , Equ.3.17 can be reduced to Φ ? (x)=argmax ff k g2F n ˙ M(h f k ;x) o (3.18) The expected CNM of h Φ ? onX is E ³ ˙ M(h Φ ?) ´ = X k ® k E ³ ˙ M(h f k )jΦ ? =ff k g ´ (3.19) where ® k =P(Φ ? =ff k g). 3.2.3 Hierarchical Weak Classifier In practice, computing the optimal Φ ? before evaluating any features is unlikely. (This requireschoosingthebestfeaturetypebeforeseeingtheimage.) Inthiswork,wepropose ahierarchicalclassificationfunctiontoapproximateh Φ ?. Thebasicideaistoevaluatethe features one by one, and after each evaluation decide whether it is necessary to look at more features. Assume that we evaluate the features in the order f 0 1 ;:::;f 0 m . We define h ff 0 1 ::f 0 m g in a recursive way h ff 0 1 g (x)=h f 0 1 (x) h ff 0 1 ::f 0 k g (x)= 8 > > < > > : h ff 0 1 ::f 0 k¡1 g (x);ifx2V ff 0 1 ::f 0 k¡1 g (X) h f 0 k (x);otherwise (3.20) where h f 0 k is the single-feature weak classifier based on f 0 k , and V ff 0 1 ::f 0 k g (X) is defined by V ff 0 1 g (X)= © xj ¯ ¯ h f 0 1 (x) ¯ ¯ >µ 1 ª V ff 0 1 ::f 0 k g (X)=V ff 0 1 ::f 0 k¡1 g (X)[ n xj ¯ ¯ ¯h f 0 k (x) ¯ ¯ ¯>µ k o (3.21) 52 where µ k is a threshold of confidence. It is chosen adaptively for each feature µ k =argmin µ P ³¯ ¯ ¯h f 0 k (x) ¯ ¯ ¯>µ ´ · ® k 1¡ P k¡1 i=1 ® i (3.22) V ff 0 1 ::f 0 k g represents the part of X where the prediction of h ff 0 1 ::f 0 k g is confident. Finally, we have: if x 2 V f 0 k = V ff 0 1 ::f 0 k g ¡V ff 0 1 ::f 0 k¡1 g , then Φ = ff 0 1 ;:::;f 0 k g, and h Φ = h ff 0 1 ::f 0 k g . The expected CNM of our hierarchical weak classifier is E ³ ˙ M(h Φ ) ´ = X k ¸ k ¯ k E ³ ˙ M(h f 0 k )jx2V f 0 k ´ (3.23) where ¸ k = t(h f 0 k ) P k i=1 t(h f 0 i ) , and ¯ k =P(x2V f 0 k ). If the feature types used are fully indepen- dent, ¯ is equal to ®. To determine the order in which the features are evaluated, we sort the features according to their expected CNMs. The feature with higher expected CNM is evaluated earlier. This is an approximation of the optimal order. We have not found an algorithm that computes the optimal order in polynomial time w.r.t. the number of feature types. To rank the heterogeneous features, the classifiers should be defined in a comparable way. In our work, we use probability ratio based classifiers (details are given in the next section). For arbitrary classifier models, some normalization techniques, such as those described in [2], should be applied before ranking. 3.2.4 Learning Weak Classifier Similar to our CBT method in Section 3.1, we define the single-feature weak classifier h f as a piecewise function based on a partition of the sample space X into disjoint subsets 53 fX j jX j ½ Xg n j=1 , which cover all of X. For each subset of the partition, the output of the weak classifier h f is defined by Equ.3.6. In this method, for each feature f k , we first find a projection p k to map f k to [0;1). This projection separates the two classes as much as possible in the 1-D space. For differentfeaturetypes, theprojectionscanbedifferent, eitherlinearornon-linear. (Later in Section 3.2.5, we give the implementation details of the projections for the features we use.) We then make a uniform partition in the 1-D projection space: X k;j = ½ x2Xjp k (f k (x))2 · j¡1 n ; j n ¶¾ (3.24) and compute the classification function h f k by Equ.3.6. Because the outputs of our weak classifiers are defined by probability ratio and the features are used only for partition, their margins are directly comparable. One approximation in classification function learning is that the sample distributions W § used to compute h f 0 k are learned independently. For the hierarchical classification function in Equ.3.20, we should ideally learn the conditional probability distributions given that the samples lie in X ¡ V ff 0 1 ::f 0 k¡1 g . However, this imposes an exponentially increasing demand of training data w.r.t. the number of feature types. The hierarchical weak classifier corresponds to a hierarchical partition of the sample space. Figure3.15givesanillustration. Mostofthesamplespaceisdividedonlyalongthe first dimension, while some difficult part is further divided along the second dimension, and so on. The classification power of the hierarchical weak classifier with multiple features is measured by its expected CNM defined by Equ.3.23. At each boosting round, 54 we evaluate several sub-regions, for each of which we find the best feature f k for each feature type and combine them to form h R . The h R with the largest expected CNM is added to the current boosted classifier. 0 0.5 1 0 0.5 1 0 0.5 1 Dim 2 Dim 1 Dim 3 Figure 3.15: An illustration of hierarchical partition of sample space. 3.2.5 Implementation The features we use are our edgelet feature, the HOG descriptor [16], and the covariance descriptor[92]. Theyareallstate-of-the-artshapeorientedfeaturesandhavesuccessfully been applied to object detection problems. There are many other candidates. However, these three types of features are sufficient to demonstrate the different aspects of our approach. 3.2.5.1 Feature dependent projection functions Fordifferentfeaturetypes,wedesigndifferentprojectionfunctionsinordertoachievethe best classification result. One edgelet feature can be seen as a short edge template. The 55 feature response is the matching score between the template and the input image, i.e. f edge :X 7! [0;1). Hence, we simply use the identity function as the projection function for edgelet, denoted by p 1 . For the HOG descriptor, we do not use the dense sampling method in [16], instead we use the variable-sized block version in [119]. Given a rectangular sub-region, it is divided into 2£2 equal-sized cells. Within each cell, the edge intensities at nine orientations are summed. The output is a 36-D histogram vector, i.e. f HOG :X 7!R 36 . We use Linear Discriminative Analysis (LDA) to find a linear projection p 2 best separating the object and non-object classes: p 2 (f HOG (x)) Δ =a 2 hv 2 ;f HOG (x)i+b 2 (3.25) where v 2 2R 36 , and a 2 and b 2 are normalizing factors learned from the training set. Our covariance descriptor is extracted from a 6-D raw feature vector: · x y jI x j jI y j jI xx j jI yy j ¸ ,wherexandy arethepixellocation,andI x ;I y ;I xx ;I yy are the first/second order intensity derivatives. The covariance matrices lie in a con- nected Riemannian manifold [6]. Formally f cov :X 7!M 6£6 , whereM 6£6 is a manifold embedded in R 6£6 . Because covariance matrices are symmetric, the real dimension is 6£(6+1)=2=21. Asthemanifoldisnotalinearspace, itisinappropriatetoapplyLDA directly. Following the method in [92], we first map the covariance matrices to a linear 56 tangent space of the manifold, and then perform LDA in the tangent space. Denote the mapping to the tangent space by à :M 6£6 7!R 21 , whose definition is Ã(X) Δ =vec ¹ (log ¹ (X)) (3.26) where X and ¹ are two positive definite symmetric matrices, log ¹ is a matrix logarithm operator that maps a matrix in the manifold to a matrix in the tangent space attached to ¹, and vec ¹ is a coordinate mapping operator that converts the Riemannian metric on the tangent space to the canonical metric in the vector space. More details of à and its learning can be found in [92]. The projection function of the covariance descriptor is defined by p 3 (f cov (x)) Δ =a 3 hv 3 ;Ã(f cov (x))i+b 3 (3.27) where v 3 2R 21 , and a 3 and b 3 are normalizing factors learned from the training set. 3.2.5.2 Computational costs of features The computational costs of the three types of features are very different. Computing an edgelet response, which is basically edge template matching, requires mainly 16-bit short integer operations; computing the histograms of HOG through integral images requires mainly 32-bit integer operations; computing a covariance matrix through integral images requires 32-bit integer and 64-bit floating point operations. The projection p 2 is an inner product operation of two floating point vectors; the complexity of p 3 is dominated by the matrix logarithm operator in Ã, which requires singular value decomposition (SVD). We first used the OpenCV SVD function and evaluated the speed of p(f(x)) for the three 57 feature types. The ratio of their average speeds is about t edge : t HOG : t cov = 1 : 10 : 30. This order is consistent with those reported in the original papers [105, 16, 92]. In order to improve the computational efficiency, we replaced the OpenCV SVD with an implementationintheIntelIPPlibrary[38]. Thisresultedinaspeedratioabout1:10:12. We have evaluated this program on several versions of Intel CPUs. The ratio is stable. Besides p(f(x)), computing the edge intensity images of different orientations and the integral images brings some overhead. However, this overhead, which is partially shared among different types of features, is relatively small. 3.2.5.3 Memory costs of features Given a W £ H gray scale image I (8-bit per pixel), to compute the three types of features, six intensity derivative images for I x ;I y ;I xx ;I yy ; q I 2 x +I 2 y ;arctan( I x I y ), and two coordinate images for x;y are first computed and stored in 16-bit short integers. Then nine 32-bit integer integral images for HOG, six 32-bit integer integral images and 21 64- bit double integral images for covariance matrix are computed. The overall memory cost for a W £H image is about 240£W £H bytes. If the training sample size is 64£128 pixels, 2G bytes of memory, which are the maximum amount of memory dynamically allocated by one process for a 32-bit windows system, can accommodate about 1,000 samples. However, most of the current training sets of object detection problems are of thousands of samples. Hence, during training, only for a portion of the training data, the intermediate image representation is pre-computed and stored in memory, while for the rest it is computed on-the-fly. 58 3.2.5.4 Selecting the best weak classifiers Similar to [92, 119], we randomly sample 200 sub-regions at each boosting round, and search for the locally best edgelet, HOG, and covariance features. For each region R, the localsearchisdonebyrandomlyevaluating40edgelets,5HOGfeatures,and5covariance features whose supporting regions R(f) have large coverage of R. For an edgelet feature, R(f) is the bounding box of the edge template; for the HOG and covariance descriptors, R(f)istherectangularregionfromwhichthehistogramsarecomputed. Wesamplemore edgelet features than the other two types, because the edgelet feature pool is bigger than those of the other two. During training, at each boosting round, the samples with buffered intermediate rep- resentation are used to select the good features. After features are fixed, all the training samples are used to refine the classification functions. 3.2.6 Experimental Results We apply this approach to two classes of objects, pedestrians and cars. 3.2.6.1 Performance on pedestrians For pedestrians, we use the INRIA data set [16]. We train a classifier consisting of 800 weak classifiers with our CBT method with a maximum channel number of one. Figure 3.16 shows the ROC curves of this method and some previous ones. (The ROC curve of the cascade classifier is generated by changing the number of layers used.) From Figure 3.16, it can be seen that compared to the edgelet-only, HOG-only [119], and covariance- only [92] cascades, the hybrid-feature cascade achieves better performance. On average, 59 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 70 75 80 85 90 95 100 False Positive Per Window Detection Rate (%) HOG SVM by Dalal & Triggs CVPR05 HOG boosting by Zhu et al. CVPR06 Covar SVM by Tuzel et al. CVPR07 Edgelet−only boosted cascade Hybrid Feature Cascade Figure 3.16: Evaluation of our multi-feature based detection method on the INRIA set. our cascade classifier searches about 24,000 sub-windows per second on a 3.0GHz Intel CPU. 3.2.6.2 Feature statistics Figure3.17showsthefirstweakclassifierlearnedforpedestriandetectionwithitsselected features and their classification functions. It can be seen that the covariance descriptor hasthebestdiscriminativepower. However, duetoitshighcomputationalcost, itisonly the second feature of the weak classifier after an edgelet, but before a HOG. 60 Figure3.17: Thefirstweakclassifierlearnedforpedestriansanditsselectedfeatures. The firstfeatureevaluatedisanedgeletcorrespondingtothehead-shouldercontourofhuman body; the second feature is a covariance descriptor whose supporting region surrounds the head-shoulder part; the third feature is a HOG descriptor. The x-axis is the index of the histogram bins, i.e. the partition along the projection direction. The y-axis is the classifier’s output. For the cascade pedestrian detector learned from the INIRIA set, we first count the frequencies of different types of features being selected as the first/second/third feature in the weak classifiers as shown in Figure 3.18. It can be seen that though the HOG and covariance descriptors are stronger than edgelet for classification, they are much more computationally expensive so that they are most often used as the second or third feature. 61 1 2 3 0 20 40 60 80 100 Feature order Percentage (%) Edgelet HOG Covar Figure 3.18: Frequencies of different feature types as the first, second, and third feature in the hierarchical weak classifiers. Next, we count the frequencies of different types of features being evaluated per sub- window. This is a hardware-independent metric to compare the speeds of different meth- ods. Table 3.2 shows the results. Tuzel et al. [92] report that on average the HOG-only cascade requires evaluating 15.62 HOG features per sub-window and the covariance-only cascade needs 8.45 covariance descriptors per sub-window. For edgelets we do the eval- uation ourselves. The edgelet-only cascade requires about 28 edgelets per sub-window. Based on the speed ratio of the three types of features, it can be seen that our hybrid- featuredetectorisfasterthantheHOG-onlyandcovariance-onlydetectors,thoughslower than the edgelet-only detector. Feature type Edgelet HOG Covar Evaluation frequency per window 15.25 2.6 2.05 Table 3.2: Evaluation frequencies of different feature types. Finally, we count the evaluation frequencies for the first, second, and third features, as shown in Table 3.3. It can be seen that the third feature is rarely used. The first features are mostly edgelets, which are designed to encode the local silhouette explicitly 62 but are relatively sensitive to small transformations, such as translation and rotation. The second and third features are mostly HOG and covariance descriptors, which encode thestatisticsofasub-regionandarerobusttosmalltransforms, butdonotencodewhich pixels actually contribute to the histogram bins; very different shapes could result in the same histogram. Their complementarity is natural. Feature order First Second Third Evaluation frequency per window 16.33 2.66 0.91 Table 3.3: Evaluation frequencies of the first, second, and third features. 3.2.6.3 Hierarchical v.s. sequential weak classifier Wecomparetheperformanceofourhierarchicalfeaturecombinationwithtwoothercom- bination strategies: sequential summation and sequential maximum. Sequential summa- tion is defined by h ff 0 1 ::f 0 m g (x)= X m k=1 h f 0 k (x) (3.28) and sequential maximum is defined by h ff 0 1 ::f 0 m g (x)=h f M(x) (3.29) where f M = argmax f 0 k n¯ ¯ ¯h f 0 k (x) ¯ ¯ ¯ o . Both of these strategies require evaluating all the features before making a prediction, which can be considered the accuracy upper bound ofourhierarchicalstrategy. Wetakethefirst10featuresofthesingle-thresholdpedestrian classifier in Section 3.2.6.1, apply these three combination strategies, and evaluate the classification performance on the test set. Figure 3.19 shows the ROC curves. It can be 63 seenthattheperformanceofthetwosequentialstrategiesisalmostthesame,andslightly better than that of our hierarchical strategy, but they are about five times slower than our method. 0 10 20 30 40 50 60 70 0 20 40 60 80 100 False positive rate (%) True positive rate (%) Hierarchical Sequential sum Sequential max Figure 3.19: Comparison of different feature combination strategies. 3.2.6.4 Performance on cars For car detection, we use the same training set in Section 3.1.4.4. As the intra-class variationofmulti-viewcarsislarge,wetrainatreestructureddetectorwithfourbranches with our CBT method. For testing, we first evaluate our car detector on the UIUC car test images [1]. Table 3.4 shows the equal-recall-precision rates of our method as well as some previous ones. It canbeseenthatourmulti-viewdetectorachievesperformancecomparabletothestate-of- the-art single-view approaches on a particular viewpoint. The most direct comparison is themulti-viewcardetectorlearnedbytheCBTalgorithmbutwithonlyedgeletfeatures. 64 Our hybrid-feature based detector is slightly better than the edgelet-only CBT classifier. However, the difference of performance on this set is only for one or two examples. We thus need some more difficult testing data. Method Single-scale Multi-scale Leibe et al.[49] 97:5% - Mutch & Lowe [61] 99:94% 90:6% edgelet-only CBT 97:5% 92:8% hybrid-feature CBT 98:0% 94:24% Table 3.4: Detection equal-precision-recall rates on the UIUC car image set. Fortesting,wecollected390carimagesfromthePASCAL2006challengedataset[22]. Thissetincludesmulti-viewcarsofdifferentmodels. Forevaluation,weonlyconsiderthe cars that are higher than 32 pixels. There are 481 counted cars in this set overall. The datasetcontainstwodifferenttypesofimages: closeshotsandmid/long-distantshots. In the close shot images, we detect cars from 250 to 500 pixels high; in the mid/long-distant shotimages, wedetect carsfrom32to250pixels high. FollowingthePASCALchallenge, buses are not included in the car class. Positive responses on buses are counted as false alarms. Figure 3.20shows the precision-recall curveof our method on thisset. The equal precision-recallrateisabout81%. Hoiemet al. [34]use150carimagesfromthePASCAL 2006 data for testing and their method achieves an equal precision-recall rate of about 61%. The highest reported results in the PASCAL 2006 and 2007 challenges have the equal precision-recall rates of about 45% and 55% respectively [69]. However, these rates are for the whole test set, which is much more difficult. 65 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.75 0.8 0.85 0.9 0.95 1 − Precision Recall Figure 3.20: Evaluation of our multi-feature based detection method on the problem of multi-view car detection. 3.2.7 Conclusion and Discussion In this section, we described a method to integrate different types of features for ob- ject detection. We learned ensemble classifiers by boosting weak classifiers. Each weak classifier is based on several different types of features, which are ranked according to their speed normalized classification margins. The weak classifier makes the prediction by examining the feature types one by one. As long as the features used are not highly correlated, a combination of them should result in greater accuracy than by using any single. However, if one feature truly dominates over another on all samples but is slower, it is possible that the accuracy of the combination is higher than the weaker one but lower than the stronger one. We demonstrated our approach on two object classes, pedestrians and cars, with three feature types, edgelets, HOG and covariance descriptors. However, our method is 66 not limited to these types of features; new features could be readily integrated into the framework. At the end of this chapter, we show some example results of pedestrian and car detection in Figure 3.21 and Figure 3.22. 67 Figure 3.21: Example results of pedestrian detection. 68 Figure 3.22: Example results of car detection. 69 Chapter 4 Object Segmentation by Boosting Local Shape Features 4.1 Motivation and Outline In recent years, methods for direct detection of objects have become popular. The best known example is perhaps that of face detection by Viola and Jones [96] where no prior segmentation is applied; rather, the image is scanned by sub-windows of various sizes and a determination as to the presence or absence of the desired object is made in each sub-window. While such methods show good performance at the detection level, object delineationisnotveryprecise. Typically, aboundingboxthatcontainstheobjectaswell as some of the background is output. A more accurate delineation process may then be applied inside the bounding box, as in [50]. In some existing work, discriminative local shape features, such as contour fragments in [64, 84], are used to model the appearance of the objects for detection or recognition tasks, see Figure 4.1. We observe that most of the selected features lie on the object boundary and that our human perception can delineate the object easily based on the 70 (a) (b) (c) Figure 4.1: Local shape features in [64, 84]: a) Boundary fragments selected for cows [64]; b) Feature responses of a face [84]; c) Feature responses of a cow [64]. feature responses on the input image. Based on this observation, we propose a simul- taneous object detection and segmentation method, where the detection model and the segmentation model share the same shape features. We formulate segmentation as a binary classification problem and train segmentor by an off-line learning algorithm. For training, a large feature pool is first built without any knowledgeoftheobjectclass. Foreachfeature,apairofweakclassifiersfordetectionand segmentation is built. A variation of the real AdaBoost algorithm [78] is used to select good features from the pool. The input of the segmentor is an image sample and a pixel locationwithinthesamplewindow; theoutputisthefigure-groundprediction. Wedefine an effective neighborhood for each edgelet feature. The figure-ground distribution within the neighborhood is learned, based on which a local weak segmentor is determined. The final boosted ensemble classifier with a cascade decision strategy works as a detector as well as a segmentor. 71 Given a new input image, the shape feature based boosted classifier is first applied to all possible sub-windows in the image. After one round scanning, the output contains the locations of the objects and the pixel level segmentation masks. After shape based segmentation, if color information is available, we apply an iterative method to estimate the color model of the object and use this model to refine the segmentation result. Our main contributions of this method are: 1) a design of the weak segmentors based on local shape features; and 2) a boosting algorithm for simultaneous learning of detectors and segmentors. Although texture can also be a very useful cue for segmentation, we do not include it in this work. Weevaluatethedetectionandsegmentationperformanceofoursystemquantitatively on several public data sets (the UIUC car image set [1], the Caltech101 image set [23] 1 , and the TU Darmstadt image set [49] 2 ), and some data that we have collected. The experimental results show that the detection performance of our method is comparable tothestate-of-the-artdetectionmethods,whilethesegmentationoutperformstheexisting methods in terms of accuracy and speed. 4.2 Design of the Weak Classifier Inouroff-linelearningmethod,strongensembleclassifiersfordetectionandsegmentation arebuiltfromweakclassifiers. Featuresharingisimplementedattheweakclassifierlevel. Followingourdetectionmethod,weuseedgeletfeatures. Basedononefeaturef,welearn 1 http://www.vision.caltech.edu/Image_Datasets/Caltech101/ 2 http://www.pascal-network.org/challenges/VOC/databases.html 72 a weak classifier for detection h (d) and a weak classifier for segmentation h (s) , i.e. a pair of classifiers sharing the same feature. 4.2.1 Weak classifier for detection The weak detection classifier is a function from the image space X to a real valued object/non-object classification confidence space. The learning of the weak detection classifiers is same as that in Section 3.1.2.1. Given a labeled sample set S = f(x i ;y i )g, the weak detection classifier h (d) is defined by Equ.3.6. 4.2.2 Weak classifier for segmentation The weak segmentation classifier is a function from the space X £U to a real valued figure-ground classification confidence space, whereU is the 2-D image coordinate space, i.e. U =Z + £Z + , where Z + is the set of all nonnegative integers. Intuitively, a local featureonlycontributestotheshapearounditsneighborhood. Itisnotefficienttopredict the state of the feet from an edgelet falling on the head-top. Based on this observation, we define the effective field of the edgelet based on a saliency decay function. This idea is motivated by the tensor voting method for shape grouping [58]. As shown in Figure 4.2(a), O is a point on an edgelet feature, whose normal n and tangent v are known, P is a neighbor of O, and ( OP is the arc of the osculating circle at O that goes through P. The effect of O on P is defined by DF(s;·;¾)=exp µ ¡ s 2 +c· 2 ¾ 2 ¶ (4.1) 73 where l is the Euclidean distance between O and P, µ is the angle between n and ¡ ¡ ! OP, s = lµ 2sinµ is the length of the arc ( OP, · = 2sinµ l is the curvature, c is a constant that controls the decay with high curvature, and ¾ is the scale of analysis, which determines the size of the effective field. Note that ¾ is the only free parameter. In practice, ¾ is quantizedtofivevalues,2,4,6,8,10,accordingtothesizeofourtrainingsamples,andthe normal orientation of the edgelet point is quantized to six bins, [ ¼ 6 (i¡1); ¼ 6 i);i=1:::6. There are consequently 30 bases of the effective field, see Figure 4.2(b) for examples. For a k point edgelet, denoted by F i the effective field of the i-th point, the effective field of the whole feature is then defined by F(u)=maxfF 1 (u);:::;F k (u)g;u2U (4.2) Figure 4.2(c) shows the effective field of an edgelet feature. (a) (b) (c) Figure 4.2: Effective field: a) definition of effectiveness; b) effective field bases of individ- ual edge points; c) effective field of edgelet feature. The learning of the weak segmentation classifiers is similar to that of detection. For thepositivetrainingsamples,theirsegmentationground-trutharegivenasbinarymasks. 74 LetS + =f(x i ;y i =1;m i )gbethepositivesampleset,wheremisthesegmentationmask that has the same dimension as x (m(u) = +1 means the pixel u belongs to figure; and m(u)=¡1 means the pixel u belongs to background). Assuming that the effective field F of a feature f has been determined (how to optimize the shape of the effective field will be described later in Section 4.3.2), similar to the weak detection classifier h (d) , the weak segmentation classifier h (s) is also learned as a piecewise function: iff(x)2 · j¡1 n s ; j n s ¶ ;h (s) (x;u)= 1 2 ln à W j + (u)+² W j ¡ (u)+² ! (4.3) where W § (u) is the feature value histograms of figure/ground pixels weighted by the effective field: W j § (u)=F(u)¢P µ f(x)2 · j¡1 n s ; j n s ¶ ;m(u)=§1 ¶ ;j =1:::n s (4.4) where n s is the bin number for segmentation (in our experiments n s = 8). In practice, both h (d) and h (s) are implemented as look-up-table (LUT). The difference is that each bin of h (d) is a real valued scalar, while each bin of h (s) is a real valued matrix. 4.3 BoostingEnsembleClassifierforSegmentationandDetection Let H be the weak classifier pool that consists of the weak classifier pairs built from all possible edgelets. Each element in the pool is a pair of weak detection and segmentation classifiers, i.e. (h (d) ;h (s) ). We use a variation of boosting algorithm to learn an ensemble classifier fromH as strong detector and segmentor. 75 4.3.1 Sample weight evolution One important feature of boosting algorithms is their weight evolution. For traditional detection problems, each sample is assigned a real valued weight D (d) representing its importance or difficulty. During the boosting procedure, the weights of the misclassified samples are increased while those of the correctly classified samples are decreased, so that greater attention is placed onto the difficult portions of the sample space. For segmentation, not only do the difficulties of different samples vary, but the difficulties of different positions of the same sample also vary. Intuitively, for some less articulated parts of the human body, e.g. torso, segmentation is relatively easy, and not much more than a constant mask; for some highly articulated parts, e.g. legs, additional features needtobeevaluatedbeforemakingthefinaldecision. Hence, forsegmentation, weassign a weight fieldD (s) to each positive sample. D (s) (u) represent the importance of the pixel at position u. During the boosting procedure, the weight fields for segmentation are evolved in the same way as the weights for detection. Assume that at the t-th boosting round, a pair of weak detector and segmentor (h (d) t ;h (s) t ) are selected, and the current sample weights are D (d) t for detection and D (s) t for segmentation. For all samples, the sample weights for detection of the t+1-th round are calculated by Equ.3.10. For all positivesamples, thesampleweightsforsegmentationofthe t+1-throundarecalculated by D (s) t+1 (x;u)=D (s) t (x;u)exp h ¡m(u)h (s) t (x;u) i ;8u2U (4.5) 76 4.3.2 Optimization of weak classifier At each boosting round, the best weak classifier pair is selected from H, where two components need to be optimized: the edgelet feature and the effective field. The edgelet features are enumerated in the feature pool and an effective field is defined by the shape of its edgelet and the parameter ¾. As we allow different ¾’s for different points in one edgelet, for a k point edgelet there are 5 k possible field shapes. When the sample size is 24£58 pixels, there are 857,604 possible edgelets overall. It would be very time consuming to perform brute force search in the Cartesian space. Instead, weseparate the optimization into two steps: searching first for the best edgelet with a default ¾ value, and then searching for the best ¾ value. With fixed ¾, the best edgelet is selected according to the following criterion: ³ h (d) t ;h (s) t ´ = argmin h (d) t ;h (s) t 2H 8 < : ¸2 X j q W j + W j ¡ +(1¡¸) 1 º X j X u2U q W j + (u)W j ¡ (u) 9 = ; (4.6) where º = q P j P u W j + (u) P j P u W j ¡ (u) is a normalizing factor. This criterion en- codes the discriminative power of the feature for both detection and segmentation. The coefficient ¸ represents the relative importance of the two tasks. In our experiments, ¸=0:7. The value of ¾ is optimized in a greedy way. At one time, the ¾ of one edgelet point is optimized, while the others are kept fixed. Figure 4.3 gives the full algorithm of the simultaneous learning of the detector and the segmentor. The output of this algorithm is an ensemble classifier with a cascade decision strategy for detection. As segmentation is 77 a balanced classification problem, we take the default threshold to be zero, i.e. the pixel u of x is classified as figure, if and only if H (s) T (x;u)= X T t=0 h (s) t (x;u)>0 (4.7) In practice, we use the prior figure-ground distribution as the first weak segmentation classifier h (s) 0 : 8x;h (s) 0 (x;u)= 1 2 ln µ W + (u)+² W ¡ (u)+² ¶ (4.8) where W § (u) is the prior probabilities of figure/ground labels of the pixel at the position u: W § (u)=P (m(u)=§1) (4.9) 4.4 Refining Segmentation by Color The boosted ensemble segmentor only utilizes shape information. Color is also a very useful cue for segmentation. We develop a color based postprocessing method to refine shape based segmentation. Figure 4.4 shows the probability distribution of shape based segmentation errors. (Details regarding training and testing are given later in Section 4.5.) From Figure 4.4 it can be seen that by shape based segmentation, most inner and outer areas of the object are already “confidently” segmented, and most errors happen around the boundary of the objects. Hence, to improve the segmentation, we only need to refine the decision for those uncertain areas. At the beginning, a color model of one object C obj and a color 78 ² Given the initial sample set S = S + [S ¡ , where S + = f(x i ;+1;m i )g and S ¡ = f(x i ;¡1)g, and a negative images set; ² Set the algorithm parameters: the maximum weak classifier number T, the positive passing rates fP t g T t=1 , the target false alarm rate F, the relative importance of detection to segmentation ¸, and the threshold for bootstrapping µ B ; ² Initialize the sample detection weights D (d) 0 (x) = 1 kSk for all samples, the sample segmentation weight fields D (s) 0 (x) = 1 kS + kkxk for all positive samples, the current false alarm rate F 0 =1, and t=0; ² Construct the weak classifier pool,H, from the edgelet features; ² while t<T and F t <F do 1. Search for the best edgelet (a) For each pair (h (d) ;h (s) ) in H, generate the effective field for segmenta- tion with a default value of ¾(=4), calculate h (d) and h (s) by Equ.3.6 and Equ.4.3 respectively. W § and W § are calculated under weight distribu- tion D (d) t and D (s) t respectively; (b) Select the best weak classifier pair from the classifier poolH according to Equ.4.6; 2. Search for the best shape of the effective field (a) For each point of the edgelet, set ¾ = 2;4;6;8;10, find the best value according to Equ.4.6. (b) With the new effective field, recompute the classification function of h (s) t by Equ.4.3. 3. Update sample weights by Equ.3.10 and Equ.4.5, and normalize D (d) t+1 and D (s) t+1 to p.d.f. 4. Selectthethresholdb t forthepartialsumH (d) t ,sothataportionofP t positive samples are accepted; and reject as many negative samples as possible; 5. RemovetherejectedsamplesfromS. Iftheremainingnegativesamplesareless than µ B percent of the original, recollect S ¡ by bootstrapping on the negative image set. ² Outputf(h (d) t ;h (s) t );b t g as the cascade classifier for detection and segmentation. Figure 4.3: Algorithm of simultaneously learning detector and segmentor. 79 model of background C bkg are built based on the shape based segmentation result. The color models are then used to compute a segmentation confidence map of the image by m (c) (u)= 1 2 log µ P(+1ju;C obj )+² P(¡1ju;C bkg )+² ¶ (4.10) In practice, we implement the color models with color histograms. So P(+1ju;C obj ) and P(¡1ju;C bkg ) are the bin values of their corresponding histograms. The color based segmentation result is combined with the shape based segmentation by a weighted sum: ˆ m(u)=(1¡e(u))m (s) (u)+e(u)m (c) (u) (4.11) where m (s) is the shape based segmentation result that is a real valued matrix, ˆ m is the combined segmentation result, and e is the error distribution of the shape based segmentation. e(u)isthesegmentationerrorrateofpixelu. Duringatrainingprocedure ane with the same dimension as the training samples is evaluated; when applied to test images, e is re-scaled to the size of the object hypothesis. It can be seen that the weight of each pixel is determined by the error rate of the shape based segmentation, so that refinement has greater effects on the areas with higher uncertainty. This process is repeated until either the segmentation result no longer changes or a maximum number of iterations is reached. Figure 4.5 gives the details of the refining algorithm. Some experimental results are given later in Section 4.5.2.2. 80 (a) (b) (c) Figure 4.4: Segmentation error distribution: a) frontal/rear view pedestrians; b) left profile pedestrians; c) left profile cars. (Brighter point has higher error rate than darker point.) ² For an object hypothesis, compute an initial color model of the object C obj and an initial color model of the background C bkg based on its shape based segmentation result m (s) ; ² For i=1 to N do 1. Compute the color based segmentation m (c) i by Equ.4.10; 2. Combine m (s) and m (c) i to ˆ m i by Equ.4.11; 3. If ˆ m i and ˆ m i¡1 are same, break; 4. Recompute C obj and C bkg based on ˆ m i ; ² Output ˆ m i as the final object segmentation result. Figure 4.5: Color based segmentation refining algorithm. (In our experiments, N =3.) 4.5 Experimental Results Weapplyoursegmentationapproachtotwoobjectclasses: pedestriansandcars. Section 4.5.1 describes the experimental results on the pedestrian class, including a comparison with a baseline segmentation method; Section 4.5.2 describes the experimental results on the car class, including an evaluation of the color based refinement method. 81 4.5.1 Experiments on pedestrians We divide the class of pedestrians into two sub-categories according to their viewpoints, frontal/rear view pedestrians and profile view pedestrians, and separately built models for these two view categories. This view based divide-and-conquer strategy is common for multi-view object detection tasks. The segmentation method in this chapter can be easily integrated with the CBT learning method described in Chapter 3. 4.5.1.1 Results for frontal/rear viewpoint pedestrians First, we describe results of our method on pedestrians imaged from a frontal or rear viewpoint. The second row of Table 4.1 gives a summary of the data used in this ex- periment. We collect 2,000 positive samples and 6,000 negative images from the MIT pedestrian set [68] 3 and the Internet. (925 of the positive samples are from the MIT set, and the remaining samples are from the Internet.) We randomly select 600 positive samples and label their segmentation ground-truth manually. We use a polygon to de- lineate the object. However, the boundary pixels are sometimes ambiguous and can not be clearly classified. Hence we mark a two pixel wide do-not-care (DNC) boundary, see Figure 4.6 for examples. The positive samples are resized to 24£58 pixels. The DNC pixels are ignored in both the training and testing stages. This strategy is similar to that in [85]. To evaluate segmentation, four fifths of the 600 samples are used as training data, and the remaining one fifth is used as testing data. Our data set for frontal/rear view pedestrians thus contains 1,880 positive training samples, 480 of which 3 http://cbcl.mit.edu/software-datasets/PedestrianData.html 82 Table 4.1: Summary of experimental data for simultaneous object detection and segmen- tation. Figure 4.6: Samples for frontal/rear view pedestrians: the first row is the image samples; the second row is the segmentation ground-truth (The grey pixels are do-not-care). have segmentation ground-truth, and 120 testing samples with segmentation ground- truth. Figure 4.7 shows the first few selected features and their learned segmentors. They are evenly distributed and correspond to natural body parts. 83 Figure 4.7: The first five features selected and their segmentors learned for frontal/rear pedestrians. (The 0-th segmentor is the prior distribution. Each edgelet based weak seg- mentor is implemented as a histogram. Each bin of the histogram is a real-valued matrix defined by Equ.4.3 with the same dimension of the training samples. In our experiments, a segmentor histogram has eight bins. In this figure, we visualize the matrices of the histogram bins by normalizing them to [0, 255] gray scales. White is for higher values and black for lower values.) 84 We evaluated the segmentation performance for frontal/rear view pedestrians on the 120 test samples. A precision-recall curve is generated by changing the threshold for segmentation, see Figure 4.8. (For segmentation, precision is defined as the ratio of the number of true object pixels that are classified as figure to the number of all pixels that are classed as figure; recall is defined as the ratio of the number of true object pixels that are classified as figure to the number of all true object pixels.) As the articulation effect is not very strong from this viewpoint, we achieve the highest accuracy in this class. The equal-precision-recall rate of segmentation is about 96:8%. 0 0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.7 0.8 0.9 1 1 − Precision Recall Pedestrian frontal/rear Pedestrian profile Car profile Figure 4.8: Segmentation performance on the normalized samples. To evaluate the detection performance, we collect 205 images with 313 humans of frontal/rear viewpoint. We call this set “USC pedestrian set A”. 4 For comparison, we train a classifier for detection only, i.e. set ¸ = 1 in Equ.4.6, and test it on this set. Figure 4.9 shows the detection precision-recall curves for frontal/rear view pedestrians. 4 http://iris.usc.edu/~bowu/DatasetWebpage/dataset.html 85 It can be seen that the additional segmentation capacity does not reduce the detection performance. Some example results are shown in Figure 4.11. 0 0.2 0.4 0.6 0.8 0.8 0.85 0.9 0.95 1 1 − Precision Recall Simultanous detection and segmentation Detection only Figure4.9: EvaluationofdetectionperformanceontheUSCpedestriansetA(205images with 313 humans). 4.5.1.2 Results for side viewpoint pedestrians Next, we describe the results of our method on pedestrians imaged from a left profile viewpoint. We treat the side view pedestrians and frontal/rear view pedestrians as two separate categories, as their appearances are quite different. Although there are many existing methods, e.g. [80], which report quantitative detection performance on side view pedestrians, there is no known public test set for this task, and our own data was thus collected. The third row of Table 4.1 gives a summary of the data for profile view pedestrians. For training, similar to the case of frontal/rear view pedestrians, we collect 86 2,000 positive samples of left profile view pedestrians and 6,000 negative images from the Internet. The positive samples are resized to 24£58 pixels. The segmentation ground- trutharelabeledmanuallyfor600randomlyselectedpositivesamples,fourfifthsofwhich areusedfortraining,andonefifthfortesting. Figure4.10showssomeexamplesofpositive samples. The precision-recall curve of segmentation for left profile view pedestrians is shown in Figure 4.8. The articulation effect from this viewpoint is significant. This category has the largest intra-class variation. Our method achieves an equal-precision- recallrateofabout95:1%forsegmentation. Someexampleresultsofpedestriandetection and segmentation are shown in Figure 4.11. Figure 4.10: Training samples for left profile view pedestrians. 4.5.1.3 Comparison of segmentation with mean mask On the class of profile pedestrians, we compare our method with a baseline segmentation method that uses a constant mask for segmentation. Due to the strong effect of articu- lation, the class of profile view pedestrians is more difficult than the class of frontal/rear viewpedestrians. Welearnameanmaskfromthetrainingsetofprofileviewpedestrians, 87 Figure 4.11: Examples of detection and segmentation results for pedestrians. For each pair, left is the detection result and right is the segmentation result. When there are multiple objects, different colors represent different objects. 88 seeFigure4.12, andtestitonthe120pedestriantestsampleswithsegmentationground- truth and compare with our shape based segmentor. Table 4.2 shows the comparison of segmentation accuracy. It can be seen that compared to the mean mask, the shape based segmentor reduces the segmentation error from 8:1% to 3%. Also, the shape based segmentoris muchmore stable, i.e. itproduces a smaller STD than the mean mask. The second observation is consistent with intuition. The mean mask is not a good approx- imation for the distribution with multiple modes, which is the situation for articulated objects. This can be seen from the minimum accuracy values shown in Table 4.2. The shape based segmentor has a much higher minimum accuracy than the mean mask. Figure 4.12: Mean mask of left profile view pedestrians. Accuracy (%) Ave Min Max STD Mean mask 91.9 81.0 98.4 3.87 Our shape segmentor 97.0 93.0 99.4 1.60 Table 4.2: Segmentation accuracy of mean mask and shape based segmentor on aligned profile view pedestrian samples. Although there are other standard segmentation methods whose implementations are available, such as mean-shift and k-mean, these methods are not directly comparable withourmethod,becausetheyareforgeneralimagesegmentation,ratherthanforobject segmentation. 89 4.5.2 Experiments on cars We consider the samples from a profile viewpoint for cars. There are several public data sets for side view cars, including the UIUC car set [1], the Caltech101 set [23], and the TU Darmstadt set [49]. 4.5.2.1 Results of shape based detection and segmentation In this section, we describe the results of our shape based detector and segmentor on the class of cars imaged from a left profile viewpoint. The fourth row of Table 4.1 gives a summary of the data used in this experiment. We collect 1,800 positive samples of side view cars and 6,000 negative images from the training set of the UIUC car set and the Internet. (550of thepositivesamplesarefromtheUIUC set, andtheremaining arefrom the Internet.) The car models included in this set are mainly small and mid-sized cars, including sedans, trucks, vans, and SUVs. Most of the intra-class variation is due to the variety of car models. The positive samples are resized to 75£30 pixels. We manually label the segmentation ground-truth for 600 randomly selected positive samples, four fifths of which are used for training, and one fifth for testing. Figure 4.13 shows some examples of car samples. The precision-recall curve of segmentation for side view cars is shown in Figure 4.8. Our method achieves an equal-precision-recall rate of about 96:0% for segmentation. All the curves in Figure 4.8 are for performance on the normalized samples, where the size and position of the objects are aligned. However, when searching in a real image, the sliding window is usually not well aligned with the object. To evaluate our method under such a situation and to compare with others, we run our system on two public test 90 Figure 4.13: Training samples for left profile view cars. sets, the side view car set of the Caltech101 image set with 123 images and 123 cars, and the side view car set of the TU Darmstadt image set with 50 images and 50 cars. Both of these two sets are provided with segmentation ground-truth. These sets are designed for object recognition tasks. Although the object positions are not aligned, the object sizes are roughly the same. This reduces the search space greatly. Our detector achieves a 100% detection accuracy, i.e. no missed objects and no false alarms, on these two sets. Another public set for car is the UIUC car set, but no segmentation ground-truth is provided. Consequently, we do not test our segmentor on it. (The positive training samples of the UIUC car set are used for detector training, but none of its training or testing samples are used for segmentation.) Table 4.3 gives the comparison of our segmentation method with some previous methods. In the original ground-truth of the TU Darmstadt set, the windows of cars are labeled as background, while in our training data the windows are labeled as figure. To test our classifier, we modify the original 91 ground-truth of the TU Darmstadt set by changing the labels of all window pixels to figure. Note that all the previous methods listed in Table 4.3 use a part of the set as trainingandtestontherest,whileourmethodtrainsontotallyindependentsamplesand uses the whole set for testing. It can be seen that our method outperforms the others. Method TU Darmstadt Caltech101 UIUC Winn & Shotton [101] - - 96:5% ? Kapoor & Winn [42] 95:0% ? - - Winn & Jojic [100] 94:0% ? - - Our method 97:6% 98:3% - Table 4.3: The per-pixel figure-ground segmentation accuracy of side view cars. (The number with a ? means the testing is done on a subset of the image set.) WeevaluatedourcardetectorontheUIUCcarimageset[1]. Thissetcontainscarsof bothleftandrightprofileviews. Wemirrorourleftprofiledetectortogetonefortheright profile view, and apply these two detectors on the images. Table 4.4 lists the comparison of the equal-precision-recall rates on the UIUC single-scale set and the multi-scale set. It can be seen that our method is comparable to the state-of-the-art methods, and less affectedbythemulti-scalesituationthanthe others. Someexamplesof cardetectionand segmentation results are shown in Figure 4.14. Method Single-scale Multi-scale Agarwal et al.[1] »77% »40% Garg et al.[27] »88:5% - Leibe et al.[49] 97:5% - Shotton et al.[84] 92:1% - Fritz et al.[26] - 87:8% Kapoor & Winn [42] 94:0% - Mutch & Lowe [61] 99:94% 90:6% Our method 97:5% 93:5% Table 4.4: Detection equal-precision-recall rates on the UIUC car image set. 92 Figure 4.14: Examples of detection and segmentation results for cars. 93 4.5.2.2 Results of color based refinement We evaluated our color based segmentation refining algorithm on the TU Darmstadt car images. (The other image sets we have are all in gray scale.) Our refining algorithm improves the segmentation accuracy on this set from 97:6% to 98:0%, i.e. it reduces the error rate by 16:7%. Although improvement is not significant in total pixels, as most refinement occurs around the boundary area, perceptually the improvement is obvious, see Figure 4.15 for example results. For an “indirect” comparison, take the TextonBoost method [85] for multiple class classification and segmentation as a reference. In Texton- Boost, the boosted classifier alone achieves an accuracy of 69:6%, and after combining with color, edge, and location cues, the accuracy is increased to 72:2%, i.e. the error rate is reduced by 8:6%. However, in [85] the combination algorithm is a Conditional Random Field based approach that is much more complicated and computationally ex- pensivethanourrefinementmethod. Note,itismeaninglesstocomparethetwoaccuracy values achieved by TextonBoost and our method since they are for different tasks, but the relative improvements by combining with complementary image features show the effectiveness of our method. In this experiment, we set the maximum iteration number N in the algorithm in Figure 4.5 to 3, but only one or two rounds are usually needed to converge. The compu- tational cost of the color based refinement algorithm is very small. It takes less than 80 milliseconds to process one image of the TU Darmstadt set. Our program is coded in C++ using OpenCV functions without any code level opti- mizationorparallelcomputing. Theexperimentsaredonewitha2.8GHz32-bitPentium 94 Figure 4.15: Examples of segmentation results with color based refinement. The first row are the source images; the second row are the shape based segmentation results; the third row are the segmentation results after color based refinement. The source images are from the TU Darmstadt image set. PC. The training procedure needs about 72 hours. The most time is spent on the boot- strapping procedure for collecting new negative samples. For simultaneous detection and segmentation, it usually takes a few hundred milliseconds to process one image. Our method is much faster than the previous graphical model based methods. 4.6 Conclusion and Discussion In this chapter, we described a method to simultaneously detect and segment objects of a known category. Weak detectors and segmentors were designed based on the edgelet features. A boosting algorithm was used to construct the ensemble classifiers. Simulta- neousness was achieved in both training and testing. Experimental results show that our method is comparable to the state-of-the-art methods for detection, and it outperforms the previous methods for segmentation. 95 In this method, shape is used as the primary cue for segmentation, and color as a secondarycue. Besidesshapeandcolor,textureisanimportantcueforbothsegmentation and detection tasks. One straightforward method to integrate texture information in the current method is to add a texture model in the refinement module. However, a holistic representation of color or texture model is sometimes uninformative and can not capture the details of the objects. Some part based representation could be helpful. In our current method, although the detection and segmentation share the same set of image features, there is little interaction between these two modules. Intuitively, segmentation results could be used to verify the detection hypotheses. If the segmentor produces an unusual shape, it may suggest an error of the detector. There are existing methods exploring this direction, e.g. [116]. For articulated objects, some previous methods have attempted to perform pose es- timation and object segmentation simultaneously, e.g. [9]. If the parts of an articulated objectarealmostrigid,theposeandimageoccupancyoftheobjectarehighlycorrelated. Inonedirection,posecanbeusedtodirectsegmentationinagenerativeway;intheother direction, segmentation can be used to estimate pose in a discriminative way. 96 Chapter 5 Part Combination for Detection of Occluded Objects 5.1 Motivation and Outline The learning based detection methods train object classifiers from a labeled sample set. To search for objects in a new image, the classifier is applied to the sub-windows with variable sizes in all positions. For the detection of objects with partial occlusions, part based representations can be used. For each part, a detector is learned and the part detection responses are combined to form object hypotheses. The part detectors are typically applied to overlapping sub-windows, and the sub- windows are classified independently, and one local feature may thus contribute to mul- tiple overlapped responses for one object, see Figure 5.1. Some false detections may also occur, as local features may not be discriminative enough. Due to poor image cues or partial occlusions, some object parts may not be detected. To achieve a one-to-one mapping from part detection responses to object hypotheses, we need to group the re- sponses and explain inconsistencies between the observations and the hypotheses. When objects are close to each other, both the one-object-multiple-response problem and the 97 (a) Full-body (b) Head-shoulder (c) Torso (d) Legs Figure 5.1: Examples of part detection responses for pedestrians. part-object assignmentproblemrequire joint consideration of multipleobjects, instead of treating them independently. We propose a unified framework for part response group- ing, merging and assigning, and demonstrate that it outperforms the previous related methods. Figure 5.2 shows an overview diagram of our approach. We define a part hierarchy for an object class, in which each part is a sub-region of its parent. Since building part detectors independently is time consuming and building them as sub-set of the whole- object detector may not achieve a desirable accuracy, we choose a tradeoff between these 98 Figure 5.2: Schematic diagram of our part combination method. two approaches. For each part, a detector is learned by boosting local shape features. A child node in the hierarchy inherits image features from its parent node and if a target performance can not be achieved from the inherited features, more features are selectedandaddedtothechildnode. Forwhole-object, besidesthedetector, apixel-level segmentor is learned. Given a new image, the part detectors are applied. The image edge pixels that positively contribute to the detection responses are extracted. The part responses and the object edges form an informative intermediate representation of the original image. In our approach, we do not divide the tasks of merging responses and part combina- tion into two separate stages; instead, we try to solve them under the same framework. Fromthepartdetectionresponses,multipleobjecthypothesesareproposed. Foreachhy- pothesis, a pixel-level segmentation is obtained by applying the whole-object segmentor, 99 and the silhouette is extracted. We apply occlusion reasoning to the object silhouettes to compute a 1-D visibility score, instead of the region based 2-D visibility score in the previous methods [83, 55]. We define a joint image likelihood of multiple objects, which gives rewards for successful detection of visible parts, and penalties for missed detections and false alarms. The likelihood also includes a matching score between the visible sil- houettes and the object edges. Our joint analysis enforces the exclusiveness of low level features, i.e. one image feature can contribute to one hypothesis at most. Our approach is a unified MAP framework that solves part merging, grouping, and assigning together. Our main contributions of this method include: 1) a part hierarchy design that enables efficient learning of part detectors by feature sharing; 2) an accurate occlusion reasoning approach based on object silhouettes; and 3) a joint image likelihood based on both the detection responses and the object edges, which are assigned to object hypotheses exclusively. We demonstrate our approach through the class of pedestrians. Everymoduleinourapproachcontributestotherobustnessofthewholesystem. Though the situations to the advantage of any single module may not frequently occur, together they result in a statistically significant improvement compared to the previous methods. The first version of this part combination method is published in [105, 104], and is followed by other researchers [83, 55]. Since then, we have made several improvements based on the first version method. In this chapter, we will focus the description on our current method, the second version. We will show the major difference between the new method and the old. 100 5.2 Hierarchical Body Part Detectors We use the class of pedestrians to illustrate and validate. We define a part hierarchy for human body, which consists of three levels including a full-body node and 11 body part nodes, see Figure 5.3. (In our first version method [105, 104], only the first two levels of this hierarchy are used.) Figure 5.3: Hierarchy of human body parts. (Pt 0 is full-body; Pt 1;0 head-shoulder; Pt 1;1 torso; Pt 1;2 legs; Pt 2;0 leftshoulder; Pt 2;1 head; Pt 2;2 rightshoulder; Pt 2;3 leftarm; Pt 2;4 right arm; Pt 2;5 left leg; Pt 2;6 feet; Pt 2;7 right leg. The left and right sides here are w.r.t. the 2-D image space.) 5.2.1 Learning part detectors For each node, a detector is learned. As we define the part hierarchy such that the region ofonechildnodeisasub-regionofitsparentnode,featuresharingbetweentheparentand 101 childnodesispossible. Foreachpartnode, aboostingalgorithmisappliedtoselectgood local shape features and construct a classifier as a detector. The image features used are edgelets. Before the regular boosting procedure, the detector of one node, except for the “full-body” node, inherits all the edgelet features overlapping with its sub-region from its parent node. For each inherited edgelet, the points that are out of the part’s sub-region are removed, and the classification function is re-trained. Usually the detector can not achieve a high accuracy from only the inherited features. The regular boosting algorithm is then applied to add more features to the classifier. Figure 5.4 gives an illustration of feature sharing. Note that there are several feature sharing concepts in our method. In the CBT method described in Chapter 3, feature sharing is between the weak classifiers for differ- ent sub-categories; in the simultaneous detection and segmentation method described in Chapter 4, feature sharing is between the weak detectors and the weak segmentors; here, feature sharing is betweendifferent overlapping parts. The learning algorithm used is the CBT method. (In our first version method [105, 104], the VBT method [36] is used to learnviewbasedpartclassifiers,andthereisnofeaturesharingbetweenparts. InSection 3.1, wehaveshownthattheCBTclassifiershavebetterperformancethantheviewbased VBT classifiers for pedestrian detection.) Additional details of the experimental setting are given later in Section 5.4. For the full-body node, we use the method described in Chapter 4 to learn a pixel- level figure-ground segmentor. Note, we do not learn segmentors for the other body parts. Because the full-body segmentor is based on local features, even when the object 102 Figure 5.4: Illustration of feature sharing in part detector hierarchy. (The black points are the inherited features, and the gray are the newly selected features.) ispartiallyoccluded, thefull-bodysegmentorcanstillsegmentthevisiblepartwellbased on the visible features. 5.2.2 Detecting body parts and object edges Given a new image, the part detectors are applied. Besides collecting part responses, we extract image edges that correspond to objects. For each edgelet feature f in the classifier, we call it a positive feature if it has a higher average matching score on positive samples than on negative samples, i.e. Eff(x)jx2X + g>Eff(x)jx2X ¡ g (5.1) whereX § is positive/negative sample space. The average matching scores are evaluated during the off-line learning stage. For one sub-window that is classified as an object, the positive features in the sub-window are ranked according to their matching scores. The positive features with the top 5% scores are retained. Asonedetectorusuallycontainsaboutonethousandpositivefeatures,alargenumber of edgelets are kept for one image. Some of these edgelets correspond to the same edge 103 pixels. We apply a clustering algorithm to prune redundant edgelets. An edgelet consists of a chain of 2-D points. Denote the positions of the points in an edgelet E, byfu i g k i=1 , where k is the length of the edgelet. Given two edgelets E 1 and E 2 with the same length, we define an affinity between them by A(E 1 ;E 2 ) Δ = 1 k k X i=1 hu 1;i ¡¯ u 1 ;u 2;i ¡¯ u 2 i¢e ¡ 1 2 k¯ u 1 ¡¯ u 2 k 2 (5.2) where ¯ u is the mean offu i g. If the two features have different numbers of points, k 1 and k 2 , they are first aligned by their center points, and then the longer feature is truncated to the length of the shorter one by removing points from the two ends. The affinity given by Equ.5.2 multiplied by a factor of minfk 1 ;k 2 g maxfk 1 ;k 2 g is taken as the affinity for these edgelets. The clustering algorithm is an iterative algorithm. First, we find the edgelet with the highest matching score, and then remove all edgelets with high affinity to it. This procedure is repeated until all object edgelets are examined. The remaining edgelets are the observations that support the putative object hypotheses. See Figure 5.5 for an example. Compared to general edge based image segmentation methods, where all edges are extracted, our edge extraction removes edges from background clutters and focuses on object shapes. These object edges, together with the bounding boxes, are input for the joint analysis of multiple objects. 104 Figure 5.5: Extracted object edgelet pixels. 5.3 Joint Analysis for Multiple Objects Compared to the early work on part combination [60, 82, 59], the main contribution of our approach is the joint analysis of multiple objects with occlusion reasoning. Figure 5.6 lists the main steps of our part combination algorithm. 5.3.1 Proposing object hypotheses Initially, object hypotheses are proposed from the detection responses of a subset of parts. For pedestrians, we use full-body, head-shoulder, left/right shoulder, and head to propose. During detection, only the part detectors for hypothesis proposal are applied to the whole image, while others are applied to the local neighborhood around the initial hypotheses. The hypotheses with large overlap ratios, which are defined as the areas of theirintersectionovertheareasoftheirunion,aremerged. Unlikethetraditionalmerging step [75], we use a high overlap threshold to obtain a set of “under-merged” responses, in which one object may have multiple hypotheses but hypotheses of different objects are unlikely to be merged. Although this under-merging reduces the search space, it can 105 1. Propose initial object hypotheses sorted such that their y-coordinates are in de- scending order. 2. Segment object hypotheses and extract their silhouettes. 3. Examine the hypotheses one by one, from front to back (a) For one hypothesis H, compute the joint occlusion maps for silhouettes of multiple objects, with and without H; (b) Match the detection responses and object edgelets with visible silhouettes; (c) Compute the image likelihood with H, P w (H), and the likelihood without H, P w=o (H); (d) If P w (H)>P w=o (H), accept the hypothesis; otherwise reject it. 4. Output all remaining hypotheses. Figure 5.6: Searching for the best multiple object configuration. keep the responses of close objects separate for further joint analysis. We sort the object hypotheses by their vertical coordinates such that their y-coordinates are in descending order. See Figure 5.7 for an example. Figure 5.7: Proposal of multiple human hypotheses. 106 5.3.2 Joint occlusion map of silhouettes For each hypothesis, the figure-ground segmentation is computed by applying the whole- object segmentor, and the object silhouette is extracted. We assume that objects are on a ground plane and the camera looks down towards the plane. This assumption is valid for common surveillance systems. This configuration brings two observations: 1) if a human in the image is visible, then at least his/her head is visible, and 2) the farther the human is from the camera, the smaller the y-coordinate of his/her feet’s image position. (The origin of the image is at the top-left corner.) Figure 5.8 shows an illustration of this assumption. Figure 5.8: Assumption of scene structure. We render the segmentation masks of the ordered hypotheses by a z-buffer like method, and remove the invisible parts of the silhouettes that are out of image frame or occluded by other objects. See Figure 5.9(a). For each part of an object hypothesis, a visibility score is defined as the ratio between the length of the visible silhouette and the length of the whole silhouette. For each hypothesis, we remove the parts whose visibility scores are smaller than a threshold µ v (=0.7 in our experiments). The remaining parts are considered observable to the detectors. 107 (a) (b) Figure 5.9: Occlusion relation of multiple objects. a) 1-D silhouette based occlusion reasoning with boosted object segmentor; b) 2-D region based occlusion reasoning with constant object mask. In our first version method [105, 104], the object segmentation is approximated by a constant ellipse mask, and the visibility score is computed based on 2-D object region, which is also used in [83, 55], see Figure 5.9(b). Compared to the region based visibility score, the silhouette based visibility score is more accurate and more meaningful for the shape based detectors. For example, when the region of a large object in the back is mostly occluded bya smallerobject in the front, the silhouette based occlusion reasoning can retain the back one for further analysis as long as its contour is mostly visible, while the region based occlusion reasoning can not. 5.3.3 Matching object edges with visible silhouettes After obtaining the visible silhouettes, we assign the object edgelets extracted during part detection to the hypotheses by matching them with the visible silhouettes. For each edgelet, we find the closest silhouette to it and align the edgelet with the silhouette. Figure 5.10 gives the algorithm. 108 1. Compute distance transformation for all visible silhouettes; 2. For each object edgelet (a) ComputetheChamfermatchingscorestoallthevisiblesilhouettes,andassign the edgelet to the silhouette with the largest score; (b) Find the silhouette pointc nearest to the edgelet and locally align the edgelet with the silhouette around c; (c) Mark the part of the silhouette that is covered by the edgelet as “supported”; Figure 5.10: Matching and aligning edgelets with silhouettes. Toassignedgeletstosilhouettes,wefirstcomputethedistancetransformationforeach visiblesilhouette. WethencomputetheChamfermatchingscoresbetweenalltheedgelets andallthesilhouettesthroughthedistancetransformation. Anedgeletisassignedtothe silhouette that has the highest matching score. (If an edgelet has low scores with all the silhouettes, then it is not assigned to any.) To align one edgelet with its corresponding silhouette, we first find the silhouette point c closest to the edgelet through distance transformation. We then search a small neighborhood of c along the silhouette, §5 pixels. For each position, we cut a segment from the silhouette with the same length as the edgelet and compute its shape affinity to the edgelet by Equ.5.2. The position with the highest affinity is taken as the aligned position, and the corresponding segment of the silhouette is marked as “supported” (see Figure 5.11). The ratio between the length of the supported segments and the overall length of the silhouette is called the edge coverage of the silhouette. The above algorithm guarantees that one edgelet contributes to one hypothesis at most. If one silhouette can not acquire enough supporting edgelets, the corresponding hypothesis will be removed. This naturally solves the one-object-multiple-hypotheses 109 Figure 5.11: The parts of the silhouettes that have matched edgelets (red points). problem and prune some false alarms. For example, the hypothesis 8 in Figure 5.7 is removed in this way. (Our first version method [105, 104] does not enforce feature exclusiveness.) 5.3.4 Matching detection responses with visible parts Matching part detection responses with the visible part hypotheses is a standard as- signment problem, which we solve by the Hungarian algorithm [46]. For each response- hypothesis pair, we compute their overlap ratio as the matching score. Only the pairs whose overlapratios are largerthana threshold (=0.5in our experiments) areconsidered to be a potential match. After matching, we apply under-merging to the remaining part responses to remove redundant false alarms. We then count the successful detections, false alarms, and missed detections (see Figure 5.12 for an example). 110 Figure 5.12: Result of matching full-body detection responses in Figure 5.1(a) with the proposed hypotheses in Figure 5.7 (yellow: matched responses; orange: response not matched with any hypothesis; red: hypothesis without matched response). 5.3.5 Computing joint image likelihood Denote one visible part of an object hypothesis and one part detection response byz and r respectively. Denote the set of matched response-hypothesis pairs by SD (successful detection),thesetsoffalsealarmsandmisseddetectionsaredefinedbyFA=frjr = 2SDg and FN = fzjz = 2 SDg (false negative) respectively. Denote the object edgelets from response r by E(r). The joint image likelihood of multiple objects is defined by P(OjZ)= Y fz;rg2SD P SD (r;E(r)jz) Y r2FA P FA (r) Y z2FN P FN (z) (5.3) whereO packs all observations, and Z for all hypotheses. The first term on the right side of Equ.5.3 is the reward for successful detections. It is decomposed as P SD (r;E(r)jz)=P(rjE(r);z)P(E(r)jz) (5.4) 111 To model P(rjE(r);z), we evaluate the distribution of the part detector’s true positive rate under different edge coverage of the silhouettes. The distribution is represented as a histogram. Spatial error between the response and the hypothesis or poor contract of the input image reduces the edge coverage score. Lower edge coverage usually corresponds to alowertruepositiverate. WeassumethatP(E(r)jz)isauniformdistribution,andhence it is ignored in practice. (In our first version method [105, 104], edgelet responses are not used to compute the reward for successful detection.) The second term on the right side of Equ.5.3 is the penalty for false alarms. It is computed by one minus the detector’s precision. The third term is the penalty for missed detections. It is computed by one minus the detection rate. These properties are independently evaluated for different part detectors. Althoughboththedetectionresponsesandtheedgecoveragearebasedontheedgelet features, they have some complementarity. The sub-window classification considers the featureslocally andindependently; theedgelet-silhouetteassigning considersthe features globally and jointly. 5.3.6 Searching for the best configuration Finally, we need a method to search the solution space to maximize the posterior proba- bility P(ZjI), given an input image I. According to Bayes’ rule P(ZjI)/P(IjZ)P(Z)=P(OjZ)P(Z) (5.5) 112 Assuming a uniform distribution of the prior P(Z), the above MAP estimation is equal to maximizing the joint likelihood P(OjZ). To search for the best interpretation of the image, we examine the initial object hypotheses one by one, in the descending order of their y-coordinates (see Figure 5.13 for an example). If there are several hypotheses for one object, the algorithm will find the one that best aligns with the object edges and part responses. For example, the hypotheses h 1 ;h 3 ;h 4 ;h 5 in Figure 5.13 correspond to one human. Our algorithm chooses the best one (h 1 ) and removes the others. If there are inter-object occlusions, the algorithm will ignore the occluded parts. For example, the legs of hypothesis h 12 are not detected, but this can be explained by occlusion from h 7 . Therefore, h 12 is retained. Note that this greedy search algorithm does not guar- antee to find the global optimum. However, in most cases, it works well. We call the above Bayesian combination algorithm a combined detector, whose output is combined responses. 5.4 Experimental Results We demonstrate our approach on the problem of pedestrian detection. We evaluated our system on two image test sets. One is from our own collection, and the other consists of the “Zurich Mobile Pedestrian Sequences” [21] 1 . Unlike the other popular test sets for pedestrian detection, e.g. the INRIA set [16] and the MIT set [68] that use segmented, separatedhumansamples,thesesetscontainmultipleun-segmentedhumanswithfrequent inter-human occlusions. 1 http://www.vision.ee.ethz.ch/~aess/iccv2007/ 113 Figure 5.13: An example of searching for the best multi-object configuration. (The blue rectangles overlaid on the images are the hypotheses being examined. The red boxes are thestateskeptaftercomparingtheimagelikelihoodswith/withoutonehypothesis. When examiningahypothesis,oneofthe“with”and“without”likelihoodscanbeinheritedfrom the previous round to reduce computational cost. For example “without h 0 ” and “with h 1 ” are the same state, as h 0 is removed.) 114 5.4.1 Training part detector hierarchy Totrainthepartdetectors, weusethesametrainingsetthatisusedintheexperimentin Section 3.1.4.3. This set contains multi-view pedestrian samples. The full-body samples arenormalizedto24£58pixels. Thesizesoftheotherbodypartscanbederivedbasedon their definitions in Figure 5.3. Although feature sharing cuts training time byabout half, it requires about five days to train all the part detectors. Figure 5.14 shows the first two learned edgelet features for head-shoulder, torso, and legs. They are quite meaningful. Figure 5.15 shows some examples of successful detections and interesting false alarms, where locally the images look like the target parts. Figure 5.14: The first two edgelet features learned for head-shoulder, torso, and legs. 5.4.2 Evaluation on indoor occluded examples First, to evaluate our combined detector with occlusions, we select 54 frames with 271 humans from the CAVIAR video corpus [12]. We call this set “USC pedestrian set B”. 2 Inthisset,75humansarepartiallyoccludedbyothers,and18humansarepartiallyoutof the scene. Table 5.1 lists the performance of our individual part detectors on this set. It canbeseenthatwithocclusions,theperformanceofthepartdetectorsdropsgreatly. The 2 http://iris.usc.edu/~bowu/DatasetWebpage/dataset.html 115 (a) Full-body (b) Head-shoulder (c) Torso (d) Legs Figure 5.15: Examples of part detection results on images from the Internet. (Green: successful detection; Red: false alarm) 116 feet and legs detectors have the poorest performance, as the occlusions usually happen to the lower body. Part Recall Precision Full-body 0.7638 0.9367 Head-shoulder 0.7269 0.9471 Torso 0.7934 0.9110 Legs 0.5720 0.8470 Head 0.6679 0.7702 Left shoulder 0.6863 0.8857 Left arm 0.7860 0.8694 Left leg 0.5240 0.8208 Feet 0.5092 0.7624 Table5.1: Performanceofpart detectorsontheUSC pedestrian set B. (Theperformance of right shoulder/arm/leg is similar to their right counterparts.) We compare the end-to-end performance of our system with some previous multiple human detection methods, including our first version method. Figure 5.16 shows the precision-recall curves. It can be seen that our method is significantly better than the other state-of-the-art methods, and all the combined detection methods are much better than any individual part detector on occluded examples. Table5.2liststhedetectionratesondifferentdegreesofocclusion. Itcanbeseenthat the detection rate on partially occluded humans is only slightly lower than the overall detection rate and declines slowly with the degree of occlusion. Occlusion degree (%) 25»50 50»75 >75 Human number 34 31 10 Detection rate (%) 94.12 93.55 90.0 Table 5.2: Detection rates on different degrees of occlusion (with 12 false alarms). 117 0 0.05 0.1 0.15 0.2 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 − Precision Recall Lin et al. ICCV07 Shet et al. CVPR07 Our first version method ICCV05 Our current method Figure 5.16: Evaluation of detection performance on the USC pedestrian test set B. In this experiment, we do not use any scene structure or background subtraction to facilitate detection. The test image size is 384£288 pixels. We search humans from 24 to 80 pixels wide. We use four threads to run detection of different parts simultaneously. Our experimental machine is a dual-core dual-processor Intel Xeon 3.0GHz CPU. The average speed on this set is about 3.6 second per image. Figure 5.17 shows some example detection results on this set. 5.4.3 Evaluation on outdoor occluded examples Second, we evaluate our method on the three Zurich mobile pedestrian sequences, which are captured by a stereo pair of cameras mounted on a children’s stroller. Same as [21], we only use the frames from the left camera for testing. The first test sequence contains 999 frames with 5,193 annotated humans; the second contains 450 frames with 2,359 118 Figure 5.17: Example detection and segmentation results on the USC pedestrian set B. 119 humans; the third contains 354 frames with 1,828 humans. The frame size is 640£480 pixels. ThissetismoredifficultthantheUSCset,becauseofthepoorimagingconditions and the crowded, cluttered outdoor environments. To compare with the results in [21], which combines scene analysis with object detection, we develop a simple ground plane estimation method and use the ground plane assumption to facilitate detection. We first use our full-body detector to search for humans from 24 to 200 pixel wide (corresponding to 58 to 483 pixel high). Then from the full-body responses, we do a RANSAC style algorithm to estimate a linear mapping from the 2-D image positions to the 2-D human heights: ax+by+c = h, where x;y are the image position, h is the human height, and a;b;c are the unknowns. With ground plane, the other part detectors only search the validregionsinthe position-scalespace. Thissavessome computationalcost and reduces the false alarm rate. The classifiers used in this experiment are the same as those used in Section 5.4.2. Figure 5.18 shows the precision-recall curves of our methods and those in [21]. It can be seen that on all three sequences our method dominates. However, the efforts of this workandthatin[21]focusondifferentaspects. Essetal. [21]havetriedtointegratescene structure analysis and object detection, while our approach tries to segment multiple, occluded objects jointly. These two complementary methods can be combined for further improvement. The average speed of our system on this set is about 2.5 second per image. Figure 5.19 shows some example results on this set. 120 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 − Precision Recall Ess et al. ICCV07 seq01 Ess et al. ICCV07 seq02 Ess et al. ICCV07 seq03 Our method seq01 Our method seq02 Our method seq03 Figure 5.18: Evaluation of detection performance on the Zurich mobile pedestrian se- quences. (Following [21]’s evaluation, only humans higher than 60 pixels are counted. The curves of [21], the courtesy of the original authors, are for their full-system, i.e. with ground plane and stereo depth.) 5.5 Conclusion and Discussion In this chapter, we described a method to group, merge, and assign part detection re- sponses to segment multiple, possibly inter-occluded objects. Based on occlusion reason- ing, the joint likelihood of multiple objects is maximized to find the best interpretation of the input image. We demonstrated our approach on the class of pedestrians. The experimental results show that our method outperforms the previous ones. To apply our approach to other object classes, some components may need to be modified according to the class of interest. First, the design of the part hierarchy is class dependent. Different object classes may need different partitions. Second, the 121 Figure5.19: ExampledetectionandsegmentationresultsontheZurichmobilepedestrian sequences. 122 ground plane assumption is valid for some objects in some applications, such as cars and pedestrians in surveillance videos, but not for all objects in all situations. When this assumption is not true, we need to infer the objects’ relative depths by other techniques. Third, though the feature exclusiveness idea should be helpful for any feature based detection, it may require different implementations for different features. 123 Chapter 6 Improving Object Detection by Unsupervised, Online Learning 6.1 Motivation and Outline Many existing detection methods use a supervised off-line learning approach, such as our CBT method and our multi-feature integration method described in Chapter 3. In order to obtain a generic detector with good performance, tens of thousands samples could be needed [36]. Although raw data, i.e. images and video sequences, are cheap and easy to collect, manually labeling a huge amount of data is time-consuming and tedious. Forgenericobject detection problems, theintra-classvariationcan beverylarge for both object and non-object classes, as all the situations need to be considered. Even with enough training samples, it could be hard to build a good model to solve such a complicated classification problem. For applications where the environments considered are highly limited, a specialized detector could be a better solution. This chapter describes our unsupervised, online learning approach to “grow” a set of specialized human part detectors from a set of seed 124 part detectors that are for general purpose. The input of our online learning algorithm includes a set of general part detectors, which is learned from a small labeled training set, and a large amount of unlabeled frames captured at a particular site. The output is a set of specialized detectors for the particular site. We use the class of pedestrians to demonstrate our method. Figure6.1givesaschematicdiagramofourapproach. Tofocusontheonlinelearning algorithm, we use a simplified version of the part based pedestrian detection system described in Chapter 4. Only the top four nodes, full-body, head-shoulder, torso, and legs,intheparthierarchyareused. Foreachpart,acascadestructured“seed”classifieris trained by our off-line CBT learning algorithm with a maximum channel number of one. The image features used are edgelets. The size of the training set for the seed detectors is relatively small and the samples are for general purposes. Figure 6.1: Schematic diagram of our unsupervised, online learning approach. 125 Wedesignedtheautomaticlabeler,i.e. oracle,forunsupervisedlearningbycombining thebodypartdetectionresultswiththemethoddescribedinChapter4. Theclassification confidence of a sample is calculated from its part detection responses. Only the samples with high confidence are used for updating. To reduce the alignment errors, we learn a linear regressor to align the human samples according to the head and the feet positions. Our oracle has high precision, and unlike those in [62, 73, 40] it does not rely on motion segmentation. We extend the online boosting algorithm in [67] to the case of cascade structured detectors. Our online boosting algorithm starts from the general seed detectors learned off-line. Each weak classifier is based on one edgelet feature. A shape affinity is defined to measure the similarity between two edgelet features. For each weak classifier, a small neighborhoodofitisconstructedbasedontheshapeaffinityofedgelets. Ateachboosting round, the best weak classifier in the neighborhood is selected. The decision strategy of thecascadeisupdatedbylookingatashorthistory of thesamples collected. The sample passing rates of the weak classifiers are estimated, based on which the number of the weak classifiers is adapted. The specialized detectors are used to update the oracle. We analyzed the components of our method quantitatively on two sets of surveillance videos. The experimental results show the efficiency of our system. Our main contri- butions of this method are: 1) an oracle for unsupervised learning based on a set part detectors; 2) an online learning framework for cascade structured detectors; and 3) the integrationofnoiserestrainingstrategiesinboththeoracleandthelearningcomponents. 126 6.2 Experimental Data Set Sincewewillinterlacedescriptionsofmethodologyandquantitativeanalysisofthesystem components in the next few sections, we first describe the data sets used. We have three data sets: a general sample set, a number of sequences from the CAVIAR video corpus [12], and a number of sequences from the CLEAR-VACE video corpus [14]. Our general sample set is collected from the MIT pedestrian set [68] and the Inter- net. It contains 500 human samples of frontal viewpoint, 500 human samples for profile viewpoint, and their mirror, overall 2,000 positive samples, which are well aligned and normalized to 24£58 pixels. The general sample set also contains 1,000 negative images, without any humans. Both the positive samples and the negative images are for general purposes, without any bias for environment, illumination, clothing, etc. The size of the general set is relatively small and manually labeling it is feasible. We use this set to learn the seed part detectors. Weusethe26sequencesofthe“shoppingcentercorridorview”,36,292framesoverall, from the CAVIAR video corpus to form our second data set. The videos in this set are captured with a stationary camera, mounted a few meters above the ground and looking down towards an indoor corridor. The frame size is 384£288 pixels and the sampling rateis25FPS.Weusesixsequencesrandomlychosenasavalidationsettoquantitatively analyze the different components in our method. We use another ten sequences as the training set for online learning; we call this burn-in set in order to distinguish from the training set for off-line learning. The remaining ten sequences are used as a test set to evaluate the performance of the online updated detectors. 127 The third data set consists of 10 sequences, 30,250 frames overall, randomly selected fromtheCLEAR-VACEvideocorpusforsurveillance. Thevideosinthissetarecaptured with a stationary camera, mounted a few meters above the ground and looking down towards an outdoor street. The frame size is 720£480 pixels and the sampling rate is 30 FPS. We use five sequences as the burn-in set and the other five as the test set. Figure 6.2 shows some typical frames of the CAVIAR set and the CLEAR-VACE set. (a) CAVIAR set (b) CLEAR-VACE set Figure 6.2: Example frames of the CAVIAR set and the CLEAR-VACE set. 6.3 Learning Seed Detectors Inthischapter,weonlyuseasimplifiedversionofourCBTmethodtotraintheseedpart detectors. We set the maximum number of channels in the CBT method to one so that the output is a cascade structured classifier instead of a tree. This allows us to focus the formulation and description on the online learning aspect. In this simplified version, the splitting and retraining modules in the original CBT method become unnecessary and hence removed. Figure 6.3 shows the details of this simplified algorithm. 128 ² Given the initial sample set S =f(x i ;y i )g and a negative images set; ² Set the algorithm parameters: the maximum weak classifier number T, the pos- itive passing rates fP t g T t=1 , the target false alarm rate F, and the threshold for bootstrapping µ B ; ² Construct the weak classifier pool,H, from the edgelet features; ² Initialize the sample weights D 0 (i)= 1 kSk , the current false alarm rate F 0 =1, and t=0; ² while t<T and F t <F do 1. For each weak classifier h inH, compute its classification function by Equ.3.6; 2. Select h t to minimize the criterion defined by Equ.3.9; 3. Update sample weights by Equ.3.10 and normalize D t+1 to a p.d.f.; 4. Select the threshold b t for the partial sum H t , so that a portion of P t positive samples are accepted, and reject as many negative samples as possible; 5. Remove the rejected samples from the sample set. If the remaining negative samples are less than µ B percent of the original, recollect the negative samples by bootstrapping on the negative image set. ² Outputfh t ;b t g as the cascade classifier. Figure 6.3: Off-line learning algorithm for general seed classifiers. 6.4 Oracle by Combining Part Detectors Our combined detection method in Chapter 5 combines the responses from a set of part detectors to detect multiple, possibly inter-occluded humans in images. We design our oracle based on this combined detection algorithm. A part response is represented by a 4-tuple rp = fl;p;s;³g, where l is the part type, p is the image position, s is the size, and ³ is a classification confidence. For positive part responses, ³ is defined by ³ Δ =1¡exp µ ¡ P t h t (x) P t maxjh t j ¶ (6.1) 129 where x is the image patch of the part response, and maxjh t j is the maximum absolute value of h t . This is similar to the confidence of the regular ensemble classifiers [78]. For negative part responses, ³ is defined by ³ Δ = 1 e¡1 · 1¡exp µ 1¡ T pass T ¶¸ (6.2) where T is the overall number of weak classifiers in the cascade, and T pass is the number of weak classifiers the sample has passed. This negative confidence is designed based on the filtering property of cascade classifiers. Intuitively, the later the sample is rejected, the more similar it is to real objects. Acombinedresponseisrepresentedbythesetofitspartresponsesandtheirvisibility scores v, rc = frp i ;v i g i2Prt , where Prt is the set of part labels. For humans, Prt = fFB;HS;TS;Lg, where FB;HS;TS, and L represent full-body, head-shoulder, torso, and legs respectively. The visibility score v is obtained from the occlusion reasoning in the combined detection. 6.4.1 Positive sample collection Suppose we want to collect positive samples for part P 1 . We define the panel confidence of a part response rp P 1 in a combined response rc by ˜ ³ P 1 Δ = X i2Prt¡fP 1 g^v i >µv ³ i (6.3) where µ v is the visibility threshold in Section 5.3.2. The above confidence is called panel confidence, as it makes use of information from a set of part detectors; in contrast, we 130 call ³ the self confidence. The panel confidence of P 1 does not include the self confidence of P 1 , as we want to see the sample from different “views”. When the panel confidence ˜ ³ is larger than a threshold µ pos , we consider the sample confidently positive. We use two metrics to measure the performance of the oracle, precision and utility ratio. Note, we can not use recall rate here, as we do not have the ground-truth for all humans in the unlabeled data. Suppose there are N positive responses in total, after thresholding N u are kept for online learning, in which N c are good ones, then the precision and utility ratio are respectively defined by Pr Δ = N c N u ; Ur Δ = N u N (6.4) Figure 6.4 shows the curves of the two metrics with different µ pos for the full-body de- tection on the CAVIAR validation set. In our experiments, we set µ pos = 0:073, which results in a precision of 98% and a utility ratio of 20%. Figure 6.5 shows examples of good and bad positive samples. The positive samples with large panel confidence ˜ ³, but small self confidence ³ are more valuable to the learning procedure, because they are closer to the classification boundary. For efficiency consideration, we do not send the samples with large ³ to the learner, as they will not bring much change to the existing classifier. In practice, we set a cut-off threshold for positive samples, µ max . If ³ > µ max , we consider that this sample is already recognized well, and hence do not need it for updating. For image pattern recognition, the positive samples not only need to be labeled cor- rectly, but must also be spatially aligned. In our training set, the pedestrian samples are 131 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 10 20 30 40 50 60 70 80 90 100 Positive threshold Performance (%) Precision Utility Ratio Figure 6.4: Performance of the oracle for positive sample collection. (a) Good positive (b) Bad positive Figure 6.5: Examples of positive samples collected by the oracle. resized to 24£58 pixels, and the head and the feet positions are moved to the locations (12;5) and (12;54). However, the spatial accuracy of the samples cut from the combined responses is inadequate. In order to improve the spatial accuracy, we develop a simple automaticalignmentmethodbasedonlinearregression. Weobservethattheresponsesof theedgeletfeaturesinthecascadeencodestheinformationofthesilhouette. Thefirstfew features are especially meaningful (see Figure 5.14). We take the first 200 feature values 132 of the full-body detector as input and learn a linear regressor to the image positions of the head and the feet, i.e. 2 6 6 6 6 6 6 6 6 6 6 4 head x head y feet x feet y 3 7 7 7 7 7 7 7 7 7 7 5 =R 4£200 2 6 6 6 6 6 6 4 f h 1 . . . f h 200 3 7 7 7 7 7 7 5 (6.5) 500 samples from the training set are used to learn this regressor. Table 6.1 lists the standard deviations of the head and the feet positions before and after the alignment. The errors of the y-coordinate are larger than those of the x-coordinate, because the height of human samples is more than twice the width. It can be seen that our simple alignment method can reduce the error by about half. Head x Head y Feet x Feet y Before alignment 0.9 2.5 1.1 2.4 After alignment 0.4 0.8 0.5 0.9 Table 6.1: Standard deviations of the head and the feet positions before and after align- ment. 6.4.2 Negative sample collection The collection of negative samples is similar to that of positive. Suppose we have a response rp P 1 from the detector of part P 1 , which does not correspond to any combined response. (Here we use the original response of the detector without any clustering algorithm, so that the response is exactly where the classifier output positive.) With high probability rp P 1 is a false alarm. Based on the spatial relation between parts, we 133 cutthepatchesoftheotherpartsfromtheresponserp P 1 andcalculatetheirclassification confidences ³ i . For robustness, we search around a small neighborhood for each part to obtain the highest ³ i . Intuitively, if rp P 1 is a false alarm, most of the ³ i should be negative. The panel confidence of rp P 1 is then calculated by Equ.6.3. If ˜ ³ P 1 is smaller than a threshold µ neg , we consider the sample to be confidently negative. Again we use precision and utility ratio to measure the performance of negative sample collection. Figure 6.6 shows the curves. In practice we set µ neg =¡2:88, which results in a precision about 96% and a utility ratio about 18%. −2.95 −2.9 −2.85 −2.8 −2.75 −2.7 0 10 20 30 40 50 60 70 80 90 100 Negative threshold Performance (%) Precision Utility Ratio Figure 6.6: Performance of the oracle for negative sample collection. As no quantitative analysis of the oracle is given in the previous work [62, 73, 40], it is difficult to directly compare our oracle with the previous methods. However, one advantage of our method is that we do not rely on motion segmentation, which is not robust due to illumination change, shadow, reflection, occlusion, etc. This also enables 134 our system to detect stationary humans. Our oracle can be seen as an extension of the co-trainingframework. Insteadoftwoclassifiersdancingtogether, wehaveagroupdance of multiple classifiers, which is much more reliable. 6.5 Online Boosting for Cascade Classifier With video sequences as input, a series of samples are collected by the oracle and then fed into an online learning algorithm. We integrate the real valued classification function of real AdaBoost [78], the noise restraining techniques of ORBoost [43] and AveBoost2 [66], and thelearningofcascadedecisionstrategyintotheonlineboostingalgorithm[67]. 6.5.1 Updating weak classifiers From the off-line training, we record the weak classifier poolH, and the weight distribu- tionW § . Givenanewsample(x ? ;y ? )anditscurrentweightw(theweightcomputationis describedlaterinSection6.5.2), weupdateW § ofaweakclassifierhandthenrecompute h by Equ.3.6. As W § is a histogram, online updating it is straightforward; formally, if f(x ? )2 · j¡1 n ; j n ¶ ;W j y ? =W j y ? +w® y ? (6.6) where f is the image feature h is based on, and ® y ? is the weight updating rate. In our experiments, we set ® + =10 ¡3 and ® ¡ =10 ¡5 , as negative samples are more redundant. Toachievesomevariabilityatthefeaturelevel, weconstructasmallneighborhoodC h ofh, based on its edgelet feature f. The shape affinity between two edgelets is defined by Equ.5.2. The top 10 features with the highest shape affinity with f are selected. Their 135 correspondingweakclassifiersformC h . Figure6.7showsthefeaturesintheneighborhood of the first weak classifier of the full-body detector. It can be seen that they cover a good variety. Given a new sample, we update all the weak classifiers in C h , and select the one that minimizes Equ.3.9. This optimization strategy is similar to the feature selection method in [32]; however, our neighborhood is local and much smaller, because it is constructed not randomly, but based on the shape affinity of edgelets. Figure 6.7: Features in the neighborhood of the first weak classifier of the full-body detector. 6.5.2 Updating sample weights The online boosting algorithm [67] imitates the weight evolution procedure of the off- line boosting. The weight updating strategy makes the learning procedure focus on the difficult instances, but this also makes the boosting algorithms susceptible to labeling errors[20]; thisisinevitableforunsupervisedlearning. Weintegratethenoiserestraining strategies of AveBoost2 [66] and ORBoost [43] into our online boosting algorithm. In the original real AdaBoost [78], the weights are updated by Equ.3.10. The ex- ponential increase makes the learner over-fit on noises very quickly. Oza [66] developed a boosting algorithm, called AveBoost2, in which the weight updating is smoothed by averaging the current weight with the previous one. It has been shown that AveBoost2 136 outperformsAdaBoostwithnoisyinput[66]. Wemodifytheiroff-linesmoothingstrategy so that D t+1 = ° w D t exp[¡y ? h t (x ? )]+D t ° w +1 (6.7) where ° w is a constant smoothing factor. In our experiments, we set ° w =10. Although the smoothing technique is used, the weights of mislabeled samples tend to keep growing during boosting. Karmaker and Kwek [43] developed an off-line boosting approach,ORBoost,inwhichacut-offthresholdµ outlier ,isusedasaceilingoftheweights. A sample is considered to be an outlier, if its weight grows larger than µ outlier (set to 10 in our experiments). We integrated this technique into our online boosting algorithm. When the weight of a new sample hits the threshold, the updating is stopped, and a “rollback” action is taken. Figure 6.8 shows a comparison between online learning with and without noise restraining on the CAVIAR set. It can be seen that the tendencies of the two curves are similar, but the curve with noise restraining is more smooth and outperforms the one without noise restraining. 6.5.3 Updating cascade thresholds It is the cascade decision strategy that makes the detector very efficient. The series of thresholds,fb t g,learnedfromthegeneraltrainingset,maynotbeoptimalforaparticular application. Ifthebackgroundconcernedisrelativelyuncluttered, thethresholdingcould be more aggressive, so that we can achieve better efficiency; if the scene is cluttered, the thresholding should be more conservative, so that we can consider more information before making a decision. Online updating of the thresholds is necessary. 137 0 1000 2000 3000 4000 5000 6000 90 91 92 93 94 95 96 97 Number of burn−in samples Accuracy on burn−in samples (%) with noise restrianing without noise restrianing Figure6.8: Performanceofunsupervised,onlinelearningwithandwithoutnoiserestrain- ing strategies. WeinherittherequirementofpositivepassingratesfP t gfromtheoff-linelearning. In order to update the thresholds, we keep a small history of the positive samples collected, S + . Initially, S + is populated with positive samples in the off-line training set. At each time a new positive sample comes, it replaces an old sample randomly chosen in S + . For the positive passing rates fP t g that are less than 100%, we sort the values fH t (x i )jx i 2 S + g, and then choose the threshold. For the fP t g that are 100%, we maintain the minimumvalueoffH t (x i )jx i 2S + g. Inpractice,forefficiency,thethresholdsareupdated every 100 samples, and the histogram indices j of all (h t ;x i ) pairs are buffered so that the feature values f(x) only need to be computed once. 138 6.5.4 Adaptation of cascade complexity In the off-line boosting algorithm, the training procedure stops when a target false alarm rate is reached. This automatically determines the complexity, i.e. the number of weak classifiers. Inthepreviousonlineboostingalgorithm[67,32]thecomplexityoftheensem- bleclassifierisfixed. However,similartothesituationofdecisionstrategy,thecomplexity needstobeadaptedtotheparticularproblem. Forrelativelyeasyproblems,wemayhave a shorter cascade; for relatively difficult problems, we may have a longer one. We use the sample passing rate to measure the discriminative power of the cascade detector. When scanning an image, suppose there are N pass;t sub-windows passing the t-th partial sum H t . The sample passing rate of H t is then defined by r t Δ = N pass;t N pass;t¡1 (6.8) This passing rate reflects the contribution of the t-th weak classifier h t . If h t is very helpful, r t will be a small number; if h t brings little classification power, r t will be close to 1. We evaluate the sample passing rates of the seed detectors on the first 50 frames of the burn-in set. Suppose at the beginning, there are T weak classifiers in total, denoted by r T (0) the original sample passing rate of the whole cascade. During online learning, we continue updating all the sample passing rates r t . If after learning with i samples, there exists a r t (i), such that r t (i)>r T (0), we consider the weak classifiers, h t+1 ;:::;h T , unnecessary, and remove them from the cascade. Figure 6.9 shows the sample passing rates of the full-body detector on the CAVIAR burn-in set. It can be seen that after 139 online learning, the full-body detector becomes more aggressive so that we can remove some weak classifiers from its tail. 0 500 1000 1500 2000 0 0.2 0.4 0.6 0.8 1 Number of weak classifiers Sample pass rates Original After online learning Figure 6.9: Sample passing rates of the full-body detector on the CAVIAR burn-in set before and after online learning. If after learning with i samples, r T (i) < r T (0), we consider the current cascade to be relatively weak, and add more weak classifiers to its end. Besides S + , we keep a history of the negative samples collected S ¡ , whose size is same as S + . Taking S + and S ¡ as training data, a number of new weak classifiers are selected fromH by the off-line boostingalgorithm. Figure6.10showsthesamplepassingratesofthelegsdetectoronthe CAVIAR burn-in set. It can be seen that after online learning, the legs detector becomes moreconservativesothatweneedtoaddsomeweakclassifierstothetail. Inpractice,the complexityadaptationisdoneevery1;000samples, andifincreasing, 300weakclassifiers are added as one segment each time. The addition of weak classifiers is not an online 140 procedure. It is done in an off-line batch processing way. In our implementation, we suspend the online learning when adding new weak classifiers. 0 500 1000 1500 2000 2500 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of weak classifiers Sample pass rates Original After online learning Figure 6.10: Sample passing rates of the legs detector on the CAVIAR burn-in set before and after online learning. Wenowputallthecomponentstogether. Givenanewsamplecollectedbytheoracle, it is sent through the current cascade classifier. The weak classifiers are updated and the sample weights are modified accordingly. For efficiency, the thresholds of the cascade are updated every 100 samples and the complexity of the cascade is adjusted every 1,000 samples. Figure 6.11 gives the full online boosting algorithm for the cascade classifier. In ourexperiments,theonlinelearningprocedurestopsafteralltheframesintheburn-inset are processed. In real applications, if there is no burn-in period, we can keep the online learning procedure running forever, especially when the conditions of the environment are changing due to illumination, weather, and so on. If the environmental conditions 141 are constant, the online learning could be stopped when samples collected by the oracle becomes very rare. 6.6 Experimental Results We apply our method to the problem of pedestrian detection and evaluate our system on the CAVIAR set [12] and the CLEAR-VACE set [14]. The parameters of the oracle and the learner are determined based on the analysis on the validation set from the CAVIAR corpus. The burin-in set, the test set, and the validation set have no overlap. 6.6.1 Results on the CAVIAR set We update the four seed part detectors online with the 10 burn-in sequences of the CAVIAR set. For comparison, we manually label 4;000 samples from the burn-in set and collect800negativeimagesforindoorenvironmentsfromtheInternettoformaspecialized and“clean”trainingset,fromwhichfourspecializedpartdetectorsarelearnedbyoff-line boosting. The performance of the general seed detectors, the online updated detectors, and the specialized detectors are evaluated on the 10 test sequences of the CAVIAR set. Table 6.2 shows the comparisons on performance and complexities of the detectors. The first group of three rows is for the generic seed detectors; the second group of three rows is for the online updated detectors; the third group of three rows is for the true specialized detectors. It can be seen that by online learning, both the part detectors and the combined detector, which serves as the oracle, are improved greatly. The average detection rate of the individual part detectors is increased by 20:0% while the false alarm 142 ² Inherit from the off-line boosting procedure: the seed cascade detectorfh t ;b t g, the weak classifier pool H, the weight distribution W § , the neighborhood fC t g, the positive passing ratefP t g, and the training set S; ² Set the algorithm parameters: the updating rate for positive/negative samples ® § , the smoothing rate ° w , and the cut-off threshold µ outlier of weight updating ; ² Compute the sample passing ratefr t g from the first 50 frames of the burn-in set; ² Initialization: populate the sample history S + and S ¡ by S, and i=0. ² For all frames in the burn-in set do – Get a new frame, from which use the oracle to obtain a number of samples, (x ? i+1 ;y ? i+1 );:::;(x ? i+m ;y ? i+m ); – Update the sample passing ratefr t g; – For all the samples collected from this frame do ¤ i++; ¤ Initialize sample weight D 0 =1; ¤ For t=0 to T ¡1 do, where T is the size of the current cascade 1. Updatetheweightdistributionofeveryh2C t byEqu.6.6, andrecom- pute the classification function by Equ.3.6; 2. Find the weak classifier inC t that minimizes the criterion in Equ.3.9; 3. Compute sample weight D t+1 by Equ.6.7; 4. If D t+1 >µ outlier , break updating and rollback; 5. Add x ? i to S y ?. If i mod 100 = 0, update fb t g according to S + and fP t g; 6. If i mod 1000 = 0, adapt the complexity of the cascade according to fr t g, and update T. ¤ Update the oracle. ² Outputfh t ;b t g as the updated cascade classifier. Figure 6.11: Online learning algorithm for cascade structured classifiers. 143 rate is reduced by 2:0 per frame. The combined detector’s detection rate is increased by 16:0% and its false alarm rate is reduced by 0:24 per frame. The only previous online learning approach, which reports quantitative results on the CAVIAR set, is that in [73]. Our updated combined detector has a recall rate of 94:2% and a precision of 97:2%, which is much better than the 60% recall and 85% precision in [73]. In the four part detectors, only the complexity of the legs detector increases; this maybe becausetheappearancevariationof legs islarger thanthe otherparts. The other three part detectors also have better performance than the legs detector. The specialized detectors can be seen as the upper limit of the online learning algorithm. Although the accuracies of our online updated detectors are comparable to the specialized ones, the specialized detectors use many fewer weak classifiers. Figure 6.12 shows some detection results on this set before and after online learning. FB HS Ts L Combined Original Detection Rate(%) 74.9 68.7 69.7 53.5 78.2 False alarm per frame 0.7 3.9 3.9 3.7 0.3 # of weak classifiers 1800 2900 1000 1800 Updated Detection Rate(%) 91.2 85.7 87.5 80.9 94.2 False alarm per frame 0.2 0.8 1.1 1.9 0.06 # of weak classifiers 800 1400 800 2100 Specialized Detection Rate(%) 91.5 84.7 89.4 81.1 94.9 FAPF 0.4 0.3 0.5 0.4 0.03 # of weak classifiers 200 400 300 500 Table 6.2: Comparison of the generic and specialized detectors on the CAVIAR set. 6.6.2 Results on the CLEAR-VACE set For the second set from the CLEAR-VACE corpus [14], we update the seed detectors online with the 5 burn-in sequences, and evaluate on the 5 test ones. Table 6.3 shows 144 the comparison. This set is harder than the CAVIAR set, as the scene is more cluttered. It can be seen that although we achieve similar improvements on accuracy, more weak classifiers are needed for this set than for the CAVIAR set. Both the complexities of the legs detector and the torso detector increase after online learning. The average detection rate of the part detectors is increased by 19:7% and the false alarm rate is reduced by 2:4 per frame. The combined detector’s detection rate is increased by 14:7% and its false alarm rate is reduced by 0:22 per frame. Figure 6.13 gives some detection results on this set before and after online learning. 6.7 Conclusion and Discussion We propose an unsupervised, online learning approach to improve the performance of generic part detectors for particular applications. Our oracle, which is based on the combination of part detection responses, has very high precision and does not rely on motionsegmentation. Ouronlinelearner,whichisbasedontheonlineboostingalgorithm, is able to automatically adapt the local shape features, the weak classifiers, the cascade decisionstrategy,andthecomplexityoftheclassifier. Theexperimentalresultsshowthat our method can greatly improve the performance on a particular application of the seed detectors, which are learned from a small set of general samples, by learning on a large amount of unlabeled data. Currently, our oracle is designed based on the combined detector, which has a high precision but not a high recall rate. However, the detection based tracking methods 145 Figure 6.12: Examples of detection results before and after online learning on the CAVIAR set. (Yellow for full-body; red for head-shoulder; purple for torso; blue for legs; and green for combined) 146 FB HS Ts L Combined Original Detection Rate(%) 73.1 65.0 67.8 58.7 78.4 False alarm per frame 0.7 4.0 3.6 7.1 0.3 # of weak classifiers 1800 2900 1000 1800 Updated Detection Rate(%) 92.4 80.5 88.3 82.2 93.1 False alarm per frame 0.5 1.5 1.7 2.0 0.08 # of weak classifiers 1000 1800 1600 2400 Table6.3: ComparisonofthegenericandspecializeddetectorsontheCLEAR-VACEset. described later in Chapter 7 has both a high precision and a high recall rate. It could be a potential improvement for the oracle. Our online learning algorithm takes one sample each time. This is a special case of incrementallearning. Ifthe“online”featureisnotnecessary,wecouldcollectthesamples firstandthenupdatetheclassifierinabatchprocessingway,whichcouldbemoreefficient and robust. The proposed algorithm is easily changed to a batch processing style. In our current approach, the addition and removal of the weak classifiers only happen at the end of the cascade, based on the assumption that the earlier the weak classifier is selected, the stronger it is. But for a particular task, the order of classification powers could be different. It would be more flexible to allow the insertion or removal of weak classifiers in the middle of the cascade. The oracle noises can be reduced at the oracle part or restrained at the learner part. In our method, we chose to reduce the alignment noises by a linear regressor that is a part of the oracle. However, there are some existing learning algorithms that attempt to learn the pattern from poorly aligned samples, e.g. the MILBoost method by Viola et al. [98]. Such kinds of techniquescould behelpful toimprovethe robustness of unsupervised learning. 147 Figure 6.13: Examples of detection results on the CLEAR-VACE set before and after online learning. 148 Chapter 7 Object Tracking by Associating Detection Responses Tracking objects in videos is one step beyond the detection problem, as the object iden- tities need to be maintained. Our object tracking methods are based on detection. First, the object detectors are applied frame by frame, then the detection responses at differ- ent frames are matched and linked together to form object trajectories. This chapter describes two trackers that focus on different aspects of this problem. 7.1 Tracking Occluded Objects based on Part Detection There are many sources of difficulty in performing the task of object tracking. One of them is occlusion. When there are multiple objects in the scene, they could inter-occlude one another; the target objects could also be occluded by other scene structures. The image appearance of the objects changes greatly due to occlusions (see Figure 7.1). It is more likely that the identities of objects are switched during tracking when they are close to one another. In this section, we describe a method that automatically tracks multiple, partially occluded objects. We take the class of pedestrians to demonstrate our approach. 149 Figure 7.1: Examples of appearance changes due to inter-object occlusions. 7.1.1 Motivation and Outline To track trough partial occlusions, we use a part based representation so that occlusions do not affect the entire description as they would in a holistic representation. Part-based representation has been used for human detection in a single image in some recent works, e.g. [59, 60], but these methods do not use the parts for tracking. In [118], a part-based representation is used for segmenting motion blobs by considering various articulations and their appearances, but parts are not explicitly tracked. Part tracking has been used totracktheposeofasinglehuman,e.g. [86,47],butnotlocationsofmultiplehumans. In ourapproach,wetracktheindividualdetectedpartsandthencombinetheirresponsesina combined tracker. Theadvantageofthisapproachcomesfromtheobservationthatunder partial occlusion conditions, some parts of the object remain visible and distinguishable, and can provide reliable cues for tracking. Of course, when the object is fully occluded, the tracks can only be inferred from the observables before and after. For an automatic multiple object tracking system, three main issues need to be ad- dressed: 1) when to initialize a trajectory; 2) how to track an object; and 3) when to terminateatrajectory. Ourapproachreliesonsingleframehumandetectionresponsesto 150 answer these questions. We do not rely on background modeling, and hence our method does not require any special preprocessing for moving and/or zooming cameras. Figure 7.2 gives a schematic diagram of our part based tracking method. The object parts are detected in each frame, treated as a single image, to avoid the necessity of computing reliable motion blobs, as well as to be able to detect both static and moving humans. The responses from the static part detectors are taken as the input for the tracker. The detection system consists of several part detectors and a combined detector. Theperformanceofthecombineddetectorisbetterthanthatofanysinglepartdetectorin termsofthefalsealarmrate. However,thecombineddetectorperformsexplicitreasoning only for inter-object occlusions, while the part detector can work in the presence of both inter-object and scene occlusions. Figure 7.2: Schematic diagram of our part based tracking system. We track humans by data association, i.e. matching the object hypotheses with the detected responses, whenever the corresponding detection responses can be found. We match the hypotheses with the combined detection responses first, as they are more reliable than the responses of the individual parts. If this fails, then we try to associate 151 the hypotheses with the part detection responses. If this fails again, a mean-shift tracker [15] is used to follow the object. Most often, objects are tracked successfully by data association. The mean-shift tracker is occasionally utilized for short periods. As the observations for tracking are strong in our method, we do not utilize statistical sampling techniques as in some of the previous work, e.g. [118, 87, 39]. We initialize a trajectory when evidence from new observations can not be explained by the current hypotheses, as is also the case in many previous methods [118, 87, 71, 39, 19]. Similarly, a trajectory is terminated when it is lost by the detectors for a certain period. 7.1.2 Part based Detection Module In the experiments of this section, we only use the top four nodes in the part hierarchy described in Chapter 5. For each part, a tree structured classifier is learned by boosting edgelet features. The positive training samples cover multiple viewpoints. The output of the detection system consists of three levels. The first level is a set of the original responses of the detectors. In this set, one object may have multiple corre- sponding responses (see Figure 7.3(a)). The second level is that of the merged responses, which are the result of applying an agglomerative clustering algorithm to the original responses. In the set of merged responses, one object has at most one corresponding response (see Figure 7.3(b)). The third level is that of the combined responses. One combined response has several matched part responses (see Figure 7.3(c)). The detection responses may not have a high spatial accuracy, because the training samples include some parts of the background regions in order to cover some position and size variations. 152 (a) original (b) merged (c) combined Figure 7.3: Output of the part based detection module. a) and b) are from the full-body detector; c) is from the combined detector (green for combined; yellow for full-body; red for head-shoulder; purple for torso; blue for legs). 7.1.3 Part based Tracking Module Both the original and the merged detection responses are part responses. For track- ing we add one more element to the part response representation in Section 6.4: rp = fl;p;s;v;³;cg, where the new elementc is an appearance model. The first five elements, l,p, s, v and ³, are obtained directly from the detection process. The appearance model c, is implemented as a color histogram; computation and update ofc is described later in Section7.1.3.3. Representationofacombinedresponseistheunionoftherepresentations of its parts, rc=frp i jl i 2Prtg. 7.1.3.1 Affinity for detection responses Objectpartsaredetectedframebyframe. Inordertodecidewhethertwopartresponses, rp 1 and rp 2 , of the same part type from different frames belonging to one object, an affinity measure is defined by A(rp 1 ;rp 2 )=A pos (p 1 ;p 2 )A size (s 1 ;s 2 )A appr (c 1 ;c 2 ) (7.1) 153 whereA pos , A size , andA appr areaffinitiesbased onposition, size, and appearance respec- tively. Their definitions are A pos (p 1 ;p 2 )=° pos exp h ¡ (x 1 ¡x 2 ) 2 ¾ 2 x i exp h ¡ (y 1 ¡y 2 ) 2 ¾ 2 y i A size (s 1 ;s 2 )=° size exp h ¡ (s 1 ¡s 2 ) 2 ¾ 2 s i A appr (c 1 ;c 2 )=B(c 1 ;c 2 ) (7.2) where B(c 1 ;c 2 ) is the Bhattachayya distance between two histograms, and ° pos and ° size are normalizing factors. The affinity between two combined responses, rc 1 and rc 2 , is the average of the affinities between their common visible parts A(rc 1 ;rc 2 )= P l i 2Prt A(rp 1;i ;rp 2;i )I(v 1;i ;v 2;i >µ v ) P l i 2Prt I(v 1;i ;v 2;i >µ v ) (7.3) whererp j;i is the response of the part i of the combined responserc j , v j;i is the visibility scoreofrp j;i ,j =1;2, andI isanindicatorfunction. Theaboveaffinityfunctionsencode the position, size, and appearance information. Giventheaffinity,wematchthedetectionresponseswiththeobjecthypothesessimilar tothatofmatchingpartresponsestoobjecthypothesesinSection5.3.4. Supposeattime t of an input video, we have n object hypotheses, and at time t+1 we have m responses. We first compute the m£n affinity matrix A of all the hypothesis-response pairs, i.e. A(i;j) is the affinity between the i-th hypothesis and the j-th response. The Hungarian algorithm [46] is then applied to A to find the best match. 154 7.1.3.2 Trajectory initialization The basic idea of the initialization strategy is to start a trajectory when enough new evidence is collected from the detection responses. Define the precision, pr, of a detector as the ratio between the number of successful detections and the number of all positive responses. Ifprisconstantforallframes,andthedetectionproceduresindifferentframes are fully independent, then during consecutive T time steps, the probability that the detector outputs T consecutive false alarms is P FA = (1¡pr) T . However, this inference is not accurate for real videos, where the inter-frame dependence is not negligible. If the detector outputs a false alarm at a certain position in one frame, the probability is high that a false alarm will appear around the same position in the next frame. We call this the persistent false alarm problem. Even here, the real P FA should be an exponentially decreasing function of T, and we model it as e ¡¸ init p T . Suppose we have found T(> 1) consecutive responses, frc 1 ;:::;rc T g, corresponding tooneobjecthypothesisH bydataassociation. Theconfidenceofinitializingatrajectory for H is then defined by InitConf(H;rc 1::T )= 1 T ¡1 T¡1 X t=1 A(b rc t+1 ;rc t+1 ) | {z } (1) ¢ ³ 1¡e ¡¸ init p T ´ | {z } (2) (7.4) where b rc t+1 is the prediction of rc t at the next frame. The first term in the left side of Equ.7.4 is the average affinity of the T responses, and the second term is based on the combined detector’s accuracy. The more accurate the combined detector is, the larger the parameter ¸ init . Our trajectory initialization strategy is: if InitConf(H) is 155 larger than a threshold, µ init , a trajectory is started from H, and H is considered to be a confident trajectory; otherwise H is considered to be a potential trajectory. In our experiments, ¸ init = 1:2, and µ init = 0:83. A trajectory is represented as a triple, ffrc t g t=1;:::;T ;D;fC i g i2Prt g, wherefrc t g is a series of responses,fC i g is the appearance model of the parts, and D is a dynamic model. In practice, C i is the average of the appearance models of all detection responses of part i, and D is modeled by a Kalman filter for constant speed motion. 7.1.3.3 Trajectory growth After a trajectory is initialized, an object is tracked by two strategies: data association and mean-shift tracking. Given a new frame, for all existing hypotheses, we first look for theircorrespondingdetectionresponses. Ifthereisanewdetectionresponsethatmatches a hypothesis H, then H grows by data association; otherwise a mean-shift tracker is applied. The data association itself has two steps. First, all the hypotheses are matched with the combined responses by the method described in Section 7.1.3.1. Second, all the hypotheses that are not matched in the first step are associated with the remaining part responses that do not belong to any combined response. Matching part responses with hypotheses is a simplified version of the method for matching combined responses with hypotheses. InsteadoftheaffinitymeasureinEqu.7.3forcombinedresponses,theaffinity for part responses in Equ.7.1 is used. At least one part must be detected for an object to be tracked by data association. We do not associate the part responses with the tracks directly, because occlusion reasoning, completed before association from the detection 156 responses in the current frame is more robust than from the predicted hypotheses, which are not very reliable. Whenever data association fails (i.e. the detectors can not find the object or the affinity is low), a mean-shift tracker [15] is applied to track the parts individually. The results are then combined to form the final estimation. The basic idea of mean-shift is to track a probability distribution. Although the mean-shift technique is typically used to track a color based distribution, there is no constraint on the type of the distribution. In our method, we combine the appearance model C, the dynamic model D, and the detection confidence ³, to build a likelihood map that is then fed into the mean-shift tracker. A dynamic probability map, P dyn (u), whereu represents the image coordinates, iscalculatedfromthedynamicmodelD, seeFigure7.4(d). Denotetheoriginalresponses of one part detector at the frame j by frp j g, the detection probability map P det (u) is defined by P det (u)= X j:u2Reg(rp j ) ³ j +ms (7.5) where Reg(rp j ) is the rectangular image region corresponding to rp j , ³ j is a real-valued detection confidence of rp j , and ms is a constant corresponding to the missing rate (the ratio between the number of missed objects and the total number of objects). ms is cal- culated after the detectors are learned. If one pixel belongs to multiple positive detection responses, thenwesetthedetectionscoreofthispixelasthesumoftheconfidencesofall these responses; otherwise, we set the detection score as the average missing rate, which isapositivenumber. Thedetectionscorereflectstheobjectsaliencebasedonshapecues. 157 Note, the original responses are used here to avoid the effects of errors in the clustering algorithm (see Figure 7.4(e)). (a) (b) (c) (d) (e) Figure 7.4: Probability map for the mean-shift tracker: a) is the original frame; b) is the final probability map; c), d) and e) are the probability maps for appearance, dynamic and detection respectively. (The object concerned is marked by a red ellipse.) Let P appr (u) be the appearance probability map. As C is a color histogram (the dimension is 32£32£32 for r,g,b channels), P appr (u) is the bit value of C (see Figure 7.4(c)). To estimate C, we need the object to be segmented so that we know which pixels belong to the object; the detection response rectangle is not accurate enough for this purpose. Also, for articulated objects, it is difficult to build a constant segmentation mask. In[116],ZhaoandDavisproposeaniterativemethodforupperbodysegmentation toverifythedetectedhumanhypotheses. Here,weproposeasimplePCAbasedapproach. At the training stage, examples are collected and the object regions are labeled by hand (see Figure 7.5(a)). A PCA model is then learned from the training data (see Figure 7.5(b)). Suppose we have an initial appearance model C 0 , given a new sample (Figure 158 7.5(c)) we first calculate its color probability map from C 0 (Figure 7.5(d)), then use the PCA model as a global shape constraint by reconstructing the probability map (Figure 7.5(e)). The thresholded reconstruction map (Figure 7.5(f)) is taken as the final object segmentation, which is used to update C 0 . The mean vector, the first of Figure 7.5(b), is used to computeC 0 at the beginning. For each part, we learn a PCA model. Thought this segmentation method is far from perfect, it is very fast and adequate enough to update the appearance model. This simple segmentation method could be replaced by our boosted segmentor (described in Chapter 4) for further improvement. Combining P appr (u), P dyn (u), and P det (u), we define the image likelihood for a part at pixel u by L(u)=P appr (u)P dyn (u)P det (u) (7.6) Figure 7.4 shows an example of probability map computation. Before the mean-shift tracker is applied, inter-object occlusion reasoning is computed. Only the visible parts that are detected in the last successful data association are tracked. Only the models of the parts that are detected and not occluded are finally updated. Mean-shift tracking is not always performed and fused with association results, because the shape based detectors are much more reliable than the mean-shift tracker. 159 (a) (b) (c) (d) (e) (f) Figure 7.5: PCA based body part segmentation: a) training samples; b) eigenvectors. The top-left is the mean vector; c) original human samples; d) color probability map; e) PCA reconstruction; f) thresholded segmentation map. 7.1.3.4 Trajectory termination Thestrategyofterminatingatrajectoryissimilartothatofinitializingit. Ifnodetection responses are found for an object hypothesis H for consecutive T time steps, we compute a termination confidence of H by EndConf(H;rc 1::T )= à 1¡ 1 T ¡1 T¡1 X t=1 A(b rc t+1 ;rc t+1 ) ! ³ 1¡e ¡¸ end p T ´ (7.7) Note that the combined responses rc t are obtained from the mean-shift tracker, not from the combined detector. If EndConf(H) is larger than a threshold, µ end , hypothesis H is terminated; we call it a dead trajectory, or otherwise an alive trajectory. In our experiments, ¸ end =0:5, and µ end =0:8. 7.1.3.5 The combined tracker Wenowputtheabovethreemodules,trajectoryinitialization,tracking,andtermination, together. Figure 7.6 shows the full forward tracking algorithm (it only looks ahead). 160 Trajectory initialization has a delay; to compensate we also apply a backward tracking procedure, which is the exact reverse of the forward tracking. After a trajectory is initialized, it may grow in both forward and backward directions. Note, this is not the sameastheforward-backwardfiltering,aseachdetectionresponseisprocessedonlyonce, either in the forward or in the backward direction. In cases where no image observations are available, and the dynamic model itself is not strong enough to track the object, we keep the hypothesis at the last seen position until either the hypothesis is terminated or some part of it is found again. When full occlusion is of short duration, the person could be reacquired by data association. However, if full occlusion persists, the track may terminate prematurely; such broken tracks could be combined at a higher level of analysis; we have not implemented this feature. Asimplifiedversionofthecombinedtrackingmethodistotrackonlyasinglepart,e.g. the full-body. Later in Section 7.1.4.3, we show that the combined tracking outperforms the single part tracking. The combined tracking method is robust because: 1. Thecombinedtrackerusescombineddetectionresponses,whichhavehighprecision, to start trajectories. This results in a very low false alarm rate at the trajectory initialization stage. 2. The combined tracker attempts to find the corresponding part responses of an object hypothesis. The probability that at least one part detector finds the object is relatively high. 161 Let the set of hypotheses be S, initially S =Φ. For each time step t (denote by S t the set of all alive trajectories in S at time t) 1. Static detection: (a) Detect parts. Let the result set be RP t . (b) Combine part detection responses, including inter-object occlusion reasoning. Let the result set be RC t . (c) Subtract the parts used in RC t from RP t . 2. Data association: (a) Associate hypotheses in S t with combined responses in RC t . Let the set of matched hypotheses be S t1 . (b) Associate hypotheses in S t ¡S t1 with part responses in RP t . Let the set of matched hypotheses be S t2 . (c) Build a new hypothesis H from each unmatched response in RC t , and add H into S and S t . 3. Pure tracking: For each confident trajectory in S t ¡S t1 ¡S t2 , grow it by mean-shift tracking. 4. Model update: (a) For each hypothesis in S t1 +S t2 , update its appearance model and dynamic model. (b) For each potential trajectory in S t1 , update its initialization confidence. (c) For each trajectory in S t1 +S t2 , reset its termination confidence to 0. (d) For each trajectory in S t ¡S t1 ¡S t2 , update its termination confidence. Output all confident trajectories in S as the final result. Figure 7.6: Forward part based combined object tracking algorithm. 3. The combined tracker tries to follow the object by tracking its parts, either by data association or by mean-shift. This enables the tracker to work with both scene and inter-object occlusions. 4. The combined tracker takes the average of the part tracking results as the final human position. Hence, even if the tracking of one part drifts, the position of the human can still be accurately tracked. 162 7.1.4 Experimental Results We evaluated our pedestrians tracker on three video sets. The first set consists of the 26 “shopping center corridor view” sequences of the CAVIAR video corpus [12]. The second set, called the “skate board set”, is captured from a camera held by a person standing on a moving skate board. The third set, called the “building top set”, is captured from a camera held by a person standing on the top of a 4-story building looking down towards the ground. The camera motions in the skate board set include both translation and panning, while those of the building top set are mainly panning and zooming. The frame size of these two sets is 720£480 pixels and the sampling rate is 30 FPS. As the humans in the test videos include both frontal/rear and profile views, we use the tree structured detectors for multi-view pedestrians. We compare our results on the CAVIAR set with a previous system from our group [118]. We are unable to compare with others as we are unaware of published, quantitative results for tracking on this set by other researchers. 7.1.4.1 Track level evaluation criteria To evaluate the performance of our system quantitatively, we define five track level cri- teria: 1. numberof“mostlytracked”trajectories(morethan80%ofthetrajectoryistracked), 2. number of “mostly lost” trajectories (more than 80% of the trajectory is lost), 3. number of “fragments” of trajectories (a result trajectory that is less than 80% of a ground-truth trajectory), 163 4. number of false trajectories (a result trajectory corresponding to no real object), and 5. the frequency of identity switches (identity exchanges between a pair of result tra- jectories). Figure 7.7 illustrates these definitions. Although these five categories are not a complete classification, they cover most of the typical errors observed in our experiments. Figure 7.7: Track level evaluation criteria. 7.1.4.2 Results on the CAVIAR set The only previous tracker for which we have an implementation in hand is that of Zhao and Nevatia [118]. In this experiment, we compared our method with that in [118]. This methodisbased onbackgroundsubtraction, andrequiresacalibratedstationarycamera. We test our method and that of [118] on the 26 CAVIAR sequences. The scene of thissetisrelativelyuncluttered,however,theinter-objectocclusionisintensive. Frequent interactionsbetweenhumans,suchastalkingandshakinghands,makethissetchallenging for tracking. Our detectors require the width of humans to be larger than 24 pixels. In the CAVIAR set, there are 40 humans, which are usually less than 24 pixel wide, and 6 164 humans, which are mostly out of the scene. We mark these small humans and out-of- sight humans in the ground-truth as “do not care”. Table 7.1 shows the scores. It can be seen that our method outperforms the method of [118] when the resolution is not very low. This comes from the low false alarm rate of the combined detector. Some example results on this set are shown in Figure 7.8. However, on the small humans, our shape based method does not work well (the combined tracker gets only one out of the 40 small humans tracked) while the motion based tracker gets 21 small humans mostly tracked. This better performance of the motion based tracker at low resolution is because the motion based method does not rely on a discriminative model. GT MT ML Fgmt FAT IDS Zhao and Nevatia [118] 189 121 8 73 27 20 This method 140 8 40 4 19 Table 7.1: Tracking performance of the combined tracker on the CAVIAR set. (GT: ground-truth; MT: mostly tracked; ML: mostly lost; Fgmt: trajectory fragment; FAT: false alarm trajectory; IDS: ID switch) The tracking method also greatly improves the detection performance (without con- sidering the identity consistency). Table 7.2 gives the detection scores before and after tracking. We set the detection parameters to achieve a low false alarm rate. Detection rate (%) False alarm per frame Before Full-body detector 70.32 0.28 tracking Combined detector 57.91 0.05 After tracking 94.11 0.02 Table 7.2: Detection performance before and after tracking on the CAVIAR set. 165 Figure 7.8: Example results of our combined tracker on the CAVIAR set. (The first and the second rows are for one sequence; the third and the fourth for another.) 166 7.1.4.3 Results on the skate board set The main difficulties of the skate board set are small abrupt motions due to the uneven ground, and some occlusions. This set contains 29 sequences, 9,537 frames overall. Only 13 have no occlusion at all. The combined tracking method is applied. Table 7.3 gives the tracking performance of this method. Some example results on this set are shown in Figure 7.9. GT MT ML Fgmt FAT IDS 50 39 1 16 2 3 Table 7.3: Tracking performance of the combined tracker on the skate board set. (see Table 7.1 for abbreviations.) For comparison, a single part (full-body) tracker, which is a simplified version of the combined tracker, is applied to the 13 videos that have no occlusions. Because the part detection does not have occlusion reasoning, it is not expected to work on the other 16 sequences. Table 7.4 shows the comparison results. It can be seen that the combined trackergivesfewerfalsealarmsthanthesingleparttracker,becausethefull-bodydetector has more persistent false alarms than the combined detector. The combined tracker also has more fully tracked objects, because it uses cues from all parts. GT MT ML Fgmt FAT IDS Part tracking 21 14 2 7 13 3 Combined tracking 19 1 5 2 2 Table 7.4: Comparison between the single part tracker and the combined tracker on a subset of the skate board set. (see Table 7.1 for the abbreviations.) 167 Figure 7.9: Example results of our combined tracker on the skate board set. (The first and the second rows are for one sequence; the third and the fourth for another.) 168 7.1.4.4 Results on building top set The building top set contains 14 sequences, 6,038 frames overall. The main difficulty of this set is due to the frequency of occlusions, both scene and object. See Table 7.6. No single part tracker works well on this set. The combined tracker is applied to this set. Table 7.5 shows the tracking performance. It can be seen that the combined tracker obtainsveryfewfalsealarmsandareasonablesuccessrate. Someexampleresultsonthis set are shown in Figure 7.10. GT MT ML Fgmt FAT IDS 40 34 3 3 2 2 Table 7.5: Tracking performance of the combined tracker on the building top set. (see Table 7.1 for the abbreviations.) 7.1.4.5 Tracking performance with occlusions We characterize the occlusion events in these sets with two criteria: if the occlusion is by a target object, i.e. a human, we call it an object occlusion; otherwise, a scene occlusion. If the period of the occlusion is longer than 50 frames, it is considered to be a long term occlusion; otherwise, ashorttermone. Overallwehavefourcategories: shorttermscene, long term scene, short term object, and long term object occlusions. Table 7.6 shows the tracking performance on occlusion events. Tracking success of an occlusion event means that no object is lost, no trajectory is broken, and no ID switch occurs during the occlusion. It can be seen that our method can work reasonably well in the presence of scene and object occlusions, even those that are long term. The performance on the CAVIAR set is not as good as those on the other two sets. This is because 19 out of 96 169 Figure 7.10: Example results of our combined tracker on the building top set. (The first and the second rows are for one sequence; the third and the fourth for another.) 170 occlusion events in the CAVIAR set are fully occluded (more than 90% of the object is occluded) while the occlusions in the other two sets are all partial. Video set SS LS SO LO Overall CAVIAR Zhao & Nevatia [118] 0/0 0/0 40/81 6/15 46/96 This method 0/0 0/0 47/81 10/15 57/96 Skate Board 6/7 2/2 11/16 0/0 19/25 Building Top 4/7 11/13 15/18 4/4 34/42 Table 7.6: Frequency and tracking performance on occlusion events. n=m: n successfully tracked among m occlusion events. (SS: short scene; LS: long scene; SO: short object; LO: long object) For tracking, on average, about 50% of the successful tracking is due to the data association with combined responses, i.e. the object is “seen” by the combined detector; about 35% is due to the data association with part responses; the remaining 15% is from the mean-shift tracker. Although the detection rate of any individual part detector is not high, the tracking level performance of the combined tracker is significantly better. The speedoftheentiresystemisabout1FPS.Themachineusedisa2.8GHz32-bitPentium PC. The program is coded in C++ using OpenCV functions. Most of the computational costs come from the detection component. We do not tune the system parameters for different sequences. We basically have three sets of parameters for the three video sets. The different parameters are the search range of the 2-D human size, as the image size of humans in the CAVIAR set is much smaller than those in the other two sets, and the parametersfortheKalmanfilter,astheimagemotionofhumanswithamoving/zooming camera is noisier than with a stationary camera. 171 7.1.5 Conclusion and Discussion In this section, we proposed a part based automatic human tracking method. The re- sponses of static combined human detection and body part detection are taken as the observations of the human hypotheses and fed into the tracker. Both the trajectory initialization and termination are based on the evidences collected from the detection responses. To track the objects, most of the time data association works, while a mean- shift tracker fills in the gaps between data association. The experimental results show that the proposed method has both a low false alarm rate and a high successful tracking rate. It can work reasonably well under both partial scene and inter-object occlusions. Our approach is compared with the method in [118] in cases where both methods work. Eachofthetwomethodshasitsownadvantagesanddisadvantages. Themethodof [118], whichisbasedon3-Dmodelandmotionsegmentation, islessviewpointdependent and can work on low resolution videos, while our method, which is based on 2-D shape, requires a relative high resolution and does not work well with a large camera tilt angle. However, ourmethod, whichisbasedonframebyframedetection, canworkwithmoving and/or zooming cameras, while the method of [118] can not. Our current system does not use any cues from motion segmentation. When motion information is available, it should help improve the performance. For example, recently Brostow and Cipolla [11] proposed a method to detect independent motions in crowds. The outputs are tracklets of independently moving entities, which may facilitate object tracking. Conversely, shape-based tracking can help improve motion segmentation. 172 7.2 Robust Tracking based on Soft Decision In this section, we describe our second detection based tracking algorithm. The focus of the tracking method in Section 7.1 is on partial occlusions. The algorithm in this section focuses on another aspect of the detection based tracking algorithm: the robustness of tracking with noisy observations from the detection module. 7.2.1 Motivation and Outline In detection based tracking methods, the object detectors produce the observations for the trackers. However, the detectors are not perfect. Most existing detection methods output “hard decisions”, i.e. binary valued predictions. Hard decisions extract high level information (for detection, the position and the size of the object) from raw data (images or videos). This greatly reduces the problem space for further processing; however, there is also a tradeoff between the detection rate and the false alarm rate. To achieve good tracking performance, we need both a low false alarm rate to avoid false object trajec- tories, and a high detection rate to track the objects most of the time. Although hard decisions at intermediate stages make the system less robust, we can not avoid making any decisions. Recovering trajectories directly from image sequences or some real-valued probability field is not feasible. To improve the robustness of object tracking, we propose a method that makes “soft decisions” at the detection stage by producing detection responses of different confidence levels, and which uses these confidence levels for associating the detection responses to 173 form trajectories. We apply this approach to the problems of human detection and tracking in meeting videos and surveillance videos. Figure 7.11 shows a schematic diagram of this method. The detection module, which outputs a set of responses with different confidence levels, consists of three classifiers: 1) an edgelet based boosted classifier with a cascade decision strategy, 2) an edgelet based boostedclassifierwithasingle-thresholddecisionstrategy,and3)aHOGdescriptorbased SVMclassifier[16]. Intermsofcomputationalcost,thesingle-thresholdboostedclassifier is the most expensive one, as it requires evaluating hundreds of edgelet features before a decisionisreached; thecascadeboostedclassifieristhefastestone,asitscascadedecision strategy rejects most negative samples in the very early stages of the cascade. Figure 7.11: Schematic diagram of our soft decision based tracking method. The order in which the classifiers are applied is determined based on their compu- tational costs. The faster a classifier, the earlier it is applied. Given a new frame, the cascade boosted classifier is first used to search for objects at all possible positions and of various sizes. The positive responses of this stage are sent to the SVM classifier for 174 verification. Whenanexistingtrajectorycannotbeassociatedwitheitherthefirstorthe secondleveldetectionresponse, thesingle-thresholdboostedclassifierisactivatedaround the predicted position of the trajectory; this significantly reduces the search space of the single-threshold boosted classifier. In terms of accuracy, the outputs of the three classifiers can be seen as three points on an ROC curve. The first level has the lowest false alarm rate, but also the lowest detection rate; the third level has the highest detection rate but also the highest false alarm rate. The three levels of detection responses together form a soft decision. 7.2.2 Soft Decision based Detection Module We learn the edgelet based boosted classifiers and the HOG based SVM classifier [16] for an object class. We choose these two types of features, because they are complementary. The edgelet features are designed to encode the local shape explicitly, but they are rel- atively sensitive to small transforms (translation, rotation, etc.). The HOG descriptors encode the statistics of the sub-regions and are robust to small transforms, but they do not maintain information about which pixels contribute to the histogram bins. Very dif- ferent shapes could result in the same histogram. As our focus in this section is not on occlusions, we only use a whole-object or a single part model. 7.2.2.1 Edgelet based boosted classifier Edgelets are one type of local shape features. For each edgelet in a large feature pool, one weak classifier is built to distinguish objects from background. The CBT method is then used to learn tree structured classifiers for multi-view objects. 175 Duringthetrainingofthetreeclassifier,acascadedecisionstrategyislearnedforeach branch of the tree. For each node along a branch, a threshold is chosen to accept most positive samples while rejecting as many negative samples as possible. When an overall target false alarm rate is reached, the training procedure is terminated. Because of the cascade decision strategy, most sub-windows examined in the image can be discarded by computing only the first few features in the tree. Although this is an efficient strategy, it is aggressive in terms of discarding negative samples. To obtain a decision where the detection rate is high, we ignore the series of thresholds of the cascade decision strategy, and learn an overall threshold for each branch of the tree. The threshold is chosen to accept all positive samples and reject as many negative samples as possible. However, as the decision is made at the leaf nodes of the tree, all features in the classifier need to be computed to classify one sub-window. The classification results of the boosted classifier with the cascade decision strategy are the second level responses; the results of the single-threshold boosted classifier are the third level responses. Figure 7.12(b) and Figure 7.12(c) show an example of each of these two levels for the problem of meeting room human (head-shoulder) detection. 7.2.2.2 HOG based SVM classifier HOG descriptors, proposed by Dalal and Triggs [16], are another type of local shape feature. One HOG descriptor encodes the statistics of the edge orientation within a small neighborhood. Following [16], we learn SVM classifiers for object classes based on HOG descriptors. We use the HOG based SVM classifier as post verification for the edgelet based boosted classifier. During training, the false alarms and the successful 176 (a) Level 1 responses (b) Level 2 - level 1 responses (c) Level 3 - level 2 responses Figure7.12: Examplesofthethreelevelsofdetectionresponsesfortheproblemofmeeting room human detection and tracking. detected samples from the cascade boosted classifier are collected and used to learn the SVM classifier. During detection, the cascade boosted classifier is first applied, then the positive responses are sent to the SVM classifier for verification. The threshold of the SVM classifier is chosen to remove 90% false alarms from the cascade boosted classifier andacceptasmanypositivesamplesaspossible. WeusetheSVMclassifierastheverifier after the cascade boosted classifier, because the boosted classifier is more efficient. The positiveresponsesoftheSVMclassifierarethefirstlevelresponses. Figure7.12(a)shows an example of this level. 177 7.2.2.3 Discussion about detection module design Our detection module is one implementation of the “soft decision” idea based on existing objectdetectiontechniques,thoughthereareotherchoices. Forexample,wecouldremove the thresholding process from the detection stage and use some continuous probability distribution representation as the soft decisions, instead of the discrete confidence levels. However,thiswouldgreatlyincreasetheproblemspaceofthetrackingalgorithm. Directly searching for an unknown number of object trajectories in a real-valued spatial-temporal space is very difficult. Thresholding is also necessary for the cascade boosted classifier for efficiency. To achieve multiple confidence levels, we could learn multiple thresholds for one classifier, instead of combining multiple heterogeneous classifiers. However, for a single classifier, the reasonable range of the threshold is likely to be small. Enforcing a high detection rate may result in too high of a false alarm rate; enforcing a low false alarm rate may result in too low of a detection rate. To achieve a low false alarm rate, we could reduce the final target false alarm rate in the cascade classifier learning instead of applying an SVM classifier for post verification. However, learning a cascade boosted classifier includes a bootstrapping step for each layerofthecascadetocollectnewnegativesamples. Whenthecurrentcascadeisalready operating at a very low false alarm rate, it is difficult to collect enough negative samples. Whileboostingrequiresalargetrainingsettoachievegoodgeneralizationproperty,SVM can function well with a relatively small sampling. The third level responses are mainly for a high detection rate. This level could be replacedwiththeresponsesofsomeearlystageinthecascadeboostedclassifier. However, 178 wefindthatatthesamedetectionrate,thefalsealarmrateofthesingle-thresholdboosted classifierismuchsmallerthanthatoftheearly-stageoutputofacascadeclassifier,because the former evaluates more features to make a decision. Theorderinwhichtheclassifiersareappliedisdeterminedbasedontheirspeeds. Our experiments show that the speed ratio between the cascade boosted classifier, the single- threshold boosted classifier, and the SVM classifier, is about 1:10:200. We use a second level of responses, in-between the first and the third levels, to reduce the search space of thethirdlevelforefficiency. Thesingle-thresholdboostedclassifierisonlyappliedaround the predicted positions of the existing trajectories that can not be tracked by associating with the detection responses of the first and second levels. 7.2.3 Soft Decision based Tracking Module Ourtrackingalgorithmhasthreemaincomponents: trajectoryinitialization,termination, and growth. In the these components, we use the detection responses with different confidence levels in different ways, based on their detection rates and false alarm rates. 7.2.3.1 Detection response association Givenanewframefromavideosequence, theboostedclassifierwiththecascadedecision strategy is first applied and then the positive responses are sent to the SVM classifier for verification. This process provides the first and second level responses. Similar to the combined tracker in Section 7.1, one detection response is represented by a 4-tuple, r=fp;s;c;³g. (As we do not use multiple parts in this section, the part label l and the visibility score v are not included here.) The classification confidence ³ is the weighted 179 sum of all weak classifiers’ outputs for the boosted classifier, or the distance to the clas- sification boundary for the SVM classifier. The affinity between two detection responses is defined by Equ.7.1. The Hungarian algorithm [46] is applied to the affinity matrix to findthebestmatchbetweenthehypothesesandtheresponses. Afterassociationwiththe firstlevelresponses,allremaininghypothesesarematchedwiththesecondlevelresponses with the same algorithm. After association with the first and the second level responses, we apply the single- threshold boosted classifier around the predicted positions of the remaining unmatched hypotheses. This process gives the third level responses. The search space for this stage is much smaller than that for the first and the second level responses. As the third level responses may include many false alarms, we add an additional measure to the affinity function in Equ.7.1 to reduce the matching ambiguity: A 0 (ˆ r;r)=A(ˆ r;r)A shape (r) (7.8) where A shape (r)=³ is the classification confidence of r. All hypotheses unmatched with the first and the second level responses are matched with the third level responses. If after this stage there are still unmatched hypotheses, we use a mean-shift tracker [15] to track them. However, we find that this part is not critical for the whole system, as the detection rate of the third level responses is close to 100%. Even if we leave the finally unmatched hypotheses at their original positions, there is no significant difference in the performance. 180 7.2.3.2 Trajectory initialization and termination The initialization and termination strategies are very similar to that of our part based combinedtracker,butwemakeuseofthedetectionresponsesofmultipleconfidencelevels in this method. The unmatched first level detection responses are used to initialize new trajectories. If at one frame, a first level detection response does not correspond to any existing trajectory, we start a new potential trajectory H from it. If in the succeeding T consecutive frames T first level responses are matched with H, we compute an initial- ization confidence of H. If the confidence is larger than a threshold, we say H becomes a confident trajectory. Because the false alarm rates of the second and the third level responses are relatively high, the unmatched second and third level responses are not used to start new trajectories. The trajectory termination criterion is similar to that of initialization. If in T con- secutive frames we can not find the matched first or second level response for one object hypothesis, then we compute a termination confidence. If the confidence is larger than a threshold, the trajectory is ended and we call it a dead trajectory, or otherwise an alive trajectory. ThefulldetectionandtrackingalgorithmisgiveninFigure7.13. Inpractice, inorder to compensate for the initialization delay, after a trajectory becomes confident, we track the object in both forward and backward directions. 181 Let the set of object hypotheses be S, initially S =Φ. For each time step t (denote by S t the set of all alive trajectories in S at time t) 1. Apply the cascade boosted classifier and the SVM classifier to frame t. Let the sets of the first and the second level responses be R t;1 and R t;2 . 2. Data association with the first and the second level responses: (a) Associate hypotheses in S t with responses in R t;1 . Let the set of matched hypotheses be S t;1 . (b) AssociatehypothesesinS t ¡S t;1 withresponsesinR t;2 . Letthesetofmatched hypotheses be S t;2 . (c) Build a new hypothesis H from each unmatched response in R t;1 , and add H into S and S t . 3. Apply the single-threshold boosted classifier around the predicted positions of all hypotheses in S t ¡S t;1 ¡S t;2 . Let the set of the third level responses be R t;3 . 4. Data association with the third level responses: AssociatehypothesesinS t ¡S t;1 ¡S t;2 withresponsesinR t;3 . Letthesetofmatched hypotheses be S t;3 5. Pure tracking: ForeachconfidenttrajectoryinS t ¡S t;1 ¡S t;2 ¡S t;3 ,growitbymean-shifttracking. 6. Model update: (a) For each hypothesis in S t;1 +S t;1 , update its appearance model and dynamic model. (b) For each potential trajectory in S t;1 , update its initialization confidence. (c) For each trajectory in S t;1 +S t;1 , reset its termination confidence to 0. (d) For each trajectory in S t ¡S t;1 ¡S t;2 , update its termination confidence. Output all confident trajectories in S as the final results. Figure 7.13: Forward soft decision based object tracking algorithm. 7.2.4 Experimental Results We apply this approach to two applications: human tracking in indoor meeting scenarios and outdoor surveillance scenarios. We evaluate our system on two public video corpora and quantitatively compare with previous methods. 182 7.2.4.1 CLEAR evaluation metrics for tracking Recently, there have been many efforts towards automatic, quantitative evaluation of tracking algorithms. To compare with previous methods, we use four metrics defined by the CLEAR 2007 evaluation protocol [14, 44] for 2-D multiple object detection and tracking tasks: 1. Multiple Object Detection Accuracy (MODA) is the detection accuracy calculated from the numbers of false alarms and missed detections; 2. Multiple Object Tracking Accuracy (MOTA) is the tracking accuracy calculated from the numbers of false alarms, missed detections, and identity switches. 3. Missed Detections Per Ground-Truth (MISS PGT) is the number of missed objects normalized by the number of ground-truth objects at the tracking level; 4. False Alarms Per Frame (FA PF) is the number of false alarms per frame at the tracking level. To compute these metrics, at each frame the detected object hypotheses are matched withtheground-truthobjectsbytheHungarianalgorithm[46]. Theground-truthobjects that do not match with any detection response are counted as missed detections, the number of which is denoted by m t for the t-th frame, and the detection responses that do not match with any ground-truth object are counted as false alarms, the number of which is denoted by fp t . The MODA score is computed by MODA=1¡ P N frames t=1 (c m (m t )+c f (fp t )) P N frames t=1 N (t) G (7.9) 183 whereN frames isthenumberofframes, N (t) G isthenumberofground-truthobjectsatthe t-th frame, and c m and c f are penalties for missed detection and false alarm respectively. For tracking, identity switch is another type of errors that needs to be considered. The MOTA score is computed by MOTA=1¡ P N frames t=1 (c m (m t )+c f (fp t )+log(id switches )) P N frames t=1 N (t) G (7.10) where id switches is the number of identity switches. The definitions of MISS PGT and FA PF are straightforward. Among the four metrics, MOTA is the most important one for tracking systems. These may not be the ideal measures for evaluating tracking algorithms, however, they cover most typical errors and an automatic scoring software is provided by the CLEAR evaluation organizer. 7.2.4.2 Human tracking in meeting videos Inmeetingvideos, onlytheupper-bodiesofhumansarevisiblemostofthetimewhilethe legscouldbeoccludedbysomesceneobjects, e.g. table. Hence, welearnahead-shoulder detector and track the head-should parts for this application. The training/testing data are from the NIST meeting corpus [28]. The training set contains about 2,500 positive samplesforfrontal/rearview,3,000positivesamplesforprofileview,and650background images of indoor scenes without humans. The positive samples are normalized to 36£24 pixels. The test set contains 50 sequences, about 204,000 frames overall, with a frame rate of 30 FPS and a frame size of 720£480 pixels. The sizes of humans vary from 120 pixel wide to 300 pixel wide. The training set and the test set are captured in the same 184 meeting room with the same camera setting, but with different sitting arrangements and different attendants. This set was used in the VACE 2005 evaluation [44]. First, we evaluate the detection rates and the false alarm rates of the three detection levels. 200 frames are randomly selected from the test videos and sent to our detection module. Note,toevaluatetheactualdetectionperformance,weapplythesingle-threshold boosted classifier to the whole image. Table 7.7 shows the scores of the three levels. It can be seen that the three levels represent different favors in the decision tradeoff. Level 1 Level 2 Level 3 Detection rate (%) 75.65 88.19 99.95 False alarm per frame 0.58 2.53 45.20 Table 7.7: Detection performance of the three detection levels on the NIST meeting videos. Second, we compare the end-to-end performance of this method with one of our previous methods described in [109], which is also a detection based tracking algorithm. However,thedetectionmodulein[109]outputsonlyonelevelofdecision,whoseaccuracy is similar to that of the second level responses in the soft decision. Table 7.8 lists the scores of the two methods. It can be seen that the current method achieves about 7.3% higher MOTA than the old method in [109], and the current system produces both fewer false alarms and fewer missed detections. Figure 7.14 shows some example results on this set. MODA MOTA MISS PGT FA PF Hard decision based tracker [109] 0.7142 0.7139 0.1680 0.2334 Soft decision based tracker 0.7815 0.7870 0.1011 0.1978 Table 7.8: Tracking performance on the NIST meeting videos. 185 Figure 7.14: Example results of the soft decision based tracking on the NIST meeting videos. (Thefirstandsecondrowsareforonesequence; thethirdandfourthforanother.) 7.2.4.3 Human tracking in surveillance videos For street surveillance scenarios, we learned full-body detectors for pedestrians. The training/testing data are from the CLEAR-VACE surveillance corpus [14] and our own collection. The training set contains about 1,700 positive samples for frontal/rear view, 186 1,120 positive samples for profile view, and 500 background images of outdoor street scene without humans. The positive samples are normalized to 24£58 pixels. The test set contains 50 sequences, overall about 121,000 frames, with a frame rate of 30 FPS and a frame size of 720£480 pixels. This set was used in the CLEAR 2006 and 2007 evaluations [14]. The sizes of humans vary from 10 pixel wide to 80 pixel wide. However, as our detectors do not work on very low resolution, we modify the original ground-truth to label the humans smaller than 20£50 pixels as “don’t care”. We compare the performance of this method with one of our previous systems de- scribed in [112], which is a detection based tracking algorithm with only one level of detection responses. Table 7.9 lists the scores of the two systems. It can be seen that the current method achieves about 8.8% higher MOTA than the old method in [112]. Note, the scores in Table 7.9 are computed by excluding very small humans, and hence are not directly comparable with the scores reported in the CLEAR evaluation workshops [14], which include humans of all sizes. Figure 7.15 shows some example results from this set. MODA MOTA MISS PGT FA PF Hard decision based tracker [112] 0.5265 0.5252 0.4231 0.0381 Soft decision based tracker 0.6157 0.6137 0.3147 0.0697 Table 7.9: Tracking performance on the CLEAR-VACE surveillance videos. The speed of the entire system is about 1 FPS for the meeting room human tracking task, and 0.4 FPS for the surveillance human tracking task. The speed of our three level detectionmoduleissimilartothatofthesingleleveldetectionmodule[109,112],because the expensive detectors for the first and third levels are only applied to a small search 187 space. Our experimental machine is a 64-bit Intel Xeon 3.6GHz CPU PC with 4G RAM; the program is coded in C++ using OpenCV functions without any parallel computing. Figure 7.15: Example results of the soft decision based tracking on the CLEAR-VACE surveillance videos. (The first and second rows are for one sequence; the third and fourth for another.) 188 7.2.5 Conclusion and Discussion In this section, we proposed a fully automatic object tracking system based on object detection with soft decision. Tracking is performed by associating detection responses of multiple confidence levels. Experimental results show that our method is more robust than the tracking algorithms based on detection with only a single decision strategy. The part based tracker in Section 7.1 can be considered a two level version of the soft decision based tracker: the combined detector is the first level, and the part detectors are the second level. In this section, we use even more levels explicitly. The two tracking methods are complementary and can be combined for further improvement. 189 Chapter 8 Conclusion and Future Work In this thesis, we presented a framework to detect, segment, and track multiple objects with possible, partial occlusions. We model objects as an assembly of several parts. For each part, a tree structured classifier is learned by boosting simple shape feature based weak classifiers. The part detection responses are combined to detect and track objects with partial occlusions. The proposed approach is applied to the class of pedestrians. An automatic detection and tracking system has been implemented and quantitatively evaluated on a number of challenging examples. Some modules of the system are also applied to the class of cars. The experimental results show that our method achieves the state-of-the-art performance on un-occluded objects and outperforms the existing methods on partially occluded objects. There are a few directions that are interesting to explore starting from the thesis work. Motion based feature Our current detection method is mainly based on shape cues. It does not need background substraction to provide observations. However, when 190 motion information can be extracted reliably, it should be helpful to improve de- tection performance. There are some previous methods that extend shape based features to include motion cues, e.g. [97] and [17]. Similar extensions can be ap- plied to our edgelet feature. However, one main disadvantage of the motion based methods is the loss of stationary objects. Global association for tracking Our current tracking method only allows detection responses at consecutive frames to be associated. This greedy association method is sensitive to temporal observation missing. For example, when a human is fully occluded by a tree even for short time, the trajectory is likely to be broken. Some global association methods, e.g. [51], can be used to improve the tracking perfor- mance. Use of multiple camera Our current system takes video sequences captured from a single camera. For some applications, e.g. meeting room environments, a multi- camera setting is possible. When the objects can be observed from multiple view- points at the same time, additional evidence is available to make the decision. The multi-camera integration can happen at different levels of the method: the feature level, the detection level, or the tracking level. Integration with scene analysis On one hand, some high level knowledge, i.e. the 3-D scene structure, can be used to improve detection and tracking performance; on the other hand, detection and tracking results can be used to infer the scene structure. These two tasks can be solved together in an iterative, interacting way. There is already existing work exploring this direction, e.g. [33] and [48] 191 References [1] S.Agarwal,A.Awan,andD.Roth.LearningtoDetectObjectsinImagesviaaSparse, Part-based Representation. IEEE Trans. on PAMI, 26(11):1475-1490, 2004. [2] H. Altincay, and M. Demirekler. Post-processing of Classifier Outputs in Multiple Classifier Systems. Lecture Notes in Computer Science, LNCS 2364, Springer Verlag, pp. 159-168, 2002. [3] S. Avidan. Ensemble Tracking. CVPR 2005. [4] M.-F. Balcan, A. Blum, and K. Yang. Co-Training and Expansion: Towards Bridging Theory and Practice. NISP 2004. [5] H. G. Barrow, J. M. Tenenbaum, R. C. Bolles, and H. C. Wolf. Parametric Corre- spondence and Chamfer Matching: Two New Techniques for Image Matching. IJCAI 1977. [6] W. M. Boothby. An Introduction to Differentiable Manifolds and Riemannian Geom- etry. Academic Press, 2002. [7] E. Borenstein, and J. Malik. Shape Guided Object Segmentation. CVPR 2006. [8] L. Bourdev, and J. Brandt. Robust Object Detection via Soft Cascade. CVPR 2005. [9] M. Bray, P. Kohli, and P. Torr. POSECUT: Simultaneous Segmentation and 3D Pose Estimation of Humans using Dynamic Graph-Cuts. ECCV 2006. [10] J.E.Bresenham.AlgorithmforComputerControlofaDigitalPlotter.IBMSystems Journal, 4(1):25-30, 1965. [11] G. J. Brostow, and R. Cipolla. Unsupervised Bayesian Detection of Independent Motion in Crowds. CVPR 2006. [12] CAVIAR video corpus, http://homepages.inf.ed.ac.uk/rbf/CAVIAR/ [13] CHILL project, http://chil.server.de/servlet/is/101/ [14] CLEAR’06EvaluationandWorkshop,http://isl.ira.uka.de/clear06/?Evaluation Tasks [15] D. Comaniciu, V. Ramesh, and P. Meer. The Variable Bandwidth Mean Shift and Data-Driven Scale Selection. ICCV 2001. 192 [16] N. Dalal, and B. Triggs. Histograms of Oriented Gradients for Human Detection. CVPR 2005. [17] N. Dalal, B. Triggs, and C. Schmid. Human Detection Using Oriented Histograms of Flow and Appearance. ECCV 2006. [18] J. Davis, and M. Goadrich. The Relationship Between Precision-Recall and ROC Curves. ICML 2006. [19] L. Davis, V. Philomin, and R. Duraiswami. Tracking Humans from a Moving Plat- form. ICPR 2000. [20] T.Dietterich.AnExperimentalComparisionofThreeMethodsforConstructingEn- semblesofDecisionTrees: Bagging, BoostingandRandomization.MachineLearning, 40:139-158, 2000. [21] A. Ess, B. Leibe, and L. V. Gool. Depth and Appearance for Mobile Scene Analysis. ICCV 2007. [22] M. Everingham, A. Zisserman, C. Williams, and L. V. Gool. The PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results. Technical report, 2006. [23] L. Fei-Fei, R. Fergus, and P. Perona. One-Shot Learning of Object Categories. IEEE Trans. on PAMI, 28(4): 594-611, 2006. [24] P. Felzenszwalb. Learning Models for Object Recognition. CVPR 2001. [25] Y. Freund, and R. E. Schapire. Experiments with a New Boosting Algorithm. ICML 1996. [26] M. Fritz, B. Leibe, B. Caputo, and B. Schiele. Integrating Representative and Dis- criminative Models for Object Category Detection. ICCV 2005. [27] A. Garg, S. Agarwal, and T.S. Huang. Fusion of Global and Local Information for Object Detection. ICPR 2002. [28] J. Garofolo, C. Laprum, M. Michel, V. Stanford, and E. Tabassi. The NIST Meeting Room Pilot Corpus. Language Resource and Evaluation Conference. 2004. [29] D.M.Gavrila.ABayesian, Exemplar-basedApproachtoHierarchicalShapeMatch- ing. IEEE Trans. on PAMI, 29(8): 1408-1421, 2007. [30] D. M. Gavrila. Pedestrian Detection from a Moving Vehicle. ECCV 2000. [31] D. M. Gavrila, and V. Philomin. Real-Time Object Detection for Smart Vehicles. ICCV 1999. [32] H. Grabner, and H. Bischof. Online Boosting and Vision. CVPR 2006. [33] D. Hoiem, A. Efros, and M. Hebert. Putting Objects in Perspective. CVPR 2006. 193 [34] D. Hoiem, C. Rother, and J. Winn. 3D LayoutCRF for Multi-View Object Class Recognition and Segmentation. CVPR 2007. [35] C. Huang, H. Ai, Y. Li, and S. Lao. Learning Sparse Features in Granular Space for Multi-View Face Detection. FG 2006. [36] C. Huang, H. Ai, Y. Li, and S. Lao. Vector Boosting for Rotation Invariant Multi- View Face Detection. ICCV 2005. [37] C. Huang, H. Ai, B. Wu, and S. Lao. Boosting Nested Cascade Detector for Multi- View Face Detection. ICPR 2004. [38] Intel IPP library, http://www.intel.com/cd/software/products/asmo-na/eng/302910.htm [39] M. Isard, and J. MacCormick. BraMBLe: A Bayesian Multiple-Blob Tracker. ICCV 2001. [40] O.Javed,S.Ali,andM.Shah.OnlineDetectionandClassificationofMovingObjects Using Progressively Improving Detectors. CVPR 2005. [41] M.Jones, andP.Viola.FastMulti-ViewFaceDetection.TechnicalReport, TR2003- 96 [42] A. Kapoor, and J. Winn. Located Hidden Random Fields: Learning Discriminative Parts for Object Detection. ECCV 2006. [43] A. Karmaker, and S. Kwek. A Boosting Approach to Remove Class Label Noise. In the Fifth International Conference on Hybrid Intelligent System, 2005. [44] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, M. Boonstra, and V. Ko- rzhova. Performance Evaluation Protocol for Face, Person and Vehicle Detection & Tracking in Video Analysis and Centent Extraction (VACE-II) CLEAR - Classifica- tion of Events, Activities and Relationships. [45] H. Kruppa, M. Castrillon-Santana, and B. Schiele. Fast and Robust Face Finding via Local Context. Joint IEEE Int’l Workshop on VS-PETS, 2003. [46] H. W. Kuhn. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly, 2:83-87, 1955. [47] M. Lee, and R. Nevatia. Human Pose Tracking using Multi-level Structured Models. ECCV 2006. [48] B.Leibe, N.Cornelis, K.Cornelis, andL.V.Gool.Dynamic3DSceneAnalysisfrom a Moving Vehicle. CVPR 2007. [49] B. Leibe, A. Leonardis, and B. Schiele. Combined Object Categorization and Seg- mentation with an Implicit Shape Model. Workshop on Statistical Learning in Com- puter Vision, in conjunction with ECCV 2004. 194 [50] B. Leibe, E. Seemann, and B. Schiele. Pedestrian Detection in Crowded Scenes. CVPR 2005. [51] B.Leibe,K.Schindler,andL.V.Gool.CoupledDetectionandTrajectoryEstimation for Multi-Object Tracking. ICCV 2007. [52] B. Leung. Component-based Car Detection in Street Scene Images. Master’s Thesis, EECS, MIT, 2004. [53] A. Levin, P. Viola, and Y. Freund. Unsupervised Improvement of Visual Detectors using Co-training. ICCV 2003. [54] Y. Li, H. Ai, T. Yamashita, S. Lao, and M. Kawade. Tracking in Low Frame Rate Video: ACascadeParticleFilterwithDiscriminativeObserversofDifferentLifespans. CVPR 2007. [55] Z. Lin, L. Davis, D. Doermann, and D. DeMenthon. Hierarchical Part-Template Matching for Human Detection and Segmentation. ICCV 2007. [56] Y.-Y. Lin, and T.-L. Liu. Robust Face Detection with Multi-Class Boosting. CVPR 2005. [57] D. G. Lowe. Object Recognition from Local Scale-Invariant Features. ICCV 1999. [58] G. Medioni, M.S. Lee, and C.K. Tang. A Computational Framework for Segmenta- tion and Grouping. Elsevier Science, 2000. [59] C. Mikolajczyk, C. Schmid, and A. Zisserman. Human Detection based on a Proba- bilistic Assembly of Robust Part Detectors. ECCV 2004. [60] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based Object Detection in Images by Components. IEEE Trans. on PAMI, 23(4):349-361, 2001. [61] J. Mutch, and D. Lowe. Multiclass Object Recognition with Sparse, Localized Fea- tures. CVPR 2006. [62] V. Nair, and J. Clark. An Unsupervised, Online Learning Framework for Moving Object Detection. CVPR 2004. [63] K. Okuma, A. Taleghani, N. D. Freitas, J. J. Little, and D. G. Lowe. A Boosted Particle Filter: Multitarget Detection and Tracking. ECCV 2004. [64] A. Opelt, A. Pinz, and A. Zisserman. A Boundary-Fragment-Model for Object De- tection. ECCV 2006. [65] A. Opelt, A. Pinz, and A. Zisserman. Incremental Learning of Object Detectors Using a Visual Shape Alphabet. CVPR 2006. [66] N. Oza. AveBoost2: Boosting for Noisy Data. In the Fifth International Workshop on Multiple Classifier Systems, 2004 195 [67] N. Oza, and S. Russell. Online Bagging and Boosting. International Workshop on Artificial Intelligence and Statistics, 2001. [68] C. Papageorgiou, T. Evgeniou, and T. Poggio. A Trainable Pedestrian Detection System. Intelligent Vehicles, 1998. [69] PASCAL challenge, http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html [70] M. Pawan Kumar, P. Torr, and A. Zisserman. OBJ CUT. CVPR 2005. [71] J. R. Peter, H. Tu, and N. Krahnstoever. Simultaneous Estimation of Segmentation and Shape. CVPR 2005. [72] D. Ramanan, D. A. Forsyth, and A. Zisserman. Strike a Pose: Tracking People by Finding Stylized Poses. CVPR 2005. [73] P. Roth, H. Grabner, D. Skocaj, H. Bischof, and A. Leonardis. Online Conservative Learning for Person Detection. VS-PETS 2005. [74] H. Rowley. Neural Network-Based Face Detection. PhD Thesis. 1999 [75] H. Rowley, S. Baluja, and T. Kanade. Neural Network-Based Face Detection. IEEE Trans. on PAMI, 20(1):23-38, 1998. [76] P.Sabzmeydani, andG.Mori.DetectingPedestriansbyLearningShapeletFeatures. CVPR 2007. [77] R. E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. The Annals of Statistics, 26(5): 1651-1686, 1998 [78] R. E. Schapire, and Y. Singer. Improved Boosting Algorithms Using Confidence- rated Predictions. Machine Learning, 37:297-336, 1999. [79] H. Schneiderman, and T. Kanade. A Statistical Method for 3D Object Detection Applied to Faces and Cars. CVPR 2000. [80] E. Seemann, and B. Leibe, and B. Schiele. Multi-Aspect Detection of Articulated Objects. CVPR 2006. [81] Y. Shan, F. Han, H. Sawhney, and R. Kumar. Learning Examplar-Based Catego- rization for the Detection of Multi-View Multi-Pose Objects. CVPR 2006. [82] A. Shashua, Y. Gdalyahu, and Gaby Hayun. Pedestrian Detection for Driving As- sistance Systems: Single-frame Classification and System Level Performance. IEEE Intelligent Vehicles Symposium 2004. [83] V. D. Shet, J. Neumann, V. Ramesh, and L. S. Davis. Bilattice-based Logical Rea- soning for Human Detection. CVPR 2007. 196 [84] J. Shotton, A. Blake, and R. Cipolla. Contour-Based Learning for Object Detection. ICCV 2005. [85] J. Shotton, J. Winn, C. Rother, and A. Criminisi. TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-Class Object Recognition and Segmentation. ECCV 2006. [86] L. Sigal, S. Bhatia, S. Roth, M. J. Black, and M. Isard. Tracking Loose-limbed People. CVPR 2004. [87] K.Smith,D.G.-Perez,andJ.-M.Odobez.UsingParticlestoTrackVaryingNumbers of Interacting People. CVPR 2005. [88] S. Todorovic, and N. Ahuja. Extracting Subimages of an Unknown Category from a Set of Images. CVPR 2006. [89] A. Torralba, K.P. Murphy, and W.T. Freeman. Sharing Features: Efficient Boosting Procedures for Multiclass Object Detection. CVPR 2004. [90] Z. Tu. Probabilistic Boosting-Tree: Learning Discriminative Models for Classifica- tion, Recognition, and Clustering. ICCV 2005. [91] Z. Tu, S.-C. Zhu, and H.-Y. Shum. Image Segmentation by Data Driven Markov Chain Monte Carlo. ICCV 2001. [92] O.Tuzel,F.Porikli,andP.Meer.HumanDetectionviaClassificationonRiemannian Manifolds. CVPR 2007. [93] VACE project, http://www.ic-arda.org/InfoExploit/vace/ [94] M. Varma, and D. Ray. Learning the Discriminative Power-Invariance Trade-off. ICCV 2007. [95] J. Vermaak, A. Doucet, and P. Perez. Maintaining Multi-Modality through Mixture Tracking. ICCV 2003. [96] P. Viola, and M. Jones. Rapid Object Detection Using a Boosted Cascade of Simple Features. CVPR 2001. [97] P. Viola, M. Jones, and D. Snow. Detecting Pedestrians using Patterns of Motion and Appearance. ICCV 2003. [98] P. Viola, J. Platt, and C. Zhang. Multiple Instance Boosting for Object Detection. NIPS 2005. [99] P. Wang, and J.M. Rehg. A Modular Approach to the Analysis and Evaluation of Particle Filters for Figure Tracking. CVPR 2006. [100] J. Winn, and N. Jojic. LOCUS: Learning Object Class with Unsupervised Segmen- tation. ICCV 2005. 197 [101] J.Winn,andJ.Shotton.TheLayoutConsistentRandomFieldforRecognitionand Segmentation Partially Occluded Objects. CVPR 2006. [102] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland. Pfinder: Real-time Tracking of Human Body. IEEE Trans. on PAMI, 19(7):780-785, 1997. [103] B. Wu, and R. Nevatia. Cluster Boosted Tree Classifier for Multi-View, Multi-Pose Object Detection. ICCV 2007. [104] B. Wu, and R. Nevatia. Detection and Tracking of Multiple, Partially Occluded Humans by Bayesian Combination of Edgelet based Part Detectors. International Journal of Computer Vision, 2007. [105] B. Wu, and R. Nevatia. Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. ICCV 2005. [106] B. Wu, and R. Nevatia. Improving Part based Object Detection by Unsupervised, Online Boosting. CVPR 2007. [107] B. Wu, and R. Nevaita. Optimizing Discrimination-Efficiency Tradeoff in Integrat- ing Heterogeneous Local Features for Object Detection. To appear in CVPR 2008. [108] B.Wu,andR.Nevatia.SimultaneousObjectDetectionandSegmentationbyBoost- ing Local Shape Feature based Classifier. CVPR 2007. [109] B.Wu, andR.Nevatia.TrackingofMultipleHumansinMeetings.InV4HCIwork- shop, in conjunction with CVPR 2006. [110] B. Wu, and R. Nevatia. Tracking of Multiple, Partially Occluded Humans based on Static Body Part Detection. CVPR 2006. [111] B.Wu,R.Nevatia,andY.Li.SegmentationofMultiple,PartiallyOccludedObjects byGrouping,Merging,AssigningPartDetectionResponses.ToappearinCVPR2008. [112] B. Wu, V. K. Singh, R. Nevatia, and C.-W. Chu. Speaker Tracking in Seminars by Human Body Detection. In CLEAR Evaluation Campaign and Workshop, in con- junction with FG 2006. [113] B. Wu, X. Song, V. K. Singh, and R. Nevatia. Evaluation of USC Human Tracking System for Surveillance Videos. In CLEAR Evaluation Campaign and Workshop, in conjunction with FG 2006. [114] Y. Wu, T. Yu, and G. Hua. A Statistical Field Model for Pedestrian Detection. CVPR 2005. [115] B. Wu, L. Zhang, V. K. Singh, and R. Nevatia. Robust Object Tracking based on Detection with Soft Decision. In IEEE Workshop on Motion and Video Computing (WMVC) 2008. [116] L. Zhao, and L. Davis. Closely Coupled Object Detection and Segmentation. ICCV 2005. 198 [117] T. Zhao, and R. Nevatia. Tracking Multiple Humans in Complex Situations. IEEE trans. on PAMI, 26(9):1208-1221, 2004. [118] T. Zhao, and R. Nevatia. Tracking Multiple Humans in Crowded Environment. CVPR 2004. [119] Q. Zhu, S. Avidan, M.-C. Yeh, and K.-T. Cheng. Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. CVPR 2006. 199
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Multiple pedestrians tracking by discriminative models
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
PDF
Tracking multiple articulating humans from a single camera
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Model based view-invariant human action recognition and segmentation
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Motion pattern learning and applications to tracking and detection
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Event detection and recounting from large-scale consumer videos
PDF
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
Exploitation of wide area motion imagery
PDF
Object detection and recognition from 3D point clouds
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
3D object detection in industrial site point clouds
PDF
Point-based representations for 3D perception and reconstruction
PDF
Vision-based and data-driven analytical and experimental studies into condition assessment and change detection of evolving civil, mechanical and aerospace infrastructures
Asset Metadata
Creator
Wu, Bo
(author)
Core Title
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
06/25/2008
Defense Date
04/11/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
AdaBoost,OAI-PMH Harvest,object detection and tracking
Language
English
Advisor
Nevatia, Ramakant (
committee chair
), Medioni, Gerard G. (
committee member
), Tjan, Bosco S. (
committee member
)
Creator Email
bowu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1290
Unique identifier
UC1442129
Identifier
etd-Wu-20080625 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-75931 (legacy record id),usctheses-m1290 (legacy record id)
Legacy Identifier
etd-Wu-20080625.pdf
Dmrecord
75931
Document Type
Dissertation
Rights
Wu, Bo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
AdaBoost
object detection and tracking