Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Effective incremental learning and detector adaptation methods for video object detection
(USC Thesis Other)
Effective incremental learning and detector adaptation methods for video object detection
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFECTIVE INCREMENTAL LEARNING AND DETECTOR ADAPTATION METHODS FOR VIDEO OBJECT DETECTION by Pramod Sharma A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2014 Copyright 2014 Pramod Sharma Acknowledgements As Kahlil Gibran said "The teacher who is indeed wise does not bid you to enter the house of his wisdom but rather leads you to the threshold of your mind.", I have been very fortunate to have many such teachers at different points in my career and I am thankful to all of them. Specially, I would like to thank my advisor Prof. Ramakant Nevatia for his constant guidance and encouragement. His expertise and knowledge has helped me a lot in problem formulation and exploration from different angles. I am very thankful to Prof. Nevatia for giving me personal freedom to exlore and pursue research topics of my interest and supporting me in this endeavor. I am also grateful to Prof. Gerard Medioni, Prof. C.C. Jay Kuo, Prof. Laurent Itti and Prof. Suya you for being part of qualifying committee, for taking time to review my proposal and providing me valuable feedback. I am fortunate to have the opportunity to work with Dr. Chang Huang. He provided me much needed mentorship in my initial years at USC. His guidance has immensely helped me in learning the ropes of object detection. I am thankful to Vivek Kumar Singh and Prithviraj Banerjee for investing their precious time in deep discussions with me for long hours about several topics in Computer Vision and graduate life in general. I am grateful to all the members of Computer Vision lab, specially Pradeep Natarajan, Remi Trichet, Yinghao Cai, Anustup Choudhary, Furqan Khan, Weijun Wang, Younghoon ii Lee, Matthias Hernandez, Sung Chun Lee and Arnav Agharwal for meaningful discus- sions and being wonderful colleagues. I gratefully acknowledge the funding sources that made my Ph.D. work possible. I am fortunate to have many wonderful friends at USC, who even though did not contribute directly towards my research, their presence and support has been essential. This includes Maheswaran Sathiamoorthy, Harshvardhan Vathsangam, Megha Gupta, Manish Jain, Kartik Audkhasi, Nitin Nair and Debotyam Maity. Last but not the least, I would like to express my gratitude towards my parents and family, not just for their support, presence and encouragement but also for being a source of inspiration and strength over the years. Special thanks to my two beautiful nieces and a nephew: Sanskriti, Tanavya and Kavya. iii Contents Acknowledgements ii List of Tables vii List of Figures viii Abstract xi Chapter 1: Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Object appearance and complex backgrounds . . . . . . . . . . 6 1.2.2 Variations in the object pose . . . . . . . . . . . . . . . . . . . 6 1.2.3 Noisy online samples . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.4 Generalization of adaptation method . . . . . . . . . . . . . . . 7 1.2.5 Computational Efficiency . . . . . . . . . . . . . . . . . . . . 7 1.3 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2: Related Work 12 2.1 Supervised Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Semi Supervised Methods . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Unsupervised Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Background subtraction based methods . . . . . . . . . . . . . 15 2.3.2 Information from multiple cues . . . . . . . . . . . . . . . . . 16 2.3.3 Generalized Adaptation Methods . . . . . . . . . . . . . . . . 19 Chapter 3: An Efficient Supervised Incremental Learning of Boosted Classi- fiers for Object Detection 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 Offline Loss Estimation: . . . . . . . . . . . . . . . . . . . . . 24 3.2.2 Online Loss Estimation: . . . . . . . . . . . . . . . . . . . . . 24 3.2.3 Solution for adjustment of combination coefficients: . . . . . . 25 3.2.4 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 Time Computation Performance . . . . . . . . . . . . . . . . . 27 3.3.2 Detection Performance . . . . . . . . . . . . . . . . . . . . . . 28 iv 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chapter 4: Unsupervised Incremental Learning for Improved Object Detection in a Video 33 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Outline of our Approach . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Supervised incremental learning of Boosted Classifiers . . . . . . . . . 39 4.5 Unsupervised Incremental Multiple Instance Learning for Weakly La- beled Online Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5.1 Online Sample Collection . . . . . . . . . . . . . . . . . . . . 41 4.5.2 MIL Loss Function for Real Adaboost . . . . . . . . . . . . . . 43 4.5.3 Incremental update of offline detector . . . . . . . . . . . . . . 45 4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.6.1 Detection Performance Evaluation . . . . . . . . . . . . . . . . 49 4.6.2 Improvement in Tracking . . . . . . . . . . . . . . . . . . . . . 51 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Chapter 5: Generic Detector Adaptation for Object Detection in a Video 55 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.4 Unsupervised Detector Adaptation . . . . . . . . . . . . . . . . . . . . 61 5.4.1 Unsupervised Training Samples Collection . . . . . . . . . . . 62 5.4.1.1 Online Samples . . . . . . . . . . . . . . . . . . . . 62 5.4.1.2 Pose Categorization . . . . . . . . . . . . . . . . . . 63 5.4.2 Adaptive Classifier Training . . . . . . . . . . . . . . . . . . . 64 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.5.1 Computation Time Performance . . . . . . . . . . . . . . . . . 68 5.5.2 Detection Performance . . . . . . . . . . . . . . . . . . . . . . 69 5.5.2.1 CA VIAR Dataset . . . . . . . . . . . . . . . . . . . 69 5.5.2.2 Mind’s Eye Dataset . . . . . . . . . . . . . . . . . . 70 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Chapter 6: Generic Detector Adaptation with Multi Class Boosted Random Ferns for High Detection Performance 75 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.3 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4 Online Sample Collection . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.5 Adaptive Classifier Training . . . . . . . . . . . . . . . . . . . . . . . 81 6.5.1 Random fern classifier . . . . . . . . . . . . . . . . . . . . . . 81 v 6.5.2 MBHBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.5.3 Multi class boosted random fern adaptive classifier . . . . . . . 85 6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.6.1 Datsets and experimental setup . . . . . . . . . . . . . . . . . . 87 6.6.2 Detection Results . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Chapter 7: Generic Detector Adaptation with Multi Instance Boosted Random Ferns for Robustness to Ambiguous Training Data 93 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.3 Outline of our Approach . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.4 Unsupervised online sample collection . . . . . . . . . . . . . . . . . . 100 7.4.1 Positive and negative online samples . . . . . . . . . . . . . . . 101 7.4.2 Online bags . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.5 Adaptive classifier training . . . . . . . . . . . . . . . . . . . . . . . . 104 7.5.1 Random Ferns . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.5.2 Multi Instance Random Ferns . . . . . . . . . . . . . . . . . . 105 7.5.3 Training of Boosted Multi Instance Random Fern Adaptive Clas- sifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.6.1 Online Sample Collection Evaluation . . . . . . . . . . . . . . 108 7.6.2 Adaptive Classifier Performance Evaluation . . . . . . . . . . . 109 7.6.2.1 Mind’s Eye: . . . . . . . . . . . . . . . . . . . . . . 110 7.6.2.2 CA VIAR: . . . . . . . . . . . . . . . . . . . . . . . 110 7.6.2.3 i-LIDS: . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.6.2.4 Skateboard: . . . . . . . . . . . . . . . . . . . . . . 114 7.6.3 Evaluation Summary . . . . . . . . . . . . . . . . . . . . . . . 116 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 8: Conclusion and Future Work 118 8.1 Generalization across different videos . . . . . . . . . . . . . . . . . . 121 8.2 Noise tolerance with multiple instance learning techniques . . . . . . . 121 8.3 Integration with online tracking . . . . . . . . . . . . . . . . . . . . . . 122 Bibliography 123 vi List of Tables 1.1 Distribution of our proposed methods into different categories . . . . . 5 4.1 Recall for different detectors when precision = 0.9 . . . . . . . . . . . . 50 4.2 Tracking Results for i-LIDS Dataset . . . . . . . . . . . . . . . . . . . 51 5.1 Precision improvement performance for CA VIAR dataset at recall 0.65 . 73 5.2 Best precision improvement performance on Mind’s Eye Dataset. For ME1, precision values are shown at recall 0.97, whereas for ME2 recall is 0.6. Ours-1: Without sample categorization, Ours-2: With Sample Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.1 Precision improvement of different approaches. BC: Boosting with bi- nary classification (when positive samples are not divided into different categories), MC: Boosting with multi class classification. Correspond- ing recall values for displayed precision values: Seq1: 0.7, Seq2: 0.98, Seq3: 0.65. Best performance method is highlighted in bold. . . . . . . 89 7.1 Average Precision performance . . . . . . . . . . . . . . . . . . . . . . 113 vii List of Figures 1.1 Examples of some detection results obtained by using our approach. . . 5 1.2 some detection results obtained from offline trained detector, Red: False alarms, Green: missed detections. . . . . . . . . . . . . . . . . . . . . 6 1.3 Overview of our proposed methods. We propose adaptation methods for Real Adaboost based offline detector and for generic offline trained detectors. In both these cases, we first propose a method which assumes that collected online samples are correctly labeled. Then we extend it to propose adaptive classifier training with noisy and ambiguous training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Comparison of the running time of our incremental learning approach with approach described in [1]. Inc-T (T=2,10, 20, 50) represents maxi- mum T iterations of incremental learning using the method described in [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 ROC curves and Regularization Parameter . . . . . . . . . . . . . . . . 31 3.4 Detection Results from offline detector [2](odd rows, Red) and our ap- proach (even rows, Green). Our approach is able to recover the missed detections (marked in blue) and false alarms (marked in yellow) pro- duced by the offline detector. . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 Examples images from i-LIDS [3] and TUD [4] datasets. Video se- quences in these datasets have crowded environment and cluttered back- ground. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Overview of our system . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Examples of some online samples collected by unsupervised method: Missed detections (red rectangle) are collected as positive training sam- ples. False alarms (blue rectangle) are collected as negative training samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 Updating strategy at each iteration of incremental learning in a video sequence, D = offline detector, which is same for all iterations. O i : online samples for iterationi,I i is incremental detector at iterationi, . . 46 viii 4.5 Overfitting Avoidance, example on i-LIDS AB Easy sequence: x-axis represents the number of frames, y axis shows the recall rate for the corresponding number of frames, Inc1= Strategy of treating offline de- tector as the base detector,at each iteration of incremental learning; Inc0 = Strategy of treating the incremental detector learned in the previous iteration, as the base detector. Till 1500 frames both Inc1 and Inc0 per- form similarly. After 1500 frames, performance of Inc0 keeps degrading because of over-fitting, whereas Inc1 avoids over-fitting, hence perfor- mance better than the offline detector. . . . . . . . . . . . . . . . . . . 47 4.6 ROC Curves for Detection Results . . . . . . . . . . . . . . . . . . . . 53 4.7 Sample detection results on i-LIDSS and TUD Campus Datasets. First and third rows show the detection results from the offline detector. Sec- ond and fourth rows show the detection results from the incremental detector. Red arrow points at the false alarms produced by the offline detector, whereas blue arrow shows the detections results for the peo- ple who were missed by the offline detector, but were detected by the incremental detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1 Some examples from Mind’s Eye dataset [5]. This is a challenging dataset, as it has many different human pose variations. . . . . . . . . . 56 5.2 Overview of our detector adaptation method . . . . . . . . . . . . . . . 58 5.3 Examples of some of the positive (first row) and negative (second row) online samples collected in unsupervised manner from Mind’s Eye [5] and CA VIAR [6] datasets. . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4 Run time performance of our approach. X-axis represents the number of online samples used for the classifier training, Y-axis is shown in log scale and represents runtime in seconds. RF-I-K : I random ferns with K binary features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.5 Recall-Precision curves for Detection Results on CA VIAR Dataset . . . 71 5.6 Recall-Precision curves for Detection Results on Mind’s Eye Dataset. . 72 5.7 Examples of some of the detection results when applied baseline de- tector at low threshold (best viewed in color). Red: Detection result classified as false alarm by our method. Green: Detection result classi- fied as correct detection by our method. Identified category name is also specified for Mind’s Eye dataset (second row). . . . . . . . . . . . . . . 74 6.1 Overview of our system. . . . . . . . . . . . . . . . . . . . . . . . . . 79 ix 6.2 An example of random fern weak classifier with four target classes. PSD1 = Positive Sample Distribution for category 1, NSD1 = Negative Sample Distribution for category 1. PSD1 has positive online samples which belong to category 1, NSD1 has all the negative online samples. f i are the binary features for this random fern. At a boosting iteration t, a random fern is selected andh k t are computed for all the target categories. h t is the vector ofh k t . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.3 A strong classifier vector is computed as independent sum of weak hy- pothesis vector dimensions. For example H 1 is computed as H 1 = P T t=1 h 1 t . Final strong classifier vector is defined as vector of indepen- dent strong classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4 Precision-Recall curves for detection results on Mind’s Eye and CA VIAR datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.5 Some detection results on Seq1, Seq2 respectively. Green rectangle: detection response obtained from the generic detector is verified as true detection. Red Rectangle: Detection response is verified as false alarm. First and third rows show detection results obtained from [7], whereas second and fourth rows show detection results obtained from our method. Black arrow indicates instances when [7] does not verify false alarms correctly, orange arrow indicates instances when [7] verifies a true de- tection response as false alarm. . . . . . . . . . . . . . . . . . . . . . 92 7.1 Adaptive Classifier Training Overview. Green rectangle: positive online samples,red rectangle: negative online samples. . . . . . . . . . . . . . 99 7.2 Example of obtained detection responses: Detection responses with high confidence are shown in green, detection responses with low confidence are shown in blue. If an online sample collection only focuses on col- lecting detection responses with high confidence, it may not have object instances marked in blue in the training set. . . . . . . . . . . . . . . . 102 7.3 Some of the online samples collected from our method: Green Rectan- gle: Positive online samples, Red Rectangle: Negative online samples. We can see that many collected online samples are noisy. . . . . . . . . 103 7.4 Recall-Precision Curves for MES1 and MES2 Sequence . . . . . . . . 111 7.5 Recall-Precision Curves for MES3 and MES4 Sequence . . . . . . . . 112 7.6 Recall-Precision Curve for CA V1 Sequence . . . . . . . . . . . . . . . 113 7.7 Recall-Precision Curve for iLIDS Sequence . . . . . . . . . . . . . . . 114 7.8 Recall-Precision curves for Skateboard Sequence . . . . . . . . . . . . 115 7.9 Some detection results. Green: Detection output from OTD validated as true detection by our method. Red: detection output from OTD vali- dated as false alarm by our method. . . . . . . . . . . . . . . . . . . . 117 x Abstract Object detection is a challenging problem in Computer Vision. With increasing use of social media, smart phones and modern digital cameras thousands of videos are up- loaded on the Internet everyday. Object detection is very critical for analyzing these videos for many tasks such as summarization, description, scene analysis, tracking or activity recognition. Typically, an object detector is trained in an offline manner by collecting thousands of positive and negative training samples. However, due to large variations in appear- ance, pose, illumination, background scene and similarity to other objects; it is very difficult to train a generalized object detector that can give high performance across dif- ferent test videos. We address this problem by proposing detector adaptation methods which collect online samples from a given test video and train an adaptive/incremental classifier using this training data in order to achieve high performance. First we propose an efficient incremental learning method for cascade of boosted classifiers, which collects training data in a supervised manner and adjusts the parame- ters of offline trained cascade of boosted classifiers by combining online loss with offline loss. Then, we propose an unsupervised incremental learning approach which collects online samples automatically from a given test video using tracking information. How- ever online samples collected in an unsupervised manner are prone to the labeling errors, hence instead of assigning hard labels to online samples, we utilize Multiple Instance xi Learning (MIL) approach and assign labels to the bags of instances not to the individual samples. We propose an MIL loss function for Real Adaboost framework to train our incremental detector. While the above approach gives good performance, it is limited to Real Adaboost based offline trained detector. We propose an efficient detector adaptation method which works with various kinds of offline trained detectors. In this approach first we apply offline trained detector at a high threshold to obtain confident detection responses. These detection responses are tracked using a tracking-by-detection method and using obtained detection responses and tracking output online samples are collected. However, positive online samples can have different articulations and pose variations. Hence they are divided into different categories using a pose classifier trained in the offline setting. We train a multi-class random fern adaptive classifier using collected online samples. During testing stage, first we apply offline trained detector at a low threshold, then we apply adaptive classifier on the obtained detection responses, which either accepts the detection response as a true response or rejects it as the false alarm. In this manner, we focus on improving the precision of offline trained detector. We extend this approach by proposing a multi-class boosted random fern adaptive classifier in order to select discriminative random ferns for high detection performance. We further incorporate MIL in boosted random fern framework and propose a boosted multi instance random fern adaptive classifier. Boosting provides discriminability to the adaptive classifier, whereas MIL provides robustness towards noisy and ambiguous training samples. We demonstrate the effectiveness of our approaches by evaluating them on several public datasets for the problem of human detection. xii Chapter 1 Introduction "’It is not the strongest of the species that survives, nor the most intelligent. It is the one that is most adaptable to change."’ Charles Darwin O BJECT detection has been a central problem in computer vision for many years, as it has several applications in surveillance, content retrieval, human computer interaction and robotics. It also serves as a basis for many other computer vision tasks such as object tracking, activity recognition etc. With increasing number of available video data, there is a wide interest in video analysis tasks such as video description, classification, activity analysis, video summarization and scene understanding. Suc- cessful detection of objects in videos is an important component for these high level tasks. Several offline trained object detection methods have been proposed, however due to several challenges such as appearance, viewpoint, pose, illumination and articulation changes across different videos; performance of existing offline trained detectors is far from perfect. Therefore, adaptation of offline trained detectors to a new test video is 1 important for high detection performance. Our objective is to improve the detection performance of an offline trained detector on a new test video via detector adaptation. Detector adaptation has two important components: 1) collection of online samples; 2) training of incremental or adaptive classifier. Both of these components are critical and challenging. Training of adaptive classifier require training data i.e. positive and negative online samples. These samples can be collected either in a supervised manner or in an unsupervised manner. Supervised training methods collect manually labeled training samples, however manual labeling is not a feasible solution for every new video sequence. Hence unsupervised methods are more practical for online sample collection. Unsupervised training samples can be collected in several ways such as using back- ground subtraction, detection output from offline trained detector or using tracking in- formation. Background subtraction based methods use motion information to obtain moving blobs from the test data. However, obtained blobs may be noisy as blobs for objects close to each other may merge as a single blob. Moreover, only partial blob in- formation may be obtained for an object and sometimes objects in the background may come out as the part of the blob. Background subtraction based methods may also fail to collect static objects as online samples. Another popular approach for online sample collection is to collect online samples based on the output from the offline trained detector. In these approaches, first offline trained detector is applied on the test video and the detection responses with high detec- tion confidence are collected as the positive online samples, whereas patches from the background scene are collected as the negative samples. These methods address some drawbacks of background subtraction based methods. However, some instances of the objects may not be detected by the classifier and some detection responses correspond- ing to high detection detection confidence may belong to the background scene. 2 Tracking-by-detection methods address this issue by associating detection responses across frames to obtain tracklets. By using tracking information, temporally inconsistent detection responses can be removed and at the same time object instances missed by the detector can be recovered. Online samples collected in an unsupervised manner are always prone to the label- ing errors, hence assigning hard labels to the collected training data may not give high detection performance. Multiple Instance Learning (MIL) has been proposed in the literature for training with noisy training samples. Training of incremental and adaptive classifier is another important aspect of adapta- tion/incremental learning methods. Several approaches have been proposed in the litera- ture such co-training , combined learning , online boosting and incremental SVM based methods. Co-training methods uses two different views of the data on the assumption that both the views are conditionally independent from each other and two independent classifiers are trained using these views. One classifier is used to collect online training data, whereas other classifier is used for adaptation. However, practically the assump- tion that two views are conditionally independent from each other may not always be valid. Combined learning based methods integrate training samples used for the training of offline trained detector and collected online samples to train a new adaptive classifier. However, after combining offline data with online data, training of adaptive classifier may be computationally expensive. Online boosting and SVM based methods utilize online data to modify existing offline trained detector. However most of these methods assume that collected training samples are correctly labeled, which is a critical assump- tion. Some online boosting methods use MIL techniques to deal with noisy online data. However, most of these methods focus on the detection of single instance of the object 3 in the video, with manual labeling of the object in the first frame. Moreover, applicabil- ity of detector adaptation methods using these approaches is limited to a specific type of offline trained detector. 1.1 Problem Statement Our objective is to design incremental learning and detector adaptation methods to im- prove the detection performance of offline trained detector on a specific test video. We focus on developing a detector adaptation method which is unsupervised, is applicable to various offline trained detectors and is robust to noisy and ambiguous training data. We first propose incremental learning methods for a Real Adaboost based offline trained detector. Then we propose generic detector adaptation methods which can be applied to various kinds of offline trained detectors. For both incremental learning and detector adaptation methods, first we propose training of adaptive classifier with hard labeled online samples. Then we propose training of adaptive classifier which can work effectively even in presence of noisy and ambiguous training data. In table 1.1, we show the distribution of our proposed approaches into different categories. We collect online samples in an unsupervised manner by using tracking information with the assumption that we can obtain some partial tracklets for an object, if not the whole track. We design our adaptation method in such a way that it can detect objects with various pose and articulations effectively. Many existing adaptation methods assume that collected online samples have cor- rect class labels, which is not a valid assumption, as unsupervised online samples are in- evitably prone to labeling error, which could lead to poor performance of adaptive clas- sifier. We address this issue by proposing incremental learning and adaptation methods 4 Figure 1.1: Examples of some detection results obtained by using our approach. which are robust to noisy and ambiguous training data. For this purpose, we incorpo- rate Multiple Instance Learning (MIL) in the training algorithms of our incremental and adaptive classifiers. We show effectiveness of our approaches on the problem of human detection and Some of the detection results from our method are shown in Figure 1.1. Table 1.1: Distribution of our proposed methods into different categories Online Samples Detector Specific Generic Hard labeled Supervised incremental Efficient Detector Adaptation, online samples learning Multi class random ferns Training with noisy Unsupervised incremental Multi instance boosted and ambiguous data learning random ferns 1.2 Challenges Object detection is a challenging problem due to changes in appearance of the objects across different datasets, complex background scenes etc. In this section we briefly address some of these challenges: 5 Figure 1.2: some detection results obtained from offline trained detector, Red: False alarms, Green: missed detections. 1.2.1 Object appearance and complex backgrounds Typically, object detector is trained using several training examples. These training ex- amples are considered as a representative for different test datasets. However appearance of the object and background environments may vary a lot from one dataset to another, therefore when offline trained detector is applied on novel test data it often misses some of the object instances and in addition produces false alarms as well. In Figure 1.2, we can see that when an offline trained object detector is applied on i-LIDS dataset, it produces missed detections and false alarms. 1.2.2 Variations in the object pose Objects in the test data can have different poses, e.g. for humans it can have different poses and articulations such as standing, sitting, bending etc. Due to large number of pose, appearance and articulation variations, it is difficult to train a detector that can handle all the variations. 1.2.3 Noisy online samples Online samples collected in an unsupervised manner can be noisy i.e. online samples can be mis labeled by the collection mechanism. If included in the training set, these 6 online samples can lead to ineffective adaptive classifier. Hence an effective adaptation method should address how to handle noisy online samples during learning of adaptive classifier. 1.2.4 Generalization of adaptation method Over the years, various offline trained detectors have been proposed and different types of algorithms such as boosting [8, 2], SVM [9, 10], random ferns [11] and random forests [12] have been proposed for the training. These variations in the training algo- rithms make the generalization of detector adaptation challenging. 1.2.5 Computational Efficiency With the increasing number of video datasets, computational efficiency is becoming a critical requirement, as the size of the test data can be huge. Therefore online training algorithm for adaptation method has to be computationally efficient in order to process large number of datasets. 1.3 Overview of our approach We present an approach for adapting an offline trained detector for a new test video. We propose incremental learning and detector adaptation methods. Overview of our proposed approaches is shown in Figure 1.3. We focus on designing an adaptation method which should have following capabili- ties: 1. Collects online samples in an unsupervised manner. 7 Figure 1.3: Overview of our proposed methods. We propose adaptation methods for Real Adaboost based offline detector and for generic offline trained detectors. In both these cases, we first propose a method which assumes that collected online samples are correctly labeled. Then we extend it to propose adaptive classifier training with noisy and ambiguous training data. 2. Can be applied with various offline trained detectors. 3. Adaptive classifier should be able to handle noisy online samples. 4. Explicitly addresses how to handle different pose variations and articulations in an effective manner. 8 We start by proposing incremental learning methods for Real Adaboost based offline trained detectors. First we propose a supervised incremental learning method, which focuses on adjusting the parameters of a Real Adaboost based offline trained detector only based on collected online training samples. We adjust one parameter at a time to obtain a closed form solution for parameter adjustment, which makes our method computationally efficient as compared to iterative optimization based approaches. Then, we present an unsupervised incremental multiple instance learning (UIMIL) method, which collects those specific cases as online samples where offline detector makes mistakes i.e. missed detections and false alarms. To collect missed detections and false alarms, we first apply the offline trained detector and track the obtained de- tection responses using a tracking-by-detection method. Missed detections are defined as those track responses for which there are no corresponding overlapping detection re- sponses. False alarms are defined as those detection responses for which there are no corresponding track responses. Online samples collected in an unsupervised manner are inevitably prone to labeling error, hence instead of assigning hard labels to online sam- ples, we create bags of instances, and assign labels to bags, not to individual instances. We propose a novel MIL loss function for RealAdaboost [13] framework, in order to incorporate multiple instance learning (MIL) in Real Adaboost framework. While this approach improves the performance of Real Adaboost based offline trained detector, it is limited to Real Adaboost based offline trained detectors only. While above mentioned approach gives good performance, it is only applicable to a specific type of offline trained detector. Which limits the applicability of this ap- proach. In order to make the adaptation method applicable to various offline trained detectors, we design a random fern based adaptation method. First, we apply the offline trained detector at a higher threshold for a given test video, and collect confident positive 9 and negative online samples in an unsupervised manner using a tracking-by-detection method. Here, we assume that we can obtain some partial tracklets for an object, if not the whole track, hence this approach is not heavily dependent on high tracking perfor- mance. For these collected online samples, we train a random fern [11] based adaptive classifier. Testing is done in two stages: In the first stage, we apply the offline trained detector at a low threshold and then we use the trained random fern adaptive classifier to validate the detection responses obtained from the first stage as correct detections or false alarms. This approach can handle multiple pose variations of the object effectively, as we consider different pose and articulations as different categories of positive online sam- ples and train a multi-class random fern adaptive classifier. For this purpose, we train a pose classifier offline for all the categories of positive online samples and use this offline trained pose classifier to categorize the online positive samples into different categories. While this approach is applicable to various offline trained detectors, ferns for the train- ing of adaptive classifier are selected randomly, hence detection performance may not be high. We propose a multi class boosted random fern adaptive classifier, which selects discriminative random ferns for the training of adaptive classifier for high detection per- formance. Boosting provides discriminability to the adaptive classifier, however this method does not address how to handle noisy and ambiguous training data. We address this issue by proposing a boosted multiple instance random fern adap- tive classifier. This approach is unsupervised, can be applied with various offline trained detectors, can handle noisy online samples. We first apply the offline trained detector at a low threshold and track the obtained detection responses. We collect the online sam- ples using tracking and detection output. Since collected online samples usually have 10 labeling errors, we create bags of instances for positive online and negative online sam- ples. We train a boosted multi-instance random fern classifier for the collected online samples, which can handle noisy online samples effectively. 1.4 Thesis Outline Rest of the thesis is organized as follows: We begin with a review of related works in Chapter 2. In Chapter 3, we describe our supervised incrmental learning method. Then we describe our unsupervised incremental learning approach in Chapter 4, which explains what kind of online samples to collect in an unsupervised manner, and also describes our multiple instance learning approach which can deal with noisy online samples. In chapter 5, we present the description of our random fern based generic adaptation method, which can be applied to various offline trained detectors. It can also handle multiple pose variations and articulations of the target object. In chapter 6, we present our multi class boosted random fern adaptive classifier, which in addition to being generic, also provides discriminability to the adaptive classifier for high detec- tion performance. We present our boosted multi instance random fern based adaptation method in chapter 7, in which we propose multi instance random ferns and how to in- corporate them in boosting framework to deal with noisy and ambiguous training data, which is followed by conclusion and future work in chapter 8. 11 Chapter 2 Related Work "The difference between literature and journalism is that journalism is unreadable and literature is not read." Oscar Wilde O VER the past few years, many online/incremental/adaptation methods have been proposed. We can broadly categorize these methods into three different cate- gories: supervised, semi-supervised and unsupervised methods. In this section, we de- scribe the related work proposed for all these categories. 2.1 Supervised Methods Huang et al. [1] introduced an incremental learning framework for Real Adaboost framework. In their approach, they adjust the parameters of an offline learned clas- sifier, by proposing an efficient method for computing the offline training loss during incremental learning and combining this offline loss with the loss incurred from online 12 samples to make a hybrid loss function. This hybrid loss function is minimized for op- timal adjustment of the Real Adaboost parameters using steepest descent method. They showed the results for the problem of face detection, however steepest descent method is computationally expensive, which makes incremental learning process time consuming. Supervised incremental learning methods improves the performance of offline trained detectors, however these methods require manual labeling of each online training sam- ple. Which is an infeasible requirement, if the test data to be processed is huge. 2.2 Semi Supervised Methods Unlike supervised methods, which completely rely on labeled training data, semi su- pervised methods utilize the information from both labeled and unlabeled data in order to train an adaptive classifier. Several semi supervised methods [14, 15, 16, 17] have been proposed over the years. In [14], Mallapragada et al. proposed a semi supervised method called semi-boost, which improves the performance iteratively of a given su- pervised learning algorithm using few labeled training examples and large number of unsupervised training examples. At each iteration, they select a number of unlabeled training samples and combine it with labeled training examples to train a new classifi- cation model using supervised training algorithm. Classification model trained at each iteration are combined to form a final classification model. Grabner et al. [18] proposed an online version of semi-boost for online adaptation. They use the labeled data as a prior and collect unlabeled samples from the current video and combine it with the la- beled data to train the adaptive classifier. They show the results for the problem of visual tracking i.e. tracking a specific target in the given video. 13 Other similar approaches [16, 15, 19, 17] have been proposed, which assume that object has been detected successfully in the first frame i.e. annotate the target manu- ally in the first frame, and then the online detector is updated automatically for all the consecutive frames. These methods are not suitable for the video sequences with large number of people in the scene, because these videos may contain thousands of frames with hundreds of different objects in it and objects can appear and disappear the scene at any intermediate frame in the sequence. Hence, tagging all the objects in the video requires considerable manual intervention. Moreover, like supervised methods this tag- ging process will have to be repeated for each new video sequence, which makes these methods less practical. 2.3 Unsupervised Methods Both supervised and semi-supervised methods require manually labeled training sam- ples for classifier training, which is not feasible for large number of test videos. Un- supervised methods collect online samples automatically, without any manual inter- vention. Hence these approaches are more attractive. Several unsupervised methods [20, 21, 22, 23, 24, 25, 26, 27] have been proposed for detector adaptation. We boradly categorize these methods into three different categories: background subtraction based unsupervised methods, methods which collect information from several cues such as combination of part detections, multiple detectors, motion information etc., generalized adaptation methods, which can be applied with various offline trained detectors. Here we summarize recent advances for each of these sub-categories. 14 2.3.1 Background subtraction based methods Background subtraction based methods utilize the motion information to separate the target object from the background. Nair and Clark [20] proposed a method for moving object detection, which collects online training data in an unsupervised manner. They argue that if the object is moving, it can be separated from the background, hence un- supervised online samples can collected and provided to winnow based online learning algorithm. They show the results for the problem of person detection. In [21] Javed et al. proposed a co-training based approach, which continuously labels incoming data to obtain unsupervised online samples and use this data in online manner to update the boosted classifier, which was trained using a small training set in a supervised manner. They use background subtraction to collect motion blobs and compute PCA based global features for collected blobs. They apply a co-training based framework for online learning. To select the training samples for online learning, they use base classifiers from the boosted classifiers. If more than a certain number of base classifiers classify an online sample as positive online sample or the negative online sample, it is included in the online sample set. This online sample set is used to update the boosted classifier. For online learning, they use online boosting proposed by Oza and Russel [28]. In [23] Stalder et al. proposed a cascaded confidence filtering method for improving the detection performance for a given video sequence. First they apply an offline trained detector on a given sequence and instead of obtaining detection responses, they use con- fidence map for a given frame. On this obtained confidence map, they apply geometric ground plane assumptions to get the position and size information of the objects in the scene. On this obtained confidence map from geometric filter, they apply background modeling for a set of frames in order to obtain the confidence map of regions where 15 there is high probability of object movement. At the next stage, they apply trajectory filter for a set of frames, to remove the spurious detection responses. In the last step, they apply particle filter based post processing to obtain the final trajectories. In this manner, by applying multiple filters they can get better detection results, as compared to just using the offline trained detector. They show the experiments for the problem of pedestrian detection. Background subtraction based methods work well when the target objects are mov- ing all the time in the videos and extracted motion blobs are noise free. However, for datasets where target objects overlap each other quite often or the scene in the video is crowded, i.e. many object instances are present in the scene, extracted motion blobs are usually noisy i.e. one motion blob can correspond to more than one object instance. Moreover, if the objects instances are not moving in the scene, background subtraction method would consider them as part of the background only. Moreover, extracted mo- tion blobs can have other types of noise also such as moving shadows of the object, part of the background etc. Due to these issues, background subtraction is not an optimal solution to extract unsupervised online samples from the test video sequences. 2.3.2 Information from multiple cues Wu and Nevatia [22] proposed an online learning approach for Real Adaboost frame- work. First they train seed part detectors in a supervised manner by boosting edgelet based base classifiers using a small training set. Using these seed par detectors, they design an oracle for unsupervised online sample collection. This oracle first applies different body part detectors to obtain the part detection responses and then combines the detection responses using a combination of the part detectors. Confidence score of these combined detection responses is used to reject the online sample or include it in the 16 online training set. In this manner, they collect highly confident online samples for train- ing. They propose an online version of Real Adaboost classifier to learn the classifier using online samples. They start with the classifier trained in a supervised manner and update this classifier using online samples. Based on the complexity of the current test dataset, they add or remove the base classifiers from the offline trained detector. They demonstrate the performance of their method for the problem of pedestrian detection. Kembhavi et al. [26] proposed an incremental multiple kernel learning method for object recognition to train an incremental detector for adapting to a specific scene. In their approach, they first train a generalized object detector (global detector) in a su- pervised manner using general training data. This global detector is fixed, hence is not updated at any time. Using these general training examples they train a local detector as well, which is updated using online training samples. For collection of positive on- line samples, they use highly confident detection results obtained by applying global detector. For collection of online negative samples, they first collect detection windows classified by the global detector as background class. Then they compute position en- tropy with the current negative samples in the local detector training set, if the entropy is high, they include these windows in training set of negative online samples. In this man- ner, they collect the negative online samples from different parts of the image. Using these positive and negative online samples they update the local detector. They apply their system to the problem of vehicle detection. In [24] Wang and Wang proposed a method for adapting an offline trained pedestrian detector to a specific traffic scene. They first train a generic detector in a supervised manner using general training example. Then, for a given traffic scene, they apply generic detector to obtain the detection responses. Obtained detection responses are filtered based on the confidence scores and highly confident detection responses are 17 used as positive and negative online samples. For each positive online sample, they compute a score using multiple cues such as size, location and motion. If this score is beyond a certain threshold, it is considered as a true positive sample, otherwise it is discarded. Remaining positive samples are clustered based on the appearance in order to remove the outliers from the positive samples. They collect confident negative online samples using clustering in time space, so that if the detection response does not move in the scene and it does not overlap with motion path, it is considered as false alarm and hence is included in the negative online sample set. These collected online samples are combined with the generic supervised training set for the training of adaptive classifier. Wang et al. [25] proposed an approach for adapting to a traffic scene, which also uses size, location and motion information to obtain online samples similar to [24] , but they propose confidence encoded SVM , which can handle outliers online samples robustly, hence unlike [24] they do not need to cluster the online samples to remove the outliers. They start with a set of source training samples collected in a supervised manner and obtain target online samples using information from multiple cues. Using source and target online samples they create a graph between source and target samples. A highly confident target sample is defined as target sample with high indegree, similarly a highly confident source sample is defined. For each source and target samples they compute the confidence score using indegree information. This confidence information is encoded in the adaptive classifier by extending the traditional L2 loss SVM objective function to include confidence penalty terms in the objective function. They demonstrate the performance for pedestrian detection problem. Above mentioned approaches do not require background subtraction for online sam- ple collection, hence are applicable to a wide variety of test videos. These approaches 18 employ a conservative approach for online sample collection i.e. they collect highly con- fident online samples only in order to avoid noisy online samples. Hence, their online training set will only have those cases which are detected successfully with high confi- dence. However, not all the object instances are detected with high confidence. Due to this,their online training set may not cover those instances which are not detected with high confidence. Hence their online training set may not be a true representative of the test set, which may lead to the training of ineffective adaptive classifier. Moreover, [24, 25] uses both supervised training samples and unsupervised training samples for the training of the adaptive classifier, which can result in increased training set, hence training of adaptive classifier can be computationally expensive. Also, even after using conservative approach online samples collected in unsupervised manner are always prone to labeling error, hence an effective adaptive classifier should address how to work with noisy online samples. Unsupervised methods proposed by [22, 26, 24, 25] are described for a specific type of training algorithm used for the training of offline trained detector, which limits the applicability of their approach to a specific type of offline trained detector only. 2.3.3 Generalized Adaptation Methods Unsupervised generalized detector adaptation methods collect online samples in an un- supervised manner and are applicable to various offline trained detectors, as they do not depend on the specific type of features or algorithm used for the training of of- fline trained detector. One way to achieve generalization is to collect online samples in an unsupervised manner and combined them with the training samples used for the training of offline trained detector and re-train the offline trained detector. In this way, 19 re-training based generalization only addresses online sample collection aspect of adap- tation, which can be independent of the offline trained detector. Hence this type of approach can be applied to any type of offline trained detector, however offline trained detectors usually require considerably significant number of training examples and the training itself is computationally expensive. Moreover, re-training will be required every time there is a new test data. Therefore re-training based generalized detector adaptation approach is not a feasible solution. In [25], Wang et al. proposed a detector adaptation method in which they apply the offline trained detector at a high threshold and collect confident online positive and neg- ative samples automatically. Dense HOG features are extracted from collected online samples and hierarchical K-Means clustering is used to train a vocabulary tree based adaptive classifier. This adaptive classifier is used to validate detection responses ob- tained from offline trained detector as true detections or false alarms. They show the results on two different kinds of offline trained detectors. Their approach can be applied with various offline trained detectors. However they do not explicitly address how to work with noisy online samples. 20 Chapter 3 An Efficient Supervised Incremental Learning of Boosted Classifiers for Object Detection "’Efficiency is doing better what is already being done."’ Peter F . Drucker 3.1 Introduction W E propose a supervised incremental learning method for Real Adaboost based offline trained detectors. In many object detection methods an offline detector is trained by collecting thousands of positive and negative training examples on the assumption that these examples would represent the objects present in the unseen test data. Offline detector trained in this manner, may not work for some specific cases. One possible solution to handle these special cases is to repeat the offline training process by including the special cases in the training set. However re-training for every new dataset is expensive. 21 Figure 3.1: Overview of our approach We present a novel and efficient incremental learning method which addresses these issues. Our method finds the optimal adjustments to the parameters of offline learned detector by optimizing a hybrid loss function, which is a combination of offline and online loss functions. We estimate the loss incurred by offline samples, without using offline samples during incremental learning, and combine this offline loss with loss in- curred by online samples. Hybrid loss function is optimized for one parameter at a time and an exact solution is found for the loss function optimization problem, which makes our method computationally efficient. Overview of our approach is shown in Figure 7.1. In recent years many approaches have been proposed for incremental and online learning [16, 29, 28, 21, 26, 1, 30]. Main focus has been on the classifiers based on SVM [26, 30] and boosting [16, 29, 28]. Kembhavi et al.[26] introduced incremental multiple kernel learning for object recognition. Joshi and Porikli [30] used SVM for incremental active learning. However, time can be a critical constraint for incremental learning and SVM based methods are in general computationally expensive. Huang et al. [1] proposed an incremental learning method for Real Adaboost framework. In [1], loss functions are defined as a function of all the parameters of the offline learned classifier and is optimized using steepest descent method, which is inherently slow in nature because of its convergence time. 22 3.2 Incremental Learning Real Adaboost learns weak hypothesesfh 1 ;h 2 ;:::;h t g and combination coefficients =f( 11 ;::; 1m 1 );::; ( i1 ;::; im i );::; ( T 1 ;:; Tm T )g, wherem i is the total number of partitions fori th weak hypothesis. These weak hypotheses are selected by using thou- sands of training samples during offline training, which is important in order to learn discriminative weak hypotheses which can differentiate one category from the other. However, for incremental learning, only few online examples may be available, which may not be sufficient enough to add/remove additional weak hypotheses, hence in our incremental learning method, we do not modify the weak hypotheses, instead we fo- cus on finding the optimal addition () to the offline learned combination co-efficients () for these weak hypotheses. We initialize each element in equal to zero and iterate over all the weak hypothe- ses of the strong classifier sequentially for the optimal adjustment to and update each element in after optimization. If ij is combination co-efficient ofj th partition of the i th weak hypothesis. we find 0 ij by minimizing the hybrid loss functionL(P (x;y) : 0 + 0 ij ), which is combination of offline loss function L(P off ((xjy); 0 + 0 ij ) and online loss functionL(P on (xjy); 0 + 0 ij ) and is defined as: L(P (x;y) : 00 ) = y P (y)L(P (xjy) on ; 00 )+ (1 y )P (y) L(P (xjy) off ; 00 ) (3.1) where 00 = + + 0 ij . P (xjy) off , P (xjy) on are offline and online likelihood respectively.P (y) is prior probability for category y and y2f1; +1g. y is a reg- ularization parameter which decides the weights given to the offline and online part during incremental learning. After optimization, we update ij in as: 23 ij = ij + 0 ij (3.2) In following subsections, we describe the estimation of offline and online loss func- tions and optimization solution of hybrid loss function. Then we discuss the time com- plexity of our method. 3.2.1 Offline Loss Estimation: In [1], estimated offline loss L(P (xjy) off ; + ) is defined as: L(P (xjy) off ; + ) (3.3) =L(P (xjy) off ;) Y i X j b P off (z i;j jy) exp(y ij ) where b P off (z i;j jy) is the weighted marginal likelihood forj th partition ofi th weak hypothesis, More detailed description about the derivation of Eqn. 3 can be found in [1]. By using Eqn. 3, we can define our offline loss function as: L(P (xjy) off ; 0 + 0 ij ) = L(P (xjy) off ;) Y t21:::T=i X m b P off (z t;m jy) exp(y tm ) 0 @ 0 @ X m=j b P off (z i;m jy) exp(y im ) 1 A + b P off (z i;j jy) exp(y ij ) exp(y 0 ij ) (3.4) 3.2.2 Online Loss Estimation: For online samples, total online loss is estimated as: 24 L(P on ) = N y on X k=1 1 N y on exp(yG(F (x k ) : 0 )) (3.5) whereP on = (P on (xjy); 0 ), 0 = + ,N y on is total number of online samples for categoryy, and G(F (x k ) : 0 ) = T X m=1 g m (f m (x k ) : 0 ) (3.6) whereT is total number of weak hypotheses. While finding optimal adjustment for ij , we can define our online loss function as: L(P on (xjy); 0 + 0 ij ) (3.7) = N y on X k=1 [f i (x k )] N y on exp(yG(F (x k ) : 0 )) exp(y 0 ij ) where [f i (x k )] = 1, iff i (x k ) =j, 0, otherwise. 3.2.3 Solution for adjustment of combination coefficients: We define off y and on y as follows: off y =y L(P (xjy) off ;) b P off (z i;j jy)exp(y ij ) Y t21:::T=i X m b P off (z t;m jy)exp(y tm ) (3.8) on y = 0 y N y on X k=1 [f i (x k )] N y on exp(yG(F (x k ) : 0 )) (3.9) where 0 y = y P (y). By differentiating the offline loss function (Eqn. 4) w.r.t. 0 ij and using off y (Eqn. 8) : 25 (1 y )P (y) @ L(P (xjy) off ; 0 + 0 ij ) @ 0 ij = off +1 exp( 0 ij ) + off 1 exp( 0 ij ) (3.10) Similarly, by differentiating the online loss function (Eqn. 7) w.r.t. 0 ij and by using on y (Eqn. 9), we can write: y P (y) @L(P on (xjy); 0 + 0 ij ) @ 0 ij = on +1 exp( 0 ij ) + on 1 exp( 0 ij ) (3.11) Now if we solve for @L(P (xjy); 0 + 0 ij ) @ 0 ij = 0, by using Eqn. 10 and Eqn. 11, we get 0 ij as: 0 ij = 1 2 log ( off +1 + on +1 ) ( off 1 + on 1 ) (3.12) The incremental learning process for a strong classifier is described in Algorithm 5. 3.2.4 Time Complexity Time is a critical factor for incremental learning methods. If N h are the number of weak hypotheses in the strong classifier,N o are the number of online samples andN p corresponds to the number of partitions in a hypothesis. Then time complexity for our algorithm would beO N h (N o N p h ) . For steepest descent methods, other than the cost of computing all the gradients which isO N h (N o N p h ) , there is additional overhead of dealing with Hessian 26 approximation at each iteration, which can be computationally expensive if the number of parameters to optimize are large. In cascade Real Adaboost classifier, a strong clas- sifier can have hundreds of weak classifiers and each weak classifier can have tens of partitions, so the number of parameters to optimize can be in thousands. Therefore, for such a large parameter set, steepest descent methods can be computationally expensive. Algorithm 1 Real Adaboost Incremental Learning for a strong classifier Given: Online Samples =f(x 1 ;y 1 );:::; (x n ;y n )g , Strong classifier H =fh 1 ;h 2 ;:::::;h T g. Init: +1 ; 1 ; = 0. fori = 1 toT do For each partitionj2h t , compute compute off +1 and off 1 compute on +1 and on 1 0 ij = 1 2 logf ( off +1 + on +1 ) ( off 1 + on 1 ) g ij = ij + 0 ij end for Obtainh i (with updated combination co-efficients) Output strong classifierH = (h 1 ;h 2 ;::::;h T ) 3.3 Experiments We evaluated our approach for the problem of pedestrian detection. Two different hu- man datasets are used for evaluation: UCR dataset [31] and UCSD dataset [16]. First we compare the running time of our approach with the approach described in [1]. Then we present the detection performance on UCR and UCSD Dataset. 3.3.1 Time Computation Performance For the comparison of computational time we train the upper body of the human detec- tor offline by collecting around 20,000 training samples from the Internet. The offline 27 Figure 3.2: Comparison of the running time of our incremental learning approach with approach described in [1]. Inc-T (T=2,10, 20, 50) represents maximum T iterations of incremental learning using the method described in [1]. detector is trained for 21 layers of cascade Real Adaboost using the method described in [2]. We run the experiments for our approach and [1], on a 3.16 GHz XEON CPU. In Figure 3.2 we show the comparison of computational time of our method with [1] for different number of online samples used for incremental learning. We can notice that our approach takes around only 1 second for 10 online samples. 3.3.2 Detection Performance Datasets used:To evaluate the detection performance of our incremental learning method, we use three sequences; two sequences from UCR dataset: "‘M1:Two people meeting on a bench without gestures.avi"’ (M1), "‘M2:Two people meeting on a bench with gestures.avi"’(M2), and and David Indoor (D1) sequence from UCSD dataset. For the 28 evaluation of UCR videos, we sample 94 frames from each video randomly from the latter half of the sequence and manually label the humans in these frames. For M1, 186 humans are labeled, whereas for M2, 182 humans are labeled as ground-truth. For D1 sequence, we sample total 80 frames from the sequence and manually label 80 humans in these frames as ground-truth. Online Sample collection: For each sequence, we annotate only 1 missed detection from the first half of the sequence and perturb this annotated example to generate 10 online positive samples and collect 10 false alarms from the background as the online negative samples. Choice of regularization parameter: impact of changes in y on detection perfor- mance is shown in Figure 6.4(d). We use 10 online samples for this experiment, whereas offline samples used to train the offline detector (described in section 3.1) are in ten thou- sands, hence we get high performance if y is set (equal to 0.001) to close to the ratio of number of online samples to the number of offline samples. If we use very small y (10 6 ) or very high y (0:1), the performance deteriorates. Experiment settings and evaluation criteria:: We use the offline detector trained for evaluating the computation time performance as described in section 3.1. regularization parameter ( y )is set to 0.001 for all the experiments. We use the Recall-Precision criteria to evaluate the detection performance and follow the 50% overlap criteria as used in [2]. Detection Results:We compare detection results from our approach with offline de- tector and method used in [1]. From Figure 6.4, we can see that both our approach and steepest descent method described in [32] works better than the offline detector. Our approach gives better results than the 5 iterations of the steepest descent method for all the sequences, whereas results from our method are similar to the 10 iterations of the steepest descent method. Few examples of detection results are shown in Figure 6.5. 29 3.4 Conclusion We presented an efficient incremental learning method for cascade Real Adaboost clas- sifier, which combines the offline and online loss to make a hybrid loss function. We proposed an efficient method to optimize this hybrid loss function. Our experiments on the problem of pedestrian detection demonstrate that our method improves the perfor- mance of an offline trained detector significantly by collecting only few online samples. 30 (a) ROC-M1 (b) ROC-M2 (c) ROC-D1 (d) Regularization Parameter Figure 3.3: ROC curves and Regularization Parameter 31 Figure 3.4: Detection Results from offline detector [2](odd rows, Red) and our approach (even rows, Green). Our approach is able to recover the missed detections (marked in blue) and false alarms (marked in yellow) produced by the offline detector. 32 Chapter 4 Unsupervised Incremental Learning for Improved Object Detection in a Video "’For my part I know nothing with any certainty, but the sight of the stars makes me dream."’ Vincent Van Gogh 4.1 Introduction W E address the problem of object detection with the goal of improving the overall detection performance in video sequences of crowded environments. Our ob- jective is to design an unsupervised incremental learning method which collects online samples automatically from the given test video. Moreover, our approach also addresses how to handle noisy and ambiguous online samples in the training of incremental clas- sifier. Online samples for incremental learning can be collected in supervised and unsuper- vised manners. For supervised methods, labeled online data can be collected manually 33 Figure 4.1: Examples images from i-LIDS [3] and TUD [4] datasets. Video sequences in these datasets have crowded environment and cluttered background. for a specific test environment. However, these methods are not applicable for long video sequences, as the background and object appearances may keep changing and it is expensive to collect large number of training samples manually. Unsupervised learning methods need to collect online training data automatically without any human intervention. Two important decisions for this process are: 1) what kind of online samples should be collected from the test data. 2) how to handle the inevitable noise in the online samples. Algorithm choices for these two decisions can make a major difference in the performance. Some of the unsupervised sample collection approaches use motion segmentation methods [21, 33]. However, these approaches may not be suitable for complex back- grounds. Other approaches [22] use the offline trained detector to collect those samples as online samples, which have high detection confidence. These approaches may not be able to adapt to the cases where the offline detector either misses the object, or has low detection confidence, because such cases are not included in the training set. Moreover, online samples collected using unsupervised methods are prone to noise and may not have perfect spatial alignment with the ground truth, we call it the spatial alignment error. Multiple Instance Learning (MIL) has been used successfully [16, 34, 34 15] to address spatial issues and handling the noisy online samples. In MIL, the training examples are provided as collection of instances (called bags), instead of single instance and labels are given to the bags, not to the instances, with the assumption, that at least one instance in a positive bag has the correct class label and all the instances in a negative bag have correct class labels. In this manner, MIL allows ambiguity in the training samples, which is usually present in case of unsupervised online sample collection. We propose an unsupervised incremental multiple instance learning (UIMIL) solu- tion, which focuses on improving the performance of an offline trained detector for a given video sequence. We introduce a Multiple Instance Learning (MIL) loss function for the Real Adaboost [13] framework in order to handle the noisy online samples. We also present an online sample collection method, which focuses on those cases where offline detector makes mistakes i.e. by collecting missed detections and false alarms. In [1], Huang et al. proposed a supervised method for incrementally modifying the parameters of Real Adaboost [13] cascade detector. We adopt the same method of offline loss estimation as described in [1]. Key contributions of our UIMIL method are: 1) We introduce MIL loss function for Real Adaboost framework which enables han- dling of noisy online samples. 2) Our unsupervised sample collection approach, which focuses on collecting missed detections and false alarms for multiple targets simultane- ously, using tracking information. We evaluate our approach for the problem of pedestrian detection. Our approach is not specific to any particular type of features and can be integrated with any offline detector which uses Real Adaboost framework similar to [32]. 35 4.2 Related Work Many incremental/online boosting methods have been proposed. In [28], Oza and Rus- sel proposed an online version of Discrete Adaboost. Based on their work, Grabner and Bischof [29] proposed a method to add the features during online boosting for object de- tection and visual tracking. These approaches are designed for discrete Adaboost which is different from Real Adaboost in terms of combining the weak hypotheses and Real Adaboost has been shown [13] to outperform the Discrete Adaboost algorithm. MIL based online boosting approaches have been proposed [16, 15, 19, 17]. These methods are semi-supervised. Users annotate the target manually in the first frame, and then the online detector is updated automatically for all the consecutive frames. These methods are not suitable for the video sequences with large number of people in the scene such as in Figure 4.1, because these videos may contain thousands of frames with hundreds of different persons in it and people can enter and leave the scene at any in- termediate frame in the sequence. Hence, tagging all the people in the video requires considerable manual intervention. Moreover, like supervised methods this tagging pro- cess will have to be repeated for each new video sequence, which makes these methods less practical. For unsupervised online learning, some people have used co-training based methods. Levin et al. [33] learned two classifiers simultaneously using co-training. Javed et al. [21] used principal component analysis based features and used co-training for moving pedestrians and car detection. Both of these approaches use background subtraction techniques which may not be suitable for complex backgrounds. 36 Figure 4.2: Overview of our system Wu and Nevatia [22] proposed an approach for unsupervised online Real Adaboost. They improved the performance of the generalized offline trained detector by collect- ing samples in the unsupervised manner using an oracle which used the combined re- sponses from the different part detectors of the object and showed their performance on the problem of pedestrian detection. However they used examples with only strong confident responses for online sample collection which may not help in rectifying the special cases, where the offline detector makes mistakes. Stalder et al [23] proposed an unsupervised cascaded confidence filtering based method to increase the robustness of a detector in order to improve the tracking per- formance, by putting constraints on the size of the objects, smoothness of trajectories and background information. However, their method can only work for a static camera, whereas our method does not use any background information, hence is not limited to static cameras, as long as we can track the people in the moving camera scene. 37 4.3 Outline of our Approach There are two main components in our system: unsupervised sample collection and MIL based incremental learning for Real Adaboost. The focus of our incremental learning module is to correct the mistakes made by the offline detector with noisy online samples. Incremental learning optimizes the pa- rameters of the offline trained detector by minimizing a hybrid loss function,which is a combination of offline and online loss functions. We present a novel MIL loss function for Real Adaboost framework, which allows use of bags of instances for online samples in order to handle the noisy online samples. For unsupervised online sample collection, given a specific video sequence, we first apply the offline trained detector to it, then the obtained detection responses are associ- ated frame by frame to track these detection responses. Based on the obtained tracks, we collect those samples as positive online samples which are successfully tracked, but are either missed by the detector or are detected with low confidence. Those detection re- sults, which do not match with any of the track responses are considered as false alarms and are collected as negative online samples. Bags of these online samples are used for the optimization of hybrid loss function. Overview of our system is shown in Figure 4.2. We evaluate the performance of our system on the video sequences of two publicly available datasets: i-LIDS [3] and TUD Campus [4]. Experiments demonstrate the ef- fectiveness of our method. The proposed MIL loss function works significantly better than traditional Real Adaboost loss function. Our online sample collection mechanism also performs better than confident detections based online sample collection method and background subtraction based method. We also provide the upper bound on the 38 performance of online sample collection method. Moreover, we also show that the im- proved detection accuracy by our method can help in improving the tracking perfor- mance of a state of the art tracking-by-detection method. The rest of the chapter is divided in following sections: In section 4, we briefly sum- marize supervised incremental learning [1]. In section 5, our unsupervised incremental learning framework is presented. Experiments and results are shown in section 6, which is followed by the conclusion. 4.4 Supervised incremental learning of Boosted Classifiers Schapire et al. introduced Real AdaBoost [13] with domain partitioning weak hypothe- ses. Huang et al. designed a supervised incremental learning method for this type of boosted classifiers [1]. Given a training sample setS =f(x 1 ;y 1 ); (x 2 ;y 2 ); ; (x n ;y n )g wherex2X;y2 f1; +1g, a domain-partitioning based weak hypothesis divides the instance spaceX into a set of disjoint subregions and assigns instances falling into the same subregion with a constant confidence output. Such a two-step mapping can be formulated as: h(x) =g f(x) = f(x) (4.1) wheref(x) :X!N is the domain-partition function that gives the index of subregion x belongs to, and g() : N! R is a Look Up Table (LUT) function which outputs f(x) as the weak hypothesis for all instances in regionf(x). In this way, the output of a boosted strong classifier withT weak hypotheses can be written as: H(x) = i=T X i=1 h i (x) = i=T X i=1 g i (f i (x)) = i=T X i=1 i;f i (x) (4.2) 39 in which i;j = g i (j) is the confidence output of the j th bin of the i th weak hypoth- esis. Let =f (1;1) ; (1;2) ;::; (1;m 1 ) ;::::; (T;1) ::; (T;m T ) g be the concatenated vector (m i is the number of subregions in thei-th weak hypothesis). The loss function in Real AdaBoost is defined by L S;H() = 1 n n X k=1 exp(y k H(x k )) = 1 n n X k=1 exp(y k T X i g i (f i (x k ))) = 1 n n X k=1 exp(y k G(F (x k ) :)) (4.3) whereG(F (x) : ) = P T i g i (f i (x)). In [1], Huang et al. proposed a supervised incre- mental learning algorithm to rectify an offline learned detector by adjusting the concate- nated vector . They introduced a hybrid loss functionL(P (x;y) ; +)) by combining the loss received from offline training samples and the loss from online ones: L(P (x;y) ; 0 ) = X y y P (y)L(P on (xjy) ; 0 ) + X y (1 y )P (y)L(P off (xjy) ; 0 ) (4.4) where P on (xjy) and P off (xjy) are conditional distribution of online/offline training samples respectively, 0 = + is the new concatenated vector, and y is the combina- tion coefficient that decides the importance given to the offline and online part during incremental learning. 40 4.5 Unsupervised Incremental Multiple Instance Learning for Weakly Labeled Online Data The incremental learning algorithm [1] performs well if online samples are labeled cor- rectly, which is usually an unaffordable requirement in practice. For example, errors made by detector such as missed detections and false alarms can be partially rectified through tracking based on spatial-temporal smoothness constraints in a video sequence. However, these missed detections and false alarms are weakly labeled online data of in-negligible noise, which are very likely to mislead the incremental learning if directly adopted as strongly labeled online samples. In this section, we first present the strategy of collecting positive and negative online samples by tracking, and then introduce MIL loss function and elaborate the multiple instance incremental learning algorithm that adapts the existing detector to the video being processed by this weakly labeled data. 4.5.1 Online Sample Collection We collect those cases as online samples, where offline detector fails. Hence, we design our online sample collection mechanism to focus on collecting those samples which are either missed by the detector or are false alarms. We define the following terms: Unmerged detection responses: detection responses obtained from all the scanning windows for a given video frame. Merged detection responses: these are obtained using hierarchical clustering over all the unmerged detection responses for a given video frame. To collect the online samples, first we apply the offline detector on the whole video sequence and obtain the merged detection responses , then track these detection re- sponses, to obtain the tracks T l =fT 1 ;T 2 ;:::;T m g. We prune the those tracks which 41 Figure 4.3: Examples of some online samples collected by unsupervised method: Missed detections (red rectangle) are collected as positive training samples. False alarms (blue rectangle) are collected as negative training samples. are there in the sequence only for less than half a second or only less than 10% of the detection responses of a track are confident, as these tracks are more likely to be false tracks. Positive Samples: We collect missed detections as positive samples. We define missed detections as those track responses for which either there is no merged detection response from the offline detector or the merged detection response has a low confi- dence. We match the merged detection responses in each frame with the tracks which belong to that particular frame, based on the overlap (30%) of the detection response with the track response. One track can match with only one merged detection response. Negative Samples: To collect negative online samples, we apply the offline detector on each frame and obtain the merged detection responses. Then we match the tracking responses in a frame with the merged detection responses based on the overlap (30%) between track responses and merged detection responses. We identify those merged detection responses as false alarms, which do not match with any of the track responses. Unmerged detection responses corresponding to the merged detection responses of these false alarms are then considered as negative online samples. Some of the online samples collected by unsupervised method, are shown in Figure 7.3. 42 4.5.2 MIL Loss Function for Real Adaboost Given the weakly labeled data setB = fB 1 ;B 2 ; ;B m g, where each bag B i = fx 1 ;x 2 ;:::;x n i ;y i g consists of n i instances and a binary class label y i 2 f+1;1g. It is inappropriate to demand every instance been correctly classified by the incremen- tally learned detector due to the unavoidable errors in unsupervised sample collection, but it is reasonable to have at least one instance satisfy this requirement in each bag. Ide- ally, such multiple-instance relaxed formalization only considers the instance that yields minimum loss in each bag: for a positive bag, only the instancex of the minimum clas- sification output penalizes the classifier, while for a negative bag, only the instance of maximum output matters. We can write the loss received from a positive bag as: L +1 B i = exp( max k (H(x k ))) (4.5) wherex k is thek th instance of the bag; and similarly the loss of a negative bag is L 1 B j = exp(min k (H(x k ))) (4.6) However, the real max/min functions are not differentiable at intersection boundaries which could be difficult to optimize by gradient-descent based methods. Therefore, we adopt differentiable soft-max and soft-min functions to approximate the real ones, which are defined by d max = 1 log N + B X k=1 exp(H(x k )) d min = 1 log N B X k=1 exp(H(x k )): (4.7) N + B andN B are the number of instances in the bag, and is a scale factor controlling the approximation: soft-max/min approaches the real ones when!1. 43 Algorithm 2 Updating an offline detector in a video Given: an offline detectorD,N frames from a video. Init: online sample bagsB =;, the maximum number of bags = 100, and the incremental detector setI =; fori = 1 toN do ifjB + j +jB j then - Sample Collection: obtain online positive bagsB + i and online negative bags B i based on detection results byD; each positive bag consists of 10 patches collected around a missed detection, while each negative bag includes one un- merged false alarm. - Sample Update:B B[B + i [B i else - Incremental Learning: Call Algorithm 5 with inputD andB to obtain the updated detectorD 0 I =I[D 0 -Sample Reset:B =; end if end for By replacing the real max/min functions with soft ones, we have the soft loss func- tions for positive and negative online bags as L(P (Bj +1); 0 ) = N +1 o X i=1 1 N +1 o exp([ max k fG(F (x ik ) : 0 )g) = N +1 o X i=1 1 N +1 o exp( 1 log N B i X k=1 exp(G(F (x ik ) : 0 ))) (4.8) L(P (Bj1); 0 ) = N 1 o X i=1 1 N 1 o exp( [ min k fG(F (x ik ) : 0 )g) = N 1 o X i=1 1 N 1 o exp( 1 log N B i X k=1 exp(G(F (x ik ) : 0 ))) (4.9) wherex ik is thek th instance of thei th bag. N B i is total number of instances in the bag i. As defined in Equation.4.4, this MIL loss function for weakly labeled online data is integrated with the offline loss function proposed by [1] so that generalization ability of 44 Algorithm 3 Incremental Learning Algorithm Given: online sample bags B, and offline learned cascade detector D = fH 1 ;H 2 ;::;H L g in whichH i is thei-th layer classifier. Init: the updated detectorD 0 =fg fori = 1 toL do -Incremental Learning: minimize the hybrid loss function in Equation.4.4 and obtain the updated classifierH 0 i . - remove sample bags fromB that fail in passingH 0 i end for Output:D 0 =fH 1 0 ;H 2 0 ;::;H L 0 g the detector learned offline can be maintained. By optimizing this hybrid loss function, an optimal adjustment of concatenated LUT vector can be obtained to improve the accuracy of the detector in difficult cases represented by online samples. 4.5.3 Incremental update of offline detector As formalized in Algorithm 2, online training samples are collected frame by frame throughout the entire video sequence. An incremental learning module, which is de- scribed in Algorithm 5, is invoked when there are enough online training samples to update the offline learned detector. Notice that all online training samples are released after incremental learning. Similar to the offline learning procedure, the cascade detector is incrementally learned step by step from the first layer to the last one in Algorithm.5. Optimal update of the current strong classifier H i is achieved by minimizing the hybrid loss function (Equation.4.4) with online loss definition relaxed by multiple instance formalization. Input online sample bags are rejected afterward if they cannot pass the updated classi- fier. One crucial issue of this online updating method is to avoid drifting in a long video sequence. In particular, it is likely to overfit some difficult online samples that may have 45 Incremental Learning D I0 O0 Incremental Learning Incremental Learning D D Oi On Ii In Iteration 0 Iteration 1 Iteration n Figure 4.4: Updating strategy at each iteration of incremental learning in a video se- quence, D = offline detector, which is same for all iterations. O i : online samples for iterationi,I i is incremental detector at iterationi, been considered as outliers in offline training. Figure 4.5 shows recall as a function of video length where, at each iteration of the incremental learning we adopt the detector updated in the previous iteration as the base detector; The accuracy of the incrementally learned detector degenerated drastically after 2000 frames, becoming even lower than of the base detector itself, whereas if we keep using the offline trained detector as the base detector in each iteration (see Figure 4.4 and 4.5) of the incremental learning, the incrementally learned detector constantly outperforms the offline learned one. 4.6 Experiments We evaluated the performance of our approach for the problem of pedestrian detection. In this section, first we describe the experimental settings and datasets used for evalua- tion. Then we provide the evaluation details of our MIL loss function and online sample 46 Figure 4.5: Overfitting Avoidance, example on i-LIDS AB Easy sequence: x-axis repre- sents the number of frames, y axis shows the recall rate for the corresponding number of frames, Inc1= Strategy of treating offline detector as the base detector,at each iteration of incremental learning; Inc0 = Strategy of treating the incremental detector learned in the previous iteration, as the base detector. Till 1500 frames both Inc1 and Inc0 per- form similarly. After 1500 frames, performance of Inc0 keeps degrading because of over-fitting, whereas Inc1 avoids over-fitting, hence performance better than the offline detector. collection method. Finally, we show that the improved detection results using our incre- mental learning method, also helps in improving the tracking performance of a state of the art tracking-by detection method. Offline Detector and Pedestrian Tracker: We used the collaborative learning of JRoG features [2] based method as our offline detector. We collected around 20000 positive samples from the Internet, for the training of the offline detector and trained 16 layers of the cascade Adaboost for full body of the human. Our method is not specific to any particular type of features and can be easily adopted for any offline detector which uses Real Adaboost framework with domain partitioning based weak hypotheses. 47 For unsupervised online sample collection, we used a three level hierarchical asso- ciation based tracking-by-detection method described in [35] and collected 10 image patches around one missed detection to form a positive bag and each false alarm is treated as a negative bag. For one iteration of incremental learning, 100 online bags are collected. Scale factor () in online loss function, is set to 20 for positive bags and 1 for negative bags. These settings remain same for all the experiments. Our incremental learning approach is not dependent on any specific kind of tracking method and can be integrated with any tracking-by-detection method which can provide missed detections and false alarms for incremental learning. Implementation Details: The combination-coefficient parameter ( y = 0:001) for hy- brid loss function is set empirically and remains same for all the experiments. Ground plane assumptions, similar to the method described in [2], are used. Publicly available ’C’ implementation of L-BFGS library [36] is used for the optimization of the hybrid loss function. Datasets: We have used two publicly available datasets: i-LIDS AB Dataset and TUD Campus Dataset. i-LIDS dataset consists of video sequences, captured at a subway station. It is a challenging dataset because the scenes in the sequences are crowded and people often occlude each other. We selected two sequences from this dataset for experiments: i- LIDS AB Easy and i-LIDS AB Medium. i-LIDS AB Easy sequence contains 5220 frames of size 720 x 576 pixels, with 9716 humans. The ground truth annotations are available at [3]. These annotations contain all occluded people, as well as partially visible people. i-LIDS AB Medium sequence contains 4582 frames. For this sequence, in addition to the ground truth annotation available at [3], we annotated fully visible sitting people. After annotations, there were total of 19,141 humans in the ground truth. 48 TUD Campus dataset consists of a single sequence which contains 71 frames of size 640 x 480. In this sequence, people have extreme profile and are often occluded. Ground truth annotations are same as used in [4]. 4.6.1 Detection Performance Evaluation For the evaluation of the detection results,we consider the recall-precision criteria simi- lar to [2]. The detection results are evaluated, individually for each frame, without using any temporal information. None of the images from test datasets are used for the train- ing of the offline detector. We evaluate both components (MIL loss function and online sample collection) of our system. MIL loss function evaluation: We compare our MIL loss function with traditional Real Adaboost loss function used in [1]. For this comparison, we use our online sample collection method to collect online training samples for the incremental learning method described in [1] and use hard labeled single training instances for incremental learning of [1]. We can see from Figure 6.4 and Table 4.1, that our MIL loss function outperforms the traditional Real Adaboost loss function used in [1] on all the sequences, indicat- ing that use of MIL loss function enables handling of noisy online samples, hence im- proves the performance of the detector. We also compare our method with two other approaches: 1) cascade confidence filter (CCF) [23] on i-LIDS AB Easy sequence; 2) Andriluka et al [4] on TUD Campus dataset. We can see from Table 4.1, that our method performs significantly better than CCF on i-LIDS AB Easy sequence, whereas on TUD- Campus sequence, our approach has 0.85 recall at precision 0.9, whereas Andriluka et al’s method has 0.82 recall. 49 Table 4.1: Recall for different detectors when precision = 0.9 Sequence UIMIL Offline [2] Inc-GT Inc-CS Inc-BS Huang [1] CCF[23] Andriluka [4] i-LIDS AB Easy 0:75 0:69 0:80 0:72 0:7 0:69 0.54 - i-LIDS AB Medium 0:64 0:62 0:65 0:62 0:59 0:61 - - TUD Campus 0:85 0:82 0:87 0:85 0:81 0:83 - 0.82 Online Sample Collection Evaluation: We compare our online sample collection mechanism with following three baseline methods: 1. Online samples collected from background subtraction (Inc-BS): Offline de- tector is applied and based on the number of foreground pixels in the detection window online samples are collected. Positive and negative bags of these online samples are then provided to MIL loss function. We use OpenCV 1.0 implemen- tation for background subtraction. 2. Confident detections (Inc-CS): To obtain confident detection results, first offline detector is applied to the dataset and then tracks are obtained from these detection results using the method described in [35]. Based on these tracks, if a detection result bounding box overlaps with the bounding box of a track and the detection confidence is high, it is chosen as the confident positive sample. If the detection result does not overlap with any of the tracks and the detection confidence is low, it is considered as the confident negative sample. Positive and negative bags of these confident samples are provided to the MIL loss function. 3. Online samples based on ground-truth (Inc-GT): We apply the offline detector on the dataset and use the manually labeled ground-truth to collect the missed detections and the false alarms from the dataset. It provides the upper-bound on the performance of online sample collection method. It can be seen from Figure 6.4 and Table 4.1, that our sample collection method per- forms better than Inc-CS and Inc-BS baseline methods. On i-LIDS AB Easy and i-LIDS 50 Table 4.2: Tracking Results for i-LIDS Dataset Detector MOTA FGITM FAPF Incremental Detector 0:83 0:12 0:13 Offline Detector 0:73 0:22 0:21 AB Medium sequence, Inc-CS performs very similar to offline detector, because only those samples are collected as online samples which are detected with high confidence. Whereas, Inc-BS at times drifts, which degrades its performance as compared to the of- fline detector, indicating that for crowded and cluttered scenes, background subtraction is not the best strategy to collect online samples. The computational time for incremental learning with 100 online samples, is around 30 seconds, for Xeon 3.16 GHz CPU. Some examples of the detection results are shown in Figure 6.5. 4.6.2 Improvement in Tracking We investigated if the incremental detector can also help in improving the performance of tracking-by-detection methods. For this purpose, we evaluated the performance of a state of the art tracking-by-detection method [35] on i-LIDS Dataset (i-LIDS AB Easy and i-LIDS AB Medium Sequence). For evaluation, we use the CLEAR [35] evaluation metric, which uses the following criteria: 1. Multiple Object Tracking Accuracy (MOTA): This is a combined score, comprised of false positive, false negative rates, id switches. Higher score is better. 2. Fraction Ground Truth Instances Missed(FGTIM): False Negatives, the lower score is better. 3. False Alarms Per Frame(FAPF): False Positives, the lower score is better. From Table 4.2, we can see that by using incremental detector, MOTA score im- proves significantly by 10% and also there are 8% fewer false positives and 10% fewer 51 false negatives; indicating that the incremental detector improves the overall tracking performance on a given video sequence. In this way, our method can be treated as a pre-processing step for improving the detection results, before applying the tracking- by-detection methods on the video sequence. 4.7 Conclusion In this work, we presented a novel method for unsupervised incremental learning. We proposed a novel MIL loss function for Real Adaboost framework. Experiments and results show the effectiveness of our method. Our approach can be integrated with any offline detector which uses domain partitioning based Real Adaboost framework. 52 (a) i-LIDS AB Easy (b) i-LIDS AB Medium (c) TUD Campus Figure 4.6: ROC Curves for Detection Results 53 Figure 4.7: Sample detection results on i-LIDSS and TUD Campus Datasets. First and third rows show the detection results from the offline detector. Second and fourth rows show the detection results from the incremental detector. Red arrow points at the false alarms produced by the offline detector, whereas blue arrow shows the detections results for the people who were missed by the offline detector, but were detected by the incremental detector. 54 Chapter 5 Generic Detector Adaptation for Object Detection in a Video "’Any idea is a generalization, and generalization is a property of thinking. To generalize something means to think it."’ Georg Wilhelm Friedrich Hegel 5.1 Introduction S EVERAL incremental/online learning based detector adaption methods [1, 22, 26, 37, 15, 38, 23, 25, 27] have been proposed. Most of these approaches are either boosting based [1, 22, 38] or SVM based [26, 27], which limits applicability of these approaches to a specific type of baseline classifier. We propose a detector adaptation method, which is independent of the baseline classifier used, hence is applicable to various baseline classifiers. 55 Figure 5.1: Some examples from Mind’s Eye dataset [5]. This is a challenging dataset, as it has many different human pose variations. With increasing size of new test video datasets, computational efficiency is another important issue to be addressed. [27, 26, 22] use manually labeled offline training sam- ples for adaptation, which can make the adaptation process computationally expensive, because the size of the training data could be large after combining offline and online samples. Some approaches [1, 38] have been proposed to address this issue, as they do not use any offline sample during the training of the adaptive classifier. However, these approaches optimize the baseline classifier using gradient descent methods, which are inherently slow in nature. Detector adaptation methods need online samples for training. Supervised [1] and semi-supervised [39, 15] methods require manual labeling for online sample collection, which is difficult for new test videos. Hence, unsupervised sample collection is impor- tant for adaptation methods. Background subtraction based approaches [33, 20, 21, 37] have been used for unsu- pervised online sample collection. However background subtraction may not be reliable for complex backgrounds. Tracking based methods [38, 4] have also been used. How- ever, existing state of the art tracking methods [35, 40, 41] work well for pedestrian category only. Therefore, these methods may not be applicable for different kinds of objects with many pose variations and articulations (see Figure 5.1). 56 We propose a novel generalized and computationally efficient approach for adapting a baseline classifier for a specific test video. Our approach is generalized because it is independent of the type of baseline classifiers used and does not depend on specific features or kind of training algorithm used for creating the baseline classifier. For a given test video, we apply the baseline classifier at a high precision setting, and track obtained detection responses using a simple position, size and appearance based tracking method. Short tracks are obtained as tracking output, which are sufficient for our method, as we do not seek long tracks to collect online samples. By using tracks and detection responses, positive and negative online samples are collected in an unsuper- vised manner. Positive online samples are further divided into different categories for variations in object poses. Then a computationally efficient multi-category random fern [11] classifier is trained as the adaptive classifier using online samples only. The adap- tive classifier improves the precision of baseline classifier by validating the detection responses obtained from the baseline classifier as correct detections or false alarms Rest of this chapter is divided as follows: Related work is presented in section 2. Overview of our approach is provided in section 3. Our unsupervised detector adapta- tion approach is described in section 4. Experiments are shown in section 5, which is followed by conclusion. 5.2 Related Work In recent years, significant work has been published for detector adaptation methods. Supervised [1] and semi supervised [39, 15] approaches, which require manual labeling, have been proposed for incremental/online learning but manual labeling is not feasible for the large number of videos. Background subtraction based methods [33, 20, 21, 37] have been proposed for unsupervised online sample collection, but these methods are 57 Figure 5.2: Overview of our detector adaptation method not applicable for datasets with complex backgrounds. Many approaches [22, 4, 38, 25] have used detection output from the baseline classifier or tracking information for unsupervised online sample collection. Unsupervised detector adaptation methods can be broadly categorized into three different categories: Boosting based methods, SVM based approaches and generic adaption methods. Boosting based: Roth et al. [37] described a detector adaptation method in which they divide the image into several grids and train an adaptive classifier separately for each grid. Training several classifiers separately, could be computationally expensive. 58 Wu and Nevatia [22] proposed an online Real Adaboost [13] method. They collect online samples in an unsupervised manner by applying the combination of different part detectors. Recently, Sharma et al. [38] proposed an unsupervised incremental learning ap- proach for Real Adaboost framework by using tracking information to collect the online samples automatically and extending the Real Adaboost exponential loss function to handle multiple instances of the online samples. They collect missed detections and false alarms as online samples, therefore their method relies on tracking methods which can interpolate object instances missed by the baseline classifier. Our proposed ap- proach uses a simple position, size and appearance based tracking method in order to collect online samples. This simplistic tracking method produces short tracks without interpolating missed detections, which is sufficient for our approach. SVM based: Kembhavi et al. [26] proposed an incremental learning method for multi kernel SVM. Wang et al [27] proposed a method for adapting the detector for a specific scene. They used motion, scene structure and geometry information to collect the online samples in unsupervised manner and combine all this information in confi- dence encoded SVM. Their method uses offline training samples for adaptation, which may increase the computation time for training the adaptive classifier. Both boosting and SVM based adaptation methods are limited to a specific kind of algorithm of baseline classifier, hence are not applicable for various baseline classifiers. Generic: In [25], Wang et al. proposed a detector adaptation method in which they apply the baseline classifier at low precision and collect the online samples automati- cally. Dense features are extracted from collected online samples to train a vocabulary 59 tree based transfer classifier. They showed the results on two types of baseline classi- fiers for pedestrian category, whereas our proposed method show the performance with different articulations in human pose in addition to the pedestrian category. 5.3 Overview The objective of our work is to improve the performance of a baseline classifier by adapting it to a specific test video. An overview of our approach is shown in Figure 7.1. Our approach has the following advantages over the existing detector adaptation methods: 1. Generalizability: Our approach is widely applicable, as it is not limited to a specific baseline classifier or any specific features used for the training of the baseline classifiers. 2. Computationally Efficient: Training of the random fern based adaptive classifier is computationally efficient. Even with thousands of online samples, adaptive classifier training takes only couple of seconds . 3. Pose variations: It can handle different pose variations and articulations in object pose. For online sample collection, we apply baseline detector at a high precision (high threshold) setting. Obtained detection responses, are tracked by applying a simple tracking-by-detection method, which only considers the association of detection re- sponses in consecutive frames based on the size, position and appearance of the object. For each frame, overlap between bounding boxes of the output tracks and detection re- sponses is computed. Those detection responses which match with the track responses 60 and have a high detection confidence are collected as positive online samples. False alarms are collected as negative online samples. Positive online samples are further di- vided into different categories for variations in the poses for the target object and then a random fern classifier is trained as adaptive classifier. Testing is done in two stages: First we apply the baseline classifier at a high re- call setting (low threshold). In this way, baseline classifier produces many correct de- tection responses in addition to many false alarms. In the next stage, these detection responses from baseline classifier are provided to the learned random fern adaptive clas- sifier, which classifies the obtained detection responses as the correct detections or the false alarms. In this way our adaptation method improves the precision of the baseline classifier. We demonstrate the performance of our method on two datasets: CA VIAR [6] and Mind’s Eye [5]. We show the generalizability of our method by applying it on two different baseline classifiers: boosting based [2] and SVM based [42] classifier. Exper- iments also show that the method is highly computationally efficient and outperforms the baseline classifier and other state of the art adaptation methods. 5.4 Unsupervised Detector Adaptation In the following subsections, we describe the two different modules of our detector adaptation method : Online sample collection and training of the random fern based adaptive classifier. 61 5.4.1 Unsupervised Training Samples Collection To collect the online samples, we first apply the baseline classifier at high precision setting for each frame in the video and obtain the detection responsesD = fd i g. These detection responses are then tracked by using a simple low level association [35] based tracking-by-detection method. A detection response d i is represented as d i =fx i ;y i ;s i ;a i ;t i ;l i g. (x i ;y i ) represents the position of the detection response, s i its size, a i its appearance, t i its frame index in the video and l i the confidence of the detection response. The link probability between two detection responses d i and d j is defined as : P l (d j jd i ) =A p (d j jd i )A s (d j jd i )A a (d j jd i ) (5.1) whereA p is the position affinity,A s is size affinity andA a is the appearance affinity. If the frame difference between two detection responses is not equal to 1, the link proba- bility is zero. In other words, the link probability is only defined for detection responses in consecutive frames. d i and d j are only associated with each other, ifP l (d j jd i ) is high: P l (d j jd i )> max(P l (d j jd k );P l (d l jd i )) +;8 (k6=i; l6=j) (5.2) where is an adjustment parameter. Obtained track responsesT =fT i g are further filtered and tracks of length 1 are removed fromT . 5.4.1.1 Online Samples For each frame in the video, the overlap between the bounding boxes ofD andT is computed. A detection response d i is considered as positive online sample if: O(d i \ T k )> 1 andl i > 2 (5.3) 62 WhereO is the overlap of the bounding boxes of d i and T k . 1 and 2 are the threshold values. Also one track response can match with one detection response only. On the other hand, a detection response is considered as negative online sample if: O(d i \ T k )< 1 8k = 1;::::M; andl i < 3 (5.4) whereM is the total number of track responses in a particular frame. High confidence for detection response increases the likelihood that the obtained response is a positive sample. Similarly low confidence for detection response, would lead to high false alarm probability. Some of the collected positive and negative online samples are shown in Figure 7.3. 5.4.1.2 Pose Categorization We consider different pose variations in the target object (e.g. standing, sitting, bending for human) as different categories, as the appearance of the target object varies consider- ably with the articulation in the pose. Hence, we divide the positive online samples into different categories. For this purpose, we use the poselet [42] detector as the baseline classifier. A detection response d i obtained from the poselet detector is represented as d i =fx i ;y i ;s i ;a i ;t i ;l i ;h i g, whereh i is the distribution of the poselets. We model this distribution with 150 bin histogram, each bin depicting one of the 150 trained poselets. We train a pose classifier offline, in order to divide the positive online samples into different categories. We collect the training images for different variations in the human pose and compute the poselet histograms for these training images, by applying the poselet detector. The poselet histogram setH =f b h i g is utilized for dividing the samples into different categories. 63 For a given test video, collected positive online samples are represented as,P = fP i g, where P i =fx i ;y i ;s i ;a i ;h i ;l i ;v i g,v i is the target category, which is determined as: v i = arg min l ( B(h i ; b h l ) ) :h i 2 P i ; b h l 2H (5.5) where B is the Bhattacharya distance [43]. In this manner we divide the positive online samples into different categories. Each of these categories are considered as a separate class for adaptive classifier training. 5.4.2 Adaptive Classifier Training Ozuysal et al. proposed an efficient random fern [11] classifier, which uses binary fea- tures to classify a test sample. These binary features are defined as a pair of points chosen randomly for a given reference window size of the input training samples and based on the intensity values of the points in the pair, the feature output is determined. For a given test sample, letfC 1 ;C 2 ;::::;C K g be the K target classes andff 1 ;f 2 ;::::;f N g are N binary features. The target categoryc 0 i is determined as: c 0 i = arg max c i P (f 1 ;f 2 ;::::;f N jC =c i ) (5.6) In order to classify an image with binary features, many of such features are needed, which makes the computation of joint distribution of featuresP (f 1 ;f 2 ;::::;f N ) infeasi- ble. On the other hand, if we assume all the features are independent, it will completely 64 Figure 5.3: Examples of some of the positive (first row) and negative (second row) online samples collected in unsupervised manner from Mind’s Eye [5] and CA VIAR [6] datasets. ignore the correlation among features. Hence, these features are divided into indepen- dent groups, called ferns. If we have total M ferns, each fern will have N M features, and conditional probabilityP (f 1 ;f 2 ;::::;f N jC =c i ) can be written as: P (f 1 ;f 2 ;::::;f N jC =c i ) = M Y k=1 P (F k jC =c i ) (5.7) whereF k is the set of binary features for thek th fern. During the training, distribution of each category is computed for each fern indepen- dently. This distribution is modeled as a histogram, where each category distribution has L = 2 N M bins, as the output of the N M binary features will have 2 N M possible values. For a test sample, we computeP (F k jC =c i ) as: P (F k jC =c i ) = n k;i;j P L j=1 n k;i;j + (5.8) 65 wheren k;i;j is the value of thej th bin from the distribution ofi th category fork th fern, is a constant. Learning algorithm of random fern based adaptive classifier is described in algorithm 4. For the training of the adaptive classifier, we only use online samples collected in an unsupervised manner, no manually labeled offline samples are used for the training. We train a multi-class random fern adaptive classifier by considering different categories of the positive samples as different target classes, all negative online samples are consid- ered as a single target class. For a test video, first online samples are collected from all the frames and then random fern classifier is trained. Training procedure is performed only once for a given test video. 5.5 Experiments We performed experiments for the problem of human detection and evaluated our method for generalizability, computation time performance and detection performance. We per- formed all experiments on a 3.16 GHz, Xeon computer. In this section, we provide the experiment details and show the performance of our method: Datasets: Two different datasets are used for experiments: CA VIAR [6] and Mind’s Eye [5]. Two sequences: OneShopLeave2Enter (CA VIAR1) and WalkByShop1front (CA VIAR2) are used from CA VIAR dataset. These sequences have 1200 and 2360 frames respectively of size 384 X 288. Ground-truth (GT) is available at [6]. CA VIAR1 has 290 GT instances of the human, whereas CA VIAR2 has 1012 GT instances of the human. Two video clip sequences (ME1 and ME2) are used from Mind’s Eye dataset. These sequences have 425 and 300 frames respectively, each of size 1280 X 760. We manu- ally annotated the ground-truth for these sequences. ME1 has 641 GT instances of the 66 Algorithm 4 Unsupervised Detector Adaptation Training: Given:D,T,H , Test VideoV , with totalF frames Init: Positive Online Samples,P =fg, Negative online samples,N =fg fori = 1 toF do - MatchD withT and collect positive (S + ) and negative (S ) samples for this frame -P =P[S + ,N =N[S end for fori = 1 tojPj do Init:v i =1,d min = 1 forj = 1 tojHj do - = B(h i ; b h j ) if <d min then d min = ,v i =j end if end for end for - Train Random fern classifier using online samplesP andN. Test: fori = 1 toF do - Apply baseline classifier at low threshold to obtain detection responsesD f forj = 1 tojD f j do - Apply Random fern classifier to validate the detection responses as the true detections and false alarms end for end for human, whereas ME2 has 300 GT instances of the human. ME1 has two different pose variations: standing/walking, bending, whereas ME2 has the pose variations for digging and standing/walking. Baseline classifiers: To demonstrate the generalizability of our approach, we performed experiments with two different baseline classifiers: For CA VIAR, boosting based classifier is used as described in [2]. For Mind’s Eye dataset, we used publicly available trained poselets and Matlab implementation available at [44]. 67 In the following subsections, we present computation time and detection perfor- mance experiments. 5.5.1 Computation Time Performance We evaluated computational efficiency of our approach for the training of the adaptive classifier after collection of online samples. We performed this experiment for online samples collected from CA VIAR dataset and trained the adaptive classifier for two target categories. For the adaptive classifier training, we only use the online samples collected in unsupervised manner, no offline samples are used for the training. We compare the performance of our method with [38], which also does not use any of the offline samples for incremental learning. [38] uses bags of instances, instead of single instance, hence we count all the training samples in the bag in order to count the total number of samples used for the training. In Figure 5.4, we show the run time performance for different number of ferns and number of binary features. We can see that random fern based adaptive classifier train- ing outperforms [38] in run time performance. [38] optimizes parameters of baseline detector using gradient descent method, hence training time of incremental detector is high. Whereas our random fern adaptive classifier is independent of the parameters of baseline classifier and uses simple binary features for the training, hence is computa- tionally efficient. For CA VIAR dataset, we use 30 random ferns with 10 binary features, which takes only 1.2 seconds for training of 1000 online samples, whereas the method described in [38] takes approximately 35 seconds, which makes our method approximately 30 times faster than [38]. Total training time of random fern classifier for CA VIAR1 sequence 68 Figure 5.4: Run time performance of our approach. X-axis represents the number of online samples used for the classifier training, Y-axis is shown in log scale and represents runtime in seconds. RF-I-K : I random ferns with K binary features. takes only 8 seconds for approximately 16000 online samples, whereas for CA VIAR2 it takes only 19 seconds with approximately 43000 online samples. 5.5.2 Detection Performance We evaluated the detection performance for the CA VIAR and Mind’s Eye datasets. We do not use any prior on the size and the location of the object for either detection or tracking. Tracking parameters are used as described in [35]. is set to 0.1 for all the experiments. For detection evaluation, we used the same criteria as used in [2]. This metric considers a detection output as correct detection only if it has more than 50% overlap with ground truth. 5.5.2.1 CA VIAR Dataset For this dataset, we use Real Adaboost based baseline classifier [2] and train it for 16 cascade layers for full body of human. 30 random ferns are trained for 10 binary fea- tures, for two target categories (positive and negative classes). 69 Division of positive samples into different categories is not required for this dataset, as all the humans in the sequence belong to the pedestrian category. 1 , 2 and 3 were empirically set to 0.3, 20 and 10 respectively. These parameters remain same for all the experiments on this dataset. We compare the performance of our method with two state of the art approaches [25, 38]. From Figure 5.5, we can see that our method significantly outperforms both HOG-LBP [10] and [25]. Also from Table 5.1, we can see that for CA VIAR1, at a recall of 0.65, Sharma et al’s method improves the precision over baseline by 14%, whereas our method improves the precision by 22%. For CA VIAR2, our method improves the precision over baseline by 59% at recall of 0.65, whereas Sharma et al.’s method im- proves the precision by 19%. Both our approach and Sharma et al’s method outperforms baseline detector [2], however for CA VIAR2 sequence, long tracks are not available for some of the humans, hence not enough missed detections are collected by Sharma et al’s approach, due to which its performance is not as high. Our approach does not require long tracks, hence gives better performance as compared to [38]. 5.5.2.2 Mind’s Eye Dataset We use trained poselets available at [44] for experiments on Mind’s Eye dataset. We train 15 random ferns with 8 binary features for the adaptive classifier training. Adaptive classifier is trained for four target categories (standing/walking, bending, digging and negative). 1 is set to 0.3, whereas 2 and 3 are set to 40 and 20 respectively. These parameter settings remain same for all the experiments on this dataset. During online sample collection, not many negative samples are obtained, hence we add approximately 1100 negative online samples collected in unsupervised manner 70 (a) CA VIAR1 (b) CA VIAR2 Figure 5.5: Recall-Precision curves for Detection Results on CA VIAR Dataset from the CA VIAR dataset in the online negative samples set for both the ME1 and ME2 sequences. Three training images are used to learn pose categorization histograms. These images have standing/walking, bending and digging poses respectively. None of these training images are from either ME1 or ME2 sequence. We compare the performance of our approach with the baseline classifier (poselet detector [42]), and show that by dividing the positive samples into different categories, 71 (a) ME1 (b) ME2 Figure 5.6: Recall-Precision curves for Detection Results on Mind’s Eye Dataset. we get better performance as compared to the case where we do not divide the posi- tive samples into different categories. Precision-Recall curves for both ME1 and ME2 sequences are shown in Figure 5.6. For both ME1 and ME2 sequences, our method gives better performance than pose- let detector. The best performance is obtained when we divide the positive samples into different categories. From Table 6.1, we can see that our method improves the precision for ME1 by 5% at recall 0.96, when we use sample categorization module, whereas 72 Table 5.1: Precision improvement performance for CA VIAR dataset at recall 0.65 Sequence Sharma [38] baseline [2] Ours CA VIAR1 0.56 0.42 0.64 CA VIAR2 0.4 0.21 0.8 Table 5.2: Best precision improvement performance on Mind’s Eye Dataset. For ME1, precision values are shown at recall 0.97, whereas for ME2 recall is 0.6. Ours-1: With- out sample categorization, Ours-2: With Sample Categorization Sequence Poselet [42] Ours-1 Ours-2 ME1 0.65 0.67 0.7 ME2 0.72 0.79 0.84 without sample categorization, improvement in precision is 2%. For ME2 sequence, at recall 0.6, we improve the precision for poselet detector by 12% with sample catego- rization, whereas without sample categorization improvement is 7%. Some of the detection results are shown in Figure 6.5. Our approach can be uti- lized as an efficient pre-processing step to improve the detection results, before apply- ing tracking-by-detection method on baseline classifiers. Also trained multi-category adaptive classifier can be used as pose identification such as standing, bending, digging etc. 5.6 Conclusion We proposed a novel detector adaptation approach, which efficiently adapts a baseline classifier for a test video. Online samples are collected in an unsupervised manner and random fern classifier is trained as the adaptive classifier. Our approach is generalized, hence can easily be applied with various baseline classifiers. Our approach can also handle pose variations of the target object. Experiments demonstrate that our method is computationally efficient as compared to the other state of the art approaches. We 73 (a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 5.7: Examples of some of the detection results when applied baseline detector at low threshold (best viewed in color). Red: Detection result classified as false alarm by our method. Green: Detection result classified as correct detection by our method. Identified category name is also specified for Mind’s Eye dataset (second row). show the detection performance on two challenging datasets for the problem of human detection. In future, we plan to apply our adaptation method on other categories of objects and other baseline classifiers. 74 Chapter 6 Generic Detector Adaptation with Multi Class Boosted Random Ferns for High Detection Performance "Without continual growth and progress, such words as improvement, achievement, and success have no meaning." Benjamin Franklin 6.1 Introduction W E propose an unsupervised detector adaptation method with focus on two im- portant aspects: discriminability and efficiency. Both these aspects are critical for practical applications. In previous chapter we proposed a generic detector adapta- tion method, which is applicable to various offline trained detectors, however ferns for 75 the training of adaptive classifier are selected randomly, which may not be discrimina- tive enough to give high detection performance. In this work, we focus on selecting discriminative random ferns via boosting for the training of adaptive classsifier. We collect online samples automatically from a given test video and propose a multi- class boosted random fern adaptive classifier which selects discriminative random ferns for classification. Our multi-class boosting method focuses on feature sharing among different classes, which allows simultaneous training of multiple classes in a single boosting framework, hence the training of adaptive classifier is efficient. We apply our method for the problem of human detection. Our approach neither requires training data used for the training of the generic detec- tor, nor it depends on the specific training algorithm used for the training of the generic detector. Hence our approach is widely applicable to various generic detectors, as the generic detector is only used to get the initial detection responses from a given test video. Due to the advantages of accuracy with efficiency, Random ferns [11] have been used for both classification [12] and detection [45]. Boosted random ferns [46] have been proposed for selecting discriminative random ferns for classifier training. How- ever, [46] focuses only on binary classification. In [47], Lin and Liu proposed a multi class boosting method called MBHBoost for face detection task, in which same feature is shared across different classes and simultaneous training is performed for multiple classes in a single boosting framework. Due to feature sharing and simultaneous train- ing, MBHBoost is computationally efficient in both training and testing phase. We integrate MBHBoost in random fern framework for the training of efficient and discrim- inative adaptive classifier. Rest of the chapter is divided in the following sections: related work is discussed in section 2. Overview of our approach is provided in section 3. Online sample collection is 76 described in section 4, adaptive classifier training is explained in section 5. Experiments and results are presented in section 6, which is followed by conclusion. 6.2 Related Work Detector adaptation in videos have been an active topic of interest in recent years. Oza and Russel [28] introduced online boosting. Following the success of their method, may other online boosting methods [16, 29, 1, 21, 15] have been proposed. Some of these methods require manual annotations [16, 29, 1], which is not a feasible solution for many applications. Some methods use background subtraction for online sample collection [21],which is also not an attractive solution, as the scene may have stationary objects. Moreover these boosting based methods are limited to a specific type of detector only, which limits the applicability of the approach. SVM based detector adaptation methods [48, 27] have also been proposed. In [27], Meng et al. proposed a supervised adaptation method, which collects online samples using motion, size, location based context information and SVM classifier is trained using the online training samples and the original training data which was used for the training of generic detector. Combining original training data with new training data for adaptive classifier training can be computationally expensive, if it is to be repeated for every new video. In [48], Guang et al. proposed a method for detector adaptation, which collects online samples using the confidence of the detector output and train a SVM based adaptive classifier in an iterative manner. However, multiple iterations for the training of adaptive classifier could be computationally expensive. Tracking has been used [4, 38, 7] for object detection. Sharma et al. [38] proposed an incremental learning method in which they collect online samples using tracking and detection responses, however their method is heavily dependent on the performance 77 of the tracking-by-detection method as they need trackets which can interpolate those object instances which are missed by the detector. Sharma and Nevatia [7] proposed a random fern based detector adaptation method. They collect online samples using tracking and detection responses in an unsupervised manner and train a random fern based adaptive classifier. It can be applied with various generic detectors, however the random ferns for adaptive classifier are chosen randomly, therefore the trained classifier may not be robust enough to give significant improvement over generic detector. Random ferns [11] have been used for classification [12] and detection [45] tasks. Villamizar et al. proposed boosted random ferns [46], which focuses on binary classifi- cation. They proposed a two step classification problem. In the first stage, they generate a set of potential orientation estimates and then in second stage, classifier validates the original estimations. In both stages, they use Real Adaboost [13] with HOG feature. However, their method is limited to binary classification task only. Multi class boosting methods have also been proposed [49, 47]. In [49], Torralba et al. proposed a shared featured based multi class boosting. In their approach a fea- ture is shared among different classes and the suitable subset of classes that share the same feature is found by searching all possible combinations, which makes their method computationally expensive. In [47] Lin and Liu proposed a multi class boosting. In their method, different classes share the same feature, but each class has its own weak learner. sharing the same feature among different classes enables simultaneous training of the multiple classifiers in a single training framework, hence the boosting method is com- putationally efficient. They show the performance for multi view face detection task. 78 Figure 6.1: Overview of our system. 6.3 Overview of our approach We present a method for adapting a generic detector to a specific video by proposing a multi class boosted random fern adaptive classifier. Our approach can be applied with various generic detectors. Overview of our system is provided in Figure 7.1. We first apply generic detector at a high threshold and get the detection responses for all the frames. These detection responses are tracked using a data association based tracking by detection method. Using both tracking and detection responses we collect highly confident positive and negative online samples. However, collected positive on- line samples can have different poses such as standing, bending, sitting, digging etc. We divide the collected positive online samples into different categories using poselet [42] histograms. All the negative samples are considered in the same category only. A multi class boosted random fern adaptive classifier is trained using collected on- line samples. For this purpose we integrate MBHBoost [47] framework with random ferns. Boosting enables selection of discriminative random ferns. Our boosting pro- cess focuses on sharing same feature across different classes and multiple classifiers are trained in a single boosting framework, hence our adaptive classifier training is compu- tationally efficient. 79 For testing, we apply generic detector at a low threshold and use the trained multi class boosted random fern adaptive classifier for verifying the detection responses ob- tained from the generic detector as correct detections responses or false alarms. We evaluate our approach on two public datasets: CA VIAR [6] and Mind’s Eye [5] for two different kinds of generic detectors. Experiments show that multi class boosted adaptive classifier improves the performance of a generic detector significantly as compared to the other state of the art methods. 6.4 Online Sample Collection For online sample collection we used a similar method as described in [7]. First we apply the generic detector for the whole video sequence and obtain detection responses. We track these obtained detection responses using by a low level tracking-by-detection method as described in [35], which is a simple tracking method which only associates the detection responses between neighboring frames based on the color, size and posi- tion affinity. After tracking the detection responses, we compare the overlap of the bounding boxes of detection and tracking output and collect confident positive and negative on- line samples as described in [7] by making the assumption that true detection responses would usually have high detection score, whereas false alarms would have low detec- tion score. A confident positive online sample is defined as the detection response which overlaps with a tracking response beyond a certain threshold (> 30%) and the detection confidence is also higher than a certain threshold 1 (> 15). On the other hand, a con- fident negative online samples is defined as the detection response, whose overlap with all the tracking responses is below a certain threshold(< 30%) for a given frame and detection score is between 2 (5) and 3 (10). 80 Collected positive online samples may have different articulations e.g. sitting, stand- ing, running, bending, digging etc. in case of humans as the target object. [7] argued that these different articulations can be considered as different target categories and proposed a method based on poselet [42] histogram, in order to categorize the positive online samples into different categories. In [7], a pose classifier is trained in an offline manner, and based on Bhattacharya distance between positive online samples and trained pose- let histograms, target category is identified for positive online samples. We adopt the same approach for categorizing positive online samples into different categories. All the negative online samples are considered in the same category. 6.5 Adaptive Classifier Training In this section, first we briefly describe random fern [11] classifier and MBHBoost [47], then we explain our proposed Multi class boosted random fern adaptive classifier. 6.5.1 Random fern classifier Ozuysal et al proposed a semi naive bayes based random fern classifier [11], which is based on binary features. Each binary feature is defined as a pair of pixels chosen randomly, for a given reference training image size. The value of this binary featuref j is computed as : f j = 1 ifI d(j;1) >I d(j;2) 0 otherwise (6.1) where I d(j;1) ;I d(j;2) are the pixel intensities at points d(j; 1);d(j; 2) respectively. As these are very simple features, many binary features are needed for a high classifica- tion accuracy. However, joint estimation of large number of binary features can be 81 Figure 6.2: An example of random fern weak classifier with four target classes. PSD1 = Positive Sample Distribution for category 1, NSD1 = Negative Sample Distribution for category 1. PSD1 has positive online samples which belong to category 1, NSD1 has all the negative online samples. f i are the binary features for this random fern. At a boosting iteration t, a random fern is selected andh k t are computed for all the target categories.h t is the vector ofh k t . computationally expensive, on the other hand if all the binary features are considered independent, co-relation among features is ignored completely. [11] provides a tractable solution which divides the features intoM sub-groups, where each sub-group is called fern and each fern hasN=M binary features. ForK target categories, the conditional probabilityP (f 1 ;f 2 ;::::;f n jC =c k ) is computed as: P (f 1 ;f 2 ;::::;f n jC =c k ) = M Y m=1 p(F m jC =c k ) (6.2) whereF m is the set of binary features for them th fern. 6.5.2 MBHBoost Lin and Liu proposed MBHBoost [47] for efficient and effective multi class boosting. ForK target categories of positive training samples, training setX is defined asX = 82 Figure 6.3: A strong classifier vector is computed as independent sum of weak hypoth- esis vector dimensions. For exampleH 1 is computed asH 1 = P T t=1 h 1 t . Final strong classifier vector is defined as vector of independent strong classifiers. f(x 1 ;y 1 ); (x 2 ;y 2 );:::::; (x n ;y n )g. Where y i 2 f1;::::;K;1g. -1 is the target class for negative training samples. For boosting, training setX is divided intoS. S has K subsets asfS 1 ;S 2 ;:::;S K g, whereS i is defined as (X i ;X 1 ).X i contains positive training samples for categoryi (i2f1; 2;:::;Kg) andX 1 contains all negative training samples. For a training sample (x i m ;y i m ) inS i , y i m = 1, if training sample is positive training sample, otherwise it is equal to -1. Weights are initialized independently for eachS i and a feature t at iterationt of the boosting is chosen as : t = argmin t2 1 2 K X k=1 jS k j t Y p=1 Z k p (6.3) 83 Algorithm 5 Multi Class Boosted Random Fern Adaptive Classifier Training Given: online sample setS. Random fern feature pool. Total number of training iterationsT Total number of categoriesK for positive online samples Init:w k t = 1 jS k j 8 K k=1 ,H =fg fort = 1 toT do Select the optimal random fern with feature set, using Eqn. 3 fork = 1 toK do Computeh k t (x) using Eqn. 8 end for h t = [h 1 t h 2 t ::::h K t ] -H =H[h t end for Output:H where is weak classifier feature pool set andZ k p is defined as: Z k p = jS k j X i=1 w k p (i);w k p (i) = exp(y k i h k p (x k i )) (6.4) h k p is the weak hypothesis for k th category. Final weak hypothesis for t th iteration of boosting is defined as: h t = [h 1 t h 2 t ::::h K t ] (6.5) and a strong classifierH is defined as: H = T X t=1 h t (6.6) which is essentially a vector ofK strong classifiers and can be written as: H = [H 1 H 2 :::H K ] (6.7) whereT is total number of training iterations. For detailed description of derivation of Eqn. 3, please refer to [47]. 84 6.5.3 Multi class boosted random fern adaptive classifier Random ferns are efficient and have multi-class capability and MBHBoost is an efficient multi class boosting method. We combine the random ferns with MBHBoost in order to train a multi-class boosted random fern adaptive classifier. A random fern maps 2D image datax intoz. Wherex2X,z2R, and z has total 2 N M bins. For instance a fern with 5 binary features, will have 32 bins distribution for each category. In order to integrate MBHBoost with random ferns we divide the collected online samples intoS, and for eachS k 2S, a separate distribution of positive and negative samples in initialized. Therefore for K target categories, a random fern would have 2K distributions (K for positive online samples and K for negative online samples). For boosting, we consider randomly chosen ferns as the weak classifier feature pool, and we choose totalD ferns selected randomly. At each boosting iteration, we choose a weak hypothesish t , by selecting a random fern from the feature pool. This fern is selected by using the criteria in Eqn 3.h k t (x) is defined as: h k t (x) = 1 2 log P (z k t jy k i = +1) P (z k t jy k i =1) (6.8) wherez k t is the random fern observation (z k t 2 z k ) fork th category. P (z k t jy k i = +1) is computed as : P (z k t =ljy k i = +1) = X i:z k t =l w k t (x k i ) (6.9) wherew k t is the weight of the training samplex k i (x k i 2S k ) att th boosting iteration. Complete training method is explained in algorithm 5 and Figure 6.2 provides a graphical description of feature and weak hypothesis vector. 85 Algorithm 6 Two stage testing method Given: A test video: P Generic detector: GD , Trained adaptive classifier :H Init: True detection response set TR =fg False detection response set FR =fg Stage 1: Apply GD onP , to obtain detection responsesR for all the frames Stage 2: Validate detection responses using following steps: for eachr i inR do Compute strong classifier output vector H = [H 1 H 2 :::H K ] usingH ifH t < 08 K t=1 then FR = FR[r i else PR = PR[r i end if end for Output: PR; FR For testing, we apply our trained adaptive classifier on the detection responses ob- tained from the generic detector. Adaptive classifier validates the detection responses obtained from the generic detector as true detections or as false alarms. Strong classifier output for a detection responser i is computed as: H(r i ) = [H 1 (r i )H 2 (r i )::::H K (r i )] (6.10) If all theH t (r i ) scores are below zero,r i is considered as false alarm, otherwiser i is considered as true detection. Algorithm 2 describes the testing process. Figure 6.3 explains how to compute strong classifier output from trained adaptive classifier. 6.6 Experiments We evaluated our approach for the problem of human detection. In this section, first we describe the datasets and experimental setup, then we show evaluation of detection results. 86 6.6.1 Datsets and experimental setup We show the performance on two different datasets: Mind’s Eye [5] and CA VIAR [6]. Mind’s Eye is a challenging dataset, as humans in this dataset have different pose vari- ations such as sitting, bending, digging, running, standing etc. Sometimes humans do not move in the video sequences, which makes it even more challenging, as background subtraction or motion extraction based methods are not applicable for this dataset. We selected two sequences from this dataset: Seq1 and Seq2. We use the same ground truth as [7]. CA VIAR dataset is also a challenging dataset as resolution is very low and appear- ance of the humans across different frames vary a lot due to illumination changes. We selected one sequence from this dataset: Seq3.Ground truth is available at [6]. We show that our approach is applicable to various generic detectors, by applying it on two different kinds of generic detectors: Real Adaboost [13] based generic de- tector [2] and poselet detector [42]. For Mind’s Eye, we used poselet detector as the generic detector, since it can detect humans across different pose variations. Humans in CA VIAR dataset have standing/pedestrian pose, hence a pedestrian detector [2] is used as the generic detector. Tracking parameters are same as used in [35]. For boosting,T; are set to 30 and 500 respectively. Each fern has 10 binary features. Positive online samples in Mind’s Eye dataset are divided into three different categories, as there are three different kinds of pose variations: upright, bending, digging. Hence K is set to 3 for Mind’s Eye dataset. For pose classifier training, three training images are selected from Internet in a supervised manner (one for each pose:standing, bending, digging). CA VIAR has humans in upright pose only, henceK is set to 1 and a binary classification is performed. 87 Number of online samples collected from Seq1, Seq2 and Seq3 sequence are 8565, 2773 and 4629 respectively. After the collection of online samples, training time of our boosted multi class random fern adaptive classifier is 917, 278, 386 seconds for Seq1, Seq2, Seq3 sequence respectively for the whole video sequence. All our experiments are performed on a 3.16 GHz Xeon computer. 6.6.2 Detection Results For evaluation of detection results, we used 50% intersection over union overlap criteria as used in [2]. We compare our method with other state of the art adaptation methods [38, 25, 7, 10] and the generic detectors [42, 2]. Precision-Recall curves for all three sequences are shown in Figure 6.4. For Mind’s Eye dataset (Seq1 and Seq2), we compare our approach with [7]. We also compare our method with the case when we perform boosting without dividing on- line samples into different categories. We can see from Figure 6.4 that our approach outperforms poselet detector [42] and [7]. For Seq1, at a high recall of 0.96, our ap- proach gives the precision of 0.76, whereas if we perform boosting without dividing positive samples into different categories, we obtain a precision of 0.73. [7] gives 0.7 precision, whereas generic detector [42] gives 0.65 precision. Our method improves the precision of the generic detector, even when generic detector has high precision. when generic detector has precision of 0.9, our approach can improve the precision by more than 2%. For Seq2 sequence, at a high recall of 0.7, generic detector has 0.72 precision, whereas [38] gives 0.84 precision. With two class boosting we obtain a precision of 0.85, whereas our multi class adaptive classifier gives precision of 0.9. At a high recall of 0.6, generic detector has 0.63 precision, whereas our approach has 0.83 precision. 88 Sequence Generic [38] [7] Ours-BC Ours-MC Seq1 0.63 - 0.73 0.77 0.83 Seq2 0.55 - 0.59 0.61 0.69 Seq3 0.42 0.56 0.64 0.84 - Table 6.1: Precision improvement of different approaches. BC: Boosting with binary classification (when positive samples are not divided into different categories), MC: Boosting with multi class classification. Corresponding recall values for displayed pre- cision values: Seq1: 0.7, Seq2: 0.98, Seq3: 0.65. Best performance method is high- lighted in bold. Even when generic detector has high precision of 0.9, our approach gives 0.98 preci- sion. In [38], for adaptive classifier training ferns are chosen randomly, whereas in our approach, we select discriminative ferns via boosting. Hence our detection performance is significantly better than [7]. Moreover our multi-class boosting approach has better performance, as compared to the case, when we perform boosting by considering all the positive online samples into the same category. For Seq3, we compare our method with other state of the art methods [38, 7, 25, 10] and the generic detector [2] and the comparison is shown in Figure 6.4. Our generic detector [2] gives better performance than [10, 25]. We can see that at a high recall of 0.65, generic detector has a precision of .42, [38] gives 0.56 precision, [7] gives 0.64 precision. Whereas our approach gives 0.84 precision. Which is a significant improvement over other state of the art methods. At high precision, [38] has slightly better performance than other state of the art methods. From Table 6.1, we can see that our approach gives 20%, 14% and 42% improve- ment over generic detector on Seq1, Seq2 and Seq3 respectively, which shows the sig- nificance of discriminative boosted multi class random fern adaptive classifier. Some of the detection results are shown in Figure 6.5. 89 6.7 Conclusion We presented a method for adapting an offline trained generic detector on a specific video sequence. We proposed a multi-class boosted random fern adaptive classifier, which is discriminative and efficient. Online samples are collected in an unsupervised manner using a tracking-by-detection methods. Experiments and results on the problem of human detection show that our method outperforms other state of the art methods significantly. Our Multi class boosted random fern classifier can be applied to train a generic detector as well in a supervised or unsupervised manner and can be used as an alternative to traditional random ferns. In future we plan to apply our approach on different categories of objects as well. 90 (a) Seq1 (b) Seq2 (c) Seq3 Figure 6.4: Precision-Recall curves for detection results on Mind’s Eye and CA VIAR datasets. 91 Figure 6.5: Some detection results on Seq1, Seq2 respectively. Green rectangle: de- tection response obtained from the generic detector is verified as true detection. Red Rectangle: Detection response is verified as false alarm. First and third rows show detection results obtained from [7], whereas second and fourth rows show detection results obtained from our method. Black arrow indicates instances when [7] does not verify false alarms correctly, orange arrow indicates instances when [7] verifies a true detection response as false alarm. 92 Chapter 7 Generic Detector Adaptation with Multi Instance Boosted Random Ferns for Robustness to Ambiguous Training Data "The quest for certainty blocks the search for meaning. Uncertainty is the very condition to impel man to unfold his powers." Erich Fromm 7.1 Introduction O UR objective is to design a detector adaptation method which is applicable to various offline trained detectors (OTD) and is capable of working effectively even in the presence of noisy and ambiguous online training samples. Unsupervised detector adaptation has two important components: online sample col- lection, and training of the adaptive classifier. Both these components are critical for 93 high detection performance, hence both of them should be evaluated for the effective- ness of the approach. Performance of the adaptive classifier is highly dependent on the training data, hence collected online samples should incorporate as many appearance and pose variations present in the test data as possible. Since positive and negative online samples are collected in an unsupervised manner, it is very likely that some of these samples are mis-labeled. Hence the adaptive classifier should be able to work with noisy online samples. Moreover, for practical applications an adaptation method should be able to be applied with various OTDs. Based on the above observations, we propose an unsupervised detector adaptation method which collects online samples in an unsupervised manner; we also propose a boosted multi instance random fern (B-MIRF) adaptive classifier. Our method addresses the following critical aspects of unsupervised detector adaptation: 1) kind of online samples to collect from the test video; 2) training of a robust adaptive classifier which is robust to noisy and ambiguous online samples; 3) evaluation of online sample collection mechanism in addition to evaluation of adaptive classifier; and 4) a generic detector adaptation method that can work with various OTDs. Many recent detector adaptation methods [25, 27, 48, 7] only evaluate performance of the adaptive classifier and do not provide any evaluation for online sample collec- tion mechanism. For an effective online sample collection mechanism, it is important to evaluate some variations of type of online samples to be collected for the training. Moreover, these methods commonly assume that collected unsupervised online samples are correctly labeled, which is critical as collected unsupervised online samples are very likely to be noisy. Multiple Instance Learning (MIL) has been used for robustness towards noisy online samples. In MIL approach, labels are not assigned to the individual training samples. 94 Multiple training samples are combined together to create an online bag and multiple online bags are created for each target category. Labels are assigned to the bag not to each individual sample in the bag. MIL has been used by several online object track- ing methods [39, 16, 15, 50]. However, these methods are partially supervised, as they require manual labeling of the object in the first frame, which is an expensive require- ment, because a test video may have hundreds of different instances of the same object category. These methods focus on instance specific adaptive classifier, which is also an overhead in presence of large number of different instances. Due to these limitations, these methods are not appealing for test videos of crowded environments. Moreover, these methods also do not evaluate effectiveness of their online sample collection mech- anism. [38] proposed an incremental learning method, which evaluates online sample col- lection and the adaptive classifier works with noisy online samples. However, this ap- proach is restricted to Real Adaboost [13] based OTDs, which limits the applicability of their method to a specific type of OTD only. We propose an adaptation method which focuses on simultaneous detection of mul- tiple objects in a test video. We first apply an offline trained detector (OTD) at a low threshold to obtain detection responses. These detection responses are tracked by using a tracking-by-detection approach. Online samples are collected using tracking output. We collect both confident and non confident online samples, which is different from recent adaptation methods [22, 38, 7, 50], as these methods focus on collecting only a specific type of online samples, due to which their training set may miss some instances of the object, hence adaptive classifier may not detect those instances. To deal with 95 possible noise in the training set, we use MIL and create bags of instances with the as- sumption that at least one sample in the bag has the correct label. Labels are assigned to the bags, not to individual samples. We propose a boosted multiple instance random fern classifier for the training of our adaptive classifier. Motivation for designing such an adaptive classifier is that Ferns [11] have been used successfully for efficient object detection [45, 46] and classification [12] problems. They work with simple binary features, are efficient and have low complexity. They can be easily extended to multi-class target categories. Moreover, ferns have been used successfully for detector adaptation [7]. [46] showed that performance of random ferns can be enhanced by using boosted random ferns instead of just using randomly selected ferns. Hence, motivated by various advantages of random ferns, we extend tra- ditional random ferns to incorporate MIL with it for robustness to noisy online samples, and apply boosting for selecting discriminative random ferns for high performance. Moreover our approach is independent of OTD, as OTD is only used for collecting initial detection responses. Hence, our approach can be easily applied with various OTDs. Our method makes following contributions: 1. Our approach addresses four critical aspects collectively: types of online sam- ples, training with noisy online samples, evaluation of both adaptive classifier and online sample collection, and a generic adaptation method which can work with various OTDs. 2. We propose a B-MIRF adaptive classifier by extending Random Ferns to MIL framework. Our adaptive classifier is robust to noisy online samples and focuses on selecting discriminative random ferns for classifier training. 96 3. An online sample collection mechanism which collects samples in such a way that the training set is a good representatives of the test data. We evaluate our approach for the problem of human detection on four public datasets. These are challenging datasets due to large pose and appearance variations, illumination changes, crowded environments and moving camera. Our method improves Average- Precision (AP) by 8% averaged over all video sequences as compared to OTD and gives better performance as compared to other state of the art adaptation methods. 7.2 Related Work Several detector adaptation approaches [1, 22, 28, 21, 27, 25, 38, 7, 33, 39, 15, 29] have been proposed in recent years. Oza and Russel [28] introduced online boosting. Following their work, several other online boosting based methods [29, 39, 15, 51] have been proposed, however all these methods are specific to boosting based OTDs only. Supervised [1] and semi-supervised [39, 15, 33] learning based methods have also been proposed. However both supervised and semi-supervised based methods require manual labeling, which is not feasible for datasets with large number of object instances. Unsupervised detector adaptation methods [21, 22, 38, 25, 27, 7, 23] have also been proposed. Javed et al. [21] use a co-training based approach using background subtrac- tion for online sample collection, however for complex backgrounds and videos, their method may not be applicable because background subtraction is not going to be re- liable. Wu and Nevatia [22] use combination of different body parts to collect online samples in an unsupervised manner and train an online Real Adaboost online classifier. Wang and Wang [27] proposed a method for adapting an offline trained pedestrian de- tector for new environments using using information from various cues such as location, 97 size, appearance and motion. They collect confident online samples for adaptation algo- rithm. Stalder et al [23] improve a detector using cascaded confidence filtering and uti- lize the constraints on object size, background information. None of these methods ad- dress how to utilize noisy online samples to train an effective adaptive classifier. Sharma et al. [38] proposed a MIL based Real Adaboost method for incremental learning, which can deal with noisy online data. They collect online samples in an unsupervised manner and use bags of instances as input to the adaptive classifier. They introduced a MIL loss function, which enables handling of noisy online samples in a Real Adaboost frame- work. However, their method requires a high performance tracking algorithm which is usually not the case for datasets with articulated humans. Moreover, their method is applicable for Real Adaboost based OTDs only. Wang et al. [25] proposed a non-parametric detector adaptation method. Sharma and Nevatia [7] proposed a random fern based detector adaptation method. Both these approaches can work with various OTDs. However, these approaches focus on collect- ing confident detection responses as online samples and do not explicitly address how to handle noise in the online data. Moreover, these approaches focus only on the evalua- tion of adaptive classifier. Guang et al [48] proposed a method in which they apply OTD at a low threshold to collect many online samples. They divide the obtained training samples into two categories: confident and hard training samples and iteratively train an SVM classifier and add hard training examples to the confident training set. However, iterative training of SVM can be computationally expensive in nature. 7.3 Outline of our Approach We focus on improving the precision of OTD, by applying it at a low threshold and validating the obtained detection responses as true detections or false alarms via an 98 Figure 7.1: Adaptive Classifier Training Overview. Green rectangle: positive online samples,red rectangle: negative online samples. adaptive classifier. By improving precision, we can distinguish correct detections from false alarms, hence we can detect those instances of the target objects, which are missed by the OTD applied at a high threshold. Overview of our approach is shown in Figure 7.1. We first apply OTD at a low threshold on a test video and obtain detection responses for all the frames in the video. Then we apply a tracking-by-detection method to track the detection responses. Our approach does not require a very high performance tracking method, we only need few tracklets, which can track an object in some frames. Hence we use a tracking method which associates detection responses between two frames using size, color and appearance based affinities. Positive and negative samples are collected based on the overlap of detection and tracking responses. However online samples collected in an unsupervised manner are prone to labeling errors. Hence instead of assigning hard labels to online samples, we use MIL and assign labels to the bags, not to the individual instances. To create bags, we create a pool of detection responses and select detection responses randomly from this pool and include them in the bag. Our MIL formulation works on the assumption that at least one sample in the bag has the correct label. 99 We train a B-MIRF adaptive classifier using collected online bags. Our adaptive classifier is robust to noisy online samples and boosting provides discriminability to the classifier. For testing, we apply trained B-MIRF adaptive classifier on the detection re- sponses obtained from OTD, hence we do not need to repeat the detection process again on the video frames. This avoids overhead of detecting objects again in the test video, as detection process involves scanning thousands of windows in an image, which can be computationally expensive. B-MIRF adaptive classifier is used to validate detection responses obtained from OTD as true detections or false alarms. We evaluate the performance of our system on four public datasets: Mind’s Eye [5], CA VIAR [6], i-LIDS [3] and Skateboard[52] datasets, for two different types of OTDs to show that our method is applicable with different OTDs. In addition to the evaluation of complete system, we evaluate each individual component against several baselines. We also show that our method improves the detection performance significantly compared to the other state of the art approaches. 7.4 Unsupervised online sample collection Based on the detection output from OTD, online samples can be collected either in a conservative or an aggressive manner. Conservative strategy works on the assumption that true object instances will have high detection score, whereas the false alarms will have low detection score. Online samples collected in this manner have less labeling noise, as only confident positive and confident negative examples are added in the train- ing set. On the other hand, aggressive strategy collects online samples with both high and low detection scores, hence can represent test data in a better manner because large training data captures many variations across appearance and pose variations. However, 100 collected online samples can be more noisy as compared to conservative counterpart. Hence with aggressive strategy of online sample collection, it is important to train an adaptive classifier which is robust to noise, otherwise detection performance may not be very high. Many existing methods [27, 25] collect online samples in this conservative manner. However, online samples collected in this manner may not have the instances of the object in the training set which are not detected with high confidence. As we can see in Figure 7.2, that if we consider only the detection responses with high detection score, training set may not have the instances which have low confidence score. Hence online samples collected in this manner, may not be a good representative of the test data. We collect online samples with the aggressive strategy. However, collected on- line samples may have labeling noise. Hence we use MIL and instead of assigning labels to individual training samples, we assign label to the bag. Traditional approaches [16, 15, 38, 53, 50]create an online bag by including multiple instances of a single de- tection response. However, this strategy may lead to creation of noisy online bags, if the detection response itself has labeling error. We address this issue, by creating a online bag from a pool of detection responses, not just from one detection response. Hence our approach is more robust to noise as compared to the traditional practice. In following subsections, we describe collection of positive and negative online samples and creation of online bags. 7.4.1 Positive and negative online samples We apply OTD at a low threshold ( 1 ) to get the detection responses R = (r 1 ; r 2 ;::::::; r m ), where r i =fl i ;s i ;g i ;n i g. l i = (x i ;y i ) is the top left corner of the detection bounding 101 (a) (b) (c) Figure 7.2: Example of obtained detection responses: Detection responses with high confidence are shown in green, detection responses with low confidence are shown in blue. If an online sample collection only focuses on collecting detection responses with high confidence, it may not have object instances marked in blue in the training set. box,s i = (w i ;h i ) is the size of the detection response,g i is confidence of the detection response andn i is the frame index in a given video. Using these obtained detection responses for all the frames, we apply a tracking-by- detection method to obtain the track responses T = (T 1 ; T 2 ;::::::; T p ). In our approach, we do not rely on high performance tracking method. We just need some tracklets based on the association of detection responses between different frames, hence we use the method described in [54], which associates detection responses between neighboring frames using appearance, size and location based affinities. If it can’t find detection responses in the frame, a mean-shift [55] tracker is used for tracking. To collect online samples, the overlap of bounding boxes of R and T is computed for each frame. If a detection response does not overlap with any of the track responses beyond a certain threshold (> 30%) in a given frame, it is considered as a false alarm and hence is included in the negative sample set, otherwise it is included in the positive sample set. Some of the online samples collected using our unsupervised method are shown in Figure 7.3. 102 Figure 7.3: Some of the online samples collected from our method: Green Rectangle: Positive online samples, Red Rectangle: Negative online samples. We can see that many collected online samples are noisy. 7.4.2 Online bags Positive and negative online samples collected using aggressive online sample collection mechanism are prone to labeling error, hence if assigned hard labels, it may lead to an ineffective adaptive trained classifier. Instead of assigning hard labels to the collected online samples, we create bags of instances and assign the label to the bags, not to the instances. An online sample is defined as p i =fr i ;y i g, wherey i 2f1; +1g. A positive bag B + i is defined as: B + i =fp 1 ; p 2 ;::::; p m g;8 m j=1 y i = +1 (7.1) 103 where p 1 ; p 2 ;::::; p m are chosen randomly. Similarly, we create bags for negative online samples as: B i =fp 1 ; p 2 ;::::; p m g;8 m j=1 y i =1 (7.2) 7.5 Adaptive classifier training We train a B-MIRF adaptive classifier using collected online samples. In this section, first we explain random ferns, then we describe our proposed extension of random ferns to MIL framework and then we describe the training procedure for our B-MIRF adaptive classifier. 7.5.1 Random Ferns Ozyusal et al. [11] proposed a random fern classifier, which is based on binary features. If there areN binary features for the classifier training, these features are divided intoM sub-groups, and each sub-group hasN=M number of features. Each of these sub-groups is called a fern and there are total M ferns in the random fern classifier. If there are total K target categories in the random fern classifier, the target categoryc 0 i for a given test sample is determined as: c 0 i = argmax k M Y i=1 P (F i jC =c k ) (7.3) where F i is the set of binary features for the i th fern. A binary feature f is a pair of points (l i x ;l i y ) and (l j x ;l j y ) chosen randomly for a given reference window size. (l i x ;l i y ) 104 are the x and y image co-ordinates for a given reference window size. Output of this feature for a given sampleS k is computed as: O(f) = 1; ifI S k (l i x ;l i y )>I S k (l j x ;l j y ) 0; otherwise (7.4) WhereI(l i x ;l i y ) is grayscale image intensity at point (l i x ;l i y ). 7.5.2 Multi Instance Random Ferns Traditional random ferns described above work with hard-labeled single instances not with bag of instances. Hence we extend traditional random ferns and propose multi- instance random ferns which can work with bags of instances. For a target category y where y2f1; +1g, the set of bags is defined asB y = fB y 1 ;B y 2 ;::::::::;B y n g. Wheren is the total number of online bags for a categoryy. For given bags of instances, we make the assumption that at least one instance in the bag has the correct label and also assume that the most confident sample in the bag is the sample which has the correct class label. For a positive category bagB + l , we define the most confident sample as: G(B l ) + = argmax t fg 1 ;g 2 ;:::::;g m g (7.5) whereg t is the detection confidence of thet th sample in the bag. For negative category bagB l , the most confident sample is defined as : G(B l ) = argmin t fg 1 ;g 2 ;:::::;g m g (7.6) In this way, we consider the instance with the maximum detection confidence from the positive bag, and the instance with the minimum detection confidence from the negative 105 bag. Now the feature output for a positive bag B + l with the most confident sample G(B l ) + is computed as: O(f) = 1; if (I G(B l ) +(l i x ;l i y )>I G(B l ) +(l j x ;l j y ) 0; otherwise (7.7) Similarly, feature output for a negative bag B l with the most confident sample G(B l ) is computed as: O(f) = 1; if (I G(B l ) (l i x ;l i y )>I G(B l ) (l j x ;l j y ) 0; otherwise (7.8) 7.5.3 Training of Boosted Multi Instance Random Fern Adaptive Classifier First we obtain bags of instancesB = (B + ;B ) for the whole test video from online sample collection mechanism. Then we train adaptive classifier using collected bags. For the training of adaptive classifier only collected unsupervised online samples are used and training process is performed only once for a given test video. Our adaptive classifier selects most discriminative random ferns using Real Adaboost [13], for which loss function is defined as: L = exp(yH(B y l )) (7.9) wherey2f1; +1g andH is the strong classifier output, andB y l is thel th bag in the training setB. By using the definition of most confident positive and negative samples from equation 5 and 6, this loss function can be written as: L = exp(yH(G(B l ) y ) (7.10) 106 Hence the classifier output is computed for the most confident sample in the bag. In Real Adaboost an instance spaceX is divided into S different sub-spaces and the output from a weak classifierh t (x) is computed as : h t (x) = 1 2 log( P (z t =sj + 1) P (z t =sj 1) ) (7.11) where P (z t =sjy) = X zt(G(B l ) y =s) D t (i) (7.12) wheres2S andD t (i) is the weight ofG(B l ) y . During each training iterationt, a weak classifier h t (x) from the classifier poolV is selected by minimizing the Bhattacharya distance betweenP (z t = sj + 1) andP (z t = sj 1). We construct a weak classifier pool comprised of multi-instance random ferns andS is chosen as 2 N M . In each iteration one multi instance random fern is selected as the weak classifier. 7.6 Experiments We evaluated our method for the problem of human detection. In this section, we first describe the experimental setup, then we present the results for detection performance. Datasets: We used four publicly available datasets: Mind’s Eye [5] , CA VIAR [6], i-LIDS [3] and Skateboard dataset [52]. Mind’s Eye and Skateboard datasets have artic- ulated humans, whereas humans in CA VIAR and i-LIDS datasets have pedestrian poses. Poselet [42] detector has been used for detecting articulated humans, hence we use pose- let as OTD for Mind’s Eye and skateboard datasets. Whereas [2] has shown good per- formance on pedestrian detection, hence we use this detector as OTD for CA VIAR and i-LIDS datasets. 107 Experimental settings and computational efficiency: For adaptive classifier train- ing, we train for 25 random ferns with each fern having 12 binary features. For training we prepared a feature pool(V) of 1500 ferns chosen randomly. Poselet detector is ap- plied at a low threshold of 0.5. J-RoG detector is applied at a low threshold of -1. One online bag has 5 instances of training samples. All the collected online samples are resized to 24X58 reference window size for adaptive classifier training. Tracking pa- rameters are same as used in [54]. For the training of adaptive classifier, we only used unsupervised online training samples collected by our method. We do not use any of the training samples used for the training of OTD. After collection of online samples it takes approximately 180 seconds for the training of B-MIRF classifier with 1000 online bags. All our experiments are performed on a 3.16GHz, Xeon computer. For detection performance evaluation, we use the 50% overlap evaluation criteria as used in [48]. We evaluate both online sample collection and adaptive classifier performance. In following sections, we provide evaluations for online sample collection mechanism and adaptive classifier and then we summarize our evaluations. 7.6.1 Online Sample Collection Evaluation We evaluated performance of online sample collection mechanism for Mind’s Eye dataset. We compare our online sample collection mechanism with the case when we use only confident online samples collected from the test video. For this experiment, we consid- ered a positive online sample as confident online sample if its detection confidence score is higher than 5. A negative online sample is considered a confident online sample if its detection confidence score is lower than 1. We use B-MIRF adaptive classifier for this experiment. 108 From Figure 6.4(a-d)(B-MIRF and B-MIRF(confident only)), we can see that for all four test video sequences, our online sample collection mechanism performs better as compared to the case when we only collect confident online samples. This is due to the fact that when we collect only confident online samples, there are instances of the object which are not detected with high detection confidence, and hence such instances are not included in the training set, therefore adaptive classifier can not verify these instances as correct detections. 7.6.2 Adaptive Classifier Performance Evaluation For evaluation of our adaptive classifier performance, we used both confident and non confident online samples collected from the test video. Performance is shown on all four datasets: Mind’s Eye, CA VIAR, i-LIDS and Skateboard. In addition to OTD and state of the art methods, we compare performance of our adaptive classifier with following baselines: Single Instance Random Fern(SIRF): when we do not use MIL and use hard labeled training samples, and train a random fern classifier without any boosting. Multi Instance Random Fern (MIRF): Instead of assigning hard labels to the training samples, we use bags of instances, and train a multi instance random fern classifier and skip the boosting process. Boosted Single Instance Random Fern (B-SIRF): Boosted random fern classifier is trained with hard labeled training examples. 109 7.6.2.1 Mind’s Eye: Mind’s Eye dataset is challenging as it contains humans in various poses like sitting, standing/walking, bending, digging etc. We used four sequences from this dataset: MES1, MES2, MES3 and MES4. All the images are of size 1280-by-720. MES1, MES2, MES3 and MES4 have 425, 300 ,150 and 310 frames respectively. We manually annotated these sequences for Ground Truth (GT). MES1 has 829 GT human instances, MES2 has 300 instances, MES3, MES4 have 450 and 997 human instances respectively. Recall-Precision curves for all four sequences are shown in Figure 7.4 and 7.5. De- tection Average-Precision (AP) is shown in Table 7.1. We can see that B-MIRF gives the best performance on all the sequences. MIRF classifier improves the performance better than SIRF classifier and boosting process further improves the performance for both MIRF and SIRF classifiers. we can see that B-MIRF gives better performance compared to the OTD and it outperforms [7] also. B-MIRF gives maximum precision improvement over OTD approximately by 27%, 26%, 22% and 20% on MES1, MES2, MES3 and MES4 sequences respectively, whereas [7] gives maximum precision im- provement of 5%, 12%, 3% and 8% only. B-MIRF gives best AP improvement of 13% and average AP improvement of 7% on Mind’s Eye dataset. Some of the detection results are shown in Figure 7.9. 7.6.2.2 CA VIAR: We used one sequence OneLeaveShop2front(CA V1) from the CA VIAR dataset to com- pare our approach with other state of the art adaptation methods [25, 38, 7]. CA V1 has 1200 frames, each of size 388-by-244. Humans in this sequence are primarily pedes- trians. Ground-truth (GT) for this sequence is available at [6]. GT contains 290 in- stances of human. We compare the performance with other state of the art methods 110 (a) MES1 (b) MES2 Figure 7.4: Recall-Precision Curves for MES1 and MES2 Sequence [25, 38, 10, 7], Recall-Precision curve for this sequence is shown in Figure 7.6. B- MIRF outperforms, baseline methods and other state of the art methods [10, 25, 38, 7] and OTD [2]. B-MIRF gives maximum precision improvement over OTD by 42%, whereas [7] and [38] give maximum precision improvement of 22% and 14% respec- tively. B-MIRF obtains 0.67 AP , whereas OTD[2] has AP of 0.55. [10] gives AP of 0.48, whereas [25] gives AP of 0.55. [38] gives 0.60 AP, whereas [7] gives 0.61 AP. 111 (a) MES3 (b) MES4 Figure 7.5: Recall-Precision Curves for MES3 and MES4 Sequence 7.6.2.3 i-LIDS: We used i-LIDS Easy (iLIDS) sequence which has 5220 frames each of size 720 X 576, with 9716 humans, GT annotation for this sequence is available at [3]. We compare performance with other state of the art methods [23, 38]. We can see from Figure 7.7 that B-MIRF outperforms, baseline methods and OTD [2]. B-MIRF gives the maximum 112 (a) CA V1 Figure 7.6: Recall-Precision Curve for CA V1 Sequence Table 7.1: Average Precision performance Sequence OTD SIRF MIRF B-SIRF B-MIRF MES1 0:94 0:96 0:965 0:97 0.98 MES2 0:74 0:74 0:75 0:79 0.87 MES3 0:72 0:75 0:76 0:76 0.79 MES4 0:88 0:88 0:90 0:90 0.92 CA V1 0.55 0.62 0.66 0.67 0.67 iLIDS 0.74 0.74 0.75 0.75 0.76 SKB1 0.61 0.72 0.72 0.75 0.77 precision improvement over OTD by 16%, whereas [38] gives maximum precision im- provement of 13%. B-MIRF improves AP of OTD from 0.74 to 0.76. whereas [23] gives an AP of 0.57. [38] has slightly better AP performance, whereas our method gives better performance in terms of maximum precision improvement. However [38] uses a sophisticated tracking method which requires camera model and entry-exit points,[35], whereas our approach achieves comparable performance by using a simple tracking method, which does not require any scene specific information. 113 (a) iLIDS Figure 7.7: Recall-Precision Curve for iLIDS Sequence 7.6.2.4 Skateboard: We used skateboard1 sequence (SKB1) from skateboard dataset [48] . This is a chal- lenging sequence as there are severe pose changes across the sequence and camera is in motion. SKB1 has 367 frames, each frame is of size 1280 X 720. GT is same as used in [48]. Every third frame is annotated for GT. Its hard to define precise GT boundaries for this sequence, due to articulations. As we can see from Figure 7.9(row 3), that the person in the sequence has arms stretched in some frames, while in other frames legs are stretched, hence if we consider the whole body bounding box as GT, there would be significant part of the bounding box, which will not be part of human body, whereas if we just consider upright torso as GT, we are missing out on some parts of human, which may appear in detection. For these reasons, it is less meaningful to evaluate this sequence with usual 50% overlap criteria [48]. We evaluate detection performance at two different thresholds (40% and 50% over- lap criteria) by keeping the same GT same as used in [48]. Recall-Precision curve for 114 (a) SKB1(0.5) (b) SKB1(0.4) Figure 7.8: Recall-Precision curves for Skateboard Sequence this sequence is shown in Figure 7.8. For this sequence B-MIRF gives 8% improvement over baseline at 0.5 threshold, whereas B-MIRF has 0.77 AP at threshold 0.4, whereas OTD has 0.61 AP. [48] has 0.76 AP for this sequence, whereas [56] has 0.6 AP. 115 7.6.3 Evaluation Summary Our approach gives better performance compared with several state of the art methods on four challenging datasets . We also compare individual modules (online sample collection and adaptive classifier) in addition to the overall system performance, to show the effectiveness of our module design. Our method works on two assumptions: for each object instance, we can obtain some tracklets and at least one sample in the bag has the correct label. Although parameters such as number of weak classifiers in the feature pool, number of strong classifiers and bag size are set empirically, our approach is not very sensitive to these parameters. Our method has practical applications such as it can be used as a pre-processing step for tracking-by-detection approach and it can be applied with various OTDs. One limitation of our approach is that if tracking-by- detection method completely fails to capture some specific object pose variations, online sample collection would not have any instances corresponding to these pose variations, hence adaptive classifier may not verify detection responses from OTD for such cases as correct detections. 7.7 Conclusion We presented a method for unsupervised detector adaptation for object detection in a video, which improves the detection performance of an offline trained object detector. For a given test video, we collect unsupervised online samples and train a boosted multi- instance random fern adaptive classifier, which can validate the detection responses ob- tained from the offline trained detector as the true detections or false alarms. Our ap- proach can work with various offline trained detectors. In experiments, we show that our method gives better performance compared to the other state-of-the art methods. 116 Figure 7.9: Some detection results. Green: Detection output from OTD validated as true detection by our method. Red: detection output from OTD validated as false alarm by our method. 117 Chapter 8 Conclusion and Future Work "Adapt or perish, now as ever, is nature’s inexorable imperative." H. G. Wells W E presented incremental learning and detector adaptation methods for improv- ing the performance of offline trained detectors for video object detection. We designed and developed novel adaptation approaches and improved on existing ap- proaches as well. We demonstrated the effectiveness of our approach by applying it for the problem of human detection on several standard benchmarks and compared our performance with existing state of the art methods. We proposed efficient incremental learning approach for cascade of boosted classi- fiers, which focuses on adjusting the parameters of offline trained cascade of boosted classifiers by using collected online samples only. We modify one parameter at a time to obtain a closed form solution for parameter optimization, hence our approach is compu- tationally efficient as compared to existing methods which use computationally expen- sive iterative optimization of the parameters. Another advantage of this approach is that 118 it does not require any offline training data for classifier update, which also makes this approach computationally efficient. We extended this supervised incremental learning framework and proposed an unsu- pervised incremental learning method for Real Adaboost based offline trained detectors. In this approach, online samples are collected in an unsupervised manner by using track- ing information. We utilize Multiple Instance Learning (MIL) to deal with ambiguous and noisy online data and propose a MIL loss function to incorporate multiple instances in standard Real Adaboost framework. This approach is robust to noisy online samples, however applicability is limited to Real Adaboost based offline trained detectors only. In chapter 5, we proposed a random fern based efficient adaptive classifier, which can be applied to various kinds of offline trained detectors, hence is not limited to a specific offline trained detector. Additionally, this adaptive classifier can work with various pose and articulation variations of the object. In this approach, we focus on improving the precision of offline trained detector by applying it at a low threshold to collect many detection responses. These obtained detection responses are then provided to the trained adaptive classifier which either accepts the detection responses by classifying them as the true detection or rejects them by classifying them as the false alarms. In this manner, it can detect those instances at low threshold which are missed if detector is applied at a high threshold, and at the same time removes additional false alarms. For training of adaptive classifier, online samples are collected in an unsupervised manner and positive online samples are divided into different categories. A multi-class random fern adaptive classifier is trained using collected online samples. We evaluated this approach for the problem of human detection and show that our approach outper- forms other state of the art methods. 119 We further improve this method by proposing a multi class boosted random fern adaptive classifier. Boosting provides discriminability to the adaptive classifier, hence the performance is improved. However, this approach assigns hard labels to the collected online samples, whereas online samples collected in an unsupervised manner are very likely to have labeling errors. We addressed this issue, by proposing a boosted multi instance random fern adaptive classifier, which can be applied with various offline trained detectors, selects discrimi- native random ferns for the training and is robust to noisy training data. In this approach collected online samples are divided into different bags and labels are assigned to the bags not to the individual samples. We proposed multiple instance random ferns which can work with bags of instances in the training data and use these multiple instance random ferns as weak classifier pool and train a boosted multiple instance random fern adaptive classifier using Real Adaboost. In experiments, we show that this approach outperforms other state of the art methods on several standard benchmarks. To summarize, we proposed supervised and unsupervised incremental learning meth- ods for Real Adaboost framework. Our incremental learning method can work effec- tively in presence of noisy and ambiguous data and detects multiple instances of the target object simultaneously. Then we proposed a method for generic detector adapta- tion that can work with various kinds of offline trained detectors and can also handle different articulation and pose variations. We further proposed a boosted multi instance random fern based detector adaptation method, which is not only generic but can also work in presence of noisy and ambiguous training data. While we have explored different critical aspects of detector adaptation and showed promising results. Our approaches can be extended in several ways for future directions. Following are some of the directions to be considered in future: 120 8.1 Generalization across different videos In our current design we are adapting an offline trained detector to a specific video, which is the limitation of our work, as we have to train a new adaptive classifier for each novel test video. Moreover, the trained adaptive classifier is not incorporating knowledge from previously learned adaptive classifiers. In our unsupervised incremental learning approach, we are training a new adaptive classifier for each video segment we collect online samples for. Whereas, in our ran- dom fern based approaches, we perform training process only once for the whole video sequence. In future, we plan to extend the generalization of our adaptation method to multiple video sequences, so that our adaptation is not specific to a video sequence only. For this purpose, we plan to utilize information from previously learned adaptive classifier i.e. unsupervised samples collected from video sequences which have been used for adaptation and combine these online samples with the online samples collected from current video sequence and train an adaptive classifier using these combined online samples. 8.2 Noise tolerance with multiple instance learning techniques While multiple instance learning allows effective training with noisy and ambiguous data, performance can degrade quickly if training data is heavily noisy and there are very few good online samples in the training data. Analysis of performance of adaptive classifier in presence of different levels of noise can be helpful in understanding the robustness of different classifiers to noisy data. This analysis can be helpful for end user to select appropriate adaptive classifier by estimating the noise in the new test data. 121 8.3 Integration with online tracking In our current framework, we are collecting online samples from the whole video, before training an adaptive classifier. It can be extended for online tracking and detection. Online tracking can help in obtaining online samples for next few frames. Using these online samples, adaptive classifier can be updated, which can help online tracking by generating more reliable track hypothesis. In this manner not only adaptive classifier can be trained in online manner, it can help in improving the performance of online tracking as well. 122 Bibliography [1] C. Huang, H. Ai, T. Yamashita, S. Lao, and M. Kawade, “Incremental learning of boosted face detector,” in ICCV, 2007. [2] C. Huang and R. Nevatia, “High performance object detection by collaborative learning of joint ranking of granules features,” in CVPR, 2010. [3] http://www.eecs.qmul.ac.uk/ andrea/avss2007d.html/. [4] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection and people- detection-by-tracking,” in CVPR, 2008. [5] http://www.visint.org/. [6] http://groups.inf.ed.ac.uk/vision/CA VIAR/CA VIARDATA1/. [7] P. Sharma and R. Nevatia, “Efficient detector adaptation for object detection in a video,” in CVPR, 2013. [8] P. Viola and M. Jones, “Robust real-time object detection,” in International Journal of Computer Vision, 2001. [9] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in In CVPR, 2005. [10] X. Wang, T. X. Han, and S. Yan, “An hog-lbp human detector with partial occlusion handling,” in ICCV, 2009. [11] M. Ozuysal, M. Calonder, V . Lepetit, and P. Fua, “Fast keypoint recognition using random ferns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010. [12] A. Bosch, A. Zisserman, and X. Muoz, “Image classification using random forests and ferns,” in ICCV, 2007. [13] R. E. Schapire and Y . Singer, “Improved boosting algorithms using confidence- rated predictions,” Mach. Learn., vol. 37, pp. 297–336, December 1999. [14] P. Kumar Mallapragada, R. Jin, A. Jain, and Y . Liu, “Semiboost: Boosting for semi-supervised learning,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 11, pp. 2000–2014, Nov 2009. [15] B. Zeisl, C. Leistner, A. Saffari, and H. Bischof, “On-line semi-supervised multiple-instance boosting,” in CVPR, 2010. 123 [16] B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” in CVPR, 2009. [17] T. B. Dinh, N. V o, and G. G. Medioni, “Context tracker: Exploring supporters and distracters in unconstrained environments,” in CVPR, 2011. [18] H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-line boosting for robust tracking,” in ECCV, 2008. [19] Z. Kalal, J. Matas, and K. Mikolajczyk, “P-N Learning: Bootstrapping Binary Classifiers by Structural Constraints,” CVPR, 2010. [20] V . Nair and J. J. Clark, “An unsupervised, online learning framework for moving object detection,” in CVPR, 2004. [21] O. Javed, S. Ali, and M. Shah, “Online detection and classification of moving objects using progressively improving detectors,” in CVPR, 2005. [22] B. Wu and R. Nevatia, “Improving part based object detection by unsupervised, online boosting,” in CVPR, 2007. [23] S. Stalder, H. Grabner, and L. Van Gool, “Cascaded confidence filtering for im- proved tracking-by-detection,” in ECCV, 2010. [24] M. Wang and X. Wang, “Automatic adaptation of a generic pedestrian detector to a specific traffic scene,” in CVPR, 2011. [25] X. Wang, G. Hua, and T. Han, “Detection by detections: Non-parametric detector adaptation for a video,” in CVPR, 2012. [26] A. Kembhavi, B. Siddiquie, R. Miezianko, S. McCloskey, and L. Davis. [27] M. Wang, W. Li, and X. Wang, “Transferring a generic pedestrian detector towards specific scenes.” in CVPR, 2012. [28] N. C. Oza and S. Russell, “Online bagging and boosting,” in In Artificial Intelli- gence and Statistics 2001. Morgan Kaufmann, 2001. [29] H. Grabner and H. Bischof, “On-line boosting and vision,” in CVPR, 2006. [30] A. J. Joshi and F. Porikli, “Scene-adaptive human detection with incremental ac- tive learning,” Pattern Recognition, International Conference on, vol. 0, pp. 2760– 2763, 2010. 124 [31] G. Denina, B. Bhanu, H. T. Nguyen, C. Ding, A. Kamal, C. Ravishankar, A. Roy- Chowdhury, A. Ivers, and B. Varda, “Videoweb dataset for multi-camera activities and non-verbal communication,” in Distributed Video Sensor Networks, B. Bhanu, C. V . Ravishankar, A. K. Roy-Chowdhury, H. Aghajan, and D. Terzopoulos, Eds. Springer London, 2011, pp. 335–347. [32] C. Huang, H. Ai, B. Wu, and S. Lao, “Boosting nested cascade detector for multi- view face detection,” in ICPR, 2004. [33] A. Levin, P. A. Viola, and Y . Freund, “Unsupervised improvement of visual detec- tors using co-training,” in ICCV, 2003. [34] P. Viola, J. C. Platt, and C. Zhang, “Multiple instance boosting for object detec- tion,” in NIPS, 2006. [35] C. Huang, B. Wu, and R. Nevatia, “Robust object tracking by hierarchical associ- ation of detection responses,” in ECCV, 2008. [36] http://www.chokkan.org/software/liblbfgs/. [37] P. M. Roth, S. Sternig, H. Grabner, and H. Bischof, “Classifier grids for robust adaptive object detection,” in CVPR, 2009. [38] P. Sharma, C. Huang, and R. Nevatia, “Unsupervised incremental learning for im- proved object detection in a video,” in CVPR, 2012. [39] H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-line boosting for robust tracking,” in ECCV, 2008. [40] G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah, “Part-based multiple- person tracking with partial occlusion handling,” in CVPR, 2012. [41] B. Yang and R. Nevatia, “An online learned crf model for multi-target tracking,” in CVPR, 2012. [42] L. Bourdev and J. Malik, “Poselets: Body part detectors trained using 3d human pose annotations,” in ICCV, 2009. [43] S. Theodoridis and K. Koutroumbas, “Pattern recognition, elsevier science,” 2003. [44] http://www.cs.berkeley.edu/ lbourdev/poselets/. [45] M. Villamizar, F. Moreno-Noguer, J. Andrade-Cetto, and A. Sanfeliu, “Shared random ferns for efficient detection of multiple categories,” in ICPR, 2010. [46] M. Villamizar, F. M. Noguer, J. A. Cetto, and A. Sanfeliu, “Efficient rotation in- variant object detection using boosted random ferns,” in CVPR, 2010. 125 [47] Y . yu Lin and T.-L. Liu, “Robust face detection with multi-class boosting,” in CVPR, 2005. [48] A. D. Guang Shu and M. Shah, “Improving an object detector and extracting re- gions using superpixels,” in CVPR, 2013. [49] A. Torralba, K. P. Murphy, and W. T. Freeman, “Sharing features: Efficient boost- ing procedures for multiclass object detection,” in CVPR, 2004. [50] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2012. [51] B. Babenko and M.-H. Y . S. Belongie, “Robust object tracking with online multiple instance learning,” 2011. [52] http://www.nist.gov/itl/iad/mig/med12.cfm. [53] C. Leistner, A. Saffari, and H. Bischof, “Miforests: multiple-instance learning with randomized trees,” in ECCV, 2010. [54] B. Wu and R. Nevatia, “Tracking of multiple, partially occluded humans based on static body part detection,” in CVPR, 2006. [55] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE Trans. Pattern Anal. Mach. Intell., 2002. [56] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object de- tection with discriminatively trained part based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010. 126
Abstract (if available)
Abstract
Object detection is a challenging problem in Computer Vision. With increasing use of social media, smart phones and modern digital cameras thousands of videos are uploaded on the Internet everyday. Object detection is very critical for analyzing these videos for many tasks such as summarization, description, scene analysis, tracking or activity recognition. ❧ Typically, an object detector is trained in an offline manner by collecting thousands of positive and negative training samples. However, due to large variations in appearance, pose, illumination, background scene and similarity to other objects
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Tracking multiple articulating humans from a single camera
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Analyzing human activities in videos using component based models
PDF
Learning to detect and adapt to unpredicted changes
PDF
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
Multiple pedestrians tracking by discriminative models
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Intelligent video surveillance using soft biometrics
PDF
Event detection and recounting from large-scale consumer videos
PDF
Visual knowledge transfer with deep learning techniques
PDF
Automatic image and video enhancement with application to visually impaired people
PDF
Object localization with deep learning techniques
PDF
3D deep learning for perception and modeling
PDF
Towards data-intensive processing architectures for improving efficiency in graph processing
PDF
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
PDF
Facial gesture analysis in an interactive environment
PDF
Model based view-invariant human action recognition and segmentation
Asset Metadata
Creator
Sharma, Pramod Kumar
(author)
Core Title
Effective incremental learning and detector adaptation methods for video object detection
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/14/2014
Defense Date
05/02/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptation,human detection,incremental learning,multiple instance learning,OAI-PMH Harvest,object detection,online,surveillance,unsupervised,Video
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
), Kuo, C.-C. Jay (
committee member
), Medioni, Gérard G. (
committee member
)
Creator Email
pksharma@usc.edu,sharma.pramod@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-489008
Unique identifier
UC11287115
Identifier
etd-SharmaPram-3009.pdf (filename),usctheses-c3-489008 (legacy record id)
Legacy Identifier
etd-SharmaPram-3009.pdf
Dmrecord
489008
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Sharma, Pramod Kumar
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
adaptation
human detection
incremental learning
multiple instance learning
object detection
online
surveillance
unsupervised