Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Spatio-temporal probabilistic inference for persistent object detection and tracking
(USC Thesis Other)
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SPATIO-TEMPORAL PROBABILISTIC INFERENCE FOR PERSISTENT OBJECT DETECTION AND TRACKING by Qian Yu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2009 Copyright 2009 Qian Yu Acknowledgements I would express my deepest gratitude to my advisor, Professor G´ erard Medioni. I benefit alotfromhisinsightfulideas,warmsupportandinvaluableguidance. Hisencouragement and enthusiasm in fundamental research influenced me greatly. Without him this thesis work would not be accomplished in many ways. I sincerely appreciate Professor Ram Nevatia for the constructive advice and com- ments on my work. Special thanks to Dr. Isaac Cohen for his guidance during the first year of my PhD study. I would also thank Professor Boris Rozovsky and Profes- sor Alexander Tartakovsky for the discussions and help in Nonlinear Optimal Bayesian Inference. I would also like to thank Professor Ram Nevatia, Professor James Moore, Professor Karen Liu, Professor Alexander Francois for spending their time to serve on committees of my qualifying exam and/or my defense. The PhD study at USC is an enjoyable learning experience with the help and friend- ship from my colleagues in the Computer Vision Group, especially from Jinman Kang, SungchunLee,ChangYuan,ChangkiMin,TaeEunChoe,BoWu,Wei-KaiLiao,YuPing Lin, Li Zhang, Thang Ba Dinh, Yuan Li. I enjoyed the every moment with them. ii I am also very thankful to the understanding and unconditioned support from my lovely wife, Xiaoying Ji and my family in China. I truly dedicate this dissertation to them. iii Table of Contents Acknowledgements ii List Of Tables vii List Of Figures viii Abstract xi Chapter 1: Introduction 1 1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Tasks and Difficulties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Reader’s Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Background 8 2.1 Object Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Various Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Bayesian Inference in Tracking . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Combinatorial Optimization in Tracking . . . . . . . . . . . . . . . 14 2.3 Related issues in Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Decision delay and sliding window techniques . . . . . . . . . . . . 17 2.3.3 Real-time issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 3: Data Association in Multiple Target Tracking 19 3.1 Motivation and Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 A Bayesian Formulation of Multiple Target Tracking . . . . . . . . . . . . 22 3.2.1 Anatomy of the Problem . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.2 Maximize a Posterior (MAP) . . . . . . . . . . . . . . . . . . . . . 27 3.2.2.1 Prior Model . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.2.2 Joint Motion and Appearance Likelihood . . . . . . . . . 30 3.3 Compute the MAP Estimation by Data-Driven MCMC . . . . . . . . . . 34 3.3.1 An Introduction of MCMC . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Spatio-temporal MCMC Data Association . . . . . . . . . . . . . . 35 3.3.3 Data-driven Markov Chain Dynamics . . . . . . . . . . . . . . . . 39 3.4 A Practical Approach to Determine the MAP Model . . . . . . . . . . . . 46 iv 3.4.1 Order-preserving Constraints . . . . . . . . . . . . . . . . . . . . . 47 3.4.2 Implementation Using MCMC Dynamics . . . . . . . . . . . . . . 48 3.5 Extension with Model Information . . . . . . . . . . . . . . . . . . . . . . 50 3.5.1 Extension of MAP with Model Likelihood . . . . . . . . . . . . . . 50 3.5.2 Using Model Information as Tracking Indicators . . . . . . . . . . 52 3.6 Extension with Hierarchical Data Association . . . . . . . . . . . . . . . . 55 3.6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6.2 Tracklet Association . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.7 Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.7.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 58 3.7.2 Performance Evaluation and Comparison . . . . . . . . . . . . . . 59 3.7.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Chapter 4: Online Appearance Modeling for Visual Tracking 70 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.1.1 Generative vs. Discriminative . . . . . . . . . . . . . . . . . . . . . 70 4.1.2 Global vs. Local . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1.3 Online vs. Offline. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1.4 Overview of Our Approach . . . . . . . . . . . . . . . . . . . . . . 74 4.2 A Bayesian Co-training Framework for Online Appearance Modeling . . . 76 4.2.1 A Hybrid Discriminative and Generative Model . . . . . . . . . . . 76 4.2.2 Co-training for Semi-supervised Learning . . . . . . . . . . . . . . 77 4.3 Generative Tracker with Multiple Linear Subspaces . . . . . . . . . . . . . 79 4.3.1 Global Model with Multiple Local Linear Subspaces . . . . . . . . 79 4.3.2 Online Subspace Learning . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.2.1 Subspace Merging . . . . . . . . . . . . . . . . . . . . . . 81 4.3.2.2 Subspace distance . . . . . . . . . . . . . . . . . . . . . . 82 4.4 Discriminative Tracker Using Online SVM . . . . . . . . . . . . . . . . . . 85 4.5 Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5.2 Performance Evaluation and Comparison . . . . . . . . . . . . . . 92 Chapter 5: Application: Detection and Tracking Moving Vehicles from Airborne Cameras 97 5.1 Introduction and Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2 Motion Detection From a Moving Platform . . . . . . . . . . . . . . . . . 98 5.2.1 Image Stabilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.2.2 Compensation of Illumination Changes. . . . . . . . . . . . . . . . 101 5.2.3 Background Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3 Geo-registration and Tracking Coordinates. . . . . . . . . . . . . . . . . . 103 5.4 Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.4.1 GPU-acceleration of Background Modeling . . . . . . . . . . . . . 107 5.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 v Chapter 6: Conclusion 115 6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2.1 Tradeoff between optimum and efficiency . . . . . . . . . . . . . . 117 6.2.2 General motion pattern analysis . . . . . . . . . . . . . . . . . . . 118 6.2.3 Combine tracking with segmentation . . . . . . . . . . . . . . . . . 118 6.2.4 Towards events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 References 119 vi List Of Tables 3.1 Comparative results on three real data sets. Method 1: JPDAF in[39]; Method 2: the proposed method. . . . . . . . . . . . . . . . . . . . . . . 65 4.1 Comparison of different methods G1:IVT [51], G2: incremental learning multiple subspaces, D1: online selection of discriminative color features [17], D2: online SVM, E.T: ensemble tracking [4]. D1 uses color informa- tion, which is not available for Seq1 and Seq6. . . . . . . . . . . . . . . . 93 vii List Of Figures 1.1 Typical scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 HMM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 An toy assignment example: one target can be associated to at most one observation and vice versa . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1 Segmentationofforegroundregionsinspace-timebytheuseofmotionand appearance smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Two ways to represent foreground regions . . . . . . . . . . . . . . . . . . 23 3.3 One possible cover of the observations, which includes two tracks (τ 1 , τ 2 ). The dashed rectangles represent the covering rectangles of foreground re- gions. The uncovered regions correspond to false alarms. . . . . . . . . . . 26 3.4 Illustration of neighborhood and association likelihood, where τ k (t 3 ) has three neighbors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 Illustrations of a diffusion move: the RGB color histogram is quantized in 16×16×16 bins; the weight image is the backprojection from the color histogram and is masked by foreground regions; the blue dashed rectangle indicatesthepredictionfromthemotionmodel,theredarrowisthespatial scale mean shift vector and the dashed red rectangle shows the proposal by a diffusion move. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.6 Illustrations of segmentation and aggregation moves, where the color indi- cates the object ID, dashed boxes indicate the estimated rectangles from the motion model and regions with red boundaries are original foreground regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.7 Average normalized error|| ˆ C−C||/||C|| of the estimated parameters ˆ C with different number of constraints. . . . . . . . . . . . . . . . . . . . . . 49 viii 3.8 One possible interpretation, or say meaningful covering, of the foreground regions over time, which includes three tracks (τ 1 , τ 2 , τ 3 ). τ 0 is not shown for the clarity purpose. The frame rate of overlayed foreground region is down-sampled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.9 Edgelet Adaboost training . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.10 Motion and pattern detection . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.11 Appearance descriptor of tracklets . . . . . . . . . . . . . . . . . . . . . . 57 3.12 (a) represents the first blob of different tracklets (b) the confusion matrix of tracklets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.13 Simulation result L = 200, N = 7, FA = 7 and T = 50. Color rectangles indicate the IDs of targets. Targets may split or merge when they appear. 60 3.14 (a)STDA as the function of N the maximum number of targets, (b)STDA as the function of FA the number of false alarms . . . . . . . . . . . . . . 62 3.15 SFDA and runtime (second) for online/offline, different W window size and n mc number of samplings. L=200, FA=0 and T =1000 . . . . . . 63 3.16 Average STAD score change with parameter variations . . . . . . . . . . . 64 3.17 Comparison of the bi-direction inference and sinlge direction inference . . 66 3.18 Experiment results of real scenarios from both stationary cameras and Unmanned Aerial Vehicle (UAV) cameras . . . . . . . . . . . . . . . . . . 69 4.1 Online co-training a generative tracker and a discriminative tracker with different life span (the area bounded by dashed red boxes indicates the background) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 3D example of incremental updating subspaces . . . . . . . . . . . . . . . 84 4.3 Tracking and reacquisition with long occlusion and clutter background . . 94 4.4 Tracking various type of objects in outdoor environments . . . . . . . . . 95 4.5 Comparison of generative methods in indoor environments . . . . . . . . . 96 5.1 The overview framework of our approach . . . . . . . . . . . . . . . . . . 98 5.2 Compensation of global illumination change . . . . . . . . . . . . . . . . . 101 ix 5.3 Computation of the statistics of each pixel . . . . . . . . . . . . . . . . . . 102 5.4 Somemotiondetectionresults. Thefirstcolumnshowsthereferenceframe in a sliding window (also the center frame); the second column shows the estimated background of the reference frame; the third column shows the difference image between the reference frame and the estimated back- ground. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.5 Geo-mosaicing 2000 consecutive frames on top of the reference frame. . . 107 5.6 Overview of the GPU-based implementation . . . . . . . . . . . . . . . . . 108 5.7 Procedure to compute the mode . . . . . . . . . . . . . . . . . . . . . . . 110 5.8 Comparison of with and without geo-registratoin . . . . . . . . . . . . . . 112 5.9 The tracklets and tracks obtained using the local and global data asso- ciation framework. The UAV image sequence is overlayed on top of the satellite image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 x Abstract Trackingisacriticalcomponentofvideoanalysis, asitprovidesthedescriptionofspatio- temporal relationships between observations and moving objects required by activity recognition modules. There are two tasks that we aim to address: 1) Multiple Target Tracking (MTT) 2) Tag-and-Track. The essential problem in MTT is to recover the data association between noisy observations and an unknown number of targets. To solve this problem, we proposed a Data-Driven Markov Chain Monte Carlo method to sample the data association space for a MAP (Maximum a Posterior) estimate that maximizes the spatio-temporal smoothness in both motion and appearance. Tag-and-Track is applied to track an arbitrary type of object given limited samples. The essential problem in Tag- and-Track is to establish and update an appearance model online to capture the visual signatureoftargetsundervaryingcircumstances,suchasilluminationchanges,viewpoint changes and occlusions. We pose this Tag-and-Track problem as a semi-supervised learn- ingproblem,inwhichweaimtolabelalargenumberofunlabeleddatagivenverylimited labeled data (user selection). We propose to use two trackers combined in a Bayesian co-training framework, which unifies the CONDENSATION algorithm and co-training seamlessly. By using co-training, our method avoids learning errors reinforce themselves. xi In this thesis, we present the application of our method in detection and tracking of mul- tiple moving object in airborne videos. In this application, we combine our core tracking algorithm with a set of motion detection and tracking techniques, including motion sta- bilization, geo-registration, etc., and demonstrate the robustness and efficiency of our methods. xii Chapter 1 Introduction Tracking is an important task in computer vision. It is widely applied in visual analysis, content based retrieval and video compression, etc. In this dissertation, we address the motion and tracking problem in the context of video analysis. There are three key steps in video analysis: detection of interesting moving objects, tracking of such objects from frame to frame, and analysis of object tracks to recognize their behavior. We mainly focus on the first two steps, namely detection and tracking. With the increasing availability of inexpensive cameras and high demand of public security, detection and tracking (especially in visual surveillance) has been an very active research field in recent years. Numerous approaches for detection and tracking have been proposed (a brief literature review and categorization of existing representation and for- mulations are provided in Chapter 2). Detection and tracking are challenging problems due to many different issues, such as clutter background or noisy, blurry images, com- plex object motion, large appearance variation caused by illumination and(or) viewpoint changes, partial or complete occlusion and non-rigid, articulated objects. In this large research area, we aim to address two fundamental topics, data association and object 1 appearance modeling. Data association is the essential problem of tracking multiple tar- gets 1 . Environments of interest usually contain an unknown number of targets, and mul- tiple observations of targets are reported. The data association is to associate the many observations made by cameras with the hidden states of targets that are being observed, ortoassociatemultipleobservationsthatareincurredbythesameobjecttogether. Once thedataassociationisdetermined,thehiddenstatesoftargetscanbeestimatedbyfilter- ingtechniques. Appearancemodelingisanotherimportantandchallengingissueinobject tracking, especially when one particular object undergoes appearance changes caused by illumination and viewpoint changes. The techniques of solving these two fundamental issues can be widely applied in video analysis. 1.1 Goals In terms of methodology, we aim to provide fundamental research to address these two key problems in motion and tracking. Thus, we focus on methodologies for tracking objects in general, instead of on trackers designed for specific objects, for example, a walking human tracker that highly relies on human kinematics. For this purpose, we pose data association and appearance modelling in general settings: when we define the problems, we do not assume a specific object type or specific environments. On the other hand,ourmethodcanbeeasilyextendedwhenspecificdomainknowledgeisknown. This allows our work to be applied to many different applications. Typical scenarios to apply our method include aircraft cameras and stationary or PTZ (Pan-Tilt-Zoom) ground 1 Basically, “target” and “object” are used interchangeably in this dissertation. The minor difference between the two words is that “object” indicates a known type of objects, such as human or vehicle while “target” means a unknown moving thing with no type information. 2 surveillance cameras follow moving objects in a cluttered environment, shown in Figure 1.1. There exist specific tasks in detection and tracking depending on different goals, for example, we try to achieve the goals as follows 1) track a single object with manual initialization (tag and track) 2) detect and track all moving targets in the scene, no matter what they are 3) detect and track all moving vehicles in the scene. We also addressoneparticularapplicationofdetectingandtrackingmovingvehiclesfromairborne cameras,whereourmethodcanbesuccessfullyapplied. Bycombiningthekeytechniques and specific domain knowledge, we accomplish some interesting tasks in this application, for instance, track moving vehicles in geo-coordinates given a satellite image, associate tracklets for long term tracking, compress foreground/background separately and render them offline, etc. (a) aircraft scenario (b) ground scenario Figure 1.1: Typical scenarios 1.2 Tasks and Difficulties We itemize three specific tasks that are addressed in this dissertation. The first two are related with the two fundamental issues in tracking, data association and appearance 3 modeling. The third one is to apply the proposed approachs to achieve an efficent system forairbornevideoanalysis. Wediscusstheobjectiveandchallengeofeachtaskasfollows. Task 1: General multiple target tracking The input of general multiple target tracking is a set of candidate regions in each frame, as obtained from a state of the art background compensation module and the goal is to recover trajectories of all moving objects over time from noisy observations. • The one-to-one assumption is violated. Due to occlusions by targets and static targets, noisy segmentation and false alarms, one foreground region may not corre- spond to one target faithfully. Therefore the one-to-one assumption used in most data association algorithms is not always satisfied. Our method overcomes the one- to-one assumption by formulating the visual tracking problem in terms of finding thebestspatialandtemporalassociationofobservations,whichmaximizesthecon- sistency of both motion and appearance of trajectories. To avoid enumerating all possiblesolutions,wetakeaDataDrivenMarkovChainMonteCarlo(DD-MCMC) approach to sample the solution space efficiently. • Accuracy and efficiency of detection within one frame is difficult to achieve due to the limited temporal information. We address this limitation and arrive at a uni- fied framework to combine detection and tracking in a principled manner. Instead of generating binary detection results which are then pipelined into the tracking procedure, we propose to take detection as a proposal procedure and the final de- cision is considered according to the smoothness of motion, appearance and model information overtime. 4 Task 2 Tag and track a general object on-the-fly • Appearance changes Tag&Track is challenging due to appearance changes, which can be caused by vary- ing viewpoints and illumination conditions. Appearance can also change relative to background due to the emergence of clutter and distracters. Also, an object may leave the field of view (or be occluded) and reappear. To address these difficul- ties, we aim to track an arbitrary object with limited initialization (labeled data) and learn an appearance model on-the-fly, which can then be used to reacquire the object when it reappears. Task 3 Application: Tracking multiple moving targets from a moving platform in long term • Motion detection from moving cameras Motion detection in video scenes observed by moving cameras is inherently difficult since the camera motion induces a displacement for all the image pixels. For the scenarios of Unmanned Aerial Vehicle (UAV) and PTZ cameras, the 2D homogra- phy (affine) image motion model is utilized for motion compensation. Background modelling is performed within a sliding window to reduce accumulated registration errors. Both accuracy and time performance are considered in motion detection. A GPU accelerated module is implemented to improve time performance. • Tracking in Global Coordinates For our UAV scenario, we adopt a global map as the tracking coordinates. A satellite image is selected as the map. Homograhies between UAV frames and the 5 map are estimated by a procedure called geo-registration. Compared with selecting thefirstframeascoordinates, themotionmodeldefinedinthemapcoordinateshas physical meaning. As geo-registration refines the homograhy between a UAV frame with the global map, the accumulated error is reduced. • Long-term tracking In surveillance applications, occlusion is common. Short term occlusion, which may cause missing detection for a couple of frame, can be recovered by using the spatio-temporal smoothness in both motion and appearance. In many situations, however, objects may be occluded for a long term and reappear at another location in a different moving direction. This may cause a tracker to lose target IDs. Thus, we adopt the concept of tracklets to describe a short segment of track. Then, we associate them according to the spatio-temporal consistency and appearance similarity. 1.3 Reader’s Guide The rest of this dissertation is organized as follows. Chapter 2 introduces the general background of the tracking problem. We summarize the commonly used object repre- sentations and feature selection, then discuss two formulations in solving the tracking problem, one is Bayesian inference and the other is combinatorial optimization. Also, we discuss related issues in designing tracking systems for different applications, including initialization, sliding window and post-processing techniques. In Chapter 3, we present our data association algorithm to solve the essential problem in multiple target tracking. 6 Also, we present two extensions of the proposed method, one is to use modeling informa- tion to achieve an informed proposal distribution, the second one is to extend the data associationframeworkintoahierarchicaldataassociationtocopewithlongtermtracking properly. InChapter4,wepresentouronlineappearancemodelingmethodfortrackinga singlegeneralobject. InChapter5, wepresentanimportantapplicationofdetectionand tracking moving vehicles on an airborne platform. A streamline of proposed techniques can achieve robust and efficient airborne surveillance. In Chapter 6, we summarize the main achievements and limitations of our work, and discuss possible extensions. 7 Chapter 2 Background Detection and tracking represents a very wide research area. In this chapter, we provide a brief literature review of detection and tracking methods in recent decades. The goal of this review is form a coherent picture of the large body of existing work, where we point out the place our work fits 1 . In general, there is a strong relationship between object representation and tracking algorithms. Hence, we start from the review of object representationanddiscusssuitablerepresentationfordifferenttypeofobjectsandvarious applications. Afterchoosingaproperobjectrepresentation,selectinginformativefeatures that describe the attributes of objects is also an important issue. We give a summary of commonly used features for tracking. Then, we discuss two commonly used formulations to approach the single object and multiple objects tracking. Also, we point out the essential issue in each formulation. At the end of this chapter, we discuss some related issues in designing a tracking system. 1 For a more comprehensive review, readers should refer to[85] 8 2.1 Object Representation When we talk about a tracking problem, the first likely question is to be “what is the object to track?” In a general tracking scenario, an object can be defined as anything of interest that needs further analysis. People walking on the road, a boat sailing on the water, vehicles seen from airborne cameras, ants in biology experiments, a flock of birds in the sky, have been common objects of interest in tracking papers. After knowing the object of interest, we need to find a suitable representation of the object. Different type of objects may needs very different representations under the tradeoff between accuracy andefficiency. Forexample, pointbasedrepresentationmaybe properfora flockofbirds in the sky, a bounding box may be good for tracking people walking on the road, and a 3D stick figure may be of interest to track an articulated object. Object representation also depends on the desired output of the applications. For example, both for tracking humans, pictorial representation may be appropriate when human pose is of interest, and a bounding box is enough for tracking humans in a surveillance application. The commonly used object representations include • Point: Point is the representation used at the early stage when tracking became a research topic, especially when the radar came into practice. This is the most primitive representation of an object. In visual tracking, a point representation is still often used for tracking feature points or objects that occupy a small area in an image [71, 76] • Geometric shape: The object can be represented by geometric shapes, such as a rectangle in [81, 86] and an oriented rectangle or an ellipse [12, 18]. This abstract 9 representation is suitable for tracking objects than occupy a moderate region in an image, but the detailed representation of an object is not of interest. In this representation, the object motion is simplified by the centroid’s kinematics. This is the most commonly used representation in 2D image tracking, suitable for many common categories of objects, like faces, humans and vehicles. It provides both location and size information of the object. • Contour: Contour is used to represent an object when the object’s boundary is more stable than the object’s shape. This representation is especially suitable for non-rigid objects in medical application [35, 70]. Contour is usually represented by a sequence of control points in order along the boundary of the object. This representation can provide more accurate localization of an object, however, at the expense of a more complex representation than primitive shapes. • Segmentation: Objectcanberepresentedbyainter-connectedregionwitharbitrary shapes. This representation provides the most accurate, pixel-level representation of an object in the 2D image. Segmentation is a dual form of a close-loop contour. Given such a representation, one can always degrade to anther simpler representa- tion, like point, shape and contour. This representation can be used to represent a largevarietyofobjects, bothrigidandnon-rigidobjects, simpleshapeandcomplex shape objects. This representation is tolerant to any type 2D or 3D motion, for example, tracking a cup from different viewpoints. This representation in nature provides a clear segmentation between foreground and background, which is im- portant information in tracking scenario. However, segmentation, represented as 10 a labeling of a set of pixels in a 2D image, has the most complexity. Moreover, in some applications (such as tag&track), it may not be practical to provide an accurate initialization for tracking purposes. More recently, as the increasing com- putational power that a modern computer can provide, this segmentation based representation is becoming a promising way in tracking field. Several successfully tracking methods based on pixel-level segmentation have been proposed [49, 7]. Also, one can use a composite of the above representations to describe more complex objects. For example, one can use a set of points, or a set of local patches, to represent a object [24]. Also, one can use a set of primitive shapes, such as ellipsoids [89] [23], to represent articulated objects. The relationship between the parts are governed by kinematic motion models, for example, joint angle. In this dissertation, we adopt the most commonly used representation, bounding rectangles. Moreover, it is an efficient object representation that can be used as a good approximation of a large variety of objects. After selecting a suitable object representation, one need to find good features to differentiate one object from the background or other objects. There are a number of ways to represent the features of objects. First, one can use motion characteristics as features. For example, suppose objects are moving with constant velocity, the tuple of velocity and location at a time is a good feature to differentiate it from other objects. Kinematics features are often adopted by point-based representations, where appearance feature is weak or not applicable. Also, kinematics features can be used to reduce the ambiguity in the appearance features as in [1]. 11 2.2 Various Formulations Inthissection,wereviewtwowidelyusedformulationsinsolvingtrackingproblems. One istoformulatetrackingasaBayesianinferenceproblemofestimatingthestatesofobjects given all previous states and the current observations. The second one is to formulate tracking as a combinatorial optimization problem of finding the best assignment between targets and observations, or between observations in multiple frames. 2.2.1 Bayesian Inference in Tracking Bayesian inference estimates hidden states from observable variables. Bayesian inference is in general a NP-hard problem[54], except when the structure of the problem can be represented by an acyclic graphical model. The most common graphical model used in tracking a single target is a Hidden Markov Model (HMM), shown in Figure 2.1. x 0 x t - 1 x t x 1 . . . z 0 z 1 z t - 1 z t Figure 2.1: HMM model According to the viewpoint of Bayesian inference on the HMM model, there are two types of inference formulation which have been widely used [69][55]. One is filtering p(x t |Z 0,...,t ), which is to estimate the posterior distribution of hidden states x t at time at t given all the observations Z 0,...,t ={z 0 ,z 1 ,...,z t } until time t. If tracking is formulated as a filtering problem, the posterior is recursively estimated (as in Eq.2.1) according to the previous posterior at time t−1 and observations at time t. 12 The famous Condensation (Conditional Density Propagation)[35] employs this recursive equation and uses particles to represent the density function, thus it also called Particle filtering. p(x t |Z t )=∝p(z t |x t ) Z x t−1 p(x t |x t−1 )p(x t−1 |Z t−1 ) (2.1) We represent the dynamics and observation model in Eq.2.2 x t =f A (x t−1 )+w z t =f H (x t )+v (2.2) where w and v are the noise terms in dynamics equation and the observation equation respectively, and f A is the dynamics (drift) of targets between two consecutive frames. If w, v, x 0 obeys Gaussian, f A , f H are linear functions, the posterior is also Gaussian, whichcanbeanalyticallycomputedbyKalmanfilters[55]. If w,v,x 0 issubjecttoGaus- sian distribution, f A , f H are non-linear functions, the posterior is subject to Gaussian distribution, theExtendedKalmanfilter[55], whichisbasedonfirst-orderlinearTaylor’s expansionoff A , canbeappliedtoestimatetheposteriordistribution. Forthesamecase, Unscented Kalman filter [38] can achieve three-order approximation. However, if w, v, x t are not Gaussian and f A , f H are non-linear function, Monte Carlo sampling can be applied to simulate the distribution by a set of particles and assigned weights. In [44], a MCMC-based particle filter simulates the distribution of the association probability with a fixed number of targets, which allows multiple temporal associations between observa- tions and targets. In [72, 43], sequential tracking methods use pairwise Markov random field (MRF) based prior to model the interaction between targets at one time instant. In 13 [87], anad-hocMarkovnetworkisusedtomodeltheinteractionbetweenmultipletargets ateachtimeandameanfieldMonteCarloalgorithmisappliedtoapproximatelyestimate the posterior density of each target. In [89], multiple people are detected and tracked in a crowded scene using MCMC based method to estimate the state and the number of targets sequentially. In [56] a multi-view approach uses a particle filter based method to segment and track people against clutter. Many sequential methods employ model infor- mation to identify a specific type of target against background, such as [87, 89, 44, 43]. Smoothing techniques in Bayesian inference, namely computing p(X 1,...,t |Z 0,...,t ), is to estimate the path of hidden states over time given all the observations. Smoothing is more sensible for some tracking tasks, since decision is made with full consideration of smoothnesswithinthepath. Onewellknownapproachfortrackingonesingleobjectwith noisy observations is to use the Viterbi algorithm to estimate the optimal path traversed by the object [2]. For multiple target tracking, Viterbi algorithm namely finding multiple paths of hidden variables, however, 2.2.2 Combinatorial Optimization in Tracking The Viterbi algorithm is in nature a shortest path algorithm. If in the scene there are multiple objects, we need to find multiple shortest disjoint paths, as in the formulation proposed in [88]. This derives the other way to formulate the tracking problem by using combinatorial optimization techniques. In this formulation, the optimal combinatorial of the association between multiple targets and observations is considered. Considering the simplest case in multiple target tracking, there are n targets and n observations at one time instant, one target can be associated to at most one observation 14 Observation 1 Observation 2 Observation 3 Target 1 1 4 5 Target 2 5 7 6 Target 3 5 8 8 Figure 2.2: An toy assignment example: one target can be associated to at most one observation and vice versa and vice versa. A legal assignment is also called an association event or a hypothesis. The targets and observations form a bipartite graph, shown in Figure 2.2. The edge weight indicates the cost (or utility) to associate one observation to one target. Now, the tracking problem is to find such an assignment to minimize (or maximize) the total cost (or utility). Given the optimal assignment, filtering techniques can be applied to estimate the state of each target. The classical algorithm that has been widely used for solving this assignment problem is the Hungarian algorithm [46], which has polynomial timecomplexityO(n 3 ). Theabovetoyassignmentproblemcanbeextendedintodifferent variations [42, 34] in handling the cases where new tracks emerge and existing tracks ter- minate. These variations also can be solved by the Hungarian algorithm after expanding the association events. The assignment problem discussed above is defined a sequential way. Theassignmentambiguitymayberesolvedbymakingdecisionswithobservationsin asetofframes. However, thiscausesthetotalnumberofpossibleassignmentstoincrease exponentially. Enumerating all possible solution becomes impossible, which leads to the interest in keeping only the k-best assignments at each time. This is key idea in MHT (MultipleHypothesisTracking[68]). Ofcourse, determiningthe k-bestassignmentsmust not involve enumerating then sorting all possible sets. Cox and Millter [20] proposed an 15 algorithm to optimally determine the k-best assignments in polynomial time. This algo- rithm becomes the core part of the efficient implementation of MHT in [19]. Although finding the k-best assignment in each frame can be solved efficiently, the total number of association events still goes exponentially in terms of the number of frames. Thus, pruning and merging hypotheses are required for this type of methods. Recently, there is a novel method proposed in [88], where the tracking problem is formulated as finding k-shortest disjoint paths that is solvable in a polynomial time for a particular k. It seems that it “solves” data association problem in multiple frames in polynomial time, which is essentially a NP-hard problem. However, this method needs to determine k either with some prior knowledge or by a gradient search. The upper bound of k is still an exponential function of the size of frames considered. 2.3 Related issues in Tracking There are some other related issues that we discuss in designing a tracking system, in- cluding initialization, decision delay and post-processing and time performance. 2.3.1 Initialization Anytrackingsystemneedsaninitializationphase. Adesiredinitializationhastherequire- ment of minimizing the user input or interaction. For tracking a specific type of objects, pattern based detection can be used for initialization. For tracking a unknown type of object or when the pattern based detection is hard to acquire, motion based detection is also commonly used for initialization. For tracking a single arbitrary object, a user needs to first select the object, and then tracking is initialized. For some offline application, eg. 16 video annotation, to achieve the best performance, an interactive user input can be used to re-initialize the tracker when tracking fails, such as in [78]. The initialization process is very related to the object representation. Primitive shape representation requires a moderate user input for initialization, eg. it is quite straightforward to select a bounding box to initialize tracking. Contour and segmentation based representation requires more user input for initialization. There is a one new trend in tag&track, which is to combine tracking with segmentation. A crude selection is provided, such as a bounding box, and segmentation of an object against background is automatically obtained during tracking. 2.3.2 Decision delay and sliding window techniques The ideal tracking system should minimize the delay in making decision, namely, as soon as one frame of observations is obtained, the tracking decision should immediately be made. Many real-time applications require a minimum delay. However, as discussed, by delayingthedecisionandcollectingmoreobservations,abetterdecisionmaybeobtained. The other extreme conditional is that the tracking is purely an offline process that can makeuseofalltheobservationsfromthebeginningtotheend, suchasin[88]. Atradeoff between these two extreme conditions is to use a sliding window , where the tracking decision is made according to the observations in a sliding window. This technique gives some observation buffer for making a better tracking decision while keeping the total delay constant. Although the online tracking can maintain tracks’ identification with the sliding window, a post-processing that works between different sliding windows can be applied to rectify the track IDs by looking at the observations at a larger time scale. 17 2.3.3 Real-time issue A robust and real-time tracker is one of the ultimate goals. As computational resources continue to increase, real-time tracking becomes more practical in many applications. Many real-time or close to real-time tracking algorithms and systems have been demon- strated in recent years, such as mean-shift trackers [16, 12], ensemble tracking [4], eigen- tracker [9, 51] pixel-wised posterior tracker [7], etc. More recently, as the popularity of Graphics Processing Unit, using the computation power from GPU can make some com- plicated methods real-time, such as in [70]. In our system, we use GPU to implement the module of background modeling from a moving platform. 18 Chapter 3 Data Association in Multiple Target Tracking 3.1 Motivation and Introduction Multipletargettrackingisafundamentalissueforvideoanalysisandsurveillancesystems. Astheresimultaneouslyexistmultipletargetsandmultipleobservationsoftargetsineach frame, data association becomes a first-line problem in multiple target tracking. The purposeofdataassociationistorecoverthecorrectcorrespondencebetweenobservations and targets. Indeed, data association and state estimation are coupled. Once data association is established, filtering techniques can be applied to estimate the state of targets. The way to evaluate a possible data association is to see whether the estimated states of targets form consistent trajectories in terms of both motion and appearance. The existing multiple target tracking methods can be categorized into two types: sin- glescanandmultiplescan(orn-scan)[65]. Forsinglescanalgorithms,thedataassociation decisionissequentiallymadeinadeterministic(eg. nearestneighbor)orprobabilistic(eg. probabilistic data association) way at each time step. In contrast to single scan, n-scan methodsdeferthedataassociationdecisionwhen nframesobservationsarecollectedand 19 thus are also called deferred logic methods. Although single scan methods are compu- tationally more efficient than n-scan methods, the solution of single scan is suboptimal comparedtomultiplescanmethods. Thebatchprocessingoftheobservationsfrommany frames together may require high computation cost for a problem of a non-trivial size. A formulation of the combinatorial optimization is often adopted in solving multiple target tracking problem. The task to label each observation with either a track ID or a false alarm is related with a set packing problem, which is essentially a NP problem [58]. No matter how to transform the formulation, such as a 0-1 integer programming problem [58] and a multidimensional assignment problem [65], in practice, the whole solution space has to be reduced to a feasible one by the use of heuristics. Among the largebodyofworkinmultipletargettracking,themultiplehypothesistracker(MHT)[68] and joint probabilistic data association filter (JPDAF) [14], are two classical methods. MHT is a statistical framework to evaluate the likelihood of each hypothesis, which represents a set of assignments of observations and targets. To find the best hypothesis overtime, inpracticek-besthypothesesaremaintainedateachtime, whichcanbesolved inpolynomialtime[19]. TheessentialdifferencebetweenJPDAFandMHTisthatinstead of finding the best hypothesis, JPDAF computes the expectation of the state of targets overallhypotheses(jointassociationevents). Also,anypracticalimplementationofMHT and JPDAF requires pruning the set of all hypotheses to a smaller set and thus leads to a suboptimal solution. Both these data association methods assume the one-to-one mapping between observations and targets. Insteadofexplicitlyreducingthesizeofthehypothesisset,sampling-basedtechniques have recently been proposed to solve the combinatorial optimization problem. Many of 20 them adopt sequential inference to avoid exponential explosion of the size of the solu- tion space. A MCMC-based variant of the auxiliary variable particle filter is proposed to approximately infer the position of the targets [44]. It is worth noting that the data association in this paper considers split and merged measurements, and the MCMC sam- pling simulates the probability of a data association. However, this paper assumes a known number of targets, and data association is determined in a sequential way. In [72], trans-dimensional MCMC is used to sample the probability of the data association with a varying number of targets. This sequential tracking method uses pairwise Markov random field (MRF) based prior to penalize the overlap between targets at the same time. In [89], an articulated human foreground model is adopted and multiple people are detected and tracked in crowded scenes using a MCMC based method to estimate the state and the number of targets sequentially. In [87], an ad-hoc Markov network is used to model the interaction between multiple targets at each time and a mean field Monte Carloalgorithmisappliedtoapproximatelyestimatetheposteriordensityofeachtarget. In [56], a multi-view approach uses particle filter based method to segment and track people against clutter. In order to reduce the ambiguity in data association, many of these sequential methods employ model information to identify a specific type of target against background, such as [87, 89, 44, 43]. Amongmanysamplingbasedmethods, S.Oh et al. originallyproposestouseMCMC to directly sample data association in a n-scan setting [62] and this method is a general framework, whichiscapableofinitiatingandterminatingavaryingnumberoftracksand isabletoincorporateanydomainknowledge. Thismethodisappropriateforpoint-based observations but cannot be applied for region-based observations. Compared with this 21 (a) Foreground region in one frame (b)Motionandappearanceoftwotar- gets Figure 3.1: Segmentation of foreground regions in space-time by the use of motion and appearance smoothness method, our approach overcomes the one-to-one assumption by introducing the spatio- temporal data association. Also, we encode both motion and appearance information in theposteriordistribution,whichallowsthemethodtodealwithregion-basedobservations in vision applications. Moreover, since the success of the MAP formulation relies on the definition of a posterior distribution, we avoid determining the posterior empirically, and instead introduce a practical method to estimate the parameters in the posterior offline. 3.2 A Bayesian Formulation of Multiple Target Tracking 3.2.1 Anatomy of the Problem Suppose there are K unknown targets in the scene within the time interval [1,T]. The input for the tracking algorithm is a set of regions after foreground segmentation. Let y t denote the set of foreground regions at time t, and Y = ∪ T t=1 y t be the set of all availableforegroundregionswithin[1,T]. Inthesimplestcase, asingletargetisperfectly segmentedfromthebackground,andtrackingisstraightforward. Whentherearemultiple targetsinthescene,andtheyneveroverlap,norgetfragmented,theone-to-onemapping, which is assumed by many tracking algorithms, holds: any track τ k contains at most one 22 observation at one time instant, ie.|τ k ∩y t |≤1,∀k∈[1,K], and no observation belongs to more than one track: τ i ∩τ j =∅ i6= j,∀i,j∈ [1,K]. If the one-to-one mapping holds, tracking can be done by associating the foreground regions directly. However, in the most general case, which is common in real data sets, one foreground region may correspond to multiple targets (one example is shown in Figure 3.1(a)), and onetargetmaycorrespondtomultipleforegroundregions. Inthiscase,withoutusingany modelinformation, itisverydifficulttosegmenttheforegroundregionsinasingleframe. However,ifweconsiderthistaskinspace-time,thesmoothnessinmotionandappearance oftargetscanbeusedtosolvethisproblem. OneexampleisshowninFigure3.1(b),where the segmentation of the foreground regions becomes much easier than in Figure 3.1(a): if we look at several observations over time, smoothness in motion and appearance of targets helps to disambiguate the targets. Area with one label Area with two labels (a) Pixel-level labeling foreground regions (b) Rectangle cover of foreground regions Figure 3.2: Two ways to represent foreground regions There are many ways to represent foreground regions corresponding to differentiate targets. Themostdetailedrepresentationistoassigntoeachforegroundpixelalabel(ora setoflabels). Thelabel(orlabels)indicatesthetarget(ortargets)thatthepixelbelongs to. Weallowthecasewhereonepixelisassignedtomultiplelabelstorepresenttheocclu- sion situation, as shown in Figure 3.2(a). Note that areas with a common label may not 23 necessarily be connected. This is different from a partition segmentation problem, where regionsmustbedisjoint, ie. eachpixelbelongstooneregionexclusively. Althoughsucha representation is very accurate, labeling each pixel is expensive to implement. We adopt a more efficient alternative representation, and use rectangles to approximately represent the shapes of targets and the bounding rectangles form a cover of foreground regions, as shown in Figure 3.2(b). The overlap between two rectangles indicates an occluded area. Givenpixellabels,wecanpreciselyderivearectanglecoverrepresentation,andconversely pixel labels can be approximated obtained from the rectangle cover representation. The approximation is useful since it provides an efficient explanation of foreground regions with occlusion, and significantly reduces the complexity of the problem. In such a scheme, the center and the size of a rectangle are used as the abstract representation of motion states, and the foreground area covered by a rectangle contains the appearance of one target. Covering rectangles with labels (track IDs) over time form a cover of foreground regions in a sequence of frames, and a track is a set of covering rectangles with the same label. Formally, a cover ω with m covering rectangles of Y is defined as follows. ω ={CR i =(r i ,t i ,l i )},r i ∈Π r ,t i ∈[1,T],l i ∈[1,K] (3.1) subject to ∀i,j,i6=j∈[1,m],t i 6=t j ,l i 6=l j (3.2) where CR i is one covering rectangle and r i and t i represent the state (center position and size) and the time stamp for one rectangle, l i indicates the label assigned to the 24 rectangle r i ,K is the upper bound of the number of targets. Π r is the set of all possible rectangles. Although the candidate space of possible rectangles is very large, ie. |Π r | is a large number, it is still a finite number if we discretize the state of a rectangle in 2D image space. The constraint in Eq.3.2 means that any two covering rectangles cannot share the same time stamp and track label. In other words, one track can have at most one covering rectangle at one time instant. Thus, the number of rectangles that one cover can contain is bounded, m≤ M =KT. The way to form one cover can be regarded as: first select m rectangles from space Π r and then fill them intoKT sites. One site corresponds to one unique pair of time mark and track label, ie. < t i ,l i >. No two rectangles can fill the same site. Let τ k (t) denote the covering rectangle in track k at time t. If we consider τ k (t) a virtual measurement, the data association between virtual measurements still complies to the one-to-one mapping, namely, there is at most one virtual measurement for one track at one time instant. The virtual measurement derives from foreground regions: a virtual measurement can correspond to (i.e. cover) more than one foreground region or a part of a foreground region. The relationship between virtual measurements and real observations from foreground regions reveals the spatial data association between foreground regions. A similar concept of virtual measurement is also introduced in [22] for establishing correspondence in the SfM (Structure from Motion) problem. By introducing the concept of virtual measurement, we differentiate a spatial data association from a temporal data association. The optimal joint spatio-temporal data association leads to the final solution for such a multiple target tracking problem. 25 Let Π m M denote the space of all possible combinations of m locations from M sites, the whole solution space (ω∈Ω) can be represented as Ω= [ M m=1 Ω m = [ M m=1 π m M ×π r ×···×π r | {z } m (3.3) The structure of the solution space is typical for vision problems. As in [75], the solution of the segmentation problem is formulated in a similar way, where the entire solution space is a union of m-partition spaces (m is the number of regions). Figure3.3: Onepossiblecoveroftheobservations,whichincludestwotracks(τ 1 ,τ 2 ). The dashed rectangles represent the covering rectangles of foreground regions. The uncovered regions correspond to false alarms. In the case of a single target with perfect foreground segmentation, the set of MBRs (Minimum Bounding Rectangles) for each foreground region at different times forms the bestcoverofthetarget. However,wheninter-occlusionbetweenmultipletargetsandnoisy foregroundsegmentationexists,itisnottrivialtofindtheoptimalcover. Figure3.3shows ageneralcasewithobservationsin5frames, andillustratesthecasesofsplitobservations (in frame 2) and merged observations (in frame 3). Let τ k denote one track in ω. A cover with K tracks can also be written as follows. ω ={τ 1 ,··· ,τ K } (3.4) 26 Figure 3.3 shows one possible cover ω = (τ 1 , τ 2 ) with two tracks in different colors. In the perspective of implementation, one cover contains a set of tracks. Each track consists of a sequence of covering rectangles, which are represented as the dashed rectangles in Figure 3.3. For example, τ 1 and τ 2 contain five rectangles, one at each time instant. As defined in Eq.3.1, besides the location and the size, each covering rectangle has two properties, thetrackIDandthetimelabel. Temporaldataassociationisimplementedby changing the track IDs, for example, splitting τ 1 into two tracks. Spatial data association involves the operation of changing the location and the size of one covering rectangle, for example, a diffusion of one track at a time. Intuitively, exploring the solution space from one cover to another cover is implemented by changing properties of the covering rectangles. 3.2.2 Maximize a Posterior (MAP) Theunderlyingconstraintfortrackingisthatagoodexplanationoftheforegroundregions exhibits good consistency in motion and appearance over time. Formally, in an Bayesian formulation, the tracking problem is to find a cover to maximize a posterior (MAP) of a cover of foreground regions, given the set of observations Y: ω ∗ =argmax(p(ω|Y)) (3.5) 27 In the MAP problem defined in Eq.3.5, the cover ω is denoted by a set of hidden variables. We make inference about ω from Y over a solution space ω∈Ω. ω∼p(ω|Y)∝p(Y|ω)p(ω), ω∈Ω (3.6) The likelihood p(Y|ω) represents how well the cover ω explains the foreground regions Y in terms of the spatial-temporal smoothness in both motion and appearance. The prior model regulates the cover to avoid overfitting the smoothness. In the following sections, we discuss the prior and likelihood models used in our method. 3.2.2.1 Prior Model To find a cover with reasonable properties, we first define a prior model which considers the following criteria: we prefer a small number of long tracks with little overlap with other tracks. Accordingly, we adopt the prior probability of a cover ω as the product of the following terms: p(ω)=p(N)p(L)p(O) (3.7) 1. Number of tracks. Let K denote the number of tracks. We adopt an exponential model p(N) to penalize the number of tracks. p(N)= 1 z 0 exp(−λ 0 K) (3.8) 28 2. Length of each track. We adopt an exponential model p(L) of the length of each track. Let|τ k | denote the length, i.e. the number of elements in τ k . p(L)= K Y k=1 1 z 1 exp(λ 1 |τ k |) (3.9) 3. Spatialoverlapbetweendifferenttracks. WeadoptanexponentialmodelinEq.3.10 topenalizeoverlapbetweendifferenttracks, whereΓ(t)denotestheaverageoverlap ratio of different tracks at time t. p(O)= T Q t=1 1 z 3 exp(−λ 2 Γ(t)) Γ(t)= P τ i (t)∩τ j (t)6=∅ | τ i (t)∩τ j (t) τ i (t)∪τ j (t) | |τ i (t)∩τ j (t)6=∅| (3.10) InthesolutionspaceofEq.3.3,thepriormodelisappliedtopreventtheadoptionofamore complex model than necessary. For example, a short track usually has better smoothness than a long track. Merely considering the smoothness defined by the likelihood will segment a long track into short tracks. In an extreme condition, each track contains a single observation, and has the best smoothness. The prior penalizes such an extreme conditionbyallthreeterms,thenumberoftracks,lengthofeachtrackandoverlapamong different tracks. We consider another extreme condition: a cover ω 1 that contains two perfect tracks, τ 1 and τ 2 , that 100% overlap with each other; another cover ω 2 with one track τ 1 . Without the prior, the decision cannot be made since the number of targets is unknown and ω 1 and ω 2 have the same smoothness. The parameters in the prior model are hard to determine empirically. We will show how to determine the parameters appropriately for specific data sets. 29 3.2.2.2 Joint Motion and Appearance Likelihood We assume the characteristics of motion and appearance of targets are independent, therefore the likelihood can be written as p(Y|ω)=f F (ω) K Y k=1 f(τ k ) (3.11) where f F (ω) represents the likelihood of the uncovered foreground area by ω and f(τ k ) is the likelihood for each track. The area not covered by any rectangle indicates the false alarm in observations. We prefer to cover foreground regions as much as possible unless the spatio-temporal smoothness prevents us from doing so. We adopt an exponential model of uncovered areas as f F (ω)= 1 z 3 exp(−λ 3 F) (3.12) where F is the foreground area (in pixels) which is not covered by any track. The appearance of foreground regions covered by each track τ k is supposed to be coherent, and the motion of such a rectangle sequence should be smooth. Hence, we consider a probabilistic framework for incorporating two parts of independent likelihoods: motion likelihood f M , appearance likelihood f A , then f(τ k )=f M (τ k )f A (τ k ) (3.13) We represent the elements (rectangles) in track k as (τ k (t 1 ),τ k (t 2 ),...,τ k (t |τ k | )), where t i ∈ [1,T], and (t i+1 −t i )≥ 1. Each τ k (t i ) can be regarded as the observation of track τ k at time t i . Since missing detection may happen, it is possible that no observation is 30 assigned to track τ k in the time domain (t i ,t i+1 ). Motion Likelihood For each target, we consider a linear kinematic model: x k t+1 =Ax k t +w y k t =Hx k t +v (3.14) where x k t is the hidden kinematic state vector, which includes the position (u,v), size (w,h) and the first order derivatives (˙ u, ˙ v, ˙ w, ˙ h) in 2D image coordinates. The observa- tion y k t in Eq.3.14 corresponds to the position and size of τ k (t) in 2D image coordinates. w∼N(0,Q), v∼N(0,R) are Gaussian process noise and observation noise. To de- termine the motion likelihood L M for each track, according to Eq.3.14, it is known that an observation τ k (t i ) has a Gaussian probability density functionN(· ;μ,Σ) given the predicted kinematic state ¯ τ k (t i ), L M [τ k (t i )|¯ τ k (t i )] Δ =L M [τ k (t i )]=N (τ k (t i );H¯ τ k (t i ),S k (t i )) (3.15) where S k (t i )=H ¯ S k (t i )H T +R and ¯ S k (t i ) is the prior estimate of the covariance matrix at time t i . The motion likelihood for track k can be represented as f M (τ k )= |τ k | Y i=3 L M [τ k (t i )] (3.16) 31 Since we consider derivatives in kinematic states, we need two observations to initialize one track. Thus, motion likelihood can be computed from the third observation on. The motion likelihood in Eq.3.15 can be obtained as follows. N[τ k (t i )]=|2πS k (t i )| − 1 2 exp − 1 2 {(e k (t i )) T {S k (t i )} −1 e k (t i )} , (3.17) e k (t i )=τ k (t i )−H¯ τ k (t i |t i −1) ThedetailsofupdatingthepriorandposteriorestimatesinKalmanfilterscanbefoundin [55]. Note that if missing detection happens in τ k at timet, or say there is no observation at time t for track k, the prior estimate is assigned to the posterior estimate. Appearance Likelihood In order to model the appearance of each detected region, we adopt the non-parametric histogram-based descriptor [39] to represent the appearance of foreground area covered by ω. The appearance likelihood of one track is modeled as a chain-like MRF (Markov Random Field). The likelihood between two neighbors is defined as follows. L A (τ k (t i ),τ k (t i−1 )) Δ =L A [τ k (t i )] =(1/z 4 )exp(−λ 4 D(τ k (t i ),τ k (t i−1 ))) (3.18) where D(·) represents the the symmetric Kullback- Leibler Distance (KL) between the histogram-based descriptors of foreground covered by τ k (t i ) and τ k (t i+1 ). The entire appearance likelihood of τ k can be factorized as f A (τ k )= |τ k | Y i=2 L A [τ k (t i )] (3.19) 32 Given one cover, the motion and appearance likelihood of a target is assumed to be independent of other targets. The joint likelihood of a cover can be factorized in Eq.3.35. p(Y|ω)=f F (ω) K Y k=1 f M (τ k )f A (τ k )=f F (ω) K Y k=1 |τ k | Y i=3 L M [τ k (t i )] |τ k | Y i=2 L A [τ k (t i )] (3.20) With some manipulations, we combine the prior p(ω) in Eq. 3.7 and the likelihood p(ω|Y) in Eq. 3.35 to rewrite the posterior represented in Eq. 3.21. p(ω|Y)∝exp{−C 0 S len −C 1 K−C 2 F−C 3 S olp −C 4 S app −S mot } S len =− K X k=1 |τ k | ! ,S olp = T X T=1 Γ(t) ! S app = K X k=1 |τ k | X i=2 D(τ k (t i ),τ k (t i+1 )) S mot = K X k=1 |τ k | X i=3 log(|S k (t i )|)+e(t i ) T S k (t i ) −1 e(t i ) (3.21) where e(t i ) = τ k (t i )−¯ τ k (t i |t i −1) and C 0 ,··· ,C 4 are positive real constants, which are newly introduced parameters replacing (λ i ,z i ),i=0,··· ,4. The parameters in the prior and likelihood functions are absorbed in the free parameters C 0 ,··· ,C 4 . Once one pos- sible cover ω is given, the variable S len , K, F, S olp , S app and S mot can be computed. In Section 3.4, we show a practical way to determine these parameters. The global maximum (called mode in statistics) of the posterior p(ω|Y) is our MAP solution. Eq. 3.21 reveals that the MAP estimation is equivalent to finding the minimum of an energy function. Determining the parameters in such a posterior is as important as maximizing the posterior. Improper parameter setting makes the optimization process 33 meaningless. This is an issue, which is very often ignored and thus causes critiques of Bayesian MAP inference. In Section 3.4, we discuss how to determine the parameters automatically by Linear Programming. 3.3 ComputetheMAPEstimationbyData-DrivenMCMC 3.3.1 An Introduction of MCMC Metropolis algorithm is placed among the ten algorithms that have had the greatest in- fluenceonthedevelopmentandpracticeofscienceandengineeringinthe20thcentury[5]. Metropolis algorithm is the most famous instance of Markov Chain Monte Carlo, which is a large class of sampling algorithms. These algorithms have played a significant role in statistics, econometrics, physics and computing science over the last two decades [5]. MCMC techniques are often applied to solve integration and optimization problems in large dimensional space. There are a couple of typical examples of using MCMC, • Integration: There are many cases, people needs to compute integration, for exam- ple,inBayesianinference,normalizationp(x|y)= p(y|x) p(x) R ′ x p(y|x ′ )p(x ′ )dx ′ ,marginal- ization p(x|y) = R z p(x,z|y)dz and expectation E p(x|y) (f(x)) = R x f(x)p(x|y)dx; in statistical physics, computing the partition function Z = P s exp[− E(s) kT ] where k is the Boltzmann’s constant and T denotes the temperature of the system. • Optimization: Optimization is to extract the solution that optimize some objective function from a large set of feasible solutions. MCMC techniques are often applied when it is too computationally expensive to enumerate all possible solutions to find the optimal solution. 34 3.3.2 Spatio-temporal MCMC Data Association Directlyoptimizingaposteriorbyenumeratingallpossiblesolutionsinthesolutionspace defined in Eq.3.3 is not feasible. We propose to use data-driven MCMC to estimate the best spatio-temporal cover of foreground regions. To ensure that detailed balance is satisfied, the Markov chain is designed to be ergodic and aperiodic. It is also important to design samplers that converge quickly. Due to ergodicity of the Markov chain, there is always a “path” from one state to another state with non-zero probability. However, sufficient flexibility in the transition of Markov chain significantly reduces the mixing time. In the design of the transition of Markov chain, we manage to give flexibility in two ways. First, we design ten types of transitions, the first seven being temporal moves. They contain some redundancy, for example, merge (or split) can be implemented by death moves with extension moves and switch can be implemented by split and merge moves. Second, within a time span, the “future”and“past”informationissymmetric: wecanextendatrackinboththepositive and negative time direction. Figure3.4: Illustrationofneighborhoodandassociationlikelihood,where τ k (t 3 )hasthree neighbors. Tomakethesamplingmoreefficient,wedefinetheneighborhoodinspatio-temporalspace. Two covering rectangles are regarded as neighbors if their temporal distance and spatial 35 distance is smaller than a threshold. The neighborhood actually forms a graph, where a covering rectangle corresponds to a node and an edge between two nodes indicates two covering rectangles are neighbors. In the rest of the paper, we use “node” and “covering rectangle” interchangeably. A neighbor with a smaller (larger) frame number is called a parent (child) node. The neighborhood makes the algorithm more manageable since candidatesareconsideredonlywithintheneighborhoodsystem. Figure3.4illustratesthe neighborhood. The joint motion and appearance likelihood of assigning an observation y (i.e. one foreground region) to a track τ k after t i is represented as L(y|τ k (t i ))=L M (y|τ k (t i ))L A (y,τ k (t i )) (3.22) In our proposal distribution, the sampler contains two types of moves: temporal and spatial moves. One move here means one transition of the state of the Markov chain. Temporalmovesonlychangethelabelofrectanglesinthecover. However, sincedetected movingregionsdonotalwayscorrespondtoasingletarget(theymayrepresentpartsofa target or delineate multiple targets moving closely to each other), merely using temporal moves cannot probe the spatial cover of the foreground. Hence, we propose a set of spatial moves to segment, aggregate or diffuse detected regions to infer the best cover of the foreground. The spatial and temporal moves are interdependent: the result of a spatial move is evaluated within temporal moves, and the result of a temporal move guides subsequent spatial moves. The overview of our MCMC data association algorithm is shown in Algorithm 1. The input to the algorithm is the set of original foregrounds Y, initial cover ω 0 and the 36 total number of samples n mc . The initial cover ω 0 is initialized with a greedy criteria, namelyusingtheMHTalgorithmbutkeepingonlythebesthypothesisateachtime. The covering rectangles in ω 0 are directly obtained from MBRs of foreground regions. Each Algorithm 1 Spatio-temporal MCMC Data Association Input: Y, n mc , ǫ, ω ∗ =ω 0 Output: ω ∗ for n=1 to n mc do if n<ǫ∗n mc then Sample one temporal move. else Sample one move from all candidate moves. end if Propose ω ′ according to q(ω,ω ′ ) Sample U from Unif[0,1] if U <A(ω,ω ′ ) then ω n =ω ′ , else ω n =ω, end if if p(ω n |Y)>p(ω ∗ |Y) then ω ∗ =ω n end if end for moveissampledaccordingtoitsownpriorprobability. Sincethetemporalinformationis also applied in the spatial moves, we first take ǫ∗n mc (ǫ=0.15 in experiments) temporal moves and then both types of moves are non-discriminatorily considered. Note that, instead of keeping all samples, we only keep the cover with the maximum posterior since we don’t need the whole distribution but the MAP estimate. The target distribution is the posterior distribution of ω, ie. π(ω)=p(ω|Y), which is defined on a union of varying dimension subspaces. Thus we adopt a trans-dimensional MCMC algorithm [30], which deals with the case of proposal and target distributions in varying dimension spaces. One move from ω m ∈ Ω m to ω m ′∈ Ω m ′ (m6= m ′ ) is a jump 37 betweentwodifferentmodels. Reverse-JumpMCMC[31],proposedbyP.Green,connects these two models by drawing “dimension matching” variables u and u ′ from proposal distributions q m (u) and q m ′(u ′ ) provided that dim(ω) + dim(u) = dim(ω ′ ) + dim(u ′ ), where dim(·) denotes the dimension of a vector. Then ω and ω ′ can be generated from some deterministic functions of ω = g(ω ′ ,u ′ ) and ω ′ = g(ω,u). The acceptance ratio is defined as follows: α m (ω,ω ′ )=min 1, π(ω ′ ) π(ω) q m ′(ω|ω ′ ) q m (ω ′ |ω) ∂(ω ′ ,u ′ ) ∂(ω,u) (3.23) The temporal moves of merge, split and switch do not change the number of covering rectangles but change only the label of the rectangles. All spatial moves do not change the label of the rectangles but only change the state of rectangles. These types of moves do not change the dimension of the space. The temporal moves of birth, death, extension and reduction involve the issue of trans-dimension dynamics. Note that both dimension increasing and decreasing moves only change one part of the cover and do not affect the remaining part of a cover. For a pair of dimension increasing/decreasing move, if u is a random variable, u ∼ q(u), the move is defined as ω ′ = g(ω,u) = [ω,u] and dim(ω ′ )=dim(ω)+dim(u), then q m (ω ′ |ω)=q(u). In RJ-MCMC, if u is independent of ω, it is easy to show that the Jacobian is unity [32]. In such a Markov chain transition, the computation for each MCMC move is actually low, since we only need to compute the ratio π(ω ′ )/π(ω) instead of computing the value of each posterior. Moreover, since the Markov chain dynamics only change one part of the cover and do not affect the remaining part of a cover, the ratio π(ω ′ )/π(ω) can be 38 computed by only considering the change from ω to ω ′ . For instance, for a split/merge move,weonlyneedtoconsiderthelikelihoodchangeandthepriorchangefortheaffected track. In subsequent sections, we show how to devise the Markov chain transition by con- sidering specific choices for the proposal distribution q(ω ′ |ω). 3.3.3 Data-driven Markov Chain Dynamics Dynamics 1-7 are temporal moves, which involve changing the label of rectangles. The operationofselectingcandidaterectanglesinbirthmoveandextensionmoveonlyinvolves selecting from the covering rectangles of original foreground regions. Dynamics 8-10 are spatial moves, which change the state of covering rectangles. The prior for each move from 1 to 10 are predetermined as p(1) to p(10). Dynamics 1-2: Forward Birth and Death. For a forward birth move, we pick two neighbor nodes in different frames to form a track seed, which contains two nodes. ω =({r i } m i=1 )→(w,{r m+1 ,r m+2 })=ω ′ (3.24) For the first candidate rectangle, we u.a.r. select one from covering rectangles of original foreground regions that have not been covered, ie. q b (r m+1 ) is equal to one over the number of original bounding rectangles that are not covered. Suppose the set of child 39 nodes ofr m+1 that have not been covered is, child(r m+1 ), the probability of selecting the second candidate is q b (r m+2 |r m+1 )= (−logL A (r m+2 ,r m+1 )+1) −1 P y∈child(r m+1 ) (−logL A (r m+2 ,r m+1 )+1) −1 (3.25) When we select the second node in a track seed, we only use appearance likelihood in Eq.3.25 (since the computation of the motion likelihood needs at least two nodes). To avoid the probability of one candidate dominating all the other, we use the inverse of the negative log likelihood to define the probability. For the reverse move, we u.a.r. select one from the existing track seeds and remove it from the current cover, ie. q(seed) is equal to one over the number of track seeds. By Metropolis Hastings method, we need two proposal probabilities q birth (ω,ω ′ ) and q death (ω ′ ,ω). q birth (ω,ω ′ ) is a conditional probability for how likely the Markov chain proposes to move to ω ′ and q death (ω ′ ,ω) is the likelihood for coming back. Then the accept probability of a birth move is A(ω,ω ′ )=min π(ω)q death (ω ′ ,ω) π(ω ′ )q birth (ω,ω ′ ) (3.26) where the proposal probability of a birth move is a product of the prior of a birth move p(1) and the probability of selecting two candidates rectangles, ie. q birth (ω,ω ′ ) = p(1)q b (r m+1 )q b (r m+2 |r m+1 ). Theproposalprobabilityofadeathmoveisaproductofthe prior of a death move and the probability of selecting one seed track, ie. q death (ω ′ ,ω) = p(2)q(seed). 40 Dynamics 3-4: Forward Extension and Reduction. For a forward extension move, we select a track τ k ∈ω k according to it length, ie. q e (τ k )= exp(−λ e |τ k |) P τ k ∈ω exp(−λe|τ k |) . Suppose the endnodeoftrackkisatframet i ,weselectonecoveringrectangleofanoriginalforeground regions r m+1 from child(τ k (t i )) and add it into τ k . The probability of selecting a new node q e (r m+1 ) can be represented as q e (r m+1 |τ k (t i ))= (−logL(r m+1 ,|τ k (t i ))+1) −1 P y∈child(τ k (t i ))∩τ 0 (−logL(r m+1 ,|τ k (t i ))+1) −1 (3.27) ThisprobabilityissimilartotheoneinEq.3.25butconsidersbothmotionandappearance likelihoods. For the reverse move, we u.a.r. select a track τ k that contains more than two nodes and remove the end node from τ k . To give the capability of multiple extensions or reductions, after one extension, we continue to extend the same track τ k with a probability γ e . Similarly, afteronereduction, wecontinuetoreduce τ k withprobabilityγ r . Theproposal probability of extension is q extension (·) = p(3)q e (τ k )(γ e ) n−1 (1−γ e ) n Q i=1 q e (r m+i ) and the proposalprobabilityofthereversemoveis q reduction (·)=p(4)q r (τ k )(γ r ) n−1 (1−γ r ), where n indicates the number of extension or reduction moves that actually occur. Dynamics 5-6: Merge and Split. If a track’s (τ k 1 ) end node is in the parent set of another track’s (τ k 2 ) start node, this pair of tracks is a candidate for a merge move. We select u.a.r. a pair of tracks from candidates and merge the two tracks into a new track τ k = {τ k 1 }∪{τ k 2 }. The proposal probability of a merge move is q merge (·) = p(5)q m (τ k 1 ,τ k 2 ). 41 For the reverse move, we select a track τ k according to q s (τ k ) = exp(−λs|τ k | −1 ) P |τ k |≥4 exp(−λ s |τ k | −1 ) and then select a break point according to the probability br k (i). br k (i)= −logL(τ k (t i+1 )|τ k (t i )) |τ k −2| P j=0 −logL(τ k (t i+1 )|τ k (t i )) (3.28) br k (i)isdesignedtopreferbreakingatrackatthelocationwherethemotionandappear- ance likelihood has a low value. The nodes in the track which are after the break point are moved to a new track. If the break point happens at the first link or the last link, the split operation has the same effect as a reduction operation. The proposal probability of a split move is q split (·)=p(6)q s (τ k )br k (i). Dynamics 7: Switch. If there exist two locations p,q in two tracks τ k 1 ,τ k 2 , such that τ k 1 (t p ) is in the parent set of τ k 2 (t q+1 ) and τ k 2 (t q ) is in the parent set of τ k 1 (t p+1 ) as well, this pair of nodes is a candidate for a switch move. We u.a.r. select a candidate and define two new tracks as: τ ′ k 1 ={τ k 1 (t 1 ),...,τ k 1 (t p ),τ k 2 (t q+1 ),...,τ k 2 (t |τ k 2 | )} τ ′ k 2 ={τ k 2 (t 1 ),...,τ k 2 (t q ),τ k 1 (t p+1 ),...,τ k 1 (t |τ k 1 | )} (3.29) The reverse move of a switch is symmetric, ie. the reverse move of a switch is still a switch. The proposal probabilities of a switch move and its reverse move are identical, thus there is no need to compute the proposal probability. The acceptance probability of a switch move is A switch (ω,ω ′ )=min 1, π(ω ′ ) π(ω) 42 . . . . . . Track RGB color histogram Weight image Spatial scale mean shift Back projection Figure 3.5: Illustrations of a diffusion move: the RGB color histogram is quantized in 16×16×16 bins; the weight image is the backprojection from the color histogram and is masked by foreground regions; the blue dashed rectangle indicates the prediction from the motion model, the red arrow is the spatial scale mean shift vector and the dashed red rectangle shows the proposal by a diffusion move. Dynamics 8: Diffusion. We select one covering rectangle τ k (t) in a track according to the probability q dif (τ k (t)) = −logL(τ k (t i )|τ k (t i−1 )) K P k=1 |τ k | P i=2 −logL(τ k (t i )|τ k (t i−1 )) . This probability prefers selecting a covering rectangle that has a low motion and appearance likelihood with its preceding neighbor. Thelowmotionandappearancelikelihoodsindicatethatthecoveringrectangle of the track in this frame may be erroneous. In order to update its state, we first obtain its estimated state ¯ τ(t) from the motion model, and then update its position and size according to the appearance model: generate a new covering rectangle τ ′ k (t) from the probability S(τ ′ k (t)|¯ τ k (t)). S(y ′ t |y t )∼N y t +α dE dx | x=yt ,u (3.30) where E =−logL A (x|y t ) is the appearance energy function, α is a scalar to control the step size and u is a Gaussian white noise to avoid local minimum. In practice, we adopt thespatio-scalemeanshiftvector[16]toapproximatethegradientofthenegativeappear- ance likelihood in terms of position and scale. A scale space is conceptually generated 43 by convolving a filter bank of spatial DOG (Difference of Gaussian) filters with a weight image. Searching the mode in such a 3D scale space is efficiently implemented in [16] by a two-stage mean-shift procedure that interleaves spatial and scale mode-seeking, rather thanexplicitlybuildinga3Dscalespaceandthensearching. Inourexperiments, weonly computethemeanshiftvectorinscalespaceonce, namelyperformthespatialmeanshift oncefollowedbythescalemeanshiftwithoutiterations. Thediffusionmoveisillustrated inFigure3.5. ThecolorhistogramofonetrackisderivedinaRGBspacewith16×16×16 bins. Around the initial state ¯ τ(t), a weight image is computed using histogram backpro- jection to replace each pixel with the probability associated with that RGB value in the color histogram. Note that the weight image is masked by the foreground regions: the weightofabackgroundpixelisalwayszero,asshowninFigure3.5. Anewproposalisgen- eratedbydriftingtheinitialstatealongthemeanshiftvectorandaddingaGaussiannoise according to Eq.3.30. The newly generated covering rectangle takes the place of τ k (t). The diffusion move may cause partial foreground regions left over. These regions can be covered by new rectangles generated in birth moves if they can form a consistent track. Theproposalprobabilityofadiffusionmoveisq dif (·)=p(8)q dif (τ k (t))S(τ ′ k (t)|¯ τ k (t)). The diffusion move is also symmetric. The acceptance ratio of a diffusion move is A dif (ω,ω ′ )=min 1, π(ω ′ )S(τ k (t)|y t ) π(ω)S(τ ′ k (t)|y t ) (3.31) Both motion information and appearance information are considered in the diffusion operation: the initial state of computing mean shift vector is the predicted state accord- ing to Kalman filter ¯ τ(t) and the diffusion vector is computed according to appearance 44 (a) Segmentation (b) Aggregation Figure3.6: Illustrationsofsegmentationandaggregationmoves,wherethecolorindicates theobjectID,dashedboxesindicatetheestimatedrectanglesfromthemotionmodeland regions with red boundaries are original foreground regions. information. The diffusion is used for generating new hypotheses and the decision of ac- ceptanceisstillmadeaccordingtotheMetropolis-Hastingalgorithm, wheretheposterior distribution that encodes the joint motion and appearance likelihood plays an important roleinacceptingagoodsolution. Sincewedonothaveaprecisesegmentationofthefore- ground regions, the appearance computation may not be very accurate when occlusion happens. The motion likelihood helps in estimating a good cover when appearance is not reliable. This is the reason why we need the joint motion and appearance model. The parameters C 0 ,··· ,C 4 represent the tradeoff between different factors in the posterior and are trained offline to adapt to a specific data set. Dynamics 9: Segmentation. If more than one track’s prediction ¯ τ k (t) have enough overlap with one covering rectangle y at time t, as illustrated in Figure 3.6(a), this indicatesthatonecoveringrectanglemaycorrespondtomultipletracks. Sucharectangle is regarded as a candidate for a segmentation move. The tracks are related tracks of the candidate y. Randomly select such a candidate y and for each related track τ k generate a newcoveringrectangleτ ′ k (t)accordingtotheprobabilityS(τ ′ k (t)|¯ τ k (t)). Thesegmentation move is achieved through diffusion moves (each related track performs one diffusion). 45 Thus, the reverse of a segmentation move is also a segmentation move. The acceptance ratio of one segmentation move is A seg (ω,ω ′ )=min 1, π(ω ′ ) Q S(τ k (t)|y t ) π(ω) Q S(τ ′ k (t)|y t ) (3.32) Dynamics 10: Aggregation. If one track’s prediction ¯ τ k (t) has enough overlap with more than one covering rectangle at time t, as illustrated in Figure 3.6(b), this indicates that the observation of this track in this frame may be fragmented into multiple regions. This forms a candidate for an aggregation move. Randomly select such a candidate ¯ τ k (t) and for the track τ k generate a new covering rectangle τ ′ k (t) according to the probability S(τ ′ k (t)|¯ τ k (t)). The newly generated covering rectangle takes the place of τ k (t). The aggregation move is also symmetric and its acceptance ratio is similar to the one in Eq.3.31. Both segmentation and aggregation moves are implemented by diffusion moves. In other words, the segmentation and aggregation moves are particular types of diffusion moves that address the merged and fragmented observations respectively. 3.4 A Practical Approach to Determine the MAP Model Properly selecting the parameters is necessary to assure the Markov chain converges to the correct distribution. To determine the parameters in a principled way is not a trivial task. First, the posterior can be only known up to a scale because the computation of the normalization factor over the entire ω is intractable. Second, the parameters, which encode a lot of specific domain knowledge, such as false alarms, overlap etc., are highly scenario-related. Empiricalknowledgecannothelpindeterminingtheparametersinsuch 46 acomplexposterior,andotherwisewouldmaketheprocessnotrepeatable. Determininga propersettingoftheparametersisafirst-lineproblembeforeanystochasticoptimization, and a casual setting of the parameters in the posterior makes all the efforts in searching a MAP solution meaningless. The global optimal solution, which is “optimal” to some oracle type of posterior, may not be more meaningful than some other inferior local maximaorevennon-maximaatall. ThisissuewasnoticedbyS.C.Zhuetal. in[75],where the authors applied Data-Driven MCMC in solving image segmentation (segmentation is alsointrinsicallyambiguous). TheauthorsproposedaK-Adventureralgorithmtoextract K distinct solutions from the Markov chain sampling. This method requires storing the Markov chain for selecting K distinct solutions. However, for the consideration of computational cost and difference in definition of the goal, in multiple target tracking problem, it is not proper to keep multiple solutions from the whole chain. Here, we proposeanautomaticsolutiontodeterminetheparametersinsuchaprobabilisticmodel. 3.4.1 Order-preserving Constraints Given one ω, the log posterior density function is a linear combination of the parameters (note that the log posterior density is not a linear function of ω, otherwise direct opti- mization of such a posterior can be expected). Such a linear combination in parameter space is commonly seen in the definition of a posterior that can be factorized into a set of independentcomponents. AsmentionedinSection3.3.2,weonlyneedtocomputethera- tioπ(ω ′ )/π(ω)intheMarkovchaintransitioninsteadofcomputingthevalueofπ(ω ′ )and π(ω). Inspired by this property, although we cannot know the exact value of π(ω ′ ) and π(ω), we can establish a set of order-preserving constraints, i.e. π(ω)/π(ω ′ )≥ (or≤)1 47 if we know whether one solution is no worse than the other. Such constrains can be transformed into a set of linear inequations of the parameters. After collecting enough inequations, we can apply linear programming to find a feasible solution of the parame- ters. Given ground truth data, the information of whether one solution is no worse than the other is easily to know by degrading the ground truth using the spatial and temporal moves defined in Section 3.3.3. 3.4.2 Implementation Using MCMC Dynamics In order to collect the order-preserving constraints, we start from the ground truth that contains tracks with correct label and locations, and then we degrade the ground truth usingthedynamicspresentedintheMCMCsamplingsection. Themotionmodelwhichis usedtocomputethemotionlikelihoodtermS mot inEq.3.21needstobedeterminedbefore hand. We use well-segmented foreground regions as observations, and then by fitting partial ground truth and observations into the motion model, we determine parameters in the motion model, i.e. Q and R in Eq.3.14. Then we start with the best cover ω ∗ obtained from ground truth and use the temporal and spatial moves to degrade the best cover to ω i . For each ω i , we have a constraint that π(ω ∗ )/π(ω i )≥1 (3.33) Given one cover, according to Eq.3.21, the log function of the posterior f(C|ω) Δ = log(p(ω|Y)) is a linear function in terms of the free parameters. Eq.3.33 provides one linear inequation, ie. f(C|ω ∗ )−f(C|ω i )≥ 0. After collecting multiple constraints, we 48 use linear programming to find a solution of positive parameters with a maximum sum as Maximize:a T C Subjectto:A T C≤b,C≥0 (3.34) where C = [C 0 ,...,C 4 ], a = [1,1,1,1,1] T , and each row of A T C ≤ b encodes one constraint from Eq.3.33. In our experiments, 5,000 constraints, which cover most of cases of different moves from multiple sequences in one data set, are sequentially generated and added to a constraint set. Duetotheambiguityexistingingroundtruth, asmallnumberofconflictconstraints may exist. Any constraint that conflicts with the existing set is ignored. In fact, the objective function, namely, a in LP in Eq.3.34 is a rather loose parameter as long as enough constrains are collected. Any vector a containing 5 positive numbers will work. In order to determine how many constraints are enough to get an accurate estimate of the parameters, we simulate a density function with 5 parameters. For one given number of constraints (x-axis in Figure 3.7), we independently generate multiple sets of constraintsandcomputetheaverageestimateerrors,whichareshowninFigure3.7. Since 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 number of constraints (thousand) average normalized square error ∈(0,1) Figure 3.7: Average normalized error|| ˆ C−C||/||C|| of the estimated parameters ˆ C with different number of constraints. 49 the parameters indeed encode the scenario related knowledge, we train the parameters for different data sets. Here, the data set means a set of video sequences in similar scenario (similar background, moving objects and using the same method for foreground segmentation). AdesiredMarkovchaintransitionandacorrectMAPsolutionareensured by the trained parameters. 3.5 Extension with Model Information Tracking has become increasingly powerful in recent years largely due to the adoption of statistical appearance models which allow effective foreground segmentation and ob- ject representation. It has been standard, however, to treat foreground segmentation or object detection as separate processes followed by tracking. In this section, we address this limitation and propose a unified framework to combine detection and tracking in a probabilistic manner. 3.5.1 Extension of MAP with Model Likelihood In Section 3.2 , we formulate multiple target tracking as finding the cover foreground regions with best spatio-temporal smoothness in both motion and appearance, where the input contains merely foreground regions. When we only focus on one particular type of objects, such as humans or vehicles, we extend our formulation as finding a meaningful interpretation in terms of one type of objects with the best spatio-temporal smoothness. 50 OnepossibleinterpretationisshowninFigure3.8. Nowatrackisdefinedasasequenceof rectangles with fixed length-width ratio. The full likelihood function is shown as follows. p(Y|ω)= K Q k=1 L(τ K )= K Q k=1 |τ k |−1 Y i=1 L M ◦L A (τ k (t i+1 )|τ k (t i )) | {z } (a) |τ k | Y i=1 L D (τ k (t i )) | {z } (b) (3.35) L M ◦L A inEq.3.35denotethemotionandappearancelikelihoodoftrack τ k . Inaddition to motion and appearance consistency, the proper cover of foreground regions should comply with model of objects of interest. The joint likelihood function is extended with a model likelihood L D , which represents the likelihood of the box in τ k at time t i is a target of interest. We build this model likelihood model using a boosted classifier with Figure 3.8: One possible interpretation, or say meaningful covering, of the foreground regions over time, which includes three tracks (τ 1 , τ 2 , τ 3 ). τ 0 is not shown for the clarity purpose. The frame rate of overlayed foreground region is down-sampled. theEdgeletfeatures[80]fortwotypesofobjects, i.e. humansandvehicles. Givenasetof labeledpatterns{(I i ,l i )}, theAdaBoostprocedurelearnsaweightedcombinationofbase weak classifiers, H(I) = P T t=0 h t (I), where h t (I) is the LUT-based weak classifier [79] chosen for the round t of boosting. The Boolean decision is made by sign(H(I)−b), 51 where b is the threshold for controlling the detection and false alarm rates. For each LUT-based weak classifier, the range of the edgelet feature f edgelet ∈[0,1] is divided into n bins. The weak classifier can be formulated as: h t (I)= 1 2 n X j=1 ln( W j +1 +ξ W j −1 +ξ )δ j (f edgelet (I)) (3.36) whereW j m =P(f edgelet (I)∈bin j ,l =m), δ j is the indicator function of f edgelet (I)∈bin j . Instead of using the Boolean detector, we use a likelihood function by exponentiating the confidence output by the AdaBoost classifier. L D (I)= 1 z 4 exp λ 4 X T t=0 h t (I) (3.37) where λ 4 is a positive constant. In order to reduce the false alarm rate, we use a boot- strap procedure that iteratively adds false alarms that are reported by the trained strong classifier and then re-train the detectors using the old positive and the new extended negative sets. 3.5.2 Using Model Information as Tracking Indicators Not only can the model information be used to extend the likelihood in the posterior distribution, but also can be used in designing the data-driven proposal distribution in the MCMC dynamics. For example, in pedestrian tracking, the hypothesis of generating anewtrackshouldhappenattheplaceswherethereexistdetectionresponses. Themodel knowledge is not necessarily strong enough to be a detector of full objects, thus we call it an indicator. Generally speaking, the indicator can be regarded as some necessary but 52 not sufficient condition of the appearance of one type of objects. It can be a semantic part of an articulated object, eg. the head-should of a human, some combination of local features,eg. bagoffeaturesorsomecoupledcontextualelements. Theuseofanindicator is to guide tracking to sample the places where the object is likely to appear. Since the indicator is used using the sampling process, it has to be computationally efficient. The reason for using this indicator is two fold. First, in many situation, it is very difficult to detect the whole body of an object, especially when heavy occlusion happens. A full body detector may have a low detection rate. However, for tracking humans from ceiling mounted camera views, the upper body is usually visible without much occlusion. Such information is good enough to be selected as an indicator to guide tracking. One successful example of using the indicator in tracking pedestrian is in Zhao’s work[89], where an omega shape from foreground regions is used as the head-tip indicator and this information is used to guide the birth move during samplings. Although there may exist manyfalsepositivesbyusingthishead-tipindicator,itismoreinformativethanuniformly generatinghypothesesofnewobjects. Theotherreasonofusingindicatorinsteadofusing a strong detector is due to the efficiency. A detector may require complex features, for example, for SVM based detectors, a high dimension feature vector may be needed for detecting the full body, or complex training structure, for boosting based detectors, a large number of weak classifiers or a long cascade may be needed. In our method, we propose to use a short cascade of boosted week classifiers as the indicator, which can be regarded as a detector at a high false positive stage but more efficient. Ashortcascademeetstherequirementofselectingindicators,asashortcascade means few weak features are used and thus it is efficient, more importantly, most false 53 positives are filtered out at the very beginning of the cascade. Given this indicator information, we change the proposal in MCMC moves as follows. First,acandidateofonetargetisarectanglein2Dimagespacewhichcoverslargeenough foreground region, i.e. R cov = covered foreground area of rectangle >Th cov . We use two ways to propose candidates which are then associated according their spatio-temporal smoothness. The first one is to use the indicator, which a short cascade of Adaboost classifiers to generate candidates at different scales, whose response is strong enough i.e. P T t=0 h t (I)>Th mld . The second way to generate candidates is based on motion. Given the best ω ∗ formed in the previous sliding window [t,t+W], the candidates for each track τ k ∈ω ∗ at t+W +1 are generated according to prediction of τ k with a Gaussian noise w∼N(0,Q). The probability of taking Adaboost candidates is α and the probability of taking motion candidates is (1− α). Before the tracking starts, α = 1, i.e. all the candidates are generated by the Adaboost classifier. Note that we still use the full cascade to assign the model likelihood in the posterior distribution, and the indicator is only used to generate proposals. For detecting and tracking multiple vehicles in airborne video, we rectify the targets’ heading when we collect positive samples and make samples’ orientation cover other degree of freedom except heading. The positive samples are shown in Figure3.9(a). Given a possible track τ k , we approximate the target’s orientation using the direction of the trajectory. The first two features selected in this Adaboost classifier are shown in Figure 3.10(b). After the images are stabilized, the motion segmentation are shown in Figure 3.12(b) and the clustered indicator responses are shown in Figure 3.10. 54 (a) Positive training samples 1st 2nd (b) The first two features trained Figure 3.9: Edgelet Adaboost training (a) Foreground region (b) Binary detection Figure 3.10: Motion and pattern detection 3.6 Extension with Hierarchical Data Association 3.6.1 Motivation Although merge/split operation in the MCMC data association algorithm can deal with missingdetection,however,itonlyconsidersobservationswithinashorttimespan. Some situations, such as long occlusions, may cause the tracker to lose target identification. Increasing the size of the sliding window cannot solve the problem all the time, and increases the complexity. Also, in the MCMC data association algorithm, the tracks are formedpurelyaccordingtotheposteriordistribution, whichleadstotheoptimalsolution in terms of the posterior definition. Some prior knowledge of local association can be used to reduce the total dimension space that leads to a more efficient implementation at the price of suboptimum. 55 3.6.2 Tracklet Association Hierarchical techniques have been widely used in computer vision area to achieve better robustness and efficiency. Here, we pose the data association problem into a hierarchical structure with local data association and global data association. The local data asso- ciation still uses the MCMC data association algorithm to form tracklets. Then, the tracklets are associated in the global data association according to their spatio-temporal consistency. First we define the consistency of in temporal and spatial relationship be- tween tracklets. Given two tracklets τ 1 and τ 2 , which start at time s 1 , s 2 and terminate at time t 1 , t 2 . If the condition s 1 ≥t 2 or s 2 ≥t 1 holds, the two tracklets are temporally consistent. Fortwotemporallyconsistenttrackletsτ 1 andτ 2 ,says 2 ≥t 1 ,theterminating position and velocity of τ 1 on the global map is P t 1 and V t 1 . The starting position and velocity of τ 2 on the global map is P s 2 and V s 2 . If the||P t 1 −P s 2 ||≤v max ×(s 2 −t 1 ) and ||V t 1 −V s 2 ||≤ a max ×(s 2 −t 1 ), the two are spatially consistent as well, where v max and a max represent the maximum speed and acceleration of targets on the map. In order to associate the temporally and spatially consistent tracklets, we adopt the appearance model proposed in [40]. This descriptor is invariant to 2D rotation and scale change, and tolerates small shape variations. Instead of applying this descriptor on a singleimageblob, weusethedescriptoronatracklet, whichcontainsasequenceofimage blobs. For each detected moving blob within a tracklet, the reference circle is defined as the smallest circle containing the blob. The reference circle is delineated as the 6 bin images in 8 directions depicted in Figure 3.11. For each bin i, a Gaussian color model is built on allthepixelslocatedinbiniforall8directionsandforallimageblobswithinthetracklet. Thus the color model for each tracklet is then defined as a 6D vector by summing the 56 + + ... Radial direction R: G: B: (a) Appearance Color Model + + ... Radial bins E: (b) Appearance Shape Model Figure 3.11: Appearance descriptor of tracklets contributionofeachbinimageinall8directionsandforallimageblobs. Wecansimilarly encode the shape properties of each blob by using a uniform distribution of the number of edge pixels within each bin, namely a normalized vector [E 1 (τ),E 2 (τ),··· ,E 6 (τ)]. The appearance likelihood between two compatible tracklets can be defined as: p app (τ i ,τ j )=exp(−λ(d color (τ i ,τ j )+d edge (τ i ,τ j ))) (3.38) where τ i , τ j are tracklets on which the appearance probability model is defined. TheappearancedistancebetweentwocompatibletrackletsiscomputedusingtheKullback- Leibler(KL)divergence. Forthecolordescriptor,sinceeachbinismodeledbyaGaussian model, the KL distance is reduced to: d(τ i ,τ j )= 1 2N X N ( (μ i −μ j ) 2 1 σ 2 j + 1 σ 2 i ! + σ 2 i σ 2 j + σ 2 j σ 2 i ) (3.39) where μ i , μ j , σ i and σ j are the parameters of the color Gaussian model. For the edge descriptor we use the following similarity measure: d edge (τ i ,τ j )= 1 2 6 X r=1 (E r (τ i )−E r (τ j ))log E r (τ i ) E r (τ j ) (3.40) 57 For the global association, two compatible tracklets are assigned the same ID if the distance between the two tracklets’ appearance is smaller than a threshold. Due to the existenceofbothtargetmotionandcameramotion,thetarget’sorientationcouldbequite different in different tracklets, thus the rotation-invariant property of the descriptor is quiteimportantforourtrackletsassociation. InFigure3.12,weshowtheconfusionmatrix oftheseveraltracklets. Fromtheconfusionmatrixtherotation-invariantdescriptorworks verywellwhentrackletsundergoobviousrotation. Inaddition, illuminationmayvaryfor trackletsacquiredatdifferenttimes. Sincetheappearanceconsiderstheedgeinformation, the appearance can partially deal with illumination changes. (a) (b) Figure 3.12: (a) represents the first blob of different tracklets (b) the confusion matrix of tracklets 3.7 Implementation and Results 3.7.1 Implementation Details Thisalgorithmneedsforegroundregionsasinput,whicharecomputedofflineforthewhole sequence. We did not include the time for obtaining detection input when measuring time performance. Also, the implementation of this algorithm involves a large number of operations in histogram computation, eg. in calculating foreground coverage, color 58 histogram for appearance likelihood and mean shift for diffusion moves. In order to make thehistogramcomputationefficient,weusetheintegralimagesthatarecomputedoffline. We cache the nodes that have a proper size and coverage of the foreground regions, and cache appearance descriptors for such nodes. These candidate nodes are the only allowed locations where a target can occur. In the diffusion move, the new proposed object’s locationwillberoundedtofitintothesepre-calculatedlocation. Asmostofthelocations are inactive in tracking, this significantly reduces the computational cost in tracking process. 3.7.2 Performance Evaluation and Comparison To evaluate the performance of our approach quantitatively, we adopt the metric “Se- quenceTrackingDetectionAccuracy”(STDA)proposedin[41],whichisaspatio-temporal based measure penalizing fragmentation in the temporal and the spatial domains. To compute the STDA score, one needs to compute one-to-one match between the tracked targets and the ground truth targets. The matching strategy itself is implemented (in the evaluation software) by specifically computing the measure over all the ground truth and detected object combinations and to maximize the overall score for a sequence[41]. Given M matched tracks including tracked targets τ k (i),k = 1,··· ,M,i = 1,··· ,T and the corresponding ground truth tracks G k (i),k = 1,··· ,M,i = 1··· ,T, STDA can be computed as STDA= M X i=1 P T 1 G i (t)∩τ i (t) G i (t)∪τ i (t) N frame (G i ∪τ i 6=∅) / N G +N T 2 (3.41) 59 wherethedenominatorforeachtrack N frame (G i ∪τ i 6=∅)indicatesthenumberofframes in which either a ground truth or a tracked target (or both) are present. The numerator for each track measures the spatial accuracy by computing the overlap of the matched tracking results over the ground truth targets in the sequence. The normalization factor is the average of number of tracked targets N T and the number of ground truth targets N G . STDA produces a real number value between 0 and 1 (worst and best possible performance, respectively). (a) 1 st frame (b) 30 th frame (c) 43 th frame (d) 50 th frame Figure 3.13: Simulation result L = 200, N = 7, FA = 7 and T = 50. Color rectangles indicate the IDs of targets. Targets may split or merge when they appear. To demonstrate the concept of our approach, we design simulation experiments. In a L×L square region, there are K (unknown number) moving discs. Each disc presents an independent color appearance and an independent constant velocity and scale change 60 in the 2D region. False alarms (non-overlapping with targets) are u.a.r. located in the sceneandthenumberoffalsealarmsisanuniformdistributionon[0,FA]. Ifthenumber of existing targets in the square region is less than the upper bound N, a target will be addedrandomly. Wealsoaddseveralbarsasocclusionsinthescene. Thisstaticocclusion causes a target to break into several foreground regions. This simulates real scenarios when foreground regions are fragmented due to noisy background modelling. The input to our tracking algorithm contains merely foreground regions in each frame without any shape information. Thedesignofthissimulationexperimentisthroughparticularconsiderationsforeval- uatingamethod’sabilitytorecoverspatialdataassociation. Insuchasimulation,ifthere is no absence of spatial completeness, eg. no occlusions between objects and no static occlusion, the decision of temporal data association is very easy to make without any ambiguity. It is just due to the lack of spatial compactness so that these sequences chal- lenge many existing data association methods. Without jointly considering the spatial and temporal data association, a tracking algorithm will never explore the correct ex- planation of the foreground regions. Figure 3.13 gives the results of our spatio-temporal MCMC data association algorithm. Colored and black rectangles display the targets and false alarms respectively. Red links indicate that spatial segmentation happens between nodes. As the target density and false alarm rate increase, tracking is becoming increasingly difficult. For each different setting (ie. number of targets and number of false alarms), wegenerate20sequencesforcomparisonoftheaverageperformance. Eachsequencecon- tains T = 50 frames. We compare our method with other methods, including a JPDAF 61 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 maximum of number of targets STDA MCMC Spatio-temporal JPDAF MCMC Temporal only MHT (a)L = 200,FA = 0 andT = 50 0 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 maximum of number of false alarms STDA MCMC Spatio-temporal JPDAF MCMC Temporal only MHT (b)L = 200,N = 5 andT = 50 Figure 3.14: (a)STDA as the function of N the maximum number of targets, (b)STDA as the function of FA the number of false alarms based method from [39], the MHT from [19] and our own algorithm with temporal moves only. All methods employ the same motion and appearance likelihood. To prune hy- potheses, both JPDAF and MHT adopt the minimum ratio between the likelihood of the worsthypothesistothelikelihoodofthebestone. Anyhypothesiswithalikelihoodlower than the product of this ratio and the likelihood of the best hypothesis is discarded. For JPDAF we use 1-scanback and keep at most 50 hypotheses at each scan. For MHT, we use 3-scanback and keep at most 300 group hypotheses. In fact, even if a larger scan- back (5-scanback) is used, MHT does not show obvious improvement in the simulation. This can be explained as temporal data association (when no occlusion happens) is quite straightforward in the simulation and the ambiguities are caused by the errors in spatial relationship cannot be solved by simply increasing number of scanback. The initial cover ω 0 for MCMC sampling is initialized by a greedy criteria, namely using the MHT algo- rithmbutkeepingonlythebesthypothesisateachtime. TheMCMCsamplerwasrunfor a total of 10 thousand iterations where the first 15% iterations consist solely of temporal moves. The average score from multiple runs of our method is reported. Figure 3.14(a) 62 compares the performance when the number of targets increases. Figure 3.14(b) shows the tolerance to false alarms for different methods. Because we consider the spatial and temporal association seamlessly, the performance of our method dominates the other three methods. The other three methods work almost equally poorly since they often fail at similar cases when split or merged observations exist. 2 4 6 8 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 maximum of number of targets N STDA / Runtime per frame(second) Offline (n mc =1M) Online (W=50,n mc =1000) Online (W=50,n mc =500) Online (W=25,n mc =500) Online (W=10,n mc =200) Figure 3.15: SFDA and runtime (second) for online/offline, different W window size and n mc number of samplings. L=200, FA=0 and T =1000 To extend our algorithm for long sequences, we implement the proposed association algorithm as an online algorithm within a sliding window of size W. The speed of the sliding window is Δ W frame. In other words, when a sliding window moves, the new slidingwindowhasΔ W newframesandW−Δ W framesoverlapwiththeprevioussliding window. Thecoveroftheoverlappedpartofthecurrentslidingwindowisinitializedfrom the best cover of the previous sliding window. The cover of the new frames is initialized by the greedy criteria. In the experiments, we use Δ W =1. The comparison between the online and offline versions is shown in Figure 3.15. By implementing the online version, we reduce the complexity of data association and control the delay of output for long sequences. 63 c0 c1 c2 c3 c4 -0.1 -0.05 0 0.05 0.1 parameters c 0 ,...,c 4 Average STDA change α = 0.02 α = 0.05 α = 0.10 Figure 3.16: Average STAD score change with parameter variations We also use the simulation experiments to test the robustness of the posterior density function to parameter changes. Since direct comparison of the posterior shape or evalu- ation of the mode drift is difficult, we still use the STDA score to evaluate the effect of parameterchangesintheMAPsolution. Toevaluateeachcomponentˆ c i ∈ ˆ C,i=1,...,4, weusetheestimatedoneˆ c i asthecenteranduniformlyselectc i ∈[ˆ c i (1−α/2),ˆ c i (1+α/2)]. We generate multiple sequences with a setting of N =7 and FA=7 and use a posterior with the contaminated parameters to do the MCMC sampling. The results are shown in Figure 3.16. From this figure, we can see that the average STDA score does not change significantly when there exists variations in parameters. Thus the posterior is robust to parameter variations. This can be understood as that there exists a domain in R 5 where all the constraints in training can be satisfied. The parameters that are out of (but close to) the valid domain merely break part of the constraint pool and do not likely lead to a very wrong solution in one particular sequence. Since the constraint pool is generated for a general setting, the broken constraints may not appear in that particular sequence. We show results and evaluations on three video sets to demonstrate the effectiveness of 64 F r a m e s T r a c k s i n G T S T D A C o m p l e t e t r a c k s m e t h o d 1 m e t h o d 2 m e t h o d 1 m e t h o d 2 C L E A R 1 7 , 5 8 0 1 5 2 0 . 5 3 9 0 . 7 6 5 9 6 1 3 8 C a m p u s g r o u n d 6 , 3 0 0 5 4 0 . 4 0 6 0 . 8 5 2 2 3 5 0 V I V I D 1 2 , 6 9 0 6 1 0 . 2 5 6 0 . 4 5 7 2 8 4 5 Table3.1: Comparativeresultsonthreerealdatasets. Method1: JPDAFin[39];Method 2: the proposed method. our method in real scenarios. The first set is a selection from CLEAR [26], which is captured with a stationary camera, mounted a few meters above the ground and look- ing down towards a street. The targets in the scene include vehicles and pedestrians. The second set, called “campus ground set”, is captured with a stationary camera on a tripod. The foreground in the second set is clear, however the inter-target occlusion is extensive. The third set is a selection from VIVID-I and II data sets, which are cap- tured from UAV cameras. For videos captured from UAV cameras, we first compensate for background motion. This can be accomplished by an affine transformation [39] since the camera is relatively far from the scene and the background can be modeled as a plane. The main difficulty of the third data set comes from noisy foreground regions and false alarms caused by erroneous registration and parallax. The input to our tracking algorithm contains foreground regions which are extracted using a dynamic background model estimated within a sliding window [39]. Tracking is performed automatically from the detected blobs without any manual ini- tialization. Intheexperiments, weuseonlinetrackingwithaslidingwindow W =50and n mc = 1000. The first sliding window is initialized with the greedy criteria. Table 3.1 givesthequantitativecomparison. Thecompletetrackisdefinedas80%ofthetrajectory is tracked and no ID changes. The tracking process runs at around 3 fps on P4 3.0 GHz 65 PC.SomeforegroundregionsusedasinputandtrackingresultsareshowninFigure3.18. 0 50 100 150 200 250 300 350 400 450 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Frames Averge Spatial Accuracy S T D A (a) forward inference Only 0 50 100 150 200 250 300 350 400 450 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Frames Average Spatial Accuracy S T D A (b) backward inference Only 0 50 100 150 200 250 300 350 400 450 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Frames Average Spatial Accuracy S T D A (c) JPDAF in [39] 0 50 100 150 200 250 300 350 400 450 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Frames Average Spatial Accuracy S T D A (d) Proposed method Figure 3.17: Comparison of the bi-direction inference and sinlge direction inference Oneadvantageofourtrackingalgorithm(showninbothsimulationandrealdataset) is worth highlighting. Because the bi-directional (forward/backward) sampling is applied in a symmetric way, our approach can deal with the case where targets are merged or split when they appear. Figure 3.17 illustrates the comparison to the algorithms with only forward or backward inference on the image sequence in “campus ground set”. The colors at the bottom of each chart correspond to labels allocated by the algorithm for the threemovingpersonsinthesequence,whiletheredbarscorrespondtomislabeledtargets due to merged observations. The proposed bi-directional sampling allows to estimate the trajectories and label them consistently throughout the sequence. We observe that the failure cases of this tracker occur often in following situations. First, 66 when the motion of a target cannot be faithfully represented by the constant velocity model, the MAP solution prefers splitting the track, although the appearance is still consistent. This issue can be addressed by using a general motion model or directly modeling the smoothness of a trajectory. Also, when the scene is very crowded so that targets seldom separate from each other, the track usually fails, and either merge all of them together or regard them as false alarms. To some extent, this issue can be solved by incorporating model information and use the model information to guide the spatial and temporal MCMC sampling. 3.7.3 Remarks Inthischapter, wehaveproposedageneraldataassociationalgorithmthatacceptsnoisy motionsegmentationinputandforms motionandappearance consistenttracks. Also, we proposed a practical method to determine the parameters for specific applications. To make this method capable of being widely applied, we extend this method with model informationandextendthismethodtodealwithlongtermtrackinginahierarchicalman- ner. Overall,thismethodoutperformsseveralwidelyapplieddataassociationalgorithms, such as MHT and JPDAF mainly because they do not take the spatial-relationship into consideration. However,duetotheconsiderationofspatio-relationship,thesolutionspace of our method becomes very large, which also causes slow convergence. To design an ef- ficient data driven proposal distribution becomes the only cure of this kind of sampling based method. In other words, it is always useful to add specific prior knowledge for a specifictask. Also, whenefficiencyisthemostconcern, thismethodcanalsobeextended into a way of looking for a suboptimal solution: although Markov chain’s dynamics are 67 designed in data driven manner, for example, a split point of one track will prefer the weak likelihood link between two nodes, there are still moves that can be reduced by specific prior knowledge. For instance, suppose we know some prior knowledge that can be used to form some atomic tracklets that should not be split by any chance, this infor- mation can be used to help the sampling procedure to avoid split moves on these atomic tracklets. 68 Figure 3.18: Experiment results of real scenarios from both stationary cameras and Un- manned Aerial Vehicle (UAV) cameras 69 Chapter 4 Online Appearance Modeling for Visual Tracking 4.1 Motivation Object tracking is challenging [85] due to appearance changes, which can be caused by varying viewpoints and illumination conditions. Appearance can also change relative to the background due to the presence of clutter and distracters. Also, an object may leave the field of view (or be occluded), then reappear. To address these difficulties, we aim to track an arbitrary object with limited initialization (labeled data) and learn an appearance model on-the-fly, which can then be used to reacquire the object when it reappears. 4.1.1 Generative vs. Discriminative The tracking problem can be formulated in two different ways: generative and discrim- inative. Generative tracking methods learn a model to represent the appearance of an object. Tracking is then expressed as finding the most similar object appearance to the model. Several examples of generative tracking algorithms are Eigentracking [9], WSL 70 tracking [37] and IVT [51]. To adapt to appearance changes, the object model is often updated online, as in [51]. Due to the fact that the appearance variations are highly non-linear, multiple subspaces [48] and non-linear manifold learning methods [25] have been proposed. In general, generative methods are not restricted to training one class of objects. However, for tracking purposes, the second class (i.e. the negative class) of objects contains many variations which are very hard to be modeled in a generative framework. Thus, traditional generative tracking methods are trained based on object appearance without considering background information. Insteadofbuildingamodeltodescribetheappearanceofanobject, discriminativetrack- ing methods aim to find a decision boundary that can best separate the object from the background. Recently, many discriminative trackers have been proposed [4, 17, 60]. They exhibit strong robustness to distracters in the background. Support Vector Track- ing (SVT)[3] integrates an offline trained Support Vector Machine (SVM) classifier into an optic-flow-based tracker. In order to update the decision boundary according to new samples and background, some discriminative tracking methods with online learning are proposed in [17, 4]. In [17], a confidence map is built after finding the most discrimina- tive RGB color feature in each frame. However, the limited color feature pool restricts the discriminative power of this method. In [4], Avidan proposes to use an ensemble of online learned weak classifiers to label a pixel as belonging to either the object or the background. To accommodate object appearance changes, at every frame, new weak classifiers replace part of old ones that do not perform well, or have existed longer than a fixed number of frames. Both methods in [4, 17] use features at the pixel level and rely on a mode seeking process (mean shift) to find the best estimate on a confidence map, 71 which restricts the reacquisition ability of these methods. Oza and Russell [63] proposed an online boosting algorithm, which is applied to the visual tracking problem [29, 53]. Due to the large number of features, either an offline feature selection procedure or an offline trained seed classifier is usually required in practice. It has been shown that discriminative classifiers often outperform generative models [47] if enough training data is available. However, generative methods often have better gen- eralization performance when the size of training data is small. Specifically, a simple generative classifier (naive Bayes) outperforms its discriminative counterpart (logistic regression) when the amount of labeled training data is small [59]. Recently, hybrid dis- criminative generative methods have opened a promising direction, drawing from both types of methods. Several hybrid methods [67, 52, 84, 47] have been proposed in many application domains [67, 52, 84]. Most of them imbue generative methods with the dis- criminative power via “discriminative training” of a generative model. These methods trainamodelbyoptimizingaconvexcombinationofthegenerativeanddiscriminativelog likelihood functions. Due to the asymmetry in training data, “discriminative training” of a generative model requires a parameter to govern the trade-off between the generative and discriminative components. Theoretical discussions in [47] show that an improper hybrid of discriminative generative model produces even worse performance than pure generative or discriminative methods. 4.1.2 Global vs. Local In order to deal with appearance changes, two strategies can be adopted: 1) global: a model which is able to cover all appearance variations from the start of tracking 2) local: 72 a model which focuses on recent appearance variations of the target. A global model is preferred, as it provides a complete description, thus allowing abrupt motion and reac- quisition. However, building a global appearance model is difficult. From the generative point of view, a global appearance model from different viewpoints and illumination can be represented as an appearance manifold, generally non-linear. Learning the mapping from the input space to an embedded space, or even evaluating whether a point in in- put space lies on the manifold is not a easy task. From the discriminative point of view, moreappearancevariationsmaketheclassificationproblemharder. Forexample,inSVM classification, separatinganobjectwithmanyappearancevariationsfrombackgroundre- quiresmanysupportvectorsandmaycauseaoverfittedclassifierwithpoorgeneralization abilities. 4.1.3 Online vs. Offline One way to build a global model is to offline collect many appearance instances. How- ever, offline collection of samples requires extra work and cannot be applied for general unknown objects. Recently, many online algorithms [51, 4, 17] have been proposed to build an appearance model on-the-fly, then update the online-built appearance model to adapt to current environments [82, 53]. This idea of incremental updating, which is very attractive for tracking general objects, can be applied to both generative and discrimi- native methods, such as in [51] and [4]. Many tracking algorithms use appearance at the best estimation at the previous time to update the appearance model. This self-learning approach may reinforce tracking errors and thus cause the “drift” problem. As online updating is a typical semi-supervised learning problem, co-training [10] can be used for 73 solving such a semi-supervised training with limited labeled data. The basic idea behind co-training is that the consistent decision of a majority of independent trainees is used as a labeler and thus the trainees are able to train each other. The independence of trainees can be achieved by initializing with independent data sets [10], or by using trainees that work on independent features [74]. 4.1.4 Overview of Our Approach We propose to use co-training to combine generative and discriminative models. Here, online learning an appearance model of an arbitrary object with limited labeled data is treated as a semi-supervised problem. The co-training approach proposed by Blum and Mitchell [10] is a principled semi-supervised training method. The basic idea is to train two classifiers from two conditionally independent views of the same data (with a small number of exemplars) and then use the prediction from each classifier to enlarge the training set of the other. It is proved that co-training can find an accurate decision boundary, starting from a small quantity of labeled data as long as the two feature sets are independent [10]. Empirical evidence [50] also shows that co-training works well even inthecasewheretheindependenceisnotperfectlysatisfied. In[36], thefeaturesusedfor classification are derived from PCA bases, which are obtained offline from training sam- ples, and co-training is used to improve an offline learned object detector. More recently, Tang et al. [74] propose to use co-training to online train two SVM trackers with color histogram features and HOG features. This method uses an incremental/decremental SVM solver [13] to focus on recent appearance variations without representing the global object appearance. 74 Figure 4.1: Online co-training a generative tracker and a discriminative tracker with different life span (the area bounded by dashed red boxes indicates the background) Inordertorepresentaglobalobjectappearance,weproposetouseagenerativemodel, which contains a number of low dimension subspaces. This generative model encodes all theappearancevariationsthathavebeenobservedinacompactway. Anonlinesubspace updating algorithm is proposed to modify the subspaces adaptively. The descriptive powerofthegenerativemodelincreasesasnewsamplesareadded. Forthediscriminative classifier, we use an incrementally learned SVM classifier [11] with histogram of gradient (HOG) [21] features. In practice, we find that the number of support vectors grows quite fast when the appearance of object and background changes. Moreover, the adaption of the discriminative model to new appearance changes becomes slower and shower as samples are accumulated. To address these problems, we decrementally train the SVM to focus on recent appearance variations within a sliding window. The training data flow is shown in Figure 4.1. The image patches bounded with green boxes are samples used in the generative model. They contain all the object appearance variations from the start of tracking. The red bounding boxes indicate the positive and negative training samples within the sliding window used in the discriminative classifier. The main advantage of this method is that it collaboratively combines the generative and discriminative models 75 with complementary views (features) of the training data, and encodes global object appearancevariations,thuscanhandlereacquisition. Experimentsshowthatourmethod has strong reacquisition ability and is robust to distracters in background clutter. 4.2 ABayesianCo-trainingFrameworkforOnlineAppearance Modeling 4.2.1 A Hybrid Discriminative and Generative Model We formulate the visual tracking problem as a state estimate problem in a similar way as [51, 35]. Given a sequence of observed image regions O t = (o 1 ,...,o t ) over time t, the goal of visual tracking is to estimate the hidden state s t . In our case, the hidden state refers to an object’s 2D position, scale and rotation. Assuming a Markovian state transition, the posterior can be formulated as a recursive equation p(s t |O t )∝p(o t |s t ) Z p(s t |s t−1 )p(s t−1 |O t−1 )ds t−1 (4.1) where p(o t |s t ) and p(s t |s t−1 ) are the observation model and state transition model re- spectively. p(s t−1 |O t−1 ), which is represented as a set of particles and weights, is the posterior distribution given all the observations up to time t−1. The recursive inference inEq.4.1isimplementedwithresamplingandimportancesamplingprocesses[35]. Inour approach, the transition of the hidden state is assumed to be a Gaussian distribution: p(s t |s t−1 ) =N(s t ;s t−1 ,Ψ t ), where Ψ t is a time varying diagonal covariance matrix. In this recursive inference formulation, p(o t |s t ) is the likelihood of observing o t given one 76 state of the object. p(o t |s t ) is the crucial term to define a good posterior distribution. Besides the 2D position, our state variables encode in-plane rotation and scale. This reduces appearance variations caused by such motion, at the expense of more particles needed to cover the solution space. Our measurement of one observation comes from two independent models. One is the generative model, which is based on online constructed multi-subspaces. The other is the discriminative model, which is online trained with HOG features. The features used by these two models, namely intensity pattern and local gradient features, are comple- mentary. After limited initialization, these two models are co-trained with sequential unlabeled data. In our approach, each model makes a decision based on its own knowl- edge, and this information is used to train the other model. The final decision is made by the combined hybrid model. Due to the independence between the two observers, our observation model p(o t |s t ) can be expressed as a product of two likelihood functions from the generativeM model and the discriminative modelC: p(o t |s t )∝p M (o t |s t )p C (o t |s t ). 4.2.2 Co-training for Semi-supervised Learning Many machine-learning researchers have found that unlabeled data, when used in con- junction with a small amount of labeled data, can produce considerable improvement in learning accuracy [15]. The acquisition of labeled data for a learning problem often re- quires a skilled human agent to manually classify training examples. The cost associated with the labeling process thus may cause a fully labeled training set infeasible, whereas acquisitionofunlabeleddataisrelativelyinexpensive. Insuchsituations,semi-supervised 77 learning can be of great practical value. We pose the online tracking problem as a semi- supervised learning problem, where user input that initializes the tracker is viewed as labeled data. When tracking starts, we aim to find out the label of the object from a very large set of unlabeled data. Co-training was originally proposed by Blum in 1998 [10] to handle the semi-supervised learning problem, where two or possibly more learners are each trained on a set of examples, but with each learner using a different, and ideally independent, set of features for each example. This method has been quickly applied in many different semi-supervised learning problems. Itisworthnotingthattheco-trainingisnotanewmodelbutawayoftrainingexisting modelsinasemi-supervisedmanner. Inourmethod,thetwoindependentmodelsevaluate the samples independently, the joint likelihood of the two independent models is used as the observation model in the CONDENSATION algorithm. We adopt conservative evaluation criteria in decision making for each model. The co-training is applied to train these two independent models as follows. Given the labeled data in the first a couple of frames, we initialize two independent models. When tracking starts, all particles are simultaneously evaluated by the generative and discriminative models. Having the evaluation from the generative model, we find the best particles, remove the particles nearby the best particle and find the seconde best particles, if the ratio between the second best particle and the best particle is smaller than predetermined threshold. This indicates the second particle is not going to distract the best one. The best particle that satisfies this criteria is validated and is handed to train the discriminative model. Having theevaluationfromthediscriminativemodel,iftheconfidenceofthebestparticleislarger 78 than the confidence of 80% historical positive responses, the best particle is regarded as a validated one, and is handed to update the generative model. 4.3 Generative Tracker with Multiple Linear Subspaces 4.3.1 Global Model with Multiple Local Linear Subspaces The global appearance of one object under different viewpoints and illumination condi- tions is known to lie on a low dimension manifold. However, such a global appearance manifold is highly non-linear. In [25], a non-linear mapping from the embedding space to the input space is offline learned for tracking a specific object. Although the appearance manifold is globally non-linear, the local appearance variations can still be approximated as a linear subspace. Thus, we propose to incrementally learn a set of low dimension linear subspaces to represent the global appearance manifold. A multi-subspace repre- sentation is used in [48], where a fixed number of subspaces are offline built, and online updated with new samples. LetM ={Ω 1 ,...,Ω L } represent the appearance manifold of one object and Ω l ,l∈ [1,...,L] denote the local sub-manifold. An appearance instance x is a d-dimension image vector. Let Ω l = (ˆ x l ,U l ,Λ l ,n l ) denote one sub-manifold, where ˆ x l ,U l ,Λ l and n l represent the mean vector, eigenvectors, eigenvalues and size (number of samples) of the subspace respectively. For simplicity, we omit the subscript when this causes no confusion. Here, Λ = diag(λ 1 ,...,λ n ) with sorted eigenvalues of the subspace, λ 1 ≥ λ 2 ···≥ λ n . A η-truncation is usually used to truncate the subspaces, namely m = argmin i ( P i λ i /tr(Λ)≥η). From a statistical point of view, a subspace with m 79 eigenbases can be regarded as a m-dimensional Gaussian distribution. Suppose Ω is a subspace with the first m eigenvectors, the projection of x on Ω is y = (y 1 ,...,y m ) T = U T (x−ˆ x). Then, the likelihood of x can be expressed [57] as p(x|Ω) = max l∈[1,L] p(x|Ω l ), namely the largest likelihood of each subspace. The likelihood of each subspace can be represented as, p(x|Ω l )= exp − 1 2 m P i=1 y 2 i λ i (2π) m/2 m Q i=1 λ 1/2 i · exp − ε 2 (x) 2ρ (2πρ) (d−m)/2 (4.2) where ε(x) =|x−UU T x| is the projection error, namely the L 2 distance between the samplex and its projection on the subspace. The parameter ρ= 1 d−m P d i=m+1 λ i [57], or uses 1 2 λ m+1 asaroughapproximation. ByusingEq.4.2,wecanevaluatetheconfidenceof a sample from one subspace. As our generative model contains multiple subspaces (each subspace can be regarded as a hyper-ellipsoid), we maintain the neighborhood according to the L 2 distance between the mean vectors of subspaces. To evaluate the confidence of one sample from such a generative model, we use the maximum confidence of the K-nearest neighboring subspaces. 4.3.2 Online Subspace Learning Giventhatsamplesaregiveninasequentialway,weaimtolearnthelowdimensionlinear subspaces incrementally. A new subspace is created with d 0 dimension, namely d 0 +1 sequential samples form a new subspace. Local smoothness is guaranteed by a small d 0 . A new subspace is created and added into the subspace pool. In order to represent a large number of sequential samples, we use a fixed number subspaces: if the number of 80 subspacesexceedsapredeterminedmaximum,themostsimilartwosubspacesaremerged. TheoutlineoftheonlinesubspacelearningalgorithmisshowninAlgorithm2. Inorderto maintain the local property of the subspaces, merging only happens between neighboring subspaces. Mergingoftwosubspacesandmeasuringthesimilaritybetweentwosubspaces are two critical steps in this algorithm. 4.3.2.1 Subspace Merging Several methods have been proposed to incrementally update the eigenspaces. Only the method proposed by Hall et al. [33] takes into account the change of the mean of a subspace. This approach provides an exact solution to the update of an eigenspace, and does not require storing the original samples. A similar method was also used in [51, 48] to update a subspace given new samples. We summarize Hall’s method in [33] by using scatter matrices to simplify the representation. Suppose there are two subspaces Ω 1 = (x 1 ,U 1 ,Λ 1 ,N) and Ω 2 = (x 2 ,U 2 ,Λ 2 ,M), which we are trying to merge to a new subspace Ω=(¯ x,U,Λ,M+N). Ifthe dimensionofΩ 1 andΩ 2 arep andq, the dimension r of the merged subspace Ω satisfies: max(p,q)≤ r≤ p+q+1. The vector connecting the centers of the two subspaces does not necessarily belong to either subspace. This vector causes the additional one in the upper bound of r. It is easy to verify that the scatter matrix S of the merged subspace Ω satisfies, S =S 1 +S 2 + MN M+N (x 1 −x 2 )(x 1 −x 2 ) T . We aim to find a sufficient orthogonal spanning of S. Let h 1 (x) denote the residual vector of a vectorx on Ω 1 , h 1 (x)=x−U 1 U T 1 x. Note that h 1 (x) is orthogonal to U 1 , i.e. h(x) ′ U = 0. Now, U ′ = [U 1 ,v] is a set of orthogonal bases to span the merged space, where v =GS(h 1 (U 2 ,(x 2 −x 1 ))) and GS(·) denote the 81 Gram-Schmidt process. Given the sufficient orthogonal bases, we can obtain the SVD Algorithm 2 Online Subspace learning algorithm Input: (I,M, d 0 ,L) I ={I 1 ,··· ,I n ,···}: a sequence of samples M=∅: the appearance manifold d 0 : the initial dimension for each subspace L: the maximum number of subspaces Output:M=(Ω 1 ,···Ω L ): multi-local linear subspaces whileI6=∅ do fetch d 0 +1 samples and form a new subspace Ω n ←(I i ,··· ,I i+d 0 ) if there exists an empty subspace then Add Ω n toM else (p,q) ∗ =argmaxSim(Ω p ,Ω q ), p,q∈[1,··· ,L],p6=q Ω m =Ω p ∪Ω q and replace Ω p and Ω q with Ω m end if end while decomposition of S. U ′T SU ′ = Λ 1 0 0 0 + GΛ 2 G T GΛ 2 Γ T ΓΛ 2 G T ΓΛ 2 Γ T + MN M+N gg T gγ T γg T γγ T =RΛR T (4.3) where G = U T 1 (x 2 −x 1 ), Γ = v T U 2 , g = U T 1 (x 2 −x 1 ) and γ = U ′ (x 2 −x 1 ). Now, the eigenvalue of the merged subspace is Λ in Eq.4.3 and the eigenvector U is simply U ′ R. Notethatincrementallyupdatingasubspacewithoneobservationasin[48]isonespecial case of merging two subspaces using Eq.4.3. 4.3.2.2 Subspace distance The other critical step in Algorithm 2 is to determine the similarity between two sub- spaces. We use two factors to measure the similarity between two neighboring subspaces Ω 1 ,Ω 2 , the canonical angles (principal angles) and the data-compactness. 82 Suppose the dimensions of two subspaces are p,q,p≥ q, then there are q canonical angles between the two subspaces. A numerical stable algorithm [8] computes the angles between all pairs of orthonormal vectors of the two subspaces as, cosθ k =σ k (U T 1 U 2 ),k = 1,··· ,q , where σ k (·) is the k th sorted eigenvalue computed by SVD. The consistency of two neighboring subspaces can be represented as follows. Sim 1 (Ω 1 ,Ω 2 )= q Y k=q−d 0 +1 σ k (U T 1 U 2 ) k =1,··· ,q (4.4) As the dimensionality of subspaces is larger than d 0 , the initial dimension, we select the d 0 largest principal angles, which approximately measure the angle between two local subspaces. In a 3D space, the largest canonical angle between two 2D subspaces is equiv- alent to the angle between the two planes. In this case, we prefer to merge 2D patches with a small plane-to-plane angle. Note that this merge can only happen between neigh- boringsubspaces. Theneighborhoodisdefinedaccordingtothemeanvector L 2 distance. Merging subspaces with a small principal angle can avoid destroying the local structure of the appearance manifold. The other factor to consider is data-compactness, which measures how much extra di- mensionality is incurred by a merge operation. Suppose the dimension of two sub- spaces Ω 1 , Ω 2 is p,q,p ≥ q, the sorted eigenavalues of original merged subspace are Λ r =(λ 1 ,...,λ r ),r =p+q+1. The similarity based on data-compactness is defined as Sim 2 (Ω 1 ,Ω 2 )= X p i=1 λ i / X r i=1 λ i (4.5) 83 If Sim 2 is close to one, this indicates the merge operation does not incur any increasing dimension; on the contrary, if Sim 2 is small, this indicates the variations in Ω 1 and Ω 2 cannot use common eigenvectors to represent. Combining the two factors in Eq.4.4 and Eq.4.5, the final similarity between two subspaces is defined in Eq.4.6. Sim(Ω 1 ,Ω 2 )=Sim 1 (Ω 1 ,Ω 2 )+w d Sim 2 (Ω 1 ,Ω 2 ) (4.6) where w d is the weight to balance these two factors. -2 -1 0 1 2 -2 -1 0 1 2 -3 -2 -1 0 1 2 3 4 (a) 3D saddle surface (b) final subspaces built Figure 4.2: 3D example of incremental updating subspaces One example of online trained local subspaces is shown in Figure 4.2. A saddle-like 3D surface (shown in Figure 4.2(a) is generated with Gaussian noise. The 3D points are input to Algorithm 1 sequentially. The final subspaces are shown in Figure 4.2(b) with L = 15,η = 0.995. The initial dimension d 0 is one. Although the online built subspaces depend on the order of the samples, a compact representation of the samples can always be created as long as the data are input with local smoothness. The subspace updating operation dominates the complexity of the generative tracker. Merging two subspaces 84 with dimension p,q requires Golub-Reinsch SVD O(r 3 ) of r =p+q+1 dimension square matrix. Since the dimension of each subspace is low, the total complexity is quite low. The low dimensionality of the local subspaces can guarantee both the local property and efficient computation. 4.4 Discriminative Tracker Using Online SVM It has often been argued that SVM has better generalization performance than other discriminative methods on a small training set. For the discriminative model, we adopt an incremental SVM algorithm, LASVM [11]. Also, we modify the original version of LASVM to extend its ability of forgetting samples, which is important for our tracking methodthatworksonaslidingwindowofsamples. WeusethedecrementalSVMlearning to remove old samples and focus on more recent appearance changes. SVM techniques are a set of supervised learning methods used for classification and regression. Viewing the input data as two sets of vectors, an SVM will construct a hyperplane, which maximizes the margin between the two data sets. Given the training data set D ={(x i ,y i ),x i ∈R P ,y i ∈ (−1,1)}, where x i is the a p-dimension real vector and y i is the label of the data, an SVM want to give the maximum-margin hyperplane which divides the points having y i = 1 from those having y i =−1. Any hyperplane can be written as the set of points satisfying w·x−b=0 (4.7) 85 where w is the normal of the hyperplane. We want to choose w and b to maximize the margin, or distance between the parallel hyperplanes that are as far apart as possible while still separating the data. These hyperplanes can be described by the equations w·x−b = 1 and w·x−b =−1. The margin between the two parallel lines is 2 ||w|| . If the data is linearly separable, we can select the two hyperplanes of the margin in such a way that there are no points between them and then try to maximize their distance, namely, we aim to minimize||w|| with the constraints that prevents all samples falling into the margin, ie. w·x i −b≥ 1,for y i = 1 and w·x i −b≤−1,for y i =−1. We put this together to write the optimization problem as min w,b ||w|| (4.8) subject to (w·x i +b)y i ≥1 (4.9) The optimization problem in absolute value form is difficult to solve. Fortunately, it is possible to alter the equation by substituting w with 1 2 ||w|| 2 without changing the solution(theminimumoftheoriginalandthemodifiedequationhavethesamesolution). This becomes a convex quadratic programming (QP) optimization problem, which can be solved in polynomial time and also always has a global optimal solution. min w,b 1 2 w·w (w·x j +b)y j ≥1,∀j (4.10) 86 Writing the Eq.4.9 in its Lagrangian form reveals the maximum margin hyperplane is only a function of the support vectors, min w,b max α 1 2 w·w− P j α j ((w·x j +b)y j −1) ⇒max α min w,b 1 2 w·w− P j α j ((w·x j +b)y j −1) α j ≥0,∀j (4.11) where α=(α 1 ,··· ,α n ). Let L(w,b,α) denote 1 2 w·w− P j α j ((w·x j +b)y j −1), then ∂L ∂w =0⇒w = X j α j y j x j (4.12) ∂L ∂b =0⇒ X j α j y j =0 (4.13) min w,b L(w,b,α)=− 1 2 X j α j y j x j · X j α j y j x j + X j α j (4.14) Thus, Eq.4.11 can be rewritten as max α W(α)=− 1 2 P i,j α i α j y i y j x i x j + P i α i P j α j y j =0 α j ≥0 (4.15) Eq.4.15 is called the dual form for a SVM with hard margin, ie. no samples are located withinthemargin. ThemostwidelyusedSVMformulationisthesoftmarginformulation, proposedbyCortesandVapnik(1995),whichallowssomeexamplestoviolatethemargin 87 condition. The soft margin method is particular appropriate for noisy training data. For a soft margin SVM, Eq.4.10 can be extended with slack variables ξ j , min w,b 1 2 w·w+C P j ξ j (w·x j +b)y j ≥1−ξ j ,∀j ξ j ≥0,∀j (4.16) The slack variable ξ j describes the classification error of the sample x j . By extending the objective function with the term of classification errors, the optimisation becomes a trade off between a large margin, and a small error penalty. min w,b,ξ max α 1 2 w·w+C P j ξ j − P j α j ((w·x j +b)y j −1+ξ j )− P j α j2 ξ j ⇒max α min w,b,ξ 1 2 w·w+C P j ξ j − P j α j ((w·x j +b)y j −1+ξ j )− P j α j2 ξ j subject to α j ≥0,α 2j ≥0,ξ j >0∀j (4.17) LetL(w,b,α,ξ)denote 1 2 w·w+C P j ξ j − P j (α j +α j2 )ξ j −w· P j α j y j x j −b P j α j y j + P j α j . We have the follow constraints. ∂L ∂w =0⇒w− P j α j y j x j =0 ∂L ∂b =0⇒ P j α j y j =0 ∂L ∂ξ =0⇒(C−α−α 2 )=0 min w,b,ξ L(w,b,α,ξ)=− 1 2 P i,j α i α j y i y j x i x j + P j α j (4.18) 88 Combining this together, the optimization problem of a SVM with soft margin can be write as max α − 1 2 P i,j α i α j y i y j x i x j + P j α j ! subject to C≥α j ≥0, P j α j y j =0,∀j (4.19) The constraint C≥α j ≥0 is called the box constraint. LASVM suggeststhatoptimizationisfasterwhenthesearchdirectionmostlycontains zero coefficients. LASVM uses the search directions whose coefficients are all zero except for a single +1 and a single -1. The two non-zero coefficients, are called τ-violating pair (i,j) if α i <B i ,α j >A j , and g i −g j >τ, where τ is a small positive value. and LASVM selects the τ-violating pair (i,j) that maximizes the directional gradient g i −g j . TheLASVMalgorithmcontainstwoproceduresnamedPROCESSandREPROCESS [11]. When a new sample x k arrives, PROCESS forms a τ-violating pair (i,j), which contains x k and another existing support vector, and updates the weights of this pair. Following PROCESS, REPROCESS selects a τ-violating pair from the set of support vectors and updates their weights. The new sample x k may become a new support vector through PROCESS, while another support vector may need to switch out by REPROCESS. Both PROCESS and REPROCESS select the τ-violating pair with the largest gradient. The complexity of such a selection process grows linearly with the number of vectors. A finishing step, which runs REPROCESS multiple times to further remove as many τ-violating pairs as possible, is performed after online process. For tracking, the intermediate classifier is useful, hence we run this finishing step every 10 frames. Note that, since we do not need to look at any ignored vector for incremental learning, a τ-violating pair is only selected from the set of support vectors. 89 Algorithm 3 Online SVM algorithm Input: (I,C,t) I =:{I 1 ,··· ,I n ,···}: a sequence of samples C =: the seed classifier t: the finishing cycle Output:C =: the online trained classifier whileI6=∅ do 1. Remove old ignored vectors 2. Remove old SVs and REPROCESS in all vectors 3. Fetch a new sample (I i ) and do PROCESS(I i ) 4. Do REPROCESS once in SVs 5. Do finishing every t frames end while For the online tracking problem, many appearance variations and limited training samples may degrade the generalization ability of SVM. Also, in experiments, we find that the number of support vectors grows fast when the appearance of object and back- ground changes. Thus, we propose to decrementally train the SVM and focus on recent appearance variations within a sliding window. REPROCESS in LASVM can be used to achievethe“unlearning”ofoldsamples. Fordecrementallearning, removingignoredvec- tors (when ignored vectors move out of the sliding window) does not change the decision boundary. However, the removal of a support vector will affect the decision boundary and some ignored vectors may become support vectors. In order to remove one support vector, we first zero out its coefficient and put its coefficient into the closest vector to keep the constraint, P j α j y j = 0,∀j. We then apply REPROCESS multiple times to select τ-violating pairs in the set of both ignored and support vectors and update the weights. The cost of decremental learning is that we need to store all samples within a sliding window. 90 4.5 Implementation and Results 4.5.1 Implementation Details For both generative and discriminative models, we use image vectors of size 32×32 (for face) or 32×64 (for human and vehicle). For the generative model, η is set to 0.95-0.99 and the maximum number of subspaces is set to 5-10. The initial subspace dimension is 4, which is very low compared to the input space. Thus, every 5 frames, we form a new subspace, which is then inserted into the subspace pool. For the discriminative model,weuseLASVMwithR-HOGfeaturevectors,whicharecreatedfrom16×16blocks containing8×8cells. Thestrikesizeis4toallowoverlappingHOGdescriptors. Eachcell has 9 bins oriented histogram; hence, we have 36-bin oriented histogram for a block. For a 32×64 window, the vector size is 2340. We use the linear kernel function in SVM. The number of support vectors varies between 50 and 150 for different sequences. We use a sliding window of 30 frames. We manually label the first 10 frames as the initialization for the two trackers. The Bayesian inference framework generates 600 particles. The co-trained hybrid model is implemented in C++. The combined tracker runs at around 2 fps on a P4 2.8GHz dual core PC. All testing sequences are 320×240 graylevel images. During co-training, each learner labels the unlabeled data on which it makes a confi- dent prediction based on its own knowledge. For this purpose, a threshold is needed for each learner. For the generative model, we set a threshold based on the log likelihood in Eq.4.2. To be more conservative, we use a second criteria: we find several local optima in the posterior distribution and if the ratio ρ between the second optimum and the global optimum is small enough (ρ≤ 0.7), we accept the global optimum as a positive sample 91 andallothersamplesthatfarenoughfromtheglobaloptimumarenegativesamples. For the discriminative model, due to the very limited training data, the positive and negative training data are usually well separated. Thus, we cannot adopt the way used in [50] to select the threshold. Instead, we select the confidence threshold so that at most 80% positive samples’ confidence is above that threshold. This threshold is updated every 30 frames. The positive and negative samples labeled by the generative model will not be added to the discriminative model unless they are close to the decision boundary. To express the SVM confidence as a probability, we use the method in [64] to fit a sigmoid function that is updated every 30 frames. 4.5.2 Performance Evaluation and Comparison We compare our co-trained tracker with two generative methods, including (G1) IVT [51] and our multiple linear subspaces (G2) algorithm and three discriminative methods, including online selection of discriminative color (D1) [17], our online SVM method (D2) and ensemble tracking (E.T) [4]. G1 uses a single 15D linear subspace and updates it in- crementally. NotethatD1doesnotconsidertrackingwithlargescalechangeandrotation. G1, G2, D2 and the co-trained tracker use the same parameters in CONDENSATION algorithm, but G1, G2 and D2 use self-learning to update their models. Wecomparethesemethodswithchallengingdatasets,whichcontainimagesequences of various types of object, including face (seq1-seq2), human (seq3-seq5) and vehicle (seq6). Thechallengingconditionsincludesignificantilluminationchanges(seq1),abrupt camera motion and significant motion blur (seq2-seq5), viewpoint changes and/or pose variations (seq3-seq6), and also occlusions (seq4-seq6). To compare the robustness under 92 challengingconditions,wedisplaythenumberofframesforwhichthesemethodscantrack the objects before tracking failure, i.e. after this frame, a tracker cannot recover without re-initialization. Table4.1showsthecomparisonbetweendifferentmethods. Thenumber of frames and the number of frames where occlusion happens in each sequence are also shown in Table 4.1. The comparison demonstrates that the co-trained tracker performs more robustly than other methods. Note that D1 requires color information, thus it cannot process some sequences, which are indicated with “n/a”. The visual results are shown in Figure 4.4, where the tracked objects and part of negative samples are bounded with green boxes and white boxes respectively. The red box indicates that no model is updated in this frame. In experiments, we frequently find that the co-trained tracker has better self-awareness of current tracking performance, and can safely enlarge the search range (by changing the diffusion dynamics) without being confused by distracters inthebackground. Also,theco-trainedtrackersuccessfullyavoidsdriftcausedbyvarying viewpoints and illumination changes. We compare with two other methods to demonstrate our method’s reacquisition abil- ity. One is ensemble tracking [4]. The other one is the online SVM tracker without Frame s Occlu sion G1 G2 D1 * D2 E.T Our s Seq1 761 0 17 261 n/a 491 94 7 5 9 Seq2 313 0 75 282 313 214 44 313 Seq3 140 0 11 15 6 89 22 140 Seq4 338 93 33 70 8 72 118 2 4 0 Seq5 184 30 50 50 50 50 53 154 Seq6 945 143 163 506 n/a 54 10 8 02 Table 4.1: Comparison of different methods G1:IVT [51], G2: incremental learning multiple subspaces, D1: online selection of discriminative color features [17], D2: online SVM, E.T: ensemble tracking [4]. D1 uses color information, which is not available for Seq1 and Seq6. 93 decremental learning. In the experiments, we find that ensemble tracking can only reac- quire the object after a short occlusion. Also, ensemble tracking cannot deal with ro- tation and scale change, which restricts its reacquisition ability. Without decremental learning, the number of support vectors increases to several hundreds quickly when the backgroundbecomesclutteredandtheobjectappearancechanges. Thismakestheonline SVMtrackerveryslow. Meanwhile, theperformanceofthetrackergoesdown, exhibiting as drifting and being trapped by distracters. We also compare our generative tracker G2 (a) the proposed method (b) ensemble tracking Figure 4.3: Tracking and reacquisition with long occlusion and clutter background withG1andanothergenerativemethodG3, inindoorenvironmentswithfewdistracters. G3 uses 5 key 10D subspaces (corresponding to front, left profile, right profile, up and down), which are trained offline by manually labelling 150 samples into 5 categories; our generative method G2 uses at most 10 subspaces and each new subspace starts from 4-dimensions. The indoor sequence exhibits significant head pose variations, shown in Figure 4.5(b). We calculate the projection errors for each method in Figure 4.5(a) (av- erage of multiple runs) within 1000 frames before the other two trackers start to drift. Offline 5-key subspaces method shows large projection errors at some frames where the 94 poses are not covered in offline samples. G1 is not able to adapt rapidly to new appear- ance variations after running a long sequence. Our generative method G2 can promptly adapt to appearance variations and show smaller projection errors consistently, though each subspace in our generative track has much smaller dimensionality than the other two methods. As we can see, the online trained subspaces approximately represent the poses that have been observed, though we do not train offline in different poses. (a) Tracking and reacquisition with abrupt motion and blur (b) Tracking human with clutter background and distracters (c) Tracking and reacquisition with long leaving out of field of view Figure 4.4: Tracking various type of objects in outdoor environments 95 0 200 400 600 800 1000 0 0.02 0.04 0.06 0.08 0.1 0.12 G2: our generative tracker G1: updating a single subspace G3: 5-key offline subspaces (a) Comparison of projection errors (b) Tracking with pose variations Figure 4.5: Comparison of generative methods in indoor environments 96 Chapter 5 Application: Detection and Tracking Moving Vehicles from Airborne Cameras 5.1 Introduction and Goal InthisChapter,weaimtoapplythemultipletargettrackingalgorithmforoneparticular task, which is to detect and track multiple moving vehicles from airborne cameras. The reason why we are interested in this application is two fold, first, this is an important area in surveillance. Successfully tracking multiple moving vehicles provides a funda- mental tool for further analysis in surveillance, such as event detection. Second, this application is a good area to apply our spatio-temporal multiple target tracker. Both motion information and appearance information are important to address this problem, in other words, simply using either motion or appearance cannot succeed. To detect moving vehicles from airborne cameras is very difficult. First, the size of the object may be very small, so a pattern based detector usually may fail due to the lack of resolution and blurry images. Second, motion segmentation may be very noisy, due to erroneous registration, illumination changes and parallax. To achieve our goal, we 97 needtouseasetoftechniquesandmakefullofbothmotionandappearanceinformation. Also,thisapplicationrequiresreal-timeornearreal-timeperformance. Allthisdifficulties challenges the existing methods. The overview flowchart of the whole system is shown in Figure 5.1. The core module of tracking multiple targets has been covered in Chapter 3. In this chapter, we focus on the motion detection, geo-registration and real-time implementation. Figure 5.1: The overview framework of our approach 5.2 Motion Detection From a Moving Platform 5.2.1 Image Stabilization Motion detection for a stationary camera is extensively studied compared to the moving camera. The most widely used procedure for motion detection algorithm is as follows. 98 First characterize background model, either by image difference or background learning methods, an then cluster pixels, which do not belong to the corresponding background model. Thecriticalstepinvariousdetectionalgorithmsisthecollectionofgoodstatistics representing the background model (i.e. the fixed part of the scene). The simplest case of detection algorithm is when we extract moving regions from a stationary camera since the background model process can be simplified as a process of collecting the dominant pixel value (either intensity or color values) per corresponding pixel position through the given sequence. The case of a moving camera is inherently more complex, since the camera motion induces a motion in all pixels of the image. Therefore, we need to first establish the correspondence for each pixel before actually collecting the statistics of the background model. An alternative solution for this problem consists of first estimating the camera motion and compensate for it, so that the detection of moving objects from a moving camera becomes similar to the detection problem from a static camera, and various stationary camera detection techniques can be applied. The stabilization of a camera motion is a necessary step for motion detection from a moving camera (sometime, stabilization is also needed for fixed cameras since jittering happens very often in outdoor environments). We adopt the well known 2D image regis- tration approach using a parametric transformation model. The parametric models can be affine, projective or quadratic. If scenes are taken by a camera rotating about its axes (Pan-tilt) or telephoto lens from a longdistance,thedepthofobjectsaremuchsmallerthanthedistancebetweentheobject 99 andthecamera. Therefore,wecanapproximatethescenesby2Dparametrictransforma- tions such as an affine or projective transformation [48], [62] and [85]. Especially, for the UAVsurveillanceapplication, thescenesaretakenbyafar-distantcameratoensurewide coverage of the scene, and frames are needed to be closely related to ensure a common region between frames. The affine transformation Eq.5.1 is a proper choice in this case. u t+1 v t+1 1 = a b t x c d t y 0 0 1 · u t v t 1 ⇒X t+1 =A (t,t+1) X t (5.1) Toestimatetheaffinetransformation,westartbyextractingfeaturesinpair-wiseimages. We use eigenvalues computed from gradients of intensity images using Sobel filters. The correspondences between extracted features in adjacent frames are established. For each pair of corresponding feature points (u,v) and (u ′ ,v ′ ), we can have two con- straints on the affine transformations shown in Eq.5.2. u v 1 0 0 0 0 0 0 x y 1 [a 11 ,a 12 ,t x ,a 21 ,a 22 ,t y ] T = u ′ v ′ (5.2) Given 3 or more pairs of correspondences, we can use the linear least squares method to compute the parameters. In order to remove outliers in the correspondence, we em- ploy Random Sample Consensus (RANSAC) [27] to estimate the affine transformation robustly. 100 5.2.2 Compensation of Illumination Changes The assumption that the intensity or color of background pixels does not change may not be true when illumination changes. Thus, before we estimate the background model, we first apply the method proposed in [83] compensate the global illumination change by using a 1D affine model on the intensity or color. A global illumination change means the relation of intensity or color between two images is independent of the location of the pixels. Thus, the affine model assumes that the relation between the intensity I i of a frame i and the intensity I r of the reference r complies with the transformation: I i = a∗I r +b, where a,b∈ R 1 . For color images, the three R,G, and B channels are compensated independently. In practice, this simple compensation of global illumination works very well. One example is shown in Figure 5.2. (a) Obvious global illumination changes in three frames (b) affine compensation (c) Difference image without compensation (d) Difference image with com- pensation Figure 5.2: Compensation of global illumination change 101 5.2.3 Background Modeling Given the affine transformation between two consecutive frames, the affine A (i,j) between any two frames can be represented as follows. A (i,j) = j−1 Q k=i A (k,k+1) i<j I i=j (A (j,i) ) −1 i>j (5.3) Weadopttheslidingwindow[39]methodtoestimatethebackgroundimageofareference frame. Suppose that the size of a sliding window is W. The center frame of the sliding window is selected as the reference frame. Registering any frame i in the sliding window totheselectedreferencerisperformedbyaffinetransformationA (i,r) ,whichconcatenates the estimated pair-wise transforms as in Eq.5.3. Figure 5.3: Computation of the statistics of each pixel For each current frame, we warp the images in the sliding window using the pair-wise registration and consequently minimize the impact of an erroneous registration. Indeed, 102 an erroneous registration does not influence the quality of the detection for the whole sequence, but only within the number of frames considered in the sliding window. Once we have the transformation between any frame and the reference frame, each pixel p in the reference frame r is actually corresponding to another pixel p·A (i,r) in the frame i of the sliding window unless the corresponding pixel p· A (i,r) is out of the scope of the frame i. Thus, for each pixel p in the reference frame r, we can have an array of corresponding pixels, i.e. [p·A (i,1) ,··· ,p,··· ,p·A (i,W) ]. If we assume the motion of a target is not too slow, the intensity or color of a background pixel will dominate within such an array. There are four methods to estimate the background’s intensity or color, including single Gaussian background model, mixture of multiple Gaussian (MMG)[73] background model, median background model and mode background. Since the size of sliding window is small (in our experiments W = 91), MMG method is not practical for the limited sample size. One example is shown in Figure 5.2. 5.3 Geo-registration and Tracking Coordinates When we talk about tracking, tracking has to be defined within some coordinates. For stationarycameras,itisstraightforwardtoselectthe2Dimagecoordinatesastrackingco- ordinates. However,formovingcameras,the2Dimagecoordinatesarerelative,andmove with every frame. Many appearance based tracking methods [17, 16, 18] are applied in this case. Motion model can be used within some fixed coordinates where the kinematics states of objects in every frame can be defined. Accumulated errors are introduced when fixed coordinates are selected and no further alignment is performed. Usually, the first 103 (a) Example one (b) Example two Figure 5.4: Some motion detection results. The first column shows the reference frame in a sliding window (also the center frame); the second column shows the estimated background of the reference frame; the third column shows the difference image between the reference frame and the estimated background. frame [39], or the ground plane in the first frame [42], is selected as the reference frame. Reference coordinates can bring us meaningful kinematics states and motion model of objects. For our UAV scenario, we adopt a global map as the tracking coordinates. A satellite image is selected as the map. Homograhies between UAV frames and the map are estimated by a geo-registration procedure. Compared with selecting the first frame as coordinates, the motion model defined in the map coordinates has physical meaning. As geo-registration refines the homograhy between a UAV frame with the global map, the accumulated error is reduced in contrast to concatenating the transformations from the first frame to the current frame. Besidestheplacewhereamotionmodelcanbeapplied,propercoordinatesareselected to combine the information from multiple cameras for tracking. In [39], the authors proposetouse2Dimagecoordinatesofonestationarycameratobecommoncoordinates, 104 other stationary cameras and moving cameras are registered with the first stationary 2D image coordinates by homography transformations. In [6], the ground plane is used as thetrackingcoordinates,thekinematicstatesofmultipleobjectsfrommultiplecalibrated cameras are projected into the ground plane coordinates, where tracking is performed. In our UAV environment, we assume the scene can be approximated as a plane. This assumption is reasonable when the structure on the ground plane is relatively small com- pared with the camera height to ground plane. We use a map M to represent the ground plane. Making this planar assumption, the transformation between a UAV image and the map can be represented as a homography, H iM , namely H iM I i = M. The transfor- mation between two UAV images I i and I j can be represented as H ij , i.e. H ij I i = I j . For the tracking task, we need to define the targets’ state (position, velocity, dimension, etc.) of different times in a common reference frame, where we can introduce the motion model of targets, such as constant velocity motion. The first frame could be selected as the reference coordinates. However accumulated errors are then introduced since the H i0 = H i(i−1) ···H 10 need to be computed. Here we use the map M as the common coordinates, and each UAV image I i is registered with M with the homography H iM . TheaccumulatederrorisreducedbyregisteringtheUAVimageswiththeglobalmap. In addition, it makes more sense to define the motion kinematics since the geo-coordinates have physical meaning. To compute the homography H iM , we propose a two-step procedure to register a UAV image sequence to the global map. In the first step, we compute H i(i+1) register consecu- tive frames using RANSAC to estimate the best homography to align the feature points 105 in each frame. Given the homography between two consecutive frames, the homography H i,j between any two frames can be represented as follows. H ij = j−1 Q k=i H k,k+1 i<j I i=j (H j,i ) −1 i>j (5.4) Given an initialization H 0M and the homography H i0 obtained from Eq.5.4, we roughly align the UAV image with the global map. In the second step, we refine the registration by computing the homography between roughly aligned UAV image and the map. Since UAVimagesarecapturedatdifferenttimesandindifferentviewsfromthesatelliteimage, the color, illumination, and the dynamic content (such as vehicles, trees and shadow and so on) could be very different. Thus, to establish the correspondences between these two images, we use the criterion of mutual information [77]. Given the correspondence between the roughly aligned UAV image and the map, again weapplyRANSACtocomputearefinedhomography. Bylinkingtherefinedhomography and the initialized homography from the first step, we can register the UAV image with the map without incremental accumulated registration error. Figure 5.5 shows 2000 geo- mosaiced frames overlayed on top of the map. We can see that, even after 2000 frames, the registration is still maintained within a small error bound. 106 Figure 5.5: Geo-mosaicing 2000 consecutive frames on top of the reference frame. 5.4 Implementation and Results 5.4.1 GPU-acceleration of Background Modeling After profiling the whole system, we find the bottleneck of the whole system in terms of time performance is the background modeling part. Thus, we use GPGPU (General Purpose Graphics Processing Unit) techniques implement this module. In the GPGPU framework, the fully programmable vertex and fragment processors provide powerful computational tools for general purpose computations. In order to implement an algorithm on the GPU, different computational steps are often mapped to different vertex and fragment shaders. Vertex shaders allow us to manipulate the data thatdescribesavertex. Fragmentshadersservetomanipulateapixel. WepresentaGPU 107 implementation of the sliding window based method and separate the process into two steps, warping images and computing the background model. Also, we want to minimize memory transfer between GPU and CPU, therefore the inputs for our implementation are the sequential image data and homography transformation, and the output is the motion mask (namely the difference image) for each frame. The overview structure of the implementation is shown in Figure 5.6. Our implementation needs to store all frames in the sliding window. The sequential frames are stored as two-dimension textures in the GPU memory. The most recent frame is loaded as a texture and the oldest frame is moved out of the texture pool. The warping involves changing the texture coordinates and is implemented in the vertex profile. The vertex profile takes the 3x3 homograph matrix as input and outputs the warpedtexturecoordinatesbysimplematrixmultiplicationoperations. Thetransformed texture coordinates are applied in the fragment profile. Tocomputethebackgroundmodel,differentstatisticfunctionscanbeapplied,suchas the mode, median and mean of the samples in the sliding window. For motion detection from a moving platform, the mode is usually preferred to the other two. This is because we have very limited samples, i.e. as the size of the sliding window is quite small. Figure 5.6: Overview of the GPU-based implementation 108 The mean and the median, which do not differentiate the foreground and background pixels, lead to a biased background model. The mean and the median are only proper when enough samples are available. The computation of the mode requires to build a histogram and then to find the bin with the largest number of samples (there is another waytoestablishadynamichistogrambyconstructingabinarysearchtree,whichinvolves too many branching operations and dynamic data structures, thus it is not proper for GPU implementation). Algorithm 4 Overview of GPU implementation input: (I 1 ,...,I w ) , (H 1,r ,...,H w,r ) output: Difference image I diff for Each frame I r in the sliding window do 1. Generate texture coordinates according to H i,r 2. Build a histogram for each pixel end for Compute the mode by reduction and use the average of mode bin as the background intensity. One histogram is built for each location in the reference frame and each bin records the number of hits over the sliding window. This is different from the normal concept of a normal histogram that is built on the whole image. The overview of the GPU implementation is shown in Algorithm 1. We use a fixed number n of bins to construct a histogram. By using a RGBA texture, a histogram in each channel of n bins leads to a tile size of p n/4× p n/4 texture elements. For each pixel, we need to construct such a histogram, therefore the size of the RGBA texture is W p n/4×H p n/4, where W and H are the width and the height of original images. Suppose n = 16 (this is enough for an eight bit dynamic range), the histogram texture memory is 4 times as large as the original one. In a RGBA texture with n = 16, (H(0),H(4),H(8),H(12)) is stored 109 (a) construct a histogram: a 16-bin histogram is built in one RGBA texture with the doubled width and height. (b) compute the mode: the mode is computed among all bins in the RGBA histogram texture. Figure 5.7: Procedure to compute the mode in a float4 vector in GPU with the same texture coordinates. After one frame texture is loaded, the hits in each bin are updated. A bin is indexed by an intensity value, eg. between [0,15] for n=16. It is not efficient to use “if-else” type statements to determine the placement of one intensity value in the histogram. For efficiency, it is preferable to use standard Cg functions [61] rather than using branching statements. We adopt the approach in [28], which uses the function δ b (x) = cos 2 (α(x−b)) to indicate whether the value x belongs to the bin b and α is used to keep (x− b) within (−π/2,π/2). The cos(·) function can be squared repeatedly or filtered by a floor operation to yield a more concentratedimpulse. To findthe mode ofthe histogram, we use the reductionoperation as in [45]. The required number of reduction iterations is log 2 ( p n/4). The procedure of computing the mode is illustrated in Figure 5.7, where each grid corresponds to on bin. Updating the histogram is implemented in the “ping-pong” way. After we find the mode, i.e. the bin with the largest number of samples, we average the samples that are 110 located in the mode bin as the background model. This refinement is necessary to avoid the quantization error in building histograms. The difference between the background model and the reference image is the output that is transferred to CPU. For color videos, we compute the background model independently for each channel. The approach that uses the mode as the background is independent of the result at the previoustime,thusitisableavoidthespreadofregistrationerrorsandoutliers. However, the computation of the mode is still complicated. When the registration quality is good enough, we can use an alternative to compute the background model. It is obvious that computing the mean is much easier for GPU implementation. We first warp the background result at the previous time to the current reference frame as an estimated backgroundmodel,andusethemeanofthesamplesthatarecloseenoughtotheestimated model as the new background model. If the number of the samples that are close enough to the estimated background is too small, we use the average of all samples instead. This method may inherit the errors in the previous estimate. 5.4.2 Experimental Results We show our tracking results on the following two UAV sequences. Using the longitude and latitude information coming with image sequences, the map is acquired from Google Earth. ThehomographybetweenthefirstframeandthemapH 0M ismanuallycomputed offline. Figure 5.8 shows the tracking result on a sequence with one moving object. Con- sidering the computation cost, the geo-registration refinement with the map is performed every 50 frames. Figure 5.8(c) and Figure 5.8(d) display the tracking result on the map. The trajectory of tracklets in Figure 5.8(c) is generated using the initial homography 111 between UAV image and map without refinement. Figure 5.8(d) is generated using our geo-registration. It is clear that the trajectories of tracklets without geo-registration are out of the road boundary. Since the target is fully occluded by the shadows of trees, the trajectory of the single target breaks into tracklets. In real scenarios, the moving shadow may affect the target’s appearance. We apply the deterministic non model-based method [66] working in HSV space to remove the strong moving shadow. However, due to the noisy moving shadow removal, the target identity is not fully maintained. (a) (b) (c) Trajectory of tracklets without geo-registration (d) Trajectory of tracklets with geo-registration Figure 5.8: Comparison of with and without geo-registratoin Figure5.9showstrackingresultsonthesequencewithmultiplemovingtargets. Again whentargetsareoccludedbyshadows,localdataassociationmaylosetrackidentification and thus tracklets are formed. The missing detection caused by occlusion even lasts longer than the sliding window of local data association (45 frames). However in global 112 (a) The tracking result with geo-mosaicing the UAV images on the satellite image (b) (c) (d) (e) Figure 5.9: The tracklets and tracks obtained using the local and global data association framework. The UAV image sequence is overlayed on top of the satellite image. data association, the tracklets are associated with correct ID throughout the video. The different tracks are listed in the Z direction in different colors. Figure 5.9(b), 5.9(c), 5.9(d) and 5.9(e) show the beginning frame of the tracklets of the red truck. Although the appearance of the white van and the white SUV in 5.9(b) is quite similar, the temporal and spatial constraints on the global map avoid to associate them together. 5.5 Conclusion and Discussion We have proposed a framework to detect and track moving objects from a moving plat- form in global coordinates. The geo-registration with a global map provides us reference 113 coordinates to geo-locate targets with physical meaning. In geo-coordinates, correlation between tracklets produced in the local data association algorithm is evaluated using spatio-temporal consistency and similarity of appearance. Experiments show local and global association can maintain a track ID across long term occlusion. 114 Chapter 6 Conclusion 6.1 Summary of Contributions In this thesis, we have presented approaches to tracking single or multiple targets using user selection or using motion segmentation as input. For multiple target tracking, we mainly address the data association problem in spatio-temporal space and use MCMC as the computational framework to sample the very large solution space. In this part of the work, we make full use of the smoothness in motion and consistency in appearance in spatio-temporal space to segment and track targets over time, instead of relying on appearance based methods to detect or segment objects in each frame. We also extend thisframeworkintoaversionthatcanutilizethemodelinformationtofindasemantically meaningfulinterpretationoftheforegroundregions, andweusethemodelinformationto guide the sampling procedure. We further extend the data association into a hierarchical version that contains both local data association and global association at the tracklet level. For tracking one single arbitrary object with user selection as input, we focus on howtomodeltheobjectappearanceon-the-fly. Weposethisproblemasasemi-supervised 115 problem and apply co-training to train a hybrid generative and discriminative model to avoid training errors reinforce themselves. We combine motion detection and tracking techniques in the application of detecting and tracking multiple moving vehicles from airbornevideosanddemonstratepromisingresults. AGPUversionbackgroundmodeling module from airborne cameras is implemented and significant speedup is achieved. Some more recent work, that first analyze the general motion pattern of objects and then use the motion pattern information in return to facilitate segmentation and tracking of each object in this pattern, is not included in this draft yet. Part of this thesis was published in recent conferences and journals, shown in the following list: • Q. Yu, G. Medioni. Multiple Target Tracking with Global Optimization using Markov Chain Monte Carlo, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) (to appear) • Q. Yu, T. Dinh and G. Medioni. Online Tracking and Reacquisition Using Co- trained Generative and Discriminative Trackers. 10 th European Conference on Computer Vision 2008, pp. 678-691. • Q. Yu, andG.Medioni. AGPU-basedimplementationofMotionDetectionfroma Moving Platform. IEEE Workshop on Computer Vision on GPU (CVGPU), 2008 • Q. Yu, and G. Medioni. Integrated Detection and Tracking for Multiple Moving Objects using Data-Driven MCMC Data Association. IEEE Workshop on Motion and Video Computing (WMVC), 2008. 116 • Q.Yu,G.MedioniandI.Cohen. MultipleTargetTrackingUsingSpatio-Temporal Monte Carlo Markov Chain Data Association. In CVPR 2007, pp.1-8 • Q. Yu, Q. Li and Z. Deng. Online Motion Capture Marker Labeling for Multiple Interacting Articulated Targets. In Proceeding of Eurographics 2007, pp. 477-483. • Q. Yu and G. Medioni, Map-Enhanced Detection and Tracking from a Moving Platform with Local and Global Data Association. IEEE Workshop on Motion and Video Computing (WMVC), 2007, pp.1-8 • Y.Lin,Q.Yu,andG.Medioni,Map-EnhancedUAVImageSequenceRegistration. IEEE Workshop on Applications of Computer Vision (WACV), 2007, pp.1-8 • Q. Yu, I. Cohen, G. Medioni and B. Wu. Boosted Markov Chain Monte Carlo Data Association for Multiple Target Detection and Tracking. In (ICPR 2006), pp. 675-678. 6.2 Future Directions There are a few directions which are interesting to explore from this thesis work. 6.2.1 Tradeoff between optimum and efficiency The MCMC-based approach gives very good results in complex situations, however, the computation cost involved is high. Given limited computational resources, we hope to find a tradeoff between optimum and efficiency, such as using bottom-up techniques to design more informative proposal schemes, similar to the one as in Chapter 3. Also, if we 117 can use the concept of atomic tracks to reduce the solution space to improve efficiency at the price of suboptimal solution to some extent. 6.2.2 General motion pattern analysis Many types of objects exhibit a specific motion pattern over time, such as human flow, vehicles flow, etc. A interesting direction is to first explore the general motion pattern of these objects and then use the motion pattern in turn to facilitate segmentation and tracking ofeachobject. This type ofmethodalso canbe extendedto detecteventinvery crowded scenarios where tracking each object is either impossible or not necessary. 6.2.3 Combine tracking with segmentation Segmentation techniques have become more mature recently. There are many useful methodologies and tools to segment an object or regions from one single image. Not much attention that has been paid to segment one object from a video sequence. On the other hand, a segmentation representation in tracking can provide the most detailed information of the object but we cannot afford a segmentation as initialization, such as in tag-and-track problem. Thus, combining these two ideas may have some interesting applications, for example, after user selecting an object with a bounding box, a fine segmentation can be achieved while tracking. 6.2.4 Towards events One ultimate goal of detection and tracking is to provide perfect segmentation and track identificationforfurthereventanalysis. Therearemanycases, especiallyinverycrowded 118 scenarios, however, segmentation and tracking of each individual is too difficult. There is one interesting direction of performing event analysis without tracking each object individually. 119 References [1] Saad Ali and Mubarak Shah. A lagrangian particle dynamics approach for crowd flow segmentation and stability analysis. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007. [2] James F. Arnold and Henry. Pasternack. Detection and tracking of low-observable targets through dynamic programming. In Signal and Data Processing of Small Targets, pages 207–217, 1990. [3] Shai Avidan. Support vector tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 26:1064–1072, 2004. [4] Shai Avidan. Ensemble tracking. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (PAMI), 29(2):261–271, 2007. [5] I. Beichl and F. Sullivan. The metropolis algorithm. In Computing in Science & Engineering, pages 65–69, 2000. [6] J´ erˆ ome Berclaz, Francois Fleuret, and Pascal Fua. Robust people tracking with global trajectory optimization. In In IEEE Conferece on Computer Vision and Pat- tern Recognition (CVPR), pages 744–750, 2006. [7] Charles Bibby and Ian Reid. Robust real-time visual tracking using pixel-wise pos- teriors. In In Proceeding of European Conference on Computer Vision (ECCV), 2008. [8] Ake Bjoerck and Gene H. Golub. Numerical methods for computing angles between linear subspaces. Mathematics of computation, pages 579–594, 1973. [9] MichaelJ.BlackandAllanD.Jepson. Eigentracking: Robustmatchingandtracking of articulated objects using a view-based representation. In Internation Journal on Computer Vision (IJCV), volume 26, pages 63–84, 1998. [10] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co- training. In 11th Annual Conference on Computational Learning Theory (COLT), pages 92–100, 1998. [11] Antoine Bordes, Seyda Ertekin, JasonWeston, and Leon Bottou. Fast kernel classi- fiers with online and active learning. JMLR, pages 1579–1619, 2005. [12] GaryR.Bradski. Computervisionfacetrackingforuseinaperceptualuserinterface. In Intel Technology Journal, 1998. 120 [13] Gert Cauwenberghs and Tomaso Poggio. Incremental and decremental support vec- tor machine learning. In Neural Information Processing Systems (NIPS), 2001. [14] Kuo-Chu Chang and Yaakov Bar-Shalom. Joint probabilistic data association for multitarget tracking with possibly unresolved measurements and maneuvers. IEEE Transactions on Automatic Control, 29(7):585–594, 1984. [15] O.Chapelle,B.Sch¨ olkopf,andA.Zien. Semi-Supervised Learning. MITPress,2006. [16] Robert T. Collins. Mean-shift blob tracking through scale space. In In IEEE Con- ferece on Computer Vision and Pattern Recognition (CVPR), page 234C240, 2003. [17] Robert T. Collins, Yanxi Liu, and Marius Leordeanu. Online selection of discrim- inative tracking features. In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), volume 27, pages 1631–1643, 2005. [18] DorinComaniciu,VisvanathanRamesh,andPeterMeer. Kernel-basedobjecttrack- ing. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 25:564–577, 2003. [19] IngemarJ.CoxandSunitaL.Hingorani. AnefficientimplementationofReid’sMHT algorithm and its evaluation for the purpose of visual tracking. In International Conference Pattern Recognition (ICPR), pages 437–443, 1994. [20] Ingemar J. Cox and Matt L. Miller. On finding ranked assignments with applica- tion to multi-target tracking and motion correspondence. In IEEE Transaction on Aerospace and Electronic Systems, pages 486–489, 1995. [21] NavneetDalalandBillTriggs. Histogramsoforientedgradientsforhumandetection. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 886–893, 2005. [22] FrankDellaert,StevenM.Seitz,CharlesE.Thorpe,andSebastianThrun. Structure from motion without correspondence. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 238–248, 2000. [23] JonathanDeutscher,AndrewBlake,andIanReid. Articulatedbodymotioncapture byannealedparticlefiltering. InInIEEEConfereceonComputerVisionandPattern Recognition (CVPR), volume 2, pages 126–133, 2000. [24] Nicholas Dowson and Richard Bowden. Simultaneous modeling and tracking (SMAT) of feature sets. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 99–105, 2005. [25] Ahmed Elgammal. Learning to track: Conceptual manifold map for closed-form tracking. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 724–730, 2005. [26] (CLEAR) Evaluation and Workshop. Classification of Events, Activities and Rela- tionships http://www.clear-evaluation.org/. 121 [27] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. In Communications of the ACM, volume 24(6), pages 381–39, 1981. [28] James Fung, Steve Mann, and Chris Aimone. OpenVIDIA: Parallel GPU computer vision. In Proceedings of the ACM Multimedia, pages 849–852, 2005. [29] Michael Grabner, Helmut Grabner, and Horst Bischof. Learning features for track- ing. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007. [30] Peter J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrica 82, pages 711–732, 1995. [31] Peter J. Green. Highly Structured Stochastic Systems, chapter Trans-dimensional Markov chain Monte Carlo. Oxford University Press, 2003. [32] Peter J. Green. Trans-dimensional Markov Chain Monte Carlo. Oxford University Press, 2003. [33] Peter Hall, David Marshall, and Ralph Martin. Merging and splitting eigenspace models. In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), pages 1042–1049, 2000. [34] Chang Huang, Bo Wu, and Ramakant Nevatia. Robust object tracking by hierar- chical association of detection responses. In In Proceeding of European Conference on Computer Vision (ECCV), pages 788–801, 2008. [35] Michael Isard and Andrew Blake. CONDENSATION - conditional density propa- gation for visual tracking. Internation Journal on Computer Vision (IJCV), pages 5–28, 1998. [36] Omar Javed, Saad Ali, and Mubarak Shah. Online detection and classification of moving objects using progressively improving detectors. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 696–701, 2005. [37] Allan D. Jepson, David J. Fleet, and Thomas F. El-Maraghi. Robust online ap- pearance models for visual tracking. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 415–422, 2001. [38] Simon J. Julier and Jeffrey K. Uhlmann. A new extension of the Kalman filter to nonlinear systems. In SPIE, pages 182–193, 1997. [39] Jinman Kang, Isaac Cohen, and G´ erard Medioni. Continuous tracking within and across camera streams. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 267–272, 2003. [40] Jinman Kang, Isaac Cohen, and G´ erard Medioni. Object reacquisition using invari- ant appearance model. In International Conference Pattern Recognition (ICPR), pages 759–762, 2004. 122 [41] R. Kasturi, D. Goldgof, P. Soundararajan, and V. Manohar. Performance evalua- tion protocol for text and face detection and tracking in video analysis and content extraction. In (VACE-II). Technical Report. University of South Florida, 2004. [42] Robert Kaucic, A. G. Amitha Perera, Glen Brooksby, John Kaufhold, and Anthony Hoogs. A unified framework for tracking through occlusions and across sensor gaps. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 990–997, 2005. [43] Zia Khan, Tucker Balch, and Frank Dellaert. MCMC-based particle filtering for tracking a variable number of interacting targets. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), (11):1805–1918, 2005. [44] Zia Khan, Tucker Balch, and Frank Dellaert. Multitarget tracking with split and merged measurements. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 605–610, 2005. [45] Jens Kruger and Rudiger Westermann. Linear algebra operators for gpu implemen- tation of numerical algorithms. In International Conference on Computer Graphics and Interactive Techniques, pages 908 – 916, 2003. [46] Harold W. Kuhn. The hungarian method for the assignment problem. In Naval Research Logistics Quarterly, pages 83–97, 1955. [47] JuliaA.Lasserre, ChristopherM.Bishop, andThomasP.Minka. Principledhybrids of generative and discriminative models. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 87– 94, 2006. [48] Kuang-Chih Lee and David Kriegman. Online learning of probabilistic appearance manifolds for video-based recognition and tracking. In In IEEE Conferece on Com- puter Vision and Pattern Recognition (CVPR), pages 852–859, 2005. [49] Ido Leichter, Michael Lindenbaum, and Ehud Rivlin. Bittracker-a bitmap tracker for visual tracking under very general conditions. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(9):1572–1588, 2008. [50] Anat Levin, Paul Viola, and Yoav Freund. Unsupervised improvement of visual detectors using co-training. In International Conference Computer Vision (ICCV), volume 1, pages 626–633, 2003. [51] JongwooLim,DavidRoss,RueiLin,andMingYang. Incrementallearningforvisual tracking. In Neural Information Processing Systems (NIPS), pages 793–800, 2004. [52] Ruei-Sung Lin, David Ross, Jongwoo Lim, and Ming-Hsuan Yang. Adaptive dis- criminative generative model and its applications. In Neural Information Processing Systems (NIPS), 2004. [53] Xiaoming Liu and Ting Yu. Gradient feature selection for online boosting. In International Conference Computer Vision (ICCV), pages 1–8, 2007. 123 [54] David MacKay. Information Theory, Inference, and Learning Algorithms. Cam- bridge University Press, 2003. [55] Peter S. Maybeck. Stochastic models, estimation, and control. Mathematics in Sci- ence and Engineering. Academic Press, Inc, 1979. [56] AnuragMittalandLarryS.Davis. M2tracker: Amulti-viewapproachtosegmenting and tracking people in a cluttered scene. Internation Journal on Computer Vision (IJCV), 51(3):189–203, 2003. [57] Baback Moghaddam and Alex Pentland. Probabilistic visual learning for object representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 19:696–710, 1997. [58] Charles Morefield. Application of 0-1 integer programming to multitarget tracking problems. IEEE Transaction on Automatic Control, 22:302–312, 1971. [59] Andrew Y. Ng and Michael I. Jordan. On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In Neural Information Processing Systems (NIPS), pages 841–848, 2001. [60] Hieu T. Nguyen and Arnold. W.M. Smeulders. Robust tracking using foreground- background texture discrimination. Internation Journal on Computer Vision (IJCV), pages 277–293, 2006. [61] NVIDIACorporation. CgTookkit http://developer.nvidia.com/object/cg toolkit.html. [62] Songhwai Oh, Stuart Russell, , and Shankar Sastry. Markov chain monte carlo data association for general multiple-target tracking problems. In Proceedings of the 43rd IEEE Conference on Decision and Control, pages 735–742, 2004. [63] Nikunj C. Oza and Stuart Russell. Online bagging and boosting. In International Workshop on Artificial Intelligence and Statistics, pages 105–112, 2001. [64] John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, pages 61–74, 1999. [65] AubreyB.Poore. Multidimensionalassignmentformulationofdataassociationprob- lemsarisingfrommultitargetandmultisensortracking. Computational Optimization and Applications, pages 27–57, 1994. [66] Andrea Prati, Ivana Mikic, Mohan M. Trivedi, and Rita Cucchiara. Detecting mov- ing shadows: algorithms and evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 25(7):918–923, July 2003. [67] Rajat Raina, Yirong Shen, Andrew Y. Ng, and Andrew McCallum. Classification with hybrid generative/discriminative models. In Neural Information Processing Systems (NIPS), 2003. 124 [68] Donald B. Reid. An algorithm for tracking multiple targets. IEEE Transaction Automation Control, 24(6):84–90, Dec 1979. [69] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models. Neural Computation, 11:305–345, 1992. [70] Thomas Schoenemann and Daniel Cremers. Globally optimal shape-based tracking in real-time. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. [71] Khurram Shafique and Mubarak Shah. A non-iterative greedy algorithm for multi- frame point correspondence. In International Conference Computer Vision (ICCV), pages 110–115, 2003. [72] Kevin Smith, Daniel Gatica-Perez, and Jean-Marc Odobez. Using particles to track varying numbers of interacting people. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 962–969, 2005. [73] Chris Stauffer and Eric Grimson. Adaptive background mixture models for real- time tracking. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 246–252, 1999. [74] Feng Tang, Shane Brennan, Qi Zhao, and Hai Tao. Co-tracking using semi- supervised support vector machines. In International Conference Computer Vision (ICCV), pages 1–8, 2007. [75] Zhuowen Tu and Songchun Zhu. Image segmentation by Data Driven Markov Chain Monte Carlo. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 24(5):657–674, 2002. [76] Cor J. Veenman, Marcel J.T. Reinders, and Eric Backer. Resolving motion corre- spondence for densely moving points. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), pages 54–72, 2001. [77] Paul Viola and William M. Wells III. Alignment by maximization of mutual infor- mation. Internation Journal on Computer Vision (IJCV), 24(2):137–154, 1997. [78] Yichen Wei, Jian Sun, Xiaoou Tang, and Heung-Yeung Shum. Interactive offline tracking for color objects. In International Conference Computer Vision (ICCV), pages 1–8, 2007. [79] Bo Wu, Haizhou Ai, Chang Huang, and Shihong Lao. Fast rotation invariant multi- view face detection based on real adaboost. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 79–85, 2004. [80] Bo Wu and Ramakant Nevatia. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In International Conference Computer Vision (ICCV), pages IV: 90– 97, 2005. 125 [81] Bo Wu and Ramakant Nevatia. Tracking of multiple, partially occluded humans based on static body part detection. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 951–958, 2006. [82] Bo Wu and Ramakant Nevatia. Improving part based object detection by unsu- pervised, online boosting. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007. [83] Hulya Yalcin, Robert Collins, and Martial Hebert. Background estimation under rapid gain change in thermal imagery. In CVPR Workshop on Object Tracking and Classification in and Beyond the Visible Spectrum (OTCBVS), 2005. [84] Ming Yang and Ying Wu. Tracking non-stationary appearances and dynamic fea- ture selection. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 1059– 1066, 2005. [85] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A survey. ACM Journal of Computing Surveys, 2006. [86] Qian Yu, G´ erard Medioni, and Isaac Cohen. Multiple target tracking using spatio- temporal markov chain monte carlo data association. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007. [87] Ting Yu and Ying Wu. Collaborative tracking of multiple targets. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 834–841, 2004. [88] Li Zhang, Ramakant Nevatia, and Bo Wu. Detection and tracking of multiple hu- manswithextensiveposearticulation. InInternational Conference Computer Vision (ICCV), 1-8, 2007. [89] Tao Zhao and Ram Nevatia. Tracking multiple humans in crowded environment. In In IEEE Conferece on Computer Vision and Pattern Recognition (CVPR), pages 406–413, 2004. 126
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Multiple pedestrians tracking by discriminative models
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
Tracking multiple articulating humans from a single camera
PDF
Motion pattern learning and applications to tracking and detection
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Moving object detection on a runway prior to landing using an onboard infrared camera
PDF
Exploitation of wide area motion imagery
PDF
Structure learning for manifolds and multivariate time series
PDF
Event detection and recounting from large-scale consumer videos
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Robust and proactive error detection and correction in tables
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
3D object detection in industrial site point clouds
PDF
Face recognition and 3D face modeling from images in the wild
Asset Metadata
Creator
Yu, Qian
(author)
Core Title
Spatio-temporal probabilistic inference for persistent object detection and tracking
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/05/2009
Defense Date
11/20/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
multiple target tracking,OAI-PMH Harvest,spatio-temporal inference
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gerard (
committee chair
), Moore, James Elliott, II (
committee member
), Nevatia, Ramakant (
committee member
)
Creator Email
qianyu@usc.edu,qyu@sarnoff.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1967
Unique identifier
UC1193409
Identifier
etd-Yu-2578 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-148282 (legacy record id),usctheses-m1967 (legacy record id)
Legacy Identifier
etd-Yu-2578.pdf
Dmrecord
148282
Document Type
Dissertation
Rights
Yu, Qian
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
multiple target tracking
spatio-temporal inference