Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Motion pattern learning and applications to tracking and detection
(USC Thesis Other)
Motion pattern learning and applications to tracking and detection
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MotionPatternLearningandApplicationstoTrackingandDetection by XuemeiZhao ADissertationPresentedtothe FACULTYOFTHEUSCGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (ELECTRICALENGINEERING) August2013 Copyright 2013 XuemeiZhao Acknowledgments First andforemost, I wouldliketo thankmyadvisor, Prof. G´ erardMedioni. Ihaveto admit I knew little about doing research when I came to USC five years ago. He taught me patiently, and guided me on everything, from research direction to technical details. In the early years of my study, he even sat by my side in front of computers to find the problems in my code. As I was gradually getting better at research, he gave me more and more freedom to encourage my independent thinking. But when I got stuck or frustrated, he was always there to point to the rightdirection. Isincerelyadmirehispersistentpassioninresearchandguidingstudents. When hegottheMellonMentoringAward,Iwasnotsurprisedatall,sinceheisalwaysoneofthemost popularprofessorsatschool. Iwouldalsoliketothankothermembersofmydissertationcommitteeforprovidingvaluable feedbackonmywork: Prof. B.KeithJenkins,Prof. YanLiu,Prof. RamNevatiaandProf. Jernej Barbic. Prof. Jenkins is always very nice and elegant. I was lucky to be his teaching assistant for one semester, and he is always very considerate to every student. Prof. Yan Liu, a young, beautifulandstronglyself-motivatedfemaleprofessor,isalwaysmyrolemodel. Prof. Nevatiais always very humorous and talkative in life, and very sharp when it comes to research. I learned a lot from all the conversations with him. Prof. Barbic is an outstanding researcher. Sometimes ii when I worked to late night at lab, I saw him still working energetically in the neighboring lab. Ifprofessorsworkthathard,isthereanyexcuseforstudentstonotto? I want to thank my fiance Dian for always being supportive, in both research and life. He is asuperintelligentandhardworkingmathematician. Inadditiontodoinghisownresearch,heis alwayslikeanassociateadvisortome,guidingmetechnically,andencouragingmeconsistently throughallthehardtime. Ibenefitedalotfromthecountlessdiscussionswithhim. Over the years at USC, I have been fortunate to work together with and learn from great re- searchersandgoodfriends: JanProkaj,YinghaoCai,JongmooChoi,PramodSharma,Prithviraj Banerjee, Bo Yang, Weijun Wang, Eunyoung Kim, Thang Dinh, Cheng-Hao Kuo, Yuan Li, Vivek Kumar Singh, Yuping Lin, Matheen Siddiqui, Qian Yu, Philippos Mordohai, Yili Zhao, KartikAudhkhasi,YounghooLee,SikaiZhuandYiGai. Last but not least, I would like to thank my family for their unconditional love and support. During these years, I had to live far away from my parents, but their encouragement and love is alwaysthedrivingforceformetoovercomeanydifficulty. iii TableofContents Acknowledgments ii ListofTables vii ListofFigures viii Abstract xii 1 Introduction 1 1.1 MotivationandProblemDefinition . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 MotionPatternLearning . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 TrackingandDetectioninCrowdedScenes . . . . . . . . . . . . . . . 4 1.2 Challenges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 RelatedWork 10 2.1 MotionPattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 VisualTracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 MultipleTargetTracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 MTTingeneralenvironments . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 MTTinWideAreaAerialSurveillance . . . . . . . . . . . . . . . . . 19 2.4 DetectionandTrackinginCrowdedScenes . . . . . . . . . . . . . . . . . . . 21 3 MotionPatternLearning 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 TensorVoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 RobustNon-linearManifoldGrouping . . . . . . . . . . . . . . . . . . . . . . 30 3.3.1 MultipleKernelSimilarityGraph . . . . . . . . . . . . . . . . . . . . 31 3.3.2 GraphSpectralGrouping . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.3 OutlierRejection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 iv 3.3.4 NumberofMotionPatterns . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 LearningMotionPatterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 MotionPatternsImproveMultipleTargetTracking . . . . . . . . . . . . . . . 40 3.5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5.2 ComplexityAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.6 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.6.1 MotionPatternLearningonNGSIMDataset . . . . . . . . . . . . . . 41 3.6.1.1 MotionBlobsAssociationbasedFeatureExtraction . . . . . 41 3.6.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.6.2 MotionPatternsImproveTracking . . . . . . . . . . . . . . . . . . . . 48 3.6.2.1 NGSIMDataset . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6.2.2 WideAreaSceneAnalysis: CLIF2006dataset . . . . . . . . 49 3.7 AGeneralizationofRNMGtoNDSpace . . . . . . . . . . . . . . . . . . . . 54 3.7.1 RobustMultipleManifoldStructureLearning . . . . . . . . . . . . . . 55 3.7.2 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.7.2.1 SyntheticData . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.7.2.2 USPSDigitsData . . . . . . . . . . . . . . . . . . . . . . . 62 3.7.2.3 MotionCapature(MoCap)Data . . . . . . . . . . . . . . . . 63 3.7.2.4 MotionPatternLearningonMotorbikeVideos . . . . . . . . 64 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4 TrackingandDetectioninVeryCrowdedScenes 68 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 MotionStructureTracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.1 Exploitingmotionpatternindetection . . . . . . . . . . . . . . . . . . 73 4.2.1.1 AppearanceDetectionProbability . . . . . . . . . . . . . . . 74 4.2.1.2 MotionDetectionProbability . . . . . . . . . . . . . . . . . 75 4.2.2 Exploitingmotionpatternintracking . . . . . . . . . . . . . . . . . . 76 4.2.3 SimplifiedMultipleTargetTrackinginStructuredCrowdedScenes . . 79 4.3 ExperimentalValidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.1 Singletargettrackingresultsintemporallystationaryscenes . . . . . . 81 4.3.2 Singletargettrackingresultsintemporallynon-stationaryscenes . . . . 83 4.3.3 Multi-targettrackingresultsintemporallystationaryscenes . . . . . . 84 4.3.4 Multi-targettrackingresultsintemporallynon-stationaryscenes . . . . 85 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5 OnlineDistributedImplementation 89 5.1 DistributedComputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 OnlineImplementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2.1 TemporallyIncrementalTensorVoting . . . . . . . . . . . . . . . . . . 94 5.2.2 OnlineMotionPatternLearning . . . . . . . . . . . . . . . . . . . . . 97 5.3 GPUImplementationofTensorVoting . . . . . . . . . . . . . . . . . . . . . . 97 5.3.1 ThedesignofGPUImplementation . . . . . . . . . . . . . . . . . . . 98 v 5.3.2 LimitationsofGPUimplementation . . . . . . . . . . . . . . . . . . . 99 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6 Conclusions 102 6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.1.1 EfficientMotionPatternLearning . . . . . . . . . . . . . . . . . . . . 103 6.1.2 MotionStructureTrackerforVeryCrowdedScenes . . . . . . . . . . . 103 6.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2.1 MotionPatternLearning . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2.2 TrackingandDetectioninVeryCrowdedScenes . . . . . . . . . . . . 105 AppendixA N-DClosed-FormVoting: CPUandGPUimplementation . . . . . . . . . . . . . . . 106 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 A.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 A.3 Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 AppendixB MotionPatternLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 B.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 B.3 Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 AppendixC MotionStructureTracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 C.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 C.3 Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 C.4 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 ReferenceList 111 vi ListofTables 3.1 Misclassification rates for 7 grouping algorithms on the real-world sequences. SSC1meansuseSSCtodogroupingbyusingitsownoutlierrejectionstep,and SSC2meansusingouroutlierrejectionstep. . . . . . . . . . . . . . . . . . . . 47 3.2 Trackingevaluationresultsonthreesequences . . . . . . . . . . . . . . . . . . 49 3.3 Vehicle tracking performance with and without motion patterns (MP) on wide areaimagery. Pleaseseethetextformetricdefinitions. . . . . . . . . . . . . . 54 3.4 Rand index scores of clustering on synthetic data, USPS digits, CMU MoCap sequencesandMotorbikevideos. . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.1 Trackingevaluationresultsofsingletargetintemporallystationaryscenesfrom Figure4.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 Trackingevaluationresultsofsingletargetintemporallystationaryscenesfrom Figure4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3 Trackingevaluationofsingletargetintemporallynon-stationaryscenes. . . . . 84 vii ListofFigures 1.1 Examplesofmotionpatternsformedandfollowedbymovingobjectsindifferent scenes: yellow arrows illustrate the directions of movement. (a)(b): sample image of video data from NGSIM dataset. (c)(d): sample image of video data fromYouTube. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 AnexamplesofWideAreaAerialSurveillance(WAAS)imagery. . . . . . . . 7 1.3 Some examples of very crowded scenes of sports, political gathering, shopping mallsandcrowdedstreets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 A cartoon illustration that Tensor Voting is used to infer the local geometric structuresofmanifolds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 2DillustrationofinferringlocalgeometricinformationbyTensorVoting. . . . 27 3.3 AvisualizationoftensorinR 2 andtensordecompositionin2D. . . . . . . . . 28 3.4 An illustration of the local geometric structure for an input point, i.e. the local structurethatthepointbelongsto,tangentspace,normalspace,andsaliency. . 29 3.5 A visualization of combining the three kernels, i.e. dimensionality kernel, nor- malspacekernel,anddistancekernel,togetthenewsimilaritymatrix. . . . . . 33 3.6 An illustration of motion pattern learning steps: (a) one frame of the input se- quence; (b) (x;y;v x ) space projection of KLT tracklet embedding; (c) RNMG results projection in (x;y;v x ) space; (d) motion pattern propagation results in imagespace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.7 (a): A trajectory segment. (b): The motion direction of every point on the seg- ment: green points indicate trajectory points and red lines indicate directions. (c): Zoom in of a part of the tracklet. (d): Zoom in of a part of the results after TensorVoting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 viii 3.8 A comparison of foreground extraction results. (a): Input image. (b): Fore- ground extraction results by RASL. (c): Foreground extraction results by Mix- tureofGaussian. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.9 Foreground motion blobs extraction results. (a)(b): Foreground pixels. (c)(d): Motionblobcorrespondencesonimages. . . . . . . . . . . . . . . . . . . . . 44 3.10 Grouping results of the second sequence in Figure 3.11. Black points indicate outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.11 Motion patterns learning results by RNMG: (a) Input sequences; (b) The visu- alization of motion patterns learning results projected in (x;y;v y ) space for the firstsequence,and(x;y;v x )spaceforalltheothersequences;(c)Thevisualiza- tionofmotionpatternslearningresultsinimagespace. . . . . . . . . . . . . . 47 3.12 One frame of the imagery is captured by an array of cameras (left), while it is desirable to work with only one image per frame, as if it were captured by a virtualcamera(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.13 (a)Initialtrackletsusedtolearnmotionpatterns. (b)Learnedmotionpatterns. . 53 3.14 An example of multiple smooth manifolds clustering. The first one is the in- put data samples and the other three are possible clustering results. Only the rightmostistheresultwewantbecausetheunderlyingmanifoldsaresmooth. . 57 3.15 AdemonstrationofRMMSL.Fromlefttoright,noisydatapointssampledfrom two intersecting circles with outliers, outlier detection results (black dots), and manifoldsclusteringresultsafteroutlierdetection. . . . . . . . . . . . . . . . . 57 3.16 Visualization of part of the clustering results in Table 3.4. The first row : one noisy sphere inside another noisy sphere in R 3 . The second row: two inter- secting noisy spheres in R 3 . The third row: two intersecting noisy planes in R 3 . For each part from left to right: K-means, self-tuning spectral clustering, GeneralizedPCAandRMMSL. . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.17 ClusteringresultsofRMMSLontwomanifoldswithoutliers. (a): groundtruth, (b): outlierdetection,(c): clusteringafteroutlierfiltering,(d)clusteringwithout outlierfiltering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.18 ExamplesofcleanUSPSdigitimages . . . . . . . . . . . . . . . . . . . . . . 62 ix 3.19 AnexampleofhumanactionsegmentationresultsonCMUMoCap. Topleft,the 3D visualization of the sequence. Top right, labeled ground truth and clustering resultscomparison. Bottom,9uniformlysampledhumanposes. . . . . . . . . 64 3.20 Two examples of motion flow modeling results on motorbike videos. The first and the third rows: optical flow estimation results on sample images from two video sequences. The second and the fourth rows: learned motion manifolds withhighlightedmotiondirections. . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1 Examples of structured crowded scenes. (a)(b)(c): Marathon sequences. (d)(e): Italianmotorbikedisplaysequences . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 The commonly used detection methods fail in very crowded scenes. (a) Pedes- triandetectionresultsby[30]. (b)ForegroundextractionresultsbyMoG. . . . 72 4.3 An overview of Motion Structure Tracker to solve the problems of single target tracking,andasimplifiedversionofmulti-targettracking. . . . . . . . . . . . . 72 4.4 Temporally non-stationary scenes. First row: Hongkong. Second row: Motor- bike. (a) Input sequences and examples of targets. (b) The visualization of mo- tion patterns learning results projected in (x;y;v x ) space. (c) The visualization ofmotionpatternslearningresultsinimagespace . . . . . . . . . . . . . . . . 73 4.5 Temporally stationary scenes and examples of targets. (a) Marathon-1. (b) Marathon-2. (c)Marathon-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.6 Temporally stationary scenes and examples of targets from ICCV 2011 data- drivencrowdanalysisdataset[65]. . . . . . . . . . . . . . . . . . . . . . . . . 81 4.7 Examplesoftrackingresultscomparison. Firstrow: temporallystationaryscenes. Secondrow: temporallynon-stationaryscenes. . . . . . . . . . . . . . . . . . 83 4.8 Simplified multi-target detection and tracking results in temporally (a) station- ary and (b) non-stationary scenes respectively. Red rectangle denotes the user labeledtarget. Bluerectanglesdenotethesimilarobjectsdetectedbythelearned detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.9 some examples of the scenes from the ICCV 2011 data-driven crowd analysis dataset[65]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.1 ExampleofWAASimagery,fullframe(left)anddetail(right). . . . . . . . . . 90 x 5.2 Choosingspatialdivisiontilesizeisimportantforthesuccessofdistributedcom- putation. Up: if each tile is too small, then in the region of coverage, no clear patterns are formed. Down: if each tile is too large, then the computational powerwillbewasted,andnosignificantspeedupcanbeobtained. . . . . . . . 91 5.3 Parallel estimation of tracks and motion patterns on a cluster of computers is enabledbycreatinganoverlappinggridoftiles. . . . . . . . . . . . . . . . . . 93 5.4 Parallelestimationresultsofmotionpatterns. . . . . . . . . . . . . . . . . . . 94 5.5 First row: examples of temporally stationary scenes. Lower row: examples of temporallynon-stationaryscenes. . . . . . . . . . . . . . . . . . . . . . . . . 95 5.6 Implementationoftheparallelstructureofvotecollectionmode. . . . . . . . . 99 xi Abstract With the decreasing cost of collecting data, the deluge of surveillance videos makes it nec- essary to carry out automatic intelligent processing to understand scenes and analyze activities. Therearetwoclassesofmethodstovideoanalysis,oneisbasedonanalyzingthetrajectoriesof objects of interest, and the other is examining motion features vectors directly. Our work is a combinationofthetwo. Learning motion patterns of moving objects, a key problem in the second category, is an important way for scene understanding, since motion patterns convey rich information of the scene structures. In this dissertation, we first develop an unsupervised learning framework to infer motion patterns in videos, and in turn use them to assist tracking and detection of objects, especiallyincrowdedscenes. Based on motion features such as optical flow or tracklets input, we embed feature points into motion feature space, and use a manifold learning method Tensor Voting to infer the local geometric structures. In this space, points exhibit intrinsic manifold structures, each of which corresponds to a motion pattern. To define each group, a novel robust manifold grouping al- gorithm is proposed. Tensor Voting is performed to provide multiple geometric cues, such as localtangent/normalspaceanddimensionalitiesofthegeometricstructuresthatapointbelongs to. Multiple similarity kernels based on the geometric cues are formulated between any pair of xii points,andaspectralclusteringtechniqueisusedinthismultiplekernelsetting. Ageneralization ofthegroupingalgorithmtoNDspaceisproposedandtestedonvariousdataset,includingsyn- thetic data, USPS digits, CMU Motion Capture data (MoCap) and real-world video sequences. It achieves better performance than state-of-the-art methods in these applications. Furthermore, an online distributed framework of motion pattern learning is proposed to deal with big data, suchaswideareaaerialsurveillanceimagery. To understand scenes, tracking and detection of general objects are necessary and challeng- ing tasks. In our work, we focus on solving the problem in structured scenes, i.e., the scenes with clear motion patterns. The most salient characteristic of structured scenes is, objects do not move randomly, but follow some patterns instead. Extracted motion patterns convey rich informationsuchashowtheobjectsmove,andhowtheyinteractwitheachother. Usingthemas aprior,wesignificantlyimprovetheperformanceoftrackinganddetection. Invideoanalysis,alargegroupofvideoswhicharebothinterestingandchallengingishigh densitycrowdedvideoscenescontaininghundredsofsimilarobjects,suchasheavytrafficroads, busyshoppingmalls,politicalgatherings,sportsevents,etc. Thesescenesneedspecialattention for the purpose of public safety and accident prevention. However, tracking and detection in crowded scenes are very challenging due to the large amount of similar objects, cluttered back- ground, small target size, and occlusions caused by the constant interactions between objects. In our work, we investigate the single and multiple target tracking and detection problems in structured crowded scenes, and propose Motion Structure Tracker (MST) based on motion pat- terninformation. Itisacombinationofmotionpatternlearning,visualtrackingandmulti-target tracking. In MST, tracking and detection are performed jointly, and motion pattern information xiii is integrated in both steps to enforce scene structure constraint. Experiments are performed on real-worldchallengingsequences,andMSTgivespromisingresults. Thecomparisontoseveral state-of-the-artmethodsprovesitseffectiveness. xiv Chapter1 Introduction This chapter provides introduction and definition of the problems addressed in this dissertation. Also,challengesandcontributionsareexplainedindetail. 1.1 MotivationandProblemDefinition The increased deployment of surveillance system in public areas has generated huge amounts of imagery, rich with interesting information, greatly increasing the need for automated video analysis algorithms. There are two categories of existing approaches to video analysis [76, 83]. Inthefirstandtraditionalcategory,objectsofinterestaredetected,trackedfromframetoframe, and the tracks are analyzed to model activities. These approaches fail when object detection and/or tracking do not work well, especially in crowded scenes. The second approach avoids tracking by using motion feature vectors directly to analyze videos. However, these features are mixed together and difficult to be separated in complex scenes. So they are often used in simple data sets to perform relatively simple tasks, such as detecting unusual activities. In our work, we combine the two kinds of approaches and propose a framework to analyze videos in 1 complex scenes. On the one hand, motion feature vectors are utilized to learn motion patterns. On the other hand, motion patterns are used in turn to assist detection and tracking of objects, especially in crowded scenes. The marriage of the two creates an elegant and novel solution for videoanalysis. 1.1.1 MotionPatternLearning Learning motion patterns of moving objects is an important way to analyze activities and un- derstand scenes, and in recent years, significant effort has been devoted on this topic in the visual surveillance community. Recent works in scene understanding have proven that high- level knowledge about the scene in the form of motion patterns is helpful for low-level detec- tion and tracking [8, 11, 29, 69, 84], and high-level anomaly detection and behavior predic- tion[11,29,69,77]. A motion pattern or flow is a smooth and compact spatio-temporal structure that describes asetofneighboringobjectsundergoingcoordinatedmovements,andsomeexamplesareshown in Figure 1.1. For instance, Figure 1.1(a) shows an intersection. Vehicles moving east to west form a motion pattern, while vehicles from the south making a right turn form another motion pattern. Figure 1.1(d) shows a marching band. Two groups of performers form two intersecting and rotating circles, giving rise to two different motion patterns. Motion patterns serve as a geometric and statistical model of the scene, providing information on where and what is likely tohappen. Usingthemasaprior,varioustasksliketrackingandactivityanalysisbecomeeasier. 2 (a) (b) (c) (d) Figure 1.1: Examples of motion patterns formed and followed by moving objects in different scenes: yellowarrowsillustratethedirectionsofmovement. (a)(b): sampleimageofvideodata fromNGSIMdataset. (c)(d): sampleimageofvideodatafromYouTube. 3 1.1.2 TrackingandDetectioninCrowdedScenes Detection and tracking moving objects is another way to perform activity analysis and scene understanding. In this dissertation, we investigate the problem of single and multiple target tracking,andwewanttotrackobjectsingeneral,insteadofspecificones,suchaspedestriansor vehicles only. Furthermore, although the general detection and tracking are long standing com- puter vision problems and are two of the most investigated topics, their applications in crowded scenesonlycaughtresearchers’attentionandgotprogressinrecentyears. All the tracking methods require object detection results in every frame or when the objects first appear. We focus on the detection of specific targets of interest, so the general strategies for detecting the objects in a scene are not considered. Generally speaking, the suitability of a particular tracking and detection algorithm depends on specific applications, such as object appearances,numberofobjects,objectandcameramotions,andscenestructures. Inthesurveillancevideosthatweareinterestedin,single-targettracking(STT)andmultiple- targettracking(MTT)arethetwocommontasks. ThefundamentaldifferencebetweenSTTand MTTis whethertargetsare trackedindividuallyorjointly. InSTT,atargetislabeledinthe first one or several frame(s), and matching is used for detection in the following frames. In MTT, targets are detected in each frame and associated through frames. To track general objects in structured crowded scenes, motion pattern information is used in both of detection and tracking stagestoimproveperformance. 4 1.2 Challenges Motionpatternlearningandtrackinganddetectioncanbeviewedastwomeansforsceneunder- standing and activity analysis. These two are closely related with each other, and the improve- ment in one can largely help the development of another. We discuss the challenges of them respectively. MotionPatternLearningisfacedwiththefollowingchallenges. • Input features: there are mainly two kinds of input features to be used to learn motion patterns, which are low level motion features and trajectories. They both have their pros and cons. Low level motion features are usually easy to get, but less reliable. While trajectoriesaregoodstartingpoints,theyaredifficulttoobtaininthefirstplace. • Onlinelearningofmotionpatterns: inrealsituations,motionpatternsofthescenechange over time. Thus, a good system should online update motion pattern learning results. However,distinguishingbetweenfalsealarmsandnewpatterns,anddetectingthetempo- ralchangepointofmotionpatternsaredifficult. • Vague Definition: a fundamental challenge of motion pattern learning is, what exactly motion pattern is. So far, no research has ever given a clear and good definition. Thus, empiricaldefinitionisused,makingthelearningdifficult. • Evaluation metric: what a good evaluation metric is is an open question. Empirical com- parison is often performed in related literature. Even if accurate definition is given, the 5 labeling of ground truth is extremely labor intensive. Another indirect way for evalua- tion is through the applications of motion pattern in different tasks, such as tracking and unusualactivitydetection. • Bigdata: Onekindofimagerythatcatchesourspecialattentioniswideareaaerialsurveil- lanceimageryasshowninFigure1.2. Itoftencoversaregionofseveralsquarekilometers, which means the number of vehicles is in the thousands. The limited appearance infor- mation on every target creates much ambiguity in tracking. Therefore, motion pattern learning is crucial in wide area scenes so that this additional information can be used to removefalsealarmandreducetrackingerror. Thisbringsupthechallengetolearnmotion patternseffectivelyandefficientlyforbiginputdata. Inordertoachieverealtimeprocess- ingatthisscale,thetaskneedstobeparallelizedanddividedamongseveralmachines. Trackinganddetectionisfacedwiththefollowingchallenges. • Challenges in general scenes [83]: loss of information caused by projection of 3D world ona2Dimage,noiseinimages,complexandabruptobjectmotion,nonrigidorarticulated natureofobjects,sceneilluminationchanges,andreal-timeprocessingrequirements. • Challenges specifically in very crowded scenes: as shown in Figure 1.3, the size of a target or the number of pixels on a target is usually small, making it difficult to train an object detector with sufficient accuracy. There is a large number of similar objects in the scene,therefore,trackerisverylikelytoshifttootherobjects. Alsopartialandfullobject occlusions take place often due to the constant interaction among individuals. This will causemissedorfalsedetections,andmayfurtherleadtowrongassociations. 6 Figure1.2: AnexamplesofWideAreaAerialSurveillance(WAAS)imagery. Figure1.3: Someexamplesofverycrowdedscenesofsports,politicalgathering,shoppingmalls andcrowdedstreets. 7 1.3 Contributions We propose an integrated framework for video analysis by combining the two categories of approaches, one is based on analyzing the trajectories of objects of interest, and the other is examiningmotionfeaturesvectorsdirectly. Thefirstkeycontributionsofourworkis,anovelframeworkforlearningmotionpatternsis presented. Intheframework,weproposearobustmultiplemanifoldgroupingalgorithmmaking fulluseoflocalgeometricinformation. Comparedtootherstate-of-the-artcandidatealgorithms which focus on linear subspace segmentation, the new method is proved to be good at grouping motionpatternswhicharenonlinearinmanycases. Moreover, we proposed an online distributed implementation framework of the motion pat- tern learning algorithm to efficiently process large scale data, such as wide area aerial surveil- lance imagery. Spatially, the region is divided into overlapping tiles to be processed indepen- dently. Temporally, a temporal incremental framework for tensor voting is first proposed. It is integratedintothemotionpatternlearningframeworktorealizeonlineprocessing. Then,tensor votingisacceleratedbyGraphicsProcessingUnits(GPU)implementation. The second key contribution is, Motion Structure Tracker is proposed to perform tracking and detection in very crowded scenes. MST has several advantages compared to existing meth- ods: −Bettertagandtrack: totrackasingleobjectinverycrowdedscenes,MSTcombinestracking and detection, in both of which motion pattern information is used as prior knowledge, and outperformsstate-of-the-arttrackers. 8 −Improvedmulti-targettracking: althoughitisalmostimpossibletotrackallobjectsinvery crowdedscenes,wepartiallyaddressthisbyproposingasimplifiedversionofMTTandsolving itbyMST. −Online: the proposed algorithm can sequentially process both temporally stationary and non- stationaryscenes,infermotionpatterns,andusetheminbothsingleandmultipleobjecttracking. 1.4 Outline This dissertation is organized as follows: a literature review of related work is given in Chapter 2. Then, motion pattern learning framework is proposed in Chapter 3. In Chapter 4, we present MotionStructureTrackertoperformtrackinganddetectioninverycrowdedscenes. InChapter 5, we explain how to handle large scale data in an online distributed fashion. And the summary ofourworkisgiveninChapter6. 9 Chapter2 RelatedWork Crucialtovideosurveillanceisautomaticandefficientmodelingandanalysisofthescene. Mo- tion patterns offer to solve this problem by processing the imagery into a compact form that is more useful and understandable to human operators. Another fundamental way to summarize imageryistoestimatethetracksofmovingobjects. Thesetracksfromsingleormultipletargets tell us a lot about the “life” in the region of interest, and are essential cues for higher level rea- soning tasks, such as activity recognition. Tracking and detection in crowded public scenes are verychallenging,therefore,motionpatternsareincorporatedtoassistthesetasks. Inthissection, we give a literature review from several aspects related with our work, and they are: (1) motion pattern,(2)visualtracking,(3)multipletargettracking,and(4)trackingincrowdedscenes. 2.1 MotionPattern We summarize the work on motion pattern learning from two perspectives, one is based on features,andtheotherisbasedonapplications. 10 2.1.1 Features Pioneering works on motion pattern extraction can be roughly classified by the features they use. Onemajorkindofmethodsuseslowlevelmotionfeatures [27,44,68,84]asinput. Forin- stance,[28]constructedadirectedneighborhoodgraphtomeasuretheclosenessofflowvectors, andclusteredopticalflowsbythegraphtolearnmotionpatterns. [68]proposedamixturemodel representationofsalientpatternsofopticalflow,andpresentanalgorithmforlearningthesepat- terns from dense optical flow in a hierarchical, unsupervised fashion. [84] focused on airborne videos, analyzed the essential difference in motion patterns caused by parallax and independent movingobjects, andproposedamethodtosegmentmotionpatternscreatedbymovingvehicles in stabilized airborne videos. The flows are used in turn to facilitate detection and tracking to handle the problems such as parallax, noisy background modeling and long term occlusions. Basedon the Lagrangian framework forfluid dynamics, [44]presented a streakline representa- tionofflow. Streaklinesareareusedinasimilarwaytotransportinformationaboutascene,and they are obtained by repeatedly initializing a fixed grid of particles at each frame, then moving bothcurrentandpastparticlesusingopticalflow. [38] presented a new method for modeling dynamic visual phenomena. A common under- lyinggeometrictransformprocessisusedtocapturetheintegralmotionofconstituentelements in a dynamic scene. And a Lie algebraic representation of the transform process is introduced to map the transformation group to a vector space. As an extension of [38], [39] introduced the concept of geometric flow that describes motion simultaneously over space and time, and derived a vector space representation based on Lie algebra. It was then used to model complex motionbycombiningmultipleflowsinageometricallyconsistentmanner. 11 Another category uses trajectories obtained by accurate detection and tracking of moving objects [32, 29, 77, 71]. Generally speaking, they cluster trajectories to learn motion patterns. For example, one of the earliest works [32] used vector quantization to model the probability densityfunctionsoftrajectorydistribution,thenusedthepdftodoeventrecognition. [43]itera- tively merged trajectories to extract frequently used pedestrian pathways. [71] used a Gaussian Mixture Model to detect foreground moving blobs, and categorize associated trajectories ac- cording to the basic cues such as sizes, positions and velocities. [77] introduce two similarity measuresforcomparingtrajectoriesobtainedasin[71],clustertrajectoriesusingbothsimilarity and comparison confidence, and segment a scene into semantic regions and to build semantic scenemodels. The two categories of methods have both their pros and cons. Systems built on motion flow field have broad applications, such as in extremely crowded scenes, where other features are difficulttoget. However,simpleandfastopticalflowmethodscannotgeneratereliableresultsas asolidbasis,whilethespeedrequirementofmoresophisticatedonesmakestheminfeasible[84] inrealtimesurveillancesystem. Ontheotherside,inmanycases,multi-targettrackingisavery challenging task itself, making it difficult to get reasonable trajectories as starting points. Thus, we sit between the two, and use local moving object association results (tracklets) as features insteadofglobalassociationresults(trajectories). 2.1.2 Applications From the view of applications of motion patterns, works can be roughly divided into segmen- tation [5, 44, 76], tracking [6, 84], anomaly detection [29, 76, 77] and person detection and 12 tracking [63]. In segmentation, [5] used Lagrangian particle dynamics for crowd flow segmen- tationandflowinstabilitydetection. [44]segmentseveryframeofvideointoregionsofdifferent motionsbasedonthesimilarityoftheneighboringstreamlines,and[76]getsmotionpatternsina waythatsegmentsvideosequencesbasedondifferenttypesofinteractionoccurringanddetects activitiesbothtemporallyandspatially. Pure multi-target tracking methods are object-centric and do not exploit any high-level or globalknowledgethatmayaidintracking. Thisisoneofthemajordifferencesbetweentracking with flow knowledge algorithms and these approaches. In both [6, 84] and our work, high level constraints resulting from scene structure in the form of motion patterns are integrated into the tracking algorithm. Specifically, [84] focused on airborne videos. Stabilization is performed first, and the pixels that do not satisfy the global motion model are classified as residual pixels. Theresidualpixelsconsistofnoise,independentmotion,andparallax. Butthosecausedbynoise donotformaconsistentmotionpattern,thuscanbefilteredbyexaminingtheirdimensionalities. In [6],motionpatternsareusedtoassisttrackinginverycrowdedscenes. Itwillbereviewedin detaillater. Anomaly detection is an interesting and important aspect in video surveillance. Anomaly is a vague definition, and it usually refers to the events that rarely happen. For instance, the movements of vehicles following traffic rules at an intersection are considered normal, if they are repeated by different vehicles again and again. However, after long time of observation, if we detect a car making a U turn that never happens before, it’s very likely that the car breaks the traffic rule that prohibits such U turn. In surveillance, the rare events are often of interest and need the attention from human operators. [29] calculated the probabilities of the matching 13 betweenobservedbehaviorsandthelearnedmotionpatternsandthencalculatedtheprobability values of abnormality of the observed behaviors. [76] utilized hierarchical Bayesian models, under which abnormality detection has a probabilistic explanation since every video clip and motion can be assigned a marginal likelihood, rather than by comparing similarities between samples. 2.2 VisualTracking Visual tracking addresses the problem of tracking a specific object labeled in the first frame or first few frames by a user, and it faces several challenges, such as abrupt motion, clut- tered background and object leaving the field of view (FoV). In recent years, region-based tracking approaches are very popular, most of them try to search for the best match frame by frame [23, 9, 45, 34, 18]. Some methods simply assume an area where the target is ex- pected[23,9],whileothers[45]trytopredictthestateusingmethodssuchasparticlefilter. The appearancemodelplaysanessentialrolewhenthetrackingregionhasreasonablesize,especially onlineappearance modeling [66,23,9]. IVTtracker[66]istheearliestworkthatincrementally learnsarepresentation,efficientlyadaptsonlinetochangesintheappearanceofthetargetbyup- datingalowdimensionalsubspacerepresentationofthetarget. TheincrementalPCAalgorithm updates the eigenbasis and the mean when one or more training data are given. However, the potentialproblemforonlineadaptionis: eachupdateofthetrackermaybringinsomeerror,the accumulationofwhichmayleadtotrackingfailure. Tosolvetheproblem,[23]usedSemiBoost, the idea from semi-supervised learning. Labeled data is used as a prior and the data collected 14 duringtrackingasunlabeledsamples. Sincelabeledexamplesarefromthefirstframeonly,this method throws away a lot of useful information. [9] used online Multiple Instance Learning (MIL) algorithm to improve appearance model. MIL means during training, examples are pre- sented in bags, and labels are provided for the bags instead of for each individual instance. A bagislabeledpositiveifitcontainsatleastonepositiveinstance,andnegativeotherwise. Inthe context of object detection, a positive bag could contain a few possible bounding boxes around each labeled object. Thus, the learning problem becomes easier because the learner is allowed someflexibilityinfindingadecisionboundary. [33]modelthetarget’sappearanceandbuildadetectorontheflybyrandomforestclassifier. The object is tracked by an adaptive tracker KLT, and meanwhile, the appearance is modeled by continuously growing tracked examples and pruning those wrong ones (then adding in only correct examples). While these algorithms are all based on a binary classifier to distinguish the target from background in every frame, [34] shows that the performance of a binary classifier can be significantly improved by P-N learning. P-N learning evaluates the classifier on the unlabeled data, which are positive (P) and negative (N) structures, identifies examples that have been classified in contradiction with structural constraints and augments the training set with the corrected samples in an iterative process. P-N learning is used because of the unbalance between the small number of labeled examples and the large number of structured unlabeled samples in visual tracking. The learning strategy is, use the labeled examples to train an initial classifier, label the unlabeled data by the classifier and find out the examples in contradiction withthestructuralconstraints. Thencorrecttheselabels,addthemtothetrainingsetandretrain 15 theclassifier. Byinheritingfrom[33]andincorporatingP-Nlearning,P-Ntracker[34]obtained excellentresultsonmanydataset. However,contextinformationsuchasothermovingobjects/regionsinteractingwiththetar- get are usually overlooked in visual tracking algorithms. Context Tracker [18] proposes to ex- ploit context information by introducing distracters and supporters. Distracters are regions that havesimilarappearanceasthetarget,andsupportersarelocalkey-pointsaroundtheobjecthav- ingmotioncorrelationwiththetargetinashorttimespan. Thetargetanddistractersaredetected usingsharedsequentialrandomizedferns[55],andrepresentedbyindividualevolvingtemplates. Meanwhile, the supporters are represented as keypoints, and described using descriptors of the local region around them. Using these context elements prevents the tracker from drifting to another similar object in a cluttered background, and increases the detection of the right target whenitleavestheFoVoroccludedthenreappear. Single object tracking in a structured crowded scene is similar to visual tracking, but some differencesexist. 1)Duetothehighdensityconstraint,objectsincrowdedscenemovesmoothly, and abrupt motion is rare. 2) The size of a target in crowded scenes is small, thus advanced ap- pearance model based matching is not so helpful. 3) The scale of targets is relatively stable in crowdedscenes. 4)Manyobjectslooksimilartothetarget,makingtrackervulnerabletoswitch- ingtoothersimilarobjects,eventhosefarapartfromthetarget. 5)Althoughverycrowdedscenes may have cluttered background, it rarely suffers from big change. Obviously, crowded scenes bringinmanynewchallenges,butmeanwhilefortunately,someofthecharacteristicscontribute tosolvethedifficultproblemsoftrackinganddetectionincrowdedscenes. Forexample,dueto the smooth movement constraint, a search area can be assumed as in [10, 24]. Thus, we solve 16 singleobjecttrackingincrowdedscenebasedonthetechniquesinstate-of-the-artvisualtracker, withspecialconstraintsfromtheapplication. 2.3 MultipleTargetTracking 2.3.1 MTTingeneralenvironments Multiple target tracking(MTT) is a well studied problem that has received considerable atten- tion [7, 31, 57, 86]. Besides the same issues encountered in classic tracking (occlusion, appear- ance change, etc.), it deals with additional challenges, e.g., unknown number of targets and the interactionsamongthem. Tohandlethese,theinteractionbetweenmultipleobjects,e.g.,oneob- ject is partially or completely occluded by other objects, and between objects and background, e.g., one moving object is occluded by a building, should be taken into consideration. Thus, jointlyoptimizationofdataassociationisthekey,andprogresshasbeenmade. Currentstate-of-the-artmultitargettrackingmethodsaredetectionbased[12,57,61,31,64, 37,15],whichmeansthattargetdetectionsarefirstdeterminedoveraslidingwindowofframes, and these detections are then associated into tracks. These detections are determined in one of twoways: frombackgroundsubtraction,orusinganobjectdetector. [79] represented humans as an assembly of four body parts, the detectors of which are performedineverysingleframe. Theresponsesofthebodypartdetectorsandacombinedhuman detector provide the initial observations for tracking. And data association decision is made in a small time window. To better handle missed or false detections, and the ambiguities caused by longer-term occlusons, [86] considered more frames. They proposed a network flow based 17 optimization method for data association. The maximum-a-posteriori (MAP) data association problem is mapped into a cost-flow network with a non-overlap constraint on trajectories. The optimaldataassociationisfoundbyamin-costflowalgorithminthenetwork. Recently, the state-of-the-art algorithms for MTT solve the data association problem in a hierarchicalway[42,31]. Shorttracksortrackletsareobtainedfirst,andthenlinkedintolonger tracks gradually. For instance, [31] generated reliable tracklets using conservative affinity con- straints. These tracklets are then merged into longer ones based on more complex affinity mea- sures, where the association is formulated as a MAP problem and solved by Hungarian algo- rithm. Finally, these tracklets are used to estimate scene structures, such as entries, exits and occluders, and they are used in turn to refine the final trajectories. One disadvantage here is nearestneighborassociationisheuristic,andpronetoerrorswhenobjectsareclosetogether. To overcome these problems, [57] proposed an algorithm to infer tracklets. It is a MAP estimate insteadofaheuristic. Itneitherrequiresanexhaustiveevaluationofdataassociationhypotheses, not assumes one-one mapping between observations and objects. They formulated the associ- ation problem as inference in a set of Bayesian networks, and used consistency of motion and appearanceasthedrivingforce. Invideosurveillance,peoplemaywanttotrackgeneralmovingtargets,suchaspedestrians, vehicles, or both, instead of specific objects. Also, the target size is usually small, making it difficult to train an object detector with sufficient accuracy. For these reasons, background subtraction has been the common way of obtaining target detections [57, 61]. [57] modeled the background as the mode of a sliding window of frames, while [61] used simple median image filtering. 18 DespiteoftheprogressofsolvinggeneralMTTproblem,MTTincrowdedscenesgetslittle attention, and is seldom successful. Detecting objects in each frame is a challenge, let alone associating them. [64] addresses the problem of person detection and tracking in crowded scenes, by exploringconstraints imposed by crowd density to localize individualpeople. That’s a major improvement of MTT in crowded scenes, but it is not for general objects (human heads specifically),anditrequiresalargetrainingdataset. Inourwork,welookintotheproblemfrom another viewpoint, and make an attempt to solve a simpler problem of MTT: once a user labels atargetinthefirstframe,wetrytofindsimilarobjectsandtrackallofthem. 2.3.2 MTTinWideAreaAerialSurveillance Theincreaseddeploymentofaerialsensorsforwideareasurveillancehasgeneratedhugeamounts ofimagery. Thisimageryoftencoversageographicareaofafewsquarekilometers. Oneofthe fundamental ways that motion imagery is summarized is by estimating the tracks of moving objects [60]. Compared to MTT in general environments, the domain of wide area surveillance brings its own set of challenges. The challenges are multiple-sensor video capture, which re- quires accurate registration of the different sensors, low sampling rate, which undermines the common motion smoothness assumption, and limited grayscale resolution of moving targets, whichpreventstheuseofrobustanddiscriminativeappearancemodels. Trackers for WAAS imagery have been proposed recently [61, 81]. The approach of [81] is detection based, where detections are associated using the classic Hungarian algorithm. The association cost matrix is computed by combining target association cost matrix and the novel target-pair association cost matrix, which allows the inclusion of spatial constraints between 19 neighboring targets. These spatial constraints formalize the idea that a pair of vehicles in the current frame is more likely to match a pair of vehicles in the next frame when the distance and speed difference between the vehicles is preserved. However, this assumption does not always hold, and applying it uniformly to all targets may lead to inappropriate associations. The ap- pearancemodelinthisworkistemplatebased,whichmayhaveproblemswithlightingchanges and varying backgrounds. Tracks are initialized using three-frame subtraction, but otherwise backgroundsubtractionisusedfordetection. [61]alsousestheHungarianalgorithmforassociationofdetections,butproposetoincrease its efficiency by dividing the image into cells and computing the associations within each cell. The matching cost between targets takes into account spatial proximity, velocity orientation, orientation of the road, and local context between cars. The use of local context only helps on freeways,wheretherelativepositionbetweentargetsdoesnotchangemuchovertime. Onmost city roads, where cars change lanes, pass, stop, and turn, this kind of constraint would be less helpfulandmaycauseincorrectassociations. However, these trackers [61, 81] havenot been shownto workin a distributedenvironment, and process large-scale imagery at real-time speeds. To solve the big data problem in WAAS imagery, [35]proposedadistributedarchitectureforreal-timetrackingofvehicles,andadopted aclassicmultiplehypothesistrackerfortracking. 20 2.4 DetectionandTrackinginCrowdedScenes Detectionincrowdedscenescatchesalotofattentioninrecentyears,especiallythedetectionof pedestrians. [88] is one of the first algorithms proposed for tracking in crowded environments, with significant and persistent occlusion by making use of human shape models in addition to camera models. It assumed that humans walk on a plane and acquired appearance models. This method is based on the segmentation in [87]. [87] posed human segmentation problem as a model-based segmentation problem in which human shape models are used to interpret the foregroundinaBayesianframework. Theymodelhumanshapebyfourellipsoidscorresponding to head, torso and two legs. The Bayesian framewok is solved by Markov chain Monte Carlo (MCMC)method,anddomainknowledgeincludingheadcandidatescomputedfromforeground boundaries, head candidates computed from intensity edges and analysis of foreground residue map are integrated in one theoretically sound framework. However, this method does not work when whole human bodies are invisible. To solve the problem, [78] modeled an individual humanasanassemblyofnaturalbodyparts. Partdetectorsbasedonedgeletfeaturesarelearned by a boosting method. Responses of part detectors are combined to form a joint likelihood model. Then the human detection problem is formulated as MAP estimation. This detection method can handle partially occluded humans in a single image. However, this algorithm does notworkwellincrowdedsceneswherethenumberofpixelsoneveryhumanisverysmall. Most recent related works on tracking in crowded scenes directly or indirectly use motion pattern information, and all the works are dealing with tracking a specified object in the first frame. The seminal work of tracking in structured high density crowded scenes [6] proposed 21 a scene structure based force model. In this model, the key components are static floor field (SFF), dynamic floor field (DFF) and boundary floor field(BFF), which are used to determine theprobabilityofmovingfromonelocationtoanother,byconvertingthelong-rangeforcesinto local ones. Specifically, the SFF specifies important regions of the scene, such as an entry and an exit. The DFF specifies the dynamic behavior of the crowd around the target being tracked. AndtheBFFspecifiesinfluencesbythebarriersinthescene,suchaswalls,andrestrictedareas. Then, the probability for an object moving from one position to another is estimated by the similarity measure between the initial appearance template and the new appearance template of thetargetcomputedatanotherlocation,aswellastheinfluencesfromSFF,DFFandBFF. [36] presented a framework by training a collection of Hidden Markov model to capture spatial and temporal motion patterns to describe pedestrian movement at each space-time lo- cation. To model the motion in local areas, the video is subdivided into small spatio-temporal sub-volumes. A 3D Gaussian is used to represent the object motion in the cuboid, and HMM is trained on the motion patterns at each spatial location of the video. Using these HMMs, motion patterns a subject will exhibitare predicted to hypothesizethe subject’smovementbased on the crowdsmotion. Thereisalsowork[62]tryingtosolvetrackinginunstructuredcrowdedscenes. [62]defined unstructured scenes to be the scenes where the motion of a crowd appears to be random with different participants moving in different directions over time. They proposed to model various crowdbehaviormodalitiesatdifferentlocationsofthescenebyCorrelatedTopicModel(CTM) adopted from the text processing literature. In CTM, words correspond to motion features, i.e., quantizedopticalflowvectorandlocation,andtopicscorrespondtocrowdbehaviors. Thenotion 22 of document is generated by dividing each video sequence into short clips (documents), and eachclipisassociatedwithasetofcrowdbehaviors. Thenvariationalexpectationmaximization algorithm is adopted for parameter estimation in CTM. In tracker, the next tracking position is estimatedasaweightedmeanofthenextobservationandthetrackerpredictionbyincorporating thepriorinformationfromcrowdbehaviors. Whilemostoftheexistingworkslearnedmotionpatternsfromthespecificsurveillancevideo itself,[63]proposedanovelideatofirstlearnasetofcrowdbehaviorpriorsoff-linefromalarge database of crowd videos gathered from the Internet, then matched crowd patches in testing to thedatabasetogetpriors,tryingtoaddressthelimitationswhenstrongpriorslearnedfromvideo sequences are not useful to model rare events which do not comply with the typical motion of thecrowd. [64] is the first work attempting the problem of multiple persons detection and tracking in crowded video scenes. Specifically, they detect and track heads. People detection is formulated asaminimizationofajointenergyfunctionincorporatingconfidencescoresofapersondetector for each location in an image, pairwise constraints to ensure that only valid configurations of non-overlapping detections are selected, and the estimated person density. To associate person detections, Kanade-Lucas-Tomasi tracker is used to track points on every detection, and head detections are merged gradually. However, the detector is based on part based models, and it does not work well when the number of pixels on head is small, which is the usual case in crowdedvideoscenes. 23 Chapter3 MotionPatternLearning 3.1 Introduction Inthischapter,weproposeanovelunsupervisedmotionpatterninferencemethodinstaticcam- erascenes. Therearetwomainkindsofinputfeaturestobeusedtolearnmotionpatterns,which are low level motion features and trajectories. They both have their pros and cons. Low level motion features are usually easy to get, but less reliable. While trajectories are good starting points, they are difficult to obtain in the first place. We sit in the middle of the two, and use trackletsasinput. We mainly deal with far-field surveillance videos which are often of low quality, and the sizeofmovingobjectsisrelativelysmall. Thelowresolutionmakesitdifficultifnotimpossible for appearance based detectors to work in complex scenes. Therefore, tracklets can be learned in two ways. The first choice is blob association. We learn motion blobs corresponding to foreground moving objects from background modeling, and then perform local association to 24 obtain tracklets. And the second choice is using key points tracking results. We decide which methodtousedependingonthevideoswearedealingwith. By applying Tensor Voting [51] on these tracklets, we get refined motion direction for each tracklet point, and then embed 2D tracklet points position information in (x;y) space into (x;y;v x ;v y ) motion feature space, where (v x ;v y ) represents the velocities in x and y axes re- spectively. When motion patterns exist, points form intrinsic manifold structures in this space. To segment these structures, a novel robust manifold grouping algorithm is proposed. It ex- plicitly handles outliers and uses local geometric information such as normal/tangent space and dimensionality of the local structure that each point belongs to. In addition, kernel density esti- mationisperformedtopropagatethegroupinginformationfromtrackletpointstoallpixelsthus definingdensemotionpatterns. With flow information as prior, tracking becomes easier. Each track on the image lattice can be seen as a random walk where the prior probabilities of transitions at each state are given by the motion pattern estimates. These prior probabilities help remove false associations, thus greatly decreasing false alarms. Since this is a general approach, it can be integrated into any trackingalgorithm. 3.2 TensorVoting In our motion pattern learning framework, Tensor Voting is used in several stages to provide multiplegeometriccues. Thus,itisillustratedindetailinthissection. 25 Figure3.1: AcartoonillustrationthatTensorVotingisusedtoinferthelocalgeometricstructures ofmanifolds. TensorVotingisanunsupervisedcomputationalframeworktoinferthelocalgeometricstruc- tures of manifolds [51] (an illustration is shown in Figure 3.1), and has been applied to many aspects in computer vision, such as boundary inference [73, 48], registration [52] and motion layer segmentation [46], and in machine learning, such as dimensionality estimation and man- ifold learning [51, 49]. A 2D illustration of inferring local geometric information by Tensor VotingisshowninFigure3.2. TensorVotinghasbeenprovedcapableofestimatingstructuresinNDspacewithverynoisy inputdata. GivensamplesinaninputspaceofdimensionalityN,thelocalgeometricinformation at each point is encoded in a tensorT, whose quadratic form is a symmetric and non-negative definitematrix. Recallthatatensorcanbedecomposedas 26 Figure3.2: 2DillustrationofinferringlocalgeometricinformationbyTensorVoting. 27 Figure3.3: AvisualizationoftensorinR 2 andtensordecompositionin2D. T = N ∑ i=1 i e i e i T = N−1 ∑ i=1 ( i − i+1 ) i ∑ k=1 e k e k T + N N ∑ i=1 e i e i T (3.1) where{ i } are the eigenvalues in descending order, and{e i } are the corresponding eigen- vectors. An example of tensor in 2D space is shown in Figure 3.3 This way, local geometric information such as dimensionality and normal/tangent space at every point can be derived by examiningtheeigensystemofthecorrespondingtensor. Intrinsicdimensionality. Thelargestgapbetweentwoconsecutiveeigenvalues i − i+1 indi- catesthedimensionalitydofthelocalstructurethatthepointbelongsto,and d =N −argmax i ( i − i+1 ) (3.2) 28 Figure 3.4: An illustration of the local geometric structure for an input point, i.e. the local structurethatthepointbelongsto,tangentspace,normalspace,andsaliency. Local normal and tangent space. A compact representation of the local manifold structure at the point corresponding toT is ad-dimensional normal space spanned by{e 1 ;:::;e d } and a (N −d)-dimensionaltangentspacespannedby{e d+1 ;:::;e N }. SupposewehaveasetofpointsP i (i = 1;2;··· ;N)inNDspace. Initially,weencodeevery data as an identity matrix, indicating no orientation preference, since we have no knowledge of their local structure at the beginning. In the voting process, each point P i propagates its information to its neighbors, meanwhile collects information from them. The vote from a voter toareceiverdependsonthetensorofthevoter,theorientationandthedistancebetweenthetwo. After the voting process, the sum of all the votes from neighbors becomes the new tensor of a point, and by analyzing it as Eq. 3.1, we can get a point’s geometric property. An illustration of the local geometric structure for an input point is shown in Figure 3.4. More details can be foundin[51,50,80]. 29 3.3 RobustNon-linearManifoldGrouping Asanimportantstepinmotionpatternlearning,automaticandrobustgroupingisneeded. State- of-the-art subspace grouping methods like [19, 75, 82] acquire good performances on multiple linear subspace segmentation problems. However, motion patterns in reality often have non- linear shapes (like a right turn), which brings difficulty for them. The main reason is these methods often assume the intrinsic manifolds have linear structures. Furthermore, the robust- ness issue is not explicitly considered in most of these frameworks. Driven by these factors, we propose a novel robust non-linear manifold Grouping (RNMG) method, which explicitly considers outliers and can handle multiple non-linear intrinsic subspaces with different dimen- sionalities. Itisprovedtobeeffectiveinourapplications. Itisworthnotingthat,wedonotclaim wehavebetteralgorithmthan[19,75,82]whichachievethebestresultsonmultiplelinearsub- spacegroupingandmotionsegmentationtasks,butouralgorithmfocusesonmultiplenonlinear manifoldgroupingin3Dspace. Amoregeneralizedmanifoldclusteringalgorithmwhichworks forNDspaceisproposedin[22]. Oneofthemostpopularclusteringmethodsisspectralclustering,whichhassolidtheoretical foundation(graphspectraltheory),andelegantcomputationalframework(eigen-decomposition). However it is pointed out by [91] that pairwise similarity is not enough for manifold learning, i.e., high-order relationship information between point sets is missing. Inspired by [91], we presentanovelwaytoconstructthesimilaritygraphinamultiplekernelsetting,whichincludes high-orderpointsetsinformation. 30 3.3.1 MultipleKernelSimilarityGraph Afterestimatingthelocalgeometricstructuresonmanifolds,weconstructmultiplekernels. Distancekernel. Thisiswidelyusedingraphspectralclusteringanddefinedas w dis (x i ;x j ) =exp(− dis(x i ;x j ) 2 2 dis ) (3.3) dis(x i ;x j )canbesetasthesimpleL2distanceinEuclideanspace,orthegeodesicdistance inspired by the idea of ISOMAP. People may argue that geodesic distance is a better measure here. However, there are usually multiple manifold structures in the scene, so the geodesic dis- tancebetweentwopointsisdifficulttodefinewithoutknowingwhethertheybelongtothesame structure, which is exactly what we aim to do with the grouping algorithm. Another potential problemforgeodesicdistanceistheshortcutissue. Normal space kernel. The simple intuition is, if two points are on the same manifold, in par- ticular on the same local manifold, then their normal spaces should be similar. Although two points far away from each other on the same manifold may have large principal angle between their normal spaces, we still use normal space similarity to build our kernel. The main reason is usually motion pattern has relatively low curvature, i.e. it changes smoothly and slowly as spatialdistanceincreases. Thuswehave w nor (x i ;x j ) =exp(− sin((x i ;x j )) 2 2 nor ) (3.4) where(x i ;x j )measurestheprincipalanglebetweenthenormalspacesE i ofx i andE j ofx j . Specifically,forE i ={e 1 ;:::;e d i }andE j ={e ′ 1 ;:::;e ′ d j }(assumed i 6d j ),wehave 31 (x i ;x j ) =arccos( ∑ d i k=1 <e k ;E j > d i ) (3.5) <e k ;E j >measurestheinnerproductbetweenvectore k andthenormalspaceE j ,which isequalto ∑ j l=1 <e k ;e ′ l >. Intrinsic dimensionality kernel. Tensor Voting provides reliable intrinsic dimensionality re- sults,inparticularwhensufficientnumberofsamplesonmanifoldsareavailable[51]. Thisfact is helpful to construct the similarity graph since a single motion pattern usually has a unique intrinsicdimensionality. Thus w dim (x i ;x j ) =exp(− (d i −d j ) 2 2 dim ) (3.6) Multiple kernel association. There are multiple ways to associate these kernels together, but which is best is an open problem in machine learning. In the current setting, we use the simple productofthethreeandfinditeffective. w(x i ;x j ) =w dis (x i ;x j )w nor (x i ;x j )w dim (x i ;x j ) (3.7) ThenasimilaritymatrixW = [w ij ] = [w(x i ;x j )]isobtained. Smalleranglesbetweentwo normal spaces, smaller dimensionality differences and smaller Euclidean distances all lead to largerweight,orlargersimilarityscore. Avisualizationofcombiningthethreekernelsisshown inFigure3.5. 32 Figure 3.5: A visualization of combining the three kernels, i.e. dimensionality kernel, normal spacekernel,anddistancekernel,togetthenewsimilaritymatrix. 33 3.3.2 GraphSpectralGrouping Once we have the N × N similarity graph, a standard spectral clustering technique can be applied. Specifically, the unnormalized Laplacian matrixL =D−W, whereD is a diagonal matrix whoseelements equal to thesumofW’scorrespondingrows. Afterwards, weselect the C (the number of groups) eigenvectors corresponding to the C smallest eigenvalues. Finally, K-meansalgorithmisappliedonthoseeigenvectorstogetthegroupingresults. 3.3.3 OutlierRejection Outlier rejection is an extremely important issue for robust grouping. Previous state-of-the- art approaches can handle outliers to some degree, such as less than 5%. But we found from experiments that it’s possible to have outliers of more than 10% in many cases due to tracking errors(falsealarm,wrongassociation,etc.),whichmakesitnecessarytoexplicitlyhandleoutliers inouralgorithm. In experiments, 3D Tensor Voting is thus performed to get 3 eigenvalues{ i };i = 1;2;3, in descending order. A point is considered as outlier if all of its corresponding eigenvalues are small. In practice, we use 1 as inlier degree measure. Intuitively, 1 can be viewed as the sum oftheprobabilitiesforallmanifolds(apoint’slocalgeometricstructures)withdifferentintrinsic dimensionalities. Byrankingallpointsaccordingtotheir 1 ,outlierscanbefilteredout. If the percentagep of outliers is known, then we can rank all the points in descending order oftheir{ 1 },andthosebottomp%pointsarerejectedasoutliers. Otherwise,weusethemedian of all the { 1 } as a criterion to select outliers by threshold strategy. This filtering process is similar to diffusion process [16] in spirit. Conceptually, ranking all the data points according to 34 1 issimilartorankingaccordingtotheirstationaryprobabilities,whicharecalculatedfromthe random walk model built on the data graph [92]. The difference is, in the diffusion process, the weightsbetweenpointsarecalculatedfromapre-definedkernel,e.g.,Gaussiankernel,butinour method,theyarecalculatedfromtheTensorVotingprocess,whichismorerobusttooutliers. 3.3.4 NumberofMotionPatterns Due to the intrinsic vaguenessof motion patterns, it is difficultto tell exactlyhow manymotion patterns exist, even for humans. And because of the vagueness, the number of patterns is not a sensitiveparameterinourapplication,aslongasmotionpatternshelpustounderstandthescene. However, we still have two choices to decide the number of patterns. The first is to pre-define howmanypatternsaretherebyhuman,whichisusedinourcurrentframework,andthesecond istolearnpatternsinahierarchicalfashion,whichcanbeexploredinfuturework. 3.4 LearningMotionPatterns Takingtrackletsfromthepre-processingstepasinput,2DTensorVotingisperformedtogetthe local tangent direction of every tracklet point. Therefore, every tracklet point (x;y) is mapped to a point in (x;y;v x ;v y ) space. In previous works like [68, 84], velocity information of both magnitudeanddirectionareused. Butwefoundthatsometimesinpractice,directionsaremore reliable than magnitudes, whether they are acquired from optical flow or global/local object association, since we can use Tensor Voting to refine directions, but no good method can be usedtocorrectmagnitudesoncetheyarewrong. Theproblemcanbeignoredwhentrackletsare obtainedbytrackingkeypoints,suchasin[89]. 35 When velocitymagnitude is unreliable, 2D tracklet point is embedded to a point in(x;y;) space, where 06 < 360 representing the velocity direction. It is a projection of (x;y;v x ;v y ) bythrowingawaythevelocitymagnitude. Butthestepstoinfermotionpatternsinboth(x;y;v x ;v y ) spaceand(x;y;)spacearethesame. Motionpatternsareinferredinthefollowingprocess. AnexampleisshowninFig.3.6. Step 1 − Tracklet Analysis. A tracklet is an ordered sequence = {x p i }, wherex p i is the spatial coordinate of the center of motion blob in frame i. In tracklet analysis, we perform 2D Tensor Voting for every tracklet separately, taking {x p i } as input. After voting process, the tangent direction (two-side) of every tracklet point is got, and the order of points further helps us choose one side, which indicates the motion direction of that point and the corresponding motionblob. AnexampleofatrackletandthemotiondirectionsafterTensorVotingisshownin Figure 3.7. We can see that Tensor Voting gives accurate direction information in the presence ofnoise. Toembed into(x;y;)space,takingeastdirectionas0,everymotiondirectionisprojected to a degree between 0 and 360. Therefore, every tracklet point (x;y) is mapped to a point in (x;y;) space, 0 6 < 360. To embed into (x;y;v x ;v y ) space, (v x ;v y ) is calculated from consecutive tracklet point pairs. An example of the embedding projection in (x;y;v x ) space is showninFig.3.6(b). Step 2− Multi-Motion Pattern Grouping. In the embedded space, we find that points auto- matically form manifold structures. One example is shown in Figure 3.10(a). Each structure contains rich information describing the pattern of movement. For example, a specific location (x 0 ;y 0 ) may have a correspondence (x 0 ;y 0 ;v x0 ;v y0 ), and that means an object at that location 36 Figure 3.6: An illustration of motion pattern learning steps: (a) one frame of the input se- quence;(b)(x;y;v x )spaceprojectionofKLTtrackletembedding;(c)RNMGresultsprojection in(x;y;v x )space;(d)motionpatternpropagationresultsinimagespace. 37 (a) (b) (c) Trackedpoints (d) Direction Figure 3.7: (a): A trajectory segment. (b): The motion direction of every point on the segment: greenpointsindicatetrajectorypointsandredlinesindicatedirections. (c): Zoominofapartof thetracklet. (d): ZoominofapartoftheresultsafterTensorVoting. 38 is very likely to move to (x 0 +v x0 ;y 0 +v y0 ). Therefore, the task is to group the points in the embedded space into segments, each of which corresponds to a motion pattern. Then RNMG (Sec. 3.3) is performed to first reject outliers, then group inliers into segments. Then, each mo- tion pattern is represented by a set of pointsq l = (x l ;y l ;v xl ;v y l ); l = 1;2;:::;n. An example ofclusteringresultsprojectionin(x;y;v x )spaceisshowninFig.3.6(c). Step 3 − Dense Motion Pattern Inference. After grouping, each segment corresponds to a motionpattern. Buttillnow,motionpatterninformationisonlyonsparsetrackletpoints. Toget afullunderstandingofthewholescene,wepropagatetheinformationtoallpixelsinimage. We usekerneldensityestimationinthisstep. AssumeC groupsarelearnedfromthelaststep,corre- sponding toC motion patternsi(i = 1;2;··· ;C), and the number of tracklet points belonging toi isN i . Given a pointx = (x;y;) orx = (x;y;v x ;v y ), letm x = i representx belongs to motionpatterni. Theprobabilityofxbelongingtomotionpatternicanbecalculatedas P(m x =i|x) = P(x|m x =i)×P(m x =i) P(x) _ 1 Z ∑ ma=i exp(−||a−x|| 2 = 2 )× N i ∑ i N i (3.8) Here,Z isanormalizationtermand isthebandwithofthekernel. Thenthemotionpattern X belongs to is the one with the largest probability. An example of motion pattern propogation resultsisshowninFig.3.6(d). 39 3.5 MotionPatternsImproveMultipleTargetTracking 3.5.1 Algorithm Once motion pattern information is obtained, it serves as prior to facilitate tracking individual objects. Intheassociationstageofourtrackingsystem,thekeystepistocalculatetheprobability P b of a prediction statex R being the next step of a current statex C . In the original tracking framework, P b is calculated based on appearance similarity and motion similarity. Now, prior knowledge tells us thatx R should also be on the same motion pattern asx C . To calculate this possibility,wefirstdefinethemotionpatternx C belongsto m x C =i = argmax j P(m x C =j|x C ) (3.9) Theprobabilitythatx R alsobelongstoiis P(m x R =i|x R ) _P(x R |m x R =i)×P(m x R =i) = 1 Z ∑ ma=i exp(−||a−x R || 2 = 2 )× N i ∑ i N i (3.10) Assume in the tracking system, the original probability of P b being the prediction of C is P(X R = x R |X C = x C ), then a better measure of the possibility using motion pattern informationis: 40 P ′ b =P(X R =x R |X C =x C ;m x R =m x C =i) ≈ C ∑ i=1 P(m x R =i|X R =x R )P b P(m x C =i|X C =x C ) Thisuseofmotionpatternknowledgeintrackingisgeneral,soitisnotlimitedtoourtracker, butcanbeeasilyintegratedintoanytrackingmodule. 3.5.2 ComplexityAnalysis AssumethesizeofinputimagesisW ×H,andK trackletsarelearnedfromthefeatureextrac- tion step with average length N. The computational complexity of motion pattern learning is O(KNWH +K 3 N 3 ), and the additional computational complexity cost of using this prior to improve tracking isO(KN). The speed of the Matlab implementation of learning step is about 2FPSona3:0GHzPC.Inpractice,wefindthatinsteadofincurringextratimecostfortracking, usingthispriorcouldspeeduptrackingproceduresincefalsealarmsarelargelyreduced. 3.6 ExperimentalResults 3.6.1 MotionPatternLearningonNGSIMDataset 3.6.1.1 MotionBlobsAssociationbasedFeatureExtraction In surveillance system, we want to track general moving targets, such as pedestrians, vehicles, or both. Besides, far-field surveillance videos are often of low-quality, the target size is usually small, making it difficult to train an object detector with sufficient accuracy. On the other side, 41 Figure 3.8: A comparison of foreground extraction results. (a): Input image. (b): Foreground extractionresultsbyRASL.(c): ForegroundextractionresultsbyMixtureofGaussian. moving objects are often isolated and relatively easy to be extracted and associated. For these reasons,backgroundsubtractionisperformedtoobtaintargetdetections. ExtractionofForegroundMotionBlobs. The goal of background modeling is to learn the static part of the scene in an unsupervised fashion. This is a challenging task in many settings, such as traffic surveillance videos, due to thelowqualityofimagesandcomplexbackgroundstructure. HereweuseRobustAlignmentby Sparse and Low-rank Decomposition (RASL) [56] to learn foreground points that are supposed to be the pixels on moving objects by treating the static scene (background) as intrinsic images andmovingobjects(foreground)assparseerrors,orthecorruptionstotheintrinsicimages. No- ticethatRASLestimatesglobaltransformation,whichmeanstheycouldhandlemovingcamera scenes. Ourcurrentframeworkdealswithfixedcamerascenes,butthepotentialofRASLmakes itthebestchoiceforfuturegeneralization. Figure3.8showstheforegroundextractionresultsby RASLandMixtureofGaussian. It’sobviousthatRASLgivesabetterresult. 42 RASL formulates image alignment as a rank minimization problem, subject to sparse cor- ruption, and then uses convex programming to solve it. Assume there are n linearly corre- lated images I 0 1 ;I 0 2 ;··· ;I 0 n ∈ R !×h . The input images are corrupted and misaligned versions I 1 = (I 0 1 + e 1 )◦ −1 1 ;··· ;I n = (I 0 n + e n )◦ −1 n , where e i is an additive error on image i corresponding to foreground moving objects in that frame, and i is the transformation causing themisalignment. Inourcase,sincethesequencesarecapturedbystaticcamera,thereisnomis- alignment,i.e. i =I. TheconvexoptimizationproblemwithconvexrelaxationinunknownsA andE is: min A;E; ||A|| ∗ +||E|| 1 s:t: D◦ =A+E (3.11) where A = [vec(I 1 )|···|vec(I n )] ∈ ℜ m×n is a low-rank matrix that models the common linear structure in the batch of images, andE = [vec(e 1 )|···|vec(e n )] ∈ ℜ m×n is a matrix of large-but-sparseerrorsthatcorrespondtoforegroundpoints. AfterperformingRASL,foregroundpointse i areobtainedforeveryframe. Thenconnected componentsaretakendirectlyfromtheforegroundpointsandconsideredasmotionblobscorre- sponding to moving objects or parts of them. Examples of foreground points and motion blobs acquiredfromaframeareshowninFigure3.9. 3.6.1.2 Results We apply the proposed method on several real-world video sequences to demonstrate the ad- vantages on nonlinear manifolds. Three sequences are from NGSIM data set [4]. They were 43 (a) (b) (c) (d) Figure 3.9: Foreground motion blobs extraction results. (a)(b): Foreground pixels. (c)(d): Mo- tionblobcorrespondencesonimages. 44 acquired by static cameras installed on high-rise buildings, and suffer from low image quality. The average size of the moving vehicles is about 22×15 pixels. Two other sequences are from YouTube. TotesttheeffectivenessofRNMG,wecompareitwithfivemajoralgorithms[3],whichare K-means, RANSAC, Local Subspace Affinity (LSA) [82], Generalized PCA (GPCA) [75], and SparseSubspaceClustering(SSC)[19]. dis (L2distance), nor , dim ,andLaresetandfixed as 80, 0:5, 0:75, 30 and 20 respectively for all the experiments. Since GPCA, LSA, RANSAC and K-means have no explicit outlier rejection step, to make a fair comparison, we first reject outliers by the method proposed in our algorithm (thre = 0:02×median{ 1 }), and use the same results to do further grouping for all the methods. Without outlier rejection, non-robust segmentation methods would give worse results. Compared to algorithms like GPCA and LSA which focus on linear subspace segmentation, the new method is better at grouping nonlinear motionpatterns. Takingsequence2asanexample,in(x;y;)space,threeintrinsicmanifoldstructuresexist in the space, corresponding to the three motion patterns in the second row of Figure 3.11. We manually labeled ground truth segments. The visualizations of ground truth and the grouping results for different algorithms are shown in Figure 3.10. Performance shown in Table 3.1 is evaluatedintermsofthegroupingerror,whichisdefinedasnumberofmisgroupedpoints/total numberofpoints. 45 (a) Groundtruth (b) RANSAC (c) LSA (d) GPCA (e) SSC1 (f) Ours Figure 3.10: Grouping results of the second sequence in Figure 3.11. Black points indicate outliers. 46 K-mean RANSAC LSA GPCA SSC1 SSC2 Ours 42:96% 40:34% 31:61% 21:28% 27:21% 23:03% 7.99% Table 3.1: Misclassification rates for 7 grouping algorithms on the real-world sequences. SSC1 means use SSC to do grouping by using its own outlier rejection step, and SSC2 means using ouroutlierrejectionstep. Figure 3.11: Motion patterns learning results by RNMG: (a) Input sequences; (b) The visual- ization of motion patterns learning results projected in (x;y;v y ) space for the first sequence, and(x;y;v x )spaceforalltheothersequences;(c)Thevisualizationofmotionpatternslearning resultsinimagespace. 47 3.6.2 MotionPatternsImproveTracking 3.6.2.1 NGSIMDataset To test motion patterns’ usage in tracking, we select 100 frames which are not used in learning motion patterns from each of the 3 NGSIM sequences. The ground truth trajectories are gener- atedbymanuallytrackingeachmovingvehicle. Thedatasetischallengingduetoheavytraffic, occlusion,andlowresolution. EvaluationMetric. Fourmetrics[13]areadoptedtoquantitativelyanalyzetheresults: (1)TRDR:TrackerDetectionRate= Total True Positives Total Number of Ground Truth (2)FAR:FalseAlarmRate= Total False Positives Total True Positives + Total False Positives (3)MT:Mostlytrackedtrajectories,thepercentageoftrajectoriesthataresuccessfullytracked formorethan50%frames. (4)FRAG:Averagenumberoffragmentsforeachtrackedgroundtruthtrajectory. 48 Sequence Method TRDR FAR MT FRAG 1 [57,58] 81:60% 15:50% 76:92% 1:1429 Ours 82:12% 3.75% 80:77% 1:1429 2 [57,58] 80:10% 33:45% 77:08% 1:9000 Ours 81:79% 20.95% 81:25% 1:5610 3 [57,58] 85:21% 33:75% 88:46% 1:8333 Ours 88:76% 21.42% 88:46% 1:4000 Table3.2: Trackingevaluationresultsonthreesequences The higher the value, the better is the performance in TRDR and MT; The lower the value, thebetteristheperformanceinFARandRAG.Thesefourcategoriesarebynomeanscomplete, howevertheycovermostofthetypicalerrorsobservedinourexperiments,andformreasonable evaluationmetrics. Forcomparison,wefirstperformtrackingbyastate-of-the-arttracker[57,58],thenincorpo- rate motion pattern information to the tracker to do tracking again. Tracking evaluation results are shown in Table 3.2. With similar detection rate, our method significantly decreases false alarms. It shows that motion pattern information prevents wrong association by providing prior knowledgeofobjects’movement. Butsincethedetectionratelargelydependsontheinitialmo- tionblobsfoundfrombackgroundmodeling,motionpatternsdonotcontributemuchtoimprove detection. 3.6.2.2 WideAreaSceneAnalysis: CLIF2006dataset Wide area aerial surveillancedata hasrecently proliferatedand increasedthe demandfor multi- objecttrackingalgorithms. However,thelimitedappearanceinformationoneverytargetcreates much ambiguity in tracking and increases the difficulty of removing false target detections. To 49 Figure 3.12: One frame of the imagery is captured by an array of cameras (left), while it is desirable to work with only one image per frame, as if it were captured by a virtual camera (right). 50 solve this problem, we propose to learn motion patterns in wide area scenes and take advantage ofthisadditionalinformationintrackingtoremovefalsealarmandreducetrackingerror. Wideareaaerialsurveillanceimageryiscapturedbyanarrayofsensors. Therefore,theinput toouralgorithmisnotonevideostream,butratheranarrayofvideostreams. Traditionally,there are two ways to handle this. One is to track objects independently in each video and then hand- offthetracksastheycrossfromthefieldofviewofonesensortoanotherone. Thesecondwayis tofirstmosaicthearrayofsensors,andthenestimatetracksontheresultingsinglevideostream. In this work we take a hybrid approach, where we first mosaic the sensor array and then divide the stabilized and georeferenced imagery into a number of tiles for parallel processing. This offers an advantage of seeing the “big picture” of the area under surveillance provided by the mosaicandatthesametimeretainstheabilitytodoparallelprocessingwithanoptimalnumber oftiles. Weusethemosaickingalgorithmproposedin[57]. We have evaluated our tracking algorithm on a sequence from a wide area imagery dataset [1]. This dataset is collected from an aerial platform flying in a circular pattern over Ohio State University. Thedatasetiscapturedatabout2framespersecondandcontainssignificantparallax from campus buildings and trees. There are more than 4000 frames in the dataset, with each frameroughly6500x7500size. Wehavestabilizedandgeoreferencedthedatasetto0:75meters per pixel resolution before tracking (an example of the results are shown in Figure 3.12.). For quantitative evaluation we selected a 1312x738 region in the middle of the persistently visible area and manually determined tracking ground truth for 100 frames. The selected sequence has 205tracksofvehicles,eachbeingabout10x5pixelsinsize. 51 Moving object detection was done using background subtraction. The background is mod- eled as the mode of a (stabilized) sliding window of frames. A tracking window size of 12 frames, corresponding to about 6 seconds of video was used. In motion pattern learning stage, given an input tracklet, we first calculate the movement of every tracklet point compared to the starting of the tracklet, and remove the tracklet if the median of the movement is smaller than 6 pixels. This is because the tracklet points caused by parallax are often constrained in a small region,whiletrackletscausedbyrealmovingobjectsoccupyalargearea. Several metrics were used to evaluate performance: object detection rate (ODR), moving objectdetectionrate(MODR),falsealarmrate(FAR),meancumulativeswapsoftracks(SWP), and mean cumulative broken tracks (BRK). To calculate these metrics we determine a 1− 1 assignmentofestimateddetectionstogroundtruthdetectionsusingtheHungarianalgorithmby maximizing the detection overlap. A correct detection is a detection in an estimated track that overlapswithadetectioninagroundtruthtrack. ODR is defined as the number of correct detections divided by the length of all ground truth tracks. Since our ground truth has annotations for vehicles that come to a stop after initial movement,theODRisgoingtobedepressedwhentheinputtothetrackerisonlymovingobject detections from background subtraction. Therefore, we also define a MODR, where the ground truthdetectionscorrespondingtostationaryvehiclesareremoved. TheFARisdefinedasbefore. Finally,BRKandSWPmetricsaredefinedin[67]tomeasuremulti-objecttrackingperformance andaresimilartotrackfragmentationorIDswitchratesfoundintheliterature. Quantitative results using the metrics are shown in Table 3.3. We have compared our pro- posedapproachwithmotionpatternstoonewithout,asdescribedin[57]. Furthermore,Figure3.13 52 Figure3.13: (a)Initialtrackletsusedtolearnmotionpatterns. (b)Learnedmotionpatterns. 53 WithoutMP WithMP ODR 0.28 0.23 MODR 0.33 0.29 FAR 0.85 0.19 SWP 0.55 0.35 BRK 0.75 0.46 Table 3.3: Vehicle tracking performance with and without motion patterns (MP) on wide area imagery. Pleaseseethetextformetricdefinitions. shows the motion pattern learning process. In (a), the initial tracklets are displayed. We can see that there is a large number of false tracklets from parallax or false association. (b) shows the motionpatternlearningresults,whichmoreorlesscorrespondtotheroadnetwork. The results show that motion patterns significantly reduce the false alarm rate with a small corresponding decrease in the object detection rate. This is because most of the false alarms come from moving objects detections due to parallax and these are denoised by tensor voting. Furthermore,thenumberofIDswitchesandtrackfragmentationhasalsodecreasedwiththeuse ofmotionpatterns,asisevidentinthedecreaseinthetrackswaprateandbrokentracksrate. It’s another indication that ambiguity during tracking has been reduced. While we have not shown thisquantitatively,thereducedambiguityhashadanotherpositiveeffect,andthatisanincrease inthecomputationalefficiencyofthetracker. Trackswereestimatedontheevaluationsequence atabout1framepersecond. 3.7 AGeneralizationofRNMGtoNDSpace AgeneralizationofRNMGtoNDspaceisproposedinthissectionasRobustMultipleManifold Structure Learning (RMMSL) [22]. RMMSL is a robust multiple manifold structure learning 54 scheme to robustly estimate data structures under the multiple low intrinsic dimensional mani- folds assumption. In the local learning stage, RMMSL efficiently estimates local tangent space by weighted low-rank matrix factorization. In the global learning stage, we propose a robust manifold clustering method based on local structure learning results. The proposed clustering method is designed to get the flattest manifolds clusters by introducing a novel curved-level similarityfunction. 3.7.1 RobustMultipleManifoldStructureLearning Theconceptofmanifoldhasbeenextensivelyusedinmanyaspectsofmachinelearning. Related methodshavebeenappliedtomanyreal-worldproblemsincomputervision,computergraphics, web data mining and more. Despite the success of manifold learning (here, manifold learning refers to any learning technique that explicitly assumes data have manifolds structures), people findthereareseveralfundamentalchallengesinrealapplications, (1)Multiplemanifoldswithpossibleintersections: inmanycircumstancesthereisnounique (global) manifold but a number of manifolds with possible intersections. For instance, in hand- writtendigitimages,eachdigitformsitsownmanifoldintheobservedfeaturespace. Forhuman motion, joint-position (angle) of key points in body skeleton form low dimensional manifolds for each specific action [74]. In these situations, modeling data as a union of (linear or non- linear)manifoldsratherthanasingleonecangiveusabetterfoundationformanytaskssuchas semi-supervisedlearning[20]anddenoising[25]. (2)Noiseandoutliers: onecriticalissueofmanifoldlearningiswhetherthemethodisrobust to noise and outliers. This has been pointed out in the pointer work on nonlinear dimension 55 reduction [70]. For non-linear manifolds, since it is not possible to leverage all data to estimate localdatastructure,moresamplesfrommanifoldsarerequiredasthenoiselevelincreases. (3) High curvature and local linearity assumption: typical manifold learning algorithms ap- proximate manifolds by the union of locally linear patches (possibly overlapped). These local patchesareestimatedbylinearmethodssuchasprincipalcomponentanalysis(PCA)andfactor analysis (FA) [72]. However, for high curvature manifolds, many smaller patches are needed, butthisoftenconflictswiththelimitednumberofdatasamples. RMMSL is proposed to investigate the problem of robustly estimating data structure under the multiple manifolds assumption. In particular, data are assumed to be sampled from mul- tiple smooth submanifolds of (possibly different) low intrinsic dimensionalities with noise and outliers. RMMSL is composed of two stages. In the local stage, we estimate local manifolds structure taking into account of noise and curvature. The global stage, i.e., manifold clustering and outlier detection, is performed by constructing a multiple kernel similarity graph based on local structure learning results by introducing a novel curved-level similarity function. In con- trasttopreviousworks,theclusteringstageinRMMSLcanhandlemultiplenon-linear(possibly intersecting)manifoldswithdifferentdimensionalitiesanditexplicitlyconsidersoutlierfiltering, whichisaddressedasonestepintheclusteringprocess. Comparedtothestandardassumptions inclustering,i.e.,theintra-classdistanceshouldbesmallwhiletheinter-classdistanceshouldbe large, we further argue that each cluster should be a smooth manifold component. As shown in Fig.3.14,whentwomanifoldsintersect,thereexistmultiplepossibleclusteringsolutions,while only the rightmost is the result we want. The underlying assumption we make is local manifold hasrelativelylowcurvature,i.e. itchangessmoothlyandslowlyasspatialdistanceincreases. In 56 Figure3.14: Anexampleofmultiplesmoothmanifoldsclustering. Thefirstoneistheinputdata samples and the other three are possible clustering results. Only the rightmost is the result we wantbecausetheunderlyingmanifoldsaresmooth. Figure3.15: AdemonstrationofRMMSL.Fromlefttoright,noisydatapointssampledfromtwo intersecting circles with outliers, outlier detection results (black dots), and manifolds clustering resultsafteroutlierdetection. order to get the flattest manifold clustering result, it is natural to incorporate a flatness measure intotheclusteringobjectivefunction. Therefore,byRMMSL,issue(1)isexplicitlyaddressed,and(2)and(3)arepartiallyinvesti- gated. AdemonstrationoftheproposedapproachisgiveninFig.3.15. 3.7.2 ExperimentalResults WeevaluatetheperformanceofRMMSLonsyntheticdata,USPSdigits,CMUMotionCapture data (MoCap) [2], and Motorbike videos. These data are chosen to demonstrate the general grouping capability of RMMSL as well as the advantages on nonlinear manifolds. In this task, 57 quantitative comparisons of manifolds clustering are provided. We compare RMMSL to K- means, spectral clustering (NJW) [53], Self-tuning spectral clustering [85], Generalized PCA (GPCA)[75]andSparseSubspaceClustering(SSC)[19]. Thesemethods(exceptK-means)are chosen because all of them are related to manifold learning or subspace learning. Rand Index score is used as the evaluation metric. The kernel bandwidth in spectral clustering is tuned in {1;5;10;20;50;100;200}forsyntheticdataandMoCapdata,and{100;500;1000;2000;5000} for USPS. The value of theK th neighborhood in self-tuning spectral clustering and RMMSL is chosen from{5;10;15;20;30;50;100}. of RMMSL is chosen from{0:2;0:5;1;1:5;2}. The sparseregularizationparameterofSSCistunedin{0:001;0:002;0:005;0:01;0:1}. Forsynthetic data with random noise, the parameters for all methods are selected on 5 trials and then the average performance on another 50 trials is reported. For real data, parameters are selected by pickingthebestRandIndex. ForallmethodscontainingK-means,100replicatesareperformed. 3.7.2.1 SyntheticData Since most clustering algorithms do not consider outliers explicitly, we first perform a compar- ison on 3 outlier free synthetic data, each one contains 2000 noisy samples from two manifolds inR 3 (d = 2andD = 3). ResultsareshowninFigure3.16. Randindexscoresaregiveninthe first three rows of Table 3.4. For all methods, the number of clusters n c is fixed as 2. Results (Table 3.4) show that RMMSL achieves comparable and often superior performance than other candidatemethods. Inparticular,whentwomanifoldsarenonlinearandhaveintersections,such astwointersectingspheres(secondrow),theadvantageofRMMSLisclearest. 58 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 150 200 250 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 150 200 250 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 150 200 250 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 150 200 250 −100 −50 0 50 100 −100 −50 0 50 100 0 20 40 60 80 100 0 20 40 60 80 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 0 20 40 60 80 100 Figure3.16: VisualizationofpartoftheclusteringresultsinTable3.4. Thefirstrow: onenoisy sphereinsideanothernoisysphereinR 3 . Thesecondrow: twointersectingnoisyspheresinR 3 . The third row: two intersecting noisy planes in R 3 . For each part from left to right: K-means, self-tuningspectralclustering,GeneralizedPCAandRMMSL. 59 Data/Methods K-means NJW Self-tuning GPCA SSC RMMSL Big-smallspheres 0.50 1:00 1.00 0.51 0.56 1.00 Two-intersectingspheres 0:76 0:78 0:84 0.50 0.53 0.95 Two-intersectingplanes 0:51 0:60 0:62 0.85 0.93 0.95 USPS-2200 0:80 0:89 0:89 - - 0.90 USPS-5500 0:78 0.88 0.88 - - 0.88 CMUMoCap 0:69 0:81 0:81 - - 0.89 MotorbikeVideo 0:72 0:84 0:85 0.85 0.87 0.96 Table 3.4: Rand index scores of clustering on synthetic data, USPS digits, CMU MoCap se- quencesandMotorbikevideos. To verify the robust ability of the proposed approach, we further evaluate RMMSL on syn- thetic data with outliers. We add 100 outliers and a 2D plane (1000 samples) on the Swiss roll (2000samples)togeneratetwointersectingmanifoldsinR 3 . TheresultsofRMMSLareshown inFigure3.17. RMMSLcaneffectivelyfilteroutoutliersandachieve0:96Randscore(n c = 2) and0:99F-measureforoutlierdetectionwhentheratioisgiven. Also,theRandscoreisreduced to 0:78 if we do clustering without outlier filtering (n c = 3). This fact suggests that the outlier detectionstepishelpfulifoutliersexist. ItisworthnotingthatspectralclusteringmethodscoverbroadercasesthanRMMSL,which mainly has advantage on multiple low-dimensional manifolds embedded in high dimensional space. For instance, in the case of two 2D Gaussian distributed clusters in R 2 , RMMSL is reduced to self-tuning spectral clustering (all local tangent spaces are ideally identical). Com- paredwithmultiplesubspacelearningmethodssuchasGPCA[75]andSSC[19],whicharethe state-of-the-art for linear (affine) subspace clustering, our approach is better when underlying manifoldsarenonlinear. 60 −20 −15 −10 −5 0 5 10 15 20 25 −20 −10 0 10 20 30 0 50 (a) −20 −15 −10 −5 0 5 10 15 20 25 −20 −10 0 10 20 30 0 50 (b) −15 −10 −5 0 5 10 15 −20 −15 −10 −5 0 5 10 15 20 25 0 50 (c) −20 −15 −10 −5 0 5 10 15 20 25 −20 −10 0 10 20 30 0 50 (d) Figure 3.17: Clustering results of RMMSL on two manifolds with outliers. (a): ground truth, (b): outlierdetection,(c): clusteringafteroutlierfiltering,(d)clusteringwithoutoutlierfiltering. 61 Figure3.18: ExamplesofcleanUSPSdigitimages 3.7.2.2 USPSDigitsData We further compare different clustering methods on two subsets of USPS hand written digits images(examplesareshowninFigure3.18). Thefirstonecontains1100samplesforeachdigits 1and2(referredasUSPS-2200)andthesecondonecontains1100samplesforeachdigits1,2, 3, 4 and5 (USPS-5500). D is reduced from256 (gray scale features of size16×16 images) to 50 by PCA for all andd is fixed as 5. Because of the highly nonlinear image structure, results of subspace clustering methods are not reported in Table 3.4. For USPS data, the possible high intrinsic dimensionality v.s. the limited number of samples brings difficulties for data structure learning, especially the local learning stage of RNMG. Nevertheless, RNMG achieves similar resultscomparedwithothermethods. 62 3.7.2.3 MotionCapature(MoCap)Data The automatic clustering of human motion sequences into different action units is a necessary step for many tasks such as action recognition and video annotation. Usually this is referred as temporalsegmentation. Inordertomakeafaircomparisonamongdifferentclusteringmethods, we focus on the non-temporal setting, i.e., the temporal index is removed and sequences are treatedascollectionsofstatichumanposes. Wechoose5mixedactionsequencesfromsubject86intheCMUMoCap. Weusethejoint- position (45-dimensional representation for 15 human body markers in R 3 ) features which are centralizedtoremovetheglobalmotion. Thus,Dis45anddisestimatedintheexperiment. The averageRandscoresarereportedinTable3.4. ItshowsthatRMMSLachieveshigherclustering accuracythanothercandidatemethods. One motion sequence (500 frames) and the corresponding results are visualized in Fig- ure 3.19. The subject walks, then slightly turns around and sits down. By combining the local learningresultsfromRMMSLand[72],joint-positionfeatures(R 45 )fromthissequencearevi- sualizedinR 3 . Thisfiguresupportstheassumptionthattherearelow-dimensionalmanifoldsin the high dimensional joint-position space. In fact, this MoCap sequence can be viewed as three connected nonlinear motion manifolds, corresponding to walking, turn-around and sit-down re- spectively. Thissequenceischallengingbecause,(1)walkingissimilartoturn-around,(2)some frames look like pause in the sit-down process and (3) actions are smoothly connected. Never- theless, RMMSL achieves promising results as shown in Figure 3.19. Also, since we focus on the non-temporal setting, the comparison with temporal segmentation methods such as [93, 41] 63 Figure 3.19: An example of human action segmentation results on CMU MoCap. Top left, the 3D visualization of the sequence. Top right, labeled ground truth and clustering results comparison. Bottom,9uniformlysampledhumanposes. is not investigated. Furthermore, it will be interesting to combine these temporal clustering works[93,41]withourapproach. 3.7.2.4 MotionPatternLearningonMotorbikeVideos Given motorbike videos as shown in Figure 3.20 (from YouTube), global motion pattern is learned from low level motion features. In the experiment, optical flows on salient feature points are estimated by Lucas-Kanade algorithm. Every feature point has 4D information of (x;y;v x ;v y ). Then motion direction is calculated, and every point is embedded to (x;y;) space. Thepointsareusedastheinput. Thefirstvideocontainsn = 9266pointsandthesecond 64 onecontainsn = 8684points. WeuseRNMGtolearntheglobalmotionmanifoldsbyperform- ing manifold clustering (n c = 2) and get the best Rand scores which are reported in Table 3.4. Sinceopticalflowresultsarenoisy,outlierfilteringisperformedbeforeclustering. The motion manifold learning results of two motorbike videos are shown in Figure 3.20, where motion manifolds are visualized on images after kernel density interpolation. From the results we can see that the clustered manifolds have clear semantic meanings, since each man- ifold corresponds to a coordinated movement formed by a group of motorbikes. Therefore, RNMGcorrectlylearnsglobalmotiontohelpunderstandthevideoscenes. 3.8 Conclusion In this chapter, we proposed a method to detect multiple semantically meaningful motion pat- terns in an unsupervised manner, and used them to improve multiple target tracking accuracy. A novel robust grouping algorithm making full use of local geometric information is designed. The experimental results on synthetic data, USPS digits, and complex video sequences show thatthegroupingalgorithmoutperformsstate-of-the-artalternatives,andverifythatusinghigh- level knowledge about the scene in the form of motion patterns significantly improves tracking performance. The objective of motion pattern learning and/or tracking is to better understand the scene. Therefore, using motion pattern to improve tracking is only a method instead of the goal. With respecttoapplications,motionpatterninformationcanbeusedtoexploremoreapplications. 65 Figure 3.20: Two examples of motion flow modeling results on motorbike videos. The first and thethirdrows: opticalflowestimationresultsonsampleimagesfromtwovideosequences. The secondandthefourthrows: learnedmotionmanifoldswithhighlightedmotiondirections. 66 First, motion patterns can be used to detect unusual activity. To save human operators from the labor intensive task of watching surveillance videos all the time, and meanwhile draw their attention immediately when unusual activities take place, we can use motion patterns to distin- guish between usual and rare activities. This can be achieved because motion patterns convey rich information of the regular movement, such as when and where movements take place, and what are the normal moving direction and speed, etc. Rare events are not consistent with com- mon motion patterns, thus these outliers can be picked out. Human operators can focus only on these rare events, then working load will be reduced and meanwhile detection accuracy will be increased. Second,motionpatternswillbeusedtopredictbehaviororactivities. Besidesdetectingun- usualactivitiesoncetheyhappen,surveillancecandobetterbypredictingandideallypreventing such things from happening. For instance, in a heavy traffic intersection, most vehicles move followingtrafficrules. Thiscanbelearnedinmotionpatterns. Whenacarmovesunusually,for instance tries to make a U turn which is prohibited, then an alarm can be sent before it actually breaksthetrafficrule. 67 Chapter4 TrackingandDetectioninVeryCrowdedScenes In this part, we use motion pattern information to solve the challenging problems of tracking and detection in very crowded scenes, and Motion Structure Tracker (MST) is proposed [90]. MST combines motion pattern learning, visual tracking, and multi-target tracking. Tracking in crowded scenes is very challenging due to hundreds of similar objects, cluttered background, smallobjectsize,andocclusions. However,structuredcrowdedscenesexhibitclearmotionpat- tern(s), which provides rich prior information. In MST, tracking and detection are performed jointly,andmotionpatterninformationisintegratedinbothstepstoenforcescenestructurecon- straint. MST is initially used to track a single target, and further extended to solve a simplified versionofthemulti-targettrackingproblem. Experimentsareperformedonreal-worldchalleng- ing sequences, and MST gives promising results. Our method significantly outperforms several state-of-the-artmethodsbothintermsoftrackratioandaccuracy. 68 4.1 Introduction Object tracking has been of broad interest in several applications for decades, such as secu- rity and surveillance, human-computer interaction and traffic control. Specifically, tracking in crowded scenes gets more and more attention as it pushes the limit of traditional tracking algo- rithms. Crowded scenes can be divided into two categories, structured and unstructured, depending on whether there are clear motion patterns in the scene. The definition is different from [62], in which the authors distinguish between the two based on whether each spatial location sup- ports only one dominant crowd behavior or more. For instance, Figure 4.1(d) shows a scene of Italian police riders putting on a motorbike display. Two groups of riders ride in the opposite directions. Due to occlusions, one location may have two opposite velocities which correspond to two groups of coordinated movements. The scene is structured because the two groups of movementsformtwoclearmotionpatterns. In this paper, we focus on the problem of single and multiple target tracking in structured crowded scenes (examples are shown in Figure 4.1), and we want to track objects in general, instead of specific ones, such as pedestrians. The first issue is to solve the motion pattern prob- lem. Inverycrowdedscenes,trackingisdifficultformanyreasons: thesizeofatargetisusually small; there is a large number of similar objects in the scene; partial and full object occlusions. Furthermore, the ”detect and track” paradigm fails here. However, the most salient characteris- tic of structured scenes is, objects do not move randomly, but follow a pattern instead. Several 69 Figure 4.1: Examples of structured crowded scenes. (a)(b)(c): Marathon sequences. (d)(e): Italianmotorbikedisplaysequences 70 effortshavebeendevotedtostudyingmotionpatterns,andsomeofthem[6,36]aresuccessfully usedincrowdedscenetracking. Thesecondissueissinglevsmulti-targettracking. Thefundamentaldifferencebetweenthe two is whether targets are tracked individually or jointly. In single object tracking, a target is labeled in the first one or several frame(s), and matching is used for detection in the follow- ing frames. In multi-target tracking, targets are detected and associated in each frame. Two commonlyusedmethodsfordetectionareappearancebaseddetector,andbackgroundmodeling basedmotionblobdetector,butneitherofthemworks(asshowninFigure4.2)inverycrowded scenes due to the small size, high density and the general object requirement. Thus, in both singleandmultipleobjecttracking,werequireuserlabelinginthefirstframeasinput. Inmulti- targettracking,supervisedlearningisusedfordetection. Asupervisedlearningmethodrequires a large number of samples from each kind of object for training. To save a user from tedious work, we only ask to label one example. After tracking the examplar for a few frames to train a detector,wegobacktothefirstframeandusethelearneddetectortodetectsimilarobjects,then track. As an extension of previous work, our approach (Figure 4.3) incorporates motion pattern learning results in both the detection and tracking stages, and extends the motion pattern based trackerfromsingletargettrackingtoasimplifiedversionofmultipletargettracking. 4.2 MotionStructureTracker In Motion Structure Tracker, motion pattern information is learned as explained in Chapter 3. Specifically, in very crowded scenes, we use KLT key point tracker to get short tracklets, and 71 Figure 4.2: The commonly used detection methods fail in very crowded scenes. (a) Pedestrian detectionresultsby[30]. (b)ForegroundextractionresultsbyMoG. Figure 4.3: An overview of Motion Structure Tracker to solve the problems of single target tracking,andasimplifiedversionofmulti-targettracking. 72 Figure 4.4: Temporally non-stationary scenes. First row: Hongkong. Second row: Motorbike. (a) Input sequences and examples of targets. (b) The visualization of motion patterns learning results projected in (x;y;v x ) space. (c) The visualization of motion patterns learning results in imagespace embed tracklet points into (x;y;v x ;v y ) space. After normalizing (v x ;v y ) to the same scale as (x;y), manifold structures emerge. Two examples of the projection in (x;y;v x ) space are showninFigure4.4. Toexploremanifoldproperty,TensorVotingisperformedtolearnthelocal geometric structure. Then outliers are filtered out, again using Tensor Voting. As a result, a motionpatternisrepresentedbyasetofpointsq l = (x l ;y l ;v xl ;v y l ); l = 1;2;:::;n. 4.2.1 Exploitingmotionpatternindetection Objects in structured crowded scenes move smoothly by following some patterns. To establish an object correspondence in the next frame, we search in a window within which the largest correspondingvelocityisnohigherthantwicethelargestvelocityfoundinmotionpatternprior. Thesearchspaceisthusgreatlyreduced. ThedetectionprobabilityPr f;m =Pr(y = 1|f;m)inMSTiscalculatedas, 73 Pr(y = 1|f;m)_Pr(y = 1|f)×Pr(y = 1|m) (4.1) where y = 1 denotes a detection is positive,f denotes the appearance feature vector used to judge a detection, and m denotes the motion structure information in (x;y;v x ;v y ) space used to judge a detection. In particular,Pr(y = 1|f) indicates the appearance based detection probability and Pr(y = 1|m) indicates motion detection probability. Eq. 4.1 approximately holdsbasedontheassumptionthatthetwomarginalprobabilitieswithf andmareindependent. 4.2.1.1 AppearanceDetectionProbability Randomferns[55]areproposedtospeeduprandomforest. Randomfernclassifiershaveproven to be efficient and effective in tasks such as tracking [34], image classification [14], and action recognition [54]. Thus, we use a random fern classifier for detection, and as in [34], explore the structure of unlabeled data, i.e. the positive and negative structures. Given a video, in the first frame, image patches close to the target are used as positive training examples, and those far from it are used as negative training examples. In the following frames, once the target is validated, corresponding examples (positive and negative) are extracted and used to update the detector. The detector contains a set of ferns. Once a patch is given, each fern evaluates it indepen- dently,bytakingasetofmeasurementsinfeaturevectorf. Attheleafnodethatf pointsto,the posterior probability of whether the patch is positive (y = 1) is calculated based on how many 74 positive (s + ) and negative (s − ) samples are already recorded by that leaf,s + =(s + +s − ). And Pr(y = 1|f)isanaverageoftheposteriorprobabilitiesfromalltheferns. 4.2.1.2 MotionDetectionProbability Moreover, each candidate patch is examined to test whether it is consistent with motion pat- tern prior knowledge. For a target centered at (x i ;y i ) in frame t, possible correspondences with displacements (v xij ;v y ij );(j = 1;:::;m) in frame t + 1 produces a set of pointsp ij = (x i ;y i ;v xij ;v y ij ). Intuitively, we want to check whether these points get support from motion patternprior. Thus,eachpointq l = (x l ;y l ;v xl ;v y l );(l = 1;2;:::;n)frommotionpatternlearn- ingvotesforp ij ,andthesumofallthevotesis, vote ij = n ∑ l=1 e −||q l −p ij || 2 = 2 1 (4.2) Wenormalizethevotesasfollows: Pr(y = 1|m) = vote ij ∑ m j=1 vote ij (4.3) By combining Pr(y = 1|f) and Pr(y = 1|m), the final MST detection probability not only captures the object appearance feature but also incorporates the motion structure, which is importanttodetectobjectsincrowdedscenes. 75 4.2.2 Exploitingmotionpatternintracking Inadditiontodetectingcorrespondenceofatarget,wealsotrackthetargetdirectlybyselecting keypoints on it, and find their correspondences in the next frame. However, crowded scenes have low resolution and cluttered background, making optical flow results unreliable. To solve the problem, we propose a motion structure based optical flow, i.e., Structure Flow (SF), by generalized Tikhonov regularization. Structure flow is a Bayesian extension of optical flow, sincemotionpatterninformationgivessomepriorknowledgeaboutthemovement. Formally, by using the prior knowledge as regularization term, the structure flow v = (v x ;v y )isoptimizedbyminimizingthefollowinglossfunction, L SF (v) =L AF (v)+L MP (v) (4.4) whereL AF (v) stands for the loss based on the appearance features andL MP (v) stands for thelossbasedonmotionpatternprior. istheregularizationparametertocontroltheimpactof priors. Thesetwoitemsaredescribedasfollows. L AF (v): RecalltheLucas-Kanade[40,17]method. Tocalculatethevelocity(v x ;v y )ofapoint q = (x;y), we consider K points ({p i = (x i ;y i )};i = 1;2;:::;K) in q’s neighborhood. Let I x (p i ),I y (p i )andI t (p i )representthepartialderivativesoftheimageI withrespecttoposition x,y andtimet,evaluatedatthepointp i . BasedonLKassumptions,(v x ;v y )mustsatisfy I x (p i )v x +I y (p i )v y +I t (p i ) = 0 (4.5) 76 Let A = I x (p 1 ) I y (p 1 ) I x (p 2 ) I y (p 2 ) ::: ::: I x (p K ) I y (p K ) ;v = v x v y ;b = −I t (p 1 ) −I t (p 2 ) ::: −I t (p K ) Then theK points satisfyAv =b ideally. However, usually the points do not move in the sameway,sothisdoesnotholdinpractice. Thus,wecanhavethefollowinglossfunction, L AF (v) =||Av−b|| 2 (4.6) whereeq.4.6itselfcanalsobeviewedasanover-determinedsystemifK > 2. L MP (v): We use motion pattern information in (x;y;v x ;v y ) space as prior. For a pointx = (x;y), consider a W × W area around it. It collects the information from each point x i = (x i ;y i );(i = 1;2;:::;L)intheareawithacorrespondence(x i ;y i ;v ix ;v iy )inthemotionpattern prior. Weweightheimpactasw i =e −||x−x i || 2 = 2 2 ,andnormalizeitasw i = w i ∑ L j=1 w i . Thenthe weightedsumofvelocityisusedasanestimationoftheexpectedvelocityv 0 = (v 0x ;v 0y ) T ,and thecovariancematrix v canalsobederived. Basedontheestimationsofmeanandco-variance matrix,wedesignthefollowinglossfunction, L MP (v) = (v−v 0 ) T −1 v (v−v 0 ) (4.7) Essentially, eq. 4.7 can be viewed as a multivariate Gaussian probabilistic framework to modelthepriorofstructureflow. 77 Figure4.5: Temporallystationaryscenesandexamplesoftargets. (a)Marathon-1. (b)Marathon- 2. (c)Marathon-3 L SF (v): Byreplacingthetwoproposedlossfunctionsintoeq.4.4,structureflowv isestimated byminimizing L SF (v) =||Av−b|| 2 +||v−v 0 || 2 1 v (4.8) where||v−v 0 || 2 1 v standsfortheMahalanobisdistance(v−v 0 ) T −1 v (v−v 0 ). AccordingtothegeneralizedTikhonovregularization,theclosed-formsolutionis v =v 0 +(A T A+( v ) −1 ) −1 A T (b−Av 0 ) (4.9) Thus,foreachkeypointonthetarget,avelocityestimationv incorporatingmotionstructure informationisgenerated. Thetargetpositioncanbeestimated. Itisworthnotingthat,as→ 0,eq.4.9degeneratesto(A T A) −1 A T b,whichisthestandard Lucas-Kanade optical flow. This is because no regularization is used inL SF (v). On the other hand,as→∞,eq.4.9degeneratestov 0 ,whichmeanswefullytrustthemotionpatternpriors. 78 4.2.3 SimplifiedMultipleTargetTrackinginStructuredCrowdedScenes In crowded scenes, tracking many similar objects is extremely challenging. One of the difficul- ties is how to detect multiple targets. Due to the small target size and large intra class variance caused by viewpoint and occlusion of objects, an object detector does not provide satisfying results (an example of pedestrian detection results by state-of-the-art detector [30] is given in Figure 4.2). Therefore, we step back to solve a simpler problem: once a user labels a target in the first frame, find similar objects and track all of them. This is not classical multiple target trackingproblemincommonsense,inwhichallthemovingobjectsareexpectedtobetracked. Wetracktheuserlabeledobjectforafewframes,andinparalleltrainourdetector. Thenwe go back to the first frame, detect the topn 1 (user input) similar objects by the detector. Motion structure tracker tracks multiple targets in a way similar to single target tracking. Specifically, we treat each of the n 1 + 1 objects as a single object to track. If the tracking results in frame t are good (measured by confidence scores from detector and structure flow tracker), we move to frame t+1. If not, we consider a window of size L, from frame t to frame t+L−1, and locate candidates by detector and structure flow tracker based on frame t. Then association is formulatedasinferenceinasetofBayesiannetworks[57]. Confidencescoreisthedrivingforce behind finding the MAP data association estimate. Then we move forward to frame t+1 and repeattheprocess. 79 4.3 ExperimentalValidation We apply Motion Structure Tracker(MST) in four sets of experiments. The video sequences we use can be divided into two groups: temporally stationary and temporally non-stationary scenes, depending on whether the motion patterns change with time. Figure 4.5 shows three examples of Marathon sequences (from [6, 5] and YouTube), Figure 4.6 shows three examples of crowded scenes from ICCV 2011 data-driven crowd analysis dataset [65], in which motion patterns are the same from the beginning to the end. For such sequences, we only need to learn motion patterns in the first few frames, and then use them for the whole sequence. On the contrary, the HongKong sequence [5] and the Italian motorbike sequence (from Youtube) have changing motion patterns, making it necessary to online update motion pattern learning results. Besides, the task performed in each sequence can also be divided into two groups: single target tracking and multiple target tracking. Therefore, four sets of experiments are designed for the four combinations. In all the experiments, we use 10 ferns and 13 features per fern, and we fix 1 = 10, 2 = 5, = 2. Results are robust to a certain range of these parameters. In outlier filtering step, the points whose 1 is smaller than 0:02 of the median of the 1 of all the points are filtered out. The appearance similarity between the target and candidate is calculated by normalized cross correlation(NCC) at the gray-level. Our single target tracking results are comparedwithIVTTracker[66],P-NTracker[34]andCTM[62]. 80 Figure 4.6: Temporally stationary scenes and examples of targets from ICCV 2011 data-driven crowdanalysisdataset[65]. 4.3.1 Singletargettrackingresultsintemporallystationaryscenes The sequences in Figure 4.5 and Figure 4.6 capture athletes/pedestrians from static overhead cameras. They are challenging real-world scenes due to the existence of hundreds of similar small-size objects, and occlusions from time to time. The six sequences have 343, 249, 143, 291, 393 and 1384 frames respectively, and resolutions are 720×404, 1280×720, 480×360, 640× 360, 640× 360 and 640× 360 respectively. In each experiment, we manually select a rectangular region around a target in the first frame. In each sequence, the first 50 frames are used to learn motion patterns, and 10 targets (whole body,upper body or heads) are randomly selected to test, with an average size of target as 15×21, 21×26, 17×29, 20×24, 22×26, and 24×30 pixels respectively. In the Matlab application provided by [62], bounding box size is fixed, and only target center is the input. Since target size varies in different sequences, we resizetheinputimagesofeverysequencetothemostpropersize. Ground truth is manually labeled for each target in each frame. Two criteria are used to comparethetrackingresultsbetweendifferenttrackers.(1)AverageTrackRatio(ATR),whichis calculatedassuccessfullytrackedframesdividedbytotalnumberofframesinwhichatargetisin FOV. (2) Average Center Location Error (ACLE), which measures the pixel difference between 81 Method Marathon-1 Marathon-2 Marathon-3 ATR ACLE ATR ACLE ATR ACLE IVTTracker[66] 35:2% 62:8 33:5% 86:5 40:0% 64:1 P-NTracker[34] 56:2% 35:1 68:6% 56:4 69:2% 33:9 CTM[62] 52:4% 38:8 65:7% 62:8 71:7% 30:5 Ours 81.4% 6.7 73.1% 28.5 91.1% 4.8 Table 4.1: Tracking evaluation results of single target in temporally stationary scenes from Fig- ure4.5. Method Sequence-1 Sequence-2 Sequence-3 ATR ACLE ATR ACLE ATR ACLE IVTTracker[66] 40:3% 51:8 31:4% 70:3 45:2% 52:3 P-NTracker[34] 59:6% 32:1 56:2% 60:4 72:1% 28:5 CTM[62] 52:5% 45:3 61:7% 53:2 69:3% 31:2 Ours 77.3% 15.2 75.9% 29.5 89.2% 9.8 Table 4.2: Tracking evaluation results of single target in temporally stationary scenes from Fig- ure4.6. tracked object center and the ground truth. The results are presented in Table 4.1 and Table 4.2. Itcanbeseenthatourtrackeroutperformsotherstate-of-the-art,andwegetlowACLEbecause even if our tracker shifts to a wrong object, it is still following motion pattern, thus close to the target. The best results are on Marathon-3 sequence, since it contains relatively large-size targets with clear appearance. In Marathon-1, runner size is small, and there is viewpoint change as runners run through the U-shape street. In Marathon-2, we only track heads of runners, thus the discrimination power is weak. Also, trees in Marathon-2 cause occlusion, which is a major sourceoferrors. AnexampleoftrackingresultsisshowninthefirstrowofFigure4.7. 82 Figure 4.7: Examples of tracking results comparison. First row: temporally stationary scenes. Secondrow: temporallynon-stationaryscenes. 4.3.2 Singletargettrackingresultsintemporallynon-stationaryscenes Tocapturechangingmotionpatterns,anonlinemotionpatternlearningandtrackingframework isbuilt. Eachtime,weconsiderafixed-lengthwindowofsize40intime,extractmotionpattern information in the window and utilize it to assist tracking. Then the window shifts 40 frames to dealwiththenext40frames(orlessinthelastwindow),andsoon. The Hongkong sequence (first row in Figure 4.4) and Motorbike sequence (second row in Figure 4.4) have 248 frames and 100 frames respectively. In each sequence, 10 targets are randomlyselected,withanaveragesizeof15×24and30×22pixelsrespectively. Sinceeachof thetwosequencescontainstwomotionpatterns,wedividemotionpatternpointsin(x;y;v x ;v y ) spaceintotwogroupsby K-means. TocalculateL MP (v), wefirstdecidewhichmotion pattern thepoint(ortheobjectit’sfrom)belongstobythevotefromtheobject’spasttrajectory. 83 Method Hongkong Motorbike ATR ACLE ATR ACLE IVTTracker[66] 27:63% 58:9 31:56% 69:7 P-NTracker[34] 39:58% 42:3 47:22% 55:4 CTM[62] 52:17% 35:2 42:35% 58:3 Ours 62.31% 28.5 88.75% 5.6 Table4.3: Trackingevaluationofsingletargetintemporallynon-stationaryscenes. The motion pattern learning results for the two whole sequences are shown in Figure 4.4. Figure4.4(b)showsthevisualizationofprojectionin(x;y;v x )space,andFigure4.4(c)shows theprojectiononimages. TrackingresultscomparisonarepresentedinTable4.3. Anexampleis shown in the second row of Figure 4.7. Our trackerstill significantly outperforms the others. In Motorbikesequence,alargenumberofsimilarobjectsexist,andmotionpatternprioreffectively reduces drift. The Hongkong sequence is the most challenging one, since motion patterns are notasclearasothers. 4.3.3 Multi-targettrackingresultsintemporallystationaryscenes In each experiment, we manually select a rectangular region around a target in the first frame. We track the target for10 frames to train our detector. Going back to the first frame, we use the detector to detect n 1 = 6 similar objects. Window size L is fixed as 8. If the tracking result foreachtarget(bydetectorandstructureflowtracker)hasconfidence([0,1])largerthan0.8,we move to the next frame. Otherwise use the L = 8 window to jointly optimize. The detection andtrackingresultsareshowninFigure4.8(a). Aredrectangledenotestheuserlabeledtarget,a 84 Figure 4.8: Simplified multi-target detection and tracking results in temporally (a) stationary and (b) non-stationary scenes respectively. Red rectangle denotes the user labeled target. Blue rectanglesdenotethesimilarobjectsdetectedbythelearneddetector. blue rectangle denotes true positive detection of similar objects, and a yellow rectangle denotes falsepositivedetectioninthefirstframe. Asweincreasen 1 ,morefalsepositivesarebroughtin. 4.3.4 Multi-targettrackingresultsintemporallynon-stationaryscenes Settingsarethesameasbefore. Onlinemotionpatternlearningisperformedinthesamesetting asinSection5.2. DetectionandtrackingresultsareshowninFigure4.8(b). Itshowsthattrack- inghelpsusremovesomefalsealarms,andcorrectsomeothers. Forexample,thefalsepositive ontherightinFigure4.8(b)detectstwopeopleinthefirstframe,butthetrackergraduallymoves toonetarget,sothedetectioniskeptintrackingresults. 85 4.4 Conclusion Thischapterinvestigatestheproblemoftrackingsingleandmultipletargetsinstructuredcrowded scenes by Motion Structure Tracker, which combines several research topics: visual tracking, motion pattern learning, and multiple target tracking. Although each topic has been intensively studied, they are not jointly considered before. The experimental results on several challenging sequences and the comparison with state-of-the-art methods demonstrate the effectiveness of motionstructuretracker. Inthefuturework,MotionStructureTrackercanbeimprovedfromthe followingperspectives. Multi-target detection and tracking in very crowded scenes is extremely challenging for the reasonsmentionedbefore. Thisproblemisthekeyforanalyzinggroupbehaviorsandpreventing criminalsinsurveillance,howeveritisrarelyexplored. Therefore,thisinterestingtopicisworth ofmoreattention. Inthecurrentframework,asimplifiedversionofmulti-targettrackingisexploredbylearning from one or very few positive examples labeled by the user. The unbalance between the very small number of positive examples and the very large number of negative examples makes it verydifficulttogetgooddetectionresults,nomatterwhatclassifierisutilized. Itcanbepossibly improvedintwoways. First,collectionofsometrainingdataofthetargetsofinterestintrackingisusefultotrainthe classifieroffline. Thedatacanbeselectedfromthelargeamountofsurveillancevideospublished on the internet. One good resource is the dataset named ICCV 2011 data-driven crowd analysis dataset [65], and some examples of the scenes from the dataset are shown in Figure 4.9. This 86 Figure4.9: someexamplesofthescenesfromtheICCV2011data-drivencrowdanalysisdataset [65]. couldhelpustobuildamuchmoreaccurateclassifier,althoughtherequirementoftrainingdata makes the detector not as general as it used to be. But this is effective and meaningful since in most cases, the crowded scenes we are interested in are surveillance videos of some specific objectssuchaspedestrians. Second, simple binary features are now used in random fern classifier. More advanced and descriptive features will be tried to extract more information of the targets. Also, advanced fast 87 pattern matching methods such as Matching by Tone Mapping (MTM) [26] may be added in as acomplementofdetectortogetabettermeasureofsimilaritiesbetweenpossiblepairsofobjects whenmakingdecisionsondataassociation. Once we have the multiple targets detection results, the data association methods used in traditional MTT can be brought in, with additional constraints in crowded scenes. Context in- formation can be used to assist the association process. For instance, the relative positions of a groupoftargetsareverylikelytobemaintainedforashorttimescaleincrowdedscenes. There- fore, although a single target is difficult to be tracked accurately, tracking the group together maybecomemoresolvablesincethenewtargetbecomesmuchbiggerandcontainsmanymore pixels. Besides, unusual activity detection can be performed by using both the motion patterns and multiple-target tracking results. Here, unusual activity is defined as those different from what most people do. For instance, in Marathon sports, a person moving against the majority is consideredasunusualandneedsspecialattentionfromthehumanoperator. 88 Chapter5 OnlineDistributedImplementation With the decreasing cost of collecting data, one big challenge nowadays is how to efficiently process big data. One example is Wide Area Aerial Surveillance (WAAS) imagery. We already showed the effectiveness of our algorithms on WAAS data in section 3.6.4.2, however, that problem is not solved in the most efficient way, and efficiency becomes the main focus of this chapter. WAAS imagery often covers a geographic area of a few square kilometers. This data is characterized by its large format (60-100 megapixels in every frame), multi-sensor capture, low sampling rate, and is in grayscale (see Figure 5.1). All of these properties pose challenges for computervisionanalysis,butthemainchallengeweareconcernedwithhereisthesheeramount of data to process, currently reaching about 100 megapixels a second, and going to nearly 2 gigapixels10timesasecondinthefuture. Simplystoringthedataisachallenge,nottomention semanticanalysis,whichisthefocusofourworks. As other big data problems, online efficient and scalable algorithms and distributed compu- tationareoursolutions. 89 Figure5.1: ExampleofWAASimagery,fullframe(left)anddetail(right). 5.1 DistributedComputation Inordertoachievereal-timeprocessingatWAASscale,anaturalwayistodivideuptheworkto createagridofspatialtiles,eachcoveringacertainareaofthemonitoredregion(weassumethe imagery has been stabilized and georegistered), and motion pattern learning can be performed independentlyoneachone. Choosingtile sizeis important forthesuccessofdistributedcomputation. Itdependson the resolution of input data, and the coverage of the map. If each tile is too small as shown in the upper image in Figure 5.2, then in the region of coverage, no clear patterns are formed. If each tileistoolargeasshowninthelowerimageinFigure5.2,thenthecomputationalpowerwillbe wasted, and no significant speed up can be obtained. Once proper tile size is used, the number 90 Figure5.2: Choosingspatialdivisiontilesizeisimportantforthesuccessofdistributedcompu- tation. Up: ifeach tile is too small, thenin theregionofcoverage, noclearpatterns areformed. Down: if each tile is too large, then the computational power will be wasted, and no significant speedupcanbeobtained. 91 of points in the volume of each tile is both large enough to perform informative denoising and local structure inference, and small enough to achieve high efficiency. Also, the tiles should have small overlap regions to get accurate calculation, and to avoid additional computation in the results integration stage. The proposed motion pattern learning method is in nature suitable for parallel processing by dividing the data space into small volumes. This is because Tensor Voting is based on local neighborhood information, instead of the global structure. Each input pointonlyreceivesvotesfromotherpointsinitsneighborhood,andthesizeoftheneighborhood is determined by the parameter of voting scale . All the other points outside the neighbor- hood have no effect on it. Therefore, as long as we choose proper overlapping size of the tiles, the results obtained within each tile in a distributed way should be the same as the results by processingtheregionasawhole. As shown in Figure 5.3, we first divide the spatial space into non-overlapping regions (in blue),thendependingonthevotingscalewhichdeterminestheneighborhoodsize,weconsidera largerarea(inred)foreachtiletoperformmotionpatternlearning. Afterindependentprocessing for each tile, only the results within the blue region of each tile is kept. Thus, the results are stitchedbacktogetherwithnoneedoffurtherprocessing. Compared to processing the region as a whole, distributed Computation is more efficient fromtwoperspectives. First,foreachpoint,thereisasmallercandidatepoolforselectingnearest neighbors, which is the most time consuming step in Tensor Voting, since only the points in its corresponding tile need to be considered. Second, the tiles are completely independent from each other, so that they can be processed by different processors simultaneously, and moreover, there is no further step to stitch the results back together. Assume we divide the whole region 92 Figure5.3: Parallelestimationoftracksandmotionpatternsonaclusterofcomputersisenabled bycreatinganoverlappinggridoftiles. intoM×N tiles,distributedTensorVotingisroughlyM 2 ×N 2 =c 2 timesfasterthanprocessing the whole region together. c is a factor indicating the effect of overlap region, andc is larger as M andN increase. Intuitively,thelargerM andN are,thesmallereachregionis,andthelarger theoverlapregionisneeded. Totesttheeffectivenessandefficiency,experimentsareperformedinthesamesequenceasin section 3:6:4:2. The parallel estimation results of motion patterns are shown in Figure 5.4. The motion pattern learning algorithm processes the 1312× 738 region as a whole at 1:89 frames per second. When the region is divided into 4 overlapping tiles, each 756× 469 in size, the algorithmisabletoprocessthedataat16:8framespersecond. Thisefficiencyisdeterminedby the longest processing time of each of the tiles, and there is no extra cost to combine the results from tiles. Since the data is captured at about 2 frames per second, it shows that our algorithm 93 Figure5.4: Parallelestimationresultsofmotionpatterns. is able to process in real-time in a distributed way. Moreover, it is worth noting that learning motionpatternsisanindependentstepoftracking,whichmeansitdoesnotslowdownrealtime onlinetracking,butonceitsdone,theresultscanbeintegratedintothetrackingmodule. 5.2 OnlineImplementation 5.2.1 TemporallyIncrementalTensorVoting In practice, the video data usually arrives in a streaming way, and the motion patterns may or maynotchange,whicharecalledtemporallystationaryandnon-stationaryscenesintheprevious section. SomeexamplesoftemporallystationaryscenesareshowninthefirstrowofFigure5.5. In these scenes, motion patterns appear to be the same as time passes, so the motion patterns learned from the beginning can be used as prior to assist tracking and detection in the frames 94 Figure 5.5: First row: examples of temporally stationary scenes. Lower row: examples of temporallynon-stationaryscenes. 95 long after. The second row of Figure 5.5 shows some examples of temporally non-stationary scenes. Take the intersection scene in the down left as an example, we may only learn patterns from vehicles coming from left and right in the first 60 seconds, as traffic signal only permits such movements. Then in the second 60 seconds, only vehicles coming from up and down can move,astrafficsignalchanges. Inthiscase,motionpatternschangeduetotrafficrules. Motion patternsmayalwayschangesimplybecausetrafficconditionchangesastimepasses. Therefore, to learn motion patterns in a unified framework, we should be able to online update motion patterns. Since Tensor Voting is a key step in the motion pattern learning framework, so a more fun- damental question is how to perform online Tensor Voting. Suppose in the first time slot, N 1 number of points is given as input. Tensor Voting requires N 2 1 calculations. In the second time slot, another N 2 number of points is added in. Consider the N 1 +N 2 points as a whole, (N 1 +N 2 ) 2 calculations are needed to perform Tensor Voting. But in fact, not all of these cal- culations are needed, since theN 1 points have already voted for each other. Therefore, we only need the N 1 and the N 2 points vote for each other, and the N 2 points vote among themselves. Altogether,2×N 1 ×N 2 +N 2 2 calculationsneedtobeperformed. More generally, suppose in the first k time slots, there are {N i };i = 1;2;3;:::;k points already. Intimeslotk+1,anotherN k+1 pointsarrive. Insteadofperforming(N 1 +N 2 +:::+ N k+1 ) 2 calculations, only2×(N 1 +N 2 +:::+N k )×N k+1 +N 2 k+1 calculationsareneeded. To make it simpler, assume in each time slot, the same N number of points are added in. So after time slot k +1, (2k +1)×N 2 calculations are performed instead of (k +1) × N 2 . And 96 the longer time passes, or the largerk is, the more speed up we get for temporally incremental TensorVoting. 5.2.2 OnlineMotionPatternLearning BasedontheproposedtemporallyincrementalTensorVoting,motionpatternscanbelearnedin the following way. For everyY frames, the firstX frames are used to learn and update motion patterns. The values of Y and X can be set according to the properties of the videos. For example, forfastchangingtrafficconditionscenes, thetimeintervalcanbesetas1 minute, and for slow changing traffic condition scenes, the time interval can be set large, e.g. 15 minutes. Suppose for the time intervals k = 1;2;3;:::, N k ;k = 1;2;3;::: points are extracted from tracklets respectively. From N 1 , initial motion patterns can be learned. Thereafter, instead of combining all the tracklet points from all the time slots together to learn motion patterns, we performTensorVotinginanincrementalwayasexplainedintheprevioussection. Theremaining procedure of motion pattern learning is the same as before. Therefore, motion patterns can be updatedinanefficientonlinefashion. 5.3 GPUImplementationofTensorVoting With the wide use of Graphics Processing Units (GPU), many algorithms which can be paral- lelized are implemented on GPU, thus largely increase the efficiency. Therefore, in addition to theonlineanddistributedprocessing,wealsoimplementtensorvotingonGPU,sinceitisakey step in the motion pattern learning framework. GPU has been used to accelerate tensor voting 97 before [47], but there the implementation is specifically targeted for 2D, 3D, 4D and 5D ten- sor voting. In this section, we propose an algorithm to handle the generalD-dimensional GPU implementation. 5.3.1 ThedesignofGPUImplementation The implementation is based on the closed form version of tensor voting, and the details can be found at [21]. In tensor voting, the most time consuming step is to find out the neighbors of every input point, and the size of the neighborhood is decided by the voting scale. And these neighbors then vote to the input point, as shown in Figure 5.6. For each D-dimensional point, thedistancesfromittoalltheotherpointscanbecalculatedinaparallelway,aswellasthevote fromtheneighbors. We have two choices to perform the parallel computations. One is to find out neighbors for each point on device (GPU), and all the other calculations are performed on CPU. In this case, onlythedistancesarecalculatedonGPU,andthedatatobetransferredbackfromGPUtoCPU isonlyN ×N booleandata,indicatingtheneighborhoodinformation. The other choice is to perform the vote, or the tensor cast from all the neighbors. The idea sounds appealing since all most of the calculations that can be parallelized are performed on GPU, leaving little calculation work on CPU. However, the second choice is impractical in many cases. First, the memory requirement to store the intermediate results for each thread is largely enhanced. Since register memory is limited for each thread, global memory is used instead, which is much slower than accessing register memory. And even so, the device may not have enough memory to hold these results. Second, a more fundamental difficulty is this 98 Figure5.6: Implementationoftheparallelstructureofvotecollectionmode. methodincursalargeamountofmemorytransferfromGPUtoCPU,whichisaveryexpensive procedure,makingitoffsetthespeedgain. Detailedanalysisisperformedinthenextsection. Therefore, only the neighbors selection is performed on GPU. Results show that the GPU implementation is 50 500 times speed up compared to CPU implementation for general ND TensorVoting. 5.3.2 LimitationsofGPUimplementation TherearesomelimitationsforthecomputationalspeedupbyGPU. 99 First,datatransferisabottleneck. Thisisacommonproblemfordistributedparallelprocess- ing. AlthoughtheparallelcomputationsperformedonGPUarefast,thespeedgainissometimes offset by the data transfer. To do any manipulation on GPU, we first need to transfer the data fromhostmemorytodevicememory. ThenafteroperationsonGPU,weneedtotransferthere- sultsfromdevicememorybacktohostmemory. Intheimplementationoftensorvoting,assume thereareN pointsinDdimensionalspace,andeachdataisfloattype. ThenatleastDN number offloattypedataneedtobetransferredtodevicememory. Ifwewanttoreturntheneighborhood information,thatisN ×N numberofbooleantypedatatobetransferredbacktohostmemory. Assume we have 10;000 points in 4D space, then it is 10;000×4×4bytes, or approximately 0:15 Megabytes, to be transferred to device memory. And it is 10;000× 10;000××1bytes, or approximately 95:37 Megabytes to be transferred back to host memory. This is a simplified ideal case estimation. In practice, due to the memory limitation, which will be discussed later, we need to load data several times, which result in largerdata transfer load. As we talked about inthedesignsection,anotherchoiceistoreturnthetensorvotedfromallthepointsforeachin- putpoint,orthevectortocalculatetensortosavememory. SinceeachvectorforD-dimensional pointisD-dimensional,thedatatobetransferredbacktohostmemoryisatleastD×N 2 num- ber offloattype data. Take thesameexampleof10;0004Dpoints, itis4×10;000 2 ×4bytes, or approximately 1526 Megabytes, which is a huge number. Clearly, as D increases, the data sizeincreasesquadratically. Fromtheaboveanalysis,wecanseethatthesizeofdatatobetransferredislarge. Andinthe current device, the data transfer procedure is very time consuming. So the speed gain on GPU deviceinmanysituationsareoffsetbytheexpensiveprocess. 100 Second, another limitation for the speed up is memory on CUDA. The detailed technical specifications vary by products, and the information can be found on the manual. The lim- ited device memory (e.g., global memory, shared memory, etc. ) sometimes incur additional computational cost. In GPU implementation of Tensor Voting, ideally, we want to load the N D-dimensional data into device memory from host memory only once to perform the N ×N calculations. Butwhenthenumberofpointsislargeand/orthedimensionalityishigh,thedevice memorycannotholdallthedata,sothattheyneedtobeprocessedchunkbychunk,resultingin repeated transfer of the same data. Moreover, the register memory associated with each thread islimited,sothatthereadandwriteofintermediateresultstakeadditionaltime. Bettersolutions canbecarriedoutasthecapabilityofGPUdeviceimproves. 5.4 Conclusion This chapter investigates the problem of efficiently processing large scale imagery. For motion patternlearning,wespeeduptheprocessbothspatiallyandtemporally. Spatially,wedividethe regionintooverlappingtilestoenableparallelindependentcomputation,withoutextracostinthe integration step. Temporally, we propose an online motion pattern update framework based on temporally incremental Tensor Voting. Moreover, GPU implementation is presented to further speedupallthecalculationsrelatedtoTensorVoting. 101 Chapter6 Conclusions In this work, we design a system to understand semantic meanings of the scene. we have pro- posed a method to detect multiple semantically meaningful motion patterns in an unsupervised manner. Anovelrobustgroupingalgorithmmakingfulluseoflocalgeometricinformationisde- signed. Forlargescaleimagery,oursystemisabletoprocesstheinputefficientlyinadistributed computationway. Oncemotionpatterninformationisextracted,itisintegratedindetectionassociationstepfor multiple target tracking in general scenes, and largely improves the performance. Furthermore, Motion Structure Tracker is proposed to investigate the problem of tracking single and multiple targetsinstructuredcrowdedsceneswhen“detect-then-associate”strategyfails. MSTcombines several research topics: motion pattern learning, visual tracking, and multiple target tracking. Althougheachtopichasbeenintensivelystudied,theyarenotjointlyconsideredbefore. Exten- sive experimental are performed on both synthetic data and challenging real-world sequences. The comparison with state-of-the-art methods demonstrate the effectiveness of motion pattern learningalgorithmandmotionstructuretracker. 102 6.1 Contributions 6.1.1 EfficientMotionPatternLearning Anovelframeworkforlearningmotionpatternsisfirstpresented. −Robust Multiple Manifold Grouping algorithm RNMG algorithm makes full use of local geometric information of the input point with the help of Tensor Voting. Compared to state-of- the-art candidate algorithms which focus on linear subspace segmentation, RNMG algorithm is provedtobegoodatgroupingmotionpatternswhicharenonlinearinmanycases. −Online Distributed Implementation To handle the prevailing large scale data, we take wide area aerial surveillance video as an application, and propose effective and efficient solutions. Spatially,wedivideinputspaceintooverlappingtilestoenableparallelindependentprocessing. Temporally,weproposeonlineupdatealgorithmtoavoidredundantcalculations,andhandlethe streamingdatainanonlinefashion. 6.1.2 MotionStructureTrackerforVeryCrowdedScenes MotionStructureTrackerisspecificallydesignedfortrackinganddetectioninstructuredcrowded scenes. Ithasseveraladvantagesoverexistingmethods: −Bettertagandtrackforsingletarget In crowded scenes, the number of targets is large, and the number of pixels on each target is usually small. In this situation, the targets are difficult to be distinguished from each other, so the common visual trackers are easy to jump to other similar objects during the tracking process. By using motion pattern information as prior, MST 103 combinesthetwostagesoftrackinganddetectiontogether,andgetsbetterresultsthanthestate- of-the-artvisualtrackers. −Solved a simplified version of multi-target tracking Usually, multi-target tracking means tracking all the targets in a scene, which is almost impossible in very crowded scenes. Alterna- tively, we proposed and solved a simplified version of MTT. It requires the user to label some targets in the first frame, MST tracks them for a few seconds, builds the detector in the pro- cess, and goes back to the first frame to detect similar targets, then tracks them together. This simplifiedversionisameaningfulstepinsolvingtherealMTTprobleminverycrowdedscenes. −Online motion pattern learning, detection and tracking The proposed algorithm sequen- tially processes both temporally stationary and non-stationary scenes. Motion patterns are in- ferred and updated in an efficient online fashion, and are used to assist both single and multiple objecttracking. 6.2 FutureWork There are many open problems and promising directions regarding the topics of motion pattern learning and the applications in tracking and detection. The work can be extended along the followinglines. 6.2.1 MotionPatternLearning Onecanexploremoreapplicationsofmotionpatternlearning. Usingmotionpatterntoimprove trackingisonlyamethodtobetterunderstandthescene. Togofurther,motionpatterninforma- tionmaybeusedtohelpanomalydetectionandbehaviorprediction. 104 6.2.2 TrackingandDetectioninVeryCrowdedScenes Inthefuturework,onecanfocusonimprovingMotionStructureTrackertobetterhandlemulti- target tracking. In the current framework, the unbalance between the very small number of positive examples and the very large number of negative examples is the bottleneck of the de- tection step, no matter what classifier is used. It can be improved in both offline and online fashion. Offline, we will collect some training data of the targets we are interested in tracking to train the classifier. Online, more advanced features will be tried, and advanced fast pattern matchingmethodssuchasMatchingbyToneMapping(MTM)willbeaddedinasacomplement ofdetector. To improve tracking, context information will be used in data association step, such as the relativelypositionsofagroupoftargets. 105 AppendixA N-DClosed-FormVoting: CPUandGPUimplementation A.1 Introduction ThispackagecontainstheCPUandGPUimplementationfortheclosed-formsolutionoftensor votinginN-Dspace[21]. A.2 Usage AsampleusageofthecodeiscontainedinthelibraryofTensorVotingLib. TheCPUimplemen- tationusesbruteforcewaytosearchforthenearestneighbors. TheGPUimplementationcanbe referredtosection5:3. A.3 Parameter Thereisonlyoneparameter,whichisthesameastheoneinthestandtensorvoting. 106 AppendixB MotionPatternLearning B.1 Introduction This package contains the implementation for the motion pattern learning framework. It can handle the input of general scenes, crowded scenes, and wide area aerial surveillance videos. KLTtrackerisusedtoextracttrackletsoffeaturepoints. B.2 Usage For the scenes that do not need segmentation, only the C++ project ”MMC” is needed. In the MMC project, video sequences are given as input, and users can choose whether to project 2D tracklet points to (x;y;V x ;V y ) space or (x;y;) space, depending on the properties of input. Tensor Voting is then performed to filter outliers. To handle large input such as waas video, distributed processing is enabled, and details can be referred to section 5:1. To perform seg- mentation, the matlab function "Multiple Manifold Cluster:m" is used. All the auxiliary functionsarealsoincluded. 107 B.3 Parameter Thereareseveralparameterstoset,suchasthestartframe,theendframe,thenumberoffeature pointstoextractineachframe,andthevotingscale. Settingsvarywithinputvideos,anddetails andsamplesettingsaregiveninthecode. 108 AppendixC MotionStructureTracker C.1 Introduction This package contains the implementation for motion structure tracker for crowded scenes pro- posed in [90]. It is based on the implementation provided by the authors of P-N tracker [34]. The main function of standard P-N tracker is run TLD:m, and the main function of MST is run TLD mp:m. All the functions related to motion patterns are in a separate folder of "motionpattern". C.2 Usage When running the tracker, input sequence must be given as input, and a target must be labeled inthefirstframe. Motionpatternsarelearnedseparatelybythemethodinthepreviouspackage, andgivenasinput. 109 C.3 Parameter There are several parameters to set, such as the minimal size of the object’s bounding box in the scanning grid, the size of the normalized patch in the object detector, and the scan range for detection. Theexplanationforeveryparameterandsamplesettingscanbefoundinthecode. C.4 Note Tomeasuretheperformanceoftracking,agroundtruthlabelingtool[59]isalsoprovided. 110 ReferenceList [1] CLIFdataset. https://www.sdms.afrl.af.mil/index.php?collection=clif2006. [2] CMUgraphicslabmotioncapturedatabase.Availableat. http://mocap.cs.cmu.edu/. [3] Hopkins 155 matlab implementation for subspace clustering algorithms. Available at. http://www.vision.jhu.edu/code/. [4] NextgenerationsimulationNGSIMdataset.Availableat. http://www.ngsim.fhwa.dot.gov/. [5] S.AliandM.Shah. Alagrangianparticledynamicsapproachforcrowdflowsegmentation andstabilityanalysis. CVPR,pages1–6,2007. [6] S. Ali and M. Shah. Floor fields for tracking in high density crowd scenes. ECCV, pages 1–14,2008. [7] M.Andriluka,S.Roth,andB.Schiele. People-tracking-by-detectionandpeople-detection- by-tracking. CVPR,pages1–8,2008. [8] G.Antonini, S.V.Martinez, M.Bierlaire,andJ.P.Thiran. Behavioralpriorsfordetection andtrackingofpedestriansinvideosequences. IJCV,69(2):159–180,2006. [9] B. Babenko, M.-H. Yang, and S. Belongie. Visual tracking with online multiple instance learning. CVPR,pages983–990,2009. [10] B. Babenko, M. H. Yang, and S. Belongie. Visual tracking with online multiple instance learning. CVPR,pages983–990,2009. [11] J. Berclaz, F. Fleuret, and P. Fua. Multi-camera tracking and atypical motion detection withbehavioralmaps. ECCV,pages112–125,2008. [12] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua. Multiple object tracking using k-shortest pathsoptimization. IEEEPAMI,33(9):1806–1819,sept.2011. [13] J.Black,T.Ellis,andP.Rosin. Anovelmethodforvideotrackingperformanceevaluation. PETS,pages125–132,2003. [14] A. Bosch, A. Zisserman, and X. Munoz. Image classification using random forests and ferns. ICCV,pages1–8,2007. 111 [15] M. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. V. Gool. Robust tracking- by-detectionusingadetectorconfidenceparticlefilter. ICCV,2007. [16] R. R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker. Geometric diffusion as a tool for harmonic analysis and structure definition of data, part i: Diffusionmaps. TheNationalAcademyofSciences,2005. [17] C.TomasiandT.Kanade. Detectionandtrackingofpointfeatures. IJCV,1991. [18] T. B. Dinh, N. Vo, and G. Medioni. Context tracker: Exploring supporters and distracters inunconstrainedenvironments. CVPR,pages1177–1184,2011. [19] E.ElhamifarandR.Vidal. Sparsesubspaceclustering. CVPR,pages2790–2797,2009. [20] A. Goldberg, X. Zhu, A. Singh, Z. Xu, and R. Nowak. Multi-manifold semi-supervised learning. InAISTATS,2009. [21] Dian Gong. Structure learning for manifolds and multivariate time series. PhD Thesis, UniversityofSouthernCalifornia,2013. [22] DianGong,XuemeiZhao,andG´ erardMedioni. Robustmultiplemanifoldsstructurelearn- ing. ICML,pages321–328,2012. [23] H. Grabner, C. Leistner, and H. Bishof. Semi-supervised online boosting for robust track- ing. ECCV,pages234–247,2008. [24] H. Grabner, C. Leistner, and H. Bishof. Semi-supervised online boosting for robust track- ing. ECCV,pages234–247,2008. [25] M.HeinandM.Maier. Manifolddenoising. NIPS,pages561–568,2006. [26] Y. Hel-Or, H. Hel-Or, and E. David. Fast template matching in non-linear tone-mapped images. ICCV,2011. [27] M. Hu, S. Ali, and M. Shah. Detecting global motion patterns in complex videos. ICPR, 2008. [28] M. Hu, S. Ali, and M. Shah. Learning motion patterns in crowded scenes using motion flowfield. ICPR,pages1–5,2008. [29] W. Hu, X. Xiao, Z. Fu, D. Xie, T. Tan, and S. Maybank. A system for learning statistical motionpatterns. PAMI,28(9):1450–1464,2006. [30] C. Huang and R. Nevatia. High performance object detection by collaborative learning of jointrankingofgranulesfeatures. CVPR,pages41–48,2010. [31] C. Huang, B. Wu, and R. Nevatia. Robust object tracking by hierarchical association of detectionresponses. ECCV,pages788–801,2008. 112 [32] N.JohnsonandD.Hogg. Learningthedistributionofobjecttrajectoriesforeventrecogni- tion. ImageandVisionComputing,14:609–615,1996. [33] Z. Kalal, J. Matas, and K. Mikolajczyk. Online learning of robust object detectors during unstabletracking. OLCV,pages1417–1424,2009. [34] Z. Kalal, J. Matas, and K. Mikolajczyk. P-n learning: Bootstrapping binary classifiers by structuralconstraints. CVPR,pages49–56,2010. [35] C. S. Mark Keck and Luis Galup. Real-time tracking of low-resolution vehicles for wide- area persistent surveillance. Workshop on Applications of Computer Vision, pages 441– 448,2013. [36] L.KratzandK.Nishino. Trackingwithlocalspatio-temporalmotionpatternsinextremely crowdedscenes. CVPR,pages693–700,2010. [37] B. Leibe, K. Schindler, and L. V. Gool. Coupled detection and trajectory estimation for multi-objecttracking. ICCV,2007. [38] D. Lin, E. Grimson, and J. Fisher. Learning visual flows: A lie algebra approach. CVPR, 2009. [39] D. Lin, E. Grimson, and J. Fisher. Modeling and estimating persistent motion with geo- metricflow. CVPR,2010. [40] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereovision. IJCAI,pages674–679,1981. [41] M. A. lvarez, J. Peters, B. Schlkopf, and N. D. Lawrence. Switched latent force models for movement segmentation. In Advances in Neural Information Processing Systems 23, pages55–63.MITPress,Cambridge,MA,2011. [42] S. Roth M. Andriluka and B. Schiele. People-tracking-by-detection and people-detection- by-tracking. CVPR,pages1–8,2008. [43] D.MakrisandT.Ellis. Pathdetectioninvideosurveillance. ImageandVisionComputing, 20:895–903,2002. [44] R.Mehran,B.Moore,andM.Shah. Astreaklinerepresentationofflowincrowdedscenes. ECCV,pages439–452,2010. [45] X. Mei and H. Ling. Robust visual tracking using l1 minimization. ICCV, pages 1436– 1443,2009. [46] C. Min and G. Medioni. Inferring segmented dense motion layers using 5d tensor voting. JMIV,30(9). [47] Changki Min and Grard Medioni. Tensor voting accelerated by graphics processing units (GPU). ICPR,3:1103–1106,2006. 113 [48] P. Mordohai and G. Medioni. Junction inference and classification for figure completion using tensor voting. Fourth Workshop on Perceptual Organization in Computer Vision, 2003. [49] P.MordohaiandG.Medioni. Unsuperviseddimensionalityestimationandmanifoldlearn- inginhigh-dimensionalspacesbytensorvoting. IJCAI,pages798–803,2005. [50] P. Mordohai and G. Medioni. Tensor voting: A perceptual organization approach to com- putervisionandmachinelearning. MorganandClaypoolPublishers,2008. [51] P. Mordohai and G. Medioni. Dimensionality estimation, manifold learning and function approximationusingtensorvoting. JMLR,11:411–450,2010. [52] P.MordohaiandG.Medioni. Registrationof2dpointsusinggeometricalgebraandtensor voting. JMIV,37(3):249–266,2010. [53] AndrewY.Ng,MichaelI.Jordan,andYairWeiss. Onspectralclustering: Analysisandan algorithm. InNIPS,2002. [54] O.Oshin,A.Gilbert,J.Illingworth,andR.Bowden. Actionrecognitionusingrandomised ferns. ICCV,pages530–537,2009. [55] M. Ozuysal, M. Calonder, V. Lepetit, and P. Fua. Fast keypoint recognition using random ferns. PAMI,pages448–461,2010. [56] Y.Peng,A.Ganesh,J.Wright,andY.Ma. Rasl: Robustalignmentbysparseandlow-rank decompositionforlinearlycorrelatedimages. CVPR,pages763–770,2010. [57] J.ProkajandG.Medioni. Inferringtrackletsformulti-objecttracking. WorkshopofAerial VideoProcessingJointwithIEEECVPR,pages37–44,2011. [58] J. Prokaj and G. Medioni. Using 3d scene structure to improve tracking. CVPR, pages 1337–1344,2011. [59] JanProkaj. Exploitationofwideareamotionimagery. PhDThesis,UniversityofSouthern California,2013. [60] Jan Prokaj, Xuemei Zhao, and Gerard Medioni. Tracking many vehicles in wide area aerial surveillance. Workshop on Camera Networks and Wide Area Scene Analysis joint withIEEECVPR,2012. [61] Vladimir Reilly, Haroon Idrees, and Mubarak Shah. Detection and tracking of large num- ber of targets in wide area surveillance. In ECCV, volume 6313 of Lecture Notes in Com- puterScience,pages186–199.Springer,2010. [62] M. Rodriguez, S. Ali, and T. Kanade. Tracking in unstructured crowded scenes. ICCV, pages1389–1396,2009. 114 [63] M. Rodriguez, J. Sivic, I. Laptev, and J. Audibert. Data-driven crowd analysis in videos. ICCV,pages1235–1242,2011. [64] M. Rodriguez, J. Sivic, I. Laptev, and J. Audibert. Density-aware person detection and trackingincrowds. ICCV,pages2423–2430,2011. [65] Mikel Rodriguez. ICCV 2011 data-driven crowd analysis dataset. http://mikelrodriguez.com/datasets-and-source-code. [66] D. Ross, J. Lim, R. Lin, and M. Yang. Incremental learning for robust visual tracking. IJCV,pages125–141,2008. [67] Ronald L. Rothrock and Oliver E. Drummond. Performance metrics for multiple-sensor multiple-targettracking. InProceedingsofSPIE,volume4048,pages521–531,2000. [68] I. Saleemi, L. Hartung, and M. Shah. Scene understanding by statistical modeling of mo- tionpatterns. CVPR,pages2069–2076,2010. [69] I.Saleemi,K.Shafique,andM.Shah. Probabilisticmodelingofscenedynamicsforappli- cationsinvisualsurveillance. PAMI,31(8):1472–1485,2009. [70] L. K. Saul and S. T. Roweis. Think globally, fit locally: unsupervised learning of low dimensionalmanifolds. JMLR,4:119–155,2003. [71] C. Stauffer and E. Grimson. Learning patterns of activity using real-time tracking. PAMI, 22(8):747–757,2000. [72] Y. W. Teh and S. Roweis. Automatic alignment of local representations. In NIPS, pages 841–848.2003. [73] W.Tong,C.Tang,P.Mordohai,andG.Medioni. Firstorderaugmentationtotensorvoting forboundaryinferenceandmultiscaleanalysisin3d. PAMI,26:594–611,2004. [74] R. Urtasun, D. J. Fleet, A. Geiger, J. Popovic, T. Darrell, and N. D. Lawrence. Topologically-constrainedlatentvariablemodels. InICML,pages1080–1087,2008. [75] R. Vidal, R. Tron, and R. Hartley. Multiframe motion segmentation with missing data usingpowerfactorizationandGPCA. IJCV,79(1):85–105,2008. [76] X. Wang, X. Ma, and E. Grimson. Unsupervised activity perception by hierarchical bayesianmodels. CVPR,pages1–8,2007. [77] X.Wang,K.Tieu,andE.Grimson. Learningsemanticscenemodelsbytrajectoryanalysis. ECCV,pages110–123,2006. [78] B.WuandR. Nevatia. Detectionof multiple, partiallyoccludedhumansinasingleimage bybayesiancombinationofedgeletpartdetectors. ICCV,pages90–97,2005. 115 [79] B. Wu and R. Nevatia. Tracking of multiple, partially occluded humans based on static bodypartdetection. CVPR,pages951–958,2006. [80] Tai-Pang Wu, Sai-Kit Yeung, Jia-Ya Jia, Chi-Keung Tang, and Grard Medioni. A closed- formsolutiontotensorvoting: Theoryandapplications. PAMI,34(8):1482–1495,2012. [81] Jiangjian Xiao, Hui Cheng, H. Sawhney, and Feng Han. Vehicle detection and tracking in widefield-of-viewaerialvideo. InIEEECVPR,pages679–684,2010. [82] J. Yan and M. Pollefeys. A general framework for motion segmentation: independent, articulated,rigid,non-rigid,degenerateandnon-degenerate. ECCV,pages94–106,2006. [83] A.Yilmax,O.Javed,andM.Shah. Objecttracking: Asurvey. ACMJournalofComputing Surveys,38(4),2006. [84] Q. Yu and G. Medioni. Motion pattern interpretation and detection. CVPR, pages 2671– 2678,2009. [85] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, pages 1601– 1608.2005. [86] L. Zhang, Y. Li, and R. Nevatia. Global data association for multi-object tracking using networkflows. CVPR,pages1–8,2008. [87] T. Zhao and R. Nevatia. Bayesian human segmentation in crowded situations. CVPR, 2003. [88] T.ZhaoandR.Nevatia. Trackingmultiplehumansincrowdedenvironment. CVPR,2004. [89] X. Zhao and G. Medioni. Robust unsupervised motion pattern inference from video and applications. ICCV,pages715–722,2011. [90] Xuemei Zhao, Dian Gong, and G´ erard Medioni. Tracking using motion patterns for very crowdedscenes. ECCV,pages315–328,2012. [91] D. Zhou, J. Huang, and B. Scholkopf. Learning with hypergraphs: Clustering, classifica- tion,andembedding. NIPS,pages1601–1608,2006. [92] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Sch¨ olkopf. Ranking on data mani- folds. NIPS,2004. [93] Feng Zhou, Fernando De la Torre, and Jeffrey F. Cohn. Unsupervised discovery of facial events. InProc.CVPR,pages2574–2581,2010. 116
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Structure learning for manifolds and multivariate time series
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
PDF
Multiple pedestrians tracking by discriminative models
PDF
Tracking multiple articulating humans from a single camera
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Depth inference and visual saliency detection from 2D images
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Scalable multivariate time series analysis
PDF
A deep learning approach to online single and multiple object tracking
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Event detection and recounting from large-scale consumer videos
PDF
A signal processing approach to robust jet engine fault detection and diagnosis
PDF
Neuromorphic motion sensing circuits in a silicon retina
Asset Metadata
Creator
Zhao, Xuemei
(author)
Core Title
Motion pattern learning and applications to tracking and detection
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/05/2014
Defense Date
04/24/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
crowded scene,detection,motion pattern,OAI-PMH Harvest,tensor voting,tracking
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gérard G. (
committee chair
), Jenkins, Brian Keith (
committee member
), Liu, Yan (
committee member
)
Creator Email
xuemeiz@usc.edu,zhaoxuemei86@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-316458
Unique identifier
UC11294456
Identifier
etd-ZhaoXuemei-1961.pdf (filename),usctheses-c3-316458 (legacy record id)
Legacy Identifier
etd-ZhaoXuemei-1961.pdf
Dmrecord
316458
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhao, Xuemei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
crowded scene
detection
motion pattern
tensor voting
tracking