Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Structure learning for manifolds and multivariate time series
(USC Thesis Other)
Structure learning for manifolds and multivariate time series
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
StructureLearningforManifoldsandMultivariateTimeSeries by DianGong ADissertationPresentedtothe FACULTYOFTHEUSCGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (ELECTRICALENGINEERING) August2013 Copyright 2013 DianGong Acknowledgements First and foremost I would like to thank my advisor Professor G´ erard Medioni. I was very fortunatetohaveProfessorMedioniasmymentor. Fromacademicperspective,ProfessorG´ erard Medioni consistently devotes his valuable time in mentoring my research works. I benefited greatly from Professor Medioni’s high standards on research quality and great insight for re- search directions. He did not only guide me to publish academic papers, but more importantly, he taught me to do meaningful works to solve real world problems. When I was a junior PhD student,hisguidancewasdetailoriented. Asmyresearchcontinued,heencouragedmetocome upwithmyownideastoresearchproblems. Inthisway,Ifullyexploredmyowninterestandpo- tentialinmeaningfulworks. Furthermore,Prof. MedioniencouragedmetodointernshipwhenI becameaPhDcandidate. Theinternexperiencesbroadenedmyviewfortheoutsideworld,gave medirectinsightaboutmyresearch,andledtomoreinterestingresearchworks. Theprestigious MellonMentoringAwardisthebestawardtodemonstratehowgreatheisasaPhDadvisor. IwouldalsoliketothankProfessorFeiSha,ProfessorB.KeithJenkins,ProfessorRamakant NevatiaandProfessorJernejBarbicforspendingtheirvaluabletimeonmydefenseandqualify- ingexam. InthefirstyearofmyPhD,IattendedthecourseSelectedTopicsinMachineLearning taughtbyProfessorFeiSha. Feididnotonlydemonstratethestate-of-the-artinthefieldofma- chine learning, but also guided me on how to do research on cutting-edge learning techniques. ii Since then, I learned a lot from the discussions with him, both formally and causally. Professor Jenkinsisalwaysveryelegantandnice,andIlearnedalotfromhiscourseonMathematicalPat- tern Recognition and his comments on my research. I enjoyed the conversations with Professor Ramakant Nevatia every time, and his experience on computer vision was very helpful for me. I want to express my great gratitude to Professor Yan Liu. It was a great experience for me to collaborate with her to work on interesting problems on machine learning to time series. She is anexcellentyoungresearcherinmachinelearning,andshetaughtmealotinourconversations, fromacademicresearchtocareerdevelopment. Over the years at USC, I was fortunate to work together with and learn from good friends andresearchers: KartikAudhkhasi,PrithvirajBanerjee,YinghaoCai,JongmooChoi,YannDu- mortier, Wei Guan, Chang Huang, Eunyoung Kim, Cheng-Hao Kuo, Weikai Liao, Yuan Li, Yuping Lin, Philippos Mordohai, Pradeep Natarajan, Jan Prokaj, Vivek Kumar Singh, Matheen Siddiqui,PramodSharma,BoYang,QianYuandYiliZhao. I want to thank my fiancee Xuemei for always being supportive in my entire PhD study. ThereweremanynightsthatIneededtostayintheofficebeforepapersubmissiondeadlines,and she was always there with me together. The discussions with her were a helpful and important factorformyresearchworks. Last but not least, I would like to thank my parents and family for their unconditional love andsupport. Thisdissertationisdedicatedtothem. iii TableofContents Acknowledgements ii ListofTables viii ListofFigures x Abstract xv 1 Introduction 1 1.1 ProblemDefinition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 I StructureLearningforManifolds 15 2 TensorVoting 16 2.1 TensorVotingin2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.1 TensorRepresentationandInterpretationinR 2 . . . . . . . . . . . . . 17 2.1.2 VotingFunctioninℜ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 ProblemsintheStandardFramework . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.1 Stickv.s. BallVotingFunction . . . . . . . . . . . . . . . . . . . . . . 25 2.2.2 InlierNoise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 ProbabilisticTensorVoting 32 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 ProbabilisticTensorVoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.2 SparseVotingProcedure . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.2.1 VotewithOrientedPoints . . . . . . . . . . . . . . . . . . . 39 3.3.2.2 VotewithUnorientedPoints . . . . . . . . . . . . . . . . . . 42 3.3.3 DenseVote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 iv 3.4 TheoreticalJustification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.1 EvaluationMethodology . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.2 ResultsonSyntheticData . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.3 ImageContourGrouping . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4 UnifiedTensorVotinginHighDimensionalSpace 56 4.1 Representationinℜ D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2 VotingAlgorithminℜ D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 Closed-FormVotingFunction . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5 ManifoldsDenoising 70 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 LocallyLinearDenoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.1 TheLLDAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4.1 EvaluationMethodology . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4.2 USPSDigitImages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4.3 ORLFaceImages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.4 ComparisontoSingle-ImageDenoising . . . . . . . . . . . . . . . . . 87 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6 MultipleManifoldsStructureLearning 89 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3 LocalManifoldStructureEstimation . . . . . . . . . . . . . . . . . . . . . . . 94 6.4 GlobalManifoldStructureLearning . . . . . . . . . . . . . . . . . . . . . . . 97 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 II StructureLearningforMultivariateTimeSeries 110 7 Spatio-TemporalAlignment 111 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.3 Spatio-TemporalManifold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.3.1 MathematicalModel . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.3.2 StructureLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 v 7.4 DynamicManifoldWarping . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.4.1 TemporalAlignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.4.2 TemporallyLocalSpatialAlignment . . . . . . . . . . . . . . . . . . . 129 7.4.3 MotionDistanceScore . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.4.4 ResultsofMatching . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.5 ActionRecognitionfromVideos . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.5.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.5.2 AlignmentofX mocap andY video . . . . . . . . . . . . . . . . . . . . 136 7.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8 OnlineTemporalSegmentation 142 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.3 OnlineTemporalSegmentationofHumanMotion . . . . . . . . . . . . . . . . 147 8.3.1 ProblemFormulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.3.2 KTC-S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.3.3 KTC-R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.4 OnlineHierarchicalTemporalSegmentation . . . . . . . . . . . . . . . . . . . 156 8.4.1 KTC-H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.4.2 OnlineActionRecognitionfromOpenNI . . . . . . . . . . . . . . . . 159 8.5 ExperimentalValidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.5.1 OnlineTemporalSegmentationofHumanAction . . . . . . . . . . . . 160 8.5.2 JointOnlineSegmentationandRecognitionfromOpenNI . . . . . . . 163 8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 9 MiningLarge-ScaleTimeSeries 166 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.3 KernelizedAlignmentHashing . . . . . . . . . . . . . . . . . . . . . . . . . . 170 9.3.1 Spatio-TemporalHashing . . . . . . . . . . . . . . . . . . . . . . . . 170 9.3.2 HashingonAlignmentKernel . . . . . . . . . . . . . . . . . . . . . . 171 9.3.3 DimensionReduction . . . . . . . . . . . . . . . . . . . . . . . . . . 176 9.3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 9.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 9.5.1 Smartphonedata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 9.5.2 SocialBehaviordata . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 9.5.3 Mocapdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 10 Conclusion 189 10.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 10.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 vi AppendixA ProbabilisticTensorVotingPackage . . . . . . . . . . . . . . . . . . . . . . . . . . 192 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 A.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 A.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 AppendixB NDSpaceClosedFormTensorVotingPackage . . . . . . . . . . . . . . . . . . . . 193 B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 B.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 B.3 Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 B.4 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 AppendixC TemporalAlignmentPackage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 C.2 InstallationandUsage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 C.3 Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 C.4 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 AppendixD ManifoldDenoisingPackage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 D.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 D.3 Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 ReferenceList 198 vii ListofTables 1.1 A brief overview of local structure learning. Y indicates the method explicitly considers the issue or has the property. I indicates there is an interpretation of the method that associates or considers the issue. I* indicates the aspect we are working on. The aspects we consider include: Nonlinear: does the method explicitly or implicitly consider nonlinear local structure? Probabilistic: is the method probabilistic? Outlier: is the method robust to outliers and does it con- sideroutliersinthecomputationalframework? Noise: doesthemethodconsider inlier noise, e.g., Gaussian noise? Efficient: is the method computationally effi- cientforrepeatingN timesforlocalstructurelearning(N isthenumberofdata points)? Out-of-sample: does the method provide out-of-sample extension for newdata? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 A brief overview of manifold learning. Notations are the same as in Table 1.1. Somenewissuesare: Local: doesthemethodstartfromlocaldatastructureesti- mation? Global: doestheapproachhaveaglobalobjectivefunction? Rec: does the approach provide a mapping from the latent space to the ambient space? Embedding: does the approach provide low dimensional embedding results? Multiple: does the approach explicitly or implicitly consider the possibility of multiplemanifolds? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 A brief overview of manifold denoising algorithms. Notations are the same as in Table 1.1. Some new issues are: Local: does the method start from local data structure estimation? Global: does the approach have a global objective function? Convex: isthecomputationalframeworkconvex? . . . . . . . . . . 10 1.4 A brief overview of alignment algorithms. Notations are the same as in Ta- ble 1.1. Some new issues are: Multi-dim: does the method allow alignment for multi-dimensionaldata? Diff-dim: canthemethodperformalignmentfortwoor multipledatasetsorsequenceswithdifferentdimensionalities? Global: doesthe approach have a global objective function? Temp-NL: can the approach handle temporally nonlinear variation? Spatial-NL: does the approach consider spatial nonlinearvariations? Gen: isthemethodgenerative? . . . . . . . . . . . . . . 11 viii 3.1 Tangentspaceestimationonsyntheticdata. PTVresultsarebold . . . . . . . . 52 3.2 PTVisrobusttoσ n . σ n v.s. erroroftheestimatedtangentspace . . . . . . . . 52 5.1 Reconstruction errors and misclassification rates (in percentage) by multiclass SVM classifiers on the USPS data. The error rates are shown inside the paren- theses. Without denoising, corresponding to the column heads of “none”, no reconstruction error is reported. In both measures, LLD outperforms other 2 methodsinmostcases. ParametersaresetasthesameasthoseinFig.5.3. . . . 83 6.1 Rand index scores of clustering on synthetic data, USPS digits, CMU MoCap sequencesandMotorbikevideos. . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.1 Recognition Performance Rates. (S) means the rate is measured by # of se- quences and (F) means the rate is measured by # of frames. All recognition is doneatthesequencelevel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.1 Temporal Segmentation Results Comparison. Precision (P), recall (R) and randindex(RI)arereported . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.2 Onlinehierarchialsegmentationandrecognitionon2.5Ddepthsensor . . . 164 ix ListofFigures 1.1 Anillustrationofinliernoiseandoutliersinℜ 2 . . . . . . . . . . . . . . . . . . 5 2.1 Visualizationoftensorinℜ 2 . Tensordecompositionin2D. . . . . . . . . . . . 19 2.2 Visualization of the stick vote inℜ 2 . There is a known tangent direction (e 1 ) at thevoterx. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 IntegrationofthestickvotingtogetBallVotingfunction. . . . . . . . . . . . . 23 2.4 Integrationisinconsistentwithlineardecomposition . . . . . . . . . . . . . . 26 2.5 Ambiguityofvotewitheigen-decomposition. . . . . . . . . . . . . . . . . . . 27 2.6 Sensitivityofvotewitheigen-decomposition. . . . . . . . . . . . . . . . . . . 28 2.7 Votewithspatialalignmentprocess. . . . . . . . . . . . . . . . . . . . . . . . 30 2.8 Voteissensitivewhentwopointsareclose. . . . . . . . . . . . . . . . . . . . 31 3.1 Problem illustration. Left: an example of inlier and outlier noise, right: a naturalimageandtheboundarydetectionresultsby[72]. . . . . . . . . . . . . 33 3.2 Illustration and visualization of Tensor Voting. Top, the functions of the second orderandtwotypesofthefirstordertensors;bottom,thesecondorderstickvote andthefirstorderstickvotewithtwopolarityvectorsorthogonaltoeachother . 37 3.3 Visualizationofvotewithorientedpoints. Top,saliencycomparisonofthestan- dardvoting(left)andthenewvoting(right); Bottom, 3Dsaliencymap compari- sonofthestandardvoting(left)andthenewvoting(right) . . . . . . . . . . . 41 x 3.4 Visualizationofvotewithunorientedpoints. Top,ℜ 2 votesaliencymap;bottom, STV and PTV for ball vote saliency and junction saliency, the horizontal and verticalaxesrepresentdistanceandsaliencyvaluerespectively. . . . . . . . . . 42 3.5 Comparison between STV and PTV for endpoints completion. Left: input to- kens; middle: endpoints completion results by STV; right: endpoint completion resultsbyPTV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.6 Tangentspaceestimation. Thefirstrow,noiselessPOCVlogowithgroundtruth of normal directions and SD = 1 with 200% OR. The second row, noiseless sphere, SD = 0.5 with 50% OR and SD = 1 with 100% OR. The third row, noiselessBunny,SD = 0.5,SD = 1. . . . . . . . . . . . . . . . . . . . . . . 50 3.7 PTV performs better than PCA and MD in denoising, when data includes both outlierandinliernoise. Fromlefttoright: originaldata,inlierdata,PCAresults, MDresultsandPTVresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.8 PTVperformsbetterthanSTVforedgelsgrouping. Thefirstrow,initialedgels, STV results, PTV results and smoothed edgels. The second row, STV results, PTV results, STV results and PTV results. The third row: initials edgels, STV resultsandPTVresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1 Localtangentspaceofapointx∈M. . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Anillustrationofthespatialalignmentvote. . . . . . . . . . . . . . . . . . . . 64 5.1 Locallylineardenoising(LLD)algorithm . . . . . . . . . . . . . . . . . . . . 78 5.2 ExamplesofcleanUSPSdigitimages . . . . . . . . . . . . . . . . . . . . . . 80 5.3 Denoising by PCA and manifold based algorithms on USPS data. Top row to bottom: different noise types (Gaussian, occlusion, motion blur and salt-and- pepper). Left column to right: no denoising, PCA, DM-MD [41] and LLD (cf. section5.3)withλ = 0. Denoisedimagesarevisuallymoreappealing. DM-MD tends to overly smooth while LLD does not perform well with occlusion noise. ForDM-MD,thenumberofnearestneighborsareK = 80,80,30,100foreach noisetyperespectively. ForLLD,K = 30forallnoisetypes. . . . . . . . . . . 81 5.4 Denoising by LLD with different amount of regularization. λ is increased from lefttoright. Largerλleadstooversmoothing,similartotheDM-MDalgorithm. 84 xi 5.5 Misclassification rates (in percentage) of LLD with different regularization and other methods on USPS data with salt-and-pepper noise. Small regularization improves error rates on both noisy and original clean images (note that they are “denoised”too,effectivelybeingintroducedwithnoise). Seetextfordetails. . . 84 5.6 VisualizationofdenoisedORLimages. Fromlefttoright: cleanimages,images withGaussiannoise,denoisedwithMD-DMandLLDalgorithmsrespectively. 85 5.7 Misclassification rates (in percentage) for ORL data with various denoising al- gorithms (note that, “denoising” clean images introduces noises in effect). See textfordetails. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.8 Comparison among various denoising algorithms. From left to right: images of digits contaminated with narrow black bands at the bottom, denoising by the LLD algorithm, denoising by Non-Local Means, denoising by BLS-GSM. The LLDalgorithmisabletorecoverpartiallyoccludedareasbythebands. . . . . 85 6.1 AdemonstrationofRMMSL.Fromlefttoright,noisydatapointssampledfrom two intersecting circles with outliers, outlier detection results, and manifolds clusteringresultsafteroutlierdetection. . . . . . . . . . . . . . . . . . . . . . 91 6.2 An example of multiple smooth manifolds clustering. The first one is the in- put data samples and the other three are possible clustering results. Only the rightmostistheresultwewantbecausetheunderlyingmanifoldsaresmooth. . 98 6.3 Visualization of part of the clustering results in Table 6.1. The first row : one noisy sphere inside another noisy sphere inℜ 3 . The second row: two intersect- ing noisy spheres inℜ 3 . The third row: two intersecting noisy planes inℜ 3 . For each part from left to right: K-means, self-tuning spectral clustering [152], GeneralizedPCA[133]andRMMSL. . . . . . . . . . . . . . . . . . . . . . . 104 6.4 ClusteringresultsofRMMSLontwomanifoldswithoutliers. Fromlefttoright: ground truth, outlier detection, clustering after outlier filtering and clustering withoutoutlierfiltering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.5 AnexampleofhumanactionsegmentationresultsonCMUMoCap. Topleft,the 3D visualizationofthe sequence. Top right, labeled ground truth and clustering resultscomparison. Bottom,9uniformlysampledhumanposes. . . . . . . . . 107 6.6 Two examples of motion flow modeling results on motorbike videos. From left toright: twoimagessamples,opticalflowresults,learnedmotionmanifoldswith highlightedmotiondirections. . . . . . . . . . . . . . . . . . . . . . . . . . . 108 xii 7.1 Flowchartoftheproposedapproach. . . . . . . . . . . . . . . . . . . . . . . . 113 7.2 Examples of motion capture systems. CMU Mocap (left), Gypsy Motion Cap- tureSystem(middleandright). . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.3 Spatio-Temporal Manifold Model. Graphical model comparisons (left), a visu- alizationexample(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.4 Learninggeodesicdistance. Left,variable-lengthpathmethodinsec.7.3;right, fixed-length2pathmethod. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.5 An illustration of the non-linearity of ζ(t). Top, action “stretching”(Mocap), 6 samplesareuniformlydistributedin368frames;bottom,estimatedlatent”com- pletion”variable. Thewholeactionisdecomposedinto5stages. . . . . . . . . 124 7.6 Temporal Alignment Results. DMTW is compared with DTW and CTW. The reference sequence is shown in the first row, followed by the aligned results. 2 red arrows indicate 2 key states in the reference sequence, i.e, the peaks of the first and the second boxing. The aligned sequence also has 2 red arrows, indicating the peaks of the first and the second jump. DMTW is able to align the two peak states in the jumping sequence to the peak states in the boxing sequenceverywell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.7 Action recognition results on Mocap. Top, confusion matrix for recognition in 3D; middle, confusion matrix for recognition in 2D (darker indicates larger similarity and brighter indicates smaller similarity); bottom, 1 Mocap example foreachaction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.8 Noisy Tracking Results. Left, image sequences; right, trajectories of the right feet provide by the tracker and manual labeling. Our approach can recognize actionsfromthisnoisyandoccludedinput. . . . . . . . . . . . . . . . . . . . 136 7.9 TemporalalignmentofX mocap andY video . Left,action“stretching”(368frames, Mocap); middle, action “jogging”(47 frames, HumanEva); right, a 368× 47 aligningmatrix(DMTW).Thenon-linearityofthealigningpathisvisualizedby thedarkblueregion(blueindicatessmallerror,andredindicateslargeerror). . 138 7.10 Spatialalignmentforoccludedjoint-position. Left,3Djoint-positionsfromMo- cap; right, occluded joint-positions in the temporal corresponding frame from HumanEva2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 xiii 7.11 Examples of action recognition results on videos. Top to bottom, walking, jogging and boxing (HumanEva). Left to right, image sequences, 2KD joint- trajectories, temporal aligning matrix with the most similar Mocap sequence, and the distance to 10 most similar Mocap sequences. Green arrows in images indicateexamplesofnoisytrackingresults. . . . . . . . . . . . . . . . . . . . 140 8.1 OnlineHierarchialTemporalSegmentation. A22secsinputsequenceistem- porally cut into two segments; a walking segment (S1) which is further cut into 6actionunits,andajumpingsegment(S2)whichisfurthercutinto4actionunits.144 8.2 SystemFlowchart. InputcanbeeitherMocap,2.5Ddepthsensoror2Dvideos. Outputaretemporalcutsforbothactiontransitionsandcyclicactionunits. . . . 144 8.3 AnillustrationofKTC-R.Left: trackedjointsfromdepthsensor,right:K KTC 1:190 ∈ ℜ 190×190 for windowX 1:190 . The decision to make no cut between frame 1 to 110ismadebeforethecurrentwindowwithamaximumdelayofT 0 = 80frames.153 8.4 ExamplesofOnlineTemporalSegmentation. Top(DepthSensor): asequence includes3segmentsaswalking,boxingandjumping. Noisyjoint-trajectoriesare tracked by OpenNI. Middle (Mocap): a sequence has 4579 frames and 7 action segments. Bottom (Video): a clip includes walking and running. For all cases, KTC-Rachievesthehighestaccuracy. . . . . . . . . . . . . . . . . . . . . . . 160 8.5 Online action segmentation and recognition on 2.5D depth sensor. Top to bottom, depth image sequences, KTC-H results and action recognition results. For segmentation, blue line indicates the cut and different rectangles indicate differentactionunits. Thebluecircleindicatesnoisytrackingresults. Forrecog- nition: distancetolabeledMocapsequences,andinferred3MDmotionsequences.163 9.1 VisualizationofthenormalizedDelannoyKernel. . . . . . . . . . . . . . . . . 180 9.2 Left, examplesofUCF-iphone data. Right, precision and recall results. Ground truthisgivenbytheinduceddistancefromtheGAkernel[14]. . . . . . . . . . 183 9.3 Semantic accuracy on iphone data with 32 and 128 hash bits. Ground truth is givenbythesemanticlabel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 9.4 Precision and recall results on social behavior data. Left is KAH with l = 60 andrightisKAHwithl = 15. GroundtruthisgivenbytheinducedGAdistance. 186 9.5 Examples of search results on CMU Mocap. The first row is query (a person is walking)andfollowingrowsarethe1st,10thand50threturnednearestneighbors.188 xiv Abstract Thisdissertationinvestigatesafundamentalissueinmachinelearningandcomputervision: un- supervised structure learning of high dimensional data from manifolds and multivariate time series,corruptedbynoise,andinthepresenceofoutliers. Ourprimarygoalistoaccuratelyesti- matethelocaldatastructure(e.g.,tangentspace),andtheninfertheunderlyingglobalstructure fortheinputdata. The local structure learning method we use is Tensor Voting (TV), which is a perceptual grouping approach initially proposed to infer the local data geometric structure in 2D space. Standard Tensor Voting (STV) explicitly considers outliers, i.e., the irrelevant noise which is independentofthemeaningfuldata. TheunderlyingassumptionmadebySTVistheinlierdata is noiseless. However both outlier and inlier noise commonly exist in many areas in computer vision, e.g., motion estimation, tracking, etc. By taking the uncertainty of the position into consideration,thisdissertationdevelopsProbabilisticTensorVoting(PTV)tohandlebothinlier noiseandoutliers. PTVincorporatestheBayesianframeworkwiththegeometricinferencealgo- rithminSTVtogether,andthepositionsofinlierdataarechangedfromfixedvectorstorandom xv vectors. ThisdissertationalsoextendsSTVintoaunifiedframework,whichgivesacleanandel- egantinterpretationforthevotingalgorithminNDspace. Thevote,i.e.,thegeometricinforma- tion propagation, is represented as a low-rank matrix decomposition and diffusion process with closed-formformulation,whichonlyinvolvesmatrixmultiplicationandeigen-decomposition. After estimating local data structure, global structure learning for high dimensionality data andmultivariatetimeseriesdataareperformed. Two (global) manifold learning tasks are included in this dissertation. The first one is non- parametric denoising on manifolds. We develop a closed-form algorithm, locally linear denois- ing (LLD), to denoise data sampled from (sub)manifolds, and apply this algorithm to denoise images. The second one is robust multiple manifold clustering. We present a robust multiple manifold structure learning (RMMSL) scheme to robustly estimate data structures under the multiple low intrinsic dimensional manifolds assumption. RMMSL utilizes a robust manifold clustering method based on local structure learning results. The proposed clustering method is designed to get the flattest manifolds clusters by introducing a novel curved-level similarity function. Ourapproachisevaluatedandcomparedtostate-of-the-artmethodsonsyntheticdata, handwrittendigitimages,humanmotioncapturedataandmotorbikevideos. Furthermore, this dissertation applies machine learning to structured time series with appli- cationstohumanmotionanalysis. First, we address the problem of learning view-invariant 3D models of human motion from motion capture data to recognize human actions from a monocular video sequence with arbi- trary viewpoint. We propose a Spatio-Temporal Manifold (STM) model to analyze non-linear multivariate time series with latent spatial structure, and apply it to recognize actions in the xvi joint-trajectories space. Based on STM, a novel temporal manifold alignment algorithm Dy- namicManifoldWarping(DMW)andarobustmotionsimilaritymetricareproposedforhuman action sequences, both in 2D and 3D. Second, we address the problem of unsupervised online segmenting human motion sequences into different actions. Kernelized Temporal Cut (KTC) is proposed to sequentially cut the structured sequential data into different regimes. KTC extends previous works on online change-point detection by incorporating Hilbert space embedding of distributions to handle the nonparametric and high dimensionality issues. Moreover, a realtime implementation of KTC is proposed by incorporating an incremental sliding window strategy. Withinaslidingwindow,segmentationisperformedbythetwo-samplenon-parametrichypoth- esis test based on the proposed spatio-temporal kernel. By combining online temporal segmen- tationandspatio-temporalalignmentalgorithms,wecanonlinerecognizeactionsofanarbitrary personfromanarbitraryviewpoint,givenrealtimedepthsensorinput. xvii Chapter1 Introduction This chapter provides the introduction and definition of problems addressed in this dissertation. Also,anoverviewofrelatedworkandanoutlineofthedissertationaregiven. 1.1 ProblemDefinition In this dissertation, we address a subfield of machine learning that learns the underlying data structures from raw input data which are represented as points in high dimensionality space. This type of learning is termed as instance-based learning, which includes popular machine learning methods like K-Nearest Neighbor (K-NN) and kernel regression. Instead of those su- pervised approaches that predict labels for data points based on training data, we mainly focus onunsupervisedlearningthatexploreshowdatapointsareorganized(withoutgivenlabels). Un- likethecaseinsupervisedlearning,thereisnouniquegoalintheunsupervisedscenario. Typical unsupervisedlearningtasksare,outlierdetection(anomalydetection),clustering,dimensionre- duction,alignment,andsoon. 1 In particular, there are two types of data investigated, one is high dimensionality data with manifold structure and the other is multivariate time series data. For the first one, we assume datapointsaresampledfromamanifold,ormanifolds,whichcanhaveeitherlinearornon-linear intrinsic structures. Manifold is a terminology initially proposed in differential geometry and Riemannian geometry, and has been introduced to machine learning community to model high dimensional data in a nonparametric way. In practice, data (e.g., raw input or extracted features after pre-processing) in many computer vision applications do have such manifold structures. For instance, although human motion data (3D motion capture data or trajectories from 2D videso) lie in high dimensional space (vary from 10 to 100 or more), the natural property of humanposesuggeststhathumanmotiondatahaslowerintrinsicdegreesoffreedom,whichcan be modeled as manifolds. For the second one, we also focus on the nonparametric case, i.e., thereisnoparametricdistributionconstrain(suchasGaussian)forthemultivariatetimeseries. Manifold learning is an active research area in the machine learning community in the last decade. The term manifold learning usually refers to non-linear dimension reduction, which is an alternative version of standard linear dimension reduction. Different from this direction, we mainly focus on unsupervised structure learning in the ambient space, i.e, the original input space. Thisdissertationincludesthreeparts,localdatastructurelearning,globalmanifoldstructure learning and machine learning to structured time series. Correctly and efficiently estimating localdatastructureisacrucialstepfordataanalysisandmodeling. Many(global)unsupervised learning methods such as clustering and denoising rely on accurate local structure estimation results. Particularly, local structure learning in this dissertation mainly refers to estimating the 2 local tangent space (or normal space) and the local intrinsic dimensionality of data points. We use Tensor Voting [81] to estimate such local data structures. Probabilistic Tensor Voting and unifiedVotingframeworkareproposedtoimprovethestandardTensorVoting. The global manifold structure learning in this dissertation include nonparametric denoising on(image)manifoldsandrobustmultiplemanifoldsclustering. Itisalsonotablethat,othertypes ofmanifoldlearningtasks,e.g.,semi-supervisedlearningandlatentvariablemodeling,canalso bedonebasedonthelocalstructurelearningresults. Fortimeseriesdata,insteadofthepopular(probabilistic)state-spacemodel,asegmentation- then-alignmentframeworkisproposedtoanalyzethemultivariatetimeserieswithspatio-temporal structure. Under this structured time series framework, we first propose Kernelized Temporal Cut (KTC), an extension of previous works on online change-point detection by incorporating Hilbert space embedding of distributions, to handle the nonparametric and high dimensional- ityissuesofhumanmotions. Second,anefficientspatio-temporalalignmentalgorithmDynamic ManifoldWarping(DMW)isproposedformultivariatetimeseriestocalculatemotionsimilarity betweenhumanactionsequences(segments). Furthermore,bycombiningthetemporalsegmen- tationalgorithmandalignmentalgorithm,onlinehumanactionrecognitioncanbeperformedby associatingafewlabeledexamplesfrommotioncapturedata. 1.2 Issues Manifoldstructurelearningfacesthefollowingcommonchallenges. 3 • Non-parametric and nonlinearity. It is common to assume that input data have linear structure, whichoffersmanyadvantagesintermsofinterpretation, computing, prediction and compression. However, high-dimensional data usually have complex intrinsic struc- tures that are difficult to be interpreted by linear model. Non-parametric and non-linear manifold model provides a flexible framework to model such high dimensional data, but also brings new challenges, such as how to effectively represent a manifold, and how to robustlyandefficientlyestimatemanifoldstructures. • Outliersandinliernoise. Inthisdissertation,outliersrefertodatapointsthatdonotbelong tothe(globally)meaningfuldatastructures,suchasahumanfacewithsunglassesamong a group of normal faces. Inlier noise refers to noise (e.g. Gaussian noise) on normal data samples,suchastheilluminationchangeonhumanfacesunderstandardconditions. Both outliersandinliernoisecommonlyoccurinmanyareasofcomputervision. Forinstance, the low-level edge detection results on natural images may contain both step edge error (inliernoise)andirrelevantisolatedpoints(outliers). Avisualizationofoutliersandinlier noiseisgiveninFig.1.1. • Sparsity. As a data-driven model, the learning process under manifold assumption re- quiresnon-sparsedatapointstosupportlocalorglobaldatastructures. Thisisparticularly important to local manifold structure learning, since it is difficult to leverage all the data points (globally) to help local tasks, such as local tangent space estimation. However, in manyrealcasesincomputervision(e.g.,handwrittendigitsorfaceimages),dataaretypi- callyunder-sampled(sparse)withpossibleinliernoiseandoutliers,makinglocalstructure 4 Figure1.1: Anillustrationofinliernoiseandoutliersinℜ 2 . estimationachallengingproblem. Furthermore,sparsityisanaturalgeometricpropertyif thedimensionality(especiallytheintrinsicdimensionality)goeshigh. • Multiple manifolds. There is a lack of both theory and algorithm when data is supported on a mixture of manifolds, instead of a single manifold. Such data exist commenly in practice. Forinstance,eachhandwrittendigitformsitsownmanifoldinthefeaturespace (10intotal),andinmotionsegmentation,movingobjectsformdifferenttrajectorieswhich arelowdimensionalmanifolds(1Dor2Din4Dspace). Thesemanifoldsmayintersector partiallyoverlap. Moreover,thesemanifoldsmayhavedifferentintrinsicdimensionalities and density. Existing manifold learning related approaches are not suitable for multi- manifold data. For instance, traditional graph-based clustering algorithms may create a weight graph that connects points on different manifolds near an intersection with large weight(becauseofcloseness),thuspropagatethewronginformationonmanifolds. 5 Method Nonlinear Probabilistic Outlier Noise Efficient Out-of-Sample PCA I Y Y I FA Y Y I I RSL I Y Y I RPCA I Y I I I STV I Y Y Y Ours I I* Y Y I Y Table 1.1: A brief overview of local structure learning. Y indicates the method explicitly con- siders the issue or has the property. I indicates there is an interpretation of the method that associates or considers the issue. I* indicates the aspect we are working on. The aspects we consider include: Nonlinear: does the method explicitly or implicitly consider nonlinear local structure? Probabilistic: is the method probabilistic? Outlier: is the method robust to outliers anddoesitconsideroutliersinthecomputationalframework? Noise: doesthemethodconsider inliernoise,e.g.,Gaussiannoise? Efficient: isthemethodcomputationallyefficientforrepeating N times for local structure learning (N is the number of data points)? Out-of-sample: does the methodprovideout-of-sampleextensionfornewdata? 1.3 RelatedWork We give a brief review of related methods proposed for local and global data structure learning in general. Discussions for related works on applications such as human motion data analysis andimagedenoising,aregiveninindividualchapter5,7and8. LocalManifoldStructureLearning. Mostofthedatastructurelearningormodelingworks focus on global learning, i.e., directly fit the model to data by a global objective function or a probabilistic framework. Nevertheless, Principal Component Analysis (PCA) [47], can be used to analyze the local data structure (e.g., tangent space, normal space and intrinsic dimension- ality). Local PCA is used to find low dimensional representation of input data [153]. Similar methods,e.g.,localfactoranalysis(FA)andmixtureoffactoranalysis(MFA),canalsobeused 6 to analyze local data structure [124]. However, most of these works do not explicitly consider outliers,andtheyarenotrobustinrealapplicationsincomputervision. On the other side, robust learning aims to explicitly handle outliers, e.g. robust subspace learning(RSL)[54]andRobustPrincipalComponentAnalysis(RPCA)[92,146]. Theseworks achieve good performance on global data structure learning (e.g., denoising), but they often assumelinearorparametricmodelswhicharenotapplicableinrealcases. Althoughitispossible to apply these methods for local structure learning (approximate manifold with local patches), thehighcomputationalcostlimitstheircapabilities. In this dissertation, we investigate the problem of robustly estimating local data structures without the limitation of linear or parametric models based on the Tensor Voting framework (chapter 2 and chapter 3). Standard Tensor Voting (STV [81]) is a non-parametric geometric computational method, and the (local) voting process is efficient and robust (can handle large amountofoutliers). AbriefoverviewisgiveninTable1.1. Manifold Learning. Representative works in this area include Locally Linear Embedding (LLE) [104], ISOMAP [125], Laplacian Eigenmaps (LE) [6], Hessian Eigenmaps (HLLE), Lo- cal Tangent Space Alignment (LTSA) [153], Conformal Component Analysis [106], Maximum Variance Unfolding (MVU) [141] and t-SNE [132]. These methods are explicitly designed to findproperlowdimensionalembeddingoftheinputdata. Recently, Gaussian Process Latent Variable Model (GP-LVM) is proposed to model high dimensional data by the kernelized Probabilistic PCA framework [126]. As a generative frame- work, GP-LVM not only offers a low dimensional representation of the data, but can also be 7 Method Local Global Rec Embedding Multiple Outlier Out-of-Sample LLE Y Y Y I ISOMAP Y Y I LE Y Y I GP-LVM Y Y Y I STV Y I Y Y Ours Y I* I* Y Y Y Table1.2: Abriefoverviewofmanifoldlearning. NotationsarethesameasinTable1.1. Some newissuesare: Local: doesthemethodstartfromlocaldatastructureestimation? Global: does theapproachhaveaglobalobjectivefunction? Rec: doestheapproachprovideamappingfrom the latent space to the ambient space? Embedding: does the approach provide low dimensional embedding results? Multiple: does the approach explicitly or implicitly consider the possibility ofmultiplemanifolds? used in the ambient space, e.g., missing values inference and denoising. However, outliers are notexplicitlyconsideredintheGP-LVMmodel. AbriefoverviewisgiveninTable1.2. Outlier Detection. There is a large class of methods called stochastic sample consensus for outlier identification (or filtering). A representative work is Random Sample Consensus, RANSAC[21]. Theobjectivefunctiontobemaximizedistheestimatedinlierpoints,whichare datapointslyingwithinagiventhresholdoftheestimatedmodel. RelatedworksincludeMLE- SAC, MAPSAC and M-estimator [39]. Recently, StaRSac is proposed to analyze the variance of the parameters (VoP), and it shows that there is a stable region of the estimated parameters which can solve the unknown uncertainty problem [11]. The main limitation of these methods istheneedforspecificparametricmodelofinlierdata. Another type of popular work that can handle outliers is voting based methods. In Hough transform [74], the optimal model parameters are selected as the parameters space cell which getsthemaximumnumberofvotes. Houghtransformalsorequiresaparametricmodelofinlier 8 data, so it can’t be applied to non-parametric case. As reviewed in the introduction section, tensorvotingcanhandlelargeamountofoutlierswithoutparametricmodelconstraint,butinlier noiseisnottakenintoaccountintheoriginalframework. [82]extendstheinitialTensorVotingframeworktohighdimensionalspaceandreportsgood performanceonhandlingoutliers. However,thereisstillnoexplicitdetectionofoutliers. Manifolds Clustering. As an important step in global data structure learning, automatic androbustmanifoldclusteringisneeded. State-of-the-artmanifoldclusteringmethodslike[18, 134, 149] acquire good performances based on multiple linear subspace assumption. However, read data (e.g. motion flow in computer vision) often have non-linear intrinsic structures. Fur- thermore, the robustness issue is not explicitly considered in most of these manifold clustering works. Ontheotherside,oneofthemostpopularclusteringmethodsisspectralclustering[85], whichhassolidtheoreticalfoundation(graphspectraltheory[12]),elegantcomputationalframe- work (eigen-decomposition), and no linear model constraint for data. However it is pointed out by[158]thatpairwisesimilarityisnotenoughformanifoldlearning,i.e.,high-orderrelationship informationbetweenpointsetsismissing. Driven by these factors, we propose a novel robust non-linear manifold clustering method in chapter 6, which explicitly considers outliers and can handle multiple non-linear intrinsic subspaceswithdifferentdimensionalities. ManifoldDenoising. Numerousparametricmodelbaseddenoisingmethodshavebeenpro- posed for data denoising, such as linear and nonlinear regression. Here we only focus on the non-parametric data denoising. Principal Component Analysis (PCA) is one of the most suc- cessfuldenoisingmethodsandhasbeenappliedtomanyaspectsofcomputervision[47]. While 9 Method Local Global Nonlinear Outlier Convex PCA Y Y KPCA Y Y RSL Y Y RPCA Y Y I DM-MD Y Y Y Ours Y Y Y I Y Table 1.3: A brief overview of manifold denoising algorithms. Notations are the same as in Table 1.1. Some new issues are: Local: does the method start from local data structure estima- tion? Global: doestheapproachhaveaglobalobjectivefunction? Convex: isthecomputational frameworkconvex? PCAisalinearsubspacemethod,KernelPCA(KPCA)isanextensionbykernelizingtheGrama matrix [79]. Robust subspace learning (RSL) is a marriage between PCA and robust estimator tohandleoutliers[54],andthemainlimitationisthelinearityconstraintoftheinlierdatamodel. Alsoitcalculatescovariancematrixiteratively,whichleadstolargecomputationalcost. Tohandlenon-parametricandnonlineardatastructure,amanifoldmodelisassumedtofitthe data, which means there is a latent intrinsic structure embedded in the high dimensional space. Diffusionmapmethodviewsdenoisingasreversingadiffusionprocessofwhichthenormalized graph Laplacian matrix is the generator (DM-MD) [41]. This approach tends to overly smooth theinputsandpushdatapointstothemeancurvatureofthemanifold. Moreover,outliersarenot consideredintheframework,andoutliersareharmfultothediffusionprocess. Abriefoverview isgiveninTable1.3. Spatio-TemporalAlignment. Insteadofstructure learning onone set of datapoints, align- ment focuses on learning the corresponding relationship between two or multiple sets of data 10 Method Multi-dim Diff-dim Probabilistic Global Temp-NL Spatial-NL Gen DTW Y I Y Y CCA Y Y I Y I CPM I Y Y Y Y SAM Y Y Y Y CTW Y Y Y Y Ours Y Y I* Y I Table 1.4: A brief overview of alignment algorithms. Notations are the same as in Table 1.1. Somenewissuesare: Multi-dim: doesthemethodallowalignmentformulti-dimensionaldata? Diff-dim: can the method perform alignment for two or multiple data sets or sequences with different dimensionalities? Global: does the approach have a global objective function? Temp- NL: can the approach handle temporally nonlinear variation? Spatial-NL: does the approach considerspatialnonlinearvariations? Gen: isthemethodgenerative? points (SAM [45]). In particular, when the temporal index is available, it becomes a spatio- temporal alignment between two multivariate time series, which is valuable for many computer visionapplications. Inparticular,giventwohumanmotionsequences,animportantquestioniswhetherthosetwo sequencesrepresentthesamemotion,similarmotionsordistinctmotions. Thiscanbeviewedas a (spatio-temporal) alignment problem, serving as a foundation for action recognition, cluster- ing,etc. CanonicalComponentAnalysis(CCA)[4]isproposedforlearningthesharedsubspace between two high dimensional features, and has been used as a spatial matching algorithm for activityrecognitionfromvideo[48]andactivitycorrelationfromcameras[67]. Videosynchro- nization is addressed as a temporal alignment problem in [99, 114], which uses dynamic time warping (DTW) or its variants [96]. [128] uses optimization methods to maximize a similarity measureoftwohumanactionsequences,whilethetemporalwarpingisconstrainedby1Daffine transformation. Thesamelineartemporalmodelisalsousedin[91]. 11 Different from the above unsupervised works, training based method like profile model (CPM [63]) builds a generative and probabilistic framework on multiple training sequences. ThetrainingstageisusuallyperformedbytheExpectation-Maximization(EM)algorithm. Very recently, as the elegant extension of CCA and DTW, Canonical Time Warping (CTW) is proposed for spatio-temporal alignment of two multivariate time series and applied to align human motion sequences between two subjects [160]. CTW is formulated as an energy min- imization framework and solved by an iterative gradient descent procedure. Since spatial and temporal transformations are coupled together, the objective function becomes non-convex and thesolutionisnotguaranteedtobeglobaloptimal. In this dissertation, under the manifold structure learning framework, we propose Dynamic ManifoldWarping(DMW),whichfocusesonmultivariatetimeserieswithintrinsicspatialstruc- tureandguaranteesglobaloptimalsolution. AbriefoverviewisgiveninTable1.4. Temporal Segmentation. There are mainly two types of work for temporal segmentation, one is change-point detection and the other is temporal clustering. Most of the work in statis- tics, i.e., offline or quickest (online) change-point detections (CD) [10], is often restricted to univariate series (1D) and parametric distribution assumption, which does not hold for human motions with complex structure. [148] uses the undirected sparse Gaussian graphical models andperformsjointlystructureestimationandsegmentation. Recently,asanonparametricextensionofBayesianonlinechange-pointdetection(BOCD)[1], [103] is proposed to combine BOCD and Gaussian Process (GPs) to relax the i.i.d assumption in a regime. Although GPs improve the ability to model complex data, it also brings in high computational cost. More relevant to us, kernel methods have been applied to non-parametric 12 change-pointdetectiononmultivariatetimeseries[16,37]. Inparticular,[16](KCD)utilizesthe one-classSVMasonlinetrainingmethodand[37](KCpA)performssequentiallysegmentation based on the Kernel Fisher Discriminant Ratio. Unlike all the above works, KTC can not only detectactiontransitionsbutalsocyclicmotions. Clustering is a long standing topic in machine learning [86, 135]. Recently, as an extension ofclustering,someworksfocusonhowtocorrectlytemporallysegmenttimeseriesintodifferent clusters. As a elegant combination of Kernel K-means and spectral clustering, Aligned Cluster Analysis(ACA)isdevelopedfortemporalclusteringoffacialbehaviorwithamulti-subjectcor- respondencealgorithmformatchingfacialexpressions[163]. Toestimatetheunknownnumber of clusters, [23] use the hierarchical Dirichlet process as a prior to improve the switch linear dynamical system (SLDS). Most of these works offline segment time series and provide cluster labels as in clustering. As a complementary approach, KTC performs online temporal segmen- tationwhichissuitableforrealtimeapplications. 1.4 Outline This dissertation is organized as follows. We point out the problems of the standard Tensor Voting framework in Chapter 2. Probabilistic Tensor Voting is proposed in Chapter 3 and a unified voting framework in ND space for robust local manifold structure learning is given in Chapter 4. Chapter 5 and Chapter 6 present two manifold learning applications, one is non- parametricdenoisingonmanifoldsandtheotherisrobustmultiplemanifoldsstructurelearning. In Chapter 7, we address the problem of spatio-temporal alignment for multivariate time series 13 data, with applications to view invariant human action recognition. Chapter 8 presents an on- linenon-parametricchange-pointdetectionalgorithmforcomplexmultivariatetimeseriesdata. Furthermore, a learning based hashing algorithm for mining large-scalemultivariate time series dataisproposedinChapter9. Finally,aconclusionisgiveninchapter10. 14 PartI StructureLearningforManifolds 15 Chapter2 TensorVoting 2.1 TensorVotingin 2D Tensor Voting is originally developed in 2D for perceptual grouping [77], and later extended to 3D and ND space [81] [82]. Tensor Voting is proposed based on the Gestalt principles of proximity and communication, and adhere to the Matter is Cohesive principle of Marr [71]. These principles also serve as important foundations in many aspects of unsupervised learning, e.g., the smoothness constraints and proximity constraints in manifold learning, clustering and dimensionreduction. Suppose we have a set of pointsx i (i = 1,2,...N) in 2D space. Our objective is to infer the local geometric structure of x i or its spatial neighborhood points{y j } L j=1 , and use this to characterize those points. 1 {y j } L j=1 can be the same or different from{x i } N i=1 . The input can be generalized to generic points, which are defined as tokens in the original Tensor Voting framework [81]. Tokens include un-oriented points, and oriented points, which have the prior 1 Strictlyspeaking,asanperceptualorganizationapproach, TensorVotingisinitiallyproposedtoinferthesalient structures from the input data. This chapter focuses the aspect of Tensor Voting to unsupervised geometric structure inference. 16 knowledge of the local orientation. In 2D space, Tensor Voting classifies inlier points into four categories, i.e., junctions, region interior, points on curves (1D manifolds) and endpoints on curves. Furthermore,thelocalnormalandtangentspaceofpointsonmanifoldscanbeestimated. It is notable that, outliers can be also detected based on Tensor Voting results. Details about outlierdetectioninNDspacearegiveninchapter8. 2.1.1 TensorRepresentationandInterpretationinR 2 Representation: structuredtensorT andpolarityvectorpareusedtodescribethelocalgeometric informationatpointx∈R 2 . Here,T∈R 2×2 isdefinedasa2×2symmetricandpositivesemi- definite (PSD) matrix, which is a second order tensor, andp∈ R 2 is a 2× 1 vector, which is a first order tensor. In the computational framework, tensor representation is used at both input and output stages. For an input token (a generic point), if it is a point without prior knowledge of orientation, then an identity matrixI is assumed as the input tensor, and if it is an orientated point with prior directione, thenee T is taken as the input tensor. After voting, i.e., the tensor propagationandcalculationprocess,theoutputateachpointisstillatensor. Interpretation: based on spectral decomposition, an arbitrary PSD matrixT is factorized in a special way so that the eigenvectors and eigenvalues are associated with the local geometric information. T =λ 1 e 1 e 1 T +λ 2 e 2 e 2 T = (λ 1 −λ 2 )(e 1 e 1 T )+λ 2 (e 1 e 1 T +e 2 e 2 T ) (2.1) 17 Here λ i (i = 1,2,λ 1 ≥ λ 2 ) are eigenvalues ande i (i = 1,2) are eigenvectors ofT. Basically, T ∈ R 2×2 is decomposed into two components,e 1 e 1 T ande 1 e 1 T +e 2 e 2 T , and each one is associatedwithabeliefcoefficientβ i toindicatethecomponentsaliencyas: β 1 =λ 1 −λ 2 ,β 2 = λ 2 . IntheearlierliteratureofTensorVoting[77],thesetwokindsofmatricesaredefinedasstick tensor andballtensor respectively. 2 Thestickcomponente 1 e 1 T representsthedegeneratecase of an ellipse, which has a unique directione 1 . The ball componente 1 e 1 T +e 2 e 2 T represents acircle,whichhasnopreferenceoforientation,andactuallyitisanidentitymatrixI. 3 Byanalyzingbeliefcoefficientsβ i ,tokenscanbeclassifiedintofourcategories. Ifβ 1 andβ 2 arebothquitesmall,thepointisclassifiedasanoutlier,andasajunction(orapointinaregion)if β 2 islargeandβ 1 isquitesmall,whichisequivalenttoλ 1 andλ 2 arelargeandalmostequivalent. In the case thatβ 1 ≥ β 2 , i.e., the saliency of the stick component is larger than the saliency of the ball component, the point is on a curve. This analysis is similar to junction estimation in Harriscornerdetector[38],whileourrepresentationcoversmoregeneralgeometriccasesinR 2 . In the last case, point x is on a curve (1D manifold M ∈ R 2 ), the eigenvector e 1 cor- responding to the largest eigenvalue λ 1 represents the local normal space of the manifold, or formallynoted asN x M ={αe 1 |,∀α∈ R}. Eigenvectore 2 representsthe local tangentspace of the manifold, noted as T x M ={αe 2 |,∀α∈ R}. The local manifold saliency is the belief coefficientβ 1 =λ 1 −λ 2 ,indicatinghowclearthelocalmanifoldstructureis. Furthermore, as a complementary part of the second order tensor representation, polarity vectorp∈ R 2 is proposed to describe endpoints [127], which have local maximum of polarity 2 The terminology for this matrix factorization is changed in the latter chapter to adapt to the focus of manifold learning. 3 Itisstraightforwardtoprovethatforanytwoorthogonalunitvectorsu;v 2R 2 ,uu T +vv T =I 18 Figure2.1: Visualizationoftensorinℜ 2 . Tensordecompositionin2D. ||p|| along the local tangent space T x M. For the points located at the ends of the curve, the polarityspaceP x M (pointingintothecurve)isjointlygivenbypande 2 as: P x M ={αsgn(p T e 2 )e 2 |,∀α∈ℜ + } (2.2) Intuitively,polarityspaceP x M isahalfofthetangentspaceT x M. sgn(.)isthestandardsign functionthatextractsthesignofarealnumber. Voting Algorithm inℜ 2 . After designing the tensor representation framework to associate with geometric properties inR 2 , the next step is to develop an algorithm to effectively estimate suchtensors(T,p). Itisrequiredtobeunsupervisedandnonparametric,havemoderatecompu- tational cost, and able to handle large amount of outliers. The voting algorithm is designed to meettheserequirementsinatensorpropagationprocess[77]. Problem Formulation: assume the input data are oriented points{(x n ,u n )} N 1 n=1 , where u n ∈R 2 isaunitvectorrepresentingthelocalnormalspaceforpointx n ∈R 2 ,andunoriented points{x n } N 1 +N 2 n=N 1 +1 . Inputs tokens are encoded as{(x n ,M n )} N n=1 , where N = N 1 + N 2 , 19 M n = u n u n T ∈ R 2×2 , for i = 1,2,...,N 1 andM n = I∈ R 2×2 , for i = N 1 + 1,...,N. The target of the voting algorithm is to compute the tensors{T n ,p n } L n=1 for target points set {y n } L n=1 ,andy n ∈R 2 caneitherbelongtotheset{x n } N n=1 ornot. 4 The key idea of the voting algorithm is straightforward. For each target pointy m (m = 1,2,...,L),itcollectsvotesfromitsneighborhoodpointsx n (n = 1,2,...,N)andlinearlysums themupasfollows, T m = ∑ n∈N(ym) Vote T (x n ,T n ,y m ,σ v ) (2.3) p m = ∑ n∈N(ym) Vote p (x n ,T n ,y m ,σ v ) (2.4) WhereVote T (.) andVote p (.) are vote functions that will be explained in the later section, and σ v istheonlyfreeparameterinthevotingalgorithm,i.e.,votingscale,whichcontrolsthevoting saliencydecay. Formulation I: one special case is N = L and the set{y m } L n=1 is identical to the set {x n } N n=1 , which means the tensor information is propagated within the same points set. This case is called Sparse Voting in the earlier literature [77]. Since the input and output are the same set of generic points (points with tensors), the algorithm is naturally able to process data in a hierarchical way. Voting algorithm usually includes two layers (vote twice), and it can be extendedtothreeorfourlayerstomeetspecificdomainrequirements[77][81]. Formulation II: set{y m } L n=1 includes all possible locations (points) within the neighbor- hoodofset{x n } N n=1 ,underagridstructureinR 2 . ThiscaseiscalledDenseVotingintheearlier 4 Thisformulationisageneraldescriptionsuchthatallcasesofvotingcanbeviewedasitsspecialcases. 20 literature [77]. With certain precise requirements, the geometric information for nearby points canbeestimatedafterthevoting,leadingtotheout-of-sampleextensionabilityinanaturalway. In equation 2.3, the only remaining question is how to design the proper vote function Vote T (.) and Vote p (.). The details of those functions are given in the following parts, and furtherjustificationisgiveninchapter4. 2.1.2 VotingFunctioninℜ 2 According to eq. 2.1, an arbitrary tensor can be linearly decomposed into the weighted sum of twospecialtensors,andthisrulealsoappliesforvotingfunctiondesign, Vote T (x,T,y,σ v ) =Vote T (x,{(λ 1 −λ 2 )(e 1 e 1 T )+λ 2 (e 1 e 1 T +e 2 e 2 T )},y,σ v ) = (λ 1 −λ 2 )Vote(x,e 1 e 1 T ,y,σ v )+λ 2 Vote T (x,e 1 e 1 T +e 2 e 2 T ,y,σ v ) = (λ 1 −λ 2 )Vote T (x,e 1 e 1 T ,y,σ v )+λ 2 Vote T (x,I,y,σ v ) (2.5) The notation of Vote T (x,e 1 e 1 T ,y,σ v ) is the case that the normal direction atx is known as e 1 , and the notation of Vote T (x,I,y,σ v ) is the case thatx has no preference for the normal direction. These two cases are defined as Stick Vote and Ball Vote, and are proposed separately in the earlier literature, while a unified framework for voting function is proposed in chapter 3. Inthefollowing,westillgivetheoldformulationsofthesetwovotingfunctions. StickVote: ittriestoanswerthequestionthat,ifweknowthenormalandtangentdirection ofapointxonacurve,whatisthemostlikelydirectionofthismanifoldatpointy? Thesolution is to link these two points with a constant curvature curve (arc). The normal space ofy will be thesameasthenormalspaceofthisarcaty. 21 Figure 2.2: Visualization of the stick vote inℜ 2 . There is a known tangent direction (e 1 ) at the voterx. Vote T (x,e 1 e 1 T ,y,σ v ) =e − s 2 +ck 2 2 v R(2θ)e 1 e 1 T R(2θ i ) T (2.6) ThenotationVote T (x,e 1 e 1 T ,y,σ v )indicatesthestickvotefromxtoy,andthenormalspace atxise 1 . θ i istheanglebetweenthetangentdirectione 2 andvectorx−y. l =||x−y||isthe Euclideandistancebetweeny andx. Arclengthsis θl sin(θ) andcurvaturek is 2sin(θ) l .R(2θ i )is the standard counterclockwise rotation matrix with angle 2θ i . Factorc is a constant coefficient whichisdefinedas −16log(0.1)×(σv−1) π 2 [77]. Itisnotablethat,ifθ exceeds π 4 ,thevotebecomesa zerotensor. Vote saliency is designed as the Gaussian liked kernel, which focuses on not only the arc lengths but also the curvaturek. c is used to balance the effects of these two terms. A special caseoccurswhenθ = 0,whichmeansthex i −xisorthogonaltoe 1 , andthevoteissimplified 22 Figure2.3: IntegrationofthestickvotingtogetBallVotingfunction. toe − s 2 2 v e 1 e 1 T . The first order polarity vector vote is designed based on the second order stick vote(withthesamevotingsaliency)andmoredetailscanbefoundin[127]. In summary,x gives a stick vote toy depending on the tensorT and the relative position ofx andy. The result of this voting process can be interpreted as a local and nonparametric estimationofthegeometricstructureateachsampleposition. In previous works [77], tensor voting is implemented as the formulation of voting fields, whichareprecomputedandstored. Thefundamentalfield,i.e.,Vote T ([00] T ,[10] T [10],y,σ v ) for all locationsy∈ R 2 , is visualized as Fig. 2.3. For arbitrary voter’s locationx and direction ee T ,thevotingresultscanbeeasilycalculatedasatranslationandarotationofthefundamental voting field. This process is similar to applying a spatial filter, while the fundamental field is equivalenttotheimpulsefunctionofalinearfilter. Ball Vote: ball vote starts without any information of orientation besides the position ofx (voter) andy (receiver). In the standard framework, ball voting functionVote T (x,I,y,σ v ) is designed based on the stick voting function, while the directionθ, has a uniform distribution in [02π]. 23 Vote T (x,I,y,σ v ) = ∫ 2π 0 1 2π Vote T (x,e e T ,y,σ v )dθ = ∫ 2π 0 1 2π Vote T (x,[cos(θ)sin(θ)] T [cos(θ)sin(θ)],y,σ v )dθ (2.7) Where e θ is the notation for unit direction vector with angle θ, i.e., [cos(θ) sin(θ)] T . This integration idea is visualized in Fig. 2.3. In previous tensor voting works [77], ball vote is also implemented as precomputed voting field. It is straightforward to show that, although eq. 3.9 doesn’t have a closed-form solution, the final result is a function of vectorx−y∈ R 2 (given votingscaleσ v ). BallvotingfieldisvisualizedinFig.2.3. 2.2 ProblemsintheStandardFramework Despitethesuccessoftensorvotingframeworkinmanyaspectsofcomputervision,likecontour detection, stereo vision and tracking, there are two important unsolved issues in the standard framework. First, the two basic voting functions of ball and stick vote, are inconsistent with eachother. Thisissuenotonlybringsinaproblemofinterpretingvotingalgorithmsandresults, butalsomakesthegeneralizationofhighdimensionalvotingalgorithmimpossible. Second,the standardframeworkdoesn’texplicitlyconsidernoiseontheinlierdata,whileitcommonlyexists inmanyapplicationdomainsincomputervisionandmachinelearning. 24 This dissertation is mainly motivated by these two important issues. As a consequence, a probabilistic tensor voting framework and the unified tensor voting framework in high dimen- sionalspacearedesignedinthisdissertation. Detailsaregiveninchapter2and3. Inthischapter, wegiveabriefintroductionofthesetwoissues. 2.2.1 Stickv.s. BallVotingFunction Stick and ball voting functions are proposed separately in the standard Tensor Voting frame- work [81] in 2D space. Here we describe several possible frameworks to analyze these two votingfunctions,startingfromthestandardframeworkwhichisanintegrationapproach. Integration Method. There are two fundamental voting functions in the standard Tensor Voting framework, i.e., stick and ball voting functions. Especially, the ball voting function is proposed based on an integration of the stick voting function. This integration method can be applied to higher dimensional space, e.g., 3D and 4D, but it is computationally intractable generally in high dimensional space. The computational complexity increases exponentially as dimension D increases. For instance, in R D space, if the numerical integration interval in one dimensionisδ (δ > 0),thenthecomputationalcomplexity(calculatevotingfunctiononce)isin theorderofO( 1 δ D ). Moreover,ballvotingfunctionisinconsistentwiththestickvotingfunctionunderthelinear decomposition rule. As shown in Eq. 2.8, a vote from an arbitrary tensor T can be linearly decomposedintotheweightedsumoftwobasicvoteasfollows, Vote T (x,T,y,σ v ) = (λ 1 −λ 2 )Vote T (x,e 1 e 1 T ,y,σ v )+λ 2 Vote T (x,I,y,σ v ) (2.8) 25 Figure2.4: Integrationisinconsistentwithlineardecomposition where λ i and e i ∈ R 2 are eigenvalues and eigenvectors (i = 1,2) of T ∈ R 2×2 . The first term Vote T (x,e 1 e 1 T ,y,σ v ) is the stick voting function designed before. The second term Vote T (x,I,y,σ v )istheballvotingfunction,whichisdesignedbasedontheintegrationideaof thepreviousone. Basedonthedecompositionandintegrationmethod,anaturalquestionarises, cantheintegrationideabedirectlyappliedtothegeneralvotecase,withoutanintermediatestep, i.e.,lineardecomposition? Recalltheintegrationofballvote, Vote T (x,I,y,σ v ) = ∫ 2π 0 1 2π Vote T (x,e e T ,y,σ v )dθ (2.9) The fact that the direction θ has a uniform distribution in [0,2π], can be also interpreted as a point walking on a unit circle (∈ℜ 2 ) in constant speed. This walking process is naturally 26 Figure2.5: Ambiguityofvotewitheigen-decomposition. generalizedtoanellipse(∈ℜ 2 ),whichcorrespondstoaP.S.DmatrixT∈ℜ 2×2 whenλ 1 ̸=λ 2 . 5 So,thegeneralvotingfunctioncanalsobecalculatedas, Vote T (x,T,y,σ v ) = ∫ E ( 1 ; 2 ) c(λ 1 ,λ 2 )Vote T (x,e (l) e (l) T ,y,σ v )dl (2.10) E (λ 1 ,λ 2 ) is the corresponding ellipse of P.S.D matrixT with two eigenvalues λ 1 ,λ 2 . c(λ 1 ,λ 2 ) is a constant weight so that ∫ E ( 1 ; 2 ) c(λ 1 ,λ 2 )dl = 1. It is straightforward to show that eq. 2.9 is a special case of eq. 2.10, whenλ 1 = λ 2 = 1. Eq. 2.8 and eq. 2.10 provide two possibilities to calculate the general voting function Vote T (x,T,y,σ v ), while the problem is the results of these two methods are different, as shown in Fig. 2.4. Furthermore, the total energy of ball 5 Asaspecialcase, when1 = 2, ellipsebecomesa circleandthecorrespondingT becomesanidentitymatrix I (uptoaconstantscalarfactor). 27 Figure2.6: Sensitivityofvotewitheigen-decomposition. voting function calculated from eq. 2.9 is different from the total energy of stick voting defined insection1.4, ∫ ℜ 2 Vote T (0,I,y,σ v )dy̸= ∫ ℜ 2 Vote T (0,ee T ,y,σ v )dy (2.11) The total energy of two basic voting functions are defined as the total energy of the impulse functionsoftwospatialfilters,whichareVote T (0,I,y,σ v )andVote T (0,ee T ,y,σ v ). VotewithEigenvectors. Theabovesectionshowsthatthevotingframeworkinvolvingintegra- tion has the inconsistency problem, as well as the total energy issue. Naturally, another attempt istodesignavotingframeworkbasedoneigen-decompositionofP.S.DmatrixT. Thisideacan beformallyformulatedas, Vote T (x,T,y,σ v ) =Vote T (x,[λ 1 e 1 e 1 T +λ 2 e 2 e 2 T ],y,σ v ) =λ 1 Vote T (x,e 1 e 1 T ,y,σ v )+λ 2 Vote T (x,e 2 e 2 T ,y,σ v ) (2.12) 28 WhereVote T (x,e i e i T ,y,σ v ) (i = 1,2) is just the stick voting function that has been defined insection1.4. Inshort,ageneralvoteisaweightedlinearcombinationoftwostickvotes. Although this idea is elegant and straightforward, there are inherent problems in it. First, contrast to intuition, eigen-decomposition of a matrix is not unique. For instance, an identity matrix I ∈ ℜ 2×2 can be arbitrarily decomposed into uu T +vv T , for any orthogonal pair u,v∈ℜ 2 ,withu T u =v T v = 1. So,accordingtoeq.2.12,theballvotecanbecalculatedas, Vote T (x,I,y,σ v ) =Vote T (x,uu T ,y,σ v )+Vote T (x,vv T ,y,σ v ) (2.13) Eq. 2.13 doesn’t provide unique results, since the eigen-decomposition is not unique and the stickvotingfunctionisnotisotropic. ThisfactisvisualizedinFig.2.5. Another problem of this framework is the voting results are sensitive to the noise tensorδT (or noise vectorv) on the input tensorT. Formally, the sensitive index SI defined as follows canbearbitrarilylarge. SI =lim δT→0 F(Vote T (x,T,y,σ v )−Vote T (x,T +δT,y,σ v )) F(δT) (2.14) Forinstance,iftheinitialtensorT isanidentitymatrixandthetheanglebetweennoisevectorv andxy is π 4 (thescalar of thenoise can be arbitrarily small), then the voting result becomes zero. Thus,thesensitiveindexSI goestoinfinitewhentheenergyofthenoisegoestozero. VotewithAlignment. Detailsof the votewith spatial alignment technique are given in chapter 3. We give a brief introduction of the key idea for vote with spatial alignment inℜ 2 in this chapter. 29 Figure2.7: Votewithspatialalignmentprocess. As shown in Fig. 2.7, the ball tensor (identity matrix) is decomposed in a unique way such that one eigenvector is parallel to xy and the other is orthogonal to xy. In this case, the ball vote is transfered to a stick vote whose normal vector is orthogonal to xy. This alignmentmethodhasseveraladvantages. Itprovidesauniquesolutionforthegeneralvote,and moreover, the alignment process can be viewed as the optimal results in terms of maximizing thevotingpropagationenergy 6 . 2.2.2 InlierNoise AnotherproblemofthestandardTensorVoting(STV)isinliernoiseisnotexplicitlyconsidered. Thevotingresultsaresensitivetopositionnoise,especiallywhenvoterandreceiverarecloseto eachother. SensitivityAnalysisofClosePoints. Fig.2.8illustratesthatvotingissensitivebetweentwo close points with distanceϵ > 0 (can be arbitrarily small). At the top of Fig. 2.8, two noiseless pointslieonthex-axisandthevotingresultsindicatethenormalvectorisy-axis( π 2 ). Ifanoise vector (length is alsoϵ > 0 and the direction is π 2 ) is added to one point, then the voting results indicatethedirectionofnormalvectorischangedfrom π 2 to π 4 . Thevotingsaliencyisveryclose 6 Ifthestickvotingkernelisaconvexfunction. 30 Figure2.8: Voteissensitivewhentwopointsareclose. to1sinceϵisalmostzero(assumeTensorsareinitializedasidentitymatrices). Furthermore,we candefinethesensitiveindexSI (similartoeq.2.14)withrespecttoϵ(keepthelengthofnoise vectorasϵtoo). Inthiscase,itiseasytoshowthatSI goestoinfinitewhenϵgoestozero. Besides the special cases we mentioned above, in general, when inlier points have noise, thevotingprocess(tensorinformationpropagatesfromonepointtoothers),isover-fitthenoisy position of points. To handle these problems, we propose Probabilistic Tensor Voting, which is giveninchapter3. 31 Chapter3 ProbabilisticTensorVoting 3.1 Introduction A fundamental problem for low-level computer vision is how to process noisy data and extract meaningful features. This has applications in denoising, contour grouping, stereo matching and object detection. In general, there are two types of noise: outliers that are independent of the meaningful data, and inlier noise which refers to errors on the meaningful data. The left part of Fig.3.1givesanconceptualexample;bluepointsthatarefarawayfromtheunderlyingmanifold (black) can be viewed as outliers, while red points close to the manifold can be viewed as sam- ples with approximated Gaussian noise. Because (1) the underlying manifold is nonparametric (2) both inlier and outlier noise exist, the geometric structure inference becomes a challenging problem. TherightpartofFig.3.1isarealexampleofcontourdetectiononnaturalimages. The resultsofthestate-of-the-artboundarydetector[72]stillhavemanyproblemssuchasfragments, noisyfalsedetection,whichrequiresfurtherrobustgrouping. 32 Figure 3.1: Problem illustration. Left: an example of inlier and outlier noise, right: a natural imageandtheboundarydetectionresultsby[72]. Tensor voting [81] is a geometric computational framework which has been successfully applied to many problems in computer vision [80, 155, 154] and machine learning [82, 28]. In thetensorvotingframework,localinformationisrepresentedbyasecond-ordertensorandafirst orderpolarityvector. Tensorvotingisquiterobustwithonlyonefreeparameterandcanhandle large amount of outliers. However, there are two potential problems, which are not solved by the standard framework. First, the standard tensor voting algorithm does not take into account inlier noise. Given two points lying on a manifold with errors, tensor information propagation from one point to the other is actually incorrect. The problem could be more serious when the datasamplingdensityonthemanifoldislow,whichmeanstheerrorcannotbemitigatedbythe lawoflargenumbers. Second,whenvoterandreceiverpointsareveryclosetoeachother,small numericalerrorsleadtolargebias. This is illustrated in Fig. 3.1 above. Some points are sampled from an arc with small Gaus- sian noise. On the top of the arc, four points lie on a vertical line orthogonal to the local curve segment due to small errors. Standard tensor voting produces incorrect tensor values for the points on the top of the curve. The estimated normal directions are horizontal while it should 33 bevertical. Ontheotherside,Probabilistictensorvotingworkswellforthosepoints,aswellas otherinliernoisypoints. In this chapter, probabilistic tensor voting framework is proposed to handle outliers and inlier noise simultaneously. 1 The main idea is to apply the bias-variance tradeoff from machine learning to perceptual grouping by incorporating a Bayesian framework [29]. We follow the original framework, but change the voter’s position from a fixed vector to a random distributed vector. The voting results are the expectation of the random vector voting event. This process generates new deterministic voting fields. Theoretical justification is given in section 3.4. The newvotingmethodincorporatesbothgeometricalgorithm,whichisfromstandardtensorvoting, and probabilistic framework, which is used to handle inlier noise. We evaluated the quality of the proposed algorithm by visual inspection, tangent space estimation error, denoising error, and image contour grouping results. Probabilistic tensor voting outperforms other candidate approaches. Contributions: (1) By combining a Bayesian framework with the geometric structure inference algorithm, we gainthecapabilitytoreducethelearningvarianceforperceptualgrouping. (2)Ournovelvotingframeworkwithone2ndordertensorandtwotypesofpolarityvectorscan representandhandlebothoutlierandinliernoiseatthesametime. (3) As in standard tensor voting, our algorithm can process all types of geometric structures in a non-parametric way, including multi-manifold, manifold junctions, manifold endpoints, etc. Notethatthecomputationalcostofthenewapproachremainsthesame. 1 Thereisnoexplicitassumptionfortheinliernoisedistribution. 34 3.2 RelatedWork We start with a review of the works along three axes. They are related work to tensor voting, outlieridentificationandnon-parametricdatadenoising. Tensorvoting[81]isageneralcomputationalframeworkthatcanbeappliedtomanyareasin computer vision and machine learning. It includes figure completion [80], 3D mesh extraction, motion pattern analysis [155, 154], dimension estimation, manifold learning [82], etc. Detailed comparisons with other methods in any one of these specific areas can be found in previous tensorvotingpapers. Foroutliersidentification,thereisalargeclassofmethods,calledstochasticsampleconsen- sus. The representative work among them is the Random Sample Consensus, RANSAC [21]. The objective function to be maximized is the number of estimated inlier points, which are the datapointslyingwithinagiventhresholdoftheestimatedmodel. RelatedworksincludeMLE- SAC, MAPSAC and M-estimator [39]. Recently, StaRSac is proposed to analyze the variance of the parameters (VoP), and it shows that there is a stable region of the estimated parameters which can solve the unknown uncertainty problem [11]. The main limitation of these methods is the need for specific parametric model of inlier data. Another type of popular work is voting based methods. In Hough transform [74], the optimal model parameters are selected as the pa- rametersspacecellwhichgetsthemaximumnumber ofvotes. Houghtransformalso requiresa parametric model of inlier data, so it can’t be applied to non-parametric case. Note that tensor voting can handle large amount of outliers without parametric model constraint, but the inlier noiseisnottakenintoaccountintheoriginalframework. 35 For inlier data denoising, many parametric model based denoising methods have been pro- posed, such as linear and nonlinear regression. Here we only focus on the non-parametric data denoising. Principal Component Analysis (PCA) is one of the most successful denoising meth- ods that has been applied to many aspects of computer vision [47]. While PCA is a linear sub- spacemethod,KernelPCAisanextensionbykernelizingtheGrammatrix[79]. RobustPCAis amarriagebetweenPCAandrobustestimatortohandleoutliers[54],andthemainlimitationis the linearity constraint of the inlier data model. Also it calculates covariance matrix iteratively, whichleadstolargecomputationalcost. To handle non-parametric estimation, a manifold model is assumed to fit the data, which means there is a latent intrinsic structure embedded in the high dimensional space. Diffusion map method views denoising as reversing a diffusion process of which the normalized graph Laplacian matrix is the generator [41]. This approach tends to overly smooth the inputs and pushdatapointstothemeancurvatureofthemanifold. Moreover,outliersarenotconsideredin theframework,andoutliersaregreatlyharmfultothediffusionprocess. Ourprobabilistictensor voting framework performs non-parametric estimation in the presence of both inlier and outlier noise. 3.3 ProbabilisticTensorVoting Therearetwodistinguishedfeaturesofprobabilistictensorvoting. First,inthevotingprocedure, voter’spositionbecomesa random vectorwiththeautomaticlearntdistribution. Second,inthe 36 Figure3.2: IllustrationandvisualizationofTensorVoting. Top,thefunctionsofthesecondorder and two types of the first order tensors; bottom, the second order stick vote and the first order stickvotewithtwopolarityvectorsorthogonaltoeachother representation part, a new type of polarity vectorp2 is introduced. As a result, inlier noise can beexplicitlyhandledbythenewframework. 3.3.1 Representation The top part of Fig. 3.2 is an illustrative example, which shows how to use the second order tensorandtwotypesofpolarityvectorstorepresentthelocalgeometricinformationandclassify points. Secondordertensor: thefunctionofthesecondordertensorT∈ℜ 2×2 isalmostthesame asthestandardvoting,exceptthatλ 2 representstheuncertaintyofthelocalinformation 2 . When 2 Algorithmisgivenin2D,experimentsinclude3Dbyanaturalextension. 37 λ 2 is relatively large, it can be inferred either the inlier point or its tensor suffers from noise, causedbyerrorinformationpropagationfromtheneighborhood. 3 Type 1 polarity vector: it is the same as in the standard voting, which usesp1∈ℜ 2 to detectendpoints[127]. AsinFig.3.2,endpointscannotbedetectedbythesecondordertensor, but it can be detected by the local maximum of polarity||p1|| in the tangent spacee 2 ∈ℜ 2 . More generally,p1 reflects the local data sampling asymmetry on the manifold. For example, if most of the neighborhood points ofx j ∈ℜ 2 lie on one side of its normal direction, then the polarityvectorwillpointtothisside. Anextremeexampleisendpoints,whogetlargestpolarity ||p1||amongtheneighborhoodpoints. Type2polarityvector: asinFig.3.2,errorpointsaredetectedbythelocalmaximumtype2 polarity,andtheerrorvectorisgivenbythecombinationofe 1 ∈ℜ 2 andp 2 ∈ℜ 2 . Errorvector tellsthedirectionfromthepointtothemanifold. Moregenerally,polarity||p2||andeigenvalue λ 2 reflect the uncertainty of the estimated normal (tangent) space, which is related to the error amount at this point. Furthermore,p2 is helpful for junction localization. For instance, L and T-typeofjunctionshavelargepolarities||p2||andthedirectionofp2pointstothejunction.p2 isrelatedtothepreviouswork[123],whichestimatescurvature. 3.3.2 SparseVotingProcedure As in a standard setting (starting from un-oriented points), voting algorithm in PTV has a pro- gressive structure which includes three layers: sparse ball vote, sparse stick vote, and revised stickvote(itcanhaveafewmorelayersforspecifictasks). Thekeyideaisthepositionofeach 3 2 isverysmallcomparedto1,ifdataisnoiseless. 38 tokenx j ischangedfromadeterministicvectortoarandomvectorinR 2 . Votingresultsbecome the expectation values associated with distributions of voter’s positions. Under different situa- tions,withorwithoutthedirectiononpoints,theprobabilisticvotehasdifferentinterpretations, whicharegivenasfollows. 3.3.2.1 VotewithOrientedPoints The bottom part of Fig. 3.2 is an illustration of vote with the oriented point, plus the novel type 2 polarity vector. Formally, there is a tensorT j (e j e j T ) at each pointx j representing the local normal space. So, a 1D p.d.f p j (x) is enforced tox j in the normal space 4 , and the vote atx i canbecalculatedasfollows, S(x i ,σ v ,σ n ) = ∑ j̸=i [ ∫ N j 1 (2π)σ 2 n e − ∥x−x j ∥ 2 2 2 n T(x,T j ,x i ,σ v )dx] = ∑ j̸=i [ ∫ N j 1 (2π)σ 2 n e − ∥x−x j ∥ 2 2 2 n T(x,e j e j T ,x i ,σ v )dx] (3.1) HereN j (the short notation forN x j M) is the 1d normal space ofx j , spanned by vectore j 1 ∈ ℜ 2 . p j (x) is a 1d Gaussian distribution with meanx j and variance σ 2 n in N j . The function T(x,e j e j T ,x i ,σ v )isthestandardvotingfunction[81]withtheanalyticformulationas, T(x,e j e j T ,x i ,σ v ) =e − s 2 +ck 2 2 v R(2θ ji )T j R(2θ ji ) T (3.2) 4 Thereasonisgiveninthetheoreticaljustificationsection. 39 θ ji is the angle between the tangent directione j 2 and vectorx i x j . Arc length s = θ ji l sin(θ ji ) , curvaturek = 2sin(θ ji ) l andl =||x−x j ||. R(2θ ji ) is the standard counter-clockwise rotation matrixwithangle2θ ji . cisaconstantdefinedin[81]. Theintuitionisuncertaintycanbelocallydecomposedintotwodirections,tangentandnor- mal. Tangent direction represents the sampling uncertainty on the manifold. And normal direc- tionrepresentstheuncertaintyoutsidethemanifold,indicatingtheinliererrorinourprobabilistic voting framework. From the point of view of numerical analysis, voting results are much more sensitivetothepositionperturbationδxinthenormaldirectionN x M thaninthetangentdirec- tion T x M. More discussions about the distribution of the voter’s position are given in section 4.4. The stick voting field (vote with known direction) is visualized in Fig. 3.3. Compared with standard field, the new field decreases more slowly as the curvature increases. That is because oftheuncertaintyofthevoter’spositioninthenormalspaceN x M. Inthetangentdirection,the saliency function is monotonically decreasing with the distance in the standard field, while it is amountainshaped functioninthenewfield. Similartothesecondordertensor,thepolarityvectoristheexpectationoftherandomvote, p1(x i ,σ v ,σ n ) = ∑ j̸=i [ ∫ N j 1 (2π)σ 2 n e − ∥x−x j ∥ 2 2 2 n P1(x,T j ,x i ,σ v )dx] (3.3) p2(x i ,σ v ,σ n ) = ∑ j̸=i [ ∫ N j 1 (2π)σ 2 n e − ∥x−x j ∥ 2 2 2 n P2(x,T j ,x i ,σ v )dx] (3.4) 40 Figure3.3: Visualizationofvotewithorientedpoints. Top,saliencycomparisonofthestandard voting (left) and the new voting(right); Bottom, 3D saliency map comparison of the standard voting(left)andthenewvoting(right) p1(.) is the original polarity vector in the standard voting algorithm and p2(.) is the type 2 polarityvectororthogonaltop1(.)ineachdeterministicvote. Revised Vote withe 1 andp 2 : if we want to further reduce errors, the third pass of revised stick vote can be performed. Formally, if not only the normal directione but also the polarity e 2 is known, then the revised vote can be performed. The voting algorithm is the same as stick vote,whilethecenterofthedistributionismovedfromthevoter’spositiontotheerrordirection, whichisprovidedbyp2. S ji = ∫ N ∗ j p ∗ j (x)T(x,T j ,x i ,σ v )dx (3.5) 41 Figure 3.4: Visualization of vote with unoriented points. Top,ℜ 2 vote saliency map; bottom, STV and PTV for ball vote saliency and junction saliency, the horizontal and vertical axes rep- resentdistanceandsaliencyvaluerespectively. HereN ∗ j (shortnotationforN ∗ x j M,revisednormalspaceatx j )andp ∗ j (x)arenotonlydenoted bythenormaldirectionofx j ,butalsothepolarityvectorp2. Intheℜ 2 ,N ∗ j isthehalfrangeof 1dnormaldirectionwhichhascrossanglelessthan pi 2 withp2. N ∗ x j M ={αsgn(p2 T e 1 )e 1 |,∀α∈R + } (3.6) p ∗ j (x)isaunilateralGaussiandistributionwithmeanx j andisotropicco-varianceσ 2 n inthedata spaceN ∗ j . Thesamechangesalsotakeplaceforthepolarityvote. 3.3.2.2 VotewithUnorientedPoints Theprobabilisticvotewithunorientedpointscanbederivedbycombiningthealignmentprocess andtheprobabilisticstickvote, S(x i ,σ v ,σ n ) = ∑ j̸=i [ ∫ O(x i ,x j ) 1 (2π)σ 2 n e − ∥x−x j ∥ 2 2 2 n T(x,T ∗ j ,x i ,σ v )dx] = ∑ j̸=i [ ∫ O(x i ,x j ) 1 (2π)σ 2 n e − ∥x−x j ∥ 2 2 2 n T(x,u ij u ij T ,x i ,σ v )dx] (3.7) 42 T(.) is the standard stick voting function. O(x i ,x j ) is the alignment space and u ij is the alignmentnormaldirection,whicharedeterminedbypositionsofx i andx j , O(x i ,x j ) ={αu ij |,u ij T u ij = 1,u ij T (x i −x j ) = 0,∀α∈ℜ} (3.8) The vote can also be derived by combing the standard ball voting function and the proba- bilistic framework. When tokenx j votes to tokenx i , an isotropic p.d.f p(x) is assumed to the voter’s positionx j ∈ℜ 2 . Under certain assumptions, the uncertainty distribution of voter’spo- sitionisina1D space,whichreducescomputationalcostandavoidssomenumericalproblems. Thevotingresultsatx i canbecalculatedasfollows, S(x i ,σ v ,σ n ) = ∑ j̸=i [ ∫ O ji 1 (2π) 1 2 σ n e − ∥x−x j ∥ 2 2 2 n T(x,I,x i ,σ v )dx] (3.9) Here, p(x) is assumed to be a 1D Gaussian distribution with meanx j and variance σ 2 n in the orthogonal spaceO ji , which is a 1D line orthogonal to the straight linex i x j . σ v is the voting scale and σ n is the noise voting scale. The function T(x,x i ,σ v ) is standard ball vote with an analyticformulationasin[81], T(x,I,x i ,σ v ) =e − ∥x−x i ∥ 2 2 v (I− (x−x i )(x−x i ) T ∥x−x i ∥ 2 ) (3.10) 43 Botheq.3.7andeq.3.9actuallyhasastraightforwardformulationasfollows, S(x i ,σ v ,σ n ) = ∑ j̸=i [λ(∥x j −x i ∥,σ v ,σ n )(I− (x j −x i )(x j −x i ) T ∥x−x i ∥ 2 ) +µ(∥x j −x i ∥,σ v ,σ n )( (x j −x i )(x j −x i ) T ∥x j −x i ∥ 2 )] (3.11) λ(.) is the kernel function for the normal direction propagation (vote saliency), and µ(.) is the kernel function for the uncertainty propagation (junction saliency). These two kernels have analyticformsbutinvolvecomplicatedmodifiedBesselfunction. Inadegeneratecase,σ n = 0, λ(.) becomes a Gaussian kernel and µ(.) becomes a 0 constant kernel, which is the same as eq.3.10,i.e.,thestandardtensorvotingresults. The 2D new ball field is visualized in Fig. 3.4. We can see that, the voting saliency is a mountain shaped function of the voting distance s. When s becomes relatively small, the saliencyisalsosmall. Moreimportantly,thenewsaliencyfunctionissmootheverywhere,while thestandardonehasaδ jumpats = 0. Analysis. There are two parameters in the voting algorithm, the standard voting scale σ v and newlyaddednoisevotingscaleσ n ,whichindicatesthevariancesofthevoter’sdistribution. Both twoparametersarerobust,andthisissupportedbyempiricalstudyinlatersections. Itisnotable that,bothballfieldandstickfieldareidenticaltotheoriginalfieldifσ n = 0. In the following part, we analyze the ball voting algorithm briefly and compare it to related approaches. Eq. 3.10 and eq. 3.11 focus on the normal representation, and the dual formulation focusing on the tangent representation can be got easily by omitting the identity matrix. Then, we can recognize they are special cases of the local linear tensor (LLT) model with different 44 kernelfunctions(omittingthejunctionsaliencypartin probabilistic ballvote). Thereisanother type of popular LLT model, Local Principal Component Analysis (LPCA), which serves as the foundationformanynonlinearlatentvariablemodelingworks,suchas[153]. S(x i ,σ v ) = ∑ x j ∈N(x i ) ∥x j −x i ∥ 2 ( (x j −x i )(x j −x i ) T ∥x j −x i ∥ 2 ) (3.12) N(x i ) is the set of neighborhood points ofx i (K-NN or h-neighborhood). It is notable that, we usex i to approximatex m , which is the local mean estimation of N(x i ). Compared with the quadric kernel (LPCA), Gaussian kernel (STV) is more suitable for nonlinear manifold and robusttooutliers,whileλ(.)(eq.3.11,PTV)isanadvancedversionofGaussiankernel(similar toChi-Squarekernel)byconsideringinliernoise. 3.3.3 DenseVote Dense vote is the case of formulation II defined chapter 1. Dense vote can be performed to explore the underlying data structure if needed, such as the out-of-sample extension manifold. The related data space (such as the convex hull of the data)y is divided as a grid by precision requirement, then all grid elements collect vote information from the input datax. The vote equationsofdensevoteareslightlydifferentfromthestickvote. S(y i ,σ v ,σ n ) = ∑ j [ ∫ N ∗ j p ∗ j (x)T(x,T j ,y i ,σ v )dx] (3.13) T(.)isthestandardvotingfunction,N ∗ j isthehalfrangeof1dnormaldirectionwhichhascross anglelessthan π 2 withp2. p ∗ j (x)isaunilateralGaussiandistributionwithmeanx j andisotropic 45 co-variance σ 2 n in N ∗ j . The same changes also take place for the polarity vote. After getting geometricinformationforallpointsonthisgrid,non-maximumsuppressioncanbeusedtofind theunderlyingmanifold. Moredetailedinformationaboutdensevotecanbefoundin[77]. 3.4 TheoreticalJustification Mathematics Preliminary. Let M be a smooth sub-manifold (in differential sense) in Eu- clidean spaceℜ 2 of intrinsic dimensionality 1, then for any pointq∈M, there are 1D normal spaceandtangentspace,denotedbyN p andT p . Forinliernoisepointq,wecanwriteq =p+e wherep∈ M ande is the error vector. Then, the normal space ofq is defined by the normal spaceofp. ProbabilisticVoteandUncertaintyDecomposition. Thekeyassumptionofstickvoting(vote with known direction) is as follows: the uncertainty of the voter’s position can be modeled as a Gaussiandistributioninthevoter’snormalspace. Tubular Neighborhood Theorem: letM be a smooth 1D manifold (in differential sense) in Euclideanspaceℜ 2 . Then,thereexistsϵ> 0suchthatforanypointq closetoM withdistance at mostϵ, there is a unique expressionq =p+v, wherep∈M andv is the normal space at pointp. Theproofofthistheoremcanbefoundinclassicaldifferentialgeometrybook. Thistheorem states that for any point close to the manifoldM, there is a corresponding projection pointp at M. q can be moved top lying in its normal space N q . We already know that standard tensor voting gives precise normal space estimation for points on the manifold, and the approximate 46 normalspaceestimationforpointswithsmallinliererrors. Basedontheestimatednormalspace information, probabilistic stick vote allows points to have uncertainty in their normal space, so the new vote can get more precise tensor estimation results. This theorem supports that the uncertaintyinthenormaldirectionisthekeyissuefortheprobabilisticvote. 3.5 Experiments 3.5.1 EvaluationMethodology We compare our Probabilistic tensor voting (PTV) with other non-parametric methods, includ- ing standard tensor voting (STV) [81], PCA [47] and diffusion map based manifold denoising (MD) [41]. Parametric methods, such as RANSAC, are not considered here. We examine the quality of algorithms with endpoint completion, tangent space estimation, denoising and image contour grouping. PTV and STV are applied on all tasks while PCA and MD are designed for datadenoisingtaskonly. Severaldatasetsareusedinourexperiments,includingsyntheticdata, thelogoofECCV,3DStanfordBunnydata[50],andtheBSDSimagedatabase[72]. Wechose themfortheirdifferentcharacteristicsinintrinsicstructure(linearornonlinear),applicationdo- main, multi or single manifold, curvatures (small or large), number of samples (varies from a few hundred to thousands), etc. For 3D data, we naturally extend PTV algorithm to 3D in the samewayasin[81]. WealsousedLocalPCA(LPCA)[153]fortangentspaceestimation. Preliminaryresultsdid notsupportthisalgorithmasaviableoption,becauseLPCAissensitivetodatasamplingdensity onthemanifoldandoutliers. Soweomitthoseresults. Tofurtherreducethecomputationalcost, 47 vote only takes place in a neighborhood with a distance threshold s max , which is calculated as exp(−( s 2 max σ 2 v )) = 0.001. The number of voting is reduced from O(N 2 ) to O(NM), making the computational cost a linear function of the input data sizeN. HereM is the average size of neighborhooddependingonσ v andσ n . 3.5.2 ResultsonSyntheticData Endpoint completion: Fig. 3.5 shows the comparison between STV and PTV for endpoints completion. Syntheticerror(0.5shift)isaddedtotheendpointsoftwolinesegments. Thesetwo line segments should be connected by a straight line, which is missing in this figure. The target ofthecompletionistorecoverthismissingline,andsimilarworkwasdonein[80], [145],while no error is added to the endpoints. At the sparse vote stage, both STV and PTV use two pass votestogetthetensorateachpoint. Thenthetensorsattwoendpointsareusedtogeneratedense saliency map by one pass dense vote. Since STV gets incorrect tensors at those two endpoints, themissinglinecannotberecovered. ForPTV,p2pointsinthedirectionoftheendpointserror. Bymovingtheendpoints’positionsalignedtop2(unilateralGaussiandistribution),themissing lineisperfectlyrecoveredbydensevote(σ v = 30,σ n = 0.5). Tangent space estimation: correctly estimating the tangent (normal) space of manifolds under noisy conditions is an important step for many applications, including contour grouping, manifold denoising, mesh reconstruction and smoothing, etc. In this section, we intensively investigate the robustness of STV and PTV for this target by adding different levels of inlier and outlier noise. The logo of POCV, sphere data and Stanford Bunny data are used here. The 48 Figure 3.5: Comparison between STV and PTV for endpoints completion. Left: input tokens; middle: endpointscompletionresultsbySTV;right: endpointcompletionresultsbyPTV comparisonmetricweuseistheaverageerroroftheestimatedtangentspaceoninlierpointsby twopassvotes, e = [ Num I ∑ i=1 cos −1 (|N i T G i |)]/Num I (3.14) Num I is the number of inliers, Num O is the number of outliers and outlier ratio (OR) is Num O /Num I . N i is the estimated tangent (normal) space andG i is the ground-truth atx i . It is clear that PTV produces higher errors as the noise level increases. The point here is to demonstratethatPTVis consistently better thanSTVindiversifiednoiseenvironments,evenin theextremelynoisycondition. (1), the tangent space ground-truth of POCV logo can be calculated analytically. 11 levels of inlier noise (standard deviation (SD) from 0 to 2) with 11 levels of outliers (OR from 0% to 300%) are added to the data. 5 Estimated error by two pass votes is reported in Table 3.1. 6 It shows STV is quite robust as OR increases but sensitive to SD. PTV is consistently better than STV, in a large range of noise, and decreases the error by 25% on average. For instance, 5 Fig.3.6,someoutliersarenotshownduetotheaxislimitations. 6 WhenSD isrelativelylarge( 0:4),theperformanceisdominatedbyinliernoise. 49 Figure 3.6: Tangent space estimation. The first row, noiseless POCV logo with ground truth of normal directions and SD = 1 with 200% OR. The second row, noiseless sphere, SD = 0.5 with50%ORandSD = 1with100%OR.Thethirdrow,noiselessBunny,SD = 0.5,SD = 1. 50 for data with SD = 0.6 and OR = 300%, PTV achieves only 8 degree error, while it is even difficultforhumanstorecognizePOCVfromthenoisyenvironments. Bysettingσ v = 10forall experiments, we show that PTV algorithm is quite robust. Furthermore, we fixσ v and observe the performance via different choices ofσ n . Results show that PTV is not sensitive toσ n . For instance, for data with SD = 0.6 and OR = 300%, estimated errors with large range of σ n are reported in Table 3.2 and the performance is almost a constant to different values of σ n (σ v = 10). (2),forthespheredata,wealsoadddifferentlevelsofGaussiannoiseanduniformly distributed outliers to the data (number of points is from 1600 to 6400). 7 We found that PTV is consistently better than STV and achieves 10% error reduction on average. (3), for Stanford Bunny data (Fig. 3.6, reconstruction results), the ground-truth is calculated by using STV on noiseless points cloud. Since outliers bring troubles for reconstruction, we only added different levelsofGaussianinliernoise. Themeshreconstruction(1889vertexesand3851faces)isbased on the noiseless point could. In this case, PTV is also consistently better than STV and reduces errorby12%onaverage. 8 Datadenoising: wecompareourPTValgorithmwithPCAandMD,whicharetwopopular non-parametric denoising methods. Points are sampled from an arc curve with small Gaussian noise and uniformly distributed outliers (50%). From Fig. 3.7, we can see that, PCA finds the global principal direction of the data, which is incorrect due to the nonlinear intrinsic structure and outliers. MD can smooth inlier points after several iterations but can not recognize outliers and it attracts close outliers to diffusion process if we keep running the algorithm. In contrast, 7 Fig.3.6,inlierandoutlierpointsarecoloredblueandgreenrespectively. 8 For3Dcase,weuseonepassvote,andthedetailedcomparisonisomitted. 51 Table3.1: Tangentspaceestimationonsyntheticdata. PTVresultsarebold OR/SD 0.4 0.8 1.2 1.6 2.0 0% 0.158/0.105 0.316/0.224 0.344/0.247 0.407/0.299 0.479/0.336 150% 0.150/0.128 0.293/0.206 0.346/0.214 0.428/0.298 0.466/0.328 300% 0.168/0.130 0.258/0.207 0.340/0.287 0.432/0.371 0.473/0.391 Table3.2: PTVisrobusttoσ n . σ n v.s. erroroftheestimatedtangentspace σ n 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Error 0.148 0.140 0.133 0.130 0.137 0.141 0.141 0.139 0.140 0.138 PTV can filter out outliers and get information from inlier data simultaneously by two pass votes, and inlier data can be further smoothed by moving points in their normal direction. This experiment may be argued by the fact that MD or PCA could be combined with some outlier detection methods, like clustering. In fact, the motivation here is to demonstrate PTV is robust to outliers for nonlinear data denoising, while MD or PCA not. Since PTV is a purely local method,itcanalsobestrengthenedbycombiningwithotherapproaches,likethelocaltoglobal alignmentmethod[124]. 5 10 15 20 25 30 35 0 5 10 15 20 14 16 18 20 22 24 26 5 6 7 8 9 10 11 12 13 14 15 5 10 15 20 25 30 35 −5 0 5 10 15 20 GD PCA 5 10 15 20 25 30 35 0 5 10 15 20 GD MD 16 17 18 19 20 21 22 23 24 7 8 9 10 11 12 GD FTV Figure3.7: PTVperformsbetterthanPCAandMDindenoising,whendataincludesbothoutlier and inlier noise. From left to right: original data, inlier data, PCA results, MD results and PTV results 52 Figure 3.8: PTV performs better than STV for edgels grouping. The first row, initial edgels, STVresults,PTVresultsandsmoothededgels. Thesecondrow,STVresults,PTVresults,STV resultsandPTVresults. Thethirdrow: initialsedgels,STVresultsandPTVresults 3.5.3 ImageContourGrouping We applied our algorithm to BSDS image database [72]. The problem we consider is how to analyze and group edgels in natural images, which serves as a foundation of image contour grouping. Contour (boundary) grouping results should be connected, smooth, and not sensitive to image noise, illumination change and contour occlusion. It is different from object contour detection [120], which involves high-level knowledge. We start from edge detection results in [72]. The probability boundary detector provides a 1/0 edge map with 8 filter responses for each edgel. Initial tensor is generated by summing up 8 filter responses together, then tensor voting is performed on the edgels. As shown in Fig. 3.8, the initially detected edgels results clearly suffer from noise, which requires robust grouping. For instance, the edge in the first 53 figureofthesecondrowmissesapart(leftpartoftheendpointismissing),becausetheboundary contrastisnothighenoughtoinferanedgelocally. In the first row, the slope line shaped edge looks like a step edge because of textures, and it induces errors for STV, especially forp1 (red vector). The first and the fourth points at each step are detected as endpoints. On the other hand, PTV gets tensors which are more likely to be orthogonal to the edge direction, with more precisep1. Endpoints are extremely important for image contour grouping, and reducing error forp1 is crucial for endpoints detection. In the second row, PTV provides preciseT andp2 (red vector) on endpoints, which is important for further endpoint completion. On the right of the second row, there is a comparison between STV(left)andPTV(right)onaT-junction.p2(redvector)inPTVpointstothelocationofthe L-junction, which is helpful for precise junction localization. In the last row, given a horizontal edgewithasmalljump,STVproducesfalsealarmsofendpointsdetection,whilePTVdoesnot. BasedonthelocalgroupingresultsbyPTV,localtoglobalcontourcompletionisperformed by connected boundary with detected key points, i.e., endpoints and junctions. There are three typesofcompletion;endpointtoboundary,endpointtoendpointandjunctionstoboundary. Asa necessary step, we cluster detected keypoints into groups by local graph clustering. Details are omittedbecauseoflimitedspace. ThelasttworowsofFig.3.8illustratethecontourcompletion results. In each row, the left one is the boundary detection results by [72] and the right one is completion results. After completion, the quality of contours is improved since disconnected boundaryfragmentsaregroupedandfalsedetectionsareremoved. 54 3.6 Conclusion In this chapter, probabilistic Tensor Voting is proposed to process data with both outlier and inlier noise. The advantages of this framework are: it’s non-parametric, robust to both inlier and outlier noise, can handle complicated structures, and has the same computational cost as STV. From the experimental results, it is clear that our framework outperforms other candidate methods. When data suffers from both types of noise, the improvement of our framework is significant. To demonstrate the power of the proposed probabilistic Tensor Voting framework, there are still a few unsolved issues. Firstly, in image contour grouping, local contour analysis needs to becombinedwithotherlocaltoglobalinferencetechniquestoperformthecontourcompletion. Secondly, thedimensionalityofmanyinterestingdataincomputervisionismorethan3, which suggestsourapproachbeextendedtogeneraldimensionalspace. UnifiedTensorVotinginhigh dimensional space is proposed in chapter 4. And one remain problem is to combine the unified Tensor Voting and the probabilistic vote to an integrated framework with elegant and efficient computationalalgorithm. 55 Chapter4 UnifiedTensorVotinginHighDimensionalSpace Inchapter2,thedescriptionofthestandardtensorvotinginR 2 (STV-2)isgiven,whichinvolves two voting cases, stick vote and ball vote. In this chapter, a unified voting framework in R D isproposedasanaturalgeneralizationofSTV-2,towardstheapplicationsofmanifoldlearning. This novel framework keeps the core idea of the algorithm formulation in STV-2, and gives a cleaninterpretationofvotingfunctionsinallcases,includingstickandballvote. 1 The proposed unified voting framework is motivated by several issues. Besides the natural requirement that STV-2 needs to be generalized into high dimensional space to manifold learn- ing, another important aspect is two basic voting functions in STV-2, i.e., stick and ball, are proposedseparately,leadingtoinconsistencyissuesinthevotingalgorithm. Unifiedtensorvot- ing framework gives a clean and elegant way to design a general voting function inR D , which covers all possible situations, including both stick and ball voting functions. Simply speaking, thereisnoballvotingfieldintheclosed-formsolutionandtheonlyfundamentalvotingfunction isthestickvotingfunction. Theessentialideaoftheclosed-formsolutionis,votingfunctionsin 1 Ourapproachismotivatedbypreviousworks[81][82]. 56 allcasesaredevelopedbasedononlyonefundamentalvoteplusgeometricandalgebraoperation inR D . TheNDvotingalgorithmdescribedin[82]involvesGram-Schmidtprocess,whichhasam- biguity in basis vectors decomposition and doesn’t have an explicit formulation. The unified voting formulation gives a closed-form solution, which only includes matrix multiplication and eigen-decomposition. The first contribution is the introduction of the householder transforma- tion,whichgivesaclosed-formsolutionforthestickvotingfunction,i.e.,theonlyfundamental voting function. The second contribution is the introduction of the weight kernel function, and the low rank vote is represented as the weighted sum of stick votes. By choosing a special type of kernel function, any low rank vote, as well as the general vote can be formulated as closed-form equations. Besides the feature that there is only one fundamental voting function, the proposed closed-form voting algorithm has several other advantages such as making it easy forimplementationandtheoreticalanalysis. 4.1 Representationinℜ D In this section, the relationship between the tensor (positive semi-definite matrix) and the geo- metric properties for manifold (embedded in high dimensional space) is given. The key idea is the first order information of manifolds, i.e., tangent space and normal space, are represented by a set of orthogonal basis vectors which are eigenvectors of the positive semi-definite matrix. A natural question can be asked is why higher order information, e.g., curvatures or Hessian matrix, are not included in our framework. This is because the first order information is useful 57 Figure4.1: Localtangentspaceofapointx∈M. for many real applications, and higher order information is difficult to be estimated accurately forsparseandnoisydatainhighdimensionalspace. From Manifold to Tensor. Given a manifoldM∈ R D with intrinsic dimensionality d < D, the representation of the manifold structure at a given pointx∈ M is a second order tensor, i.e., a symmetric, positive semi-definite matrixT ∈ R D×D . A tensor represents the manifold structure of pointx by encoding the normal space of the manifold as eigenvectors of the tensor thatcorrespondtonon-zeroeigenvalues,andthetangentsaseigenvectorsthatcorrespondtozero eigenvalues. Wecan think the tensor explicitlyrepresents the normal space at a given pointx∈M, and implicitly represents the tangent space as the orthogonal (dual) space of the tangent space. By geometric analysis we can get that, the relationship between the tensor representation and the tangent/normal space is a one-to-one mapping. For example, a point belonging to a 2 dimen- sional manifold inR D is represented by two tangent vectors andD−2 normal vectors, and is further represented by a tensor with D− 2 non-zero constant eigenvalues associated with the 58 eigenvectorsthatspanthenormalspaceofthemanifold. Thematrixalsohas2zeroeigenvalues whosecorrespondingeigenvectorsspanthemanifold’snormalspace. Formally, given a pointx∈ M, the normal space atx is{e 1 ,e 2 ,...,e D−d }∈ R D×D−d , and tangent space is{e D−d+1 ,...,e D }∈ R D×d , the tensorT∈ R D×D atx can be denoted as alowrankmatrix 2 withthefollowingformulation: T = D−d ∑ k=1 e k e T k (4.1) Thisformulationmeansforpointx,thelocalmanifoldstructurehasD−dnormalvectorswith eigenvalue1anddtangentvectorswitheigenvalue0. Thusitisstraightforwardtoshowthatthe rankofT isD−dandthelocalintrinsicdimensionalityatxisd. As a special case, T is defined as ball tensor in previous works, if all eigenvalues are equal (1), which represents perfect uncertainty in orientation, or in other words, the point is un-oriented. Itiseasytoprovethat,T isjustanidentitymatrixI∈R D×D . From Tensor to Manifold. On the other side, if there is a symmetric positive semi-definite matrixT, what’s the geometric interpretation? By the following decomposition of a matrix, all kinds of dimensionality possibilities are explicitly considered. The intrinsic dimensionality and tangent(normal)spacecanbeestimatedbasedonT. 2 ItislowrankwhenDd<D. Theonlyexceptionisd = 0,andthetensorbecomesanidentitymatrix. 59 T = D ∑ k=1 λ k e k e T k = D ∑ k=1 [(λ k −λ k+1 ) k ∑ j=1 e j e T j ] = d ∗ −1 ∑ k=1 [(λ k −λ k+1 ) k ∑ j=1 e j e T j ]+(λ d ∗−λ d ∗ +1 ) d ∗ ∑ j=1 e j e T j + D ∑ k=d ∗ +1 [(λ k −λ k+1 ) k ∑ j=1 e j e T j ] (4.2) The notationλ D+1 is set to 0 to simplify the decomposition. The intrinsic dimensiond ∗ is estimated according to the largest value ofλ k −λ k+1 (1≤ k≤ D), which can also be viewed asthepeakofthedifferentialofthetensor’sspectrum. d ∗ =D−argmax k (λ k −λ k+1 ),1≤k≤D (4.3) The normal space of the manifold atx is represented by eigenvectorse 1 toe d ∗ (span vectors). The tangent space is represented by the span space of eigenvectorse d ∗ +1 toe D . The manifold structuresaliencyisindicatedby(λ d ∗−λ d ∗ +1 ). Thelargerthesaliencyvalueis,theclearerthe manifoldstructureis. 4.2 VotingAlgorithminℜ D Theproblemformulationofunifiedvotingalgorithmisthesameaswhatwedescribedinchapter 1,exceptthevotingspaceischangedfromℜ 2 toℜ D . Problem Formulation: assume raw data are oriented points{(x n ,u n )} N 1 n=1 , whereu n ∈ R D is a unit vector representing the local normal space for point x n ∈ R D , and unoriented 60 points{x n } N 1 +N 2 n=N 1 +1 . Inputs tokens are encoded as{(x n ,M n )} N n=1 , where N = N 1 + N 2 , M n = ∑ D j=1 λ nj e nj e T nj ∈R D×D ,fori = 1,2,...,N 1 (orientedpoints)andM n =I∈R D×D , for i = N 1 + 1,...,N (un-oriented points). The target of the voting algorithm is to compute the tensors{T n ,p n } L n=1 for target points set{y n } L n=1 . y n ∈ R D can either belong to the set {x n } N n=1 ornot. 3 Thekeyideaofthevotingalgorithmisstraightforward. Eachtargetpointy m (m = 1,2,...,L) collects votes from its neighborhood pointsx n (n = 1,2,...,N) and linearly sums them up as follows, T m = ∑ n∈N(ym) Vote T (x n ,M n ,y m ,σ v ) (4.4) Where Vote T (.) is the voting function in R D , and σ v is the only free parameter in the voting algorithm,i.e.,thevotingscalethatcontrolsthevotingsaliencydecay. InthesamewayofSTV-2, theunifiedvotingalgorithmhasseveralspecialcases, whichare studiedmoreofteninrealapplications. Formulation I: one special case is N = L and the set{y m } L n=1 is identical to the set {x n } N n=1 . Itmeansthetensorinformationispropagatedwithinthesamepointsset. Thiscaseiscalled SparseVotingintheearlierliterature[77]. Sincetheinputandoutputare boththesamesetofgenericpoints(pointswithtensors),itnaturallymakesthealgorithmbeable toprocessdatainahierarchicalway. Votingalgorithmusuallyincludestwolayers(votetwice), anditcanbeextendedtothreeorfourlayerstomeetspecificdomainrequirements[77][81]. 3 Thisformulationisageneraldescriptionsuchthatallcasesofvotingcanbeviewedasitsspecialcases. 61 Alignment Voting Process. Given a symmetric positive semi-definite matrixM at point x, the voting results fromx to another pointy is denoted asT(x,M,y,σ v ). By the Merces’s theorem,weget M = D ∑ k=1 λ k e k e T k = D−1 ∑ k=1 (λ k −λ k+1 )[ k ∑ j=1 e j e T j ]+λ D [ D ∑ k=1 e k e T k ] (4.5) Basically,ageneraltensorisdecomposedintoD−1low-rankmatrices,eachofwhichhasequal eigenvalues (except 0), and one identity matrix (ball tensor). Then the final stick vote is the linear sum of the votes of theseD low-rank matrices. The vector between voter and receiver is x−x i = ∑ D k=1 <x−x i ,e k >e k . So,thevoteT(x,M,y,σ v )fromvoterx(withtensorM)toreceivery is, T(x,M,y,σ v ) = D ∑ k=1 (λ k −λ k+1 )T(x,[ k ∑ j=1 e j e T j ],y,σ v ) (4.6) For each low-rank vote with normal space ∑ k j=1 e j e T j , we further re-decompose the normal space into ∑ k j=1 e ′ j e ′ T j (by Gramm-Schmidt process), to ensure thate ′ 1 = v ∥v∥ , whilev = ∑ k j=1 <x−y,e j >e j . Thus,wecanfurthergetthefollowingequation, T(x,[ k ∑ j=1 e j e j T ],y,σ v ) =T(x,[ k ∑ j=1 e ′ j e ′ T j ],y,σ v ) = k ∑ j=1 T(x,e ′ j e ′ T j ,y,σ v ) (4.7) 62 Thepurposeofre-decomposingthenormalspace ∑ k j=1 e ′ j e ′ T j istominimizethecomputational cost 4 . For2≤j≤k <e ′ j ,y−x>=<e ′ j , D ∑ k=1 <x−y,e k >e k > =<e ′ j , k ∑ l=1 <x−y,e l >e l > +<e ′ j , D ∑ l=k+1 <x−y,e l >e l > =<e ′ j ,v > +<e ′ j , D ∑ l=k+1 <x−y,e l >e l > =<e ′ j ,e ′ 1 ∥v∥> +<e ′ j , D ∑ l=k+1 <x−y,e l >e l > (4.8) Inthelaststepoftheequation,thefirstitemis0sincee ′ j isorthogonaltoe ′ 1 . Theseconditem is also 0, becausee ′ j lies in the space ∑ k l=1 e l e T l , which is orthogonal to space ∑ D l=k+1 e l e T l . So,e ′ j isorthogonaltox−y,for2≤j≤k. Based on this observation, the voteT(x,e ′ j e ′ T j ,y,σ v ) is easy to compute (reduced to the simple2Dcasewithθ = 0),exceptthefirstone(j = 1),sincethenormalvectore ′ 1 maynotbe orthogonaltox−y. AvisualillustrationofthisalignmentprocessisgiveninFig.4.2. 4.3 Closed-FormVotingFunction Inthelastsection,weshowthatthegeneralvotefrom{x,M}toy isdecomposedintoD votes and each vote is performed by the alignment method. Here we further provide the closed-form 4 Thedeepinsightofthisre-decomposingprocessisgiveninthenextsection. 63 Figure4.2: Anillustrationofthespatialalignmentvote. formulationofthegeneralvote,whichcannotonlyletthevotingimplementationbecomeeasy, butalsomakethetheoreticalanalysispossible. Recall eq. 4.6, a general vote is decomposed into the weighted sum of D low-rank matrix voteasfollows, T(x,[ D ∑ k=1 λ k e k e T k ],y,σ v ) = D ∑ k=1 (λ k −λ k+1 )T(x,[ k ∑ j=1 e j e T j ],y,σ v ) (4.9) For each low rank matrix ∑ k j=1 e j e T j , the basis vectors can be rotated by any orthogonal matrix R∈ℜ k×k . Thus, the alignment method is proposed to choose a unique set of basis vectors. Here we will show that this alignment process is indeed equivalent to maximize the totalvotingsaliencyofalllow-rankmatrixvotes. Thestickvote(rank-1),i.e.,theonlyonefundamentalvote,canberepresentedas, T(x,ee T ,y,σ v ) =w(x−y,e,σ v )(I−2rr T )(ee T )(I−2rr T ) T (4.10) 64 Herer istheunitvectorconnectingvoterxandreceivery ( x−y ∥x−y∥ ), andw(x−y,e,σ v )isthe weightkernelfunction,andthevalueofthekerneldecreasesasthedistance∥x−y∥ 2 increases and the angle (betweenr ande) decreases. The key part in the stick vote is the transformation I− 2rr T , which is called householder transformation (householder reflection or elementary reflector). Furthermore,theweightkernelisdecomposedintotwofactorsasfollows, w(x−y,e,σ v ) =w dis (x−y,σ v )w angle (<r,e>) (4.11) wherew dis (·) is the distance kernel function with respect to∥x−y∥ 2 andw dis (·) is the angle kernelfunctionrespecttotheinnerproductbetweenr ande. Thus,thelowrankvoteT(x,y, ∑ k j=1 e j e T j ,σ v )canberepresentedas, T(x,y, k ∑ j=1 e j e T j ,σ v ) =T(x,y, k ∑ j=1 e ′ j e ′ T j ,σ v ) = k ∑ j=1 w(x−y,e ′ j ,σ v )(I−2rr T )(e ′ j e ′ T j )(I−2rr T ) T = (I−2rr T )[ k ∑ j=1 w(x−y,e ′ j ,σ v )(e ′ j e ′ T j )](I−2rr T ) T =w dis (x−y,σ v )(I−2rr T )[ k ∑ j=1 w angle (<r,e ′ j >)(e ′ j e ′ T j )](I−2rr T ) T (4.12) 65 wherew dis (x−y,σ v ) in the last step is invariant to the specific basis vectorse ′ j (1≤ j≤ k). Sothetraceofthevotingresultisjustw dis (x−y,σ v ) ∑ k j=1 w angle (<r,e ′ j >). If the angle kernel function is non-increasing and convex with respect to < r,e >, then it canbeshownthat,thealignmentmethodinsection4.2(chooseaspecificsetofe ′ j )isequivalent tomaximizethethetraceofthevotingresult. Furthermore,ifwesettheanglekernelfunctionw angle (·)to1when<r,e>= 0and0when <r,e>= 1,thenwecangetthesimplifiedformulationoflow-rankvoteT(x,y, ∑ k j=1 e j e T j ,σ v ) asfollows, T(x,y, k ∑ j=1 e j e T j ,σ v ) =w dis (x−y,σ v )(I−2rr T )[ k ∑ j=1 w angle (<r,e ′ j >)(e ′ j e ′ T j )](I−2rr T ) T =w(I−2rr T )[ k ∑ j=2 (e ′ j e ′ T j )+w ang (<r,e ′ 1 >)e ′ 1 e ′ T 1 )](I−2rr T ) =w(I−2rr T ) k ∑ j=1 e ′ j e ′ T j (I−2rr T )−w ′ (I−2rr T )e ′ 1 e ′ T 1 (I−2rr T ) =w(I−2rr T ) k ∑ j=1 e j e T j (I−2rr T )−w ′ (I−2rr T )e ′ 1 e ′ T 1 (I−2rr T ) (4.13) where w is the short notation of w dis (x−y,σ v ) and w ′ is the short notation of w dis (x− y,σ v )(1−w ang (<r,e ′ 1 >)).e ′ 1 isdefinedas v ∥v∥ ,whichisthesameasinsection4.2. 66 Combining eq. 4.9 and eq. 4.13, we can get the closed-form solution for the general vote T(x,M,y,σ v ). T(x,M,y,σ v ) = D ∑ k=1 (λ k −λ k+1 ){wP k ∑ j=1 e j e T j P−w ′ k Pe ′ 1,k e ′ T 1,k P} (4.14) whereP is the short notation for (I−2rr T ),λ D+1 is 0,w ′ k isw ′ in eq. 4.13 fore 1 toe k and thesamefore ′ 1,k . 4.4 Discussion One of the key features of the closed-form solution is, there is no separate ball voting field, and in fact, all voting functions are derived naturally from the stick vote (eq. 4.10). Thus, the problem of the inconsistency between the stick and ball vote (chapter 2.2) is solved elegantly. Twospecialcasesofeq.4.14aregivenasfollows. (1). IfM isarank1matrixee T ,thenthegeneralvoteT(x,M,y,σ v )degeneratestothestick vote, T(x,ee T ,y,σ v ) =w dis (x−y,σ v )w angle (<r,e>)P(ee T )P (4.15) (2). IfM is a full rank identity matrixI, then the general voteT(x,M,y,σ v ) degenerates to (voteforunorientedpoint), T(x,I,y,σ v ) =w dis (x−y,σ v )(I−rr T ) (4.16) 67 It is worth noting that, an alternative closed-form solution of tensor voting (CFTV) is pro- posed in [147]. In fact, CFTV is an extension of the original tensor voting framework, which relies on the integration idea [77]. Also, [83] proposes an efficient solution for Tensor Voting withO(1) complexity in 3D space. In contrast to [147] and [83], we propose a closed-form so- lutionasanextensionofthelatesttensorvotingin[82],whichisafullyalgebraicandgeometric approach. In particular, there are several key differences exist between CFTV and our formulation. First, CFTV is propose to give a closed-form solution for the original tensor voting based on theintegrationframework[77]. Ourformulationis an extensionof [82]which is fullybasedon matrix decomposition and multiplication. Looking closely at vote from unoriented points, ours givesI−rr T and CFTV givesI−rr T + 1 2 rr T (coefficients are omitted). In fact, the 1 2 rr T part remains in CFTV because of the integration, which is consistent with [77]. Also, the main usage of CFTV in [147] is to combine with the expectation maximization (EM) algorithm to perform robust estimation tasks such as stereo matching. Our formulation is mainly proposed to a fully geometrical and algebraical computational framework which can represent all votes consistently. Similar to CFTV, the efficient tensor voting algorithm proposed in [83] is also based on the integration framework [77]. The purpose of [83] is to design a simplified version of 3D tensor voting while keeping the same perceptual meaning of the original framework [77]. Infact, [83] canbealsoviewedasamodifiedversionof[82]in3Dspace. However,thegeneralizationof[83] tovotingalgorithminNDspaceisnotstraightforward. 68 Moreover, the proposed closed-form solution in this chapter is almost identical as [82] with slightdifferencesonthekernelfunctioninvoting. Furthertheoreticalanalysisforthedifferences is very challenging and left for future investigation. Numerical results show that the proposed approachachievesthesameaccuracyas[82]onlocalmanifoldstructureestimation. 4.5 Conclusion BesidesvalidatingtheproposedunifiedTensorVotingonmoredatasets,itwillbeinterestingto give theoretical proof that the voting results do asymptotically converge to the underlying local manifold structure, e.g. tangent space and intrinsic dimensionality. In particular, there are two importantcases. Case 1. Given N un-oriented points randomly distributed on manifolds, apply the voting algorithm once, do the voting results on these N points asymptotically converge when N is sufficientlylarge? Furthermore,dotheconvergedresultsprovideagoodestimation(e.g.,unbias) tothemanifoldproperties,i.e.,tangentspace,normalspace,intrinsicdimensionality,etc? Case 2. Given N oriented points (tangent spaces are given) randomly distributed on mani- folds. For a new point, apply voting algorithm for it once. Do the voting results asymptotically converge whenN is sufficiently large? And the same questions as in case 1 are also of interest tous. 69 Chapter5 ManifoldsDenoising Westudytheproblemofdatadenoisingwheredataareassumedtobesamplesfromlowdimen- sional (sub)manifolds. We propose the algorithm of locally linear denoising, which is a local to global alignment model for high dimensional data denoising. As a first attempt to model de- noisingasalocaltoglobalalignmentprocess,theproposedmethoddoesnotexplicitlyconsider multiplemanifoldsandoutliers. ButthealgorithmcanbeeasilyincorporatedwithUnifiedTen- sorVoting(UTV)(proposedinchapter4)toexplicitlyhandlemultiplemanifoldswithoutliers. In particular, the proposed denoising algorithm approximates manifolds with locally linear patches by constructing nearest neighbor graphs [31]. Each data point is then locally denoised withinitsneighborhood. Aglobaloptimaldenoisingresultisthenidentifiedbyaligningthelocal estimates. The algorithm has a closed-form solution that is efficient to compute. We evaluate and compare the algorithm to alternative methods on two image data sets. We demonstrate the effectivenessoftheproposedalgorithm,whichyieldsvisuallyappealingdenoisingresults,incurs smallerreconstruction errors and resultsin lower error rates when the denoised data are used in supervisedlearningtasks. 70 5.1 Introduction Many algorithms developed for tasks in computer vision, such as object recognition, segmenta- tionandothers,assumethattheinputimagescontainlittleornonoise. Thus,forvisionsystems accomplishing those tasks, it is important to remove excessive noise in images at processing stages as early as possible. Image denoising is an important preprocessing step for achieving that goal [17, 8, 93]. Denoising techniques are also widely used in computer graphics [22], digitalphotography[20]andotherapplications. Principal component analysis (PCA) is a popular data denoising technique. It is especially effective when images are contaminated with small amounts of Gaussian noise. Probabilistic approaches, based on learning priors for appearance, geometry and other visually salient prop- erties, have also been intensively studied [101, 49]. In these works, the prior models are learnt using image corpora, such as those of natural scenes, where images are randomly collected and are not meaningfully related to each other [73]. Many state-of-the-art denoising techniques are basedonstatisticalsignalprocessingandoptimalfiltering[17,8,36,15,95]. Akeyassumption in many of these work is that for a given image, many pixels’ neighborhoods (or local patches) are similar to each other. Such similarity is leveraged to estimate models of noise and clean images. While the majority of existing work has been focusing on denoising a single image, we investigate the problem of denoising collectively a collection of images. In many cases, latent intrinsic structures underpin those images. For instance, an image library of an object can be compactly described with a few parameters such as the lighting condition, the camera position, 71 etc. Weassumethattheselatentvariableslieonasmoothlowdimensionalmanifold. Identifying imagemanifoldsisanactiveresearchtopicinmanifoldlearningandlatentvariablemodels[125, 102,6,58]. Weconsidertheproblemofdenoisinginthiscontext. Specifically,weviewimages as random samples (with noise) from the manifold. A natural question arises: can the intrinsic structure be exploited for denoising? Note that the intrinsic structure is often unknown a priori, therefore needs to be inferred from the (noisy) data. How can we achieve robust denoising and inferenceatthesametime? Our work investigates these questions. We propose a simple and effective procedure for denoising data on manifolds. Our study shows clearly that exploiting the intrinsic structure of image collections is advantageous. Our iterative procedure consists of 3 steps: i) construct a nearestneighborgraphtoapproximatethemanifoldwithlocallylinearpatches;ii)denoisedata points locally within each patch; iii) align denoised data globally with regularization enforcing smoothnessonmanifolds. Eachofthethreestepsiscomputationallytractable,involvingnearest neighbor search, matrix eigendecomposition and matrix inversion. We applied our algorithm to image denoising on manifolds of handwritten digits and faces. We evaluate the quality of the denoisingbyvisualinspection,deviationfromuncorruptedimagesandclassificationerrorrates on denoised images. We compare our algorithm to alternative methods systematically under varioustypesofnoiseconditions. Ouralgorithmgenerallyoutperformsotherapproaches. The rest of the chapter is organized as follows. In section 5.2, we summarize briefly related work. Wederiveanddescribeouralgorithminsection5.3. Experimentalevaluationispresented insection5.4. Wediscussfutureresearchdirectionsinsection5.5. 72 5.2 RelatedWork Manifold learning algorithms also aim to exploit intrinsic structures in data. They are different from our effort in their primary goals of discovering and projecting data onto low dimensional structures for visualization and exploratory data analysis [125, 102, 6]. In contrast, the primary goal of denoising is to obtain denoised output in the same dimensionality as the noisy input. Notethat,formanifoldlearningalgorithms,itispossibletobuildstatisticalmodelsorfunctions thatmapdatainthelowdimensionalspacetotheoriginalinputspace[58,124,24]. Theoutputs of those models could be seen as denoised inputs. Intuitively, it is difficult to ensure the effec- tiveness and advantage of this type of denoising procedures since both phases – projection and backwardmapping–introduceerrors. Empirically,ourexperimentalresultsdidnotsupportthis procedureasarobustoptionfordenoising. Our work is more similar in spirit to the diffusion map based denoising algorithm [41]. TheyviewdenoisingasreversingadiffusionprocessofwhichgraphLaplacianisthegenerator. BoththeirandourapproachesusegraphLaplacianforregularization. However,ourapproachis significantlydifferentfromtheirsasouriterativeprocedureusesintermediatedenoisingoutputs differentlytorefinetheexistingsolution. Inparticular,theirapproachtendstooverlysmooththe inputs. Ourapproachislesssensitivetothatproblem. 5.3 LocallyLinearDenoising We assume that the data lies on a d-dimensional smooth submanifold embedded in an ambient space of dimensionality D > d. Let{x i ∈ℜ D ,i = 1,2,...,N} be N data points sampled 73 fromthemanifold. LetX∈ℜ D×N denotethematrixwherex i isthei-thcolumn. Letz i ∈ℜ d denotethecorrespondingcoordinatesforx i inthelowdimensionalspace. Weassumethatthere exists a smooth function f, mapping the low dimensional coordinates to the high dimensional space:x i = f(z i )+ϵ i ,whereϵ i representsnoise. We denote f(z i ) byy i , ie, the noiseless data. We define the shorthand notationY as we did forX. We are interested in denoising noisy datax i thus identifyingy i . Note that our goal is different from most manifold learning algorithms, which aim to identify{z i } and sometimes thefunctionf(·)[125,102,6,153]. Inwhatfollows,westartbydescribingandderivingouralgorithmindetails. Wethenanalyze thealgorithmbrieflyanddiscussafewpossibleextensions. 5.3.1 TheLLDAlgorithm Our approach hinges on the basic notion that a manifold can be seen as a collection of over- lapping small linear patches. Moreover, if data are sampled densely, these small patches can be approximated with nearest neighborhoods in point clouds. This is the same intuition behind many manifold learning algorithms [125, 102]. Our algorithm exploits this intuition by denois- ingineachpatchindependently. Thenwecomputeaglobalassignmentofdenoisingoutputsthat alignwithlocalresultsasmuchaspossible. Specifically,ouralgorithmconsistsofthefollowing threesteps: Constructingnearestneighborgraph. Weconstructaweightednearestneighborgraphfrom the sampled data points. Let K denote the number of nearest neighbors for each data point 74 (including itself) andN i denote the neighbors ofx i . We use w ij to denote the weight of the edge between the samples x i and x j . Many choices of w ij are possible. In this work, we have chosen the commonly used Gaussian kernelsw ij = exp(−∥x i −x j ∥ 2 /σ 2 ) ifx j ∈N i or x i ∈N j ,and0otherwise. LetW denotetheweightmatrixwheretheelementsarew ij . Denoisinglocally. Weviewx i andK pointsinitsneighborhooodN i asrandomsamplesfrom a linear subspace, approximating the manifold around the pointx i . We denoise theseK points withprincipalcomponentanalysis(PCA)(otherstrategiesarealsopossible). LetX i denotethe points inN i . The local estimate ofy i — denoisedx i — is given by the reconstruction ofx i withthed i principalcomponentsofX i . Similarly,wecomputereconstructionsforotherpoints inN i . Concretely, letU i ∈ℜ D×d i denote the d i principal components. The reconstruction Q i ∈ℜ D×K forthesepointsisgivenby Q i =U i U i T X i ( I−ee T /K ) +X i ee T /K (5.1) whereeisalength-K columnvectorwhoseelementvaluesareonesandI istheidentitymatrix. The number of the principal components d i can be estimated adaptively, mindful of the inhomogeneous distribution of the noise at different parts of the manifold. This is in sharp contrast to many manifold learning algorithms where a global dimensionality d needs to be estimated. When there is noise in the data, estimating a globald is challenging as the noise and the curvature of the manifold interplay. In our work, d i is estimated with simple methods such asthresholdingontheresidualvariances. 75 Aligning globally. In addition to having its own neighborhood, any point x i can be in the neighborhoods of other data points as the approximating linear subspaces overlap for a densely sampled manifold. Intuitively, the local denoising results forx i from other neighborhoods re- flect also information about the “true” location y i . To integrate this information from every neighborhood, we seek a global assignment of denoising resultY that minimizes the sum of discrepanciestoallneighborhoods. Specifically,weminimizethefollowinglossfunction L A = ∑ i ∑ j: i∈N j ∥y i −Q j (i)∥ 2 2 (5.2) whereQ j (i)standsforthelocaldenoisingresultforx i fromtheneighborhoodofx j . The loss functionL A can be expressed compactly with data selection matrices [153]. Let S = [S 1 ,S 2 ,...,S N ] be a 1×N block matrix where each blockS i ∈ℜ N×K corresponds to theneighborhoodofx i . Moreoever,S i isabinarymatrixanditselements jk = 1ifandonlyif x j is thek-th nearest neighbor ofx i . Note thatXS i =X i . The loss functionL A is expressed as L A =∥YS−Q∥ 2 F (5.3) whereQ = [Q 1 Q 2 ··· Q N ]andthesubscriptF indicatestheFrobeniousnorm. In addition to aligning the global coordinatesY with the local estimatesQ, we also seek an output that is smooth on the manifold. To this end, we also minimize the total variation ofY on the graph [41]. The total variation is computed as the squared norm of the (discrete) difference,L G = ∑ D m=1 ∥∇Y m ∥ 2 ,whereY m isthem-throwofthedenoiseddataY and∇is thediscretedifference,approximatingthegradientoncontinuousmanifold[12]. Thelosscanbe 76 writtenintermsofthegraphLaplacian,analogoustotheLaplacian-Bertramioperatoronsmooth manifolds, L G = D ∑ m=1 Y m LY m T = trace(LY T Y) (5.4) ThegraphLaplacianL =I−D −1 W isdefinedintermsoftheweightmatrixW forthegraph andthediagonalmatrixD withthediagonalelementofD ii = ∑ j w ij . ThelossfunctionL G attainsitsminimumataconstantY,whichwouldresultinalargeloss forthealignmentL A . Totradeoff,weadoptthesameregularizationframeworkproposedin[41] andcomputetheoptimalsolutiontothefollowingloss L(λ) =∥YS−Q∥ 2 F +λtrace(LY T Y) (5.5) as the denoising result:Y ∗ = argmin Y L(λ). Empiricall, the coefficientλ≥ 0 is chosen with validationsets. Notethattheoptimaldenoisingresulthasaclosed-formsolutiongivenby Y ∗ (λ) =QS T (SS T +λL) −1 (5.6) The matrix product =SS T is a diagonal matrix. Specifically, its diagonal element Λ ii is the numberofnearestneighborhoodsthatx i belongsto. The algorithm listing in Fig. 5.1 reviews the key steps of our algorithm. Note that our algo- rithm depends on the nearest neighbor graph which is estimated from the (noisy) data samples X. AftercomputingthedenoisedoutputY ∗ (λ)ineq.(5.6),thegraphcanbere-estimatedwith 77 InputX: noisy data, K: number of nearest neighbors, λ: regularization coefficient for graph Laplacian S1constructK weightednearestneighborgraph S2denoiselocallywitheq.(5.1)tocomputeQ S3aligngloballywitheq.(5.6) S4(optionally)iteratefromS1,usetheresultsinS3inplaceofX OutputY ∗ : denoisingresults Figure5.1: Locallylineardenoising(LLD)algorithm denoised outputs. The denoising can then be iteratively refined on the new graph. While a for- malproofofconvergenceofthisiterativeprocessisleftforfuturework,weobserveconvergence afterafewiterationstoastablesolutioninourexperiments. 5.3.2 Analysis In the following, we analyze the LLD algorithm briefly and contrast to a closely related ap- proach [41]. To gain intuition, we first consider the case when λ = 0. Note that, when λ approaches∞, as mentioned before, the solutionY ∗ (λ) of eq. (5.6) becomes trivially constant sothatthesolutionisinfinitelysmooth. Ontheotherextreme,however,whenλiszero,thereisnoenforcementonthesmoothnessof thedenoisedoutput. Instead,thesolutiontakesthesimpleformofY ∗ (0) =QS T −1 . Ashort calculation reveals thatY ∗ (0) is the average of all local estimates. It is interesting to note that this simple procedure works well in some denoising tasks, as evidenced by empirical study in latersections. Also,evenforλ = 0,duetotheiterativenatureoftheLLDalgorithm,averaging changes from one iteration to the other as the nearest neighbors are recomputed every iteration. Furthermore,ifQiscomputedwithnonlinearmethodsfromX (asopposedtolinearprojections withPCA),theoverallaveragingeffectishighlynonlinearandcompounded. 78 When λ≪ 1, we can approximate the matrix inverse in eq. (5.6) with (SS T +λL) −1 ≈ −1 (I−λL −1 )(byapplyingTaylorexpansion). Thesolutionisthenapproximatedwith Y ∗ (λ)≈Y ∗ (0)(I−λ −1 )+λY ∗ (0)D −1 W −1 (5.7) The first term approximately takes the form of discounted simple averages of local estimates. The second term reveals more insight. Specifically, the local estimate q j also contributes to the global assignmenty i ifx i andx j are connected in the weighted nearest neighbor graph. Moreover, for the Gaussian kernel we have taken to compute the weights, the contribution is positively proportional to the weightw ij . Intuitively, ifx i andx j are close to each other, their local estimates should be similar to each other. Hence, information from the local estimateq j couldbeusedforestimatingy i . Heinetalhasrecentlyproposedadiffusionmapbaseddenoisingalgorithm[41]. Similarto ours,theiralgorithmcomputesdirectlythehighdimensionaldenoisingoutputsandincorporates graph Lapalacian as a regularization to favor smooth solutions. Their iterative procedures take theform(inthenotationofthispaper) Y ←Y(I +λL) −1 (5.8) Whilethetwoupdatesineq.(5.6)andeq.(5.8)appearsimilar,keydifferencesexist. Ineq.(5.8), theupdateddenoisingresultY (ontheleftside)isuseddirectlyintherightsidetoberefined. In eq. (5.6), the updated denoising result affects the refined output at the next time step indirectly through the computation of local estimatesQ on the right side. Note that the computation of 79 Figure5.2: ExamplesofcleanUSPSdigitimages Q requires recomputing the nearest neighbor graph as well as recomputing projection matrix (cf. eq. (5.1)). Therefore, the two denoising algorithms are unlikely to converge to the same stationarypoint. Wegainfurtherinsightbyinspectingagainthespecialcasewhenλ = 0. Notethateq.(5.8) immediately reaches a fixed point for anyY. For the LLD algorithm eq. (5.6), this is not nec- essarily true. In the next section, we compare the LLD algorithm to their algorithm empirically anddiscoversignificantdifferencesinapplications. 5.4 Experiments 5.4.1 EvaluationMethodology Weevaluatetheperformanceofthelocallylineardenoising(LLD)algorithmontwodatasets: a subsetoftheUSPShandwrittendigitimages,whichcontains200imagesperdigitclass,andthe ORL face images with resolution reduced from 112×92 to 28×23. We chose them for their differentcharacteristicsinnumberofsamples,dimensionality,andintrinsicstructures. Wecom- pare the performance of our approach to that of four denoising algorithms: PCA, the diffusion 80 Figure5.3: DenoisingbyPCAandmanifoldbasedalgorithmsonUSPSdata. Toprowtobottom: different noise types (Gaussian, occlusion, motion blur and salt-and-pepper). Left column to right: nodenoising,PCA,DM-MD[41]andLLD(cf. section5.3)withλ = 0. Denoisedimages arevisuallymoreappealing. DM-MDtendstooverlysmoothwhileLLDdoesnotperformwell withocclusionnoise. ForDM-MD,thenumberofnearestneighborsareK = 80,80,30,100for eachnoisetyperespectively. ForLLD,K = 30forallnoisetypes. map based manifold denoising (DM-MD) algorithm [41], Non-Local Means (NLM) and BLS- GSM[8,95,36]. Thelatertwomethodsaredesignedtodenoiseoneimageatatime,therefore, donotrelyonintrinsiclatentstructuresinimagecollections. Wealsotriedatwo-stepdenoisingstrategywhereweprojecttoalowdimensionalspacefirst andthenusealearnedstatisticalmodeltomapbacktotheoriginalinputspace[124]. Preliminary resultsdidnotsupportthisstrategyasaviableoptionforrobustdenoising. Onepossiblereason is that both steps introduce errors and there is no “global” criteria controlling the quality of the final denoised images. We omit those results. Note that, the LLD algorithm computes denoised imageswithoutidentifyinglowdimensionalembeddings. 81 We investigate the robustness of denoising algorithms to different types of noise. We treat the original images as “clean” images and synthesize noisy images by adding noise to them: Gaussian noise imposed on the pixel intensities, random occlusion patches, motion blurring as caused by camera movements, and salt-and-pepper noise where each pixel’s intensity is ran- domly flipped at a probability of 20% to its complement alue. Denoising algorithms that are resilient to different noise types are highly desirable in practice as inferring noise types is often challenging. We examine the quality of denoising with visual inspection. We also apply two quantita- tive metrics: reconstruction errors between denoised outputs and the clean images; as well as classification errors on the denoised outputs with classifiers trained on clean data. Note that the denoising algorithms are not told which images are noisy ones. Therefore, all images are de- noised. As a consequence, the original clean images will be contaminated while noisy images are cleaned. We report results separately on them. Fig. 5.2 and 5.6 show examples of clean images from these two data sets. We report findings on the USPS data first, followed by those ontheORLfaceimages. TheLLDalgorithmdependsontheregularizercoefficientλandthenumberofnearestneigh- bors K. We chose them based on cross-validation using either of the two quantitative metrics. Parametersforotheralgorithmsaretunedsimilarly. 5.4.2 USPSDigitImages Denoising results Fig. 5.3 displays the denoising results on noisy images by 3 algorithms underthefourtypesofnoise. Weaddednoiseto100images(10perclass)andshow50ofthem 82 Table 5.1: Reconstruction errors and misclassification rates (in percentage) by multiclass SVM classifiers on the USPS data. The error rates are shown inside the parentheses. Without denois- ing, corresponding to the column heads of “none”, no reconstruction error is reported. In both measures, LLD outperforms other 2 methods in most cases. Parameters are set as the same as thoseinFig.5.3. Noise Noisyimages Originalcleanimages type none PCA DM-MD LLD none PCA DM-MD LLD G -(20) 640(19) 920(11) 719(13) -(16) 518(16) 918(19) 665(15) O -(18) 615(18) 949(15) 524(17) -(16) 390(16) 918(20) 333(15) B -(20) 964(20) 1228(52) 1043(22) -(16) 420(16) 906(18) 274(16) S -(22) 785(14) 977(15) 860(14) -(16) 518(16) 938(22) 666(15) (results for the other 50 are similar). Overall, manifold based denoising algorithms yield more visually appealing reconstructions. PCA introduces suppositions of images from all classes as it adds to the reconstructed images with a mean image computed over the whole data set. The diffusionmapbasedalgorithmtendstooverlysmooth,alsonotedin[41]. Thisisnotnecessarily adisadvantage: undertheocclusionnoise,thisalgorithmistheonlyonethatcanconnectbroken strokescausedbyocclusion,thusmorevisuallyappealing. In Table 5.1, we quantify the denoising quality in terms of reconstruction errors, which is the sum of the squared differences in pixel intensities. As a reference, the amount of noise added to the images are 805, 520, 968 and 1137, respectively for each type of noise. On noisy images, PCA incurs the smallest amount of reconstruction errors and LLD has smaller errors than DM-MD. On the original clean images, PCA contaminates less in the cases of Gaussian and salt-and-pepper noise, while LLD contaminates less for occlusion and blur noise. DM-MD hasthehighesterrors. 83 Classification An important use of denoising is to preprocess data for supervised learning tasks. InTable5.1,wecomparethemisclassificationratesusingdifferentdenoisingalgorithms. The misclassification rates are numbers displayed inside parentheses. SVM classifiers were trained on 200 clean images (20 images per class) and we tested the classifiers with 100 noisy images and 1700 originally clean images. Almost all algorithms improve over the baseline without denoising. The DM-MD algorithm performs the best on noisy images with Gaussian andocclusionnoise. However,itdoessoattheexpenseofincreasingerrorratesonoriginalclean images. The LLD algorithm attains the smallest error rates in most categories. In particular, it is able to achieve so by reducing error rates more than other algorithms on both noisy images andoriginalcleanimages. Theexceptionbeingonthemotionbluringimages,theerrorratewas increasedslightlyfrom20%to22%. Figure 5.4: Denoising by LLD with different amount of regularization. λ is increased from left toright. Largerλleadstooversmoothing,similartotheDM-MDalgorithm. no denoising PCA DM−MD LLD(0) LLD(0.01) noisy images clean images Figure5.5: Misclassificationrates(inpercentage)ofLLDwithdifferentregularizationandother methods on USPS data with salt-and-pepper noise. Small regularization improves error rates on both noisy and original clean images (note that they are “denoised” too, effectively being introducedwithnoise). Seetextfordetails. 84 Figure 5.6: Visualization of denoised ORL images. From left to right: clean images, images withGaussiannoise,denoisedwithMD-DMandLLDalgorithmsrespectively. no denoising PCA DM−MD LLD noisy images clean images Figure 5.7: Misclassification rates (in percentage) for ORL data with various denoising algo- rithms(notethat,“denoising”cleanimagesintroducesnoisesineffect). Seetextfordetails. Effectsofgraphregularization Intheresultswehavereportedsofar,wehavesettheparam- eterλ to 0 in the LLD algorithm. A nonzeroλ trades off the errors of alignment (see eq. (5.3) ) andthesmoothnessofthedenoisedimages. Weexperimentedwithdifferentsettingsofλunder thesalt-and-peppernoise. TheoptimalvalueforλcombinedparametersK anddischosenwith the smallest classification error rates on validation data sets. Fig. 5.4 shows that a smaller λ Figure 5.8: Comparison among various denoising algorithms. From left to right: images of digits contaminated with narrow black bands at the bottom, denoising by the LLD algorithm, denoisingbyNon-LocalMeans,denoisingbyBLS-GSM.TheLLDalgorithmisabletorecover partiallyoccludedareasbythebands. 85 retains the “grain” in the image while a larger one often oversmooths. The λ we used in this experimentare0, 0.25,1.5and9respectively. Fig.5.5reportstheclassificationperformanceof theLLDalgorithmwithtwosettingsλ = 0andλ = 0.01,aswellasotheralgorithms. Notethat while LLD(λ = 0) achieves a better overall error rate than competitive methods, its error rates on noisy images were worse than that of DM-MD. With a small amount of regularization, the error rates of LLD(λ = 0.01) on noisy images were significantly lower than both DM-MD and withoutregularization. Furthermore,theerrorratesontheoriginalcleanimageswereimproved too, though not significantly. For this experiments, we have used 10 clean images per class to build a SVM classifier. On the clean images, the error rate is 30.4%. Therefore, an interesting observationisthatbothDM-MDandthetwoLLDsareabletoimprovetheerrorratesfromthis baseline. Webelievethisisthebenefitsofsemi-supervisedlearning,asdiscussedin [41]. 5.4.3 ORLFaceImages The ORL face image data set has a total of 400 images from 40 subjects. We add noise to ran- domly chosen 200 images and retain the other 200 images as clean samples. We use 4 clean images per subject to train a 40-way classifier to distinguish subjects. Our test set for classifi- cation tasks contains 1 clean image and 5 noisy images per subject. The performance of two manifold based denoising algorithms under Gaussian noise are displayed in Fig. 5.6. We drew similarconclusionsaboutthesetwoalgorithmsasinourpreviousexperimentsonUSPSdata. In particular, we note that the DM-MD algorithm oversmooths, which would lead to inferior clas- sification results. This is confirmed in Fig. 5.7. Note that using clean images only the classifier has a classification error of 5.8%. Furthermore, note that the results are obtained at choosing 86 the number of nearest neighbors be 60 for LLD and 20 for DM-MD. Both numbers are greater than the number of images per subject in the data set. This indicates that the LLD algorithm is capable of using information from similar images, though not necessarily clustered in terms of subjects,togetridofnoise. 5.4.4 ComparisontoSingle-ImageDenoising The LLD algorithm collectively denoises all images. This is different from many existing ap- proaches, which denoises one image at a time. A drawback of those approaches is that they cannotbenefitfromintrinsicstructure,suchasimagemanifolds. To exemplify this, we compared the LLD algorithm to Non-Local Means (NLM) and BLS- GSM [8, 95, 36], both regarded as state-of-the-art denoising algorithms. Specifically, we added a narrow black band to the bottom of 100 USPS digit images, as shown in Fig. 5.8. The LLD algorithm denoises these noisy images by using the rest clean images in the data set and is able to recover partially from the occlusion. On the other hand, both NLM and BLS-GSM cannot recognizethebandsasnoises. If the intrinsic structure is contaminated by strong noise, the LLD algorithm is prone to extract information from wrong images to denoise. For instance, images of digit 2 with bottom bandnoises can be easily confused as images of digit 7. Therefore, LLD denoises by mixing both types of images. NLM and BLS-GSM are immune to this problem as these images are processedindependently. Thus,itisinterestingtoexplorewhetherandhowwecancombinethe advantagesofbothtypesoftechniques. 87 5.5 Conclusion We study the problem of data denoising when data lie on intrinsic low dimensional structures suchassubmanifolds. Weproposelocallylineardenoising(LLD)toexploitsuchstructures. The algorithm integrates local denoising results through a global alignment process that minimizes discrepancies in reconstruction with different local neighborhoods, balanced by graph Lapla- cian regularization to prefer smooth solutions on the manifolds. The algorithm is evaluated and compared to other state-of-the-art denoisubg methods. The results are encouraging: on both handwritten digit and face images, the proposed algorithm yields visually appealing denoising results, incurs small reconstruction errors and results in low error rates when the denoised data areusedinsupervisedlearningtasks. We view the algorithm proposed in this chapter as a general local to global alignment strat- egy for data denoising. For example, the local denoising step can be easily adapted to other approaches that are more robust, for instance, e.g., unified Tensor Voting in chapter 3 or robust PCA[44]. 88 Chapter6 MultipleManifoldsStructureLearning Wepresentarobustmultiplemanifoldsstructurelearning(RMMSL)schemetorobustlyestimate data structures under the multiple low intrinsic dimensional manifolds assumption [32]. In the locallearningstage,RMMSLefficientlyestimateslocaltangentspacebyweightedlow-rankma- trix factorization. In the global learning stage, we propose a robust manifold clustering method based on local structure learning results. The proposed clustering method is designed to get the flattestmanifoldsclustersbyintroducinganovelcurved-levelsimilarityfunction. Ourapproach is evaluated and compared to state-of-the-art methods on synthetic data, handwritten digit im- ages,humanmotioncapturedataandmotorbikevideos. Wedemonstratetheeffectivenessofthe proposedapproach,whichyieldshigherclusteringaccuracy,andproducespromisingresultsfor challengingtasksofhumanmotionsegmentationandmotionflowlearningfromvideos. 6.1 Introduction Theconceptofmanifoldhasbeenextensivelyusedinalmostallaspectsofmachinelearningsuch asnon-lineardimensionreduction[104],visualization[58,132],semi-supervisedlearning[113], 89 multi-task learning [2] and regression [121]. Related methods have been applied to many real- worldproblemsincomputervision,computergraphics,webdataminingandmore. Despitethe success of manifold learning (in this paper, manifold learning refers to any learning technique thatexplicitlyassumesdatahavemanifoldsstructures),peoplefindthereareseveralfundamental challengesinrealapplications, (1)Multiplemanifoldswithpossibleintersections: inmanycircumstancesthereisnounique (global) manifold but a number of manifolds with possible intersections. For instance, in hand- writtendigitimages,eachdigitformsitsownmanifoldintheobservedfeaturespace. Forhuman motion, joint-position (angle) of key points in body skeleton form low dimensional manifolds for each specific action [129]. In these situations, modeling data as a union of (linear or non- linear)manifoldsratherthanasingleonecangiveusabetterfoundationformanytaskssuchas semi-supervisedlearning[27]anddenoising[41]. (2)Noiseandoutliers: onecriticalissueofmanifoldlearningiswhetherthemethodisrobust to noise and outliers. This has been pointed out in the pointer work on nonlinear dimension reduction[104]. Fornon-linearmanifolds,sinceitisnotpossibletoleveragealldatatoestimate localdatastructure,moresamplesfrommanifoldsarerequiredasthenoiselevelincreases. (3) High curvature and local linearity assumption: typical manifold learning algorithms ap- proximate manifolds by the union of locally linear patches (possibly overlapped). These local patchesareestimatedbylinearmethodssuchasprincipalcomponentanalysis(PCA)andfactor analysis (FA) [124]. However, for high curvature manifolds, many smaller patches are needed, butthisoftenconflictswiththelimitednumberofdatasamples. 90 Figure6.1: AdemonstrationofRMMSL.Fromlefttoright,noisydatapointssampledfromtwo intersecting circles with outliers, outlier detection results, and manifolds clustering results after outlierdetection. In this paper, we investigate the problem of robustly estimating data structure under the multiple manifolds assumption. In particular, data are assumed to be sampled from multiple smooth submanifolds of (possibly different) low intrinsic dimensionalities with noise and out- liers. The proposed scheme named Robust Multiple Manifolds Structure Learning (RMMSL) is composed of two stages. In the local stage, we estimate local manifolds structure taking into accountofnoiseandcurvature. Theglobalstage,i.e.,manifoldclusteringandoutlierdetection, is performed by constructing a multiple kernel similarity graph based on local structure learn- ing results by introducing a novel curved-level similarity function. Thus, issue (1) is explicitly addressed, and(2)and(3)arepartiallyinvestigated. Ademonstrationoftheproposedapproach is given in Fig. 6.1. It is worth noting that other types of global learning tasks such as dimen- sionreduction,denoisingandsemi-supervisedlearningcanbehandledbasedoneachindividual manifoldclusterbyexistingmethods[58,132,41,115]. Our problem statement is rigorously expressed as follows. Data{x i } n 1 +n 2 i=1 ∈R D×(n 1 +n 2 ) consist of inliers{x i } (1≤ i≤ n 1 ) and outliers{x j } (n 1 +1≤ j≤ n 1 +n 2 ). Inlier points 91 {x i } n 1 i=1 ∈R D×n 1 areassumedtobesampledfrommultiplesubmanifoldsM c (1≤c≤n c )as follows, x i =f C(x i ) (τ i )+n i ,i = 1,2,..,n 1 (6.1) whereC(x i )∈{1,2,...,n c }isthemanifoldlabelfunctionforx i .n i isinliernoiseinR D . f c (·) is the smooth mapping function that maps latent variableτ i from latent spaceR dc to ambient spaceR D for manifoldM c . d c (< D) is the intrinsic dimensionality ofM c and it can vary for different manifolds. The task of local structure learning is to estimate the tangent space T x i M and the local intrinsic dimensionality d C(x i ) at eachx i . This is equivalent to estimate J(f c i (·);τ i )∈ R D×dc i , i.e., the Jacobian matrix of f c i (·) atτ i . Details are given in section 3. The tasks of global structure learning addressed in this paper are, to detect outliers{x j } (n 1 +1≤j≤n 1 +n 2 )andtoassignmanifoldsclusterlabelc i ∈{1,2,...,n c }forinliers{x i } (1≤ i≤ n 1 ) (Fig. 6.2). This is given in section 4. Experimental results are shown in section 5 andfollowedbyconclusioninsection6. 6.2 RelatedWork Recentworksonmanifold learningandlatentvariablemodelingfocuson low-dimensionalem- beddingandlabelpropagationforhighdimensionaldata,whichareusuallyassumedtobesam- pledfromasinglemanifold[104,115,58,132]. Toaddressthemultiplemanifoldsproblem,[27] 92 gives theoretical analysis and a practical algorithm for semi-supervised learning with the multi- manifold assumption. As a complementary approach to previous works, RMMSL mainly fo- cusesonunsupervisedlearningwiththemulti-manifoldassumptionanditcanbecombinedwith existingapproaches 1 . In the global structure learning stage, RMMSL focuses on clustering and outlier detection. Clustering is a long standing problem and we only review manifold related clustering methods. If data lie on a low-dimensional submanifold, distance on the manifold is used to replace the Euclidean distance in the clustering process. This leads to spectral clustering [86, 152, 70], one of the most popular modern clustering algorithms. We refer [135] as a survey for spectral clus- tering. However,mostoftheseworksdonotexplicitlyconsidermultipleintersectingmanifolds. State-of-the-art multiple subspace learning methods such as [18, 133, 149] acquire good performance on multiple linear (intersecting) subspace segmentation with the assumption that the intrinsic manifolds have linear structures. However, real data often have nonlinear intrinsic structures,whichbringsdifficultyforthem. Nonlinearmanifoldclusteringisinvestigatedin[26], which mainly focuses on separated manifolds. As an extension of ISOMAP, [118] proposes an EMalgorithmtoperformmultiplemanifoldsclustering,butresultsaresensitivetoinitializations andtheE-stepisheuristic. Recently,[140]proposesanelegantsolutionofmanifoldsclustering byconstructingtheaffinitymatrixbasedonestimatedlocaltangentspace. 1 Itcanalsobeusedtosupervisedlearningwithminormodifications. 93 6.3 LocalManifoldStructureEstimation Correctly and efficiently estimating local data structure is a crucial step for data analysis. As explained before, problems like noise and high curvature make local structure estimation chal- lenging. This section addresses the issue of how to model and represent local structure infor- mationonmanifolds. WestartfromlocalTaylorexpansion,andlocalmanifoldtangentspaceis representedbytheJacobianmatrixunderthelocalisometryassumption. LocalTaylorExpansion. Without additional assumptions, the model in eq. 6.1 (section 1) is not well defined. For instance, if f(·) and point set{τ i } satisfy eq. 6.1, then f(g −1 (·)) and point set{g(τ i )} satisfy it too, where g(·) is any invertible and differential mapping function fromR d toR d . Thus,thelocalisometryassumptionisenforcedatpointτ i ∈R d , ||f(τ)−f(τ i )|| 2 =||τ−τ i || 2 +o(||τ−τ i || 2 ) (6.2) whereτ isintheϵ−neighborhoodofτ i . TheaboveconditionimpliesthatJ(f(·);τ i )∈R D×d is an orthonormal matrix [116], i.e., J T (f(·);τ i )J(f(·);τ i ) = I d . From eq. 6.1, by using TaylorexpansiontoincorporatebothTaylorapproximationerrorandinliernoiseweget, x i j −x i =J(f(·);τ i )(τ i j −τ i )+e i j +n , i j e i j ∼o(||τ i j −τ i || 2 ) (6.3) 94 where x i j (1 ≤ j ≤ m i ) are m i elements in the ϵ ball of x i and noise vector n , i j denotes n i j −n i . Byusingmatrixnotation,wehave, X i −x i 1 T m i =J i (T i −τ i 1 T m i )+E i +N i (6.4) Assume there are m i points{τ i j ,x i j } m i j=1 in the ϵ−neighborhood of{τ i ,x i }, the local data matrix X i is defined as [x i 1 ,x i 2 ,...,x im i ] ∈ R D×m i . The corresponding latent coordinate matrixT i forX i is denoted as [τ i 1 ,τ i 2 ,...,τ im i ]∈ R d×m i , the corresponding local Taylor approximation error matrixE i is denoted as [e i 1 ,e i 2 ,...,e im i ]∈ R D×m i andN i is the local inliernoisematrix[n , i 1 ,n , i 2 ,...,n , im i ]∈R D×m i .J i istheshortnotationofJ(f(·);τ i ). Itisnaturaltoassumenoisen , i j tobei.i.dwithhomogeneousGaussiandistributionN(0,σ 2 n I D ) (1≤ j≤ m i ). Furthermore, we treat errors as independent Gaussian random vectors with dif- ferent covariance matrices, i.e.,e i j ∼ N(0,σ 2 I D α i j ). By adding noise and error together, an integratederror vectorϵ i j isdefinedas, ϵ i j =n , i j +e i j ∼N(0,σ 2 n I D +σ 2 α i j I D ) j = 1,2,...,m i (6.5) whereσ 2 n indicatesthescaleofthenoise’scovarianceandσ 2 indicatesthescaleoftheerror. α i j reflects the inhomogeneous property of different error of Taylor expansion on pointsx i j (based onx i ). Thus,noiseanderrorarejointlyconsidered. Insteadoftreatingerrore i j ashomogeneous across different points, we argue the error is proportional to the relative distance||τ i j −τ i || 2 , basedonthenaturalpropertyofTaylorexpansion. 95 Inference. Toreflecttheinhomogeneouspropertyoferrorϵ i j (combinebothTaylorapprox- imationerrorandinliernoise),weproposethefollowingobjectivefunctiontoestimateJ i , L(J i ) =||E i S|| 2 F =||{(X i −x i 1 T m i )−J i (T i −τ i 1 T m i )}S|| 2 F (6.6) whereS∈ R m i ×m i is the diagonal weight matrix with s jj = s(τ i j ,τ i ) indicating the impor- tance to minimize error ϵ i j on x i j . Intuitively, we emphasize the error minimization for x i j morethanx i k ifs jj islargerthans kk . Then we discuss how to choose α i j and determineS accordingly. First, we choose α i j = α(||τ i j −τ i || 2 ) where α(·) is a monotonically non-decreasing function in the non-negative domain. Supportedbythefactoflocal-isometry(eq.6.2),wehaveα i j ∼α(||x i j −x i || 2 ). So, ϵ i j ∼N(0,(σ 2 n +σ 2 α(||x i j −x i || 2 ))I D ) j = 1,2,...,m i (6.7) Eq.6.7immediatelysuggestsaweightfunction,s jj =s(∥τ i j −τ i ∥ 2 ) = 1/(σ 2 n +σ 2 α(||x i j − x i || 2 )). Giventhefunctionα(·),s(·)canbe determined and we get the weight matrixS. Then, eq.6.6canbeeffectivelysolvedbythefollowingoptimizationframework argmax J i ||(J T i f X i SS T f X T i J i )|| ∗ ,s.t.,J T i J i =I d (6.8) 96 Essentially, the solution of eq. 6.8 is just the largest d eigenvectors of the matrix f X i SS T f X T i ( f X = (X i −x i 1 T m i )), which is called local structure matrixT i ∈R D×D . The local intrinsic dimensionalitydcanbeestimatedbyfindingthelargestgapbetweeneigenvaluesofT i [82]. Analysis. By modeling inhomogeneous errore i j , curvature effect (Hessian) is implicitly considered without high order terms. Furthermore, for a large range ofα(·) (such as high order polynomial), weights jj is quite small when||x i j −x i || 2 is large. Thus, outliers will not affect theestimationresultsmuch. It is also interesting to compare the local learning methods with different kernel functions α(·). For standard low-rank matrix approximation by Singular Value Decomposition (SVD), α(·) is a constant function. This is because SVD assumes data lie on a linear subspace without considering curvature or outliers. SVD can be improved to Robust SVD, by introducing the robust influence function to handle outliers [54]. However, Robust SVD is still constrained by thelinearmodelandthecomputationalcostishighbecauseoftheiterativecomputation. Onthe otherhand,TensorVoting(TV)usesGaussiankernel[82],whichcanbeviewedasaspecialcase ofs(·)whenσ n = 0. ThisisbecausestandardTensorVotingdoesnotconsiderinliernoise. 6.4 GlobalManifoldStructureLearning In this global stage of RMMSL, we focus on multiple smooth manifolds clustering based on localstructurelearningresults. In contrast to previous works, the clustering stage in RMMSL can handle multiple non- linear(possiblyintersecting)manifoldswithdifferentdimensionalitiesanditexplicitlyconsiders 97 Figure6.2: Anexampleof multiple smooth manifolds clustering. The first one is the inputdata samples and the other three are possible clustering results. Only the rightmost is the result we wantbecausetheunderlyingmanifoldsaresmooth. outlier filtering, which is addressed as one step in the clustering process. Compared to the standard assumptions in clustering, i.e., the intra-class distance should be small while the inter- class distance should be large, we further argue that each cluster should be a smooth manifold component. As shown in Fig. 6.2, when two manifolds intersect, there exist multiple possible clustering solutions, while only the rightmost is the result we want. The underlying assumption we make is local manifold has relatively low curvature, i.e. it changes smoothly and slowly as spatial distance increases. In order to get the flattest manifold clustering result, it is natural to incorporateaflatnessmeasureintotheclusteringobjectivefunction. Curved-Level Measure. As an approximation of (continuous) Laplacian-Bertrami opera- tor [40], (discrete) graph Laplacian can measure the smoothness of the underlying manifold. However, computing graph Laplacian is a global process and most of the theoretical results can not be easily adapted to the multi-manifold setting. On the other side, curvature is a local mea- surementtoindicatethecurveddegreeofageometricobject. Asageneralizationofthecurvature for 1D curve and 2D surface, the Ricci curvature tensor is proposed to represent the amount by which the volume element of a geodesic ball in a curved (Riemannian) manifold deviates from thatofthestandardballinEuclideanspace. 98 Inspired by the idea of curvature, the following curved-level measurement R(x) is consid- ered, R(x) = ∑ x i ∈N(x) ||θ(J i ,J)|| d(x i ,x) (6.9) θ(J i ,J) measures the principal angle between the tangent space J i ∈ ℜ D×d i (T x i M) and J ∈ℜ D×d (T x M). d(x i ,x) is the geodesic distance betweenx i andx. N(x) is the spatial neighborhood points set forx. Intuitively,R(x) is analogous (up to a constant variation) to the integration of the unsigned principal curvatures along different directions atx. The theoretical analysis on the connection between R(x) and curvature (such as mean curvature) as well as the asymptotic behavior when the number of data samples goes to infinity is left for future investigation. Based on eq. 6.9, we can measure the (approximate) total curved level on one manifoldclusterM k bysummingupR(·)onalldatasamplesx i belongingtothiscluster, R(M k ) = ∑ x i ∈M k R(x i ) = ∑ x i ∈M k ,(i,j)∈G ||θ(J i ,J j )|| d(x i ,x j ) (6.10) where G is the undirected neighborhood graph built on data samples (ϵ-neighborhood or K- nearestneighborhoodgraph). ObjectiveFunction. Eq.6.10providesanempiricalwaytomeasurethetotalcurveddegree on one particular manifold clusterM k . This can be viewed as one type of the intra-cluster 99 dissimilarityinthestandardclusteringframework. Inordertogetthebalancedclusteringresults, wealsoconsidertheinter-clusterdissimilarityfunctionasfollows, R(M k ,M l ) = ∑ x i ∈M k ,x j ∈M l ,(i,j)∈G ||θ(J i ,J j )|| d(x i ,x j ) (6.11) In practice, the value of||θ(J i ,J)||/d(x i ,x) in eq. 6.9 is unbounded and numerically un- stable. Thus, it is straightforward to compound eq. 6.9, 6.11 and the standard similarity kernel function(suchasGaussian)togetthenormalizedcurvedmeasurement. Inparticular,weusethe standard minimization framework by putting||θ(J i ,J)||/d(x i ,x) into the similarity function withanadditionaldistancesimilarityfunction, J RMMSL ({M k } nc k=1 ) = nc ∑ k=1 W(M k ,M k ) W(M k ) (6.12) where W(·) is the contrary version of the curved measure function R(·), i.e., the flatter the manifoldisthe larger valueofW(·)is.M k isthecomplementarysetofM k . Theformulation ofW(·)is, ∑ x i ∈M k ,(i,j)∈G w 1 ( ||θ(J i ,J j )|| d(x i ,x j ) )w 2 (d(x i ,x j )) (6.13) where w 1 (·) and w 2 (·) are similarity kernel functions that can be chosen as Gaussian or other standard formulations (similar forW(M k ,M l )). Due to the shrinkage effect of kernel,d(·) is further approximated by Euclidean distance. Intuitively, W(M k ) is one type of the intra-class 100 similarity function on one manifold cluster and W(M k ,M l ) is the inter-class similarity func- tionbetweentwoclusters. Theoptimaln c -classesclusteringresultsareobtainedbyminimizing eq.6.12. Algorithm. Indeed,eq.6.12canbeviewedasamulti-classnormalized-cutwithnovelsim- ilaritymeasurementprovidedbylocalcurvedsimilarityanddistancesimilarityfunctions[135]. Directly minimizing eq. 6.12 is an NP-hard problem, but it can be relaxed by graph spectral clusteringinthefollowingprocedure. Step 1. Before (global) manifold clustering, local structure learning of RMMSL in sec. 6.3 isperformed to estimatethe localtangent spaceJ i ∈ℜ D×d i at each pointx i . d i can be locally estimatedfromthelocallearningstageofRMMSLorchosenasafixedvalueinadvance. Also, theneighborhoodgraphGisbuiltonallinputdatasamples{x i } n i=1 (n =n 1 +n 2 ). Step2. ConstructingthesimilaritymatrixW = [w(x i ,x j )] i,j ∈ℜ n×n bythefollowingtwo kernels. The first kernel is the pairwise distance kernel, which is widely used in graph spectral clustering and defined as w 1 (x i ,x j ) = exp{−||x i −x j || 2 /(σ i σ j )}. In particular, we use the idea from self-tuning spectral clustering to select the local bandwidth σ i and σ j [152]. The second one is curved level kernelw 2 (x i ,x j ) = exp{−(θ(J i ,J j )) 2 /(||x i −x j || 2 (σ 2 c /σ i σ j ))}, whereσ c isusedtocontroltheeffectofthiscurvedsimilarity. Then,w(x i ,x j )issetas w ij =w 1 (x i ,x j )w 2 (x i ,x j ) = exp{−( ||x i −x j || 2 σ i σ j + θ(J i ,J j ) 2 ||x i −x j || 2 σ 2 c /σ i σ j )}. (6.14) Step e 3(optional). BasedonW,outlierdetection(filtering)canbedoneasdescribedlater. 101 Step 4. Once we have the similarity matrixW, the standard spectral clustering technique can be applied. Specifically, we compute the (unnormalized) Laplacian matrixL = D−W, whereD is a diagonal matrix whose elements equal to the sum ofW’s corresponding rows. We select the firstn c (number of clusters) eigenvectors of the generalized eigenproblemLe = λDe. Finally,K-meansalgorithmisappliedontherowsoftheseeigenvectors. Afteridentifying manifoldclusterlabels,manytaskssuchasembeddinganddenoisingcanbeperformedoneach manifoldcluster. OutlierDetection. Weformulatetheoutlierdetectionproblemasamanifoldsaliencyrank- ing problem by the random walk model proposed in [159]. We define a random walk graph on {x i } N i=1 from the following transition probability matrixP = I−L rw = D −1 W. It can be shown that, ifW is symmetric, then the stationary probability π of this random walk can be calculateddirectlywithoutcomplexeigen-decomposition[159], π =1 T n D/||W|| 1 (6.15) where1 n isann×1columnvectorwithallelementsas1and||·|| 1 istheentry-wise1-normof amatrix. Itisstraightforwardthat,rankingdataaccordingtoπ isthesameastheun-normalized distribution1 T n D. After ranking, bottom points can be filtered out as outliers. The ratio of the outlierscanbegivenasapriororbeestimatedbyperformingK-meanson1 T n D. Analysis. Thekeyideaofthisalgorithmistopresentanovelwaytoconstructthesimilarity matrixW inamultiplekernelsetting(basedonlocalstructureestimations)toencourageflatter clusteringresults. Itisworthnotingthatiftwopointshavedifferenttangentspaces,thesimilarity 102 Data/Methods K-means NJW STC GPCA SSC RMMSL Big-smallspheres 0.50 1.00 1.00 0.51 0.56 1.00 Two-intersectingspheres 0.76 0.78 0.84 0.50 0.53 0.95 Two-intersectingplanes 0.51 0.60 0.62 0.85 0.93 0.95 USPS-2200 0.80 0.89 0.89 - - 0.90 USPS-5500 0.78 0.88 0.88 - - 0.88 CMUMoCap 0.69 0.81 0.81 - - 0.89 MotorbikeVideo 0.72 0.84 0.85 0.85 0.87 0.96 Table 6.1: Rand index scores of clustering on synthetic data, USPS digits, CMU MoCap se- quencesandMotorbikevideos. (eq. 6.14) becomes smaller when two points get closer (in a range). This can be viewed as an intuitive explanation that why RMMSL can handle multiple intersecting manifolds. From the highlevelpointofview,thepairwisesimilarityincorporatestheinformationontwolocalpoints setsratherthantwosinglepointsonly. Similarideaswereproposedin[18,27,136]. 6.5 Experiments WeevaluatetheperformanceofRMMSLonsyntheticdata,USPSdigits,CMUMotionCapture data(MoCap)andMotorbikevideos. Thesedataarechosentodemonstratethegeneralcapability ofRMMSLaswellastheadvantagesonnonlinearmanifolds. Weinvestigatetheperformanceof manifoldclusteringandfurtherapplicationssuchashumanactionsegmentationandmotionflow modeling. We also perform quantitative comparisons of the local tangent space estimations. In particular,theweightedlow-rankmatrixdecomposition(localstructurelearninginRMMSL)is comparedwiththelocalSVD(orlocalPCA[124])andND-TV[82]. ResultsshowthatRMMSL is more robust to curvature and outliers than other methods. Also, if the manifold has relatively 103 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 150 200 250 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 150 200 250 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 150 200 250 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 150 200 250 −100 −50 0 50 100 −100 −50 0 50 100 0 20 40 60 80 100 0 20 40 60 80 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 0 20 40 60 80 100 Figure 6.3: Visualization of part of the clustering results in Table 6.1. The first row : one noisy sphereinsideanothernoisysphereinℜ 3 . Thesecondrow: twointersectingnoisyspheresinℜ 3 . The third row: two intersecting noisy planes inℜ 3 . For each part from left to right: K-means, self-tuningspectralclustering[152],GeneralizedPCA[133]andRMMSL. low curvature and is outlier free, then RMMSL and local SVD have similar results. Details are omittedduetospacelimit. MultipleManifoldsClustering. In this task, quantitative comparisons of manifolds clustering are provided on three data sets. We compare RMMSL (global structure learning) to K-means, spectral clustering (NJW algorithm [86]), Self-tuning spectral clustering (STC) [152], General- ized PCA (GPCA) [133] and Sparse Subspace Clustering (SSC) [18]. These methods (except K-means) are chosen because they are related to manifold learning or subspace learning. We also perform comparisons with other spectral clustering and subspace clustering methods such as [149] and [26], and get the same conclusion, but results are omitted due to space limit. Rand Indexscoreisusedastheevaluationmetric. 104 −20 −15 −10 −5 0 5 10 15 20 25 −20 −10 0 10 20 30 0 50 −20 −15 −10 −5 0 5 10 15 20 25 −20 −10 0 10 20 30 0 50 −15 −10 −5 0 5 10 15 −20 −15 −10 −5 0 5 10 15 20 25 0 50 −20 −15 −10 −5 0 5 10 15 20 25 −20 −10 0 10 20 30 0 50 Figure 6.4: Clustering results of RMMSL on two manifolds with outliers. From left to right: ground truth, outlier detection, clustering after outlier filtering and clustering without outlier filtering. Thekernelbandwidthσ inspectralclusteringistunedin{1,5,10,20,50,100,200}forsyn- thetic data, MoCap and videos, and{100,500,1000,2000,5000} for USPS. The value of the K th neighborhoodinself-tuningspectralclusteringandRMMSLischosenfrom{5,10,15,20, 30, 50, 100}. σ c intheglobalstage ofRMMSLischosenfrom{0.2,0.5,1,1.5,2}. Inthelocal stage, quadratic α(·) is used and σ n is set as 1. The sparse regularization parameter of SSC is tunedin{0.001,0.002,0.005,0.01,0.1}. Forsyntheticdatawithrandomnoise,theparameters for all methods are selected on 5 trials and then the average performance on another 50 trials is reported. Forrealdata,parametersareselectedbypickingthebestRandIndex. Forallmethods containingK-means,100replicatesareperformed. SyntheticData. Sincemostclusteringalgorithmsdonotconsideroutliersexplicitly,wefirst perform a comparison on 3 outlier free synthetic data, while each contains 2000 noisy samples from two manifolds inℜ 3 (d = 2 and D = 3). Results are shown in Fig. 6.3. Rand index scores are given in the first three rows of Table 6.1. For all methods, the number of clustersn c is fixed as 2. Results (Table 6.1) show that RMMSL achieves comparable and often superior performancethanothercandidatemethods. Inparticular,whentwomanifoldsarenonlinearand 105 have intersections, such as two intersecting spheres (second row), the advantage of RMMSL is clearest. To verify the robustness, we further evaluate RMMSL on synthetic data with outliers. We add 100 outliers and a 2D plane (1000 samples) on the Swiss roll (2000 samples) to generate two intersecting manifolds inℜ 3 . The results of RMMSL are shown in Fig. 6.4. RMMSL effectively filters out outliers and achieves 0.96 Rand score (n c = 2) and 0.99 F-measure for outlier detection when the ratio is given. Also, the Rand score is reduced to 0.78 if we do clustering without outlier filtering (n c = 3). This fact suggests that the outlier detection step is helpful if outliers exist. It is worth noting that spectral clustering methods cover broader cases than RMMSL, which mainly has advantage on multiple low-dimensional manifolds embedded in high dimensional space. For instance, in the case of two 2D Gaussian distributed clusters in ℜ 2 , RMMSL is reduced to self-tuning spectral clustering (all local tangent spaces are ideally identical). Compared with multiple subspace learning methods such as GPCA and SSC, which are the state-of-the-art for linear manifold clustering, our approach is better when underlying manifoldsarenonlinear. USPSDigits. WechoosetwosubsetsofUSPShandwrittendigitsimages. Thefirstcontains 1100 samples for digits 1 and 2 each (USPS-2200) and the second contains 1100 samples for digits1to5each(USPS-5500). Disreducedfrom256(size16×16images)to50byPCAandd isfixedas5. Duetothehighlynonlinearimagestructure,resultsofsubspaceclusteringmethods are not reported. For USPS data, the possible high intrinsic dimensionality v.s. the limited number of samples bring difficulties for data structure learning, especially the local learning stageofRMMSL.Nevertheless,RMMSLachievescomparableresults. 106 Figure6.5: AnexampleofhumanactionsegmentationresultsonCMUMoCap. Topleft,the3D visualizationofthesequence. Topright,labeledgroundtruthandclusteringresultscomparison. Bottom,9uniformlysampledhumanposes. MoCap Data. The automatic clustering of human motion sequences into different action unitsisanecessarystepformanytaskssuchasactionrecognitionandvideoannotation. Usually this is referred as temporal segmentation. In order to make a fair comparison among different clusteringmethods,wefocusonthenon-temporalsetting,i.e.,thetemporalindexisremovedand sequencesaretreatedascollectionsofstatichumanposes. Wechoose5mixedactionsequences from subject 86 in the CMU MoCap. We use the joint-position (45-dimensional representation for 15 human body markers inℜ 3 ) features which are centralized to remove the global motion. The average Rand scores are reported in Table 6.1. It shows that RMMSL achieves higher clusteringaccuracythanothercandidatemethods. One motion sequence (500 frames) and the corresponding results are visualized in Fig. 6.5. The subject walks, then slightly turns around and sits down. By combining the local learning results from RMMSL and [124], joint-position features (ℜ 45 ) from this sequence are visualized 107 Figure 6.6: Two examples of motion flow modeling results on motorbike videos. From left to right: two images samples, optical flow results, learned motion manifolds with highlighted motiondirections. inℜ 3 . Thisfiguresupportstheassumptionthattherearelow-dimensionalmanifoldsinthejoint- positionspace. Infact,thisMoCapsequencecanbeviewedasthreeconnectednonlinearmotion manifolds,correspondingtowalking,turn-aroundandsit-downrespectively. MotionFlowModeling. Unsupervised motion flow modeling is performed on videos [61]. Thegoalistoanalyzecoordinatedmovementsformedbymultipleobjects,extractsemanticlevel information,andunderstandwhat’shappeninginthescene. Givenmotorbikevideosasshownin Fig. 6.6 (from YouTube), global motion pattern is learned from low level motion features. This is an important task for video analysis and can be served as a foundation for many applications suchasobjecttrackingandabnormaleventdetection[60,61]. Differ from the probabilistic approaches in [61], we formulate motion flow learning as a manifold clustering problem. In the experiments, optical flows on salient feature points are estimatedbyLucas-Kanadealgorithm. Everyfeaturepointhas4Dinformationof(x,y,v x ,v y ). Thenmotiondirectionθiscalculated,andeverypointisembeddedto(x,y,θ)space. Weobserve that coordinated group movements form into manifold structures in (x,y,θ) space. Therefore, 108 the points are used as the input. The first video contains n = 9266 points and the second one contains n = 8684 points. We use RMMSL to learn the global motion manifolds by doing manifold clustering (n c = 2) and get the best Rand scores which are reported in Table 6.1. Since optical flow results are noisy, outlier filtering is performed before clustering. The motion manifoldlearningresultsoftwomotorbikevideosareshowninFig.6.6,wheremotionmanifolds are visualized on images after kernel density interpolation. From the results we can see that the clustered manifolds have clear semantic meanings, since each manifold corresponds to a coordinated movement formed by a group of motorbikes. Therefore, RMMSL correctly learns globalmotiontohelpunderstandthevideoscenes. 6.6 Conclusion RobustMultipleManifoldsStructureLearningisproposedtoeffectivelylearnthedatastructure by considering noise, curved-level and multiple manifolds assumption. In particular, the esti- matedlocalstructureisusedtoassisttheglobalstructurelearningtasksofclusteringandoutlier detection. The algorithm is evaluated and compared to other state-of-the-art clustering meth- ods. Theresultsareencouraging: onbothsyntheticandrealdata,theproposedalgorithmyields smallerclusteringerrorsinchallengingcases, especially whenmultiple nonlinearmanifolds in- tersect. Furthermore,theresultsonactionsegmentationandmotionflowmodelingdemonstrate RMMSL’scapabilityforbroadchallengingapplicationsinrealworld. 109 PartII StructureLearningforMultivariateTimeSeries 110 Chapter7 Spatio-TemporalAlignment We address the problem of learning view-invariant 3D models of human motion from motion capture data, in order to recognize human actions from a monocular video sequence with arbi- trary viewpoint. We propose a Spatio-Temporal Manifold (STM) model to analyze non-linear multivariatetimeserieswithlatentspatialstructureandapplyittorecognizeactionsinthejoint- trajectories space. Based on STM, a novel alignment algorithm Dynamic Manifold Warping (DMW)andarobustmotionsimilaritymetricareproposedforhumanactionsequences,bothin 2D and 3D [28]. DMW extends previous works on spatio-temporal alignment by incorporating manifold learning. We evaluate and compare the approach to state-of-the-art methods on mo- tioncapturedata andrealisticvideos. Experimental results demonstrate the effectivenessofour approach,whichyieldsvisuallyappealingalignmentresults,produceshigheractionrecognition accuracy,andcanrecognizeactionsfromarbitraryviewswithpartialocclusion. 111 7.1 Overview Although significant progress has been made in action recognition [57, 88, 75, 84, 150, 89], the problem remains inherently challenging due to significant intra-class variations, viewpoint change,partialocclusionandbackgrounddynamicvariations. Akeylimitationofmanyaction- recognition approaches is that their models are learned from single 2D view video features on individual datasets and thus unable to handle arbitrary view change or scale and background variations. Also,sincetheyarenotgeneralizableacrossdifferentdatasets,retrainingisnecessary foranynewdataset. Our research is motivated by the requirement of view-invariant action recognition and the fact that the existing human motion capture (Mocap) data provides useful knowledge to under- standtheintrinsicmotionstructure(Fig.7.2). Inparticular,weaddresstheproblemofmodeling and analyzing human motion in the joint-trajectories space. Our proposed approach has the followingmodules: (1) Given a labeled Mocap sequence with M markers in 3D, which is a 3M-dimensional sequentialdata(3MD+t),thelowdimensionalmanifoldstructure(i.e.,tangentspace,geodesics distance,etc)islearntbyusingTensorVoting. Thisisanofflineprocess,asshowninFig.7.1. (2)Forotherunlabeledmotionsequencesin3D,aftertheintrinsicstructurelearning(1),we can calculate the motion similarity score with each labeled motion sequence by the proposed spatio-temporalalignmentapproach,todoactionrecognition. Thisisanonlineprocess. 112 Figure7.1: Flowchartoftheproposedapproach. (3) More interestingly, our system can recognize actions from videos. 2D tracking results from single view videos are often noisy and have occlusions, while the structure learning algo- rithms(1)remainthesameandouralignmentapproach(2)cannaturallyhandle2Dinput. Ourapproachhasthefollowingadvantages: Oneorveryfewexamplesarerequiredineachactioncategoryinthetrainingstage,compared to100sformanylearningapproaches. 113 Intra-person variations: a person repeating an action twice with differences, especially in motiondynamics,canbehandledbytheproposedtemporalalignmentmethod(sec.7.4.1). Inter-person variations: two people performing an action with differences in both pose style and motion dynamic, are considered by combining temporal and spatial alignment (sec. 7.4.2) together. Viewinvariance: lowdimensionalhumanmotionmanifoldmodelsarelearntfromthe3DMo- cap data, and our spatial-temporal alignment algorithms can handle 2D input (arbitrary view- point); these two features enable our system to recognize actions regardless of the viewpoint, withoutdatasetdependenttraining. Occlusion handling: in order to recognize actions from videos, key points trajectories need to betracked. InsteadofM keypoints,oftenonlyK visiblepointscanbetrackedduringthewhole action (K≤ M), such as a side view video of a walking man. Our system can handle these 2D noisytrackingtrajectories,evenwithocclusion(2KD+t). Transfer learning: when applying our approach to a video dataset, there is no training pro- cess on this dataset and people in these videos do not necessarily appear in the labeled Mocap sequences. Thus, our approach can be considered as a transfer learning framework, i.e., the knowledgefromlabeledMocapdatacanbeadapted toanyactionvideo. Our overall approach is sketched in Fig. 7.1. The joint-trajectories of M human body key points are used to represent a human motion sequence. Trajectories can be either provided by Mocap (3D) or be tracked from a single view video (2D). The modules filled in blue are novel, and represent the key contributions in our approach. We briefly introduce these modules in this section,andtechnicaldetailsaregiveninsection7.3,7.4and7.5. 114 Section 7.3 - Structure Learning. An important property we use is that, although human motion data lies in a high dimensional space, it can be well represented in a low dimensional intrinsicspaceembeddedintheobservations. Thishasbeenpreviouslypointedoutin[139,130] for actions such as walking and running. In particular, we propose Spatio-Temporal Manifold (STM) framework (sec. 7.3), incorporating both spatial and dynamic structures, to model the joint-trajectories. Given a human motion sequence, which defines a STM in 3MD (Mocap) or 2KD (2D input with possible occlusion), where M and K are the numbers of markers for 3D and 2D input respectively. Tensor Voting is used to learn the manifold structure, while voting algorithmsaremodifiedtocombinethetemporalinformation. Section7.4-Spatio-TemporalAlignment. BasedonthestructurelearningresultsfromSTM, a motion distance function is proposed to measure similarity in two motion sequences, after proper spatial and temporal alignment. There are three proposed algorithms in this module, temporal alignment, spatial alignment and motion similarity metric. The first two are designed asunsupervisedmachinelearningalgorithms,andcanbeappliedfornon-linearmultivariatetime series in general. To meet online computational requirements, our algorithms are efficient and easy to implement without explicit parameters tuning. Recognizing actions from Mocap can be donebyassociatingthetestsequenceswitheachlabeledsequenceandselectingthepairwiththe maximumsimilarity. PromisingpreliminaryresultsarereportedonCMUMocap,andthefuture planistoincludemoreactioncategoriesandincreasetheinter/intrasubjectvariability. Section 7.5 - Action recognition from videos. More interestingly, our system can not only recognize actions from 3D input, but also 2D input from arbitrary viewpoint, making view- invariant recognition from single view videos possible. To apply our approach for videos, we 115 Figure 7.2: Examples of motion capture systems. CMU Mocap (left), Gypsy Motion Capture System(middleandright). have a pre-processing step to extractjoint 2D trajectories from image observations by a tracker. We use the tracker proposed by David Ross [100] or any reasonable tracker. Although tracking results are often noisy (dynamic variations, lack of local contrast, etc), we can still recognize actions from those 2D tracked trajectories, even with occlusions (follow the same procedure in section4and5). Weplantovalidateourapproachonseveralchallengingdatasets,suchasBrown HumanEva [112], INRIA IXMAS [142], etc. We have already obtained promising preliminary resultsonHumanEva. 7.2 RelatedWork Inspiredbythesuccessinobjectrecognitions,low-levelfeatureslikeSpace-TimeInterestPoints (STIPs)plusHistogramofOrientedGradient(HOG)descriptorsareusedinmanyactionrecog- nition works [57, 88]. Silhouettes based features are also popular [142, 69], for which good results rely on accurate foreground extractions. Some works also use tracked key points, which 116 arequantizedasfeaturevectorsbythepre-learnedormanuallydesignedcodebook[78,122,75]. Actionrecognitionisamultifacetedfield,ourdiscussionfocusesonview-invariantmethods,and readerscanrefertoarecentreview[94]formoredetails. Inshort,althoughgreatprogresshasbeenmade,mostoftheworksonlyfocusonfewaction classes(around10)andwellcontrolledexperimentalsettings(smallintra-classvariations). This is in a sharp contrast to the complexity of the real life motion behavior and requirements of applications(i.e.,videosurveillance,videoindexing,etc). ViewInvariantRecognition. HiddenMarkovModel(HMM)isbuilton3Djoint-trajectories (Mocap) to capture the dynamic information of human motion [68]. The claimed advantage of the3DHMMmodelisthatthedependenceonviewpointandilluminationisremoved. However, HMM requires large amount of training data in relatively high dimensional space (e.g. 67) and theHMMstructuremustbeadaptivelydesignedforspecificapplicationdomains. Thesemaybe potentialfactorsthatmaketherecognitionperformanceunsatisfactory,andAdaBoostisusedto improvetheaccuracy[68]. View-independenceisalsoaddressedin[69,84]byrenderingMocap data of various actions from multiple viewpoints, which is a time and storage consuming pro- cess. Forinstance,10 o intervalforcameratiltangle(rangeπ/2)andpanangle(range2π)results in360renderimagesforeachpose,whichincreasesthecomputationalcostforrecognition. An- otherclassofmethodsreliesonrecovering3Dposesinformationfromsilhouettes. In[142],3D models are projected onto 2D silhouettes with respect to different view point, and [150] detects 2Dfeature first andthen back-projects themto action features based on a 3Dvisual hull. These methods require a computationally expensive search process over model parameters to find the best match between 2D features and 3D model. Very recently, in [143], a 3D HOG descriptor 117 wasproposedtohandleviewpointchange,andthisapproachrequiresthemultipleviewcamera settingsfortrainingdatatoachievetheview-invariantrecognition. Departing from these methods, our recognition process does not require 2D pose rendering or parameters search. Our trajectory features are located at body skeleton’s key locations, with explicitsemanticmeaning,allowingoursystemtobedirectlyappliedtoarbitraryscenewithout datasetsdependenttraining. Dynamic Manifold Model. Non-linear manifold learning and Latent Variable Modeling (LVM)isprominentinmachinelearningresearchinthepastdecade[125,104]. In[59],Tensor Voting[82]isusedtoanalyzethe1Dmanifoldoflandmarksequences,andthemanifoldstructure is applied to 3D face tracking and expression inference. In particular, some probabilistic latent variable frameworks, i.e., GP-LVM, GPDM and its variants [58, 139, 129], focus on motion capture data and try to capture the intrinsic structure of human motion, which is further applied to3Dmonocularpeopletracking[130]. Oneadvantageofthesemethodsistheabilitytomodel the low-dimensional latent space with associated dynamics based on a few high-dimensional training data. Furthermore, missing values are easily handled thanks to the inherent properties oftheprobabilisticframework. While our STM framework is inspired by [59, 130, 139], our goal is significantly different. GPDMmethods explicitlymodelthelatenthumanposespace,whichisdesignedforrecovering the intrinsic motion structure. By contrast, STM implicitly models the latent pose space and fo- cusesonrecoveringthelatent“completion”variable,whichismoresuitableformotionsequence alignment. 118 MotionSequenceMatching. Giventwohumanmotionsequences,animportantquestionis to consider whether those two sequences represent the same motion, similar motions or distinct motions. This canbeviewedas a (spatio-temporal) alignment problem, serving as a foundation for action recognition, clustering, etc. Canonical Component Analysis (CCA) [4], proposed for learning the shared subspace between two high dimensional features, which been used as the spatialmatchingalgorithmforactivityrecognitionfromvideo[48]andactivitycorrelationfrom cameras[67]. Videosynchronizationisaddressedasatemporalalignmentproblemin[99,114], which uses dynamic time warping (DTW) or its variants [96]. [128] uses optimization methods to maximize a similarity measure of two human action sequences, while the temporal warping isconstrainedby1Daffinetransformation. Thesamelineartemporalmodelisalsousedin[91]. Very recently, as an elegant extension of CCA and DTW, Canonical Time Warping (CTW) is proposed for spatio-temporal alignment of two multivariate time series and applied to align human motion sequences between two subjects [160]. CTW is formulated as an energy min- imization framework and solved by an iterative gradient descent procedure. Since spatial and temporal transformations are coupled together, the objective function becomes non-convex and the solution is not guaranteed to be global optimal. Under the STM model, we propose Dy- namic Manifold Warping (DMW), which focuses on time series with intrinsic spatial structure andguaranteesglobaloptimalsolution. Temporal Segmentation. Some works focus on how to correctly segment motion capture sequences. [5] proposed an on-line algorithm to decompose motion sequences into distinct actionunitinthenon-smoothingpointby(probabilistic)PrincipalComponent Analysis(PCA). 119 Aligned Cluster Analysis (ACA) is developed for temporal clustering of facial behavior with a multi-subjectcorrespondencealgorithmformatchingfacialexpressions[161,162]. 7.3 Spatio-TemporalManifold Supposethereisad-dimensionalsubmanifoldMembeddedinanambientspaceofdimension- ality D≫ d. We use latent variable model (LVM) to representM as a mapping between the intrinsic space and the ambient space: f :R d →R D andx = f(τ)+ϵ, wherex∈R D is the observation variable,τ∈R d is the latent variable andϵ∈R D is the noise. In computer vision applications, the mapping function f is often highly non-linear, and the ambient space is the spatial(feature)space,soMisalsocalledspatial-manifold. Despitethesuccessfulapplications ofmanymanifoldlearningmethods,theyarenotcapableforsequentialdatamodeling,sincethe dynamic properties are involved in the standard manifold learning framework. To incorporate thetemporaldimensionintothestandardLVM,weproposeanovelframeworkasfollows. 7.3.1 MathematicalModel Definition: aspatio-temporalmanifold(STM)isadirectedtraversingpathM p (withboundary orcompact)onaspatial-manifoldM,andfurtherembeddedinR D . A traversing pathM p can be intuitively thought as a point walking onM from a starting point at time t 1 (τ start ,x start ) to an ending point (τ end ,x end ) at time t 2 . A path is not just a subset ofM which ”looks like” a curve, it also includes a natural parametrization as, g ζ→ : [01]→M,s.t. g(0) =τ start andg(1) =τ end . So,anewlatentvariableζ∈ [01]isassociated with every point in this path. Furthermore, the relationship betweenζ and temporal indext can 120 be modeled as a time series p t→ζ : [t 1 t 2 ]→ [0 1], s.t. h(t 1 ) = 0 and h(t 2 ) = 1. SinceM is embeddedinR D byf(·),essentiallythetraversingpath(withnoise)canbedescribedasanon- linear multivariate time series asx(t) = f(g(p(t)))+ϵ. The left part of Fig. 3 is the graphical modelofSTM,andtherightpartisavisualizationofatraversingpathona2Dsphereembedded inR 3 . STMforHumanMotionData. GivenalengthL x humanactionsequence(e.g. stretching), the joint-trajectories can be represented as a matrixX 1:Lx = [x 1 x 2 ...x Lx ]∈ℜ D×Lx , where x t isthejoint-positionsattemporalindext. In3D(Mocap),x t = [p t 11 p t 12 p t 13 ...p t M1 p t M2 p t M3 ] T ∈ R 3M×1 andp t i = (p t i1 p t i2 p t i3 ) is the coordinate of thei th marker inR 3 . Or in 2D (tracking tra- jectories),x t = [u t 11 u t 12 ...u t K1 u t K2 ] T ∈R 2K×1 (K≤M),u t i = (u t i1 u t i2 )isthepixellocation of the i th key point. Althoughx t lies in a high dimensional space, the natural property of hu- man pose suggestsx i has lower intrinsic degree of freedom. So,X 1:Lx is just a sequence of sampled observations on a STM. Here, the ambient space is the joint-position space, manifold M is the human pose space, andM p is a specific type of human action. The newly introduced variable ζ is assigned to a semantic meaning which indicate the “completion” degree of a ac- tion. For a complete action sequence, i.e., an action unit, we assume the starting point of the actionhasζ = 0andtheendingpointhasζ = 1 1 . Withoutspecificnotes,anactionsequenceis correspondingtoacompleteactionunit inthisdissertation. 1 Forperiodicmotion,i.e,walking,itdefinesamotioncycle. 121 Figure7.3: Spatio-TemporalManifold Model. Graphical model comparisons (left), a visualiza- tionexample(right). 7.3.2 StructureLearning With the temporal indext, STM is a multivariate time series. Interestingly, withoutt, the set of all points in a STM is empirically found to be a 1D smooth manifold by Tensor Voting. Given {x t } L t=1 be L ordered data points sampled from a STM, the goal of learning is to estimate the tangent space and recover the latent “completion” variable ζ t from those samples. Note that our goal is different from most latent variable models, which aim to identifyτ [125, 104] and sometimesf(·)[58,139,129]. Learningd Geo (·). WeuseTensorVotingtocalculatetheminimumtraversingdistancebetween x s andx s+1 (1≤ s≤ L−1)toapproximate thegeodesicdistanced Geo (·). TensorVotingisa non-parametric framework propose to estimate the geometric information of manifolds, as well astheintrinsicdimensionality[82]. Letx s (0) =x s ,wehave d Geo (x s ,x s+1 ;M p )≈ R ∑ r=0 ∥x s (r)−x s (r+1)∥ L 2 (7.1) 122 wherex s (r+1)isupdatedfromthecurrentpointx s (r)(Fig.7.4), x s (k+1) =x s (k)+αJ ∗ (x s (r))J ∗ (x s (r)) T (x s+1 −x s (r)) (7.2) until x s (r + 1) converges to x s+1 . α is a step length, and J ∗ (x s (r)) is the tangent space estimationonx s (r)byTensorVoting.[59]usesTensorVotingtoestimatethemanifoldstructure for 3D face tracking in 126D space, while the temporal index is not explicitly considered. Our algorithmisarevisedversionof[59]undertheSTMframework. Figure 7.4: Learning geodesic distance. Left, variable-length path method in sec. 7.3; right, fixed-length2pathmethod. Learningζ t . Atwostageapproachis possible, first estimateτ (orf(·))on acollection oftime series, and then optimize{ζ 1:L }. Nevertheless, we propose a solution which performs direct estimationforindividualsequencebasedonthelearntgeodesicdistance. ζ ∗ t = ∑ t−1 s=1 d Geo (x s ,x s+1 ;M p ) ∑ L−1 s=1 d Geo (x s ,x s+1 ;M p ) (7.3) Sincethetraversingpathiscontinuousandsmooth,theglobalgeodesicdistanceisapproximately decomposedtothesumofthelocaldistance,inspiredbyISOMAP[125]. 123 In summary, STM is an extension of the manifold framework by adding the extra temporal dimension to model the sequential data. STM also extends the multivariate time series frame- workbyincorporatingthelatentvariablemodelinthespatialspace. Figure 7.5: An illustration of the non-linearity of ζ(t). Top, action “stretching”(Mocap), 6 samplesareuniformlydistributedin368frames;bottom,estimatedlatent”completion”variable. Thewholeactionisdecomposedinto5stages. Preliminary Results. We use the CMU Mocap data [13] in our experiments, andM = 15 keypointsareusedtorepresentthehumanbody,resultinginjoint3Dtrajectoriesin45D.These 15 key points are extracted from the “amc” and “asf” files by our joint-angle to joint-position transfer algorithm. Fig. 7.5 illustrates the latent completion variable ζ learning results from a “stretching” sequence in an action unit, i.e., from the action’s start to the end. We uniformly divide the sequence into 5 stages along the time index. The dynamic variations in stage 2 and 4 arelargerthantheothers,andthesetwocorrespondto“stretch”and“fold”arms. Stage3hasthe 124 smallest variation, because it corresponds to the “peak”state of a stretching, i.e., there is almost noarmmovement. 7.4 DynamicManifoldWarping GiventwohumanactionsequencesX 1:Lx ∈R Dx×Lx (M x p )andY 1:Ly ∈R Dy×Ly (M y p ) 2 ,we need to calculate the motion distance scoreS(X 1:Lx ,Y 1:Ly ), after proper spatial and temporal alignment. The problem is inherently challenging because of the large spatial/temporal scale difference between human actions, ambiguity between human poses, as well as the inter/intra subject variability [160]. We model the motion sequence matching as a spatial-temporal align- ment problem under the STM framework, and incorporate manifold learning, spatial alignment andtemporalalignmenttogether,resultinginDynamicManifoldWarping(DMW). 7.4.1 TemporalAlignment The first part of DMW is temporal alignment, which is called Dynamic Manifold Temporal Warping(DMTW).DMTWisthecombinationofmanifoldlearningandDynamicTimeWarping (DTW),andcanbeappliedtoanytemporaldatawithlatentspatialstructure. 2 Bothcanbe3Djoint-trajectories,oronea3Dandtheothera2Djoint-trajectoriesfromavideoclip. 125 Formulation. Given two time series X 1:Lx ∈ R Dx×Lx and Y 1:Ly ∈ R Dy×Ly , find the optimalalignmentpathQ = [q 1 ,q 2 ,...,q L ]∈R 2×L byminimizingthefollowinglossfunction (∥∥ F istheFrobeniusnormoperator), L DMTW (F x (·),F y (·),W x ,W y ) =∥F x (X 1:Lx )W T x −F y (Y 1:Ly )W T y ∥ 2 F (7.4) whereW x ={w x t,tx }∈{0,1} L×Lx ,W y ={w y t,ty }∈{0,1} L×Ly arebinaryselectionmatrices encodingthetemporalalignmentpathQ[160]. w x t,tx =w y t,ty = 1isequivalenttoq t = [t x t y ] T , which means x tx corresponds to y ty at step t in the alignment path. F(·) maps X 1:Lx and Y 1:Ly to a shared subspace with the same dimensionality. Essentially,F x (·) andF y (·) are spatialmappingfunctionsandW x andW y aretemporalwarpingmatrices. IfF(·) is an identity function, thenL DMTW reduces to∥ X 1:Lx W T x −Y 1:Ly W T y ∥ 2 F , which is equivalent to performing the standard DTW directly on X 1:Lx and Y 1:Ly . Unlike thealternativeiterativealgorithmtooptimizeL DMTW , i.e., optimizeW withfixedF andthen optimizeF with fixedW, we propose a two-step approach without the iterative computation. InsteadofoptimizingF x ,F y ineq.7.4,wedirectlyestimatethemundertheSTMframework. Step 1. Under the STM model in section 7.3, we chooseF x (X 1:Lx ) to beζ x 1:Lx ∈R 1×Lx and F y (X 1:Ly ) to be ζ y 1:Ly ∈ R 1×Ly . ζ t represent the universal structure for all STMs, making aligning two sequences with different actions possible. If the sequence is training data (i.e. Mocap),thenmethodsinsec.7.3.2canbeused. 126 Otherwise, instead of performing the variable-length path estimation, we can directly esti- matethed Geo (·)byusingthefixed-length(i.e.,1or2)traversingpath,withoutre-pefromingTen- sorVotingateachstep. Fortwoneighborhoodpointsx s ,x s+1 ∈R Dx×1 ,TensorVotingprovides the tangent space estimation results on these two points asJ ∗ (x s )∈ R Dx×1 andJ ∗ (x s+1 )∈ R Dx×1 3 . Thegeodesicdistanced Geo (x s ,x s+1 ;M px )canbeapproximatedbychoosinganop- timaltravelingtransactionpointx ∗ s =x s +βJ ∗ (x s )tominimize∥x s +βJ ∗ (x s )−x s ∥ 2 2 +∥ x s +βJ ∗ (x s )−x s+1 ∥ 2 2 . Thisisasecondorderapproximationoftheminimumtraversingpath ineq.7.1 7.2. The variable lengthK piecewisedirectedpathx 0 →x 1 ...→x K (x 0 =x s and x K =x s+1 ) is approximated as a fixed length 2 directed pathx s →x ∗ s →x s+1 , as visualized inFig.7.4. Theoptimalβ ∗ hasaclosed-formsolutionas 1 2 J ∗ (x s ) T (x s+1 −x s ). Afterlearning d Geo (·),combiningwitheq.7.3,wecanobtaintheestimatedresultsforζ x 1:Lx andζ y 1:Ly ,denoted as f ζ x ∈R 1×Lx and f ζ y ∈R 1×Ly . Combinedwitheq.7.5,itresultsinaclosed-formestimationford Geo (x s ,x s+1 ). d Geo (x s ,x s+1 )≈∥x s −x ∗ s ∥ 2 +∥x ∗ s −x s+1 ∥ 2 (7.5) Theclosed-formsolutionisextremelyfastandproducereliableresultsinourexperiments. Step 2. ReplaceF x (·) andF y (·) withζ x andζ y in eq. 7.4,L DMTW reduces to the following formulation, L DMTW (W x ,W y ) =∥ f ζ x W T x − f ζ y W T y ∥ 2 F (7.6) 3 Forself-disjointspointsonthepath,withintrinsicdimensionality1. 127 This is equivalent to performing DTW in the transform domain, i.e.,ζ x andζ y . The temporal aligningmatrixA ={a tx,ty }isdefinedasa tx,ty = ( f ζ x tx − f ζ y ty ) 2 ,whichisacompactrepresenta- tionof f ζ x and f ζ y . Optimizingeq.7.6resultsinvariablelengthpath(varyfrommax(L x ,L y )to L x +L y −1), which is not proper for similarity metric. Thus, referenced DTW is proposed to fixthepathlengthbysettingonewarpingmatrixtobeidentity, ∥ f ζ x I Lx − f ζ y W T y ∥ 2 F (7.7) where I Lx is an identity matrix. X 1:Lx is chosen as the reference sequence, and Y 1:Lx is alignedtoX 1:Lx bythewarpingmatrixW y ∈R Lx×Ly . ThepathQineq.7.7hasfixedlength L x . Since f ζ x and f ζ y are monotonically increasing sequences, dynamic programming provides anextremelyefficientsolution(O(L x +L y ))asfollows, q(2,t+1)−q(2,t) = argmin δt≥0 ∥ e ζ x q(1,t) − e ζ y q(2,t)+δt ∥ (7.8) whereQ∈R 2×Lx always hasq(1,t) = t, and satisfies boundary conditions, thatq 1 = [1 1] T and q Lx = [L x L y ] T . Q can be by recursively optimized by eq. 7.8 from the starting point [11] T . Akeyfeatureoftheframeworkisanytwomotionsequenceswithdifferentfeaturerepresen- tations can be aligned. For instance, one sequence has the 3-D joint positions representations, andtheothersequencehasthepartial2-Djointpositionsrepresentationswiththeconsideration of occlusion in 2-D view. Furthermore, this alignment algorithm can be applied to multimodal dataset, i.e., one is a feature sequence from the video stream and the other is a feature sequence 128 from the speech signal, without explicit consideration of mapping function between video and audiodomain. Preliminary Results. The proposed DMTW algorithm (eq. 7.4) is compared with other state-of-the-art algorithms. In particular, Dynamic Time Warping (DTW) [99, 96] is chosen as the baseline algorithm and Canonical Time Warping (CTW) [160] is chosen as the alternative method. Tomakethecomparisonclearer,thesequencesmayincludemorethanoneactionunits. Fig. 7.6 shows the visual comparison for two motion sequences, one is boxing (twice) and the other is side jumping (twice). DTW does not consider the spatial transformation, making it difficult to align two motion sequences by two people. CTW significantly outperforms DTW. Our DMTW gets the best results among three methods. For future work, we plan to include more motion sequences and manually label the alignment ground-truth, to provide quantitative performancecomparisonwithotherstate-of-the-artalgorithms. 7.4.2 TemporallyLocalSpatialAlignment After temporal alignment, spatial alignment is performed to leverage the subjects’ variability, i.e., body-skeleton scales variations between different people, or 2D viewpoint variations. In particular, we propose Dynamic Manifold Spatial Warping (DMSW), which has the following framework, D DMSW (X t 1 :t 2 , e Y t 1 :t 2 ) =∥V x (U(X t 1 :t 2 ))−V y (U( e Y t 1 :t 2 ))∥ 2 F (7.9) 129 Figure 7.6: Temporal Alignment Results. DMTW is compared with DTW and CTW. The ref- erence sequence is shownin the first row, followed by the aligned results. 2 red arrows indicate 2 key states in the reference sequence, i.e, the peaks of the first and the second boxing. The aligned sequence also has 2 red arrows, indicating the peaks of the first and the second jump. DMTW is able to align the two peak states in the jumping sequence to the peak states in the boxingsequenceverywell. X t 1 :t 2 ∈R Dx×(t 2 −t 1 +1) areconsecutiveframefeaturesx t 1 tox t 2 inthereferencesequence,and e Y t 1 :t 2 ∈R Dy×(t 2 −t 1 +1) are temporally corresponding samples in the aligned sequence e Y 1:Lx . V x (·) is the spatial alignment function (same for V y (·)) andU(·) is the pre-defined feature extraction function. Spatial alignment is restricted to temporally local (fromt 1 tot 2 ) segments, since global matching on entire sequences is often not accurate due to non-linear variations. HowtosetV(·)isexplainedinthefollowingpartandU(·)willbediscussedinsec.7.4.3. 130 Denoting the extracted features byU(·) as two zero mean feature sets, U x ∈ R d 1 ×n and U y ∈R d 2 ×n ,weconsideranunsupervisedlearningapproach,i.e.,CanonicalCorrelationAnal- ysis(CCA),inwhichapairoflinearalignmentmatricesisoptimizedinthesenseofmaximizing thecorrelationE(·)intransformedfeaturesasfollows, E(V x ,V y ) =Tr(V T x U x (V T y U y ) T ) s.t.,V T x U x U T x V x =V T y U y U T y V y =I d (7.10) whereV x ∈ R d 1 ×d andV y ∈ R d 2 ×d are two linear spatial alignment matrices forU x and U y , andI d is the identity matrix of size d×d. Tr(·) is the trace operator. Minimizing this objectivefunctionisequivalenttosolvingageneralizedeigenvalueproblem[4]. Themetriccan beinducedinthetransformdomainas, D DMSW (X t 1 :t 2 , e Y t 1 :t 2 ) =∥V ∗T x U x −V ∗T y U y ∥ 2 F (7.11) V ∗ x ∈R d 1 ×d andV ∗ y ∈R d 2 ×d arethesolutionsofeq.7.10. Eq.7.11canhandletwofeaturesets withdifferentdimensionalities,makingthealignmentbetween2Dand3Dinputpossible. 7.4.3 MotionDistanceScore BasedontheproposedDMTWfortemporalalignmentandDMSWforspatialalignment,wefur- therproposetwotypesofmotiondistancefunctionsbychoosingtwofeatureextractionfunctions U(.). 131 In particular, instead of treating x t ∈ R Dx×1 (or y t ) as a multi-dimensional vector, the implicit structure in the joint-position space is considered. In sec. 7.3,x t = [p t 1 ,...,p t M ] T ∈ R 3M×1 , the 3D Euclidean space is implicitly embedded in the joint-position spaceR 3M . Thus, wereformulatex t as, x t = p 11 ... p M1 p 12 ... p M2 p 13 ... p M3 ∈R 3×M (7.12) which turns to be M samples inR 3 (similar operation forx t ∈ R 2K to x t ∈ R 2×K ). This operationisdefinedasT 3D :R 3M →R 3×M ,orT 2D :R 2K →R 2×K . The first feature extraction function is chosen asU 1 (x t ) =T(x t ), which is the static pose feature (joint-position in the matrix formulation). The second one isU 2 (x t ,x t+1 ) =T(x t )− T(x t+1 ),whichisthemotionposefeaturebetweentwoconsecutiveframes. Thus,thefinalsimilarityscoreS 1 (X 1:Lx ,Y 1:Ly )givenbythestaticfeaturesisasfollows, S 1 = t=Lx ∑ t=1 D DMSW (T(x t ),T(e y t )) (7.13) where e y t is the temporally corresponding frame estimated by eq. 7.4. The similarity score S 2 (X 1:Lx ,Y 1:Ly )givenbythemotionfeaturesisasfollows, S 2 = t=Lx ∑ t=1 D DMSW (T(x t )−T(x t+1 ),T(e y t )−T(e y t+1 )) (7.14) 132 Thesetwoscorescanbelinearlycombined, S DMW (X 1:Lx ,Y 1:Ly ) =λS 1 (X 1:Lx ,Y 1:Ly )+(1−λ)S 2 (X 1:Lx ,Y 1:Ly ) (7.15) where λ ∈ [0 1] can be either optimized by cross-validation in the supervised setting (i.e., recognition), or chosen manually in the unsupervised setting (i.e., clustering). Eq. 7.15 is a summaryresultofeqs.7.3,7.4,7.10andtwofeatureextractionfunctions. The similarity metric is not symmetric, so we set the testing sequence to be the reference sequence. 7.4.4 ResultsofMatching We collected 3978 frames from CMU Mocap capturing fifteen people, performing 10 natural actions(detailsinFig.7.7). Themotiondistancescoresbetweenanytwosequencesiscalculated, resultingina10×10averagemotiondistancematrixS forthese10actions(Fig.7.7). Itisclear thatthediagonalareahasthesmallestvariations,whichshowstheeffectivenessofoursimilarity function(eq.7.15). For action recognition, we use the leave-one-out procedure for each sequence, i.e., each sequenceistreatedasunlabeledandassociatedwithallothersequences. Sinceeachpersononly performs a specific action once, the recognition process can not benefit from the fact that the same person repeating the same action results in quite large similarity. λ in eq. 7.15 is set to be 0.5andresults(Table7.1)showthatourapproachonlymisclassifies5%sequences,or1.2%by weighing with the number of frames. To demonstrate the superiority of combining both static 133 Figure 7.7: Action recognition results on Mocap. Top, confusion matrix for recognition in 3D; middle, confusion matrix for recognition in 2D (darker indicates larger similarity and brighter indicatessmallersimilarity);bottom,1Mocapexampleforeachaction. (eq.7.13)andmotionfeatures(eq.7.14),recognitionratesforindividualmethodsarereportedin Table 7.1. The recognition rate drops down to 85% for static features only and 60% for motion featuresonly. Furthermore,todemonstratetheabilitytorecognizeactionsfromarbitrary2Dview,Mocap sequences are projected to joint 2D trajectories in 30D space using a synthetic camera (without occlusion, K = M = 15). We achieve 90% accuracy in this 2D view recognition. [68] also 134 reports recognition rate for Mocap data, but direct comparison is difficult, because [68] collects a large number of sequences and uses Adaboost for training. Our approach only requires a few sequencesforeachaction. Moreover,2Dviewrecognitionisnotconsideredin[68]. Other applications, such as unsupervised action clustering, can also be naturally derived based on the proposed motion similarity metric. For future work, we plan to include more sequences from CMU Mocap or other motion capture datasets. The number of action classes willalsoincrease,includingalmostallimportantprimitivehumanactions. Intermsof2Daction recognition, we intend to project 3D motion data to several 2D views (from top to bottom, left toright,fronttoback),andsystematicallymeasuretherecognitionperformanceinallcases. Methods/Data Mocap(S) Mocap(F) 3D(combine) 95% 99% 3D(Static) 85% 83% 3D(Motion) 60% 67% 2D(Combine) 90% 87% Lv&Nevatia[68] NA 92% Table 7.1: Recognition Performance Rates. (S) means the rate is measured by # of sequences and(F)meanstherateismeasuredby#offrames. Allrecognitionisdoneatthesequencelevel. 7.5 ActionRecognitionfromVideos Besides action recognition from Mocap, the proposed approach in sections 7.3 and 7.4 can be used to recognize actions from videos, without extra training process. A tracker is needed to extract trajectories from videos, and the alignment process need to be adapted to the 2D input fromvideos. 135 Figure 7.8: Noisy Tracking Results. Left, image sequences; right, trajectories of the right feet providebythetrackerandmanuallabeling. Ourapproachcanrecognizeactionsfromthisnoisy andoccludedinput. 7.5.1 Pre-processing To apply our approach for videos, we have a pre-processing step to extract joint 2D trajectories fromimageobservations. Theproblemitselfischallengingandtightlyconnectedtohumanpose estimation,animportantsubareaincomputervision. Currently,weusetheIncrementallearning Visual Tracker (IVT) [100], an online updated appearance model is used to model the objects dynamic variation. Any other tracking algorithms can be considered. For videos HumanEva, oftenK = 7 or 8 key points are estimated from the side view by the IVT, resulting in joint 2D trajectories in 14D or 16D space. Although tracking results are often noisy (Fig. 7.8), we can stillrecognizeactionsfromthosetrackedtrajectories,evenwithocclusion. 7.5.2 AlignmentofX mocap andY video Given K tracks in 2D, we have a 2KD spatio-temporal manifold (STM) Y 1:Ly ∈ R 2K×Ly , where the t th columny t = [u 1 ,...,u K ] T ∈ R 2K is the joint-coordinates vector of K tracked points(u j(1≤j≤K) )atframet. Assumetheunderlying3Djoint-trajectoriesforthepeopleinthe 136 videoisZ 1:Ly ∈R 3M×Ly (z t(1≤t≤Ly) = [p 1 ,...,p M ] T ,p j(1≤j≤M) ∈R 3 ),itcanbeshownthat the STM in 2KD space is the projection of the STM in 3MD space. In particular, under the linear projection modelP ∈R 2×3 ( from 3D positionp j to 2D image coordinateu j ), we can have u 1 .. u K = P 0 ... ... P ... ... 0 P 2K×3K W 11 W 12 ... ... W ij ... ... ... W KM 3K×3M p 1 .. p M (7.16) W ij is a binary selection matrix, which equals toI 3×3 if the j th key points is available in the tracking results, otherwise equals to 0 3×3 . This projection relationship can be compactly representedinthematrixnotationas, y (2K×1) = e P f Wz (3M×1) (7.17) i.e., the 2KD manifold is just the linear projection of the 3MD manifold. e P∈R 2K×3K is the compactprojectionmatrixand f W∈R 3K×3M isthe compact selection matrix. Forprospective projection, the derivation is similar, and the manifold should be represented in homogeneous space. ThealignmentprocessbetweenY 1:Ly (video)andX 1:Lx ∈R 3M×Lx (Mocap)isperformed inthefollowingprocedure: (1)StructureLearning. Usethealgorithmsinsection7.3.2andsection7.4.1. 137 Figure 7.9: Temporal alignment ofX mocap andY video . Left, action “stretching”(368 frames, Mocap); middle, action “jogging”(47 frames, HumanEva); right, a 368× 47 aligning matrix (DMTW). The non-linearity of the aligning path is visualized by the dark blue region (blue indicatessmallerror,andredindicateslargeerror). (2) Temporal Alignment. Use the proposed DMTW (eq. 7.4) algorithm to get temporal correspondence f X 1:Ly ∈R 3M×Ly ,asshowninFig.7.9. (3)Spatial Alignment. SelectK markersfrom f X 1:Ly , resulting in f X K 1:Ly ∈R 3K×Ly . Only features from f X K 1:Ly are selected to matchY 1:Ly by using DMSW (eq. 7.11), since M− K markers’informationismissedfromvideotrackingresults,asshowninFig.7.10. (4)MotionDistance. Useeq.7.15. ItisnotablethatoperationT 3D isappliedto f X K 1:Ly and T 2D isappliedtoY 1:Ly . Basedontheproposedalignmentalgorithm,recognizingactionsfromvideoisderivednatu- rally. Assume there areN labeled motion sequences{X i mocap } N i=1 associated with action label 138 Figure7.10: Spatialalignmentforoccludedjoint-position. Left,3Djoint-positionsfromMocap; right,occludedjoint-positionsinthetemporalcorrespondingframefromHumanEva2. I i ∈I, whereI = 1,2,...,C indicatesC action classes. Given a joint trajectoriesY video from avideoclipbythetracker,theestimatedactionlabelI y isgivenby I y = arg min i∈{1,2..,N} S DMW (Y video ,{X i mocap ,I i }) (7.18) 7.5.3 Results To validate our approach on video based action recognition, we chose a number of video se- quences from Brown HumanEva dataset (1 and 2), which is a benchmark proposed for human motion analysis [112] (Fig. 7.11). The reason that standard action benchmark datasets, such as KTH [105] or Weizmann [7] are not used, is that the low resolution of videos makes key points trackingresultsunreliable. ForIVT[100],wemanuallylabelthekeypointsinthefirstframeand do not tune any parameter in the tracking process. Tracked trajectories are associated with the labeled Mocap sequences from 10 action categories in sec. 7.4.4 (Fig. 7.11). We can correctly estimatetheunknownactionlabelfromthesenoisyandoccluded2Dinput. 139 Figure7.11: Examplesofactionrecognitionresultsonvideos. Toptobottom,walking,jogging andboxing(HumanEva). Lefttoright,imagesequences,2KDjoint-trajectories,temporalalign- ing matrix with the most similar Mocap sequence, and the distance to 10 most similar Mocap sequences. Greenarrowsinimagesindicateexamplesofnoisytrackingresults. [89] reports the highest action recognition performance on HumanEva-1, and a latent pose estimator is proposed to improve the recognition process. In contrast to [89], our STM frame- work focuses on learning the motion completion degree and implicitly model the latent human pose. Departing from [89], our approach does not need any training process in HumanEva, all sequences selected from HumanEva-1 or 2 are treated as unlabeled data, and the recognition is automaticallydonewhentrackingresultsaregiven. In future work, we will select more sequences from the HumanEva dataset across differ- ent views, quantitatively measure the recognition rate, and compare with other state-of-the-art methodsonHumanEva. Otherinterestingactionrecognitiondatasetssuitableforview-invariant 140 action recognition, such as INRIA IXMAS [142] or UCF [64], will also be considered for vali- datingourapproach. Furthermore,naturalhumanmotionsequencesarecontinuousandpossibly include multiple action units, thus temporal segmentation can be viewed as a pre-processing step for video action recognition. We propose a non-parametric change-point detection algo- rithm(Chapter8)toautomaticallysegmentinputvideosequencesintopre-defined action units, andquantitativelymeasuretheperformanceaswellastheimpacttoactionrecognition. 7.6 Conclusion Inthischapter,weproposedaspatio-temporalmodelSTMtoanalyzesequentialdatawithlatent spatial structure. Furthermore, a robust and efficient alignment algorithm DMW is designed to calculate the similarity between two multivariate time series. Based on STM and DMW, we achieved view-invariant action recognition on videos by associating a few Mocap examples. In thefuture, wewillevaluateourapproachonmoredatasets, andapplyitto 3Dmotionrecovery andtemporalmotionsegmentation. 141 Chapter8 OnlineTemporalSegmentation We address the problem of unsupervised online segmenting human motion sequences into dif- ferent actions. Kernelized Temporal Cut (KTC), is proposed to sequentially cut the structured sequentialdataintodifferentregimes[30]. KTCextendspreviousworksononlinechange-point detectionbyincorporatingHilbertspaceembeddingofdistributionstohandlethenonparametric and high dimensionality issues. Based on KTC, a realtime online algorithm and a hierarchical extensionareproposedfordetectingbothactiontransitionsandcyclicmotionsatthesametime. Weevaluateandcomparetheapproachtostate-of-the-artmethodsonmotioncapturedata,depth sensor data and videos. Experimental results demonstrate the effectiveness of our approach, which yields realtime segmentation, and produces higher action segmentation accuracy. Fur- thermore,bycombiningwithsequencematchingalgorithms,wecanonlinerecognizeactionsof anarbitrarypersonfromanarbitraryviewpoint,givenrealtimedepthsensorinput. 142 8.1 Introduction Temporal segmentation of human motion sequences (motion capture data, 2.5D depth sensor or 2D videos), i.e., temporally cut sequences into segments with different semantic meanings, is an important step for building an intelligent framework to analyze human motion. Temporal segmentation can be applied to motion animations [5], action recognition [68, 57, 143], video understanding and activity analysis [157, 163]. In particular, temporal segmentation is crucial forhumanactionrecognition. Mostrecentworksinhumanactivityrecognitionfocusonsimple primitiveactionssuchaswalking,runningandjumping,incontrasttothefactthatdailyactivity involves complex temporal patterns (walking then sit-down and stand-up). Thus, recognizing suchcomplexactivitiesreliesonaccuratetemporalstructuredecomposition[87]. Previous work on temporal segmentation can be mainly divided into two categories. On one side, many works in statistics, i.e., either offline or online (quickest) change-point detec- tions [10], are often restricted to univariate series (1D) and the distribution is assumed to be known in advance [1]. Because of the complex structure of motion dynamics, these works are not suitable for temporal segmentation of human actions. On the other side, temporal cluster- ing has been proposed for unsupervised learning of human motions [55]. However, temporal clusteringisusuallyperformedoffline,thusnotsuitableforapplicationssuchasrealtimeaction segmentationandrecognition. Motivatedbythesedifficulties,weproposeanonlinetemporalsegmentationmethodwithno parametric distribution assumptions, which can be directly applied to detect action transitions. The proposed method, Kernelized Temporal Cut (KTC), is a temporal application of Hilbert 143 Figure 8.1: OnlineHierarchialTemporalSegmentation. A 22 secs input sequence is tempo- rallycutintotwosegments;awalkingsegment(S1)whichisfurthercutinto6actionunits,and ajumpingsegment(S2)whichisfurthercutinto4actionunits. Figure 8.2: System Flowchart. Input can be either Mocap, 2.5D depth sensor or 2D videos. Outputaretemporalcutsforbothactiontransitionsandcyclicactionunits. space embedding of distributions [43, 117] and kernelized two-sample test [35, 34] of online segmentation on structured sequential data. KTC can simultaneously detect action transitions and cyclic motion structures (action units) within a regime as shown in Fig. 8.1. Furthermore, a realtime implementation of KTC is proposed, incorporating an incremental sliding window strategy. Withinaslidingwindow,segmentationisperformedbythetwo-samplehypothesistest 144 basedontheproposedspatio-temporalkernel. Theinputandoutputofoursystemareshownin Fig.8.2. Insummary,ourapproachpresentsseveraladvantages: − Online: KTC can sequentially process the input and capture action transitions, which is extremelyhelpfulforrealtimeapplicationssuchascontinuousactionrecognition. − Hierarchical: KTC can simultaneously capture transition points between different actions, andcyclicactionunitswithinanactionregime,e.g.,awalkingsegmentcontainsseveralwalking cycles. This is important for realtime activity recognition, i.e., action can be recognized after onlyoneactionunit. − Nonparametric: nonparametric and high dimensionality issues are handled by the Hilbert spaceembeddingbasedonthespatio-temporalkernel,whichcanbedirectlyappliedtocomplex sequentialdatasuchashumanmotionsequences. − Online Action Recognition and Transfer Learning: KTC can be applied to a variety of inputsuchasmotioncapturedata,depthsensordataand2Dvideosfromanunknownviewpoint. Moreimportantly,KTCcanbeappliedtoonlineactionrecognitionfromOpenNIandKinectby combiningmethodssuchas[160]. Therecognitionisperformedwithoutanytrainingdatafrom depthsensor,whichistrulytransferlearning. 8.2 RelatedWork Temporalsegmentationisamultifacedareaandseveralrelatedtopicsinmachinelearning,statis- tics,computervisionandgraphicsarediscussedinthissection. 145 Change-Point Detection. Most of the work in statistics, i.e., offline or quickest (online) change-point detections (CD) [10], is often restricted to univariate series (1D) and parametric distributionassumption,whichdoesnotholdforhumanmotionswithcomplexstructure. [148] uses the undirected sparse Gaussian graphical models and performs jointly structure estimation and segmentation. Recently, as a nonparametric extension of Bayesian online change-point de- tection(BOCD) [1], [103] isproposed to combine BOCD and Gaussian Process (GPs) to relax the i.i.d assumption in a regime. Although GPs improve the ability to model complex data, it also brings in high computational cost. More relevant to us, kernel methods have been applied tonon-parametricchange-pointdetectiononmultivariatetimeseries[16,37]. Inparticular,[16] (KCD)utilizestheone-classSVMasonlinetrainingmethodand[37](KCpA)performssequen- tially segmentation based on the Kernel Fisher Discriminant Ratio. Unlike all the above works, KTCcannotonlydetectactiontransitionsbutalsocyclicmotions. TemporalClustering. Clusteringisalongstandingtopicinmachinelearning[86,135]. Re- cently,asanextensionofclustering, someworksfocusonhowtocorrectlytemporallysegment time series into different clusters. As a elegant combination of Kernel K-means and spectral clustering, Aligned Cluster Analysis (ACA) is developed for temporal clustering of facial be- haviorwithamulti-subjectcorrespondencealgorithmformatchingfacialexpressions[163]. To estimatetheunknownnumberofclusters,[23]usethehierarchicalDirichletprocessasapriorto improve the switch linear dynamical system (SLDS). Most of these works offline segment time seriesandprovideclusterlabelsasinclustering. Asacomplementaryapproach,KTCperforms onlinetemporalsegmentationwhichissuitableforrealtimeapplications. 146 Motion Analysis. In computer vision and graphics, some works focus on grouping human motions. Unusual human activity detection is addressed in [157] using the (bipartite) graph spectralclustering. [151]extractsspatio-temporalfeaturesto addresseventclustering onvideo sequences. [55] proposes a geometric-invariant temporal clustering algorithm to cluster facial expressions. Morerelevantly,[5]proposesanonlinealgorithmtodecomposemotionsequences into distinct action segments. Their method is an elegant temporal extension of Probabilistic Principal Component Analysis for change-point detection (PPCA-CD), which is computation- allyefficientbutrestrictedto(approximate)Gaussianassumptions. ActionRecognition. Althoughsignificantprogresshasbeenmadeinhumanactivityrecog- nition[57,143,87],theproblemremainsinherentlychallengingduetoviewpointchange,partial occlusion and spatio-temporal variations. By combining KTC and alignment approaches such as [28, 160], we can perform online action recognition for input from 2.5D depth sensor. Un- likeotherworkson supervised jointsegmentationandrecognition[42], twosignificantfeatures of our approach are, viewpoint independence and handling arbitrary person with a few labeled Mocapsequences,inthetransferlearningmodule. 8.3 OnlineTemporalSegmentationofHumanMotion This section describes the Kernelized Temporal Cut (KTC), a temporal application of Hilbert space embedding of distributions [43] and kernelized two-sample test [34, 35], to sequentially estimatetemporalcutpointsinhumanmotionsequences. 147 8.3.1 ProblemFormulation Given a stream inputX 1:Lx ={x t } t=Lx t=1 (x t ∈ℜ Dt , where D t can be fixed or change over time), the goal of temporal segmentation is to predict temporal cut points c i . For instance, if a person walks and then boxes, a temporal cut point must be detected. For depth sensor data,x t is the vector representation of tracked joints. More details ofx t are given in sec. 8.5. From a machine learning perspective, the estimated{c i } Nc i=1 can be modeled by minimizing the followingobjectivefunction, L X ({c i } Nc i=1 ,N c ) = Nc ∑ i=1 I(X c i−1 :c i −1 ,X c i :c i+1 −1 ) (8.1) where X c i :c i+1 −1 ∈ℜ D×(c i+1 −c i ) indicates the segment between two cut points c i and c i+1 (c 1 = 1, c Nc+1 = L x + 1). Here I(·) is the homogeneous function to measure the spatio- temporal consistency between two consecutive segments. It is worth noting that, both{c i } Nc i=1 and N c need to be estimated from eq. 8.1. Next, the main task is to design I(·) and to online optimizeeq.8.1. Asthecounterpart,eq.8.1couldbeofflineoptimizedbydynamicprogramming whenN c isgiven,whichisoutofthescopeofthispaper. 8.3.2 KTC-S Insteadofjointlyoptimizingeq.8.1,theproposedKernelizedTemporalCut(KTC)sequentially optimizesc i+1 basedonc i byminimizingthefollowinglossfunction, L {X c i :c i +T−1 } (c i+1 ) =I(X c i :c i+1 −1 ,X c i+1 :c i +T−1 ), i = 1,2,...,N c −1 (8.2) 148 wherec i (c 1 = 1, c Nc+1 =L x +1)isprovidedbythepreviousstepandT isafixedlength. We refertothissequentialoptimizationprocessforeq.8.2asKTC-S,whereSstandsforsequential. Sequentially optimizingL is actually a fixed-length sliding window process which is also used in [37]. However, setting T is a difficult task and how to improve this process is described in sec. 8.3.3. Essentially, Eq. 8.2 is a two-class temporal clustering problem forX c i :c i +T−1 ∈ ℜ D×T . ThecrucialfactorisconstructingI(·),whichisrelatedtotemporalversionof(dis)similar functionsinspectralclustering[86,135,55]andinformationtheoreticalclustering[19]. To handle the complex structure of human motion, unlike previous work, KTC utilizes Hilbertspaceembeddingofdistributions(HED)tomapthedistributionofX t 1 :t 2 intotheRepro- ducingKernelHilbertSpace(RKHS).[43,117]areseminalworksoncombiningkernelmethods andprobabilitydistributionanalysis. Withoutgoingintodetails,theideaofusingHEDfortem- poral segmentation is straightforward. The change-point is detected by using a well behaved (smooth) kernel function, whose values are large on the samples belonging to the same spatio- temporalpatternandsmallonthesamplesfromdifferentpatterns. Bydoingthis,KTCdoesnot onlyhandlenonparametricandhighdimensionalityproblemsbutalsorestsonasolidtheoretical foundation[43]. HED.Inspiredby[117],probabilisticdistributionscanbeembeddedinRKHS.Atthecenterof theHilbertspaceembeddingofdistributionsarethemeanmappingfunctions, µ(P x ) =E x (k(x,·)), µ(X) = 1 T T ∑ t=1 k(x t ,·) (8.3) 149 wherex t=T t=1 are assumed to be i.i.d sampled from the distributionP x . Under mild conditions, µ(P x )(sameforµ(X))isanelementoftheHilbertspaceasfollows, <µ(P x ),f >=E x (f(x)), <µ(X),f >= 1 T T ∑ t=1 f(x t ) Mappingsµ(P x )andµ(X)areattractivebecause, Theorem1. ifthekernelkisuniversal,thenthemeanmapµ:P x →µ(P x )isinjective.[117] This theorem states that distributions ofx∈ℜ D have a one-to-one correspondence with mappings µ(P x ). Thus, for two distributions P x and P y , we can use the function norm ||µ(P x )−µ(P y )|| to quantitatively measure the difference (denoted asD(P x ,P y )) between these two distributions. Moreover, we do not need to access the actual distributions but rather finitesamplestocalculateD(P x ,P y )because: Theorem 2. Assume that||f|| inf ≤ C for allf∈H with∥|f|| H ≤ 1, then with probability at least1−θ,||µ(P x )−µ(X)||≤ 2R T (H,P x )+C √ −T −1 log(θ)).[117] AslongastheRademacheraverageiswellbehaved,finitesamplesyielderrorthatconverges to zero, thus they empirically approximate µ(P x ). Therefore, D(P x ,P y ) can be precisely approximatedbyusingfinitesampleestimation||µ(X)−µ(Y)||. Thanks to the above facts, we use HED to construct I KTC (X 1:T 1 ,Y 1:T 2 ) to measure the consistencybetweendistributionsoftwosegmentsasfollows, 2 T 1 T 2 ∑ i,j k(x i ,y j )− 1 T 2 1 ∑ i,j k(x i ,x j )− 1 T 2 2 ∑ i,j k(y i ,y j ) (8.4) 150 Combiningeq.8.2andeq.8.4,c i+1 isestimatedbyminimizingthefollowingfunctioninmatrix formulationas: L {X c i :c i +T−1 } (c ′ i ) =−(E c ′ i T ) H K KTC c i :c i +T−1 E c ′ i T , E c ′ i T = 1 c ′ i e 1:c ′ i T − 1 d i e c ′ i :T T (8.5) wherec ′ i andd i are short notations forc i+1 −c i andc i +T−c i+1 . e t 1 :t 2 T ∈ℜ T×1 is a binary vector with 1 for positions from t 1 to t 2 and 0 for others. K KTC c i :c i +T−1 ∈ℜ T×T is the kernel matrixbasedonthekernelfunctionk KTC (·). Kernel. Thesuccessofkernelmethodslargelydependsonthechoiceofthekernelfunction[43]. Asmentionedbefore,thedifficultyofhumanmotion,isthatbothspatialandtemporalstructures areimportant. Thus,weproposeanovelspatio-temporalkernelk KTC (·)asfollows, k KTC (x i ,x j ) =k S (x i ,x j )k T (x i ,x j ) =k S (x i ,x j )k T ( e ∆(x i ), e ∆(x j )) (8.6) where k S (·) is the spatial kernel and k T (·) is the temporal kernel. e ∆(x) is the estimated local tangent space at pointx. k S (·) and k T (·) can be chosen according to the domain knowledge or universal kernels such as Gaussian. For instance, the canonical component analysis (CCA) kernel[4]isusedforjoint-positionfeaturesas, k CCA S (x i ,x j ) = exp(−λ S d CCA (x i ,x j ) 2 ) (8.7) 151 whered CCA (·)istheCCAmetricbasedonM×3matrixrepresentationofx∈ℜ 3M×11 . Orin general,wesetthemas, k S (x i ,x j ) = exp(−λ S ∥x i −x j ∥ 2 ) k T ( e ∆(x i ), e ∆(x j )) = exp(−λ T θ( e ∆(x i ), e ∆(x j )) 2 ) (8.8) whereλ S is the kernel parameter fork S (·) andλ T is the kernel parameter fork S (·). θ(·) is the notationofprincipalanglebetweentwosubspace(rangefrom0to π 2 ). In short, the spatio-temporal kernel k KTC captures both spatial and temporal distributions of data (a visual example in Fig. 8.3), which is suitable to model structured sequential data. As specialcases,k ST degeneratestospatialkernelifλ T → 0andtotemporalkernelifλ S → 0. Optimization. Unlike the NP-Hard optimization in spectral clustering [135], eq. 8.5 can be efficiently solved because the feasible region ofc i+1 is [c i +1,c i +T−1], allowing to search theentirespacetominimizeL(c i+1 ). Foreachstep,minimizingeq.8.5hascomplexityatmost O(T 2 )toaccessk KTC (·). 8.3.3 KTC-R Sequentiallyoptimizingeq.8.1isgiveninsec.8.3.2. However,thisprocessmaynotbesuitable forrealtimeapplications. Akeyfeatureofhumanmotionistemporalvariations,i.e., oneaction canlastforalongtimeoronlyafewseconds. Thus,itisdifficulttouseafixed-length-T sliding- window to capture transitions. Small values of T cause over-segmentation and large values of T cause large delays (T = 300 for depth sensor results in 10 secs delay). To overcome this 1 M isthenumberof3DjointsfromMocapordepthsensor. 152 Figure 8.3: AnillustrationofKTC-R. Left: tracked joints from depth sensor, right:K KTC 1:190 ∈ ℜ 190×190 for windowX 1:190 . The decision to make no cut between frame 1 to 110 is made beforethecurrentwindowwithamaximumdelayofT 0 = 80frames. problem, we combine incremental sliding-window strategy [5] and two-sample test [34, 35] to designarealtimealgorithmforeq.8.5,i.e.,KTC-R(Fig.8.3). GivenX 1:Lx = [x 1 ,...,x Lx ]∈ℜ D×Lx , KTC-R sequentially processes the varying-length windowX t = [x nt ,...,x nt+Tt ]atstept. Thisprocessstartsfromn 1 = 1andT 1 = 2T 0 ,where T 0 isthepre-definedshorestpossibleactionlength. Atstep-t(assumethelastcutisc i ),ifthere isnocapturedactiontransitionpoint,thefollowingupdatingprocessisperformed, n t+1 =n t ,T t+1 =T t +δT (8.9) elseifthereisatransitionpoint, c i+1 =n t +T t −T 0 ,n t+1 =c i+1 ,T t+1 =T 1 (8.10) whereδT isthesteplengthofincreasingthewindow. Thisprocessendswhenn t −L x ≤T 0 . As shownineq.8.9andeq.8.10,X 1:Lx issequentiallyprocessedandallcutsc i areestimatedwhen 153 the algorithm requires the (c i +T 0 −1) th frame (same for non-cut frames). This fact indicates STC-Rhasafixed-lengthtimedelayT 0 ,asshowninFig.8.3. Ateachstep,decidingonacut(atframen t +T t −T 0 )isequivalenttothefollowinghypothesis test, H 0 :{x i } n ′ t −1 nt and{x i } nt+Tt−1 n ′ t arethesame; H A :notH 0 (8.11) where n ′ t is the short notation for n t +T t −T 0 . eq. 8.11 is re-written by combining eq. 8.5 as follows, L t =−(E n ′ t −nt Tt ) H K KTC nt:nt+Tt−1 E n ′ t −nt Tt −H 0 :L t ≥δ t :n ′ t isnotacut; −H A :L t <δ t :n ′ t isacut (8.12) whereδ t istheadaptivethresholdforthehypothesistest8.12. Infact,eq.8.12isdirectlyinspired by [34] which proposes a kernelized two-sample test method. L t is analogous to the negative squareofempiricalestimationofMaximumMeanDiscrepancy(MMD),whichhasthefollowing formulation, MMD[F,X 1:T 1 ,Y 1:T 2 ] = ( 1 T 2 1 T 1 ∑ i,j=1 k(x i ,x j ) − 2 T 1 T 2 T 1 ,T 2 ∑ i,j=1 k(x i ,y j )+ 1 T 2 2 T 2 ∑ i,j=1 k(y i ,y j )) 1 2 (8.13) 154 whereF is a unit ball in a universal RKHSH, and{x} T 1 i=1 and{y} T 2 j=1 are i.i.d. samples from distributionsP x andP y . Itcanbeshownthat, lim λ T →0 L t =−MMD[F,X nt:n ′ t −1 ,X n ′ t :nt+Tt−1 ] 2 (8.14) if the same kernel in MMD is used as the spatial kernel in k KTC (·) (eq. 8.6) and k T (·) degen- erates to 1 as λ T → 0. Based on eq. 8.14, δ t is set as B R (t) +δ where B R (t) is an adaptive threshold which is calculated from the Rademacher bound [34], and δ is a fixed global thresh- old which is the only non-trivial parameter in KTC-R (used to control the coarse-fine level of segmentation). Analysis. Insummary,bothKTC-SandKTC-Rarebasedoneq.8.5. Themaindifferencesare, KTC-S performs segmentation by sequential optimization in a two-class temporal clustering way, and KTC-R performs segmentation by using incremental sliding-window in a two-sample test way. KTC-R requires more sliding-windows than KTC-S, but for each one, there is no optimization,andaccessingk KTC (·)O(T t δT)timesisenough(lineartoT t ). Onlywhenanew cut is detected, O(T 2 t ) times accessing is required. Thus, KTC-R is extremely efficient and suitableforrealtimeapplications. Itisnotablethat,evenifthefixed-lengthsliding-windowmethod(sec.8.3.2)isimprovedto makethedecisionwhetheracuthappensornotinX c i :c i +T−1 ,asmallT isstillnotreliablefor realtime applications. The reason is that a clear temporal cut for human motion requires a large number of observations before and after the cut. Indeed, the required number of frames varies fromactiontoaction,evenformanualannotation. 155 8.4 OnlineHierarchicalTemporalSegmentation Besidesestimating{c i },decomposinganactionsegmentX c i :c i+1 −1 intoanunknownnumberof action units (e.g., three walking cycles) if cyclic motions exists, is also needed [56]. This is not only helpful for understanding motion sequences, but also for other applications such as recog- nition and indexing. Thus, an online cyclic structure segmentation algorithm, i.e., Kernelized AlignmentCut(KAC),isproposedasageneralizationofkernelembeddingofdistributionsand temporal alignment [160, 163]. By combining KAC and KTC-R, we get the two-layer segmen- tationalgorithmKTC-H, where Hstands for hierarchical. Action units segmentationis difficult for non-periodic motions (e.g., jumping), which are actions that are usually performed once lo- cally. However, people can still perform two consecutive non-periodic motions, and these two motionsarenotidenticalbecauseofintra-personvariations,whichbringschallengesforKAC. 8.4.1 KTC-H KAC. Asanonlinealgorithm,KACutilizesthesliding-windowstrategy. EachwindowX a j +nt−Tm:a j +nt−1 issequentiallyprocessed,startingfromn 1 = 2T m , a 1 =c i ,wherea j isj th actionunitcut. T m is a parameter which is the minimal length of one action units. We empirically find that results areinsensitivetoT m . For each windowX a j +nt−Tm:a j +nt−1 , this process has two branches. The last action unit continues: n t+1 =n t +δT m ;orthereisanewactionunit: a j+1 =a j +n t −T m ,n t+1 = 2T m . 156 Here δT m is the step length. This process ends when a new cut point c i+1 received. Deciding whetherX a j +nt−Tm:a j +nt−1 isthestartofanewunitornotcanbeformulatedas, S t =S Align (X a j :a j +Tm−1 ,X a j +nt−Tm:a j +nt−1 ) −H 0 :S t ≤ϵ t :a j +n t −T m isanewunit; −H A :S t >ϵ t :notH 0 (8.15) whereS Align (·) is the metric to measure the structure similarity between X a j :a j +Tm−1 and X a j +nt−Tm:a j +nt−1 , to handle intra-person variations. ϵ t is an adaptive threshold (empirically set by cross-validation) and ideally should be close to zero if alignment can perfectly leverage variations. SimilartoKTC-R,KAChasdelayT m . Inparticular, KACusesdynamictimewarp- ing (DTW) [160, 163] to designS KAC (·) by minimizing the following loss function based on thekernelfromeq.8.6, S KAC (K a j +nt−Tm:a j +nt−1 a j :a j +Tm−1 :W 1 ,W 2 ) (8.16) whereK is the cross-kernel matrix for two segments, andW 1 andW 2 are binary temporal warping matrices encoding the temporal alignment path as shown in [160]. Interested readers are referred to [160, 163] for more details aboutS(·). Eq. 8.16 can be optimized by using dynamic programming with complexity O(T 2 m ), andS KAC (·) measure the similarity between the current action unit (a part) and the current window. Importantly, alignment methods such as DTW are not suitable for eq. 8.12. This is because alignment requires two segments to have roughlythesamestartingandendingpoints,whichdoesnotholdineq.8.12. 157 KTC-H. By combining KTC-R and KAC, we can sequentially simultaneously capture action transitions (cuts) and action units, in the integrated algorithm KTC-H. Formally, KTC-H uses thetwo-layerslidingwindowstrategy,i.e.,theouterloop(sec.8.3.3)toestimatec i andtheinner looptoestimatea j betweenc i andthecurrentframefromtheouterloop. SinceKTC-R(eq.8.12) andKAC(eq.8.15)bothhavefixeddelay(T 0 andT m ),KTC-Hissuitableforrealtimetransition andactionunitsegmentation. Discussion. We compare with several related algorithms: (1) Spectral clustering [135] can be extended to temporal clustering if only allowing temporal cuts (TSC) [55]. Similarly, mini- mizing eq. 8.5 can be viewed as an instance of TSC motivated by embedding distributions into RKHS.(2)PPCA-CDisproposedin[5]tomodelmotionsegmentsbyGaussianmodels, where CD stands for change-point detection. Compared to [5], KTC has higher computational cost but gains the ability to handle nonparametric distributions. (3) KTC is similar to KCpA [37], whichusesthenovelkernelFisherdiscriminantratio. Comparedto[37],KTCperformschange- point detection by using the incremental sliding-window. More importantly, KTC detects both change-pointsandcyclicstructures. Thisiscrucialforonlinerecognition,makingactioncanbe recognizedafteronlyoneunitinsteadofthewholeaction. (4)AsanelegantextensionofKernel K-meansandspectralclustering,ACAisproposedin[163]forofflinetemporalclustering. KTC canbeviewedasanonlinecomplementaryapproachto[163]. 158 8.4.2 OnlineActionRecognitionfromOpenNI We use KTC-H to recognize tracked OpenNI data online, based on labeled Mocap sequences only. To the best of our knowledge, this is the first work towards continuous action recognition fromOpenNIinthetransferlearningframework,withouttheneedoftrainingdatafromOpenNI. Recognition. In our system, action recognition is done by combining KTC-H and sequences alignment. AssumethereareN labeledMocapsegments(actionunits){X i mocap ∈ℜ 3M×L i } N i=1 associated with label I i ∈I, whereI = 1,2,...,C indicates C action classes. Given a seg- mentedactionunitX a j :a j+1 −1 ∈ℜ 3K×(a j+1 −a j ) fromKTC-H,theestimatedactionlabelI Test isgivenbyusingDynamicManifoldWarping(DMW)[28], I Test a j :a j+1 −1 = arg min i∈{1,2..,N} S DMW (X a j :a j+1 −1 ,(X i mocap ,I i )) (8.17) where K can be M or≤ M when indicating noisy and possible occluded tracking trajectories fromdepthsensordatabyusingtheOpenNItracker. Thespatialandtemporalvariationsbetween X andX i mocap arehandledbyDMW. 8.5 ExperimentalValidation We evaluate the performance of our system from two aspects: (1) temporal segmentation on depthsensor,Mocapandvideo,(2)onlineactionrecognitionondepthsensor. 159 Figure 8.4: Examples of Online Temporal Segmentation. Top (Depth Sensor): a sequence includes 3 segments as walking, boxing and jumping. Noisy joint-trajectories are tracked by OpenNI.Middle(Mocap): asequencehas4579framesand7actionsegments. Bottom(Video): aclipincludeswalkingandrunning. Forallcases,KTC-Rachievesthehighestaccuracy. 8.5.1 OnlineTemporalSegmentationofHumanAction In this section, quantitative comparison of online temporal segmentation methods is provided. KTC-R is compared with other state-of-the-art methods, i.e, PPCA-CD [5] and TSC-CD [151, 55], where TSC-CD is a change-point detection algorithm based on temporal spectral cluster- ing by our implementation. PPCA-CD uses the same incremental sliding-window strategy as sec. 8.3.3 and TSC-CD uses the fixed-length sliding window as sec. 8.3.2. Thresholds (e.g., δ for KTC-R and thresholds for other methods) are set by cross-validation on one sequence. 160 Table 8.1: Temporal Segmentation Results Comparison. Precision (P), recall (R) and rand index(RI)arereported Methods PPCA-CD(online) TSC-CD(online) KTC-R(online) Depth 0.73(P)/0.78(R)/0.80(RI) 0.77(P)/0.81(R)/0.81(RI) 0.87(P)/0.93(R)/0.88(RI) Mocap 0.85(P)/0.90(R)/0.90(RI) 0.83(P)/0.86(R)/0.88(RI) 0.86(P)/0.91(R)/0.92(RI) Video − 0.78(P)/0.85(R)/0.82(RI) 0.85(P)/0.92(R)/0.88(RI) Methods like ACA [163] and [23] can not be directly compared since they are offline. Results are evaluated by three metrics, i.e., precision, recall and rand index. The first two are for cut points and the last one is for all frames. The ground-truth for rand index (RI) is labeled as con- secutive numbers 1, 2, 3, ... for different segments. Importantly, T 0 is set as 80,250,60 for depth sensor, Mocap and video respectively, making KTC-R have 2.3, 2.1 and 1 seconds delay. KTC-S achieves similar accuracy to KTC-R but with longer delay, details are omitted due to lackofspace. ResultsareveryrobusttoT 0 andδT. Forinstance,wegotalmostidenticalresults whenT 0 rangesfrom60to120inOpenNIdata. Depth Sensor. To validate online temporal segmentation on depth sensor, 10 human mo- tion sequences are captured by the PrimeSense sensor. Each sequence is a combination of 3 to 5 actions (e.g., walking to boxing) with length around 700 frames (30Hz). For human pose tracking, we use the available OpenNI tracker to automatically track joints on human skeleton. K∈ [12,15] key points are tracked, resulting in joint 3D positionx t inℜ 36 toℜ 45 . Although humanposetrackingresultsareoftennoisy(Fig.8.4andFig.8.5),wecancorrectlyestimateac- tion transitions from these noisy tracking results. In particular, KTC-R (δT = 30) significantly improvestheaccuracyfromothermethods(Table8.1). Themainreasonisthejoint-positionsof 161 noisytrackedjointshavecomplexnonparametricstructures,whichishandledbykernelembed- dingofdistributions[34,35,117]inKTC. −KTC-H. Besides action transitions, results on detecting both cyclic motions and transitions are reported by performing KTC-H (T m = 50,δT m = 1). Since other methods don’t have the module,wereportquantitativecomparisonononlinehierarchialsegmentationbyusingKTC-H or other methods plus our KAC algorithm in sec. 8.4.1. Results show (Table 8.2) that KTC-H getshigheraccuracythanothercombinations. Itisnotablethat,becauseofthenaturalofRI,the RI metric will increase when the number of cuts increase, even for low P/R, which is the case forhierarchialsegmentation(includingtwotypesofcuts). Mocap. Similar to [5, 163], M = 14 joints are used to represent the human skeleton, resultinginjoint-quaternionsofjointanglesin42D.Onlinetemporalsegmentationmethodsare testedon14selectedMocapsequencesfromsubject86inCMUMocapdatabase. Eachsequence isacombinationofroughly10actionsegmentsandintotaltherearearound10 5 frames(120Hz). Since the implementation of PPCA-CD differs from [5] (such as only forward pass is allowed in our experiments), results are not the same as in [5]. Table 8.1 shows that the gain of KTC-R (δT = 50) to other methods in Mocap is reduced, compared with depth sensor data. This is becausetheGaussianpropertyismorelikelytoholdforquaternionsrepresentationof noiseless Mocapdata,whichisnotthecaseforrealdataingeneral. Video. Furthermore, KTC-R is performed on a number of sequences from HumanEva-2, which is a benchmark for human motion analysis [112]. Silhouettes are extracted by back- ground substraction, resulting in a sequence of binary masks (60Hz). x t ∈ℜ Dt is set as the vector representation of the mask at frame t. It is notable that, D t (size of masks) in different 162 Figure8.5: Onlineactionsegmentationandrecognitionon2.5Ddepthsensor. Toptobottom, depth image sequences, KTC-H results and action recognition results. For segmentation, blue line indicates the cut and different rectangles indicate different action units. The blue circle indicates noisy tracking results. For recognition: distance to labeled Mocap sequences, and inferred3MDmotionsequences. frames may not be identical, so PPCA-CD can not be applied. This fact supports the advantage of KTC, which is applicable for complex sequential data as long as a (pseudo) kernel can be defined. In particular, we follow [163] to compute the matching distance of silhouettes to set the kernel. Results are shown in Fig. 8.4 and Table. 8.1. As a reference, state-of-the-art offline temporal clustering method ACA achieves higher accuracy than KTC-R on Mocap (96% preci- sion). However, offline methods (1) are not suitable for real-time applications, and (2) require thenumberofclusters(segments)tobesetinadvance,whichisnotapplicableinmanycases. 8.5.2 JointOnlineSegmentationandRecognitionfromOpenNI Wecollectadditional5109frames(N = 30)with10primitiveactionsfromCMUMocapasthe labeled (training) datafor recognition. In orderto associate labeled Mocap sequences with data 163 Table8.2: Onlinehierarchialsegmentationandrecognitionon2.5Ddepthsensor Methods PPCA-CD+KAC KTC-H Depth 0.72(P)/0.76(R)/0.89(RI)/0.71(Acc) 0.85(P)/0.87(R)/0.94(RI)/0.85(Acc) fromotherdomains,joint-positiontrajectories(M = 15)areusedineq.8.17[28]. Testingdata arepreviouscollectedsequencesfromdepthsensor,andonlinesegmentationandrecognitionare simultaneouslyperformedbyKTC-Handeq.8.17. Asignificantfeatureofourapproachis,there isnoextra-trainingprocessfordepthsensor,i.e.,theknowledgefromMocapcanbetransferred to other motion sequences, based on proper features. Tracked trajectories from OpenNI in an actionunit(segmentedbyKTC-H)areassociatedwithlabeledMocapsequencesfrom10action categories. AlthoughOpenNItrackingresultsareoftennoisy(highlightedbybluecirclesinFig.8.5),we achieve 85% recognition accuracy (Acc) from these noisy tracking results (Table 8.2), without any additional training on depth sensor data. This result does not only benefit from DMW [28], butalsofromKTC-H.DMWrequirestheinputonlycontainoneactionunit,whileKTC-Hper- forms a critical missing step, i.e., accurate online temporal segmentation, in order to perform recognition. As illustrated in Table 2, the accuracy on OpenNI is enhanced from 0.71 to 0.85, which strongly supports the effectiveness of KTC-H. Furthermore, the complete and accurate 3MD human motion sequences can be inferred by associating the learned manifolds from Mo- cap. 164 8.6 Conclusion In this paper, we propose an online temporal segmentation method KTC, as a temporal exten- sion of Hilbert space embedding of distributions for change-point detection based on the novel spatio-temporal kernel. Furthermore, a realtime implementation of KTC and a hierarchial ex- tension are designed, which can detect both action transitions and action units. Based on KTC, weachieverealtimetemporalsegmentationonbothMocap,motionsequencesfrom2.5Ddepth sensorand2Dvideos. Furthermore,temporalsegmentationiscombinedwithalignment,result- ing in realtime action recognition on depth sensor input, without the need of training data from depthsensor. 165 Chapter9 MiningLarge-ScaleTimeSeries After the work of spatio-temporal alignment and online temporal segmentation, a natural ques- tionis,canthisframeworkhandlelargeamountofactionsforbroaderapplications? Forinstance, nowadaystherearemillionsorevenbillionsofhumanactionsequencesandvideosavailableon- line(e.g.,CMUMoCapandYouTube),andthesedataarenotfullyexplored. However,searching andmininglarge-scalemultivariatetimeseriesdataisextremelychallengingduetothecomplex dependence relations and the difficulty to define proper similarity metrics. In this chapter, we propose Kernelized Alignment Hashing (KAH), a generalized learning algorithm to learn com- pact binary codes for fast mining large-scale multivariate time series data. KAH is a temporal generalizationoflocality-sensitivehashingforvariable-lengthtimeseriesdatabyincorporating the kernel approach. It utilizes the Global Alignment (GA) kernel, a soft-max version of dy- namictimewarping,tocapturethetemporalsimilarityfortimeseriesandguaranteethepositive definite kernel property. To further decrease the run-time complexity of hashing, we propose temporal feature dimension reduction to reduce the dimensionality of multivariate time series. We evaluate the proposed algorithm on a range of time series data set, including smart phone 166 sensor data, social behavior data and motion capture data. Compared to alternative methods, KAH does not only perform fast search on millions of time series data, but also yields better precisionsandrecalls. 9.1 Introduction Multivariatetimeseriesareubiquitous,appearinginmanyimportantapplicationssuchashuman activity analysis, health care, speech recognition, financial prediction, etc. Fast retrieval and mininginlarge-scaletimeseriesdataisoneofthefundamentalproblemsintimeseriesanalysis andmodeling. Forinstance,inordertorecognizeactionsandretrievesimilarmotionsequences, an efficient algorithm is required to search in large amount of human motion capture sequences (e.g. CMU Mocap). Another example is to find the time frame with similar patterns as the currentfinancialmarketfromtherecordedhistoricaldata. Fast (approximate) nearest neighbor search has been studied for non-time series for more than a decade. Tree-based methods such as k-d tree provide an efficient and exact search for low-dimensionalvectordata. However,thesearchefficiencydegeneratestolinearscanwhenthe dimensionalityishigh. Locality-sensitivehashing(LSH)[25,3]isthemostpopularapproachto perform fast search by reducing the high dimensional data to binary codes. In particular, recent progressonadvancedlearningbasedhashingmethodsdemonstratesthesuperioritytolearncom- pacthashcodestoretrievelarge-scaledatasets[144,107,97]. However,weareconfrontedwith several major challenges when applying this line of work to time series data: (1) the temporal lengthoftimeseriesobservationscouldbedifferent,whichleadstotheproblemofgeneralizing 167 thehashingmethodtovariable-dimensionalityspace,(2)itisdifficulttodefineasimilaritymet- ric because of the complex temporal pattern of multivariate time series. Dynamic time warping (DTW) [96] was studied in mining large-scale time series applications [98]. However, DTW is notaproperdistancemetric(doesnotholdthetriangleinequality)andtheinducedkernelisnot a positive definite kernel, and (3) how to efficiently compute the codes for a new query, whose temporallengthmaybedifferentfromtheothersinthedataset. In this chapter, we investigate the problem of learning compact binary codes for multivari- ate time series data and address these three challenges. The proposed algorithm, Kernelized Alignment Hashing (KAH), is a temporal extension of LSH by incorporating the global align- ment(GA)kernelontimeseries[14]. GAkerneldoesnotonlyprovideavalidinduceddistance metric for time series, but also allows kernelized hashing [53] on time series, which can handle the variable-length problem. Furthermore, spatio-temporal dimension reduction is proposed to reduce the computational complexity of generating the binary codes for a query, which speeds up the search process 1 . Overall, the online complexity of KAH is sub-linear to the number of time series in the database. To the best of authors’ knowledge, KAH is the first work to learn binarycodesforhashingonmultivariatetimeseriesdata. Thetimeseriesminingproblemcanberigorouslyexpressedasfollows. ThereareN multi- variate time seriesX i ∈ℜ D×L i (1≤ i≤ N) in the database. Given a queryY ∈ℜ D×Ly , the mostsimilarone(orϵ-neighbors)canbefastretrieved, X ∗ = arg min i∈{1,2..,N} D(X i ,Y) (9.1) 1 Weusetheterm”spatio-temporal”torefertothecomplexstructureofmultivariatetimeseries. 168 whereD(·) is a distance metric between two (can be of different length) multivariate time se- ries. Unlike the case for Euclidean space, metricD(·) is not well defined and requires fur- ther investigation. The goal of KAH is to (1) design a proper metricD(·) which captures the spatio-temporal similarity for multivariate time series and (2) learn a set of hashing functions h(·) :ℜ D×L →{1,−1} (D is fixed butL is arbitrary) to perform efficient search in the binary codespace. 9.2 RelatedWork In approximate nearest neighbors search, LSH related methods usually require long codes to achieve good precision, which increases the memory storage and leads to low recall. To over- comethisproblem,recenteffortsonadvancedhashingmainlyfocusonlearningcompact(short) binarycodesfromthedata. Thiscanbecategorizedintotwoparts,thefirstisunsupervisedcode learning and recent works include spectral hashing (SH) [144], shift-invariant kernel [97], ker- nelized locality-sensitive hashing (KLSH) [53] and iterative quantization (ITQ) [33]. Second, supervised code learning such as binary reconstruction embedding (BRE) [52], minimal loss hashing(MLH)[90],[137]andkernel-basedsupervisedhashing(KSH)[65]useslabelinforma- tioninthetrainingdatatobetterretrievesemanticsimilarneighbors. A central topic for time series retrieval is how to define the similarity (distance) metric, and dynamic time warping (DTW) [96] has been extensively used in time series applications [98]. As pointed out before, DTW is not a proper distance metric and the induced kernel is not pos- itive definite. In order to combine powerful kernel methods, dynamic time-alignment kernel 169 (DTAK) [109] is proposed to approximate the DTW distance. But, there is no theoretical sup- portthatDTAKispositivedefinite. Indatamining,SAXandiSAXwereproposedtoindexlarge-scaletimeseriesdata[62,108]. However, these works are designed for univariate time series and can not be directly applied to multivariatetimeseriesretrieval. 9.3 KernelizedAlignmentHashing This section describes the Kernelized Alignment Hashing (KAH), a temporal generalization of LSH incorporating the kernel method and dimension reduction, to efficiently learn compact bi- narycodesfromunlabeledtimeseriesdata. 9.3.1 Spatio-TemporalHashing Given the unlabeled time series data setX i ∈ℜ D×L i (1≤ i≤ N), the purpose of hashing is to design a set of hashing functionsh(·) :ℜ D×L →{1,−1} and each one corresponds to 1 bit after hashing. It is notable that,L is not fixed (neither is the overall dimensionD×L) and the traditionalhashingframeworkh(X) =sgn(w T vec(X))(assumecentralization)isnotdirectly applicable 2 . Assume there is a positive definite kernel function k ST (·) : (X 1 ,X 2 ) → ℜ satisfying Mercer’stheorem,wedesignhashingfunctionintheReproducingkernelHilbertspace(RKHS). k ST (·) is also required to well capture the spatio-temporal similarity for D-dimensional time 2 sgn()isthesignfunctionandvec()isthevectorizationoperator. 170 series, which is given in Sec. 3.2. In particular, we design spatio-temporal hashing function h ST (·)as, h ST (X) =sgn( m ∑ j=1 (ω j Φ ST (X (j) )) T Φ ST (X)) (9.2) whereΦ ST (·)isthecorrespondingimplicit(maybeunknown)mappingfunctionofk ST (·).X (j) (1≤j≤m) are m selected basic time series from the data base and ω j is the coefficient con- structed from training data also. Furthermore, to speed up the hashing process and reduce the structure complexity ofX, dimension reduction is also introduced for eq. 9.2. Thus, h ST (X) canberepresentedas, sgn( m ∑ j=1 (ω j Φ ST (f ST (X (j) ))) T Φ ST (f ST (X))) (9.3) where f ST (·) is a spatio-temporal mapping function to reduce the dimensionality of X. In eq.9.3,howtodesignf ST (·)andconstructk ST (·)tocapturespatio-temporalsimilarityfortime series require further investigation. Instead of joint learning, which is often complicated, we study kernelized hashing (eq. 9.2) and dimension reduction (eq. 9.3) in two modules to gain the simplicity. This is similar to the framework of learning binary codes for vector data, where principalcomponentanalysis(PCA)isoftenusedasapre-processingsteptoreducethedimen- sionalityofthedata[144,33]. 9.3.2 HashingonAlignmentKernel AlignmentKernel. Designing a positive definite (p.d.) kernel is not only important for eq. 9.2 but also helpful to define the distance metric on time series. This is because, once we have 171 k(·), the induced distanceD(X,Y) = (k(X,X)+k(Y,Y)−2k(X,Y)) 1/2 is also a valid distance metric ifk(·) is positive definite. Until recently, Global Alignment (GA) kernel solves the time series kernel problem by an elegant idea, i.e., the similarity is the weighted average on allalignedpathsinsteadofthebestalignedone[14]. ThekeyideaofGAkernelissimilartothe soft-max trick in optimization, which usually turns a non-convex min-max problem to convex optimization. Formally,giventwosequencesX 1:Lx ∈R D×Lx andY 1:Ly ∈R D×Ly withthesamespatial dimensionalityD,theGAKernelisdefinedas, k GA (X 1:Lx ,Y 1:Ly ) = ∑ Q∈A(Lx,Ly) e −φ(X,Y,Q) = ∑ Q∈A(Lx,Ly) |Q| ∏ t=1 κ σ (x q t (1) ,y q t (2) ) (9.4) where k GA (·) is the GA kernel forX 1:Lx andY 1:Ly , and κ σ (·) is the pre-defined local kernel for vector input from theℜ D space with bandwidth parameter σ. A(L x ,L y ) is the set of all possible temporal alignment paths for one L x -length and one L y -length time series, andQ = [q 1 ;q 2 ,...,q |Q| ]∈ℜ 2×|Q| represents a temporal alignment path. The alignment pathQ must satisfy several principles, i.e., starts from (1,1), ends at (L x ,L y ), monotonic and continuous (refer to [14] for details). Thus, the length of alignment path is bounded as max{L x ,L y }≤ |Q|≤L x +L y −1. Herewealsointroducethenotationofφ σ ,whichistheinduceddivergence onaspecificalignmentpath. 172 φ σ (X,Y,Q) = |Q| ∑ t=1 φ σ (x q(t,1) ,y q(t,2) ) = |Q| ∑ t=1 −log(κ σ (x q(t,1) ,y q(t,2) )) (9.5) Because of the logarithmic property, i.e., e −φ(·) = κ σ (·), φ σ (·) enforcing on two multivariate timeserieswiththesamelengthisequivalenttothesumofperformingφ σ (·)onthecorrespond- ingelements. GAkernelcanbeviewedasatemporalversionofageneralkernelfamily, i.e., the mapping kernel[110],whichisrecentlyproposedasanextensionofconvolutionkernel.[110]provesthat, a mapping kernel is p.d. for all local kernelκ(·) if and only if the support set of the kernel is a transitive set. However, the support set of GA kernel is not transitive. [14] proposes an elegant solution,i.e.,geometricdivisibility,todesignap.d. GAkernelk GA (·). Moreanalysisofthep.d. propertyofGAkernelisgiveninSec. 4. Compared to DTW, GA kernel has two significant advantages. First, GA is a valid kernel andtheinduceddistancebyGAisaproperdistancemetricformultivariatetimeseries. Second, GA allows nonlinear kernelization by the kernel function κ σ (·) (after considering geometric divisibility). These facts provide huge potential to combine GA with the state-of-the-art kernel methods for time series research. One example is using GA kernel and SVM to do time series classification. And more importantly, GA kernel allows kernelized hashing on variable-length multivariatetimeseriesinKAH. 173 Hashing. Recall eq. 9.2, after designing kernel, the next question is how to get ω j . Un- like LSH or related methods, KAH performs hashing in the RKHS space, instead of the fixed- dimension vector space. In particular, KAH utilizes the approximate Gaussian idea in ker- nel space from [53]. Assume the implicit mapping by GA kernel is ϕ GA (X), we want to design the Gaussian distributed r in the feature space to perform hashing as h(ϕ GA (X)) = sgn(r T ϕ GA (X))(alternativeversionofeq.9.2),wherer isapproximatedbytheweightedsum ofthemappingfunctionontselectedtimeseries. r = 1 √ t t ∑ k=1 Φ GA (X (k) ) (9.6) Hereweassumedatahasbeen(implicitly)centralizedintheRKHSspace. Thesetselectedtime seriesdataarechosenfrommbasictimeseriesX (j) (1≤j≤m),whichareselectedfromthe wholedatabase. According to the kernel trick and the central limit theorem [53], the final solution of eq. 9.2 inKAHhasanelegantformulationas, h KAH (ϕ GA (X)) =sgn( m ∑ j=1 ω j k GA (X (j) ,X)) (9.7) whereω = [ω 1 ,...,ω m ] T is given byK −1/2 b e t m . HereK b = [k GA (X (i) ,X (j) )] is an m×m kernel matrix for the basic time series, ande t m is length-m select vector containing ones at t random positions, i.e., select t time series from total m basic ones. Thus, in order to generate 174 b-bitscodesbyusingeq.9.7(e.g.,b = 32,128,...),wecanuniformlychoosebselectvectorse ts m (1≤s≤b)andthefinalcodingresultscanberepresentedinmatrixformulationas, H b−bit KAH (X) =sgn((E b b,m ) T K −1/2 b ⃗ k GA (X)) (9.8) where ⃗ k GA (·) :ℜ D×L →ℜ m is a short notation of vector [k GA (X (1) ,·),...,k GA (X (m) ,·)] T . HereE t b,m isab×mmatrixwitheachrowalength-mselectvector,andH b−bit KAH (·) :ℜ D×L → {1,−1} b is the hashing function to generate b-bits codes. In eq. 9.8, sign function sgn(·) is appliedtoeachrowandthefinaloutputisacolumnvector. After generating binary codes, the standard LSH methods [9] can be applied to find the approximate nearest neighbors in sub-linear search time. If we assume the random vectorr is trulyGaussian,thenthebinarycodespreservethesimilarityintheoriginalkernelspace,i.e., Pr(h(X i ) =h(X j )) = 1− 1 π cos −1 ( e k GA (X i ,X j )) where e k GA (·)isthenormalizedGAkernelwithformulationasfollows, k GA (X i ,X j )/ √ k GA (X i ,X i )k GA (X j ,X j ) Although the collision probabilityPr(·) is not exactly equivalent to e k GA (·), it is a monotonical functionofthisnormalizedkernel. Infact,Pr(h(X i ) =h(X j ))isalwayswithinafactor0.878 from e k GA (·). 175 It is notable that k GA (·) inside the hashing function h KAH (·) can take time series with variable-lengthinnature,thusKAHcanbedirectlyappliedtovariable-lengthtimeseriesdataset. Also,sincek GA (·)isap.d. kernel(Theorem1),K b isasymmetricpositivedefinitematrixand computingK −1/2 b is numerically allowed. Moreover, given a query, generating one bit binary code for this query requires m times calculation of k GA (·). Thus, in order to get sub-linear online complexity, we have lim n→inf m/n→ 0, e.g., m = O(log(m)). More discussions on timecomplexityaregiveninSec. 3.4. 9.3.3 DimensionReduction Theonlinecomplexityofeq.9.7toanewqueryincreaseslinearlywithmandquadraticwiththe temporallengthL. Furthermore,theredundancyintheD×Ldimensioncanalsobecompressed to capture the useful information. Because of the complex spatio-temporal structure, it is not reasonable to treatX as a vector inℜ DL space and then perform standard techniques. Thus, we propose the spatio-temporal dimension reductionf ST (·) :ℜ D×L →ℜ d×l (d≤ D, l≤ L), whichisamappingfunctionandcanberepresentedas, f ST (X) =P d×D XW T l×L (9.9) whereP ∈ℜ d×D (d≤ D)is a spatial transformation matrix andW ∈{0,1} l×L (l≤ L) is a binary temporal sampling matrix which is first introduced in [160]. Each row inW has only one element as 1 and all others as 0. HereP is the same for all theD-dimensional time series butW isnot. 176 Thekeyidea is,W isdetermined onlybyX andP is learned from the entire data base. In particular, eachW i is learned from individualX i by manifold uniform sampling, a nonlinear extension of uniform sampling, which is given as follows. Inspired by works in [129], often there is a latent structure of the high dimensional time series data. For instance, the degree of freedom of human motion is much lower than the dimensionality of feature representation in the joint-angle (position) space. In fact, manifold uniform sampling is based on the temporal manifold framework, which is proposed to explore the latent structure of (smooth) time series x(t)∈ℜ D . Definition: a temporal manifold is a directed traversing pathM p (with boundary or com- pact)onamanifoldM,andfurtherembeddedinR D . A traversing pathM p can be intuitively thought as a point moving on a manifoldM (em- bedded inℜ D ) from a starting point at time t 1 to an ending point at time t 2 . A path is not just a subset ofM which “looks like”a curve, but also includes a natural parameterization as, g ζ→ : [0 1]→M. So, a latent variable ζ∈ [0 1] is associated with every point on this path. SinceMisembeddedinR D ,essentiallywecanusethetraversingpathtodescribethe(smooth) continuousmultivariatetimeseriesx(t)∈ℜ D asx(t) =p(ζ(t))+ϵ. p(·)isamappingfunction which maps latent variableζ toR D . From now on, we treat a (smooth) continuous multivariate time seriesx(t) as a temporal manifold andX is just the sampling overx(t), orM p equiva- lent. Accordingtodifferentialgeometry,thereisnouniqueexistenceofparameterizationofthe traversing path. We are interested in a special one - isometric embedding, i.e., the length of the pathis(proportional)preservedinζ(t). 177 Formally, we want|ζ(t 1 )− ζ(t 2 )| ∝ d Len (x(t 1 ),x(t 2 )). The key idea of the manifold uniformsamplingistosampleuniformlyoverthe latentspaceζ, i.e., sampling shouldbe dense in the highly curved (changed) region and sparse in the flat region. Assume we want to sample L points fromx(t), then the i th sampling point isx(t i ) with ζ(t i ) = i/L. In real case, the continuousx(t) is unknown, we have a finite number of samplesX = [x 1 ;...;x L ] and want to perform further down sampling overX. In such case, a simple approach is used to recoverζ (t) fromx (t) (1≤ (t)≤L),whichisperformedas,, ζ i = d Len (x 1 ,x i ) d Len (x 1 ,x L ) ≈ ∑ i−1 k=1 ||x k −x k+1 || ∑ L−1 k=1 ||x k −x k+1 || (9.10) Based on the inferred{ζ (t) } L (t)=1 , temporal samples onX (orW X equivalently) are given by uniformlyselectedx l k=1 inthis1Dlatentspacewithζ k =k/l. Given all{W i }, P is learned from the entire data base by maximum variance principle as[33], argmax P N ∑ i=1 var(PX i W T i ), s.t.,P T P =I (9.11) where var(·) is the variance operator over the column space ofX i W T i . Without loss of gen- erality,x (each column inX) is assumed to be zero-centralized inℜ D . The objective function is the same as the PCA andP can be learned by eigen-decomposition inℜ D , which requires O(D 3 )complexityfordecompositionandO(NlD 2 )forestimatingcovariancematrixinℜ D . Bycombiningeq.9.8andthespatio-temporalmappingineq.9.9,thefinalhashingfunction H b−bit KAH (X)is, sgn((E b b,m ) T (K ST b ) −1/2 ⃗ k GA (PXW T X )) (9.12) 178 9.3.4 Algorithm ToreviewthekeystepsofKAH,thealgorithmissummarizedasfollows, Input:mtimeseries{X (j) } m j=1 areselectedfrom{X i } N i=1 ∈ℜ D×L i (e.g,m =O(logN))), b: the number of bits, t:, parameter for random selection, σ: bandwidth, d: the reduced spatial dimensionality,l: thetemporallengthofsampling,Y ∈ℜ D×Ly : aquery. • DesignW X i for eachX i in the latent space from eq. 9.10 and computeP according to eq.9.11,thentransformX i tof ST (X i )byeq.9.9. • Compute the basic kernel matrix K ST b ∈ℜ m×m for{X ST (j) } m j=1 using k GA (·) and do centralization. • Generateb-bitscodesforeachX ST i byeq.9.12. • (online) Transform Y to f ST (Y) and generate b-bits codes by eq. 9.12, then perform neighborsearchbythealgorithmin[9]. Complexity. The computational complexity for the proposed KAH algorithm includes of- flineandonlinetwoparts. -Offline. The learning stage of eq. 9.9 takes O(NLD) for sampling and O(NlD 2 +D 3 ) for spatial transformation. Formulating the basic kernel matrixK ST b takes O(m 2 l 2 d) and cal- culating the inverse takes O(m 3 ). Furthermore, transforming N time series into binary codes requires O(Nml 2 d). Overall, the most time consuming part is O(Nml 2 d) which is the same orderasO(N logN)ifm =O(logN). 179 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure9.1: VisualizationofthenormalizedDelannoyKernel. -Online. Compared with the offline one, online complexity is more crucial because this is directly related with the time delay of searching. Online complexity includes two parts, gen- erating binary codes and performing search based on codes. The second part depends on the LSH algorithm (e.g. [9]), which is not the focus of this work. The most time consuming stage of generating codes is the kernel operation, which isO(mL 2 D), sub-linear toN. After spatio- temporal transformation, this complexity can be further decreased toO(ml 2 d), which depends on L/l and D/d. Furthermore, the triangular trick can be used in GA kernel to speed up the kerneloperation[14]. 9.4 Analysis This section presents analysis for, (1) positive definite property of GA kernel, and (2) the rela- tionshipbetweentheinducedgeometricallydivisiblekernelandGaussian. Proposition1[14]. Consideramappingτ(x) = x 1+x andanyinfinitelydivisiblekernelκ(·) inarange(0, 1),τ −1 κ(·)isgeometricallydivisible(g.d.) andinfinitelydivisible. Furthermore, usingsuchg.d. kernelasthelocalkernelink GA canguaranteethatk GA isap.d. kernel. 180 Theorem 1 gives a solid theoretical foundation to use GA kernel family in real application. As an example, we can start from the well-known Gaussian kernel κ G σ (x,y) = exp(−||x− y|| 2 /2σ 2 ), and combine it with mappingτ(·) −1 , to get the g.d. kernel exp(−φ σ (x,y)), where φ σ (x,y)istheinduceddivergenceinℜ D withformulationφ σ (x,y) = 1 2σ 2 ||x−y|| 2 +log(2− e − 1 2 2 ||x−y|| 2 ). Itisinterestingtocompareφ σ (x,y)withthedivergenceinGaussiankernel. Onecaneasily prove that,||x−y|| 2 /2σ 2 ≤ φ σ (x,y)≤ min{||x−y|| 2 /2σ 2 + 1,||x−y|| 2 /σ 2 }. Thus, it is reasonable to set the local kernel as exp(−φ σ (x,y)), which is not only g.d. but also tightly boundedbytheGaussian. In an extreme case, all well-behaved divergence φ σ (x,y) will converge to 0 as σ goes to infinity. In such case, we find that k GA (X,Y) → ||A(L x ,L y )||, which is the size of the alignment path set. In fact, this number is also known as the Delannoy number D(L x ,L y ), which is a p.d. kernel on integers (N). A visualization of the normalized Delannoy kernel e D(L x ,L y ) = D(L x ,L y )/ √ D(L x ,L x )D(L y ,L y ) is shown in Fig. 9.1. We observe that there isnodiagonaldominanceeffectfor e D(·). Thus,onecaneasilyavoidthediagonaldominanceof k GA (X,Y)foranyrealdatabysettingσ properly. 9.5 Experiments We evaluate the performance of KAH on smart phone sensor data, social behavior data and CMUmotioncapturedata(Mocap). Thesedataarechosentodemonstratethegeneralcapability ofKAHaswellastheadvantagesonlarge-scalevariable-lengthtimeseriesdata. Weinvestigate 181 theperformance ofhashing inboth non-semantic and semantic ways. The first one is to use the ground-truth neighbors and measure the precision and recall curve for the hashing neighbors. UnliketheEuclideanspace,theground-truthneighborsoftimeseriesarenotwelldefinedandthe neighborsgivenbyinduceddistancefromnormalizedGAkernel(D GA (X,Y) = ( e k(X,X)+ e k(Y,Y)− 2 e k(X,Y)) 1/2 ) are used in experiments. As in [97], the average distance to the 50th nearest neighbor is used as the threshold for true positive. The second one is the semantic accuracy, i.e., the average precision of returned neighbors by using the semantic class labels (if available)astheground-truth. WeperformquantitativecomparisonsbetweenKAHandotherstate-of-the-arthashingmeth- ods, including the locality sensitive hashing with zero center (LSH-ZC) [9, 33], rotated PCA (PCA-RR) [46], spectral hashing [144], SKLSH [97], KLSH -[53] (Gaussian kernel) and Itera- tive Quantization (ITQ) [33]. It is notable that, all previous proposed hashing methods can not bedirectlyappliedtovariable-lengthtimeseriesdata,thusapre-processingstepisrequiredand more details are given later. Due to the randomness of the algorithms, average results over 10 trialsarereported. Since KAH is an unsupervised learning algorithm, we only compare it to methods which donotutilizelabelinformationinthetrainingstage. Supervisedorsemi-supervisedalgorithms, suchas[66],SSH[138],minimallosshashing(MLH)[90]andkernel-basedsupervisedhashing (KSH) [65], are not included in the comparison. Moreover, KAH can combine with state-of- the-art kernel learning framework in [51] to utilize label information, which is left for future investigation. 182 0 5 10 15 20 25 30 35 40 −60 −40 −20 0 20 40 jumping 0 10 20 30 40 50 60 −40 −20 0 20 40 60 running 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision 32 bits 64 bits 128 bits 256 bits Figure9.2: Left,examplesofUCF-iphonedata. Right,precisionandrecallresults. Groundtruth isgivenbytheinduceddistancefromtheGAkernel[14]. 9.5.1 Smartphonedata The UCF-iphone data is a time series data set of recorded inertial measurement units (IMU) on smartphones[76]. Eachtimeserieshas 500frameswith 9Dfeaturesfromaperson performing one of nine activities (e.g., walking and running). We use sliding window approach to cut the dataintoshortones,i.e.,length40or60frames. ThisisbecauseIMUis60Hzandroughly40to 60framescontainsoneactioncycleandenablemeaningfulsimilaritymeasurement. Intotalwe have 7660 9D time series which are used in our experiments (randomly selected 1000 ones as queries)withintotal367680frames. SinceUCF-iphonedatahavesemanticlabels(9activities), bothnon-semanticandnon-semanticevaluationresultsarereported. In the non-semantic setting, we measure the precision and recall for the returned search neighbors based on the ground-truth neighbors. As we pointed out before, there is no default similaritymetricfortimeseries andtheinducedGAdistanceD GA (·)is usedto getthe ground- truth neighbors. Since GA kernel is also used in KAH, it is unfair to compare KAH to other hashingworks,sonocomparisonisreportedinthissetting. Furthermore,exceptKAH,allother hashing methods are proposed for fixed-dimensional data. Thus, we transform each time series 183 into 180D vector to apply other hashing techniques (180 = 9×20 and each one is sampled to 20 frames). In KAH,m is fixed at 300, t is 30 andσ is chosen asmed(x−y) √ med(x−y) with a multiplier factor 2 according to [14]. Here med(·) is the median operator andx andy representfeaturesinℜ 9 . Fig. 9.2 is the precision-recall curve for non-semantic evaluation for KAH without spatial temporal reduction (eq. 9.8) in different number of hash bits, i.e., 32 to 256. It shows that more bits can improve the performance and using 32 bits achieves high precision and recall values against the linear scan. This fact supports the effectiveness of KAH, which can speed up the searchprocessontimeseriesdataandmaintaintheperformancesimultaneously. WealsoinvestigatetheperformanceofKAHwithreduction(eq.9.12)byfixingl = 20and d = 5. Moreover,thecomparisontootherhashingworksareincludedinthesemanticevaluation. Results (Fig. 9.3) show that KAH achieves superior performance than other candidate methods in terms of the semantic accuracy of returned neighbors. SKLSH [97] does not perform well in termsofthesemanticaccuracy,asobservedby[33],andtheresultsarenotincludedinFig.9.3. It isnotablethat,nonlinearalgorithmsincludingSH,KLSHandKAHoutperformothercandidates. This is because the complex structure of multivariate time series is difficult to handle by linear hashing methods. SH [144] obtains very good results for 32 hash bits but the edge over linear hash methods decreases when the code is long, e.g., 128 bits. This fact is consistent with the observationsfrompreviousworkssuchas[33]. ThegainfromKAHtoKLSHismainlybecause oftheGAkernel,whichcanwellcapturethespatio-temporalsimilaritybetweentimeseries. 184 0 10 20 30 40 50 60 70 80 90 100 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 Num of Returned Neighbors Semantic Accuracy SH KLSH PCA−RR ITQ LSH−ZC KAH 0 10 20 30 40 50 60 70 80 90 100 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 Num of Returned Neighbors Semantic Accuracy SH KLSH PCA−RR ITQ LSH−ZC KAH Figure 9.3: Semantic accuracy on iphone data with 32 and 128 hash bits. Ground truth is given bythesemanticlabel. 9.5.2 SocialBehaviordata The social behavior dataset is a synthetic dataset created by a research consortium, which con- sistsofexpertsincomputesecurity,datasimulation,counterintelligence,andsoon,asabench- mark dataset for social behavior anomaly detection. The dataset contains time series observa- tions of social behaviors, such as email communications, computer log-in/log-off records, USB access,andhttpaccess,among1000peoplefromJan. 2,2010toMay16,2011. Oneofthemain task is that given a query profile (usually time series behavior observations by the anomolous users within a time period), search similar behavior patterns in the large collections. As before, the raw data is pre-processed to 3958 9D time series with lengthL = 60 frames and 1000 ran- domlyselectedonesareusedasqueries. Sincethereisnosemanticlabel,onlythenon-semantic evaluationisperformed. Fig.9.4(left)showstheprecision-recallcurvefortheproposedKAH(m = 300,t = 30,l = L = 60). Results show that KAH obtains higher quality search results compared to the linear scan. Using128bits,theprecisionismorethan0.95whentherecallis0.4. That’salsobecause most of the queries are normal behaviors, which are quite likely to be similar to each other. 185 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision 16 bits 32 bits 64 bits 128 bits 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision 16 bits 32 bits 64 bits 128 bits Figure 9.4: Precision and recall results on social behavior data. Left is KAH with l = 60 and rightisKAHwithl = 15. GroundtruthisgivenbytheinducedGAdistance. Furthermore, Fig. 9.4 (right) shows the results for KAH with l = 15. The 4× down sampling can largely speed up the kernel operation and meanwhile maintain reasonable precisions and recalls. 9.5.3 Mocapdata KAH is applied to human motion capture data to perform similar motion (segments) retrieval, whichhasmanyimportantapplicationssuchasactivityrecognition,videogameandanimation. Forinstance,motiondataretrievalcanbecombinedwithstate-of-the-arthumanposeestimation works in [111] to perform real time action recognition from single depth sensor. CMU Mocap dataisatimeseriesdatasetrecordingthejoint-position(angle)ofhumanmotions,butsequences are not well labeled. Each sequence in Mocap is performed by a human subject with length on averagefrom2000to10000frames. In order to apply time series hashing, Mocap is pre-processed as follows. 14 joints are used to represent the human skeleton for all sequences, resulting in joint-quaternions of joint angles in 42D [163]. We also use sliding window approach to get short sequences with 60 frames, resulting in close to half million short ones, or in total 30 million frames inℜ 42 . This is a 186 large-scale time series data set, and clearly performing retrieval in the linear scan way is time consuming. 1000 randomly selected motion sequences are used as queries for evaluation. One example of visualization results is provided in Fig. 9.5, where several returned neighbors from KAH for a walking sequence are shown. Results show that the semantically similar ones are returnedbyKAH,anditisworthnotingthatthe1stand10thnearestneighborsaremoresimilar tothequerythanthe50thone,whichmeansranksreflectsemanticsimilaritieswell. 9.6 Conclusion KernelizedAlignmentHashing(KAH)isproposedtolearnbinarycodesfromunlabeledvariable- lengthtimeseriesdata. Inparticular,thekernelizedhashingframeworkcombinedwiththeglobal alignment time series kernelis used in KAH. The algorithm is evaluated and compared to other state-of-the-arthashingmethods. Theresultsareencouraging,KAHyieldsgoodprecision-recall curve for non-semantic evaluation and higher accuracy for semantic evaluation. Furthermore, the results on CMU Mocap data demonstrate the capability of KAH on large-scale time series data set. For future work, KAH can be applied to more real data sets including videos (e.g. HumanEva)anddepthsensordata. 187 Figure 9.5: Examples of search results on CMU Mocap. The first row is query (a person is walking)andfollowingrowsarethe1st,10thand50threturnednearestneighbors. 188 Chapter10 Conclusion Thischaptergivesasummaryofcontributions,papersandthefuturedirections. 10.1 Contribution ThepapersthatserveasthebasisforChapter2to9arelistedasfollows. • Dian Gong, Gerard Medioni, Sikai Zhu and Xuemei Zhao, Kernelized Temporal Cut for Online Tempoal Segmentation and Recognition, Proc. of the 12th European Conference onComputerVision(ECCV2012),Firenze,Italy,October2012. • Dian Gong, Xuemei Zhao and Gerard Medioni, Robust Multiple Manifolds Structure Learning,Proc. ofthe29thInternationalConferenceonMachineLearning(ICML2012), Edinburgh,Scotland,June2012. • DianGongandGerardMedioni,”ProbabilisticTensorVotingforRobustPerceptualGroup- ing”, workshop on POCV, Proc. of the IEEE 25th Conference on Computer Vision and PatternRecognition(CVPR2012),RhodeIsland,USA,June2012. 189 • Dian Gong and Gerard Medioni, ”Dynamic Manifold Warping for View Invariant Ac- tion Recognition”, Proc. of the IEEE 13th International Conference on Computer Vision (ICCV2011),Barcelona,Spain,November2011. • Dian Gong, Fei Sha and Gerard Medioni, ”Locally Linear Denoising on Image Mani- folds”, Proc. of the 13th International Conference on Artifical Intelligence and Statistics (AISTATS 2010), Sardinia, Italy, May 2010. Volume 9 of Journal of Machine Learning Research. 10.2 FutureWork There are several future directions regarding to the topics of robust manifold structure learning, timeserieslearningandhumanmotionanalyusis. Unified Tensor Voting. The mathematical model of unified Tensor Voting framework for robustmanifoldstructurelearningrequiresfurtherinvestigation. Thehopeisthismodelcannot onlyderive the votingscheme in high dimensional space from the perspectiveof local structure learning,butcanalsonaturallydeveloptheprobabilisticversionofTensorVotingtohandleinlier noise. As shown in the preliminary results, the 1st pass vote is the optimal results of this new modelbytheso-calledweightedlow-rankmatrixapproximation[119]. Spatio-TemporalAlignment. Thespatio-temporalalignmentalgorithm,i.e. DynamicMan- ifold Warping, can be improved. Currently the spatial similarity calculation is performed by CCA on temporally local human motion segments. However, the intrinsic spatial structure of 190 human pose is nonlinear, which is beyond the scope of linear method. It is possible to incorpo- rate the information theoretical and graph kernel (similarity) works to our alignment algorithm tohandlethenonlinearityofhumanposes. 191 AppendixA ProbabilisticTensorVotingPackage A.1 Introduction Thispackagecontainsamatlabimplementationforprobabilistictensorvoting[29]. A.2 Usage Sparse Ball PTV 2D.m is the probabilistic voting function for the first pass in the voting procedure,andSparse PTV 2D.mistheprobabilisticvotingfunctionafterthefirstpass. A.3 Parameters Theretwoparameters,σandσ n . Thefirstoneisthesameasthebandwidthinthestandardtensor voting, and the second one is the voting scale for inlier noise. There is no theoretical guidance on setting σ n , and it is chosen empirically. For instance, if σ is chosen as 10, then a general rangeforσ n is0.1to1. 192 AppendixB NDSpaceClosedFormTensorVotingPackage B.1 Introduction Thispackagecontainsimplementationfortheclosed-formsolutionoftensorvotinginNDspace (chapter 4). This is a matlab package and the correspondingC ++/GPU implementation can bereferredto[156]. B.2 Usage Sparse Ball STV ND.mistheclosed-formfunctionforthefirstpassinthevotingprocedure, andSparse STV ND.mistheclosed-formfunctionafterthefirstpass. Sparse STV ND.m can also be used for the first pass, because two votes are integrated as one framework in the closed-form solution. There are several choices for the closed-form solution and Std6 is the recommendedone. 193 B.3 Parameter Theonlyparameterisσ,whichisthesameastheoneinthestandardtensorvoting. B.4 Note Asacomparison,thispackagealsoincludesthestandardtensorvotingfunctionality[81],which can be downloaded from [131]. More details and explanations are included in the code and the readmefile. 194 AppendixC TemporalAlignmentPackage C.1 Introduction ThispackagecontainsimplementationforDynamicManifoldWarping(DMW)[28]andrelated temporalalignmentfunctions. C.2 InstallationandUsage Step 1. unzip aca.zip to your folder. Step 2. Run addPath.m to add sub-directories to the path of Matlab. Step 3. Go to the folder utility/ctw 20100910/ctw to run make.m if needed (for DTW,CTW,etc). Step4. RunscriptCmuMocap.morscriptCmuMocap2.m. C.3 Parameter ThereisnoparameterforDMW. 195 C.4 Note The main function alignDmw.m also includes other alignment algorithms such as Dynamic Time Warping (DTW) and Canonical Time Warping (CTW), which is based on [160]. More detailsandexplanationsareincludedinthecodeandthereadmefile. 196 AppendixD ManifoldDenoisingPackage D.1 Introduction ThispackagecontainsamatlabimplementationforLocallyLinearDenoising(LLD)[31]. D.2 Usage RunthescriptfileasademoscriptLLD.m,thisfunctionwillcallLLD Graph.m. D.3 Parameter There are two parameters, one is d (intrinsic dimensionality), and the other is K (number of neighbors). d can be chosen as the true intrinsic dimensionality if prior knowledge is available. Forinstance,dissetas2formeshdenoisingand1forcurvesmoothing. TheoptimalvalueofK dependsonthesparsityofdataandtheintrinsicdimensionality,and thereisnogeneralrule. Forinstance,iftheinputdataare10,000pointsamplesfromameshin 3D,thenK canbechosenfrom10to100. MoreexamplesofdandK canbereferredto[31]. 197 ReferenceList [1] R. P. Adams and D. J.C. MacKay. Bayesian online changepoint detection. In University ofCambridgeTechnicalReport,2007. [2] A. Agarwal, H. Daume III, and S. Gerber. Learning multiple tasks using manifold regu- larization. InNIPS,pages46–54.2010. [3] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearestneighborinhighdimensions. CommunicationsoftheACM,51(1):117–122,2008. [4] FrancisR.BachandMichaelI.Jordan. Kernelindependentcomponentanalysis. Journal ofMachineLearningResearch,3:1–48,2003. [5] J. Barbic, A. Safonova, J.-Y. Pan, C. Faloutsos, J. K. Hodgins, and N. S. Pollard. Seg- menting motion capture data into distinct behaviors. In Proc. Graphics Interface, pages 185–194,2004. [6] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. NeuralComputation,15:1373–1396,June2003. [7] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, and Ronen Basri. Actions as space-timeshapes. InProc.ICCV,volume2,pages1395–1402,2005. [8] A.Buades,B.Coll,andJ.M.Morel. Anon-localalgorithmforimagedenoising. InProc. IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 60–65, 2005. [9] MosesS.Charikar. Similarityestimationtechniquesfromroundingalgorithms. InSTOC, pages380–388,2002. [10] J.ChenandA.K.Gupta. ParametricStatisticalChange-pointAnalysis. Birkhauser,2000. [11] J.ChoiandG.Medioni. Starsac: Stablerandomsampleconsensusforparameterestima- tion. InProc.CVPR,pages675–682,2009. [12] F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, 2nd edition edition,May1997. [13] CMU. http://mocap.cs.cmu.edu/. 198 [14] M.Cuturi. Fastglobalalignmentkernels. InProc.ICML,2011. [15] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen O. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on ImageProcessing,16(8):2080–2095,2007. [16] F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. IEEETransactionsonSignalProcessing,53:2961–2974,2005. [17] M.EladandM.Aharon. Imagedenoisingvialearneddictionariesandsparserepresenta- tion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages895–900,2006. [18] E.ElhamifarandR.Vidal. Sparsesubspaceclustering. Proc.CVPR,2009. [19] L. Faivishevsky and J. Goldberger. A nonparametric information theoretic clustering al- gorithm. InICML,pages351–358,2010. [20] R.Fergus,B.Singh,A.Hertzmann,S.T.Roweis,andW.T.Freeman. Removingcamera shakefromasingleimage. ACMTransactionsonGraphicsandSIGGRAPH,25:787–794, 2006. [21] M.FischlerandR.Bolles. Randomsampleconsensus: Aparadigmformodelfittingwith applications to image analysis and automated cartography. Comm. of the ACM, 24:381– 395,1981. [22] S. Fleishman, I. Drori, and D. Cohen-Or. Bilateral mesh denoising. ACM Transactions onGraphicsandSIGGRAPH,22:950–953,July2003. [23] Emily Fox, Erik Sudderth, Michael Jordan, and Alan Willsky. Nonparametric bayesian learningofswitchinglineardynamicalsystems. InNIPS21,pages457–464.2009. [24] Y. Gao, K.L. Chan, and W.Y. Yau. Manifold denoising with Gaussian process latent variablemodels. InProceedingsofIntl.Conf.onPatternRecognition,pages1–4,2008. [25] AristidesGionis,PiotrIndyk,andRajeevMotwani. Similaritysearchinhighdimensions viahashing. InVLDB,pages518–529,1999. [26] A. Goh and R. Vidal. Segmenting motions of different types by unsupervised manifold clustering. 2007. Proc.CVPR. [27] A. Goldberg, X. Zhu, A. Singh, Z. Xu, and R. Nowak. Multi-manifold semi-supervised learning. InAISTATS,2009. [28] D. Gong and G. Medioni. Dynamic manifold warping for view invariant action recogni- tion. InProc.ICCV,pages571–578,2011. [29] Dian Gong and Gerard Medioni. Probabilistic tensor voting for robust perceptual group- ing. InProc.POCV,CVPRworkshop,2012. 199 [30] Dian Gong, Gerard Medioni, Sikai Zhu, and Xuemei Zhao. Kernelized temporal cut for onlinetempoalsegmentationandrecognition. InProc.ECCV,2012. [31] Dian Gong, Fei Sha, and Gerard Medioni. Locally linear denoising on image manifolds. InProc.AISTSTS,2010. [32] DianGong,XuemeiZhao,andGerardMedioni. Robustlocaltoglobalstructurelearning onmultiplemanifolds. InProc.ICML,2012. [33] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization - a procrustean approachtolearningbinarycodesforlarge-scaleimageretrieval. IEEEPMAI,2012. [34] A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola. A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems 19, pages513–520.MITPress,2007. [35] A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur. A fast, consistent kernel two-sampletest. InNIPS,volume19,pages673–681.2009. [36] Jose A. Guerrero-Colon, Eero P. Simoncelli, and Javier Portilla. Image denoising using mixturesofgaussianscalemixtures. InICIP,pages565–568.IEEE,2008. [37] Zad Harchaoui, Francis Bach, and Eric Moulines. Kernel change-point analysis. In Ad- vancesinNeuralInformationProcessingSystems21,2009. [38] C.HarrisandM.J.Stephens. Acombinedcornerandedgedetector. InProc.AlveyVision Conference,pages147–152,1988. [39] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge UniversityPress,2ndeditionedition,2004. [40] M. Hein, J. Audibert, and U. von Luxburg. From graphs to manifolds - weak and strong pointwiseconsistencyofgraphlaplacians. InCOLT,pages470–485,2005. [41] M. Hein and M. Maier. Manifold denoising. In B. Sch¨ olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 561–568. MIT Press,Cambridge,MA,2007. [42] M. Hoai, Z. Lan, and F. De la Torre. Joint segmentation and classification of human actionsinvideo. InProc.CVPR,2011. [43] T.Hofmann,B.Scholkopf,andA.J.Smola. Kernelmethodsinmachinelearning. Annals ofStatistics,36:1171–1220,2008. [44] P.J.Huber. RobustStatistics. Wiley-Interscience,February1981. [45] D. D. Lee J. H. Ham and L. K. Saul. Semisupervised alignment of manifolds. In Proc. AISTATS,pages120–127,2005. 200 [46] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating local descriptors into a com- pactimagerepresentation. InCVPR,2010. [47] I.T.Jolliffe. Principalcomponentanalysis. Springer,2ndedition,2002. [48] T. K. Kim and R. Cipolla. Canonical correlation analysis of video volume tensors for actioncategorizationanddetection. IEEETrans.onPAMI,31:1415–1428,2009. [49] J.J.Kivinen,E.B.Sudderth,andM.Jordan. Imagedenoisingwithnonparametrichidden markov trees. In Proc. International Conference on Image Processing, volume 3, pages 121–124,2007. [50] Venkat Krishnamurthy and Marc Levoy. Fitting smooth surfaces to dense polygon meshes. InProc.SIGGRAPH1996,pages313–324,1996. [51] B.Kulis,M.Sustik,andI.Dhillon. Low-rankkernellearningwithbregmanmatrixdiver- gences. JMLR,2009. [52] BrianKulisandTrevorDarrell. Learningtohashwithbinaryreconstructiveembeddings. InNIPS,pages1042–1050.2009. [53] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing. IEEE Transac- tionsonPatternAnalysisandMachineIntelligence,2011. [54] F.DelaTorreandM.J.Black. Aframeworkfor robustsubspace learning. International JournalofComputerVision,54:117–142,2003. [55] F. De la Torre, J. Campoy, Z. Ambadar, and J. F. Conn. Temporal segmentation of facial behavior. InProc.ICCV,2007. [56] I. Laptev, S.J. Belongie, P. Perez, and J. Wills. Periodic motion detection and segmen- tation via approximate sequence alignment. In Proc. ICCV, pages 816–8231. Springer- Verlag,Berlin/Heidelberg,2005. [57] I.Laptev, M.Marszalek, C.Schmid, andB. Rozenfeld. Learningrealistic humanactions frommovies. InProc.CVPR,2008. [58] N. Lawrence. Probabilistic non-linear principal component analysis with gaussian pro- cesslatentvariablemodels. JMLR,6:1783–1816,November2005. [59] W. Liao and G. Medioni. 3d face tracking and expression inference from a 2d sequence usingmanifoldlearning. InProc.CVPR,volume2,pages416–423,2008. [60] D. Lin, E. Grimson, and J. Fisher. Modeling and estimating persistent motion with geo- metricflows. InCVPR,2010. [61] D.Lin,E.Grimson,andJ.Fisher. Constructionofdependentdirichletprocessesbasedon poissonprocesses. InNIPS.2011. 201 [62] J. Lin, E. Keogh, S. Lonardi, and B. Chiu. A symbolic representation of time series, withimplicationsforstreamingalgorithms. InSIGMODWorkshoponResearchIssuesin DMKD,2003. [63] J. Listgarten, R. M. Neal, S. T. Roweis, and A. Emili. Multiple alignment of continuous timeseries. InNIPS,volume17.2005. [64] Jingen Liu, Jiebo Luo, and Mubarak Shah. Recognizing realistic actions from videos in thewild. InProc.CVPR,pages1996–2003,2009. [65] W.Liu,J.Wang,R.Ji,Y.-G.Jiang,andS.-F.Chang. Supervisedhashingwithkernels. In CVPR,2012. [66] W.Liu,J.Wang,S.Kumar,andS.-F.Chang. Hashingwithgraphs. InICML,2011. [67] C. C. Loy, T. Xiang, and S. Gong. Multi-camera activity correlation analysis. In Proc. CVPR,pages1988–1995,2009. [68] F.LvandR.Nevatia. Recognitionandsegmentationof3-dhumanactionusinghmmand multi-calssadaboost. InProc.ECCV,volume3954,pages359–372.2006. [69] F.LvandR.Nevatia. Singleviewhumanactionrecognitionusingkeyposematchingand viterbipathsearching. InProc.CVPR,pages1–8,2007. [70] M.Maier,U.vonLuxburg,andM.Hein. Influenceofgraphconstructionongraph-based clusteringmeasures. InNIPS,pages351–358.2009. [71] D.Marr. Vision. SanFrancisco: Freeman,1982. [72] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using localbrightness,color,andtexturecues. IEEETrans.onPAMI,26:530–549,2004. [73] D.Martin,C.Fowlkes,D.Tal,andJ.Malik. Adatabaseofhumansegmentednaturalim- ages and its application to evaluating segmentation algorithms and measuring ecological statistics. InProc.8thInt’lConf.ComputerVision,volume2,pages416–423,July2001. [74] J. Matas, C. Galambos, and J. Kittler. Robust detection of lines using progressive proba- bilistichoughtransform. CVIU,78:119–137,2000. [75] P.Matikainen,M.Hebert,andR.Sukthankar. Representingpairwisespatialandtemporal relationsforactionrecognition. InProc.ECCV,volume6311,pages508–521.2010. [76] C. McCall, K. Reddy, and M. Shah. Macro-class selection for hierarchical k-nn classifi- cationofinertialsensordata. InPECCS,2012. [77] G. Medioni, M. S. Lee, and C. K. Tang. A Computational Framework for Segmentation andGrouping. NewYork: Elsevier,2000. 202 [78] R. Messing, C. Pal, and H. Kautz. Activity recognition using the velocity histories of trackedkeypoints. InProc.ofICCV,pages104–111,2009. [79] S. Mika, B. Sch¨ olkopf, A. J. Smola, K. R. M¨ uller, M. Scholz, and G. R¨ atsch. Kernel pca and de-noising in feature spaces. In M.S. Kearns, S.A. Solla, and D.A. Cohn, edi- tors,AdvancesinNeuralInformationProcessingSystems11,pages536–542.MITPress, Cambridge,MA,1999. [80] P. Mordohai and G. Medioni. Junction inference and classification for figure completion usingtensorvoting. InFourthWorkshoponPerceptualOrganizationinComputerVision, 2004. [81] P. Mordohai and G. Medioni. Tensor Voting: A Perceptual Organization Approach to ComputerVisionandMachineLearning. MorganandClaypoolPublishers,2007. [82] P.MordohaiandG.Medioni. Dimensionalityestimation,manifoldlearningandfunction approximationusingtensorvoting. JMLR,11:411–450,2010. [83] Rodrigo Moreno, Miguel Angel Garcia, Domenec Puig, Luis Pizarro, Bernhard Burgeth, and Joachim Weickert. On improving the efficiency of tensor voting. PAMI, 33:2215– 2227,2011. [84] P.NatarajanandR.Nevatia. Viewandscaleinvariantactionrecognitionusingmultiview shape-flowmodels. InProc.CVPR,2008. [85] A.Y.Ng,M.I.Jordan,andY.Weiss. Onspectralclustering: Analysisandandalgorithm. 2001. NIPS. [86] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and analgorithm. InAdvancesinNeuralInformationProcessingSystems14,pages849–856. MITPress,2002. [87] Juan Carlos Niebles, Chih-Wei Chen, , and Li Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. In Proceedings of the 12th EuropeanConferenceofComputerVision(ECCV),Crete,Greece,September2010. [88] J. Nielbes, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories usingspatial-temporalwords. IJCV,79:299–318,2008. [89] Huazhong Ning, Wei Xu, Yihong Gong, and Thomas Huang. Latent pose estimator for continuousactionrecognition. InProc.ECCV,volume5303,pages419–433.2008. [90] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In ICML, 2011. [91] F.Padua,F.Carceroni, R.Santos,andG.Kutulakos. Linearsequence-to-sequencealign- ment. IEEETrans.onPAMI,32:304–320,2010. 203 [92] Y.Peng,A.Ganesh,J.Wright,andY.Ma. Rasl: Robustalignmentbysparseandlow-rank decompositionforlinearlycorrelatedimages. 2010. CVPR. [93] P.PeronaandJ.Malik. Scale-spaceandedgedetectionusinganisotropicdiffusion. IEEE TransactionsonPatternAnalysisandMachineIntelligence,12:629–639,July1990. [94] R. Poppe. A survey on vision-based human action recognition. Image and Vision Com- puting,28:976–990,2010. [95] Javier Portilla, Vasily Strela, Martin J. Wainwright, and Eero P. Simoncelli. Image de- noising using scale mixtures of gaussians in the wavelet domain. IEEE Transactions on ImageProcessing,12(11):1338–1351,2003. [96] L. R. Rabiner and B. Juang. Fundamentals of speech recognition. Prentice-Hall, Inc., 1993. [97] Maxim Raginsky and Maxim Raginsky. Locality-sensitive binary codes from shit- invariantkernels. InNIPS,pages1509–1517.2009. [98] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, andE.Keogh. Searchingandminingtrillionsoftimeseriessubsequencesunderdynamic timewarping. InSIGKDD,2012. [99] C. Rao, A. Gritaiand, M. Shah, and T. Syeda-Mahmood. View-invariant alignment and matchingofvideosequences. InProc.ICCV,pages939–945,2003. [100] DavidA.Ross,JongwooLim,Ruei-SungLin,andMing-HsuanYang. Incrementallearn- ingforrobustvisualtracking. IJCV,77:125–141,2008. [101] S. Roth and M. J. Black. Fields of experts: a framework for learning image priors. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 860–867,2005. [102] S.T.RoweisandL.K.Saul. Nonlineardimensionalityreductionbylocallylinearembed- ding. Science,290:2323–2326,December2000. [103] YunusSaatci,RyanTurner,andCarlRasmussen. Gaussianprocesschangepointmodels. InProc.ICML,2010. [104] L. K. Saul and S. T. Roweis. Think globally, fit locally: unsupervised learning of low dimensionalmanifolds. JournalofMachineLearningResearch,4:119–155,2003. [105] ChristianSchuldt,IvanLaptev,andBarbaraCaputo. Recognizinghumanactions: Alocal svmapproach. InProc.ICPR,volume3,pages32–36,2004. [106] FeiShaandLawrenceK.Saul. Analysisandextensionofspectralmethodsfornonlinear dimensionalityreduction. InProc.ICML,pages784–791,2005. 204 [107] QinfengShi,JamesPetterson,GideonDror,JohnLangford,AlexSmola,AlexStrehl,and VishyVishwanathan. Hashkernels. InAISTATS,pages496–503,2009. [108] J.ShiehandE.Keogh. isax: Indexingandminingterabytesizedtimeseries. InSIGKDD, 2008. [109] H.Shimodaira,K.I.Noma,M.Nakai,andS.Sagayama. Dynamictime-alignmentkernel in support vector machine. In Advances in Neural Information Processing Systems 14. MITPress,2002. [110] K. Shin and T. Kuboyama. A generalization of hausslers convolution kernel: mapping kernel. InProc.ICML,pages944–951,2008. [111] J. Shotton, A. Fitzgibbon, M. Cook, and A. Blake. Real-time human pose recognition in partsfromsingledepthimages. InCVPR,2011. [112] LeonidSigal,AlexandruO.Balan,andMichaelJ.Black. Humaneva: Synchronizedvideo and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV,87:4–27,2010. [113] A.Singh,R.Nowak,andX.Zhu. Unlabeleddata: Nowithelps,nowitdoesn’t. InNIPS, pages1513–1520.2009. [114] M. Singh, I. Cheng, M. Mandal, and A. Basu. Optimization of symmetric transfer er- ror for sub-frame video synchronization. In Proc. ECCV, volume 5303, pages 554–567. 2008. [115] K. Sinha and M. Belkin. Semi-supervised learning using sparse eigenfunction bases. In NIPS.2010. [116] A.Smith,X.Huo,andH.Zha. Convergenceandrateofconvergenceofamanifold-based dimensionreductionalgorithm. InNIPS.2009. [117] A.Smola,A.Gretton,L.Song,andB.Schoelkopf. Ahilbertspaceembeddingfordistri- butions. In Algorithmic Learning Theory: 18th International Conference, pages 13–31. Springer-Verlag,Berlin/Heidelberg,2007. [118] R.SouvenirandR.Pless. Manifoldclustering. InICCV,2005. [119] NathanSrebroandTommiJaakkola. Weightedlow-rankapproximations. InProc.ICML, 2003. [120] PraveenSrinivasan,LimingWang,andJianboShi. Groupingcontoursviaarelatedimage. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural InformationProcessingSystems21,pages1553–1560.2009. [121] F. Steinke and M. Hein. Non-parametric regression between manifolds. In NIPS, pages 1561–1568.2009. 205 [122] J. Sun, X. Wu, S. Yan, L. F. Cheong, T. S. Chua, and J. Li. Hierarchical spatial-temporal contextmodelingforactionrecognition. InProc.ofCVPR,pages2004–2011,2009. [123] C. Tang and G. Medioni. Curvature-augmented tensor voting for shape inference from noisy3ddata. IEEETrans.onPAMI,24:858–864,2002. [124] Y.W.TehandS.Roweis. Automaticalignmentoflocalrepresentations. InM.S.Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing Systems 15,pages841–848.MITPress,Cambridge,MA,2003. [125] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlineardimensionalityreduction. Science,290:2319–2323,December2000. [126] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of theRoyalStatisticalSociety,B,6(3):611–622,1999. [127] W.Tong,C.Tang,P.Mordohai,andG.Medioni. Firstorderaugmentationtotensorvoting forboundaryinferenceandmultiscaleanalysisin3d. IEEETrans.onPAMI,26:594–611, 2004. [128] Yaron Ukainitz and Michal Irani. Aligning sequences and actions by maximizing space- timecorrelations. InProc.ECCV,volume3953,pages538–550.2006. [129] R. Urtasun, D. J. Fleet, A. Geiger, J. Popovic, T. Darrell, and N. D. Lawrence. Topologically-constrained latent variable models. In Proc. ICML, pages 1080–1087, 2008. [130] R. Urtasun, D.J. Fleet, and P Fua. 3d people tracking with gaussian process dynamical models. InProc.CVPR,volume1,pages238–245,2006. [131] USC. http://iris.usc.edu/people/medioni/ntensorvoting.html. [132] L.J.P.vanderMaaten. Learningaparametricembeddingbypreservinglocalstructure. In Proc.AISTATS,pages384–391,2009. [133] R. Vidal, Y. Ma, and S. Sastry. Generalized Principal Component Analysis (GPCA). SpringerVerlag,2010. [134] R. Vidal, R. Tron, and R. Hartley. Multiframe motion segmentation with missing data usingpowerfactorizationandGPCA. IJCV,79(1):85–105,2008. [135] U.vonLuxburg. Atutorialonspectralclustering. StatisticsandComputing,17:395–416, 2007. [136] U. von Luxburg, A. Radl, and M. Hein. Getting lost in space: Large sample analysis of thecommutedistance,. InNIPS,pages2622–2630.2011. [137] J. Wang, S. Kumar, and S.-F. Chang. Sequential projection learning for hashing with compactcodes. InICML,2010. 206 [138] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large scale search. IEEEPMAI,2012. [139] J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models for humanmotion. IEEETrans.onPAMI,30:283–298,2008. [140] Y. Wang, Y. Jiang, Y. Wu, and Z.-H. Zhou. Spectral clustering on multiple manifolds. IEEETransactionsonNeuralNetworks,22(7):1149–1161,2011. [141] Kilian Q. Weinberger, Fei Sha, and Lawrence K. Saul. Learning a kernel matrix for nonlineardimensionalityreduction. InProc.ICML,pages893–846,2004. [142] D.Weinland,E.Boyer,andR.Ronfard. Actionrecognitionfromarbitraryviewsusing3d exemplars. InProc.ICCV,2007. [143] Daniel Weinland, Mustafa zuysal, and Pascal Fua. Making action recognition robust to occlusionsandviewpointchanges. InECCV,volume6313,pages635–648.2010. [144] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In NIPS, pages 1753– 1760.2009. [145] L. R. Williams and K. K. Thornber. A comparison of measures for detecting natural shapesinclutteredbackgrounds. IJCV,34(2-3):81–96,2000. [146] John Wright, Yigang Peng, Yi Ma, Arvind Ganesh, and Shankar Rao. Robust principal componentanalysis: Exactrecoveryofcorruptedlow-rankmatricesbyconvexoptimiza- tion. In Advances in Neural Information Processing Systems, volume 22, pages 2080– 2088.2009. [147] Tai-PangWu,Sai-KitYeung,JiayaJia,Chi-KeungTang,andGerardMedioni. Aclosed- formsolutiontotensorvoting: Theoryandapplications. PAMI,34:1482–1495,2012. [148] XiangXuanandKevinMurphy. Modelingchangingdependencystructureinmultivariate time series. In Proceedings of the 24th international conference on Machine learning, 2007. [149] J. Yan and M. Pollefeys. A general framework for motion segmentation: independent, articulated,rigid,non-rigid,degenerateandnon-degenerate. Proc. ECCV,pages94–106, 2006. [150] P. Yan, S. M. Khan, and M. Shah. Learning 4d action feature models for arbitary view actionrecognition. InProc.CVPR,2008. [151] L. Zelnik-Manor and M. Irani. Statistical analysis of dynamic actions. IEEE Trans. on PAMI,28:1530–1535,2006. [152] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, pages 1601– 1608.2005. 207 [153] Z. Zhang and H. Zha. Principal manifolds and nonlinear dimension reduction via local tangentspacealignment. SIAMJournalofScientificComputing,26:313–338,2004. [154] X. Zhao, D. Gong, and G. Medioni. Tracking using motion patterns for very crowded scenes. InProc.ECCV,pages315–328,2012. [155] X. Zhao and G. Medioni. Robust unsupervised motion pattern inference from video and applications. InProc.ICCV,pages715–722,2011. [156] Xuemei Zhao. Motion pattern learning and applications to tracking and detection. PhD Thesis,UniversityofSouthernCalifornia,2013. [157] Hua Zhong, Jianbo Shi, and M. Visontai. Detecting unusual activity in video. In Proc. CVPR,pages816–823,2004. [158] D.Zhou,J.Huang,andB.Scholkopf. Learningwithhypergraphs: Clustering,classifica- tion,andembedding. 2006. NIPS. [159] Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Sch¨ olkopf. Ranking on data manifolds. In Sebastian Thrun, Lawrence Saul, and Bern- hard Sch¨ olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press,Cambridge,MA,2004. [160] Feng Zhou and Fernando De la Torre. Canonical time warping for alignment of hu- manbehavior. In Advances in Neural Information Processing Systems,volume22,pages 2286–2294.2009. [161] FengZhou,FernandoDelaTorre,andJeffreyF.Cohn. Unsuperviseddiscoveryoffacial events. InProc.CVPR,pages2574–2581,2010. [162] Feng Zhou, Fernando De la Torre, and Jessica K. Hodgins. Aligned cluster analysis for temporal segmentation of human motion. In Automatic Face and Gesture Recognition, pages1–7,2008. [163] Feng Zhou, Fernando De la Torre, and Jessica K. Hodgins. Hierarchical aligned cluster analysis for temporal clustering of human motion. Accepted for publication at IEEE TransactionsPatternAnalysisandMachineIntelligence(PAMI),2012. 208
Abstract (if available)
Abstract
This dissertation investigates a fundamental issue in machine learning and computer vision: unsupervised structure learning of high dimensional data from manifolds and multivariate time series, corrupted by noise, and in the presence of outliers. Our primary goal is to accurately estimate the local data structure (e.g., tangent space), and then infer the underlying global structure for the input data. ❧ The local structure learning method we use is Tensor Voting (TV), which is a perceptual grouping approach initially proposed to infer the local data geometric structure in 2D space. Standard Tensor Voting (STV) explicitly considers outliers, i.e., the irrelevant noise which is independent of the meaningful data. The underlying assumption made by STV is the inlier data is noiseless. However both outlier and inlier noise commonly exist in many areas in computer vision, e.g., motion estimation, tracking, etc. By taking the uncertainty of the position into consideration, this dissertation develops Probabilistic Tensor Voting (PTV) to handle both inlier noise and outliers. PTV incorporates the Bayesian framework with the geometric inference algorithm in STV together, and the positions of inlier data are changed from fixed vectors to random vectors. This dissertation also extends STV into a unified framework, which gives a clean and elegant interpretation for the voting algorithm in ND space. The vote, i.e., the geometric information propagation, is represented as a low-rank matrix decomposition and diffusion process with closed-form formulation, which only involves matrix multiplication and eigen-decomposition. ❧ After estimating local data structure, global structure learning for high dimensionality data and multivariate time series data are performed. ❧ Two (global) manifold learning tasks are included in this dissertation. The first one is nonparametric denoising on manifolds. We develop a closed-form algorithm, locally linear denoising (LLD), to denoise data sampled from (sub)manifolds, and apply this algorithm to denoise images. The second one is robust multiple manifold clustering. We present a robust multiple manifold structure learning (RMMSL) scheme to robustly estimate data structures under the multiple low intrinsic dimensional manifolds assumption. RMMSL utilizes a robust manifold clustering method based on local structure learning results. The proposed clustering method is designed to get the flattest manifolds clusters by introducing a novel curved-level similarity function. Our approach is evaluated and compared to state-of-the-art methods on synthetic data, handwritten digit images, human motion capture data and motorbike videos. ❧ Furthermore, this dissertation applies machine learning to structured time series with applications to human motion analysis. ❧ First, we address the problem of learning view-invariant 3D models of human motion from motion capture data to recognize human actions from a monocular video sequence with arbitrary viewpoint. We propose a Spatio-Temporal Manifold (STM) model to analyze non-linear multivariate time series with latent spatial structure, and apply it to recognize actions in the joint-trajectories space. Based on STM, a novel temporal manifold alignment algorithm Dynamic Manifold Warping (DMW) and a robust motion similarity metric are proposed for human action sequences, both in 2D and 3D. Second, we address the problem of unsupervised online segmenting human motion sequences into different actions. Kernelized Temporal Cut (KTC) is proposed to sequentially cut the structured sequential data into different regimes. KTC extends previous works on online change-point detection by incorporating Hilbert space embedding of distributions to handle the nonparametric and high dimensionality issues. Moreover, a realtime implementation of KTC is proposed by incorporating an incremental sliding window strategy. Within a sliding window, segmentation is performed by the two-sample non-parametric hypothesis test based on the proposed spatio-temporal kernel. By combining online temporal segmentation and spatio-temporal alignment algorithms, we can online recognize actions of an arbitrary person from an arbitrary viewpoint, given realtime depth sensor input.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
PDF
Motion pattern learning and applications to tracking and detection
PDF
Scalable multivariate time series analysis
PDF
Kernel methods for unsupervised domain adaptation
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Noise-robust spectro-temporal acoustic signature recognition using nonlinear Hebbian learning
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Leveraging training information for efficient and robust deep learning
PDF
Leveraging structure for learning robot control and reactive planning
PDF
Learning logical abstractions from sequential data
PDF
Transfer learning for intelligent systems in the wild
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Modeling, learning, and leveraging similarity
PDF
Visual representation learning with structural prior
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Feature learning for imaging and prior model selection
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Learning distributed representations from network data and human navigation
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
Asset Metadata
Creator
Gong, Dian
(author)
Core Title
Structure learning for manifolds and multivariate time series
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/03/2014
Defense Date
04/25/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
change point detection,clustering,human action recognition,human motion analysis,kernel methods,machine learning,manifold learning,multivariate time series,nonparametric models,OAI-PMH Harvest,probabilistic tensor voting,structure learning,temporal alignment
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gerard G. (
committee chair
), Jenkins, Brian Keith (
committee member
), Sha, Fei (
committee member
)
Creator Email
dian.gong@gmail.com,diangong@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-313397
Unique identifier
UC11295160
Identifier
etd-GongDian-1939.pdf (filename),usctheses-c3-313397 (legacy record id)
Legacy Identifier
etd-GongDian-1939.pdf
Dmrecord
313397
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Gong, Dian
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
change point detection
clustering
human action recognition
human motion analysis
kernel methods
machine learning
manifold learning
multivariate time series
nonparametric models
probabilistic tensor voting
structure learning
temporal alignment