Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Human activity analysis with graph signal processing techniques
(USC Thesis Other)
Human activity analysis with graph signal processing techniques
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
UniversityofSouthernCalifornia Ph.D.Dissertation HUMANACTIVITYANALYSIS WITH GRAPH SIGNAL PROCESSING TECHNIQUES Author: Jiun-YuKao Supervisor: Dr. AntonioOrtega ADissertationPresentedtothe FACULTYOFTHEUSCGRADUATESCHOOL UNIVERSITY OFSOUTHERNCALIFORNIA InPartialFulfillmentoftheRequirementsfortheDegree DoctorofPhilosophy (ELECTRICALENGINEERING) December2019 Copyright2019 Jiun-YuKao ii Abstract Analyzing and understanding human motion has long been a popular yet challenging research area with a broad range of applications. Recently, the availability of reliable 2D or 3D positions of skeletal joints during actions has resulted in an increasing interest in devel- oping automated action analysis systems utilizing skeleton-based motion data. In this Ph.D. dissertation, we explore model-based approaches to construct representations for captured skeleton-based motion data taking into consideration prior knowledge about human skele- tons. The main challenge in achieving so is the irregularity in the skeletal structure and its corresponding motions. We propose to leverage graph structures to tackle this challenge, since graph structures have been shown their superiority in modeling complex relationships between entities in irregular domains. In this work, we propose graph-based motion repre- sentationsthatstartwithaskeletalgraph(includingskeletal-temporalgraphs)andthenapply agraphtransformsuchastheGraphFourierTransform(GFT)ortheSpectralGraphWavelet Transform(SGWT)tothegraphsignaldefinedontheconstructedgraph,wherethegraphsig- nalcorrespondstomotiondata. Wediscusstheconstructionofskeletalandskeletal-temporal graphs and further derive the spatial and spectral properties associated with these types of graphs, including symmetric sub-graphs, GFT modes, spectrum multiplicities, fast GFT im- plementation and interpretations of the corresponding GFT basis. We further discuss some propertiesofthesegraph-basedrepresentations,includingtheircomputationalefficiencyand ability to generalize to new datasets. As an extension, we explore the possibility of learning a set of action-dependent graphs for classification, where we propose a discriminative graph learning problem along with an iterative algorithm to solve it. A closed-form solution is furtherderivedwhengraphssatisfycertainstructures. As for applications, we consider two real-world scenarios where skeleton-based motion dataiscapturedtofulfillanautomatedactionanalysistask. Thefirstapplicationistodevelop an automated mobility assessment system where the motions performed by patients with musculo-skeletal disorders are captured and automatically assessed and utilized to predict their current medication states. We conduct thorough experiments with our proposed graph- based representation in order to assess its performance. Additionally, several factors for designing this general type of systems are discussed, such as the environments and activity tasks on which they are deployed and the features and classifiers that can be used along with them. Thesecondapplicationistodevelopaskeleton-basedactionrecognitionsystem,which is a popular research topic in the field of computer vision and machine learning. Employing the proposed representations is shown to lead to recognition performance comparable to iii the state-of-the-art, while at the same time providing benefits in significantly lower time complexity,robustnesstonoisyandmissingdata,andgeneralizationtodifferentdatasets. iv Tomyfamily and friends... v Acknowledgements This dissertation and my doctoral journey would not have been possible without all the support from my teachers, collaborators, family, and friends. I would like to express my gratitude towards everyone of them. First and foremost, I would like to thank my advisor, Professor Antonio Ortega, for his support and guidance. He has guided me through every stage of the research process, all the way from critical thinking and problem formulation to idea development and presentation. All of these training from him helped me develop into a mature researcher. His considerable encouragement and insightful input have supported me through the process, especially at those tough moments when I felt lost or stuck. I could not haveaskedforabetteradvisorandmentor. I would also like to express my gratitude to Professor C.-C. Jay Kuo and Professor Ram Nevatiaforservingasmydissertationcommitteemembers,aswellasProfessorJustinP.Haldar andProfessorShrikanthNarayananforbeingthecommitteemembersinmyqualifyingexam. Theirgenerousfeedbackandcommentshavebeenagreathelpinimprovingthisdissertation. I would like to thank all my teachers at USC since their lectures have greatly benefited me. The efforts they put in explaining those advanced topics carefully established the basis of my research ability. I must also thank my collaborators and group members, Dr. Akshay Gadde, Dr. Aamir Anis, Dr. Yung-Hsuan (Jessie) Chao, Dr. Hilmi E. Egilmez, Dr. Eduardo Pavez,Keng-ShihLu,andPratyushaDas. Ihaveenjoyedthediscussionswiththemandtheir inputhasbeenimmenselyhelpfultothedevelopmentofthisdissertation. Finally, I would like to take this opportunity to express the profound gratitude to my parents and my sisters for their support and love. I would especially like to thank my fiancé, Kuan-Wen (James) Huang, for having all these insightful discussions with me as well as alwaysbeingthereforme. vi Contents Abstract ii Acknowledgements v ListofFigures ix ListofTables xii ListofAlgorithms xiv 1 Introduction 1 1.1 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Graph-basedRepresentations . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 ApplicationsinAutomatedSystemsforActionAnalysis . . . . . . . 6 1.2.3 DiscriminativeGraphLearningwithTopologyRegularization . . . . 6 1.3 OrganizationoftheThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Background 8 2.1 GraphsandGraphSignals . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 TransformsonGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 GraphFourierTransform . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 SpectralGraphWaveletTransform . . . . . . . . . . . . . . . . . . . 12 3 Graph-basedRepresentationsforMotionData 14 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 SkeletalGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 SkeletalGraphConstruction . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 InterpretationsfromGFTbasis . . . . . . . . . . . . . . . . . . . . . 17 3.2.3 SemanticGroupSparseEigenvectors . . . . . . . . . . . . . . . . . 18 3.2.4 TheRoleofGraphWeights . . . . . . . . . . . . . . . . . . . . . . 22 vii 3.3 Skeletal-temporalGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 ProposedGraph-basedRepresentationanditsProperties . . . . . . . . . . . 24 3.4.1 RepresentationExtraction . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.2 PropertiesofProposedRepresentation . . . . . . . . . . . . . . . . . 25 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 Application: Skeleton-based ActionRecognition 31 4.1 FeatureDesignwithTemporalModeling . . . . . . . . . . . . . . . . . . . . 31 4.2 ExperimentalEvaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.2 EvaluationSettingsandParameters . . . . . . . . . . . . . . . . . . 33 4.2.3 ResultsonClassificationPerformance . . . . . . . . . . . . . . . . . 33 4.2.4 RobustnesstoNoisyData . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.5 RobustnesstoMissingData . . . . . . . . . . . . . . . . . . . . . . 44 4.2.6 GeneralizationtoNewDatasets . . . . . . . . . . . . . . . . . . . . 47 4.2.7 EasyCombinationwithGeneralModelingSchemes . . . . . . . . . 47 4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5 DiscriminativeGraphLearningwithSparsityRegularization 50 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2 ProblemFormulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Disc-GLassoAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4 Closed-formSolutionforSkeletalGraphs . . . . . . . . . . . . . . . . . . . 56 5.5 ExperimentsonSyntheticData . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.6 ExperimentsonRealMotionData . . . . . . . . . . . . . . . . . . . . . . . 63 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6 ConclusionandFutureWork 68 Bibliography 70 A CaseStudy: AutomatedMobilityAssessmentSystem 79 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 A.3 SystemDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 A.4 FeatureDesignandClassification . . . . . . . . . . . . . . . . . . . . . . . 83 A.5 ExperimentsandEvaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 85 viii A.5.1 ExperimentalMethodology . . . . . . . . . . . . . . . . . . . . . . 85 A.5.2 PreprocessingandMethods . . . . . . . . . . . . . . . . . . . . . . 86 A.5.3 FeaturePerformance . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A.5.4 ClassifierPerformance . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.5.5 SystemPerformance . . . . . . . . . . . . . . . . . . . . . . . . . . 89 A.5.6 ImpactofTaskDifficulty . . . . . . . . . . . . . . . . . . . . . . . . 90 A.6 ConclusionandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . 91 ix ListofFigures 2.1 Four graph Laplacian eigenvectors of an example graph with N = 8 nodes. The blue and red bars represent positive and negative values at the signal components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 A15-nodeskeletalgraph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Relative motions between joints i and j. (a) Translation: v i v j = 0. (b) Withrotation: v i v j , 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 First four GFT basis vectors u 1 ;; u 4 . Blue square: positive value. Red dot: negativevalue. Greenpentagram: zero. . . . . . . . . . . . . . . . . . . 18 3.4 Interpretation of projecting a movement that is opposite between upper and lowerbodytothe2 nd GFTbasisvector u 2 ofG 1 . . . . . . . . . . . . . . . . 18 3.5 IllustrationofTheorems1and2withthe15-nodeunweightedskeletalgraph. (a) The 15-node skeletal graphG with two subsets of vertices S and S c . (b) Inducedsub-graphonS. (c)-(d)Eigenvectorsandcorrespondingeigenvalues ofthegraphLaplacianofG S andG,respectively. Red,blueandgreenindicate positive,negativeandzeroentityvalue. . . . . . . . . . . . . . . . . . . . . 21 3.6 Length of relative motion vector in terms of rotation angle. In this example, v i = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.7 Exampleofconstructing askeletal-temporalgraphwith N t = 2. . . . . . . . . 23 3.8 (a) A GFT basis vector ofG s . (b) Two GFT basis vectors ofG t with N t = 2. (c)(d) Two GFT basis vectors ofG st , each of which is Kronecker product between(a)andoneof(b). . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.9 Three options for constructing an unweighted skeletal graph with N s = 15. (a) The graph constructed solely based on prior knowledge about skeletal structure. (b)(c) Two alternative graph constructed with knowledge about motionofinterestaswalking-relatedactions. . . . . . . . . . . . . . . . . . 26 3.10 Resulted first four GFT basis vectors u 1 ;; u 4 ofG 1 andG 2 respectively. Top row: G 1 . Bottom row: G 2 . Blue square: positive value. Red dot: negativevalue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 x 3.11 15 Kinect skeleton joints. Sign values of graph features basis vectors: blue (+), red (-). Top: the proposed graph-based features. Notice that zero- crossings between neighboring vertices increase as the eigenvector corre- sponds to higher eigenvalue (frequency). Bottom: PCA basis constructed withcaptureddataon x-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.12 Cumulative curve of ratio of explained energy versus graph spectrum. Red: withgraphasG 1 . Blue: withgraphasG 2 . Black: withgraphasG 3 . . . . . . 28 3.13 EnergydistributionoverGFTbasisvectorsofeachcapturedmotionsequence. Cross: walking. Circle: jumping. . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Confusion matrix for using proposed ST-SGWT+TPM feature on the MSR- Action3Ddataset. Eachcellrepresentstheclassificationaccuracyfromwhite (0)toblack(1),whichisnormalizedbythenumberofinstancesineachcategory. 35 4.2 ConfusionmatrixforusingproposedST-SGWT+TPMfeatureontheUTKinect- Action3Ddataset. Eachcellrepresentstheclassificationaccuracyfromwhite (0)toblack(1),whichisnormalizedbythenumberofinstancesineachcategory. 36 4.3 ConfusionmatrixforusingproposedST-SGWT+TPMfeatureontheFlorence3D- Actiondataset. Eachcellrepresentstheclassificationaccuracyfromwhite(0) toblack(1),whichisnormalizedbythenumberofinstancesineachcategory. 38 4.4 Examplesdemonstratethatthelevelofnoiseatjoint,measuredbythevariation in the length of attaching bone between consecutive frames, depends on the jointvelocity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 ClassificationaccuracyoverPSNRonMSR-Action3Ddataset. . . . . . . . . 42 4.6 ClassificationaccuracyoverPSNRontheUTKinect-Action3Ddataset. . . . . 42 4.7 ClassificationaccuracyoverPSNRontheFlorence3D-Actiondataset. . . . . 43 4.8 Classification accuracy over percentage of missing data on MSR-Action3D dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.9 Classification accuracy over percentage of missing data on the UTKinect- Action3Ddataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.10 Classification accuracy over percentage of missing data on the Florence3D- Actiondataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1 VisualizationoftheoriginalgraphsG 1 andG 2 forgeneratingsyntheticgraph signalsoftwocategories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Visualizethelearnedgraphsfortwocategoriesofgraphsignalswithdifferent graphlearningmethods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 xi 5.3 Cumulativespectrumenergyoftestsignalsineachclassonthelearnedgraphs. G1andG2representrespectivelythegraphlearnedforeachcategory. . . . . 62 5.4 Theseparationmeasureversusr withDisc-GLassocomparedtothatofGLasso. 63 5.5 Classificationaccuracyversusr withDisc-GLassocomparedtothatofGLasso. 64 5.6 Edge weights in the learned action-dependent skeletal graph for each action using proposed closed-form solution of discriminative graphical lasso on Florence3Ddataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.7 Edge weights in the learned action-dependent skeletal graph for each action usingconventionalgraphicallassoonFlorence3Ddataset. . . . . . . . . . . 65 5.8 Spectrumofmotionsequencesfromoneactionclassondifferentclass-specific graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 A.1 Walkingtaskstrajectoryusedintheexperiments. Arrowsindicatedirectionor movement. Onlysegmentsshowninredwereusedtoclassifythemedication state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 xii ListofTables 4.1 Comparisonwiththestate-of-the-artresultsontheMSR-Action3Ddataset. . 34 4.2 Comparison with the state-of-the-art results on the MSR-Action3D dataset (followingprotocolof[45]). . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Comparisonwiththestate-of-the-artresultsontheUTKinect-Action3Ddataset. 36 4.4 Comparisonwiththestate-of-the-artresultsontheFlorence3D-Actiondataset. 37 4.5 Averagetestingruntime(ms)usingJointspatialgraphkernelmethodandour proposedmethod. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.6 EmpiricalPSNRateachjointfortwodatasets. . . . . . . . . . . . . . . . . . 40 4.7 AverageruntimeandclassificationaccuracybyusingtheproposedGFT-based representationandPCA-basedrepresentationwhenadaptedtoeachnewdataset. 47 4.8 ClassificationaccuracyandruntimebyusingLiealgebrabasedrepresentation withDTWandFTPasproposedin[83]andbyusingtheproposedGFT-based representation with the same DTW and FTP modeling on three datasets. Acc: classification accuracy. T r : runtime required for extracting skeletal representations. T t : runtimerequired fortemporalmodeling includingDTW and FTP. T c : runtime required for classification. T total : overall runtime. F speedup : speed-upfactorsonT r . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1 Classification performance of utilizing predefined unweighted skeletal graph and learned graphs with GLasso and Disc-GLasso on Florence3D-Action dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 A.1 SVMperformanceforvariousfeatures. Accuracyisreportedwiththeformat as average accuracy (best accuracy/worst accuracy) across 14 subjects. A: Accuracy, P: Precision, R: Recall and F-M: F-measure. ALL: Gait, Angle, andGraph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.2 Performanceofsingleclassifierandmultipleclassifierscombination. A:Ac- curacy,P:Precision,R:Recall,F-M:F-measure,AP:AverageofProbabilities, MV:MajorityVoting,S:SVM,k: k-NN,D:DecisionTree,R:RandomForest. 88 xiii A.3 Effect of normalization on subject-independent performance. A: Accuracy, P:Precision,R:Recall,F-M:F-measure. . . . . . . . . . . . . . . . . . . . . 89 A.4 subject-independent performance of single classifiers. A: Accuracy, P: Pre- cision,R:Recall,F-M:F-measure. . . . . . . . . . . . . . . . . . . . . . . . 90 A.5 Subject-independentcombinationperformance. A:Accuracy,P:Precision,R: Recall,F-M:F-measure,AP:AverageofProbabilities,MV:MajorityVoting, S:SVM,k: k-NN,D:DecisionTree,R:RandomForest. . . . . . . . . . . . . 90 A.6 Performance results for three walking tasks. Accuracy is reported with the format as average accuracy (best accuracy/worst accuracy) across subjects. A,P,RandF-MdenoterespectivelyAccuracy,Precision,RecallandF-measure. 91 xiv ListofAlgorithms 1 Disc-GLassoalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 1 Chapter1 Introduction With a simple glance at a scene, humans can easily observe, recognize and further analyze the activities performed by others. However, having machines capable of achieving similar ability still remains challenging and has long been a popular research area, with a wide range of applications such as home monitoring systems to automatically assess mobility of the elderly, tools for measuring the quality of movements for workers, surveillance systems and game consoles. The first step to build such a system, where machines or computers can perceive and understand actions performed by humans, is to collect scene information with sensors. In general, multiple modalities, including 2D RGB video, depth maps, optical flows and skeleton-based data, have been utilized in such systems. A conventional choice for the sensor would be an RGB camera, so that analyzing human actions using 2D RGB video sequences has been extensively studied during the past few decades [1]. Although using RGB camera can have advantages in terms of lower costs and easier adaptability to various environments, it has limitations under more complex scenarios involving variations of scale, rotation, illumination and occlusion. Recently, with the advance of cost-effective depthsensors,e.g.,Kinectsensors,aswellasthedevelopmentofpowerfulreal-timetracking algorithms[33,91],reliable2Dor3Dpositionsofskeletaljointsduringactionshavebecome available, which can make human activity analysis systems more robust in those challenging scenarios. Therefore,inthiswork,wefocusonusingtheskeleton-basedmotiondata,i.e.,2D or 3D coordinates associated with each skeletal joint, as the captured motion data, which we aimtorepresent,process,andanalyzeforimprovedautomatedactivityanalysis. Assuming that skeleton-based motion data is available, an automated action analysis systemtypicallywouldrequireextractingdescriptiveandcompactinformation,i.e.,represen- tationsorfeatures,tocharacterizehumanactionsfromthecapturedmotiondata. Approaches to construct representations from data have been investigated for some time, and typically includeproceduressuchaspre-processing,transformingandtemporalmodeling. Developing techniques to build suitable representations and extract features from captured motion data 2 Chapter1. Introduction becomes essential because subsequent processing steps, e.g., classification, often operate on extractedfeatures,ratherthanonrawmotiondata. Since good representations are usually required to be compact and descriptive, energy compaction,i.e.,capabilityofembeddingdatawithinasubspacewithsmallerdimensionality, as well as discrimination between categories, are two desirable properties when it comes to designing representations. To achieve possessing these two properties, in previous work, data is heavily relied upon to construct representations. For example, Principal Component Analysis(PCA)[32]basedmethodsdependondataanditscovariancetoobtainthetransform matrix, which is later applied to data. Statistical-based features also utilize data to select the set of informative statistics. Deep learning based methods require extensive usage of data as well in order to learn the best parameters for kernel filters. All in all, these data- drivenapproachesmayhavebenefitsintermsofachievingbetterenergycompaction(aswith PCA-based methods) or better discrimination between classes (as with deep learning based methods), since they are retrained for every different dataset and task. However, data-driven approachesbuildrepresentationswithoutexplicitlyconsideringthespatialdependencyamong body joints, which is critical for understanding human motion. Since actions are performed involving human skeletons, and physical restrictions that apply to skeletons are known a priori, our goal is to construct representations that incorporate knowledge about the skeletal structure,ratherthanrelyingsolelyondata. As an alternative to data-driven techniques, in this work, we instead focus on model- based approaches to construct the representations. That is, we develop tools to transform motion data prior to recognition, where the transformation itself is not dependent on any training data and can be applied across multiple human motion analysis tasks. The main challengeindevelopingsuchrepresentationsistheirregularityintheskeletalstructureandits corresponding motion, e.g., as compared to optical flow or dense motion fields on a regular grid of pixels. We will tackle this challenge by leveraging graph structure derived from the skeleton,togetherwithgraphsignalprocessingapproaches. Recently,graphsignalprocessing has introduced notions of frequency derived from spectral graph theory to process data in irregular domains [3, 13, 75], with successful applications in areas such as social networks [85], sensor networks [4], point clouds [2, 46, 79], etc. Therefore, in this thesis, we propose to utilize graph structure to model the skeleton and further apply graph signal processing approaches to the data, i.e., graph signals, which together serves as a model-based approach to construct the representations for captured motion data. These proposed representations have several potential advantages. Because the representation itself is independent of the data, it can provide improved robustness to noisy data. Furthermore, the representation can Chapter1. Introduction 3 be used across multiple datasets, thus providing easier generalization, as well as improved computationalefficiency,becausethetransformationisknownapriorianditscomputationcan beoptimizedwhenthesystemisimplemented. Finally,becauseourproposedrepresentations are based on the skeleton graph, they provide better interpretation and make it easier to compareresultsacrossdifferenttasks. In this work, we propose graph-based motion representations that start with a skeletal graph (including skeletal-temporal graphs) and then apply a graph transform such as the Graph Fourier Transform (GFT) or the Spectral Graph Wavelet Transform (SGWT) to the graphsignals,i.e.,motiondata,definedontheconstructedgraph. Wediscusstheconstruction of a skeletal graph and a skeletal-temporal graph and further derive the spatial and spectral propertiesassociatedwiththisspecifictypeofgraphs,includingsymmetricsub-graphs,GFT modes, spectrum multiplicities, fast GFT implementation and interpretations associated to the GFT basis. Furthermore, we discuss some desirable properties of these graph-based representations,includingtheircomputationalefficiencyandtheirabilitytobegeneralizedto newdatasets. As for applications, we consider two real-world scenarios where skeleton-based motion datais capturedandutilized tofulfill anautomatedaction analysistask. Thefirst application istodevelopanautomatedmobilityassessmentsystemwheremotionsperformedbypatients withmusculo-skeletaldisordersarecapturedandautomaticallyassessedandutilizedtopredict current medication states of patients. We conduct thorough experiments with our proposed graph-based representation in order to evaluate it. Additionally, several general factors for designing this type of systems are discussed, such as the environments and activity tasks for which they are deployed and the features and classifiers that are applied with them. The secondapplicationistodevelopaskeleton-basedactionrecognitionsystem,whichisapopular researchtopicinthefieldofcomputervisionandmachinelearning. Employingtheproposed representationsisshowntoleadtorecognitionperformancecomparabletothestate-of-the-art, whileatthesametimeprovidingsignificantlylowertimecomplexity,robustnesstonoisyand missingdata,andeasiergeneralizationtodifferentdatasets. Inthelastpartofthedissertation,weconsiderascenariowhereasetofactivity-dependent graphs are constructed for classification. We first propose a general multi-category graph learning problem where we are interested in learning multiple graphs, each representing one category of signals (data samples), while discrimination between graphs is promoted. To address this problem, we formulate an optimization problem to encourage not only an efficient representation of signals belonging to certain category on the graph corresponding tothiscategory,butalsothediscriminationbetweengraphsofdifferentclasses. Additionally, 4 Chapter1. Introduction we provide an efficient algorithm to solve this general problem. Moreover, since the graphs we are considering in this work are skeletal graphs, which is a special family of graphs that are tree structured and symmetric, we can further derive a closed-form solution for our multi-categorygraphlearningproblemandgreatlyreducethetimecomplexitytosolveit. We perform experiments on both synthetic and skeleton-based motion datasets to evaluate the performanceofthisdiscriminative graphlearningscheme. 1.1 RelatedWork StatisticalFeaturebasedMethods State-of-the-art approaches to extract representations for human activity analysis are mostly data-driven. For example, in widely used principal component analysis (PCA) methods [71, 89], the representation, i.e., the transformation to be applied to the raw motion, is learned from data. Raptis et al. [71] employ PCA on the torso joint positions to estimate a human torso surface, and represent a human pose with the spherical angles between the limb joint positions and the torso surface. This approach also employs a Fourier transform over time to characterize the temporal structure of actions. Yang and Tian [89] compute the positiondifferencebetweenallpairsofjointswithinoneframe,thejointsoftwoconsecutive frames, and the joints of one frame and the initial frame to capture the spatial and temporal configurations of human poses. PCA is then applied to the concatenated features to extract the so-called eigenjoints. Among these PCA-based approaches, energy compaction can be guaranteed on the extracted representations. However, the principal components need to be recomputed when new datasets are considered, leading to difficulty in generalizing across datasets,aswellasincomparingandinterpretingresults. AsidefromPCA-basedmethods,otherdata-drivenapproachesemploystatisticalfeatures to represent the skeleton-based motion data. Ofli et al. [17] find the top N most informative joints according to the variance of joint angle and angular velocity, and construct feature vectors with the features of these most informative joints, where the best N to be used are chosenbasedondata. Xiaetal. [87]defineahistogramof3Djointlocations,whichextracts the histogram of spherical coordinates of the joint positions in a coordinate system that uses the hip joint as the origin. These approaches are better than PCA-based techniques in terms of interpretability, but they generally do not guarantee energy compaction. Moreover, they are sensitive to noise, which is a major factor in skeleton-based motion data due to errors in skeletontracking. 1.2. Contributions 5 Graph andNeuralNetworkbasedMethods Thepotentialofleveraginggraphstorepresentthenaturalskeletalstructureforhumanaction modelinghasattractedincreasingattentionfromresearchers. Forexample,inourearlywork [37, 39], we first proposed to construct an undirected skeletal graph and view human motion dataassignalsontheskeletalgraph. WethenappliedaGraphFourierTransformtothegraph signals and interpreted the resulting transform coefficients in terms of skeletal correlation between joints. Furthermore, the development of graph convolutional networks (GCNs) [12, 26, 28, 41], which generalize convolutional neural networks (CNNs) from regular grid- structured data, such as images/videos, to irregular structured data associated to graphs, has madeitpossibletousegraph-baseddatadirectlyforclassification. Yanetal. [88]andShiet al. [48] utilize GCN on skeleton-based data by constructing the skeletal-temporal graph and applyinggraphconvolutionineitherspatialdomainorspectraldomain,achievingsubstantial improvements in terms of classification performance. Another work [10] also utilizes graph spectral filtering with the same skeletal graph constructed followed by a recurrent network (LSTM) for temporal modeling. Although the graph neural network based approaches [10, 48, 88] utilize graphs to model prior knowledge about human skeleton, methods to extract representations are still data-driven, e.g., learning the best graph filtering functions [10, 48, 88] or learning the best graph structure [10, 48] from data. These data-driven approaches mayhaveadvantageintermsofdiscriminatingbetweendifferentactioncategories,andshow superiorperformanceinthecontextofactionrecognition,buttheycannotbeeasilygeneralized acrossdatasets,otherthanbyfullyretrainingthesystemforthenewtasks. 1.2 Contributions 1.2.1 Graph-basedRepresentations Tothebestofourknowledge,thisresearchisthefirstworkthatmodelshumanskeletalstructure as a graph and leverages spectral graph theory and graph signal processing approaches to construct a skeletal-temporal representation for the captured motion data. We propose two graphconstructionmethods,skeletalandskeletal-temporalgraphs,andwefurtherinvestigate their associated spatial and spectral properties. These can not only justify why graph-based methodscanachievegoodperformanceinseveralapplications,butcanalsomakeiteasierto developfuturegraph-basedschemes. 6 Chapter1. Introduction 1.2.2 ApplicationsinAutomatedSystemsforActionAnalysis We validate our proposed graph-based representations by considering two real-world appli- cations, where the goal is to build an automated system for action analysis. The first one is to develop automated mobility assessment systems for patients with mobility disorders. We presentathoroughexaminationofseveralgeneralfactorstodesignsuchsystemsqualitatively and quantitatively. The other application is on skeleton-based action recognition, where we notonlydemonstratethattheproposedrepresentationscanachievecomparableclassification performance to state of the art systems, but also show that they possess additional desirable properties, such as robustness to noisy or missing data, computational efficiency and easy generalization. 1.2.3 DiscriminativeGraphLearningwithTopologyRegularization We also develop an approach to learn class-specific skeletal graphs for classification. To do so, we first propose a general framework for constructing class-specific graphs for signals in multiple categories, followed by deriving an algorithm to maintain interpretability of the learnedgraphofagivenclass,whilealsopromotingdiscriminationbetweengraphsofdifferent classes. We are the first to propose a framework and algorithm for learning discriminative graphs across multiple classes of signals. Furthermore, as we are specifically interested in skeletal graphs, which have known topology and some favorable properties, we derive a closed-form solution to efficiently learn class-specificskeletal graphs from motion data. The algorithm is evaluated on both synthetic and captured motion data and we demonstrate its advantagesinclassificationtasks. 1.3 Organizationofthe Thesis The rest of the thesis is organized as follows. In Chapter 2, we briefly provide some back- ground about graphs, graph signals and transforms defined on graphs. In Chapter 3, we start with the construction of skeletal graphs and skeletal-temporal graphs and their associated spatialandspectralproperties. Wethenpresenttheproposedgraph-basedrepresentationsfor skeleton-basedmotiondatautilizingeitherGFTorSGWT.Wealsodiscusstheinterpretation ofGFTbasis,andseveraldesirablepropertiesoftheproposedrepresentations. Wethenassess the proposed representations based on two real-world applications. The first application is skeleton-based action recognition, where a thorough comparison with state-of-the-art tech- niques for three public datasets is reported and discussed in Chapter 4. The second one is 1.3. OrganizationoftheThesis 7 to develop an automated mobility assessment system, which is presented in Appendix A. Fi- nally,inChapter5,insteadofutilizingpre-definedunity-weightedgraphstomodeltheskeletal structure, we explore the possibility of learning action-dependent skeletal graphs, by starting with a general framework of class-specific graph learning problem together with a proposed block coordinate descent algorithm to solve it. We further consider the special family of skeletal graphs where a closed-form solution is derived and evaluated on captured motion data. In Chapter 6, we conclude this dissertation and discuss several possible directions for futurework. 8 Chapter2 Background 2.1 GraphsandGraph Signals Graphsaregenericandnaturaldatarepresentationsforirregulardomains. Examplesofthese domains span various types of networks, such as sensor, transportation and social networks, and numerous applications in digital images, videos and point clouds [13, 75]. In these examples,graphs,asacollectionofverticesconnectedbyedges,areutilizedtostructuredata. Typically, the vertices may represent the data entities while the edges represent the pairwise relationships between them. Under this framework, several graph-based signal processing approaches, such as transforming, filtering and sampling, can be adopted to further process andanalyzethegivendata. A graphG =fV;Eg is defined in terms of a finite set of verticesV withjVj = N and a set of edgesE withjEj = M. An adjacency matrix W can then be defined to represent the connectivityamongverticesinG. Ifthereexistsanedgee2E connectingvertices v i and v j , theentry W ij representstheweightoftheedge e =¹v i ;v j º;otherwise, W ij = 0. Ifagraphis undirected,thatis,v i isconnectedtov j ifandonlyifv j isconnectedtov i ,then Wissymmetric. If a graph is unweighted, then edges are unweighted and W ij 2f0;1g;8i; j2f1;;Ng. Through this thesis, we will focus on undirected graphs and will consider both unweighted andweightedgraphs. OnceagraphG isdefined,agraphsignalcanbedefinedasafunction f :V!R,which associateseachvertexinthegraphv2V withascalarvalue f¹vº. Agraphsignalmayalsobe representedasavector f2R N ,where f i or f¹iºrepresentsthescalarvalueassociatedtovertex v i . It is worth noting that, for a given graph, there can exist many different or varying graph signals. Graph signal processing techniques are developed in order to analyze and interpret thesegraphsignalsdependingonthetopologyofthegraph. 2.1. GraphsandGraphSignals 9 AlgebraicGraphRepresentations Most of the graph signal processing approaches are built using algebraic representations of the graphs as starting points. Aside from the adjacency matrix W we defined above, other popularalgebraicrepresentationsforgraphsincludethefollowing. IncidenceMatrix Theincidencematrix Bisan N by M matrix,wherethei-throwrepresentstheincidence of edges at vertex v i and each column corresponds to one of these edges. That is, if B ik , 0, one end of e k terminates at vertex v i . For an undirected graph, B ik = W e k if vertex v i and edge e k are incident and0 otherwise. For a directed graph, B ik =W e k ; B jk = W e k if edge e k pointsfromvertex v i towardvertex v j withedgeweight W e k . DegreeMatrix The degree matrix D is an N by N diagonal matrix whose diagonal entry represents the degree, i.e.,thesumofedgeweights,ofthecorrespondingvertex. Thatis, D ii = Í j,i W ij . GraphLaplacians ThereareseveraltypesofgraphLaplacianmatrices,includingcombinatorialgraphLapla- cian,normalizedgraphLaplacianandrandomwalkgraphLaplacian,asdescribedbelow. Definition1(CombinatorialgraphLaplacianmatrix) ThecombinatorialgraphLaplacian matrix L isan N by N matrix,definedasfollows: L = D W: (2.1) Since both D and W are symmetric, L is also symmetric and furthermore, each of its rows addtozero, i.e., L1 = 0 where 1 =»11¼ Ë and 0 =»00¼ Ë . Definition2(NormalizedgraphLaplacianmatrix) The normalized graph Laplacian ma- trixL is an N by N matrix which normalizes the combinatorial graph Laplacian over the degreeofvertex,definedasfollows: L = D 1 2 LD 1 2 = I D 1 2 WD 1 2 : (2.2) L ij = 8 > > > < > > > : 1 if i = j and D ii , 0 W ij p D ii q D jj if i, j (2.3) wheretheweightofeachedgeisnormalizedbasedonthedegreeoftwoverticesitconnects. 10 Chapter2. Background Definition3(RandomwalkgraphLaplacianmatrix) The random walk graph Laplacian matrixT isan N by N matrixdefinedasfollows: T = D 1 W: (2.4) Unlike the combinatorial and normalized Laplacian matrices, the random walk Laplacian matrixT isnotsymmetric,buteachofitsrowsaddto1,i.e.,T 1 = 1. 2.2 TransformsonGraph 2.2.1 GraphFourierTransform Following the definition of combinatorial graph Laplacian, L is regarded as a difference operatorsince,foranygraphsignal f2R N ,thefollowingisalwayssatisfied. ¹Lfº¹iº = D ii f i Õ j,i W ij f j = Õ j,i W ij f i f j ; i = 1;;N (2.5) where¹Lfº¹iº representsthei-thcomponentofthevector Lf and D ii = Í j,i W ij . Therefore, Lf canbeviewedasalinearfilterthatoperateswithinthe1-hopneighborhood of the graph. That is, given an input signal f, the output signal value at vertex v i ,¹Lfº¹iº, depends only on the input signal values at vertex v i and the 1-hop neighboring vertices of v i , i.e.,N G ¹v i º = v j 2V :¹v i ;v j º2E . In other words, the absolute value of Lf serves as a measure for local signal variation. For example, when v i and its neighboring vertices all have the same value, i.e., local signal variation is small,¹Lfº¹iº will have a minimum absolute value 0. In contrast, it will have a largerabsolutevalueatvertex v i ifthereismorelocalvariationaround v i . 2.2. TransformsonGraph 11 While (2.5) measures the local variation around one vertex, we can further estimate the aggregatevariationoverallverticesinthegraphusingtheLaplacianquadraticform: f Ë Lf = N Õ i=1 Õ v j 2N¹v i º W ij f i f j f i = Õ i; j;i<j W ij f i f j 2 ; (2.6) whenthegraphisundirected. When f = 1,i.e.,thesignalhasthesmallestpossibleaggregatevariationacrossthegraph, wehave 1 Ë L1 = 0,whichdemonstratesthevalidityof f Ë Lf asanestimatorofoverallvariation. Ontheotherhand,observingthat L1 = 0 = 01andaccordingtothedefinitionsofeigenvectors and eigenvalues, we know that 1, or 1 p N 1 after normalization, is an eigenvector of the graph Laplacian matrix L associated with eigenvalue 0. If the graph is undirected, L is symmetric andthushasafullsetoforthogonaleigenvectors. Wecanlookforthissetofeigenvectorsby selecting u 1 = 1 p N 1asthefirsteigenvector,andtheniterativelysolvingthefollowingequation forthenexteigenvector u k u k = arg min f?u 1 ;;u k1 kfk=1 f Ë Lf ; k = 2;;N (2.7) The above equation shows that the successive eigenvectors will possess minimal ag- gregate signal variation while being orthogonal to the previously selected ones. In other words,fu 1 ; u 2 ;; u N g is a set of eigenvectors associated with real, non-negative eigenval- ues 1 ; 2 ;; N , ordered from small to large aggregate variation across the graph. These eigenvalues thus provide a similar notion of frequency as the classical Fourier transform for 1D signals. In the classical Fourier analysis, the eigenfunctions associated with higher fre- quencieshavelargervariation,thatis,theyoscillatemorerapidly,whilethoseeigenfunctions associated with lower frequencies are smoother and have slower oscillations. The eigenvec- tors and eigenvalues of graph Laplacian matrix provide the similar frequency interpretation as 1D Fourier transform does. For example, the eigenvector u 1 associated with the smallest eigenvalue, which is 0, has constant value 1 p N over all the vertices. On the other hand, the eigenvectors associated with larger eigenvalues (higher frequencies) oscillate more between vertices, that is, vertices connected with heavier edges are more likely to have dissimilar values. Also, more zero crossings can be observed in these high-frequency eigenvectors. This is illustrated in Fig. 2.1, where four graph Laplacian eigenvectors of an example graph 12 Chapter2. Background = 0.00 = 0.72 = 3.23 = 5.54 Figure2.1: FourgraphLaplacianeigenvectorsofanexamplegraphwithN = 8nodes. Theblueandredbarsrepresentpositiveandnegativevaluesatthesignalcomponents. with8 verticesareplotted. With the Fourier-like frequency interpretation, this set of eigenvectors forms a transform known as the graph Fourier transform (GFT), which is denoted as U = u 1 u 2 u N . Foranygraphsignal x2R N ,itsgraphFouriertransformisdefinedasfollows: ~ x = U Ë x: (2.8) Thus,anygraphsignal x canberepresentedintermsoftheGFTbasis: x = U~ x; (2.9) whichis alsoknownastheinversegraphFouriertransformof~ x. 2.2.2 SpectralGraphWaveletTransform Manyclassicalsignalprocessingtoolshavebeengeneralizedtothegraphdomainforprocess- ing graph signals. Examples include filtering, sampling, translation and convolution. While the GFT (defined in Section 2.2.1) serves as a generalization of classical Fourier transform, generalizations of the classical wavelet filter banks are also possible. Wavelet transforms defined on graphs can simultaneously localize the graph signals in both the vertex domain andthegraphspectraldomain,whichisadesiredpropertyinmanyoccasions. Recentworks for designing graph wavelet transforms can roughly be divided into two schemes: vertex do- main and graph spectral domain designs. As an example of vertex domain design, the graph wavelet transform (CKWT) proposed by Crovella and Kolaczyk [11] is defined based on the geodesic distance between vertices. On the other hand, the spectral graph wavelet transform (SGWT) proposed by Hammond et al. [24] utilizes the dilated and translated versions of a bandpasskerneldesignedinthegraphspectraldomaincharacterizedbytheeigenvectorsand eigenvaluesofthegraphLaplacianmatrix. InChapter3,besideadoptingGFTasanapproach 2.2. TransformsonGraph 13 to build the representation for motion sequence, we also propose an alternative by adopting SGWTtobuildamulti-scaleandlocalizedrepresentation. Given a graphG = fV;Eg withjVj = N that is undirected and with non-negative edge weights, SGWT can be realized by taking a kernel function ^ g¹º : R + ! R + , scaling its domain by K scalarsft 1 ;;t K g2 R + , then centering each wavelet at each vertex by convolving it with an impulse graph signal i 2R N , which has value 1 at thei-th vertex and 0 everywhere else. The kernel function ^ g¹º needs to be a band-pass kernel which satisfies ^ g¹0º = 0, lim !1 ^ g¹º = 0, and an admissibility condition [24]. A spectral graph wavelet at scalet k localizedaroundvertexi,denotedas t k ;i 2R N ,canbewrittenasfollows. t k ;i ¹jº = N Õ n=1 ^ g¹t k n ºu n ¹iºu n ¹jº; j = 1;;N (2.10) where n ; u n arethen-theigenvalue andeigenvectorofthegraphLaplacian,respectively. In addition to the wavelet functions, SGWT also includes a scaling function, i.e., a low- pass filter, centered at each vertex, which represents the low-frequency content around each vertexinthegraph. Wedenotethescalingfunctioncenteredatthei-thvertexas i . Finally,wecandenotetheentiretransformas :R N !R N¹K+1º ,whichisgivenby = 1 ;; N ; t 1 ;1 ;; t 1 ;N ;; t K ;1 ;; t K ;N (2.11) asan N by N¹K +1º transformmatrix. Given a graph signal f 2 R N residing on the graph, the SGWT coefficients can be representedasan N¹K +1º-dimensionalvectoras c = Ë f; c2R N¹K+1º (2.12) Furthermore,Hammondetal. in[24]alsoproposetousetruncatedChebyshevpolynomi- alswithDtermstoapproximatethetransformwithtimecomplexityofO¹DjEj+N¹K+1ºDº, whichissignificantlylowerthantheO¹N 3 ºoriginallyrequiredwhencomputingthetransform involvesobtainingthefulleigen-decompositionofgraphLaplacianmatrix L2R NN . 14 Chapter3 Graph-basedRepresentationsforMotion Data 3.1 Introduction As introduced in Chapter 2, graphs provide a promising way to structure data and represent the relationships between entities. Graph signal processing techniques can then be used to efficiently manipulate and analyze data. Therefore, we propose to utilize a graph-based frameworktoestablishanadequaterepresentationformotiondata. Inthiswork,wefocusondevelopingrepresentationsforhumancaptured3Dmotiondata. With the advances in sensor technologies, capturing reliable 3D motion has become cost effective, with multiple types of methods available for this task. For example, adopting a marker-based motion capture system, such as MoCap, can usually provide accurate keypoint locations of interest on the body but requires significant efforts and costs for setting up. An alternative is to use high-definition depth cameras, such as Microsoft Kinect, whose development has progressed rapidly in the past few years and which are now cost effective. Along with associated powerful skeleton tracking algorithms [33], these cameras can output estimated 3D humanjoint positions ateach frame in real-time. In eitherway, we can assume thatthe3Dcoordinatesofthebodyjoints(orsomepre-definedkeypointsonthebody)ateach time stamp are available for every captured motion sequence. Therefore, all of the proposed representationswillbedevelopedbasedontheassumptionthat3Dcoordinatesofbodyjoints have been estimated. Note that even we assume that some estimates are available, these estimatesmaynotbereliableandcouldbequitenoisy. Beforegoingaheadtodescribetheproposedrepresentations,wefirstintroducethenotation thatwillbeusedinthefollowingsections,andformallydefinetheformatofourinputmotion data. Let t2f0;:::;Tg be the frame index of a frame in the motion sequence, and let N s be the number of tracked body joints or keypoints on the body. We denote N s 1 vectors x t , 3.2. SkeletalGraphs 15 y t , and z t as the x-, y-, and z-coordinates of all the joint positions in frame t where the i-th entry of the vectors, i.e., x t ¹iº, y t ¹iº, z t ¹iº, represent the coordinates of jointi in framet. The movementofjointi betweentwoconsecutiveframes,t1 andt,canbewrittenasfollows: x t ¹iº = x t ¹iº x t1 ¹iº; y t ¹iº = y t ¹iº y t1 ¹iº; z t ¹iº = z t ¹iº z t1 ¹iº; (3.1) for t2f1;:::;Tg and i2f1;:::;N s g. Then the motion vector associated to joint i at t can be written as v t;i =¹x t ¹iº;y t ¹iº;z t ¹iºº. We can further group the motion vectors of all the joints as N s 1 vectorsx t , y t , andz t , which correspond to the motions in x-, y-, and z-coordinatesofalljointsbetweenframest1 andt. Finally,wedefinean N s 3 matrixcontainingmotionvectorsas V t = 2 6 6 6 6 6 6 4 v t;1 : : : v t;N s 3 7 7 7 7 7 7 5 = h x t y t z t i (3.2) The rest of this chapter is organized as follows. In Section 3.2, we discuss the construction of a skeletal graph, together with its spatial and spectral properties. Next, we discuss the constructionofaskeletal-temporalgraphanditscorrespondinginterpretationsinSection3.3. TheninSection3.4.1wedefineanapproachtoextractrepresentationsforgivenmotiondata, byleveragingeitherGFTorSGWT.Finally,inSection3.4,wediscussseveralpropertiesthat areexpectedtobeachievedwithourproposedrepresentations. 3.2 SkeletalGraphs 3.2.1 SkeletalGraphConstruction In order to efficiently model the human skeletal structure in motions, a special family of graphs can be constructed, which we name skeletal graphs. A skeletal graph is constructed by representing the body joints with graph vertices while representing the connections or spatial relationships between pairs of joints with graph edges. This type of graphs can be utilized to model human skeletal motion in scenarios where skeletal-based motion data is considered, regardless of whether this is 2D or 3D data or what set of joints is used. For example,ifskeletal-basedmotiondataisgeneratedfromvideoswithtrackingalgorithmssuch 16 Chapter3. Graph-basedRepresentationsforMotionData Figure3.1: A 15-node skeletal graph. asOpenPosewith25-keypointbodykeypointestimation(e.g.,theNTURGB+Ddataset[5]), then a skeletal graph with vertex set of size 25 can be constructed. On the other hand, if the skeletal-based motion data is captured with Kinect sensor with 15 or 20 keypoints (e.g., the Florence 3D action dataset [47] and MSR Action3D dataset [45]), then a skeletal graph of size 15 or 20 should be constructed to model this skeletal structure. Overall, the graph constructed in the above fashion can be utilized to model the skeletal structure under general scenarios where skeletal-based motion data is available. An example of a 15-node skeletal graphisshowninFig. 3.1. WemodelthehumanskeletalstructureasafixedundirectedgraphG s =fV s ;E s ; Wgwith the vertex setV s = v 1 ;v 2 ;;v N s corresponding to the N s tracked body joints. The edge setE s consists of undirected edges with non-negative weights, which are specified in W.E s is decided based on the prior knowledge about human skeleton as follows: v i is connected to v j withaunityweightonlyifthereexistsaphysicallimbdirectlyconnectingthei-thand j-th body joint. In this way, the constructed skeletal graphG s embeds the physical connectivity between body parts. Beside of constructing graphs based on physical skeletal connectivity, somealternativegraphformulationsarepossiblewhenknowledgeaboutmotionsofinterests isavailable,whichwillbediscussedwithmoredetailsinSection3.4. In this setting, each column vector in V t in (3.2), i.e.,x t ,y t orz t , can be regarded as a graph signal f ¹tº d 2 R N s lying on the constructed skeletal graphG s , where d2fx; y;zg indicatingthecoordinateaxisofchoice. Forexample, f ¹tº x =x t ,whichrepresentsthemotion alongthex-axisacrossallthejointsatframet,isagraphsignaldefinedonG s . 3.2. SkeletalGraphs 17 Figure3.2: Relative motions between jointsi and j. (a) Translation: v i v j = 0. (b) With rotation: v i v j , 0. 3.2.2 InterpretationsfromGFTbasis As mentioned previously, the motion along each dimension at a given time for all joints, i.e., x,y, andz, can be regarded as graph signals, where t is omitted for simplicity. From (2.6), the variation of a signal, e.g., x, on the graph can be measured by the Laplacian quadraticform: ¹xº Ë L¹xº = Õ ¹i;jº2E w ij ¹x¹iºx¹jºº 2 : (3.3) From the right hand side of (3.3), this quantity is a weighted squared sum ofx¹iºx¹jº, which is the x-component of v i v j . Note that v i v j is indeed the relative motion between joints i and j, and this vector is zero if an exact translation is applied to joints i and j. As a result,weseethatatranslation(asinFig. 3.2(a))makesnocontributiontothevariation(3.3) onthegraph,whileamotionwithrotationbetweenjointsi and j (asinFig. 3.2(b))produces a positive variation. Recall that from (2.8), we know the graph Fourier transform for a graph signalx isdefinedas ~ x = U Ë x where L = UU Ë ,andtherefore ¹xº Ë L¹xº =¹xº Ë UU Ë ¹xº = ~ x Ë ~ x = N s Õ i=1 i ~ x¹iº 2 (3.4) This means that the GFT characterizes a projection of joint motions (x,y, andz) onto graph Fourier modes u 1 to u n with increasing variations on the graph, and such a variation onlycapturestherotationsbetweenadjacentjoints,butnotthetranslations. 18 Chapter3. Graph-basedRepresentationsforMotionData Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Figure3.3: FirstfourGFTbasisvectors u 1 ;; u 4 . Bluesquare: positivevalue. Red dot: negative value. Green pentagram: zero. positive negative , ∈ , ∈ Figure 3.4: Interpretation of projecting a movement that is opposite between upper andlowerbody to the 2 nd GFT basis vector u 2 ofG 1 . Since the transform is defined on a skeleton-like graph, we can expect that the transform coefficients have an interpretation in terms of measuring the coordination between body parts. For example, Fig. 3.3 shows the first four GFT basis vectors of the 15-node skeletal graph in Fig. 3.1. While Eigenvector 1 extracts the lowest-frequency component from signals, Eigenvector 2 and 3 extracts the upper/lower body coordination and left/right body coordinationrespectively. Fig. 3.4furtherillustrateshowprojectinggraphsignals,i.e.,motion vectors,ontoGFTbasiscanbringinterpretationinmeasuringthecoordinationbetweenbody parts during motions via a simplified example. We can see that, when the graph signal contains the joints in upper body moving in one direction while those in lower body moving in the opposite direction, the projection of this graph signal onto the 2 nd GFT basis vector willbelarge, i.e.,the2 nd transform coefficientwillhavealargeabsolutevalue. 3.2.3 SemanticGroupSparseEigenvectors A sparsestructure of the eigenvectorsof the graphLaplacian, i.e., a graphFourier basis with a small total number of nonzero elements, can lead to easier interpretation and therefore we wouldliketoderivethenecessaryconditionsfortheconstructedskeletalgraphtohavesparse eigenvectors. These conditions can provide us insights on how to construct skeletal graph, e.g.,howtoassignweightstoskeletaledgessuchthatasetofsparseeigenvectorscanstillbe obtained. Thenecessaryandsufficientconditiontohavea2-sparseeigenvectorsforthegraph 3.2. SkeletalGraphs 19 Laplacian of a general graph, i.e., existing an eigenvector with at most 2 nonzero elements, hasbeenderivedin[78]. Theorem8of[78]statesthat,assuminganundirectedandconnected graphhasN vertices,thenthereexistnodeiandjsuchthatw ik = w jk ;8k2f1;;Ngnfi; jg, ifandonlyifthegraphLaplacian Lhasa2-sparseeigenvector. Thisindicatesthatthenecessary conditionforanundirectedgraphtohave2-sparseeigenvectoristopossesscertainsymmetry property between two vertices in the graph. Note that skeletal graphs exhibit significant symmetry across groups of vertices, e.g., nodes on left arm and those on right arm. Since Theorem 8 of [78] only focuses on investigating the symmetry between two vertices, rather thanthesymmetrybetweentwogroupsofverticesthatskeletalgraphsexhibit,wewouldlike to extend the work of [78] and investigate whether the symmetry between groups of vertices can have a similar relationship to a group-sparse structure in eigenvectors. Next, we derive a theoremtoshowthat,undercertainconditions,thereexistsagroup-sparseeigenvectorofthe graphLaplacianofaskeletalgraph. Definition4(Symmetricsub-graphs) For an undirected graphG with vertex setV where jVj = N,wesaythatGhassymmetricsub-graphsifthereexisttwodisjointsubsetsofvertices, SandS c ,suchthattheinducedsub-graphsonSandS c arethesame. Thatis,Ghassymmetric sub-graphsifthereexistS =fv 1 ;;v M g,S c =fu 1 ;;u M gandS r =f1;;Ngn¹S[S c º, suchthat: 1) w v i v j = w u i u j ;8i; j2f1;;Mg 2) w v i k = w u i k ;8i2f1;;Mg; k2 S r 3) w v i u j = 0;8i; j2f1;;Mg. Consequently, we can see that an unweighted skeletal graph always has symmetric sub- graphs. Weighted skeletal graphs with symmetric edge weights also have symmetric sub- graphs. Belowwewillderivetheexistenceofgroupsparseeigenvectorsforthosegraphswith symmetricsub-graphs,includingskeletalgraphs. Theorem1(2M-sparseeigenvectorsofaskeletalgraph) Let Wdenotetheadjacencyma- trixofanundirectedandconnectedgraphG. Then,ifthegraphhastwosymmetricsub-graphs of size M, that is, it satisfies Definition 4, the graph Laplacian has at least one (2M)-sparse eigenvector. Proof. Assume that an undirected and connected graphG with N vertices has symmetric sub-graphs. Thatis,thereexisttwodisjointsubsetsofverticesSandS c ,eachofsizeM,which have the same induced sub-graphs, i.e., they satisfy the conditions in Definition 4. Without loss of generality, we can assume that the first M vertices belong to S while the second M 20 Chapter3. Graph-basedRepresentationsforMotionData vertices belong to S c and the remaining N2M vertices belong to S r . Let W denote the adjacencymatrixofthegraph,then W = 2 6 6 6 6 6 4 W S 0 W SS r 0 W S c W S c S r W Ë SS r W Ë S c S r W S r ; 3 7 7 7 7 7 5 (3.5) andthecombinatorialgraphLaplacianmatrixis L = 2 6 6 6 6 6 4 L S 0 W SS r 0 L S c W S c S r W Ë SS r W Ë S c S r L S r ; 3 7 7 7 7 7 5 (3.6) where L S 2M M , L S c 2M M , L S r 2M N2M are the partitions of the graph Laplacian correspondingtoeachsubsetofvertices,withM n denotingasquarematrixofordern. Then, assume that u2R M is an eigenvector of L S with eigenvalue (u will also be an eigenvector of L S c since,byconstruction, L S = L S c fromDefinition4),considerthefollowing L 2 6 6 6 6 6 4 u u 0 3 7 7 7 7 7 5 = 2 6 6 6 6 6 4 L S u L S c u W Ë SS r u+ W Ë S c S r u 3 7 7 7 7 7 5 = 2 6 6 6 6 6 4 u u 0 3 7 7 7 7 7 5 (3.7) Therefore, 2 6 6 6 6 6 4 u u 0 3 7 7 7 7 7 5 isa(2M)-sparseeigenvectorofthegraphLaplacian L. The proof above also leads to Theorem 2, which characterizes the relationships between thespectrumofthegraphandthespectrumofthesymmetricsub-graphs. Theorem2(Spectralrelationshipsbetweengraphandsymmetricsub-graphs) Assumethat an undirected connected graphG =¹V;Eº has symmetric sub-graphs as in Definition 4. An inducedsub-graphG S =¹S;E S º on S canbeconstructedasthefollows: 1)8v i ;v j 2 S,if e =¹v i ;v j º2E,then e2E S 3.2. SkeletalGraphs 21 (a)G (b)G S (c)EigenpairsofG S (d)EigenpairsofG Figure 3.5: Illustration of Theorems 1 and 2 with the 15-node unweighted skeletal graph. (a) The 15-node skeletal graphG with two subsets of vertices S and S c . (b) Induced sub-graph on S. (c)-(d) Eigenvectors and corresponding eigenvalues of the graph Laplacian ofG S andG, respectively. Red, blue and green indicate positive, negative and zero entity value. 2)8v i 2 S,includeaself-loopedgeat v i withweight: Õ v i v k ;v k 2S r w v i ;v k : Thenif uisaneigenvectorofthegraphLaplacianoftheinducedsub-graphG S witheigenvalue ,then isalsotheeigenvalueofthegraphLaplacianofG witheigenvector 2 6 6 6 6 6 4 u u 0 3 7 7 7 7 7 5 . Fig. 3.5 illustrates Theorem 1 and 2 with the 15-node unweighted skeletal graphG. Fig. 3.5(a) shows that this graph has symmetric sub-graphs: the two subsets of vertices marked as blue circle and red square as represent a choice of sets S and S c such that the conditions in Definition 4 are all satisfied. Fig. 3.5(b) shows the induced sub-graph G S 22 Chapter3. Graph-basedRepresentationsforMotionData Figure3.6: Lengthofrelativemotionvectorintermsofrotationangle. Inthisexample, v i = 0. on S constructed following the process mentioned in Theorem 2. The eigenvectors and the corresponding eigenvalues of the graph Laplacian of G S are shown in Fig. 3.5(c), where red and blue indicate positive and negative entity value. Based on Theorem 2, there exist 2M-sparse eigenvectors of the graph Laplacian ofG. Fig. 3.5(d) demonstrates that Theorem 2 holds as three 6-sparse eigenvectors can be constructed based on the three eigenvectors of G S andwiththesamecorrespondingeigenvalues. 3.2.4 TheRoleofGraphWeights Based on (3.3), the variation on the graph, defined by the Laplacian quadratic form, is the weighted square sum of the relative motion between adjacent joints. Sincex¹iº,y¹iº, and z¹iº are thethreeelementsof v i ,wehave ¹xº Ë L¹xº = Õ ¹i;jº2E w ij ¹v i ¹1º v j ¹1ºº 2 ¹yº Ë L¹yº = Õ ¹i;jº2E w ij ¹v i ¹2º v j ¹2ºº 2 ¹zº Ë L¹zº = Õ ¹i;jº2E w ij ¹v i ¹3º v j ¹3ºº 2 : Note that these quantities all depend on the relative motion vector v i v j . Let the length of the bone connecting joints i and j be d ij , and let the rotation angle corresponding to the relative motion between jointsi and j be, as shown in Fig. 3.6. Then, the norm of v i v j can be expressed as 2d ij sin¹2º. This means that the quantities in the summations of (3.8) areproportionaltoboththeedgeweightandthesquaredbonelength: w ij ¹v i ¹kº v j ¹kºº 2 / w ij d 2 ij sin 2 ¹2º; for k = 1;2;3: (3.8) 3.3. Skeletal-temporalGraphs 23 = skeletal edge temporal edge Figure3.7: Exampleofconstructing a skeletal-temporal graph with N t = 2. (a) (b) (c) (d) Figure 3.8: (a) A GFT basis vector ofG s . (b) Two GFT basis vectors ofG t with N t = 2. (c)(d) Two GFT basis vectors ofG st , each of which is Kronecker product between (a) and one of (b). By (3.8), this means that, with the same rotation angle, moving a longer bone (e.g., edge ¹10;11º, the bone between knee and ankle) introduces a greater increase in the variation quantity (3.8) than moving a shorter bone (e.g., edge¹4;5º, the bone between shoulder and elbow). Basedonthiswecanproposeanalternativechoiceofweights,i.e.,definew ij = 1d 2 ij . In this way, the variation that the Laplacian quadratic form measures will be independent of the limb length, and will depend on the rotation angles only. We will discuss alternative approaches for learning the edge weights in general graphs, as well as in skeletal graphs, in Chapter 5. 3.3 Skeletal-temporal Graphs The skeletal graph can be augmented in order to better capture temporal dynamics, by a new graphthatincorporatestemporaledges,usingthegraphCartesianproductdefinedasfollows. Definition5(GraphCartesianproduct) Given two unweighted undirected graphsG = fV G ;E G g andH =fV H ;E H g,thegraphCartesianproductofG andH,denotedasGH, isagraphwithitsvertexsetasV G V H ,whichisCartesianproductofvertexsetsofG and H. Each vertex inGH is denoted as¹v i ;w j º where v i 2V G and w j 2V H . Two vertices ¹v;wºand¹v 0 ;w 0 ºareconnectedinGH onlyifv = v 0 ;¹w;w 0 º2E H orw = w 0 ;¹v;v 0 º2E G . 24 Chapter3. Graph-basedRepresentationsforMotionData The fixed undirected skeletal-temporal graphG st with N t temporal nodes can then be obtainedbytakingtheCartesianproductofaskeletalgraphG s asdefinedinSection3.2and anunweightedtemporallinegraphG t with N t vertices,i.e.,G st =G s G t . Fig. 3.7showsan exampleofaskeletal-temporalgraphwith2temporalnodes. Graph signals: Any spatial-temporal cube of length N t in the motion sequence can be regarded as a graph signal residing on the skeletal-temporal graph. Specifically, a graph signal f ¹tº d 2R N s N t canbedefinedonthisskeletal-temporalgraphG st whenhaving f ¹tº d ¹i;sº = d t+s1 ¹iº, wherei2f1;;N s g, s2f1;;N t g, for any d2fx; y; zg and t2f1;;T N t +1g. TheGFTbasisofaskeletal-temporalgraphiscloselyrelatedtotheGFTbasesofG s and G t . Assume that x is a GFT basis vector ofG s (x as an eigenvector of W G s with eigenvalue ) and y is a GFT basis vector ofG t (y as an eigenvector of W G t with eigenvalue ), then since W G st ¹x yº = W G s I N t ¹x yº+ I N s W G t ¹x yº = W G s x I N t y+ I N s x W G t y =¹+º¹x yº (3.9) where denotes the Kronecker product, we can see that x y is a GFT basis vector ofG st . Thatis,theGFTbasisvectorsoftheskeletal-temporalgraphcanbederivedastheKronecker product between basis vectors of typical skeletal graph and basis vectors of temporal line graph. Fig. 3.8providesanillustrativeexample. 3.4 Proposed Graph-based Representation and its Proper- ties 3.4.1 RepresentationExtraction Onceoneofthetwotypesofgraphsdiscussedabove,i.e.,skeletalgraphorskeletal-temporal graph, are chosen, we can use transforms associated to the graph to represent motion data. Specifically,forGFT-basedrepresentation,weapplytheGFTtographsignals f ¹tº d : f f ¹tº d = U Ë f ¹tº d asin(2.8)andutilize f f ¹tº d astherepresentation. As for SGWT-based representation, we follow (2.11) to apply SGWT to graph signals as c ¹tº d = Ë f ¹tº d andusethetransformcoefficients c ¹tº d astherepresentation. 3.4. ProposedGraph-basedRepresentationanditsProperties 25 3.4.2 PropertiesofProposedRepresentation Astheproposedrepresentationisconstructedwithpre-definedskeleton-likegraph,inSection 3.2.2wesawthatthestructureofeigenvectorscanprovideaninterpretationforthecoordination between body parts. Besides, as it can be built independently from the data (i.e., we do not need to include the data for constructing the skeletal or skeletal-temporal graph), there are severaldesirablepropertiesthatcanbeachieved. Wewilldiscusstheminthissection. To validate these properties, we apply the proposed representation to the CMU Graphics Lab Motion Capture Database [8], which was captured with a motion capture system (i.e., Vicon),andthusachievesbetteraccuracyintheestimatedpositionsoftrackedjointscompared tothosecapturedwithdepthcameras(e.g.,Kinect). Althoughthemotioncapturesystemhas 41markerstapedontheblackjumpsuittobetrackedwith12infraredcameras,weonlyusea subsetof15jointsandtheirassociatedpositionsasinputsequences. As seen in Section 3.2.1, we can construct an unweighted skeletal graphG 1 solely based on the prior knowledge about human skeleton. That is, we connect two vertices with a unity edge only when their corresponding joints are connected directly with physical limb, e.g., between shoulder and elbow or between knee and ankle. This is shown in Fig. 3.9(a). It is worth pointing out that the choice of graph, i.e., the choice of edge set, is critical to the representation as it can change the GFT basis structure significantly. Thus, alternatively, if wehavebetterideaaboutthemotionofinterest,wemayusethispriorknowledgetochoosea betteredgeset. Specificallytakewalkingmotionasanexample,wherebilateralsymmetryis regardedtobecommoninnormalhumangait[22]. Thus,wecanconstructtheskeletalgraph with the edge set as shown in Fig. 3.9(b)(c), which makes vertices more connected within eitherleftorrightpartofbody. ThealternativegraphfomulationsofG 2 ;G 3 inFig. 3.9(b)(c) leadtochangesintheGFTbasisstructure,e.g.,theywillencouragemoresignchanges(zero crossings)symmetricallyacrossthecenter. Fig. 3.10 further illustrates this point by showing the first 4 GFT basis vectors (out of a totalof15)forthedifferentgraphchoicesofFigs. 3.9(a)and(b). Thetoprowofeigenvectors is associated with the graphG 1 constructed solely based on natural human skeleton (as in Fig. 3.9(a)) while the bottom row is associated with the graphG 2 constructed with prior knowledgethatwalking-relatedmotions,withbilateralsymmetrycharacteristic,isofinterest (asinFig. 3.9(b)). Wecannoticethat,byconnectingtheverticesaccordingtopriorbilateral symmetry assumption, the structure of eigenvectors is significantly changed. For example, the 2 nd eigenvector, related to the min-cut of graph, is changed to be symmetric between left and right body rather than between upper and lower body. Since the choice of edge set can lead to change in GFT basis structure accordingly, it may lead to different performance 26 Chapter3. Graph-basedRepresentationsforMotionData (a)G 1 (b)G 2 (c)G 3 Figure3.9: ThreeoptionsforconstructinganunweightedskeletalgraphwithN s = 15. (a) The graph constructed solely based on prior knowledge about skeletal structure. (b)(c) Two alternative graph constructed with knowledge about motion of interest as walking-related actions. Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Figure3.10: ResultedfirstfourGFTbasisvectors u 1 ;; u 4 ofG 1 andG 2 respectively. Toprow:G 1 . Bottomrow:G 2 . Blue square: positive value. Red dot: negative value. of resulted representations in terms of the desirable properties, such as interpretability and energycompaction,whichwillbediscussedinthefollowingsections. BetterInterpretability Principal component analysis (PCA), whose variants are popular in representing spatio- temporal data such as motion capture data, is one special case of our proposed graph-based feature. Consideringafullyconnectedgraphwithedgeweightssetaccordingtothecovariance in data, the resulting graph-based basis vectors are exactly the principal components in PCA method. However,constructingthegraphwithouttakingdataintoconsideration,asproposed here, can lead to an easier interpretation of the results, as compared to PCA. Fig. 3.11 showsacomparisonofthestructureofbasisvectorsobtainedusingourproposedgraph-based features and PCA. We can observe that the component of data that is captured by each PCA 3.4. ProposedGraph-basedRepresentationanditsProperties 27 Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Figure 3.11: 15 Kinect skeleton joints. Sign values of graph features basis vectors: blue (+), red (-). Top: the proposed graph-based features. Notice that zero-crossings between neighboring vertices increase as the eigenvector corresponds to higher eigen- value(frequency). Bottom: PCA basis constructed with captured data on x-axis. basis vector is more difficult to interpret than the GFT basis vectors. For example, the third eigenvectorofPCAincludesanisolatedvertexintheleftlegandthecomponentonthefourth eigenvector of PCA is hard to interpret as the coordination between upper and lower body. Furthermore, in the evaluation section, our proposed feature shows to be able to achieve comparable performance to PCA does. Finally, our proposed basis is not data-dependent while PCA highly depends on the training dataset. This lack of data-dependence makes it easiertocompareresultsacrossdifferentsubjects,tasks,coordinatesystemsordatasets. EnergyCompaction As mentioned in the introduction, energy compaction is important. This property allows us to represent the data approximately with a low-dimensional subspace of the feature space constructedbytherepresentation. Compactionintherepresentationcanleadtomanybenefits, suchasbetterdimensionalityreductionorimproveddatacompression. In order to validate that this representation provides energy compaction, we use four motion sequences in the CMU MoCap Database that contain walking motion. We first construct the skeletal graph, eitherG 1 ,G 2 , orG 3 , and their respective GFTs. Within each sequence, we compute the representation based on the method in Section 3.4.1 frame by frame, i.e., taking the GFT of the graph signals. We then take the length square of these projections and measure the ability of each basis vector (graph Laplacian eigenvector) to explain the original motion sequence. Fig. 3.12 shows that almost 90% energy in the motion sequences can be explained by the first 6 (out of 15) low-frequency eigenvectors. 28 Chapter3. Graph-basedRepresentationsforMotionData 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 frequency in spectral graph domain cumulative explained energy ratio Cumulative curve of explained energy of motion versus frequencies G as in Fig.1 G as in Fig.5(a) G as in Fig.5(b) Figure 3.12: Cumulative curve of ratio of explained energy versus graph spectrum. Red: withgraphasG 1 . Blue: with graph asG 2 . Black: with graph asG 3 . This validates that the proposed representation provides energy compaction, which makes it suitablefordimensionalityreductionandforcompressionapplications[42]. Letustakeacloserlookattheresultsofcumulativecurveofexplainedenergyversusgraph spectruminFig. 3.12. Withthesamegraphsignals,i.e.,motiondata,therepresentationbased onG 2 containsmorehigh-frequencycomponents. Thismakessensebecausethisformulation (Fig. 3.9(b))introducesstrongerconnectivitywithinthesamesideofthebody,betweenupper andlowerlimbs. Sinceinanaturalwalkingmotionthesetendtomoveinoppositedirections, thischoiceofgraphleadstomoreenergyinhigherfrequencies. DiscriminationbetweenActions Asidefromtheenergycompactionproperty,whentherepresentationisappliedtomultiple action categories, we would also like to have different actions to be separable in the feature space constructed by the proposed representation. This can be beneficial for classification applications. Toachieve this, the representation needs to becapable of exploiting the critical componentsinthemotionsequencethatarebestatdiscriminatingbetweencategories. We validate whether the proposed representation discriminates between different motion categories with the following experiment. We select 8 motion sequences from CMU MoCap Database,4containingwalkingmotionandtheother4containingjumpingmotion. Asinthe previous experiment, the ratio of explained energy using each basis vector is plotted in Fig. 3.4. ProposedGraph-basedRepresentationanditsProperties 29 0 5 10 15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 index of Laplacian eigenvector energy on eigenvector / total energy Energy distributed among eigenvectors of Laplacian 105 39 105 40 105 41 105 42 105 13 105 14 105 29 105 31 Figure 3.13: Energy distribution over GFT basis vectors of each captured motion sequence. Cross: walking. Circle: jumping. 3.13. Theresultsshowthatallthewalkingtasksareseparatedquitewellfromallthejumping tasks. Itisalsoworthmentioningthatthesetwodifferentmotionclassescanbewellseparated byonlythefirstseveraleigenvectorssuchasthe2 nd ;3 rd ;4 th and5 th one. Noticethat,asshown in Fig. 3.10, the 2 nd eigenvector can exploit the correlation between upper and lower body while the3 rd and4 th eigenvector can exploit the symmetry or bilateral symmetry embedded in human motions. Take the walking tasks for example. Work in the literature has shown that natural human walking motion possesses bilateral symmetry [22, 23, 31] while jumping motion does not. This fact corresponds to the results shown in Fig. 3.13 which indicate that walking motions have a larger component when projected on the 3 rd and 4 th eigenvectors. The method shows to be able to preserve and reveal the coordination mechanism underlying naturalhumanmotions. In Chapter 4, we will examine this ability to discriminate between categories more thor- oughly, in the context of action recognition, with a larger dataset and with more motion categoriesconsideredsimultaneously. ComputationalEfficiency Assume we are given a new dataset and motion analysis task. If we were to use PCA- based methods, such as [71, 89], we would first need to re-compute the eigen-decomposition of covariance matrix for this new dataset to obtain the principal components, leading to a 30 Chapter3. Graph-basedRepresentationsforMotionData time complexity of O¹N 3 º. Using GCN-based methods (e.g., [10, 48, 88]) will also require a re-training procedure to learn a new graph and new graph filter parameters based on each new dataset. On the other hand, using the proposed framework to acquire graph-based representation will only require re-applying the previously computed transform matrix to signals in the new dataset, which has time complexity of O¹N 2 º and thus can greatly reduce the training time complexity. We will further validate this experimentally in the context of action recognition in Section 4.2.6. Furthermore, in some extensions of our work [38, 49], a fast transform is possible when it is predefined and with certain properties that are satisfied with skeletal graphs, and therefore can lead to further reduction in classification complexity. Additionally, as there is no need to use data to obtain transform matrix, this framework can work well in online applications. Using the same transform matrix regardless of the datasets/taskscanalsoresultineasiergeneralizationacrossdatasets/tasks. 3.5 Conclusion To conclude, in this chapter, we discuss the construction of skeletal graph and skeletal- temporal graph as well as the properties of their GFT basis. We derive the relationship of the symmetric topology in graphs and the sparse structure of GFT basis. Since the desirable properties of GFT basis, we propose to extract representation for motion capture data by applying GFT or SGWT of skeletal or skeletal-temporal graph on it. Furthermore, based on CMU MoCap Database, we demonstrate that the proposed graph-based representations possess several advantages in interpretability, energy compaction, discrimination between categories, and computation efficiency. In next chapter, we will focus on validating that the representation can preserve and promote discrimination between action categories, in the contextofactionrecognitiontask,basedonthreepublicMoCapdatasets. 31 Chapter4 Application: Skeleton-basedAction Recognition In this chapter, we evaluate the two representations proposed in Chapter 3, GFT-based and SGWT-based,aswellastheirproperties,suchastheirabilitytodiscriminatebetweenclasses andtheircomputationalefficiency,inthecontextofactionrecognitionapplications,whichare an important research topic in the field of computer vision and machine learning. In Section 4.1, we show how to utilize the proposed representations to extract feature vectors. The datasets we use for evaluation and associated experimental settings are discussed in Section 4.2, followed by classification performance on three datasets reported and compared to the stateoftheartsinSection4.2.3. Finally,weexaminetheperformanceofourmethodinterms of desirable properties, such as robustness to noisy or missing data, and adaptability to new datasets. 4.1 FeatureDesignwith Temporal Modeling As discussed in Section 3.4.1, both the GFT and the SGWT based representations are con- structed frame-wise. Therefore, given a motion sequence and its associated frame-wise representation, we need to choose a temporal model to capture the temporal dynamics. The literatureontemporalmodelingcanberoughlydividedintothreetypesofapproaches. First, some approaches treat the frame-wise representations in a sequence independently, e.g., the bag of poses [9] and majority voting scheme [76]. These methods lack the ability to finely model the temporal dynamics. For example, the action stand up cannot be distinguished from the action sit down with an independent temporal modeling scheme. Second, some ap- proaches model all the frame-wise representations as one sequence, using generative models (e.g., HMM [52]) or dynamic temporal warping (DTW) [61]. Generative models are usually sensitive to noise and overfitting, especially when a highly complicated generative model is 32 Chapter4. Application: Skeleton-basedActionRecognition trainedwithlimitedamountofdata. Also,DTWdependsheavilyonawell-definedmetricto measure the similarity between frames and is likely to produce large temporal misalignment when periodic actions are considered. The third type of temporal modeling method can be viewed as an intermediate approach between the first two, which only encodes certain re- stricted temporal structures. Examples include temporal pyramid matching (TPM) [69] and Fourier temporal pyramid (FTP) [35], which lead to models that are less complicated and morerobusttotemporalmisalignmentandnoise. In our experiments, we adopt TPM to model the dynamics in the sequence of frame- wise representations, but alternative temporal models could be used as well. Specifically, given a motion sequence with T frames, we extract the frame-wise GFT-based or SGWT- basedrepresentations, f f ¹tº d and c ¹tº d ,respectively. AssumingGFT-basedrepresentationswitha skeletalgraphareused,acoefficientmatrix C2R T3D canbeconstructedwherethei-throw of C is f f ¹iº x Ë ; f f ¹iº y Ë ; f f ¹iº z Ë with D as the dimension of f f ¹iº d . Then a pooling function is defined to apply column-wise pooling for a sub-block of C. Here we adopt a mean pooling function p :R r3D !R 13D ,whichtakescolumn-wisemeanforablockofcoefficients. Furthermore, we apply this pooling function to sub-blocks of C of different sizes, which can capture the temporalorderofactionsspanningdifferentduration. Themaximumpyramidlevel,denoted as M, needs to be specified. For pyramid levelm M, we first uniformly divide C into a set ofnon-overlappingsub-blocksfB i gsothat C = B Ë 1 ;; B Ë 2 m1 Ë . Thefeaturevectorforthis pyramid level is then computed as z m = p¹B 1 º;;p¹B 2 m1º . Finally, the feature vector forthismotionsequenceisgivenbytheconcatenationofthefeaturevectorsofallthepyramid levels, i.e., z 1 ;; z M . The above scheme can also apply to SGWT-based representations similarly. This temporal pooling scheme (TPM) will be used in the following experiments to extractthefeaturevectorforeachmotionsequence. 4.2 ExperimentalEvaluation 4.2.1 Datasets We evaluate the proposed representations in the context of action recognition using three publicdatasets. MSR-Action3D [45] provides 3D positions of 20 skeletal joints, captured by a depth sensor similar to Kinect, for 10 different subjects performing 20 actions. Every action is performed two or three times by each subject, resulting in 557 motion sequences altogether. When a single limb was used for the action, subjects were advised to use their right arm or 4.2. ExperimentalEvaluation 33 leg. Thisdatasetischallenginginthesensethatdifferentclassesofactionsarehighlysimilar toeachother(e.g.,drawx,drawtickanddrawcircle)andtheestimatedskeletondataisnoisy. UTKinect-Action3D [87] provides the 3D positions of 20 skeletal joints, captured using a single stationary Kinect sensor, and includes 10 different subjects performing 10 actions. Everyactionwasperformedbyeachsubjecttwiceandledto199motionsequencesaltogether. Thisdatasetischallengingduetothevariationsintheviewpoints(e.g.,thesequencescanbe capturedfromright,frontalorbackview)andsignificantintra-classvariations(e.g.,throwor pickup withdifferenthands),andthedurationoftheactionclipsvaryingdramatically. Florence3D-Action [47] provides the 3D positions of 15 skeletal joints, captured using a single stationary Kinect sensor, and includes 10 different subjects performing 9 actions. Every action was performed by each subject two or three times and resulted to 215 motion sequencesaltogether. Thisdatasetischallengingduetolargeintra-classvariations(e.g.,same actionperformedusingdifferenthands)andtheinter-classseparationissmallforactionssuch as drinkfromabottleandanswerphone. 4.2.2 EvaluationSettingsandParameters For all three datasets, we adopt cross-subject evaluation, where the motion sequences from half of the subjects are used for training while those from the other half are used for testing. The unweighted skeletal-temporal graph is constructed for each dataset based on the number oftrackedskeletaljointsaspresentedinSection3.3. Thus,forMSR-Action3DandUTKinect- Action3D,askeletal-temporalgraphwithN s = 20isconstructedwhileforFlorence3D-Action, a skeletal-temporal graph with N s = 15 is constructed. The number of temporal nodes in the skeletal-temporal graph is selected by cross validation. Once the graph is constructed, we compute the proposed GFT-based and SGWT-based representations for frames in each sequence. The TPM scheme described in Section 4.1 is adopted to generate the final feature vectorforeachsequence. FortheSGWT,weusetheMeyer kernelas ^ g¹ºandthenumberof scales K is decided by cross validation. For TPM, the maximum pyramid level M is set to 3 inalltheexperiments. 4.2.3 ResultsonClassificationPerformance Table4.1and4.2reporttheclassificationaccuracyontheMSR-Action3Ddatasetwhenusing ourproposedGFT-basedandSGWT-basedrepresentationswithtemporaldynamicsmodeled via TPM approach. The performance is compared to other skeleton-based features in the literature in the cross-subject setting. We can see that utilizing SGWT-based representation 34 Chapter4. Application: Skeleton-basedActionRecognition Table4.1: Comparisonwiththestate-of-the-artresultsontheMSR-Action3Ddataset. Approach Accuracy(%) Year RecurrentNeuralNetwork[54] 42.5 2011 Histogramsof3Djoints[87] 78.97 2012 EigenJoints[89] 82.30 2012 HON4D[65] 82.15 2013 Jointanglesimilarities[63] 83.53 2013 SMIJ[17] 47.10 2014 Jointspatialgraphkernel[44]y 92.20 2017 ST-GFT+TPM 85.00 ST-SGWT+TPM 86.52 Liegroup[83] 89.48 2014 Key-Pose[84] 94.40 2016 Elasticfunctionalcoding[70] 85.16 2017 ykernelizedmethods methodsthatincludetrajectorywarping Table4.2: Comparisonwiththestate-of-the-artresultsontheMSR-Action3Ddataset (following protocol of [45]). Approach AS1 AS2 AS3 Average Bagof3dpoints[45] 72.9 71.9 79.2 74.7 Histogramsof3Djoints[87] 87.98 85.48 63.46 78.97 EigenJoints[89] 74.5 76.1 96.4 82.30 Covariancedescriptors[58] 88.04 89.29 94.29 90.53 Liegroup[83] 95.29 83.87 98.22 92.46 HBRNN[14] 93.33 94.64 95.50 94.49 LSTM[27] 92.38 90.18 92.79 91.78 EnsembleTS-LSTM [27] 95.24 96.43 100 97.22 ST-SGWT+TPM 91.67 82.69 98.40 90.92 performsbetterthanutilizingGFT-basedrepresentation,whichmaybebecausemanyactions in this dataset are localized in a sub-part of the body. For example, the dataset consists of several actions that involve only the right hand in the movement, e.g., draw X, draw circle, draw tick, high throw, etc. Besides, we can also observe that one challenging point of this datasetisthesmallinter-classseparation,suchasmanyactionsinvolvingonlyhandmotions. This is shown in Fig. 4.1 where we can see, for example, difficulty in classifying among groups of actions like draw X, draw tick and draw circle. In Table 4.1, we observe that 4.2. ExperimentalEvaluation 35 0.77 0.01 0.17 0.02 0.03 0.01 0.90 0.03 0.01 0.71 0.09 0.01 0.06 0.01 0.06 0.43 0.01 0.11 0.01 0.16 0.03 0.07 0.03 0.82 0.09 0.10 0.07 0.87 0.01 0.03 0.09 0.09 0.01 0.71 0.07 0.26 0.04 0.08 0.79 0.05 0.01 0.12 0.07 0.68 1.00 0.05 1.00 0.03 0.18 0.99 0.73 0.01 0.06 0.05 0.05 1.00 0.01 1.00 0.95 0.01 0.99 0.02 0.07 0.95 0.02 0.01 0.94 0.13 0.03 0.95 high arm wave horizontal arm wave hammer hand catch forward punch high throw draw X draw tick draw circle hand clap two hand wave side boxing bend forward kick side kick jogging tennis swing tennis serve golf swing pickup and throw Confusion matrix with ST-SGWT+TPM (1-vs-1) on MSR-Action3D high arm wave horizontal arm wave hammer hand catch forward punch high throw draw X draw tick draw circle hand clap two hand wave side boxing bend forward kick side kick jogging tennis swing tennis serve golf swing pickup and throw Figure 4.1: Confusion matrix for using proposed ST-SGWT+TPM feature on the MSR-Action3Ddataset. Eachcellrepresentstheclassificationaccuracyfromwhite(0) toblack(1),whichisnormalized by the number of instances in each category. those methods that allow trajectories to be warped such as DTW [83, 84] (listed in the lower sub-table) generally achieve higher classification accuracy, since better temporal modeling approachescanhelpresolvetheambiguityindistinguishingthesechallengingcategories,but this comes at the cost of an increase in time complexity. Furthermore, among those methods in the upper sub-table in Table 4.1, only kernelized methods (marked withy) perform better than ours. However, kernelized methods require computing similarity between each pair of sequences, which leads to much higher time complexity, as justified in Table 4.5. It is worth notingthatourproposedrepresentationscanfurtherincorporateeitherwarpingorkernelized techniques,whichcanpotentiallyimprovetheperformance. Table4.3reportstheclassificationaccuracyontheUTKinect-Action3Ddatasetbyutilizing our proposed SGWT-based representation with the feature extraction of Section 4.1. We can seethattheproposedmethodcanoutperformmostoftheliteratureintermsoftherecognition 36 Chapter4. Application: Skeleton-basedActionRecognition Table 4.3: Comparison with the state-of-the-art results on the UTKinect-Action3D dataset. Approach Accuracy(%) Year Histogramsof3Djoints(LOO)[87] 90.92 2012 Randomforests[94] 87.90 2013 Subgraph-patterngraphkernel(LOO)[68]y 97.44 2016 ST-LSTM+TrustGate[30] 95.0 2016 Jointspatialgraphkernel[44]y 98.30 2017 LSTM[27] 93.94 2017 EnsembleTS-LSTM[27] 96.97 2017 ST-SGWT+TPM 97.98 JP+DTW+FTP[83] 94.68 2014 PairwiseRJP+DTW+FTP[83] 95.58 2014 Liegroup[83] 97.08 2014 Riemannianmanifold[57] 91.50 2015 Key-Pose[84] 93.47 2016 Elasticfunctionalcoding[70] 94.87 2017 ykernelizedmethods methodsthatincludetrajectorywarping 1.00 0.10 0.11 1.00 1.00 0.90 0.89 1.00 1.00 1.00 1.00 1.00 walk sit down stand up pick up carry throw push pull wave hands clap hands Confusion matrix with ST-SGWT+TPM (1-vs-1) on UTKinect walk sit down stand up pick up carry throw push pull wave hands clap hands Figure 4.2: Confusion matrix for using proposed ST-SGWT+TPM feature on the UTKinect-Action3D dataset. Each cell represents the classification accuracy from white(0)toblack(1),whichisnormalizedbythenumberofinstancesineachcategory. 4.2. ExperimentalEvaluation 37 accuracy. Fig. 4.2 further shows the confusion matrix to provide a more detailed view of classification performance. We can observe that our model method only classifies wrongly from pick up and carry to walk. This can be due to the large intra-class variations existing in thisdataset,especiallyforthesetwocategories. Table 4.4: Comparison with the state-of-the-art results on the Florence3D-Action dataset. Approach Accuracy(%) Year Multi-partbag-of-poses(LOO)[47] 82.00 2013 Subgraph-patterngraphkernel(LOO)[68]y 91.63 2016 Jointspatialgraphkernel[44]y 93.20 2017 ST-SGWT+TPM 86.79 JP+DTW+FTP[83] 85.26 2014 PairwiseRJP+DTW+FTP[83] 85.20 2014 Liegroup[83] 90.88 2014 Riemannianmanifold[57] 87.04 2015 Elasticfunctionalcoding[70] 89.67 2017 ykernelizedmethods methodsthatincludetrajectorywarping Table 4.4 reports the classification accuracy on the Florence3D-Action dataset with our proposed SGWT-based representation with the feature extraction of Section 4.1. We can see that the proposed method is comparable to those in the literature in terms of the recognition accuracy. Fig. 4.3 shows the confusion matrix, where the proposed method predicts most of categories correctly, except for those with small inter-class variations, e.g., drink and answer phone. In addition, we demonstrate that our method has significantly lower time complexity, which is shown in Table 4.5. The experiments were conducted on a Windows machine with IntelCorei7-5600U2.60GHzprocessor. Table 4.5: Average testing runtime (ms) using Joint spatial graph kernel method and our proposed method. Dataset Jointspatialgraphkernel[44] Ourapproach UTKinect 318.57 61.68 Florence3D 206.92 32.22 38 Chapter4. Application: Skeleton-basedActionRecognition 0.92 0.10 0.10 0.70 0.20 0.80 0.07 0.73 0.92 0.08 0.90 0.90 0.08 0.10 0.10 0.20 0.10 0.10 0.90 1.00 wave drink answer phone clap tight lace sit down stand up read watch bow Confusion matrix with ST-SGWT+TPM (1-vs-1) on Florence3D wave drink answer phone clap tight lace sit down stand up read watch bow Figure 4.3: Confusion matrix for using proposed ST-SGWT+TPM feature on the Florence3D-Actiondataset. Eachcellrepresentstheclassificationaccuracyfromwhite (0)toblack(1),whichisnormalized by the number of instances in each category. 4.2. ExperimentalEvaluation 39 4.2.4 RobustnesstoNoisyData In our experiments we add noise at various signal-to-noise (SNR) levels to the three datasets and compare the classification accuracy of our proposed approach with that achieved by a PCA-basedmethod. Anaiveapproachtoincorporatenoiseintothesimulationswouldsimply consistofselectingsomeexistingmodel(say,additivewhiteGaussiannoise)andaddingnoise withequalvariancetoallmotionmeasurements. Instead,weinvestigateamorerealisticnoise model,derivedfromtheactualdataandknowlegdeofthehumanbody. Westartbyanalyzinghowmeasurementnoisemaydependonthespecificjoint. Because the bone length between a physically connected pair of joints should be constant, we charac- terize the noise associated to each bone based on the standard deviation of the bone length measurementsobtainedfromthesequence,i.e., b k = std kp t;i p t;j k 2 ; (4.1) where b k isthe k-thboneconnectingjointi andjoint j. Thenthestandarddeviationofnoise at each joint is computed as the average over the standard deviations of the lengths of all bones connected to that joint. As for the signal energy, based on the observation that the signal energy should be zero if there is no motion, we calculate the peak signal energy E s by consideringthemaximumsquarednormsofallmotionvectorsatalljointsalongtime,i.e., E s = max t2f1;;T1g;i2f1;;Ng kv t;i k 2 2 ; (4.2) where v t;i = p t+1;i p t;i . Finally,theempiricalPSNRiscalculatedbytakingtheratiobetween the peak signal energy across all joints, i.e., E s , and the average noise energy at each joint in MSR-Action3D and UTKinect datasets, as reported in Table 4.6. We can observe that the empirical PSNR is different for different datasets as well as for different joints, which motivatesustoconsiderajoint-dependentnoisemodel. To add artificial noise that is realistic in real-world systems, we consider a noise model where noise level at each joint is proportional to the moving velocity of that joint. In order to validate how realistic this noise model is, we plot the moving velocity of a specific joint, along with the estimated noise level at that joint, within two sequences from MSR-Action3D dataset, in Fig. 4.4. Note that since we do not have the ground truth positions of each joint available,toobtaintheestimatednoiselevelateachjoint,weapproximateitviathevariations in lengths of bones attaching to that joint, given that the variation in lengths of each bone shouldbezeroinanoise-freescenario. Thismodelisshowntobereasonable,asinFig. 4.4, 40 Chapter4. Application: Skeleton-basedActionRecognition Table4.6: Empirical PSNR at each joint for two datasets. PSNR(dB) JointName MSR-Action3D UTKinect ShoulderLeft 57.60 32.20 ShoulderRight 56.34 31.11 ShoulderCenter 57.07 33.66 Spine 54.73 39.76 HipLeft 55.71 33.81 HipRight 54.92 33.55 HipCenter 57.15 41.18 ElbowLeft 53.26 32.28 ElbowRight 53.24 30.28 WristLeft 63.44 32.83 WristRight 63.44 31.30 HandLeft 73.88 33.45 HandRight 73.88 32.30 KneeLeft 53.83 28.76 KneeRight 53.04 29.08 AnkleLeft 62.22 34.64 AnkleRight 62.22 34.62 FootLeft 66.30 40.61 FootRight 66.30 40.11 Head 45.63 37.80 Average 59.21 34.17 wherewecanobserveacloserelationshipbetweenthenoiselevelandthemovingvelocityat thejoint. In[77],theauthorsutilizedLEDmarkersandmanualannotationstoobtainground truth joint positions and compared with the estimated positions of Kinect, where the similar relation between noise and motion can be observed. Therefore, as the estimation error/noise would depend on many factors such as depth, occlusion, etc., a noise model depending on movingvelocityofjointsisasimplifiedyetreasonablemodel. Theexperimentisdesignedasfollows. Foreachjointi attimet,anindependentGaussian noisewithstandarddeviationas i;t = psnr s kv t;i k 2 2 E S (4.3) is added to the original data, leading to a noise-corrupted dataset, where psnr is decided 4.2. ExperimentalEvaluation 41 0 5 10 15 20 25 30 0 0.05 0.1 0.15 0.2 Joint velocity, Right Elbow, hand clap Joint velocity (m/frame) 0 5 10 15 20 25 30 0 1 2 3 4 5 x 10 -3 Bone length variation, Right Elbow-Wrist, hand clap Frame index Bone length variation (m) 0 5 10 15 20 25 30 35 0 0.05 0.1 0.15 0.2 Joint velocity, Right Knee, side kick Joint velocity (m/frame) 0 5 10 15 20 25 30 35 0 0.02 0.04 0.06 0.08 Bone length variation, Right Hip-Knee, side kick Frame index Bone length variation (m) Figure 4.4: Examples demonstrate that the level of noise at joint, measured by the variation in the length of attaching bone between consecutive frames, depends on the joint velocity. by the targeted PSNR value and is the same for all the joints. The range of PSNR values for the experiment is selected based on the empirical PSNR in each dataset, as reported in Table 4.6, i.e., the relative noise levels are based on what is typical given available datasets. Our proposed GFT representation and PCA are applied to the corrupted data to generate frame-based representations. The same temporal pooling scheme, i.e., temporal pyramid pooling with pyramid level as 3, is used for both representations to get the feature vector for each sequence. Finally, a linear SVM classifier is applied to the feature vectors and the classification accuracy is reported and plotted, as in Fig. 4.5, 4.6 and 4.7. The experiment is repeated10timesforeachPSNRlevelinordertoaverageovertherandomnoiserealizations. Basedontheexperimentalresults,wecanobservethattheGFT-basedmethodconsistently outperforms the PCA-based method on all three datasets when the PSNR is greater than 0, which demonstrates the robustness to noisy data with the proposed graph-based representa- tions. Finally, as potential future work, we may consider constructing our own noise-free dataset by utilizing rendering software, given that the current public datasets are already includingfairlylargeestimationerrors. 42 Chapter4. Application: Skeleton-basedActionRecognition 0 10 20 30 40 50 60 60 65 70 75 PSNR (dB) Classification accuracy (%) Classification accuracy over different PSNR levels PCA GFT Figure4.5: Classification accuracy over PSNR on MSR-Action3D dataset. -5 0 5 10 15 20 25 30 35 40 90 91 92 93 94 95 96 97 98 PSNR (dB) Classification accuracy (%) Classification accuracy over different PSNR levels PCA GFT Figure4.6: Classificationaccuracy over PSNR on the UTKinect-Action3D dataset. 4.2. ExperimentalEvaluation 43 0 10 20 30 40 50 60 80 81 82 83 84 85 86 87 88 89 PSNR (dB) Classification accuracy (%) Classification accuracy over different PSNR levels PCA GFT Figure4.7: Classificationaccuracy over PSNR on the Florence3D-Action dataset. 44 Chapter4. Application: Skeleton-basedActionRecognition 4.2.5 RobustnesstoMissingData We next evaluate the performance of our proposed schemes in cases where there is missing data(e.g.,thepositionofajointcannotbeobtainedatsomepointintimeduetoocclusions). Givenaskeleton-basedmotiondataset,wefirstsynthesizethecorrespondingcorrupteddataset withmissingentries. Eachentryinthedatasetiskeptwithprobabilityp;otherwise,thatentry is thrown away. This error is introduced independently at each joint. Here we only consider the scenario where the percentage of missing data is less than 50%. Based on this corrupted dataset,classificationisperformedwitheitherPCA-basedmethodorGFT-basedmethod. For PCA-basedmethod,followingtheconventionalapproach,weusethealternatingleastsquares (ALS)algorithmtojointlylearntheprincipalcomponentsandestimatedcoefficients[19,82]. Specifically, given the data matrix X2 R np with some missing values, we first denote the set of indices of those observed entries as such that, if X ij is observed, then¹i; jº2 . We then solve the following minimization problem for a low-rank reconstructed data matrix M2R np : min A2R nk ;B2R pk Õ ¹i;jº2 ¹X ij A Ë i B j º 2 +¹kAk 2 F +kBk 2 F º where M = AB Ë ; A2R nk ; B2R pk . When B is fixed, the problem can bedecoupled into nindependentridgeregressionproblemsdependingontheparameters A i andsimilarlywhen Aisfixed,wecansolvefor B. Wecanalternatinglysolvefor Aand Bandeventuallyprovide the reconstructed data matrix M. The reconstructed data matrix is then used to extract the representationwithPCA. For GFT-based method, we propagate the signals based on the predefined skeletal graph. Given each graph signal f 2 R n , i.e., the motion data per dimension per frame, and the predefined skeletal graphG with normalized Laplacian asL, we can split both f andL into partitionsassociatedwitheitherobservedormissingentries,i.e., f = f l f u ! ; L = L ll L lu L ul L uu ! where l represents the set of indices of observed entries while u represents the set of indices ofthoseentrieswithvaluesmissing. Wethensolvefortheoptimal f u byfavoringsmallsignal variationsonthegivengraph,asthefollowingoptimizationproblem: ^ f u = argmin f u f Ë Lf 4.2. ExperimentalEvaluation 45 Thereexistsaclosed-formsolutionasbelow ^ f u =L y uu L ul f l whereL y uu denotesthepseudo-inverseofL uu . Finally,thepropagatedsignal,i.e., ^ f = f l ^ f u ! , is regarded as the reconstructed data matrix to extract the representation with the proposed GFT-basedmethoddescribedinSection3.4.1. For each p value, the experiment is repeated 10 times and the averaged classification accuracy is reported for each dataset, see Figs. 4.8,4.9 and 4.10. Here we only consider the scenariowherethepercentageofmissingdataislessthan50%. WecanseethattheGFT-based method with signal propagation on graph consistently outperforms the PCA-based method withALSalgorithmonallthreedatasets. 0 5 10 15 20 25 30 35 40 45 50 25 30 35 40 45 50 55 60 65 70 75 Percentage of missing data (%) Classification accuracy (%) Classification accuracy over different percentage of missing data PCA GFT Figure4.8: ClassificationaccuracyoverpercentageofmissingdataonMSR-Action3D dataset. 46 Chapter4. Application: Skeleton-basedActionRecognition 0 5 10 15 20 25 30 35 40 45 50 40 50 60 70 80 90 100 Percentage of missing data (%) Classification accuracy (%) Classification accuracy over different percentage of missing data PCA GFT Figure4.9: ClassificationaccuracyoverpercentageofmissingdataontheUTKinect- Action3D dataset. 0 5 10 15 20 25 30 35 40 45 50 30 40 50 60 70 80 90 Percentage of missing data (%) Classification accuracy (%) Classification accuracy over different percentage of missing data PCA GFT Figure 4.10: Classification accuracy over percentage of missing data on the Florence3D-Action dataset. 4.2. ExperimentalEvaluation 47 4.2.6 GeneralizationtoNewDatasets AsdiscussedinSection3.4,thereisnoneedtore-computethetransformwhenanewdataset isconsideredwhenusingtheproposedgraph-basedrepresentation. Ontheotherhand,using PCA-basedrepresentationswillrequiretore-computethetransformbycomputingtheeigen- decomposition of covariance matrix of the new dataset, which requires a time complexity of O¹N 3 º. Toevaluatethepotentialbenefitsofourapproachwedesignanexperimentwherewe assume that the GFT basis has been constructed based on previous data and we measure the timecomplexityofderivingthefinalfeaturerepresentationsduringthetrainingprocess,using either the proposed GFT-based method or PCA-based method on a new dataset. The time requiredforcomputingthefeaturerepresentations,averagedover10repeatedexperiments,is reported in Table 4.7, together with the classification accuracy on that new dataset. We can observethattheaccuracyusingtheproposedgraph-basedrepresentationisconsistentlyhigher than the accuracy using PCA-based method, and the time complexity using the proposed graph-based method is consistently lower than that of the PCA-based method. Using the proposed graph-based method can reduce time complexity by 66.38%, 51.91% and 77.76% forMSR-Action3D,UTKinectandFlorence3Ddatasetrespectively. Table 4.7: Average runtime and classification accuracy by using the proposed GFT- basedrepresentationandPCA-basedrepresentationwhenadaptedtoeachnewdataset. Time-GFT(s) Time-PCA(s) Acc-GFT(%) Acc-PCA(%) MSR-Action3D 0.2620 0.7794 74.3590 70.3297 UTKinect 0.0904 0.1880 95.9596 91.9192 Florence3D 0.0775 0.3485 88.6792 84.9057 4.2.7 EasyCombinationwithGeneralModelingSchemes Ourproposedgraph-basedapproachcaneasilybeadoptedtoreplaceexistingrepresentations used in state of the art classification systems, and potentially bring benefits in terms of computational efficiency because our approach is model-based and not data-driven. For example, as proposed in [83], the authors utilized a skeletal representation derived with Lie algebra, combined with dynamic time warping (DTW) and Fourier temporal pyramid (FTP) astemporalmodelingscheme,whichachievedstate-of-the-artperformanceonthreedatasets. However,extractingskeletalrepresentationusingLiealgebraisverycomplex. Therefore,we can replace the representation with our proposed GFT-based representation, combined with 48 Chapter4. Application: Skeleton-basedActionRecognition Table 4.8: Classification accuracy and runtime by using Lie algebra based represen- tation with DTW and FTP as proposed in [83] and by using the proposed GFT-based representation with the same DTW and FTP modeling on three datasets. Acc: clas- sification accuracy. T r : runtime required for extracting skeletal representations. T t : runtimerequiredfortemporalmodelingincludingDTWandFTP.T c : runtimerequired forclassification. T total : overall runtime. F speedup : speed-up factors onT r . Acc(%) T r (s) T t (s) T c (s) T total (s) F speedup MSR-Action3D Liegroup[83] 87.55 3286.5 3900.8 5.0 7192.3 GFT 87.55 21.7 195.3 1.1 218.1 151 UTKinect-Action3D Liegroup[83] 95.96 972.2 640.8 0.8 1613.8 GFT 97.98 7.5 26.8 0.1 34.5 130 Florence3D-Action Liegroup[83] 88.68 347.0 165.9 0.7 513.6 GFT 89.62 5.7 9.8 0.1 15.6 61 the exactly same temporal modeling scheme, i.e., DTW and FTP. We conduct experiments for action recognition same as before by using either the Lie group based representation as proposedin[83]orourproposedGFT-basedrepresentation,followedwiththesametemporal modeling scheme (DTW and FTP) and classification scheme (one-vs-all linear SVM) as in [83]. Theclassificationaccuracyandtheruntimeofeachprocessingstep,i.e.,representation extraction, temporal modeling, and classification, with either representation are reported in Table 4.8. First, it is shown that replacing the Lie group based representation with the proposed GFT-based representation can achieve comparable or even better performance in termsofclassificationaccuracy. Furthermore,byutilizingourGFT-basedrepresentation,the overallruntimecanbesignificantlyreducedbymorethanafactorof30forallthreedatasets, as compared to using the Lie group based representation. The reduction in time complexity is especially significant when the runtime of extracting representation is considered, where using the GFT-based representation achieves gains of 2 orders of magnitude on average. In general,theproposedmodel-basedapproachtoextractGFT-basedrepresentationforskeleton- basedmotiondatacaneasilybeutilizedasanalternativetootherrepresentationsandflexibly combinedwithexistingtemporalmodelingandclassificationschemes,whichpotentiallycan improvetheclassificationperformance,whilealsoimprovingthecomputationalefficiencyor therobustnesstomissingandnoisydata. 4.3. Conclusion 49 4.3 Conclusion To conclude, in this chapter we validate the applicability of the representations presented in Chapter 3 in the context of 3D action recognition, as a real-world application. Utilizing the proposedrepresentationsisdemonstratedtoachieveperformancecomparabletorecentwork in the literature, while at the same time providing benefits in terms of significantly lower time complexity, robustness to noisy and missing data, fast application to new datasets or combinationwithexistingtemporalmodelingschemes. 50 Chapter5 Discriminative GraphLearningwith SparsityRegularization In Chapter 3, we introduced representations based on modeling the skeletal structure with a predefinedunweightedskeletalorskeletal-temporalgraph. Whenweutilizedthisschemefor actionrecognitioninChapter4,wenotedthatdifferentclassesofmotionscanbesignificantly different in the set of body joints involved. This makes us revisit the procedure of graph formulationandinvestigatewhetherbetterdiscriminationandinterpretabilitycanbeachieved if different graph structures are learned for different actions. We extend the previously proposed approach to use different graphs, each associated with one action category when creating therepresentations,insteadofusingthesamegraphforallactions. Note that graph learning methods in the literature have not taken discrimination between categories into consideration. Therefore, we propose a novel, yet general, discriminative graph learning scheme, which favors discrimination between classes while preserving repre- sentability, together with an efficient algorithm to learn multiple graphs that are suitable for general multi-class classification. We evaluate the proposed algorithm on real and synthetic data and show that our proposed method promotes discrimination between classes, which furtherleadstobetterclassification. 5.1 Introduction Although graph structure arise naturally when it comes to structured data, how to construct theoptimalgraph,i.e.,selectingedgesandthecorrespondingweights,isnottrivial. Insome cases, there is a clear choice for graphs based on prior or domain-specific knowledge, e.g., 4-connected graph is popular in image coding, while friendship graphs are widely used to analyzesocialnetworks. However,inotherapplications,intuitivechoicesforgraphsmaynot alwaysreflecttherealintrinsicrelationshipsbetweenentities. Hence,learningtherightgraph 5.1. Introduction 51 topology from observed data is desirable and becomes an important research topic in graph signalprocessing. Previous research tackles the graph learning problem from various perspectives, which areoftenbasedonpromotingsmoothnessofdatasamplesonthelearnedgraph. Forexample, in[86],thegraphislearnedbysolvinganoptimizationproblem,wheretheobjectivefunction includestwoterms,representingsmoothnessofthenoiselessversionofobservedgraphsignals and data fitting, respectively. Such objective functions are selected in order to favor efficient signal representation, i.e., energy compaction, so that when representing signals in terms of the graph spectrum (i.e., in the graph Fourier transform domain) a small number of non zero coefficients is sufficient (on average) to provide a good approximation. This is desirable for applicationssuchasdenoisingandcompression. Anotherfamilyofgraphlearningapproachesaredevelopedfromastatisticalperspective, where graph signals are analyzed as random vectors with a Gaussian Markov Random Field (GMRF)distributionandaprecisionmatrixplaystheroleoftheLaplacianmatrix[92]. Under this framework, learning the graph topology and associated graph Laplacian, i.e., estimating the covariance or precision matrix, can be formulated as solving a Gaussian Maximum Likelihood(ML)problemwithanadditional` 1 regularizationterm,whichencourageslearning a sparse graph [18, 43, 55]. In [18] and [55], a coordinate descent procedure is applied to efficiently solve the ` 1 -penalized Gaussian ML problem. Notice that solving the Gaussian ML problem will yield a trace term, which can also be interpreted as a smoothness measure, andtherebythisapproachalsopromotesenergycompactionofsignalsonthelearnedgraphs. Inaddition,includingthe` 1 -penaltyleadstoalearnedgraphwithsparserconnectivity,which canmakeiteasiertointerpretstatisticaldependenciespresentinthedata. In this chapter, we focus on applications involving classification. A set of data samples, i.e.,realizationsofgraphsignals,oflimitedsizeisavailable,whereeachsampleisassociated withoneknowncategory/label. Weaimatconstructinganovercompletedictionaryofatoms which can represent graph signals sparsely, i.e., being able to represent each signal as linear combinationofafewatomsinthedictionary. Furthermore,sincetherearemultiplecategories of signals in a classification task, for signals in each given category, we define a class- specific sub-dictionary, e.g., by learning from those data samples in the corresponding class. Cascading these sub-dictionaries of each class leads to the final dictionary, which provides a sparserepresentationforeachdatasample. Theresultingsparsecodingcoefficientscanserve as the features to any conventional classifier, e.g., kernelized SVM, to achieve the final label assignment. 52 Chapter5. DiscriminativeGraphLearningwithSparsityRegularization Under the above-mentioned scenario, the problem of constructing class-specific sub- dictionaries is critical to the classification performance. In a recent series of works [80, 81, 93], one common graph is defined for all the categories and class-specific graph transforms, i.e.,polynomialsofthegraphLaplacian,arelearnedbasedonsignalsineachcategory,which then act as the sub-dictionaries. Using one common graph for all categories means that the discrimination can only be achieved by selecting different filter coefficients for different categories. Onemayconsiderapplyingthegraphlearningapproachesmentionedabove,such asgraphicallasso,independentlytoeachclassofsignals(eachcategory),inordertolearnone graphforeachclass. However,thosemethodsdesigngraphsthatwillfavorenergycompaction and sparsity within each class. Thus, utilizing the resulting graph transform on each learned graphassub-dictionarydoesnotguaranteethatitwillbeeffectiveindiscriminatingbetween classes. To address this problem, our main motivation is to take discrimination into account, in additiontoenergycompactionandsparsity,whenconstructingtheclass-specificgraphsfrom signalsinmultiplecategories,leadingtodiscriminativeclass-specificsub-dictionaries. Inthis way, the resulting graph transforms, e.g., GFT basis, should also be discriminative between classes and can then be well suited for classification. To the best of our knowledge, we are the first to propose a graph estimation method that combines both a fitting term to optimize energycompactionandatermtopromotediscrimination[36]. The rest of this chapter is organized as follows. In Section 5.2, we formally define the multiple-category graph learning problem, along with the proposed objective function to be optimized. In Section 5.3, we develop a block coordinate descent algorithm for solving the proposedoptimizationproblem. InSection5.4,wefurtherderivetheclosed-formsolutionfor thediscriminativegraphlearningproblemwhenaskeletalgraphisconsidered. Experimental results on synthetic data are presented in Section 5.5, while experimental results on real motiondataarepresentedinSection5.6. Section5.7concludesthischapter. 5.2 ProblemFormulation Inspired by the Fisher discrimination criterion for linear discriminant analysis (LDA) [25], whichaimsatminimizingthewithin-classscatterwhilemaximizingthebetween-classscatter, we propose a Fisher discrimination graph learning algorithm where the graph representing each signal category is learned jointly from data samples in all the categories. Based on the conventional Gaussian ML objective to estimate the graph Laplacian for each category [55], 5.2. ProblemFormulation 53 a new objective functional is proposed that includes an additional term measuring the non- smoothnessofdatafromothercategoriesonthegraphofaspecificcategory. Themotivation behind this approach is to enforce smoothness within a class and non-smoothness across classessoastoimprovethediscriminationamongthelearnedgraphsfordifferentcategories ofdatasamples. We first define the multi-category graph learning problem where we are interested in learning multiple graphs, each for one category of signals (data samples). Assume there are n-dimensionalrandomgraphsignals, x ¹1º ;:::; x ¹Sº ,where x ¹iº istherandomsignalassociated withthei-thcategory(label)andthereare S categoriesintotal. Furthermore,foreachsignal categoryi,thereare N i -i.i.d. realizations, x ¹iº 1 ;:::; x ¹iº N i . Notethat x ¹iº j 2R n . Our goal is to learn the graph structure of each category, i.e.,G 1 ;:::;G S , from the observed realizations of signals. Each weighted undirected graphG i =¹V;E i ; Q i º consists of a set of verticesV =f1;2;:::;ng connected by a set of edgesE i and a symmetric matrix representation Q i , where for a , b,¹a;bº2E i if and only if Q i;ab , 0. From the graph signal processing literature, Q i has usually been restricted to be a graph Laplacian [13], a generalized graph Laplacian [67] or an adjacency matrix [75]. Here we consider the least restrictive constraint for Q i , where it is only required to be positive semi-definite, similar to thegraphicallassomethod[18]. Applying the conventional graphical lasso method directly to this multi-category graph learningsettingleadstosolvingthefollowing` 1 -penalizedGaussianMLestimationproblem separatelyforthegraphofeachcategoryi. Problem1(Category-independentGraphLearningProblem) Giventheempiricalcovari- ancematrixcomputedfromthesignalsofcategoryi,i.e., K i = 1 N i Í N i k=1 x ¹iº k x ¹iº k Ë ,findthegraph representationforcategoryi,i.e., Q i ,bysolving: min Q i 0 logdet¹Q i º+tr¹K i Q i º+kQ i k 1 : (5.1) asproposedingraphicallassomethod[18,55]. Byrevisitingthetracetermin(5.1), tr¹K i Q i º = 1 N i tr N i Õ k=1 x ¹iº k x ¹iº k Ë Q i ! = 1 N i N i Õ k=1 x ¹iº k Ë Q i x ¹iº k ; (5.2) itcanbeobservedthatminimizingtr¹K i Q i ºispromotingsmoothnessoftherealizations(data samples) in the i-th category on the estimated i-th graph. That is, this optimization selects 54 Chapter5. DiscriminativeGraphLearningwithSparsityRegularization Q i such that the Laplacian quadratic form x ¹iº Ë Q i x ¹iº is small, which as discussed in Section 2.2.1correspondstohavingsignalsmoothness. In order to promote discrimination between graphs learned with signals in different cate- gories, it would be desirable for the signals in i-th category, i.e., x ¹iº , not only to be smooth on the graph learned for the i-th category, but also be non-smooth on the learned graphs corresponding to other categories. That is, we would like to design an optimization problem that can select Q j , such that x ¹iº Ë Q j x ¹iº is large 8j , i. To achieve this goal, we propose to modifythetracetermin(5.1)asfollows: min Q 1 ;:::;Q S S Õ i=1 N i Õ k=1 1 N i " x ¹iº k Ë Q i x ¹iº k i S1 S Õ j,i x ¹iº k Ë Q j x ¹iº k # = min Q 1 ;:::;Q S S Õ i=1 " tr¹K i Q i º i S1 S Õ j,i tr K i Q j # ; (5.3) where K 1 ;:::; K S are the empirical covariance matrices of signals from each category and i representstheweightforsignalnon-smoothnessregularizer,whichcanbeutilizedtofavor eithersignalrepresentabilityordiscriminationbetweengraphs. Afterreformulating(5.3),we proposetosolvethefollowingnewoptimizationproblem. Problem2(DiscriminativeGraphLearningProblem) Given those empirical covariance matricescomputedfromsignalsineachcategory,i.e., K 1 ;:::; K S where K i = 1 N i Í N i k=1 x ¹iº k x ¹iº k Ë , for each category i, solve for its corresponding discriminative graph representation Q i with thefollowingoptimizationproblem: min Q i 0 logdet¹Q i º+tr¹K i Q i º i S1 S Õ j,i tr¹K j Q i º+ i kQ i k 1 : (5.4) In(5.4), i representstheweightforsparsityregularizer. Although i and i arenotnecessarily required be equal across categories, for simplicity, we assume i = and i = for alli, in theremainderofthischapter. 5.3 Disc-GLassoAlgorithm We now propose a block coordinate descent algorithm, similar to what have been developed in[18]butwithadifferentobjectivefunction,forsolvingProblem2. Firstthesubgradientof 5.3. Disc-GLassoAlgorithm 55 (5.4)is Q 1 i + K i S1 S Õ j,i K j + i = 0; (5.5) where i = sign¹Q i º. Letting S1 = 1 r wecanrewrite(5.5)as: W i + K i 1 r S Õ j,i K j + i = 0 (5.6) where W i istheestimatedcovariancematrixand W i = Q 1 i . Considerapartitionfor W i and K i , W i = W i 11 w i 12 w i 12 Ë w i 22 ! ; K = K i 11 k i 12 k i 12 Ë k i 22 ! (5.7) where W i 11 ; K i 11 are¹n1º¹n1º sub-matrices, w i 12 ; k i 12 are column vectors of length n1,andusesimilarpartitionsfor i . Theupperrightblockof(5.6)leadsto w i 12 + k i 12 1 r S Õ j,i k j 12 + i 12 = 0 (5.8) Furthermore,solving(5.8)willbeequivalenttosolvingthefollowingdualproblem: = argmin 1 2 W i 11 12 W i 11 12 k i 12 + 1 r S Õ j,i W i 11 12 k j 12 2 +kk 1 ; (5.9) asonce solves(5.9), w i 12 = W i 11 cansolve(5.8). The above procedure can then be repeated through all the columns/rows partition until convergence has been reached, as shown in Algorithm 1. Compared to [18], our algorithm takes into consideration the partitioned weighted covariance matrices of other classes, in additiontothatofthetargetclass. Thismakesinitializationofouralgorithmmorechallenging sinceweneedtosettheparameterr appropriately;theselectionofthisparameterisdiscussed next. 56 Chapter5. DiscriminativeGraphLearningwithSparsityRegularization Criterionforchoosingr: Sincer denotes the inverse of S1 , it can be regarded as a hyper-parameter, which can be usedtofavoreithermorediscriminationbetweengraphsormoresignalsmoothnessongraphs. As r increases, the effect of the additional term for promoting signal non-smoothness, i.e., Í S j,i tr¹K j Q i º, becomes weaker, leading to less discrimination between the learned graphs. On the other hand, as r decreases, solving Problem 2 will become equivalent to solving Problem 1 and the smoothness of signals to be on graphs will be favored. To guarantee that aftereachupdatingstept, W ¹tº i 0;8t,weshowbelowthatonlytheinitial W i isrequiredtobe positivedefinite: W ¹0º 0. Astheinitializationisdefinedas W ¹0º i := K i 1 r Í S j,i K j +I, weneedtohave K i 1 r S Õ j,i K j 0 > 0 (5.10) Choosingr sothat(5.10)issatisfiedcanleadtothecorresponding W ¹0º 0. Nowsuppose that W ¹tº 0, which implies the Schur complement is positive: w 22 w Ë 12 ¹W ¹tº 11 º 1 w 12 > 0. Then by the update rule, the corresponding Schur complement for updated W ¹t+1º will be evengreater: w 22 w Ë 12 ¹W ¹t+1º 11 º 1 w 12 > w 22 w Ë 12 ¹W ¹tº 11 º 1 w 12 > 0 Thus,once W ¹0º 0,thefollowingupdated W ¹tº 0;8t. Therefore, in the proposed algorithm, we first search for the minimum ratio r, which is the best value that favors most discrimination, such that K i 1 r Í S j,i K j 0;8i, via line search through a predefined set of possible values for r. Compared to GLasso [18], the only additional cost in time complexity of adding the proposed Fisher-LDA-like term is this searchingprocedure. 5.4 Closed-formSolution for Skeletal Graphs Intheprevioussections,weareconsideringageneralgraphlearningproblemwithoutapplying any constraints to the graph topology. However, as discussed in Chapter 3, for developing graph-based representations for skeletal-based motion data, we are considering a skeletal graph where the topology of the graph, i.e., the edge set, is already decided based on human skeleton, and therefore the graph learning problem is further reduced to a weight estimation problem, that is, how to assign a weight to each edge in the given edge set. To assign the optimaledgeweightsforactionsofdifferentcategories,weformulateamulti-categorygraph 5.4. Closed-formSolutionforSkeletalGraphs 57 Algorithm1Disc-GLassoalgorithm Foreach W i ,giventheempiricalcovariancematrices K 1 ;; K S . 1. Searchforthebest(minimum)ratior suchthat K i 1 r Í S j,i K j 0;8i. 2. Initialize with W i = K i +I 1 r Í S j,i K j . The diagonal of W i will remain unchanged inwhatfollows. 3. Cyclethroughthecolumnsrepeatedly,performingthefollowingstepstillconvergence: (a) Rearrangetherows/columnssothatthetargetcolumnislast(implicitly). (b) Solvethelassoproblemstatedin(5.9). (c) Fillinthecorrespondingrowandcolumnof W i using w i 12 = W i 11 ^ . learning problem as in the previous sections. Denote the data samples, which are the motion vectors in this case, in action category c as x ¹cº i ;i = 1;;N c , where there are S categories in total. As discussed previously, when regarding the motion vectors as graph signals, the variationofgraphsignalsontheskeletalgraphislargerifthemotioncorrespondstoalonger bone. Thatis,kx¹s j º x¹t j ºk 2 / d ¹ s j ;t j º 2 sin 2 ¹2ºwhered ¹ s j ;t j ºisthebonelengthbetween joint pair¹s j ;t j º. Therefore, in order to have the learned edge weights corresponding to the discriminative characteristics between action categories, rather than solely corresponding to theinverseofsquaredbonelengths,therawdatasamplesneedtobepreprocessedinawaysuch that the effect of d 2 s j ;t j in relative motion vectors is ruled out. We perform this preprocessing bycalculatingthepreprocessedempiricalcovariancematrix K c fromN c datasamples x ¹cº i for all c = 1;;S withthesquaredbonelengthbeingnormalized,i.e., K c ¹s j ;t j º = 1 N c N c Õ i=1 h x ¹cº i ¹s j º x ¹cº i ¹t j º i 2 d i ¹s j ;t j º 2 ; 8¹s j ;t j º2E (5.11) Thegoalistolearnasetofaction-dependentgraphsG 1 ;;G S ,whereG c =¹V;E; W c º; jVj = n;jEj = m.E canbetheedgesetcorrespondingtotheskeletalgraph. W c denotesthe weightedadjacencymatrixthatislearnedforactioncategoryc,whichcontainstheestimated weightsforedgesdefinedinE. WeextendProblem2tothefollowingproblem,whichsolves for combinatorial graph Laplacians (CGL) under the constraint of a predetermined edge set thatistreestructured,e.g.,edgesetofaskeletalgraph. Problem3(DiscriminativeGraphLaplacianLearningonSkeletalGraph) Giventhepre- processedempiricalcovariancematrices K c forallc = 1;;S,thegoalistosolvefortheset 58 Chapter5. DiscriminativeGraphLearningwithSparsityRegularization ofCGL L c ;8c = 1;;S giventhefollowinggraphlearningobjectivefunctionthatpromotes discriminationbetweencategories: min L c 2L¹Eº logjL c j y +tr¹L c K c º S1 S Õ s,c tr¹L c K s º+kL c k 1;of f (5.12) whereL¹Eº is the set of CGLs with edge set asE andjL c j y is the pseudo-determinant of L c whichisrequiredduetosingularity ofCGLmatrices. Utilizing the facts thatjL c j y = det¹L c + 1 n 11 Ë º andkL c k 1;of f = tr¹L c ¹I 11 Ë ºº, we can furthersimplify(5.12)tothefollowingoptimizationproblem: min L c 2L¹Eº logdet¹L c + 1 n 11 Ë º+tr¹L c f K c º (5.13) where f K c = K c S1 Í S s,c K s +¹I 11 Ë º. To represent the objective function (5.13) in terms of the edge weights we would like to solve for, we first denote the incidence matrix B of a graph with edge setE as B = ¹b 1 ;; b m º2R nm , where b j = e s j e t j for¹s j ;t j º2E. Then, the CGL can be expressed intermsoftheincidentmatrixandtheedgeweights: L c = Bdiag¹u ¹cº 1 ;;u ¹cº m ºB Ë whereu ¹cº j represents the edge weight for the j-th edge in the c-th graph. We represent the collection of alledge weightsasaweightvector u ¹cº =¹u ¹cº 1 ;;u ¹cº m º Ë . Then, L c + 1 n 11 Ë = m Õ j=1 u ¹cº j b j b Ë j + 1 n 11 Ë = B 1 diag¹u ¹cº 1 ;;u ¹cº m ; 1 n º B 1 Ë = Gdiag¹u ¹cº+ ºG Ë (5.14) where we let G = B 1 and u ¹cº+ = u ¹cº Ë 1 n Ë . Note that as in our case either a skeletal graphorahandgraphistreestructured,i.e.,m = n1,theaugmentedincidencematrix Gis square. From(5.14),wecanexpress(5.13)asanobjectivefunctionof u ¹cº : min u ¹cº logdet¹Gdiag¹u ¹cº+ ºG Ë º+tr¹Bdiag¹u ¹cº ºB Ë f K c º (5.15) The objective function in (5.15), denoted as J¹u ¹cº º, is convex in u ¹cº since the objective functionin(5.13)isconvexin L c and L c canberegardedasanaffinefunctionof u ¹cº basedon (5.14). TheconvexityofJ¹u ¹cº ºin u ¹cº meansthattheoptimalsolution u ¹cº minimizingJ¹u ¹cº º 5.4. Closed-formSolutionforSkeletalGraphs 59 can be found by setting the derivative of J¹u ¹cº º with regard to u ¹cº as 0. However, taking derivative of the log-determinant term in J¹u ¹cº º with regard to u ¹cº leads to a term including Gdiag¹u ¹cº+ ºG Ë 1 , which cannot be expressed explicitly in u ¹cº in general. According to [50], when the graphG c is known to be tree structured, which is satisfied in our case with either skeletal graphs or hand graphs, utilizing Kirchhoff’s matrix tree theorem implies that the unweighted version of graph Laplacian L c , denoted as Q c = Bdiag¹1ºB Ë , has pseudo- determinantequalton. Thatis, n =jQ c j y = det¹Q c + 1 n 11 Ë º = det¹Gdiag¹1 + ºG Ë º (5.16) where 1 + =¹1;;1; 1 n º Ë . Giventhat Gissquareand Ganddiag¹1 + ºareofequalsize,(5.16) canbefurthersimplifiedasfollows: n = det¹Gºdet¹diag¹1 + ººdet¹G Ë º = 1 n det¹Gºdet¹G Ë º (5.17) Therefore,(5.17)impliesthatjdet¹Gºj = n. Considering(5.14),wehave det¹Gdiag¹u ¹cº+ ºG Ë º = n 2 det¹diag¹u ¹cº+ ºº (5.18) Consequently,wecansimplify J¹u ¹cº º in(5.15)tothefollowing: J¹u ¹cº º =2log¹nºlogdet¹diag¹u ¹cº+ ºº+tr¹Bdiag¹u ¹cº ºB Ë f K c º =log¹nº m Õ j=1 log¹u ¹cº j º+ m Õ j=1 u ¹cº j b Ë j f K c b j (5.19) Finally,bysettingthederivativeof J¹u ¹cº ºwithrespecttou ¹cº j as0,wecanobtaintheoptimal weightforthe j-thedgeinthegraphforactioncategoryc. Basedon f K c = K c S1 Í S s,c K s + ¹I 11 Ë º and the expression of the preprocessed empirical covariance matrix K c in (5.11), theoptimaledgeweightu ¹cº j canbefurtherexpressedintermsofdatasamples x: u ¹cº j = 1 b Ë j f K c b j = " 1 N c N c Õ i=1 ¹x ¹cº i ¹s j º x ¹cº i ¹t j ºº 2 S1 S Õ s,c N c Õ i=1 ¹x ¹sº i ¹s j º x ¹sº i ¹t j ºº 2 +2 # 1 ; (5.20) 8¹s j ;t j º2E. 60 Chapter5. DiscriminativeGraphLearningwithSparsityRegularization 0.900000 0.100000 (a)G 1 0.900000 0.100000 (b)G 2 Figure 5.1: Visualization of the original graphsG 1 andG 2 for generating synthetic graph signals of two categories. (5.20) demonstrates that a closed-form solution exists for learning a set of discriminative skeletalgraphs,i.e.,thereexistsaclosed-formsolutionfortheoptimalweightsineachclass- specificgraph. Withthisresult,thegraphsandthegraph-basedrepresentationscanbeobtained moreefficiently. Furthermore,insomescenariosinvolvingonlinelearning/classification,e.g., online event detection, having a closed-form solution for the optimal graphs means that the systemcanobtaintheoptimalgraphsinatimelyfashionandcaneasilyadapttoanewchunk ofsequence. 5.5 ExperimentsonSynthetic Data Inthissection,wegeneratesyntheticdatatovalidatethatourproposedmethodcanconstruct graphs that are better at discriminating between multiple categories. First, we construct two graphsG 1 andG 2 , each having 64 vertices following the 8 8 grid pattern. G 1 is a 4-connected graph with equally weighted horizontal and vertical edges (w hor = w ver = 0:9) whileG 2 isalso4-connectedbutwithheavilyweightedhorizontaledgesandweaklyconnected vertical edges (w hor = 0:9;w ver = 0:1), as illustrated in Fig. 5.1. The corresponding combinatorial Laplacian matrices, i.e., L 1 , L 2 , are then constructed for each graph and K 1 =¹L 1 + Iº 1 ; K 2 =¹L 2 + Iº 1 are computed. Then we generate N = 2000 i.i.d. realizations following multivariate Gaussian distribution with covariance matrices K 1 and K 2 respectively for each class. These 2N graph signals are used as training data samples for learning the graph topology of each category. Finally the empirical covariance matrices computed from training samples in two categories are used as input to our proposed Disc- GLassoalgorithm,with = 1:0and = 0:05. Forcomparison,conventionalgraphicallasso 5.5. ExperimentsonSyntheticData 61 0.815023 0.002161 (a) GLasso, Category 1,jEj = 115 0.815023 0.002161 (b) GLasso, Category 2,jEj = 60 0.815023 0.002161 (c) Disc-GLasso, Cat- egory1,jEj = 88 0.815023 0.002161 (d) Disc-GLasso, Cat- egory2,jEj = 58 Figure 5.2: Visualize the learned graphs for two categories of graph signals with different graph learning methods. [18] is applied on the empirical covariance matrix of training samples in each class, in order toestimatethegraphofeachcategory. The learned graphs for each data category and with each method are visualized in Fig. 5.2. Qualitativeresultsshowthat,insteadofpursuingsolelysmoothness/energycompaction, the graph for signals in Category 1 that have been learned with the proposed Disc-GLasso algorithm have more strongly connected vertical edges, which provide better discrimination betweenthesetwographicalmodels,whiletheonelearnedwithconventionalgraphicallasso hasmoreequally-weightedhorizontalandverticaledges. In order to quantify the advantages of Disc-GLasso, we generate testing samples, N i.i.d. Gaussiansamplesforeachclasswiththesamecovariancematrices K 1 and K 2 . Assumethat graphs for signals in Category 1 and Category 2, denoted as Q 1 and Q 2 , have been learned via either Disc-GLasso or GLasso. The following three measures are adopted to quantify improveddiscriminability. • Cumulativespectrumofsignalsineachcategoryonthelearnedgraphs: Thecumulative spectrumupto k-theigenvectorofsignalsinCategoryconlearnedgraphG j isdefined as Í Nc i=1 kU Ë 1:k x ¹cº i k 2 Í Nc i=1 kx ¹cº i k 2 , where U 1:k is a matrix containing the first k eigenvectors of Q j as columns. • Wealsointroduceanewseparationmeasure: s = tr¹K 1 Q 2 º+tr¹K 2 Q 1 º tr¹K 1 Q 1 º+tr¹K 2 Q 2 º : (5.21) • Theaccuracyofabinaryclassificationproblemthatdistinguisheswhichcategoryeach sampleisbelongingto. 62 Chapter5. DiscriminativeGraphLearningwithSparsityRegularization 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 eigenvector index cumulative energy G1 learned with DiscGLasso G2 learned with DiscGLasso G1 learned with GLasso G2 learned with GLasso (a) test signals in category 1 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 eigenvector index cumulative energy G1 learned with DiscGLasso G2 learned with DiscGLasso G1 learned with GLasso G2 learned with GLasso (b)testsignalsincategory2 Figure 5.3: Cumulative spectrum energy of test signals in each class on the learned graphs. G1andG2represent respectively the graph learned for each category. The cumulative spectrum of signals are plotted in Fig. 5.3. As expected, the test signals in category 2 are shown to be smooth on the graph learnt for class 2, regardless of whether they are learned with GLasso or Disc-GLasso. Furthermore, they are much less smooth on the graph of class 1 learned with proposed Disc-GLasso than on that with GLasso, which validatesthediscriminativepowerbetweengraphs. Fig. 5.4plotstheseparationmeasureof(5.21)versustheparameterr. Whendiscriminant graphs learned for each class, the denominator of s will be small while the numerator will be large, as the signals of one class should be smooth on the learned graph of that class and not smooth on the graph of the other class, which leads to a large s. Notice again that increasingr makestheeffectoftheobjectiveterm Í S j,i tr¹K j Q i ºweakerandhencefavorsless discrimination between the learned graphs, which is again validated via the monotonically descendingtrendinFig. 5.4. Finally, we examine whether improving the discrimination between learned graphs of different classes translates to improved classification. It is worth noting that our method can beutilizedtopreprocessthesignalsbeforeaconventionalclassifier,suchasSVM,isapplied. The classification scheme we apply in this experiment directly assigns a class label to each testsignalbychoosingtheclasssuchthatthecorrespondinggraphtransformprovidesamore compact representation of the input signal. Specifically, we project each test signal onto the first 1 2 low-frequency GFT basis of graph for each category and calculate the projection energy. Then the label of each signal is assigned with the class whose low-frequency GFT basis preserve more energy. The classification accuracy based on the 4000 testing signals between the two categories is reported in Fig. 5.5, which shows a consistent improvement versusselectedratior whenthegraphsarelearnedwithourproposedmethod. 5.6. ExperimentsonRealMotion Data 63 10 20 30 40 50 60 70 80 90 100 1.105 1.11 1.115 1.12 1.125 1.13 1.135 1.14 r separation measure s Disc-GLasso GLasso Figure 5.4: The separation measure versus r with Disc-GLasso compared to that of GLasso. 5.6 ExperimentsonReal Motion Data Inthissection,wewouldliketoexaminetheclass-specificgraphlearningschemeinthecontext of real motion data. We first utilize MSR-Action3D dataset for the qualitative experiments. For each pair of actions, we adopt the proposed Disc-GLasso algorithm to learn one skeletal graph for each action category. Then we examine the spectrum of motion sequences in each class on each class-specific learned graph. Some of the results are shown in Fig. 5.8. We can observe that in all of the examples, the motion sequences of one action class are much smoother on their own graph learned with Disc-GLasso, while not being as smooth on the learnedgraphassociatedwiththeotheractionclass. Wewouldalsoliketoexaminewhetherleveragingaction-dependentgraphscanbenefitthe recognitiontaskintermsofclassificationaccuracy. BasedontheFlorence3D-Actiondataset, we learn a set of action-dependent graphs via the closed-form solution for discriminative graph learning as derived in Section 5.4. For comparison, we also consider a set of action- dependent graphs learned from data in each class independently with graphical lasso [55] as well as the typical unweighted skeletal graph utilized in Section 4.1. Given one motion sequence, we cascade its full spectrum on each class-specific graph, followed by temporal pyramid pooling with pyramid level as 3, which leads to a feature vector. A linear SVM classifierisadoptedtoclassifythemotionsequenceswiththegivenfeaturevectorsunderthe cross-subject setting, where sequences from half subjects are used as training data and the remaining are used as testing data. The classification accuracy is reported in Table 5.1. The 64 Chapter5. DiscriminativeGraphLearningwithSparsityRegularization 10 20 30 40 50 60 70 80 90 100 72.5 72.6 72.7 72.8 72.9 73 73.1 73.2 73.3 73.4 73.5 r classification accuracy Disc-GLasso GLasso Figure 5.5: Classification accuracy versus r with Disc-GLasso compared to that of GLasso. Approach Accuracy(%) Predefinedunweightedskeletalgraph 86.79 Action-dependentgraphswithGLasso 88.68 Action-dependentgraphswithDisc-GLasso 90.57 Table 5.1: Classification performance of utilizing predefined unweighted skeletal graphandlearnedgraphswithGLassoandDisc-GLassoonFlorence3D-Actiondataset. comparison between utilizing typical unweighted skeletal graph and utilizing learned class- specific graphs demonstrates the benefits in discrimination and consequent classification accuracy,withclass-specificgraphlearnedviaDisc-GLassoalgorithm. 5.7 Conclusion We propose a novel graph learning approach that learns a discriminative set of graphs from multiplecategoriesofdatasamples. Insteadoflearningthegraphsintermsofrepresentability, weproposetoincludeanadditionalLDA-liketermtoenableabetterdiscriminabilitybetween classes. We also derive a block coordinate descent algorithm to efficiently estimate the graphtopologyofeachclass. Qualitativeandquantitativeexperimentsonasyntheticdataset demonstrate that the graphs learned with our proposed method have more discrimination between classes, leading to benefits for classification. We then restrict the graph topology as a skeletal graph, which can further lead to a closed-form solution for the optimal edge 5.7. Conclusion 65 wave drink answer phone clap tight lace sit down stand up read watch bow Figure5.6: Edgeweightsinthelearnedaction-dependentskeletalgraphforeachaction usingproposedclosed-formsolution of discriminative graphical lassoon Florence 3D dataset. wave drink answer phone clap tight lace sit down stand up read watch bow Figure 5.7: Edge weights in the learned action-dependent skeletal graph for each actionusingconventional graphical lasso on Florence 3D dataset. 66 Chapter5. DiscriminativeGraphLearningwithSparsityRegularization 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Spectrum of sequence in class "draw x" draw x two hand wave 0 5 10 15 20 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Spectrum of sequence in class "two hand wave" draw x two hand wave 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Spectrum of sequences in class "draw x" draw x jogging 0 5 10 15 20 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Spectrum of sequences in class "jogging" draw x jogging 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Spectrum of sequences in class "draw x" draw x forward kick 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Spectrum of sequences in class "forward kick" draw x forward kick 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Spectrum of sequences in class "forward kick" forward kick jogging 0 5 10 15 20 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Spectrum of sequences in class "jogging" forward kick jogging Figure 5.8: Spectrum of motion sequences from one action class on different class- specific graphs. 5.7. Conclusion 67 weights. Futureworkwillconsiderapplyingourmethodonreal-worlddata,suchasanomaly detection, and also considering other functionals that may further improve discriminability betweengraphs. 68 Chapter6 Conclusionand FutureWork In this thesis we have presented a novel representation for human motion data using graph structuresandhavealsodevelopedmethodsforgraphsignalprocessingofhumanmotiondata. We proposed to model the human skeletal structure with an undirected skeletal graph, and extendedthisideatoproposeanundirectedskeletal-temporalgraphforbetteraccommodating temporal relationships. The tracked body joints are the vertices in the graph and the motion dataarethegraphsignalsresidingonthisgraph. Twotransforms,i.e.,graphFouriertransform andspectralgraphwavelettransform,whichcananalyzethesignalcontentsineitherfrequency domainorvertexdomain,wereutilizedtoconstructnovelrepresentationsforcapturedhuman motiondata. Weexaminedthepropertiesoftheseproposedrepresentationsanddemonstrated that they are not only easy to interpret, but also possess several desired properties, including energy compaction, preservation of discrimination between different actions, robustness to unreliabledataandcomputationalefficiency. Furthermore, we demonstrated the applicability of our proposed representations in the contextofreal-worldapplications. Westartedwiththeapplicationinvolvingactionrecognition basedonthreebroadlyuseddatasets. Additionally,inthefollowingAppendixA,weconsider developing an automated mobility assessment system. In both applications, utilizing the proposedrepresentationsisshowntoachievecomparableperformanceastheliterature,while at the same time providing interpretations and other advantages such as significantly lower timecomplexity. Finally, when we model the skeleton with the same fixed unweighted graph to create the representationsfortheapplicationinactionrecognition,werecognizethatdifferentclassesof motions can be significantly different, e.g., by involving different sets of body joints. Thus, it may make more sense if we use different graph structure for different actions. We propose a novel approach where a different graph is learned for each class. We propose a general discriminativegraphlearningscheme,anefficientalgorithmtolearnmultiplegraphssuitable for general multi-class classification, and further develop class-specific graph learning based Chapter6. ConclusionandFutureWork 69 methods specifically for motion representation. We evaluate the proposed algorithm on both synthetic and real motion datasets, showing that our approach can promote discrimination betweenclasses. There are several directions for ongoing and future work. For constructing a represen- tation, first, beside of constructing the skeletal-temporal graph and using TPM for modeling the frame-wise representation, the development of schemes to take into account temporal dynamics remains unexplored. In some cases, more complex temporal modeling approaches may help increase the ability to describe the actions. Besides, the temporal edges in the skeletal-temporal graph were assigned with unity weights. However, this may not be an optimal choice, considering that the temporal dynamics can be different at each joint. Graph learning methods can be extended to select the optimal edge weights for the temporal edges as well. Moreover, as the skeletal graph and the skeletal-temporal graph have a known spec- trum once the graph is defined, we may be able to design better SGWT kernels based on its spectrum. Finally,asweproposeanovelandgeneralframeworkandefficientalgorithmsforlearning multiple class-specific graphs with discrimination between classes being promoted, we can apply this framework and method to some other interesting applications, such as anomaly detection. For example, learning graphs with smooth signal assumption has achieved the best performance in the application with seizure prediction dataset, where EEG signals of subjectsareusedtopredictnormal/abnormalsituation. Ouralgorithmisnotonlyconsidering smooth signal assumption but also promoting the discrimination between conditions, which isthereforepromisingtothistypeofapplications. 70 Bibliography [1] J.K.AggarwalandM.S.Ryoo.“HumanActivityAnalysis:AReview”.In:ACMCom- put.Surv.43.3(Apr.2011),16:1–16:43.issn:0360-0300. [2] A. Anis, P. A. Chou, and A. Ortega. “Compression of dynamic 3D point clouds using subdivisional meshes and graph wavelet transforms”. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2016, pp. 6360– 6364. [3] A.Ortega,P.Frossard,J.Kovačević,J.MFMoura,andP.Vandergheynst.“Graphsignal processing:Overview,challenges,andapplications”.In:ProceedingsoftheIEEE106.5 (2018),pp.808–828. [4] A.Sakiyama,Y.Tanaka,T.Tanaka,andA.Ortega.“Efficientsensorpositionselection using graph signal sampling theory”. In: 2016 IEEE International Conference on Acoustics,SpeechandSignalProcessing(ICASSP).2016,pp.6225–6229. [5] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. “NTU RGB+D: A large scale dataset for 3D human activity analysis”. In: The IEEE Conference on Computer Vision and PatternRecognition(CVPR).2016. [6] A. Weiss, S. Sharifi, M. Plotnik, J. PP van Vugt, N. Giladi, and J. M Hausdorff. “Toward automated, at-home assessment of mobility among patients with Parkinson disease, using a body-worn accelerometer”. In: Neurorehabilitation and neural repair 25.9(2011),pp.810–818. [7] B. Takač, A. Català, D. R. Martín, N. Van Der Aa, W. Chen, and M. Rauterberg. “Position and orientation tracking in a ubiquitous monitoring system for parkinson disease patients with freezing of gait symptom”. In: JMIR mHealth and uHealth 1.2 (2013). [8] CarnegieMellonUniversity MotionCaptureDatabase.mocap.cs.cmu.edu. [9] A. A. Chaaraoui, P. C.-P., and F. Flórez-Revuelta. “Silhouette-based human action recognitionusingsequencesofkeyposes”.In:PatternRecognitionLetters34.15(2013), pp.1799–1807. Bibliography 71 [10] C. Li, Z. Cui, W. Zheng, C. Xu, R. Ji, and J. Yang. “Action-Attending Graphic Neural Network”.In:IEEETransactionsonImageProcessing27(2018),pp.3657–3670. [11] M. Crovella and E. D. Kolaczyk. “Graph wavelets for spatial traffic analysis”. In: ProceedingsofIEEEINFOCOM.Vol.3.2003,pp.1848–1857. [12] M. Defferrard, X. Bresson, and P. Vandergheynst. “Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering”. In: Proceedings of the 30th In- ternational Conference on Neural Information Processing Systems. NIPS’16. 2016, pp.3844–3852. [13] D.I.Shuman,S.K.Narang,P.Frossard,A.Ortega,andP.Vandergheynst.“Theemerg- ing field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains”. In: IEEE Signal Processing Magazine 30.3 (2013),pp.83–98. [14] Y. Du, W. Wang, and L. Wang. “Hierarchical recurrent neural network for skeleton basedactionrecognition”.In:2015IEEEConferenceonComputerVisionandPattern Recognition(CVPR).2015,pp.1110–1118. [15] R.PWDuinandD.MJTax.“Experimentswithclassifiercombiningrules”.In:Multiple classifiersystems.Springer,2000,pp.16–29. [16] F. B Kashani, G. Medioni, K. Nguyen, L. Nocera, C. Shahabi, R. Wang, C. E Blanco, Y.-A.Chen,Y.-C.Chung,B.Fisher,andothers.“Monitoringmobilitydisordersathome using3Dvisualsensorsandmobilesensors”.In:Proceedingsofthe4thConferenceon WirelessHealth.ACM.2013,p.9. [17] F.Ofli,R.Chaudhry,G.Kurillo,R.Vidal,andR.Bajcsy.“SequenceoftheMostInfor- mativeJoints(SMIJ):Anewrepresentationforhumanskeletalactionrecognition”.In: ComputerVisionandPatternRecognitionWorkshops(CVPRW),2012IEEEComputer SocietyConferenceon.2012,pp.8–13. [18] J. Friedman, T. Hastie, and R. Tibshirani. “Sparse inverse covariance estimation with thegraphicallasso”.In:Biostatistics 9.3(2008),pp.432–441. [19] F. W. Young, Y. Takane, and J. de Leeuw. “The principal components of mixed mea- surement level multivariate data: An alternating least squares method with optimal scalingfeatures”.In:Psychometrika43.2(1978),pp.279–281. [20] Galna,B.andBarry,G.andJackson,D.andMhiripiri,D.andOlivier,P.andRochester, L. “Accuracy of the Microsoft Kinect sensor for measuring movement in people with Parkinson’sdisease”.In:Gait&posture39.4(2014),pp.1062–1068. 72 Bibliography [21] Galna, B. and Jackson, D. and Schofield, G. and McNaney, R. and Webster, M. and Barry,G.andMhiripiri,D.andBalaam,M.andOlivier,P.andRochester,L.“Retraining functioninpeoplewithParkinsondiseaseusingtheMicrosoftkinect:gamedesignand pilottesting”.In:Journalof neuroengineeringandrehabilitation11.1(2014),p.60. [22] D.J.Goble.“TheInfluenceofHorizontalWalkingVelocityontheBilateralSymmetry of Normal Ground Reaction Force Parameters”. PhD thesis. University of Windsor, 2002. [23] D.J Goble, G.W Marino, and J.R Potvin. “The influence of horizontal velocity on interlimb symmetry in normal walking”. In: Human Movement Science 22.3 (2003), pp.271–283. [24] D.K.Hammond,P.Vandergheynst,andR.Gribonval.“Waveletsongraphsviaspectral graphtheory”.In:AppliedandComputationalHarmonicAnalysis30.2(2011),pp.129 –150. [25] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning : data mining,inference,andprediction(SecondEdition).Springer,2009. [26] M.Henaff,J.Bruna,andY.LeCun.“DeepConvolutionalNetworksonGraph-Structured Data”.In:ArXive-prints(June2015).arXiv:1506.05163[cs.LG]. [27] I. Lee, D. Kim, S. Kang, and S. Lee. “Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks”. In: 2017 IEEE Inter- nationalConferenceonComputerVision(ICCV).2017,pp.1012–1020. [28] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun. “Spectral networks and locally con- nectednetworksongraphs”.In:InternationalConferenceonLearningRepresentations (ICLR2014),CBLS,April2014.2014. [29] J.Kittler,M.Hatef,R.PWDuin,andJ.Matas.“Oncombiningclassifiers”.In:Pattern AnalysisandMachineIntelligence,IEEETransactionson20.3(1998),pp.226–239. [30] J. Liu, A. Shahroudy, D. Xu, and G. Wang. “Spatio-Temporal LSTM with Trust Gates for3DHumanActionRecognition”.In:ComputerVision–ECCV2016.2016,pp.816– 833. [31] J. M. Haddad, R. E.A. van Emmerik, S. N. Whittlesey, and J. Hamill. “Adaptations in interlimb and intralimb coordination to asymmetrical loading in human walking”. In: Gait&Posture23.4(2006),pp.429–434. [32] I.T. Jolliffe. Principal Component Analysis. Springer Series in Statistics. Springer, 2002. Bibliography 73 [33] J.Shotton,A.Fitzgibbon,M.Cook,T.Sharp,M.Finocchio,R.Moore,A.Kipman,and A. Blake. “Real-time human pose recognition in parts from single depth images”. In: Proceedingsofthe2011IEEEConferenceonComputerVisionandPatternRecognition (CVPR).2011,pp.1297–1304. [34] J. Wang and Z. Liu and Y. Wu and J. Yuan. “Mining actionlet ensemble for ac- tion recognition with depth cameras”. In: Computer Vision and Pattern Recognition (CVPR),2012IEEEConferenceon.2012,pp.1290–1297. [35] J.Wang,Z.Liu,Y.Wu,andJ.Yuan.“Miningactionletensembleforactionrecognition with depth cameras”. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition.2012,pp.1290–1297. [36] J.-Y.Kao,D.Tian,H.Mansour,A.Ortega,andA.Vetro.“Disc-GLasso:Discriminative graph learning with sparsity regularization”. In: 2017 IEEE International Conference onAcoustics,SpeechandSignalProcessing(ICASSP).2017,pp.2956–2960. [37] J.-Y.Kao,M.Nguyen,L.Nocera,C.Shahabi,A.Ortega,C.Winstein,I.Sorkhoh,Y.-c. Chung, Y.-a. Chen, and H. Bacon. “Validation of Automated Mobility Assessment Using a Single 3D Sensor”. In: Computer Vision — ECCV 2016 Workshops. ECCV 2016.LectureNotesinComputerScience9914(2016),pp.162–177. [38] J.-Y. Kao, K.-S. Lu, and A. Ortega. “Construction of Skeletal and Hand Graphs”. In: inpreparation(2018). [39] J.-Y. Kao, A. Ortega, and S.S. Narayanan. “Graph-based approach for motion capture data representation and analysis”. In: Image Processing (ICIP), 2014 IEEE Interna- tionalConferenceon.2014,pp.2061–2065. [40] T. Kerola, N. Inoue, and K. Shinoda. “Spectral graph skeletons for 3D action recogni- tion”.In:ComputerVision–ACCV2014.Springer,2014,pp.417–432. [41] T.NKipfandM.Welling.“Semi-SupervisedClassificationwithGraphConvolutional Networks”.In:ICLR2017.2017. [42] C. H. Kwak and I. V. Bajić. “Online MoCap Data Coding With Bit Allocation, Rate Control,andMotion-AdaptivePost-Processing”.In:IEEETransactionsonMultimedia 19.6(2017),pp.1127–1141. [43] B. M. Lake and J. B. Tenenbaum. “Discovering structure by learning sparse graphs”. In:Proceedingsofthe32nd CognitiveScienceConference(2010). 74 Bibliography [44] M. Li and H. Leung. “Graph-based approach for 3D human skeletal action recogni- tion”. In: Pattern Recognition Letters 87 (2017). Advances in Graph-based Pattern Recognition,pp.195–202. [45] W. Li, Z. Zhang, and Z. Liu. “Action recognition based on a bag of 3D points”. In: 2010IEEEComputerSocietyConferenceonComputerVisionandPatternRecognition -Workshops.2010,pp.9–14. [46] F.Lozes,A.Elmoataz,andO.Lezoray.“PDE-BasedGraphSignalProcessingfor3-D Color Point Clouds : Opportunities for cultural heritage”. In: IEEE Signal Process. Mag.32.4(2015),pp.103–111. [47] L. Seidenari, V. Varano, S. Berretti, A. Del Bimbo, and P. Pala. “Recognizing Actions from Depth Cameras as Weakly Aligned Multi-part Bag-of-Poses”. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013, pp. 479– 485. [48] L. Shi, Y. Zhang, J. Cheng, and H. Lu. “Adaptive Spectral Graph Convolutional Net- worksforSkeleton-BasedActionRecognition”.In:ArXive-prints(May2018).arXiv: 1805.07694[cs.CV]. [49] K.-S.LuandA.Ortega.“FastgraphFouriertransformsbasedongraphsymmetryand bipartition”.In:ArXive-prints (2019).arXiv:1907.07875[eess.SP]. [50] K.-S.Lu,E.Pavez,andA.Ortega.“OnLearningLaplaciansofTreeStructuredGraphs”. In:IEEEDataScienceWorkshop(2018),pp.205–209. [51] J. Luo, W. Wang, and H. Qi. “Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps”. In: Computer Vision (ICCV), 2013IEEEInternationalConferenceon.2013,pp.1809–1816. [52] F. Lv and R. Nevatia. “Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost”. In: Computer Vision – ECCV 2006. Ed. by Aleš Leonardis, Horst Bischof, and Axel Pinz. Berlin, Heidelberg: Springer Berlin Heidel- berg,2006,pp.359–372. [53] M.A.Gowayyed,M.Torki, M.E.Hussein,andM.El-Saban.“HistogramofOriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recogni- tion”.In:ProceedingsoftheTwenty-ThirdInternationalJointConferenceonArtificial Intelligence.2013,pp.1351–1357. Bibliography 75 [54] J. Martens and I. Sutskever. “Learning Recurrent Neural Networks with Hessian-free Optimization”. In: Proceedings of the 28th International Conference on International ConferenceonMachineLearning.ICML’11.2011,pp.1033–1040. [55] R. Mazumder and T. Hastie. “The graphical lasso: New insights and alternatives”. In: Electronicjournalofstatistics6(2012),p.2125. [56] M. E. McNeely, R. P. Duncan, and G. M. Earhart. “Medication improves balance and complex gait performance in Parkinson disease”. In: Gait & Posture 36.1 (2012), pp.144–148. [57] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. D. Bimbo. “3-D Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold”.In:IEEETransactionsonCybernetics45.7(2015),pp.1340–1352. [58] M. E. Hussein, M. Torki, M. A. Gowayyed, and M. El-Saban. “Human Action Recog- nitionUsingaTemporalHierarchyofCovarianceDescriptorson3DJointLocations”. In:ProceedingsoftheTwenty-ThirdInternationalJointConferenceonArtificialIntel- ligence.IJCAI’13.2013,pp.2466–2472. [59] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H Witten. “The WEKA data mining software: an update”. In: ACM SIGKDD explorations newsletter 11.1(2009),pp.10–18. [60] E.Mirek,MRudzińska,andA.Szczudlik.“Theassessmentofgaitdisordersinpatients with Parkinson’s disease using the three-dimensional motion analysis system Vicon.” In:Neurologiaineurochirurgiapolska41.2(2006),pp.128–133. [61] M.MüllerandT.Röder.“MotionTemplatesforAutomaticClassificationandRetrieval ofMotionCaptureData”.In:Proceedingsofthe2006ACMSIGGRAPH/Eurographics SymposiumonComputerAnimation.SCA’06.2006,pp.137–146. [62] M.Nguyen,L.Fan,andC.Shahabi.“ActivityRecognitionUsingWrist-WornSensors for Human Performance Evaluation”. In: The Sixth Workshop on Biological Data MininganditsApplicationsinHealthcare(2015). [63] E. O.-B. and M. Trivedi. “Joint angles similarities and HOG2 for action recognition”. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops(CVPRW).IEEE,2013. [64] Ofli, F. and Chaudhry, R. and Kurillo, G. and Vidal, R. and Bajcsy, R. “Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition”.In:J.Vis.Comun.ImageRepresent.25.1(Jan.2014),pp.24–38. 76 Bibliography [65] O. Oreifej and Z. Liu. “HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences”. In: 2013 IEEE Conference on Computer Vision andPatternRecognition.2013,pp.716–723. [66] G. Palacios-Navarro, I. García-Magariño, and P. Ramos-Lorente. “A Kinect-Based SystemforLowerLimbRehabilitationinParkinson?sDiseasePatients:aPilotStudy”. In:Journalofmedicalsystems39.9(2015),pp.1–10. [67] E.PavezandA.Ortega.“GeneralizedLaplacianprecisionmatrixestimationforgraph signalprocessing”.In:2016IEEEInternationalConferenceonAcoustics,Speechand SignalProcessing(ICASSP). 2016,pp.6350–6354. [68] P. Wang, C. Yuan, W. Hu, B. Li, and Y. Zhang. “Graph Based Skeleton Motion Representation and Similarity Measurement for Action Recognition”. In: Computer Vision–ECCV2016.2016,pp.370–385. [69] P. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen. “Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition”. In: IEEE Trans. Cir. and Sys. forVideoTechnol.27.12(Dec.2017),pp.2613–2622. [70] R.Anirudh,P.Turaga,J.Su,andA.Srivastava.“ElasticFunctionalCodingofRieman- nianTrajectories”.In:IEEETransactionsonPatternAnalysisandMachineIntelligence 39.5(2017),pp.922–936. [71] M. Raptis, D. Kirovski, and H. Hoppe. “Real-time Classification of Dance Gestures fromSkeletonAnimation”.In:Proceedingsofthe2011ACMSIGGRAPH/Eurographics SymposiumonComputerAnimation.2011,pp.147–156. [72] L. Rokach. “Ensemble-based classifiers”. In: Artificial Intelligence Review 33.1-2 (2010),pp.1–39. [73] D.RutaandB.Gabrys.“Classifierselectionformajorityvoting”.In:Informationfusion 6.1(2005),pp.63–81. [74] R. Wang, G. Medioni, C. J Winstein, and C. Blanco. “Home monitoring musculo- skeletaldisorderswithasingle3dsensor”.In:ComputerVisionandPatternRecognition Workshops(CVPRW),2013IEEEConferenceon.IEEE.2013,pp.521–528. [75] A. Sandryhaila and J. M. F. Moura. “Big Data Analysis with Signal Processing on Graphs: Representation and processing of massive data sets with irregular structure”. In:IEEESignalProcessingMagazine31.5(2014),pp.80–90. Bibliography 77 [76] K.SchindlerandL.vanGool.“Actionsnippets:Howmanyframesdoeshumanaction recognition require?” In: 2008 IEEE Conference on Computer Vision and Pattern Recognition.2008,pp.1–8. [77] S. Obdrzálek, G. Kurillo, F. Ofli, R. Bajcsy, E. Y. W. Seto, H. B. Jimison, and M. Pavel.“AccuracyandrobustnessofKinectposeestimationinthecontextofcoachingof elderlypopulation”.In:2012AnnualInternationalConferenceoftheIEEEEngineering inMedicineandBiologySociety(2012),pp.1188–1193. [78] O. Teke and P. P. Vaidyanathan. “Uncertainty Principles and Sparse Eigenvectors of Graphs”.In:IEEETransactionsonSignalProcessing65.20(2017),pp.5406–5420. [79] D. Thanou, P. A. Chou, and P. Frossard. “Graph-Based Compression of Dynamic 3D PointCloudSequences”.In:IEEETrans.ImageProcess.25.4(2016),pp.1765–1778. [80] D. Thanou, D.I. Shuman, and P.Frossard. “Learning ParametricDictionaries for Sig- nalsonGraphs”.In:IEEETransactionsonSignalProcessing62.15(2014),pp.3849– 3862. [81] D. Thanou, D. I. Shuman, and P. Frossard. “Parametric dictionary learning for graph signals”. In: Global Conference on Signal and Information Processing (GlobalSIP), 2013IEEE.2013,pp.487–490. [82] T. Hastie, R. Mazumder, J. D. Lee, and R. Zadeh. “Matrix Completion and Low-rank SVD via Fast Alternating Least Squares”. In: J. Mach. Learn. Res. 16.1 (Jan. 2015), pp.3367–3402. [83] R. Vemulapalli, F. Arrate, and R. Chellappa. “Human Action Recognition by Repre- senting 3D Skeletons As Points in a Lie Group”. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. CVPR ’14. 2014, pp. 588– 595. [84] C.-y. Wang, Y. Wang, and A. L. Yuille. “Mining 3D Key-Pose-Motifs for Action Recognition”.In:CVPR.IEEEComputerSociety,2016,pp.2639–2647. [85] X.Dong,D.Mavroeidis,F.Calabrese,andP.Frossard.“MultiscaleEventDetectionin SocialMedia”.In:DataMin.Knowl.Discov. 29.5(Sept.2015),pp.1374–1405. [86] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst. “Laplacian matrix learning for smooth graph signal representation”. In: 2015 IEEE International Conference on Acoustics,SpeechandSignalProcessing(ICASSP).2015,pp.3736–3740. 78 Bibliography [87] L. Xia, C.-C. Chen, and J. K. Aggarwal. “View invariant human action recognition using histograms of 3D joints.” In: IEEE Computer Society Conference on Computer VisionandPatternRecognitionWorkshops(CVPRW).IEEE,2012,pp.20–27. [88] S. Yan, Y. Xiong, and D. Lin. “Spatial Temporal Graph Convolutional Networks for Skeleton-BasedActionRecognition”.In:AAAI.2018. [89] X.YangandY.Tian.“EigenJoints-basedactionrecognitionusingNaïve-Bayes-Nearest- Neighbor.”In:CVPR2012HAU3DWorkshop.IEEE,2012,pp.14–19. [90] M. Zanfir, M. Leordeanu, and C. Sminchisescu. “The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection”. In: Computer Vision (ICCV), 2013 IEEE International Conference on. 2013, pp. 2752– 2759. [91] Z.Cao,T.Simon,S.-E.Wei,andY.Sheikh.“RealtimeMulti-Person2DPoseEstimation usingPartAffinityFields”.In:CVPR.2017. [92] C. Zhang, D. Florencio, and P. A. Chou. Graph Signal Processing - A Probabilistic Framework.Tech.rep.MicrosoftResearch,2015. [93] X. Zhang, X. Dong, and P. Frossard. “Learning of Structured Graph Dictionaries”. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).2012. [94] Y.Zhu,W.Chen,andG.Guo.“FusingSpatiotemporalFeaturesandJointsfor3DAction Recognition”.In:2013IEEEConferenceonComputerVisionandPatternRecognition Workshops.2013,pp.486–491. 79 AppendixA CaseStudy: AutomatedMobility AssessmentSystem In this appendix, we examine the ability of the proposed GFT-based representation (see Section 3.4.1) to represent motion data and exploit critical coordination between body parts, in the context of an automated mobility assessment system. Some general methodologies in designing such systems are presented. An approach to extract the features based on the proposed GFT-based representation will be discussed. This appendix is mostly based on our workin[37]. A.1 Introduction Observationofaperson’smovementsperformingcertaintasksiswidelyusedinmanycontexts, such as early diagnosis and treatment of various diseases, sports and military applications. Example applications include estimating the risk of falls in elderly patients, adjusting med- ication levels for those being treated for musculo-skeletal disorders or evaluating movement for rehabilitation. It is common practice for physicians to assess mobility, e.g., gait and balance,bydirectobservationofpatientsperformingstandardizedtasks. Inrarecaseshighly specialized equipment providing kinematic measurements is used. Clearly, cost as well as personnel/equipmentavailability makeitimpossibleto providethiskindof assessmentmore broadly. Thus, while in some situations it would be desirable to assess mobility frequently, andtodosoatapatient’shome,thisisnotpossibleinpractice. Severalfactorsmotivateourworkaimedatdevelopinganautomatedmobilityassessment systems based on depth sensors. The increased availability of low cost wearable sensors and other body sensing technologies provides the opportunity to unobtrusively and continuously sense and assess human mobility. Moreover, harnessing mobility analytics can lead to the developmentofabroadrangeofapplications. Forexample,wearablesensorsthatsummarize 80 AppendixA. CaseStudy: AutomatedMobilityAssessmentSystem activity in the field from measurements from various sensors (e.g., acceleration, gyroscope) have already been utilized in various applications [6], 3D sensors (e.g., Microsoft Kinect) have the potential to complement wearable sensor measurements at home or in the clinic providingdetailedinsightsaboutmotioncharacteristics,whilehavingtheadvantageofbeing unobtrusive. Inthiswork,wevalidateanautomatedmobilityassessmentsystemandhighlight design aspects that can be generalized to other applications. For this, we focus on musculo- skeletal disorders as a case study in part because of the existence of abundant literature relatedtothesedisorders. Furthermore,thereispotentialforverysignificantimpactforthese automated systems, since common clinical practice already involves mobility assessment protocolscarriedoutonsimpleactivities. A reliable monitoring system to evaluate musculo-skeletal disorders, e.g., Parkinson’s Disease (PD), can be beneficial to both patients and clinicians as the population of patients with these types of disorders increases. One example of such a system is introduced in [16] with the POCM 2 system, where a single 3D sensor, e.g., Microsoft Kinect, is used to enable frequent mobility assessments from the home. In this work we present a system to automatically assess mobility. For this, we show that novel preprocessing, feature extraction and classifier selection is required. We utilize the proposed graph-based representation and compare its effectiveness to the PCA-based method. We present results that demonstrate the effectiveness of the proposed system to predict the medication state of persons with PD. Our proposedsystemisevaluatedbasedonarealdatasetcollectedusingMSKinectcamerafrom agroupofsubjectswithPD. GaitinpersonswithPDischaracterizedbyBradykinesia,instability,episodesoffreezing and increased variability, and may be mitigated through medication. Optimized medication andrehabilitationplansrequirereliableandfrequentmobilityassessments. Incurrentpractice, trainedphysiciansassessthemobilityofpersonswithPDduringvisitsattheclinicbyvisually observing patients as they perform standardized tasks. In addition to requiring patients to performthesetasksinthecontrolledenvironmentoftheclinic,currentpracticeislimiteddue tothelackofquantifiablemeasurements. Currentclinicalscales,e.g.,theUnifiedParkinson’s DiseaseRatingScale(UPDRS),alsolackresolutionandrelyonvisualobservation. Our hypothesis is that differences in performance under different medication conditions in subjects with PD can be distinguished from the skeleton data (i.e., joint positions over time) produced by the Kinect sensor. And therefore, in general, it makes sense to process skeleton data directly. We consider the two medication states in persons with PD, ON condition corresponding to a medicated state and OFF condition corresponding to a non- medicated state. The OFF condition effectively simulates a state in which the medication is A.1. Introduction 81 no longer effective and mobility may be affected. We further analyze which skeleton data featuressupportclassifyingthemedicationstateusingvariousmachinelearningapproaches. With our experiments, we focus on subject-dependent classification as we are interested in predictingthemedicationstateofspecificindividuals,however,weprovideacomparisonwith subject-independentclassificationtoshowhowtheseresultsgeneralize. Wefurtherreporton performancefordifferentclassifierstodeterminewhatapproachismostsuitedforthesedata. Results presented are based on a study [16] carried out on 14 PD patients that perform a dual-task walking action, i.e., walking in a figure-of-eight pattern while counting backward, for which the Kinect sensor seems to be the most reliable. The cognitive task further challenges participants, therefore we expect the cognitive task to accentuate the difference between conditions. Previous work [16, 74] shows that some gait parameters, e.g., stepping timeandstepsize,aresignificantlydifferentbetweenPDandnon-PDsubjects. Thisappendix addresses the more challenging task of distinguishing between different medication levels. We show that more appropriate features (e.g., proposed graph-based features) and system design (e.g., proper normalization and combination of classifiers) are necessary in order to capturethefinemovementdifferencesbetweenconditions. To our knowledge this is the first study to use Kinect skeleton data to discriminate the medication state of persons with PD. Our results show the potential for Kinect type sensors to be used to quantify mobility performance in persons with PD and possibly other mobility related disabilities. The main contributions of this work are: (1) We develop and validate a method to automatically assess mobility using a single 3D sensor that can discriminate the medication state of PD subjects, (2) We utilize a graph-based feature extraction technique as proposed in Chapter 3 to reveal the dynamic coordination between parts of the body that, comparedwithpurelydata-driventechniques,providescomparableclassificationperformance butiseasiertointerpret,and(3)Weprovidedesigninsightsontheproposedsystemanddiscuss howtheycanbenefitmobilityassessmentinothercontexts. The remainder of the appendix is organized as follows. Section A.2 presents a brief reviewofrelatedwork. SectionA.3proposesthegeneralmethodologyandgivesinsightsinto key factors for deploying a successful automated mobility assessment system. Section A.4 describes the features and classification algorithms implemented. Section A.5 presents the experimentalsetupandmethods,andexperimentalresults. FinallySectionA.6concludesthe appendixandprovidesfuturedirections. 82 AppendixA. CaseStudy: AutomatedMobilityAssessmentSystem A.2 RelatedWork Kinect has previously been used in applications related to PD, primarily as an intervention tool [21, 66] where game-play supports motivation and a game score is used as a measure of performance. In [21] a game was developed to train dynamic postural control and the accuracy of Kinect to measure on the spot walking, stepping and reaching. A comparison of theKinectsystemagainstaViconmotioncapturesystemispresentedin[20]. Kinecthasalsobeenusedtosupplementinertialsensordataaspartofahomemonitoring system for detecting freezing of gait [7]. The POCM 2 system [16] used the raw Kinect skeleton data on pilot data to detect pauses and discriminate between a non-PD person and a person with PD using dimensionality reduction techniques. A similar comparison was reported by [60] using a Vicon system and showing a difference between PD patients and healthycontrolsatsimilaragebothinanglechangesandinspatiotemporalparametersofgait. In this appendix, we build on previous work (in particular [16, 74]) to tackle the difficult problemofassessingfinechangesinmobilityinducedbymedicationinPDpatients. A.3 SystemDesign In this section, we propose a general methodology for deploying an automated mobility assessment system based on cost-effective 3D sensors. We provide insights into key factors that can lead to the success or failure of the deployment of this type of systems. Key aspects ofthismethodologywillbevalidatedinourevaluationsectionbasedonexperimentalresults fromourcasestudy. Hardwareandenvironment One of the most important factors is to understand the limitation of existing 3D sensors. TakingMicrosoftKinectsensorasexample,itshorizontalFieldofViewhasapracticalrange of1.2to3.5meterswhileitsverticalFieldofViewhasapracticalrangeof0.8to2.5meters. Therefore, when accounting for the environment to deploy the system and the tasks to be performed by the subjects, not to have subjects exceed these ranges is important and critical to the system robustness. For example, an outdoor open environment may not be suitable for deployingthissystem. Also,requestingthesubjectstoperformcertaintaskssuchasclimbing upstairsorrunningfarawaymayleadtosystemfailure,asthedeviceFieldofViewcouldbe exceeded. Furthermore, as the estimation for the 3D positions of skeletal joints with Kinect SDKhavemuchlargerestimationerrorsundercertainsituations,e.g.,walkingawayfromthe A.4. FeatureDesignandClassification 83 sensorattheperipheryoftheFieldOfViewortakingaturn,theenvironmentandtaskshould limitthe occurrenceofthesesituations. Taskstobeassessed The tasks that the subjects are asked to perform should be properly chosen to satisfy the following criteria. First, activities that can fully exploit and examine the mobility of all parts of the body are preferable to those that place explicit constraints on certain parts of body. For example, a task requiring the subject to count silently while walking can be better than a walkingtaskwherethesubjectisrequiredtoholdatray,asthelatteractivityhaslesspotential toexploittheupperbodymobility,whileintheformerthewalkingmaybemore“natural”as thesubjecthastofocusoncounting. Secondly,thelevelofdifficultyinperformingactivities willaffectthecapabilityofdiscriminatingsubjects’states. Anover-simplifiedtaskwillmake iteasierforthesubjectstocontrolthemobilityperformanceunderdifferentstates/conditions, which would make it more difficult for the system to distinguish between states/conditions basedontheassessedmobility. Finally,itisbettertohaveeachtaskperformedrepeatedlyby eachsubjectinordertoimprovetherobustnessoftheassessment. A.4 FeatureDesignand Classification A preliminary statistical analysis on a partial dataset indicated that when participants per- formed the task in the OFF state, they took shorter steps and had increased hip flexion, the latter, likely due to increased trunk forward lean. These results prompted us to focus on subject-dependent classification and extend the features considered to include angular gait measurements (e.g., angles extracted from the skeleton joints and angular speed) and graph-basedfeatures. Mobility data are often normalized when studied. For example in the context of action recognition [40, 90] proposes a normalization scheme, which is applied first to the captured skeleton data by estimating the expected lengths of skeleton limbs (connections) across subjects from the training data, and then adjusting the locations of joints to achieve the same lengths of limbs, with the limb direction vectors being preserved. We compare the effects of normalizationonthesubject-independentclassificationperformanceinSectionA.5.5. Weprovidehereafterdetailsaboutthefeaturesusedforclassification. GaitMeasurements To measure the gait statistics, after segmenting the strides we extract a set of spatial- temporalparametersincludingsteplengths,stridetimeandstridewidthassuggestedin[56]. 84 AppendixA. CaseStudy: AutomatedMobilityAssessmentSystem Since each stride consists two steps, i.e., a left step and a right step, For consistency and across-subjectsanalysis,welabeledstepsbasedonthemost/leastaffectedside(asoppositeto using left/right). Consequently, step feature are named as most-affected-side step length and least-affected-sidesteplength. AngularStatistics We extract a set of angular parameters associated with each joint as features, inspired by the SMIJ method [64]. Angles corresponding to each joint at each time stamp are computed by evaluating the dot product of the vectors defined by the limb segments connecting that joint. Assumingtheanglecorrespondingtothedotproductis,wehavetwopossiblechoices forthejointangle: and2. Theactualvalueusedisdefinedbytakingintoaccountthe typeofjoint(e.g.,elboworknee)andthedirectionofmotion. Weconsider19anglesintotal: one angle for elbow, knee, and neck joints and two angles for the hip and shoulder joints. To capturetemporalvariationswefurtherconsiderthefollowingfivestatistics: average,standard deviation,min,max,andangularspeed. Theresultingfeaturevectorofangularstatisticshas adimensionof19¹anglesº5¹statisticsº = 95. Graph-basedFeatures Inordertoextractasetoffeatureswhichcancaptureandevaluatemoreglobalproperties inmotions,e.g.,thecoordinationbetweentwobodyparts,weutilizethegraph-basedmethod proposed in Chapter 3. A 15-node skeletal graph is constructed as discussed in Section 3.2 andframe-wiseGFT-basedrepresentationsareextractedasdescribedinSection3.4.1. As discussed in Chapter 3, the GFT basis can capture global motion properties. For example,asshowninFig. 3.11,thesecondbasisvectorwillbeabletoexploitthecoordination between upper and lower body while the third basis vector can help capture and measure the degreeofbilateralsymmetrywhichisanimportantcharacteristicinwalkingmovement. To cope with different number of steps in each sequence and to represent the temporal dynamicsoftheframe-by-framecoefficients,weadoptthetemporalpyramidpoolingscheme similar to [34, 51, 53]. We define an average pooling functionF : R mn ! R 1n such that z =F¹Bº provides column-wise average to a matrix B. Let K denote the maximum number ofpyramidlevelstobeused. Thenatlevelk K,wecomputethepooledcoefficientvectoras z k =»F¹B 1 º;;F¹B 2 k1º¼,wherefB i gisasetofnon-overlappingblockmatricesuniformly dividing the matrix C which contains all the transform coefficients as calculated previously. A final feature vector d is obtained as a concatenation of pooled coefficient vector at each level,i.e., d =»z 1 ; z 2 ;; z K ¼ Ë ,withthedimensionas45¹2 K 1º. A.5. ExperimentsandEvaluation 85 ClassificationMethods Several classification algorithms were used to categorize the medication state. Extracted features where labeled (with ON and OFF) and used for training. We evaluate our system using Naive Bayes, SVM, k-NN, Decision Tree, and Random Forest classifiers with WEKA [59]. Combining classifiers improves perrformance as reported in [29, 62, 72]. We therefore report performance measurements for different combinations of classifiers. We use two ensemble methods: Average of Probabilities [15] and Majority Voting [73]. The Average of Probabilities fusion method returns the mean value of probabilities of multiple classifiers. The Majority Voting returns the class which gets the most votes among multiple classifiers. Next,weprovidedetailsontheevaluationsetupandresults. A.5 Experimentsand Evaluation A.5.1 ExperimentalMethodology Fourteen adults with PD (9 men, disease duration 8:667:48 years, Hoehn and Yahr stage I-III)tookpartinthepilotstudy1. TheyeachvisitedaUniversitylaboratory4times. Thefirst and last sessions consisted of qualitative interviews where 2-5 participants were interviewed in focus groups about their expectations and perceptions of the system. The remaining two sessions consisted of quantitative assessments (one each for ON and OFF medication state). Participants performed 7 standardized functional tasks: (1) walking, (2) walking whilstcounting,(3)walkingwhilstcarryingatray,(4)walkingaroundanobstacle,(5)sit-to- stand, (6) lifting a soda can and (7) lifting an object to a shelf. In this appendix we focused our analysis on tasks (1) walking, (2) walking whilst counting tasks, and (3) walking whilst carrying a tray. We focused our attention on the walking-based tasks as we found skeleton data had a lower noise/signal ratio compared to other tasks. However, the walking whilst countingtaskwasspecificallysingled-outbecausetheaddedcognitivetaskisknowntocreate an additional challenge in persons with PD that we expected would accentuate differences betweenONandOFFconditions. Accelerometer, camcordervideo andKinect wererecorded forall tasks. However, inthis work we consider only Kinect sensor data. Fig. A.1, shows the trajectories followed for walking-based tasks for which participants walked 5 times in a figure-of-eight pattern and repeatedthisatleasttwice. 1Theretrospectivepilotstudywasapproved by the University Institutional Review Board 86 AppendixA. CaseStudy: AutomatedMobilityAssessmentSystem FigureA.1: Walking tasks trajectory used in the experiments. Arrows indicate direc- tion or movement. Only segments shown in red were used to classify the medication state. A.5.2 PreprocessingandMethods At each timestamp, we consider a subset of 15 joints (head, neck, torso, shoulders, elbows, hands, hips, knees and feet) in addition to the self-reported most affected side. As shown in Fig. A.1, we exclude segments of the trajectory corresponding to turns and where the sensor didnothaveagoodviewingangleasKinectskeletondatacorrespondingtothoseportionsof the trajectory are noticeably noisier. Note that the direction of the trajectory was chosen so thatthesegmentsusedfacedthesensor,tomatchKinect’sskeletonreconstructionassumption. Walking sequences were automatically segmented similar to [74] into Skeletal Action Units (SAU) each consisting of two steps and subsequently used to derive linear and angular kinematic measurements. We also limited extracted angles maximum extents to what is bio-mechanicallypossible(Kinectdoesnotconstraintskeletonjoins). We extracted a total of 1521 SAUs. The total number of ON-labeled SAUs is 759 while the total number of OFF-labeled SAUs is762. These numbers translate to an average of109 SAUspersubjectand54SAUsforeachcondition,ensuringabalanceddataset. Dependingon theexperiment,featurevectorsforeachSAUaregeneratedbasedonthegait,angleandgraph features extracted as described in Section A.4. For the graph-based features, the maximum number of pyramid levels is set to 3, i.e. K = 3. As customary, we further l 2 -normalize the featurevectorforrobustness. A.5. ExperimentsandEvaluation 87 EvaluationresultswereobtainedforNaiveBayes,SVM,k-NN,DecisionTree,andRandom Forestprimarilyforsubject-dependentclassification. Resultsforacross-subjectsclassification arealso presentedtoallowcomparingbothapproachesperformance. A3-foldmethodisusedfortrainingandtesting. Inthesubject-dependentapproach,each subject’s set of SAUs is uniformly randomly separated into 3 non-intersecting folds; then, two folds are used for the training model and the remaining fold is used to test the trained classifier. In the subject-independent approach, all SAUs of all subjects are combined and then,separatedinto3folds. The3-foldmethodstepsareusedinasimilarwayforthesubject- dependent approach. We also provide the evaluation of combination of multiple classifiers. Theperformancemetricswereportincludeaccuracy,precision,recallandF-measure. A.5.3 FeaturePerformance The system performance is evaluated based on the walking whilst counting task data and using an SVM classifier trained with separate feature vectors for gait, angle, graph, and for a combination of all of the above. We also include PCA-based feature as a comparison to the proposed graph-based feature. The PCA-based feature is acquired from the training set and followedbyexactlythesamepyramidpoolingschemeasgraph-basedfeatureis,whichleads tothesamedimensionoffeaturevectorsforbothgraph-basedmethodandPCA. Table A.1 provides a summary of results. Our result shows that the combination of three proposed features sets achieve the highest performance metrics. Table A.1 shows that the combination of sets in SVM performs the best with 84.79% accuracy rate, 85.43% precision and 83.38% recall. When only one type of feature is used the graph-based feature set outperforms the other two choices in terms of all four metrics. Besides, when doing evaluation, we notice that the worst-case accuracy with graph-based features (69.63%) also significantly outperforms the worst-case accuracy with gait (39.53%) or angular features (53.58%), which shows that graph-based features are more robust in terms of exploiting the differenceinmotionsbetweenONandOFFstates. Thepossiblereasonsinclude: (1)itsbetter abilitytocapturethecharacteristicsinglobalcoordinationamongbodypartsduringamotion, and(2) itsabilitytocapturethetemporaldynamics/evolutionoftheframe-basedfeatures. Furthermore,resultsofPCA-basedandgraph-basedfeaturesarecomparable. Asdetailed in Section A.4, the graph-based features provide a number of advantages, including inter- pretability of the body coordination, robustness to the noise and selection of the dataset, and easiness tocompareresultsbetweendifferentschemes. 88 AppendixA. CaseStudy: AutomatedMobilityAssessmentSystem Table A.1: SVM performance for various features. Accuracy is reported with the format as average accuracy (best accuracy/worst accuracy) across 14 subjects. A: Accuracy,P:Precision,R:RecallandF-M:F-measure. ALL:Gait,Angle,andGraph. Feature A(%) P(%) R(%) F-M Gait 63.58(88.71/39.53) 57.26 55.40 0.51 Angle 75.30(92.22/53.58) 75.01 74.20 0.74 Graph 82.41(95.68/69.63) 83.04 81.93 0.82 All 84.79(93.95/71.23) 85.43 83.38 0.84 PCA 84.66(95.32/71.99) 85.30 84.44 0.85 Table A.2: Performance of single classifier and multiple classifiers combination. A: Accuracy,P:Precision,R:Recall,F-M:F-measure,AP:AverageofProbabilities,MV: MajorityVoting,S:SVM, k: k-NN, D: Decision Tree, R: Random Forest. Classifier/Combination A(%) P(%) R(%) F-M SVM 84.79 85.43 83.38 0.84 RandomForest 83.09 83.68 83.09 0.83 k-NN 79.24 80.16 79.24 0.79 DecisionTree 72.66 72.92 72.66 0.73 NaiveBayes 71.02 71.64 71.02 0.70 SkR(AP) 87.41 87.61 87.40 0.87 Sk(AP) 87.28 87.49 87.29 0.87 SkR(MV) 85.62 85.79 85.61 0.86 DSk(AP) 85.37 85.61 85.37 0.85 DSk(MV) 85.29 85.51 85.30 0.85 A.5.4 ClassifierPerformance To test the performance for different classifiers we consider the walking whilst counting dataset, a combination of Gait, Angle and Graph features (as we showed it provides the best performance in Section A.5.3) and the subject-dependent approach. We report performance metricsforNaiveBayes,DecisionTree,k-NN,RandomForestandSVM.TableA.2provides the average performance metrics. According to Table A.2, SVM performs the best with 84.79% accuracy, 85.43% precision, and 83.38% recall, while Naive Bayes gives the lowest performanceratewith71.02%accuracy,71.64%precisionand71.02%recall. BothRandom ForestandSVMachievehighaccuracy(morethan83%). We also tested using combination of classifiers with two fusion methods: Average of Probabilities and Majority Voting. Table A.2 also presents the performance metrics of five best combinations of classifiers. Comparing to the best of the single classifier, i.e., SVM A.5. ExperimentsandEvaluation 89 TableA.3: Effectofnormalizationonsubject-independentperformance. A:Accuracy, P:Precision, R: Recall, F-M: F-measure. Normalization A(%) P(%) R(%) F-M With 72.91 72.32 72.32 0.72 Without 70.87 71.03 71.03 0.71 (84.79% accuracy), these results show that combining multiple classifiers can outperform using a single classifier. The best performance rate is 87.41% accuracy (2.62% better than singleSVM),87.61%precision,87.40%recallwithcombinationofSVM,k-NN,andRandom ForestusingAverageofProbabilitiesfusionmethod. Overall,thebestfivecombinationshave morethan85%accuracy. Itcanalsobeseenthatthebestfivecorrespondtothecombination ofthebestperformingsingleclassifiers: SVM,k-NN,DecisionTreeandRandomForest. A.5.5 SystemPerformance We also report the system performance using the subject-independent approach. Set-up is similartotheexperimentsofSectionA.5.4. Toassesstheeffectsofnormalizationwecompare resultsobtainedwhennormalizationisappliedassuggestedin[90]andwithoutnormalization usingtheSVMclassifierwithgraph-basedfeatures. ResultspresentedinTableA.3showthat normalization does not provide a significant improvement. This counter-intuitive result may beexplainedbythefactthattheeffectsofPDarehighlyperson-dependentandnotcorrelated tobodysize. Table A.4 shows the result of classification of five classifiers in this approach: Naive Bayes, Decision Tree, SVM, Random Forest, and k-NN using the combination of feature sets. Overall, the performance achieved with subject-independent classifiers are much worse than those achieved with subject-dependent classifiers. The accuracy ranges from 60.09% to 76.86% while precision is from 60.50% to 77.10% and recall ranges from 60.10% to 76.90%. k-NNachievesthehighestrate(76.86%accuracy,77.10%precision,76.90%recall, and 0.77 F-measure) and Naive Bayes gives the worst value in performance comparing to other classifiers (60.09% accuracy, 60.50% precision, 60.10% recall, and 0.60 F-measure). BothRandomForestandk-NNhasmorethan75%accuracy. Possiblereasonsforthelowerperformanceofsubject-independentresultsincludethefact that our model only includes information of most/least affected side. The inclusion of other demographicfactorssuchasage,gender,condition,howlongmedicated,affectedlimbs,etc. mightimprovesubject-independentresults. 90 AppendixA. CaseStudy: AutomatedMobilityAssessmentSystem Table A.4: subject-independent performance of single classifiers. A: Accuracy, P: Precision, R: Recall, F-M: F-measure. Classifier A(%) P(%) R(%) F-M NaiveBayes 60.09 60.50 60.10 0.60 DecisionTree 62.98 63.00 63.00 0.63 SVM 67.13 67.10 67.10 0.67 RandomForest 75.67 75.90 75.70 0.76 k-NN 76.86 77.10 76.90 0.77 TableA.5: Subject-independentcombinationperformance. A:Accuracy,P:Precision, R: Recall, F-M: F-measure, AP: Average of Probabilities, MV: Majority Voting, S: SVM,k: k-NN, D: Decision Tree, R: Random Forest. COMBINATION A(%) P(%) R(%) F-M Rk(AP) 77.32 77.50 77.30 0.77 SkR(MV) 76.92 77.00 76.90 0.77 Rk(MV) 76.59 76.70 76.60 0.77 We then test our system using fusion methods, similar to Section A.5.4. The best three combinations of classifiers’ performance metrics are reported in Table A.5. Comparing to the best performance single classifier (k-NN with 76.86% accuracy), the three combinations reported have comparable result. The best accuracy rate is 77.32%, which is around 0.5% higher than that of single classifier, is from combination of Random Forest, and k-NN. It can be seen that the subject-independent approach does not work well in both single classifier andcombinationofmultipleclassifiersbecauseeachsubjecthasdifferentmobilitytraits(i.e., mostaffectedside),anddifferencesinmobilityforbothconditions. A.5.6 ImpactofTaskDifficulty Toevaluatetheimpactoftaskdifficultyonclassificationresultswecompareperformanceon threetasks: walkingwhilstcounting,walkingwithholdingatray,andwalkingonlyusingthe sameprocedures,datasizeandfeaturesthanwhatwasusedinSectionA.5.3. Assummarized in Table A.6 we find that the average accuracy of the walking only task is 81.04%, slightly lower than for the other two dual tasks. Also, the worst accuracy across all subjects is much worse(48.99%)forthewalkingonlytask,comparedto71.23%withwalkingwhilstcounting task. This seems to corroborate the fact that dual tasks add cognitive load and increased coordination that accentuates the motion disabilities between conditions. Furthermore, we A.6. ConclusionandFutureWork 91 TableA.6: Performanceresultsforthreewalkingtasks. Accuracyisreportedwiththe formatasaverageaccuracy(bestaccuracy/worstaccuracy)acrosssubjects. A,P,Rand F-Mdenoterespectively Accuracy, Precision, Recall and F-measure. Task A(%) P(%) R(%) F-M Count 84.79(93.95/71.23) 85.43 83.38 0.84 Tray 82.04(94.44/53.63) 82.19 90.00 0.85 Walk 81.04(96.05/48.99) 81.63 87.75 0.83 can observe that the average accuracy of the walking whilst holding a tray is comparable to thatofwalkingwhilstcounting,howevertheworstaccuracyacrossthesubjectsismuchworse for walking whilst holding a tray. A possible explanation might be that the system is unable to capture changes in mobility between conditions in subjects for which the impairment of movementismostlyaffectingtheupper-body. These results show that the task can significantly affect the system performance. Best performance seem to be achieved for tasks: (1) that do not constrain movement, and (2) that aresufficientlychallenging. A.6 ConclusionandFuture Work Mobilityassessmentiscriticalforseveralapplicationsincludingrehabilitation,physicalther- apy, optimizing treatment, or performance in sport and military applications. In this work, we propose a methodology to develop an automated mobility assessment systems based on motion data captured with a single cost-effective 3D sensor (i.e., Microsoft Kinect). We propose using three types of features that we show are capable to capture fine movement changes. In particular the proposed graph-based features can capture dynamic coordination between thedifferentpartsofthebody. We evaluate system performance with a pilot study involving 14 adults with PD (9 men, disease duration 8:66 7:48 years, Hoehn and Yahr stage I-III). Our results support the feasibility of using a Microsoft Kinect to recognize the medication state of persons with PD using a relatively small number of movements in the case of a dual-task, i.e., walking whilst counting. More specifically, we show that for a combination including gait, angle and graph-based features, it is possible to achieve subject-dependent classification performance rates of 87.41% of accuracy, 87.61% of precision and 87.40% of recall with a combination of SVM, k-NN, and Random Forest using and Average of Probabilities fusion method. It appears that among the features proposed, the graph-based features are more robust in terms 92 AppendixA. CaseStudy: AutomatedMobilityAssessmentSystem of exploiting the difference in motions between medication states. Results obtained for subject-independentclassificationappearsignificantlyworse. Wealsoevaluatehowdifferent features,classifiers,approachesandtasksimpactthesystemperformanceanddiscussinsights intothekeyperformancefactorsandfailuremodesoftheproposedsystem. Futureworkwillincludeextendingthepilotstudytoalargernumberofsubjectstoprove the statistical significance of specific features in discriminating between medical states and investigating methods that allow for a more fine grained mobility assessment. Furthermore, extending these results to the new Kinect One system can lead to significant improvements. ComplementingtheKinectsensorwithdatafromotherwearablesensorscouldalsoleadtoa boostinperformance. Finally,extendingthesystemtobecapableofautomaticallymeasuring thedegreeofmobilityimpairmentanddecidingthemostsuitabletasksforsubjectstoperform canalso bebeneficialtoclinicalwork.
Abstract (if available)
Abstract
Analyzing and understanding human motion has long been a popular yet challenging research area with a broad range of applications. Recently, the availability of reliable 2D or 3D positions of skeletal joints during actions has resulted in an increasing interest in developing automated action analysis systems utilizing skeleton-based motion data. In this Ph.D. dissertation, we explore model-based approaches to construct representations for captured skeleton-based motion data taking into consideration prior knowledge about human skeletons. The main challenge in achieving so is the irregularity in the skeletal structure and its corresponding motions. We propose to leverage graph structures to tackle this challenge, since graph structures have been shown their superiority in modeling complex relationships between entities in irregular domains. In this work, we propose graph-based motion representations that start with a skeletal graph (including skeletal-temporal graphs) and then apply a graph transform such as the Graph Fourier Transform (GFT) or the Spectral Graph Wavelet Transform (SGWT) to the graph signal defined on the constructed graph, where the graph signal corresponds to motion data. We discuss the construction of skeletal and skeletal-temporal graphs and further derive the spatial and spectral properties associated with these types of graphs, including symmetric sub-graphs, GFT modes, spectrum multiplicities, fast GFT implementation and interpretations of the corresponding GFT basis. We further discuss some properties of these graph-based representations, including their computational efficiency and ability to generalize to new datasets. As an extension, we explore the possibility of learning a set of action-dependent graphs for classification, where we propose a discriminative graph learning problem along with an iterative algorithm to solve it. A closed-form solution is further derived when graphs satisfy certain structures. ❧ As for applications, we consider two real-world scenarios where skeleton-based motion data is captured to fulfill an automated action analysis task. The first application is to develop an automated mobility assessment system where the motions performed by patients with musculo-skeletal disorders are captured and automatically assessed and utilized to predict their current medication states. We conduct thorough experiments with our proposed graph-based representation in order to assess its performance. Additionally, several factors for designing this general type of systems are discussed, such as the environments and activity tasks on which they are deployed and the features and classifiers that can be used along with them. The second application is to develop a skeleton-based action recognition system, which is a popular research topic in the field of computer vision and machine learning. Employing the proposed representations is shown to lead to recognition performance comparable to the state-of-the-art, while at the same time providing benefits in significantly lower time complexity, robustness to noisy and missing data, and generalization to different datasets.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Human motion data analysis and compression using graph based techniques
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Efficient transforms for graph signals with applications to video coding
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Efficient graph learning: theory and performance evaluation
PDF
Lifting transforms on graphs: theory and applications
PDF
Estimation of graph Laplacian and covariance matrices
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Learning and control for wireless networks via graph signal processing
PDF
Scalable sampling and reconstruction for graph signals
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Efficient graph processing with graph semantics aware intelligent storage
PDF
Green learning for 3D point cloud data processing
PDF
Sampling theory for graph signals with applications to semi-supervised learning
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Advanced knowledge graph embedding techniques: theory and applications
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Data-driven 3D hair digitization
Asset Metadata
Creator
Kao, Jiun-Yu
(author)
Core Title
Human activity analysis with graph signal processing techniques
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
12/10/2019
Defense Date
12/08/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D action recognition,graph-based representation,human activity analysis,motion capture data,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ortega, Antonio (
committee chair
), Kuo, C.-C. Jay (
committee member
), Nevatia, Ram (
committee member
)
Creator Email
anjo8081@gmail.com,jiunyuka@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-250059
Unique identifier
UC11673104
Identifier
etd-KaoJiunYu-8037.pdf (filename),usctheses-c89-250059 (legacy record id)
Legacy Identifier
etd-KaoJiunYu-8037.pdf
Dmrecord
250059
Document Type
Dissertation
Rights
Kao, Jiun-Yu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D action recognition
graph-based representation
human activity analysis
motion capture data