Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Human motion data analysis and compression using graph based techniques
(USC Thesis Other)
Human motion data analysis and compression using graph based techniques
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Human motion data analysis and compression using graph based techniques by Pratyusha Das A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL AND COMPUTER ENGINEERING) May 2024 Copyright 2024 Pratyusha Das To my hometown Raiganj... ii Acknowledgements I am humbled and deeply grateful for the support and encouragement I have received throughout my doctoral journey. Without the help of my teachers, collaborators, family, and friends, this dissertation would not have been possible. I extend my sincere thanks to each and every one of them. First and foremost, I owe a debt of gratitude to my advisor, Professor Antonio Ortega, whose guidance and support have been invaluable. He has been with me every step of the way, from the early stages of critical thinking and problem formulation to developing and presenting ideas. His mentorship has played a crucial role in my growth as a researcher, and I am honored to have had the privilege of working under his guidance. I would also like to thank my dissertation committee members, Professor Feifei Qian and Professor Francisco Valero-Cuevas, and my qualifying exam committee members, Professor Ram Nevatia, Professor Keith Jenkins, and Professor C.C. Jay Kuo. Their generous feedback and comments have helped me to refine and improve this dissertation, and I am grateful for their contributions. I would be remiss if I did not acknowledge the contributions of my USC teachers, whose lectures have greatly benefited me. They have provided a strong foundation for my research ability by carefully explaining advanced topics. My collaborators and group members, Dr. Joanne Kao, Dr. Eduardo Pavez, Dr. KengShih Lu, Dr. Sarath Shekizzar, Dr. Ajinkya Jaywant, Shashank Nelamangala Sridhara, iii Carlos Hurtado, and Huong Nguyen, have been an integral part of my journey. Their insights and discussions have been immensely helpful in shaping the development of this dissertation. I want to express my gratitude to my lifelong teachers and mentors: Dr. Goutam Dasgupta, Sudipta Dey, and Dr. Amit Konar. Additionally, I would like to extend my thanks to my friends Shilpi, Rumi, Irin, Hazra, Animesh, Saheli, Priyanka, Murthy, Rituraj, Nasir, and Chitraleema for their invaluable mental support during this challenging journey. Lastly, I would like to express my heartfelt gratitude to my family, especially my parents (Ma, Baba), my sister (Didi), and my brother in law, for their unwavering support and love. I also give special thanks to my husband, Dr. Biswajit Datta, for his insightful discussions and constant support. His love and encouragement have been a source of strength and inspiration throughout this journey. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.1 Graphs and graph signals . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.2 Transforms on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.1.2.1 Graph Fourier transform . . . . . . . . . . . . . . . . . . . . 9 1.1.2.2 Spatiotemporal Graph Convolutional Networks . . . . . . . 11 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.1 Spatiotemporal hand graph for stable activity understanding . . . . . 13 1.2.2 Symmetric Sub-graph Spatio Temporal Graph Convolution . . . . . . 13 1.2.3 Geometric understanding and visualization of Spatio Temporal Graph Convolution Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.4 Graph-based skeleton data compression . . . . . . . . . . . . . . . . . 15 1.3 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 Graphs and graph signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Transform on Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Graph fourier transform . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Spatio-Temporal Graph Convolutional Network . . . . . . . . . . . . 21 Chapter 3: Spatiotemporal hand graph for activity understanding . . . . . . . . . . . 23 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Spatial hand skeleton graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 Graph construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.2 Analysis of graph frequencies . . . . . . . . . . . . . . . . . . . . . . 30 3.2.3 Interpretation of graph Fourier basis . . . . . . . . . . . . . . . . . . 31 v 3.3 Spatiotemporal hand Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Spatiotemporal graph Fourier transform . . . . . . . . . . . . . . . . 34 3.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.1 Real-time activity segmentation . . . . . . . . . . . . . . . . . . . . . 36 3.4.1.1 Segmentation using Bayesian Information Criterion(BIC) . . 37 3.4.1.2 Experimental setup and Dataset . . . . . . . . . . . . . . . 38 3.4.1.3 Real-time segmentation results . . . . . . . . . . . . . . . . 40 3.4.2 Offline activity segmentation . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.2.1 Unsupervised Offline Activity Segmentation using GrAFX . 44 3.4.2.2 Qualitative analysis of the performance . . . . . . . . . . . . 45 3.4.2.3 Offline segmentation results . . . . . . . . . . . . . . . . . . 46 3.4.3 Activity Recognition using GrAFX . . . . . . . . . . . . . . . . . . . 47 3.4.3.1 Recognition strategy using GrAFX . . . . . . . . . . . . . . 47 3.4.3.2 Proposed Stability Metrics for Activity Recognition . . . . . 47 3.4.3.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.3.4 Recognition results . . . . . . . . . . . . . . . . . . . . . . 49 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 4: Symmetric Sub-graph spatiotemporal Graph Convolution . . . . . . . . . 52 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 STGCN for hand skeleton data . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Symmetry based graph decomposition . . . . . . . . . . . . . . . . . . . . . 57 4.3.1 Symmetric Sub-graph ST-GCN . . . . . . . . . . . . . . . . . . . . . 58 4.3.2 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.1 Experimental setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.2.1 Performance analysis . . . . . . . . . . . . . . . . . . . . . . 64 4.4.2.2 Stability of the network . . . . . . . . . . . . . . . . . . . . 65 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter 5: Geometric understanding and visualization of Spatio Temporal Graph Convolution Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2.1 Skeleton graph and polynomial graph filters . . . . . . . . . . . . . . 74 5.2.2 Non-Negative Kernel Regression (NNK) neighborhoods . . . . . . . . 76 5.2.3 Grad-CAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2.4 Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.1 Neighborhood analysis using dynamic DTW . . . . . . . . . . . . . . 80 5.3.2 STG-Grad-CAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 vi 5.4.3 Label smoothness computed from the features . . . . . . . . . . . . . 87 5.4.4 Visualization through STG-Grad-CAM . . . . . . . . . . . . . . . . 90 5.4.5 Effect of noise in the data . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.6 Transfer performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4.6.1 Results on other datasets and advanced network . . . . . . . 100 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Chapter 6: Graph-based skeleton data compression . . . . . . . . . . . . . . . . . . . 105 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2 Graph-based Skeleton Compression . . . . . . . . . . . . . . . . . . . . . . . 108 6.2.1 GFT-based transform coefficient extraction using skeleton graph . . . 109 6.2.2 Quantization and entropy coding . . . . . . . . . . . . . . . . . . . . 111 6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3.1 Compression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.3.2 Performance on Action recognition . . . . . . . . . . . . . . . . . . . 117 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Chapter 7: Conclusion and Future work . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.0.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 vii List of Tables 3.1 Comparison between the different graphs, hand graph, finger hand graph, left-right hand graph. It is clear from the table that the left-right hand graph performs the best in segmenting the assembling task . . . . . . . . . . . . . 44 3.2 Segmentation Accuracy for USC Dataset [20] . . . . . . . . . . . . . . . . . . 47 3.3 Recognition accuracy of FPHA dataset . . . . . . . . . . . . . . . . . . . . . 49 3.4 Recognition accuracy of FPHA in train:test=1:1 protocol . . . . . . . . . . . 50 3.5 Stability analysis of the classification models . . . . . . . . . . . . . . . . . . 50 4.1 Performance summary of the algorithms on Action recognition for different training/testing protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 Action recognition results on the FPHA dataset . . . . . . . . . . . . . . . . 65 4.3 Cross-validation stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.1 NTU-RGB60 super classes: Upper body, Lower body and Full body action category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2 NTU-RGB120 super classes: Upper body, Lower body and Full body action category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3 Accuracy achieved by STGCN for NTU-RGB-D for different amount of masking of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4 Fathfullness and contrastivity in xview and xsub settings . . . . . . . . . . . 96 6.1 Summary of comparison between GSC and DCT in terms of average RMSE computed over 56880 skeleton data files . . . . . . . . . . . . . . . . . . . . . 116 6.2 Performance of the compressed data in Action recognition . . . . . . . . . . 118 viii List of Figures 3.1 Proposed hand graphs: (a)Human anatomy inspired hand graph GH and (b) Finger connected hand graph GF H . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Left-right hand graph: Graph GLRH constructed to capture relative motion between two hands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Example elementary frequency basis for GH. Green dot: positive value. Red dot: negative value. Blue dot: zero. Each basis captures the motion variation between nodes. For example, λ = 0 captures the average of the hand motion, whereas the eigenvector corresponding to λ = 0.12 captures the variation in the motion between the thumb and the rest of the fingers. . . . . . . . . . . 32 3.4 Example elementary frequency basis for GF H. Green dot: positive value. Red dot: negative value. Blue dot: zero. Each basis captures the motion variation between the nodes. For example, λ = 0 captures the average of the hand motion, whereas the eigenvector corresponding to λ = 0.42 captures the variation in the motion between the thumb and little to the rest of the fingers. 32 3.5 Example elementary frequency basis for GLRH. Green dot: positive value. Red dot: negative value. Blue dot: zero. Each basis captures the motion variation between the nodes. For example, λ = 0 captures the average of the hand motion, whereas the eigenvector corresponding to λ = 0.26 captures the variation in the motion between the left hand and right hand. . . . . . . . . 33 3.6 Kronecker product between spatial hand graph and temporal line graph results in this Spatiotemporal hand graph . . . . . . . . . . . . . . . . . . . . . . . 34 3.7 Steps for the toy assembling task: Action1 – Assembling: Attach the front wheel, set the red board, and tighten the screws. Action2 – Combining: Attach the power cable with the red board; Connect the sonic sensor cable to the red board; Combine the green board, use the pins; Attach the side wheels. Action3 – Checking: Check whether all the parts are attached and assembled properly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.8 Segmentation outcome( ˆS p a ) using features from GLRH (λ = 1) for the proposed method with SegAcc = 84.8%. Si and Iti represent the subject ID and iteration number, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 42 ix 3.9 Segmentation outcome( ˆSb a ) using features from baseline method (λ = 1) with SegAcc = 71.6%. Si and Iti represent the subject ID and iteration number, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.10 dtwnorm for each data sequence considering s11 as expert. . . . . . . . . . . . 46 4.1 The 21-node hand graph and its symmetry-based decomposition. Red, green, and blue circles represent those nodes in VX , VZ , and VY , respectively. VX and VY contain those sets of nodes belonging to different sides of the symmetry axis. VZ contains the nodes in the symmetric axis for the next stage of decomposition. Unlabeled edges have weights 1, and Pb,i are permutation operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Butterfly structure of the hand graph decomposition — Relation between x in and z : z = Bn,px in. Each small butterfly structure represents one Haar unit. Here, we see two stages of butterfly structures for the two stages of decomposition as shown in Figure 4.1. A1, A3, A3 and A4 are the GFTs corresponding to two connected components of G H 1 , G H 2 , G H 3 and G H 4 respectively. Unlabeled edges have weights 1, and Pb,i are the permutation operations. . . . . . . . . 60 4.3 Symmetric Sub-graph Spatio temporal graph neural network for complex activity recognition from hand skeleton data. . . . . . . . . . . . . . . . . . . 62 4.4 Normalized confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.1 Proposed data-driven approach to understanding the geometry of the embedding manifold in STGCNs using windowed dynamic time warping (DTW) and Non-Negative Kernel (NNK) graphs. Left: We construct Dataset NNK Graphs (DS-Graph) where each node corresponds to an action sequence, and the weights of edges connecting two nodes are derived from pairwise distances between the features representing the corresponding action sequences. In this example, we show how the two classes (corresponding to red and blue nodes on the DS-Graph) become more clearly separated in deeper layers of the network. We also observe the skeleton graph (S-Graph) node importance for each action using a layerwise STG-GradCAM (the three-time slice example corresponds to a Throw action). Right: For a set of spatiotemporal input action sequences, we observe the label smoothness on the DS-Graph constructed using the features obtained for the sequences after each STGCN layer. The observed label smoothness at each layer of the STGCN network averaged over three super-classes corresponding to actions involving the upper body, lower body, and full body. In this plot, lower variation corresponds to greater smoothness. We note that the label smoothness increases in the deeper layers, in which the different actions can be classified (see DS-Graphs at the bottom of the left plot). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 x 5.2 Energy graph spectrum of the human actions (NTURGB 120 [69]) of the normalized adjacency matrix of the S-Graph (A). We use the graph spectrum of the normalized adjacency matrix, as in STGCN, for easy understanding. However, a similar observation can be made using the normalized graph Laplacian [85] as the eigenvectors for both cases are the same. . . . . . . . . . . . . . 77 5.3 Smoothness of labels on the manifold induced by the STGCN layer mappings in a trained model. As the label smoothness increases, the Laplacian quadratic (y ⊤Ly) decreases. Intuitively, a lower value of y ⊤Ly corresponds to the features belonging to a particular class having neighbors from the same class. We divide the actions in NTU-RGB60 into three super-classes (Upper body (Left), Lower body (Middle), Full body (Right)) and present smoothness with respect to each action in the grouping. We emphasize that, though the smoothness is displayed per class, the NNK graph is constructed using the features corresponding to all input action data-points. We observe that the model follows a similar trend where the smoothness of labels is flat in the initial layers (indicative of no class-specific learning) and increases in value in the later layers (corresponding to discriminative learning). There exist outliers to this trend (e.g., in upper body group drop, brushing) where the smoothness decreases in intermediate layers. This implies that the representations for these actions are affected by features from other actions to accommodate for learning other classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4 Label smoothness of STGCN for different graph construction methods (blue)- NNK, (red)-k-NN. As the label smoothness increases, the Laplacian quadratic (y ⊤Ly) decreases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5 Actions at different time points with their spatiotemporal joint importance: a bigger node size represents higher importance. . . . . . . . . . . . . . . . . 91 5.6 Joint importance map for all actions, In the Y tick labels ’L’ : ’Left’, ’R’ : ’Right’. Action names in the X-axis are given on the right of the figure. This figure is generated in x-sub setting. . . . . . . . . . . . . . . . . . . . . . . . 92 5.7 Joint importance map for all actions in xview setting.The action and joint order in X and Y axis respectively follow the order shown in Figure 5.6. . . 92 xi 5.8 L-STG-GradCAM visualization of spatiotemporal node importance for action class Kick of a trained STGCN network used in experiments. The size of the blue bubble denotes the relative importance of the node in a layer for prediction by the final softmax classifier and is scaled to have values in [0, 1] at each layer. The node importance values are normalized across layers to have a clear comparison among the layers. We observe that the localization of the action, as observed using the L-STG-GradCAM is evident only in later layers, while initial layers have no class-specific influence. The visualizations allow for transparency in an otherwise black-box model to explain any class prediction. Our approach is applicable to any STGCN model and is not affected by the model size, optimization strategy, or dataset used for training. . . . . . . . 93 5.9 L-STG-GradCAM visualization of spatiotemporal node importance for action class Throw of a trained STGCN network used in experiments. The size of the blue bubble denotes the relative importance of the node in a layer for prediction by the final softmax classifier and is scaled to have values in [0, 1] at each layer. We observe that the localization of the action, as observed using the L-STG-GradCAM is evident only in later layers, while initial layers have no class-specific influence. The visualizations allow for transparency in an otherwise black-box model to explain any class prediction. Our approach is applicable to any STGCN model and is not affected by the model size, optimization strategy, or dataset used for training. . . . . . . . . . . . . . . 94 5.10 L-STG-GradCAM visualization of spatiotemporal node importance for action class Sitting down of a trained STGCN network used in experiments. The size of the blue bubble denotes the relative importance of the node in a layer for prediction by the final softmax classifier and is scaled to have values in [0, 1] at each layer. We observe that the localization of the action, as observed using the L-STG-GradCAM is evident only in later layers, while initial layers have no class-specific influence. The visualizations allow for transparency in an otherwise black-box model to explain any class prediction. Our approach is applicable to any STGCN model and is not affected by the model size, optimization strategy, or dataset used for training. . . . . . . . . . . . . . . 94 5.11 Joint importance map for STGCN-7. The action and joint order in the X and Y axes, respectively, follow the order shown in fig 5.6. . . . . . . . . . . . . 96 xii 5.12 Impact of noise added to the input sequence on the label smoothness observed using the corresponding features obtained with the model. The Laplacian quadratic form (y ⊤Louty) decreases as the label smoothness increases . We show the impact of noise on one action per super-class grouping (Upper, Lower, and Full Body). We observe that in actions where the Laplacian quadratic form had a non-increasing trend, it remained mostly unaffected by adding noise to the input action sequence. However, actions where the label smoothness decreased before increasing were affected by noise. This implies that the features learned in the early layers for these actions are not robust, and adding noise allows us to see the modified manifold induced in these layers. 97 5.13 Left: Label smoothness of unseen action classes using a model trained with the NTU-RGB60. For new actions, we consider input sequences corresponding to labels 61 to 120 in NTU-RGB120. We divide these actions into three super-classes (Upper, Lower, and Full body) as before and present results averaged over each set. We see that the model is able to embed the features corresponding to new action sequences such that they are separable. Further, the Laplacian quadratic form(y ⊤Ly) follows a similar non-increasing trend as in the case of the NTU-RGB60 in a much smaller range of scale. This implies that the features learned by the model can be used for the novel classes and model transfer can be done with simple fine-tuning. Right: Classification accuracy (higher is better) on NTU-RGB120 test dataset using a 10-layer STGCN network. Here, we show the performance achieved by the model when training from scratch, along with that obtained with transfer learning by fine-tuning a model trained on NTU-RGB60. We see that the model can transfer effectively, as our label smoothness analysis predicted. . . . . . . . . 99 5.14 Validation loss (lower is better) on NTU-RGB120 test dataset using a 10-layer STGCN network. Here, we show the validation loss achieved by the model when training from scratch along with that obtained with transfer learning by fine-tuning a model trained on NTU-RGB60 on fewer layers. We see that the model is able to transfer effectively, as predicted by our label smoothness analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.15 Smoothness of labels on the manifold induced by the ShiftGCN [16] layer mappings in a trained model. Intuitively, a lower value in y ⊤Louty corresponds to the features belonging to a particular class having neighbors from the same class. This visualization of label smoothness for the ShiftGCN network trained on NTU-RGB 60 dataset. We emphasize that, though the smoothness is displayed per class, the NNK graph is constructed using the features corresponding to all input action data-points. We observe that the model follows a gradually decreasing pattern in y ⊤Louty over the layers, unlike STGCN. . 101 xiii 5.16 This visualization of layer smoothness for the STGCN network trained on Kinetics 400 dataset [54]. The smoothness of labels on the manifold induced by the STGCN layer mappings in a trained model. Intuitively, higher smoothness corresponds to the features belonging to a particular class having neighbors from the same class. We emphasize that, though the smoothness is displayed per class, the Dataset Graph(NNK) is constructed using the features corresponding to all input action data-points. We observe that the model follows a similar trend where y ⊤Louty is flat in the initial layers (indicative of no classspecific learning) and decrease in value in the later layers (corresponding to discriminative learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.1 Block diagram of the proposed graph-based skeleton data compression (GSC) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2 Human anatomy inspired Skeleton graph Gs . . . . . . . . . . . . . . . . . . 111 6.3 Example of Graph Fourier basis for the skeleton graph. In each case, λ is the corresponding graph frequency. The upper and lower rows correspond to the lowest and highest three frequencies, respectively. Node colors represent signs of the elements of each basis, green−→positive, red−→negative, blue−→zeros. . 112 6.4 Energy for vertex-time transform (GFT and DCT), the axis is numbered with the frequency indices arranged in increasing order. . . . . . . . . . . . . . . . 113 6.5 Energy for the time-only transform (DCT), the axis are numbered with the frequency indices arranged in increasing order . . . . . . . . . . . . . . . . . 113 6.6 Heatmap of non-uniform quantization matrices (Q↓ nu) and (Q↘ nu). . . . . . . 114 6.7 Rate distortion curve for uniform quantization. MSE (in m2 ) vs bit-rate per joint in each frame is plotted. . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.8 Reconstructed signal from the compressed data of the joint spine mid, the compression ratio for DCT and GSC with Q↓ nu was 2.4 and 4.8 respectively. . 117 6.9 Rate distortion curve for non-uniform quantization. MSE (in m2 ) vs bit-rate per joint in each frame is plotted. GSC with Q↘ nu : GSC-Zig-zag and GSC with Q↓ nu: GSC-Row major . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 xiv Abstract Analyzing and understanding human actions is a popular yet challenging field with broad applications. Studying complex hand movements is even more challenging due to similarities between different actions and the concentration of motion in small body areas, making them hard to differentiate. In this thesis, we address these challenges by creating representations for hand skeleton-based motion data. To tackle the irregularity in hand skeletal structure and actions, we utilize graph structures, known for modeling complex relationships among entities in irregular domains. First, our approach involves constructing different spatial and spatiotemporal hand graphs and applying graph-based tools like the Graph Fourier Transform (GFT) and Graph Neural Networks to analyze position and motion data on the graph. To the best of our knowledge, we are the first to propose using hand graphs for understanding human motion. We delve into constructing different types of hand graphs, exploring their spatial and spectral properties, and interpreting GFT basis contributions. Furthermore, we emphasize the desirable properties of our representations, including computational efficiency and generalization ease. Second, exploiting the structural properties of the proposed hand graphs, we decompose hand graphs into smaller sub-graphs and define a symmetric subgraph spatiotemporal graph neural network using separate temporal kernels for each sub-graph. This approach can be generalized to model graphs with similar properties, such as symmetry, bipartiteness, and more. This method reduces complexity while providing better performance. However, these neural network-based methods suffer from a lack of interpretability and stability. We discuss xv methods to interpret spatiotemporal graph neural network-based models, which can find the subgraph structures influencing the decision made by the model. Additionally, we design methods to analyze these models’ stability, which helps us choose methods appropriate to real-world applications. Third, these reliable, fast, portable skeleton data acquisition systems can track multiple people simultaneously, providing full-body skeletal key points and more detailed landmarks of the face, hands, and feet. This leads to a huge amount of skeleton data being transmitted or stored. This thesis introduces graph-based techniques to compress skeleton data nearly losslessly. Human activity understanding involves two stages: segmenting sub-actions from a complete activity and then recognizing these sub-actions. We assess the performance of our graph-based, application-agnostic feature extraction method in both online (real-time) and offline settings, using an industrial assembling task dataset collected in a controlled USC environment. For the supervised recognition task, we consider daily activities from realworld scenarios like kitchen, office, and social. Employing the proposed method, we can achieve better recognition performance than the state-of-the-art in a cross-person setting. It offers the added benefits of reduced complexity and increased stability. For the compression task, we use a large human activity dataset NTURGB60, where the proposed method outperforms the existing techniques like DCT with a high margin without compromising recognition performance. xvi Chapter 1 Introduction Human–Robot Interaction (HRI) has recently received considerable attention in the academic community, labs, healthcare companies, and media. Machines and robots are inseparable from human life. So, efficient technologies are needed to give these machines and robots the ability to understand what a human is doing. With a simple view of the scene, humans can easily observe, recognize, and further analyze the activities performed by others. However, developing machines capable of achieving similar abilities remains challenging due to the complexity of human activities. Human activity understanding [6, 1] has long been a popular research area, with a wide range of applications, such as tools for measuring the quality of movements for workers, telerehabilitation of the elderly, surveillance systems, surgical procedures, kitchen activities, and game consoles. The first step to building such a system, where robots or machines can perceive and understand human actions, is to collect scene information with sensors. Multiple data modalities have been employed in human activity analysis, including 2D RGB video [6], depth maps, and skeletal data [94]. Traditionally, RGB cameras have been the conventional choice for capturing sensor data, leading to extensive research in analyzing human actions using 2D RGB video sequences over the past few decades [1]. While RGB cameras offer advantages in terms of adaptability to various environments, they come with limitations in handling more complex scenarios, such as those characterized by scale, rotation, illumination, and occlusion variations. 1 Recent advancements in affordable depth sensors, such as Kinect sensors, combined with robust real-time tracking algorithms [14], have enabled the reliable capture of 2D or 3D positions of skeletal joints during human actions. This capability significantly improves the reliability of human activity analysis systems, especially in challenging scenarios such as those mentioned above. Recognizing human actions through skeletal data has gained popularity recently, evidenced by the growing number of publications in this field. This approach involves identifying actions based on sequences of human movements, each represented by 2D or 3D coordinates. Commercial sensors that accurately measure human skeletons, like Microsoft Kinect sensors [123] and Intel RealSense depth cameras [119], have become widely available and cost-effective. Furthermore, the accessibility of vision-based algorithms for extracting position coordinates from videos, such as OpenPose [87] and DeepLabCut[79], has simplified skeleton-based activity recognition. Additionally, the emergence of commercial cameras like Senticare Altumview, which are dedicated to recording skeletal data, indicates the increasing demand for understanding human activities using skeletal information. These cameras find extensive applications in healthcare, monitoring, and surveillance. Compared to other modalities, 2D/3D human skeleton sequences offer a range of unique advantages [43]. • First, biological research indicates that skeleton motion trajectories provide sufficient information for humans to recognize actions. Thus, an action can be succinctly represented by a sequence of 3D joint sets, requiring minimal storage and enabling fast processing. • Second, recording an RGB video consumes more storage, whereas storing a skeleton sequence for the same duration requires much less storage. • Third, skeleton data is free from environmental influence once the joint tracking is done. Thus, the action recognizer is unaffected by background clutter, clothing, and 2 lighting and recognition not as computationally challenging as if RGB video was used as an input. • Fourth, skeletons offer privacy protection by representing a human face as a single joint with 3D coordinates, effectively hiding critical identity information. • Finally, skeletonized human bodies discard obvious racial features, preventing a learning model from drawing unexpected correlations between races and actions. Therefore, in this work, we focus on using skeleton-based motion data, i.e., 2D or 3D coordinates associated with each skeletal and hand joint as the captured motion data, which we aim to represent, process, and analyze for improved automated activity analysis. Additionally, these systems can track multiple people simultaneously, providing full-body skeletal key points and more detailed landmarks in the face, hands, and feet. This leads to a massive amount of skeleton data being transmitted or stored. In this thesis, we also propose methods to compress the data so that we can store and transfer these data efficiently. Humans perform numerous activities in their everyday lives. Existing studies [110, 1] suggest two classes of activities based on body motion and functionality. The first is simple full-body motor activity; the second is complex functional activity. Full-body motor activities consider body motion and posture, for example, walking, sitting, or running. Each functional activity class deals with different functions the subject performs, for example, reading, working on the computer, or watching TV. Existing research defines these two classes of activities with different terminologies. While simple activity analysis is well explored, complex activity analysis still has a wide range of scope to explore and build efficient algorithms. In particular, with the recent advancement of devices like Kinect and software like OpenPose, which provide detailed hand joint data, we perform complex activity analysis offline and in real-time for tasks that only involve the hands, such as kitchen activities and 3 surgical procedures. Analysis of complex activities performed using only hands is challenging due to (i) the similarity of motions in different action units, which makes them hard to distinguish, and (ii) the localization of the motion to small areas of the body. Complex human activities such as assembly tasks [45] and food preparation [37] consist of a pre-defined sequence of action units. Analyzing complex activities typically involves two stages: dividing the activity into smaller sub-actions and then recognizing them. When working with skeleton-based motion data, an automated action analysis system must extract concise and informative details, such as representations or features, to characterize human actions. This involves procedures like pre-processing, transformation, and temporal modeling. Developing techniques to create suitable representations and extracting features from captured motion data is crucial because subsequent steps, such as classification and segmentation, often operate on these extracted features rather than the raw motion data. Effective representations are typically expected to be both compact and descriptive. This means they should be able to embed data within a lower-dimension subspace while preserving the ability to distinguish between different categories or classes. In earlier studies [96], a substantial dependence on data was evident in constructing representations aiming to embody these two characteristics. For example, Principal Component Analysis (PCA) [113] based methods depend on data and its covariance to obtain the transform matrix, which is later applied to data. Statistical-based features also utilize data to select the set of informative statistics. Deep learning-based methods require extensive usage of data to learn the best parameters for kernel filters, and they lack interpretability. However, most of these prior works focused on human body skeleton data, and there has been limited research on understanding complex activities involving hand movements. Complex activities [29] involve intricate hand movements, subject to physical constraints for which we have prior knowledge. Consequently, we aim to create representations incorporating this knowledge about the hand skeletal structure rather than relying solely on data. As human activity involves both hands, the relative motion of the two hands is also crucial. 4 In this thesis, we focus on modeling complex dynamic hand motions. We first develop tools to transform motion data before segmentation or recognition. Importantly, these transformations are not reliant on specific training data and can be applied to various human hand motion analysis tasks. The main challenge in developing such representations of hand motion data is that hand skeletal structures do not follow a regular grid structure like images. Therefore, we will tackle this challenge by leveraging graph structure derived from the hand skeleton and graph signal processing approaches. We explore different hand graph topologies to incorporate various hand motions while analyzing their interpretability. Recently, graph signal processing has introduced notions of frequency derived from spectral graph theory to process data in irregular domains [86] [105], with successful applications in areas such as social networks [83], sensor networks [97], point clouds [2], [75], etc. Therefore, in this thesis, we propose to utilize graph structure to model the hand skeleton and further apply graph signal processing approaches to the data, i.e., graph signals. These proposed representations have several potential advantages. Because they are independent of the data, they are application-agnostic and can be used for any motion-based application. Due to their efficient energy compaction, we can use them for compressing the markerless mocap data. Further, the proposed graph structures are bipartite and symmetric, which can be further utilized to reduce the computation cost of obtaining the corresponding representations. In Chapter 3, we propose graph-based motion representations, starting with three different spatial hand graph topologies and extending them to spatiotemporal (ST) hand graphs. Then, we apply a graph transform such as the Graph Fourier Transform (GFT) or joint Fourier transform to the graph signals, i.e., motion data, defined on the constructed graph. We discuss the construction of spatial and spatiotemporal hand graphs and further derive their spatial and spectral properties, including the existence of symmetric sub-graphs, GFT modes, spectrum multiplicities, fast GFT implementation, and interpretations associated with the GFT basis. Furthermore, we discuss some desirable properties of these graph-based 5 representations, including their computational efficiency and ability to generalize to new datasets. Neural network-based models are well-known for their performance in these tasks. To explore the neural network-based model incorporating the spatial structure of the hand and temporal correlation of the motion data, we first design a spatiotemporal graph neural network for the hand skeleton data in Chapter 4. Interestingly, some properties of hand graphs, such as tree structure and symmetry, as mentioned in Chapter 3, can be further used to decompose into smaller sub-graphs, reducing the complexity of the network. Filtering with the smaller sub-graph is equivalent to pre-processing the data with a Haar unit-based orthogonal matrices. This pre-processing is interpretable, leading us to use separate temporal kernels for each smaller sub-graph, designing a symmetric subgraph spatiotemporal graph neural network. This not only reduces the complexity but improves the performance as well. The interpretability and stability of these models are very important to generalizing them and adapting them to a different dataset. In Chapter 5, we focus on interpretability and the geometric understanding of these models. Unlike graph Fourier bases, the ST-graph convolution network is hard to interpret. We focused on designing a data-driven method for understanding STGCN models using windowed-DTW distance-based NNK graphs. This generic method analyzes any deep neural network (DNN) that operates on spatiotemporal data, even with varying temporal lengths. To validate our insights from label smoothness, we design a grad-cam for ST-GCN to understand the contribution of each layer to the decision made by these ST-GCN-based models. This method identifies the sub-graph structure in the hand responsible for the network’s decision. The last part of the thesis (Chapter 6) studies the compression of these large motion capture datasets using graph-based methods. To our knowledge, we are the first to propose a nearly lossless compression technique for mocap data. As these graph Fourier coefficients give excellent energy compaction, we use a separable windowed spatiotemporal graph Fourier 6 transform for compressing these large data with very low bit-rate and nearly lossless reconstruction. We perform extensive experiments to show that our compression techniques can be used without significantly impacting recognition tasks. 1.1 Background 1.1.1 Graphs and graph signals Graphs are generic and natural data representations for irregular domains, including networks such as sensors, transportation, and social networks and numerous applications in digital images, videos, and point clouds. In these examples, graphs are used to structure data as a collection of vertices connected by edges. Typically, the vertices may represent data entities, while the edges represent their pairwise relationships. Under this framework, several graph-based signal processing approaches, such as transforming, filtering, and sampling, can be adopted to process and analyze the given graph data (i.e., the graph signal) further. A graph G = {V, E} is a finite set of vertices V with |V | = N and a set of edges E with |E| = M. An adjacency matrix A can then be defined to represent the connectivity among vertices in G. If there exists an edge e ∈ E connecting vertices vi and vj , the entry Aij represents the weight of the edge e = (vi , vj ); otherwise, Aij = 0. If a graph is undirected, vi is connected to vj if and only if vj is connected to vi , then A is symmetric. If a graph is unweighted, then edges are unweighted and Aij ∈ 0, 1, ∀i, j ∈ 1, , N. Through this thesis, we will focus on undirected graphs and will consider both unweighted and weighted graphs. Once a graph G is defined, a graph signal can be defined as a function f : V → R, which associates each vertex in the graph v ∈ V with a scalar value f(v). A graph signal may also be represented as a vector f ∈ R N , where fi or f(i) represents the scalar value associated with vertex vi . It is worth noting that, for a given graph, many different or varying graph 7 signals can exist. Graph signal processing techniques are developed to analyze and interpret these graph signals by taking into account the graph’s topology. Algebraic Graph Representations Most graph signal processing (GSP) approaches are built using algebraic representations of the graphs as starting points. Aside from the adjacency matrix A we defined above, other popular algebraic representations for graphs include the following. Adjacency Matrix An adjacency matrix A represents the connectivity among vertices in a graph G. The elements of the adjacency matrix indicate whether pairs of vertices are adjacent or not. If there is an edge between vertex vi and vertex vj , the corresponding entry (i, j) (and (j, i) since the graph is undirected) in the matrix is set to 1; otherwise, it is set to 0. For a weighted undirected graph, we define a weighted adjacency matrix Aˆ , a real symmetric N ×N matrix, where ai,j ≥ 0 is the weight assigned to the edge connecting nodes i and j. In this thesis, we have only considered an undirected graph. Degree Matrix The degree matrix D is an N×N diagonal matrix whose diagonal entry represents the corresponding vertex’s degree, i.e., the sum of weights of edges connecting to that vertex. That is, Dii = P j̸=i Aij . Graph Laplacians We next define several types of graph Laplacian matrices, including the combinatorial graph Laplacian, the normalized graph Laplacian, and the random walk graph Laplacian. Definition 1 (Combinatorial graph Laplacian). The combinatorial graph Laplacian matrix L is an N by N matrix, defined as: L = D − A. (1.1) 8 Since D and A are symmetric, L is also symmetric, and each of its rows adds to zero, i.e., L1 = 0 where 1 = [1...1]⊤ and 0 = [0...0]⊤. Definition 2 (Normalized graph Laplacian). The normalized graph Laplacian matrix L is an N × N matrix which normalizes the combinatorial graph Laplacian over the degree of a vertex, defined as: L = D− 1 2LD− 1 2 = I − D− 1 2AD− 1 2 (1.2) so that Lij = 1, if i = j and Dii ̸= 0 √ −Aij Dii√ Dii , if i ̸= j, where the weight of each edge is normalized based on the degree of the two vertices it connects. 1.1.2 Transforms on Graphs 1.1.2.1 Graph Fourier transform From its definition, the combinatorial graph Laplacian, L, is regarded as a difference operator since, for any graph signal f ∈ R N , we have that: (LF)(i) = Diifi − X j̸=i fj = X j̸=i Aij (fi − fj ), i = 1, ..., N, (1.3) where (Lf)(i) represents the i th component of the vector Lf and Dii = P j̸=i Aij . Therefore, Lf can be viewed as a linear filter that operates within the 1-hop neighborhood of the graph. That is, given an input signal f , the output signal value at vertex vi , (Lf)(i), depends only on the input signal values at vertex vi and the 1-hop neighboring vertices of vi , i.e., NG(vi) = {vj ∈ V : (vi , vj ) ∈ E}. 9 In other words, the absolute value of Lf measures local signal variation. For example, when vi and its neighboring vertices have similar values, their local signal variation will be small, and (Lf)(i) will have a minimum absolute value close to zero. In contrast, it will have a larger absolute value at vertex vi if there is more local variation around vi . While (2.3) measures the local variation around one vertex, we can further estimate the aggregate variation over all vertices in the graph using the Laplacian quadratic form: f ⊤Lf = X N i=1 X vj∈N (vi) Aij (fi − fj ) = X i,j,i<j Aij (fi − fj ) 2 . (1.4) When f = 1, i.e., the signal has the smallest possible aggregate variation across the graph, we have 1 ⊤L1 = 0, which demonstrates the validity of f ⊤Lf as an estimator of the overall variation. On the other hand, observing that L1 = 0 = 01 and according to the definitions of eigenvectors and eigenvalues, we know that 1, or √ 1 N 1 after normalization, is an eigenvector corresponding to eigenvalue 0 of L. Since L is symmetric (because the graph is undirected), it has a full set of orthogonal eigenvectors. We can look for this set of eigenvectors by selecting u1 = √ 1 N 1 as the first eigenvector and then iteratively solving the following equation for the next eigenvector uk uk = min f⊥u1,u2,...,uk−1 ||f||=1 Lf,k = 2, ..., N (1.5) (2.5) shows that the successive eigenvectors will possess minimal aggregate signal variation while being orthogonal to the previously selected ones. In other words, u1, u2, ..., uN is a set of eigenvectors associated with real, non-negative eigenvalues λ1, λ2, ..., λN , ordered from small to large aggregate variation across the graph. These eigenvalues thus provide a similar notion of frequency as the classical Fourier transform for 1D signals. In the classical Fourier analysis, the eigenfunctions associated with higher frequencies have larger variations; that is, they oscillate more rapidly, while those eigenfunctions associated with lower frequencies 10 are smoother and have slower oscillations. The eigenvectors and eigenvalues of the graph Laplacian matrix provide a frequency interpretation similar to the 1D Fourier transform. For example, the eigenvector u1 associated with the smallest eigenvalue, which is 0, has a constant value √ 1 N over all the vertices. On the other hand, the eigenvectors associated with larger eigenvalues (higher frequencies) oscillate more between vertices; that is, vertices connected with heavier edges are more likely to have dissimilar values. Also, more zero crossings can be observed in these high-frequency eigenvectors. With the Fourier-like frequency interpretation, this set of eigenvectors forms the graph Fourier transform (GFT) as U = (u1, u2, ..., uN ). For any graph signal x ∈ R N , its graph Fourier transform is defined as follows: x˜ = UT x (1.6) Thus, any graph signal x can be represented in terms of the GFT basis: x = Ux˜ which is also known as the inverse graph Fourier transform of x˜. 1.1.2.2 Spatiotemporal Graph Convolutional Networks Spatiotemporal Graph Convolutional Networks (STGCNs) are a generalized framework to process structured time series where general spatiotemporal sequence learning tasks are involved. The spatiotemporal block combines graph convolutions for spatial information and temporal convolutions, which can capture the most useful spatial features and coherently extract the most essential temporal features. Spatiotemporal graph neural network for action recognition proposed by Yan et al. [115] represented the intra-body connections of joints within a single frame by a graph. This graph is constituted by an adjacency matrix A and an identity matrix I representing the node connectivity of each node in the graph with other nodes and itself. As discussed in Section 1.1.2.1, the Laplacian matrix acts as a high pass filter; it is clear from Equation 2.1 11 that Ax works as a low pass filter, averaging all the neighborhood values for a vertex vi , which is connected to vertex vj , ∀j. (Ax)ij = X i˜j Aijxj (1.7) While considering a spatiotemporal signal, the input feature is represented as a tensor of (C, N, T) dimensions, where C, N and T represent the number of channels, the number of joints, and the temporal length of the activity sequence, respectively. The spatiotemporal graph convolution is then performed in two stages. First, a standard 2-D convolution is performed with a temporal kernel size of (1×τ ). Secondly, to capture the intra-joint variations, the resulting tensor is multiplied with the normalized adjacency matric A on the second dimension. The normalized adjacency metric is defined by A = Λ− 1 2 (A + I)Λ− 1 2 . Let input and output feature map is denoted by xin and xout respectively. An STGCN is computed by xout = AxinW, (1.8) Where, W represents the stacked weight vectors of multiple output channels and Λii = P j (Aij+I ij ). The adjacency matrix is divided into several matrices Aj where A+I = P j Aj . To learn the edge weights of the graph, another learnable matrix Q, is used along with the adjacency matrix. So, the final ST-graph convolution is implemented using (4.2) xout = X j Λ − 1 2 j (Aj ⊗ Q)Λ − 1 2 j xinW (1.9) 12 1.2 Contributions 1.2.1 Spatiotemporal hand graph for stable activity understanding To the best of our knowledge, this research is the first work that models human hand skeletal structure as a graph and leverages spectral graph theory and graph signal processing approaches to construct a hand skeletal-temporal representation for the captured motion data. We propose three hand graph topologies and two graph construction methods, hand skeletal and hand skeletal-temporal graphs, and we further investigate their associated spatial and spectral properties. These can justify why graph-based methods can perform well in several applications and make it easier to develop future graph-based schemes. These proposed methods are unsupervised and application-agnostic. We consider action segmentation and recognition tasks to evaluate the proposed representation’s efficacy. For segmentation, we explore both real-time segmentation and offline segmentation problems. The segmentation problem involves an assembling task in an industrial setting, while the recognition problem deals with kitchen and office activities. For both tasks, we propose novel notions of stability, loss function stability (LFS), and estimation stability with cross-validation (ESCV) that are used to quantify the robustness of achieved solutions. 1.2.2 Symmetric Sub-graph Spatio Temporal Graph Convolution We analyze hand skeleton-based complex activities by modeling dynamic hand skeletons through a spatiotemporal graph convolutional neural network (ST-GCN). This model jointly learns and extracts spatiotemporal features for activity recognition. We analyze the properties of the hand graph and present a brief analysis of its characteristics. Exploiting the symmetric nature of hand graphs, we develop a Symmetric Sub-graph spatiotemporal graph convolutional neural network (S2 -ST-GCN). This decomposes the graph into smaller subgraphs, which allow us to build a separate temporal model for the relative motion of the 13 fingers. This subgraph approach can be implemented efficiently by preprocessing input data using a Haar unit-based orthogonal matrix. Then, in addition to spatial filters, separate temporal filters can be learned for each sub-graph. We evaluate the performance of the proposed method on the First-Person Hand Action dataset. The proposed method shows advantages compared to state-of-the-art methods in the cross-person setting, where the model did not come across a test subject’s data while learning. S2 -ST-GCN also performs better than a finger-based hand graph decomposition where no preprocessing is applied. 1.2.3 Geometric understanding and visualization of Spatio Temporal Graph Convolution Networks We propose a data-driven approach to understand the embedding geometry induced at different layers of the STGCN using local neighborhood graphs constructed on the feature representation of input data at each layer in Chapter 5 . To do so, we develop a windowbased dynamic time warping (DTW) to compute the distance between data sequences with varying temporal lengths. We characterize the functions learned by each layer of STGCN using the label smoothness [126] of the representation. We show that STGCN models learn representations that capture general human motion in their initial layers and can discriminate different actions only in later layers. To validate our findings, we build a layerwise Spatio temporal Graph Gradient-weighted Class Activation Mapping (L-STG-GradCAM) for spatiotemporal data to visualize and interpret each layer of the STGCN network. We evaluate the interpretability of STGCN on a skeleton-based activity recognition task. Using this method, we can see which body joints are responsible for a particular task and how their temporal dynamics contribute to the classification output. We briefly study the interpretability of a recognition task by changing the model depth and the training and testing protocol. To find the efficacy of STG-Grad-CAM, we define its faithfulness to the model measured by the impact of occlusions on the graph 14 nodes. For the explainability of STGCN, we compute the contrastivity of the model for different classes based on the outcome of STG-Grad-CAM. We provide empirical evidence on different datasets and advanced networks, which explain that the proposed method is a generic tool. 1.2.4 Graph-based skeleton data compression In Chapter 6, we introduce Graph-based Skeleton Compression (GSC), an efficient graphbased method for nearly lossless compression. We use a separable spatiotemporal graph transform, non-uniform quantization, coefficient scanning, and entropy coding with runlength codes for nearly lossless compression. We evaluate the compression performance of the proposed method on the large NTU-RGB activity dataset [100]. Our method outperforms a 1D discrete cosine transform method applied along the temporal direction. We also evaluate action recognition performance with this compressed data. 1.3 Organization of the thesis The rest of the thesis is organized as follows. In Chapter 3, we start with constructing hand spatial and spatiotemporal graphs and their associated spatial and spectral properties. We then present the proposed graph-based representations for skeleton-based motion data utilizing GFT or Joint graph Fourier Transform (JFT). We also discuss the interpretation of the GFT basis and several desirable properties of the proposed representations. We then assess the proposed representations based on two real-world applications. The first application is skeleton-based complex activity segmentation in real-time and offline. The second is complex activity recognition, where a thorough comparison is reported with state-of-the-art techniques for public datasets. In Chapter 4, we first explore ST-GCN for hand skeleton activity recognition. Then, exploiting the hand graph’s symmetric characteristics, we propose a novel yet generic technique named Symmetric Sub-graph spatiotemporal Graph Convolution 15 and evaluate its performance in complex activity recognition tasks. We design a geometric graph-based approach to interpret the representation learned by these spatiotemporal graph neural net-based models and later develop a GradCam-based visualization technique for STGCN. We present them in Chapter 5. The compression techniques of these large markerless mocap datasets are proposed in Chapter 6. 16 Chapter 2 Background 2.1 Graphs and graph signals Graphs are generic and natural data representations for irregular domains. Examples of these domains span various types of networks, such as sensor, transportation and social networks, and numerous applications in digital images, videos and point clouds . In these examples, graphs, as a collection of vertices connected by edges, are utilized to structure data. Typically, the vertices may represent the data entities while the edges represent the pairwise relationships between them. Under this framework, several graph-based signal processing approaches, such as transforming, filtering and sampling, can be adopted to further process and analyze the given data. A graph G = V, E is defined in terms of a finite set of vertices V with |V | = N and a set of edges E with |E| = M. An adjacency matrix A can then be defined to represent the connectivity among vertices in G. If there exists an edge e ∈ E connecting vertices vi and vj , the entry Aij represents the weight of the edge e = (vi , vj ); otherwise, Aij = 0. If a graph is undirected, that is, vi is connected to vj if and only if vj is connected to vi , then A is symmetric. If a graph is unweighted, then edges are unweighted and Aij ∈ 0, 1, ∀i, j ∈ 1, , N. Through this thesis, we will focus on undirected graphs and will consider both unweighted and weighted graphs. 17 Once a graph G is defined, a graph signal can be defined as a function f : V → R, which associates each vertex in the graph v ∈ V with a scalar value f(v). A graph signal may also be represented as a vector f ∈ R N , where fi or f(i) represents the scalar value associated to vertex vi . It is worth noting that, for a given graph, there can exist many different or varying graph signals. Graph signal processing techniques are developed in order to analyze and interpret these graph signals depending on the topology of the graph. Algebraic Graph Representations Most of the graph signal processing approaches are built using algebraic representations of the graphs as starting points. Aside from the adjacency matrix A we defined above, other popular algebraic representations for graphs include the following. Incidence Matrix : The incidence matrix B is an N by M matrix, where the i th row represents the incidence of edges at vertex vi and each column corresponds to one of these edges. That is, if Bik ̸= 0, one end of ek terminates at vertex vi . For an undirected graph, Bik = Aek if vertex vi and edge ek are incident and 0 otherwise. For a directed graph, Bik = −Aek ; Bjk = Aek if edge ek points from vertex vi toward vertex vj with edge weight Aek . Degree Matrix : The degree matrix D is an N by N diagonal matrix whose diagonal entry represents the degree, i.e., the sum of edge weights, of the corresponding vertex. That is, Dii = P j̸=i Aij . Graph Laplacians : There are several types of graph Laplacian matrices, including combinatorial graph Laplacian, normalized graph Laplacian and random walk graph Laplacian, as described below. Definition 1 (Combinatorial graph Laplacian matrix): The combinatorial graph Laplacian matrix L is an N by N matrix, defined as follows: L = D − A (2.1) 18 Since both D and A are symmetric, L is also symmetric and furthermore, each of its rows add to zero, i.e., L1 = 0 where 1 = [1...1]T and 0 = [0...0]T . Definition 2 (Normalized graph Laplacian matrix): The normalized graph Laplacian matrix L is an N by N matrix which normalizes the combinatorial graph Laplacian over the degree of vertex, defined as follows: L = D− 1 2LD− 1 2 = I − D− 1 2AD− 1 2 (2.2) L = 1, if i = j and Dii ̸= 0 √ −Aij Dii√ Dii , if i ̸= j where the weight of each edge is normalized based on the degree of two vertices it connects. 2.2 Transform on Graph 2.2.1 Graph fourier transform Following the definition of combinatorial graph Laplacian, L is regarded as a difference operator since, for any graph signal f ∈ R N , the following is always satisfied. (LF)(i) = Diifi − X j̸=i fj = X j̸=i Aij (fi − fj ), i = 1, ..., N (2.3) where (Lf)(i) represents the i th component of the vector Lf and Dii = P j̸=i Aij . Therefore, Lf can be viewed as a linear filter that operates within the 1-hop neighborhood of the graph. That is, given an input signal f , the output signal value at vertex vi , (Lf)(i), 19 depends only on the input signal values at vertex vi and the 1-hop neighboring vertices of vi , i.e., NG(vi) = vj ∈ V : (vi , vj ) ∈ E. In other words, the absolute value of Lf serves as a measure for local signal variation. For example, when vi and its neighboring vertices all have the same value, i.e., local signal variation is small, (Lf)(i) will have a minimum absolute value 0. In contrast, it will have a larger absolute value at vertex vi if there is more local variation around vi . While 2.3 measures the local variation around one vertex, we can further estimate the aggregate variation over all vertices in the graph using the Laplacian quadratic form: f TLf = X N i=1 X vj∈N (vi) Aij (fi − fj ) = X i,j,i<j Aij (fi − fj ) 2 , (2.4) when the graph is undirected. When f = 1, i.e., the signal has the smallest possible aggregate variation across the graph, we have 1 TL1 = 0, which demonstrates the validity of f TLf as an estimator of overall variation. On the other hand, observing that L1 = 0 = 01 and according to the definitions of eigenvectors and eigenvalues, we know that 1, or √ 1 N 1 after normalization, is an eigenvector of the graph Laplacian matrix L associated with eigenvalue 0. If the graph is undirected, L is symmetric and thus has a full set of orthogonal eigenvectors. We can look for this set of eigenvectors by selecting u1 = √ 1 N 1 as the first eigenvector, and then iteratively solving the following equation for the next eigenvector uk uk = min f⊥u1,u2,...,uk−1 ||f||=1 Lf,k = 2, ..., N (2.5) 20 The above equation shows that the successive eigenvectors will possess minimal aggregate signal variation while being orthogonal to the previously selected ones. In other words, u1, u2, ..., uN is a set of eigenvectors associated with real, non-negative eigenvalues λ1, λ2, ..., λN , ordered from small to large aggregate variation across the graph. These eigenvalues thus provide a similar notion of frequency as the classical Fourier transform for 1D signals. In the classical Fourier analysis, the eigenfunctions associated with higher frequencies have larger variation, that is, they oscillate more rapidly, while those eigenfunctions associated with lower frequencies are smoother and have slower oscillations. The eigenvectors and eigenvalues of graph Laplacian matrix provide the similar frequency interpretation as 1D Fourier transform does. For example, the eigenvector u1 associated with the smallest eigenvalue, which is 0, has constant value √ 1 N over all the vertices. On the other hand, the eigenvectors associated with larger eigenvalues (higher frequencies) oscillate more between vertices, that is, vertices connected with heavier edges are more likely to have dissimilar values. Also, more zero crossings can be observed in these high-frequency eigenvectors. With the Fourier-like frequency interpretation, this set of eigenvectors forms a transform known as the graph Fourier transform (GFT), which is denoted as U = (u1, u2, ..., uN ). For any graph signal x ∈ RN , its graph Fourier transform is defined as follows: x˜ = UT x (2.6) Thus, any graph signal x can be represented in terms of the GFT basis: x = Ux˜ which is also known as the inverse graph Fourier transform of ˜x. 2.2.2 Spatio-Temporal Graph Convolutional Network STGCN is a generalize framework to process structured time series where general spatiotemporal sequence learning tasks are involved. The spatio-temporal block combines graph 21 convolutions for spatial information and temporal convolutions jointly which can capture the most useful spatial features and extract the most essential temporal features coherently. Spatio temporal graph neural network for action recognition proposed by Yan et al. [115] represented the intra-body connections of joints within a single frame by a graph. This graph is constituted by a adjacency matrix A and an identity matrix Im representing the node connectivity of each node in the graph with other nodes and itself. In general Ax works as a low pass filter, averaging all the neighbourhood values for a vetex vi , which is connected to vertex vj , ∀j. (Ax)ij = X i˜j Aijxj (2.7) While considering a spatio-temporal signal, the input feature is represented as a tensor of (C, N, T) dimensions, where C, N and T represent the number of channel, number of joints and the temporal length of the activity sequence. The spatio-temporal graph convolution is then performed by in two stages. First, a standard 2-D convolution is performed with a temporal kernel size of (1 × τ ). Secondly, to capture the intra-joint variations, the resulting tensor is multiplied with the normalized adjacency matric A on the second dimension. The normalized adjacency metric is defined by A = Λ− 1 2 (A+I)Λ− 1 2 . Let input and output feature map is denoted by xin and xout respectively. A STGCN is the computed by xout = AxinW (2.8) Where, W represents the stacked weight vectors of multiple output channels and Λii = P j (Aij+I ij ). The adjacency matrix is divided into several matrices Aj where A+I = P j Aj . To learn the edge weights of the graph, another learnable matrix Q is used along with the adjacency matrix. So, the final ST-graph convolution is implemented using 4.2 xout = X j Λ − 1 2 j (Aj ⊗ Q)Λ− 1 2 j xinW (2.9) 22 Chapter 3 Spatiotemporal hand graph for activity understanding Using hand skeleton data to understand complex hand actions, such as assembly tasks or kitchen activities, is an important yet challenging task. Despite the recent advancement in devices, such as Kinect cameras, and software, such as OpenPose, which can help provide detailed hand skeleton information, there is little work in the literature on this topic. This chapter analyzes hand skeleton motion data for understanding complex activities. We introduce an unsupervised hand graph-based feature extraction method and are the first group to propose human anatomy-inspired hand graphs and their various topology. We start with a spatial hand graph and then extend it to a spatiotemporal hand graph. Our proposed method is novel and generalized. We consider action segmentation and recognition tasks to evaluate the proposed representation’s efficacy. For segmentation, we explore both real-time segmentation and offline segmentation problems. The segmentation problem involves an assembling task in an industrial setting, while the recognition problem deals with kitchen and office activities. For both tasks, we propose novel notions of stability, namely, loss function stability (LFS) and estimation stability with cross-validation (ESCV), to quantify the robustness of achieved solutions. Our proposed feature extraction leads to classification performance comparable to state-of-the-art methods, achieving significantly better accuracy and stability in a cross-person setting. The proposed method also outperforms the existing methods in 23 the segmentation task in terms of accuracy and shows robustness to any change in the input hyper-parameters. This work was published in [20] and [26]. 3.1 Introduction Chapter 3 introduced the concept of using graphs to structure data and represent entity relationships, with graph signal processing techniques facilitating efficient manipulation and analysis. As a result, we propose employing a graph-based framework to establish a suitable representation of motion data. In this chapter, we analyze various types of motion data, including full-body human skeleton data, human hand motion data, and animal motion data. We develop representations for 2D/3D motion data captured from humans and animals. Advances in sensor technology have made capturing reliable 3D motion cost-effective, with various methods available for this task. Marker-based motion capture systems like MoCap can provide accurate key-point locations on the body but require substantial effort and cost to set up. Alternatively, high-definition depth cameras such as Microsoft Kinect, whose development has progressed rapidly in recent years and are now cost-effective, can output estimated 3D human joint positions in real time with the assistance of powerful vision-based skeleton tracking algorithms. With easy access to skeleton data, we have the 3D coordinates of body joints (or some predefined key points) at each time stamp for every captured motion sequence. Hence, all proposed representations are developed based on the assumption that the 2D/3D coordinates of body joints have been estimated. Multiple approaches for action understanding using full-body skeleton data have been proposed in the last decade, including co-occurrence feature learning [125], spatial-temporal graph convolutional networks [115], and spectral graph skeletons [56]. Graph-based approaches for human motion analysis [115, 53, 65] using skeleton data have also gained popularity due to their simplicity and efficiency. In particular, they can provide motion features without prior knowledge of the task. While tasks involving whole-body motion have been 24 studied, complex activity understanding using only hand skeleton data has not been thoroughly explored yet. Analysis of complex activities [71, 91] performed using only hands is challenging due to the similarity of motions in different action units, which makes them hard to distinguish. Moreover, analysis of complex activities done using only hands [20], such as preparing food, is more involved because motion is localized in small areas of the body. Furthermore, commonality in posture or semantic similarity exists across multiple activities. For example, open peanut butter may be much more similar to open milk than to charge cell phone in terms of action patterns. The understanding of complex activity [10] requires a multi-level analysis, starting with extracting low-level body position information, then segmenting tasks into a series of sub-tasks, and then assigning semantic meaning to each sub-task. In this chapter, we focus on the problem of representation and processing of low-level position data so that it can be used for efficient activity segmentation and recognition. In [115, 73], a neural network-based supervised approach was developed to model dynamic hand skeleton motion and analyze complex activities. However, this model requires a significant amount of training data to achieve good performance, which may not be available in some scenarios. Moreover, such systems may be difficult to use in practice, as retraining and adaptation could be costly (e.g. if the system is moved from one location to another). The problem becomes more challenging when training data are scarce, as insufficient information is available beforehand to develop a model to comprehend the task. Additionally, most pre-existing trained models exhibit inconsistency in performance when applied to other datasets. In this chapter, we systematically study hand graphs [20] and complex hand motion in an unsupervised manner. We propose a Graph-based Application-agnostic Feature eXtraction (GrAFX) method for complex action understanding using hand Motion capture (Mocap) data. This feature extraction model is application-agnostic and unsupervised. Thus, our method can be used in any hand skeleton-based application, from hand action understanding 25 to hand gesture recognition [27]. Our proposed graph-based method extracts temporal and spatial information present in the hand activity sequence. We are particularly interested in developing stable models [112], which are always preferable since they maintain consistent performance in varying datasets and conditions. While techniques to analyze stability have been developed [11, 13], none of them provides a metric that can be used to estimate stability for arbitrary machine learning models, including state-of-the-art methods based on neural networks. Additionally, while a metric of estimation stability with cross-validation (ESCV) has been proposed for problems such as a Lasso-based optimization [68], similar techniques have not been proposed for classification. The robustness of GrAFX, in complex activity segmentation, is measured with a dynamic time wrapping (DTW) based metric, which measures the consistency of the output under small changes in the hyper-parameters. Additionally, to analyze the stability of GrAFX in classification, we propose two metrics that quantify the variation in the loss function and the estimated probability of each class as different training sets are used to build the model. We demonstrate that our approach achieves better stability than state-of-the-art systems, leading to improved generalization across datasets. There are three main contributions in this chapter. • First, we introduce a spatiotemporal hand graph that can extract features that generalize across various applications without requiring prior knowledge of the specific application. • Second, we propose a DTW-based measurement for unsupervised temporal action segmentation tasks, which can qualitatively analyze the performance across multiple subjects. • Finally, we analyze the performance of our proposed recognition system based on two novel notions of stability: validation loss stability (LFS) and estimation stability (ESCV). 26 In this chapter, we introduce a graph representation of hand skeleton data and propose three different topologies for hand graph construction, which can efficiently capture hand motion, coordination of the fingers, and intra-hand motion. We also analyze some interesting spectral properties of these graphs. These graphs are used to extract features from hand motion, where feature extraction is completely data-independent and unsupervised. For temporal feature extraction, we use temporal pyramidal pooling. However, this does not capture the joint spatiotemporal information. Therefore, we introduce novel spatiotemporal hand graphs and study their properties, including their computational efficiency. Then, the graph-temporal transform of [74] is applied for feature extraction to graph signals defined on the constructed graphs. We provide an interpretation for the resulting representations based on the spectrum and basis of the constructed graph, which help us justify their suitability for action recognition and segmentation tasks. We present three sets of experiments. In Experiment 1, we design a real-time complex activity segmentation system, which finds segmentation instances of the subtasks in real time without prior knowledge of the actions. We exploit the idea of Bayesian Information Criteria (BIC) [44] for online unsupervised segmentation using graph features. In Experiment 2, we explore the offline action segmentation problem. In this task, we aim to segment an activity into small sub-tasks without prior task knowledge. The stability of the complete system is also provided in terms of different choices of hyper-parameters, showing the advantages of an unsupervised system. In Experiment 3, we perform an action recognition experiment on the FPHA dataset [37] using only the hand position data. We provide a detailed stability analysis for our classification model using GrAFX and compare it to the state-of-the-art LSTM model [48]. 27 3.2 Spatial hand skeleton graphs We start with a detailed description of the graph-based representation of hands and its application to understanding the activity. The proposed system uses no video information and relies completely on the 2D hand key points extracted by OpenPose[87]. It is a generalized framework and can be used for any motion capture data. Frames where OpenPose fails to extract hand key points because of occlusion are ignored. 3.2.1 Graph construction OpenPose provides 2D coordinates of the hand key points, but we have a choice of how to create a graph to analyze these data. Inspired by the structure of human hands, we consider three alternative hand graphs. First, we construct the nature Hand graph GH (21 nodes, 20 edges) as shown in Figure 3.1(a). Second, to account for the relative motion of the tips of the fingers, we also propose the Finger-connected Hand graph GF H (21 nodes, 24 edges), which adds a set of new edges to GH so that the fingertips are linked, as shown in Fig. 3.1(a). Finally, we note that both hands are involved in assembling tasks, so the relative motion between hands is also an important feature for understanding the activity. Consequently, Left-Right Hand graph GLRH (42 nodes, 46 edges) is constructed as shown in Fig. 3.2 adding a new set of edges between the two hands. GLRH can capture the relative motion between two hands and the intra-hand motion. All these graphs are undirected and unweighted. Each graph is defined as G = {V, E}, where V and E denote the set of vertices and edges respectively, with respective cardinalities Nv and Ne. We use the symmetric normalized graph Laplacian from Definition 2. The graph Fourier transform (GFT) is used to analyze the frequency content of graph signals. The spectral basis of the graph are the eigenvectors of L leading to a matrix U with columns {u1, u2, ..., uNv }. The corresponding spectral frequencies are the eigenvalues of L associated with U denoted by σ(G) = λ1, λ2, ..., λNv where 0 = λ1 ≤ λ2 ≤ ... ≤ λNv . 28 Figure 3.1: Proposed hand graphs: (a)Human anatomy inspired hand graph GH and (b) Finger connected hand graph GF H Figure 3.2: Left-right hand graph: Graph GLRH constructed to capture relative motion between two hands For each graph, the graph signal, which is the motion data present in each joint of the hand graph, is defined on each vertex of the graph. This motion data C is computed using the position data extracted using the tracking device or the algorithm, where ci ∈ R Nv is the graph signal of the i th coordinate of the motion. The dimension of C is Nv, d, where Nv represents the number of vertices and d represents the number of motion vector coordinates. We use the approach proposed in [52] to compute the GFT-based features for our graph in each frame. The eigenvectors uk, k = 1, ..., Nv, form an orthogonal basis for any graph signal residing on G. That implies any graph signal can be represented as a unique linear combination of uk as: ci = X Nv k=1 αk,iuk, (3.1) 29 αk,i = c ⊤ i uk, (3.2) Here, α1,i, α2,i, ..., αNv,i can act as a unique representation of the motion for a given frame frj . We use these α’s as features for activity segmentation. 3.2.2 Analysis of graph frequencies Clearly, from Figure 3.1(a), GH is a tree-structured graph, and therefore it is also bipartite. Thus, the eigenvectors of the normalized graph Laplacian are in the interval [0, 2] with λN = 2 [82]. For the graph GH, when we arrange the eigenvalues in ascending order, the eigenvalues exhibit a repeating pattern, starting with an eigenvalue of single multiplicity, followed by an eigenvalue with a higher multiplicity, and this pattern continues alternately. We observe that GH is an extended version of a star graph, where each finger of the star has more than one node. A hand graph with N1 fingers and N2 nodes per finger (here, Nv = N1 × N2 + 1) has a special eigenstructure with the following properties. • Once arranged in ascending order, the eigenvalue with higher multiplicity has a multiplicity greater than N1 − 1. • The number of eigenvalues with multiplicity 1 is N2 + 1. • The minimum and maximum eigenvalues are 0 and 2, respectively, and both have multiplicity 1. Likewise, the N-star graph [8] has eigenvalue 1 with multiplicity N − 2, and the other two eigenvalues are 0 and 2 (with multiplicity 1). It is also similar to the above-mentioned properties of GH. Thus, the N-star graph is also a hand graph GH with N1 = N − 1 and N2 = 1 (here N2 = 1 as each finger has only 1 node). So, • The only eigenvalue (here, λ = 1) with higher multiplicity has multiplicity N1 − 1 = (N − 1) − 1 = N − 2. 30 • The number of eigenvalues with multiplicity 1 (λ = 0 and λ = 2) is N2 + 1 = 2. Note that GF H has a unique eigenstructure, meaning all eigenvalues have multiplicity equal to 1. The spectral properties of these graphs are important, given that we are using the GFT as our feature vector. In particular, multiple ways exist to project the input graph signal onto the corresponding subspace for eigenvalues with multiplicity greater than one. In this chapter, we do not optimize the choice eigenvectors for those eigenvalues with higher multiplicity. All three graphs have some form of symmetry, potentially leading to fast algorithms for GFT computations, as we explore in Chapter 4. 3.2.3 Interpretation of graph Fourier basis One useful aspect of these graph structures is their graph Fourier bases are interpretable. Thus, the elementary basis vectors corresponding to the spectrum of these spatial graphs can be utilized for feature extraction, with the added advantage of being interpretable. In Figures 3.3, 3.4, and 3.5, we provide examples to help interpret the elementary basis vectors associated with GH, GF H, and GLRH, respectively. In all the figures, the basis vectors corresponding to the lowest frequency (λ = 0) capture the signal’s average. In Figure 3.3, λ = 0.12 captures variations between the thumb and the other fingers, while λ = 0.53 represents variations in the motion from the tip of the fingers to the base of the hand. In Figure 3.4, the basis corresponding to λ = 0.42 captures the variation among the thumb and little to the other fingers. Here, λ = 0.53 shares a similar basis as in Figure 3.3. In Figure 3.5, the basis corresponding to λ = 0.12 captures the variation between the thumb and the other fingers, λ = 0.26 corresponds to the variation in motion between hands, and λ = 0.53 shows intra hand motion variation. 31 Figure 3.3: Example elementary frequency basis for GH. Green dot: positive value. Red dot: negative value. Blue dot: zero. Each basis captures the motion variation between nodes. For example, λ = 0 captures the average of the hand motion, whereas the eigenvector corresponding to λ = 0.12 captures the variation in the motion between the thumb and the rest of the fingers. Figure 3.4: Example elementary frequency basis for GF H. Green dot: positive value. Red dot: negative value. Blue dot: zero. Each basis captures the motion variation between the nodes. For example, λ = 0 captures the average of the hand motion, whereas the eigenvector corresponding to λ = 0.42 captures the variation in the motion between the thumb and little to the rest of the fingers. 32 Figure 3.5: Example elementary frequency basis for GLRH. Green dot: positive value. Red dot: negative value. Blue dot: zero. Each basis captures the motion variation between the nodes. For example, λ = 0 captures the average of the hand motion, whereas the eigenvector corresponding to λ = 0.26 captures the variation in the motion between the left hand and right hand. 3.3 Spatiotemporal hand Graph In this section, we expand upon the fixed undirected graph representation of human hands, as introduced in Section 3.2, by incorporating temporal graph connections. There are various ways to construct a spatiotemporal graph, but the simplest yet powerful choice for modeling temporal data is to employ a line graph as the temporal graph, with its cardinality determined by the length of the temporal window. Using a line graph as a temporal graph allows spatiotemporal processing to be performed with a separable transform. This versatile approach accelerates the processing of spatiotemporal data and reduces complexity. It can be applied to model various spatiotemporal datasets, where the spatial graph can take any structured form while the temporal graph remains a line graph. 33 3.3.1 Spatiotemporal graph Fourier transform A spatiotemporal hand graph with temporal sequences of hand joint positions in space is constructed. Each node in the spatiotemporal graph represents the motion of the corresponding joint and that specific time. The spatial graph is constructed based on hand anatomy as shown in Figure 4.1(a). The temporal graph is a line graph Gt , which connects each hand joint with itself in consecutive times. The spatiotemporal hand graph, as shown in Figure 3.6, can be constructed applying graph Cartesian product (⊗) between GH and Gt . Figure 3.6: Kronecker product between spatial hand graph and temporal line graph results in this Spatiotemporal hand graph In complex activity analysis, the input data is represented by a matrix X, containing observations on each of the N joints for each of the T time instants. The observation at each joint contains the three dimensions of the corresponding motion vector at that joint. Each dimension is considered separately. The spatiotemporal graph Fourier transform is defined based on a separable transform as in [74], leading to the separable graph temporal Fourier transform (SGTFT): F(X; G) = ΨGXΨT (3.3) 34 where ΨT is a normalized discrete Fourier transform (DFT) matrix of size T × T and ΨG is the N × N left eigenvector matrix of the Laplacian matrix LG of G. Using the properties of the Kronecker product (⊗), the SGTFT can be written as F(X; G) = (ΨG ⊗ ΨT )X = ΨJX where ΨJ = ΨG ⊗ ΨT . The Laplacian matrix of J can be expressed as LJ = IT ⊗ LG ⊕ LT ⊗ IN so that: LJ = (ΦT ⊗ ΦG)(ΛT ⊕ ΛG)(ΨT ⊗ ΨG) = ΦJΛJΨJ (3.4) where ΦT = Ψ−1 T and ΦG = Ψ−1 G . The columns of ΨJ, denoted by uk, form a spectral basis for graph signals residing on G, so that any xi can be written as a unique linear combination of uk: ci = N X×T k=1 αk,iuk, where αk,i = c ⊤ i uk. (3.5) As mentioned in Equation 3.1, ci here, represents the i th coordinate of the motion vector present of the graph (hand) and αk be a vector of length equal to the dimension of data at each joint (e.g., length three if motion at each joint is in 3D). Thus, α1,i, α2,i, ..., αN×T,i is a unique representation of the signal over a window of length T. Since these α features do not depend on a specific application, we call them graph-based application-agnostic features, and 35 the method is called graph-based application-agnostic features extraction in short, GrAFX. To summarize, the advantages of the GrAFX methods are • Representations are interpretable • Feature extraction is unsupervised • Processing can be done in real-time • The feature extracted using GrAFX is application agnostic Hand skeleton data is obtained from motion-capture devices or pose estimation algorithms from videos. While extracting hand pose estimation data from videos, there should not be any camera motion present so that the accurate motion of the hand joints can be captured. Generally, complex activity data is a sequence of frames, and each frame has a set of hand joint coordinates in 2D or 3D. In the next section, we perform three experiments using the GrAFX method: 1. unsupervised real-time action segmentation, 2. unsupervised offline action segmentation, and 3. action recognition. 3.4 Applications 3.4.1 Real-time activity segmentation Monitoring activity and parsing it into smaller segments are challenging and important tasks in computer vision. There are various activity monitoring tasks, from surveillance to workflow monitoring to quality control inspection. For example, monitoring laborers’ work and developing a qualitative analysis is vital in an industrial environment because even small mistakes can be risky. Since human actions are complex, the main challenge in activity monitoring and parsing is to segment the activity sequence according to action primitives. Fine-grained complex activities are goal-driven and follow a grammar, for example, assembling tasks [45], food preparation [106], a surgical process in medicine, etc. Additionally, 36 automatic segmentation of these tasks becomes more difficult when the system has no proper knowledge about what activity is happening. In this section, we delve into an unsupervised task that involves segmenting complex activities using skeleton data. To achieve this, we will utilize the graph-based feature extraction techniques we discussed in Section 3.3. To perform real-time segmentation on our time-series data, we must establish a way to measure the similarity between two consecutive data windows. For our segmentation task, our approach hinges on detecting changes in motion patterns between different actions, and notably, we do not rely on any prior knowledge regarding the specific activities themselves. It is important to underscore that we do not have prior knowledge about the activity in this segmentation task. Instead, the only information available is the observable shift in motion patterns from one action to the next. 3.4.1.1 Segmentation using Bayesian Information Criterion(BIC) To do the temporal activity segmentation, first, graph features are extracted using the method described in Section 3.2. Later, a temporal mean-pulling model is built to capture the temporal structure of the input features. For segmentation of the sequence data, an unsupervised approach is used, which can segment any data sequence without any prior knowledge about the data. The proposed method measures the distance between two consecutive windows by Generalized likelihood ratio (GLR) [44]. At time ti , let Wl and Wr be the feature matrix of the left and right windows. Each column of W is constructed from the features computed using (6.2). Determining whether a boundary exists at frame i depends on the relative performance of two competing models. The first model assumes that w1, ..., wN ∈ Wl ∪ Wr is more appropriately modeled by a single distribution ( Wml ∪ Wr ∼ N (µ, Σ) where wi ∈ Rd , d is the dimension of the feature vector space. The second model assumes that w1, ..., wN is more appropriately modeled by two separate 37 distributions where w1, ..., wi ∈ Wl and Wl ∼ N (µl , Σl); wi+1, ..., wN ∈ Wr and Wr ∼ N (µr, Σr). Then, ∆BICi is computed using: ∆BICi = log |ΣWl∪Wr | N 2 |ΣWl | Nl 2 |ΣWr | Nr 2 ! − λ 2 d + d(d + 1) 2 log N (3.6) where, |.| is the determinant of a matrix, and (d, Nl), (d, Nr), (d, N) are the dimension of Wl ,Wr,Wl ∪ Wr, and N = Nl + Nr. Now, if ∆BICi > 0, then frame i is a good segmentation boundary, otherwise we merge Wl and Wr and compare the next window with this merged window. The first term in (3.6) is GLR when the model is Gaussian and the second term, λ 2 (d + d(d+1) 2 ) log N, is responsible for penalizing the candidate models according to their complexities. λ controls the number of segments. 3.4.1.2 Experimental setup and Dataset In collaboration with Mitsubishi Electric Research Laboratories (MERL), we have collected a new fine-grain activity dataset in an industrial setting. A robot toy assembling task was considered, and 14 subjects performed the task. It consists of a few sequential sub-tasks, and their sequence is fixed. A camera is used to capture the video, and later, we use OpenPose to extract detailed hand motion data. For privacy reasons, the dataset contains only the 2D join positions extracted by OpenPose, and not the original video. The complete work is done on 2-D position data of hands. Key features of OpenPose OpenPose [14] is a real-time system to jointly detect a human body, hand, and facial key points (in total 130 key-points) on a single image. Computational performance on body key-point estimation of this system is invariant to the number of detected people in the image. It can detect multiple people in a single frame, providing 18-key-point body pose estimation per person. It can estimate 2 × 21 - hand key points and 70 face key points. The input for the system can be an image, pre-recorded video, or live 38 input from a video camera. The system’s output is body pose estimation key points, which can be saved in various formats (JSON, XML, PNG, JPG). This gives us 2-D key points in [x, y, c] format, where the [x, y] values are the joint coordinates and [c] is the confidence in the range [0, 1]. Dataset Each subject is asked to assemble a gopigo3 [50] robot base kit according to a specific set of instructions. Figure 3.7 provides a representation of the sequential subtasks of the toy assembling task. This task has three main sub-actions. • Action1 – Assembling: Attach the front wheel, set the red board, and tighten the screws. • Action2 – Combining: Attach the power cable with the red board; Connect the sonic sensor cable to the red board; Combine the green board, use the pins; Attach the side wheels. • Action3 – Checking: Check whether all the parts are attached and assembled properly. Before starting the task, one instructor clearly demonstrates the steps for each subject. Moreover, a pictorial representation of the sequential steps is available in front of them during the task. The parts of the toy car are kept on a table with a height of 105cm. Eleven subjects (9 males and 2 Females) are asked to perform the task thrice. All subjects are in their 20s and early 30s and are from an engineering background, thus accustomed to the tasks involved in robot car assembly. Each subject performs the assembling task three times, getting 33 data sequences. An HD 1080p Logitech camera is used to capture the scene. After capturing all the videos, we used OpenPose to extract each subject’s 2D hand skeleton key points. OpenPose estimates 2 × 21 - hand key points in each frame at fps 30. The key point hand dataset is available online (github. [21]). 39 Figure 3.7: Steps for the toy assembling task: Action1 – Assembling: Attach the front wheel, set the red board, and tighten the screws. Action2 – Combining: Attach the power cable with the red board; Connect the sonic sensor cable to the red board; Combine the green board, use the pins; Attach the side wheels. Action3 – Checking: Check whether all the parts are attached and assembled properly. 3.4.1.3 Real-time segmentation results We first define the evaluation metric to analyze the segmentation results following the metrics proposed by Gensler et al. [38]. In this assembling task in an industrial setting, we can tolerate a slightly early or late segmentation while performing online unsupervised segmentation. Moreover, to make an online segmentation decision, we must wait for all the information from the current window to be available for processing. As a result, we can only have a segmentation point at the start of a window. A segmentation zone (SZ) is defined around each ground truth segmentation instance to account for such a scenario. Let tg be the ground truth segmentation time, i.e., times before and after tg correspond to different tasks. Then, SZ(tg) = {tg −δt, tg +δt}, which allows some uncertainty about the exact time when activities switched1 . The definition of True positive (T P), True negative (T N), False positive (F P), and False negative (F N) uses SZ as follows: • T P: If both ground truth and algorithm produce only one segmentation time in the SZ. 1Note that in practice humans do not switch activities instantly and thus SZ can also be viewed as representing the transition from one action to another. 40 • T N: If both ground truth and algorithm have no segmentation time in the SZ. • F P: Case1. If the algorithm produces more than one segmentation instance within SZ. Case2. If a time instant is not a segmentation point according to the ground truth and does not fall inside any SZ, the algorithm detects it as a segmentation instance. • F N: If the algorithm produces no segmentation point within a SZ. Let Sg and Sa be the set of segmentation time points given by ground truth and algorithm, respectively. Sˆ a contains segmentation instances corresponding to T P, hence, Sˆ a ⊂ Sa. Let fg and ˆfa be the action labels of frames segmented by Sg and Sˆ a respectively. Then, ftp = fg ∩ ˆfa and the cardinality of ftp is Ltp. Letting the length of the sequence be L we have SegAcc = Ltp L × 100%, (3.7) where a higher value for SegAcc corresponds to better performance. To take into consideration the early and late segmentation, the distance between Sg and Sˆ a is measured using (3.8). Letting the cardinality of Sg, Sˆ a and Sa be Lg , Lˆ a and La respectively, we define Scoredist = (1 − X Lg i=1 βi |Sgi − Sˆ ai | L ) × 100 Scoreuw = 100 − δ × |Lˆ a − La| (3.8) where P Lg i=1 βi = 1 is weight vector and δ is a penalty factor. A higher value of Scoredist stands for segmentation instances closer to ground truth. The number of unwanted segmentation instances is also counted for qualitative analysis using Scoreuw. An experiment is conducted for varying SZ from 5s to 10s. The minimum value of δt is set to 2s, as in our online segmentation system window-length Lw, which is also set to 5s. Note that accuracy is increased for increasing the value of SZ, but with the increase 41 in SZ, we are allowing more tolerance for early and late segmentation. As a compromise, for the rest of the chapter we report results for δt = 3s and Lw = 5s. For comparison, we use as a baseline method where the same online segmentation experiment is performed with the motion vector computed from the hand position data as features. The set of T Ps corresponding to proposed method and baseline are saved in ˆS p a and ˆSb a respectively. Figure 3.8 and Figure 3.9 show the segmentation achieved by the proposed method ( ˆS p a ) and the baseline method ( ˆSb a ), respectively, with SegAcc for each participant. Color transitions represent action changes, and the cross marks represent the ground truth segmentation points. Clearly, our proposed method detects segmentation points within SZ for most of the data sequences, while the baseline method mostly fails to do that. The average evaluation metrics for proposed method with GH, GF H and GLRH are given in Table 3.1. Figure 3.8: Segmentation outcome( ˆS p a ) using features from GLRH (λ = 1) for the proposed method with SegAcc = 84.8%. Si and Iti represent the subject ID and iteration number, respectively. Table 3.1 compares the three proposed hand graphs in terms of segmentation performance for different λ. For the lower value of λ, Score2 decreases but Score1 increases, which implies over-segmentation but segmentation closer to Sg, and for higher λ, the opposite behavior is observed. At the same time, P recision and Recall values decrease with the increase of 42 λ. So, λ = 1 can be chosen for better performance. It is evident in the table that the features extracted from GLRH outperform GH and GF H in terms of all metrics. This justifies our assumption that relative motion information between hands is important and efficiently captured by GLRH. If there is a task where only one hand is involved, one can use GH or GF H instead. The segmentation takes 0.0034s to process a window of 5s using Matlab 2017b running on an 8-Core Intel Xeon processor with 64GB RAM. Figure 3.9: Segmentation outcome( ˆSb a ) using features from baseline method (λ = 1) with SegAcc = 71.6%. Si and Iti represent the subject ID and iteration number, respectively. 43 in % λ = 0.8 λ = 1 λ = 1.5 G H F H LRH H F H LRH H F H LRH M1 45.3 42.1 58.1 37.3 39.8 54.3 10.6 16.7 26.2 M2 78.7 72.7 100 66.7 66.7 85.7 21.2 33.3 47.2 M3 54.8 51.1 71.2 46.7 48.3 64.1 14.1 22.2 33.2 M4 86.6 84.3 93.1 78.9 79.3 84.8 69.2 71.58 75.5 M5 61 56.7 66.1 41.6 43.1 59.6 10.4 16.3 18.9 M6 66.3 69.7 72.5 71.2 73.1 75.4 91.5 92.1 92.3 M1, M2, M3, M4, M5, M6 represents P recision, Recall, F1 − Score, SegAcc, Scoredist and Scoreuw, respectively (measured as %). Table 3.1: Comparison between the different graphs, hand graph, finger hand graph, leftright hand graph. It is clear from the table that the left-right hand graph performs the best in segmenting the assembling task 3.4.2 Offline activity segmentation 3.4.2.1 Unsupervised Offline Activity Segmentation using GrAFX For offline segmentation (Experiment1 ), the hand skeleton data sequence is divided into smaller windows {w1, w2, ..., wM}, where each window is represented by a feature vector computed using GrAFX. We use Expectation maximization (EM) [114] Gaussian mixture model-based clustering for segmentation. Given the target number of actions, C, we define C clusters and assume that the features of each cluster follow a different Gaussian mixture model. EM is then used to get the best set of parameters for those mixture models representing the subtasks. In this case, we consider three types of actions: assembling, combining, and checking; we formulate the problem as three class clustering problem. The GMM parameters such as µk, Σk and wk are initialized following [5]. Here, µk is the mean vector of the k th class, Σk and wk are the covariance matrix and weight vector of the k th class, respectively. If the 44 covariance matrix is not positive definite, an identity matrix is assigned as the covariance matrix. In each iteration EM [42] estimates the parameters of GMM µk, Σk and wk. In each iteration, the EM algorithm maximizes the likelihood and re-estimates the parameter values until it reaches the convergence criteria. After clustering, each window obtains a class label from the set {1, 2, ..., C}. When there is a class label change between two consecutive windows wt and wt+1, we consider the end of window t as a good segmentation instance. Since each window is assigned to a cluster, the system is sensitive to the start and end of each window. To reduce the impact of window position, we use overlapping windows. However, this may result in two labels being assigned for a given time interval. If an interval gets two different labels, the label with a higher likelihood is chosen as the current label. 3.4.2.2 Qualitative analysis of the performance Consider a scenario where we do not have prior knowledge about tasks performed by the subjects, but we have an example of those tasks completed by an expert. Then, using the segmentation results (Section 3.4.2.1), we can rank how well each subject has completed the task by measuring the similarity of their work with that of the expert. To quantify the similarity between two unequal-length data sequences of different subjects, we use dynamic time warping (DTW) [78] where dtw(Ai , Bj ) between sequence A with length i and sequence B with length j is defined as: dtw(Ai , Bj ) = dist(Ai , Bj )+ min(dtw(Ai−1, Bj ), dtw(Ai , Bj−1), dtw(Ai−1, Bj−1)) (3.9) We are given data sequences from an expert, ex (a reference sequence), and two subjects, s1 and s2, with respective lengths Lex , Ls1 , Ls2, where Lex < Ls1 < Ls2. We compute dtw(ex, s1) and dtw(ex, s2) We use a normalized DTW so that a distance can be computed when LA ̸= LB: dtwnorm(ALA , BLB ) = dtw(ALA , BLB ) max(LA, LB) (3.10) 45 To compare dtwnorm(ex, si) across subjects (si), we apply min-max normalization [90] to each sequence, which removes biases in the range of values in the data. 3.4.2.3 Offline segmentation results Figure 3.10 shows the similarity between each subject and the expert, where iteration 1 for subject 11 was used as the expert reference (Sex = s11, iter1). We compute dtwnorm between each data sequence and Sex. Lower values of dtwnorm correspond to greater similarity to the expert, implying better performance. Note that each subject performed the task three times, and it can be seen that their performance improved with each attempt, as shown in Fig. 3.10, with 10 out of 11 subjects achieving their best results in the third attempt. sex Participants dtw norm Figure 3.10: dtwnorm for each data sequence considering s11 as expert. Let segmentation accuracy (segacc) be defined as the ratio of number of windows clustered correctly to the total number of windows. Table 3.2 summarizes segacc computed using different feature extraction methods. It is clearly seen that GrAFX performs the best. To evaluate the robustness of our algorithm, we repeat the similarity-based ranking of subjects under different conditions where we vary parameters such as window size (chosen 46 Features Motion Vectors GH [20] GrAFX segacc 62.15% 79.58% 86.12% Table 3.2: Segmentation Accuracy for USC Dataset [20] with values 2/3/4/5/7 secs) and number of subtasks. After segmenting each data sequence, the distance from the expert is computed using (3.10), and then we select the top-ranking user, i.e., the one with the smallest distance to the expert data. Out of 33 sequences, the same subjects were in the top 3 ranking 46% of time for varying window size and 76% of time when considering the top 15 rankings. 3.4.3 Activity Recognition using GrAFX 3.4.3.1 Recognition strategy using GrAFX Activity recognition is performed using GrAFX, to extract spatiotemporal features from the hand skeleton data. A temporal window is defined to capture the local temporal variation of the data, followed by a mean pulling algorithm to obtain a 1D feature vector per window. Then Support Vector Machine (SVM) [98] is used for classification. 3.4.3.2 Proposed Stability Metrics for Activity Recognition Stability analysis aims to determine the sensitivity of the system’s output to input variations. In this chapter, we use three different measures of stability. First, leave-one-out cross-validation stability (LOOCS) is defined as: ψLOOCS = σacc |µacc| , (3.11) where µacc and σacc denote the mean and standard deviation of the accuracy over different test settings. ψLOOCS is computed in a cross-subject setting, i.e., we train with all subjects but one and test on the subject we left out. 47 We also quantify Loss function stability (LFS). While training any classification model, the main goal is minimizing the loss function while tuning the parameters. Thus, it is crucial to observe the variation in the loss function for a specific validation set when the training set is changed. To quantify this, we measure validation loss stability for each validation set and compute the average. Assume there are M validation sets {vs1, vs2, ..., vsM} and for each validation set there are N training sets {tsi 1 , tsi 2 , ..., tsi N } where i denotes the corresponding validation set. For each validation set, we compute N validation losses {V Li 1 , V Li 2 , ..., V Li N }. Let µV Li and σV Li denote the mean and standard deviation of the validation loss value over different test settings. Thus we define LFS, ψLFS, as: ψLFS = 1 M X M i σV Li |µV Li | (3.12) While computing the validation loss (e.g., cross-entropy loss in LSTM or hinge loss in SVM), we only consider the estimated score (or the probability) of the true class. However, if the model is stable, any given class probability, including the probability of an incorrect class (not the true class), should be similar for different choices of the training set and the same validation set. Inspired by the concept of estimation stability [68], we introduce estimation stability (ESCV). Let the predicted probabilities of a model for each class be denoted by p c i , where c and i denote the class index and validation sample of i th validation set, respectively. For each class, we compute ψ c by taking the ratio of the standard deviation and the absolute mean of the class probability. Finally, ESCV, ψESCV, is the average over all the classes and validation sets. ψESCV = 1 MN X N c=1 X M i=1 ψ c i (3.13) 3.4.3.3 Datasets In this work, we use the First-Person Hand Action Benchmark (FPHA) proposed by Hernando et al. [37]. FPHA encompasses 45 hand actions interacting with 3D objects across 26 48 distinct objects in three different scenarios, executed by six subjects. The dataset comprises 1,175 action videos and is designed to address the challenges posed by high inter-subject and intra-subject variability in terms of style, speed, scale, and viewpoint As mentioned in [37], to compensate for anthropomorphic and viewpoint differences, poses are normalized to have the same distance between pairs of joints and define the wrist as the center of coordinates. Our experiments only use the right-hand pose annotations obtained using the magnetic sensors and inverse kinematics. There are activities (Open juice bottle, Open peanut butter ) where the action (open) is the same but the person is handling different objects (Juice bottle, peanut butter); naturally, these actions are highly related. On the other hand, there are activities with the same object (liquid soap), but the actions are different (Open, close, pour). It is very challenging to differentiate two activities with the same action but handle different objects when given only the 3-D position data. In fact, this is even difficult for the human eye if no background knowledge is available. 3.4.3.4 Recognition results To compare our results with the LSTM-based action recognition in [37], we use similar experimental settings with train:test dataset ratios of 1:3, 1:1, and 3:1 at sequence level. The second protocol consists of a 6-fold ’leave-one-person-out’ cross-validation, i.e., each fold consists of 5 subjects for training and 1 for testing. In this set, the training and testing set do not share the same subjects taking care of and subjective bias. Protocol LSTM [37] GH [20] GrAFX 1:3 58.75 48.63 62.39 1:1 78.73 63.3 79.14 3:1 84.82 72.46 84.93 Cross-person 62.06 61.55 73.67 Table 3.3: Recognition accuracy of FPHA dataset As we can see clearly from Table 3.3, GrAFX with spatiotemporal connection outperforms GH [20] with only spatial connection in terms of accuracy. While comparing with the LSTM, 49 Algorithm Accuracy JOULE-pose [49] 74.60 Gram Matrix [122] 85.39 HBRNN [30] 77.40 TCN-16 [58] 76.28 TF [36] 80.69 TCN-16 + TTN [73] 80.14 Lie Group [109] 82.69 GtH 79.14 Table 3.4: Recognition accuracy of FPHA in train:test=1:1 protocol our proposed method outperformed by 5% for the cross-person setting, which is a more realistic scenario. GrAFX’s performance is comparable to the state-of-the-art method’s in a 1:1 train test setting, as shown in Table 3.4. Though the feature extraction of GrAFX is unsupervised, the proposed features uniquely capture the action characteristics rather than any person-specific information. In contrast, LSTM, a data-driven method, performs poorly in the cross-person setting. Table 3.5 summarizes the stability analysis (Section 3.4.3) of these two classification models. Though the accuracy of the proposed model and LSTM is comparable in some cases, GrAFX outperforms LSTM in the cross-person setting both in accuracy and stability. Method ψLOOCS ψLF S ψESCV LSTM 0.106 0.129 1.106 GrAFX 0.0267 0.0455 0.0372 Table 3.5: Stability analysis of the classification models As an additional comparison with the existing state-of-the-art feature extraction techniques, we use principal component analysis [113], a data-driven approach, to extract features followed by an SVM [98] for classification. This method achieves comparable performance in stability with the proposed method but lacks in terms of performance, with an average accuracy of 45%. 50 3.5 Conclusion To conclude, in this chapter, we discuss the construction of hand-skeletal graphs and spatiotemporal graphs and the properties of their GFT basis. Given the desirable properties of the GFT basis, we propose to extract representation for motion capture data by applying GFT or SGFT of a spatial or spatiotemporal graph on it. Furthermore, based on different markerless MoCap databases, we demonstrate that the proposed graph-based representations possess several advantages in interpretability, energy compaction, discrimination between categories, and computation efficiency. To evaluate the performance of this method, we choose two complex activities: a real-time and offline unsupervised segmentation task and a supervised classification task. We introduce a DTW-based stability metric to measure the robustness of the segmentation algorithm and LOOCS, LFS, and ESCV to analyze the robustness of GrAFX in classification. GrAFX achieves better stability than state-of-the-art algorithms. 51 Chapter 4 Symmetric Sub-graph spatiotemporal Graph Convolution In Chapter 3, we explored the different choices of spatiotemporal graph topologies for analyzing human hand activity data. We noted that hand graphs are sometimes bipartite and exhibit some symmetry. In this chapter, we leverage graph symmetry to introduce novel approaches for creating spatiotemporal graphs. To illustrate the advantages of this decomposition and graph construction technique, we apply it to the same activity recognition tasks from Chapter 3. We analyze hand skeleton-based complex activities by modeling dynamic hand skeletons through a spatiotemporal graph convolutional neural network (ST-GCN). This model jointly learns and extracts spatiotemporal features for activity recognition. Exploiting the symmetric nature of hand graphs, we propose a Symmetric Sub-graph spatiotemporal graph convolutional neural network (S2 -ST-GCN) that decomposes the graph into smaller sub-graphs, allowing us to build a separate temporal model for the relative motion of the fingers. This subgraph approach can be implemented efficiently by preprocessing the input data using a Haar unit-based orthogonal matrix. Then, in addition to spatial filters, separate temporal filters can be learned for each sub-graph. We evaluate the performance of the proposed method on the First-Person Hand Action dataset. While the proposed method shows comparable performance with the state-of-theart methods in the train:test=1:1 setting, it achieves this with greater stability. Furthermore, 52 we demonstrate significant performance improvement compared to state-of-the-art methods in the cross-person setting, where the model did not come across a test subject’s data while learning. S2 -ST-GCN also performs better than a finger-based hand graph decomposition where no preprocessing is applied. This work was published in [23]. 4.1 Introduction Chapter 3 presents a technique for extracting features from complex activities using an unsupervised hand graph-based approach. We introduced two types of hand graphs, namely the spatial hand graph and the spatiotemporal hand graph, and utilized a graph Fourier transform-based tool to derive feature representations in an unsupervised manner. These methods are highly effective in scenarios where prior task knowledge or a sizable training dataset for parameter learning is lacking. Nonetheless, for situations where ample training data is available, conventional neural network-based models can be leveraged to facilitate the understanding of human activity. The success of deep learning methods has led to a surge of deep learning-based skeleton modeling methods in the past few years. In [125], [70], [121], authors used recurrent neural networks, while [55], [67] explored temporal CNNs to learn action recognition models in an end-to-end fashion. Kim et al. [58] proposed a temporal convolution network for 3-D human action recognition where they perform convolution in the temporal dimension of the joints. However, these methods perform only temporal models to capture the temporal variation of the data, thus failing to capture the relative changes in motion across the joints. As an alternative, [115] introduced a spatial-temporal graph convolutional network (STGCN) based dynamic skeleton model, which jointly learns spatial and temporal patterns from data for action recognition. These spatiotemporal filters are separable, and a single temporal filter is learned, which is applied to data from all joints. In this chapter, we first propose a hand graph-based ST-GCN to recognize complex hand activities by using 53 the human anatomy-inspired hand graph mentioned in Chapter 3. However, this makes it difficult to capture the varying temporal nature of the relative motion of the fingers, which means that more layers and filters (and thus more training data) may be needed to achieve sufficiently good performance. In this chapter, instead, we allow different time filters to be used on the preprocessed data of each subgraph. In [73], a hybrid model-based and datadriven approach is used to analyze human hand activity using hand skeleton data. This paper learns warping functions to jointly reduce intra-class variability and increase interclass separation for activity analysis. This method, however, does not exploit the structural correlation of the hand joints. Relative motion variation among the fingers is crucial to differentiate between various hand actions. To incorporate the spatial correlation, , we introduce a hand graph-based supervised system to study complex activities using hand position data. There are three main contributions to this work. • First, we extend the spatiotemporal graph convolutional networks proposed in [115] to hand graphs and use them to recognize complex activities using hand skeleton data. • In our second contribution, we exploit the symmetry of the hand graph, decomposing it into smaller interpretable sub-graphs and modeling the temporal information separately for each of these sub-graphs. Our proposed approach, Symmetric Sub-graphSpatiotemporal graph convolutional networks (S2 -ST-GCN), achieves better accuracy with fewer convolutional layers. Our approach is based on the observation that skeleton graphs [53], including hand graphs [20], are symmetric. Symmetry is not only useful computationally, as shown in [77], but also closely tied to human motion, so exploiting it can lead to better recognition. Indeed, our method can be generalized to any symmetric skeleton graph. 54 • Our third contribution is to analyze the performance of the hand graph-based ST-GCN and S2 -ST-GCN model along with its stability. We show these graph-based models have better leave-one-out cross-validation stability than state-of-the-art methods. We work with the 21-node spatial hand graph shown in Figure 3.1. Based on the symmetry of the hand graph, it is decomposed into 4 sub-graphs, and a separate temporal kernel is used for each of the sub-graphs. Before learning a filter from the data, we preprocess the data using a predefined filter, namely, Haar units designed for the hand graph [77]. We provide a detailed interpretation of the sub-graphs and the corresponding preprocessing, which helps us justify using a separate temporal model for each sub-graph. Multiple S2 -ST-GCN blocks are then applied to the preprocessed input data to extract higher-level features. Later, a softmax is used to recognize the activity from these feature sets. We evaluate the performance of this method on the First-person hand Action (FPHA)[37] dataset, which includes activity data in different settings such as kitchen, office, and industry. S2 -ST-GCN shows competitive performance with respect to the state-of-the-art algorithms in experiments where training and test data sets include different examples from the same subjects. Additionally, S 2 -ST-GCN significantly outperforms the state-of-the-art method (about 10% increase in accuracy) in a cross-person setting, where training and test data do not have any subject overlap. Note that this is a more realistic scenario since it does not require training data to be available for every user. S2 -ST-GCN also shows promising performance in terms of leave-one-out cross-validation stability [13] as compared to deep learning-based model long short term memory (LSTM) [37]. 4.2 STGCN for hand skeleton data Given these temporal sequences of hand joint positions in space, a spatiotemporal hand graph is constructed. The spatial graph is constructed based on hand anatomy as shown in Figure 3.1. The temporal graph is a line graph Gt , which connects each hand joint with itself 55 in consecutive time-frames. The spatiotemporal hand graph, as shown in Figure 3.6, can be constructed applying graph Cartesian product (⊗) between GH and Gt . Instead of directly using the adjacency matrix As of the spatiotemporal hand graph, we create a new graph with self-loops and use As + I as the adjacency matrix. The self-loops are added so that features associated with a node are combined with those of its neighbors. Let us consider the normalized adjacency matrix (Definition 2) representing the spatial and temporal graphs are As and At respectively, where A = D− 1 2 (A + I)Λ − 1 2 andDii = X j (Aij + Iij ). Thus, the spatiotemporal graph can be represented as Ast = It ⊗ As + At ⊗ Is. The input data to the ST-GCN is a 2D tensor constructed by the number of joints in one axis and time in the other axis. This is analogous to image-based CNNs as images can be represented as grid graphs where the graph signal residing at each node is the pixel intensity. While considering a spatiotemporal signal, the input feature is represented as a tensor of (C, N, T) dimensions, where C, N, and T represent the number of channels, number of joints, and the temporal length of the activity sequence, respectively. Let us denote the input and output feature maps xin and xout, respectively. A ST-GCN is computed as xout = AsxinW, (4.1) where W represents the stacked weight vectors of multiple output channels. We use the spatial configuration partitioning proposed in [115]. The temporal filters with kernel size of 56 (1 × τ ) are shared by all nodes. Another matrix Q is introduced to learn the edge weights of the graph so that the ST-graph convolution is implemented as: xout = (As ⊙ Q)xinW, (4.2) where ⊙ represents element-wise multiplication. 4.3 Symmetry based graph decomposition The hand graph GH of Figure 4.1 is bipartite with a left-right (LR) symmetry [77]. Exploiting its symmetry, we can first decompose GH based on the LR symmetry around middle finger into G + H and G − H as shown in Figure 4.1. Then, using the LR symmetry around the middle finger in G + H, a further decomposition can be applied, and we get G ++ H and G +− H . As a result, we can decompose the graph into 4 sub-graphs G H 1 (purple graph), G H 2 (yellow graph), G H 3 (red graph) and G H 4 (blue graph) with number of nodes 9, 4, 4, and 4, as shown in Figure 4.1(c). Filtering using these sub-graph Laplacians is equivalent to filtering using the Laplacian (L = D − A) of GH [77]. After pre-processing the input (x in = {x in 1 , xin 2 , ..., xin 21}) using the orthogonal matrix Bn,p, we get z = Bn,px in, where z = {z1, z2, ..., z21}. Here, Bn,p is a n×n orthogonal matrix representing a stage of p parallel Haar units (see Figure 4.2). Equations (4.3), (4.4), (4.5) and (4.6) show the relation between different elements of x in and z. z1 = x1, z4 = x4, z9 = x9, z14 = x14, z19 = x19, √ 2z2 = x2 + x3 + x5 + x6, √ 2z7 = x7 + x8 + x10 + x11, √ 2z12 = x12 + x13 + x15 + x16, √ 2z17 = x17 + x18 + x20 + x21 (4.3) 57 √ 2z3 = x2 − x3 + x6 − x5, √ 2z8 = x7 − x8 + x11 − x10, √ 2z13 = x12 − x13 + x16 − x15, √ 2z18 = x17 − x18 + x21 − x20 (4.4) √ 2z5 = x3 − x5, √ 2z10 = x8 − x10, √ 2z15 = x13 − x15, √ 2z20 = x18 − x20 (4.5) √ 2z6 = x2 − x6, √ 2z11 = x7 − x11, √ 2z16 = x12 − x16, √ 2z21 = x17 − x21 (4.6) We group the different elements of z in four groups according to their corresponding subgraph G H 1 , G H 2 , G H 3 and G H 4 . Let, ϕ1 = {z1, z2, z4, z7, z9, z12, z14, z17, z19}, ϕ2 = {z3, z8, z13, z18}, ϕ3 = {z5, z10, z15, z20} and ϕ4 = {z6, z11, z16, z21}. ϕ1, ϕ2, ϕ3 and ϕ4, shown in Figure 4.2, are the input to G H 1 , G H 2 , G H 3 and G H 4 respectively. As we can see from Figure 4.2(e), ϕ1 captures the average of the signal present at different fingers, where ϕ2 captures the differences between (thumb, index) and (ring, little) fingers. ϕ3 captures the differences between ring and finger, whereas ϕ4 captures the differences between (thumb, little) fingers. 4.3.1 Symmetric Sub-graph ST-GCN As each of the smaller subgraph captures different hand motions, we use a separate temporal model for each of the subgraph in the STGCN setting and name it as Symmetric-subgraph 58 (a) !! !! " !! # !! " !! #" !! ## (a) (c) (b) Figure 4.1: The 21-node hand graph and its symmetry-based decomposition. Red, green, and blue circles represent those nodes in VX , VZ , and VY , respectively. VX and VY contain those sets of nodes belonging to different sides of the symmetry axis. VZ contains the nodes in the symmetric axis for the next stage of decomposition. Unlabeled edges have weights 1, and Pb,i are permutation operations. 59 Figure 4.2: Butterfly structure of the hand graph decomposition — Relation between x in and z : z = Bn,px in. Each small butterfly structure represents one Haar unit. Here, we see two stages of butterfly structures for the two stages of decomposition as shown in Figure 4.1. A1, A3, A3 and A4 are the GFTs corresponding to two connected components of G H 1 , G H 2 , G H 3 and G H 4 respectively. Unlabeled edges have weights 1, and Pb,i are the permutation operations. Spatiotemporal graph neural network. (S2 -ST-GCN). S2 -ST-GCN is computed using the following equation: xout = (As1 ⊙ Q1)ϕ1W1 (As2 ⊙ Q2)ϕ2W2 (As3 ⊙ Q3)ϕ3W3 (As4 ⊙ Q4)ϕ4W4 (4.7) where Asi is the normalized adjacency matrix corresponding to i th sub-graph, Qi is the corresponding spatial graph weight matrix and Wi is the corresponding stacked temporal weight matrix. Note that our approach leads to a significant reduction in the number of 60 spatial weights to be learned (4 × 4 + 4 × 4 + 4 × 4 + 9 × 9 = 129) as compared to ST-GCN (21 × 21 = 441). This leads to a smaller number of multiplications and summations and lower complexity compared to ST-GCN. Modeling the average and difference of the signals present at different fingers separately, as achieved by our pre-processing (see Figure 4.2), can also have advantages for recognition. For example, for some actions (e.g., grabbing an object) there are changes in the relative motion between the fingers, whereas in others (e.g., moving an object from one place to another), the relative motion of the fingers is almost zero, but their average motion is similar to the motion of the object. 4.3.2 Network architecture We use multiple spatiotemporal graph convolution layers on the input data to extract higherlevel features. Due to the lack of availability of hand skeleton data, it was crucial not to choose a very deep network so that there would be sufficient data for training, but performance would not be compromised. Before feeding the data to the network, hand skeleton poses are normalized so that all hands have the same distance between pairs of joints, with the wrist as the center of coordinates, as proposed in [37]. This takes care of anthropomorphic and viewpoint differences. The detail of the network architecture is shown in Figure 4.3. Since weights on different nodes of the S2 -ST-GCN are shared within a graph, first, input hand skeletons are fed to a batch normalization layer to keep the scale of input data consistent on different joints. The proposed model is comprised of 3 layers of S2 -ST-GCN units (Section 4.2) with temporal kernel size equal to 15. The three layers have 16, 128, and 256 channels for output, respectively. We randomly drop out the features at 0.5 probability after each S2 -ST-GCN unit to avoid overfitting. After that, a global pooling was performed on the resulting tensor to get a 256-dimension feature vector for each sequence. Finally, we feed them to a SoftMax classifier. The complete model is trained in an end-to-end manner with backpropagation. 61 The models are learned using stochastic gradient descent with a varying learning rate, which starts at 0.01 and decays at the rate of 0.1 after every 10 epochs. Extraction of Hand skeleton data Input Spatial T1 T2 T3 T4 Temporal Data normalization Global pooling Feature concatenation S2 -ST-GCN Unit 1 S2 -ST-GCN Unit 2 S2 -ST-GCN Unit 3 Soft-max Predicted Action Hand Graph Decomposition Hand Graph Pre-processing (a) Figure 4.3: Symmetric Sub-graph Spatio temporal graph neural network for complex activity recognition from hand skeleton data. 4.4 Experimental Results 4.4.1 Experimental setting To evaluate the performance of S2 -ST-GCN, we use the FPHA dataset [37] mentioned in the Section 3.4.3.3. To compare the results of S2 -ST-GCN with the baseline method mentioned in [37], we use similar experimental settings. The first protocol uses different dataset partitions with no overlap for training and testing. This includes train:test ratios of 1 : 3, 1 : 1, and 3 : 1 at the sequence level. The second protocol Cross-person consists of a 6-fold ’leaveone-person-out’ cross-validation, i.e., each fold consists of 5 subjects for training and one 62 Figure 4.4: Normalized confusion matrix for testing. This takes care of subjective bias. Thus, training and testing sets do not share the data of the same subjects, which is a more accurate representation of what one would encounter in a real-world scenario. To illustrate the enhanced performance facilitated by the symmetry-based decomposition, we explore an alternative decomposition approach. In this method, we represent each finger as an independent graph, utilizing the motion vector associated with each vertex as a signal within the graph. We compare our results to this subgraph-based method, where each subgraph represents a finger, G F 1 , G F 2 , G F 3 , G F 4 , but we directly apply the input data, without any preprocessing, to these sub-graphs to extract higher-level features. We call this method finger-based ST-GCN (F-ST-GCN). 63 4.4.2 Results 4.4.2.1 Performance analysis Table 4.1 summarizes the performance of the 3 layers ST-GCN (A), 7 layers ST-GCN (A), 3 layer F-ST-GCN (L), 3 layers S2 -ST-GCN (L) and 3 layer S2 -ST-GCN (A), networks with respect to the state of the art method LSTM. A, L represent adjacency and Laplacian matrices, respectively. Clearly, ST-GCN performs better than the LSTM baseline. S2 -ST-GCN-3 (A) outperforms both F-ST-GCN-3, ST-GCN-7 and LSTM in all the experiments with different protocols. The table shows that the accuracy of our proposed method is significantly higher in cross-person settings. ST-GCN and S2 -ST-GCN incorporate the relations amongst the nodes along with the temporal information, which can explain the improved performance in the activity classification task. The graph decomposition in S2 -ST-GCN helps in better localization in the graph, achieving better performance with a lesser number of layers. In this recognition task, filters based on the Adjacency matrix perform better than Laplacian filters. Table 4.2 provides a comparison with the state-of-the-art algorithms reported in [73]. Among these methods, the Gram matrix method [122] and the Lie group approach [109] also used dynamic time warping [9] for sequence alignment, as well as non-Euclidean features to help improve performance. These results are computed with the train-test data ratio 1:1 protocol. Our proposed methods achieve comparable accuracy with ST-GCN-3, ST-GCN-7, S 2 -ST-GCN-3, S2 -ST-GCN outperforms all neural network-based methods. Fig. 4.4 shows the confusion matrix (normalized) for action recognition results. Some activities such as high five, sprinkle and pour salt are easily recognized, while activities such as give coin, unfold glasses, take letter from envelope are commonly misclassified, likely because hand poses are more subtle. 64 Accuracy in % Protocol Methods 1:3 1:1 3:1 Cross-person LSTM 58.75 78.73 84.82 62.03 ST-GCN-3 (A) 66.06 70.59 85.08 65.88 ST-GCN-7 (A) 68.69 81.53 86.41 66.68 F-ST-GCN-3(L) 57.37 60.57 81.93 63.73 S 2 -ST-GCN-3(L) 59.97 73.74 82.93 71.64 S 2 -ST-GCN-3 (A) 64.46 83.25 87.8 72.6 Table 4.1: Performance summary of the algorithms on Action recognition for different training/testing protocols . Method Accuracy(%) Moving Pose[120] 56.34 JOULE-pose [49] 74.60 HBRNN [30] 77.40 TF [36] 80.69 Lie Group [109] 82.69 Gram Matrix [122] 85.39 2-layer LSTM 76.17 2-layer LSTM + TTN [73] 78.43 TCN-16 [58] 76.28 TCN-16 + TTN [73] 80.14 TCN-64 [58] 79.10 TCN-64 + TTN [73] 81.32 TCN-32 [58] 81.74 TCN-32 + TTN [73] 82.75 TCN-32 (affine warp)TCN-32 (affine warp) 70.43 TCN-32 + TTN (affine warp) [73] 78.26 ST-GCN-3 70.59 ST-GCN-7 81.53 S 2 -ST-GCN-3 83.25 Table 4.2: Action recognition results on the FPHA dataset 4.4.2.2 Stability of the network We define a notion of stability based on quantifying changes in the system’s output when we change the training dataset. A learning algorithm is said to be stable if the learned model does not change much for a different set of training datasets. Measuring these changes gives us a metric for stability. In this chapter, we measure leave-one-out cross-validation stability [13] to quantify model stability. 65 We denote test accuracy set by αtest where αtest = {α1, α2, ..., αN } and αi denote the accuracy for test subject i. Leave-one-out cross-validation stability is measured as the ratio of the standard deviation µ(αtest) of the accuracy of the test set to the mean σ(αtest) of the accuracy of the test set as mentioned in Equation 3.11. Here, we used the cross-person setting mentioned in the previous section, where each fold consists of 5 subjects for training and one for testing. The less the ψLOOCS value, the higher the stability of the model and vice-versa. Table 4.3 shows the stability measures of different methods. S2 -ST-GCN shows better stability than LSTM. Method LSTM ST-GCN-3 ST-GCN-7 S 2 -ST-GCN 3 ψLOOCS 0.106 0.081 0.086 0.082 Table 4.3: Cross-validation stability 4.5 Conclusion This chapter introduced a novel approach for complex activity recognition using hand skeleton data. Our method leverages symmetry in the hand graph structure to decompose it into smaller sub-graphs, each processed by a separate temporal model. This is in contrast to traditional models like LSTM and TCN, which do not adequately capture the nuances of hand-joint motion. Exploiting the symmetric nature of the spatial hand graph, we decompose it into smaller graphs and use a separate temporal model for each. The input data undergoes pre-processing through a Haar unit-based orthogonal matrix. The rationale behind distinct temporal modeling for each sub-graph becomes evident through the interpretation of the pre-processed data. We evaluate the performance of our method on the FPHA dataset, which consists of 45 daily activities. In a 1:1 train-test setting, where both training and testing data come from the same subjects but different iterations, our S2 -ST-GCN method achieves comparable performance to state-of-the-art methods. However, our approach truly shines in the 66 cross-person setting, simulating real-world scenarios. S2 -ST-GCN outperforms LSTM and F-ST-GCN in this setting. Additionally, S2 -ST-GCN exhibits superior stability compared to LSTM, making it a more reliable choice for real-world applications. 67 Chapter 5 Geometric understanding and visualization of Spatio Temporal Graph Convolution Networks In the preceding chapters, we have seen that STGCNs have emerged as a desirable model for many applications, including skeleton-based human action recognition. Despite achieving state-of-the-art performance, our limited understanding of the representations learned by these models hinders their application in critical and real-world settings. While layerwise analysis of CNN models has been studied in the literature, to the best of our knowledge, there exists no study on the layerwise explainability of the embeddings learned on spatiotemporal data using STGCN. In this chapter, we first propose a data-driven understanding of the embedding geometry induced at different layers of the STGCN using local neighborhood graphs constructed on the feature representation of input data at each layer. To do so, we develop a window-based dynamic time warping (DTW) to compute the distance between data sequences with varying temporal lengths. We characterize the functions learned by each layer of STGCN using the label smoothness [126] of the representation. We show that STGCN models learn representations that capture general human motion in their initial layers and can discriminate different actions only in later layers. This provides justification for experimental observations showing that fine-tuning of later layers works well for transfer between related tasks. We provide experimental evidence on different datasets and advanced networks that justify the proposed method as a generic tool. We also show that noise at 68 the input has a limited effect on label smoothness, which can help justify the robustness of STGCNs to noise. To validate our findings, we build a layerwise Spatiotemporal Graph Gradient-weighted Class Activation Mapping (L-STG-GradCAM) for spatiotemporal data to visualize and interpret each layer of the STGCN network. Gradient-based class activation maps (Grad-CAM) [99] are popular for interpreting convolutional neural networks for grid-structured data such as images. In this chapter, we design an extension of Grad-CAMs for spatiotemporal graph convolution (STG-Grad-CAM) to improve the interpretability of STGCNs. We provide results for a skeleton-based activity recognition task as proof of concept. We show which body joints are responsible for a particular task and how their temporal dynamics contribute to the classification output. We present a brief study of the interpretability of a recognition task by changing the model depth and the training and testing protocol. To evaluate the efficacy of STG-Grad-CAM, we compute two metrics, faithfulness and contrastivity as defined in Section 5.4.4. The faithfulness of STG-Grad-CAM is computed to the model measured by the impact of occlusions on the graph nodes. To evaluate the explainability of STGCN, we compute the contrastivity of the model for different classes based on the outcome of STG-Grad-CAM. In the cross-person setting, we observe better contrastivity than in the cross-view setting. We also extend this method to layerwise Spatiotemporal Graph Gradient-weighted Class Activation Mapping (L-STG-GradCAM) for a layerwise analysis of contribution and visualization of the representation at different layers. 5.1 Introduction STGCNs [115] can handle temporal dynamics of graph data and are used in many applications such as traffic forecasting [118] and skeleton-based action recognition [95]. Training these STGCN models involves several design choices like other deep learning methods, e.g., architecture, optimization routine, loss function, and dataset. Typically, the resulting trained 69 model has properties that are coupled in highly complex ways to the choices that were made at training so that the choice of a specific model is primarily justified by its performance on data selected for evaluation. While this practical perspective has led to significant advances, a better understanding of the system is needed for safe and robust deployment in the real world. Two major approaches have been used in the literature to understand deep learning systems. Function approximation methods are based on the inductive bias of the loss function [39], the ability of the optimization to achieve good minima [3], or consider the study of classifier margins [41]. Data-driven analysis methods consider the relative position of sample data-points in the representation domain for characterization [4, 103, 18]. Data-driven approaches can provide a unified framework for understanding models because they can abstract the specific functional components. Specifically, in a data-driven approach, functions do not need to be explicitly modeled; they can be characterized implicitly using the outputs they produce. In this chapter, we develop a data-driven approach to understand STGCN models better. Our approach is based on a layer-wise analysis, interpretation, and visualization of the embeddings produced by the STGCN. Our proposed method starts by defining a Dataset graph (DS-Graph), which captures the pairwise similarities between sequences in the set, represented by their embedding. This allows us to compare models obtained with very different architectures by simply comparing the DS-Graphs they produce in their respective embedded spaces. While our method is widely applicable, our experiments focus on a human activity recognition task using skeleton-based data as an illustrative task to evaluate our STGCN analysis methods. This type of data has been widely used in human action recognition (Chapter 4) due to its view-invariant representation of pose structure, robustness to sensor noise [123], and efficiency in computation and storage [64, 117]. Recently, STGCN approaches have gained popularity by demonstrating superior performance in human activity understanding [65, 72, 24, 89] and have become one of the state-of-the-art methods in 70 the field of activity recognition. As will be shown experimentally, our proposed layer-wise analysis of STGCNs helps us to (i) understand their generalization, (ii) detect bias toward learning any particular feature, (iii) evaluate model invariance to a set of functions, and (iv) assess robustness to perturbations to the input data. For example, in STGCN for skeletonbased activity recognition [115], some layers may focus on learning the motion of specific body parts. Therefore, some models will not be suitable for new action classes where the motion is localized in other body parts. The layer-wise analysis of deep learning models helps us to understand their generalization capabilities for any data, including those that were not used during training, while pointing to whether there is any bias toward learning any particular feature or if it is invariant to a set of functions or high-frequency perturbation to the input data. For example, in STGCN [115] for skeleton-based activity recognition, some layers may focus on learning the motion of specific body parts. Therefore, these pre-trained models will not be suitable for new action classes where the motion is localized in other body parts. While layer-wise analysis of CNN [12] models has been studied in the literature [60, 46], to the best of our knowledge, there exists no study on the layer-wise explainability of the embeddings learned on spatiotemporal data using STGCN. In fact, most of the work on STGCN interpretation has studied only the final layer [65, 25]. Extending the layer-wise analysis to STGCNs is not straightforward because of the varying lengths of the STGCN embeddings. This variability makes it difficult to find the similarity of data-point embeddings, such as action sequences with differing lengths, as the commonly employed similarity metrics (e.g., cosine similarity or Euclidean distance) are unsuitable for sequences of varying lengths. Our first major contribution is a geometric framework to characterize the data manifolds corresponding to each STGCN layer output. Our approach analyzes these manifolds by constructing a Non-Negative Kernel (NNK) DS-Graph [101] (Section 5.2.2), where nodes represent input sequences (actions) and distances between nodes are computed using dynamic time warping (DTW) [80] (Section 5.2.4). This allows a distance to be computed 71 between actions with different durations. We choose the NNK construction due to its robust performance in local estimation across different machine learning tasks [103]. The benefits of the NNK construction will be demonstrated through a comparison with k-NN DS-Graph constructions in Section 5.4.3. For the DS-Graph at each layer, we quantify the label smoothness as a way to track how the STGCN learns (Figure 5.1). Our approach has several important advantages: (1) the analysis is agnostic to the training procedure, architecture, or loss function used to train the model; (2) it allows for the comparison of features having different dimensions; (3) it can be applied to data that were not used for training (e.g., unseen actions or data in a transfer setting); (4) it allows us to observe how the layerwise representations are affected by external noise added to the input. Our second major contribution is to extend our previous method, spatiotemporal graph GradCAM (STG-GradCAM) [25], to perform layerwise visualization of the contributions of different Skeleton-Graph (S-Graph) nodes. To achieve this, we merge the class-specific gradient for a datapoint at each layer with the learned representations by that layer. This enables us to interpret individual layers within an STGCN network. The resulting layerwise STG-GradCAM (L-STG-GradCAM) allows us to visualize the importance of any node in any STGCN layer for the classification of a particular query class (action). This visualization helps confirm the results obtained through our analysis of the STGCN model using NNKbased geometric methods. It enhances the transparency of the model and deepens our comprehension of the representations learned at each layer. Second, inspired by the explainability work on CNNs [99] and GCNNs [124], we introduce methods to interpret the output of STGCNs. We extend Gradient-weighted Class Activation Mapping (Grad-CAM) [99] to spatiotemporal graph convolution data and use it to interpret the predictions made by STGCNs. Our proposed method, spatiotemporal graph Grad-CAM (STG-Grad-CAM), is generic and can be used to analyze any spatiotemporal graph topology. Grad-CAM techniques have the potential to be more helpful for spatiotemporal graphs than 72 for images because it may be hard for humans to intuitively determine the relative importance of different graph nodes for specific classification tasks. The spatiotemporal graph convolution we consider here is a separable transform so that both the spatial and temporal convolution retain separate localized information, and STGCN retains localized information at any point in space and time. STG-Grad-CAM uses the gradient of any query (action) class flowing to the final convolutional layer, which is a function of the graph adjacency matrix and the previous layer’s output. A spatiotemporal pooling is applied to the gradients to generate a weight vector for the features. The weighted average of these ST features produces a heatmap highlighting the most important graph nodes (e.g., activated body joints in an action classification task) along with their most influential dynamic pattern. The proposed method can perform spatiotemporal graph localization in graphical data with a dynamic temporal variation. Further, we use class-wise temporal pooling to generate node importance maps and find which nodes (body joints for a skeleton graph) contribute to the model’s decisions for a particular task. We further extend the proposed STG-GradCAM by combining the class-specific gradient at each layer with its representations so that it can be applied to interpret all layers in an STGCN network. The resulting layerwise STG-GradCAM (L-STG-GradCAM) allows us to visualize the importance of any node in any STGCN layer for the classification of a particular query class (action). This visualization helps confirm the results obtained through our analysis of the STGCN model using NNK-based geometric methods. It enhances the transparency of the model and deepens our comprehension of the representations learned at each layer. We use the concept of faithfulness[99] and define it in the context of skeleton data in Section 5.4.4 to evaluate the explainability of STG-Grad-CAM. Additionally, we compute contrastivity [93] among classes based on the outcome of STG-Grad-CAM to assess the model’s performance. To evaluate STG-Grad-CAM, we first create a joint temporal importance map for this specific task that explains which body joints contribute most to identifying a specific action. We further find the dominant structure of the spatial graph for a particular action. We 73 present an interpretability study of this skeleton-based STGCN model in different experimental settings: 1) changing the train-test protocols to incorporate cross-person (persons in the test set are not the same as in the training set) and cross-camera view (similarly camera angles are different in training and testing sets) [100], 2) varying the model depth. Our proposed data-driven label smoothness, with supporting evidence from L-STGGradCAM, leads to the following contributions: • We show that initial layers learn low-level features corresponding to general human motion, while specific actions are recognized only in the later layers. • We show that, in a transfer task, the choice of layers to leave unchanged and layers to be fine-tuned can be informed by the changes in label smoothness for the target task on a network trained for the source task. • We experimentally show that the label smoothness of an STGCN model over the layers is not affected significantly when Gaussian noise is added to the inputs. This justifies the observed model’s robustness to noise. • We develop Layerwise STG-GradCAM, a visualization tool for use at all layers of the STGCN network, and provide insights into the working of an STGCN model by characterizing the input-output mapping induced by each successive layer of the model. • We present a quantitative metric characterizing the ’faithfulness’ of STG-Grad-CAM and explaining the model’s ’contrastivity’ among classes based on STG-Grad-CAM. 5.2 Preliminaries 5.2.1 Skeleton graph and polynomial graph filters A skeleton graph (S-Graph) is a fixed undirected graph G = {V, E, Aˆ } composed of a vertex set V of cardinality |V| = N, an edge set E connecting vertices, and Aˆ , a weighted adjacency 74 STGCN layers � � � � DTW based NNK DTW based NNK DTW based NNK Spatio-temporal feature distribution Frame First Mid End Classification STG-Grad CAM : Node importance for prediction Input STGCN layer 1 STGCN layer 5 STGCN layer 10 Figure 5.1: Proposed data-driven approach to understanding the geometry of the embedding manifold in STGCNs using windowed dynamic time warping (DTW) and Non-Negative Kernel (NNK) graphs. Left: We construct Dataset NNK Graphs (DS-Graph) where each node corresponds to an action sequence, and the weights of edges connecting two nodes are derived from pairwise distances between the features representing the corresponding action sequences. In this example, we show how the two classes (corresponding to red and blue nodes on the DS-Graph) become more clearly separated in deeper layers of the network. We also observe the skeleton graph (S-Graph) node importance for each action using a layerwise STG-GradCAM (the three-time slice example corresponds to a Throw action). Right: For a set of spatiotemporal input action sequences, we observe the label smoothness on the DSGraph constructed using the features obtained for the sequences after each STGCN layer. The observed label smoothness at each layer of the STGCN network averaged over three super-classes corresponding to actions involving the upper body, lower body, and full body. In this plot, lower variation corresponds to greater smoothness. We note that the label smoothness increases in the deeper layers, in which the different actions can be classified (see DS-Graphs at the bottom of the left plot). 75 matrix (see also Section 4.2). Aˆ is a real symmetric N × N matrix, where ai,j ≥ 0 is the weight assigned to the edge connecting nodes i and j. An STGCN layer (Section 4.2) is a function of this adjacency matrix Aˆ and the identity matrix I representing a self-loop. Specifically, STGCN uses the normalized adjacency matrix A = D− 1 2 (Aˆ + I)D− 1 2 where (D)ii = P j (ai,j + 1). Intuitively, the elementary graph filter A combines graph signals from adjacent nodes. Figure 5.2 shows the energy of the human motion analyzed in the elementary frequencies of A, where we can see that for typical human motion the energy is concentrated in the higher frequencies (i.e., the larger eigenvalues of A, λ17,. . ., λ25). We use filters that are polynomials of the adjacency matrix, which, in terms of neural networks, is equivalent to applying multiple layers of graph filters. Thus, an l-degree polynomial captures the data in a l-hop neighborhood. In the human activity recognition task, the skeleton graph has 25 nodes and 24 edges [100]. For this tree-structured graph, the maximum distance between two leaf nodes is 10. Thus, a 10-degree polynomial can capture information about the entire graph. This justifies using 10 layers of STGCN units in the STGCN network under consideration. 5.2.2 Non-Negative Kernel Regression (NNK) neighborhoods The traditional methods of defining a neighborhood, such as K-nearest neighbor (KNN) and ϵ-neighborhood, solely depend on the distance to the query point and do not consider the relative positions of neighboring points. Further, these methods also require selecting parameters, such as k or ϵ. For these reasons, we use non-negative kernel regression (NNK) [102] to define neighborhoods and graphs for our manifold analysis. Unlike KNN, which is a thresholding approximation, NNK can be viewed as a form of basis pursuit [108] and results in better neighborhood construction with improved and robust local estimation performance in various machine learning tasks [103, 104]. The key advantage of NNK is its geometric interpretation for each neighborhood constructed. While in KNN points xj and xk are included in the neighborhood of a data-point xi solely based on their metric to xi , i.e., s(xi , xj ) and s(xi , xk), in NNK this decision is made by also taking into account the metric 76 X Energy Y Energy Z Energy Graph frequency Graph frequency Graph frequency Figure 5.2: Energy graph spectrum of the human actions (NTURGB 120 [69]) of the normalized adjacency matrix of the S-Graph (A). We use the graph spectrum of the normalized adjacency matrix, as in STGCN, for easy understanding. However, a similar observation can be made using the normalized graph Laplacian [85] as the eigenvectors for both cases are the same. 77 s(xj , xk). Consequently, xj and xk are included simultaneously in the NNK neighborhood only if they are not geometrically redundant. The obtained NNK neighborhoods can be described as a convex polytope approximation of the data-point, determined by the local geometry of the data. This is particularly important for data that lies on a lower dimensional manifold in high dimensional vector space, a common scenario in feature embeddings in deep neural networks (DNNs). NNK uses KNN as an initial step, with only a modest additional runtime requirement [102]. The computation can be accelerated using tools [51] developed for KNN when dealing with large datasets. NNK requires kernels that take values in the interval [0, 1]. In this work, we use cosine similarity with the windowed aggregation as in (5.3). The kernel is applied on representations obtained after ReLU and hence satisfies the NNK definition requirement since all xi have non-negative entries. 5.2.3 Grad-CAM Gradient weighted Class Activation Maps (Grad-CAM) [99] is a technique to visualize why a convolutional neural network-based model has reached a certain decision. As convolutional layers naturally retain spatial information, which is lost in fully-connected layers; the last convolutional layers contain high-level semantics and detailed spatial information about the image. The neurons in these layers are responsible for semantic class-specific information in the image. The gradient information flowing into the last convolutional layer of the CNN is used in Grad-CAM to compute the importance of each neuron for a particular decision of interest. Let the last layer produce N feature maps, where the n th feature map is F n ∈ R u×v with each element indexed by i, j. Thus, F n i,j refers to the activation at location (i, j) of the feature map F n . The class discriminative localization map Grad-CAM L c Grad−CAM for any class c is computed performing a weighted summation of forward activation maps F n and following it by a ReLU as in (5.1). To calculate the weights α c n , we initially obtain the gradient of the score y c for class c (prior to the softmax operation) with respect to the 78 feature map activations F n of a convolutional layer. This gradient is represented as δyc δF n . The neuron weights α c n are computed by global average pooling of these gradients. Finally, the Grad-CAM is computed as follows: H c Grad−CAM = ReLU( X n α c nF n ) (5.1) 5.2.4 Dynamic Time Warping Dynamic time warping (DTW) [78] is a well-known technique to find an optimal alignment between two given (time-dependent) sequences. DTW allows many-to-one comparisons to create the best possible alignment, exploiting temporal distortions, unlike the Euclidean distance, which allows one-to-one point comparison. This is why DTW efficiently computes the similarity between two variable-length arrays or time sequences. Intuitively, the sequences are warped nonlinearly to match each other. In the action recognition task, we have variable time-length action sequences. Consequently, the extracted features preserve the time localization as a property of STGCN. Thus, this work uses DTW to compute the similarity between two temporal features extracted using STGCN. DTW is computed as: dtw(i, j) = dist(ai , bj ) + min(dtw(i − 1, j), dtw(i, j − 1), dtw(i − 1, j − 1)) (5.2) where dtw(i, j) is the minimum warp distance of two time-series of lengths i and j. Any element dtw(i, j) in the accumulated matrix indicates the dynamic time warping distance between series A1:i and B1:j . 79 5.3 Proposed method 5.3.1 Neighborhood analysis using dynamic DTW Once we have a fully trained STGCN network, we construct an NNK DS-Graph using the representation generated by each layer of the STGCN model and refer to this graph as the NNK NNK Dataset Graph GD. Note that each node corresponds to a data-point in our NNK DS-Graph, i.e., an action sequence represented by its features (learned by the STGCN). This differs from the S-Graph used in the STGCN model, which provides the original representation of an action sequence from which the features are extracted. After NNK DS-Graph construction, we observe the smoothness of the class labels with respect to the graph, as shown in Figure 5.1. Graph smoothness or label smoothness in a graph represents the variation of the label of the neighboring node for each node in the DS-Graph. A DS-Graph has higher label smoothness when there is less variation in the labels of neighboring nodes. Our work uses label smoothness as a metric for assessing the representation of different layers within a network. The main challenge with spatiotemporal action data is that each individual activity corresponds to a data sequence with a different temporal length. To address this issue, we develop a DTW-based distance metric to find the similarity between the representations (Section 5.2.4). Computation of this window-based DTW distance metric w-DTW involves the following steps. • Consider two sequences si and sj divided temporally into m windows. The dimension of si and sj is N × Ti and N × Tj respectively. Here N denotes the number of spatial joints and Ti denotes the temporal length of the i th sequence. 80 • s w i denotes the w-th window of the sequence, then the distance between two sequences is computed as follows. wDTW(si , sj ) = Xm w=1 αwDTW(s w i , s w j ) (5.3) αw is the weight to the w-th window, Pm w=1 αw = 1. • The weights are chosen such that they decrease along the temporal axis based on the length statistic of all the sequences in the dataset i.e., the number of samples that have non-zero padding in a particular temporal window. While STGCN involves complex mappings, the transformations they induce and the corresponding structure of each representation space can be studied using a graph constructed on the embedded features. Consider an STGCN model and a spatiotemporal dataset. At each layer, all sequences in the dataset can be represented using the NNK Dataset Graph. In this graph, each node corresponds to a sequence, and the action labels are treated as ’signals’ or attributes associated with these nodes, as illustrated in Figure 5.1. At the output of each layer, each input sequence is mapped to new values (in some other feature space). Thus, we can associate a new NNK DS-graph to the same set of data-points (with the same signal, i.e., label). Thus, instead of directly working with the high dimensional features or the model’s overparameterized space, the focus is on the relative positions of the feature embeddings obtained in STGCN layers. This allows us to characterize the geometry of the manifold spaces encoded by an STGCN and to develop a quantitative understanding of the model. We now present a theoretical result (Theorem 1) relating the respective label smoothnesses of the input and output features of a single layer in a neural network to that of its complexity measured by the ℓ2-norm [41, 84]. The proof for the theorem is provided below. 81 Definition 3. Label smoothness: The Laplacian quadratic y ⊤Louty is referred as label smoothness. With the increase in the value of y ⊤Louty, label smoothness decreases and viceversa. This implies that an increase in the similarity on the labels of the connected nodes increases the label smoothness and vice versa. Theorem 1. Consider the features corresponding to the input and output of a layer in a neural network denoted by xout = ϕ(Wxin) where ϕ(x) is a slope restricted nonlinearity applied along each dimension of x. Let us suppose that the smoothness of the labels y in the feature space is proportional to the smoothness of the data x. Then, y ⊤Louty ≤ c ||W||2 2 y ⊤Liny (5.4) where L corresponds to the graph Laplacian obtained using NNK in the feature space. Note that c > 0 depends only on constants related to data smoothness and the slope of the nonlinearity. Proof. Let N (i) in , N (i) out be the set of NNK neighbors of data-point x (i) in the input-output feature spaces. y ⊤Louty = X i X j∈N (i) out θij |y (i) − y (j) | 2 (5.5) Now, by assumption, the smoothness of the labels is proportional to the data smoothness. Thus, y ⊤Louty = X i X j∈N (i) out cout θij ||x (i) out − x (j) out||2 (5.6) where cout > 0 is the proportionality constant. 82 N (i) out corresponds to the optimal NNK neighbors, and therefore any other neighbor set will have a larger value and label smoothness, i.e., y ⊤Louty ≤ cout X i X j∈N (i) in θij ||x (i) out − x (j) out||2 = cout X i X j∈N (i) in θij ||ϕ(Wx(i) in ) − ϕ(Wx(j) in )||2 ≤ cout X i X j∈N (i) in βθij ||Wx(i) in − Wx(j) in ||2 where β > 0 corresponds to the upper bound on the slope of the nonlinear activation function. Using the label smoothness expression computed with input features to the layer, and gathering all the positive constants as c = β cout cin , we obtain y ⊤Louty ≤ c ||W||2 2 y ⊤Liny. (5.7) Remark 1. Theorem 1 states that the change in label smoothness between the input and output spaces of a network layer is indicative of the complexity of the mapping induced by that layer, i.e., a big change in label smoothness corresponds to a larger transformation of the features space. Remark 2. Theorem 1 does not make any assumption on the model architecture and makes an assumption about the relationship between the respective smoothness of the data and the labels. The slope restriction on the nonlinearity is satisfied by activation functions used often in practice. For example, the ReLU function is slope restricted between 0 and 1 [32, 33]. The idea of characterizing intermediate representations using graphs was previously studied in [40, 63]. However, these works were limited to images and did not study spatiotemporal data. To the best of our knowledge, our work presents the first methodology for use with structured input sequences for analysis and understanding of STGCN networks. 83 Our method uses NNK for analysis similar to [12]. However, unlike other approaches, our work focuses on the geometry of the feature manifold induced by the STGCN layer using the NNK graphs constructed. 5.3.2 STG-Grad-CAM Like regular convolutional layers, spatiotemporal graph convolution layers also retain spatially and temporally localized information, which is lost in fully-connected layers (FCL). Thus, the last convolutional layer before FCL contains optimal high-level spatiotemporal information. The neurons in these layers are responsible for the class-specific spatiotemporal importance of the skeleton joints in the activity sequence data. The gradient information flowing into the last spatiotemporal graph convolutional layer of the STGCN is used in GradCAM to compute the importance of each neuron for a particular decision of interest. Let the k th graph convolutional feature map at layer l be defined as: F l k (X, A) = σ(AF˜ l−1 (X, A)Wl k ). (5.8) Here, the k th feature at the l th layer is denoted by F l k,n,t for node n and time t and A˜ = (D− 1 2 (A + I)D− 1 2 ) ⊙ Q. Then, the global average pooling (GAP) feature after the last STGconvolutional layer (L) is computed as ek = 1 NT X N n=1 X T t=1 F L k,n,t (5.9) The class score is calculated as yc = P k wc kek. The class activation maps (CAM) can be calculated as H c ST G−CAM = ∥ReLU X k wc kF L k ! ∥. 84 Using this notation, we extend the Grad-CAM method for STGCNs and name it spatiotemporal graph grad cam (STG-Grad-CAM). Gradient-based heatmaps over nodes n and time frames t are: H c Gradient = ∥ReLU δyc δXn,t ∥. (5.10) STG-Grad-CAM’s class-specific weights for class c at layer l and for feature k are calculated by: α c,l k = 1 NT X N n=1 X T t=1 δyc δFl k,n,t . (5.11) The final heatmap calculated from layer l is H c,l ST = ReLU X k α c,l k F l k ! . (5.12) STG-Grad-CAM enables us to generate class-specific heatmaps for different layers of the network. Using these heatmaps, we can understand the importance of the different body joints over time for a specific task. In other words, which joint is responsible for selecting a specific label for a specific action? To create the body-joint importance map (ψ) for each action class, we compute the STG-Grad-CAM for each class and take an average over all the correctly classified data-points. Let MC be the number of correctly classified data-points for an action class C. Then the joint importance map is then defined by ψ C n = 1 MC X MC m=1 1 T C m T C Xm t=1 H C,L,n,t ST (5.13) Here, T C m denotes the number of frames in the mth correctly classified data-point of class C. Ultimately, we normalize the joint importance map ψ C for all joints within an action between (0,1). 85 5.4 Experimental Results 5.4.1 Network Architecture First, input skeletons are fed to a batch normalization layer to keep the scale of input data consistent on different joints. The STGCN model used for human action recognition by [115] comprises of 10 STGCN layers implemented as in Section 4.2. The first four layers have 64 channels for output. The next three layers have 128 channels for output. And the last three layers have 256 channels for output. Afterward, a global pooling layer with a softmax is used as a classifier. The model is trained using cross-entropy loss using batch stochastic gradient descent for 100 epochs. 5.4.2 Dataset NTU-RGB60: In this experiment, we use the NTU-RGB+D [100] dataset, which is one of the largest datasets with 3D joint locations for the human action recognition task. This has 56,000 data-points, including 60 actions. 40 volunteers performed these actions while video was captured by three cameras placed at different angles. Each data-point contains the temporal sequence of 3-D locations of 25 participants’ joints. Following the recommended train-test protocol, we also evaluate our system in 1) cross-subject (xsub) and 2) cross-view (xview). In the xsub setting, the training and test set does not have the same participant’s data. In the xview setting, the training set contains camera views 2 and 3 (front and side), and the test set is all from camera view 1 (left, right at angle 45◦ ). We divided the actions in NTU-RGB60 into three super-classes corresponding to actions involving the upper body, lower body, and full body. Table 5.1 shows the actions corresponding to actions involving the upper body, lower body, and full body. NTU-RGB120: NTU-RGB120 [69] extends NTU-RGB60 with an additional 57, 367 skeleton sequences over 60 extra action classes, from 106 distinct subjects. We divided the actions in NTU-RGB61-120 into three super-classes corresponding to actions involving the 86 Category: Actions Upper body : ’drink water’, ’eat meal or snack’, ’brushing teeth’, ’brushing hair’, ’drop’, ’pickup’, ’throw’, ’clapping’, ’reading’, ’writing’, ’tear up paper’, ’wear jacket’, ’take off jacket’, ’wear a shoe’, ’take off a shoe’, ’wear on glasses’, ’take off glasses’, ’put on a hat or cap’, ’take off a hat or cap’, ’cheer up’, ’hand waving’, ’reach into pocket’, ’make a phone call answer phone’, ’playing with phone tablet’, ’typing on a keyboard’, ’pointing to something with finger’, ’taking a selfie’, ’check time (from watch)’, ’rub two hands together’, ’nod head or bow’, ’shake head’, ’wipe face’, ’salute’, ’put the palms together’, ’cross hands in front (say stop)’, ’sneeze or cough’, ’touch head (headache)’, ’touch chest (stomachache or heart pain)’, ’touch back (backache)’, ’touch neck (neckache)’, ’nausea or vomiting condition’, ’use a fan (with hand or paper) or feeling warm’, ’punching or slapping other person’, ’pushing other person’, ’pat on back of other person’, ’point finger at the other person’, ’hugging other person’, ’giving something to other person’, ’touch other persons pocket’, ’handshaking’ Lower body: ’kicking something’, ’hopping (one foot jumping)’, ’jump up’, ’kicking other person’, ’walking towards each other’, ’walking apart from each other’ Full body: ’sitting down’, ’standing up (from sitting position)’, ’staggering’, ’falling’ Table 5.1: NTU-RGB60 super classes: Upper body, Lower body and Full body action category upper body, lower body, and full body. Table 5.2 shows the actions corresponding to actions involving the upper body, lower body, and full body. 5.4.3 Label smoothness computed from the features To provide insights into the representations of the STGCN network’s intermediate layers, we use our geometric analysis of the representation using the DTW-based NNK method described in Section 5.3.1. Figure 5.1 (right) shows the label smoothness over the layers of STGCN for different sets of the upper body, lower body, and full body actions (please refer Table 5.1). We see a sudden fall in the laplacian quadratic after layer 8, while the slope is small before that layer. This implies that the early layers have features that are mostly not class-specific. In contrast, the smoothness improves (corresponds to a decrease in the value of y ⊤Louty) using the representations after layer 8 consistently across all input 87 Category: Actions Upper body : ’put on headphone’, ’take off headphone’, ’shoot at the basket’, ’bounce ball’, ’tennis bat swing’, ’juggling table tennis balls’, ’hush (quite)’, ’flick hair’, ’thumb up’, ’thumb down’, ’make ok sign’, ’make victory sign’, ’staple book’, ’counting money’, ’cutting nails’, ’cutting paper (using scissors)’, ’snapping fingers’, ’open bottle’, ’sniff (smell)’, ’toss a coin’, ’fold paper’, ’ball up paper’, ’play magic cube’, ’apply cream on face’, ’apply cream on hand back’, ’put on bag’, ’take off bag’, ’put something into a bag’, ’take something out of a bag’, ’open a box’, ’shake fist’, ’throw up cap or hat’, ’hands up (both hands)’, ’cross arms’, ’arm circles’, ’arm swings’, ’yawn’, ’blow nose’, ’hit other person with something’, ’wield knife towards other person’, ’grab other person’s stuff’, ’shoot at other person with a gun’, ’high-five’, ’cheers and drink’, ’take a photo of other person’, ’whisper in other person’s ear’, ’exchange things with other person’, ’support somebody with hand’, ’finger-guessing game (playing rock-paper-scissors)’ Lower body: butt kicks (kick backward)’, ’cross toe touch’, ’side kick’, ’step on foot’ Full body: ’squat down’, ’move heavy objects’, ’running on the spot’, ’stretch oneself’, ’knock over other person (hit with body)’, ’carry something with other person’, ’follow other person’ Table 5.2: NTU-RGB120 super classes: Upper body, Lower body and Full body action category actions. Following Theorem 1, we can state that the large change in the label smoothness from layer 8 to 9 corresponds to a larger transformation (equivalent to the functional norm of the layer is large) in the input-output mapping of this layer. Our earlier visualization using L-STG-GradCAM validates this analysis visually. Figure 5.3 presents action-wise label smoothness over the layers of STGCN. This figure helps us better understand which actions are learned over the layer of STGCN. For example, in Figure 5.3 (left), action reading and writing are poorly learned. The value of y ⊤Louty in the label smoothness plot is not monotonically decreasing for all the actions. For example, for the action drop, the label smoothness decreases in the middle and increases again at the end. The possible reason behind this pattern is that the network tries to accommodate other actions and again learns all the actions fairly before the last layer. Comparison between NNK and k-NN: Figure 5.4 shows the effect on label smoothness for different choices of graph construction methods like NNK and k-NN. Higher label 88 0 2 4 6 8 10 STGCN layers 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Full body action sitting down standing up (from sitting position) staggering falling 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Lower body action kicking something hopping (one foot jumping) jump up kicking other person walking towards each other walking apart from each other 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Upper body action drink water eat meal or snack brushing teeth brushing hair drop pickup throw clapping reading writing 0 2 4 6 8 10 STGCN layers 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Full body action sitting down standing up (from sitting position) staggering falling 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Lower body action kicking something hopping (one foot jumping) jump up kicking other person walking towards each other walking apart from each other 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Upper body action drink water eat meal or snack brushing teeth brushing hair drop pickup throw clapping reading writing 0 2 4 6 8 10 STGCN layers 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Full body action sitting down standing up (from sitting position) staggering falling 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Lower body action kicking something hopping (one foot jumping) jump up kicking other person walking towards each other walking apart from each other 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Upper body action drink water eat meal or snack brushing teeth brushing hair drop pickup throw clapping reading writing � ! � � � ! � � � ! � � Figure 5.3: Smoothness of labels on the manifold induced by the STGCN layer mappings in a trained model. As the label smoothness increases, the Laplacian quadratic (y ⊤Ly) decreases. Intuitively, a lower value of y ⊤Ly corresponds to the features belonging to a particular class having neighbors from the same class. We divide the actions in NTU-RGB60 into three super-classes (Upper body (Left), Lower body (Middle), Full body (Right)) and present smoothness with respect to each action in the grouping. We emphasize that, though the smoothness is displayed per class, the NNK graph is constructed using the features corresponding to all input action data-points. We observe that the model follows a similar trend where the smoothness of labels is flat in the initial layers (indicative of no classspecific learning) and increases in value in the later layers (corresponding to discriminative learning). There exist outliers to this trend (e.g., in upper body group drop, brushing) where the smoothness decreases in intermediate layers. This implies that the representations for these actions are affected by features from other actions to accommodate for learning other classes. 89 smoothness (small value of y ⊤Louty) represents better the construction of the graph, reducing the prediction error at each layer. NNK clearly performs better than k-NN in choosing the right neighbors and their corresponding weights. � ! � � Figure 5.4: Label smoothness of STGCN for different graph construction methods (blue)- NNK, (red)-k-NN. As the label smoothness increases, the Laplacian quadratic (y ⊤Ly) decreases. 5.4.4 Visualization through STG-Grad-CAM To evaluate the performance of STG-Grad-CAM, we consider a human action classification task using skeleton data. The 10-layer STGCN model [116] is used for classification, which can achieve 82.1% accuracy in xsub setting and 88.5% accuracy in xview setting. Now, to understand the STG-convolutional model, we apply STG-Grad-CAM and generate a classspecific heatmap using (Equation 5.12) for all the data-points. Fig. 5.5 shows the joint importance of three actions at different points in time. The bigger size of the node of the skeleton graph denotes the higher importance of the body joint at that point in time. For 90 the first action throw, in which mostly the hands are involved, we see more activation in the hand joints, specifically in the right hand. Action ’Kicking other person’ mostly involves lower body parts, and in the figure also, we see higher importance in the leg and the back region. Therefore, we can see the dominant substructure of the graph for this action. Note that the important changes over time, and it is highest in the middle of the action. Action sitting down usually requires all the joints with more involvement of the center of a human body, which is clearly seen in Figure 5.5. Interestingly, at the end of the task, all the joints are not highly activated. Figure 5.5: Actions at different time points with their spatiotemporal joint importance: a bigger node size represents higher importance. To understand the importance of human body joints for a particular task, we compute joint importance value using (Equation 5.13) and take average overall data-points. Figure 5.6 shows the joint importance map for all the actions. Interestingly, actions starting from #49 involve a lower body than an upper body, which is clearly seen in the figure. This indicates the influential joints behind STGCN’s prediction for each class. This result is generated in the xsub setting. In xview setting, a given action for one participant is present in both the training and test set, which are captured at a different camera angle. Naturally, STGCN achieves better accuracy in xview setting. Figure 5.7 shows the joint importance map for all actions in xview setting. We can see that there is significant importance in the lower body region when the action predominantly involves the upper body or hands, for example, ’nausea’, ’wear jacket’, ’writing’. 91 1. drink water. 2. eat meal. 3. brushing teeth. 4. brushing hair. 5. throw. 6. clapping. 7. reading. 8. writing. 9. tear up paper. 10. wear jacket. 11. take off jacket. 12. wear on glasses. 13. take off glasses. 14. put on a hat. 15. take off a hat. 16. cheer up. 17. hand waving. 18. reach into pocket. 19. make phone call. 20. playing with phone. 21. typing on a keyboard. 22. pointing to something with finger. 23. taking a selfie. 24. check time. 25. rub two hands together. 26. nod head. 27. shake head. 28. wipe face. 29. salute. 30. put the palms together. 31. cross hands in front. 32. sneeze. 33. touch head. 34. touch chest . 35. touch back. 36. touch neck. 37. nausea. 38. use a fan. 39. punching other person. 40. pushing other person.. 41. pat on back of other person. 42. point finger at the other person. 43. hugging other person. 44. giving something to other person. 45. touch other’s pocket. 46. Handshaking. 47. drop. 48. pickup. 49. sitting down. 50. standing up. 51. wear shoe. 52. take off shoe 53. kicking something. 54. hopping. 55. jump up. 56. staggering. 57. falling. 58. kicking other person. 59. walking towards other. 60. walking apart from other. Joints 0 1 Figure 5.6: Joint importance map for all actions, In the Y tick labels ’L’ : ’Left’, ’R’ : ’Right’. Action names in the X-axis are given on the right of the figure. This figure is generated in x-sub setting. 0 5 10 15 20 25 30 35 40 45 50 55 60 Action name 0 5 10 15 20 25 Joints 0.0 0.2 0.4 0.6 0.8 1.0 Figure 5.7: Joint importance map for all actions in xview setting.The action and joint order in X and Y axis respectively follow the order shown in Figure 5.6. 92 L-STG-GradCAM visualization Figure 5.8 shows the layerwise variation of joint importance for the action ’Kick’ for three-time slices. The node’s size in the skeleton graph denotes the degree of importance of the body joint at that time point for the final prediction. For the action Kick, which mostly involves lower body parts, we notice in the figure that the initial layers (up to layer 8) have very weak, if any, GradCAM localization corresponding to the action. In contrast, the last three STGCN layers show explicit node importance heatmap where the leg and the back joints are relatively more active, indicative of the action. We present additional examples in Figure 5.9 and Figure 5.10 corresponding to an upper-body action (Throw) and a full body action (Sitting down). In both cases, we find a similar trend where the STGCN graph filters learned in the initial layers capture general human motion, focusing on all the nodes in the skeleton graph and having class-specific node importance only in a few final layers of the network. Figure 5.8: L-STG-GradCAM visualization of spatiotemporal node importance for action class Kick of a trained STGCN network used in experiments. The size of the blue bubble denotes the relative importance of the node in a layer for prediction by the final softmax classifier and is scaled to have values in [0, 1] at each layer. The node importance values are normalized across layers to have a clear comparison among the layers. We observe that the localization of the action, as observed using the L-STG-GradCAM is evident only in later layers, while initial layers have no class-specific influence. The visualizations allow for transparency in an otherwise black-box model to explain any class prediction. Our approach is applicable to any STGCN model and is not affected by the model size, optimization strategy, or dataset used for training. 93 Figure 5.9: L-STG-GradCAM visualization of spatiotemporal node importance for action class Throw of a trained STGCN network used in experiments. The size of the blue bubble denotes the relative importance of the node in a layer for prediction by the final softmax classifier and is scaled to have values in [0, 1] at each layer. We observe that the localization of the action, as observed using the L-STG-GradCAM is evident only in later layers, while initial layers have no class-specific influence. The visualizations allow for transparency in an otherwise black-box model to explain any class prediction. Our approach is applicable to any STGCN model and is not affected by the model size, optimization strategy, or dataset used for training. Figure 5.10: L-STG-GradCAM visualization of spatiotemporal node importance for action class Sitting down of a trained STGCN network used in experiments. The size of the blue bubble denotes the relative importance of the node in a layer for prediction by the final softmax classifier and is scaled to have values in [0, 1] at each layer. We observe that the localization of the action, as observed using the L-STG-GradCAM is evident only in later layers, while initial layers have no class-specific influence. The visualizations allow for transparency in an otherwise black-box model to explain any class prediction. Our approach is applicable to any STGCN model and is not affected by the model size, optimization strategy, or dataset used for training. 94 Faithfulness : To characterize the faithfulness of the proposed STG-Grad-CAM, we mask the data from nodes that are highlighted by STG-Grad-CAM and then measure the change in the accuracy, where the mask takes a value in the interval (0, 1). Table 5.3 shows the accuracy achieved by the model for a varying amount of masking in the highlighted nodes indicated using STG-Grad-CAM. It is expected that with a higher amount of masking, we get lower accuracy and vice versa, which is evident in the table. Let the masking percentage be m ∈ [0, 1] and the corresponding accuracy be αm, where α0 denotes the accuracy without masking. Then, faithfulness is computed using: βfaithful = X i mi |α0 − αmi | α0 . (5.14) Table 5.4 shows βfaithful for xsub and xview setting. The values of βfaithful in xsub and xview settings are consistent with their accuracy value. Contrastivity: Inspired by the concept of Contrastivity mentioned in [93] in a binary classification problem, we define the contrastivity (βcontrast) for STG-Grad-CAM in this case for a multiclass classification problem. Contrastivity is defined to characterize the intuition that class-specific features highlighted by an explanation method should differ between classes. Once we have the node importance map (ψ), we find the correlation [7] between the node importance vector ψ C for a class C with all other classes. Then contrastivity (βcontrast) is computed by taking an average of them and then subtracting it from 1. A higher value of βcontrast represents better contrastivity. Table 5.4 shows βcontrast for xsub and xview setting. Though we achieve better accuracy in xview setting, xsub achieves better contrastivity, showing better separation in feature space among classes. It is possible that xview model is performing better in terms of accuracy because of the subjective bias present in the data, and it is not solely concentrating on the action-specific joint dynamics. We also studied the same STGCN network with fewer STGCN layers. We removed the 3 rd, 6th, and 9th layer with the corresponding number of output filters 64, 128, and 256. 95 Masking (in %) No masking 10% 50% 90% xsub - Accuracy(%) 81.5 80.3 64.52 20.59 xview - Accuracy(%) 88.3 88.1 64.31 14.82 Table 5.3: Accuracy achieved by STGCN for NTU-RGB-D for different amount of masking of the data Experimental protocol xview xsub βcontrast 0.51 0.54 βfaithful 0.89 0.74 Table 5.4: Fathfullness and contrastivity in xview and xsub settings These layers are chosen randomly. This STGCN model with 7 layers (STGCN-7) achieved 70.2% accuracy for NTU-RGB-D dataset in an xsub setting. Figure 5.11 shows the joint importance map for this network. Surprisingly, for all the actions, only the hand region mostly influences the decision. Naturally, the misclassification is higher. 0 5 10 15 20 25 30 35 40 45 50 55 60 Action name 0 5 10 15 20 25 Joints 0.0 0.2 0.4 0.6 0.8 1.0 Figure 5.11: Joint importance map for STGCN-7. The action and joint order in the X and Y axes, respectively, follow the order shown in fig 5.6. 96 5.4.5 Effect of noise in the data We analyze the robustness of the STGCN network in the presence of noise in the data. In our experiments, we add noise at various peak signal-to-noise ratio (PSNR) levels to a set of actions and compare the label smoothness over the layers concerning the original signal. The popular approach to incorporate noise into the spatiotemporal data is to add additive white Gaussian noise to the measurement [53]. Figure 5.12 shows label smoothness for three actions drop, hop, and standing up(from sitting position). It is clear from these examples that the overall performance degraded slightly, while the smoothness of the labels through successive layers of the network is better than the original signal. Specifically for the action drop, as we discussed in Section 5.4.3, the label smoothness degraded in the middle of the network and recovered at the end, however for the noisy signal, we notice a more stable non-increasing pattern as in other actions. The accuracy of the STGCN network on this partially noisy dataset is 80.2%. Hence, the network is pretty robust to this additive white Gaussian noise. � ! � � � ! � � � ! � � 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Upper body action drink water eat meal or snack brushing teeth brushing hair drop pickup throw clapping reading writing 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Upper body action drink water eat meal or snack brushing teeth brushing hair drop pickup throw clapping reading writing 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Upper body action drink water eat meal or snack brushing teeth brushing hair drop pickup throw clapping reading writing 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Upper body action drink water eat meal or snack brushing teeth brushing hair drop pickup throw clapping reading writing 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Upper body action drink water eat meal or snack brushing teeth brushing hair drop pickup throw clapping reading writing 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Upper body action drink water eat meal or snack brushing teeth brushing hair drop pickup throw clapping reading writing 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Upper body action drink water eat meal or snack brushing teeth brushing hair drop pickup throw clapping reading writing 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Upper body action drink water eat meal or snack brushing teeth brushing hair drop pickup throw clapping reading writing 0 2 4 6 8 10 STGCN layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Label smoothness Upper body action drink water eat meal or snack brushing teeth brushing hair drop pickup throw clapping reading writing Figure 5.12: Impact of noise added to the input sequence on the label smoothness observed using the corresponding features obtained with the model. The Laplacian quadratic form (y ⊤Louty) decreases as the label smoothness increases . We show the impact of noise on one action per super-class grouping (Upper, Lower, and Full Body). We observe that in actions where the Laplacian quadratic form had a non-increasing trend, it remained mostly unaffected by adding noise to the input action sequence. However, actions where the label smoothness decreased before increasing were affected by noise. This implies that the features learned in the early layers for these actions are not robust, and adding noise allows us to see the modified manifold induced in these layers. 97 5.4.6 Transfer performance So far, we see that the first few layers focus on understanding general human motion, which sets the premise before learning the specific task. Therefore, the hypothesis is that given a similar human activity dataset, the network should exhibit similar behavior in the layerwise representations. To explore the area of adapting the pretrained STGCN model to a new dataset and analyze the transfer performance of this network, we use the new 60 actions (61-120) in NTU-RGB120 [69] dataset. Please note, in the rest of the chapter, when we cite the NTU-RGB120, we refer to the new actions 61-120. We first analyze the label smoothness of the NTU-RGB120 dataset on the pretrained STGCN model trained on NTU-RGB60 dataset. Figure 5.13 (Left) shows the label smoothness over the successive layers of the network. We divided 61 −120 actions of NTU-RGB120 into three sets, lower body, upper body, and full body actions, depending on the involvement of the body joints (refer Table 5.2). We notice a similar pattern to that we observed for NTU-RGB60. There is a big jump in the label smoothness after layer 8, while the slope changes slowly before that. Therefore, although the network is not trained on NTU-RGB120, it shows similar behavior, proving our hypothesis. The overall accuracy of the network is 9%, which states the need for fine-tuning. Figure 5.13 (Right) shows the validation accuracy (validation loss in supplementary (Figure 5.14) of the STGCN network with respect to the training epochs. We know that in the case of transfer learning, we can fine-tune the last few layers depending on the performance or the availability of the data. We consider 3 cases of fine-tuning varying the number of layers, such as 1. training only the FCN layer, 2. training the last 3 STGCN layers, including the FCN layer, and 3. training the whole network. Interestingly, we see that in this case, only fine-tuning the FCN layer (case 1) provides considerably good performance. This means that in reality, STGCN captures a good representation of these human motions. If we have a dataset where the actions are very different than the trained dataset, we can fine-tune more layers depending on the availability of the data. 98 � ! � � 1 2 3 4 5 6 7 8 9 STGCN layers 0.75 0.8 0.85 0.9 0.95 1 Label smoothness Dataset : NTU-RGB 120 Full body actions Lower body actions Upper body actions 0 2 4 6 8 10 STGCN layers 0.08 0.09 0.1 0.11 0.12 0.13 0.14 Label smoothness Dataset : NTU-RGB 120 Full body actions Lower body actions Upper body actions 0 2 4 6 8 10 STGCN layers 0.08 0.09 0.1 0.11 0.12 0.13 0.14 Label smoothness Dataset : NTU-RGB 120 Full body actions Lower body actions Upper body actions 1 2 3 4 5 6 7 8 9 STGCN layers 0.75 0.8 0.85 0.9 0.95 1 Label smoothness Dataset : NTU-RGB 120 Full body actions Lower body actions Upper body actions 1 2 3 4 5 6 7 8 9 STGCN layers 0.75 0.8 0.85 0.9 0.95 1 Label smoothness Dataset : NTU-RGB 120 Full body actions Lower body actions Upper body actions 1 2 3 4 5 6 7 8 9 STGCN layers 0.75 0.8 0.85 0.9 0.95 1 Label smoothness Dataset : NTU-RGB 120 Full body actions Lower body actions Upper body actions 1 2 3 4 5 6 7 8 9 STGCN layers 0.75 0.8 0.85 0.9 0.95 1 Label smoothness Dataset : NTU-RGB 120 Full body actions Lower body actions Upper body actions 1 2 3 4 5 6 7 8 9 STGCN layers 0.75 0.8 0.85 0.9 0.95 1 Label smoothness Dataset : NTU-RGB 120 Full body actions Lower body actions Upper body actions 1 2 3 4 5 6 7 8 9 STGCN layers 0.75 0.8 0.85 0.9 0.95 1 Label smoothness Dataset : NTU-RGB 120 Full body actions Lower body actions Upper body actions 0 20 40 60 80 100 Training epochs 0 20 40 60 80 Accuracy(%) Dataset : NTU-RGB 120 Full training Trained Last 3 STGCN layers Trained FCN layer Figure 5.13: Left: Label smoothness of unseen action classes using a model trained with the NTU-RGB60. For new actions, we consider input sequences corresponding to labels 61 to 120 in NTU-RGB120. We divide these actions into three super-classes (Upper, Lower, and Full body) as before and present results averaged over each set. We see that the model is able to embed the features corresponding to new action sequences such that they are separable. Further, the Laplacian quadratic form(y ⊤Ly) follows a similar non-increasing trend as in the case of the NTU-RGB60 in a much smaller range of scale. This implies that the features learned by the model can be used for the novel classes and model transfer can be done with simple fine-tuning. Right: Classification accuracy (higher is better) on NTU-RGB120 test dataset using a 10-layer STGCN network. Here, we show the performance achieved by the model when training from scratch, along with that obtained with transfer learning by finetuning a model trained on NTU-RGB60. We see that the model can transfer effectively, as our label smoothness analysis predicted. 99 0 20 40 60 80 100 Training epochs 0 1 2 3 4 5 6 7 Mean validation loss Dataset : NTU-RGB 120 Full training Trained Last 3 layers Trained FCN layer Figure 5.14: Validation loss (lower is better) on NTU-RGB120 test dataset using a 10-layer STGCN network. Here, we show the validation loss achieved by the model when training from scratch along with that obtained with transfer learning by fine-tuning a model trained on NTU-RGB60 on fewer layers. We see that the model is able to transfer effectively, as predicted by our label smoothness analysis. 5.4.6.1 Results on other datasets and advanced network Results on ShiftGCN Shift graph convolutional network (Shift-GCN) comprises shift graph operations and lightweight point-wise convolutions, where the shift graph operations provide flexible receptive fields for both spatial and temporal graphs. There are two modes of combining the shift operation with point-wise convolution first, Shift-Conv and ShiftConvShift. To provide insights into the representations of the intermediate layers of the ShiftGCN network, we use our geometric analysis of the representation using the DTW-based NNK method described in Section 5.3.1. Figure 5.15 shows the label smoothness over the layers of ShiftGCN demonstrating that our proposed DTW-based NNK method for geometric analysis of the representation can be used for any network dealing with spatiotemporal data. ShiftGCN shows a gradual fall in the Laplacian quadratic form (y ⊤Louty) over the layers, unlike STGCN where we see a flat region in the initial layers and a sudden fall after layer 8. The possible reason behind the gradual change in the label smoothness from the initial 100 layers is the shift operation in the spatial and temporal domains. These shift operators allow ShiftGCN to learn global features starting from the initial layers, showing a gradual increasing pattern in the label smoothness. � ! � � Figure 5.15: Smoothness of labels on the manifold induced by the ShiftGCN [16] layer mappings in a trained model. Intuitively, a lower value in y ⊤Louty corresponds to the features belonging to a particular class having neighbors from the same class. This visualization of label smoothness for the ShiftGCN network trained on NTU-RGB 60 dataset. We emphasize that, though the smoothness is displayed per class, the NNK graph is constructed using the features corresponding to all input action data-points. We observe that the model follows a gradually decreasing pattern in y ⊤Louty over the layers, unlike STGCN. Results on STGCN network trained on dataset Kinetics 400 Deepmind Kinetics human action dataset [54] contains around 300,000 video clips retrieved from YouTube with an average length of around 10 seconds. The videos cover as many as 400 human action classes, ranging from daily activities sports scenes, to complex actions with interactions. Skeleton location of 18 joints in 2D coordinates (X, Y ) format on each frame is extracted using OpenPose [87]. The STGCN is trained using 240, 000 skeleton data samples; no background information was used. Figure 5.16 shows the label smoothness over the layers of STGCN trained on the Kinetics dataset. We see a similar pattern, i.e., sudden drop in the Laplacian quadratic form starting from layer 8. This again implies that the last few layers learn particular information related to the specific actions present in the dataset. We notice 101 a relatively small change in the label smoothness over the layers of STGCN. As there are many actions in the Kinetics dataset that deal with different objects, for example, ’eating donuts’, ’eating hotdog’, ’eating burger’, it is very difficult to differentiate the action without the object information. This could be one of the reasons behind the poor performance of STGCN on the Kinetics dataset. However, our proposed NNK-based embedding geometry understanding method can analyze STGCN network trained on any dataset. � ! � � Figure 5.16: This visualization of layer smoothness for the STGCN network trained on Kinetics 400 dataset [54]. The smoothness of labels on the manifold induced by the STGCN layer mappings in a trained model. Intuitively, higher smoothness corresponds to the features belonging to a particular class having neighbors from the same class. We emphasize that, though the smoothness is displayed per class, the Dataset Graph(NNK) is constructed using the features corresponding to all input action data-points. We observe that the model follows a similar trend where y ⊤Louty is flat in the initial layers (indicative of no classspecific learning) and decrease in value in the later layers (corresponding to discriminative learning 5.5 Conclusion In conclusion, we present a data-driven approach for understanding STGCN models using windowed-DTW distance-based NNK graphs. We focused on a skeleton-based activity recognition task, but this is a generic method to analyze any DNN dealing with spatiotemporal data, even with varying temporal lengths. Analyzing the label smoothness of the successive 102 layers on the NNK graphs, we show that the initial layers focus on general human motion, and features for individual action recognition are learned by the model only in the later layers. We also present a comparison between graph construction methods, showing the superiority of NNK graph over k-NN. To validate our insights from label smoothness, we introduce an L-STG-GradCAM method to visualize the importance of different nodes at each layer for predicting action. We then present our analysis of label smoothness and its impact on the transfer performance of a trained STGCN model to unseen action classes. We generate a joint-time activation heatmap to understand how these joints and their temporal dynamics contribute to the model’s decision. Further, we create a joint activation map for each action. We analyze interpretability using STG-Grad-CAM in different experimental settings. We evaluate the explanation of STG-Grad-CAM using quantitative metrics, faithfulness. We observe higher faithfulness in the xview setting. Additionally, we show the model’s contrastivity among classes based on the outcome of STG-Grad-CAM and find higher contrastivity in xsub setting. We study how the change in the model depth affects the performance. We plan to use the node importance map in the future to improve the performance of such STGCN-based models. Finally, we present an analysis of the robustness of the features at each layer of an STGCN in the presence of input Gaussian noise. We show that the added noise does not affect the label smoothness of several action classes. 5.6 Limitations The analysis is based on features learned in a trained network, and we foresee no issues while scaling up to larger models and datasets, but testing this remains open. We note that our approach can be applied to any spatiotemporal data and model, but the analysis in this work was limited to skeleton-based human action recognition. Although our work here makes some empirical observations, we note that our approach is amenable to theoretical 103 study using spectral and graph signal concepts. We plan to investigate these ideas and ultimately develop an understanding that can lead to better design, training, and transfer of spatiotemporal models. 104 Chapter 6 Graph-based skeleton data compression With the advancement of reliable, fast, portable acquisition systems, human motion capture data is widely used in many industrial, medical, and surveillance applications. These systems can track multiple people simultaneously, providing full-body skeletal key points and more detailed landmarks in the face, hands, and feet. This leads to a huge amount of skeleton data to be transmitted or stored. In this chapter, we introduce Graph-based Skeleton Compression (GSC), an efficient graph-based method for nearly lossless compression. We use a separable spatiotemporal graph transform, non-uniform quantization, coefficient scanning, and entropy coding with run-length codes for nearly lossless compression. We evaluate the compression performance of the proposed method on the large NTU-RGB activity dataset. Our method outperforms a method based on the 1D discrete cosine transform applied along the temporal direction. In near-lossless mode, our proposed compression does not affect action recognition performance. This work was published in [22]. 6.1 Introduction In the preceding chapters, we have seen that skeleton-based activity data is widely used in human activity recognition due to its easy acquisition and ability to preserve the privacy of the scene. Moreover, human skeleton information extracted from video has gained significant importance in many other application including monitoring or surveillance, telehealth [61], 105 sports guidance [15], autopilot, mobile robots [31], HCI (Human-Computer Interaction) [62], surveillance [92], monitoring workers at the industry, etc. Each of these applications needs the data to be stored and sometimes transmitted efficiently at a low cost. With the advancement of depth cameras, acquiring human skeleton data, including full-body skeletons and more detailed hands and facial key points, is becoming more efficient. Sensors such as Microsoft Kinect [123], Intel RealSense [57], and software such as OpenPose [14], Deeplabcut are now widely used. In fact, advances in 2D skeleton tracking from video, e.g., by OpenPose [14] or Deeplabcut, mean that depth cameras are no longer always required. Motion capture or pose data is described by the position of a fixed number of key points associated with body joints in one or more temporal frames. Software-based state-of-theart analysis methods, such as OpenPose [14], can detect more than one person in a frame and extract skeleton information over time, representing the dynamic characteristics of body postures. Storing skeleton data instead of the original video sequences is more efficient (much lower storage or bandwidth) and has added advantages in terms of processing complexity and privacy preservation [94]. Increased use of skeleton data motivates the need for efficient compression methods. This has led to a very rapid increase in human skeleton data. On the other hand, with the popularity of network infrastructure and the development of cloud computing, the demand for skeleton data transmission is also increasing. Especially in the Internet of Things era, sensors are only responsible for data collection and transmission, and the cloud does the analysis. For example, in a scenario where detailed tracking is possible, for each detected person OpenPose can provide 25 body key-points, 6-foot key-points, 2 ×21 hand key points, 70 face key-points at a rate of 30 samples/ seconds. Thus, a system monitoring 5 persons can generate 0.2 billion real values in an hour, creating significant transmission and storage capacity demands. Our main motivation is to build an efficient, fast, and nearly lossless compression technique for skeleton data. 106 Since skeleton data acquisition systems are relatively new, only a few papers have studied the compression problem. Li et al. [66] proposed a motion-based joint selection (MJS) algorithm for 3D human skeleton data, where a motion score is computed for each joint, and only data for those joints whose motion score exceeds a threshold is transmitted. This motion score threshold controls the trade-off between bit rate and accuracy. Further, [66] uses the standard DEFLATE algorithm [28] along with MJS for additional compression. In this method, compression performance degrades when most body joints move. Moreover, this approach requires the complete action data sequence to compute which joints are moving, meaning there is a significant latency. He et al. [47] proposed a skeleton coding scheme consisting of i) spatial differential coding, ii) temporal differential coding, and inter-prediction. The scheme switched between different types of skeletons in video frames. None of these works analyze this problem in the transform domain, i.e., from a frequency domain perspective. Moreover, our proposed scheme is not exactly lossless but can operate at a distortion level consistent with the maximum resolution of sensors and analysis tools. Thus, as compared to a lossless system [47], we achieve greater compression without affecting classification performance. Graph-based methods for image [35] [34] and video [59] [81] compression have been proposed [17]. Skeleton data have spatial and temporal dimensions, similar to a video signal, but unlike in video, spatial information corresponds to a series of irregularly located points. While spatial (skeleton) data lends itself to a graph representation, to our knowledge, graphbased transform methods have not been used yet to compress skeleton data. Our proposed method, Graph-based Skeleton Compression (GSC), uses i) a spatial skeleton graph [53], inspired by human anatomy, and ii) an unweighted temporal path graph. Note that our graph definition is fixed and does not depend on the motion observed on the scene. We form the graph as the Cartesian product of the spatial skeleton and temporal path graphs, similar to the method proposed in Section 3.2. This leads to a separable 107 Graph Fourier transformation (GFT). We propose to use the separable GFT of this vertextime graph to transform the 3D pose data. Then, the output transform coefficients are quantized. Two quantization methods, uniform and non-uniform, are introduced. Nonuniform quantization is particularly advantageous, enabling nearly lossless compression with a higher compression ratio. This approach also helps reduce the effect of estimation noise present in the higher temporal frequencies. We also propose two scanning techniques to map the vertex-time coefficients into a 1D sequence for encoding. Specifically, note that it is common for motion capture (MoCap) devices to work at 30 fps or above. Thus, given the characteristics of typical human motion, it is likely that most of the temporal high-frequency information in captured data can be considered as noise. Consequently, since most of the energy is concentrated in the lower frequencies, we perform a row-major [107] scan before encoding. We use Huffman coding with run-length and end-of-block codes. This chapter only shows the results of body skeleton data compression. But, by selecting a different graph, this method can be used to compress other MoCap data, e.g., hand, foot, or facial key-point data compression. GSC outperforms a DCT-based compression scheme, where the transform is applied along the temporal direction in terms of reconstruction error (e.g., measured by mean square error) while producing no change activity recognition performance (when operating in near-lossless mode). All our experiments are conducted on the NTU-RGB human activity dataset [100]. 6.2 Graph-based Skeleton Compression We start by introducing the details of the proposed Graph-based Skeleton Compression (GSC). Figure 6.1 shows the block diagram of the compression pipeline. 108 Skeleton data No_coordinate X no_time_frames X no_joints (*XYZ treated independently) Transform GFT on Skeleton graph (Spatially) GFT on line graph (Temporally) Quantization Uniform/ Non-uniform quantization Rounding off Zigzag scanning Encoding Run length Huffman End of block Bit stream Figure 6.1: Block diagram of the proposed graph-based skeleton data compression (GSC) algorithm 6.2.1 GFT-based transform coefficient extraction using skeleton graph The skeleton graph we construct in this work, Gs, is unweighted and has 25 nodes and 24 edges, as shown in Figure 6.2. These are undirected graphs composed of a vertex set V of cardinality |V| = N, an edge set E connecting vertices, and a weighted adjacency matrix A. A is a real symmetric N × N matrix, where ai,j is the weight assigned to the edge (i, j) connecting vertices i and j. We assume non-negative weights, i.e., ai,j ≥ 0. In this chapter, we use the symmetric normalized Laplacian, defined as (2.2). The Graph Fourier transform (GFT) is the matrix U, with columns {u1, u2, ..., uN }, the eigenvectors of L, which act as the spectral basis. Λ is a diagonal matrix whose diagonal elements are the eigenvalues of L. As mentioned in Chapter 3, these eigenvalues act as spectral frequencies and are such that 0 ≤ λ1 ≤ λ2 ≤ ... ≤ λN . The eigen-pair {(λk,uk)} provides Fourier interpretation for graph signals on Gs. The spectral basis uk, k = 1, ..., N forms a basis for any graph signal residing on Gs, so that any graph signal can be uniquely represented as: ci = X N k=1 αk,iuk (6.1) 109 where ci is a graph signal, with i = 1, 2, 3 representing each of the coordinates in 3D space, and the n-th entry of ci corresponds to the position in coordinate i of joint n. The transform coefficients in (6.1) are obtained as: αk,i = c ⊤ i uk (6.2) where α1,i, α2,i, ..., αN,i are the graph coefficients in the frequency domain, and we denote αk = (αk,1 αk,2 αk,3) ⊤. As we have seen in Chapter 3 we can easily interpret the graph Fourier basis for the hand graphs; here, we show examples of skeleton graph Forier basis and its interpretation. Figure 6.3 depicts the 3 lowest and 3 highest graph Fourier basis vectors for Gs, each shown as a graph signal. As can be seen, u1 is a constant vector corresponding to DC, so that projection onto u1 computes the average of all the nodes. From their respective sign patterns, u2 can be seen to capture variation between the upper and lower body, while basis u3 captures variation between the left hand and right hand. In contrast, note that the higher frequency basis is more localized (more zeros) and shows sign changes between consecutive nodes. Thus, it is likely that projections onto these higher frequency bases capture contain noise since it is unlikely that connected joints will have very different motion characteristics in typical human motion. Note that, for typical markerless MoCap data, position estimation tends to suffer from errors that increase with motion [53] [19], i.e., estimation noise is nonlinear. For a given temporal sampling rate, increased motion will result in more energy in the higher frequencies of the temporal DCT. We will use this to develop a non-uniform quantization strategy. Given a 3D skeleton data sequence, with 3 × 25 × T samples, where T is the number of time frames, we first divide it into small temporal windows (with window size t = 1s). Next, we apply spatial GFT on these windows using the basis of Gs and then apply the temporal GFT where the graph is a fixed, unweighted line graph with t nodes and (t − 1) edges (note that the GFT for this line graph is the DCT). Figure 6.4 and Figure 6.5 show 110 Figure 6.2: Human anatomy inspired Skeleton graph Gs the average energy for the vertex-time approach and a time-only technique (where DCT is used along the time axis for information in each node) for the NTU-RGB dataset. Notice that the top row of Figure 6.5 corresponds to the lowest temporal frequency, which contains significant energy since each node is processed independently. In contrast, the first row in Figure 6.4 contains the transform coefficients after applying the skeleton graph GFT of the lowest temporal frequency. As can be seen, the energy is concentrated in the lowest skeleton graph frequencies, resulting in increased energy compaction. 6.2.2 Quantization and entropy coding We first consider a uniform quantization with a fixed step size for all frequencies and rounding off to the nearest integer. Denote F the GFT coefficient matrix of dimension Nv × t whose elements are αk,i and Qu is the uniform quantization matrix. The uniform quantization is defined as: 111 u1 u2 u3 u23 u25 u24 Figure 6.3: Example of Graph Fourier basis for the skeleton graph. In each case, λ is the corresponding graph frequency. The upper and lower rows correspond to the lowest and highest three frequencies, respectively. Node colors represent signs of the elements of each basis, green−→positive, red−→negative, blue−→zeros. ψk = round(F ⊘ Qu), (6.3) where ⊘ is the element-wise division, and αˆk = ψk ⊙ Qu, (6.4) where ⊙ is the element-wise multiplication, gives us the reconstructed frequency coefficient αˆk. Uniform quantization introduces the same level of distortion across all frequencies. However, when we increase the distortion to achieve higher compression, we begin to lose a substantial amount of information. This loss is primarily due to the heavy distortion applied to the lower frequencies, which carry most of the signal energy, as depicted in Figure 6.4. Consequently, the reconstructed signal’s mean square error starts to rise. We have developed a non-uniform quantization method for compressing skeleton data to address this issue 112 Figure 6.4: Energy for vertex-time transform (GFT and DCT), the axis is numbered with the frequency indices arranged in increasing order. Figure 6.5: Energy for the time-only transform (DCT), the axis are numbered with the frequency indices arranged in increasing order and maintain reasonable compression ratios. In the context of skeleton data compression, where the primary downstream task is action understanding using the compressed skeleton data, heavy distortion on the higher frequencies, which carries mostly the estimation noise information, will not affect the performance of the downstream tasks. In [53], Kao et al. show that estimation error increases as the motion present in different body joints increases (i.e., the noise is not independent of the signal and increases as the signal magnitude increases). Non-uniform quantization allows us to take into account the non-linear nature of motion estimation noise [53]. Since the noise is primarily present in the high frequencies, we can afford to use coarser quantization in the high frequencies with less impact on downstream tasks such as compression and then action recognition. We choose the values of non-uniform quantization matrix Qnu for different frequencies based 113 on their importance. Since higher frequencies are more affected by noise we increase the quantization step-size for higher frequencies, with values chosen by analyzing the data. Our goal is to achieve a quantization error that is comparable to or smaller than the accuracy of the position estimation. Since position estimation error is of the order of 0.01m for systems such as Kinect, we choose quantization so that errors are typically below 10−2m. Figure 6.6 shows the two proposed non-uniform quantization matrices (Qnu) where the distortion rate increases diagonally in the left one (Q↓ nu) and vertically in the right one ( Q↘ nu). While using Q↓ nu, we do row-major scanning to flatten the data. On the other hand, we perform a zigzag scan while using Q↘ nu. Using these techniques, we can first scan comparatively higher values, and most zeros go at the end of the 1D array. For DCT, we only perform quantization using Q↓ nu, but for GSC, we use both the non-uniform quantization methods and compare them. For encoding, we use Huffman coding with run length (maximum length is chosen to be 10) and end-of-block codes, indicating the last non-zero value along a scan path. These methods are similar to those used in JPEG [111]. Qnu Qnu Figure 6.6: Heatmap of non-uniform quantization matrices (Q↓ nu) and (Q↘ nu). 6.3 Experimental Results We evaluate the compression performance of the proposed GSC algorithm on one of the largest MoCap datasets, the NTU-RGB dataset [100]. This has 60 action classes and 56,880 114 data samples captured by three Kinect V2 cameras concurrently. 3D skeletal data contains the 3D coordinates of 25 body joints at each frame. Kinect v2 has error in the range of 10−2 m [88]. Thus, if we can achieve reconstruction root mean square error around 10−2 or lower, we consider the system nearly lossless. In all experiments, we use the normalized joint position data. The data sequence is normalized by deducting the spine base position of the first frame from all the joints of all frames. 6.3.1 Compression Results Experiment 1: Uniform quantization Figure 6.7 shows the rate-distortion curve for varying bit-rate. It is clear from the figure that GSC performed better than DCT-based compression. GSC performs significantly well in the lower bit rate. However, uniform quantization does lossy compression unless the bit rate exceeds 2. GSC Figure 6.7: Rate distortion curve for uniform quantization. MSE (in m2 ) vs bit-rate per joint in each frame is plotted. Experiment 2: Non-Uniform quantization Figure 6.8 shows an example of the reconstructed signal for the spine mid-joint while using non-uniform quantization. We notice 115 that after compression, DCT mostly retained the overall envelope of the signal, whereas GSC also retained the smaller details. Figure 6.9 shows the curve of MSE vs bit-rate (per joint per frame). We observe that quantization with Q↓ nu shows better performance than Q↘ nu. Table 6.1 presents average RMSE computed over 56880 skeleton files from the NTU-RGB dataset while compressing the data with DCT with Q↓ nu, GSC with Q↘ nu and GSC with Q↓ nu. In this case, the compression ratio is cr = L/Lϕ, the ratio of the number of 2D frequencies L to the number of frequencies encountered before the end-of-block Lϕ, on average. For all of the algorithms, the compression ratio was set to 2. GSC with Q↓ u outperforms the other two methods. The average times to compress a temporal frame are 7 and 5 milliseconds for GSC and DCT, respectively. As spatial skeleton graphs are symmetric and bipartite in nature, We can use the fast GFT methods of [76] for faster compression. This study is left as future work. Directly encoding a skeleton file using temporal DCT leads to the bit rate of 8.29 bits per joint (bpj). On the other hand, for the same skeleton data GSC with Q↓ nu leads to a bitrate of 1.07 bpj with an MSE of 1cm2 . Allowing this loss, which is in the range of Kinect v2 estimation error, results in 87.05% reduction in the file size. Note that authors in [47] achieved a maximum 78.4% size reduction of the skeleton files in their lossless compression method. Thus, we can achieve higher compression with a negligible amount of loss. The best operating bit-rate region for using GSC while keeping MSE in Kinect’s error rate range is 0.5 − 1 bpj. cr ≈ 2 DCT with Q↓ nu bit − rate ≈ 1.2 GSC with Q↘ nu bit − rate ≈ 0.3 GSC with Q↓ nu bit − rate ≈ 0.2 RMSE (in m) 3.13 × 10−2 1.41 × 10−2 1 × 10−2 Table 6.1: Summary of comparison between GSC and DCT in terms of average RMSE computed over 56880 skeleton data files 116 Figure 6.8: Reconstructed signal from the compressed data of the joint spine mid, the compression ratio for DCT and GSC with Q↓ nu was 2.4 and 4.8 respectively. 6.3.2 Performance on Action recognition To analyze the effect of compression on action recognition, we use a graph-based action recognition algorithm proposed in [53]. First, the skeleton graph-based features are extracted. Then, to model the dynamics in the sequence of frame-wise representations, temporal pyramid matching is adopted with a mean pool pyramid height of 3. Later, a linear support vector machine is applied for recognition. Table 6.2 shows the recognition accuracy for the reconstructed skeleton data files from the compressed signal. Here, we use rates of r = 1.2 and r = 1.1 for DCT and GSC-based compression, respectively with Q↓ nu. We notice that recognition accuracy is even better by a small percentage while using the compressed signal. The reason could be the reduction of the noise present in the signal to a certain extent 117 Figure 6.9: Rate distortion curve for non-uniform quantization. MSE (in m2 ) vs bit-rate per joint in each frame is plotted. GSC with Q↘ nu : GSC-Zig-zag and GSC with Q↓ nu: GSC-Row major while compressing with the GSC algorithm, which leads to the improvement of recognition performance. Algorithm No compression Compression using DCT Compression using GSC Accuracy (%) 56 57 57 Table 6.2: Performance of the compressed data in Action recognition 6.4 Conclusion This chapter introduces a graph-based approach to MoCap data compression. We use a separable spatiotemporal graph transform that achieves good energy compaction for typical data. We process the data in temporal windows and calculate graph Fourier coefficients. To address noise in high temporal frequencies, we have proposed a non-uniform quantization technique. Additionally, we’ve applied a row-major and zig-zag scanning method for Graph 118 Spectral Coefficient (GSC) compression, which enables us to map 2D frequency data into a 1D representation. Through experiments conducted on the NTU-RGB activity dataset, our GSC compression technique consistently outperforms temporal Discrete Cosine Transform (DCT) compression in terms of rate-distortion performance. Notably, GSC achieves nearly lossless compression without negatively impacting activity recognition performance. Furthermore, GSC is a versatile approach applicable to compressing various MoCap data, with adaptability achieved by modifying the spatial graph. Future work may explore optimizing edge weights on the graph to enhance compression performance. 119 Chapter 7 Conclusion and Future work Human motion analysis is important in various applications, including human-robot coordination, surgery, physical therapy, and sports and military performance. In this research, we have introduced a graph-based methodology for understanding complex human hand motion using markerless motion data acquired through a single RGB camera or a cost-effective 3D sensor, such as the Microsoft Kinect. We have explored different variations of spatial and spatiotemporal hand graphs and elucidated their interpretability in the graph Fourier domain. Additionally, we have introduced graph-based feature extraction techniques and incorporated popular neural network models capable of capturing subtle changes in movement. Notably, our proposed graph-based features excel at capturing dynamic coordination between different body parts. We have also discussed methods for interpreting spatiotemporal graph neural network models and presented techniques for efficiently compressing extensive motion capture data for storage and transmission. In terms of practical applications, we have considered both online (real-time) and offline activity segmentation scenarios. We have assessed the algorithm’s performance in complex activity recognition tasks and demonstrated that our approach achieves competitive results compared to state-of-the-art methods. Importantly, it accomplishes this with significantly lower computational complexity and improved stability. In the realm of data compression, 120 we have outperformed existing techniques, such as Discrete Cosine Transform (DCT), by a wide margin without compromising recognition performance when using compressed data. 7.0.1 Future work Moving forward, there are several avenues for ongoing and future research. Firstly, our symmetry-based graph decomposition technique has shown promise in reducing complexity while enhancing recognition performance. Extending this approach to partially symmetric graphs could enhance its applicability to various domains like social networks and sensor networks, which often exhibit partial symmetry but not complete symmetry. Moreover, testing this technique on large graphs to address the complexity of processing such data is an intriguing prospect. We introduced techniques to gain a layer-by-layer insight into complex spatiotemporal graph neural network models, specifically focusing on ST-GCNs. While ST-GCNs excel at modeling spatiotemporal data, they often face challenges in transfer learning and generalization across diverse datasets. Our method allows us to identify influential subgraph structures that impact the network’s decision-making process. Moreover, these critical subgraph structures can be integrated into the network to enhance its classification performance. Along with interpretability, stability is also a crucial aspect of these models that can be applied in real-life scenarios. As an extension of our current work in Chapter 3, a detailed study needs to be performed to evaluate the stability of all the existing state-of-the-art methods compared to our method. In order to design bio-inspired robots, a crucial step is grasping how animals, like dogs, adjust their way of moving when transitioning from smooth, obstacle-free paths to walking on uneven terrain or through environments where obstacles change position. We can employ the graph grad-cam to pinpoint key subgraph structures within the skeletal data of dogs as they switch between various movement styles. These influential subgraph structures will guide us in designing robot movements that can adapt to diverse terrains. 121 Regarding skeleton data compression, while we have achieved nearly lossless performance for Kinect and Openpose data, our quantization matrix was tailored to a specific dataset. Developing a generalized method for designing quantization matrices adaptable to different motion-capture devices is necessary. Additionally, to enhance compression and reconstruction speed, implementing the fast graph Fourier transform for the symmetric and bipartite skeleton graph is part of our future plans. 122 References [1] Jake K Aggarwal and Michael S Ryoo. Human activity analysis: A review. ACM Computing Surveys (CSUR), 43(3):1–43, 2011. [2] Aamir Anis, Philip A Chou, and Antonio Ortega. Compression of dynamic 3d point clouds using subdivisional meshes and graph wavelet transforms. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6360–6364. IEEE, 2016. [3] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019. [4] Robert Baldock, Hartmut Maennel, and Behnam Neyshabur. Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems, 34:10876–10889, 2021. [5] Jernej Barbiˇc, Alla Safonova, Jia-Yu Pan, Christos Faloutsos, Jessica K. Hodgins, and Nancy S. Pollard. Segmenting motion capture data into distinct behaviors. In Proceedings of Graphics Interface 2004, GI ’04, pages 185–194, School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, 2004. Canadian HumanComputer Communications Society. [6] Djamila Romaissa Beddiar, Brahim Nini, Mohammad Sabokrou, and Abdenour Hadid. Vision-based human activity recognition: a survey. Multimedia Tools and Applications, 79(41-42):30509–30555, 2020. [7] Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. In Noise reduction in speech processing, pages 1–4. Springer, 2009. [8] Gregory Berkolaiko, EB Bogomolny, and JP Keating. Star graphs and eba billiards. Journal of Physics A: Mathematical and General, 34(3):335, 2001. [9] Donald J Berndt and James Clifford. Using dynamic time warping to find patterns in time series. In KDD workshop, volume 10, pages 359–370. Seattle, WA, 1994. [10] P. Bharti, D. De, S. Chellappan, and S. K. Das. HuMAn: Complex activity recognition with multi-modal multi-positional body sensing. IEEE Transactions on Mobile Computing, 18(4):857–870, April 2019. 123 [11] F Bonassi, E Terzi, M Farina, and R Scattolini. Lstm neural networks: Input to state stability and probabilistic safety verification. In Learning for Dynamics and Control, pages 85–94. PMLR, 2020. [12] David Bonet, Antonio Ortega, Javier Ruiz-Hidalgo, and Sarath Shekkizhar. Channel redundancy and overlap in convolutional neural networks with channel-wise nnk graphs. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. [13] Olivier Bousquet and Andr´e Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002. [14] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017. [15] Yan Long Che and Zhong Jin Lu. The key technology research of kinect application in sport training. In Advanced Materials Research, volume 945, pages 1890–1893. Trans Tech Publ, 2014. [16] Ke Cheng, Yifan Zhang, Xiangyu He, Weihan Chen, Jian Cheng, and Hanqing Lu. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [17] Gene Cheung, Enrico Magli, Yuichi Tanaka, and Michael K Ng. Graph spectral image processing. Proceedings of the IEEE, 106(5):907–930, 2018. [18] Romain Cosentino, Sarath Shekkizhar, Salman Avestimehr, Mahdi Soltanolkotabi, and Antonio Ortega. The geometry of self-supervised learning models and its impact on transfer learning. 2022. [19] Pratyusha Das, Kingshuk Chakravarty, Arijit Chowdhury, Debatri Chatterjee, Aniruddha Sinha, and Arpan Pal. Improving joint position estimation of kinect using anthropometric constraint based adaptive kalman filter for rehabilitation. Biomedical Physics & Engineering Express, 4(3):035002, 2018. [20] Pratyusha Das, Jiun-Yu Kao, Antonio Ortega, Tomoya Sawada, Hassan Mansour, Anthony Vetro, and Akira Minezawa. Hand graph representations for unsupervised segmentation of complex activities. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4075–4079. IEEE, 2019. [21] Pratyusha Das, Jiun-Yu Kao, Tomoya Sawada, Antonio Ortega, Hassan Mansour, Anthony Vetro, and Akira Minezawa. Hand skeleton dataset for a toy assembling task extracted using openpose. https://drive.google.com/file/d/1DDmkSL-_ CqZxdFgg83NOufXY-LFtDI0s/view?usp=sharing, 2018. 124 [22] Pratyusha Das and Antonio Ortega. Graph-based skeleton data compression. In 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), pages 1–6, 2020. [23] Pratyusha Das and Antonio Ortega. Symmetric sub-graph spatio-temporal graph convolution and its application in complex activity recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3215–3219, 2021. [24] Pratyusha Das and Antonio Ortega. Symmetric sub-graph spatio-temporal graph convolution and its application in complex activity recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3215–3219. IEEE, 2021. [25] Pratyusha Das and Antonio Ortega. Gradient-weighted class activation mapping for spatio temporal graph convolutional network. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4043– 4047. IEEE, 2022. [26] Pratyusha Das, Antonio Ortega, Siheng Chen, Hassan Mansour, and Anthony Vetro. Application-agnostic spatio-temporal hand graph representations for stable activity understanding. In 2021 IEEE International Conference on Image Processing (ICIP), pages 1074–1078, 2021. [27] Quentin De S, H Wannous, and J P Vandeborre. Heterogeneous hand gesture recognition using 3D dynamic skeletal data. Computer Vision and Image Understanding, 181:60–72, 2019. [28] Peter Deutsch. Deflate compressed data format specification version 1.3. 1996. [29] Aaron M Dollar. Classifying human hand use and the activities of daily living. The human hand as an inspiration for robot hand development, pages 201–216, 2014. [30] Y Du, W Wang, and L Wang. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1110–1118, 2015. [31] P´eter Fankhauser, Michael Bloesch, Diego Rodriguez, Ralf Kaestner, Marco Hutter, and Roland Siegwart. Kinect v2 for mobile robot navigation: Evaluation and modeling. In 2015 International Conference on Advanced Robotics (ICAR), pages 388–394. IEEE, 2015. [32] Mahyar Fazlyab, Manfred Morari, and George J Pappas. Safety verification and robustness analysis of neural networks via quadratic constraints and semidefinite programming. IEEE Transactions on Automatic Control, 2020. [33] Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, and George Pappas. Efficient and accurate estimation of lipschitz constants for deep neural networks. Advances in Neural Information Processing Systems, 2019. 125 [34] G. Fracastoro, D. Thanou, and P. Frossard. Graph transform learning for image compression. In 2016 Picture Coding Symposium (PCS), pages 1–5, 2016. [35] Giulia Fracastoro, Dorina Thanou, and Pascal Frossard. Graph transform optimization with application to image compression. IEEE Transactions on Image Processing, 29:419–432, 2019. [36] G Garcia-Hernando and T K Kim. Transition forests: Learning discriminative temporal transitions for action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 432–440, 2017. [37] Guillermo Garcia-Hernando, Shanxin Yuan, Seungryul Baek, and Tae-Kyun Kim. First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 409–419, 2018. [38] Andr´e Gensler and Bernhard Sick. Novel criteria to measure performance of time series segmentation techniques. In LWA, pages 193–204, 2014. [39] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, pages 2232–2241. PMLR, 2019. [40] Vincent Gripon, Antonio Ortega, and Benjamin Girault. An inside look at deep neural networks using graph signal processing. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018. [41] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems, 31, 2018. [42] Maya R. Gupta and Yihua Chen. Theory and use of the em algorithm. Found. Trends Signal Process., 4(3):223–296, March 2011. [43] Fei Han, Brian Reily, William Hoff, and Hao Zhang. Space-time representation of people based on 3d skeletal data: A review. Computer Vision and Image Understanding, 158:85–105, 2017. [44] Kyu J. Han, Panayiotis G. Georgiou, and Shrikanth Narayanan. The SAIL Speaker Diarization System for Analysis of Spontaneous Meetings. In Proceedings of IEEE International Workshop on Multimedia Signal Processing (MMSP), Cairns, Australia, October 2008. [45] Tengda Han, Jue Wang, Anoop Cherian, and Stephen Gould. Human action forecasting by learning task grammars. arXiv:1709.06391, 2017. [46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 126 [47] Xiaoyi He, Mingzhou Liu, Weiyao Lin, Xintong Han, Yanmin Zhu, Hongtao Lu, and Hongkai Xiong. A multimodal lossless coding method for skeletons in videos. In 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pages 132–137. IEEE, 2019. [48] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [49] Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. Jointly learning heterogeneous features for rgb-d activity recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5344–5352, 2015. [50] Dexter Industries. Gopigo3 robot base kit. https://www.amazon.com/ Dexter-Industries-GoPiGo3-Robot-Base/dp/B071WPZ2GF. [51] J. Johnson, M. Douze, and H. J´egou. Billion-scale similarity search with gpus. IEEE Trans. on Big Data, 2019. [52] Jiun-Yu Kao, Antonio Ortega, and Shrikanth Narayanan. Graph-based approach for motion capture data representation and analysis. In Proceedings of IEEE International Conference on Image Processing, oct 2014. [53] Jiun-Yu Kao, Antonio Ortega, Dong Tian, Hassan Mansour, and Anthony Vetro. Graph based skeleton modeling for human activity analysis. In 2019 IEEE International Conference on Image Processing (ICIP), pages 2025–2029. IEEE, 2019. [54] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. [55] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, and Farid Boussaid. A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3288–3297, 2017. [56] T Kerola, N Inoue, and K Shinoda. Spectral graph skeletons for 3d action recognition. In Asian Conference on Computer Vision, pages 417–432. Springer, 2014. [57] Leonid Keselman, John Iselin Woodfill, Anders Grunnet-Jepsen, and Achintya Bhowmik. Intel realsense stereoscopic depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1–10, 2017. [58] T. S. Kim and A. Reiter. Interpretable 3D human action analysis with temporal convolutional networks. In 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pages 1623–1631. IEEE, 2017. [59] Woo-Shik Kim, Sunil K Narang, and Antonio Ortega. Graph based transforms for depth video coding. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 813–816. IEEE, 2012. 127 [60] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012. [61] Chung-Liang Lai, Ya-Ling Huang, Tzu-Kuan Liao, Chien-Ming Tseng, Yung-Fu Chen, and D Erdenetsogt. A Microsoft Kinect-based virtual rehabilitation system to train balance ability for stroke patients. In 2015 International Conference on Cyberworlds (CW), pages 54–60. IEEE, 2015. [62] Kam Lai, Janusz Konrad, and Prakash Ishwar. A gesture-driven computer interface using kinect. In 2012 IEEE Southwest Symposium on Image Analysis and Interpretation, pages 185–188. IEEE, 2012. [63] Carlos Lassance, Vincent Gripon, and Antonio Ortega. Representing deep neural networks latent space geometries with graphs. Algorithms, 14(2):39, 2021. [64] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055, 2018. [65] Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. Actionalstructural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3595–3603, 2019. [66] Sheng Li, Tingting Jiang, Yonghong Tian, and Tiejun Huang. 3D human skeleton data compression for action recognition. In 2019 IEEE Visual Communications and Image Processing (VCIP), pages 1–4. IEEE, 2019. [67] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015. [68] Chinghway Lim and Bin Yu. Estimation stability with cross-validation (escv). Journal of Computational and Graphical Statistics, 25(2):464–492, 2016. [69] Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10):2684–2701, 2019. [70] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang. Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, pages 816–833. Springer, 2016. [71] Ye Liu, Liqiang Nie, Lei Han, Luming Zhang, and David S Rosenblum. Action2activity: recognizing complex activities from sensor data. In Twenty-fourth international joint conference on artificial intelligence, 2015. 128 [72] Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020. [73] S Lohit, Q Wang, and P Turaga. Temporal transformer networks: Joint learning of invariant and discriminative time warping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12426–12435, 2019. [74] A Loukas and D Foucard. Frequency analysis of time-varying graph signals. In 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pages 346– 350. IEEE, 2016. [75] Fran¸cois Lozes, Abderrahim Elmoataz, and Olivier L´ezoray. Pde-based graph signal processing for 3-d color point clouds: Opportunities for cultural heritage. IEEE Signal Processing Magazine, 32(4):103–111, 2015. [76] K. Lu and A. Ortega. Fast graph fourier transforms based on graph symmetry and bipartition. IEEE Transactions on Signal Processing, 67(18):4855–4869, 2019. [77] Keng-Shih Lu and Antonio Ortega. Fast graph fourier transforms based on graph symmetry and bipartition. IEEE Transactions on Signal Processing, 67(18):4855– 4869, 2019. [78] S Masood, M P Qureshi, M B Shah, Salman Ashraf, Zahid H, and G Abbas. Dynamic time wrapping based gesture recognition. In 2014 International Conference on Robotics and Emerging Allied Technologies in Engineering (iCREATE), pages 205–210. IEEE, 2014. [79] Alexander Mathis, Pranav Mamidanna, Kevin M Cury, Taiga Abe, Venkatesh N Murthy, Mackenzie Weygandt Mathis, and Matthias Bethge. Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Nature neuroscience, 21(9):1281–1289, 2018. [80] Meinard M¨uller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007. [81] SVN Murthy and BK Sujatha. A novel graph-based technique to enhance video compression algorithm. In Emerging Research in Computing, Information, Communication and Applications, pages 463–468. Springer, 2015. [82] Sunil K Narang and Antonio Ortega. Perfect reconstruction two-channel wavelet filter banks for graph structured data. IEEE Transactions on Signal Processing, 60(6):2786– 2799, 2012. [83] Hoang Long Nguyen and Jai E Jung. Socioscope: A framework for understanding internet of social knowledge. Future Generation Computer Systems, 83:358–365, 2018. 129 [84] Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm infinite width ReLU nets: The multivariate case. In International Conference on Learning Representations, 2019. [85] Antonio Ortega. Introduction to graph signal processing. Cambridge University Press, 2022. [86] Antonio Ortega, Pascal Frossard, Jelena Kovaˇcevi´c, Jos´e MF Moura, and Pierre Vandergheynst. Graph signal processing: Overview, challenges, and applications. Proceedings of the IEEE, 106(5):808–828, 2018. [87] Daniil Osokin. Real-time 2d multi-person pose estimation on cpu: Lightweight openpose. arXiv preprint arXiv:1811.12004, 2018. [88] Karen Otte, Bastian Kayser, Sebastian Mansow-Model, Julius Verrel, Friedemann Paul, Alexander U Brandt, and Tanja Schmitz-H¨ubsch. Accuracy and reliability of the kinect version 2 for clinical measurement of motor function. PloS one, 11(11), 2016. [89] Chao Pan, Siheng Chen, and Antonio Ortega. Spatio-temporal graph scattering transform. arXiv preprint arXiv:2012.03363, 2020. [90] S Patro and K K Sahu. Normalization: A preprocessing stage. arXiv preprint arXiv:1503.06462, 2015. [91] Liangying Peng, Ling Chen, Menghan Wu, and Gencai Chen. Complex activity recognition using acceleration, vital sign, and location data. IEEE Transactions on Mobile Computing, 18(7):1488–1498, 2018. [92] Mirela Popa, Alper Kemal Koc, Leon JM Rothkrantz, Caifeng Shan, and Pascal Wiggers. Kinect sensing of shopping related actions. In International joint conference on ambient intelligence, pages 91–100. Springer, 2011. [93] Phillip E Pope, Soheil Kolouri, Mohammad Rostami, Charles E Martin, and Heiko Hoffmann. Explainability methods for graph convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10772–10781, 2019. [94] Liliana Lo Presti and Marco La Cascia. 3D skeleton-based human action classification: A survey. Pattern Recognition, 53:130–147, 2016. [95] Bin Ren, Mengyuan Liu, Runwei Ding, and Hong Liu. A survey on 3d skeleton-based action recognition using learning method. arXiv preprint arXiv:2002.05907, 2020. [96] Alessia Saggese, Nicola Strisciuglio, Mario Vento, and Nicolai Petkov. Learning skeleton representations for human action recognition. Pattern Recognition Letters, 118:23– 31, 2019. 130 [97] Akie Sakiyama, Yuichi Tanaka, Toshihisa Tanaka, and Antonio Ortega. Efficient sensor position selection using graph signal sampling theory. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6225–6229. IEEE, 2016. [98] B Scholkopf and A J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001. [99] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017. [100] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016. [101] Sarath Shekkizhar and Antonio Ortega. Graph construction from data using non negative kernel regression (nnk graphs). arXiv preprint arXiv:1910.09383, 2019. [102] Sarath Shekkizhar and Antonio Ortega. Graph construction from data by non-negative kernel regression. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3892–3896. IEEE, 2020. [103] Sarath Shekkizhar and Antonio Ortega. Model selection and explainability in neural networks using a polytope interpolation framework. In 2021 55th Asilomar Conference on Signals, Systems, and Computers, pages 177–181. IEEE, 2021. [104] Sarath Shekkizhar and Antonio Ortega. Revisiting local neighborhood methods in machine learning. In Data Science and Learning Workshop (DSLW). IEEE, 2021. [105] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains. IEEE signal processing magazine, 30(3):83–98, 2013. [106] S. Stein and S. J. McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp 2013), Zurich, Switzerland. ACM, September 2013. [107] Jeyarajan Thiyagalingam, Olav Beckmann, and Paul HJ Kelly. An exhaustive evaluation of row-major, column-major and morton layouts for large two-dimensional arrays. In Performance Engineering: 19th Annual UK Performance Engineering Workshop, pages 340–351. University of Warwick Coventry, UK, 2003. [108] Joel Aaron Tropp. Topics in sparse approximation. The University of Texas at Austin, 2004. 131 [109] R Vemulapalli, F Arrate, and R Chellappa. Human action recognition by representing 3D skeletons as points in a lie group. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 588–595, 2014. [110] Michalis Vrigkas, Christophoros Nikou, and Ioannis A Kakadiaris. A review of human activity recognition methods. Frontiers in Robotics and AI, 2:28, 2015. [111] Gregory K Wallace. The JPEG still picture compression standard. IEEE transactions on consumer electronics, 38(1):xviii–xxxiv, 1992. [112] H. Wang, T. M. Khoshgoftaaar, and Q. Liang. Stability and classification performance of feature selection techniques. In 2011 10th International Conference on Machine Learning and Applications and Workshops, volume 1, pages 151–156, Dec 2011. [113] S. Wold, K. Esbensen, and P. Geladi. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987. [114] L. Xu and M. I. Jordan. On convergence properties of the em algorithm for gaussian mixtures. Neural computation, 8(1):129–151, 1996. [115] S Yan, Y Xiong, and D Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [116] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, 2018. [117] Yichao Yan, Jingwei Xu, Bingbing Ni, Wendong Zhang, and Xiaokang Yang. Skeletonaided articulated motion generation. In Proceedings of the 25th ACM international conference on Multimedia, pages 199–207, 2017. [118] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875, 2017. [119] Aviad Zabatani, Vitaly Surazhsky, Erez Sperling, Sagi Ben Moshe, Ohad Menashe, David H Silver, Zachi Karni, Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Intel® realsense™ sr300 coded light depth camera. IEEE transactions on pattern analysis and machine intelligence, 42(10):2333–2345, 2019. [120] Mihai Zanfir, Marius Leordeanu, and Cristian Sminchisescu. The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In Proceedings of the IEEE international conference on computer vision, pages 2752–2759, 2013. [121] S Zhang, Y Yang, J Xiao, X Liu, Yi Yang, D Xie, and Y Zhuang. Fusing geometric features for skeleton-based action recognition using multilayer lstm networks. IEEE Transactions on Multimedia, 20(9):2330–2343, 2018. 132 [122] X Zhang, Y Wang, M Gou, M Sznaier, and O Camps. Efficient temporal sequence comparison and classification using gram matrix embeddings on a riemannian manifold. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4498–4507, 2016. [123] Zhengyou Zhang. Microsoft kinect sensor and its effect. IEEE multimedia, 19(2):4–10, 2012. [124] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016. [125] W Zhu, C Lan, J Xing, W Zeng, Y Li, L Shen, and X Xie. Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. [126] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pages 912–919, 2003. 133
Abstract (if available)
Abstract
Analyzing and understanding human actions is a popular yet challenging field with broad applications. Studying complex hand movements is even more challenging due to similarities between different actions and the concentration of motion in small body areas, making them hard to differentiate. In this thesis, we address these challenges by creating representations for hand skeleton-based motion data. To tackle the irregularity in hand skeletal structure and actions, we utilize graph structures, known for modeling complex relationships among entities in irregular domains.
First, our approach involves constructing different spatial and spatiotemporal hand graphs and applying graph-based tools like the Graph Fourier Transform (GFT) and Graph Neural Networks to analyze position and motion data on the graph. To the best of our knowledge, we are the first to propose using hand graphs for understanding human motion. We delve into constructing different types of hand graphs, exploring their spatial and spectral properties, and interpreting GFT basis contributions. Furthermore, we emphasize the desirable properties of our representations, including computational efficiency and generalization ease.
Second, exploiting the structural properties of the proposed hand graphs, we decompose hand graphs into smaller sub-graphs and define a symmetric subgraph spatiotemporal graph neural network using separate temporal kernels for each sub-graph. This approach can be generalized to model graphs with similar properties, such as symmetry, bipartiteness, and more. This method reduces complexity while providing better performance. However, these neural network-based methods suffer from a lack of interpretability and stability. We discuss methods to interpret spatiotemporal graph neural network-based models, which can find the subgraph structures influencing the decision made by the model. Additionally, we design methods to analyze these models' stability, which helps us choose methods appropriate to real-world applications.
Third, the reliable, fast, portable skeleton data acquisition systems can track multiple people simultaneously, providing full-body skeletal key points and more detailed landmarks of the face, hands, and feet. This leads to a huge amount of skeleton data being transmitted or stored. This thesis introduces graph-based techniques to compress skeleton data nearly losslessly.
Human activity understanding involves two stages: segmenting sub-actions from a complete activity and then recognizing these sub-actions. We assess the performance of our graph-based, application-agnostic feature extraction method in both online (real-time) and offline settings, using an industrial assembling task dataset collected in a controlled USC environment. For the supervised recognition task, we consider daily activities from real-world scenarios like kitchen, office, and social. Employing the proposed method, we can achieve better recognition performance than the state-of-the-art in a cross-person setting. It offers the added benefits of reduced complexity and increased stability. For the compression task, we use a large human activity dataset NTURGB60, where the proposed method outperforms the existing techniques like DCT with a high margin without compromising recognition performance.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Human activity analysis with graph signal processing techniques
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Efficient graph learning: theory and performance evaluation
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Critically sampled wavelet filterbanks on graphs
PDF
Dynamical representation learning for multiscale brain activity
PDF
Efficient transforms for graph signals with applications to video coding
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Estimation of graph Laplacian and covariance matrices
PDF
Lifting transforms on graphs: theory and applications
PDF
Learning distributed representations from network data and human navigation
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Object classification based on neural-network-inspired image transforms
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Human appearance analysis and synthesis using deep learning
Asset Metadata
Creator
Das, Pratyusha
(author)
Core Title
Human motion data analysis and compression using graph based techniques
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2024-05
Publication Date
01/18/2024
Defense Date
10/06/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
complex activity analysis,graph machine learning,graph neural networks,graph signal processing,hand activity analysis,human activity analysis,OAI-PMH Harvest,skeleton data compression
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ortega, Antonio (
committee chair
), Qian, Feifei (
committee member
), Valero-Cuevas, Francisco (
committee member
)
Creator Email
daspraty@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113804858
Unique identifier
UC113804858
Identifier
etd-DasPratyus-12610.pdf (filename)
Legacy Identifier
etd-DasPratyus-12610
Document Type
Thesis
Format
theses (aat)
Rights
Das, Pratyusha
Internet Media Type
application/pdf
Type
texts
Source
20240118-usctheses-batch-1120
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
complex activity analysis
graph machine learning
graph neural networks
graph signal processing
hand activity analysis
human activity analysis
skeleton data compression