Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
(USC Thesis Other)
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INCORPORATING AGGREGATE FEATURE STATISTICS IN STRUCTURED DYNAMICAL MODELS FOR HUMAN ACTIVITY RECOGNITION by Prithviraj Banerjee A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2014 Copyright 2014 Prithviraj Banerjee Acknowledgements I would like to thank my advisor, Prof. Ram Nevatia, for his guidance during my stay at USC. I am especially thankful to him for allowing me the freedom and time to grow as a vision researcher while providing thoughtful insights and support whenever it was needed. I am also thankful to Prof. Gerard Medioni, Prof. Yan Liu, Prof. Wei-Min Shen and Prof. C-C. Jay Kuo for their comments and discussions during my qualifying and dissertation examinations. I am also grateful to Dr. SungChun Lee for his immense help in pre-processing the data, Rosemary Binh and Lizsl De Leon for always helping me with the administrative hurdles of Ph.D. life. I would like to thank all the members of the USC computer vision group for count- less hours of discussions and debates, which have helped me in shaping this thesis. I am immensely grateful for the discussions with Pradeep Natarajan, Vivek Singh, Pramod Sharma, Weijun Wang, Furqan Khan, Chang Huang, Remi Trichet, Ying Hao, Chen Sun, Song Cao, Cheng-Hao Kuo, Eunyoung Kim, Thang Dinh, Anustup Choudhury, Arnav Agarwal, Younghoon Lee and many more current and past members of IRIS computer vision laboratory. I am also grateful for the funding sources that made my Ph.D. work possible. I was funded, in part, by the Provost Fellowship from U.S.C., the U.S. Government Virat and MindsEye programs, and by the Ocean of Naval Research. A special thanks to my friends Pramod Sharma, Kartik Audhkhasi, Maheswaran Sathiamoorthy, Harshwardhan Vathsangam, Manish Jain, Megha Gupta and Nilesh Mishra. ii I am eternally thankful to my father Dr. Joy Prokash Banerjee, and my late mother Dr. Sunanda Banerjee, for their support and encouragement throughout my formative years and beyond; without them I would be a shadow of my present self. I am also grateful to my uncle Mr. Jyotirmay Neogy, my parent-in-laws Mr. Krupakar Konda and Mrs. Latha Konda, for giving me useful advice on all aspects of life. Lastly, but most importantly, I truly appreciate the love and support of my wife Shravya through all the ups and downs of my life. Completing the Ph.D. without her besides me is unimaginable. iii Contents Acknowledgements ii List of Tables vii List of Figures viii Abstract xi Chapter 1: Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Proposed Design Philosophy . . . . . . . . . . . . . . . . . . . 2 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Observation Noise . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.2 Object Interactions . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 Variation in Activity Style Dynamics . . . . . . . . . . . . . . 6 1.2.4 Detection in Unsegmented Videos . . . . . . . . . . . . . . . . 7 1.2.5 Activity Description . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.6 Manual Annotation Requirements . . . . . . . . . . . . . . . . 7 1.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Choosing the Correct Model . . . . . . . . . . . . . . . . . . . 10 1.3.2 LF-HCRF: Learning co-occurrence statistics of local-features . 10 1.3.3 Pose-MKL: Statistical pooling of multiple pose distributions . . 12 1.3.4 PF-HCRF: Pose filter based dynamical models . . . . . . . . . 12 1.3.5 MSDP: Multi-state dynamic pooling using segment selection . . 13 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 2: Human Activity Recognition in Videos - An Overview 15 2.1 Structured Dynamical Models (SDMs) . . . . . . . . . . . . . . . . . . 15 2.1.1 Logic and Grammar . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2 Activity Modeling using Finite State Machines (FSM) . . . . . 17 2.1.3 Classification and Inference using State Machines . . . . . . . . 18 2.2 Local-Feature based Statistical Models (LFSM) . . . . . . . . . . . . . 20 2.2.1 Interest Point Detection . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2 Feature Descriptors . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 Feature Quantization . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.4 Incorporating Structure in UAMs . . . . . . . . . . . . . . . . 22 2.3 Generative vs Discriminative Models . . . . . . . . . . . . . . . . . . . 23 2.3.1 Application to SDMs and LFSMs . . . . . . . . . . . . . . . . 24 iv 2.4 Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Chapter 3: LF-HCRF: Learning Neighborhood Co-occurrence Statistics of lo- cal features 26 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Learning Co-Occurrence statistics of STIP Features . . . . . . . . . . . 30 3.3.1 Spatio Temporal Interest Point features . . . . . . . . . . . . . 31 3.3.2 CRF model for Bag-of-Words classifier . . . . . . . . . . . . . 32 3.3.2.1 Hidden Variables for Learning Code-Words . . . . . 33 3.3.2.2 Incorporating Co-Occurrence Statistics . . . . . . . . 35 3.3.3 Hidden Conditional Random Fields . . . . . . . . . . . . . . . 36 3.3.3.1 HCRF Training Algorithm . . . . . . . . . . . . . . 38 3.3.4 Hidden Layer Connectivity . . . . . . . . . . . . . . . . . . . . 39 3.3.4.1 Minimum Spanning Tree . . . . . . . . . . . . . . . 40 3.3.4.2 2-Edge-Connected Graph . . . . . . . . . . . . . . . 41 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.1 Weizmann . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.1.1 Edge Connectivity and Code Book Size . . . . . . . . 43 3.4.1.2 Classification Accuracy . . . . . . . . . . . . . . . . 44 3.4.2 KTH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4: Pose based Activity Recognition using Multiple Kernel Learning 48 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 Kinematic Pose Priors . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.1 Pose Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.2 Multiple Kernel Learning . . . . . . . . . . . . . . . . . . . . 55 4.4 Results and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 5: Pose Filter based Hidden-CRF models for Activity Detection 58 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.4 Key-Pose Identification . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4.1 Pose Sequence Summarization . . . . . . . . . . . . . . . . . . 65 5.5 Key-Pose Filter based HCRF . . . . . . . . . . . . . . . . . . . . . . . 69 5.5.1 Root Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.5.2 Key-Pose Appearance Filter . . . . . . . . . . . . . . . . . . . 70 5.5.3 Temporal Location Distribution . . . . . . . . . . . . . . . . . 71 v 5.6 Model Learning and Inference . . . . . . . . . . . . . . . . . . . . . . 72 5.6.1 Latent Support Vector Machine . . . . . . . . . . . . . . . . . 73 5.6.2 Weight Initialization . . . . . . . . . . . . . . . . . . . . . . . 74 5.7 Model Inference for Multiple Detections . . . . . . . . . . . . . . . . . 75 5.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.8.1 UT-Interaction [64] . . . . . . . . . . . . . . . . . . . . . . . . 80 5.8.2 USC-Gestures [49] . . . . . . . . . . . . . . . . . . . . . . . . 82 5.8.3 CMU-Action [31] . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.8.4 Rochester ADL [42] . . . . . . . . . . . . . . . . . . . . . . . 86 5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Chapter 6: Multi-State Dynamic Pooling using Segment Selection 89 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3 Pooling Interest Point Features for Event Classification . . . . . . . . . 96 6.3.1 Discriminative Segment Selection . . . . . . . . . . . . . . . . 97 6.3.2 K-Segment Selection (KSS) . . . . . . . . . . . . . . . . . . . 97 6.3.3 Linear Time Subset Scanning . . . . . . . . . . . . . . . . . . 98 6.4 Multistate K-Segment Selection (MKSS) . . . . . . . . . . . . . . . . 99 6.5 Selecting Optimal Parameter K . . . . . . . . . . . . . . . . . . . . . . 102 6.5.1 Regularized Multistate Segment Selection (RMSS) . . . . . . . 103 6.6 Dynamic Program for K-Segment Selection . . . . . . . . . . . . . . . 105 6.6.1 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.7 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.7.1 Comparisons with baselines . . . . . . . . . . . . . . . . . . . 108 6.7.2 Segment Selection results . . . . . . . . . . . . . . . . . . . . 112 6.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Chapter 7: Conclusions and Future Work 114 7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Bibliography 120 vi List of Tables 3.1 Comparison with varying codebook sizejHj and different edge connec- tivity on Weizmann dataset. . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Comparative results on the Weizmann and KTH datasets. Top half cites approaches that attempt to model the neighborhood structure of interest points. Bottom half cites approaches from general activity recognition systems, not necessarily based on interest points. Last column shows the train/test split ratio for KTH. . . . . . . . . . . . . . . . . . . . . . 45 5.1 UT-Interaction: Classification accuracy for observing the initial 50% of the video, and the full video. . . . . . . . . . . . . . . . . . . . . . . . 80 5.2 Result tables for USC-Gestures . . . . . . . . . . . . . . . . . . . . . . 83 5.3 Result tables for Rochester-ADL. . . . . . . . . . . . . . . . . . . . . . 86 6.1 MAP result table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 vii List of Figures 1.1 We evaluate our algorithms on the following challenging datasets: (a-c) CMU Action Dataset (d) Composite Cooking Dataset (e) USC-Gestures dataset (f) KTH activity dataset (g) Weizmann dataset (h) UT-Interaction dataset and (i) Rochester ADL dataset. . . . . . . . . . . . . . . . . . . 3 1.2 Flow diagram of aggregate statistical approaches. First, interest point features are detected in the spatio-temporal volume. Next, global statis- tics like histograms of codewords are computed corresponding to each video instance. Finally, the histogram features from training video ex- amples are used to learn a discriminative classifier like Support Vector Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Flow diagram for finite state machine based approaches. The action is represented as a sequence of state transitions, resulting in a Finite State Machine representation of the action. A temporal graphical model like Dynamic Bayesian Network or Dynamic Conditional Random Fields are used to infer the state assignments in each frame. . . . . . . . . . . 5 1.4 Feature Classifier Hierarchy, with our contributions spanning the hier- archy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Relative strengths and target space of our proposed algorithms. . . . . 11 3.1 (a) Histogram computed over entire video volume. (b) Histogram com- puted over global spatio-temporal bins. (c) The feature-centric neigh- borhood of codewordb is shown in red sphere, with co-occurrence rela- tionships transformed into edges of a graph. . . . . . . . . . . . . . . . 29 3.2 Co-occurrence statistics of codewords: (a) The neighborhood of code- word b is shown in the spatio-temporal volume. Codewords b and c co-occur often, whereasa andc rarely co-occur. (b) Reduction of neigh- borhood relationships to edges in a graph. . . . . . . . . . . . . . . . . 31 3.3 CRF formulation for Bag-of-Words classifier: (a) Logistic Regression model with pre-assigned codewords as observations. (b) CRF with code word assignment determined by hidden variables, i.e. not observable in train/test data. (c) CRF representing co-occurrence statistics of code- word assignments using hidden layer connectivity. . . . . . . . . . . . . 34 3.4 (a) General representation of Hidden Conditional Random Field model (b) Factor graph representation of Hidden Conditional Random Field (c) Evidence reduced factor graph used for inferringP (hjy;X;) during training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 (a) The hidden layer with edge set E MST (b) Evidence reduced factor graph forE MST has no cycles, allowing for exact inference . . . . . . . 40 viii 3.6 (a) 2-lattice graph with every node connected to its two closest spatial neighbors (b) Evidence reduced factor graph for E 2Lat has cycles and requires loopy belief propagation for inference. . . . . . . . . . . . . . 41 3.7 Confusion Matrices: (a) Weizmann dataset withE 2Con connectivity and jHj = 20. Average accuracy over all classes is 98:76%. (b) KTH dataset withE MST connectivity andjHj = 20 for s1,s2 andjHj = 10 for s3,s4 scenarios. Average accuracy over all classes is 93:98% . . . . . . 47 4.1 Kinematic tree priorsK 1 toK 8 representing distinct mean-pose config- urations. The one- boundary of the Gaussian distribution for relative location and orientation between part pairs is shown using blue ellipses and green lines respectively. . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Sample pose estimation results on the USC-Gesture Dataset using kine- matic prior treesK 1 toK 8 . The results display the part-wise marginal posterior distributionsp(l i = (x;y;) ;k). The distributions have signif- icantly lower entropy for frames closer to the mean-pose configuration of the KTPs, for example,K 1 G 6 ,K 2 G 4 ,K 6 G 3 and others. (This figure is best viewed in color). . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Confusion Matrix for USC-Gesture dataset. . . . . . . . . . . . . . . . 57 5.1 Flow diagram of our proposed algorithm. . . . . . . . . . . . . . . . . 63 5.2 Pose Summarization for Key-Pose detection: Bottom row shows every fourth frame from a sample video sequence. Top row shows the key poses and their respective temporal boundaries forK = 3 . . . . . . . . 65 5.3 Results for pose summarization algorithm: (a) K=3, (b) K=4 and (c) K=5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 The factor graph representation of our proposed HCRF model forK = 2 key-poses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.5 Panel (a) shows the two key-poses identified by the pose summariza- tion algorithm, with their corresponding anchor times: a 1 ; a 2 . Panel (c) shows the feature descriptorsx t , and their corresponding codewords as- signmentsw t 2W below. The root filter T R R is shown in cyan, being applied between framest r andt r +L where it models the pose-codeword frequencies, as shown in panel (b). Sample results of key-pose filters T A k A learned by the LSVM model are shown in panel (d). Their tem- poral location is modeled with a normal distribution about their corre- sponding anchor location a k , shown in panel (c). . . . . . . . . . . . . 68 5.6 (a-c) Noisy and erroneous tracks . . . . . . . . . . . . . . . . . . . . . 76 5.7 Track fragmentations due to ID switches. . . . . . . . . . . . . . . . . 77 5.8 Scale allignment search. . . . . . . . . . . . . . . . . . . . . . . . . . 78 ix 5.9 Track Extensions: A single actor track is fragmented into blue and ma- genta tracks due to non-pedestrian pose changes during the course of the pickup action. Track extensions help in resolving such track fragmenta- tion errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.10 Heat maps represent the output scores of the key-pose filters, root filters and the inferred detection confidenceA(t), along with ground truth and predicted detection segments. Refer the text for more details. (Figure is best viewed in color and magnified) . . . . . . . . . . . . . . . . . . . 79 5.11 Streaming video performance on the UT-Interaction compared against Key-Framing [60], MSSC [9], Dynamic Cuboid [65], Integral Cuboid [65], Cuboid+SVM [65], Bayesian [65], Random-Chance and PF-HCRF (Our Method). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.12 UT-Interaction: Precision-Recall curves for activity detection. . . . . . 81 5.13 Precision-Recall curves on CMU action dataset compared to other meth- ods: Ke-SFP[31], Ke-SP[31], Ke-SW[31], Shechtman[71], Yuan[96] . 85 5.14 Key-pose sequences inferred by PF-HCRF gives a semantic description of activity with high consistency. . . . . . . . . . . . . . . . . . . . . . 87 6.1 Composite event Grating Cheese is composed of numerous primitive actions which occur with varying durations and arbitrary gaps between them, making it a challenge to learn a composite event classifier. . . . . 91 6.2 Flow diagram for multistate dynamic feature pooling algorithm. A 3- state finite state machine is assumed, which specifies a temporal order- ing as S A < S B < S C . Segment classification scores are computed w.r.t each state, and the discriminative segments are selected by solv- ing a linear programe. The aggregate feature statistics from the selected segments are pooled to compute a global feature statistic (x). which is used for training a Latent-SVM classifier, and hence learn the opti- mal activity classification weights . The weight learning and feature poolling is repeated iteratively till convergence. . . . . . . . . . . . . . 92 6.3 MAP result table for the Composite Cooking dataset. Column TB presents the MAP scores from a temporally binned BoW classifier [36] with 3, 5 and 7 temporal partitions. MAP scores for MKSS and RMSS algorithm for different number of states (3, 5 and 7) is also given for each action. The highest MAP score across states is highlighted. . . . . . . . . . . . 109 6.4 (a) 3-State segment selection result for Seperating an egg with repre- sentative frames of the selected segments. (b) 5-State segment selection result with comparisons of the frames selected by each state. . . . . . . 111 7.1 Summary of our contributions across the feature-classifier hierarchy. . 115 7.2 Relative merits and drawbacks of our proposed algorithms. . . . . . . . 116 7.3 Surveillance scenarios addressed by our proposed algorithms. . . . . . 118 x Abstract Human action recognition in videos is a central problem of computer vision, with nu- merous applications in the fields of video surveillance, data mining and human computer interaction. There has been considerable research in classifying pre-segmented videos into a single activity class, however there has been comparatively less progress on ac- tivity detection in un-segmented and un-aligned videos containing medium to long term complex events. Our objective is to develop efficient algorithms to recognize human activities in monocular videos captured from static cameras in both indoor and outdoor scenarios. Our focus is on detection and classification of complex human events in un-segmented continuous videos, where the top level event is composed of primitive action compo- nents, such as human key-poses or primitive actions. We assume a weakly supervised setting, where only the top level event labels are provided for each video during train- ing, and the primitive actions are not labeled. We require our algorithm to be robust to missing frames, temporary occlusion of body parts, background clutter, and to variations in activity styles and durations. Furthermore, our models gracefully scale to complex events containing human-human and human-object interactions, while not assuming ac- cess to perfect pedestrian or object detection results. We have proposed and adopted the design philosophy of combining global statistics of local spatio-temporal features, with the high level structure and constraints provided xi by dynamic probabilistic graphical models. We present four different algorithms for activity recognition, spanning the feature-classifier hierarchy in terms of their semantic and structure modeling capability. Firstly, we present a novel Latent CRF classifier for modeling the local neighborhood structure of spatio-temporal interest point features in terms of code-word co-occurrence statistics, which captures the local temporal dynam- ics present in the action. In our second work, we present a multiple kernel learning framework to combine human pose estimates generated from a collection of kinematic tree priors, spanning the range of expected pose dynamics in human actions. In our third work, we present a latent CRF model for automatically identifying and inferring the temporal location of key-poses of an activity, and show results on detecting multi- ple instances of actions in continuous un-segmented videos. Lastly, we propose a novel dynamic multi-state feature pooling algorithm which identifies the discriminative seg- ments of a video, and is robust to arbitrary gaps between state transitions, and also to significant variations in state durations. We evaluate our models on short, medium and long term activity datasets and show state of the art performance on both classification, detection and video streaming tasks. xii Chapter 1 Introduction Automated recognition of human activities is a central problem of computer vision with important applications in video surveillance, video retrieval, data analytics and human computer interaction. In recent years, important progress has been made in low-level computer vision tasks like object/pedestrian detection and tracking. Such algorithms aim to detect and identify the entities present in a video-scene, however, they do not answer the question: ”What is happening in the video ?”. Practical applications of computer vision technology require a higher level semantic analysis of the video, beyond a simple identification of the people and objects present in the scene. The specifications of semantic analysis varies based on the application. While surveillance scenarios require a description of human-to-human and human-to- object interactions, applications like gesture recognition and human computer interac- tion require a human pose level description of the action sequence. There also exist applications like video retrieval where inferring a single generic class label for the en- tire video is sufficient. The goal of an activity recognition system is to provide such a high level semantic analysis of a video, based on the application requirements. Real life 1 applications require activity recognition systems to be robust to moderate occlusions, and even temporary loss of frames. Furthermore, it is desirable that the recognition al- gorithm should be invariant to changes in actor/object shape and appearance, resulting from the human pose dynamics in the activity. Activities containing human-object inter- actions pose a further challenge due to the difficulty in training reliable object detection algorithms for the large variety of objects encountered in the real world. 1.1 Problem Statement Our objective is to develop efficient algorithms to recognize human activities in monoc- ular videos captured from static cameras in both indoor and outdoor scenarios. Our focus is on detection and classification of complex human events in un-segmented con- tinuous videos, where the top level event is composed of primitive action components, such as human key-poses or primitive actions. We assume a weakly supervised setting, where only the top level event labels are provided for each video during training, and the primitive action components are not labeled. We require our algorithm to be robust to missing frames, temporary occlusion of body parts, background clutter, and to varia- tions in activity style and durations. Furthermore, our model should gracefully scale to complex events containing human-human and human-object interactions, while not as- suming access to perfect pedestrian or object detection results. Figure 1.1 show sample frames from the datasets used for evaluating our algorithm. 1.1.1 Proposed Design Philosophy Over the years, two distinct threads in current research on this topic have emerged. The first focuses on computing global histogram based statistics of local-features (like 2 (c) (b) (a) (d) (e) (f) (g) (h) (i) Figure 1.1: We evaluate our algorithms on the following challenging datasets: (a-c) CMU Action Dataset (d) Composite Cooking Dataset (e) USC-Gestures dataset (f) KTH activity dataset (g) Weizmann dataset (h) UT-Interaction dataset and (i) Rochester ADL dataset. spatio-temporal interest points), and train a discriminative classifier to assign a single ac- tivity label to the video. Such methods require only event level annotation for training, and use off-the-shelf computationally efficient classifiers like SVM. Figure 1.2 shows the flow diagram of global statistical approaches. However, global statistics do not cap- ture the temporal dynamics of the human activity, and are unable to generate a semantic description of the activity. The statistics computed over local-features are agnostic to 3 b c c b a b t y x a a b c 2 3 2 Interest Point Detection Histogram of Codewords SVM classification Figure 1.2: Flow diagram of aggregate statistical approaches. First, interest point fea- tures are detected in the spatio-temporal volume. Next, global statistics like histograms of codewords are computed corresponding to each video instance. Finally, the histogram features from training video examples are used to learn a discriminative classifier like Support Vector Machines. their source (such as humans, objects or scene-background), and hence cannot explic- itly model semantic associations like human-object interactions. The second category of approaches learn a dynamical model of human motion, and infer the activity based on state transitions observed in the video (Figure 1.3). Definition of a state can range from primitive action based decompositions to limb based dynamics, and the inferred state sequences provide a temporal segmentation of the video. Activity state definitions can capture semantic concepts like human pose configurations, human- object interactions, and also deal with missing or occluded frames. However the models are sensitive to underlying assumptions of duration statistics, variations in activity styles, and also impose significant annotation (like joint positions and primitive event labels) and modeling requirements during training. We propose to design an activity model which combines the global statistics of local features, with the high level structure and constraints provided by dynamical graphical models, in a discriminative learning framework. We focus on approaches which require minimum annotation requirements by training latent variable models to learn semantic concepts (like human-pose and human-object interactions), and at the same time ensure they are flexible enough to incorporate varying styles and durations of activity dynamics. 4 Finite State Machine S A S B S C x 1 y 1 x 2 y 2 x 3 y 3 x 4 y 4 Probabilistic Graphical Model Figure 1.3: Flow diagram for finite state machine based approaches. The action is represented as a sequence of state transitions, resulting in a Finite State Machine rep- resentation of the action. A temporal graphical model like Dynamic Bayesian Network or Dynamic Conditional Random Fields are used to infer the state assignments in each frame. We further strive to develop models which do not require manually segmented videos, and are capable of recognizing multiple instances of the activity in an unsegmented video clip. 1.2 Challenges Human activity recognition in monocular videos is an extremely challenging task in computer vision, due to the inherent diversity and unconstrained nature of actions which can be possibly observed using a video camera. Furthermore, action recognition systems have to contend with the common challenges faced by any video based computer vision task, such as inadequate resolution, imaging noise, illumination changes, camera jit- ter and video aliasing. However, even if we consider perfect image acquisition from the sensor module, there still remain significant challenges in designing robust activity recognition algorithms, some of which we describe in the following sections: 5 1.2.1 Observation Noise The actions can take place in dynamic and cluttered environment, where simple tech- niques like background subtraction and motion flow are insufficient to accurately disam- biguate the foreground objects from the background scene. Techniques like pedestrian detection and tracking may help in spatial localization of the human figure, however the track results may contain noise in terms of missed tracks, false positive tracks and mis-alligned tracks. Furthermore, the actor of interest undergoes a variety of articulate pose transitions, with self and external occlusions. 1.2.2 Object Interactions Certain actions involve interactions with objects such as utensils, snack-packet, refriger- ator etc, however the current state of the art object detectors cannot reliably detect these deformable objects, resulting in a semantic gap between the observed features and the high level action label. Furthermore, it is not a scalable approach to train a separate object detector for all the possible objects that can appear in videos. 1.2.3 Variation in Activity Style Dynamics The same activity can be performed in differing styles by different human beings. For example, to pick up an object, a person might bend down at his waist, or go down on his knees to pick the object up. Both styles of activity involve different set of pose dynamics, and the activity module should be able to deal with such variations. 6 1.2.4 Detection in Unsegmented Videos In real world scenarios, the activity of interest occurs only for a part of a video stream, and automatic temporal segmentation of videos based on its activity content is a difficult task. The video can consist of multiple instances of the same activity, and it is desirable to recognize and segment each activity instance. Even if the video is apriori known to contain only a single action, the action boundaries are rarely aligned with the boundaries of the video, making it a challenging task to temporally localize the action in the video. 1.2.5 Activity Description Most activity classifiers assign a single class label to a video segment, however in real world applications like video analytics and data mining, a description of the activity is required. The description can be in terms of the key-poses identified in the activity, or in terms of the primitive events of a composite activity. However meaningful activity descriptions require semantic state definition in the activity model, and are in general re- stricted to dynamical model based approaches, requiring significant manual annotations during training. 1.2.6 Manual Annotation Requirements Activity models which incorporate the semantic components of the activity, like transi- tion between key-poses or limb motion dynamics, require significant manual annotation requirements, ranging from annotating the limb locations of a human, to manual defini- tion of linear interpolation based dynamics between key-poses. There exist approaches which use motion capture data to avoid manual pose annotations, however it is expensive and cumbersome to collect such data. 7 Aggregate Statistical Models Structured Dynamical Models Structured Semantic Features Un-structured Low Level Features MSDP PF-HCRF Pose-MKL LF-HCRF Multiple Kernel Learning Classifier using Histogram of Poses Spatio-Temporal Interest Point Features Co-occurrence Statistics in HCRF model Pose Features from Pedestrian Tracks Key-Pose transition based Dynamical HCRF model Dense Trajectory Features State-wise Feature Pooling S A S B S C + + + f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 f 10 f 11 f 12 f 13 f 14 f 15 f 16 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 f 10 f 11 f 12 f 13 f 14 f 15 f 16 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 f 10 f 11 f 12 f 13 f 14 f 15 f 16 b c c b a b a t y x x2 x4 x3 x5 x1 y h2 h3 h4 h5 h1 Figure 1.4: Feature Classifier Hierarchy, with our contributions spanning the hierarchy. 1.3 Our Contributions Our objective is to combine the global statistics of local spatio-temporal features with the high level structure and constraints provided by probabilistic graphical models. The problem can be addressed at two different stages in the activity recognition framework, namely (1) Feature definition and (2) Classifier model selection. Feature definitions can be hierarchically categorized based on the level of semantics they are able to capture in a video. At the bottom of the hierarchy are unstructured interest point features such as 3D Harris Corners [36], V oxels [93] and Dense Trajectories [84], which describe the low level gradient and motion flow information present in the video. Such features are generally source agnostic, and are incapable of differentiating semantic concepts like humans, objects, or even between the foreground and background scene. At the top of the feature hierarchy are structured semantically meaningful features, such as human 8 pose estimates [1], object detection [17] and pedestrian detection [23] results. The im- portant choice to make is whether to rely on powerful semantic features upfront which are difficult to detect and work in restricted environments, or to extract simple but reli- able features, and postpone semantic analysis to later modules of an action recognition system. Classifier models can also be hierarchically categorized based on their capability to model semantic concepts. At one end of the hierarchy, are the aggregate statistical models, such as Bag-of-Words [36] and Topic Models [52], which capture the holistic appearance of actions in videos using a frequency analysis of the visible features. On the other end of the hierarchy are dynamical classification models, which commonly represent the activity as a sequence of state transitions, and model semantic concepts corresponding to human pose transitions and object interactions. We present four distinct action recognition approaches, each representing a unique combination of features and classification models, spanning the space of the feature/classifier hierarchy as described above. Firstly, we propose a LF-HCRF model to learn local neighborhood structure of interest point features in an aggregate statistical model. Sec- ond, we propose a Pose-MKL model which uses structured observation features such as human pose estimation results in an aggregate statistical model. Third, we propose a PF-HCRF model consisting of key-pose detection filters in a dynamical graphical model framework. Lastly, we propose a MSDP model for dynamic pooling of local features while satisfying the temporal constraints imposed by a dynamical model. Firgure 1.4 shows a summary of our contributions in feature-classifier hierarchy. 9 1.3.1 Choosing the Correct Model An important question to be considered is how to choose the correct algorithm from the ones proposed in this thesis. It is important to understand that an universal algorithm for solving the general activity recognition task in videos is still a worthy, but distant goal. As a first step towards making an informed choice for selecting an appropriate algorithm, engineers need to conduct an inventory of the tools available to them, while keeping in mind their relative merits and drawbacks. Second, and foremost, a deeper understanding of the target scenarios and final objectives is required to properly priori- tize the requirements of the activity recognition task. Each of our proposed algorithms target a specific class of recognition tasks and challenges (Figure 1.5). The MSDP algo- rithm is a suitable choice for recognizing complex un-aligned long term events in single actor scenarios, whereas PF-HCRF is suitable for detecting key-pose based activities in continuous un-segmented videos containing interactions between multiple actors. Pose- MKL is more suitable for discriminating between actions containing a similar set of key-poses, while LF-HCRF is best used in videos where high level semantic concepts are difficult to model at both the feature and classifier level. In short, the correct choice is based on the target application and stated requirements of the task. A brief description of our proposed algorithms follows in the succeeding sections. 1.3.2 LF-HCRF: Learning co-occurrence statistics of local-features Local-feature based statistical models compute a single aggregate statistic of the features over the entire spatio-temporal volume, and ignore the spatial and temporal distribu- tion of the features. There exist methods which use a variety of global spatio-temporal binning structures to capture the relative layout of interest point features. However 10 Long term Complex Events Un-Aligned Videos Robust to gaps Robust to spurious primitive action Key-Pose based Actions Un-segmented Videos Robust to Occlusions Multiple Actor Interactions Description using key-poses Robust to Noisy Pose Estimates Segmented Videos Handles Cyclic Actions Simple to Moderate actions Low Algorithm complexity Segmented Videos Handles Cyclic Actions Structured Dynamical Models Aggregate Statistical Models Low Level Features Structured Features PF-HCRF MSDP LF-HCRF Pose-MKL Figure 1.5: Relative strengths and target space of our proposed algorithms. aggregate statistics combined with global bin partitions ignore local neighborhood re- lationships, and are only able to capture global relationships between the partitions. Our objective is to model local spatio-temporal relationships using statistics which are pairwise (instead of aggregate), and feature-centric with a notion of local neighborhood associated with them (instead of global clip level partitions). We propose modeling the neighborhood relationships in terms of a count function, which measures the pairwise co-occurrence frequency of codewords. We describe a transformation to represent the count function in terms of the edge connectivity of hidden variables of a CRF clas- sifier, resulting in a Local Feature HCRF model (LF-HCRF), and explicitly learn the co-occurrence statistics as a part of its maximum likelihood objective function. 11 1.3.3 Pose-MKL: Statistical pooling of multiple pose distributions Local-features like STIP represent the salient locations in a video volume, however they do not have any semantic structure or interpretation associated with the individual fea- tures. They rely on higher level model constructs (like dynamical models and discrim- inative linear classifiers) for semantic interpretation, if any. We propose an alternative approach using structured observation features, specifically inferred human pose dis- tributions. Human pose estimation is more reliable for poses which are similar to the mean-pose configuration represented in the pose-prior, and it is hard to design a single prior which works for all poses present in an activity. Recent work using human pose for activity recognition rely on structured dynamical models to choose the appropriate pose-prior, however they require manual construction of activity models, and are lim- ited to a narrow set of motion styles and durations. We propose computing aggregate statistics over the pose detection distribution from multiple pose priors, and pool them together in a multiple kernel learning (MKL) framework. 1.3.4 PF-HCRF: Pose filter based dynamical models Structured observation features like human pose estimations are sensitive to deviations from the pose prior, and are computed independent of the activity sequence. We propose learning discriminative pose-filter detectors, which are trained to detect a few key-poses in an activity. Human activities can be described in terms of a sequence of transitions between key-poses, where key-poses represent the important human pose configurations in an activity. Previous key-pose based approaches use dynamical model, like HMMs, to infer the key-pose assignments in each frame, where the number of random vari- ables to be inferred is proportionate to the number of frames. To keep the inference 12 tractable, they further require a Markovian assumption between observations from ad- jacent frames, and do not capture the global distribution of the features. We argue that for activity classification, it is only sufficient to determine the presence or absence of a state in an observation sequence, while ensuring certain temporal relationships between the state detections are satisfied. We are motivated to propose an alternative graphical model for activity detection, where the random variables to be inferred are the tempo- ral locations of the state, which are much fewer than the number of frames. We learn detection filters for each key-pose, which along with a bag-of-words root filter are com- bined in a conditional random field model (called as Pose Filter HCRF or PF-HCRF model) , whose parameters are learned using the latent-SVM optimization. In summary, we prefer a more generalizable and tractable model, at the expense of per-frame state descriptions. 1.3.5 MSDP: Multi-state dynamic pooling using segment selection Recognizing long range complex events composed of a sequence of primitive actions is a challenging task, as videos may not be consistently aligned in time with respect to the primitive actions across video examples. Moreover, there can exist arbitrary long intervals between, and within the execution of primitive actions. We propose a novel multi-state segment selection algorithm, which pools features from the discriminative segments of a video. We present an efficient linear programming based solution, and introduce novel linear constraints to enforce temporal ordering between segments from different states. We also propose a regularized version of our algorithm, which auto- matically determines the optimum number of segments to be selected in each video. Furthermore, we present a new, provably faster O(N logN) algorithm for the single 13 state K-segment selection problem. Our results are validated on the Composite Cooking Activity dataset, containing videos of cooking recipes. 1.4 Thesis Outline The thesis is outlined as follows: We begin in chapter 2 with a brief overview of activity recognition models proposed in the literature using dynamical graphical models and local-feature based statistical models. Chapter 3 presents our framework for learning Neighborhood Co-occurrence statistics in local features using Local Feature HCRF. In chapter 4, we describe our method for statistical pooling of pictorial structure based pose detections using Pose-MKL model. Chapter 5 presents our framework for detecting key- poses in a video using a Pose Filter HCRF model, and chapter 6 presents our dynamic feature pooling framework for multi-state models. Finally in chapter 7 we summarize our contributions and results, and present a discussion of possible future directions. 14 Chapter 2 Human Activity Recognition in Videos - An Overview We present a brief overview of human activity recognition algorithms proposed in the community, with focus on their semantic structure modeling capabilities. 2.1 Structured Dynamical Models (SDMs) Human activities are a composition of interactions between basic semantic concepts like location and motion of people in a scene, human pose dynamics, human-object interactions etc. These interactions take place in a structured spatio-temporal sequence, and are unique to each activity. SDMs are designed to explicitly model the structure present in activities. At the highest level, activities can be described as a set of logical rules, using either propositional or First order logic. These rules define the spatio-temporal constraints between the interacting agents composing the activity. To facilitate learning and infer- ence using these rules, the logical rules can be reduced to state definitions of a finite 15 state machine. Probabilistic Graphical Models (PGMs) have been successfully applied to training and inference of state based machines under observation uncertainty. Both generative and conditional training algorithms have been proposed to manage the com- plex inter dependencies present in the observation distribution. To account for unknown state definitions and complex state spaces, latent variable models have been introduced to learn the activity structure directly from training data. 2.1.1 Logic and Grammar We present only a brief review of Logic and Grammar based algorithms as these are not the focus of our work. The spatio temporal structure can be encoded in terms of propositional logic [66] or first order logic [80, 44]. [8] proposed a probabilistic event logic (PEL) algorithm to incorporate uncertainty in first order logic statements and the underlying observation stream. [80] incorporate probabilistic reasoning using first order logic by proposing a Markov Logic Network constructed from weighted rules. [44] proposed rules based on Allen’s Interval logic for a 2 player basketball scenario, and incorporated them into a MLN. However such probabilistic logic inference algorithms cannot deal with the quantification aspect of first order logic, and enumerate over all possible values of a existential/universal quantifiers. This makes them unsuitable for scaling to more generic scenarios. A set of rules representing all possible variations in activities does not scale well, and hence motivates finding the underlying grammar for generating the rules. Context Free Grammar (CFG) [66] provide a generic framework to represent the hierarchy and periodic nature of human activities. To deal with observation uncertainties, Stochastic Context Free Grammar (SCFG) based approaches have been propose [25]. However there is limited work on learning the grammar/logic rules directly from the data, and 16 such techniques require a domain expert to define all possible production rules for the grammar. Such methods have been restricted to highly controlled environments like professional sports or highly restricted gesture based activities. 2.1.2 Activity Modeling using Finite State Machines (FSM) Human activities can be described as a sequence of discrete states, where each state represents a unique spatio-temporal configuration of the human in the underlying feature space. The representation is equivalent to restricting human activities to only Type- 3 grammars (regular grammars) in the Chomsky hierarchy. While this simplifies the action representation, and the corresponding inference algorithms, it is however limited by its representation power compared to context free grammars. Finite State Machines compactly represent the possible state transitions in an ac- tivity, and have been extensively used for recognizing human activities. Some of the earlier works like Yamato et al [94] learn FSM models to recognize simple tennis ac- tions. Fengjun et al [40] present an Action Net model, which is essentially a FSM with each state representing a key pose in the activity. Chan et al [10] manually define se- mantic primitives to model rare events in an airport sequence. [24] define states based on primitive limb motions resulting in a limb activity model. More recently, [49, 72] use state definitions based on the linear interpolations between poses to detect complex human activity sequences in cluttered environment. The FSM models can range from left-to-right chain transitions, defined using a di- agonal transition matrix like the ones used in [10] where every state is visited only once, or can be densely connected with multiple transitions across states [24], while others are composed of multiple chain FSMs with transitions between chains occurring only at the 17 start and end states [40]. Manually defined states generally have a semantic meaning as- sociated with them, for example the states in [10] correspond to distinct load and unload events in an airport scenario. State machines can also be learned directly from the data [94], and may not have any semantic meaning associated with the states. 2.1.3 Classification and Inference using State Machines The classification task in FSM models requires first predicting the most likely sequence of states given an observation model, and second, to choose the best possible FSM ex- plaining the given observation stream. Dynamical models, like Dynamic Bayesian Net- works (DBNs) and Dynamic Condition Random Fields [77], provide a generic frame- work for inferring the state transition of FSM models under observation uncertainty. The state assignments in certain restricted graph structures (like trees and chains) can be ef- ficiently inferred using message passing algorithms like Belief Propagation, or using its Max-Product variant, the Viterbi algorithm. For generic graph structures, there exist ap- proximate inference algorithms based on loopy belief propagation and sampling based methods like MCMC and particle filtering. The video category can be determined by including a random variable corresponding to the activity class in the graphical model, and inferring its most likely value given the observed data. An alternative approach learns separate graphical models for each activity class, and their likelihood probability over the observed data is compared to determine the final class assignment. [Combining Multiple FSMs] Linear chain models like HMMs and chain-CRFs [48] model a single FSM sequence at a time, while other dynamical models like Parallel- HMM and Coupled-HMM combine multiple FSMs into a single model. [83] use Parallel HMMs for sign language recognition, with the left and right hands modeled as parallel 18 FSM sequences. [6] model Tai chi actions using Coupled-HMMs, where each limb of the body is a separate FSM, linked together using coupling constraints. Both Parallel- HMMs and Coupled-HMMs are essentially different ways of factorizing a larger state- space into smaller tractable FSMs with limited number of states and transitions, resulting in computationally feasible training and inference algorithms. [Semi-Markov Models] Dynamical models like HMMs and chain-CRF follow the first order Markov principle, where state transitions are a function of only the current state. The state duration in these models follow a geometric distribution, where the probability of a state being observed for a given duration decreases exponentially with the length of the duration. However human activity states rarely follow a geometric distribution, and can often have long temporal durations, like a person walking or a vehicle driving steadily on a long road. In such cases, the probability of the FSM sequence to remain in the same state exponentially approaches zero, and forces a state transition even in the presence of contrary observations. Semi Markov models like HSMM [22, 47] were introduced to model the time duration statistics of a state, using a semi-markovian pro- cess to describe the state transitions. More recently, Tang et al [78] proposed a variable duration model for conditional HMMs. [Hierarchical Models] Human activities in general contain a natural hierarchy, where high level composite events are composed from sequence of primitive events. Re- cent work have explored various means of incorporating such hierarchy in a dynami- cal model. [24] propose a hierarchical link up between primitive limb action models into a larger network of activity models. [49] define primitive events as basic limb mo- tions like rotate,flex and pause, which are combined together into composite events like 19 walk, wave, pickup etc. A hierarchical model can share primitive events across different composite events, greatly improving the scalability of such methods. However auto- matic learning of hierarchy is difficult, and are in general manually designed using prior domain knowledge. 2.2 Local-Feature based Statistical Models (LFSM) Unstructured activity models classify video based on the distribution of localized spatio- temporal features. These features are extracted directly from the video using image statistics like gradients and frequency transforms, and hence avoid the problems asso- ciate with mid level modules like object segmentation etc. 2.2.1 Interest Point Detection Significant spatio-temporal locations in a video are determined by maximizing an saliency optimization function. In general, the salient points have a strong variations along both spatial and temporal dimensions, and can be assumed to be characteristic of a particu- lar event in the video. The theory of interest points has been well validated for object detection and recognition in images, and hence the motivation to extend it to the com- bined spatio-temporal volume. Many different salient optimization function have been proposed ranging from temporal Gabor filter [12], biologically inspired hierarchicalC 2 feature functions [26], Hessian based localization functions [92], to 3D Harris corner based functions [35]. Some of these features like [12, 92] produce dense interest points, whereas others like [35, 26] produce sparse interest points. 20 2.2.2 Feature Descriptors Feature descriptors capture the pixel level image statistics present around the interest points. The most common descriptors used are Histogram of Gradients (HoG) and the Histogram of Flow (HoF) features introduced in [36], which respectively capture the characteristic shape and motion present in the video. HoG descriptors have been shown to be very successful in the object detection community, and HoF is a natural extension of the gradient based descriptors to the temporal domain. Other descriptors reported in the community are the HOG3D descriptor [32] which is a temporal exten- sion of the well know SIFT descriptor, the extended SURF descriptor [92] based on weighted components of the Haar wavelets. The correct choice of descriptors is based on our requirements of computational speed, available temporal window size, and in- variance requirements to rotation and affine transformations. Wang et al [85] provide a detailed evaluation of multiple feature descriptors over multiple activity datasets, and found HoG/HoF features to give good performance in most applications. However these features provide a local description around the interest points, and do not have any se- mantic knowledge in terms of objects and humans in the scene, making it difficult to infer the primitive events in the over all activity. Alternatively, some techniques focus on the global spatio-temporal information about where the interest points are spatially located, and when they are detected on the tem- poral axis. These methods dispense with need of defining image based descriptors. Sun et al [75] construct trajectories from the interest points, and construct a hierarchy of features based on interest point and trajectory co-locations. Others like [7, 20] construct descriptors based solely on the (x,y,t) location of interest points. 21 2.2.3 Feature Quantization The HoG/HoF descriptors provide high dimensional feature vectors, and also the num- ber of feature points in a video is non-constant. Hence a quantization technique is nec- essary to summarize the descriptors into a more manageable set. The most common approach is to learn a vocabulary of code words from the feature vectors, and assign a single codeword to each vector. The video content is then summarized using a Bag of Words (BoW) based representation, most commonly using a histogram of codewords, which measures the frequency of occurrence of each codeword. A variety of techniques have been proposed based on the BoW framework. In general, a suitable clustering algo- rithm is used to construct the vocabulary of codewords, such as KMeans [70] and Mutual Information based clustering [27]. Liu et al [38] perform page rank based feature min- ing before constructing the vocabulary, to prune away the uninformative features. [34] construct a hierarchy of vocabularies, with histogram of codewords at each level serving as input features for the next level of vocabulary. The final classifier can range from a max margin SVM based classifier [70, 41] to Adaboost classifiers [38]. There also exist unsupervised algorithms like probabilistic Latent Semantic Analysis (pLSA) model and Latent Dirichlet Allocation (LDA) model [53] which learn intermediate topics over the feature codewords, and automatically learn the categories present in the dataset. 2.2.4 Incorporating Structure in UAMs Recently, the emphasis has been on incorporating structural cues on the unstructured bag of words based approaches. Structural cues can be in the form of codeword co- occurrences, body part kinematic constraints, or even on the temporal evolution of the interest points in the activity. [27] learn the spatio-temporal structure using spatial cor- relation of the codewords. Niebles et al [52] learn a constellation of bag-of-words which 22 models the mutual geometric relationship among different parts, where parts are clus- ters of video features. [42] model the the velocity of interest point trajectories using a mixture of markov chains, whereas [75] model the dynamic properties of IP trajectories using Markov chains, subsequently combined in Multiple Kernel Learning framework. [41] use scene context derived from movie scripts as cues, however the framework is more suited for movie video segments. However the structure learned using such tech- niques generally lack a semantic interpretation, and cannot describe the constituent ele- ments of the activity like pose, objects or salient dynamics. 2.3 Generative vs Discriminative Models A classification (or labeling) task in computer vision is typically modeled using the joint probability distributionP (X;Y ), whereX is the set of input variables corresponding to the input video data, andY is the set of output variables corresponding to the class/state labels of the activity in the video. The classification problem is solved by estimating the best possible assignment of the label variable set Y , given the input data X and distribution parameters: y = arg max y2Y P (X;Y =y;) = arg max y2Y P (Y =yjX;) (2.1) The distribution parameter is estimated during training by maximizing either the joint likelihood Q i P (X i ;Y i ) resulting in a generative model, or the condition likeli- hood Q i P (Y i jX i ) resulting in a discriminative (or conditional) model. Discriminative models ignore the data distributionP (X), whereas generative models incorporate it dur- ing parameter estimation. We refer the readers to [76] for further details on parameter estimation of generative and discriminative models. 23 Discriminative parameter learning directly optimizes the classification criteria (Equa- tion 2.1), and hence performs better in classification tasks compared to generative mod- els. They also have a fewer number of parameters to be learned, leading to more scalable models. However, generative models show better performance with noisy input obser- vations, as they also learn the data distributionP (X). Furthermore, generative models are able to simulate observation data, given the classification labels:xP (Xjy), which is useful for semantic analysis of the learned activity model. 2.3.1 Application to SDMs and LFSMs Both structured dynamical models and local-feature based statistical models can be trained in the generative and discriminative framework. SDMs like HMMs and DBNs are learned in a generative framework. Modeling the data distributionP (X) is an in- tractable problem as it implies learning a distribution of all possible observable features. Linear chain generative models, like HMMs, assume conditional independence between features from different frames, given their corresponding state assignments. Dynamic conditional random fields are the corresponding discriminatively trained SDMs, which also includes the linear chain-CRF model [73]. In contrast to HMMs, they ignore the data distributionP (X), while modeling contextual dependencies between class labels and time-separated observation features. Discriminative classifiers like Support Vector machine (SVM) and Multiple Kernel Learning (MKL) are commonly used for classifying pooled statistics of local features. Alternatively, topic models like Latent Dirichlet Allocation (LDA) and probabilistic La- tent Semantic Analysis (pLSA) [53] are learned in generative framework, where the distribution of observed features are automatically categorized into topics. [52] propose 24 a part based hierarchical model for activity classification, which learns the geometric relationships between the image features in a generative framework. 2.4 Latent Variable Models Latent (or hidden) variables are not directly observable, and are instead inferred along with the distribution parameters during model training. Latent variables represent the unobservable data in an action model, and are useful in describing abstract and con- ceptual knowledge which cannot be easily translated into measurable quantities in the real world. In activity recognition, latent variables are used to capture complex spatio- temporal dynamic states, such as in HMMs and Latent-DCRFs [45], or in representing underlying concepts (or topics) in the feature space [53]. Part based representation of activities and objects have been proposed, where the the part identity is a latent variable, and action/object labels are inferred using Hidden Conditional Random Fields (HCRFs) [56, 88]. Latent varaibles can be viewed as a form of dimensionality reduction, where complex feature spaces are aggregated into simpler concepts, however in general, it is difficult to associate any semantic interpretation to the inferred latent values. Parame- ter estimation in latent variable models require large amount of training data, and use non-convex optimization resulting in convergence to local minima. 25 Chapter 3 LF-HCRF: Learning Neighborhood Co-occurrence Statistics of local features In this chapter, we consider the problem of learning neighborhood structure of low-level localized features lacking any semantic meaning, and propose a graphical model frame- work for the same. Recent methods have focused on leveraging the spatio-temporal neighborhood structure of the features, but they are generally restricted to aggregate statistics over the entire video volume, and ignore local pairwise relationships. Our objective is to capture these relations in terms of pairwise co-occurrence statistics of spatio-temporal codewords for the task of human activity recognition in videos. 3.1 Introduction While there exist approaches which model human actions as a sequence of key states of a Probabilistic Graphical Model (PGM) [59, 73, 40], in the following work our focus is on approaches which avoid explicit modeling of the pose and dynamics of the human body, and also do not require the ability to detect and track the actor, which are inherently 26 difficult tasks by themselves. An alternative approach is to directly model the human appearance and motion in the video, using methods based on template matching [4, 62], shape flow correlation [30] and interest point (IP) tracking [15, 42]; these approaches are sensitive to viewpoint variation and background clutter. A popular local-feature based statistical approach is to classify videos based on the spatio-temporal interest points (STIPs) [70, 35, 12, 26, 27, 67]; they do not require person detection results and are robust to background clutter. Due to their localized and unstructured nature, they have been shown to be invariant to temporal and appearance variations across videos. Nevertheless, these methods typically ignore the spatial and temporal distribution of the features and rely only on a bag of features based model. For example, Schuldt et al [70] assume individual feature detections to be independent and construct a single histogram for the entire video (Figure 3.1.a), and hence capture a single aggregate statistics of the codewords over the entire spatio-temporal volume. To address these issues, there exist methods [36, 74, 75] which use a variety of binning structures to capture the relative layout of the STIPs (Figure 3.1.b), however the bin partition boundaries are rigid, and hence can be sensitive to spatial or time shifts of the activity segment in the video volume. Aggregate statistics combined with global bin partitions ignore local neighborhood relationships, and are only able to capture global relationships between the partitions. Consider the video volume in Figure 3.1.c, where the feature centric neighborhood of codeword ’b’ is shown as a red sphere. We observe that codewords b and c co-occur quite frequently, whereas a and c rarely co-occur in the same neighborhood. Such local relationships can be captured by statistics which are pairwise (instead of aggregate), and feature-centric with a notion of local neighborhood associated with them (instead of global clip level partitions). 27 We aim to address these issues and model the neighborhood relationships in terms of a count function which measures the pairwise co-occurrence frequency of codewords. We describe a transformation to represent the count function in terms of the edge con- nectivity of latent variables of a Conditional Random Field (CRF) classifier, and ex- plicitly learn the co-occurrence statistics as a part of its maximum likelihood objective function. The probabilistic nature of our method allows us to naturally incorporate code- book learning into the CRF learning process, and hence retains the discriminative power of the STIP descriptors. The resulting latent CRF shares the same parametrization as a Hidden Conditional Random Field (HCRF) [56, 88], and hence we can leverage existing training and inference algorithms. Our method is transparent to the type of STIP detec- tor used in the observation layer. We show results using the sparse 3D Harris corner interest points [35]. We evaluate our framework on the Weizmann [4] and KTH [70] activity datasets, and compare with other existing approaches. 3.2 Related Work Literature on human activity recognition is too large to be covered here. We focus our survey on interest point based methods, and probabilistic graphical models. Numerous interest point based methods have been proposed in the activity recog- nition community [35, 12, 26, 27, 7]. Recent literature has focused on learning the neighborhood distribution of the interest points in the spatio-temporal volume. Laptev et al [36] proposed using multi-channel non linear SVMs, where each channel tries to capture a different type of constraint on the interest points. Kovashka et al [34] propose a framework for learning a hierarchy of features to capture the shape of space-time fea- ture neighborhood. Bregonzio et al [7] capture the global spatio temporal distribution of interest points by extracting features from a cloud of interest points. Sun et al [75] 28 b c c b a b t y x (a) (b) b b t y x b c c b a b a t y x (c) a b c 2 3 2 a b 2 a c c b 3 a b c 2 1 1 1 1 1 c c a b a a 1 Figure 3.1: (a) Histogram computed over entire video volume. (b) Histogram computed over global spatio-temporal bins. (c) The feature-centric neighborhood of codewordb is shown in red sphere, with co-occurrence relationships transformed into edges of a graph. use SIFT feature trajectories and learn their spatial co-occurrence in a Multiple Kernel Learning (MKL) framework. Niebles et al [52] propose a generative hierarchical model, characterized as a constellation of bags-of-features, which models the mutual geometric relationship between different features. We briefly survey some of the approaches for activity recognition using Probabilistic Graphical Models (PGM). Generative models like Bayesian Networks have been used to capture object context for actions [21], model key pose changes [40, 49] and learning unsupervised topic models for activity [53]. Discriminative models like a CRF network 29 [73] have been used to model the temporal dynamics of silhouettes based features like shape context and pair wise edge features. Wang et al [87] proposed a Hidden Condi- tional Random Field (HCRF) for gesture recognition, which introduced a latent variable layer in the CRF network. Morency et al [45] model the dynamics between gesture la- bels by using a Latent-Dynamic CRF model. Wang et al [88] learn a part based model using a HCRF on each frame, with optical flow patches as features, and the final class label obtained via majority voting. This was extended to a Max-Margin learning frame- work in [89]. 3.3 Learning Co-Occurrence statistics of STIP Features We propose a structured activity model for learning the neighborhood relationships of STIPs. We define the neighborhood relationships in terms of the co-occurrence statistics of the codewords assigned to the interest points. We formally define the co-occurrence statistic as a count functionC(b;c) (whereb andc are the codewords), which counts the number of times codewordsb andc co-occur in the same neighborhood. Interest point x j lies in the neighborhood ofx k iffx j 2Nb (x k ). We introduce suitable neighborhood functionsNb() in Section 3.3.4. As an example, Figure 3.2(a) shows the neighborhood of interest points with codewordb assigned to them. The codewordsb andc co-occur quite frequently, whereasa andc rarely occur in the same neighborhood. Such statistics can be quite useful in discriminating between different class categories. We describe the interest point detector used in Section 3.3.1. Section 3.3.2 intro- duces a logistic regression model for classifying histogram of codeword based features. We extend the framework to include potential functions over hidden variables in a CRF model for learning the codeword assignments. We show a transformation of theC(b;c) 30 b c c b a b a t y x b c c b a b a t y x (a) (b) Figure 3.2: Co-occurrence statistics of codewords: (a) The neighborhood of codeword b is shown in the spatio-temporal volume. Codewordsb andc co-occur often, whereasa andc rarely co-occur. (b) Reduction of neighborhood relationships to edges in a graph. function in terms of edge connectivity in the CRF model. Finally we combine the po- tential functions in terms of a Hidden Condition Random Field (HCRF)[56]. Section 3.3.3 describes the HCRF classifier and its Maximum Likelihood training. Lastly in Section 3.3.4, we describe examples of neighborhood functions used to compute the co-occurrence statistics. 3.3.1 Spatio Temporal Interest Point features Interest points provide a compact representation of the motion patterns present in the video volume. Such approaches work under the assumption that the underlying mo- tion pattern provides sufficient description of the activity present in the video. In our proposed framework, we selected the 3D Harris corner detector [35]. It has been ex- tensively used in human activity recognition literature, and provides a good baseline for comparison of our framework with existing systems. However, our framework is trans- parent to the type of interest point used in the observation layer, and can be replaced by any other interest point detector. 31 Laptev [35] extends the Harris corner function to the spatio-temporal domain, where the interest points correspond to the points of maxima of the following function: H =Det ()kTrace 3 () = 1 2 3 k ( 1 + 2 + 3 ) 3 (3.1) where is a spatio-temporal second moment matrix composed of first order spatial and temporal derivatives averaged using a Gaussian weighting function; and 1 ; 2 ; 3 are the eigen values of matrix. Hence for each video, we obtain a set of feature vectors X =fx j g N i j=1 , where x j is theK dimensional feature descriptor of thej th interest point. In our experiments we use a 144 dimensional HoG-HoF combined descriptor which captures the shape and flow information in the neighborhood of the interest point. 3.3.2 CRF model for Bag-of-Words classifier LetY be the set of class labels andX =fx 1 ;x 2 ;:::;x m g be the set ofm interest point (IP) detections in the video volume, wherex j 2R K is a K-dimensional feature descrip- tor vector of thej th IP. The classical approach is to learn a dictionary of codewordsH using K-Means clustering. Thej th IP is assigned a codewordCW (x j )2H based on its nearest cluster center. A histogram of codewords g2 R jHj is constructed over the video volume. The histogram g = g 1 ;g 2 ;:::;g jHj can be computed using an indicator function 1 as follows: g i = X x j 2X 1 CW(x j )=i (3.2) The codeword assignment function is represented in terms of vectorh = h 1 ;h 2 ;:::;h jmj T . We define each elementh j 2H to denote the codeword associated with thej th interest 1 1 a=b returns 1 ifa =b is true, otherwise returns 0. 32 point, and hence we have8 x j 2X : h j = CW (x j ). The logistic regression classifier (a type of CRF classifier) is defined over the class labely2Y and the histogram feature vectorg as follows: P (yjX;) =P (yjh;X;) /exp ( X a2Y 1 y=a T a g ) =exp 8 < : X a2Y 1 y=a X i2jHj ai g i 9 = ; =exp 8 < : X a2Y 1 y=a X i2jHj ai X x j 2X 1 CW(x j )=i 9 = ; =exp 8 < : X x j 2X X a2Y X i2jHj ai 1 y=a 1 h j =i 9 = ; 8j:h j =CW(x j ) =exp 8 < : X x j 2X T (y;h j ) 9 = ; 8j:h j =CW(x j ) (3.3) where(y;h j ) is a function which returns ajYjjHj dimensional vector whose ele- ments are the product of the indicator functions 1 y=a 1 h j =i . The weight vector can be learned by maximizing the conditional likelihood over the training data. Figure 3.3(a) shows the corresponding graph model of the CRF, in this case a Logistic Regression model. 3.3.2.1 Hidden Variables for Learning Code-Words The function(y;h j ) takes only the codeword assignment as input parameter, and hence the CRF learning is independent of the actual interest point descriptorx j . We generalize it to incorporate codeword assignment into the maximum likelihood learning process. We remove the restriction of8 x j 2X : h j = CW (x j ), and declareh to be a vector of random variables, whose associated probability mass functionP (hjy;X;) we would 33 h 5 h 1 y h 2 h 4 (a) h 3 x 5 h 1 x 1 x 2 x 4 x 3 h 2 h 3 h 4 h 5 y (b) y h 1 x 1 x 2 x 3 x 4 x 5 h 2 h 3 h 4 h 5 (c) Figure 3.3: CRF formulation for Bag-of-Words classifier: (a) Logistic Regression model with pre-assigned codewords as observations. (b) CRF with code word assignment de- termined by hidden variables, i.e. not observable in train/test data. (c) CRF representing co-occurrence statistics of codeword assignments using hidden layer connectivity. like to learn as a part of the training process. An interesting consequence is that each interest point has a class conditional distribution of codewordsP (h j jy;x j ;) associated with it, instead of a single codeword per interest point. We defineP (hjX;) in terms of potential function(x j ;h j ) as follows: P (hjX;)/exp 8 < : X x j 2X T (x j ;h j ) 9 = ; (3.4) =exp 8 < : X x j 2X X c2H 1 h j =c T c x j 9 = ; The vector h is not an observable in the data, hence it is treated as a hidden variable. Figure 3.3(b) shows the modified CRF formulation, whose class posterior equation is given as: P (yjX;) = X h2H jVj P (hjX;)P (yjh;X;) (3.5) / X h2H jVj exp 8 < : X x j 2X T (x j ;h j ) + X x j 2X T (y;h j ) 9 = ; 34 3.3.2.2 Incorporating Co-Occurrence Statistics We model the neighborhood structure of the interest points, i.e. how often two code- words co-occur in a spatio-temporal neighborhood. We quantify this co-occurrence statistic using a count functionC(b;c), which counts the number of times codewordsb andc co-occur in the same neighborhood. We define an indicator function (x j ;x k ): (x j ;x k ) = 1 ifx j 2Nb(x k ) 0 otherwise (3.6) We model the indicator function in terms of a graph : Let the set of nodes h j (j = 1;:::;m) correspond to the vertices in a graphG = (E;V ) , and the set of edgesE is given as: E =f(j;k) : j;k2V; (x j ;x k ) = 1g (3.7) i.e. there is a edge connectingh j andh k , if and only ifx j lies in the neighborhood ofx k . Figure 3.2(b) shows the graph transformation of the co-occurrence relationship present between the codewords. The count function can now be formally defined as : C(b;c) = X (x j ;x k )2X 1 CW(x j )=b 1 CW(x k )=c (x j ;x k ) = X (j;k)2E 1 h j =b 1 h k =c (3.8) The conditional class label distribution is defined as: P (yjhX;)/exp ( X a2Y 1 y=a T a C() ) (3.9) =exp ( X a2Y X b2H X c2H a;b;c 1 y=a C(b;c) ) =exp 8 < : X (j;k)2E X a2Y X b2H X c2H a;b;c 1 y=a 1 h j =b 1 h k =c 9 = ; 35 =exp 8 < : X (j;k)2E T (y;h j ;h k ) 9 = ; Combining the binary and ternary functions, we get: P (yjh;X;)/ (3.10) exp 8 < : X j2V T (y;h j ) + X (j;k)2E T (y;h j ;h k ) 9 = ; The resulting CRF formulation is given as follows: P (yjX;)/ X h2H jVj expf (y;x; h;)g (3.11) (y;x; h;) = X j2V T (x j ;h j ) + X j2V T (y;h j ) + X (j;k)2E T (y;h j ;h k ) 3.3.3 Hidden Conditional Random Fields The latent CRF (Eqn. 3.11) shares the same parametrization as that of a Hidden Con- ditional Random Field [56, 88], nevertheless there are significant differences in how we define and interpret the potential functions of our respective CRF models. We view our model from a Bag-Of-Words perspective, with each latent variable corresponding to a codeword assignment, and also interpret the edges connecting the latent variables as representing co-occurrence relationships between the interest points. In contrast, [88] treats latent variables as a ’part’ belonging to a constellation model (akin to a pictorial structure); an interpretation which does not carry over to sparse STIP features. Further- more, they use dense optical flow features, require stabilized human detection windows, and perform classification on a per-frame basis. Hence it is not obvious how it can be 36 y h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 y x 1 x 2 x 3 x 4 h 1 h 2 h 3 h 4 h 1 h 2 h 3 h 4 (a) (b) (c) Figure 3.4: (a) General representation of Hidden Conditional Random Field model (b) Factor graph representation of Hidden Conditional Random Field (c) Evidence reduced factor graph used for inferringP (hjy;X;) during training. extended to incorporate sparse features like Harris 3D corners. Extended segments of video can be void of STIPs, which makes it challenging to apply frame-by-frame meth- ods. Inspite of these differences, due to the shared parametrization of the CRF equa- tions, we use the existing HCRF learning and inference procedure. Figure 3.4 shows a generic HCRF model. The mathematical notations and equations in Section 3.3.2 were kept consistent with the ones presented in [56, 88] so that a direct correlation can be observed between the two models. Given the observed featuresX of an image I, its corresponding classy, and code- word labels h, a hidden conditional random field is defined as: P (yjX;) = X h2H P (y; hjX;) =Z (yjX)= X ^ y2Y Z (^ yjX) = P h2H m exp ( (y;X; h;)) P ^ y2Y P h2H m exp ( (^ y;X; h;)) (3.12) where the function (y;X; h;)2R is as defined in equation 3.11. 37 The observation nodex j is connected to a single hiddenh j node and vice versa. This is not a requirement for a general HCRF model, but is enforced here as our proposed model (equation 3.11) has a one-to-one correspondence between the hidden variables and observation nodes. The observation features are conditionally independent given the hidden node values, and hence the dependencies between the observations are mod- eled through the hidden layer only. In our case, the edge set E directly models the neighborhood function Nb(). Also the HCRF model parametrization =f;; g is independent of the number of observations, and hence independent of the number of hidden nodes, which makes it well suited for classifying varying length videos. 3.3.3.1 HCRF Training Algorithm The model parameters are learned by maximizing the condition log likelihood on the training data. The presence of hidden variables makes the objective function non- concave with no global optima. Gradient ascent is used to determine a local optima. The gradient expression can be represented as a set of expectations over the posterior prob- ability of the hidden parts and the class labels: P (h j ;yjX;) and P (h j ;h k ;yjX;). Hence at each iteration of gradient descent we need to compute the following terms 8y2Y 8(j;k)2E 8a;b2H (3.13) ~ P (yjX;) =Z(yjX) = X h j 2jHj ~ P (h j jy;X;) P (h j =ajy;X;) = X h:h j =a P (hjy;X;) P (h j ;h k =a;bjy;X;) = X h:h j ;h k =a;b P (hjy;X;) 38 where ~ P () is the un-normalized probability distribution. The conditional marginals in equation 3.13 are computed by belief propagation over the evidence reduced fac- tor graph (Figure 3.4(c)). If E forms a forest structure (no cycles), then exact belief propagation can be performed on the factor graph. The cardinality ofH,Y and the number of edgesjEj determines the computational complexity of belief propagation: O(jHj 2 jEjjYj). The HCRF classifier can be learned either in One-Against-All (OAA) or All-In-One (AIO) training mode. In OAA mode, we train a binary HCRF classifier (Y =f0; 1g) for each class label, and the final classification is determined as: y =argmin i2A fP (y = 1jX; i )g (3.14) whereA is the set of action categories. The computational complexity remains same as each classifier is binary withjYj = 2. In AIO mode, we train a single multi-class HCRF classifier (Y =A), and the final classification is determined as : y =argmin a2Y fP (y =ajX;)g (3.15) AIO requires a large set of codewordsH as it needs to learn a single multi-way classifier. Unfortunately HCRF model learning is quadratic injHj, and hence AIO tends to be extremely slow when compared to OAA training. Our experiments are based on OAA training framework. 3.3.4 Hidden Layer Connectivity The edge set E models the neighborhood structure as captured by the neighborhood functionNb(x j ;x k ). The simplest neighborhood function is a Euclidean ball function 39 with radiusr such that: Nb (x k ) =fx j :jjp j p k jj<rg, wherep j ;p k are the spatio- temporal location of the interest points. Unfortunately, such a formulation will lead to a densely connected edge setE with large number of cycles, and exact belief propagation would be difficult. We propose simpler connectivity models, and show results validating their effectiveness in capturing the neighborhood structure. h 11 x 1 x 3 x 2 h 12 h 13 h 21 x 1 x 3 h 22 h 31 x 1 x 3 x 2 h 32 h 33 h 34 x 4 y 1 h 11 h 12 h 13 h 21 h 22 h 31 h 32 h 33 h 34 (a) (b) Figure 3.5: (a) The hidden layer with edge setE MST (b) Evidence reduced factor graph forE MST has no cycles, allowing for exact inference 3.3.4.1 Minimum Spanning Tree We propose a distance function based on the Minimum Spanning Tree (MST), which allows exact belief propagation for inference during HCRF training. We define the neighborhood function as: Nb (x k ) =fx j : (j;k)2 MST(G t k )g (3.16) whereG t k is a graph with nodes as the set of interest points with the same frame index as nodex k , and edge weights equal to their pairwise spatial Euclidean distance. The 40 resulting Euclidean MST has a link between each node and its closest neighbor, as the Nearest Neighbor Graph is a subgraph of the Euclidean Minimum Spanning Tree [14]. Hence the neighborhood function captures the relationship of interest points with respect to its closest spatial neighbor. Similar tree structures have been used for learning part based models for object de- tection [56] and activity recognition [88]. The hidden layer edge setE MST corresponding to equation 3.16, is shown in Figure 3.5(a). The corresponding evidence reduced factor graph (Figure 3.5(b)) is a tree, and hence we can do exact belief propagation on it. h 11 x 1 x 3 x 2 h 12 h 13 h 21 x 1 x 3 h 22 h 31 x 1 x 3 x 2 h 32 h 33 h 34 x 4 y 1 h 11 h 12 h 13 h 21 h 22 h 31 h 32 h 33 h 34 (a) (b) Figure 3.6: (a) 2-lattice graph with every node connected to its two closest spatial neigh- bors (b) Evidence reduced factor graph for E 2Lat has cycles and requires loopy belief propagation for inference. 3.3.4.2 2-Edge-Connected Graph A graph is k-edge-connected if it remains connected whenever fewer than k-edges are deleted from the graph. We construct a 2-edge-connected edge setE 2Con by adding edges to E MST such that every node is linked to their two closest spatial neighbors. Similar graphs have been used earlier for object recognition [56]. Note thatE Mst E 2Con , and 41 hence E 2Con should be more robust to perturbations in interest point locations across videos of the same class. Figure 3.6(a) shows the graphical model corresponding to anE 2Con edge set. The ev- idence reduced factor graph, shown in Figure 3.6(b), contain cycles, and hence requires approximate inference methods like loopy belief propagation. 3.4 Results We validate our framework on two widely used human activity datasets: Weizmann [4] and KTH [70]. We use the original version of the Weizmann dataset with 9 ac- tion categories: walking, running, jumping, sideways, bending, one-hand-waving, two- hands-waving, jumping in place and jumping jack. Each action was performed by 9 different actors resulting in a total of 81 videos in the dataset. The KTH dataset con- tains 600 videos of 25 actors performing 6 actions: walking, jogging, running, boxing, hand-waving and hand-clapping; and is repeated in different scenarios: outdoors (s1), outdoors with scale variation (s2), outdoors with clothing variation (s3) and indoors (s4). We extract 3D-Harris Corner points and generate Hof-HoG descriptors for all videos, using code provided by [36]. The interest points are generated using the default parame- ters. We learn the HCRF classifier with the interest points as observation features, using our extension of the HCRF library [56] where we replace the inference engine with the libDAI library [43]. Our method was implemented on a quad-core 3:16 GHz Intel Xeon CPU. The average training time for the KTH dataset (48 videos with 500 frames each) was 30 to 60 hours depending on cardinality ofH. The average test time was 30 seconds for a single video given the interest point descriptors. 42 3.4.1 Weizmann We validated our approach on the Weizmann dataset using Leave-One-Out Cross-Validation (LOOCV), which is the standard experimental setup also used by others. The video clips of a single actor are set aside as the test set, and the HCRF classifier is learned on the remaining 8 actors. This process is repeated such that each actor appears once in the test set. We split the 9 actions into two categories based on the net motion of the interest point detections: Mobile actions (walking, running, jogging, gallop-sideways) and Stationary actions (bending, one-hand-waving, two-hands-waving, jumping in place and jumping jack). Such a categorization helps the training algorithm to concentrate on the more ambiguous classes. 3.4.1.1 Edge Connectivity and Code Book Size Table 3.1 shows our results with varying codebook sizejHj usingE MST andE 2Con con- nectivity matrix. We are able to get good results with a much smaller codebook size be- cause each interest point can be associated with multiple codewords based on the class conditional distributionP (hjy;X). The connectivity matrix describes the neighborhood function being modeled. Almost similar accuracy results were obtained forE 2Con con- nectivity, with a difference of only single video being misclassified. This shows that good results can be achieved with relatively simple definitions of neighborhood func- tions, such as given by equation 3.16. 43 jHj = 10 15 20 E MST 97:53% 96:30% 97:53% E 2Con 96:34% 98:76% 97:53% Table 3.1: Comparison with varying codebook sizejHj and different edge connectivity on Weizmann dataset. 3.4.1.2 Classification Accuracy Figure 3.7 shows our confusion matrix on Weizmann dataset using LOOCV , and achieves an accuracy rate of 98:76% usingE 2Con andjHj = 20. We only make a single classi- fication error out of 81 test samples. Table 3.2 contrasts our performance with oth- ers. We consistently outperform other exisiting approaches which attempt to model the neighborhood structure [7, 53, 98, 52]. There exist approaches [4, 69, 89] which have achieved perfect results on this dataset, however they either require accurate silhouettes, fixed-size image windows centered at the person of interest or track information to sta- bilize the videos. Our interest point approach does not require such information, but gives comparable performance. 3.4.2 KTH Following the experimental setup of [70], we use video clips of 16 actors as our training set, and use the remaining 9 actors as our test set. We split the actions into two categories based on the net motion of the interest point detections: Mobile actions (walking, run- ning, jogging) and Stationary actions (boxing, hand-waving and hand-clapping). The experiments on KTH were run with E MST connectivity andjHj = 20 for s1,s2 and 2 The results are from our implementation of [36]. It uses Multiple Kernel Learning [34] instead of greedy kernel selection. 44 Method Weizmann KTH Train/Test Our Approach 98:76% 93:98% 16:9 Bregonzio et al [7] 96:66% 93:17% 24:1 Laptev et al [36] 95:06% 2 91:80% 16:9 Niebles et al [53] 90:00% 83:33% 24:1 Zhang et al [98] 92:89% 91:33% 24:1 Kovashka et al [34] - 94:53% 16:9 Gilbert et al [20] - 94:50% 16:9 Niebles et al [52] 72:80% - - Wang et al [89] 100:0% 92:51% 1:1 Wang et al [88] 97:20% 87:60% 1:1 Schindler et al [69] 100:0% 92:70% 4:1 Jhuang et al [26] 98:8% 91:70% 16:9 Dollar et al [12] 85:20% 81:17% 24:1 Schuldt et al [70] - 71:72% 16:9 Blank et al [4] 100:0% - - Table 3.2: Comparative results on the Weizmann and KTH datasets. Top half cites approaches that attempt to model the neighborhood structure of interest points. Bottom half cites approaches from general activity recognition systems, not necessarily based on interest points. Last column shows the train/test split ratio for KTH. jHj = 10 for s3,s4. Figure 3.7(b) shows our average confusion matrix for all four sce- narios: s1, s2, s3 and s4. We note that majority of the confusion is between actions jog and run, which is expected due to their similar nature. Table 3.2 compares our average accuracy rate with other methods. We achieve an average accuracy score of 93:98%. A direct comparison should be carefully made as some of the results are reported on the easier 24 : 1 train-test ratio. Note that our performance is significantly better than [70, 26, 36], and comparable to [34, 20] with the difference of a single misclassification. Futhermore, both [34, 20] use dense features which are much more expensive to compute. Our results are most directly comparable to Schuldt et al [70] because they learn a discriminative classifier over histogram of codeword features, and do not model any spatio-temporal relationships between the interest points. We achieve a significant improvement in performance ( 22%) over [70], which we attribute to our learning co-occurrence relationships between interest points. 45 3.5 Conclusion We proposed a structured graphical model for learning a discriminative classifier over STIP features for categorizing human activity videos. We model the neighborhood re- lationships of interest points using co-occurrence statistics, and show a transformation to represent the relationships as edges of a Hidden Conditional Random Field. We vali- date our framework on two widely used human activity datasets: Weizmann and KTH, and show improvements over other existing approaches. Our method can be naturally extended to capture temporal relationships, which is a part of our current ongoing work. 46 Side Jump Run Walk Jack Wave2 Pjump Bend Wave1 Side 1 0 0 0 0 0 0 0 0 Jump 0 0.89 0.11 0 0 0 0 0 0 Run 0 0 1 0 0 0 0 0 0 Walk 0 0 0 1 0 0 0 0 0 Jack 0 0 0 0 1 0 0 0 0 Wave2 0 0 0 0 0 1 0 0 0 Pjump 0 0 0 0 0 0 1 0 0 Bend 0 0 0 0 0 0 0 1 0 Wave1 0 0 0 0 0 0 0 0 1 Weizmann Walk Jog Run Box Wave Clap Walk 1 0 0 0 0 0 Jog 0.03 0.89 0.08 0 0 0 Run 0 0.14 0.86 0 0 0 Box 0 0 0 0.92 0 0.08 Wave 0 0 0 0 1 0 Clap 0 0 0 0 0.03 0.97 KTH Figure 3.7: Confusion Matrices: (a) Weizmann dataset with E 2Con connectivity and jHj = 20. Average accuracy over all classes is 98:76%. (b) KTH dataset withE MST connectivity andjHj = 20 for s1,s2 andjHj = 10 for s3,s4 scenarios. Average accuracy over all classes is 93:98% 47 Chapter 4 Pose based Activity Recognition using Multiple Kernel Learning In this chapter, we consider the problem of using structured semantically meaningful observation features for activity recognition, while relying on statistical pooling tech- niques for classification, which greatly reduces the annotation and dynamic modeling requirements during training. 4.1 Introduction A common existing approach is to compute global histogram based statistics of local spatio-temporal features in the video volume, which are used to train a discriminative classifier [34, 36]. These approaches do not require significant annotations or model construction, however they completely ignore the structure present in human activities. A complementary approach is to train a classifier (such as a dynamic bayesian net- work, hidden markov model etc.) based on the human pose dynamics present in an action, while using pose estimates as observations [24, 72]. Such methods leverage 48 their knowledge of the structure of the activity, however training such models requires semantic decomposition of the activity into key-poses, with interpolations defined for the intermediate states. This imposes significant annotation and modeling requirements on the end user during training. The classifier is also specific to a particular style and duration of the activity. We are motivated by recent developments in 2D human pose detection [58, 18, 91], and introduce a classification method that uses the structure provided by human pose without constraining it to a specific dynamical model. Pose estimation is more reliable for poses which are similar to the mean-pose configuration represented in the prior, and it is hard to design a single prior which works for all poses present in an activity. Hence we use a collection of pose priors, which helps in detecting the distinct poses present in an activity by at-least one of the pose priors. Figure 4.1 shows examples of pose priors used in our work. The multiple pose detection responses are combined in a multiple kernel learning (MKL) framework to classify the videos into action categories. We test our framework on a human gesture dataset, and provide convincing results in support of our framework. 4.2 Human Pose Estimation Human poses provide strong semantic cues to the underlying activity. Pictorial structure based object detection techniques have improved the quality and reliability of modern pose detection algorithms [58]. We use pose detection results of the human body, as input observations to our activity recognition module. We represent a human body using a 10-part model: head, torso, and upper/lower limbs of arms/legs. The 2D pose model consists of nodesl i corresponding to each part i, with edges between the nodes enforcing spatial constraints on their arrangements. The complete human pose is represented as 49 (a)K 1 (b)K 2 (c)K 3 (d)K 4 (e)K 5 (f)K 6 (g)K 7 (h)K 8 Figure 4.1: Kinematic tree priors K 1 to K 8 representing distinct mean-pose config- urations. The one- boundary of the Gaussian distribution for relative location and orientation between part pairs is shown using blue ellipses and green lines respectively. the configuration of the parts given byL =fl 0 ;l 1 ; ;l N g, where the state of part i is defined asl i = (x i ;y i ; i ). Given the observed image I, the pose posterior distribution is defined as: p(LjI)/p(IjL)p(L) (4.1) /exp 0 @ X i (l i ;I i ) + X (i;j)2E (l i ;l j ) 1 A where (i;j)2 E are the pairwise constraints between the parts. In standard im- plementations [58, 18] a tree graph configuration rooted at the torso is chosen, which 50 g1 g2 g3 g4 g6 g7 g8 g9 g10 g11 g5 K 1 K 2 K 3 K 4 K 5 K 6 K 7 K 8 Figure 4.2: Sample pose estimation results on the USC-Gesture Dataset using kinematic prior treesK 1 toK 8 . The results display the part-wise marginal posterior distributions p(l i = (x;y;) ;k). The distributions have significantly lower entropy for frames closer to the mean-pose configuration of the KTPs, for example, K 1 G 6 , K 2 G 4 , K 6 G 3 and others. (This figure is best viewed in color). ensures exact inference procedures. p(L) represents the prior on the part configura- tions, and are derived from the kinematic constraints imposed on the parts. Details on constructing meaningful priors for activity recognition are discussed in Section 4.2.1. p(IjL) is the likelihood of the image observation, given a particular configuration state of the parts. The joint probability is decomposed asp(IjL)/ Q i p (I i jl i ), where each term p (I i jl i ) is computed from independent part detectors applied over the detection window. We assume human location in the image is known apriori. We use the bound- ary and region template-based part detectors provided by [58]. The marginal posterior probability for each body-part is inferred using belief propagation over the graphical model defined in equation 4.1. 51 4.2.1 Kinematic Pose Priors The pose configuration prior is encoded in the distributionp(L). A variety of pose priors have been used in the literature ranging from tree priors set to uniform probabilities within a bounded range [57], non-tree priors with occlusion reasoning built into the model [46] and fully connected graph models using absolute part orientations [3]. We choose to use the kinematic tree priors (KTPs) proposed by Andriluka et al [1], because they are easy to design for the end user, and also exact inference algorithms is possible for such priors. KTPs decompose the kinematic constraints overL as p(L) =p(l 0 ) Y (i;j)2E p (l i jl j ) (4.2) where the relative location and orientation arrangement of each child part w.r.t. the parent part is encoded in terms of a Gaussian distribution. A common approach is to learn a single KTP [57, 1], which is representative of the human poses present in the dataset. However detectors based on a single prior tend to return confident estimates only for the poses similar to the configuration represented in the prior. Hence we design a collection of prior trees K i 2K, representing a variety of mean-pose configurations, which helps in detecting the distinct poses present in an activity by at-least one of the pose priors. Figure 4.1 shows examples of KTPs defined in our current implementation, each with a distinctive mean-pose configuration. Figure 4.2 shows results with different KTPs; it is clear that estimates are much better when the actual pose is similar to the prior pose. The set of Gaussian parameters, corresponding to the mean relative orientation be- tween each pair of parts, defines the mean-pose configuration of the KTP. The parame- ters are easily set via visual inspection, for example,K 1 is similar to a pose with both hands raised above the head, where asK 5 represents a pose with bent elbows. The rest 52 of the Gaussian parameters are set to generic values independent of the activity dataset, and are determined empirically from a standard pose estimation dataset [58]. The kinematic tree priors are extended to contain a set of flags defining which body parts are important to capture a particular pose. Hence a KTP can have only upper body parts visible, or even a single side (left/right) of the body parts visible (e.g.K 2 andK 3 ), depending on what is most representative of the pose we want to capture. To speed up the pose search algorithm, we restrict the part object detectors to search in a bounded location and orientation range, determined by the KTPs, similar to the position priors described in [18]. 4.3 Activity Recognition The pose detectors described above are applied to all the frames in the video. The detector corresponding to the k th kinematic tree prior returns a set of part-wise pos- terior marginal distributions E k =fE k;i g i=1N for each frame, where E k;i = p(l i = (x;y;) ;k). The pose detectors return confident estimates for the frames containing poses similar to its corresponding mean-pose configuration, and will likely have higher uncertainty for other configurations. The confident estimates can be identified visu- ally, however there exists no obvious algorithm to identify them automatically. A part- wise entropy based scoring may be employed, however we could not determine a set of consistent entropy based thresholds which would select a single confident result across all the detectors. Hence, the activity recognition problem remains quite challenging. We propose computing descriptors from the distributions and combine them in a MKL framework for activity classification. 53 4.3.1 Pose Descriptors A standard approach to summarizing the distributions is to use the maximum a posteriori (MAP) estimate but this ignores the variances of the distribution. We avoid making a hard threshold regarding the final pose; instead, pose descriptor histograms are extracted from the marginal posterior distributions of the parts, as introduced by Ferrari et al [18]. We use only two of the three pose descriptors proposed in [18], namely Descriptor A and Descriptor B (described below). Descriptor C was found to be redundant for classification purposes. [Descriptor A] The marginal distributionE k;i is quantized to 20 16 24 bins in the x;y and dimensions. The descriptor captures the global distribution of the parts in the detection window in each orientation. [Descriptor B] The descriptor encodes the part orientations, relative locations and rel- ative orientations in a single concatenated histogram. The following three distributions are computed fromE k;i : P l i = X (x;y) P (l i = (x;y;)) (4.3) P (r (l i ;l j ) =) = X ( i ; j ) P (l i )P (l j )1 (r( i ; j )=) P l xy i l xy j = = X (x i ;y i ;x j ;y j ) P (l xy i )P (l xy j )1 l xy i l xy j = where the marginal orientation of each part is given byP l i and is quantized into 24 bins. The relative orientation marginals between pairs of parts is given byP (r (l i ;l j )) and is quantized into 24 bins. The relative location marginals is given byP l xy i l xy j and is quantized into 7 9 bins. The final descriptor is a concatenation of all the above three histograms computed for each part. 54 4.3.2 Multiple Kernel Learning A vocabulary of codewords is learned for each type of descriptor (A and B) correspond- ing to each of the kinematic tree priors K i 2K by K-Means clustering; the cluster centers constitute the codewords of the vocabulary. The learned vocabulary is used to compute a set of histogram of codeword featuresh2H for each of the input videos, where each histogramh corresponds to a particular descriptor type and KTP. Hence we havejHj = 2jKj number of histogram of codewords for each video. Similarity kernel matricesK c are computed for each type of histogram using the 2 distance function: K c (i;j) =exp 1 A c 2 (h i ;h j ) (4.4) 2 (h i ;h j ) = 1 2 D X b=1 (h i (b)h j (b)) 2 h i (b) +h j (b) ! A c is the kernel scaling parameter and is set to the mean distance value. Multiple kernel learning (MKL) is used to determine the most discriminative combinations of kernels in a max-margin framework. MKL has been successfully used for combining feature channels in computer vision [34]. MKL determines the weightsw c , such that the com- bined kernelK = P j2Kj c w c K c is the best conical combination of the individual kernels for recognizing a given action. Final classification is performed by learning an SVM classifier for each class using the combined kernelK. We use a publicly available implementation of the MKL algorithm proposed by Bach et al [2]. 4.4 Results and Conclusion We tested our framework on the USC-Gesture dataset [49], containing multiple video clips of 11 actions performed by 8 actors. There is very little variation in the clips belonging to the same actor and action, hence we choose only two clips per actor per 55 action, resulting in a total of 2 11 8 = 176 video clips in our dataset. We use videos from 7 actors as training data, and test on the 8th actor, and repeat for all possible permutations of actors. Pose detectors are applied with a collection of 8 kinematic prior trees shown in Figure 4.1. The marginal posterior distribution of parts returned by a subset of the pose detectors are shown in Figure 4.2, where each row corresponds to a different KTP. We observe that at least one of the pose detectors returns a confident pose estimate across all the frames. We usedK = 40 for constructing the descriptor codewords. Final classification re- sults using the MKL algorithm are illustrated by a confusion matrix shown in Figure 4.3. We achieve an average accuracy rate of 83:33% across all folds. Note that [72] report a 92% accuracy rate, however that method requires manual construction of activity mod- els by annotating 2.5D joint locations for selected key poses; the models also contain motion styles and durations. The method presented here does not require any manual modeling effort (assuming that the set of pre-defined KTPs is sufficient) and should be insensitive to styles of motion as dynamics is not modeled specifically. This is not to argue that motion dynamics is not useful, or even critical, for activity recognition but that some analysis that is not dependent on precise dynamical models may be useful prior to the application of the dynamical models. 56 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g1 1 0 0 0 0 0 0 0 0 0 0 g2 0 0.938 0 0 0 0 0 0 0.063 0 0 g3 0.125 0 0.844 0 0 0 0 0 0 0.031 0 g4 0 0 0 0.813 0.063 0 0.125 0 0 0 0 g5 0 0.125 0 0 0.625 0 0.125 0.125 0 0 0 g6 0 0 0 0.125 0 0.625 0.125 0.125 0 0 0 g7 0 0 0 0.063 0.063 0 0.875 0 0 0 0 g8 0.063 0 0 0 0 0 0 0.938 0 0 0 g9 0 0.25 0 0 0.125 0 0 0 0.625 0 0 g10 0 0 0 0.125 0 0 0 0 0 0.875 0 g11 0 0 0 0 0 0 0 0 0 0 1 Figure 4.3: Confusion Matrix for USC-Gesture dataset. 57 Chapter 5 Pose Filter based Hidden-CRF models for Activity Detection In this chapter, we consider the problem of using key-pose detection filters to detect and classify activities in unsegmented videos. Key-pose detection results are comparatively less semantically meaningful than previously described pose distribution features, how- ever they incorporate more structure and semantics than localized features like STIPs. There has been considerable research in classifying segmented videos into a single ac- tivity class, however there has been comparatively less progress on activity detection in un-segmented videos. In real world scenarios, the activity of interest occurs only for a part of a video stream, and automatic temporal segmentation of videos based on its activity content is a difficult task. We propose a novel algorithm, based on auto- matically identifying the key-poses in an activity, and learning a key-pose filter based 58 HCRF model capable of detecting and classifying 1 multiple instances of an activity in an unsegmented video. 5.1 Introduction Activity classification algorithms can be broadly categorized based on their structure modeling capabilities. A popular class of approaches [36, 34] compute global histogram based statistics of local spatio-temporal features, and train a discriminative classifier to assign a single activity label to the video. Such methods require only event level annotation for training, and use off-the-shelf computationally efficient classifiers like SVM. However, global statistics do not capture the temporal dynamics of the human activity, and are unable to generate a semantic description of the activity. To classify unsegmented videos, they are typically applied in an ad-hoc sliding window fashion, which ignores the dependencies between neighboring windows. A complementary approach learns a dynamical model of human motion, and infers the activity based on state transitions observed in the video [24, 73]. Definition of a state can range from primitive action based decompositions [49] to limb based dynamics [24], and the inferred state sequences provide a temporal segmentation of the video. However the models are sensitive to underlying assumptions of duration statistics, variations in activity styles, and also impose significant annotation (like joint positions and primitive event labels) and modeling requirements during training. Dynamical models, like HMMs, generally require inference of the state assignments in every frame, where the number of random variables to be inferred is proportionate to 1 The terms classification, detection and recognition are used interchangeably in the vision community. We define activity classification as assigning a single class label to the entire video clip, whereas activity detection determines the temporal extent of the activity, and returns a temporal segmentation of the video based on its activity content. Recognition consists of classification and/or detection. 59 the number of frames. To keep the inference tractable, they further require a Markovian assumption between observations from adjacent frames, and do not capture the global distribution of the features. We argue that for activity classification, it is only sufficient to determine the presence or absence of a state in an observation sequence, while en- suring certain temporal relationships between the state detections are satisfied. We are motivated to propose an alternative graphical model for activity detection, where the random variables to be inferred are the temporal locations of the state, which are much fewer than the number of frames. Our model also incorporates the global distribution of features in a tractable manner. In summary, we prefer a more generalizable and tractable model, at the expense of per-frame state descriptions. Our model does generate an ac- tivity description in terms of state detections. Human activities can be described in terms of a sequence of transitions between key-poses [40], where key-poses represent the important human pose configurations in an activity. Pose inference is a difficult problem in itself; instead, we compute image features that are related to pose but do not make the pose information, such as joint positions, explicit. A single key-pose corresponds to a state (or a part 2 ) in our activ- ity model. We learn a collection of key-pose detection filters, and pool their detection responses, while satisfying the temporal relationships between them. A close analogy is the pictorial structure object detection framework, where the detection responses of independent part detectors are pooled together while satisfying the geometrical relation- ships between the object-parts. Furthermore, the temporal locations of the key-pose detections correspond to the active segments in the video stream, enabling activity de- tection in unsegmented videos. 2 A sequence of states in an activity can be viewed as a decomposition of the composite activity into primitive parts. 60 Our contributions are multi-fold. Firstly, we introduce a novel pose summarization algorithm to automatically identify the key-poses in an activity. Second, we introduce a key-pose filter based hidden conditional random field (HCRF) model capable of detect- ing and classifying human activities in unsegmented videos. Lastly, we present a new extended human gesture dataset with unsegmented videos containing multiple activity instances, for evaluating the action detection performance. 5.2 Related Work We focus our survey on classification algorithms using a part based representation, and also briefly review methods for activity detection. [Part based Models] Part based methods have been extensively used in activity analy- sis, where the definition of a ’part’ varies widely. Some approaches use semantic part definitions, like human key poses [40], linear interpolations between 3D joint positions [49], and spatio-temoral volumetric shapes of human limbs [29]. These methods either require extensive manual annotations/segmentations and predefined dynamical models, or hard to collect motion capture data for training. Other approaches favor non-semantic definition of parts. Wang et al [90] define discriminative motion patches and learn a HCRF model for classification; their model performs single frame classification, and ignores inter-frame part dynamics. A variety of methods based on Dynamical Graphical Models exist [13, 73, 45], where activities are defined as a series of state transitions. The state definitions may lack meaningful semantics, and ignore the global distribution of features in the activity segment. More recently, Niebles et al [51] extended the deformable part models for object detection by [17] to the temporal domain for activity classification. They initialize the 61 sub-parts based on their correlations to a global feature distribution, and need not cor- respond to any semantic interpretation. They classify pre-segmented videos only. A closely related model for learning discriminative key-pose sequences is proposed by Vahdat et al [82], with focus on interactions between a pair of humans. The model only ensures ordering constraints, ignores the uncertainty in temporal placement of the poses and cannot detect multiple instances of the activity in a video. [Activity Detection] Structure models like [49, 24] perform automatic video segmenta- tion by learning densely linked finite state machines, which combine all the activities in a single model. They do not scale well with the number of activities (inference is quadratic in number of states), and require extensive manual annotations. Spatio-temporal volu- metric feature based algorithms [96, 28] rely on global statistics to detect action events, with no semantic reasoning of the underlying activity. [28] further requires enumeration of all possible sub volumes, and resorts to sub-sampling for tractable learning. There exist techniques [71, 62] based on maximizing the volumetric correlation of 3D tem- plates to localize single primitive actions, however it is unclear how multiple templates for complex activities can be combined. 5.3 Model Overview While there exist methods like [96, 28, 71, 62] which perform detection in both space and time dimensions, we argue that spatial detection of the human is better solved by dedicated human detectors [11], which have shown impressive performance on variety of benchmark datasets. Our focus is on the temporal detection of the composite activity, and we assume that a human trajectoryx =fx t g is provided as input to our algorithm, wherex i is the human detection box in thet th frame. 62 Video 1 Video 2 Video M PSA Σ H C R F L S V M Test Video Model Training Test Inference Key-Pose Identification Training Videos 1 1 K 2 1 K M K 1 a K a 1 ) , ( max arg * z x z T Z D A R , , * * * , , D A R * * 1 * * , , , K r t t t y ) ; | , ( max * , x z y P z y Figure 5.1: Flow diagram of our proposed algorithm. We define a human activity as a sequence of key-poses. Automatic key-pose iden- tification is a challenging problem, and we introduce a novel Pose Summarization Al- gorithm (PSA) to identify the key-poses in an activity, along with their temporal anchor positions a k . The temporal anchors represent the expected temporal position of the key- pose in a given pose sequence, and provide the initial feature template for learning the key-pose detectors. Key-pose detectors are composed of linear weight filters, discrim- inatively learned from the observed HoG/HoF features at their corresponding anchor locations. We further define probabilistic temporal position distributions for each of the key-poses, to model the uncertainty in their placement about the anchor locations. The pose features in a video are quantized to a vocabulary of pose-codewords. The global distribution of the poses present in the sequence are learned using a root filter, which is a function of the histogram of pose-codewords. The multiple key-pose fil- ters, root filters and their corresponding temporal relationships are jointly modeled in 63 a probabilistic framework, resulting in a hidden conditional random field (HCRF). The parameters of this model are learned in a discriminative max margin framework using a latent support vector machine. Figure 5.1 shows the flow diagram of our proposed algorithm. Final classification and detection is performed by inferring the class labels, and temporal positions from the HCRF model. [Feature Descriptors] We compute HoG/HoF features from the detection window re- turned at trajectory locationx t . The detection window is primarily centered around the human, and hence the features capture the human pose configuration at timet. Similarly to [17], each detection window is divided into a grid of cells of size 8 8. The gradi- ent information is accumulated in a 1D orientation histogram for each cell, discretized into 18 orientation bins. We use tri-linear interpolation to distribute the gradient weight among neighboring cells and orientation bins, as recommended in [11]. The histogram is normalized based on neighborhood cells, resulting in a 31 dimensional feature vector, containing both contrast sensitive/insensitive features. A similar procedure is used to obtain the HoF feature vector. The concatenated HoG/HoF feature vector computed at x t is given asf(x t )2R D . 5.4 Key-Pose Identification Automatic decomposition of an composite activity stream into its constituent key-poses is defined as the Key Pose Identification problem. This is a prerequisite for learning a key-pose detector, as we need to first identify what are the important key-poses in an activity, before learning how to detect them. Algorithms for automatic key pose iden- tification rely on variants of change detection in the pose dynamics. However change- point detection algorithms require accurate human limb estimates, which are difficult to achieve. Another approach is to perform hierarchical clustering of the pose features, 64 b 1 =0 τ 1 =7 b 2 =12 b 3 =21 b 4 =40 τ 2 =15 τ 3 =27 K = 3 Figure 5.2: Pose Summarization for Key-Pose detection: Bottom row shows every fourth frame from a sample video sequence. Top row shows the key poses and their respective temporal boundaries forK = 3 followed by vector quantization to learn a vocabulary of pose based codewords. How- ever these codewords do not take into account the temporal structure present in the pose sequence of an activity. Inspired by existing techniques for video summarization [39], we solve the key pose identification problem using pose sequence summarization. Given N poses in an activity sequence, our task is to select the K < N subset of poses, which best summarizes the complete pose sequence w.r.t. a cost function defined on the pose space. We next describe the algorithm in detail. 5.4.1 Pose Sequence Summarization Let f(x i ) 2 R D define a D dimensional feature vector describing the human pose present in the window x i . Let 1 K define the temporal location of the K key poses. By definition, each key pose f(x j ) best summarizes the poses present in the pose sequence f(x b j )f(x b j+1 ) present between framesb j andb j+1 . Hence, for a given temporal range [b j ;b j+1 ), the optimal key-pose location j is computed as j = arg min ^ C(^ ;b j ;b j+1 ), where functionC() is the Pose Summarization Error. C (^ ;b j ;b j+1 ) = b j+1 1 X i=b j kf (x i )f (x ^ )k 2 2 65 (a) K = 3 (b) K = 4 (c) K = 5 Figure 5.3: Results for pose summarization algorithm: (a) K=3, (b) K=4 and (c) K=5 E(K;f j g;fb j g) = K X j=1 C ( j ;b j ;b j+1 ) (5.1) The total cost incurred in summarizing the entire pose sequence using justK key-poses is given by the error function E(). The optimal assignments of key-posesf j g, and their respective temporal boundariesfb j g are determined by minimizing E(). A dy- namic programming algorithm for video summarization was proposed in [39], which is easily adapted for our purpose. The key insight is that given the temporal boundaries fb j g, the corresponding key-pose locationsf j g can be determined inO(n 2 ) time. This 66 y t 1 t 2 t r x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 t r t r +L x t 1 x t 2 Figure 5.4: The factor graph representation of our proposed HCRF model forK = 2 key-poses. suggests an algorithm which recursively determines the optimal temporal boundary lo- cations. The dynamic program has a computational complexity ofO (Kn 3 ), and hence is efficient for reasonably sized segments withn< 200 frames. We present results of the key-pose identification on a sample video in Figure 5.2, and observe that the identified key-poses match closely to an intuitive definition of key- poses by humans. Note that with increasingK (Figure 5.3), adjacent key poses are more similar in appearance, and harder to distinguish from each other. The optimum value of K varies depending on the activity, and we determine it empirically in our experiments. 67 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 w a w b w c w b w d w d w c w b w c t r t r +L w a w b w d w c 1 3 2 1 2 1 1 1 1 , a t N 2 2 2 2 2 , a t N 1 2 1 2 a 2 a 1 a 1 a 2 t t x g w c a b d Figure 5.5: Panel (a) shows the two key-poses identified by the pose summarization algorithm, with their corresponding anchor times: a 1 ; a 2 . Panel (c) shows the feature descriptors x t , and their corresponding codewords assignments w t 2W below. The root filter T R R is shown in cyan, being applied between framest r andt r +L where it models the pose-codeword frequencies, as shown in panel (b). Sample results of key- pose filters T A k A learned by the LSVM model are shown in panel (d). Their temporal location is modeled with a normal distribution about their corresponding anchor location a k , shown in panel (c). 68 5.5 Key-Pose Filter based HCRF Once the key-poses of an activity have been identified, we learn a set of key-pose filters to detect the corresponding key-poses in a given trajectoryx. We further define tempo- ral distributions to model the location of the key-poses, and combine them in a Hidden Conditional Random Field, where running inference on the HCRF model solves the de- tection and classification tasks simultaneously. HCRFs are well suited to accommodate a part-based representation, and have been used before for both part-based object and ac- tion classification tasks [56, 90]. We define the individual key-pose filters as the ’parts’ in our HCRF model. Lety be a binary class variable signifying the presence/absence of an activity class. Our objective is to learn a probability distributionP (yjx) to infer the class labely : y = arg max y2f+;g P (yjx)/ arg max y;z P (yjz;x)P (zjx) (5.2) wherez =ft r ;t 1 ;t 2 t K g are the latent variables in our model.t r determines the tem- poral position of the action segment in the trajectory, while the variablesft k g determine the temporal location of the key poses constituting the activity. Figure 5.4 shows the factor graph representation of the HCRF. Solving the inference in equation 5.2 provides us with localization of the activity segment in the trajectory, along with the class label. The key-pose locations also provide us with a semantic description of the activity in terms of its key-poses. We model the probability distributionP (y;zjx) as follows: P (y +1 jz;x)/exp ( T R R (x;t r ) + X k2K T A k A (x;t k ) ) (5.3) P (zjx)/P (t r jx) Y k2K P (t k jx)/C Y k2K N t k j a k + k ; 2 k 69 whereN (xj; 2 ) is the standard normal distribution with mean and variance 2 . The filtersf R ; A g are a single dimensional template specifying the weights of the features f R ; A g appearing in a segment of the trajectory. Their dot product is the filter score when applied to the segment. 5.5.1 Root Filter The root filter T R R captures the global distribution of poses present in a given activity segment. First, a vocabulary of codewordsW is learned over the pose features f(x i ) extracted from all the videos in the training set. Then, each trajectory window x i is assigned to the closest pose-codewordw2W, the mapping being defined by a function g(x t ) :R D !W. An activity segment is said to start from timet r and has a length ofL frames. The root filter computes the histogram of pose-codewords present in the temporal window [t r ;t r +L], as shown in Figure 5.5(b,c), and is defined as follows: T R R (x;t r ) = X w2W w tr+L X t=tr 1 g(xt)=w (5.4) where 1 g(xt)=w returns 1 if g(x t )=w is true, otherwise returns 0. Parameter L is the temporal bandwidth of the root filter, and is set to the average length of an activity segment determined from training examples. 5.5.2 Key-Pose Appearance Filter Filter T A k A models the appearance of thek th key-pose. Accurate key-pose detection requires the HoG-HoF descriptors to be computed from detections centered at the hu- man figure. Misaligned detections (Fig. 5.6(c)) capture only the partial human image, and produce noisy HoG-HoF features, which in turn leads to inaccurate key-pose de- tections. We incorporate a scale-alignment search around the trajectory detection box 70 x t k = [~ c;w;h], where~ c is the center of the detection box and (w;h) are its width and height: T A k A (x;t k ) = max s2S;~ p2P T k f (x t k = [~ c +~ p;sw;sh]) (5.5) whereS is the scale pyramid andP is the alignment search grid. We learn a conditional model, and hence the weight vector k corresponds to the discriminative ability of the appearance of the k th key-pose to classify the overall activity segment. Figure 5.5(d) show examples of appearance models learned for detecting key-poses. The weight mag- nitudes show a clear visual correlation with the discriminative key-pose present in the video. 5.5.3 Temporal Location Distribution The latent variablesz =ft r ;t 1 ;t 2 t K g together define the temporal location of the activity segment, and its constituent key-pose locations. As we do not have prior knowl- edge of the global temporal location of the activity segment, we set its distribution to a constant :P (t r jx)/C. The temporal distribution of the key-posesP (t k jx) is modeled using a standard nor- mal distributionN () with mean a k + k and variance 2 k . Parameter a k represents the expected temporal location of thek th key-pose relative tot r , and is called the anchor po- sition of the key-pose; this is analogous to the anchor position of parts in object detection frameworks [17] It is initialized to the average of the key-pose locations computed using the pose summarization algorithm described in Section 5.4.1, and remains unchanged during model training. The optimal key-pose locations j in each video (identified by the pose summarization algorithm) need not be centered within their corresponding tem- poral boundaries [b j ;b j+1 ) (see Fig. 5.2). The parameter k accounts for this offset, and 71 measures the linear shift in the key-pose location from its anchor position a k , while parameter 2 k measures the uncertainty in the temporal location. Figure 5.5(c) shows the parameterization of the normal distribution. Both k and k are learned during model training, however it is more convenient to learn the equivalent log-probability parameters (a k ;c k ), which are defined as follows: logP (t k jx)/ T D k D (t k ;t r ) =a k 2 t c k t (5.6) t =t r + a k t k ; a k = 1= 2 k ; c k = 2 k a k 5.6 Model Learning and Inference The HCRF model defined in equation 5.3 can be learned either using Max Likelihood criteria: = arg max Y i X z i P (y i ;z i jx i ;) (5.7) resulting in the traditional HCRF learning procedure described by Quattoni et al [56], or by an alternative approach of training the model using the Max-margin criteria [90]: = arg max :8 i max z iP (y i ;z i jx i ;) 1 min z iP (y i ;z i jx i ;) > (5.8) where is the margin between the positive and negative examples. It has been argued [90] that the max-margin criteria is better suited for the classification task; we train the parameters of our model in a Max-margin framework. The optimization in equation 5.8 is equivalent to optimizing a Latent Support Vector Machine [95] in the log-probability 72 domain. Transforming the probability distributions to the log domain results in the following energy function: E(x;z) = logP (y = +1jz;x) = T (x;z) (5.9) = T R R (x;t r ) + X k2K T D k D (t k ;t r ) + T A k A (x;t k ) 5.6.1 Latent Support Vector Machine Max Margin optimization for training linear classifiers results in a support vector ma- chine (SVM). A Latent Support Vector Machine (LSVM) incorporates latent variable inference in the SVM optimization algorithm. Yu et al [95] proposed a Concave-Convex Procedure (CCCP) for efficiently solving the latent-SVM optimization. For binary de- cision problems with Zero-One Loss functions, the LSVM formulation is equivalent to the one proposed in [17] . We prefer to use the formulation by [95] because of its general framework, and superior convergence properties due to its use of cutting plane algorithm to solve the underlying SVM optimization. The LSVM optimization is stated as follows: min " 1 2 kk 2 2 +C n X i=1 max ^ y;^ z T (x i ; ^ y; ^ z) + L (y i ; ^ y; ^ z) # " C n X i=1 max ~ z T (x i ;y i ; ~ z) # (5.10) where L is the loss function, and is the class augmented feature function. The optimization is solved using the Concave-Convex Procedure (CCCP), which minimizes f()g() where bothf andg are convex. To map our activity model into the LSVM formulation while satisfying the convexity requirements off andg, the feature function is defined as: (x i ;y i ; ^ z) = (x;z) for all positive examples, and equal to zero for negative examples. 73 The CCCP algorithm requires solving two sub-problems iteratively: (1) Latent Vari- able Completion , and (2) Loss-Augmented Inference. Latent variable completion prob- lem is equivalent to MAP inference on the HCRF model, and is defined as: max ~ z T (x; ~ z) = T R R + X k2K max t k tr =0 T D k D + T A k A where t r is set to zero as training videos are pre-segmented. The maximization over t k can be solved in O(N) time (N is length of a trajectory) using distance transform algorithms [17]. The Loss Augmented inference problem with Zero-One loss for a binary decision problem is solved as follows: max ^ y;^ z T (x i ; ^ y; ^ z) + (y i ; ^ y; ^ z) = 8 > > < > > : max 1; max ^ z T (x; ^ z) ify i = +1 max 0; 1 + max ^ z T (x; ^ z) ify i =1 (5.11) The parametera k represents the variance of a normal distribution, and should always be a positive quantity. However, the Latent SVM formulation does not allow a constraint to be placed on the sign of the weights learned, and hence there exists a possibility of learning a negative value fora k . In practice, this does not occur if the parts defined in the model are discriminative enough and not redundant. 5.6.2 Weight Initialization Latent SVM optimization is a highly non-convex problem, and converges to a local minima. Hence, in practice careful initialization of the weights has been suggested in previous work using LSVMs [17, 51]. We train standard SVMs separately on the root filter and the appearance filter features, and initialize R and A respectively to the learned weights.c k is initialized using the mean displacement of the key-pose locations 74 k (obtained from the Pose Summarization algorithm) from the anchor location a k . Pa- rametera k is initialized using the pose-boundary locations (b k ;b k+1 ), as it represents the variance in the key pose location. 5.7 Model Inference for Multiple Detections Detecting and tracking humans in cluttered and crowded environments is a challenging problem. We use a standard appearance based pedestrian tracker [23], trained indepen- dently of the datasets used here. Figure 5.6 and 5.7 show some representative results highlighting the challenges. Common inaccuracies include false positive tracks, missed tracks, misaligned tracks, and track fragmentations. The PF-HCRF detector is less sen- sitive to false positive trajectories, and treats it as a valid human track where ideally no human activity will be detected. However, missed tracks are impossible to recover from; hence we prefer tracking algorithms with higher recall at the expense of precision. Mis- aligned tracks cause noisy key-pose detections, and hence we perform scale-alignment search (eqn. 5.5) around the detection box. Figure 5.8 shows the scale-alignment search about a candidate detection (blue ellipse), with the optimal box shown in red. In our ex- periments, we use a single octave scale pyramidS with 5 levels centered at the original scale, and a 3 3 alignment search gridP with 10 pixel step width. Track fragmentations are frequently caused by human subjects undergoing non- pedestrian pose transitions, which commonly occur during actions such as pickup. To counter the effects of premature track termination, we extend the trajectories beyond their start and end positions. Figure 5.9 shows an example of track fragmentation, and our proposed track extension (shaded-dashed detections in blue and magenta). We note 75 (c) Misaligned Tracks (a) False Tracks (b) Missed Tracks Figure 5.6: (a-c) Noisy and erroneous tracks that the extensions may not correspond to human subjects in the video (magenta col- ored extensions in figure), in which case they are equivalent to false positive tracks, and should not adversely affect our performance. Detection and classification on a test video is performed by inferring the optimal class labels and root filter location:fy ;t r g = max y;z P (y;zjx; ). The optimal root filter location t r is the detected position of the activity segment. For pre-segmented videos,t r is fixed to zero, and only the optimal class labely is inferred. In an activity 76 Figure 5.7: Track fragmentations due to ID switches. detection task, multiple instances of the same activity class can exist in a single video. The optimumt r will return only a single detection result. To incorporate multiple de- tections, we infer the time seriesA(t) = max z=tr P (y = +1;t r =tjx; ), representing the detection confidence at each timet. Following object detection algorithms, we apply a Non-Maxima Suppression (NMS) filter toA(t), and declare the resulting maximas as our predicted activity detections. Figure 5.10 shows an example of the multiple detection inference procedure for the two-handed wave action, where the outputs of the separate key-pose filters, root filters, and the inferred time series A(t) are shown. The ground truth row shows the activity 77 Figure 5.8: Scale allignment search. T=1 T=2 T=3 T=4 T=8 T=7 T=6 T=5 Figure 5.9: Track Extensions: A single actor track is fragmented into blue and magenta tracks due to non-pedestrian pose changes during the course of the pickup action. Track extensions help in resolving such track fragmentation errors. segments for two-handed wave action with positive labels (red), negative labels (green) for other actions and segments with no activity (cyan). The NMS output is given by pink bars, with the inferred key-pose locations marked in yellow, along with the key- pose frames shown above. The sequence of detected key-poses are consistent across segments and describe the activity. The video sequence contains action segments with partial occlusions, where some of the key-poses are not visible. We observe that the individual key-pose and root filter detection confidences are not sufficient for detecting the activity segments, whereas the 78 Key Pose 3 Key Pose 1 Key Pose 2 Root Filter NMS output Ground Truth Max P(y +1 ,t r |x) Z/t r A(t)= Key Pose 4 KP-1 KP-2 KP-3 KP-4 KP-1 KP-2 KP-3 KP-4 KP-1 KP-2 KP-3 KP-4 KP-1 KP-2 KP-3 KP-4 High Low PF-HCRF Inference Result on Single Trajectory Figure 5.10: Heat maps represent the output scores of the key-pose filters, root filters and the inferred detection confidenceA(t), along with ground truth and predicted detection segments. Refer the text for more details. (Figure is best viewed in color and magnified) combined inference resultA(t) provides a clear segmentation of the video, hence vali- dating our algorithm. The NMS algorithm also detects a false positive due to the local maxima occurring in that segment; choosing an appropriate confidence-threshold for the detected maximas will remove the weakly scored false positives. We set the threshold to the confidence value corresponding to the maximum F1 score of each detector. 5.8 Results We evaluate our algorithm on 4 datasets: UT-Interaction [64], USC-Gestures [49], CMU-Action [31] and Rochester-ADL [42]. The model is trained using a pose-codeword vocabulary size of 500, and by selecting an appropriate number of key-poses K 2 79 Method 50%-Video Full-Video PF-HCRF 83:33% 97:50% Raptis [60] 73:30% 93:30% Ryoo [65] 70:00% 85:00% Cuboid+SVM [64] 31:70% 85:00% BP+SVM [65] 65:00% 83:30% Vahdat [82] - 93:30% Zhang [97] - 95:00% Kong [33] - 88:30% Table 5.1: UT-Interaction: Classification accuracy for observing the initial 50% of the video, and the full video. f3; 4; 5g based on action complexity. Inference on the PF-HCRF model runs at 0:05 fps on a standard PC, and at 2 fps without scale-alignment search. 5.8.1 UT-Interaction [64] The UT-Interaction Set-1 dataset was released as a part of the contest on Semantic De- scription of Human Activities (SDHA) [64]. It contains 6 types of human-human in- teractions: hand-shake, hug, kick, point, punch and push. The dataset is challenging as many actions consist of similar human poses, like “outstretched-hand” occurs in point, punch, push and shake actions. There are 10 video sequences shot in a parking-lot, with 2-5 people performing the interactive actions in random order. [Classification] SDHA contest [64] recommends using a 10-fold leave-one-out evalu- ation methodology. PF-HCRF achieves an average classification score of 97:50%, and outperforms all existing approaches (Table 5.1). We also evaluate our model on the streaming task (or activity prediction task), where only the initial fraction of the video is observable. This measures the algorithm’s performance at classifying videos of in- complete activity executions. Figure 5.11 plots the classification accuracy for different values of observation ratio. The PF-HCRF model out-performs other methods, which 80 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Video Observation Ratio: θ Accuracy(%) UT−Interaction Streaming Video Performance Key−Framing MSSC Dynamic Cuboid Integral Cuboid Cuboid+SVM Bayesian Random−Chance PF−HCRF Figure 5.11: Streaming video performance on the UT-Interaction compared against Key- Framing [60], MSSC [9], Dynamic Cuboid [65], Integral Cuboid [65], Cuboid+SVM [65], Bayesian [65], Random-Chance and PF-HCRF (Our Method). 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.7 0.75 0.8 0.85 0.9 0.95 1 Recall Precision UT−Interaction Detection Result Hand−Shake Hug Kick Point Punch Push Figure 5.12: UT-Interaction: Precision-Recall curves for activity detection. 81 can be attributed to its learning a small set of discriminative key-poses, where detecting even the first few key-poses helps in classifying the action. Moreover, the model returns the most likely position of key-poses in the unobserved section of the video, and hence is capable of “gap-filling”. [60] also uses a key-frame based algorithm, however it is unable to perform gap-filling, as they only learn the temporal order of the key-frames, whereas PF-HCRF employs a probabilistic model for the key-pose locations, which is learned in a discriminative manner. [Detection] SDHA contest [64] recommends evaluating the detection performance on the 10 videos using precision-recall curves for the 6 actions, and we present the same in Figure 5.12. None of the contest participants report detection PR curves [64], making us the first ones to do so. [60] report detection results while assuming that each video contains one and only one instance of each action type, and report an accuracy rate of 86:70% averaged over all actions, where the predicted action has a 50% temporal overlap with the groundtruth. Using the same metric, PF-HCRF model achieves an average detection accuracy of 90:00%. 5.8.2 USC-Gestures [49] The dataset consists of 8 video sequences of 8 different actors, each performing 5-6 in- stances of 12 actions, resulting in 493 action segments. The actions correspond to hand gestures like attention, left-turn, right-turn, flap, close-distance, mount etc. The dataset has a relatively clean background with stationary humans, however it is still challenging due to relatively small pose differences between actions, causing pose-ordering to be- come a key discriminative factor in recognizing actions. [Classification] Following [49, 72], we evaluate the classification performance using 82 Method Tr:Ts = 1 : 7 Tr:Ts = 3 : 5 PF-HCRF (Classif.) 98:00% 99:67% Root-Filter 58:81% 85:57% Singh [72] 92:00% - Natarajan [49] 79:00% 90:18% PF-HCRF (Det. MAP) 0:68 0:79 Root-Filter (Det. MAP) 0:26 0:49 Table 5.2: Result tables for USC-Gestures two different train-test ratios: 1:8, and 3:5, averaged over all folds. The PF-HCRF algo- rithm outperforms previous results (Table 5.2) in all split ratios. Furthermore, [49, 72] require manual construction of activity models using 2.5D joint locations for manually identified key-poses; the models also contain pre-defined motion styles and durations. PF-HCRF avoids cumbersome manual annotation of motion styles, while also automat- ically identifying the key-poses. [Detection] The dataset has 8 videos (12000 frames per video) containing continuous executions of 493 action segments. The action segments consist only 10% of total video frames, and are interspersed with gesture actions other than the 12 gestures used for classification, making it a challenging dataset for activity detection. For baseline comparison, we implemented a Root-filter classifier (Sec.5.5.1), where a standard SVM is trained using the histogram of pose-codewords. Table 5.2 shows the Mean Average Precision score averaged over 12 actions for the detection task. The root-filter does not capture temporal dynamics, and fails to differentiate between gestures with simi- lar key-poses, but different temporal ordering, which explains their lower performance compared to PF-HCRF. 83 5.8.3 CMU-Action [31] This dataset contain events representing real world activities such as picking up object from the ground, waving for a bus, pushing an elevator button, jumping jacks and two handed waves. The dataset consists of 20 minutes of video containing 110 events of interest, with three to six actors performing multiple instances of the actions. The videos were shot using hand held cameras in a cluttered/crowded environment, with moving people and cars composing a dynamic background. The dataset is challenging due to its poor resolution (160120), frequent occlusions, high variability in how sub- jects perform the actions, and also significant spatial and temporal scale differences in the actions. We evaluate our performance using a 1:2 train:test split. Fig. 5.13 shows the Precision-Recall curves for the 5 action classes, and for four different method variations. First, PF-HCRF model is applied to manually annotated ground truth tracks (M1). Next it is applied to tracks computed from a pedestrian tracker [23] (M2), and then reap- plied without “scale-alignment search and track extensions”(M3). Lastly, the Root-filter based SVM classifier is applied to the computed tracks (M4). We compare our perfor- mance to previously published results [31, 71, 96] on this dataset. Ke et al [31] show results using a flow consistency based correlation model of [71], and three variants of their super-pixel part-based method. Note, that these methods use only a single activity instance for training. Yuan et al [96] combine the two-handed-wave and jumping jack actions, and show results only on this single combined action Results on ground truth tracks (M1) provides an upper-bound on our performance in terms of reliance on tracks, and we achieve the best results using PF-HCRF across 84 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision Hand−Wave 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision Pick−Up 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision Pushing−Elevator−Button 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision Jumping−Jacks 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision Two−Handed−Wave Ke-SFP Ke-SP Ke-SW Schetman (M3) Without Scale- Alignment Search (M2) PF-HCRF Computed Tracks Yuan (M1) PF-HCRF Ground Truth Tracks Other Published Results Our Results (M4) Root-Filter SVM Figure 5.13: Precision-Recall curves on CMU action dataset compared to other meth- ods: Ke-SFP[31], Ke-SP[31], Ke-SW[31], Shechtman[71], Yuan[96] 85 Method Accuracy Features PF-HCRF 88:67% HoG+HoF Wang [86] 85:00% HoG+HoF Wang [86] 96:00% HoG+HoF+ContextFtrs Messing [42] 67:00% Key Point Tracks (KPT) Messing [42] 89:00% KPT+Color+FaceDets Laptev [36] 59:00% HoG+HoF Raptis [61] 82:67% KPT+HoG+HoF Satkin [68] 80:00% HoF Table 5.3: Result tables for Rochester-ADL. all actions. With computed tracks (M2), PF-HCRF still out-perform other existing tech- niques across all actions, showing our model’s tolerance to noisy tracks. Without “scale- alignment search and track extensions” (M3), the performance on hand-wave and pickup activities is poorer, which we attribute to misaligned and fragmented tracks caused by non-pedestrian poses, however, we still have good results for the other three activities. Lastly the results with the Root-filter classifier (M4) are significantly lower, which vali- dates that our performance improvement is over and above simply using human tracking results. 5.8.4 Rochester ADL [42] The Activities of Daily Living (ADL) dataset contains 150 videos performed by 5 actors in a kitchen environment, and consists of 10 complex daily-living activities, involving interaction and manipulation of hard to detect objects: answering phone, dialing phone, looking up phone directory, writing on whiteboard, drinking water, eating snacks, peel- ing banana, eating banana, chopping banana and eating using silverware. As we do not have access to an upper-body tracker, the PF-HCRF is applied to the entire frame instead of tracks. 86 1 2 3 1 2 3 4 Answering Phone Drinking Water Pick Phone Open Phone Phone to Ear Approach Fridge Pickup Bottle Pour Water Drink Water Figure 5.14: Key-pose sequences inferred by PF-HCRF gives a semantic description of activity with high consistency. Table 5.3 summarizes our results on the dataset. PF-HCRF achieves an accuracy rate of 88:67% using only HoG-HoF features. The choice of features is important for this dataset, as special features can be designed to capture elements of the kitchen scene and the various objects, like yellow-banana, white-board, phone-near-face etc. Messing et al [42] augment their model with color and face-detection based features, improving their accuracy from 67% to 89%. Similarly, [86] augments their HoG-HoF descriptors with contextual-interaction based features, causing their accuracy to improve from 85% to 96%; we expect that the PF-HCRF model will also benefit from using simmilar contex- tual features. Furthermore, PF-HCRF localizes the key-poses of the complex activities, and we observe high consistency in the key-pose appearance across actors (Fig.5.14), and they seem to correspond to a natural semantic interpretation. Such decompositions are not obtainable using [42, 86]. 87 5.9 Conclusion We propose a key-pose filter based HCRF model for activity detection and classification. Our model is capable of detecting multiple instances of an activity in an unsegmented video. We also present a novel pose summarization algorithm to automatically identify the key poses of an activity sequence. Our model training does not require manual annotation of key-poses, or primitive actions labels, and uses video segment level class labels only. We conduct extensive experiments on a publicly available human gesture dataset to test the classification performance. We also introduce a new extended human gesture dataset containing unsegmented videos, to test the detection performance of our model. We show large improvements over the baseline methods, and competitive performance with previous work. 88 Chapter 6 Multi-State Dynamic Pooling using Segment Selection In this chapter, our objective is to recognize long range complex events composed of a sequence of primitive actions. This is a challenging task as videos may not be con- sistently aligned in time with respect to the primitive actions across video examples. Moreover, there can exist arbitrary long intervals between, and within the execution of primitive actions. We propose a novel mutltistate segment selection algorithm, which pools features from the discriminative segments of a video. 6.1 Introduction Human activity recognition is a fundamental problem in computer vision. The difficulty of the task varies greatly based on the complexity of the actions themselves, and the imaging conditions. It is useful to characterize the activities as falling into three broad classes. Firstly, is the class of relatively short period activities consisting of primitive actions such as shake, hug and wave [67], and snippets of sport activities like diving, 89 golf-swing [51] etc. A second category of activities consist of complex events natu- rally described as a composition of simpler and shorter term primitive actions; examples include videos of cooking recipes and assembly instructions, which take place in a struc- tured environment consisting of fixed camera and consistent background structures, such as kitchen tops and shelves. Lastly, there is the category of videos in the wild, such as amateur video uploads on YouTube, where there is significant variety in camera pose and challenging background environments. The focus of this work lies on the second category of activities. Even though the three categories face common challenges caused by ambiguities inherent to images and variations in the actor clothing, style and background variations, there are also signifi- cant differences. For the shorter-term activities, local features such as STIPs or Dense Trajectory (DT) features aggregated by methods such as Bag of Words [36, 34] or Fisher Vectors [55] have been found to give good performance. However, such methods are not adequate for the more complex events due to the much larger variations possible in such activities. Consider the example of grating cheese illustrated in figure 6.1, where the relatively straightforward task of retrieving cheese from the refrigerator, unwrapping it, grating it and replacing the cheese back, is performed in varying styles by two different actors. Some actors retrieve a bowl (or a tray) before (or after) retrieving the grater, resulting in only a loose temporal ordering between the primitive states of the activity, making it difficult to estimate a fixed temporal distribution of the states in a video se- quence. Moreover there are significant variations in the duration of each state due to different speeds and styles of the actors, and optional actions like moving the grater may cause large gaps between the essential states of the event. The task becomes even more challenging due to gaps appearing within the execution of a state, as actors tend to take breaks during long actions, like grating cheese, and hence violating the usual 90 Takeout Bowl Takeout FlatGrater Takeout PlasticBag Remove Package Grate Cheese Package Cheese Put In PlasticBag Takeout CuttingBoard Takeout FlatGrater Takeout PlasticBag Remove Package Grate Cheese Move FlatGrater Scratch Off Cheese Package Cheese Put In PlasticBag Video 1 Video 2 Figure 6.1: Composite event Grating Cheese is composed of numerous primitive ac- tions which occur with varying durations and arbitrary gaps between them, making it a challenge to learn a composite event classifier. assumption of temporal continuity with respect to primitive action executions. How- ever, we note that certain long range temporal relationships are not violated, such as retrieving the cheese before grating it, and a composite event recognizer should learn such relationships. Furthermore, such videos may contain many intervals that are not of direct relevance to the activity of interest, for example, resting in between parts of the task. Some of these variations may be captured by spatio-temporal feature pooling strategies [36, 75], where the video is divided into spatio-temporal grids of histograms. However rigid grid quantizations are not likely to be sufficient for un-aligned and un-cropped videos [34], and are sensitive to temporal variations in action executions. There has been work representing such activities by dynamical graphical models such as Dynamic Bayesian Networks [40, 49]; however, these representations can be highly sensitive to varying time intervals and spurious intervening activities. We suggest that dynamic pooling strategies offer a suitable compromise between static pooling and the rigid dynamical models. Dynamic pooling has been used in pre- vious works, such as [51, 19, 79], which identify the discriminative video segments to pool features from, and therefore adapt the pooling strategy to the observed features in 91 S A S B S C + + + j T A f j T B f j T C f φ(x) =[ φ A (x) , φ B (x) , φ C (x) ] LSVM Optimization Finite State Machine : Figure 6.2: Flow diagram for multistate dynamic feature pooling algorithm. A 3-state finite state machine is assumed, which specifies a temporal ordering asS A <S B <S C . Segment classification scores are computed w.r.t each state, and the discriminative seg- ments are selected by solving a linear programe. The aggregate feature statistics from the selected segments are pooled to compute a global feature statistic (x). which is used for training a Latent-SVM classifier, and hence learn the optimal activity classi- fication weights. The weight learning and feature poolling is repeated iteratively till convergence. the video. However, such methods assume restrictive constraints on the temporal po- sitioning of primitive actions, and do not allow for arbitrary length inter-action gaps, and furthermore restrict the hypothesis space for primitive actions to continuous video segments only. Our work is closest in spirit to [37], which proposed a dynamic pooling algorithm that treats the locations of characteristic segments of the video as latent variables which are inferred for each video sequence in a LSVM framework. The possible set of segment selections is combinatorial in size, and they propose a fractional integer linear program- ming based solution to obtain the exact solution. However, [37] does not distinguish between the primitive actions in a video during pooling, and incorporates only weak 92 pair-wise temporal associations by considering an exhaustive set of segment pairs dur- ing both training and testing phase, making it unsuitable for long-term complex event recognition. We are motivated to extend the segment selection algorithm [37] to incorporate learning of temporal relationships between primitive actions, and propose a novel tem- poral segment selection algorithm for complex event recognition. We formulate the task of multi-state dynamic feature pooling as a linear programming optimization prob- lem, where the discriminative weights for feature classification are learned iteratively in a latent support vector machine (LSVM). Figure 6.2 presents an overview of our ap- proach while assuming a three state model. Aggregate statistics, such as histograms of spatio-temporal features are computed from overlapping segments of the video, and a classification score is assigned to each segment w.r.t each state. The discriminative seg- ments are identified on a per-state basis, and a global feature vector is pooled from the selected segments, which is used to train the LSVM classifier. Our algorithm is si- multaneously (1) robust to arbitrary gaps between primitive states while respecting their temporal ordering, (2) robust to gaps within primitive states caused by varying dura- tions of primitive action executions, and (3) selects the discriminative sub-segments of the video for classifying the composite event. Our algorithm eschews any manual anno- tations of the states during training, and instead automatically infers the important states from the training data. Our key contributions are multi fold: (1) We present a new provably fast solution to the K-segment selection problem, which improves the running time of the previously proposed [37] fractional linear program based solution toO(N logN), (2) we present a regularized linear programming optimization for multi-state K-segment selection for dynamic feature pooling, and finally (3) propose a novel extension to our linear program 93 for automatically determining the number of discriminative segments in a video, and hence avoiding pooling features from non-informative (w.r.t to the complex event label) video segments. We validate our algorithm on the recently introduced Composite Cooking Dataset [63], which consists of a collection of cooking recipe demonstrations performed by multiple actors while following a weakly enforced script. Each recipe is a long term composite event, consisting of a series of primitive actions like opening cupboard, taking out toaster, peeling vegetables etc, and we show significantly improved results on the dataset, along with an analysis of the discriminative segments selected by our algorithm. 6.2 Related Work We focus our survey on classification algorithms using different feature pooling strate- gies, and also briefly review algorithms modeling temporal structure of complex activi- ties. Static pooling algorithm model the (x,y,t) distribution of features in the 3-dimensional video volume, and fix the pooling strategy before commencing classifier training, and do not adapt dynamically to the features statistics observed during training. A popular class of pooling strategy divides the video into multiple spatio-temporal grids [36, 75], and concatenates the feature statistics computed from each of the grids into a global video- wide feature vector, while other approaches construct interest point centric histograms [16, 20, 34], where feature statistics are accumulated from static spatio-temporal grids centered around each detected interest point. However, the static nature of the feature pooling grids assumes that the underlying action is consistently aligned with the video, and makes it sensitive to time and duration shifts in action instances. 94 In real life videos, the actions of interest might occur in only small a fraction of the frames. Restricting the classifier to the important video segments [69, 68] has been shown to improve recognition accuracy, emphasizing the importance of dynamic feature pooling algorithms, which only compute feature statistics from the important segments of a video, determined dynamically during training. While some algorithms ignore the temporal structure, and select only a single continuous subsequence [69, 54] of the video, others represent the video as a sequence of atomic segments [19, 51, 79] and either learn or manually define temporal relationships between the segments. A variety of atomic segment representations have been proposed such as histogram of codewords over temporal [51] and spatio-temporal volumes [79], poselet based representation of human actions [60], and scene context based video decomposition [81]. However they either ignore temporal ordering [81], or assume well cropped videos without arbitrary gaps between the atomic segments [51, 79], while [60] selects only a small number of discrete frames, and requires a manually defined set of relevant poselets, further restrict- ing it to primitive single-human actions. A popular class of approaches represent the complex event as a sequence of state transition of a finite state machine, where the state definitions can be semantic, such as human key-poses [40] or linear interpolations between 3D joint positions [49]. Other approaches favor non semantic state definitions based on low level feature, such as dis- criminative motion patches [90] and histogram of gradient/flow features [78]. The event label is inferred using either logic models [8], or generative [13] and conditional [73, 49] probabilistic graphical models like HMM and CRF. Such methods have shown robust performance on short duration activities like walking, jumping etc., however due to their inherent Markovian nature, they are unable to handle long range dependencies between 95 primitive action segments, and are sensitive to variations in duration and style of ac- tion execution. Recently, [78] proposed a conditional variable duration HMM model for action classification on youtube videos, however it does not distinguish between video segments based on relevance to the composite event, and instead attempts to model ev- ery segment as a valid state, making it susceptible to spurious features from unimportant segments. 6.3 Pooling Interest Point Features for Event Classification Classical framework for event recognition using low level image features consists of a three stage process: detection, pooling and classification. The detection stage consists of computing a set of descriptors x i =fx k g from spatio-temporal interest point detec- tions in thei th video. The pooling stage involves combining the multiple local feature descriptors into a single global feature (x i ) representation of the video. Lastly in the classification stage, discriminative classifiers like support vector machines are trained using the global features as training data. Our contributions are in the feature pooling stage of the framework. We present a new, provably faster, algorithm to a previous segment selection algorithm by Li et al [37]. We further propose a novel algorithm for pooling features from the discriminative time intervals of a video, while modeling the temporal dynamics present in the activity in a joint optimization framework. We also present an extension to our algorithm which automatically determines the optimum number of segments to select. 96 6.3.1 Discriminative Segment Selection We first describe the basic framework of discriminative segment selection, where the video is divided into N equal-length temporal segments. Let f t be the locally pooled feature computed from thet th segment, where the pooling criteria can be as simple as computing a histogram of codewords detected within the segment. The global feature descriptor is computed by pooling a subset of the segments: (x; h) = P t f t h t P t e t h t ; ;x = max h T (x;h) = max h P t T f t h t P t e t h t (6.1) where h t is a binary variable indicating the selection of the t th segment, and e t is a strictly positive constant which normalizes the () with respect to the number of seg- ments selected. A variety of representations can be chosen for f t ande t , for example, setting f t as the histogram of codewords ande t as the sum of codewords appearing in thet th segment results in the classical BoW feature. The discriminative weight vector computes the score c j corresponding to each segment, where the score is proportional to the importance of the segment in classifying the given event. The weight vector is learned using a Latent-SVM optimization [95, 17]: min 1 2 kk 2 2 +C X i2P max 1; ;x i +C X i2N max 0; 1 + ;x i C X i2P ;x i (6.2) whereP andN are the set of positive and negative training examples. 6.3.2 K-Segment Selection (KSS) Inference of the latent variableh in equation 6.1 determines the important segments in a video, and the inference algorithm can be stated as aK-segment selection problem, 97 where the objective is to select theK most optimum segments from a video for clas- sification purpose, and is equivalent to solving the following fractional integer linear program: KSS(K) : Maximize P N t=1 c t h t P N t=1 e t h t s.t. X N t=1 h s t =K ; 8 t h t 2f0; 1g (6.3) where c t = T f j is the classification score value corresponding to the feature vector f t , pooled from the t th segment. Li et al [37] proposed a relaxed linear programming solution to solve the above integer problem, where the fractional linear program is trans- formed to an equivalent standard linear program [5]. An optimal solution to the linear program is computed using the standard simplex algorithm, which has exponential worst case complexity, and on average polynomial time complexity. We next present a prov- ably faster algorithm to solve theK-segment selection problem. 6.3.3 Linear Time Subset Scanning Our key observation is that selecting the optimumK segments in a video by optimizing equation 6.3, is equivalent to solving the Linear Time Subset Scanning (LTSS) problem [50]. We give a brief introduction to the LTSS problem, and refer the readers to [50] for the details. Let us define a subsetS =ft3 h t = 1g, and define the following two additive statistics over the subset: X(S) = P t2S c t and Y (S) = P t2S e t . We further define a subset scoring functionF (S) = F (X;Y ) = X(S) Y(S) according to which we want to select the best possible subset. For all scoring functionsF (S) satisfying the LTSS property, the optimal subsetS maximizing the score can be found by ordering the elements of the set according to some priority function G(t), and selecting the topK highest priority elements. 98 A scoring functionF (S) satisfies the LTSS property (Theorem 1 [50]) with priority function G(t) = ct et , if (1) F (S) = F (X;Y ) is a quasi-convex function of two addi- tive sufficient statistics of subsetS, (2)F (S) is monotonically increasing withX(S), and (3) all additive elements of Y (S) are positive. In our case, the scoring function F (S) = P t2S ct P t2S et = P t ctht P t etht =F (h) is a ratio of linear functions in the segment selector variable vectorh, and hence can be shown to be quasi-convex [5] using a simple analy- sis of its-sublevel sets. Monotonicity ofF (h) w.r.t.X(S) is shown due to them being linearly related, and furthermore, e t are strictly positive by design, as they represent the normalization factor for each segment. Hence, theK-Segment Selection problem satisfies the LTSS property with priority functionG(t) = ct et . To select the optimumK segments, we sort the set of segment scores: n ct et o N t=1 , and select the segments corre- sponding to the topK scores. Therefore our algorithm reduces theK-segment selection problem to a simple sorting problem with a time complexity ofO(N logN), which is an order of magnitude faster than solving a linear program. Note, that we only require an unordered list of the topK segments, and hence, one can select theK th largest ele- ment using the linear time median-of-medians selection algorithm, and select the topK segments through a linear traversal over all the segments, which solves the problem in O(N) linear time. 6.4 Multistate K-Segment Selection (MKSS) TheK-segment selection algorithm [37] does not consider the temporal relationships between the selected segments, and hence ignores the temporal ordering of primitive action sequences composing a complex event. In effect, it can be viewed as a single state segment selection. We are motivated to extend theK-segment selection algorithm to a multi-state formulation, where each state corresponds to discriminative primitive 99 actions present in the video. Let us consider a two state problem, where our objective is to select the optimum sub-segments for both state A and state B, such that state A occurs in the video before state B. This is equivalent to a finite state machine where state A transitions to state B. We define h A ; h B 2f0; 1g N as the segment selection indicator vectors corresponding to the two states. To ensure the temporal ordering in the selected subsegments, we construct the following constraint: Kh B t + X N k=t h A k K 1tN (6.4) The linear equation defines a mutual exclusion constraint on states A and B, such that if thet th segment is assigned to state B, then all temporal successor segments appearing aftert cannot be assigned to state A. Similar constraints can be placed on predecessor segments of state B when selecting thet th segment for state A. Using equation 6.4, we can build any left to right transition finite state machine with arbitrary number of states. Note, that self transitions are implicitly modeled, as the selected segments of a particular state can be arbitrarily separated. State duration models specify the expected time a Markovian system will spend in a particular state. Similar duration constraints can be placed on our state models by specifying a compactness constraint: Kh A t + X t k=1 h A k + X N k=t+ h A k K 1tN (6.5) The compactness constraint specifies that all the segments selected for the state A, must lie within a temporal window of length 2, by placing a mutual exclusion constraint on thet th segment and all other segments lying outside the 2 window. We combine both 100 the temporal constraints and the compactness constraints with theK-segment selection problem, and formulate it as a relaxed linear programming optimization: MKSS (K) = Maximize P s2S P N t=1 c s t h s t P s2S P N t=1 e s t h s t +a log X s2S kh s k 1 (6.6) s.t. C1 : 0h s t 1 8s2S; 1tN C2 : X N t=1 h s t =K 8s2S C3 : Kh s b t + X N k=t h sa k K 8(s a ;s b )2T; 1tN Kh sa t + X t k=1 h s b k K 8(s a ;s b )2T; 1tN C4 : X s2S h s t 1 1tN C5 : Kh s t + X t k=1 h s k + X N k=t+ h s k K 8s2S; 1tN whereS is the set of states in our model, and (s a ;s b )2T is the set of temporal order constraints where state s b appears only after state s a . C1 is the linear relaxation con- straint over the binary indicator variables. C2 specifies that exactlyK segments should be selected corresponding to each state. C3 and C5 correspond to the temporal and com- pactness constraints respectively. C4 ensures that no segment is counted twice during state assignments. The resulting optimization is a fractional linear program, where the objective function is a ratio of linear functions. Solving fractional linear programs is a well explore problem in operations research, and can be solved using a simple transfor- mation [5] to an equivalent standard linear program: Maximize c T h+d e T h+f Gh m Ah = b e T h +f > 0 () Maximize c T y +dz Gy mz 0 Ay bz = 0 e T y +fz = 1; z 0 where y = h e T h+f and z = 1 e T h+f . Note, that the number of constraints in equation 6.6, and in the above transformed problem, are polynomial in the number of states and 101 segments, and hence the linear program can be efficiently solved using off the shelf solvers 1 . The optimal h s vector obtained by solving the linear program in equation 6.6 iden- tifies the selected segments for pooling features corresponding to each states2S. The global feature descriptor (x) is defined as a concatenation of features pooled from the individual states. Assuming a 3-state model such as in figure 6.2, the global feature vector is computed as: (x) = [ A ; B ; C ] = P t f t h A t P t e t h A t ; P t f t h B t P t e t h B t ; P t f t h C t P t e t h C t (6.7) which is used to train an latent-SVM classifier. 6.5 Selecting Optimal Parameter K The parameterK in theK-segment selection problem specifies the number of segments to be selected. However the appropriateK value can vary from video to video, and there is little intuition on how to compute an appropriate value. One feasible criteria is to select the bestK value by iteratively solving theK-segment selection problem : K = arg max K KSS(K). It can be shown that the optimum value of such an iterative procedure will always beK = 1, i.e. only selecting the segment with the largest c i d i ratio. Consider the following inequality: a 2 b 2 a 2 +a 1 b 2 +b 1 a 1 b 1 , which can be verified for all b 1 ;b 2 0 using simple algebraic operations. The inequality shows that any combination of multiple segments will always have a lower ratio value than the single segment with the highest c i d i ratio. A similar theoretical argument cannot be made for MKSS, however 1 http://www.gnu.org/software/glpk/ 102 in our experiments we observe that the optimum solution is for each state to select a single best segment. Previously, [37] addressed the problem by adding a logarithmic regularization func- tion :a log (khk 1 ), which favors choosing a larger number of segments. However choos- ing appropriate values of the hyper-parametera is again non-trivial, and it is estimated through cross-validation for each action category. In effect, the regularization parameter a indirectly chooses the appropriateK value, and is equivalent to selectingK through cross-validation. We next present an extension to our multistate segment selection algo- rithm for automatically selecting the optimum number of segments. 6.5.1 Regularized Multistate Segment Selection (RMSS) The segment selection criteria used in KSS and MKSS are such that the negatively scored segments will never be selected. On the other hand, the segments contributing a positively weighted score corresponds to the discriminative (w.r.t classifying the com- posite event) segments in the video, and hence an appropriate segment selection criteria should maximize the number of positively weighted segments selected while satisfying the multi-state constraints. We define a vector r sa t = 0:5 (c sa t +jc sa t j) which con- tains all the positive valued scores computed from the segments for states a , while the negative valued scores are set to zero. We further define a segment selection constraint P N t=1 r sa t h sa t P N t=1 r sa t which ensures that at least a fraction of the positively weighted segments will be selected as part of the optimum solution. The linear program- ming optimization for regularized multistate segment selection takes only the positive weight fraction as input parameter, and is defined as follows: RMSS () = Maximize P s2S P N t=1 c s t h s t P s2S P N t=1 e s t h s t ; ~ K = N kSk (6.8) 103 s.t. C1 : 0h s t 1 8s2S; 1tN C2b : X N t=1 h s t ~ K 8s2S C3 : ~ Kh s b t + X N k=t h sa k ~ K 8(s a ;s b )2T; 1tN ~ Kh sa t + X t k=1 h s b k ~ K 8(s a ;s b )2T; 1tN C4 : X s2S h s t 1 1tN C5 : ~ Kh s t + X t k=1 h s k + X N k=t+ h s k ~ K 8s2S; 1tN C6 : X N t=1 r sa t h sa t X N t=1 r sa t 8s2S r sa t = 0:5 (c sa t +jc sa t j) 8s2S; 1tN C7 : X N t=1 h sa t (1) X N t=1 h s b t 8s a ;s b 2S X N t=1 h sa t (1 +) X N t=1 h s b t 8s a ;s b 2S where the parameter ~ K is the maximum number of segments which can be selected per state in a video withN frames. The optimization does not restrict each state to a constant ~ K number of segment selections; instead it relaxes the equality constraint C2 with C2b in equation 6.8, which only places an upper bound on the number of segments selected. It is also desirable that the number of segments selected corresponding to each state is equally balanced across states, so that a single state does not dominate the solution of segment selection. An additional constraint C7 is added to ensure a balanced selection of segments across states, within a margin of. The parameter determines the number of segments selected in the optimum solu- tion. However the constraints in theRMSS() optimization can be rendered infeasible for certain values of , in particular, if the value is too high, it is likely that the state transition constraints cannot be satisfied for any combination of segment selection. We further observe that there exists a 0 0 such thatRMSS() has a feasible solution 104 for all 0 , and hence the optimization problem is monotonic in with respect to its feasibility. The monotonic behavior suggests a simple binary search over to find the optimal 0 within an error margin of inO log 1 iterations. 6.6 Dynamic Program for K-Segment Selection The Multi State K-Segment selection problem can also be solved using dynamic pro- gramming under some assumptions. For the dynamic program solution, a slightly mod- ified objective function is solved: h = arg max h X s2S P N t=1 c s t h s t P N t=1 e s t h s t (6.9) where the fractional sums for each state are separated into individual terms. Note, that the optimization objective function in equation 6.6 and 6.9 are equivalent for the case when the segment histograms are normalized, and the denominator termse t are set to 1. We next describe a dynamic programming based algorithm for solving the above optimization. We define an arrayE[s;i;j], whose elements corresponds to the score of selecting K segments between segmentsi andj for states. Furthermore, we define an arrayOpt[s;j] which contains the optimal value to the optimization in equation 6.9 upto thes th state andj th segment in the video. The optimal solution to equation 6.9 forN segments andjSj is given by Opt[jSj;N]. The contents of the array Opt[s;j] can be efficiently computed using the following dynamic program recursion: Opt(s;j) = arg min 1ij fOpt(s 1;i 1) +E(s;i;j)g (6.10) Algorithm 1 presents the algorithm for efficiently computing the recursion in equation 6.10. 105 Algorithm 1 Dynamic Programming Algorithm for Multi State K-Segment Selection INPUT: c s j ;e s j N j=1 8s2S, where N is total number of segments Let Optimal score matrixOpt2R jSjN Let score matrixE2R jSjNN SetOpt[s;j] =1 8s2S; 1jN SetE[s;i;j] =1 8s2S; 1i<jN for8i;j : i<j do E[s;i;j] = maximize P j t=i ctht P j t=i etht s.t. P j t=i h s t =K; 8 t h t 2f0; 1g end for for8s2S do forj= 2 to N do Opt[s;j] = min 1i<j fOpt[s 1;i 1] +E[s;i;j]g B[s;j] = argmin 1i<j fOpt[s 1;i 1] +E[s;i;j]g end for end for Backtrack onB[] :j =N::: 1 optimal state assignments. 6.6.1 Time Complexity Given the score matrix E(s;i;j), the optimal solution can be computed in O(jSjN 2 ) time complexity. Computing a single entry of the E(s;i;j) matrix requires finding the top K valued segments between the i th and the j th segment, which we showed in section 6.3.3 can be computed in O(N) time complexity. Hence a straightforward algorithm for computing the score matrix E will require O(jSjN 3 ) time complexity. The matrix E can be computed more efficiently using a min-heap data structure and some careful bookkeeping. Algorithm 2 shows the details of the procedure, where after every iteration, the min-heap contains the top K valued segments in a given window (i;j). The resulting time complexity for computingE matrix isO(jSjN 2 logK). 106 Algorithm 2 Efficient computation ofE(s;i;j) matrix INPUT: c s j ;e s j N j=1 8s2S, where N is total number of segments Let score matrixE2R jSjNN SetE[s;i;j] =1 8s2S; 1i<jN for8s2S do fori= 1 to N do Initialize an empty Min-Heap priority queue P of sizeK forj= i to N do Letb s j = c s j e s j if P.size()< K then P.push(b s j ) else if P.head()<b s j then Replace the head value of priority queue P withb s j . Iteratively exchangeb s j with its children till heap property is satisfied. end if end if E(s;i;j) = P t2P c s t P t2P e s t end for end for end for 6.7 Experiments and Results We evaluate our algorithm on the recently introduced Composite Cooking Dataset [63]. The dataset contains 41 cooking recipe demonstrations like prepare ginger, seperate an egg, make coffee etc, where the videos are recorded with a fixed elevated camera recording the actors from the front preparing the dishes inside a kitchen. There are a total of 138 videos of16 hours containing actions performed by 17 different actors, and are shot at 29.4fps with 1624x1224 pixel resolution. In our experiments, we use the pre-computed histogram of codeword features for each frame, provided with the dataset. The codewords are computed over HoG, HoF, motion boundary histograms and 107 trajectory shape features, extracted from densely sampled interest point tracks [84] in the videos. We divide each video into overlapping segments of 100 frames each, and sum the histogram of codewords from each frame within thej th segment, to construct a single accumulated histogram featuref j . To setup the fractional linear programming problems MKSS and RMSS, we normalize the features from each segment using its L1 norm, and compute the scores values c s j = s f j for each state s using the current value of the weight vector from the LSVM classifier. The normalization constants e s j are set to 1, and in effect, normalize the features based on the number of segments selected. In our experiments, we avoided any action or dataset specific tunning and set the regularization parameter as a = 5 for all events, which we empirically observed to select a larger fraction of positively scored segments in the video, and hence contributing more towards classifying the composite event. 6.7.1 Comparisons with baselines To establish a baseline, we implemented a bag of words based SVM classifier, where the codewords in the video are globally pooled into a single histogram. Figure 6.3 presents our results on the 41 composite cooking actions using a 6-fold cross validation as suggested by [63]. As the BoW features compute only globally aggregated statistics, their performance is quite low on complex long range activities. We also implement a temporally binned BoW classifier, as the one proposed by [36]. We experiment with three types of binning structures, where the video is divided into 3, 5 and 7 equal length partitions, and histogram of codewords computed from each partition are concatenated together. We observe considerable improvement in MAP values compared to the BoW classifier, which we attribute to the temporal pooling of features which is important for 108 Action Labels BoW TB-3 TB-5 TB-7 MKSS-3 MKSS-5 MKSS-7 RMSS-3 RMSS-5 RMSS-7 Chopping a cucumber 0.11 0.14 0.11 0.15 0.12 0.15 0.11 0.12 0.16 0.12 Prepare carrots 0.19 0.33 0.39 0.42 0.39 0.39 0.45 0.38 0.39 0.42 Prepare a peach 0.23 0.11 0.20 0.09 0.10 0.13 0.08 0.17 0.13 0.11 Slice a loaf of bread 0.58 0.61 0.73 0.66 0.75 0.71 0.87 0.77 0.72 0.85 Prepare cauliflower 0.40 0.43 0.42 0.36 0.60 0.65 0.35 0.34 0.63 0.33 Prepare an onion 0.16 0.46 0.26 0.10 0.42 0.19 0.10 0.35 0.20 0.09 Prepare an orange 0.14 0.68 0.47 0.38 0.30 0.42 0.36 0.61 0.44 0.37 Prepare fresh herbs 0.38 0.24 0.14 0.18 0.24 0.20 0.22 0.22 0.22 0.18 Prepare garlic 0.18 0.31 0.12 0.07 0.31 0.32 0.14 0.31 0.16 0.30 Prepare asparagus 0.02 0.03 0.04 0.04 0.03 0.04 0.05 0.04 0.05 0.05 Prepare fresh ginger 0.14 0.37 0.23 0.09 0.12 0.19 0.08 0.14 0.21 0.07 Prepare a plum 0.41 0.18 0.11 0.09 0.36 0.18 0.22 0.61 0.21 0.61 Zest a lemon 0.14 0.20 0.20 0.20 0.20 0.20 0.20 0.25 0.17 0.20 Prepare leeks 0.23 0.33 0.32 0.46 0.42 0.34 0.43 0.34 0.33 0.38 Extract lime juice 0.42 0.48 0.49 0.53 0.44 0.49 0.50 0.46 0.47 0.43 Prepare a pomegranate 0.39 0.81 0.94 0.56 0.53 0.65 0.78 0.48 0.70 0.78 Prepare broccoli 0.20 0.40 0.42 0.47 0.45 0.45 0.46 0.45 0.45 0.60 Prepare potatoes 0.23 0.11 0.15 0.11 0.23 0.22 0.14 0.24 0.23 0.17 Prepare a pepper 0.10 0.16 0.08 0.09 0.14 0.11 0.10 0.18 0.11 0.12 Prepare a pineapple 0.55 0.37 0.48 0.51 0.56 0.74 0.61 0.73 0.77 0.62 Prepare spinach 0.10 0.28 0.31 0.36 0.58 0.58 0.58 0.50 0.28 0.44 Prepare a fresh chilli 0.23 0.05 0.05 0.06 0.09 0.14 0.20 0.10 0.14 0.20 Cook pasta 0.26 0.53 0.45 0.54 0.38 0.49 0.64 0.53 0.47 1.00 Separate an egg 0.65 0.47 0.60 0.63 0.52 0.63 0.57 0.63 0.63 0.56 Prepare broad beans 0.14 0.68 0.18 0.29 0.68 0.51 0.68 0.47 0.52 0.52 Prepare a kiwi fruit 0.15 0.23 0.23 0.10 0.11 0.18 0.11 0.22 0.10 0.12 Prepare an avocado 0.13 0.07 0.13 0.07 0.05 0.08 0.07 0.05 0.10 0.07 Prepare a mango 0.06 0.16 0.30 0.22 0.15 0.29 0.32 0.21 0.29 0.33 Prepare figs 0.30 0.06 0.07 0.07 0.09 0.13 0.22 0.10 0.14 0.21 Use box grater 0.42 0.75 0.49 0.39 0.65 0.63 0.57 0.73 0.71 0.57 Sharpen knives 0.75 0.63 1.00 0.75 0.67 0.67 0.75 0.75 1.00 1.00 Use speed peeler 1.00 0.10 0.10 0.10 0.20 0.33 0.25 0.20 0.20 0.20 Use a toaster 0.16 0.57 0.39 0.35 0.42 0.43 0.31 0.52 0.34 0.20 Use a pestle-mortar 0.55 0.55 0.59 0.65 0.63 0.60 0.63 0.65 0.60 0.50 Use microplane grater 0.20 0.49 0.23 0.20 0.18 0.26 0.29 0.21 0.25 0.27 Make scrambled egg 0.22 0.45 0.57 0.56 0.50 0.65 0.61 0.53 0.78 0.72 Prepare orange juice 0.64 0.65 0.81 0.69 0.81 0.83 0.83 0.81 0.83 0.78 Make hot dog 0.05 0.30 0.18 0.59 0.21 0.21 0.44 0.40 0.40 0.44 Pour beer 0.56 0.10 0.06 0.03 0.56 0.53 0.53 0.53 0.55 1.00 Make tea 0.28 0.46 0.33 0.35 0.53 0.51 0.61 0.50 0.56 0.75 Make coffee 0.53 0.88 0.75 0.88 0.71 0.75 0.71 0.75 0.75 0.71 Splitwise Average MAP 0.30 0.40 0.36 0.35 0.39 0.41 0.41 0.42 0.41 0.41 Figure 6.3: MAP result table for the Composite Cooking dataset. Column TB presents the MAP scores from a temporally binned BoW classifier [36] with 3, 5 and 7 tem- poral partitions. MAP scores for MKSS and RMSS algorithm for different number of states (3, 5 and 7) is also given for each action. The highest MAP score across states is highlighted. complex event detection. However, the performance starts decreasing as the number of 109 Methodology MAP Bag of Words SVM 30:19% SVM-MeanSGD [63] 32:30% K-Segment Selection [37] 31:30% Temporal Binning T-3 [36] 40:30% Temporal Binning T-5 [36] 36:60% Temporal Binning T-7 [36] 34:70% Temporal Binning Best [36] 41:74% MKSS : 3 states 39:40% MKSS : 5 states 41:20% MKSS : 7 states 41:00% MKSS Best 43:57% RMSS : 3 states 42:30% RMSS : 5 states 41:00% RMSS : 7 states 41:40% RMSS Best 47:80% Table 6.1: MAP result table. partitions is increased, which we attribute to the static nature of the partitions, making them sensitive to variations in the temporal location of primitive actions. We next evaluate both the MKSS and RMSS algorithms for three different number of states: 3, 5 and 7. The optimal number of states is a function of the complexity of the underlying event in the video, and also the amount of variety present across videos of the same event class. Table 6.1 shows the average MAP over all classes of the MKSS and RMSS algorithm for the different number of states, and also the average of the best performance. The MKSS and RMSS algorithms achieve on average an MAP score of 41:47% and 47:80% respectively. Our method outperforms the SVM-MeanSGD [63] algorithm, which learns a SVM classifier using chi-square kernels and reports a score of 32:30%. We note, that [63] also reports an MAP score of 53:9% using external textual scripts to guide the classifier training, however our focus is on purely computer vision based approaches, and expect our algorithm to also benefit from similar complimentary modalities. We also implemented theK-segment selection [37] algorithm and evaluated it on the dataset. We observe only a minor improvement in performance over traditional 110 (a) (b) Seperating an egg Figure 6.4: (a) 3-State segment selection result for Seperating an egg with representative frames of the selected segments. (b) 5-State segment selection result with comparisons of the frames selected by each state. BoWs, which we attribute to its lack of temporal structure modeling, which is crucial for classifying long term composite events. 111 6.7.2 Segment Selection results Solving the MKSS and RMSS algorithms provides us with the optimum segment selec- tion indicator vectorh, whose elements are real valued numbers between 0 and 1. For visualizing the segments assigned higher selection weights, we threshold the indicator valuesh such that atleast 40% ofkhk 1 is retained. Figure 6.4 (a) shows the results of the multistate segment selection algorithm for 3 states on a seperate an egg video. We note the clear decomposition of the selected segments into three temporal states, where state-A (red) appears before state-B (green), which is followed by state-C (blue). The states are learned automatically from the training data, and it is difficult to associate a single primitive action with each state. However, we can discern some interesting trends through visual inspection of the results. For example, state-B seems to correspond to working at the counter station, and as the video contains extended periods at the counter, our model only selects some parts of the video. State-A seems to correspond to moving to back of the room and opening a door, and is detected twice in the video where the person approaches the cupboard and the refrigerator. The intervening frames are not im- portant to state-A and are ignored in its score computation. State-C seems to correspond to the person moving to the side to wash the dishes, or to keep them away. Figure 6.4 (b) compares the segment selection applied to two different videos of the: separate an egg activity, where the algorithm assumes a 5-state model, and we see a correlation between the types of primitive actions each of states correspond to across the videos. We note that each state can represent a cluster of features, and hence may correspond to multiple primitive actions and scenes. This becomes more apparent in videos where the actor interchanges the actions of approaching the refrigerator and approaching the cupboard. As our states do not have a semantic understanding of what 112 a refrigerator or a cupboard is, it only recognizes the gross spatio-temporal motions occurring in the scene. 6.8 Conclusions We presented a novel multistate segment selection algorithm for pooling features from the discriminative segments of a video. We presented a solution based on efficiently solving a linear programming optimization, and formulate linear constraints to enforce temporal ordering among the states representing the primitive actions of an event. We also presented a provably faster solution to the single state K-segment selection prob- lem [37] and improve the computation time to O(N logN). Finally, we presented a regularized version of the multistate segment selection algorithm, which automatically determines the number of segments to be selected for each state in a given video. We evaluated our algorithm on the Composite Cooking Activity dataset [63], and showed significantly improved results compared to other static and dynamic pooling algorithms. One promising approach for future work is to extend the algorithm to incorporate se- mantic mappings between the states and the underlying feature distributions, and ex- plore automated methods of determining the optimal number of states. 113 Chapter 7 Conclusions and Future Work To summarize, our research focused on recognizing complex human events in videos, and we adopted the design philosophy of combining global statistics of local spatio- temporal features, with the high level structure and constraints provided by dynamic probabilistic graphical models. We presented four different algorithms for activity recog- nition, spanning the feature-classifier hierarchy in terms of their semantic and struc- ture modeling capability, and our specific contributions are summarized in Figure 7.1. Choosing the correct algorithm is dependent on the salient attributes of the target surveil- lance scenario. In Figure 7.2, we provide a list of relative merits and drawbacks of each our proposed algorithms, to help a video surveillance engineer make an informed choice. In chapter 3, we presented a novel Local-Feature (LF) HCRF model for learning the local neighborhood relationships between interest point features using co-occurence statistics, and presented a transformation to represent the relationships between neigh- boring codewords as edges of Hidden Conditional Random Field. Furthermore, the codeword dictionary is learned as a part of the maximum likelihood learning process, 114 Finding temporally ordered discriminative segments using Linear Programming Provably fast single state segment selection algorithm Discover and Detect discriminative Key-Poses in semi-supervised setting. Transforming pairwise co-occurrence statistics of STIP features to edge connections of HCRF Combining noisy pose estimates from multiple kinematic tree priors using Multiple Kernel Learning Figure 7.1: Summary of our contributions across the feature-classifier hierarchy. resulting in significantly smaller vocabulary sizes, with each interest point being as- signed a probability distribution over the codewords. We validated our framework on two widely cited human activity datasets: Weizmann and KTH. and showed improve- ments over other existing approaches. Next, we presented the Pose-MKL algorithm in chapter 4, where we described a novel method for activity recognition based on distribution of human pose estimates in a video. We presented a multiple kernel learning framework for combining pose estimate distributions from a variety of kinematic tree priors for the task of activity classifica- tion. We evaluated our algorithm on the USC Gestures dataset, and achieve competitive performance over other previously reported results, while minimizing manual modeling and annotation requirements on the user during training. Both the LF-HCRF and Pose-MKL algorithm ignore the long term temporal dynam- ics of the action, which is especially important for complex event recognition in long 115 Attributes MSDP PF-HCRF LF-HCRF Pose-MKL Training/Testing Requirements Algorithm Complexity Low High Very Low Moderate Minimize Annotations Yes Yes Yes Yes Avoid Human Tracks Yes No Yes No Action Description Limited Yes No No Target Action Categories Action Complexity Complex Simple Moderate Simple Cyclic/Periodic Actions No No Yes Yes Un-Segmented Videos No Yes No No Un-Aligned Videos Yes Yes No No Model Robustness Gap Filling No Yes No No Arbitrary Gaps in Video Yes No No No Ignore Spurious Primitives Yes No No No Intermittent Occlusions Yes Yes No No Figure 7.2: Relative merits and drawbacks of our proposed algorithms. term videos. In chapter 5. we presented a novel key-pose filter based HCRF model for activity detection and classification in continuous un-segmented videos. To min- imize the manual annotation requirements during the training phase, we presented a pose summarization algorithm to automatically identify the key poses of an activity se- quence. Our algorithm is trained in a semi-supervised fashion, where only the compos- ite event labels are provided, and the key-pose labels are not provided during training. We extensively evaluated our model on four publicly available datasets: CMU-Action, UT-Interaction, USC-Gestures and Rochester-ADL, and show significant improvements over the state of the art. Our evaluation include both detection and classification tasks, and also a streaming video task where only a partial video is provided during testing. Finally, we addressed the task of recognizing long term weakly scripted complex events in chapter 6, where we presented a novel dynamic feature pooling framework 116 for multi-state models. We formulated the problem as an efficient linear programming optimization, and showed a computationally efficient extension for automatically deter- mining the number of segments to be selected in a video. We also present a provably faster linear time solution to the single state dynamic feature pooling problem. We evaluated our algorithm on the challenging Composite Cooking Activity dataset, and showed significantly improved results compared to previous static and dynamic pooling algorithms. In conclusion, we have presented a comprehensive set of algorithms which spans the feature-classifier hierarchy, with feature detectors ranging from low level interest point features to semantically rich pictorial structure pose distribution based features, and also classification algorithms ranging from global statistical models like multiple kernel learning to structured dynamical models like hidden conditional random fields. We showed state of the art results on multiple datasets (Figure 7.3) containing a variety of short, medium and long term actions taking place in dynamic and cluttered scenes. 7.1 Future Work Going forward, we list some of the future research directions that can be explored using our proposed approaches: [Integrating Multiple Signal Modalities] Current approaches to action recognition ig- nore the other commonly available signal modalities such as audio recordings and script annotations (when available). Furthermore, with recent advances in range sensing, good quality 2.5D representations of human actions have become easily obtainable. Our fu- ture research will focus on how best to fuse these disparate signal modalities for the combined task of activity recognition. 117 Aggregate Statistical Models Structured Dynamical Models Structured Semantic Features Un-structured Low Level Features MSDP PF-HCRF Pose-MKL LF-HCRF Figure 7.3: Surveillance scenarios addressed by our proposed algorithms. [Including noisy object detection features] Object detection and identification is a challenging task in computer vision, and a reasonable activity recognition system should be able to deal with missing or false object detection results. Our key-pose based algo- rithms can be easily extended to include the object pose in relation to the human pose, and infer the combined human-object pose in activities. However, our current system requires that the objects appear consistently in the key-pose frames, which in our prac- tical experience is difficult to achieve due to limitations of the object detection module. An alternative approach is to use a super-pixel segmentation of the video volume (voxel segmentation) to infer possible object locations, while not directly recognizing the ob- ject category. We are currently exploring ways of including such a voxel segmentation into our activity recognition algorithms. [And-Or semantics in key-pose detection] Typical key-pose sequence based activity recognition systems model the pose transitions using a finite state machine (FSM). Such 118 models are capable of skipping a key-pose, or incorporating an alternate sequence of key-poses for the same activity label. Our current implementation of PF-HCRF is de- signed to mimic a linear chain, left-to-right FSM model, and cannot incorporate And-Or semantics in key-pose transitions. In the future, we propose to incorporate such seman- tics to make our current algorithm more robust to variations in activity style, where the same activity can be executed using different key-pose sequences. 119 Bibliography [1] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People de- tection and articulated pose estimation. In CVPR, 2009. [2] Francis R Bach, Gert R G Lanckriet, and Michael I Jordan. Fast Kernel Learning using Sequential Minimal Optimization. Technical report, UC-Berkeley, 2004. [3] Martin Bergtholdt, Jörg Kappes, Stefan Schmidt, and Christoph Schnörr. A Study of Parts-Based Object Class Detection Using Complete Graphs. IJCV, 2009. [4] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, and Ronen Basri. Ac- tions as space-time shapes. In ICCV, December 2005. [5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization, 2009. [6] M. Brand, N. Oliver, and A. Pentland. Coupled hidden Markov models for com- plex action recognition. In CVPR. IEEE Comput. Soc, 1997. [7] Matteo Bregonzio, Shaogang Gong, and Tao Xiang. Recognising action as clouds of space-time interest points. In CVPR, 2009. [8] William Brendel, Alan Fern, and Sinisa Todorovic. Probabilistic event logic for interval-based event recognition. In CVPR, 2011. [9] Yu Cao and Daniel Barrett. Recognizing Human Activities from Partially Ob- served Videos. CVPR, 2013. [10] M.T. Chan, A. Hoogs, J. Schmiederer, and M. Petersen. Detecting rare events in video using semantic primitives with HMM. In ICPR, 2004. [11] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detec- tion. In CVPR, 2005. [12] P. Dollar, V . Rabaud, G. Cottrell, and S. Belongie. Behavior Recognition via Sparse SpatioTemporal Features. In VSPETS, 2005. [13] T. Duong, H. Bui, D. Phung, and S. Venkatesh. Activity recognition and abnor- mality detection with the switching hidden semi-markov model. In CVPR, 2005. [14] D Eppstein, M S Paterson, and F F Yao. On Nearest-Neighbor Graphs. Discrete & Computational Geometry, April 1997. [15] C. Fanti, L. Zelnik-Manor, and P. Perona. Hybrid Models for Human Motion Recognition. In CVPR, 2005. 120 [16] Alireza Fathi and Greg Mori. Action recognition by learning mid-level motion features. In CVPR, 2008. [17] P Felzenszwalb and D McAllester. A discriminatively trained, multiscale, de- formable part model. In CVPR, 2008. [18] Vittorio Ferrari, M. Jimenez, and Andrew Zisserman. Pose search: retrieving peo- ple using their pose. In CVPR, 2009. [19] Adrien Gaidon. Actom sequence models for efficient action detection. In CVPR, 2011. [20] Andrew Gilbert, John Illingworth, and Richard Bowden. Fast realistic multi-action recognition using mined dense spatio-temporal features. In ICCV, September 2009. [21] Abhinav Gupta, Aniruddha Kembhavi, and Larry S Davis. Observing human- object interactions: using spatial and functional compatibility for recognition. PAMI, October 2009. [22] S. Hongeng and R. Nevatia. Large-scale event detection using semi-hidden markov models. In ICCV, 2003. [23] Chang Huang, Wu, and Ram Nevatia. Robust object tracking by hierarchical asso- ciation of detection responses. In ECCV, 2008. [24] Nazli Ikizler and David Forsyth. Searching video for complex activities with finite state models. In CVPR, 2007. [25] YA Ivanov and AF Bobick. Recognition of visual activities and interactions by stochastic parsing. PAMI, 22(8):852–872, 2000. [26] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A Biologically Inspired System for Action Recognition. In ICCV, October 2007. [27] Liu Jingen and Mubarak Shah. Learning human actions via information maxi- mization. In CVPR, 2008. [28] Y Ke, R Sukthankar, and M Hebert. Efficient visual event detection using volu- metric features. In ICCV, 2005. [29] Y Ke, R Sukthankar, and M Hebert. Event Detection in Crowded Videos. In ICCV, 2007. [30] Yan Ke, Rahul Sukthankar, and Martial Hebert. Spatio-temporal Shape and Flow Correlation for Action Recognition. In CVPR, June 2007. 121 [31] Yan Ke, Rahul Sukthankar, and Martial Hebert. V olumetric Features for Video Event Detection. IJCV, 2010. [32] Alexander Klaser, M. Marszalek, Cordelia Schmid, and Inria Grenoble. A spatio- temporal descriptor based on 3D-gradients. In BMVC, 2008. [33] Yu Kong, Yunde Jia, and Yun Fu. Learning Human Interaction by Interactive Phrases. In ECCV, 2012. [34] Adriana Kovashka and Kristen Grauman. Learning a Hierarchy of Discriminative Space-Time Neighborhood Features for Human Action Recognition. In CVPR, 2010. [35] I Laptev. On Space-Time Interest Points. IJCV, 2005. [36] I Laptev, M Marszalek, C Schmid, and B Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. [37] Weixin Li, Qian Yu, and Ajay Divakaran. Dynamic Pooling for Complex Event Recognition. In ICCV, 2013. [38] J Liu, Jiebo Luo, and Mubarak Shah. Recognizing realistic actions from videos â ˘ AIJin the wildâ ˘ A ˙ I. In CVPR, June 2009. [39] Tiecheng Liu and John R. Kender. Computational approaches to temporal sam- pling of video sequences. MCCA, 2007. [40] F Lv and R Nevatia. Single view human action recognition using key pose match- ing & viterbi path searching. In CVPR, 2007. [41] M. Marszalek, I. Laptev, and C. Schmid. Actions in context. CVPR, June 2009. [42] Ross Messing, Chris Pal, and Henry Kautz. Activity recognition using the velocity histories of tracked keypoints. In ICCV, 2009. [43] Joris M Mooij. libDAI : A Free and Open Source C ++ Library for Discrete Ap- proximate Inference in Graphical Models. JMLR, 11:2169–2173, 2010. [44] VI Morariu. Multi-agent event recognition in structured scenarios. CVPR, 2011. [45] Louis-Philippe Morency, Ariadna Quattoni, and Trevor Darrell. Latent-Dynamic Discriminative Models for Continuous Gesture Recognition. In CVPR, 2007. [46] Greg Mori, Xiaofeng Ren, and AA Efros. Recovering human body configurations: Combining segmentation and recognition. In CVPR, 2004. 122 [47] P Natarajan and R Nevatia. Coupled hidden semi markov models for activity recognition. In WMVC, 2007. [48] P Natarajan and R Nevatia. View and scale invariant action recognition using multiview shape-flow models. In CVPR, 2008. [49] Pradeep Natarajan, Vivek Singh, and Ram Nevatia. Learning 3D Action Models from a few 2D videos. In CVPR, 2010. [50] DB Neill. Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society, 74(2):337–360, March 2012. [51] Juan Carlos Niebles, Chih-wei Chen, and Li Fei-fei. Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification. In ECCV, 2010. [52] Juan Carlos Niebles and Li Fei-Fei. A Hierarchical Model of Shape and Appear- ance for Human Action Classiïˇ n ˛ Acation. In CVPR, 2007. [53] Juan Carlos Niebles, Hongcheng Wang, and Li Fei-Fei. Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words. IJCV, March 2008. [54] Sebastian Nowozin, Gokhan Bakir, and Koji Tsuda. Discriminative Subsequence Mining for Action Classification. In CVPR, 2007. [55] D Oneata, Jakob Verbeek, and C Schmid. Action and event recognition with Fisher vectors on a compact feature set. In ICCV, 2013. [56] Ariadna Quattoni, Sybor Wang, Louis-Philippe Morency, Michael Collins, and Trevor Darrell. Hidden conditional random fields. PAMI, 2007. [57] D. Ramanan, D.a. Forsyth, and A. Zisserman. Strike a Pose: Tracking People by Finding Stylized Poses. In CVPR, 2005. [58] Deva Ramanan. Learning to parse images of articulated bodies. NIPS, 2007. [59] Deva Ramanan and DA Forsyth. Automatic annotation of everyday movements. In NIPS, 2003. [60] Michalis Raptis and Leonid Sigal. Poselet Key-framing: A Model for Human Activity Recognition. In CVPR, 2013. [61] Michalis Raptis and Stefano Soatto. Tracklet Descriptors for Action Modeling and Video Analysis. In ECCV, 2008. [62] M.D. Rodriguez, Javed Ahmed, and Mubarak Shah. Action Mach A spatio- temporal maximum average correlation height filter for action recognition. In CVPR, 2008. 123 [63] Marcus Rohrbach, Michaela Regneri, and M Andriluka. Script data for attribute- based recognition of composite activities. In ECCV, 2012. [64] M. S. Ryoo, J. K. Aggarwal, Chia-chih Chen, and Amit Roy-chowdhury. An Overview of Contest on Semantic Description of Human Activities (SDHA). In ICPR contests, 2010. [65] MS Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV, 2011. [66] M.S. Ryoo and J.K. Aggarwal. Recognition of Composite Human Activities through Context-Free Grammar Based Representation. CVPR, 2006. [67] M.S. Ryoo and J.K. Aggarwal. Spatio-temporal relationship match: Video struc- ture comparison for recognition of complex human activities. In ICCV, September 2009. [68] Scott Satkin and Martial Hebert. Modeling the Temporal Extent of Actions. In ECCV, 2010. [69] Konrad Schindler and L. Van Gool. Action Snippets: How many frames does human action recognition require? In CVPR, 2008. [70] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local SVM approach. In ICPR, 2004. [71] Eli Shechtman and Michal Irani. Space-time behavior-based correlation-Or-how to tell if two underlying motion fields are similar without computing them? PAMI, 2007. [72] Vivek Singh and Ramakant Nevatia. Action recognition in cluttered dynamic scenes using Pose-Specific Part Models. In ICCV, 2011. [73] Sminchisescu, Atul Kanaujia, Li, and Metaxas Dimitris. Conditional models for contextual human motion recognition. In ICCV, 2005. [74] Cristian Sminchisescu. Selection and context for action recognition. In ICCV, September 2009. [75] Ju Sun, Xiao Wu, Shuicheng Yan, L.F. Cheong, T.S. Chua, and Jintao Li. Hierar- chical spatio-temporal context modeling for action recognition. In CVPR, 2009. [76] Charles Sutton and Andrew Mccallum. An Introduction to Conditional Random Fields for Relational Learning. In Introduction to Statistical Relational Learning. 2006. 124 [77] Charles Sutton, Andrew Mccallum, and Khashayar Rohanimanesh. Dynamic con- ditional random fields: Factorized probabilistic models for labeling and segment- ing sequence data. JMLR, 2007. [78] Kevin Tang, Li Fei-Fei, and Daphne Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012. [79] Yicong Tian, Rahul Sukthankar, and Mubarak Shah. Spatiotemporal Deformable Part Models for Action Detection. In CVPR, 2013. [80] S. Tran and L. Davis. Event modeling and recognition using markov logic net- works. ECCV, 2008. [81] Arash Vahdat, Kevin Cannons, Greg Mori, Sangmin Oh, and Ilseo Kim. Com- positional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach. In ICCV, 2013. [82] Arash Vahdat, B Gao, Mani Ranjbar, and Greg Mori. A discriminative key pose se- quence model for recognizing human interactions. In Workshop on Visual Surveil- lance, 2011. [83] C. V ogler and D. Metaxas. Parallel hidden markov models for american sign lan- guage recognition. In ICCV, 1999. [84] Heng Wang and A Klaser. Action recognition by dense trajectories. In CVPR, 2011. [85] Heng Wang, M.M. Muhammad Muneeb Ullah, A. Klaser, Ivan Laptev, Cordelia Schmid, and Alexander Klaser. Evaluation of local spatio-temporal features for action recognition. In BMVC. British Machine Vision Association, 2009. [86] Jiang Wang, Zhuoyuan Chen, and Ying Wu. Action Recognition with Multiscale Spatio-Temporal Contexts. In CVPR, 2011. [87] S.B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Darrell. Hidden Conditional Random Fields for Gesture Recognition. In CVPR, 2006. [88] Yang Wang and Greg Mori. Learning a discriminative hidden part model for human action recognition. In NIPS, 2008. [89] Yang Wang and Greg Mori. Max-margin hidden conditional random fields for human action recognition. In CVPR, June 2009. [90] Yang Wang and Greg Mori. Hidden Part Models for Human Action Recognition: Probabilistic vs. Max-Margin. PAMI, 2010. 125 [91] Yang Wang, Duan Tran, and Zicheng Liao. Learning hierarchical poselets for human parsing. In CVPR, 2011. [92] Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An efficient dense and scale- invariant spatio-temporal interest point detector. In ECCV, 2008. [93] Chenliang Xu, Caiming Xiong, and Jason J Corso. Streaming Hierarchical Video Segmentation. In ECCV, 2012. [94] J Yamato, Jun Ohya, and K Ishii. Recognizing human action in time-sequential images using hidden Markov model. CVPR, 1992. [95] Chun-Nam John Yu and Thorsten Joachims. Learning structural SVMs with latent variables. In ICML, 2009. [96] Junsong Yuan, Zicheng Liu, and Ying Wu. Discriminative Subvolume Search for Efficient Action Detection. In CVPR, 2009. [97] Yimeng Zhang, Xiaoming Liu, MC Chang, W Ge, and T Chen. Spatio-Temporal phrases for activity recognition. ECCV, 2012. [98] Ziming Zhang, Yiqun Hu, Syin Chan, and Liang-Tien Chia. Motion Context : A New Representation for Human Action Recognition. In ECCV, 2008. 126
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Model based view-invariant human action recognition and segmentation
PDF
Event detection and recounting from large-scale consumer videos
PDF
Tracking multiple articulating humans from a single camera
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Robust representation and recognition of actions in video
PDF
Analyzing human activities in videos using component based models
PDF
Multiple pedestrians tracking by discriminative models
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Structure learning for manifolds and multivariate time series
PDF
Exploitation of wide area motion imagery
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Active state learning from surprises in stochastic and partially-observable environments
PDF
Hybrid methods for robust image matching and its application in augmented reality
PDF
Ubiquitous computing for human activity analysis with applications in personalized healthcare
PDF
Motion pattern learning and applications to tracking and detection
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
Asset Metadata
Creator
Banerjee, Prithviraj
(author)
Core Title
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/22/2014
Defense Date
05/02/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
activity detection,computer vision,graphical models,human activity recognition,machine learning,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
), Kuo, C.-C. Jay (
committee member
), Liu, Yan (
committee member
), Medioni, Gérard G. (
committee member
), Shen, Wei-Min (
committee member
)
Creator Email
banerjee.prithviraj@gmail.com,pbanerje@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-483796
Unique identifier
UC11286534
Identifier
etd-BanerjeePr-2981.pdf (filename),usctheses-c3-483796 (legacy record id)
Legacy Identifier
etd-BanerjeePr-2981-0.pdf
Dmrecord
483796
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Banerjee, Prithviraj
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
activity detection
computer vision
graphical models
human activity recognition
machine learning