Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robust representation and recognition of actions in video
(USC Thesis Other)
Robust representation and recognition of actions in video
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ROBUST REPRESENTATION AND RECOGNITION OF ACTIONS IN VIDEO by Pradeep Natarajan A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Ful¯llment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2009 Copyright 2009 Pradeep Natarajan Acknowledgements I would like to thank my advisor Prof.Ramakant Nevatia, for his advice and support during the course of my PhD research. His wide knowledge and experience in computer vision and research has been invaluable in my development as a researcher and preparing me for my future career. I also feel privileged to have had the opportunity to work with Prof.Gerard Medioni and Prof.Fei Sha. Their advice and guidance with techniques in computer vision and machine learning helped me immensely during my time at USC. I would also like to thank Prof.Antonio Ortega, Prof.Ramesh Govindan and Prof.Craig Knoblock for serving in my qualifying committee and for taking the time to review my proposal. I am also thankful to my colleagues in USC Computer Vision Lab, especially Fengjun Lv, Bo Wu, Qian Yu, Vivek Kumar Singh, Furqan Khan, Prithviraj Banerjee, Pramod Kumar Sharma and Sung Chun Lee for their help and collaboration in my research projects. My work would not have been possible without their participation and time. Finally, I would like to thank my wonderful family for their support, patience and encouragement over the years. Their enthusiasm and good wishes should continue to serve me well in the coming years. ii Table of Contents Acknowledgements ii List Of Tables vi List Of Figures vii Abstract xi Chapter 1: Introduction 1 Chapter 2: Previous Work 7 2.1 Generative Models (HMMs) . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Discriminative Models (CRF) . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Combined Tracking and Action Recognition . . . . . . . . . . . . . . . . . 12 Chapter 3: Hierarchical Multi-channel Hidden Semi Markov Models 14 3.1 Hidden Markov Models and Extensions . . . . . . . . . . . . . . . . . . . 14 3.2 Model De¯nition and Parameters . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 Recursive Viterbi Descent (RVD) . . . . . . . . . . . . . . . . . . . 21 3.3.2 Factored Recursive Viterbi Descent (FRVD) . . . . . . . . . . . . . 25 3.3.3 Applications of Inference Algorithm . . . . . . . . . . . . . . . . . 29 3.3.3.1 CHSMM Decoding . . . . . . . . . . . . . . . . . . . . . . 29 3.3.3.2 HPaHSMM Decoding . . . . . . . . . . . . . . . . . . . . 31 3.3.3.3 HSPaHMM Decoding . . . . . . . . . . . . . . . . . . . . 32 3.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1 Embedded Viterbi Learning . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5.1 E®ect of Parameter Initialization in CHSMM . . . . . . . . . . . . 41 3.5.2 Comparison of duration modeling at the lower and upper layers on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5.3 Application to Continuous Sign Language Recognition . . . . . . . 48 iii Chapter 4: Simultaneous Tracking and Action Recognition (STAR) using Formal Action Models 52 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 HVT-HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.1 Model De¯nition and Parameters . . . . . . . . . . . . . . . . . . . 54 4.2.2 Parametrization and Learning . . . . . . . . . . . . . . . . . . . . . 56 4.2.3 Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.1 Gesture Tracking and Recognition . . . . . . . . . . . . . . . . . . 62 4.3.2 Tracking and Recognizing Articulated Body Motion . . . . . . . . 64 Chapter5: ViewandScaleInvariantActionRecognitionUsingMultiview Shape-Flow Models 69 5.1 Overview of Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Action Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 Pose Tracking and Recognition . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4 Shape and Flow Potentials. . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Chapter 6: Temporally-Dense Spatio Temporal Interest Point Models for Action Recognition 86 6.1 Feature Detection and Description . . . . . . . . . . . . . . . . . . . . . . 87 6.1.1 Spatial Interest Point Detection . . . . . . . . . . . . . . . . . . . . 87 6.1.2 Spatio-Temporal Interest Point Detection . . . . . . . . . . . . . . 88 6.1.3 Temporally Dense Spatio-Temporal Interest Points (TD-STIP) . . 89 6.1.4 Evaluation of Feature Detection . . . . . . . . . . . . . . . . . . . 91 6.2 Feature Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2.1 Spatio-Temporal Codebook Generation . . . . . . . . . . . . . . . 93 6.2.2 Learning Codeword Weights . . . . . . . . . . . . . . . . . . . . . . 94 6.3 Action Representation and Recognition . . . . . . . . . . . . . . . . . . . 96 6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4.1 Segmented Recognition on KTH Dataset . . . . . . . . . . . . . . 99 6.4.2 Shape Vs Flow Features . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4.3 Continuous Recognition on KTH Dataset . . . . . . . . . . . . . . 101 6.4.4 Recognition on USC Dataset . . . . . . . . . . . . . . . . . . . . . 103 6.4.5 Recognition with Infra-red Videos . . . . . . . . . . . . . . . . . . 104 Chapter7: SimultaneousTrackingandActionRecognitionusingDynamic Bayesian Action Networks 106 7.1 Action Representation and Recognition . . . . . . . . . . . . . . . . . . . 107 7.1.1 Graphical Model Representation . . . . . . . . . . . . . . . . . . . 107 7.1.2 Pose Tracking and Recognition . . . . . . . . . . . . . . . . . . . . 108 7.1.3 Transition Potential . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.1.4 Observation Potential . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.2 Action Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.2.1 Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 iv 7.2.1.1 KeyPose Annotation and 3D Lifting . . . . . . . . . . . . 114 7.2.1.2 Pose Interpolation and Model Learning . . . . . . . . . . 116 7.2.2 Feature Weight Learning . . . . . . . . . . . . . . . . . . . . . . . 118 7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Chapter 8: Summary and Future Work 125 8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 References 128 v List Of Tables 1.1 Interesting actions in various domains . . . . . . . . . . . . . . . . . . . . 4 3.1 Accuracy values with Uniform and Normal Duration models . . . . . . . . 43 3.2 Mean and Variance of log-likelihood . . . . . . . . . . . . . . . . . . . . . 43 3.3 Model Accuracy(%) and Speed(fps). . . . . . . . . . . . . . . . . . . . . . 48 3.4 Phoneme Transcriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5 Word Accuracy Rates(%) and Speed(fps) . . . . . . . . . . . . . . . . . . 51 4.1 Comparison of HVT-HMM, HHMM and HS-HMM Gesture Dataset . . . 64 4.2 Comparison of HVT-HMM and HHMM Action Dataset . . . . . . . . . . 65 4.3 Robustness under occlusion, style variations and other factors . . . . . . . 67 5.1 Comparisonofaccuracyandspeedwithshape,flow,shape+flow,shape+ flow+duration features at di®erent tilt angles . . . . . . . . . . . . . . . 84 5.2 Overall Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.1 Accuracy on KTH Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 Frame-by-frameAccuracyforcontinuousrecognition;N=totalno. offrames, E=no. of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3 Recognition Accuracy on USC[44] Dataset. *Channel HOF 313 produced best results on KTH in [30] . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.1 Performance Scores on Gesture and Grocery Store dataset . . . . . . . . . 123 vi List Of Figures 1.1 Key problems in Action Recognition . . . . . . . . . . . . . . . . . . . . . 3 1.2 Sample Graphical Model for Simultaneous Tracking and Recognition . . . 5 1.3 Sample results the standard KTH dataset . . . . . . . . . . . . . . . . . . 6 1.4 Sample results in cluttered indoor and outdoor environments . . . . . . . 6 1.5 Sample results in grocery stores . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Relationship between di®erent models discussed . . . . . . . . . . . . . . . 21 3.2 Structureofa)HMMb)HSMMc)PaHMMd)PaHSMMe)CHMMf)CHSMM g)HSPaHMMh)HPaHSMM-squaresareobservations,circlesstates,rounded rectangles are states over some duration . . . . . . . . . . . . . . . . . . . 22 3.3 Recursive Viterbi Descent with 2 channels at level h . . . . . . . . . . . . 24 3.4 Factored Recursive Viterbi Descent with 2 channels at level h . . . . . . . 28 3.5 Decoding CHSMM - rounded boxes are primitive nodes, dark squares ob- servations. d1, d1¡2, d2 are durations. . . . . . . . . . . . . . . . . . . . 30 3.6 Decoding HPaHSMM - circles are top-level nodes, rounded boxes lower- level nodes, dark squares observations. d1, d1¡2, d2 are durations. . . . 31 3.7 Decoding HSPaHMM - rounded boxes are top-level nodes, circles lower- level nodes, dark squares observations. d1, d1¡2, d2 are durations. . . . 33 3.8 Learning Time Vs Th . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.9 Structure of Event Simulator . . . . . . . . . . . . . . . . . . . . . . . . . 46 vii 3.10 Variation of HSPaHMM Speed(frames/sec or fps) and Accuracy(%) with a)Sigma, Beam Size=11 b)Beam Size, Sigma=1 . . . . . . . . . . . . . . . 47 3.11 Variation of HPaHSMM Speed(fps) and Accuracy(%) with Beam Size . . 47 4.1 Graphical structure of HVT-HMM . . . . . . . . . . . . . . . . . . . . . . 55 4.2 23D body model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3 Variation in values of Equation 4.2 - Line with diamonds-term1, Line with squares-term2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Overlay Images for Arm Gestures . . . . . . . . . . . . . . . . . . . . . . . 62 4.5 Variation in Accuracy(%) and Speed(fps) with - a)p b)K . . . . . . . . . . 62 4.6 Recognizing and tracking with variation in - a)Background and Lighting b) Style c) Random external motion d) Self-Occlusion . . . . . . . . . . . 63 4.7 Actions tested in Experiment 2 - a)Bend b)Jack c)Jump d)Jump-In-Place e)Run f)Gallup Sideways g)Walk h)Wave1 i)Wave2 . . . . . . . . . . . . . 66 4.8 Sample tracking and recognition results - a)jack b)jump c)run d)walk e)gallop sideways f)bend g)jump in place h)wave1 i)wave2 . . . . . . . . . 66 4.9 Trackingandrecognitionwith-a)Swingingbagb)dogc)kneesupd)moonwalk e)Occluded feet f)Occlusion by pole g)Viewpoint 20 o pan h)Viewpoint 30 o pan i)Viewpoint 40 o pan j)Viewpoint 45 o pan . . . . . . . . . . . . . . . . 67 4.10 Tracking and recognition with learned models on dataset from [59] with variations in viewpoint as well as jittery camera motion . . . . . . . . . . 68 5.1 TransitionConstraints-a)Graphmodelforasingleeventb)2-layermodel for a simple 2-event recognizer at the ¯rst two pan angles (0 o ,30 o ) . . . . 72 5.2 Pose intialization starting with an initial detection window (Green Box in I start ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3 Unrolled Graphical Model of the SFD-CRF for Pose Tracking and Recog- nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4 Matching optical °ow in image with template optical °ow . . . . . . . . . 81 viii 5.5 Sample Background, Viewpoint and Scale variations tested - (a)Indoor of- ¯ceenvironment,tilt=15 o ,pan=0 o (b)Indooro±ceenvironment,tilt=0 o ,pan=30 o (c)Indoorlibrary,tilt=0 o ,pan=90 o (d)Indooro±ce,tilt=0 o ,pan=315 o (e)Indoor o±ce,tilt=0 o ,pan=0 o (f)Indoor library,tilt=0 o ,pan=270 o (g)Outdoor with movingcars,tilt=0 o ,pan=0 o (h)Outdoor,tilt=30 o ,pan=30 o (i)Outdoor,small scale,tilt=45 o ,pan=0 o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.1 Comparison of Interest Points extracted by (a) Harris Corner (b) Rank Increase Measure (c) STIP (d) TD-STIP. Columns (1)-(3) A range of clut- tered indoor environments (4) Monitoring a typical tra±c intersection (5) Under camera motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2 % of frames with atleast X correct interest points for a given accuracy for di®erent detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3 Shape and Flow Descriptors extracted at an Interest Point . . . . . . . . . 93 6.4 TD-STIPs clustered into CodeWords for actions a)Boxing b)Hand waving c)Running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Graphical Model for Continuous Action Recognition . . . . . . . . . . . . 96 6.6 Observation Potential Computation: Green box-Track window w t from persondetection,YellowCircles-InterestPointFeatures,Whitebox-Neighborhood of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.7 Accuracy on KTH Dataset with varying training set sizes . . . . . . . . . 100 6.8 Sample results for Action Recognition and Localization . . . . . . . . . . 101 6.9 Variation in Accuracy using Shape, Flow and Shape+Flow . . . . . . . . . 102 6.10 Sample results for Action Recognition on USC dataset . . . . . . . . . . . 103 6.11 Sample results for Action Recognition on IR dataset . . . . . . . . . . . . 105 7.1 Dynamic Bayesian Action Network . . . . . . . . . . . . . . . . . . . . . . 108 7.2 Computation of Observation Potentials. . . . . . . . . . . . . . . . . . . . 111 7.3 ModelLearningIllustrationforCrouch actionwith3keyposesand2prim- itives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.4 Part lengths of ideal human model . . . . . . . . . . . . . . . . . . . . . . 116 ix 7.5 Primitive Learning by Pose Interpolation . . . . . . . . . . . . . . . . . . 118 7.6 Results on the Gesture Dataset: inferred pose is overlaid on top of the image and illustrated by limb axes and joints . . . . . . . . . . . . . . . . 121 7.7 Results on the Grocery Store Dataset: A bounding box shows the actor's position and the inferred pose is overlaid on top of the image, illustrated by limb axes and joints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.8 Results on Gesture Dataset. (a) feature weight learning using perceptron algorithm; (b) accuracy vs train:test ratio . . . . . . . . . . . . . . . . . . 124 x Abstract Recognizing actions from video and other sensory data is important for a number of applications such as surveillance and human-computer interaction. While the potential applications are compelling and has inspired extensive research on this topic, there are several di±cult challenges. These challenges can be broadly classi¯ed into four key prob- lems - 1) Action Representation 2) Feature Extraction 3) Learning 4) Inference. There are a range of possible approaches for each of these problems, and the choice depends on the application domain: whether it involves a single person or multiple actors, is the camera static or moving and is the background static or moving. Inourwork,wefocusonrecognizingsinglepersonactionsunderarangeofbackground and imaging conditions. We have worked on each of the four key problems in action recognition in this domain, and have made novel contributions. These include the use of hierarchical graphical models for high-level action representation as well as e±cient low-level features that are robust to background clutter, background motion and camera motion. We will describe the techniques developed during our research, and present results in a range of challenging indoor and outdoor video sequences. This work can have several potential applications including human-computer interaction, intelligent rooms, and monitoring lightly crowded areas in o±ces and grocery stores. xi Chapter 1 Introduction Automated semantic analysis of images and video has been a subject of active research over the years. Such technologies can enable several applications including search and retrieval, visual surveillance, assistive technologies like sign language recognition and human interactions with computers and robots. While the applications are compelling, di±culty in developing robust methods even for basic tasks like person detection has made progress in computer vision quite slow. Over the years, most of the research in computer vision has been in solving basic low-level problems like object detection and tracking. While di±cult and fundamental for vision processing, even perfect solutions to these problems do not directly answer the "so-what" question in terms of enabling real applications and solutions. This would requiretechniquesforhigh-levelreasoningandunderstanding,whichstartwiththeoutput of low-level processing. Further, despite tremendous progress in low-level detection and tracking techniques, the current state-of-the-art approaches are far from perfect. Hence the high-level reasoning techniques must not only be robust to error and uncertainty, but also in some cases drive low-level processing to reduce the search space. 1 Inourwork, wefocusonrecognizingactionsinvideostartingfromlow-leveldetection and tracking as well as feature detection. We use the term action quite broadly, to refer to anything that changes the state of an actor. An actor is any object of interest such as a person or vehicle. We chose to work on this problem since action sequences provide the most relevant information to end users - be it intelligence analysts looking at UAV videos, surveillance in groceries, web users searching for a speci¯c video clip or systems for human-robot interactions. Developing an action recognition system involves four key problems: ² Action Representation which involves choosing a suitable formalism to model ac- tions. Suchrepresentationscansimplybeabag-of-featurestothemorecomplicated graphical models to various logic based formalisms. ² Feature Extraction which involves extracting suitable low-level features from video for recognizing actions and can range from simple foreground blobs to spatio- temporal interest points. ² Learning - The learning algorithm can involve both structure learning as well as parameter learning. ² Inference - involves recognizing the action sequence given a test video. Figure 1.1 illustrates the relationship between these di®erent problems. The di±culty in developing solutions for these problems depends on several factors including camera and background motion, the number of actors in the scene and the complexity of interactions between them. 2 Figure 1.1: Key problems in Action Recognition In order to develop viable research plan, we ¯rst de¯ned a set of factors that characterize the di±culty of an application domain: ² Single Vs Multiple Actors (people, vehicles, etc.) Vs Crowded. ² Static Vs Moving Camera. ² Static Vs Moving Background. ² Segmented Vs Continuous Recognition. The set of interesting actions depends on the characteristics of the application domain. Hence we next proceeded to develop a list of interesting actions in di®erent domains. Table 1.1 presents a partial list of interesting actions involving people and vehicles. In this thesis, we focused on developing methods for recognizing single person actions under large variations in scale, viewpoint and backgrounds. We have worked on each of the key problems in action recognition and made novel contributions as well as improved upon existing approaches. Further, we have also developed methods for automatically 3 Single Person gestures, sitting, standing, walking, running, pickup Person-Person Following, meeting, shake-hands Person-Facility Entering, exiting, waiting, loitering Single Vehicle Accelerate, decelerate, turn, stop Vehicle-Vehicle Follow, pass, collide Person-Vehicle Driving, loading, unloading Table 1.1: Interesting actions in various domains segmenting and recognizing actions from a continuous stream. Potential applications of our work include Human-Computer Interaction, Human-Robot Interaction, Intelligent Rooms and surveillance in lightly populated regions of groceries, libraries, etc. We have used a range of action representations, including logic based representations, bag-of-features and graphical models. We have introduced several novel extensions to ex- isting graphical model formalisms [42][41] that address the representational limitations of existing graphical models. Further, to address the limitations of the current action representations, we have explored combinations of logic-based [45] and bag-of-features [40]representationswithgraphicalmodels. Thesecombinationsallowuseof sophisticated features for recognition in graphical models, and also minimize training requirements. The traditional approach to activity recognition is to ¯rst detect and track the actors and then analyze the tracks to recognize the activities [63][66][12][38]. It is di±cult to apply such methods to actions that also require tracking of body pose and limbs due to the di±culty of pose tracking from a single view. Hence, we adapt an approach where tracking and action recognition are performed simultaneously. We achieve this by a multi-layer hierarchical graphical model as illustrated in ¯gure 1.2. Here the top most layer corresponds to the composite actions, the middle layer cor- responds to primitive actions and the lowest layer corresponds to the pose tracks. The 4 Figure 1.2: Sample Graphical Model for Simultaneous Tracking and Recognition action models in the higher layers drive the low-level tracking. This reduces the potential search in the pose space, but makes a closed world assumption and can work only for known actions. Extractingappropriatelow-levelfeaturesiscrucialinanyvisionapplication. Wehave usedseveralfeaturestorecognizeactions,includingforeground blobs,edge detections,per- son detections, optical °ow and spatio-temporal interest points. We have also introduced a novel interest point detector in [40], which provides a dense, compact representation of the spatio-temporal structure of actions, and also does not require costly bounding box annotations for training the models. We have used extensions of several state-of-the-art learning and inference algorithms in our work including Expectation-Maximization (EM) based algorithms for training in the absence of annotations, as well as the discriminative Voted-Perceptron algorithm for trainingwithpartialannotations. Further,wehaveexploredusingprior knowledge inour models to minimize the training requirements. We have also extended existing inference algorithms for online, real-time recognition of actions in a range of domains including 5 sign-language recognition, gesture recognition, and action recognition in a range of indoor and outdoor scenarios with signi¯cant variations in viewpoint, scale, background clutter and motion. Figure 1.3: Sample results the standard KTH dataset Figure 1.4: Sample results in cluttered indoor and outdoor environments Figure 1.5: Sample results in grocery stores We have rigorously tested our methods and shown state-of-art results on standard datasets such as KTH and Weizmann, in a range of cluttered indoor and outdoor envi- ronments, and also for surveillance in grocery stores and libraries as illustrated in ¯gures 1.3-1.5. In the rest of this thesis, we will discuss our methods in detail, their relative strengths and weaknesses and also show how our results compare with the state-of-the art. 6 Chapter 2 Previous Work Before we discuss the novel work done by us, we will ¯rst review existing literature in action recognition. Since the ¯eld and the range of methods are quite broad, we will in particular focus on methods that apply Graphical Models for recognizing actions. We classify them into three broad sections: Generative Models which model the joint proba- bilityP(X;Y)betweentheobservationsequenceXandstatesequenceY,Discriminative Models P(YjX) that model the conditional probability and also some recent work which combine object and pose tracking with action recognition using graphical models. We take the last approach in a lot of methods discussed in this thesis. 2.1 Generative Models (HMMs) [72] presents one of the earliest applications of using graphical models for human action recognition in videos, by using a feature based bottom-up approach with HMMs. They ¯rst extract image features using 2D meshes in each video and and assign a codeword to them, to obtain the observation sequence. Then, they recognize the action by computing the log-likelihood for each of a set of pre-learned action HMMs and recognize the action 7 correspondingtotheHMMwiththehighestlog-likelihood. Theyshowimpressiveperfor- mance in distinguishing among 6 tennis strokes. [58] presents another early application of HMMs for recognizing gestures by updating a recursive ¯lter based on extracted hand pose information. Whileimpressive,theseearlyworksappliedHMMsforrecognizingactionsina tempo- rally segmented video sequence - i.e. the input sequences contained only one action. [62] demonstrates HMMs for recognizing a sentences from American Sign Language (ASL) from hand tracks obtained by tracking colored gloves in video. This is particularly im- pressive since their algorithm does not require costly "datagloves" or explicit modeling of ¯ngers, and has high accuracy rates on sentences from a large 40 word lexicon. Further, both the inference and training were done using unsegmented ASL sentences. [63] built on this work and showed impressive results with explicit high-level grammar constraints as well as in unrestricted experiments. [66]introducedanextensionofHMMcalledPaHMMformodelingprocessesoccurring independently, in parallel. They apply it for ASL recognition by modeling the actions of each hand as parallel streams and trained the parameters appropriately. The indepen- dence assumption reduces the number of parameters in the model compared to HMMs, making it potentially more scalable for large vocabularies and their results demonstrate improved robustness for ASL recognition. [67] incorporates linguistic constraints from the Movement-Hold [32] model of ASL into the PaHMM framework and shows improved performance. [67] also demonstrates the generality of the PaHMM framework by incor- porating high-level constraints for gait recognition. The explicit use of high-level models and constraints in [67] is in contrast to most other approaches using HMMs and other 8 graphical models that focus on bottom-up learning from image features. [5] introduces the coupled HMM (CHMM) for modeling multiple interacting processes, in contrast to PaHMM which assumes that the processes are independent. [4] presents an "N-heads" dynamic programming algorithm for e±cient inference and learning in CHMMs. [5][4] demonstratetheadvantagesofCHMMoverHMMsbyrigoroustestinginsimulateddata, and also for recognizing tai-chi gestures in video. The results also demonstrate that CHMM training is insensitive to parameter initialization in contrast to HMMs. [56][50] appliedCHMMsforrecognizinghumaninteractionsbytrainingthem¯rstusingsynthetic data and then using the trained models for recognition in real videos, with no additional training or tuning. [20]recognizeseventsinvideoby¯rstdetectingsimplerprimitive eventsusingBayesian networks in each frame and then recognizes more complex composite from the primitive events using a hidden semi-Markov model (HSMM). This work is one of the earliest demonstrations of using explicit duration models for event recognition. Further, [20] also presentsanalgorithmforlineartimeinferenceinHSMM,ifthedurationmodelisuniform or Gaussian, and presents recognition results for monitoring vehicles in videos collected from an unmanned aerial vehicle (UAV), as well as in actions involving multiple inter- acting people like stealing. Action recognition at multiple levels have been addressed in several other works as well. [49] introduces the layered hidden Markov model (LHMM) to recognize actions at multiple temporal scales and abstraction levels. The activities at each level are classi¯ed using a HMM for that level. Further, only the lower-level actions can in°uence inference at the higher levels. [49] applied LHMMs to recognize common 9 o±ce actions from multiple real-time streams including video, audio and computer in- teractions. [7] introduces an extension of hierarchical HMM (HHMM) that allows lower level states to share parents, making the hierarchical structure a lattice instead of a tree. [46] applied this extension of HHMM for recognizing actions in a typical indoor o±ce environment, similar to [49]. Therehasbeenrecentworktocombinetherepresentationaladvantagesofthesedi®er- ent models in a single formalism. The switching hidden semi-Markov model (S-HSMM) [13] presents a two layer extension of HSMM and recognizes actions at di®erent abstrac- tions and also models action durations. [13] consider two possible duration models: the multinomial and the discrete Coxian distribution and demonstrate S-HSMM for recog- nizing human actions of daily living. In our own work, [41] presents the coupled hidden semi-Markov model (CHSMM) which combines HSMM and CHMM to simultaneously model multiple interacting processes as well as action durations. [42] on the other hand introducethehierarchicalmulti-channelhiddensemi-Markovmodels(HM-HSMM)which simultaneously model action hierarchy, state durations and multi-channel interactions. [42][41] present e±cient algorithms for learning and inference in the combined models anddemonstratethemforsignlanguagerecognitioninadomainsimilarto[66][67]. They show good performance even with a small training set and simple features. Further, they incorporate high-level language constraints into their models similar to [67]. 2.2 Discriminative Models (CRF) The various HMM extensions discussed so far model the joint probability P(X;Y) be- tween the observations X and states Y. This assumes that the observations x t 2X are independent of each other, which is unrealistic in many applications and the conditional 10 random ¯elds(CRF) [28] were introduced to directly model the conditional probability P(YjX). CRFs have been successfully applied in a wide range of domains including nat- ural language processing (e.g.[28]), document processing (e.g.[64]), image classi¯cation (e.g.[27]), object segmentation (e.g.[69]), pose tracking (e.g.[65]) among others. There has been several recent works in action recognition as well that attempt to replicate this success. [61] presents one of the earliest applications of CRFs for action recognition, and demonstrates the e®ectiveness of CRFs for classifying not only distinct actions but also among subtle variations in a single action class. They train their models from synthetic data generated by rendering motion capture data for various actions, and recognize ac- tions in videos by extracting shape context features [2] from silhouettes. Their results show that CRFs outperform HMMs as well as the Maximum Entropy Markov Models (MEMM)[36]. One key disadvantage of CRFs is that they need annotations of the entire state for training models, which is costly or even impractical for most action recognition applica- tions. [68] introduce a discriminative hidden state approach for action recognition using hidden CRF (HCRF). HCRFs were originally introduced in [53] for object recognition, and [68] adapted them for recognition of human gestures. Each action/gesture in the HCRF contains a set of unobserved hidden states and the training data contains anno- tations only for the actions. They show superior performance of HCRFs over CRF and HMM for recognition of human arm gestures and head gestures. They also show that the performance improves signi¯cantly by using observations from window around the current frame instead of only the current frame. 11 While HCRFs capture the internal dynamics of gestures, they require pre-segmented sequences for training and inference. Thus they do not capture dynamics that involve transitions between di®erent gestures. The latent dynamic CRF (LDCRF) [38] addresses this issue by combining HCRF with DCRF. LDCRF incorporates hidden state variables for each action and also learn the dynamics between the actions. [38] evaluated LDCRFs for recognizing human gestures from unsegmented video streams in the domains of head and eye gesture recognition. Their results demonstrate the advantages of LDCRFs over HMMs, CRFs as well as support vector machines (SVMs). 2.3 Combined Tracking and Action Recognition The various applications that we discussed so far take a bottom-up approach to recog- nition where suitable image features are ¯rst extracted and then the graphical models are used for action recognition. Such approaches rely heavily on accurate extraction of low-level features such as silhouettes or tracks, which is unrealistic in many applications. In contrast, several top-down approaches have been proposed in recent work, which use high-level action models to guide low-level tracking and then use the feedback from the lower levels to recognize the action sequence and are robust to low-level errors, but as- sume that only a known set of actions can take place in the domain. [73] presents a tracking-as-recognition approach where a hierarchical ¯nite state ma- chine is constructed from 3D motion capture data for each action. Actions in video sequences are then recognized by comparing the prior models with motion templates ex- tracted from the video. [73] shows very promising results in several di±cult sequence, 12 demonstrating the utility of the top-down approach. [15] presents a related approach where the states of an HMM are represented with a set of pose exemplars and gestures in video are recognized by comparing the exemplars with image edge maps, using the Hausdor® distance. More recently, [52] also takes a tracking-as-recognition approach for analyzing actions involving articulated motion. Here the actions are represented by a variant of HHMM known as the factored state HHMM (FS-HHMM) which represents possible transitions of the actor pose, which in turn is represented by a 3D body model with29degreesoffreedom(forthejointangles). Ateachframe, thescoreforeachposeis computed based on the overlap between the projected pose and the silhouette, and track poses and recognize actions by accumulating the scores. In our own work, [43] combines ideas from [73] and [52] to present a top-down approach to simultaneously track and recognize articulated full-body human motion using learned action models. [45] extends this idea to learn the action models lifting partially annotated 2D videos. In contrast to using action models to guide tracking, [8] present a method for jointly recognizing events and also link fragmented tracklets into long duration tracks. Actions are represented with a DBN, which supplies data driven constraints for estimating the likelihood of possible tracklet matches. The event constraints are then combined with appearance and kinematic constraints, and the event model with the highest event score isrecognizedtohaveoccurredifthelikelihoodscoreexceedsathreshold. [8]demonstrate the approach in a scene with airplane servicing actions with many non event actors. 13 Chapter 3 Hierarchical Multi-channel Hidden Semi Markov Models Hidden Markov models (HMM) have come in wide use for activity recognition due to their simplicity but the standard form su®ers from three key limitations. These include unrealistic models for the duration of a sub-event, not encoding interactions among mul- tiple agents directly and not modeling the inherent hierarchical organization of these activities. In this chapter, we introduce a new family of HMMs that simultaneously ad- dress these limitations. We also present e±cient algorithms for inference and learning in these models. In particular, we focus on three possible structures - the Coupled Hidden semi-Markov Model (CHSMM), the Hierarchical Semi-Parallel Hidden Markov Model (HSPaHMM) and the Hierarchical Parallel Hidden Semi-Markov Model (HPaHSMM). CHSMM introduces explicit duration models into coupled HMMs while HSPaHMM and HPaHSMM are 2-layer structures. We demonstrate these ¯rst on synthetic time series data and then in an application for sign language recognition. 3.1 Hidden Markov Models and Extensions We will begin by de¯ning the basic HMM formally, and progressively build the repre- sentation to describe the various extensions, before we describe our formalism. HMMs 14 are basically a class of Dynamic Bayesian Networks (DBN) where there is a temporal evolution of nodes. An HMM model ¸ is speci¯ed by the tuple: ¸=(Q;O;A;B;¼) where, Q is the set of possible states, O is the set of observation symbols, A is the state transition probability matrix (a ij =P(q t+1 =jjq t =i)), B is the observation probability distribution (b j (k) = P(o t = kjq t = j)) and ¼ is the initial state distribution. It is straightforward to generalize this model to continuous (like gaussian) output models by parameterizing the observation and transition probabilities. The hierarchical hidden Markov model (HHMM) were introduced in [16] to model complex multi-scale structure by including a hierarchy of hidden states. Formally an HHMM with H levels can be represented by the tuples: ¸ 0 h =(Q h ;O h ;A h ;B h ;¼ h ) h=1::H The parameters A h , B h , ¼ h can depend not only on the state in level h but also on the other levels, typically the parent level (h+1) and the child level (h¡1). In traditional HMMs the ¯rst order Markov assumption implies that the duration probability of a state decays exponentially. The hidden semi-Markov models (HSMM) were proposed to alleviate this problem by introducing explicit state duration models. Thus an HSMM model (¸ 00 ) can be speci¯ed by the tuple: ¸ 00 =(Q;O;A;B;D;¼) Here D is the set of parameters for the duration models and the other parameters are as de¯ned before. 15 InHMMs,asinglevariablerepresentsthestateofthesystematanyinstant. However, many interesting processes have multiple interacting agents and several multi-channel HMMs have been proposed to model these. These extensions basically generalize the HMM state to be a collection of state variables (S t =S 1 t ;::;S C t ) and are represented as: ¸ 000 =(Q C ;O C ;A C ;B C ;¼ C ) where Q c and O c are the possible states and observations at channel c respectively and ¼ c represents the initial probability of channel c's states. A C contains the transition probabilities over the composite states (P([q 1 t+1 ;::;q C t+1 ]j[q 1 t ;::;q C t ])), and B C contains the observation probabilities over the composite states (P([o 1 t ;::;o C t ]j[q 1 t ;::;q C t ])). In this form, the learning as well as inferencing algorithms are exponential in C, and also result in poor performance due to over-¯tting and large number of parameters that need to be learnt. The various multi-channel extensions introduce simplifying assumptions that help in factorizing the transition and observation probabilities. Factorial hidden Markov models (FHMM) [17] factor the hidden state into multiple variables which are nominally coupled at the output. This allows factorizing A C into C independent N*N matrices (N=number of states in each channel), while the observation probabilities B C are left unfactorized: P([q 1 t+1 ;::;q C t+1 ]j[q 1 t ;::;q C t ])= C Y c=1 P(q c t+1 jq c t ) 16 Parallel hidden Markov models (PaHMM) [66] factor the HMM into multiple inde- pendent chains and hence allow factorizing both A C and B C . Thus we have, P([q 1 t+1 ;::;q C t+1 ]j[q 1 t ;::;q C t ])= C Y c=1 P(q c t+1 jq c t ) P([o 1 t ;::;o C t ]j[q 1 t ;::;q C t ])= C Y c=1 P(o c t jq c t ) (3.1) Coupled hidden Markov models (CHMM) [6] on the other hand factor the HMM into multiplechainswherethecurrentstateofachaindependsonthepreviousstateofallthe chains. In CHMMs, each channel has its own observation sequence and hence B C can be factorized. Like FHMM and PaHMM, they allow each channel to evolve independently and hence we have, P([q 1 t+1 ;::;q C t+1 ]j[q 1 t ;::;q C t ])= C Y i=1 P(q i t+1 j[q 1 t ;::;q C t ]) (3.2) Further, they assume that the interaction between channels i and j is independent of the interaction between i and k and replace P(q i t+1 j[q 1 t ;::;q C t ]) with the transition potential Á trans ([q 1 t ;::;q C t ];q i t+1 ): Á trans ([q 1 t ;::;q C t ];q i t+1 )= C Y j=1 P(q i t+1 jq j t ) (3.3) With this simpli¯cation, we have an N*N transition matrix for each pair of channels i and j and factorize A C into C 2 N*N matrices. The factorization on the RHS of equation 17 (3.3) does not sum to 1. Hence during inference, we compute a score ©(X;Y) that mea- sures the a±nity between the observation sequence X and state sequence Y, rather than the joint probability P(X;Y). However, Á trans ([q 1 t ;::;q C t ];q i t+1 ) monotonically increases with P(q i t+1 j[q 1 t ;::;q C t ]) (please see Appendix A of Supplemental for proof) which implies that the score ©(X;Y) also monotonically increases with the joint probability P(X;Y). Hence classi¯cations based on ©(X;Y) is identical to those from P(q i t+1 j[q 1 t ;::;q C t ]). This demonstrated empirically in [4] through extensive experiments on simulated data and we show similar results with our experiments as well. Each of the above extensions address one of the limitation of HMMs and presents a solution that is suited to some speci¯c application domains. In more recent work, [12] introduced the Switching-HSMM (S-HSMM) which simultaneously addresses duration modeling and hierarchical organization. S-HSMM represents activities in 2-layers with the lower layer containing a HSMM and the upper layer containing a Markov model. Thus we have, ¸ S¡HSMM lower =(Q lower ;O lower ;A lower ;B lower ;D lower ;¼ lower ) ¸ S¡HSMM upper =(Q upper ;O upper ;A upper ;B upper ;¼ upper ) 3.2 Model De¯nition and Parameters Thehierarchicalmulti-channelhiddensemi-Markovmodels(HM-HSMM)thatwepropose combine duration modeling, multi-channel interactions and hierarchical structure into a 18 single model structure. In the most general form, they can be described by a set of parameters of the form: ¸ 0000 h =(Q c h ;O c h ;A c h ;B c h ;D c h ;¼ c h ) h=1::H where, h 2 1::H is the hierarchy index, c is the number of channels at level h, and the parameters have interpretations similar to before. Each channel at a higher level can be formedbyacombinationofchannelsatthelowerlevel. Also,thedurationmodelsateach level are optional. Further, the channels at each level in the hierarchy maybe factorized using any of the methods discussed above (PaHMM, CHMM etc). It can be seen that ¸ 0000 presents a synthesis of ¸ 0 , ¸ 00 and ¸ 000 . The Coupled Hidden Semi-Markov Model (CHSMM) is a special case of ¸ 0000 h where each channel has states with explicitly speci¯ed duration models. This can be speci¯ed by the tuple: ¸ CHSMM =(Q C ;O C ;A C ;B C ;D C ;¼ C ) where the parameters Q C , O C , A C , B C , ¼ C are de¯ned as before. D C contains a set of parameters of the form P(d c i =kjq c t =i), i.e the probability of state i in channel c having a duration k. Further, A C and B C can be factorized in any of the methods discussed before. In this chapter, we will focus on the causal coupling used in CHMMs. 19 TheHierarchical Semi-Markov Parallel Hidden Markov Model(HSPaHMM) has2lay- ers with multiple HMMs at the lower layer and HSMM at the upper layer and has the following set of parameters: ¸ HSPaHMM lower =(Q C lower ;O C lower ;A C lower ;B C lower ;¼ C lower ) ¸ HSPaHMM upper =(Q upper ;O upper ;A upper ;B upper ;D upper ;¼ upper ) TheHierarchicalParallelHiddenSemi-MarkovModel(HPaHSMM)ontheotherhand containsmultipleHSMMsatthelowerlayerandasingleMarkovchainattheupperlayer and hence has the following set of parameters: ¸ HPaHSMM lower =(Q C lower ;O C lower ;A C lower ;B C lower ;D C lower ;¼ C lower ) ¸ HPaHSMM upper =(Q upper ;O upper ;A upper ;B upper ;¼ upper ) It can be seen that ¸ CHSMM , ¸ HSPaHMM and ¸ HPaHSMM are all special cases of ¸ 0000 . Figure3.1illustratestherelationshipbetweenvariousHMMextensionsdiscussed,and ¯gure 3.2 illustrates their graphical structures. The key di®erence between HSPaHMM and HPaHSMM is that HSPaHMM models the duration for the entire low-level PaHMM with the top-level HSMM state while HPaHSMM models the duration of each state in each low-level HSMM. Thus HSPaHMM requires fewer parameters, while HPaHSMM is a richer structure for modeling real events. 20 (or) A is a generalization of B Hidden Markov Models (HMM) Duration Hierarchy Multi−Channel HHMM HSMM MC−HMM S−HSMM HM−HSMM HSPaHMM HPaHSMM CHSMM CHMM PaHMM FHMM A B A B Legend: B is a special case of A Figure 3.1: Relationship between di®erent models discussed 3.3 Decoding Algorithm Here we present a generic algorithm for decoding HM-HSMMs and then discuss sim- pli¯cations of this algorithm for the speci¯c instances of HM-HSMM that we focus on. We will do this progressively starting with the Viterbi decoding in HMMs, and then the algorithms used for decoding various HMM extensions, before we present decoding in HM-HSMMs. 3.3.1 Recursive Viterbi Descent (RVD) TheRecursiveViterbiDescent(RVD)algorithmworksbystartingatstatesinthehighest layer and recursively computing the Viterbi path at di®erent time segments in the lower 21 h) a) b) d) c) e) f) g) Figure 3.2: Structure of a)HMM b)HSMM c)PaHMM d)PaHSMM e)CHMM f)CHSMM g)HSPaHMM h)HPaHSMM - squares are observations, circles states, rounded rectangles are states over some duration layers. Inthisalgorithm, weassumethateachchannelinthelowerlevelhassingleparent channelinthehigherlevelandalsoeachlowerlevelstatehasauniqueparentstate. Thus we restrict the hierarchy to be trees rather than lattices. InanHMM,asinglevariablerepresentsthestateattimetandthebeststatesequence given an observation sequence o 1 ::o T , is computed by calculating variables ± i;t as: ± i;t =max j (± j;t¡1 +lnP(ijj))+lnP(o t ji) (3.4) where i and j are possible states. Here ± i;t denotes the log-likelihood of the maximum probability path to state i at time t. Equation (3.4) describes the Viterbi algorithm for inference in HMMs and has an O(TN 2 ) complexity where T=length of observation sequence, and N=number of states in the HMM. 22 In HSMM where we explicitly model durations, the ± variables are computed over durations [t s ,t e ]: ± i;ts;te =max j;ts 0 (± j;ts 0 ;ts¡1 +lnP(ijj))+lnP(te¡ts+1ji)+ te X t=ts lnP(o t ji) (3.5) This has an O(N 2 T 3 ) complexity, though several methods exist to improve the perfor- mance. Typically, we maximize ± i;ts;te over the durations (te¡ts+1) at each frame, but we will keep this form here for further discussion. In the various multi-channel extensions we have a vector [i 1 ::i c 0::i C ] of C values for the state representation. Replacing the single state i in equation (3.4) with the vector [i 1 ::i c 0::i C ] we get: ± " i 1 i c 0 i C # ;t = max " j 1 j c 0 j C # 8 > < > : ± " j 1 j c 0 j C # ;t¡1 +lnP( " i 1 i c 0 i C # j " j 1 j c 0 j C # ) 9 > = > ; +lnP(o t j " i 1 i c 0 i C # ) (3.6) In this form, the inference complexity is O(TN 2C ) complexity, but this can be reduced signi¯cantly by using any of the factorizations discussed earlier. In HHMMs [16], we compute variables ± h;p;¿ i;t as follows: ± h;p;¿ i;t =max j n ± h;p;¿ j;t¡1 +lnP h;p (ijj) o +lnP(o t jj) (3.7) Here, ± h;p;¿ i;t corresponds to the log-likelihood of the maximum probability path through the HMM at level (h-1) such that the state of level h-1 is i at time t, and the parent state p at level h started at time ¿. This is similar to the inference algorithm presented in [16], though we have changed notations slightly for simplicity. 23 ] Δ [ h,1 ic’,ts’,ts−1 ] ic,tsc,tec Δ h−1,c’,i,ts i1,ts1,te1 i2,ts2,te2 ] [ } + ln P(d=te−ts+1|i) + ln P(i|[ic,ic’]) { = max + max ts’ ts−1 te ts P(i|[ic,ic’]) P(d=te−ts+1|i) tsc tec c=1 i ic’ c’=2 ic i1 i2 Δ h−1,c’,i,ts i1,ts1,te1 i2,ts2,te2 ] [ Δ [ h,1 ic’,ts’,ts−1 ] ic,tsc,tec Δ [ h,1 ic,tsc,tec ic’,ts,te ] te ts1 te Δ ts2 [ h,1 ic,tsc,tec ic’,ts,te Figure 3.3: Recursive Viterbi Descent with 2 channels at level h In the HM-HSMMs that we propose we combine HSMM, MC-HMM and HHMM. Thus combining the ± variables in equations (3.4)-(3.7), at each level h in the hierarchy, we compute variables of the form ¢ h;c;p;¿ [i c1 ;ts c1 ;te c1 ]::[i c 0;ts c 0;te c 0]::[i c2 ;ts c2 ;te c2 ] , where c and p are the parent channel and state respectively in level (h+1), [c1..c2] are the set of channels in level h with c as parent and [i c1 ::i c2 ] are possible states in channels [c1..c2] respectively with p as parent state. ¿ is the time at which the parent channel c started in parent state p and for an observation sequence of length T, ¿ 2 [1::T]. ts c 0 and te c 0 are start 24 and end times of the interval during which channel c 0 in level h is in state i c 0. With these de¯nitions, we can combine equations (3.4)-(3.7) to compute the ¢ 0 s as follows: ¢ h;c;p;¿ " i c1 ;ts c1 ;te c1 i c 0 ;ts c 0 ;te c 0 i c2 ;ts c2 ;te c2 # = max j c 0;ts 0 c 0 f¢ h;c;p;¿start " i c1 ;ts c1 ;te c1 j c 0 ;ts 0 c 0 ;ts c 0 ¡1 i c2 ;ts c2 ;te c2 #+ln(P(i c 0j[i c1 ::j c 0::i c2 ]))+ ln(P(te c 0¡ts c 0 +1ji c 0))+ max [ i 0 cl1 ::i 0 cl2 ] [ ts 0 cl1 ::ts 0 cl2 ] ¢ h¡1;c 0 ;i c 0;ts c 0 2 6 4 i 0 cl1 ;ts 0 cl1 ;te c 0 i 0 cl 0 ;ts 0 cl 0 ;te c 0 i 0 c2 ;ts 0 c2 ;te c 0 3 7 5 g (3.8) where, ts c 0¡1·te c 00,8c"2[c1..c2],c"6=c'. It can be seen that equation (3.8) is a combination of equations (3.4)-(3.7). This has a time complexity of O(HCN 2C+1 T 3C+1 ). Figure 3.3 illustrates the RVD algorithm. It can be seen that at each step in the algorithm, we add a brick of probability for the interval [ts;te] from the lower level. 3.3.2 Factored Recursive Viterbi Descent (FRVD) While the RVD algorithm accurately computes the highest probability path, its time complexity is exponential in the number of channels C. This can be simpli¯ed if we can factorizethe¢'sintoindependentchannels,whichispossiblefortheindependentchannel factorization (3.1) and (3.3). Let ± h;c;p;¿ i c 0;ts c 0;te c 0 denote the log-likelihood of the maximum probability path such that channel c 0 in level h is in state i c 0 from time ts c 0 to te c 0. c and p are the parent channel and state and ¿ is the start time as before. Then, ± h;c;p;¿ i c 0;ts c 0;te c 0 = max i c 00 ;ts c 00 ;te c 00 8c 00 2[c1;c2] c 00 6=c 0 8 > < > : ¢ h;c;p;¿ " i c1 ;ts c1 ;te c1 i c 0 ;ts c 0 ;te c 0 i c2 ;ts c2 ;te c2 # 9 > = > ; (3.9) 25 whichsimplymaximizesallthefreevariablesin¢. Ifweassumethatthechannelsevolve independently and interact only during transitions as in equation (3.2), substitute (3.8) in (3.9) and rearrange the terms, we get: ± h;c;p;¿ i c 0;ts c 0;te c 0 = max i c 00 ;ts c 00 ;te c 00 8c 00 2[c1;c2] c 00 6=c 0 8 > < > : ¢ h;c;p;¿ " i c1 ;ts c1 ;te c1 i c 0 ;ts c 0 ;te c 0 i c2 ;ts c2 ;te c2 # 9 > = > ; = max j c 0 ;ts 0 c 0 i c 00 ;ts c 00 ;te c 00 8c 00 2[c1;c2] c 00 6=c 0 f¢ h;c;p;¿ " i c1 ;ts c1 ;te c1 j c 0 ;ts c 0 ;ts c 0 ¡1 i c2 ;ts c2 ;te c2 # +ln(P(i c 0j[i c1 :j c 0:i c2 ])) +ln(P d (te c 0¡ts c 0ji c 0))+ max [ i 0 cl1 ::i 0 cl2 ] [ ts 0 cl1 ::ts 0 cl2 ] ¢ h¡1;c 0 ;i c 0;ts c 0 2 6 4 i 0 cl1 ;ts 0 cl1 ;te c 0 i 0 cl 0 ;ts 0 cl 0 ;te c 0 i 0 c2 ;ts 0 c2 ;te c 0 3 7 5 g (3.10) PaHMM Factorization: If we use the independent non-interacting channel factoriza- tion as in equation (3.1), we can compute the ¢'s in terms of the ±'s as: ¢ h;c;p;¿ " i c1 ;ts c1 ;te c1 i c 0 ;ts c 0 ;te c 0 i c2 ;ts c2 ;te c2 # = c2 X c 0 =c1 ± h;c;p;¿ i c 0;ts c 0;te c 0 (3.11) Substituting (3.11) and the PaHMM factorization (3.1) in equation (3.10), we get: ± h;c;p;¿ i c 0;ts c 0;te c 0 = max j c 0;ts 0 c 0 f± h;c;p;¿ j c 0;ts 0 c 0 ;ts c 0¡1 +ln(P(i c 0jj c 0))+ ln(P d (te c 0¡ts c 0ji c 0))+ C i c 0 X c 00 =1 max i 0 c 00 ;ts 0 c 00 ± h¡1;c 0 ;i c 0;ts c 0 i 0 c 00 ;ts 0 c 00 ;te c 0 g (3.12) where C i c 0 is the number of channels in level h¡1 with i c 0 as parent. This has an overall complexity of O(HC 2 N 3 T 4 ) which is also polynomial in all the variables. See Appendix C of Supplemental for the full derivation of equation (3.12). 26 CHMM Factorization: If we use the causally coupled factorization (3.3) instead of (3.1), by substituting (3.3) and (3.11) in (3.10) and rearranging we get: ± h;c;p;¿ i c 0;ts c 0;te c 0 = max j c 0;ts 0 c 0 f± h;c;p;¿ j c 0;ts 0 c 0 ;ts c 0¡1 +ln(P(i c 0jj c 0))+ ln(P d (te c 0¡ts c 0ji c 0))+ C i c 0 X c 00 =1 max i 0 c 00 ;ts 0 c 00 ± h¡1;c 0 ;i c 0;ts c 0 i 0 c 00 ;ts 0 c 00 ;te c 0 g+ (3.13) Cp c X c 00 =1 c 00 6=c max j c 00;ts 0 c 00 ;te 0 c 00 f± h;c;p;¿ j c 00;ts 0 c 00 ;te 0 c 00 +ln(P(i c 0jj c 00))g where C pc is the number of channels in level (h¡1) under state p in channel c at level h. Terms 1 and 2 in equation (3.14) corresponds to in°uence from the same channel, term3 corresponds to the durationprobabilityand term 4 corresponds to the probability from the lower levels, term 5 corresponds to in°uence from the other channels. The com- plexity of the FRVD algorithm with CHMM factorization is O(HC 2 N 3 T 5 ). Figure 3.4 illustrates the FRVD algorithm at level h with C = 2 channels, when using the CHMM factorization. BeamSearch: Themaximumprobabilitygiventheobservationsisgivenby H P c=1 max ic;tsc ± H;¿=1 ic;tsc;T and the best path can be computed by storing the indices that produce the maxima at each step in equation (3.14). While the FRVD algorithm has polynomial time complexity, the inference is still not linear in the sequence length T making the computation still expensive. If we could re- strict the time spent in a particular state to be in some range [M ¡Th;M +Th], then given ¿ in equation (3.14), ts c 0;te c 0 2 [¿ +M ¡Th;¿ +M +Th]. Thus there are only 27 ic’,ts’,ts−1 δ h,1 ic’,ts,te δ h−1,c’,i,ts i1,ts1,te δ h−1,c’,i,ts i2,ts2,te δ h−1,c’,i,ts i1,ts1,te δ ic,tsc,tec h,1 + Σ max { } + ln P(i|ic) δ h,1 ic’,ts’,ts−1 c=1 ic tsc tec P(i|ic) P(i|ic’) ic’ c’=2 i1 i2 ts’ ts−1 te ts ts1 te te ts2 P(d=te−ts+1|i) i = max + ln P(i|ic’) δ h,1 ic’,ts,te + ln P(d=te−ts+1|i)+ Σ max { } { δ } ic,tsc,tec h,1 δ h,1 Figure 3.4: Factored Recursive Viterbi Descent with 2 channels at level h O(Th)possiblevaluesforts c 0;te c 0. AsimilarreasoningappliesfortheRHStooandhence computing the ±'s in equation (3.14) has O(HC 2 N 3 TMTh 3 ) and in equation (3.12) has O(HC 2 N 3 TMTh 2 ). We can simplify the complexity further by storing only the top K states for each interval [ts c 0;te c 0] in each channel c 0 at each level h. Then the complexity becomes O(HC 2 NK 2 TMTh 3 ) and O(HC 2 NK 2 TMTh 2 ) for the coupled and independent chan- nel factorizations respectively. In most applications, especially with Gaussian or Uniform duration models, M;Th¿T andalsok¿N making the run time reasonable, evenreal- time under some conditions. Further, we can take advantage of additional constraints based on the speci¯c model structure as well as application domain to further reduce the complexity. 28 3.3.3 Applications of Inference Algorithm Intherestofthissection,wepresent3speci¯cinstancesofthegeneralinferencealgorithm, forthemodelinstancesthatwefocuson. Allthemodelsconsideredare2-layerstructures, with a single chain at the top layer and multiple chains at the lower level. We use the following notations - Let C be the number of channels in lower layer, T the number of frames or observations in each channel, N the number of states in each channel of the lower layer 1 and W the number of lower level HMMs. Then, the top-level HMM will have one state for every lower level HMM as well as a state for every possible transition between lower level HMMs. This is because in applications like sign language recognition where each word is modeled by a multi-channel HMM/HSMM each transition between the words in the lexicon is distinct. Each of the W HMMs can transition to any of the W HMMs giving a total of W 2 transition states. Thus the top-level HMM has a total of W ¤(W +1) states. 3.3.3.1 CHSMM Decoding For decoding the CHSMM corresponding to event w, with C channels and N states in each channel we compute variables ± w;¿ ic;tsc;tec : ± w;¿ ic;tsc;tec =max jc;ts 0 c n ± w;¿ j c ;ts 0 c ;ts c ¡1 +ln(P(i c jj c )) o + C X c 0 =1 c 0 6=c max j 0 c 0 ;ts 0 c 0 ;te 0 c 0 n ± w;¿ j 0 c 0 ;ts 0 c 0 ;te 0 c 0 +ln(P(i c jj 0 c 0)) o + ln(P(d=te c ¡ts c ji c ))+ln(P(O tsc ::O tec ji c )) (3.14) 1 We have assumed that all the lower level HMMs have the same number of states for simplicity but it is easy to extend the algorithms presented to cases where the low-level HMMs have varying number of states. 29 d1−2 State Transition Event1 Event2 d2 d1 Figure 3.5: Decoding CHSMM - rounded boxes are primitive nodes, dark squares obser- vations. d1, d1¡2, d2 are durations. Equation (3.14) is a special case of equation (3.14). Also note that the Coupled Hidden Markov Model (CHMM) is a special case of CHSMM where we do not explicitly model durations. Next,inordertodecodeasequenceofeventsfromanunsegmentedobservationstream, we simply string together the individual event CHSMMs along with additional states to represent event transitions, to form a compound CHSMM. This is possible since the inter-channel coupling restricts all channels to be in the same event. Since there are W 2 possible transitions and N states for each of the W events, the compound CHSMM has a total of W ¤(W +N) states in each channel. By setting ¿=1 in equation (3.14) and ignoring the w index (since we have accounted for di®erent events in the compound CHSMM), the overall complexity of computing the ±'s is O(C 2 W 2 (W +N) 2 T 4 ). Note that this is O(T 4 ) instead of O(T 5 ) as in equation (3.14) since we °attened the hierarchy by taking advantage of the structural constraints. By restricting the number of possible 30 state durations to [M¡Th;M +Th] and storing only the top K states at each interval, the complexity becomes O(C 2 W(W +N)KTMTh 2 ). Figure 3.5 illustrates the decoding of CHSMM. 3.3.3.2 HPaHSMM Decoding HPaHSMM is a 2-layer structure, with a Markov chain at the top layer and multiple HSMMs at the lower layer. For decoding the HSMMs corresponding to event w, with C parallel channels and N states in each channel we compute variables ± w;¿ ic;tsc;tec : ± w;¿ i c ;ts c ;te c =max j c ;ts 0 c n ± w;¿ j c ;ts 0 c ;ts c ¡1 +ln(P(i c jj c )) o + ln(P(d=te c ¡ts c ji c ))+ln(P(O ts c ::O te c ji c )) (3.15) Note that equation (3.15) is a special case of equation (3.12). d1−2 State Transition Event1 Event2 PaHSMM Event1 PaHSMM Event2 d2 d1 Figure 3.6: Decoding HPaHSMM - circles are top-level nodes, rounded boxes lower-level nodes, dark squares observations. d1, d1¡2, d2 are durations. Next,inordertodecodeasequenceofeventsfromanunsegmentedobservationstream, we string the event PaHSMMs and the transition states. Also, to ensure that all the 31 channels are in the same event, we add a single channel Markov chain at the top layer, each of whose states correspond to a lower level PaHSMM. Since there are W upper level stateswithN statelowerlevelPaHSMMs, and W 2 upperleveltransitionstateswithjust a single lower level state, the compound HPaHSMM has a total of W(W +N) states. By setting ¿ = 1 and ignoring the w index in equation (3.15), the total complexity of computing the ±'s is O(CW 2 (W +N) 2 T 3 ). Now, if we restrict the state duration to be in [m¡¾;m+¾] and store only the top-k states at each time interval in the compound HPaHSMM, the complexity becomes O(CW(W +N)kTm¾). Figure 3.6 illustrates the decoding of HPaHSMM. 3.3.3.3 HSPaHMM Decoding HSPaHMM is a 2-layer structure, with a single semi-Markov chain at the top layer and a PaHMM with multiple HMM chains at the lower layer. For decoding the PaHMM corresponding to event w we compute variables ± w;¿ ic;t : ± w;¿ i c ;t =max j c n ± w;¿ j c ;t¡1 +ln(P(i c jj c )) o +P(o t ji c ) (3.16) Note here that the ±'s are at each instant t rather than intervals, since PaHMM reasons onlyateachinstant. Nexttoautomaticallysegmentacontinuousstreamintoconstituent events, we add a single HSMM each of whose states correspond to an event or an event transition. The log-likelihoods in the top-layer are computed as: ± upper;¿=1 w;ts;te = max w 0 ;ts 0 n ± upper;¿=1 w 0 ;ts 0 ;ts¡1 +ln(P(wjw 0 )) o + ln(P(d=te¡tsjw))+ C X c=1 max i c ;ts c ± w;ts ic;tsc;te (3.17) 32 Equations (3.16) and (3.17) are both special cases of (3.12). Now, traversing the top- level HSMM takes O(W 2 (W+1) 2 T 2 ) and each call to compute the lower level PaHMM's best path takes O(CN 2 T) giving a total complexity of O(CW 2 (W +1) 2 N 2 T 3 ). If the possiblestatedurationsfortheHSMM2[M¡§;M+§], traversingthetoplevelHSMM takes O(W 2 (W + 1) 2 T§) and each low level PaHMM takes O(CN 2 M) giving a total complexity of O(CW 2 (W +1) 2 N 2 TM§). Also each top-level state can transition only to O(W) states since from event states we go only to one of W possible event transition states and from each event transition state we can go only to one event state. Thus, by storing only the top-K states in the upper-HSMM the overall complexity becomes O(CWKN 2 TM§). Figure 3.7 illustrates HSPaHMM decoding. PaHMM d2 d1−2 d1 Event1 Event2 State Transition Event1 PaHMM Event2 Figure 3.7: Decoding HSPaHMM - rounded boxes are top-level nodes, circles lower-level nodes, dark squares observations. d1, d1¡2, d2 are durations. 33 3.4 Learning Here we present an Expectation-Maximization algorithm for learning the various parame- tersthatissimilartotheBaum-WelchalgorithmusedtotrainHMMs. IntheBaum-Welch algorithm for HMM, we compute forward and backward variables ® i;t and ¯ i;t as follows: ® i;t = 8 < : X j ® j;t¡1 P(ijj) 9 = ; P(o t ji) (3.18) ¯ i;t = X j P(jji)P(o t+1 jj)¯ j;t+1 (3.19) where ® i;t denotes the total probability of being in state i at time t, and ¯ i;t is the total probability of of the remaining observations o t+1 ::o T , starting in state i at time t. We generalize the ® and ¯ variables to the HM-HSMMs similar to the way we generalized the ± variables in section 5.1. See Appendix D of Supplemental for a full derivation of the forward and backward variables. At each level h in the hierarchy, we need to compute forward variables of the form A h;c;p;¿start [i c1 ;ts c1 ;te c1 ]::[i c 0;ts c 0;te c 0]::[i c2 ;ts c2 ;te c2 ] where the indices have similar interpretations as in section5.1. A h;c;p;¿start [i c1 ;ts c1 ;te c1 ]::[i c 0;ts c 0;te c 0]::[i c2 ;ts c2 ;te c2 ] denotesthetotalprobabilitystartingfrom t = ¿ start so that channel c 0 is in state i c 0 in time interval [ts c 0;te c 0] 8c2 1::C p c . We can write a recursive dynamic programming equation to compute the A's as follows: A h;c;p;¿ start " i c1 ;ts c1 ;te c1 i c 0 ;ts c 0 ;te c 0 i c2 ;ts c2 ;te c2 # = X jc;ts c 0 8 > > < > > : A h;c;p;¿ start " i c1 ;ts c1 ;te c1 j c 0 ;ts 0 c 0 ;ts c 0 ¡1 i c2 ;ts c2 ;te c2 #P(i c 0j[i c1 ::j c 0::i c2 ]) 9 > > = > > ; £ P d (te c 0¡ts c 0ji c 0)£ X [ i 0 cl1 ::i 0 cl2 ] [ ts 0 cl1 ::ts 0 cl2 ] 8 > > > > < > > > > : A h¡1;c 0 ;i c 0;ts c 0 2 6 4 i 0 cl1 ;ts 0 cl1 ;te c 0 i 0 cl 0 ;ts 0 cl 0 ;te c 0 i 0 c2 ;ts 0 c2 ;te c 0 3 7 5 9 > > > > = > > > > ; (3.20) 34 where, ts c 0¡1·te c 00, 8c"2[c1..c2], c"6=c'. Similarly we can de¯ne backward variables of the form B h;c;p;¿ end [i c1 ;ts c1 ;te c1 ]::[i c 0;ts c 0;te c 0]::[i c2 ;ts c2 ;te c2 ] , which denotes the total reverse probability from the ending time t=¿ end , so that channel c 0 is in state i c 0 in time interval [ts c 0;te c 0] 8c21::C pc . The B's can be computed as: B h;c;p;¿ end " i c1 ;ts c1 ;te c1 i c 0 ;ts c 0 ;te c 0 i c2 ;ts c2 ;te c2 # =P d (te c 0¡ts c 0ji c 0)£ X [ i 0 cl1 ::i 0 cl2 ] [ ts 0 cl1 ::ts 0 cl2 ] 8 > > > > < > > > > : B h¡1;c 0 ;i c 0;te c 0 2 6 4 i 0 cl1 ;ts 0 cl1 ;te c 0 i 0 cl 0 ;ts 0 cl 0 ;te c 0 i 0 cl2 ;ts 0 cl2 ;te c 0 3 7 5 9 > > > > = > > > > ; £ X j c ;te 0 c 0 8 > > < > > : P(j c 0j[i c1 ::i c 0::i c2 ])£B h;c;p;¿ end " i c1 ;ts c1 ;te c1 j c 0 ;te c 0 +1;te 0 c 0 i c2 ;ts c2 ;te c2 # 9 > > = > > ; (3.21) where, te c 0 +1¸ts c 00,8c"2[c1..c2], c"6=c'. With these forward and backward variables, we can re-estimate the parameters at eachlayer similarto thestandardBaum-Welch equation, but this canbe time consuming and prone to over¯tting. However, if we can factorize the transition probability P(j c 0j[i c1 ::i c 0::i c2 ]) similar to equation(3.3)wecanfactorizetheforwardvariables. Then,weneedtocomputevariables ® h;c;p;¿ start i c 0;ts c 0;te c 0 which denotes the total probability such that channel c 0 in level h is in state 35 i c 0 from time ts c 0 to te c 0, starting from time t = ¿ start . c and p are the parent channel and state and ¿ start is the start time as before. This can be done as: ® h;c;p;¿start i c 0;ts c 0;te c 0 = X j c 0;ts 0 c 0 n ® h;c;p;¿start j c 0;ts 0 c 0 ;ts c 0¡1 £P(i c 0jj c 0) o £ Cp c Y c 00 =1 c 00 6=c X i c 00;ts 0 c 00 ;te 0 c 00 n ® h;c;p;¿ start i c 00;ts 0 c 00 ;te 0 c 00 £P(i c 0ji c 00) o £ P d (te c 0¡ts c 0ji c 0)£ C i c 0 Y c 00 =1 X i 0 c 00 ;ts 0 c 00 ® h¡1;c 0 ;i c 0;ts c 0 i 0 c 00 ;ts 0 c 00 ;te c 0 (3.22) where C p c is the number of channels in level (h¡1) under state p in channel c at level h. Term 1 in equation (3.22) corresponds to in°uence from the same channel, term 2 corresponds to in°uence from the other channels, term 3 corresponds to the duration probability and term 4 corresponds to the probability from the lower levels. Similarly, thebackwardvariablesB canbefactorizedtocomputevariables¯ h;c;p;¿ end i c 0;ts c 0;te c 0 whichdenotes the total reverse probability such that channel c 0 in level h is in state i c 0 from time ts c 0 to te c 0, ending at time t=¿ end : ¯ h;c;p;¿ end i c 0;ts c 0;te c 0 = X j c 0;te 0 c 0 n ¯ h;c;p;¿ end j c 0;te c 0+1;te 0 c 0 £P(j c 0ji c 0) o £ C pc Y c 00 =1 c 00 6=c X i c 00;ts 0 c 00 ;te 0 c 00 n ¯ h;c;p;¿ end i c 00;ts 0 c 00 ;te 0 c 00 £P(i c 00ji c 0) o £ P d (te c 0¡ts c 0ji c 0)£ C i c 0 Y c 00 =1 X i 0 c 00 ;ts 0 c 00 ® h¡1;c 0 ;i c 0;ts c 0 i 0 c 00 ;ts 0 c 00 ;te c 0 (3.23) 36 Also, if we use the independent channel factorization as in equation (3.1) instead of the causally coupled factorization, we can compute the forward and backward variables as: ® h;c;p;¿start i c 0;ts c 0;te c 0 = X j c 0;ts 0 c 0 n ® h;c;p;¿start j c 0;ts 0 c 0 ;ts c 0¡1 £P(i c 0jj c 0) o £ P d (te c 0¡ts c 0ji c 0)£ C i c 0 Y c 00 =1 X i 0 c 00 ;ts 0 c 00 ® h¡1;c 0 ;i c 0;ts c 0 i 0 c 00 ;ts 0 c 00 ;te c 0 (3.24) ¯ h;c;p;¿ end i c 0;ts c 0;te c 0 = X j c 0;te 0 c 0 n ¯ h;c;p;¿ end j c 0;te c 0+1;te 0 c 0 £P(j c 0ji c 0) o £ P d (te c 0¡ts c 0ji c 0)£ C i c 0 Y c 00 =1 X i 0 c 00 ;ts 0 c 00 ® h¡1;c 0 ;i c 0;ts c 0 i 0 c 00 ;ts 0 c 00 ;te c 0 (3.25) Now, once we compute the forward and backward variables at all levels, we can re- estimate the variables at level h as follows with a suitable normalization factor- 1) Initial Probability: Let ¼ h;c;p i c 0 denote the probability of starting in state i c 0 in channel c 0 at level h, and let the parent state and channel be p and c respectively. Then, ¼ h;c;p i c 0 =¼ h;c;p i c 0 X ¿ end te c 0 ¯ h;c;p;¿ end i c 0;1;te c 0 (3.26) 37 2) Transition Probability: Let P h;c;p (s t+1 =i c 0js t =j c 00) denote the probability of transi- tioning at level h, to state i c 0 in channel c 0 from state j c 00 in channel c 00 with the parent state and channel being p and c respectively. Then, P h;c;p (s t+1 =i c 0js t =j c 00)= X ¿ start ts c 0 ;te c 0 f® h;c;p;¿start i c 0;ts c 0;te c 0 £ P h;c;p (s t+1 =i c 0js t =j c 00)£ X ¿ end ts c 00 ;te c 00 ¯ h;c;p;¿ end j c 00;ts c 00;te c 00 g (3.27) 3) Duration Probability: Let P h;c;p c 0 (djs t = i c 0) denote the probability of spending time d in state i c 0 in channel c 0 at level h whose parent is state p in channel c at level h+1. Then, P h;c;p c 0 (djs t =i c 0)= X ¿ start ts c 0 ;te c 0 f® h;c;p;¿ start i c 0;ts c 0;te c 0 £P h;c;p c 0 (djs t =i c 0)£ X ¿ end j c 0 ;ts 0 c 0 ;te 0 c 0 ¯ h;c;p;¿ end j c 0;ts 0 c 0 ;te 0 c 0 g(3.28) 4) Output Probability: Observation probability is computed only at the lowest layer with h = 1. Let P 1;c;p c 0 (o t = O k js t = i c 0) denote the probability of observing O k in state i c 0 whose parent is state p in channel c. Then we re-estimate as, P 1;c;p c 0 (o t =O k js t =i c 0)= 8 < : X i 0 c 0 ;ts 0 c 0 ® 1;c;p;¿ start i 0 c 0 ;ts 0 c 0 ;ts c 0¡1 9 = ; X te c 0 ts c 0 ·t·ts c 0 +d 8 > > < > > : P 1;c;p c 0 (dji c 0) t 0 =ts c 0+d Y t 0 =ts c 0 o t =O k P 1;c;p c 0 (o t 0js t 0 =i c 0) X ¿ end j c 0 ;te 0 c 0 ¯ 1;c;p;¿ end j c 0;ts c 0+d+1;te 0 c 0 9 > > = > > ; (3.29) 38 Wehaveskippeddescribingpreciseboundaryconditionsofthevariablesinthesumsabove for brevity. It can be seen that each of the equations (3.26)-(3.29) involve re-estimating parameters after summing out the free variables. Algorithm 1 Embedded Viterbi Learning 1: numTrain = number of training samples 2: P h;c;p (i c 0) = Probability of starting in state i c 0 in channel c 0 , at level h with parent channel c and state p. 3: P h;c;p (j c 0ji c 00) = Probability of transitioning from state i c 00 in channel c 00 to state j c 0 in channel c 0 , at level h with parent channel c and state p. 4: P h;c;p (dji c 0) = Probability of spending a duration d in state i c 0 in channel c 0 , at level h with parent channel c and state p. 5: P 1;c;p (O k ji c 0) = Probability of observing symbol O k in state i c 0 in channel c 0 at the lowest level, with parent channel c and parent state p. 6: for i=1 to numTrain do 7: jointHMMà String together HMMs of words/events forming training sequence i. 8: repeat 9: maxPath à State sequence of maximum probability path through jointHMM obtained using decoding algorithm. 10: ¼ h;c;p i c 0 à No. of times channel c 0 is in state i c 0 at level h with parent channel c and state p in maxPath. 11: n h;c;p i c 0 à No. of times channel c 0 is in state i c 0 at level h with parent channel c and state p, in maxPath. 12: n h;c;p i c 0;j c 00 à No. of times channel c 0 transitions from state i c 0 to state j c 00 at level h with parent channel c and state p, in maxPath. 13: n h;c;p i c 0;d à No. of times state i c 0 in channel c 0 spends duration d at level h with parent channel c and state p, in maxPath. 14: n 1;c;p i c 0;O k à No. of times O k is observed in state i c 0 in channel c 0 at level h with parent channel c and state p, in maxPath. 15: Re-estimate parameters using the following equations 16: P h;c;p (i c 0)à ¼ h;c;p i c 0 = P j c 0 ¼ h;c;p j c 0 17: P h;c;p (j c 00ji c 0)à n h;c;p i c 0;j c 00 =n h;c;p i c 0 18: P h;c;p (dji c 0)à n h;c;p i c 0;d =n h;c;p i c 0 19: P 1;c;p (O k ji c 0)à n 1;c;p i c 0;O k =n h;c;p i c 0 20: until convergence 21: Split HMMs and update corresponding word/event HMMs. 22: end for 39 3.4.1 Embedded Viterbi Learning One issue with the learning algorithm presented, is that computing the forward and backward variables as well as the summations in equations (3.26)-(3.29) are very slow. Further, in many applications like sign language recognition and event recognition the training samples typically contain a sequence of words/events which are not segmented. In such cases, we string together the individual word/event HMMs and then re- estimate the parameters of the combined HMM. Thus the individual events are seg- mented automatically during the re-estimation process. Further, in constrained HMM structures like the left-right HMM (where each state can transition only to itself or to onestatehigher),there-estimationprocedurecanbesimpli¯edbycalculatingtheViterbi path through the HMM at each iteration, instead of the full forward-backward algorithm whichrequirescomputingallpossiblepaths. Theparameterscanthenbere-estimatedby a histogram analysis of the state occupancies. Since in our applications we are primarily interested in left-right HMMs, we adopted a Viterbi-like embedded training algorithm for HSPaHMM and HPaHSMM. Algorithm 1 presents the pseudocode for the Embedded Viterbi Learning algorithm described here. 3.5 Experiments To evaluate learning and decoding in our models we conducted three experiments: ¯rst, since the CHSMM has a large number of parameters, we test the e®ect of parameter initialization in the learning algorithm using simulated data. Further, we empirically validate the CHMM approximation in equation (3.3) by generating the sample data from 40 the joint probability P(q i t+1 j[q i t ;::;q C t ]) and learning the parameters P(q i t+1 jq j t ). Second, since the main di®erence between HPaHSMM and HSPaHMM is in the level at which durationsaremodeledwecomparetheirperformancewithdatageneratedfromadiscrete event simulator. Finally, we compare the performance of all the models on real data for sign language (ASL) recognition. In all cases, we compare our results with PaHMM and CHMMwithoutanydurationmodelsandbeamsearchonthestates. Allruntimeresults are on a 2GHz Pentium 4 Windows platform with 2GB RAM, running Java programs. 3.5.1 E®ect of Parameter Initialization in CHSMM We follow the methodology described in [17] and [6] to generate synthetic data. We con- structed a discrete event simulator based on the CHSMM model to generate synthetic observation sequences. Each event can have multiple(C) channels/agents; each channel can be in one of N states at any time. Cartesian product transition matrices are built fromrandomintraandinter-channeltransitionprobabilities,thenperturbedwithuniform noiseandre-normalizedsothatwhilethereissomeinteractionbetweentheprocesses,the overall system is not strictly conformant to our coupling model (i.e. the coupling model in test data is more general). This also provides empirical validation to the coupleing model used (3.3). The states have 3-dimensional Gaussian observation models with the means uniformly distributed in [0,1] and the covariances set to I/4. Further, each state has uniform/Gaussian duration models with randomly distributed parameters. Random walks through each event model were used to generate forty sequences of ¯ve di®erent event models containing hundred observations each. Half of these were used for training and the other half for testing. 41 Thetrainingdatawasusedtotrainthefollowingmulti-channelHMMs: 1)CHSMM acc with duration models set manually to correct values 2) CHSMM per where the duration models were initialized after perturbing the parameters from the simulator with ¼ 30% error3)CHSMM ran wherethedurationmodelswerelearnedfromrandominitialization 4) PaHMM 5) CHMM. All the parameters were learned from randomly initialized valuesusingthelearningalgorithmdescribedinsection6, exceptforthedurationmodels in models 1 and 2. Each model was trained until the slope of the log-likelihood curve fell below 0.01 or when 100 training iterations were completed. Since [6] has already demonstrated that CHMMoutperformsFHMM,LHMMandCartesian-productHMMinasimilarsetup, we did not compare our model with them. In order to estimate the recognition accuracy of an HMM-model, we took the test sequence from each event and ran it on the learned models. If the learned model corresponding to the generating event produced the max- imum log-likelihood, then the sequence was counted as classi¯ed correctly. The ratio of correctly classi¯ed sequences to total number of test sequences gives the accuracy. In the ¯rst set of tests, we initialized the event models with C=2 and N=5, such that the states had uniform duration models and the maximum possible duration(Th) for a state was 10 and conducted 50 trials using the setup described. We then repeated the experiment with Gaussian duration models. Table 3.1 shows the average accuracy and learning time for the various HMMs. As can be seen from these results, all variations of CHSMM outperform CHMM and PaHMM. Further, even if there is a large error(¼ 30%) in initializing the duration pa- rameters, there is not a huge change in performance, but there is a signi¯cant drop with 42 Uniform Normal Accuracy Time Accuracy Time CHSMM acc 97:42% 673s 90:8% 661s CHSMM per 96:49% 213s 89:2% 472s CHSMM ran 91:04% 500s 84:8% 286s PaHMM 70:90% 53s 69:3% 51s CHMM 70:77% 173s 61:5% 340s Table 3.1: Accuracy values with Uniform and Normal Duration models randomly initialized duration models. In order to understand these results, we calcu- lated the mean and standard deviation of log-likelihood (LL) on training data, mean log-likelihood on test data generated from the event that generated the training data (Class Test), and mean log-likelihood on other test data (Cross Test) as shown in Table 3.2. Although CHMM has the highest train and test likelihoods on average, it also has Mean ¾ Class Cross TrainLL TrainLL Test Test CHSMM acc ¡276:3 41:8 ¡292:3 ¡406:1 CHSMM per ¡302:1 47:4 ¡318:9 ¡424:9 CHSMM ran ¡376:2 42:8 ¡394:6 ¡504:7 PaHMM ¡306:0 57:6 ¡337:0 ¡427:6 CHMM ¡276:9 118:7 ¡262:7 ¡376:3 Table 3.2: Mean and Variance of log-likelihood the highest variance. Hence there is a higher confusion between class and non-class test sequences resulting in lower classi¯cation accuracy. PaHMM has a lower variance, but theclass-nonClassseparationisalsosmallresultinginloweraccuracy. TheCHSMMvari- ations on the other hand have low variance and high class-nonClass separation resulting in high accuracy rates. This also indicates that the mean likelihood probability in itself is not a suitable metric to measure a model/algorithm's classi¯cation capability. 43 We repeated these experiments by varying T, Th, N. The accuracy values were in these ranges - CHSMM acc (90-100%), CHSMM per (85-95%), CHSMM ran (80-90%), PaHMM(65-80%), CHMM(60-75%). AlthoughvariationinT,Nandthedurationrange [M ¡Th;M +Th] did not a®ect the accuracies, increase in Th produced a signi¯cant increase in the computation time for CHSMM acc . This is expected as the complexity of the learning and inferencing algorithms vary with MTh 2 . Also, CHMM has surprisingly high learning time because it takes many more iterations to converge. Figure 3.8 shows these variations. These results indicate that incorporating explicit duration models produce a 20-30% Figure 3.8: Learning Time Vs Th improvementinclassi¯cationaccuracyoverCHMMandPaHMMwhilemodelingmultiple interacting processes. Further, the learning algorithm is not sensitive to initialization of transition and observation probabilities, but some prior initialization of duration proba- bilities produces much better models. These improvements come at the cost of increased computation time, which can be partly o®set by restricting the possible duration range [M¡Th;M +Th] of the models. 44 3.5.2 Comparison of duration modeling at the lower and upper layers on Synthetic Data In this experiment, we used a discrete event simulator to generate synthetic observation sequences. Each event can have C=2 channels/agents, and each channel can be in one of N=5 states at any time. State transitions were restricted to be left-to-right so that each state i can transition only to state (i+1). The states had 3-dimensional Gaussian obser- vation models with the means uniformly distributed in [0,1] and the covariances were set toI/4. Further, eachstatehadGaussiandurationmodelswithmeansintherange[10,15] andvariancessettoDPARAM=10. Wethenbuiltatop-leveleventtransitiongraphwith uniform inter-event transition probabilities. Continuous observation sequences were then generated by random walks through the event transition graph and the corresponding low-level event models. Random noise was added in between the event sequences as well as at the beginning and end of the observation sequence. Figure 3.9 illustrates this pro- cedure for a 2 event top-level transition graph. As can be seen, this setup corresponds to the HPaHSMM model structure. Thus, in e®ect we are testing how well the HSPaHMM model approximates the HPaHSMM model. Observation sequences were generated using the setup described above using a 5- event top-level transition graph such that each sequence had 2-5 individual events and each event occurred at least 30 times in the entire set of sequences producing in total 50-60sequences. Wethenrandomlychoseasetoftrainingsequencessuchthateachword occurred at least 10 times in the training set and used the rest as test sequence. Thus the training set contained 10-20 sequences and the test set contained 40-50 sequences. 45 Low−level Event1 Event2 Event2−1 Transition Event1−2 Transition Begin Noise End Noise Event1 Event2 Event1−2 Observations Observations Observations Observations for Event Sequence 1−>2 Top−level Event Transition Graph Event Models Figure 3.9: Structure of Event Simulator The data generators were then discarded and then the following models were trained - 1)HPaHSMM with randomly initialized output models, left-right low-level PaHSMMs and low-level duration models (parameters m, ¾ - see section 5.2.2) set accurately us- ing the corresponding simulator parameters. The beam-size k was set manually. 2) HSPaHMM with randomly initialized output models, left-right low-level PaHMMs and top-level duration models whose means (M) were set by summing over the means of the corresponding low-level PaHSMMs in the simulator and set the parameters K and § (see section 5.2.3) manually. 3) PaHMM with the output models initialized randomly and decoding performed by Beam Search as described in section 5.2. Each model was trained by re-estimating the output models using Embedded Viterbi learning until the slope of the log-likelihood curve fell below 0.01 or when 100 training iterations were completed. We then ran the learned models on the test sequences and obtained accuracy measures using the metric (N ¡D¡S¡I)=N where N=number of events in test set, D=no. of deletion errors, S=no. of substitution errors, I=no. of insertion errors. 46 Since the accuracy as wellas complexityof the decoding algorithms depend on manu- allysetparameters(K,§forHSPaHMMandkforHPaHSMM)we¯rstinvestigatedtheir e®ects. To do this, we varied the parameters and ran 50 iterations of the train-test setup described above for HSPaHMM and HPaHSMM for each parameter value. Figures 5-6 show these variations. As can be seen, while increasing § in HSPaHMM produces a signi¯cant drop in frame Figure 3.10: Variation of HSPaHMM Speed(frames/sec or fps) and Accuracy(%) with a)Sigma, Beam Size=11 b)Beam Size, Sigma=1 Figure 3.11: Variation of HPaHSMM Speed(fps) and Accuracy(%) with Beam Size rate, it does not a®ect the accuracy. On the other hand, increasing the beam size(K) 47 produces a signi¯cant increase in accuracy at the cost of slower speed. For HPaHSMM, increasingthebeamsize(k)doesnotimproveaccuracysigni¯cantly. Basedontheseobser- vationsweranasetof50testscomparingHSPaHMM(with§=1, K =11), HPaHSMM (with k = 10) and PaHMM. Table 3.5.2 summarizes the average accuracies and speeds. It shows that HSPaHMM provides a big improvement in performance when compared to Model Accuracy Speed HPaHSMM 83:1%(N=124, D=17, S=1, I=3) 12.0 HSPaHMM 63:7%(N=124, D=3, S=36, I=6) 40.2 PaHMM 4:8%(N=124, D=38, S=53, I=27) 39.1 Table 3.3: Model Accuracy(%) and Speed(fps) the PaHMM without a®ecting the speed. While HSPaHMM's accuracy is still lower than HPaHSMM, it is 3 times faster and thus serves as a good mean between HPaHSMM and PaHMM. 3.5.3 Application to Continuous Sign Language Recognition We next tested our models in an application for automatically segmenting and recog- nizing American Sign Language (ASL) gestures from a continuous stream of data. Sign language recognition, besides being useful in itself, provides a good domain to test hi- erarchical multi-agent activities; both hands go through a complex sequence of states simultaneously, each sign has distinct durations, and there is a natural hierarchicalstruc- ture at the phoneme, word and sentence level. For our experiments, we used a set of 50 test sentences from a larger dataset used in [66]; thesequenceswerecollectedusingaMotionStar TM systemat60framespersecond, did not have any word segmentations and also had some noise at the beginning and end. 48 We used a 10 word vocabulary, and each sentence is 2-5 words long for a total of 126 signs. The input contains the (x;y;z) location of the hands at each time instant; from these we calculate the instantaneous velocities which are used as the observation vector for each time instant. Wemodeleachwordas2-channelCHSMM,orPaHMM(forHSPaHMM)orPaHSMM (for HPaHSMM) based on the Movement-Hold(MH) model [32] which breaks down each sign into a sequence of moves and holds. During a move some aspect of the hand is changed while during a hold all aspects are held constant. The MH model also identi¯es several aspects of hand con¯guration like location (chest, chin, etc), distance from body, hand shape, and kind of movement (straight, curved, round). With these de¯nitions, we can encode the signs for various words in terms of constituent "phonemes". For example, in the word "I", a right-handed signer would start with his hand at some distance from his chest with all but his index ¯nger closed and end at the chest. This can be encoded in the MH model as (H(p0CH)M(strToward)H(CH)), where p0CH indicates that the hand is within a few inches in front of the chest at the start, strToward indicates that hand moves straight perpendicular to the body and CH indicates that the hand ends at the chest. Similar transcriptions can be obtained for more complex 2-handed signs by considering both hands as well as hand shape. Table 3.4 shows the words in the lexicon and their corresponding phoneme transcriptions for the strong (right) hand. We model the observation probabilities in the hold states as a normal distribution with ¹=0 while the move states are modeled as a signum function. Further, we set the in°ection point of the signum to be the same as the Gaussian's variance. The intuition behind this choice is that during the hold states the con¯guration of the hand remains 49 Sign Transcription I H-p0CH M-strToward H-CH man H-FH M-strDown M-strToward H-CH woman H-CN M-strDown M-strToward H-CH father H-p0FH M-strToward M-strAway M-strToward H-FH mother H-p0CN M-strToward M-strAway M-strToward H-CN inform H-iFH M-strDownRightAway H-d2AB sit S-m1TR M-strShortDown H-m1TR chair H-m1TR M-strShortDown M-strShortUp M-strShortDown H-m1TR try H-p1TR M-strDownRightAway H-d2AB stupid H-p0FH M-strToward H-FH Table 3.4: Phoneme Transcriptions constant with some random noise, while we have a move whenever the hand's position changes above the noise threshold during an instant. We speci¯ed the duration models for the move states based on the distance between thestartingcon¯gurationandtheendingcon¯guration,andtheframerate. Todothiswe separated the possible hand locations into 3 clusters - those around the abdomen(AB), those around the chest(CH) and those around the face/forehead(FH). We approxi- mately initialized the hold state and intra-cluster transition times by looking at a few samples and set the inter-cluster transition time to be twice the intra-cluster transition time. We modeled the duration as a normal distribution centered around these means and variance = 2:5 so that the Th in the decoding algorithm is reasonably small. For the upper level HSMM in HSPaHMM, we set the means(M) by adding the means of the individual states and set §=2:5. Thus, we approximately set a total of 4 parameters for the entire set up. Table3.5showsthewordaccuracyrates. WedidnotincludeCHMM'saccuracyrates astheapproximatedecodingalgorithmpresentedin[6]assumesthattheprobabilitymass 50 at each channel is concentrated in a single state while calculating channel interactions. This assumption is not valid in our domain as many states can have nearly equal proba- bilitiesintheinitialtimestepsandchoosingonlyoneofthemprunesouttheotherwords resulting in extremely poor performance. Model Accuracy Speed CHSMM 83:3%(N=126, D=5, S=14, I=2) 0.4 HPaHSMM 78:6%(N=126, D=6, S=17, I=4) 2.1 HSPaHMM 67:46%(N=126, D=11, S=22, I=8) 6.4 PaHMM 18:25%(N=126, D=7, S=23, I=73) 7.3 Table 3.5: Word Accuracy Rates(%) and Speed(fps) These results indicate that including duration models signi¯cantly improves the re- sults. HSPaHMM provides a good high-speed alternative to the more complex CHSMM and HPaHSMM. Further, HSPaHMM produces better results than PaHMM because the top-level HSMM restricts the number of word transitions and hence reduces the number of insertion errors. Our results were obtained without requiring any additional training data since the added model structure allows us to cleanly embed the domain constraints speci¯ed by the Movement-Hold model. Further we used only hand location as features, insteadofdetailed¯ngermodels. Forcomparison, existingalgorithmsforcontinuoussign languagerecognition[66][63]requiretrainingsetswith1500-2000signsforatestsetof¼ 400 signs. [54] reports good results on isolated word recognition with a 49 word lexicon that uses just one training sequence per word, but the words in both the test and train sequences are pre-segmented; this is a much easier task than continuous sign language recognition demonstrated here. 51 Chapter 4 Simultaneous Tracking and Action Recognition (STAR) using Formal Action Models The models discussed in the previous chapter, recognize actions starting with fairly ac- curate hand tracks. Such tracks are di±cult to get in most applications. To address this, we present a top-down approach to simultaneously track and recognize articulated full-body human motion using action models represented by formal rules. We map these rules to a Hierarchical Variable Transition Hidden Markov Model (HVT-HMM) that is a three-layeredextensionoftheVariableTransitionHiddenMarkovModel(VTHMM).The top-most layer of the HVT-HMM represents the composite actions and contains a single Markov chain, the middle layer represents the primitive actions which are modeled using aVTHMMwhosestatetransitionprobabilityvarieswithtimeandthebottom-mostlayer represents the body pose transitions using a HMM. We represent the pose using a 23D body model and present e±cient learning and decoding algorithms for HVT-HMM. We demonstrate our methods ¯rst in a domain for recognizing two-handed gestures and then in a domain with actions involving articulated motion of the entire body. Our approach shows 90-100% action recognition in both domains and runs at real-time (¼ 30fps) with very low average latency (¼ 2frames). 52 4.1 Introduction Thetraditionalapproachtoactivityrecognitionisbottom-up whereanactorissegmented fromthebackground(usuallyby"backgroundsubtraction"methods), thenthebodyand the3-Dposearetrackedandthetrajectoriesareusedforaction/gesturerecognition. Such an approach scales well with the number of possible activities and also can potentially handle previously unseen events. However, accurate tracking at the lower level is di±cult since we use 23 degrees of freedom in our pose estimate, so the search space is huge. Further,variationsinthestyleofactions,backgroundandclothingoftheactor,viewpoint and illuminations can introduce errors. Alternatively,onecantakeatop-down approachwherehigherleveleventmodelsdrive tracking; this could be considered a tracking-as-recognition approach similar to that of [73]. Whilethecomplexityofthisapproachincreaseslinearlywiththenumberofactions, itisalsomorerobusttolow-levelerrorsandisespeciallyattractivewhenweareinterested in recognizing a ¯nite set of prede¯ned actions like the ones we consider here. Hence we take a top-down approach to action recognition in our work. Our basic model is a variation of the well-known Hidden Markov Model (HMM). However, rather than a single level model, we use a hierarchical model with three layers. Top level is for recognizing compositions of simpler events that we call primitive events; primitive events form the middle level and the bottom level is responsible for tracking the poses. Initial segmentation of the contour is provided by a background subtraction module. 53 4.2 HVT-HMM WewillnowpresentthetheoryofHVT-HMMin3steps-1)wepresentaformalde¯nition andde¯netheparametersofHVT-HMM.2)Thenwepresentourapproachtoparametrize actionswiththeHVT-HMMandlearntheparameters, and¯nally3)Developane±cient decoding algorithm and extend it to online-decoding. Notations: C =No. ofcompositeevents,P =Maxno. ofprimitiveeventspercomposite event,D =Maxdurationthatcanbespentinaprimitive,N =Maxno. ofposespossible under a primitive action, T = No. of frames in video. 4.2.1 Model De¯nition and Parameters We begin by de¯ning a standard HMM model ¸ by the tuple (Q;O;A;B;¼) where, Q is the set of possible states, O is the set of observation symbols, A is the state transition probability matrix (a ij = P(q t+1 = jjq t = i)), B is the observation probability distribu- tion(b j (k)=P(o t =kjq t =j))and¼ istheinitialstatedistribution. Itisstraightforward to generalize this model to continuous (like gaussian) output models. ThehierarchicalhiddenMarkovmodel(HHMM)extendsthisbyincludingahierarchy of hidden states. This can be formally speci¯ed by the tuples- ¸ 0 = (Q h ;O h ;A h ;B h ;¼ h ) where h21::H indicates the hierarchy index. In traditional HMMs, constant state transition probability (a ij ) and the ¯rst or- der Markov assumption implies that the duration probability of a state decays expo- nentially with time. Variable transition hidden Markov models (called Inhomogeneous HMM in [55]) instead make the transition probability A dependent on the duration - a ij (d)=P(q t+1 =jjq t =i;d t (i)=d);1·i;j·N;1·d·D 54 Figure 4.1: Graphical structure of HVT-HMM The HVT-HMM that we introduce has three layers with the top-most layer con- taining a single HMM for composite events. Each state in the top-level HMM corre- sponds to a VT-HMM in the middle layer whose states in turn correspond to multiple HMMsat the track levelfor each degree of freedom. Using the notation described before, this can be formally represented by the set of parameters ¸ HVT¡HMM c (composite layer), ¸ HVT¡HMM p (primitive layer) and ¸ HVT¡HMM x (track layer) as follows: ¸ HVT¡HMM c =(Q c ;O c ;A c ;B c ;¼ c ) ¸ HVT¡HMM p =(Q p ;O p ;A d p ;B p ;¼ p ) ¸ HVT¡HMM x =(Q C x ;O C x ;A C x ;B C x ;¼ C x ) Figure 4.1 illustrates the HVT-HMM. Pose Representation: The pose x of a person is represented using a 19D body model for the joint angles, with three additional dimensions for direction of translation(x,y,z) and one for scale(H) to give a total of 23 degrees of freedom (Figure 4.2). Note that we 55 ignore the motion of the head, wrist and ankles as our image resolutions are generally too small to capture them. Each body part is represented as a cylinder, and the pose is ¯t to the image by projecting it to 2D. 2 H 10 H 6 H 4 H 3 H W 5 W H Figure 4.2: 23D body model 4.2.2 Parametrization and Learning We model each composite event with a single left-right VTHMM at the primitive layer and multiple HMMs (one for each moving part) at the track layer. If we denote the pose at time t by x t , the image by I t , the foreground by f t , and the projection of x t on I t by 56 Proj(x t ;I t ) we de¯ne the observation probability P(I t jx t ) to be the fraction of pixels in Proj(x t ;I t ) that fall on f t : P(I t jx t )= jProj(x t ;I t )\f t j jProj(x t ;I t )j (4.1) We de¯ne the state transition probability at each of the three layers as follows: AttheComposite Event Layer transition probabilities are initialized uniformly as the actors can perform actions in any order and it is generally di±cult to obtain suitable training data to learn such high-level models. At the Primitive Event Layer, we use a left-right VTHMM. Thus, at each instant a state can either transition to itself or to the next state in chain, and the probability of this transition varies with the time spent in the current state. We model this using a sigmoid function as follows: P(p t jp t¡1 )= 8 > > > > > > > > > < > > > > > > > > > : n 1+e d p t¡1 ¡¹ p t¡1 ¡¾ ¾ if p t =p t¡1 , n 1+e ¡ d p t¡1 ¡¹ p t¡1 +¾ ¾ if p t ¸p t¡1 0 otherwise (4.2) where, ² n is a suitable normalization constant ² p t is the primitive event at time t ² ¹ p t¡1 denotes the average time spent in executing primitive action p t¡1 and is learned from training data. 57 ² d p t¡1 denotesthetimespentinprimitivep t¡1 inthecurrentexecutionoftheaction. ² ¾ is a noise parameter to allow for some variation in the actual end position of the actions. We set this to 20% of ¹ p t¡1 in our implementation. The intuition behind the choice of the sigmoid function is that it allows state transitions only when the primitive event has been executed for a duration close to the mean. This is because as d p t¡1 approaches ¹ p t¡1 , term 1 (for maintaining current state) in equation 4.2 decreases and term 2 (for transition to next state) increases as shown in Figure 4.3. Figure 4.3: Variation in values of Equation 4.2 - Line with diamonds-term1, Line with squares-term2 At the Track Layer we de¯ne the transition probability P(x t jfx t¡1 ;p t g) as follows: P(x t jfx t¡1 ;p t g)= 8 > > > > < > > > > : 1 2Mp t if 0·jx t ¡x t¡1 j·2M p t , 0 otherwise (4.3) where, - M p t is the mean distance the pose changes in one frame (average speed) under primitive p t and is learned during training. 58 - x t is obtained from x t¡1 using a simple geometric transformation speci¯ed by p t . - jx t ¡x t¡1 j is the distance between x t and x t¡1 This de¯nition basically restricts the speed of a pose transition to at most 2M pt . Based on these observation and transition probabilities, we can re-estimate the parameters ¹ i andM i usingaforward-backwardalgorithmsimilartotheBaum-Welchalgorithmusedin basic HMMs (see supplementary material). Further, since the actions in our experiments havetypicalprimitiveduration(¹ i )andspeedofaction(M i )acrossallinstances,estimates from a single sample for each action is su±cient to train the models. 4.2.3 Decoding Algorithm Nextwedevelopadecodingalgorithmforcontinuous recognitionofcompositeeventsfrom anunsegmentedvideostream. Let± t (c;i;x t ;d)denotetheprobabilitythemaximumpath that the composite event c is in primitive event i and posex t at timet and has also spent a duration d in primitive i. By assuming a unique start primitive (indexed by 1) for each composite event the ± variables can be calculated using the following equations: ± t+1 (c;j;x t+1 ;d+1)=max x t ± t (c;j;x t ;d)a c jj (d)p c (x t+1 jx t ;j)p(I t+1 jx t+1 );d>0 ± t+1 (c;j;x t+1 ;1)= max i;¿;x t ± t (c;i;x t ;¿)a c ij (¿)p c (x t+1 jx t ;j)p(I t+1 jx t+1 );j >1 ± t+1 (c;1;x t+1 ;1)= max c 0 ;i;¿;xt ± t (c 0 ;i;x t ;¿)(¿)p c (x t+1 jx t ;1)p(I t+1 jx t+1 ) ± 1 (c;j;x 1 ;1)=¼ c j ¼ c x 1 ;j p(I 1 jx 1 ) (4.4) where, - a c ij (¿) = probability of transition from primitive i to j having spent time ¿ in state i, under complex event c. 59 - p c (x t+1 jx t ;j) = probability of transition to pose x t+1 from x t under primitive j. - p(I t+1 jx t+1 ) = the observation probability at the track layer. Bycalculatingthe±'sforeachc;j;x t ;dwecangetMAPpathfrommax c;j;x T ;d ± T (c;j;x T ;d). Further by storing the previous states for each c;j;x t ;d at each instant t we can retrace the Viterbi path through the HVT-HMM. This algorithm has a complexity of O(TCPDN 2 (P + C)) which can be very slow as the number of possible poses(N) is large. Hence as an approximation, we store only the top K con¯gurations at each instant and also prune out con¯gurationsfc;j;x t ;dg for which max c 0 ;j 0 ;x 0 t ;d 0 ± t (c 0 ;j 0 ;x 0 t ;d 0 )¡± t (c;j;x t ;d)>p (4.5) where p is a su±ciently small threshold. In this case, each max in the RHS of equation 4.4 takes only O(K) and hence the whole algorithm takes O(KCPNDT). One crucial issue with the basic Viterbi algorithm is that the entire observation se- quence must be seen before the state at any instant can be recognized. Thus even in cases where the algorithm runs at real time, the average latency(time taken to declare the state at a given instant) can be in¯nitely large. One simple solution is to use a ¯xed look ahead window, but the performance of this algorithm is poor. Instead we build on the online variable-window algorithm proposed recently [39] which is demonstrated to be better than other approaches. At each frame we calculate the following function: f(t)=M(t)+¸(t¡t 0 ) (4.6) 60 where, - t 0 is the last frame at which the composite, primitive and track states were output. - M(t) can be any suitable function which is large if the probability at t is concen- trated around a single state and small if it is spread out. - The term (t¡t 0 ) penalizes latency. - ¸ is a weight term that also functions as a normalization factor. With this metric, large values for ¸ penalize latency while smaller values penalize errors. In our implementation we choose M(t) to be (1¡max c;j;x t ;d ± t (c;j;x t ;d)) as it is the sim- plest. Alternate choices include the entropy of the state distribution but as our results show the simple choice performs satisfactorily. At each instant, if M(t) < ¸(t¡t 0 ) we output the maximum probability state at t, and the states in the interval [t 0 ;t) that lead up to it, and then repeat the procedure from (t+1). [39] shows that this algorithm is 2¡competitive 1 for this choice of M(t) in HMMs and a similar reasoning applies here as well. 4.3 Experiments We tested our method in two domains - 1) For recognizing fourteen arm gestures used in military signalling and 2) For tracking and recognizing 9 di®erent actions involving articulated motion of the entire body. All run time results shown were obtained on 3GHz, Pentium IV, running C++ programs. 1 The cost is no more than 2 times the cost of any other choice of t 61 Figure 4.4: Overlay Images for Arm Gestures Figure 4.5: Variation in Accuracy(%) and Speed(fps) with - a)p b)K 4.3.1 Gesture Tracking and Recognition In the ¯rst experiment, we demonstrate our method on the gesture datasets used in [15, 60], for examples in the domain of military signaling which consists of fourteen arm gestures shown in Figure 4.4. The dataset consists of videos of the fourteen gestures performed by ¯ve di®erent people with each person repeating each gesture ¯ve times for a total of 350 gestures. Each gesture is modeled by a 3-state VTHMM at the primitive layer which in turn corresponds to 2 track layer HMMs (one for each hand). We trained the model for each gesture using a randomly selected sequence and tested the variation in performance of the decoding algorithm with the pruning parameters K and p. As can be seen from ¯gure 4.5, while the accuracy increases slightly with larger K and p, the speed decreases 62 Figure 4.6: Recognizing and tracking with variation in - a)Background and Lighting b) Style c) Random external motion d) Self-Occlusion signi¯cantly. Since our video is at 30fps, we choose K = 10 and p = 0:5 for real-time performance. Our system gives an overall accuracy of 90:6%. We also tested our method with large variations in background, lighting, individual style, noise, occlusion and also with signi¯cant variation in the actual angle with the camera and observed no drop in performance. Figure 4.6 illustrates some of these variations. Next we replaced the primitiveevent layer ofHVT-HMM with asimple HMM(to giveaHHMM) and the semi- MarkovHSMM(togiveaHS-HMM)insteadofVTHMMandtestedthemunderthesame training and pruning conditions. As can be seen from the results in table 4.1 including duration models in HVT-HMM produces a signi¯cant improvement in performance over 63 Accuracy Speed Avg. Latency Max Latency (%) (fps) (frames) (frames) HVT-HMM 90:6 35.1 1.84 9 HHMM 78:2 37:4 4.76 14 HS-HMM 91:1 0:63 - - Table 4.1: Comparison of HVT-HMM, HHMM and HS-HMM Gesture Dataset an HHMM without the drop in speed seen in HS-HMMs. Further, the average latency of HVT-HMM is lower than that of HHMM because duration dependent state transitions restrict the set of possible states and hence M(t) in equation 4.6 tends to be large. Since no known online decoding algorithm exists for an HSMM, the entire sequence must be seen before the state sequence is generated for HS-HMM. On the same dataset, [15] reports 85:3% accuracy on the ¯rst 6 gestures by testing and training on all 150 gestures (we have 97%). [60] reports 83:1% accuracy on the entire set. This approach uses edge contours instead of foreground silhouettes to allow for moving cameras, even though the data itself does not contain any such example; this maybepartiallyresponsibleforitslowerperformanceandhenceadirectcomparisonwith our results may not be meaningful. Another key feature of our method is that it requires just one training sample/gesture while [15, 60] require a train:test ratio of 4:1. 4.3.2 Tracking and Recognizing Articulated Body Motion In the second experiment, we demonstrate our methods on a set of videos used in [3] of actions that involve articulated motion of the whole body. The dataset contains videos (180*144, 25fps) of the 9 actions in Figure 4.7 performed by 9 di®erent people. Each action is represented by a node in the top-level composite event graph. Each composite event node corresponds to a 2 or 4 node primitive event VTHMM, and each 64 Accuracy Speed Avg. Latency Max Latency (%) (fps) (frames) (frames) HVT-HMM 100:0 28.6 3.2 13 HHMM 91:7 32.1 7.6 25 HS-HMM 100:0 0.54 - - Table 4.2: Comparison of HVT-HMM and HHMM Action Dataset primitivenode inturn correspondstomultipletrack-levelHMMs(oneforeachbodypart that moves). Thus the "walk" event has 4 primitive events (one for each half walk cycle) and 6 track HMMs (for the right and left shoulder, upper leg and lower leg), while the "bend" event has 2 primitives (for forward and backward motions) and 3 track HMMs (for the torso, knee and right shoulder) for each primitive. We learn the primitive and track level transition probabilities and also the average velocities for actions involving translation (walk,run,side,jump). Table 4.2 compares the performance of our approach with others and ¯gure 4.8 shows some sample results. As can be seen, while the accuracy and speed are high, the latency is also higher compared to the gesture set. This is because the actions involve more moving parts and hence it takes longer to infer the state at any instant. Also note that even in cases where the initial joint estimates and pose tracking had errors, the recognition layer recovers. Next, we tested the robustness of our method to background, occlusions,stylevariations(carryingbriefcase/bag,moonwalketc)andalsoviewpoint(pan angle in the range 0 o ¡45 o ). Our method is fairly robust to all these factors (Figure 4.9) and table 4.3 summarizes these results. [3]reports99:36%classi¯cationrateat¼2fpsona3GHzP4onthesameset. Their approach focuses on extracting complex features (3d space-time shapes) from the silhou- ettes and classifying them using a simple nearest-neighbor approach. Our work on the 65 Figure 4.7: Actions tested in Experiment 2 - a)Bend b)Jack c)Jump d)Jump-In-Place e)Run f)Gallup Sideways g)Walk h)Wave1 i)Wave2 Figure4.8: Sampletrackingandrecognitionresults-a)jackb)jumpc)rund)walke)gallop sideways f)bend g)jump in place h)wave1 i)wave2 66 Test Sequence 1st best 2nd best Carrying briefcase walk run swinging bag walk side Walk with dog walk jump Knees up walk run Limp jump walk Moon walk walk run Occluded feet walk side Full occlusion walk run Walk with skirt walk side Normal walk walk jump Table 4.3: Robustness under occlusion, style variations and other factors Figure 4.9: Tracking and recognition with - a)Swinging bag b)dog c)knees up d)moonwalk e)Occluded feet f)Occlusion by pole g)Viewpoint 20 o pan h)Viewpoint 30 o pan i)Viewpoint 40 o pan j)Viewpoint 45 o pan 67 Figure 4.10: Tracking and recognition with learned models on dataset from [59] with variations in viewpoint as well as jittery camera motion other hand uses very simple features (foreground silhouettes) and moves the complexity to the higher layers. Next, to test the generality of our learned models, we ran our method on a few addi- tionalwalkingandjoggingsequencesonadi®erentdataset(from[59])without re-training. This dataset contains several challenging sequences with jittery camera motion as well as variations in tilt and pan angles. While the initial joint estimates and pose tracking had some errors in these sequences, the recognition rate was mostly una®ected under these variations. Figure 4.10 illustrates some of these results. 68 Chapter 5 View and Scale Invariant Action Recognition Using Multiview Shape-Flow Models The models discussed in the previous chapter, were built by making hand-coded rules to HMMs. This is e®ective for well structured actions like hand gestures and actions like walk, run etc. However, it is much harder to code such rules for actions involving com- plicated limb motions like sitting on ground. Further, actions in real world applications typicallytakeplaceinclutteredenvironmentswithlargevariationsintheorientationand scale of the actor. To address the representation issue, we ¯rst render synthetic poses from multiple viewpoints using Mocap data for known actions and represent them in a Conditional Random Field(CRF). We address background clutter and motion by using a combination of shape, optical °ow and pedestrian detection based features and enhance these basic potentials with terms to represent spatial and temporal constraints. We ¯nd thebestsequenceofactionsusingViterbisearchanddemonstrateourapproachonvideos from multiple viewpoints and in the presence of background clutter. 69 5.1 Overview of Approach Approaches to activity recognition can be classi¯ed as being in two broad threads - the ¯rst starts with image level features (like shape or optical °ow) and recognizes events by comparing the image features to a set of event templates, while the second focuses on modeling the high-level structure of events with graphical models. We combine ideas from the graphical model and template based threads and demonstrate our approach on videos with large variations in viewpoints and scale and also in the presence of back- ground clutter. Similarto [35] we¯rst render Mocapdata of variousevents in multipleviewpointsus- ing Poser. We then embed these templates into a 2 layer graph model similar to [38, 13]. The nodes in the top layer correspond to events in each viewpoint and the lower layer corresponds to each pose in the event. At each frame we compute the observation prob- ability based on shape similarity using the scaled-Hausdro® distance and the transition probability based on °ow similarity using features similar to [14]. We augment the sim- ilarity score with a duration term to account for events taking place at di®erent speeds, and a spatial term that provides a Kalman ¯lter like framework for tracking the person. We recognize events using Viterbi search on the graphical model. Our approach for si- multaneously tracking and recognizing actions also builds on the tracking-as-recognition approach in [73]. 70 5.2 Action Representation Human actions involve both spatial (represented by the pose) and temporal (correspond- ingtotheevolutionofbodyposeovertime)componentsintheirrepresentation. Further, the actual appearance of the spatio-temporal volume varies signi¯cantly with scale and viewpoint. In order to make our representation invariant to viewpoint, we ¯rst render posesofsynthetichuman¯guresfrommotion capture data(obtainedfrom[21])ofvarious actions in multiple viewpoints using POSER 1 . We cover 90 o of camera tilt angle at 15 o intervals and 360 o of pan at 30 o intervals. We render our poses with a large resolution (900£600pixels)anduseascaleinvariantdistancemeasuretomakeourapproachrobust to variations in scale. Further, we include body poses in all frames of the action template instead of just the key poses as in [35]. This is because we use °ow based measures be- sides shape, and hence using key poses with a large pose di®erence would make the °ow matching very inaccurate. We embed the poses in a 2-layer graphical model illustrated in Figure 5.1. Each node in the top layer corresponds to an action in a particular viewpoint and the lower layer corresponds to the individual poses. We restrict transitions at the event layer based on the similarity between the low level poses at the transition point. For example, we can transition to the standup only after a sitdown event since the ¯nal pose in sitdown is very similar to the start pose in standup, while we can transition to several events from the stand action (with a single stand pose at the lower layer). Further, at the lower layer we restrict pose transitions based on the expected speed of the top layer event. Thus, for the sitdown event one cannot transition directly from the starting stand pose to the 1 POSER 5, Curious Labs (now e frontier Inc.) 71 a) b) Event Template e T 1 T 2 T 3 T n-1 T n e 1 e 1 e 2 e 2 pan=30 o pan=0 o Figure 5.1: Transition Constraints - a) Graph model for a single event b) 2-layer model for a simple 2-event recognizer at the ¯rst two pan angles (0 o ,30 o ) Figure 5.2: Pose intialization starting with an initial detection window (Green Box in I start ) 72 last sitting pose, but must go through the intermediate poses. To reduce the complexity of inference, we assume that the approximate tilt angle is known and we need to only consider di®erent pan angles. This is reasonable in most applications where the tilt is ¯xed and only the relative pan of the actor to the camera varies. Wecomputethesimilaritybetweentheimagesequenceandtheeventtemplatesbyem- bedding shape and °ow similarity scores and also the transition constraints in Figure 5.1 into the observation and transition potentials of a CRF. Since our model includes shape and °ow features and also models event durations, we call it the Shape,Flow,Duration- CRF(SFD-CRF).CRFisageneralizationofHMMthatallowsobservationandtransition potentials to be arbitrary functions that can vary with position in the sequence. Further, theobservationandtransitionpotentialsneednothaveaprobabilisticinterpretationmak- ing it ideally suited for embedding our shape and °ow similarity measures. We describe the details of the SFD-CRF in the next section. 5.3 Pose Tracking and Recognition During recognition, we start with a human detection window obtained with a state-of- the-art pedestrian detector([71]) and use the edge map within the window as our basic observation. The detector does not precisely segment the human form and also does not give orientation information. So we ¯rst re¯ne the detection by matching templates of thestandingposeinmultipleorientations,withinthedetectionwindowatdi®erentscales (in our experiments, we looked at 3 scales below the height of the detection window at steps of 1.1). Figure 5.2 illustrates the pose initialization process described. We then 73 Figure5.3: UnrolledGraphicalModeloftheSFD-CRFforPoseTrackingandRecognition track and recognize the events by traversing the graph model in Figure 5.1 starting from the initial set of poses. Let I = fI 1 ,I 2 ,...,I T g denote the sequence of frames in the video, e = fe 1 ,e 2 ,...,e T g denote the sequence of [event;viewpoint] tuples through the top layer of the graphical model in Figure (5.1) (since we input the tilt angle the viewpoint corresponds to the possible pan angles), p = fp 1 ,p 2 ,...,p T g denote the sequence of pose templates through the lower level and let w = fw 1 ,w 2 ,...,w T g denote the sequence of track windows for the actor through the video. The state µ t of the person at frame t is denoted by the tuple [e t ,p t ,w t ]. Then, the probability of the state sequence µ = fµ 1 ,µ 2 ,...,µ T g given the observation sequence I is given by the standard CRF formulation: P(µjI)= 1 Z Á(µ 1 ;I 1 ) T Y t=2 [Ã(µ t¡1 ;µ t ;I t¡1 ;I t )Á(µ t ;I t )] (5.1) 74 where, Á(µ t ;I t ) is the observation potential, Ã(µ t ;µ t¡1 ;I t ;I t¡1 ) is the transition potential and Z= P µ P(µjI) is an observation dependent normalization factor. Equation 5.1 is very similar to the Conditional Random People(CRP) formulation presented in [65] for tracking people using a set of pose templates. Our approach di®ers from [65] in 3 ways - ¯rst, we simultaneously recognize and track the actions. Hence the pose transitions depend not only on their similarity but also on the spatio-temporal constraints imposed by the action. Second, since we are interested in recognition our pose tracking is coarse and does not include the Grid Filtering done in [65]. Third, our similarity features are scale invariant while they assume that silhouette is scaled and centered apriori. Sincewestartwithapersondetectioninthe¯rstframe,p 1 correspondstothestanding pose. The observation potential Á(µ t ;I t ) is measured using the shape similarity measure de¯ned later in equation (5.14) within the current track window. As the shape similarity measure depends only on the pose template p t and the track window w t , we have: Á(µ t ;I t )=Á([p t ;w t ];I t ) (5.2) We de¯ne the transition potential Ã(µ t ;µ t¡1 ;I t ;I t¡1 ) as the product: Ã(p t ;p t¡1 ;I t ;I t¡1 )=à trans ([e t¡1 ;p t¡1 ];[e t ;p t ])à flow ([p t¡1 ;w t¡1 ];[p t ;w t ];I t¡1 ;I t ) (5.3) where à trans corresponds to the transition constraints imposed by the high level graph modelsimilartotheoneshowninFigure5.1andà flow isde¯nedusingthe°owsimilarity (equation 5.15) and can be computed given the track windows and pose templates. 75 There are two key issues with these basic potentials - ¯rst, since there are typically several noisy edges in the image besides the person, one can match on some background edges and stay in a speci¯c pose. Second, in any action the actor tends to move around and hence we must allow for some motion of the track window. However, such moves can accumulate over time and the track window can wander o® by matching on the background edges. We address the ¯rst problem by augmenting the state at each frame with a duration node and adding a temporal penalty term to the observation potential in equation (5.2) that models speed at which an action takes place. Thus the state µ t corresponds to the tuple[e t ,p t ,w t ,d t ]whered t isthedurationforwhichtheactorhasbeenperformingaction e t . We model the speed of action with a Gaussian whose parameters can be learned from the Mocap data. If p t is the i th pose under event e t , the temporal penalty Á time (e t ;p t ;d t ) is given by: Á time (e t ;p t ;d t )= 1 ¾ t p 2¼ e ¡ (i=d t ¡¹ t ) 2 2¾ 2 t (5.4) where, ¹ t is the mean speed and ¾ t is the standard deviation for the action e t . These can be learned by ¯nding the mean and standard deviation of the lengths of action segments in the Mocap data. For example, if the sitdown action template that we use consists of a sequence of 74 pose templates and the averagelength of the sitdown action in the Mocap data is ¼60 frames, then the ¹ t for sitdown is 74/60=1.23. Á time (e t ;p t ;d t ) e®ectively limits the possible poses p t for any given [e t ,d t ] preventing the action from getting stuck at any pose. This term also plays a role similar to the "Blurry I" kernel used in [14] to allow actions to occur at di®erent rates. 76 We address the second problem by scanning a region around the previous template position and choosing the location with the best shape similarity score. In order to prevent the track window from wandering o®, we augment the shape similarity score with a multi-variate Gaussian based on the distances moved in the x and y directions by the track window w t : Á space (e t ;w t )= 1 2¼¾ x ¾ y e ¡¢x 2 =¾ 2 x e ¡¢y 2 =¾ 2 y (5.5) In our experiments, we set ¾ x and ¾ y to be 20% of the scaled template width and height respectively. The use of Gaussians to de¯ne Á space (e t ;w t ) e®ectively provides a Kalman ¯lter like mechanism for choosing the track window w t . While we consider only actions occuringinplaceinthischapter,thispotentialcanbeextendedtoallowactionsinvolving large translations of the actor by estimating the di®erence between the expected position of the window and the actual position. Combining equations (5.2), (5.4) and (5.5) we can de¯ne the augmented observation potential ^ Á(µ t ;I t ) as: ^ Á(µ t ;I t )=Á([p t ;w t ];I t )Á time (e t ;p t ;d t )Á space (e t ;w t ) (5.6) Figure 5.3 illustrates the unrolled graphical structure of our model with the temporal and spatial nodes. Each maximal clique in ¯gure 5.3 corresponds to a potential de¯ned in equations (5.2)-(5.6). 77 With these potentials, the best state sequence can be inferred by computing the maximum probability path: p¤=argmax µ P(µjI) (5.7) Equation (5.7) can be solved using Viterbi-like search. However, since we render¼ 13000 poses and the track window can potentially be at any location in the image (captured at 740£ 480 pixels resolution) the state space is huge making the computation time impractical. But, given the transition and spatial constraints imposed by the graphical model, only a very small number of these states have signi¯cant probability. In our experiments we considered only the top P = 10 [p t ,w t ] tuples for each e t . Since we use templates for 6 actions rendered in 12 possible pan angles, we need to consider only 6*12*10=720statesateachframeto¯ndthebeststatesequence. Also, sinceweareonly interestedin¯ndingthebestpaththroughtheSFD-CRF,wecanignorethenormalization factor Z(I) in equation 5.1 since it is constant for a given video segment. 5.4 Shape and Flow Potentials We now describe the two key potentials in the SFD-CRF. Shape Matching: The Hausdor® measure[23] has been popular for matching images due to its simplicity and extensions that are invariant to translation, scale and rotation. For two sets of points A and B, the directed Hausdor® distance from A to B is: h(A;B)=max a2A min b2B ka¡bk (5.8) 78 where k :k is any norm (L1 in our case). In order make the distance robust to outliers, we typically take the partial distance as in [23]: h(A;B)=K th max a2A min b2B ka¡bk (5.9) whereK th maxreferstotheK th largestvalueofka¡bk. Typically,wewishtomatchaset oftemplateedgepointsBtoedgesinimageAatvariousimagelocationsandindependent x and y scales. Let the quadruple t=(t x ;t y ;s x ;s y ) denote a model transformation t(B). Then [24] presents a scaled Hausdor® distance at t as: h(A;t(B))=K th max a2A min b2B ka¡(s x b x +t x ;s y b y +t y )k (5.10) While, the Hausdro® score can be used to measure the similarity of particular set of image points to the model, it does not directly give a probability for the match between A and B. In previous work, two probabilistic formulations have been used. The ¯rst is the Hausdor® fraction which counts the fraction of model points that are at a distance less than a threshold: h K (A;B)·± (5.11) 79 WhilethisisastraightforwardgeneralizationofthepartialHausdor®distanceinequation (5.9),itisquitesensitivetothethreshold±. [51]presentsanalternativeformulationbased on the distance of each model point to the nearest image point as follows: P(Ajt(B))= jBj Y i=1 p(D i ) (5.12) where p(D i ) is a probability distribution function for the distance of each model point to the nearest image point and is de¯ned as: p(D i )=c i + 1 ¾ p 2¼ e ¡D 2 i =2¾ 2 (5.13) Thisformulationwasusedin[15]tomatchshapetemplatesusingthechamferdistance,for gesture recognition. In our case, since we render the pose templates at a high resolution, the model typically has ¼ 1000 points. Hence the RHS in equation (5.4) tends to zero even for well matched templates. Instead we ¯rst compute the scaled Hausdor® distance for the entire template using equation (5.10) and then embed it in a normal distribution, to de¯ne our shape similarity potential Á([p t ;w t ];I t ): Á([p t ;w t ];I t )=P(Ajt(B))= 1 ¾ p 2¼ e ¡h(A;t(B)) 2 =2¾ 2 (5.14) In our experiments we set ¾ =15 though the results are fairly robust to the actual value of ¾. Flow Similarity: We measure °ow similarity between the event templates and the videobasedonpixel-wiseoptical°ow, similarto[14]. However, ourapproachdi®ersfrom 80 Figure 5.4: Matching optical °ow in image with template optical °ow [14] in two crucial aspects - 1) [14] assumes that the actor in the action is already tracked and stablized, while we explore a set of possible windows at each time step and thus simultaneously do tracking and recognition 2) There is a large di®erence in scale between the templates and the actor in the image. Hence we cannot precompute the template °ows. We will discuss how we address the ¯rst problem in the next section. Here we will focus on computing a °ow similarity given two templates T 1 , T 2 and two image windows in consective frames I t¡1 , I t at frames t and t¡1. In order to compute the optical °ow between two templates, we ¯rst scale them based on the image windows and then compute optical °ow using the Lucas-Kanade [25] algorithm. Then, similar to [14], we split the optical °ow vector ¯eld F into two scalar ¯elds F x and F y corresponding the x and y components, then half-wave rectify them into four non-negative channels F + x , F ¡ x , F + y , F ¡ y and ¯nally blur and normalize them with a 81 Gaussian to obtain the ¯nal set of features ^ Fb + x , ^ Fb ¡ x , ^ Fb + y , ^ Fb ¡ y . We extract a similar set of features from the image windows and compute the °ow similarity as: à flow (T 1 ;T 2 ;I t¡1 ;I t )= 1 4 4 X c=1 P x;y2I a c (x;y)b c (x;y) ja c jjb c j (5.15) where I refers to the spatial extent of the °ow descriptor, the b c 's refer to the features extractedfromthetemplatesandthe a c 'srefertotheimagefeatures. Notethatequation (5.15) is normalized to be in the range [0,1]. Figure 5.4 illustrates the computation of the °ow similarity described. 5.5 Experiments We tested our approach on videos of 6 actions - sit-on-ground(SG), standup-from-ground (StG), sit-on-chair(SC), standup-from-chair(StC), pickup(PK), point(P). We collected instances of these actions around 4 di®erent tilt angles - 0 o , 15 o , 30 o , 45 o . We did not precisely calibrate the camera at each tilt and typically had a tilt error of ¼5 o . At each tilt we collected instances of actions at 4 di®erent pan angles - typically, around 0 o , 45 o , 90 o , 270 o , 315 o though the actual pan was not measured precisely. We collected one in- stance of each action for each [tilt,pan] combination from four di®erent actors for a total of 16 instances of each action at each tilt. Further for tilt=0 o , we collected videos under 6 widely varying backgrounds including indoors in o±ce environments and outdoors in front of moving vehicles for a total of 24 instances of each action at that tilt. In all we had 400 instances of all actions across all tilts and pans. We also varied the zoom of the camera and hence the actual size of the person varied between ¼80-300 pixels in 82 740£480 resolution videos. The pose templates were rendered so that the standing pose is ¼ 600 pixels tall. Figure 5.5 illustrates some of the conditions under which we tested our approach. To process the videos, we ¯rst apply a pedestrian detector similar to [71]. As the Figure 5.5: Sample Background, Viewpoint and Scale variations tested - (a)Indoor o±ce environment,tilt=15 o ,pan=0 o (b)Indoor o±ce environment,tilt=0 o ,pan=30 o (c)Indoor library,tilt=0 o ,pan=90 o (d)Indoor o±ce,tilt=0 o ,pan=315 o (e)Indoor o±ce,tilt=0 o ,pan=0 o (f)Indoor library,tilt=0 o ,pan=270 o (g)Outdoor with moving cars,tilt=0 o ,pan=0 o (h)Outdoor,tilt=30 o ,pan=30 o (i)Outdoor,small scale,tilt=45 o ,pan=0 o detector is trained only for the standing pose, it fails when the pose of the actor changes during an action. Thus the detections provide an approximate segmentation of the event boundaries in the video sequence. We tested our algorithm by running our recognizer betweentwodetectionsandthencomparingthehighestprobabilityeventsequenceinthe intervening frames to the ground truth. To measure the relative importance of shape features, flow features and duration modeling we compare our system (shape+flow+duration) with using only shape, only 83 flow and shape+flow potentials in the CRF. Table 5.1 summarizes the accuracy at di®erent tilt angles. Note that using only flow features is very similar to [14] except that we don't start with track windows that are centered and scaled. All the speed numbers reported are for the entire system including detection, tracking and recognition and were obtained by running C++ Windows programs on a single 2GHz Pentium IV CPU. As can be seen, combining shape with flow features produces a signi¯cant improve- 0 o 15 o 30 o 45 o Overall Speed(fps) shape+flow+duration 77:35 82:98 81:25 65:63 78:86 0:37 shape+flow 70:68 76:67 77:42 62:5 72:37 0:34 flow 63:79 56:67 61:29 53:12 59:12 0:41 shape 56:82 59:01 75:76 56:25 61:18 1:7 Table 5.1: Comparison of accuracy and speed with shape, flow, shape+flow, shape+ flow+duration features at di®erent tilt angles ment over using either of these alone. The result is further improved by modeling event durations. Also note that including duration models improves the speed too since they restrict the set of possible poses to consider. However,the cost of computing °ow features is high. Since the scale of the actor is unknown, these cannot be pre-computed apriori unlike in approaches which assume a known ¯xed scale. We alleviate this cost partly by storingtheN =1500mostrecenttemplate°owscomputed. Ateachframe,we¯rstcheck if a particular template °ow is already in the stored set and compute the °ow only if it is not present. Another observation from Table 5.1 is that the accuracy can vary with tilt angles. At higher tilts the sit-on-ground and pickup actions look very similar causing a large confusion. Variation in pan angles at a given tilt however does not signi¯cantly a®ect the 84 SG SC PK P StG StC SG 75:56 4:44 20:0 0:0 0:0 0:0 SC 11.54 76:92 11:54 0:0 0:0 0:0 PK 15:9 0:0 86:1 0:0 0:0 0:0 P 2:56 0:0 10:26 87:18 0:0 0:0 StG 0:0 0:0 20:0 0:0 75:56 4:44 StC 0:0 0:0 11:54 0:0 11:54 76:92 Table 5.2: Overall Confusion Matrix performance since the actions have very distinct signatures at di®erent pan angles. Table 5.2 shows the overall confusion matrix across all viewpoints. 85 Chapter 6 Temporally-Dense Spatio Temporal Interest Point Models for Action Recognition Theapproachdescribedinchapter5usesshapeand°owbasedfeatures,whicheliminates theneedforbackgroundsubtraction. However,sincewecannotdistinguishbetweenback- ground and object edges the method can be susceptible to distraction from background motion and clutter. Further, our action models are obtained from Mocap data which is hard to collect. In this chapter, we use features based on spatio-temporal interest points that extract feature points only in regions with signi¯cant spatio-temporal variations. Existing meth- odsforinterestpointdetectionprovidearepresentationofactionsthatiseithertoosparse or too dense with several spurious detections; we extend existing interest point detectors to provide a dense action representation while minimizing spurious detections. We learn the action models, by ¯rst clustering the interest points into a set of code- words, and then combining them with pedestrian detection and tracking using a Condi- tional Random Field (CRF). Since we learn our models from video data, we do not need Mocap data anymore, but we do need a su±ciently large training set for the models to 86 generalizewell. Thelargernumberofinterestpointsandthehigh-levelreasoningprovided by the CRF allows us to automatically recognize action sequences from an unsegmented stream, at real time speed. We demonstrate our approach by showing results comparable to state-of-the-art for action classi¯cation on the standard KTH-action set, and also on more challenging cluttered videos. 6.1 Feature Detection and Description 6.1.1 Spatial Interest Point Detection The Harris corner detector detects points in an image that have signi¯cant variations in both the x and y directions. Let I(x,y) denote the image and g(x;y;¾ 2 ) a Guassian kernel- g(x;y;¾ 2 )= 1 2¼¾ 2 e ¡(x 2 +y 2 )=2¾ 2 (6.1) Let L denote the convolution- L(x;y;¾ 2 )=g(x;y;¾ 2 )¤I(x;y) (6.2) and let L x =@ x L and L y =@ y L denote the partial derivatives of L in the spatial dimen- sions. At a given scale ¾ 2 the interest points can be detected based on the eigenvalues of the second moment matrix- ¹=g(x;y;¾ 2 )¤ 0 B B @ L 2 x L x L y L x L y L 2 y 1 C C A (6.3) 87 Signi¯cant values for the eigenvalues ¸ 1 ,¸ 2 of ¹ indicate signi¯cant variations in both the x and y directions. [18] detects interest points as the positive maxima of the corner function- H =det(¹)¡k¢trace 2 (¹)=¸ 1 ¸ 2 ¡k(¸ 1 +¸ 2 ) 2 (6.4) 6.1.2 Spatio-Temporal Interest Point Detection [31] presents a natural extension of the 2D Harris corner detector to 3D. Let V(x,y,t) denote the video and g(x;y;t;¾ 2 ;¿ 2 ) denote a 3D Gaussian kernel- g(x;y;t;¾ 2 ;¿ 2 )= 1 p (2¼) 3 ¾ 4 ¿ 2 e ¡(x 2 +y 2 )=2¾ 2 ¡t 2 =¿ 2 (6.5) where ¾ 2 and ¿ 2 are the spatial and temporal scales respectively. Let L x = @ x L, and L t =@ t L denote the partial derivatives of L in the spatial and temporal dimensions. We can then de¯ne a 3£3 second moment matrix- ¹=g(x;y;¾ 2 ;¿ 2 )¤ 0 B B B B B B @ L 2 x L x L y L x L t L x L y L 2 y L y L t L x L t L y L t L 2 t 1 C C C C C C A (6.6) Spatio-temporal interest points are those (x;y;t) locations which have signi¯cant eigen- values ¸ 1 ,¸ 2 ,¸ 3 for ¹ in all 3 dimensions. [31] extends the corner function in equation (6.4) to 3D- H =det(¹)¡k¢trace 3 (¹)=¸ 1 ¸ 2 ¸ 3 ¡k(¸ 1 +¸ 2 +¸ 3 ) 3 (6.7) and detect the interest points as the positive local maxima of H. 88 Figure 6.1: Comparison of Interest Points extracted by (a) Harris Corner (b) Rank In- crease Measure (c) STIP (d) TD-STIP. Columns (1)-(3) A range of cluttered indoor environments (4) Monitoring a typical tra±c intersection (5) Under camera motion 6.1.3 Temporally Dense Spatio-Temporal Interest Points (TD-STIP) WhiletheSTIPsextractedusingthemethoddescribedinsection6.1.2provideacompact and intuitive representation of actions, they are typically too sparse to be suitable for recognizing actions involving small movements and in cluttered environments. Several existing approaches like [37][11] address this issue by using one or more spatial interest point detectors similar to the Harris detector described in 6.1.1 or a combination of spa- tial and spatio-temporal detectors like in [47]. Such methods however can result in a large number of points that only correspond to spatial corners with no signi¯cant action. These spurious interest points can distract the recognition algorithm and the training algorithms might require bounding box annotations for e®ective learning. 89 A simple way to increase the number of temporally signi¯cant interest points would be to compute the second moment matrix and corner function as in equations (6.6) and (6.7), and choose points that are local maxima in just the the spatial dimensions instead of all 3 dimensions. This extracts points that are spatial corners, but are on temporal edges. This would still result in a large number of spurious points due to small changes in illumination and other noise conditions. Hence, in addition to the corner function in equation (6.7) we also compute the rank increase measure from [3] that measures the local motion inconsistency in a spatio-temporal patch. We compute the following two matrices in the space-time neighborhood of each interest point - M = 0 B B B B B B @ §L 2 x §L x L y §L x L t §L x L y §L 2 y §L y L t §L x L t §L y L t §L 2 t 1 C C C C C C A (6.8) M § = 0 B B @ §L 2 x §L x L y §L x L y §L 2 y 1 C C A (6.9) These are similar to second moment matrices in equations (6.3) and (6.6) except that the summations are over a spatio-temporal volume. In our experiments we used a 5£5£5 neighborhoodaroundeachpoint. [3]arguesthatatspatio-temporalcornerscorresponding 90 to multiple local motions, the rank increases from M § to M, while M is rank-de¯cient at other points. Thus we have, ¢r =rank(M)¡rank(M § )= 8 > > > > < > > > > : 0 single motion 1 multiple motion (6.10) While theoretically sound, the rank increase measure requires computing eigenvalues of M and M § and is also susceptible to noise. To handle this, [3] introduced a continuous rank increase measure- ¢^ r = det(M) det(M § )jjMjj F (6.11) where,jjMjj F = p P M(i;j) 2 istheFrobeniusnormofM. Spatio-temporalcornershave high ¢^ r, but this method can generate several interest points in uniform regions which haveindeterminate°ow. Asimilarphenomenonwasalsoobservedin[26]. Thuswechoose only those points that simultaneously maximize (6.7) in the spatial dimensions and also have high ¢^ r. Figure 6.1 compares the interest points produced by our approach, with the others. 6.1.4 Evaluation of Feature Detection Weevaluatedtheaccuracyofthedi®erentfeaturedetectorsinasetofindoorandoutdoor images collected in a variety of cluttered backgrounds from the dataset used in [44]. We manually annotated bounding boxes around the actors, and de¯ne a detected interest point to be correct if it falls inside the bounding box, and de¯ne accuracy as the fraction of interest points that are correct. 91 Figure 6.2: % of frames with atleast X correct interest points for a given accuracy for di®erent detectors We need a high accuracy and also a good number of correct interest points in every frame as observations for recognition in a graphical model. Hence, we measure the % of frames which have at least X correct interest points at di®erent accuracies. Figure 6.2 presents the results of our evaluation. As can be seen TD-STIPs produce a large number ofgoodinterestpointsinmostframes. Further,whiletheSTIPshavegoodaccuracy,they produce far fewer interest points. The Harris Corner and Rank Increase Measure based detectors produce large number of interest points but with low accuracies. Thus, our detector e®ectively increases correct interest point detection, without a®ecting accuracy. 6.2 Feature Descriptors We describe the interest points using SIFT-like [34] descriptors in both image gradients and optical °ow. The intuition behind this is that SIFT provides a local description of image features suitable for frame-by-frame processing while HoG and HoF are computed over larger regions or volumes. The local shape information is represented by computing 92 a set of orientation histograms on (4£4) pixel neighborhoods in the gradient image. The orientations are assigned to one of 8 bins, and each descriptor contains a (4£4) array of 16 histograms around the interest point, giving a (4£4£8=128) dimensional shape feature vector. We compute a similar 128 dimensional vector for the optical °ow to give a descriptor with 256 dimensions. Figure 6.3 illustrates the shape and °ow descriptors computed for a sample action. Figure 6.3: Shape and Flow Descriptors extracted at an Interest Point 6.2.1 Spatio-Temporal Codebook Generation During training, we extract interest points from the training videos and then learn a set of codewords as in the traditional Bag-of-Features(BoF) approaches (e.g.[48]). Further, to handle symmetric action instances, we also include descriptors obtained by °ipping 93 the interest point descriptors about the x-axis. We then cluster these descriptors using K-Means++[1]. Let  denote the set of points being clustered and let D(x) denote the minimum distance of point x to the cluster means already chosen. K-Means++ improves the traditional K-Means clustering by carefully seeding the initial cluster locations as follows- 1a. Choose an initial center c 1 uniformly at random from Â. 1b. Choose cluster c i , by selecting c i =x 0 2 with probability D(x 0 ) 2 P x2 D(x) 2 . 1c. Repeat Step 1b until all k cluster means are initialized. 2. For each x2Â, compute cluster membership based on the nearest cluster mean c i . 3. Recompute cluster means based on cluster membership: c i = 1 jC i j P x2C i x where C i is the set of points that belong to cluster i. 4. Repeat steps 2 and 3 until convergence. [1] shows that K-Means++ is O(logk) competitive after just the initial seeding step. In our experiments, we found that the clusters from just the initial step produce results comparable to the clusters from the full K-Means algorithm. Figure 6.4 illustrates the codewords produced by clustering our interest point descriptors. Note that the clusters typically contain points that are close spatially and are also of interest semantically. 6.2.2 Learning Codeword Weights Once we learn the codebook by clustering interest points in the training data, we learn weights w c;a for each code word c for each action a. Let F + c;a denote the number of 94 Figure 6.4: TD-STIPs clustered into CodeWords for actions a)Boxing b)Hand waving c)Running times codeword c occurs in action a, and F + a the total number of codewords in action a's trainingset. Weestimatetheprobabilitytomatchcodewordcforactionaasp + c;a = F + c;a F + a . We set F + c;a =1 if no feature in action a's training samples match code word c. We treat the samples for all other actions as negative examples for action a and estimate the probability to match codeword c on a negative example for action a as p ¡ c;a = F ¡ c;a F ¡ a , where F ¡ n;a is the number of times codeword c occurs in all other actions and F ¡ a is the number of codewords in all other actions. We then estimate the weight w c;a as - w c;a =log à p + c;a p ¡ c;a ! (6.12) This de¯nition of codeword weights is similar to the one used in [37], and assigns high weights to those codewords c that are most discriminative for action a. 95 6.3 Action Representation and Recognition WeautomaticallysegmentandrecognizeasequenceofactionsinvideousingaConditional RandomField(CRF),bytakingadvantageofthefactthatactionshavetypicaldurations. We also include additional transition constraints in the CRF where appropriate - for example, a stand action can take place only after a sit action. The state of the CRF at each frame t is represented by the tuple µ=[a,d], where a denotes the action, and d denotes the duration for which action a has been occurring. Figure 6.5 illustrates the CRF. Frame Action Duration d t−1 a t−1 t d t a t+1 d t+1 a I t−1 I t I t+1 Figure 6.5: Graphical Model for Continuous Action Recognition LetI=fI 1 ,I 2 ,...,I T gdenotethesequenceofframesinthevideo. Then,theprobability of the state sequence µ = fµ 1 ,µ 2 ,...,µ T g given the observation sequence I is given by the standard CRF formulation- P(µjI)= 1 Z Á(µ 1 ;I 1 ) T Y t=2 [Ã(µ t¡1 ;µ t ;I t¡1 ;I t )Á(µ t ;I t )] (6.13) 96 where, Á(µ t ;I t ) is the observation potential, Ã(µ t¡1 ;µ t ;I t¡1 ;I t ) is the transition potential and Z(I)= P µ P(µjI) is a normalization factor. The observation potential of the CRF at each frame is computed by accumulating the codeword weights of the detected interest points. Á(a;d)=exp n X ®(l)w l n;a o (6.14) where ®(l) is a weighting function that reduces the in°uence of codeword w l n;a based on its location l. For actions which take place in an upright position like walking, we can localize the person using pedestrian detection and tracking into our observations. Further, since our interest point detector produces very few false alarms, the actors could be localized by spatiallyclusteringtheinterestpoints, evenwhenpedestriandetectionsarenotavailable. In our implementation, we accumulated codeword weights in region that is twice the de- tectionwindowwidth,whenthedetectionswereavailable. Inothercasesweaccumulated weights over the entire frame. Figure 6.6 illustrates the computation of the observation potential. Figure6.6: ObservationPotentialComputation: Greenbox-Trackwindoww t fromperson detection, Yellow Circles-Interest Point Features, White box-Neighborhood of interest 97 The transition potential models typical action durations with a signum function as follows- Ã([a t ;d t ;¿ t ];[a t+1 ;d t+1 ;¿ t+1 ])= 8 > > > > > > > > > > < > > > > > > > > > > : 1 1+e d t ¡¹(a t )¡¾(a t ) ¾(a t ) if a t+1 =a t ;d t+1 =d t +1 1 1+e ¡ d t ¡¹(a t )+¾(a t ) ¾(a t ) 8a t+1 ;d t+1 =1 0 Otherwise (6.15) Here ¹(a t ) and ¾(a t ) are the mean and variance of the duration of one action cycle and is computed from the training data. The intuition behind the choice of the transition potential in equation (6.15) is that the weight of staying in the current action decreases beyond the mean, and the weight for transition to another action or a new cycle of the same action increases. With these potentials, the best state sequence can be inferred by computing the maximum probability path- p¤=argmax µ P(µjI) (6.16) Equation (6.16) can be solved e±ciently using Viterbi search and automatically provides segmentation of actions in a continuous stream. From the de¯nitions in equations (6.13)- (6.15), our inference algorithm takes O(T) computation where T=no. of frames. Alter- nate methods for modeling durations like semi-CRF[9] have O(T 2 ) complexity, demon- strating the e®ectiveness of our inference. 98 6.4 Experimental Evaluation We rigorously evaluated di®erent aspects of our approach including segmented action recognition, feature descriptors (shape Vs °ow) and continuous action recognition. We describe these in detail. 6.4.1 Segmented Recognition on KTH Dataset We tested our approach on the KTH set from [59], which is one of the standard datasets for evaluating action recognition algorithms. The dataset consists of 25 persons perform- ing 6 di®erent actions, namely: boxing, handclapping, handwaving, jogging, running and walking. The dataset was recorded under 4 di®erent conditions: outdoors (s1), outdoors with scale variations (s2), outdoors with clothing variations (s3) and indoors under light- ing variations (s4). We trained our models in each of the four conditions separately, and also with full datasetwithvarying train:test ratios. Werepeatedourexperimentsbyvaryingthetrain- ing samples 5 times and computed the average accuracy for each train:test ratio. Figure 6.7 and Table 6.1 summarize the results of our approach. Most of the errors are due to confusion between similar actions like running, jogging and handclapping, handwaving similar to other works. Dataset Ours State-of-Art KTH-s1 96.67% 96.0%[19] KTH-s2 86.67% 86.1%[19] KTH-s3 88:10% 92.1%[57] KTH-s4 95:24% 96.7%[57] KTH-s1+s2+s3+s4 91:08% 91.8%[30] Table 6.1: Accuracy on KTH Dataset 99 Figure 6.7: Accuracy on KTH Dataset with varying training set sizes Our approach produces results comparable to the state-of-the-art, though the ac- curacy is slightly lower than some recent results [19][57][30]. Further, even with small train:test ratios of 6:19 and 10:15, our approach produces results comparable to earlier approachesthatuseleave-one-outvalidationlikein[48][70],demonstratingtheadvantage of using dense representations. Figure 6.8 illustrate the results for action recognition and localization obtained by our method under the various test conditions. Another key advantage of our approach is that the runtime speed is real time at ¼30fps on a 3GHz Pentium IV running Windows C++ programs. In contrast [19] re- ports a runtime speed of¼0.5fps and our implementation of [30] had a speed of¼4-5fps on the KTH videos and even slower on videos with higher resolutions. 6.4.2 Shape Vs Flow Features Several earlier works [19][57][44] show that the combination of shape and °ow features produce better results than using either of them alone. We did a similar test, by using 100 Figure 6.8: Sample results for Action Recognition and Localization only the Shape and the Flow descriptors described in Section 4. Figure 6.9 illustrates the variation in accuracy with train:test ratios when using Shape, Flow and Shape+Flow features. Combination of shape and °ow produces a ¼5-10% improvement over using either alone. 6.4.3 Continuous Recognition on KTH Dataset Inordertotesttheabilityofourapproachtoautomaticallysegmentandrecognizeactions from a continuous stream, we created a dataset by concatenating action videos from the KTH dataset. In order to avoid large discontinuities in the video volume at the concatenation points, we only used the actions boxing, handclapping and handwaving which occur in place near the center of the image. We used action videos from the test set and concatenated 3-6 action segments, for a total of ¼20 videos for each of the datasets s1-s4. We then localized the person in the combined videos using low-level 101 Figure 6.9: Variation in Accuracy using Shape, Flow and Shape+Flow pedestriandetectionandtrackingandthenaccumulatedcodewordweightsusingtheCRF asdescribedinsection4. Weusedthesamecodewordsthatwelearnedduringsegmented action recognition. In this experiment, none of the actions that were classi¯ed correctly during segmented recognition were misclassi¯ed. Further, we also had a high accuracy in detecting the action boundaries in the concatenated videos. Table 6.2 summarizes our results, where N=Totalnumberofframesintheconcatenatedvideodataset, E=Number of frames with erroneous event label from our algorithm. Dataset Accuracy KTH-s1 94:16%(N =9479;E =554) KTH-s2 85:05%(N =6672;E =997) KTH-s3 90:34%(N =6903;E =667) KTH-s4 92:87%(N =10261;E =732) Table 6.2: Frame-by-frame Accuracy for continuous recognition; N=total no. of frames, E=no. of errors 102 6.4.4 Recognition on USC Dataset To test the e®ectiveness of our approach in cluttered scenes, we tested our approach on videos of 6 actions from [44] - sit-on-ground(SG), standup-from-ground(StG), sit-on- chair(SC), standup-from-chair(StC), pickup(PK), point(P). We used instances of these actions around 3 di®erent tilt angles - 15 o , 30 o , 45 o from multiple pan angles, typically, around 0 o , 45 o , 90 o , 270 o , 315 o . These actions were collected in typical indoor o±ce settings and also outdoors with varying zoom. In all we had a dataset of 265 action segments across all viewpoints and background conditions. Figure 6.10: Sample results for Action Recognition on USC dataset Totrainourmodelsweusedatrain:test ratioof9:1. Wetestedthetrainedmodelsfor segmented action recognition, and also for continuous recognition where we encoded the CRFwithhigh-levelconstraintslikea stand canhappenonlyafterasit action. However, since the low-level pedestrian detection works only in the upright pose we cannot use it for tracking through changing poses in actions like sitting. Hence, we used the weights from all the detected interest points in our recognition. But since our interest point 103 Tilt STIP[30] TD-STIP TD-STIP+ HoG 111 HoF 111 HoF 313* HoG 331 HoF 113 Constraints 15 o 0:875 0:875 0:875 0:875 0:75 0:875 0.9375 30 o 0:75 0:5 0:625 0.875 0:75 0:75 0.875 45 o 0:75 0:875 0:625 0:5 1.0 1.0 1.0 Table 6.3: Recognition Accuracy on USC[44] Dataset. *Channel HOF 313 produced best results on KTH in [30] detector produces very few spurious detections, we could localize the actor by spatially clustering the interest points; Figure 6.10 illustrates this. In addition we also trained SVM classi¯ers using di®erent HoG and HoF channels, similar to the ones described in [30]forthesametrain:test separationforcomparison. Table6.3summarizestheresultsof our experiments. Our approach consistently produces best performance at all tilt angles, while the STIP based class¯ers' performance depends on the channels used. [44] reports accuracies of 82:98%, 81:25% and 65:23% for the 15 o , 30 o and 45 o tilts respectively, using Mocap models. These models do not require additional training, but it is cumbersome to collect Mocap data. Our models, on the other hand requires large training sets to generalize well, but can be learned from videos. Also, our inference on the USC dataset runs at ¼ 6fps (on 740£480 resolution), while [44] report a speed of 0:37fps, and our implementation of [30] ran at 0:5fps. 6.4.5 Recognition with Infra-red Videos Wealsotestedthegeneralityofourmethodacrosssensors,bytestingitoninfra-red videos offouractions-sit-on-ground(SG),standup-from-ground(StG),pickup(PK),point(P) col- lected from multiple pan angles. In all we collected 60 action segments. Our experiments 104 produced 100% recognition with a train:test ratio of 9:1. Figure 6.11 presents some sample results on this set. Figure 6.11: Sample results for Action Recognition on IR dataset 105 Chapter 7 Simultaneous Tracking and Action Recognition using Dynamic Bayesian Action Networks The models discussed in chapters 5 and 6 require Mocap data or learned from a large training set of videos respectively. Both of these are di±cult to collect. In this chap- ter, we present a method which requires a much smaller amount of training data. Each action is modeled as a sequence of primitive actions, each of which is represented as a function which transforms the actor's state. We formulate the model learning as a curve- ¯tting problem, and present a novel algorithm for learning human actions by lifting 2D annotations of a few keyposes to 3D and interpolating between them. Action models are represented in a Dynamic Bayesian Action Network (DBAN); actions are inferred by sampling the models and accumulating the feature weights which are learned discrimina- tively using a modi¯ed latent state Perceptron algorithm. Multiple features, including a new grid-of-centroid feature, are used. We show results on visual gesture recognition and activity recognition in a cluttered grocery store environment. 106 7.1 Action Representation and Recognition Our approach for action representation is based on the idea that a composite action can be decomposed into a sequence of primitive actions. Each primitive action pe modi¯es the state s of the actor to get the new state s 0 . This can be expressed in a functional form as f pe (s;s 0 ;N), which maps the current state s to the next state s 0 given a set of parameters N. Thisrepresentationofactionsissimilartotheoneusedin Temporal Logic of Actions (TLA)[29], which combines ¯rst-order logic and temporal logic and represents actions with logical predicates. It was originally proposed in [29] for specifying and rea- soning about concurrent systems. TLA is focused on reasoning about the properties of deterministic systems like programs, while our focus here is on reasoning under uncer- tainty and error. In our work, we assume that there is known ¯nite set of possible functions f to rep- resent primitives. This is common in many domains of interest. In particular, in human motion analysis, [22] shows that there are only 3 possible movements for any limb - Ro- tate, Flex and Pause. Further, the Flex action can be represented as the simultaneous rotation of two parts (like thigh and knee). 7.1.1 Graphical Model Representation Giventheactionmodels,weembedthemintoaDynamicBayesianNetwork(DBN)which we call the Dynamic Bayesian Action Network (DBAN) illustrated in Figure 7.1. The nodes in the topmost layer in the DBAN correspond to the composite actions like walk, pickup, etc. The middle layer corresponds to the primitives and the lowest 107 layer corresponds to the pose. In addition, we also include the primitive durations in the model. Thus, the state s t of the DBAN at time t is denoted by the tuple (ce t ;pe t ;d t ;p t ). d Primitive Event Composite Event Pose Observation ce pe d p p o o t t t t t t+1 t+1 pe ce t+1 t+1 t+1 d t ce t−1 pe t−1 o t−1 t−1 p Duration t−1 Figure 7.1: Dynamic Bayesian Action Network 7.1.2 Pose Tracking and Recognition During recognition, we ¯rst detect the person in the video using a state-of-the-art pedes- trian detector. We then use a combined shape and foreground blob tracker to track and localize the person in each frame, through changing pose. The position and scale infor- mation available from the tracker is then used to adjust the predicted 3D pose obtained from the event model. The likelihood of the pose is then computed by matching the adjusted pose with low-level image features. Since the pose p t in the state s t is continuous, an exhaustive search of the entire state spacetorecognizethestatesequenceisnotpossible. Insteadweuseasamplingbasedap- proach to recognize the best state sequence, by storing the top-K states s t at each frame 108 t. Foreachstates t , we¯rstsampletheAction Transition Potential Á a (s t ;ce t+1 ;pe t+1 )to choose the next actions (ce t+1 ;pe t+1 ). Next, we sample from the Pose Transition Poten- tial Á p (p t ;pe t+1 ;p t+1 ) to choose the next pose p t+1 . Here, Á p (p t ;pe t+1 ;p t+1 ) represents a distribution over the parameters N in the function f pe t+1 (p;p 0 ;N) corresponding to primitive pe t+1 . Further, since the orientation of the actor w.r.t the camera can change in actions like walk, we scan a window within 10 o of the previous orientation to choose the current orientation. Next, we compute the observation potential Á obs (s t+1 ;o t+1 ), and recognize the best state sequence by accumulating the weighted sum of potentials. Algo- rithm 2 presents pseudocode for the algorithm described. Algorithm 2 Inference Algorithm ² Sample initial distribution Á init (s) to get initial states S 0 =fhs (i) 0 ,® (i) 0 iji=1::Kg for t=1 to T do for i=1 to K do Action Prediction: ² Samplehce (i) t+1 ,pe (i) t+1 i» Á a (s t ;ce t+1 ;pe t+1 ) Pose Prediction: ² Sample p (i) t+1 » Á p (p t ;pe t+1 ;p t+1 ) Weight Estimation: ² ® (i) t+1 = ® (i) t + P k w k Á k (s t ;s t+1 ;o t+1 ) end for end for The sum in the weight estimation step above, computes the weighted sum of the tran- sition and observation potentials. This is similar to the inference used in CRF-Filters [33], but di®ers in two crucial aspects - 1) We use a two-stage prediction, ¯rst for actions and then for poses. 2) We accumulate the weights over time for each sample, instead of resampling based on the instantaneous weights, so that our recognition algorithm does not drift due to local errors. 109 The representation and inference framework described so far is general and can be applied in many domains. Now, we will describe the speci¯c observation and transition potentials we used in our implementation. 7.1.3 Transition Potential In our implementation, we allow all possible transitions between the composite events ce. We model the primitive transitions by using the primitive event durations in the log-signum function as follows: Á a (s t ;ce t+1 ;pe t+1 )= 8 > > > > < > > > > : ¡ln µ 1+e d t ¡¹(pe t )¡¾(pe t ) ¾(pe t ) ¶ pe t+1 =pe t ¡ln µ 1+e ¡ d t ¡¹(pe t )+¾(pe t ) ¾(pe t ) ¶ pe t+1 6=pe t (7.1) Here,¹(pe t )and¾(pe t )arethemeanandvarianceofprimitiveeventpe t durationslearned fromthetrainingdata. Withthisde¯nition,theprobabilityofstayingthesameprimitive pe t decreases near the mean duration ¹(pe t ) and the probability of transition to a new primitive increases. We model the pose transition potential with the log-normal distribution as: Á p (p t ;pe t+1 ;p t+1 )=¡ (p t+1 ¡p t ¡µ mean ) 2 2µ 2 var (7.2) Here, (p t+1 ¡p t ) denotes the amount by which pose p t+1 has changed from p t . µ mean and µ var are are the mean and variance of the amount by which primitive pe t+1 changes the pose at each frame, and is also learned during training. 110 7.1.4 Observation Potential We compute the observation potentials of a state Á obs (s t ;o t ) using features we extract fromthevideo. Forrobustness,weusemultiplefeaturestocomputethelikelihood,suchas foregroundoverlap, di®erenceimagematch, andanovel grid-of-centroids basedapproach to match foreground distribution. Figure 7.2 illustrates the observation potentials we extract. cent hausdorff overlap overlap diff fg t−1 I Image I t Image compute 3D Model project centroids centroids compute foreground blobs extract difference project moving parts Figure 7.2: Computation of Observation Potentials Foreground Overlap: The foreground overlap score of the pose p is computed by accumulating the foreground pixels overlapping with orthographic projection of pose p. Á fg (p;I fg )=jI fg \Proj(p;I)j (7.3) where, Proj(p;I) is the projection of pose p on image I and I fg is the foreground image. This provides a simple measure of the similarity between the observed pose and the predicted pose p. 111 Foreground matching using Bag of centroids: In practice, extracted foreground blobs are often fairly noisy and the foreground overlap measure is very sensitive to pose misalignments. So in addition to the overlap measure, we propose grid-of-centroids to match the foreground blob distribution with the pose. To compute the grid-of-centroids, weplaceanm£ngridontheimageI andcomputethecentroidsoftheforegroundpixels within each cell. We represent this grid-of-centroids by the set of the non-zero centroids º [1;n c ] (I), where n c is the number of non-zero centroids. To compute the potential of pose p using the grid-of-centroids, we compare the grid-of-centroids computed over the foreground blobs within the person detection window, and those computed from the projection of p on the image Proj(p;I). We use Hausdor® measure to compute the similarity score between the grid-of-centroids. Á cent (p;I fg )= max i2[1;n c ] min j2[1;n c 0] jjº i (Proj(p;I))¡º j (I fg )jj 2 where,jj:jj is the euclidean norm. Di®erence Image Matching: We use di®erence between the frames I t and I t¡1 within the person detection window as an estimate of change in observed pose. For state s = (ce;pe;d;p), we obtain the moving body parts of pose p during the primitive event pe fromtheactionmodel. Theobservationpotentialofsforchangeinposeisthenmodelled as the overlap between the di®erence image and projection of moving body parts on the image. Á di® (s;I t ;I t¡1 )=jDi®(I t ;I t¡1 )\Proj(p change ;I)j 112 where, Proj(p change ;I) is the projection of moving parts of pose p on image I and Di®(I i ;I j ) is the di®erence image between the frame i and frame j. This measures the similarity between the instantaneous motion observed in the video, and the pose change. Position change matching: In addition to the observation potentials described, we also include a motion weight since an event can be performed while the actor is either standing or while he is in motion, say walking or running. We compute this by matching expected change in position with the position change observed using the person tracker. Given the observed change in position ± pos , we de¯ne the potential function of a state s with primitive event pe as Á pos (s;± pos )= 8 > > < > > : ¡ ± 2 pos 2¾ stand ; if standing ¡ (± pos ¡¹ pe pos ) 2 2¾ pe pos ; if moving (7.4) where,¾ stand isthepositionvarianceforstandingevent, ¹ pe pos ,¾ pe pos aremeanandvariance of the change in person position for the primitive event pe. The ¯rst term above models the stationary events with a log-normal distribution with zero velocity, while the second models the walking action with constant velocity model. 7.2 Action Learning Learningactionmodelsinourframeworkinvolvestwoproblems-(1)Learningtheparam- etersN intheprimitiveeventde¯nitions f pe (p;p 0 ;N). (2)Learningtheweightsw k ofthe di®erent potentials. We will now describe our algorithms for learning these parameters in detail. 113 frame j Key Pose Annotation Key Frames annotate frame 0 frame i Video Volume frame k frame i Lift to 3D 3D Key Pose Models Learn Primitives by interpolation Primitive 1 Primitive 2 frame k Figure7.3: ModelLearningIllustrationforCrouch actionwith3keyposesand2primitives 7.2.1 Model Learning Sincethefunctionalformoff pe isknownfrompriorknowledge,learningN involvescurve- ¯tting from a set of (p;p 0 ) pairs. In particular, human limb motions involve rotations of the form Rotate(part,axis,degree), where the parameter axis is the axis of rotation and degree is amount by which the part is rotated. Now, if we know the center of rotation, axis and degree can be computed from the start and end locations of part using simple geometry. We take such an approach as illustrated in Figure 7.3. 7.2.1.1 KeyPose Annotation and 3D Lifting Given an action, we ¯rst choose a set of keyposes that best represent the action. In our work, we do this manually but we can choose them automatically from a sample video by computing a motion energy function similar to [35]. Next we annotate the 2D joint locations of the frames containing the key poses in a sample action video. In addition, we also annotate for the relative depth ordering between the joints, the height H of the person in pixels, and also the approximate pan and tilt of the camera. 114 Given the 2D (x;y) annotations, we lift the pose to 3D (x;y;z) by comparing them to part lengths of an idealized human model as follows: z j2 ¡z j1 = 8 > > > > > > > > > < > > > > > > > > > : q l 2 j1;j2 ¡¢x 2 j1;j2 ¡¢y 2 j1;j2 z j2 >z j1 ¡ q l 2 j1;j2 ¡¢x 2 j1;j2 ¡¢y 2 j1;j2 z j2 <z j1 0 z j1 =z j2 (7.5) where, - j1, j2 are possible adjacent joints. - ¢x j1;j2 =jx j2 ¡x j1 j, ¢y j1;j2 =jy j2 ¡y j1 j. - l j1;j2 is the length of part between joints j1, j2. Thus, if j1=left shoulder and j2=left elbow, l j1;j2 is the length of the left upper arm. - z j1 < z j2 indicates that joint j1 is in front of joint j2, and such relationships can be inferred from the relative depth information in the annotations. Thesetofpossible(j1;j2)are(Shoulder,Elbow),(Elbow,Wrist),(Hip,Knee),(Knee,Ankle) on both the left and right sides, and (Center Hip,Center Neck). We set z Center;Hip =0 and the part lengths l j1;j2 as fractions of the person's image height H as illustrated in ¯gure 7.4. We then solve the linear equations in equation (7.5) to get z j , which gives 3D keyposes. Next, we normalize the3Dposestoastandardheightandorientationand also let the hip centerbe at (0;0;0), since di®erent videos of an action can be shot at di®erent viewpoints and scale. 115 2 H 10 H 6 H 4 H 3 H W 5 W H Figure 7.4: Part lengths of ideal human model 7.2.1.2 Pose Interpolation and Model Learning We now describe our approach to learning the model, from the key pose annotations of a single action video. After we have lifted the key pose annotations to 3D and normalized them, we de¯ne the primitive actions to be the motions that interpolate between each of the key poses. Let pe denote the primitive event corresponding to the transition from keypose k s to k e . Also, let t s and t e be the frames in the action video when keypose k s and k e occur respectively. Since the keyposes are normalized to the same height, orientation and location we only need to compute the transformations of the body parts between the keyposes. We do this by computing the rotations of the torso, the left and right upper arms, lower arms, upper legs and lower legs in that order. As we discussed earlier, all limb motions can be expressed in terms of rotations [22] as Rotate(part;axis;µ), where axis is the axis rotation and µ is the angle of rotation. 116 Since Rotate has a ¯xed functional form, to learn the primitive pe we need to compute axis and µ from k s and k e for every part. Let (j1;j2) denote the start and end joints of part. For example if part=left upper arm, then j1=left shoulder, j2=left elbow. Also, let k(j;x) denote the x-coordinate of joint j in keypose k. Also let: ¢k s;x =k s (j2;x)¡k s (j1;x);¢k s;y =k s (j2;y)¡k s (j1;y) ¢k s;z =k s (j2;z)¡k s (j1;z);¢k e;x =k e (j2;x)¡k e (j1;x) ¢k e;y =k e (j2;y)¡k e (j1;y);¢k e;z =k e (j2;z)¡k e (j1;z) and let, V s =(¢k s;x ;¢k s;y ;¢k s;z );V e =(¢k e;x ;¢k e;y ;¢k e;z ) Now, we compute axis and µ as: axis= V s £V e jV s £V e j ;µ = cos ¡1 ³ Vs¢Ve jVsjjVej ´ t e ¡t s +1 (7.6) The approach described so far computes the model by lifting 2D annotations of a few key poses to 3D and interpolating between them, from a single action video. When we have annotations from multiple videos, we ¯rst collect the annotations corresponding to each keypose. We lift the annotations to 3D, normalize them, and then compute the mean and variance of the joint locations. We then compute the primitive action parameter axis from the mean keyposes as in equation (7.6). We compute µ for each action sample, and then compute µ mean and µ var as the mean and variance of the 117 µ's. Furtherwealsocomputemeanduration¹(pe)andvariance¾(pe)fromd=t e ¡t s +1 for the duration models. , 40 ) Start Pose End Pose Rotate(Torso, Rotate(Right Arm, Rotate(Torso, Rotate(Right Arm, Rotate(Left Arm, Rotate(Left Arm, , 40 ) − y−axis − y−axis , 40 ) − y−axis , 40 ) − y−axis − y−axis − y−axis , 40 ) , 40 ) Figure 7.5: Primitive Learning by Pose Interpolation 7.2.2 Feature Weight Learning We will now describe our approach for the training feature weights w =fw f g associated with the feature potentials Á f (:). We de¯ne the likelihood of a state sequence s [1:n] for a feature f in image sequence I [1:n] as: © f (I [1:n] ;s [1:n] )= X i Á f (h i ;s i ) (7.7) where,h i =hI [1:n] ;s i¡1 ;ii. Weformulatefeatureweightestimationasminimizationofthe log likelihood error functionE(¹ w) over the entire training setT. Thus, the log likelihood error function for the given set of weights ¹ w is: X i2T ±(ce o i 6=ce gt i ) X f w f ³ © f (I [1:n] ;s gt [1:n] )¡© f (I [1:n] ;s o [1:n] ) ´ 118 where, s gt [1:n] is the ground truth label sequence and s o [1:n] is the estimated state sequence obtained using the inference algorithm described in Algorithm 2. Due to this log linear formulation of the likelihood error function, we can learn the weight vector using the Voted Perceptron algorithm [10]. However, this uses completely labeled training data to estimate the model parameters, while in our case the ground truth label sequence only contains event annotations. Hence, we propose an extension to the Voted Perceptron algorithm to deal with partially labeled data. In particular, we add an additional step of estimating the latent variables based on the current parameters and use it to compute the training error. Algorithm 3 Discriminative Feature Weight Learning Randomly set the initial weight vector w for t=1 to T do for i=1 to N do ² Set ¢ s =0 ² Use algorithm 2 to compute the most likely state sequence s o [1:n i ] on the i th training sequenceI i using the current weight vector w. if ce o [1:n i ] 6=ce gt [1:n i ] then ²Giventhelabeledeventsequence,estimatethemostlikelystatesequenceusing algorithm 2 (without the action prediction step) ~ s [1:n i ] =argmax 8p X f w f © f (I i ;hce gt ;pe gt ;d gt ;pi [1:n i ] ) ² Collect the feature errors ¢ f =¢ f +© f (I i ;~ s [1:n i ] )¡© f (I i ;s o [1:n i ] ) end if end for ² Update the weight vector w f =w f + ¢ f jj¢ f jj L 1 end for The proposed training algorithm summarized in Algorithm 3, takes T passes over the training set. For each training sequence, the most likely state sequence with the current weight vector is computed using algorithm 2. If the estimated composite event 119 is not correct, ground truth state sequence is estimated from the labeled event sequence using algorithm 2 without the action prediction step (since the action is known). The feature errors between the observed and the ground truth sequence are collected over the entire training set and is used to update the weight vector. In our experiments, we learnt multiple randomly initialized feature weights for each action model and use the weights that achieve the highest training accuracy. 7.3 Experiments We tested our approach in two domains: recognition of hand gestures in an indoor lab environment and full body actions in a challenging setting of a grocery store. Gesture Dataset: For the hand gesture dataset, we collected 5¡ 6 instances of 12 actions from 8 di®erent actors in an indoor lab setting. The dataset contains a total of about 500 action sequences across all actions. The videos are 852£480 resolution, and actual height of person is¼ 200¡250 pixels. This set is similar to that used in [60] but has a bigger variety. As the background is not highly cluttered, extracted foreground is quite accurate but the large number of actions with subtle di®erences makes recognition still a challenging task. Grocery Store Dataset: This dataset is collected in a much more cluttered setting of a grocery store. Videos were collected from a static camera mounted on top of an aisle, at a downward tilt of ¼ 20 ± . In each video, an actor enters the scene, picks up an item and leaves. The action set includes 3 full body gestures - walking, pickup from shelf, crouch and pickup. We collected 16 videos from 8 di®erent actors. Each video is 120 about 400 frames long and the observed size of the actor varies from 200 to 375 pixels in 852£480 frame. Even though this dataset requires inference over only a few actions, it is highly challenging due to the extremely cluttered background which results in poor foreground extraction (an example is shown in ¯gure 7.2 earlier) and highly articulated and ambiguous poses such as when a person is crouching and pulling an object from the shelf. Furthermore, the orientation of the actor wrt the camera is not the same for di®erent actors or even during the same action, such as walking. Lastly, the actions are not segmented a priori but rather the temporal segmentation is part of the recognition process. Some sample results obtained by our algorithm on these datasets are shown in ¯gures 7.6 and 7.7. Note that in 7.6 the limb joints appear to be highly accurate, less so in 7.7. Figure 7.7 also shows the complexity of the environment and the di±culties of some of the poses, particularly the ones on the bottom row. (p) (a) (b) (e) (f) (c) (d) (g) (h) (i) (j) (k) (l) (m) (n) (o) Figure 7.6: Results on the Gesture Dataset: inferred pose is overlaid on top of the image and illustrated by limb axes and joints Quantitative Evaluation: We performed a quantitative evaluation of action recogni- tion accuracy as well as pose tracking errors. For the gesture dataset, the videos contain only one action; it is said to be recognized correctly if its label is the same as in the 121 (j) (b) (e) (f) (c) (d) (h) (g) (i) (k) (l) (a) Figure 7.7: Results on the Grocery Store Dataset: A bounding box shows the actor's position and the inferred pose is overlaid on top of the image, illustrated by limb axes and joints. ground truth. For the grocery dataset, there is a sequence of actions, each taking place oversomeinterval. Here, weconsidertheeventstobecorrectlyrecognizedifthedetected and ground truth annotations have signi¯cant overlap (say¸50%) and measure the de- gree of overlap independently. Second column of the table 7.1 summarizes our results on both datasets. For ges- tures, the accuracy is similar to those reported in other work but a direct comparison is not meaningful due to use of di®erent datasets. For the Grocery store sequence, the recognition sequence results are perfect with overlap between detected and annotated event intervals being about 90%; note that it is di±cult to mark the boundaries between actions precisely to construct the ground truth. We conjecture that the results are better for the Grocery dataset, in spite of the higher apparent complexity and ambiguity in a single frame, because of the smaller number of actions which are quite distinct from each other and the temporal models embedded in the DBAN. We also attempted to evaluate the pose tracking errors. As we do not explicitly com- pute a 3-D pose, nor have access to 3-D ground truth, we measure the 2D joint position 122 errors and normalize them wrt the size of the person. We only include the joints that are involved in the action (e.g. only hands and shoulders in the gesture sequence) and only sequences where the correct action is recognized (in analogy with object detection where spatial accuracy of false alarms is not measured). Third column of the 7.1 provides the results. The numbers seem quite adequate for inferring the positions of limbs for activ- ity analysis and relationships to objects in the surroundings but possibly not for motion capture which has not been our objective. Dataset Train:Test Recognition 2D Tracking Speed ratio (% accuracy) (% error) (fps) Gesture 3:5 90:18 5:25 8 Grocery 1:7 100 11:88 1:6 Table 7.1: Performance Scores on Gesture and Grocery Store dataset Thelastcolumnofthe7.1givesthecomputationalspeedson3GHzXeonCPU.Note that the numbers include all the steps. Most of the computation arise from the low level feature extraction phase during person detection and the foreground extraction. Our inference algorithm actually runs close to real time (20-30 fps). For both datasets, we learned the action models on a subset of actors and test on the rest. Figure7.8(a)showsthelearningbehaviorofthelatent state perceptron algorithmon an action model trained on the gesture dataset. Observe that the model already achieves an accuracy of 85% after a few iterations. We rigorously evaluated our system with varying train:test ratios (see ¯gure 7.8(b)). On Gesture Dataset, our method achieves ¼88%accuracywithmodelsobtainedfromonly2actors(1:3 train:test ratio), whilefor Grocery Store Dataset, models learnt from one actor sequence were enough to correctly 123 recognize all the actions. This clearly demonstrate the generalizability of our method over multiple actors. (b) Number of Iterations (a) Train Test Ratio Error Rate (%) Figure 7.8: Results on Gesture Dataset. (a) feature weight learning using perceptron algorithm; (b) accuracy vs train:test ratio 124 Chapter 8 Summary and Future Work To summarize, we have worked on each of the key problems in action recognition that we discussed in the introduction. We have made novel contributions as well as improved uponexistingapproaches. Wehavedemonstratedmyapproachesinseveraldomains, and have shown state-of-the-art results standard benchmark datasets at real-time speeds. We have used a range of action representations, including logic based representations, bag-of-features and graphical models. We have introduced several novel extensions to existing graphical model formalisms [41][42] that address the representational limitations ofexistinggraphicalmodels. Further,toaddressthelimitationsofthecurrentactionrep- resentations, we have explored combinations of logic-based [45] and bag-of-features [40] representations with graphical models. These combinations allow use of sophisticated features for recognition in graphical models, and also minimize training requirements. Extracting appropriate low-level features is crucial in any vision application. We have used several features to recognize actions, including foreground blobs, edge detec- tions, person detections, optical °ow and spatio-temporal interest points. We have also 125 introduced a novel interest point detector in [40], which provides a dense, compact rep- resentation of the spatio-temporal structure of actions, and also does not require costly bounding box annotations for training the models. Wehaveusedextensionsofseveralstate-of-the-artlearningandinferencealgorithmsin my work. We have extended existing Expectation-Maximization (EM) based algorithms for training in the absence of annotations, as well as the discriminative Voted-Perceptron algorithm for training with partial annotations. Further, we have explored using prior knowledge in my models to minimize the training requirements. We have also extended existing inference algorithms for online, real-time recognition of actions in a range of do- mains including sign-language recognition, gesture recognition, and action recognition in a range of indoor and outdoor scenarios with signi¯cant variations in viewpoint, scale, background clutter and motion. 8.1 Future Work In future, we plan to build on our research to develop techniques for action recognition in a wider range of domains. A successful visual action recognition system will also re- quire simultaneous development in the lower levels of vision processing including object detection, object tracking and pose tracking. While progress in developing general solutions to core vision problems like segmenta- tion, object detection, tracking and recognition has been slow, good methods have been developedinspeci¯cdomains. Basedonthislesson,webelieveinfocusingonwellde¯ned 126 domains with clear applications and progressively working towards a general solution. Further, we have learned the following key lessons based on our research experience: - Methodsdevelopedshouldproducegoodresultsonadomainhoweversimple,rather than on a dataset. - Combinationsofmultiplefeatures(likeshapeand°ow)producebetterperformance than a single feature. - Hybrid representations produce more realistic models of actions, objects and other semantic entities. - Successfulmethodsforhigh-levelvisionprocessingmustuseacombinationofbottom- up feature processing and top-down models. Based on these observations, we will now present a speci¯c roadmap for future research in action recognition. From our research as well as the state-of-the-art, we believe that robusttechniquesforrecognizingsinglepersonactionsusingstaticcamerasinindoorset- tings can be developed in the next 2-3 years. Robust feature extraction in such scenarios is reasonably well understood, and robust detection and tracking methods are maturing rapidly. Such systems can have immediate applications such as gesture-based human- computer interaction, and monitoring lightly used areas in stores and o±ces. Encouraging results have been published for recognizing single person actions with camera and background motion by us (in [44]) as well as others. Robust methods can be developed for this domain in the next 3-5 years, based on the recent rapid progress in learning based object detection and tracking techniques. Key applications enabled 127 by such systems include human-robot interactions through gestural commands and orga- nizing personal video collections which are typically collected with a signi¯cant camera jitter. Based on the success of our feature detector from [40] in detecting relevant interest points, we believe that mature technologies for analyzing multiple actors with low occlu- sions such as those from UAVs or from street lights can be developed in the next 5 years andcanhaveapplicationsinanalysisofintelligencevideosandalsoformonitoringtra±c. Robust visual features for enhancing the results of text-based image and video search are also being developed rapidly and should have practical applications in the next 5 years. Analyzing videos from crowded locations such as train stations is harder, but recent results in person detection and tracking are promising and we believe that good human- in-the-loop methods for action recognition in such scenarios can be developed in the next 5-10 years. Analysis of completely unstructured videos in crowded scenes with moving cameras using purely vision is much harder. Robust systems that work under such con- ditions are 15-20 years in the future, but we believe that progress in the simpler domains can make development of such capabilities much faster. To conclude, while the vision-problem is broad, di±cult and poorly de¯ned, we be- lieve that we have made signi¯cant progress. We have demonstrated our approaches in several state-of-the-art datasets as well as in many challenging conditions. We believe that continued research that build on our work, will make vision based applications a reality. 128 References [1] David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seed- ing. In SODA, pages 1027{1035, 2007. [2] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape matching and object recog- nition using shape contexts. PAMI, 24:509{522, 2002. [3] MosheBlank,LenaGorelick,EliShechtman,MichalIrani,andRonenBasri. Actions as space-time shapes. In ICCV, pages 1395{1402, 2005. [4] M.Brand. Coupledhiddenmarkovmodelsformodelinginteractingprocesses. Tech- nical Report 405, MIT Media Lab Vision and Modeling, 1996. [5] M. Brand, N. Oliver, and A. Pentland. Coupled hidden markov models for complex action recognition. CVPR, pages 994{999, 1997. [6] M. Brand, N. Oliver, and A. Pentland. Coupled hidden markov models for complex action recognition. CVPR, pages 994{999, 1997. [7] H.Bui,D.Phung,andS.Venkatesh. Hierarchicalhiddenmarkovmodelswithgeneral state hierarchy. Proceedings of the National Conference in Arti¯cial Intelligence, pages 324{329, 2004. [8] M.T. Chan, A. Hoogs, R. Bhotika, A. Perera, J. Schmiederer, and G. Doretto. Joint recognition of complex events and track matching. In CVPR, pages II: 1615{1622, 2006. [9] WilliamW.CohenandSunitaSarawagi. Exploitingdictionariesin namedentityex- traction: combiningsemi-markovextractionprocessesanddataintegrationmethods. In KDD, pages 89{98, 2004. [10] Michael Collins. Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In EMNLP, pages 1{8, 2002. [11] P. Doll¶ ar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, 2005. [12] T.V. Duong, H.H. Bui, D.Q. Phung, and S. Venkatesh. Activity recognition and abnormalitydetectionwiththeswitchinghiddensemi-markovmodel. CVPR,1:838{ 845, 2005. 129 [13] T.V. Duong, H.H. Bui, D.Q. Phung, and S. Venkatesh. Activity recognition and abnormalitydetectionwiththeswitchinghiddensemi-markovmodel. CVPR,1:838{ 845, 2005. [14] Alexei A. Efros, Alexander C. Berg, Greg Mori, and Jitendra Malik. Recognizing action at a distance. In ICCV, pages 726{733, 2003. [15] A. Elgammal, V. Shet, Y. Yacoob, and L. Davis. Learning dynamics for exemplar- based gesture recognition. In CVPR, pages 571{578, 2003. [16] S. Fine, Y. Singer, and N. Tishby. The hierarchical hidden markov model: Analysis and applications. Machine Learning, page 32, 1998. [17] Z Ghahramani and M.I. Jordan. Factorial hidden markov models. Advances in Neural Information Processing Systems, 8, 1996. [18] C.HarrisandM.J.Stephens. Acombinedcornerandedgedetector. In Alvey Vision Conference, pages 147{152, 1988. [19] H.Jhuang, T. Serre, L. Wolfe, and T. Poggio. A biologically inspired system for action recognition. In ICCV, 2007. [20] S Hongeng and R. Nevatia. Large-scale event detection using semi-hidden markov models. ICCV, 2:1455, 2003. [21] http://www.mocapdata.com. [22] Ann Hutchinson and G.Balanchine. Labanotation: The System of Analyzing and Recording Movement. ISBN: 0878305270, 1987. [23] Daniel P. Huttenlocher, Gregory A. Klanderman, and William Rucklidge. Compar- ing images using the hausdor® distance. PAMI, 15(9):850{863, 1993. [24] Daniel P. Huttenlocher and William Rucklidge. Multi-resolution technique for com- paring images using the hausdor® distance. In CVPR, pages 705{706, 1993. [25] T. Kanade and B.D. Lucas. An iterative image registration technique with an ap- plication to stereo vision. In IJCAI, pages 674{679, 1981. [26] Yan Ke, Rahul Sukthankar, and Martial Hebert. Event detection in crowded videos. In ICCV, 2007. [27] Sanjiv Kumar and Martial Hebert. Discriminative random ¯elds: A discriminative framework for contextual interaction in classi¯cation. In ICCV, pages 1150{1157, 2003. [28] JohnLa®erty,AndrewMcCallum,andFernandoPereira. Conditionalrandom¯elds: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282{289, 2001. 130 [29] Leslie Lamport. The temporal logic of actions. ACM Transactions on Programming Languages and Systems, 16(3):872{923, 1994. [30] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. [31] Ivan Laptev. On space-time interest points. IJCV, 64(2-3):107{123, 2005. [32] S.K. Liddell and R.E. Johnson. American sign language: The phonological base. Sign Language Studies, 64:195{277, 1989. [33] Benson Limketkai, Dieter Fox, and Lin Liao. Crf-¯lters: Discriminative particle ¯lters for sequential state estimation. In ICRA, pages 3142{3147, 2007. [34] David G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60:91{110, 2004. [35] Fengjun Lv and Ramakant Nevatia. Single view human action recognition using key pose matching and viterbi path searching. In CVPR, 2007. [36] AndrewMccallum,DayneFreitag,andFernandoPereira. Maximumentropymarkov modelsforinformationextractionandsegmentation. In ICML,pages591{598, 2000. [37] K. Mikolajczyk and H. Uemura. Action recognition with motion-appearance vocab- ulary forest. In CVPR, 2008. [38] Louis-Philippe Morency, Ariadna Quattoni, and Trevor Darrell. Latent-dynamic discriminative models for continuous gesture recognition. In CVPR, 2007. [39] Mukund Narasimhan, Paul A. Viola, and Michael Shilman. Online decoding of markov models under latency constraints. In ICML, pages 657{664, 2006. [40] Pradeep Natarajan, Prithviraj Banerjee, Furqan Khan, and Ramakant Nevatia. Temporally-dense spatio temporal interest point models for action recognition. In Under Submission, 2009. [41] PradeepNatarajanandRamakantNevatia. Coupledhiddensemimarkovmodelsfor activity recognition. In WMVC, 2007. [42] Pradeep Natarajan and Ramakant Nevatia. Hierarchical multi-channel hidden semi markov models. In IJCAI, 2007. [43] Pradeep Natarajan and Ramakant Nevatia. Online, real-time tracking and recogni- tion of human actions. In WMVC, 2008. [44] Pradeep Natarajan and Ramakant Nevatia. View and scale invariant action recog- nition using multiview shape-°ow models. In CVPR, 2008. [45] Pradeep Natarajan, Vivek Kumar Singh, and Ramakant Nevatia. Star: Simulta- neous tracking and action recognition using dynamic bayesian action networks. In Under Submission, 2009. 131 [46] Nam T. Nguyen, Dinh Q. Phung, Svetha Venkatesh, and Hung Hai Bui. Learning and detecting activities from movement trajectories using the hierarchical hidden markov models. In CVPR (2), pages 955{960, 2005. [47] Juan Carlos Niebles and Fei-Fei Li. A hierarchical model of shape and appearance for human action classi¯cation. In CVPR, 2007. [48] Juan Carlos Niebles, Hongcheng Wang, and Fei-Fei Li. Unsupervised learning of human action categories using spatial-temporal words. In BMVC, 2006. [49] NuriaOliver,AshutoshGarg,andEricHorvitz. Layeredrepresentationsforlearning and inferring o±ce activity from multiple sensory channels. CVIU, 96(2):163{180, 2004. [50] Nuria M. Oliver, Barbara Rosario, and Alex Pentl. Graphical models for recogniz- ing human interactions. In Proc. of Intl. Conference on Neural Information and Processing Systems (NIPS, 1998. [51] ClarkF.Olson. Aprobabilisticformulationforhausdor®matching. InCVPR,pages 150{156, 1998. [52] PatrickPeursum, SvethaVenkatesh, and Geo® A.W. West. Tracking-as-recognition for articulated full-body human motion analysis. In CVPR, 2007. [53] Ariadna Quattoni, Michael Collins, and Trevor Darrell. Conditional random ¯elds for object recognition. In NIPS, 2004. [54] Bowden R, Windridge D, Kadir T, Zisserman A, and Brady M. A linguistic feature vector for the visual interpretation of sign language. volume 1, pages 91{ 401, 2004. [55] Padma Ramesh and Jay G. Wilpon. Modeling state durations in hidden markov models for automatic speech recognition. ICASSP, pages 381{384, 1992. [56] Barbara Rosario, Nuria Oliver, and Alex Pentland. A synthetic agent system for bayesian modeling human interactions. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), pages 342{343, 1999. [57] K. Schindler and L. van Gool. Action snippets: How many frames does human action recognition require? In CVPR, 2008. [58] J. Schlenzig, E. Hunter, and K. Ishii. Recursive identi¯cation of gesture inputs using hidden markov models. In Proc. Second Annual Conference on Applications Computer Vision, pages 187{104, 1994. [59] Christian SchÄ uldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: A local svm approach. In ICPR (3), pages 32{36, 2004. [60] Vinay Shet, Shiv Nage Prasad, Ahmed Elgammal, Yaser Yacoob, and Larry Davis. Multi-cueexemplar-basednonparametricmodelforgesturerecognition. In ICVGIP, 2004. 132 [61] CristianSminchisescu,AtulKanaujia,ZhiguoLi,andDimitrisMetaxas. Conditional random ¯elds for contextual human motion recognition. In ICCV, pages 1808{1815, 2005. [62] ThadStarnerandAlexPentland. Real-timeamericansignlanguagerecognitionfrom video using hidden markov models. In ISCV, 1995. [63] ThadStarner,JoshuaWeaver,andAlexPentland. Real-timeamericansignlanguage recognitionusingdeskandwearablecomputerbasedvideo. PAMI,20(12):1371{1375, 1998. [64] Charles Sutton, Khashayar Rohanimanesh, and Andrew McCallum. Dynamic con- ditional random ¯elds: factorized probabilistic models for labeling and segmenting sequence data. In ICML, page 99, 2004. [65] Leonid Taycher, David Demirdjian, Trevor Darrell, and Gregory Shakhnarovich. Conditional random people: Tracking humans with crfs and grid ¯lters. In CVPR (1), pages 222{229, 2006. [66] C Vogler and D.N. Metaxas. Parallel hidden markov models for american sign lan- guage recognition. ICCV, pages 116{122, 1999. [67] ChristianVogler,HaroldSun,andDimitrisMetaxas. Aframeworkformotionrecog- nition with applications to american sign language and gait recognition. In In Proc. Workshop on Human Motion, 2001. [68] Sy Bor Wang, Ariadna Quattoni, Louis-Philippe Morency, David Demirdjian, and Trevor Darrell. Hidden conditional random ¯elds for gesture recognition. In CVPR (2), pages 1521{1527, 2006. [69] Yang Wang and Qiang Ji. A dynamic conditional random ¯eld model for object segmentation in image sequences. In CVPR (1), pages 264{270, 2005. [70] S.F. Wong and R. Cipolla. Extracting spatiotemporal interest points using global information. In ICCV, pages 1{8, 2007. [71] Bo Wu and Ramakant Nevatia. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In ICCV, pages 90{97, 2005. [72] J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-sequential images using hidden markov model. In CVPR, pages 379{385, 1992. [73] Tao Zhao and Ram Nevatia. 3d tracking of human locomotion: A tracking as recog- nition approach. ICPR, 1:541{556, 2002. 133
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Model based view-invariant human action recognition and segmentation
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Temporal perception and reasoning in videos
PDF
Invariant representation learning for robust and fair predictions
PDF
Exploitation of wide area motion imagery
PDF
Robust real-time vision modules for a personal service robot in a home visual sensor network
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
A unified Bayesian and logical approach for video-based event recognition
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Analyzing human activities in videos using component based models
PDF
Facial gesture analysis in an interactive environment
PDF
Robot vision for the visually impaired
PDF
Event detection and recounting from large-scale consumer videos
PDF
Analysis, synthesis and recognition of human faces with pose variations
PDF
Tracking multiple articulating humans from a single camera
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Human activity analysis with graph signal processing techniques
Asset Metadata
Creator
Natarajan, Pradeep
(author)
Core Title
Robust representation and recognition of actions in video
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/29/2009
Defense Date
04/28/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
action recognition,computer vision,conditional random fields,hidden Markov models,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
), Medioni, Gerard G. (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
pnataraj@usc.edu,pradeep_nats@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2409
Unique identifier
UC1133843
Identifier
etd-Natarajan-3164 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-562446 (legacy record id),usctheses-m2409 (legacy record id)
Legacy Identifier
etd-Natarajan-3164.pdf
Dmrecord
562446
Document Type
Dissertation
Rights
Natarajan, Pradeep
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
action recognition
computer vision
conditional random fields
hidden Markov models