Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Analyzing human activities in videos using component based models
(USC Thesis Other)
Analyzing human activities in videos using component based models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Analyzing Human Activities in Videos using Component Based Models by Furqan Muhammad Khan A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2013 Copyright 2013 Furqan Muhammad Khan To my adorable parents and beautiful wife. ii Acknowledgments I am grateful foremost to Allah for answering my prayers. It is said that it takes a village to educate a child, so is true for my getting a doctorate degree, which would not be possible without help, love and guidance of so many people around me. The foremost of them all is my adviser, Professor Ramakant Nevatia, for his constant support, motivation, guidance and encouragement during this journey of obtaining the Ph.D. His thoughtful insights and allowance of the freedom and time helped me grow as a computer vision researcher. I am also very thankful to Prof. Gerard Medioni, Prof. Keith Jenkins, Prof. Laurent Itti and Prof. Wei-Min Shen for their comments and discussions during and after my qualifying and dissertation exams. I am also very grateful to all the members of the USC Computer Vision group both academically for having helpful research discussions with them and socially for sharing cheerful moments which helped me deal with highs and lows during the process of getting the Ph.D. Specically, I beneted from discussions with Vivek Kumar Singh and his contributions in the eld of human pose estimation. I also want to thank Dan Parks for providing object detections and tracks which I used as input to some of my algorithms, Sung Chun Lee for his courtesy of Natural Language Description module, and Weijun Wang for motion blob tracking. In addition, the work presented here has also beneted from discussions with Prithviraj Banerjee, Pradeep Natarajan and Pramod Sharma. I also gratefully acknowledge the Fulbright Program for providing funding, in majority, iii and the U.S. Government Mind's Eye program for providing funding, in part, for the completion of my program. I would also like to pay gratitude to my friends, Adeel, Sohaib, Aaq, Hassan, Usman, Hira, Omar, Yousuf, Ali and Saba, who even though did not contribute directly to this research but whose support and encouragement were essential for me to succeed. In the end, this dissertation would not have been possible without the love, support and inspiration of my family, my mother, Nasreen Jahan, my father, Sarwar Khan, my brothers, Salman and Afwan, and my sisters Raima, Wardah and Kashaf. I am thankful to them because without their constant support I would have never survived the hardships of Ph.D. program. For they each were also models of hard work, patience, determination and love. Most importantly, I truly appreciate unconditional love and support of my beautiful wife Hala for she had to deal with the ups and downs of my life in the most direct ways. Without her patience, encouragement and love, completing Ph.D. would have been unimaginable. iv Table of Contents List of Tables viii List of Figures ix 1 Introduction 1 1.1 The Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Applications of human activity recognition . . . . . . . . . . . . . 5 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1 Variation in style and speed . . . . . . . . . . . . . . . . . . . . . . 6 1.2.2 Similarity of activities . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.3 Actor and Object perception . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 Nuisances and other factors . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Overview of Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Related Work 15 2.1 Approaches for single agent . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Holistic representations . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.1.1 Local Feature Approaches . . . . . . . . . . . . . . . . . . 16 2.1.1.2 Human Centric Approaches . . . . . . . . . . . . . . . . . 20 2.1.2 Component Based Approaches . . . . . . . . . . . . . . . . . . . . 21 2.2 Approaches modeling Action-Object Context . . . . . . . . . . . . . . . . 23 2.3 Approaches for multiple agents . . . . . . . . . . . . . . . . . . . . . . . . 24 3 Deformable Component Action Model 25 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Component Based Recognition of Videos . . . . . . . . . . . . . . . . . . . 28 3.3.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.2 Volume Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.3 Composition Model . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.4 Structure Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.6 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 v 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4 Full-body activity detection in complex environments using Conditional Bayesian Networks 41 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 Action Representation and Recognition . . . . . . . . . . . . . . . . . . . 46 4.3.1 Action Representation . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.2 Recognition Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.3.1 Primitive Detection . . . . . . . . . . . . . . . . . . . . . 52 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.1 Recognition of Segmented Actions . . . . . . . . . . . . . . . . . . 53 4.4.2 Recognition and Localization of Actions . . . . . . . . . . . . . . . 59 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 Multiple Pose Context Trees for Action Recognition in Single Frame 62 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Modeling Human Body and Object Interaction . . . . . . . . . . . . . . . 65 5.3 Human Pose Estimation using Context . . . . . . . . . . . . . . . . . . . . 68 5.3.1 Pose Inference for known Object Context using Pose Context Tree 69 5.4 Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4.1 Prior potentials for Pose Context Tree . . . . . . . . . . . . . . . . 72 5.4.2 Object Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6 Simultaneous recognition of Action, Pose and Object in Videos 81 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.3 Modeling Action Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.3.1 Actor's States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.2 Human Object Interaction . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.3 Action Model Learning . . . . . . . . . . . . . . . . . . . . . . . . 89 6.4 Simultaneous Inference of Action, Pose and Object . . . . . . . . . . . . . 90 6.4.1 Pose Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.1.1 Transition Potential . . . . . . . . . . . . . . . . . . . . . 94 6.4.1.2 Observation Potential . . . . . . . . . . . . . . . . . . . . 95 6.4.2 Object Recognition and Tracking . . . . . . . . . . . . . . . . . . . 96 6.4.3 Pose-Object Binding . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7 Summary 105 vi Appendix A English Language Description Generation . . . . . . . . . . . . . . . . . . . . . 108 A.1 Action ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A.2 Object blob association . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 A.3 Rules for description generation . . . . . . . . . . . . . . . . . . . . . . . . 110 Reference List 111 vii List of Tables 3.1 Class wise average precision of our method when using dierent structure models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Mean average precision (MAP) performance comparison on HOHA dataset. Performance for our system for dierent structure models is given after component selection. *Best Model means we selected that structure model for each action, which produced the best AP for that action when other parameters were xed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1 AP of CBN and BoW methods for action recognition task (MGT: Machine Generated Tracks, HAT: Human Annotated Tracks) . . . . . . . . . . . . 56 4.2 Performance of our method on the validation subset of Mind's Eye Year 2 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1 Object detection accuracy over the Test set . . . . . . . . . . . . . . . . . 76 5.2 Pose accuracy over the entire dataset with a total of 65 10 = 650 parts. S correspond to the accuracy over images in the soccer set, similarly B for basketball and M for misc. . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3 Accuracy over parts involved in the object interaction; for soccer only the legs are considered and for basketball only the hands; thus, accuracy is computed over 44 4 = 176 parts. . . . . . . . . . . . . . . . . . . . . . . 78 6.1 Action Recognition Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2 Pose Recognition Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 101 viii List of Figures 1.1 Few frames from some videos used as input to our algorithm . . . . . . . 4 1.2 Intra-class variation for Catch activity . . . . . . . . . . . . . . . . . . . . 7 1.3 Inter-class similarity between a) Catch and b) Drink actions . . . . . . . . 8 3.1 Bayesian Network of Action Components . . . . . . . . . . . . . . . . . . 29 3.2 Step-wise process of computing volume descriptor . . . . . . . . . . . . . . 30 3.3 Precision-Recall curves for each action model. . . . . . . . . . . . . . . . . 39 3.4 Sample frames from selected videos for which human activity was success- fully recognized by our system. Above results are produced for Impulse Model and binarization of output done to achieve maximum F1 score. . . 40 4.1 Intra-class variance. Dierent manifestations of `Catching' a ball, while: (a) running, (b) on bike, (c) jumping. A person `Colliding' with dierent objects: (d) another person, (e) a pole, (f) a chair. . . . . . . . . . . . . . 43 4.2 Block diagram of our recognition and description system . . . . . . . . . . 44 4.3 FSM Model for Actions. Primitives are shown in red. . . . . . . . . . . . . 47 4.4 Conditional Bayesian Network for Action Recognition . . . . . . . . . . . 49 4.5 Examples of variations in dierent subsets of dataset. . . . . . . . . . . . 54 4.6 Detection and Tracking of Humans, Objects and Motion Blobs. a) Ball is detected as motion blob but many spurious blob tracks exist. b) Pole is not detected. c) Humans and large objects are detected reliably. d) Human and box are merged into one blob . . . . . . . . . . . . . . . . . . . . . . . 55 4.7 PR curves of CBN using MGT and BoW methods . . . . . . . . . . . . . 56 4.8 PR curves of CBN using Human Annotated Tracks . . . . . . . . . . . . . 57 4.9 Examples where the main action was successfully recognized. . . . . . . . 58 ix 4.10 Action recognition and localization results using MGT. Actions associated with subjects at each frame are displayed. . . . . . . . . . . . . . . . . . . 61 5.1 Eect of object context on human pose estimation. (a), (c) show sample images of players playing soccer and basketball respectively; (b), (e) shows the human pose estimation using tree-structured models [72]; (c), (f) show the estimated human pose using the object context . . . . . . . . . . . . . 63 5.2 Pose Model: (a) Tree structured human model with observation nodes (b) Pose Context Tree for object interaction with left lower leg; object node is marked as double-lined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3 Part Templates: (a) Edge based part templates (b) Region based part templates, dark areas correspond to low probability of edge, and bright areas correspond to a high probability; . . . . . . . . . . . . . . . . . . . . 71 5.4 Inference on Pose Context Tree: (a) Sample image of a soccer player (b) Distributions obtained after applying edge templates (c) Joint part distri- butions of edge and region templates (d) Inferred pose and object . . . . . 72 5.5 Sample images from the dataset; Rows 1, 2 contains examples from bas- ketball class; Rows 3, 4 from soccer, and Row 5 from misc. . . . . . . . . 75 5.6 Sample positive examples used for training object detectors; Row 1 and 2 shows positive examples for basketball and soccer ball respectively. . . . . 76 5.7 Confusion matrix for interaction categorization: (a) using scene object detection,(b) using multiple pose context trees . . . . . . . . . . . . . . . 78 5.8 Results on Pose Dataset. (a),(b),(c) are images from basketball set and (d),(e) are from soccer set. The posterior distributions are also shown for the Iterative Parsing approach and using PCT when action and object position is known. Notice that even in cases where the MAP pose is similar, the pose distribution obtained using PCT is closer to the ground truth. Soccer ball responses are marked in white, and basketballs are marked in yellow. In example (c), basketball gets detected as a soccer ball and thus results in a poor pose estimate using Multiple-PCT, however, when the context is known, true pose is detected using PCT. . . . . . . . . . . . . . 80 6.1 Similarity in pose sequences for two actions . . . . . . . . . . . . . . . . . 82 6.2 Graphical Model for Actor Object Interaction . . . . . . . . . . . . . . . . 86 6.3 Object visibility for dierent keyposes. Even for static camera, appearnace of spray bottle and ashlight change signicantly between frames. . . . . . 97 6.4 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 x 6.5 Eect of using object context. Likelihoods of cup, phone, ashlight and spray bottle at each location of interaction frames are shown in cyan, blue, green and red respectively. (a) Phone, `calling' action and pose were cor- rectly recognized in presence of multiple object detections. (b) Object con- text helped to disambiguate `lighting' from `pouring' action. (c) `Pouring' action was recognized in absence of `cup'. (d) Detection of `cup' confused `drinking' with `pouring' action. . . . . . . . . . . . . . . . . . . . . . . . . 99 6.6 Pose Tracking and Classication results for actions that were correctly identied. Object of interaction identied by our system automatically is also highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.7 Some examples of XML description generated by our framework for the corresponding examples of Figure 6.5. . . . . . . . . . . . . . . . . . . . . 104 A.1 An example of description generation . . . . . . . . . . . . . . . . . . . . . 109 xi List of Algorithms 1 Pseudo Code for Action Recognition . . . . . . . . . . . . . . . . . . . . . 50 2 Pseudo Code for Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3 Inference mechanism for Actor States . . . . . . . . . . . . . . . . . . . . 93 xii Abstract With cameras getting smaller, better and cheaper, the amount of videos produced these days has increased exponentially. Although not comprehensive by any means, the fact that about 35 hours of video is uploaded to YouTube every minute is indicative of the amount of data that is being generated. This is in addition to the videos recorded for surveillance by grocery stores and by security agencies at airports, train stations and streets. Whereas analysis of the video data is the core reason for surveillance data collec- tion, services such as YouTube can also use video analysis to improve search and indexing tasks. However, due to extremely large amount of data generation, human video analysis is not feasible; therefore, development of methods which can automatically perform the intelligent task of visual understanding, specically human activity recognition, has seen a lot of interest in past couple of decades. Such capability is also desired to improve hu- man computer interaction. However, the associated problem of activity description, i.e., information about actor, location and object of information, has not got much attention despite its importance for surveillance and indexing tasks. In this thesis, I propose meth- ods for automated action analysis, i.e., recognition and description of human activities in videos. The task of activity recognition is seemingly easily performed by humans but it is very dicult for machines. The key challenge lies in modeling of human actions and rep- resentation of transformation of visual data with time. This thesis contributes to provide possible solutions related to development of action models to facilitate action description xiii which are general enough to capture large variations in an action class while allowing for robust discrimination of dierent action classes and corresponding inference mechanisms. I model actions as a composition of several primitive events and use graphical models to evaluate consistency of action models with video input. In the rst part of the thesis, I use low-level features to capture the transformation of spatiotemporal data during the primitive event In the second part, to facilitate description of activities, such as identi- cation of actor and object of interaction, I decompose actions using high-level constructs, actors and objects. Primitive components represent properties of actors and their rela- tionships with objects of interaction. In the end, I represent actions as transformation of actor's limbs (human pose) over time and decompose actions using key poses. I infer human pose, object of interaction and the action for each actor jointly using a dynamic Bayesian Network. This thesis furthers research on relatively ignored but more comprehensive problem of action analysis, i.e., action recognition with the associated problem of description. To support the thesis, I evaluated the presented algorithms on publicly available datasets. The performance metrics highlight eectiveness of my algorithms on datasets which oer large variations in execution, viewpoint, actors, illuminations, etc. xiv Chapter 1 Introduction To estimate the depth of impact computers have in our lives, we do not need to look beyond our work desks or pockets. The cellphones, the smart cars, the annoying auto- mated telephone operator at customer service desk are all now part of our lives. The notion that computers are good only for repetitive, data-intensive and computational tasks and not for intelligent tasks such as reasoning, logical inference and analysis of visual and auditory input, is being seriously challenged. Humans perform such intelligent tasks numerous times in a typical day with ease and without realization of their com- plexity; however, enabling computers to do such intelligent tasks is dicult. Researchers have made remarkable strides in extending capabilities of computers to perform intelli- gent tasks over last decade or so. Consequently, computers are not just boring repetitive machines at an assembly plant or data warehouses anymore, but they can also outsmart you at Jeopardy, beat you at chess and more often than not correctly understand what you speak. However, the wish-list of intelligent tasks that we want computers to perform reliably is long and includes a number of tasks that require visual intelligence. Humans hold considerable advantage over computers when performing vision related tasks such as analysis of videos or images because of their superior ability to recognize and detect objects and activities and to perform inference to obtain a coherent meaning. The goal of researchers in computer vision community is to enhance the ability of computers 1 to analyze and interpret visual data. Humans can not only detect objects quickly and seamlessly but also recognize complex motion patterns even when the visual stimulus is highly occluded or incomplete. Activity detection is one of the core elements of vi- sual analysis tasks such as video surveillance, autonomous driving, video summarization, human-computer interaction and content mining. Therefore, ability to automatically recognize activities will be useful in several tasks including those which improve quality of life. As computer vision researchers, our goal is to enable computers to perform the intelligent task of understanding what they see, specically, human activities. Further pressing for need of automated video analysis is the availability of smaller and cheaper devices with large memory storage such as phones, tablets and laptops, which have enabled users to capture their life moments not just as a single snapshot but also as video snippets. This is in addition of the increased surveillance at streets, shopping malls, train stations and other places of public gathering. The speed at which video data is being generated (an indirect measure is that more than 35 hours of video are uploaded each minute to YouTube), it is impossible for humans to analyze all of this data for any benecial use. The objective of this thesis is to present a solution to the problem of automatic recognition and description of human activities captured only from one camera. This chapter formally introduces the problem of human activity recognition and description and present some applications which require solution to the activity recognition problem in Section 1.1. We discuss some of the challenges that are involved in the development of a general purpose activity recognition system in Section 1.2. A summary of our approach to solve the problem is presented in Section 1.3 and an overview of the organization of this thesis is presented in Section 1.4. 2 1.1 The Goal The objective of this work is to develop an ecient method to recognize and describe actions performed by humans from visual data. In particular, we address the issue of building a computer vision system that, given a video, identies presence of one or more action classes that can explain the visual data. For example, Figure 1.1 shows few frames of some video sequences to be given as input to our system and the desired action class label output. The task of automated action recognition is extremely dicult due to many reasons including variability in duration, styles of execution and viewpoints. We discuss some of the challenges associated with action recognition task in Section 1.2. Our secondary focus is to develop an approach that also facilitates activity descrip- tion i.e. extraction of other details about the action such as identication of actor, object of interaction and/or sub-events. Richness of descriptions is an indicative of understand- ing capability of system and provides critical information under certain circumstances. For example, during analysis of surveillance videos identication of a person carrying a basket and a person carrying a ri e require dierent level of human attention and re- sponse; therefore, sole action recognition loses critical information if descriptions are not provided. Most traditional approaches recognize actions as a whole by learning statistics of low-level image features, bypassing the process of recognizing actor and/or objects. Ac- tor and object perception in arbitrary pose are known to be very challenging for various reasons, but they hold key to activity description and cannot be disregarded. Further- more, in a realistic scenario, it is not only desirable to recognize actions in a video but also localize them i.e. determine when (temporally) and where (spatially) the action took place. Our goal is to use these unreliable components to construct a reliable activity recognition and description system. 3 (a) Catch (b) Stand Up (c) Haul (d) Sit Down Figure 1.1: Few frames from some videos used as input to our algorithm Depending on the application, the data for activity recognition task may come from dierent modalities such as motion capture systems, depth, infrared or video cameras. Each of these modalities present dierent challenges of their own but share the problems of general purpose action representation and their inference. In this thesis, we focus on data coming from video cameras. 4 1.1.1 Applications of human activity recognition Human activity recognition is a key component of any video analysis task. Automated video analysis is required for variety of reasons. For example, law-enforcement agencies can benet from the ability to automatically analyze surveillance videos on the go to proactively deal with threats to public or reduce response time. From the old days when surveillance was performed directly by humans, we have come to an age when many cameras watch over us at airports, bus terminals or even at streets. Depending on the scenario, these videos are not analyzed until after occurrence of some incident or if they are analyzed instantaneously, usually one person is responsible to keep an eye on more than one video feed. Online automated video analysis can help to mitigate the rst issue with video surveillance by providing timely detection and prediction of volatile situations that can lead to shorter response time by security ocials to reduce the damage or total avoidance of a dangerous situation. On the ip side, oine video analysis can be used for forensic investigation by these agencies. Video surveillance is used not only by law enforcement agencies but also by gro- cery stores. Videos can be analyzed automatically to monitor theft and study customer shopping patterns to help stores arrange their stocks. Video surveillance is also used to monitor human activity at old homes to ensure safety of its residents. In addition to video surveillance, activity recognition is used to improve human com- puter interaction. Computer games are evolving to allow users to interact with game consoles using their own movements instead of using game controllers. Furthermore, ac- tion recognition is also useful for video indexing and search, disability aid, and video summarization among others. 5 1.2 Challenges The problem of activity recognition is complex. The fact that each activity can be per- formed by a variety of actors at dierent speeds and who do not necessarily share similar appearance and body shape introduces high variance in the way each activity is executed. In contrast, limb motions for dierent activities may not be similar and require contex- tual information for their disambiguation. View-point, cluttered background and motion induced due to illumination make the task of estimating limb movements even more dif- cult. The problem complexity gets compounded when description of the activity is also desired as outcome. Activity description requires perception of actors, objects and/or actor's poses, and recognition methods which are more transparent to humans than pure machine learning based techniques. Perception of objects and actors in arbitrary pose is very challenging which makes the task of inferring and describing the activity very di- cult. Further, the granularity of description desired also has an aect on the complexity of the task at hand. In this work, we aim to recognize human activities in semi-controlled environment and describe them at sub-activity level. The problem of activity recognition and description is challenging due to existing unsolved problems of object and actor detection in dierent poses. Even if description is not an intended outcome, activity recognition is very dicult due to similarity of actions, variations in style and speed across actors, view-point, background clutter and high dimensionality. We brie y discuss some of the challenges below: 1.2.1 Variation in style and speed An activity can be performed in multiple ways by an actor. For example, consider the activity of picking up an object from the ground. A person may just bend his back and pick up the object, while another person my squat to pick it up. Also, the execution 6 Figure 1.2: Intra-class variation for Catch activity speed may vary from person to person. For example, consider a person making a phone call. The time it takes to dial a number may very considerably from an actor to another. Therefore, action models must be general enough to account for these style and speed variations. 1.2.2 Similarity of activities Certain activities are very similar in terms of the way they are performed. For example, activities pick-up and put-down of an object are very similar in terms of body movements. Distinguishing between these two require either object perception or very ne human pose analysis. Using motion capture data, human joint positions can be obtained; still, devel- opment of generic action representations that can capture variation in style and speed of actions while having the ability to distinguish actions with slight dierences in temporal evolution of human poses is dicult. Further, motion capture systems are usually very expensive and require a controlled environment; however, we intend to develop methods which can perform reliably in minimally constrained environments. 7 (a) Catch (b) Drink Figure 1.3: Inter-class similarity between a) Catch and b) Drink actions 1.2.3 Actor and Object perception For activity description, actor and object perception are key. These problems are very complex due to variations in appearance, size, shape and view-point, as well as, due to occlusions and background clutter. For indoor surveillance videos, techniques of back- ground subtraction can be used to nd moving objects, but even these methods do not perform well under illumination changes as is the case with outdoor videos. However, objects, activities and poses provide useful mutual context. This contextual information should be helpful to improve recognition of these entities; therefore, the challenge is how to model this mutual context. 1.2.4 Nuisances and other factors Nuisances such as image noise and illumination changes, and factors such as scale and viewpoint changes, occlusion of actors and objects, background clutter and camera motion add to the complexity of not only actor and object recognition tasks but also to activity recognition task. For the purpose of this thesis, we assumed that the video is taken from a camera mounted on a static platform. 8 Considering the current state-of-the-art technology for actor and object perception, we addressed the problem of action analysis in three dierent contexts of reliability of these methods. At one end of the spectrum is the scenario where current human, human pose and object detection methods are not expected to perform reliably at all, as is the case with videos of movies and sports. Human, human pose and object recognition methods are usually not stable for complex scenes where humans and objects are highly occluded, the background is cluttered, or the viewpoint variations are high. On the other end, we restrict occlusion of human parts to be very low so that human pose estimation methods can work reliably to allows us to explicitly model human-object interaction for activity analysis. In between, we have the situation where human actors are detected relatively reliably but current pose estimation techniques would fail due to occlusion from large objects of interaction. 1.3 Overview of Approach In order to overcome the challenges listed in the previous section, the activity models should be general enough to represent large variations in style and speed of each action, but at the same time, it should be able to discriminate between actions which look simi- lar and allow for robustness against nuisances, such as viewpoint changes and occlusion. Object recognition community has found that part-based approaches are robust at han- dling similar challenges oered by the task of object recognition, specially for objects with prominent structure. Similarly, we propose use of component based approach to model actions because actions are also well structured entities in spatiotemporal domain. In our approach, actions are decomposed into spatiotemporal segments, which may exhibit far less variation as compared to the action as a whole and can be detected more reliably. An 9 action is represented by its constituent components and their relative spatiotemporal or- der. The relative order of the components dene the structure of the action. Constituent components provide robust discrimination of actions in case of occlusions and viewpoint changes, when one or more components cannot be detected reliably enough. The exibil- ity in spatiotemporal ordering of the components account for large intra-class variations of an action, as well as discrimination of actions with similar constituents. Therefore, even though the components may be simple and of low variance, complex action models can be constructed form their combination. Since our goal is to describe actions, i.e. to identify actor, objects of interaction, and spatiotemporal location of action, we decompose actions with human centric approach. We use properties of human actor and his relation with object of interaction to decompose action into temporal segments. We explicitly model actor object interaction as part of the action model, as it improves the performance of action recognition. This semantic decomposition is, however, not really feasible if actors are dicult to detect due occlu- sion, partial visibility or other reasons as current recognition technologies are not robust enough. Therefore, we use both automatic and manual selection of components in our ap- proach. Once the components are identied, their appearance and action structure are learned. Where human detection is challenging, we decompose actions into components corresponding to spatiotemporal volumes represented by low-level features such as his- togram of gradients and histogram of ows. On the other hand, for videos where current technologies allow for robust human detection, we use human centric decomposition of actions and the components usually correspond to the temporal segments correspond- ing to features of actors and objects (speed, human pose, etc.). For recognition, we use static and dynamic graphical models to match the data with the action models because 10 they allow for a systematic way to integrate information from dierent sources (action components, actors and objects detectors). We can divide development of our method in three stages. At each stage, we evaluated our method on a publicly available dataset. For rst stage, we applied our component based method on Hollywood Human Action dataset (HOHA) [52]. The dataset has large human occlusions, viewpoint and other factors which do not allow for robust human de- tection with current technologies. Therefore, we use spatiotemporal decomposition of action and low-level feature based represenation of components. We achieved promis- ing mean per class average precision of 29%, which is comparable to other published results. For the second stage, we evaluated our method on the subsets of Mind's Eye [12] Year 1 and Year 2 datasets. Mind's Eye dataset is dierent from HOHA dataset in the intra-class variation of actions, however, it oers relatively simple conditions for human detection and tracking. On this multi-class set, our method achieved mean per class av- erage precision of 38% using machine generated tracks of humans and objects, and 61% using hand annotated tracks. These results are signicantly better than the results of a self implemented bag of features action classier using trajectory features of [90] for the recognition task. Our method also achieved mean per class F1 of 0.2 on a subset of Mind's Eye Year 2 dataset for the action detection (recognition and localization) task. We also presented some qualitative examples of English language description generation using our method and temporal localization of our actions. Finally, we evaluated our most elaborate system on the dataset of [29] and achieved mean class accuracy of 92.3%. The dataset has relatively small human occlusios and simple background which allow for noisy but detailed human pose estimation using current technologies. We also presented qualitative results for our method's ability to generate structured action descriptions. 11 1.4 Organization of Thesis In Chapter 2, we brie y discuss some of the work done in the past to recognize actions automatically. We separately discuss methods for single and multiple agents; however, the focus is on single-agent methods. We coarsely divide previous approaches related to single agent action recognition into two groups, holistic and component based, according to their representation of actions. In this thesis, we take component based approach to recognize actions. We realize these component based action models as graphical networks. We use both static and dynamic graphical networks and their combination to develop algorithms that are robust to deal with challenges discussed earlier. Chapter 3 introduces our spatiotemporal component based approach to recognize ac- tions in videos, which forms the bases of the rest of the chapters in this thesis. This approach does not rely on detection of humans or objects. Each action component cor- responds to a spatiotemporal volume and is represented using low-level features which capture spatiotemporal aspects of the volume. An action is inferred through recognition of its components using static Bayesian Network. Unlike local feature based approaches, this component based approach represent action using only the related components of an action rather than a distribution of local features. We evaluate the approach on the Holly- wood Human Actions video benchmark dataset. The approach shows good results under challenging condition; however, the action components do not have semantic meanings. In Chapter 4 we use high level constructs such as humans and objects to decompose actions into properties of actors and their relationships with object of interaction. Specif- ically, we model action as a sequence of actor-object states. These composite states are further decomposed into primitive actor-object states which correspond to properties of a single object or relationship between two objects. We dene a Conditional Bayesian 12 network to embed this hierarchical structure of activities and generate bottom-up hy- potheses from observations driven primitive detection. We realize that such a system heavily relies on the performance of object detection, which does not perform well under wide variety of conditions we are trying to address. However, even if perfect human and object tracks are provided, intra-class variations and similarities among dierent actions require improvement of existing systems. Chapter 5 builds basis for our more detailed action representation. In the chapter, we propose a method to recognize interactions from a single image based on the pose of the actor and the object of interaction. For example, observing a cup and a person's hand near his face imply that the person maybe drinking. We use a detailed human pose model to extract limb positions and use mutual context of action, pose and object to identify the type of interaction. Since, no temporal information is available, emphasis is given to reliable pose and object estimation, which directly aects determination of action. We propose a graphical framework, Pose Context Tree (PCT), to model the mutual context of action, pose and object and infer these entities simultaneously. We use a tree-structure for articulate human pose modeling with 10 body parts [22]. Human-object interaction is modeled by adding an object node to the pose tree, such that the resulting structure is still a tree to allow for ecient and exact inference. Multiple PCTs are created to model object interactions with dierent body parts and inference is done over all trees to nd the best interaction and the pose. Not all activities can be recognized given only one frame, for example, making and receiving a phone call. Therefore, temporal information is also important to resolve ambiguity among activities. Chapter 6 proposes a computationally ecient component based action recognition method to retain information about actor, object and human poses. The method leverages from the mutual context provided by action, object and pose as well as temporal information about evaluation of human pose. Temporal continuity 13 also simplies the task of pose and object recognition. We present a dynamic graphical model to recognize activities involving small objects in videos. We temporally segment action based on Keyposes and use a Dynamic Bayesian Network formulation to infer activities by evaluating the sequence of keyposes. Object context is also modeled to distinguish between activities which may have similar motion patterns. We use a two step stochastic ltering based inference algorithm to infer most likely description of the video in terms of activity, human pose and object. Chapter 7 summarizes our contributions with regards to development of action recog- nition methods that are more robust as well as suitable for description generation. 14 Chapter 2 Related Work Over the past couple of decades numerous approaches for human activity recognition have been proposed. Here we review overall previous approaches to human activity recognition in a single camera. Approaches which are more relevant to the proposed system will be discussed under respective chapters. In the following, we group recognition methods based on their applicability to single or multiple agents and if they explicitly model action-object context. 2.1 Approaches for single agent Approaches to recognize actions performed by a single agent either with or without object interaction can be broadly separated into two categories based on their representation of actions: Holistic representation of actions: These methods view action as a single high-level event in time and compute global statistics of the action volume to describe its content. Component based representation of actions: These methods represent action as a composition of mid-level or high-level events (components). Relative or absolute spatiotemporal locations of these components, which capture aspects of local spatial 15 or temporal structure in data, dene the structure of an action and are explicitly modeled. 2.1.1 Holistic representations Approaches in this group are inspired by success of object recognition methods which use statistics of low level image features such as gradient, to capture the shape of an object. Similarly, in holistic approaches, an action is considered to have a spatiotemporal (ST) "`shape"' dened by low-level features, such as gradient and optical ow . Actions are represented by the distribution of low-level features over the spatiotemporal volume of the action. The focus of these approaches is to either design new low-level features or statistical mechanisms to improve accuracy of action recognition. Broadly, the feature extraction is either human centric or local region based. 2.1.1.1 Local Feature Approaches Local feature approaches blur the boundary between a component based and holistic approach because they sub-divide the video into small spatiotemporal regions and use low- level features to capture the spatiotemporal shape within. However, they are categorically dierent from component based approaches because despite local feature extraction, they represent the video using a global descriptor and ignore the location information of the features. The biggest challenge for these approaches is to extract features which are discriminative and insensitive to small changes in lighting, viewpoint and scale. Representing a video using local features involve three steps: i) detection of salient spatiotemporal regions (interest point detection), ii) extraction of features that repre- sent these regions (feature extraction), and iii) collection of statistics (video description). Interest points are usually detected as points which are local extrema of some image 16 function. Some of the common spatial interest point detectors used for object detec- tion are Harris corner [33], Scale Invariant Feature Transform (SIFT) [57] and Maximum Stable Extremal Regions (MSER) [60]. These and their spatiotemporal extensions, such as of SIFT by Scovanner et al. [80], have been used to detect salient regions for video representation. Perhaps, the most popular of interest point detectors, Space-Time Interest Point (STIP), was propose by Laptev and Lindeberg [51], as an extension of Harris corner detector by Harris and Stephens [33]. A spatiotemporal corner is dened as spatial corner in image whose velocity vector is reversing direction. Mathematically, it is dened as a region where the local gradients vector spanning x, y and t, directions are orthogonal. However, due to sparsity of interest points produced by Laptev and Lindeberg's [51] STIP detector, Dollar et al. [16] proposed an interest point detector which is more sensitive to changes of motion direction. The detector looks for local maxima of images ltered both spatially using 2-D Gaussian kernel and temporally using 1-D Gabor lter. Bregonzio et al. [5] proposes frame dierencing and use of 2-D Gabor lters to improve results of the method of Dollar et al. For even denser coverage of video by interest points, Willems et al. [94] use spatiotem- poral extension of determinant of the Hessian of an image, and Gilbert et al. [28] applies Harris corner detector in (x,y), (x,t) and (y,t) directions. The second step in video representation is feature extraction for region around interest points. Laptev and Lindeberg [51] proposed spatiotemporal jets, which are high-order derivatives computed for the cubes centered at the STIPs, to capture the local motion and spatial information. Later on, for a more robust description of the local neighborhood of the interest point, Laptev et al. [52] suggested use of histogram of gradients (HOG) and histogram of optical ows (HOF) aggregated over sub-regions dened by a SxSxT grid centered at the interest point. 17 Dollar et al. [16] employed dierent choices of features based on brightness, gradients and optical ows to describe the region around the interest points. Their experiments suggested vectorization of image gradients at pixel locations of the cubic region out- performed other choices based on histogram computation as the positional information osets the added robustness of histograms. Another choice for feature extraction was presented by Klaser et al. [47]. They sug- gested spatiotemporal extension of the SIFT descriptor [57] based on histograms of 3-D gradient orientations. Whereas, Willems et al. [94] extended the SURF descriptor which uses Haar-wavelets to spatiotemporal domain. All the above methods use a static region around the interest point for feature extrac- tion. Another approach is to track the interest point and use a dynamic spatiotemporal region. This helps to keep the extracted feature more directly related to the interest point. Most commonly a spatial interest point detector is applied at each frame of the video and the detected interest points are then tracked using KLT tracker [58] or any other method. For example, Sun et al. [87] track the interest points obtained using SIFT detector and describe the spatiotemporal region around the path of each interest point by averaging the SIFT descriptor around the location of interest point for each time instant of the track. In addition, to describe the evolution of trajectory, they use statistics of instanta- neous velocities of the tracked points; however, they ignored local motion information of the neighborhood. In a contrasting approach, Matikainen et al. [61] tried to capture the local motion information in the neighborhood of point trajectories by clustering the trajectories using K-means [2]. Trajectories were forced to be of xed length and their possible ane transformation was accounted for clustering. The spatiotemporal region around clustered trajectories was represented by velocity vectors of the center of clustered trajectory and elements of ane transformation model. Messing et al. [62] removed the limitation of 18 xed length trajectories by quantizing them as time series of log-polar quantized velocities; whereas, Raptis et al. [76] represent trajectories as a time series and represent it by concatenation of histogram of gradients (HoG) and histogram of ows (HoF) features computed for each timestep, as well as their temporal averages over a sliding window (AoG and AoF). The most popular model to represent a video using feature vectors of local spatiotem- poral regions is bag of words model. In a bag of words model, the emphasis is on the distribution of local features rather than their specic location. Each local feature is as- sociated with a "`word"' in vocabulary and a normalized histogram of word occurrences is computed for the whole video. Therefore, any information about spatiotemporal loca- tion of local features is suppressed. A vocabulary is created by clustering local features extracted from the videos in the training set. Discriminative quality of the vocabulary directly correlates with action recognition and for a good vocabulary selection of appro- priate distance measure for the feature space is critical and non-trivial for varying length feature descriptors. Most of the methods discussed above [51, 16, 80, 47, 94, 61] use bag-of-words model to represent actions. To overcome the deciency of bag of words model in capturing spatiotemporal layout information Laptev et al. [52] extended the work on spatial pyramid matching [53] to en- code the spatiotemporal layout of video using a xed space-time grid and showed improved recognition performance in comparison to bag of words model. [87] represented point-wise and trajectory-wise context in a hierarchy and employed Multiple Kernel Learning tech- nique to learn the optimal weights of several feature channels on a xed space-time grid, whereas [48] encodes the layout by creating hierarchical vocabularies that encode relative position of interest points with respect to each other. More recently, Sadanand and Corso [78] propose a method, called ActionBank, which blurs the boundary of a holistic and component based approach. They use many high 19 level action templates and search for their occurrence in the video to be classied. A video is then described by a vector of maximum template match responses for each action template. To encode some spatiotemporal information, a video is divided into sub-volumes and maximums over each sub-volume are pooled into one description vector. Video descriptors are then classied into action classes using support vector machines (SVM). Their method however, requires dierent views of all the actions to be present in the action bank and manual selection of many action templates. These methods are dierent from component based approaches in that they use collection of statistics for video encoding instead of the notion of modeling actions using its related components. 2.1.1.2 Human Centric Approaches This category includes methods which use human detection and tracking to identify the region of interest and compute features to capture the transformation of the human body over the period of the action. The transformations of human body can be dened in terms of either human shape or motion, or both. Yamato et al. [98] modeled action as an evolution of human shape, captured through binary silhouette. Discrete observation Hidden Markov models were adopted to represent action templates. The observations of the HMM were quantized feature vectors computed from binary silhouette of a person playing tennis. Sminchisescu et al. [85] also used silhouettes of human body but suggested use of Con- ditional Random Fields (CRFs) over HMMs because HMMs cant accommodate multiple overlapping features of the observation or long-range dependencies among observations at multiple time steps, because the inference problem for such models becomes intractable. Alternatively, Bobick and Davis [4] tried to dene actions as a function of their mo- tion. The authors proposed to compute two 2-D image transformations at each time step: motion energy image, to capture the intensity of the motion, and motion history image, 20 to indicate the latency of motion at each location. Higher order moments of these trans- formed images are computed to represent the action volume and Mahalanobis distance is used as a similarity measure of an input with a template model. Efros et al. [18] used mid-level motion features computed from the optical ow eld in the local neighborhood of the human tracks to dene action templates. To match an input video with an action model, each frame was labeled independently according to its neighbor in a database of annotated video sequences. Fathi and Mori [20] also use similar representation for action templates based on optical ows, but use Adaboost [26] for classication for input sequences. Ke et al. [44] demonstrated that both human shape and motion information are impor- tant cues for action recognition. The authors proposed a method based on a combination of shape and optical ow features. Their approach is based on over segmentation of videos in space-time using Mean-Shift algorithm [9]. Human centric approaches are strongly coupled with human detection and tracking al- gorithms. Humans can be detected by either foreground background segmentation [19, 42] or by doing shape analysis [10, 36, 21]. Robustness of detection and tracking algorithms directly correlates with the ability to extract related features and thus with action recog- nition performance. However, they simplify the task of activity description by obviating the need to devise algorithms to nd person of interest for each action. 2.1.2 Component Based Approaches These methods explicitly decompose actions into components which capture the informa- tion of local spatial or temporal structures of the action. Simple temporal decomposition of actions was explored by [59, 38, 67, 7]. Lv and Nevatia [59] introduced ActionNets as a way to enforce constraints on the temporal order of action components, when each compo- nent is described by a representative human pose (keypose). Ikizler and Forsyth [38] used 21 a sequential model of human poses which relate to temporal components. Natarajan et al. [67], modeled action as a sequence of primitive events which cause a transformation of human pose. Temporal constraints on primitive events and spatial constraints on human pose during each event were encoded jointly in a dynamic Bayesian Network. On the other hand, Brendel and Todorovic [7] used a time series of activity codewords. At each time step, the frame is segmented using mean-shift algorithm and a candidate segment is identied as part of the action based on best matching score with activity codewords. Temporal consistency of regions is enforced with a Markov chain and Viterbi algorithm is used to optimize the candidate selection process under spatial and temporal smoothness constraints. Assumption of each video frame being part of some action and that there is only one salient spatial region are too restrictive. Wang and Mori [92] proposed a more complex model to allow for more than one salient spatial region. They adopted hidden conditional random eld (HCRF) model proposed for object recognition [71] to explicitly encode pairwise relationships of spatial regions in each frame. The authors also captured global information about the action using a bag of words model in addition to local patch features. Niebles et al. [68] decompose actions into many temporal segments of variable lengths. These segments are classied into motion segments and their best matching scores are accumulated. To ensure temporal consistency each segment is penalized for the displace- ment from the expected location of the best matched motion classier in the action model. This can be viewed as an extension of popular deformable parts model of Felzenswalb et al. [21] which only allows for temporal displacement of motion segments. Raptis et al. [75] decompose actions into spatiotemporal components and explicitly model their pairwise relationship in a graphical model. The authors do spatiotemporal segmentation of a video by tracking and clustering spatial interest points. Each segment 22 is then matched with one action component. Optimal matching of segments with action components is performed by an approximate sub-graph matching algorithm. The approaches for action recognition presented in this thesis fall under this category of component based models. Most of the above models display remarkable performance on action classication task; however, their modeling schemes are not well suited for high- level action description task. Although ActionBank constructs a high-level component model for action recognition, but describing throw as a combination of many dierent high-level actions is not intuitive. 2.2 Approaches modeling Action-Object Context Relatively little work has been done to leverage from the mutual context action and ob- jects of interaction provide to each other. Moore et al. [63] estimated a prior on HMMs for dierent activities based on object contact. Wilson and Bobick [95] and Davis et al. [13] studied the eect of object properties on human activities such as walk. Kuniyoshi and Shimoxaki [50] proposed a self-organizing neural network model for action-object context. Peursum et al. [70] and Kjellstrom et al. [46] used activity recognition as a tool to improve object detection. Peursum et al. [70] assumed human poses were tracked by an indepen- dent system to be provided as input to activity recognition system. All these systems assume that either action or object recognition task can be solved independently well enough to robustly provide context to the other. Gupta et al. [30] presented a Bayesian formulation to simultaneously estimate action performed and object used. However, like Perursum et al. [70], hand trajectories are obtained independently, which is challenging in generic setting. Recently, Cinbis et al. [39] used scene knowledge with human and object centered features in a Multiple Instance Learning (MIL) framework to recognize 23 activities \in the wild"'. The method, however, ignores spatio-temporal relations among the features. 2.3 Approaches for multiple agents Most of the approaches we discussed above usually target single actor activities. Methods have been devised to address problems with activities involving more than one person who interact with each other, such as, playing basketball. Both graphical models [40, 11, 31, 83] and logic based [77, 84, 64, 6] systems have been proposed. Wheres [89, 64, 6] try to do probabilistic reasoning in logic-based framework. Intille and Bobick [40] dynamically constructed Bayes nets hierarchy using temporal constraints to recognize football plays. Damen and Hogg [11] linked primitive events by doing Markov Chain Monte Carlo (MCMC) search over Bayes nets. Gupta et al. [31] dened AND-OR graphs for loosely labeled data to automatically learn spatio-temporal patterns of a baseball game. Ryoo and Aggarwal [77] used a hierarchical framework to recognize high-level activities from sub-events by dening a context-free grammar (CFG). Although it used HMMs to detect sub-events, CFG parsing is not probabilistic. Tran and Davis [89] allowed for uncertainty in logic formulae by assigning weights to formulae and by describing a way to derive a Markov Network from these weighted rules. They call this reasoning framework as Markov Logic Network (MLN). Their goals is to recognize person-person and person-vehicle interactions. Morariu and Davis [64] used Allen's Interval Logic to recognize high level activities from primitive events for one-on-one basketball game. They use MLN for inference. Siskind [84] introduced Event Logic and Brendel et al. [6] extended it with probabilistic logic to Probabilistic Event Logic (PEL). These systems are usually designed for the scenarios where the domain of application has predened set of rules which need to be followed. 24 Chapter 3 Deformable Component Action Model 3.1 Introduction Human activity recognition is extremely complex for many reasons including for it being dynamic. The dynamic nature results in high intra-class variance because of the human factor. Not only that all of us do things slightly dierently but also it is inherently dicult for humans to exactly repeat their actions without sucient eort and practice. Most promising approaches to recognize actions treat activities as a single high-level event in time. They transform spatiotemporal data into a representative feature vector and classify it into one of many classes. Videos are transformed either using a dense grid or using sparse salient points which implicitly capture sub-events. Dierent methods have been proposed to take advantage of spatio-temporal structure of these low-level features which are lost in a simple bag-of-word approach. These methods perform well for datasets where intra-class similarity is low, the actions are pre-segmented or there is only one actor in the frame. On the other hand, a handful of methods have been proposed which explicitly model actions as a combination of sub-events either sequential [67] or concurrent [78, 75]. Natarajan et al. [67] have focused solely on human pose, which is itself a complex and unsolved problem. Sadanand et al. [78] proposed ActionBank and 25 demonstrated promising results on UCF50. However, for a reasonable and discriminative ActionBank, a good representation of action lters has to be selected manually. In this paper, we propose a component based action recognition approach similar in spirit to Deformable Parts Model [21] for object recognition. We represent each action as a congurable composition of spatiotemporal segments. Complicated actions can be represented this way by using simple components. The number of components will trade- o complexity with expressiveness. Intra-class variance of actions is primarily captured as variance in the structure of the action instead of the variance of individual components. This helps reduce the data dimension of our problem domain from pixels to components. Therefore, models can be trained reasonably with limited data. Another advantage of using our method is that it essentially does recognition via detection. Therefore, unlike most other methods it is inherently adept at localization of actions in video streams. For each action, we construct a Bayesian Network of its components and eciently perform recognition and localization tasks by using well established inference algorithms on these graphs. ActionBank also uses a component based representation for actions however, it uses many manually selected high-level action detectors and combines detector responses using a linear SVM, which is less ecient for localization task, despite its use of Action Spotting framework [14] to localize individual components. We propose an alternative and more structured approach for component based action recognition which uses a mid-level representation for action recognition. We show that our approach yields mean Average-Precision of 29% on Hollywood Human Actions Dataset and measures reasonably against the state-of-the-art method. In the following we will discuss some related work in Section 3.2, then present details of our method in Section 3.3. We discuss details of experimental evaluation of our method in Section 3.4 and conclude in Section 3.5 26 3.2 Related Work A number of methods have been proposed to recognize actions in videos. Very broadly these methods can be classied into two categories: holistic and component based. The rst type of approach views action as a single entity and use statistics of low-level fea- tures to recognize actions. Approaches which use global spatiotemporal templates, such as motion history [4], spatiotemporal shapes [3] or other templates to capture holistic view [14]; and bag-of-words representations, where words are computed statically for each frame [79, 47, 56, 97] or dynamically for salient points [51, 52, 66, 62] or from fea- ture trajectories [87, 68, 96] usually fall under this category. These approaches are usually sensitive to partial occlusions, viewpoint or scale changes. The second type uses decomposition of action to capture local spatial or temporal structure. Sequential models have been proposed where each component is represents a spatial region [38, 7, 92] or a detailed human pose [59, 67]. Brendel and Todorovic [6] identify at each frame only one promising region (activity codeword) as part of an activity and model temporal consistency using a Markov chain. Wang and Mori [92] model pair- wise relations of predened image patches in a frame as an HCRF. However, this method relies on independent detection of image patches. On the other hand, Lv and Nevatia [59], Gupta and Davis [29], and Natarajan et al. [67] model actions as sequences of articulated human pose. However, pose estimation is a very challenging and unsolved problem. More recently, non-sequential compositional approaches have been proposed by [78, 75]. Sadanand and Corso [78] use a bank of many high level action templates and use linear SVM to classify the vector of their matching scores at dierent locations. Their approach requires a good sampling of dierent viewpoints of each action example. Raptis et al. [75] decompose a video into spatiotemporal regions (segments) based on grouping of low- level feature trajectories. An action is represented as a fully connected graph of its 27 components. The recognition problem is formulated as a matching problem of regions and component labels. The performance of the system depends on video segmentation. Furthermore, performance is traded for computational eciency which results in lower than state-of-the-art results. We use an approach similar to Raptis et al., however, we use generative instead of the discriminative approach. Our method does not depend on feature trajectories, and video segmentation and matching. Instead of component label, we model location of the component as a latent variable and compute posterior probability estimate for occurrence of action. 3.3 Component Based Recognition of Videos We rst describe our high-level model and representation of actions using its components and then outline details of component denition and recognition. 3.3.1 Representation We represent each action as a constellation of exible components using a Bayesian Net- work. Unlike most previous methods where component denition is restricted to spatial regions, we dene component as a spatiotemporal volume. Therefore, each component can be viewed as a sub-event. Depending on the speed at which an action is performed and the viewpoint, the placement of the components relative to each other may vary. Therefore, we model the location of each component as a latent random variableL c . To deal with variation in speed and scale, we dene the size of a component relative to the size of the action volume so it scales proportionally with spatiotemporal scale of the action volume. Further, due to lighting, actors, objects, etc., the appearance of a component may also vary. Therefore, we model the appearance (or composition) of a component as 28 Figure 3.1: Bayesian Network of Action Components a probability distribution of volume appearance (or video evidencee c ). Figure 3.3.1 gives an example of an action represented as a Bayesian Network of its components. We assume conditional independence among components of the action due to com- putational complexity; however, our method produces encouraging results despite the trade-o. The joint probability of the given network is as follows where the Boolean random variable A indicates presence or absence of the action. P (A;L 1 ;L 2 ;:::;L n ;e 1 ;e 2 ;:::;e n ) =P (A) n Y c=1 P (L c jA)P (e c jL c ) (3.1) We refer to terms P (L c jA) and P (e c jL c ) as structure and composition models of compo- nent c, respectively. 3.3.2 Volume Descriptors We use a dense representation of spatiotemporal volumes based on Histogram of Gradients (HOG) [10] and/or Histogram of Optical Flows. To compute the descriptor, we virtually crop out the volumev from the video. Then each image in the volume is divided into 2x2 29 Figure 3.2: Step-wise process of computing volume descriptor non-overlapping pixel regions, or blocks. Each block is further sub-divided into 2x2 non- overlapping pixel regions, or cells. For each cell, we compute 1D histograms of oriented gradients and oriented optical ows as they capture local shape and motion relatively robustly. The gradients are quantized into one of nine orientation bins, whereas the ows are quantized into one of 5 bins (4 directions in addition to no motion). Each pixel votes for up to two gradient and two ow orientation bins with the strength proportional to the magnitude and the distance from the center of the bin. We use grayscale images for computation of gradient and optical ow. The histogram of each cell is normalized with respect to the energy (gradient energy for HOG, ow energy for HOF) of the block of which the cell is part of. For each block, the HOG and HOF histograms of each cell are concatenated (when using both features) to give one (9+5)x2x2=56 dimensional feature vector for each block. The descriptors for all the blocks in the image are then 30 concatenated to give 56x4=224 dimensional feature vector for each image. Finally, we temporally segment the volume into two parts, compute the mean of feature vectors for the frames in each segment and concatenate the mean feature vectors to give a 448 dimension descriptor for the video volume. Figure 3.3.1 illustrates computation of volume descriptors. 3.3.3 Composition Model Each action component is dened by the size of the spatiotemporal volume it covers. For each component, we randomly x its size as a fraction of the action volume under consideration. To learn the composition (appearance) model of component c, we select a random relative location ` c for it. Then for each video v, we compute descriptors d v `c for spatiotemporal volumes of size equal to the component centered at location ` c . We then separately compute for positive and negative examples of action a, the mean and variance of descriptors d v `c . We represent the mean and variance of positive examples as cp and cp , respectively, and the mean and variance of negative examples as cn and cn , respectively. The likelihood of the observed spatiotemporal volume around location` to correspond to composite c is given by Equation 3.2, whereZ is the normalization factor. We dene the likelihood function P (e c j` c ), as the logarithm of the ratio of Normal distributions N cp (:) andN cn (:) of the associated k dimensional volume descriptor d ` with means cp and cn and covariances P cp =diag( cp ) and P cp =diag( cn ), respectively. log(P (e c j` c )) =log(Z) +log(N cp (:)=N cn (:)) (3.2) = 1 2 X i=1:K log( i cp )log( i cn ) + (d i `c i cp ) 2 i cp (d i `c i cn ) 2 i cn 31 3.3.4 Structure Model The relative location of a component with respect to the center of action volume is modeled as a latent random variable, L c . The structure model encodes likelihood of the location of a component for presence or absence of an action P (L c jA). We use four dierent models, i) Pulse Model ii) Uniform Model iii) Gaussian Model and iv) Non- parametric model, to encode the structure of components. Unit Impulse Model: A Unit Impulse function outputs 1 only for a certain input and 0 for all other inputs. We estimate the likelihood of a component's location for occurrence of action a as a unit impulse function. This creates a non- exible component based model which emphasizes heavily on agreement of structure of observation and the action model. A unit impulse model is given below where ` c is the randomly initialized location for component c. P (` c ja =yes) = 8 > > < > > : 1; if ` c = ` c 0; otherwise (3.3) Uniform Model: The unit impulse model is very restrictive as the location of compo- nents may change due to changes in viewpoint, scale and speed. Although this restrictive- ness may lead to better precision at the cost of recall. However, to make our model more exible, we also suggest use of a Uniform distribution as structure model. However, using a uniform distribution over whole action volume will make the spatiotemporal structure of action unimportant and make the approach equivalent to bag-of-words approaches. Therefore, we restrict the non-zero uniformity of structure model to the neighborhood of randomly initialized component location. The parameters of neighborhood function Neighbor(`) are xed. 32 P (` c ja =yes) = 8 > > < > > : 1= P `c2Lc P (` c ja =yes); if ` c 2Neighbor( ` c ) 0; otherwise (3.4) Gaussian Model: A Gaussian function oers a compromise between the unit impulse and the uniform model. The randomly initialized location is weighted more than any other location in its neighborhood, while the the variance, 2 , of the Gaussian function controls the volume of the eective neighborhood. P (` c ja =yes) =N (` c ; ` c ;) (3.5) Non-parametric model: We also propose use of a non-parametric distribution over quantized locations as structure model. To estimate the non-parametric structure model for presence of action (P (L c jA = yes)), we use positive and negative examples of the action in the training set. For each component, we use its size to quantize the spatiotemporal volume of video sequence in a normalized three dimensional grid. We quantize the video locations so that along each dimension, the size of the bin is equal to half the size of composite volume along corresponding dimension. We assign each location on the grid a score equal to the likelihood of volume appearance if the component is centered at that location. We then accumulate scores for each location and nally, normalize by the number of examples. We repeat this process again for negative examples in the training set to estimate structure model for absence of action (P (L c jno)). In practice, for eciency, we only score locations on the grid which are in the 11x11x11 neighborhood of randomly selected location for the composite. 33 3.3.5 Training Training action models in our approach involves learning structure and composition mod- els of action components. To nd an optimal number of action components, we rst train our models for a large number of components (200 in our experiments) and then search for components which optimize performance on the test set. First, we randomly initialize location and size of each component, followed by learning their composition model. In case, we have skewed distribution for positive and negative examples of an action, we randomly select negative examples equal to 4 times the number of positive samples for training. For non-parametric structure model, we then learn the likelihood distribution of component locations in positive and negative examples of the training set and use their ratio at each location as the estimate of likelihood function. Exhaustive search of an optimal subset of components is a computationally prohibitive task for these many action components. Therefore, we group action components into groups of 25 and perform an exhaustive search over the group of components for each action to nd the optimal set of component groups. Conditional independence of action components given action allows to compute the maximum likelihood of each component only once and store it. Hence, it makes the component search very ecient. 3.3.6 Recognition Given a video, we compute the likelihood of each component in the action model cen- tered at dierent spatiotemporal locations. For each location, we compute the volume descriptor for spatiotemporal volume of size equal to the the action component. To speed up the process, we use integral histograms of gradients and optical ows. Therefore, for each location, it takes only few memory reads to compute the volume descriptor. The likelihood of the component at each location is then weighted by the structure model. The conditional independence assumption for action components allow us to eciently 34 nd the most likely component conguration, i.e. the component conguration that yields maximum log-likelihood. h` 1 ;:::;` n i = arg max h` 1 2L 1 ;::;`n2Lni P (a) Y c=1:n P (` c ja):P (e c j` c ) (3.6) The decision about presence of an action is made by thresholding the probability associated with the most likely component conguration. 3.4 Experimental Evaluation We evaluate our method on the benchmark datasets of Hollywoood Human Actions (HOHA) [52]. HOHA contains 430 videos, which are taken from dierent Hollywood movies. The dataset is very challenging because the videos are taken in unconstrained and realistic environment and have signicant camera motion. In addition, it also con- tains rapid scene changes and signicant clutter, which makes the problem very complex. Visibility of actors also vary among videos and the actions included also involve inter- actions with other agents such in \kiss" or \get out of car". The relative small size of training samples also makes it dicult to capture large variations in manifestation of included actions e.g. \kiss" or \sit down". Furthermore, the length of the videos vary from few seconds to few minutes. For the HOHA dataset, we used the same experimental setting as used by [52] for recognition task in which, the test set has 211 videos with 217 labels and the training set has 219 videos with 231 labels. The videos are manually annotated and the annotations provide start and end frames of each action in the video and have, in some cases, more than one label assigned to one segment. Therefore, we evaluate performance of our trained models using average precision (AP) on the precision/recall curve. To train each model, 35 Class Impulse Uniform Gaussian Non-Parametric Best Answer Phone 17.1% 15.9% 16.0% 16.4% 17.1% Get Out of Car 18.3% 19.1% 19.1% 37.6% 37.6% Hand Shake 21.9% 21.2% 21.6% 19.1% 21.9% Hug Person 19.8% 26.4% 26.3% 25.1% 26.4% Kiss 50.5% 41.9% 41.6% 45.3% 50.5% Sit Down 20.3% 17.2% 17.3% 17.5% 20.2% Sit Up 15.4% 14.7% 14.7% 14.6% 15.4& Stand Up 34.8% 42.2% 42.2% 40.7% 42.2% Mean 24.8% 24.8% 24.8% 27.0% 28.9% Table 3.1: Class wise average precision of our method when using dierent structure models. we use only the segment of the video that corresponds to the action and use one-versus-all classication approach to train our models for HOHA similar to Laptev et al. [52]. For each video in the test set, we get maximum scores for each model and use them to obtain precision-recall curves. Table 3.4 shows class wise performance of our method when using dierent structure models. We obtain a mean per class average precision (AP) of 27.04% when using Non-Parametric structure model, and mean per class AP of 28.9% when we selected the structure model with best AP for each action class. We observed that our method gives best performance for actions Kiss and Stand Up, both of which have more than 45 positive examples. This dependency on amount of training data can be attributed to the learning of composition models. Table 3.4 compares performance of our method with other published results on the HOHA dataset. Our method gives promising performance on the very challenging dataset in comparison to others, although it does not improve on the state of the art. We attribute deciency in performance in comparison to the state-of-the-art to random decomposition of action volume and weakness of features to capture spatiotemporal evolution of each component. In our opinion, actions like Sit Up and Sit Down, have short durations and they lack structure, so a component based approach is probably not well suited for them. 36 Method MAP Matikainen et al. [61] 22.8% Klaser et al. [47] 24.7% Laptev et al. [52] 38.4% Yeet et al. [100] 36.8% Sun et al. [87] (TTD) 30.3% Sun et al. [87] (TTD-SIFT) 44.9% Shandong et al. [96] 47.6% Raptis et al. [76] 32.1% Raptis et al. [75] 40.1% Our method - Impulse Model 24.78% - Uniform Model 24.81% - Gaussian Model 24.88% - Non-parametric Model 27.04% - Best Model* 28.93% Table 3.2: Mean average precision (MAP) performance comparison on HOHA dataset. Performance for our system for dierent structure models is given after component se- lection. *Best Model means we selected that structure model for each action, which produced the best AP for that action when other parameters were xed. Further, we noticed that the annotations do not provide tight temporal boundaries of actions. This favors loose modeling frameworks, such as [52, 87, 96]. Figure 3.4 shows sample frames from some of the videos in which our system suc- cessfully recognized the activity. It should be noted that 3.4(f) shows an example video where more than two activities are performed simultaneously. Our system was able to recognize both Sit Down and Answer Phone activities in that particular video. 3.5 Conclusion In this chapter, we introduced a component based approach to recognize actions and eval- uated its performance on a benchmark dataset. We randomly decomposed actions into spatiotemporal components and represented each part as Gaussian distribution of volume descriptors. The location of each component is modeled as a latent variable. We investi- gated use of dierent ways to model the distribution of component locations. Our method 37 produced promising results on the HOHA benchmark dataset in comparison to other re- sults without improving the state-of-the-art. We observed that low-level representation of action components is powerful only when sucient training data is available. 38 (a) Answer Phone (b) Get Out of Car (c) Hand Shake (d) Hug Person (e) Kiss (f) Sit Down (g) Sit Up (h) Stand Up Figure 3.3: Precision-Recall curves for each action model. 39 (a) Answer Phone (b) Get Out of Car (c) Hand Shake (d) Hug Person (e) Kiss (f) Sit Down and Answer Phone (g) Sit Up (h) Stand Up Figure 3.4: Sample frames from selected videos for which human activity was successfully recognized by our system. Above results are produced for Impulse Model and binarization of output done to achieve maximum F1 score. 40 Chapter 4 Full-body activity detection in complex environments using Conditional Bayesian Networks 4.1 Introduction In the previous chapter, we presented an approach to randomly decompose actions into spatiotemporal components and capture their characteristics using low-level features. The methods provides the basic framework to recognize actions via detection of parts. How- ever, the components do not bear any semantic meaning and are conditionally indepen- dent of each other. In this chapter, we propose another method of action decomposition based on high level structures of action and object of interaction. We decompose action structurally and temporally, and relax conditional independence assumption among adjacent tem- poral components. Use of action and object concepts also allows us to produce action descriptions. In this chapter, we consider a the case of human object interaction, where the objects may also be considerably large in size, such as bicycles, cardboard boxes, etc., to provide a strong context to human pose and occlude them enough for reliable detection. Our goal is to generate structured descriptions for videos of multiple unsegmented actions. The actions have high intra-class variance and are performed using objects of dierent types and sizes and recorded from dierent viewpoints. The description does not only 41 identify actions but also its interval, actor and object of interaction (if any). Structured descriptions can be useful for tasks like natural language description, video indexing and content summary. Among other reasons, action recognition is complex due to occlusion of actors and objects, as well as high intra-class variance arising from dierent manifestations and interaction with objects of dierent types (Fig. 4.1). Many recognition systems have been proposed for segmented actions with or without small objects [51, 29, 69, 52, 41, 78]. Using variable length sliding window approach to localize actions with these methods is computationally expensive. Although [67] can segment actions in linear time but it allows only one action per frame. Furthermore, actor-object interaction is often not modeled because small objects are hard to detect and they do not considerably occlude actors to cause problem. Explicit modeling of actor- object interaction for small objects improves recognition results in [29, 45], however, the performance would suer for interaction with large objects due to reliance on human pose estimation. Also, action variations were often restricted to performing same manifestation by dierent actors. Our system does not make such assumptions regarding segmentation, object size, manifestations and multiplicity of actions. We propose a component based approach to recognition of actions akin to that of part-based recognition of objects to address these challenges. For object recognition, part based approaches perform well given occlusions, high intra-class and viewpoint vari- ances. In our system, an action is recognized by detecting its components and reasoning for their consistency with the action structure. Small and simple components with low variance are combined to represent complex actions with high variance. It also allows for hierarchical composition of actions, such as exchange can be composed of give and take actions overlapping in time. A block diagram of our approach is given in Fig 4.2. 42 (a) (b) (c) (d) (e) (f) Figure 4.1: Intra-class variance. Dierent manifestations of `Catching' a ball, while: (a) running, (b) on bike, (c) jumping. A person `Colliding' with dierent objects: (d) another person, (e) a pole, (f) a chair. We take an actor centered approach for action decomposition. An action component is a group of primitive actions and actor-object relationships based on a temporal order. Although generic object detection is not very robust, human detection and tracking has improved considerably in past few years to make our approach viable. We use HMMs to recognize sub-components (primitive actions and relationships, e.g.moving, crouch- ing, etc.) from noisy human and object tracks in time O(T ). Then for each action, we use a Conditional Bayesian Network (CBN) 1 to combine evidence gathered over the intervals for its components. Our approach is a complement of the one taken in [34] where the lower level inferences are made by a Bayesian Network and higher level inferences by an HMM. The latter has an inherent diculty in not being robust to lack of observations (e.g. missing tracks) at 1 A traditional BN is a generative model and encodes the complete distribution whereas a CBN does not predict the probability of observations. 43 Figure 4.2: Block diagram of our recognition and description system the lowest levels. Instead, we use HMMs to detect and segment lower level actions and use a CBN for higher level reasoning; this approach is much more robust to missing data. We evaluate our method on a subset of very challenging Mind's Eye dataset [12].The dataset has dierent viewpoints of multiple manifestations of 48 human actions per- formed with or without objects. We show encouraging action recognition results using our method. We also show some examples of producing English language descriptions from our structured output. In the end, we show some results for action detection and localization task in long videos. To summarize our multi-fold contribution: 1. We realize a component based action recognition approach which uses Conditional Bayesian Networks to recognize multiple actions under challenging conditions in a video. 2. Our method performs well under high intra-class variance arising from dierent manifestations, interaction with objects of dierent sizes and types. 3. Our approach can recognize action variations (manifestations/interactions) which may not be present in the training set at all. 4. Our approach can generate structured video descriptions identifying actions, in- tervals, actors and objects which can be used for natural language descriptions, anomaly detection, indexing, search, etc. 44 In the following, we start by brie y reviewing related work in Section 4.2. Section 4.3 gives details of our recognition method. We present results of our experiments in Sec- tion 4.4 before concluding in Section 4.5. 4.2 Related Work To paint an overall picture, we group approaches for single actor action recognition into feature-centered and human-centered approaches. The most popular approach towards ac- tivity recognition over the last decade has been the feature-centered approach. Statistics of low-level features are collected and classied using machine learning techniques. Fea- tures can be extracted either around static point locations [5, 28] or dynamically around keypoints [51, 52, 62]. Although these algorithms work well, but they are usually suitable for segmented videos with known action boundaries. Further, mapping from features to class label is 'opaque', i.e. the reasons for failure and success cannot be easily told. In comparison, our system outputs structured descriptions which include information about actor, object, action interval and action components and sub-components. Thus, our method provides a transparent processing pipeline and errors can be easily traced from the latest to the earliest stage of the pipeline. This is a key distinguishing feature of our system from statistical feature based approaches. Human-centered approaches usually use graphical models to model spatio-temporal constraints on actors and actions. Both generative and discriminative models have been proposed [86, 73, 65, 67]. Human centered approaches can model detailed actor-object interactions, like [70, 29, 45]. However, they depend on pose estimation, which is a challenging problem under occlusion. Therefore, these methods cannot work reliably for large objects of interaction. Also, as most of these systems try to recognize activities as a whole, more data to train these methods is required when the intra-class variance increases. 45 To recognize interactions involving more than one person, such as, playing basketball, both graphical models [40, 11, 31, 83] and logic based [77, 84, 64, 6] systems have been proposed. [89] allows for uncertainty in logic formulae by assigning weights to them and derive a Markov Logic Network (MLN) from these weighted rules to recognize person- person and person-vehicle interactions. [64] uses Allen's Interval Logic to recognize high level activities from primitive events using MLN for one-on-one basketball game. [84] introduces Event Logic and [6] gives its probabilistic version, Probabilistic Event Logic (PEL). These methods work best when logic rules are well dened; to create such rules for various domains is a major challenge in itself. 4.3 Action Representation and Recognition Below we discuss generic action representation, recognition model and inference. 4.3.1 Action Representation We decompose actions into simpler actions and actor-object relationships, such as mov- ing, crouching and close. These components are then grouped together based on their temporal order to give temporal segmentation of the actions. We x the number of tem- poral segments for each action and allow segment lengths to vary to deal with variations in speeds of actions. Fixing the number of segments for each action can be viewed as xing number of hidden states in an HMM. The segments can have variable durations and our inference system can bridge gaps between the observations of these segments. Hence, the xed number indicates a trade-o between complexity and exibility. We model each action as a nite state machine (FSM) and the temporal segments of the action as the states of the FSM. Transition model of FSM enforces the relative order of these segments. Simpler actions and actor-object relationships during each segment are modeled as components of FSM states. For brevity, hereon we refer to the primitive 46 (a) Generic Action Model (b) Action Model for Catch Figure 4.3: FSM Model for Actions. Primitives are shown in red. sub-states of FSM as primitives and the states corresponding to temporal segments of an action as composites. Action structures, i.e. number of states and their denitions, can be learned using ma- chine learning techniques from data; however, the states so learned may not have semantic meaning. Thus, we manually select semantically meaningful states to dene actions, as they make the task of video description easy. We choose properties and relations of actor and objects as primitives, such as moving, crouching, close. Primitives are grouped into composites, hence, they are also semantically meaningful. Composition of each action (composites and their order) is also specied by hand. Manual specication of action models is relatively easy as the elements are semantic, allows for natural descriptions and requires only limited training data. Figure 4.3 shows our action model. Each actiona hasm composites n c a j o m j=1 and each composite c a j is comprised of n j primitives n p c a j k o n j k=1 ; m;n j > 0. Further, a composite denition can be shared by more than one action and a primitive denition can be shared by multiple composites. 4.3.2 Recognition Model Each action model is formalized as a separate Conditional Bayesian Network (CBN) (Fig. 4.4). The root node random variable, A =hT;S;Oi, represents a tuple where T represents the interval of occurrence of action a performed by subject (actor) S using 47 objectO (maybe null). Nodes C a j j m j=1 correspond tom composites of actiona and nodes P c a j k j n j k=1 correspond ton j primitives of compositec a j . Each internal node's random variable also represents a tuple; C a j = D TC a j ;S;O E and P c a j k = D TP c a j k ;S;O E , whereTC a j andTP c a j k are the variables representing intervals of occurrence of composite c a j and primitive p c a j k , respectively. Each CBN encodes temporal constraints on composites and primitives, and their relative importance. Observable for the CBN are features computed from human and object tracks. To simplify our notation, we consider only one action, one subject and one object at a time. For multiple actions, subjects and objects, we repeat recognition process for each combination. Thus, after rewriting variables c a j and p c a j k as c j and p j k , respectively, and dropping variables S and O from node variables we get A =hTi, C j =hTC j i and P j k = D TP j k E . The conditional probability of hidden nodes given evidence for the CBN, using the simplied notation, is given by: Pr A; n C j ; n P j k o n j k=1 o m j=1 je / Pr(C 1 jA) n 1 Y l=1 Pr(P 1 l jC 1 ):Pr(P 1 l je) ! 0 @ m Y j=2 n j Y k=1 Pr(C j jC j1 ;A):Pr(P j k jC j ):Pr(P j k je) 1 A (4.1) where terms Pr(C j jC j1 ;A), Pr(C 1 jA) and Pr(P j k jC j ) enforce temporal and struc- tural constraints, Pr(P j k je k ) evaluates evidence for primitives and Pr(A) is uniform. The term Pr(C j jC j1 ;A) is modeled as: Pr(C j jC j1 ;A) = Pr(C j jC j1 )Pr(C j jA) (4.2) 48 (a) Generic Recognition Model (b) Model for Catch Figure 4.4: Conditional Bayesian Network for Action Recognition where, Pr(C j jC j1 ) enforces temporal order and Pr(C j jA) encodes importance. We give functional forms of the terms in Eq(4.1) in the following subsection. 4.3.3 Inference For each subject and object pair (object maybe null), the probability of occurrence of an actiona over an intervalt and its associated composites and primitives can be computed by adopting (4.1) for the corresponding CBN. We pose the recognition problem as nding all hidden node assignments = a; n c j ; n p j k o n j k=1 o m j=1 = hti; n htc j i; nD tp j k Eo n j k=1 o m j=1 for which: 1. The probability ofs givene (Eq(4.1)) is greater than some thresholdth: Pr(sje)> th 2. The probability of s is greater than probability of any other assignment s 0 with a =a 0 ; i.e. instead of reporting multiple assignments with same interval t =t 0 , we only report the most likely conguration of components. 3. The intervalt ofs is not a sub-interval oft of any other optimal assignments ; i.e. between two assignments, the one with longer action interval is preferred. 49 Algorithm 1 Pseudo Code for Action Recognition 1: Apply Primitive detectors to generate hypotheses HP j k for each primitivep j k . HMMs can be used to do this in O(F ), where F is number of frames. HP j k =farg p j k (Pr(p j k je)>th p )g. 2: Use primitive interval hypotheses to hypothesize interval for composites c j : HC j = 8 < : ( n j \ k=1 tp j k ) tp j k 2HP j k j n j k=1 > 0 9 = ; 3: Hypothesize interval t for action a from hypotheses for its composites c j : HA = 8 < : m [ j=1 tc j 9 = ; tc j 2HC j where t 1 [t 2 = [f b 1 ;f e 1 ][ [f b 2 ;f e 2 ] = [f b 1 ;f e 2 ] 4: Accept assignment hypothesis if probability in Eq.(4.1) is greater than threshold th. Enumerating Eq(4.1) for all possible assignments is computationally prohibitive be- cause each node has an interval component. Therefore, we use the bottom-up approach given in Algorithm 1 to eciently nd assignments which satisfy above three conditions. We assume that for any interval t of actiona, the likelihoods of the intervals for each of its composites are uniform over the range of all sub-intervals of t. Pr(p j k = D tp j k E jc j =htc j i) = 1; tp j k tc j (4.3) where tp j k tc j means tp j k is a sub-interval of tc j . Similarly, for any interval tc j of compositec j , the likelihoods of the intervals for each of its primitives are uniform over the range of all sub-intervals of tc j . Pr(c j =htc j ija =hti) = 1; tc j t (4.4) where tc j t means tc j is a sub-interval of t. 50 We consider only those intervals which are implied by observations. First, we generate primitive hypotheses (p j k ) by applying Primitive detectors. A primitive hypothesis is one for which Pr(p j k je) > th p . Any primitive detection method can be used but using an HMM based detector can segment the video in O(F ) without a threshold, where F is the number of frames. For eciency, we drop any hypothesis which is a sub-interval of another hypothesis as it results in an assignment with shorter interval although maybe with higher probability. However, the threshold th can be adjusted to accept hypothesis with the longer interval. Pr(p j k je)/ P f2tp j k log(P (p j kf je f )) jtp j k j + 1 (4.5) Once the primitives are detected, we use their intervals to hypothesize intervals tc j for composites c j . The largest valid length of an interval for a composite is given by the intersection of the intervals of its primitives (because of minimal representation).The temporal order of the composites is modeled as a function of their intervals, which penal- izes the `gap' between the observation of two composites. Let the interval tc be dened as [fc b ;fc e ] then: Pr(c j =htc j ijc j1 =htc j1 i) = 8 > > > > > > > < > > > > > > > : N (0;)(fc b j fc e j1 ) fc b j fc e j1 1 fc b j1 <fc b j <fc e j1 0 otherwise (4.6) To generate interval hypotheses for an action, we consider all possible combinations of hypotheses of its composites by choosing at least one hypothesis for each composite. For each combination, interval of the action is hypothesized as union of intervals for its composites. Union of two intervals mean the smallest interval that includes both of them. Due to Eq(4.6) the joint probability is maximized when longest length composites 51 with minimum gaps are used. Therefore, the assignment s hypothesized bottom-up from evidence satises both conditions 2 and 3 above. Finally, we compute and threshold the probability of s using Eq(4.1) to determine occurrence of action. 4.3.3.1 Primitive Detection Dierent methods can be used to detect primitive states and their intervals. Our frame- work is independent of the specics of primitive detection given it provides an estimate of the probability of the primitive over the interval. However, the primitive detection should be robust in a bottom up approach, as failures at lower-level of the hierarchy are propagated upwards. We choose Hidden Markov Models (HMMs) to detect primitives because they can simultaneously detect and segment the intervals in order O(F ), where F is video length, using Viterbi algorithm, whereas a sliding window based approach for a bag-of-words like method takes O(F 2 ). We dene a separate HMM for each primitive. The state variable of each HMM, p f , is binary, which indicates occurrence of primitive p at each time instant f. We use a Random Walk transition model for all primitives. Pr(p f+1 jp f ) = 0:5; p f+1 =p f (4.7) Our primitive denitions are centered around the notion of actor and object. There- fore, for observation model we compute features, such as speed, distance between centers, relative locations, etc., of tracks of humans, objects and motion blobs (entities) from the video. We assume prior probabilities of observations and states to be uniform and replace observation model by the likelihood term in HMM. Posterior probability of each primitive Pr(p f je f ) is computed by multiplying either logit or Gaussian functions over related features. During test, Viterbi algorithm is applied to compute the most likely path of primitive states for each entity and/or entity pair. 52 During training, instead of optimizing HMMs independently, we use parameters which optimize performance of the recognition system as a whole. We initialize the parameters and then search their neighborhood for parameters which maximize the mean AP of the action models on training data. The parameters can be easily initialized intuitively as the features are simple and their number is small. 4.4 Experiments We apply our action recognition method on two dierent types of videos: i) short (seg- mented actions), and ii) long (unsegmented actions). Both types of videos may have multiple actions; however, since short videos are only few seconds long, action localiza- tion is not needed. For long videos, we do both recognition and localization. 4.4.1 Recognition of Segmented Actions For this task we used a subset of Mind's Eye Year 1 dataset [12]. It is a very challenging dataset of multiple manifestations of 48 actions performed by dierent actors in dierent outdoor environments. Most of the actions require human interaction with some object. There can be multiple actors in the scene performing dierent actions, e.g. a person throws a ball to another person to catch it. Furthermore, not all actions are mutually exclusive to each other, like a person may run before colliding with another person. Thus, multiple labels are assigned to each video. Annotations for videos are collected using Amazon Mechanical Turk. Annotators are asked to say yes or no about some action in a video. Human assigned labels are noisy because dierent examples of same manifestation are labeled by dierent Turkers. On average, for each manifestation of an action, mean accuracy of a human annotator with 53 (a) Training example of Catch (b) Test example of Catch (c) Training example of Push (d) Test example of Push Figure 4.5: Examples of variations in dierent subsets of dataset. respect to others is about 94%. Therefore, we use majority vote of annotators for each manifestation as ground truth for all examples of that manifestation. We selected a subset of 8 actions, Catch, Collide, Haul, Pickup, Push, Run, Throw and Walk, for evaluation. Of the 366 videos that were included corresponding to these actions in our evaluation, we used 196 for training and 170 for testing. Some action manifestations in test set are not present in training set (Fig. 4.5). The set of primitives we use in our implementation are: Moving, Stationary, Close, Far, Moving-Towards, Moving-Away, Near-Torso, Near-Feet, Moving-Slow, Moving-Fast and Same-Y-Coordinates. [Object Detection and Tracking] We rst evaluated performance of object detection and tracking systems as they provide input to the inference modules. For human detection we used an o-the-shelf pedestrian detector by Huang and Nevatia [36] and an association based tracker by Huang et al. [37]. The pedestrian detector has precision and recall of 78% and 74% respectively when evaluated on 360 videos (from training and test sets). For object detection, we used [21] with some modications to train models for 9 object classes 2 , Bag, Ball, Bicycle, Box, Car, Chair, Garbage-Can, Motorbike and Toy-Car.The 2 Object detection and tracking was courtesy of Dan Parks, iLab, University of Southern California 54 average of per class precision and recall of the object detection method are 11.96% and 37.40% respectively. In our observation, [21] works well for large structured object, such as chair and motorbike, in comparison to smaller objects with plane surfaces, such as boxes and garbage-cans. In addition we track objects as motion blobs, which are obtained using background subtraction method in OpenCV. Fig 4.6 show some detection and tracking results. (a) (b) (c) (d) Figure 4.6: Detection and Tracking of Humans, Objects and Motion Blobs. a) Ball is detected as motion blob but many spurious blob tracks exist. b) Pole is not detected. c) Humans and large objects are detected reliably. d) Human and box are merged into one blob [Recognition baseline] Due to absence of previous results on Mind's Eye dataset, we rst obtained a baseline for recognition task using a method similar to [79]. We use a bag- of-words representation for videos and classify them into concepts using SVM classier with 2 kernel. We use HOG and HOF features computed for dense trajectories [90] in our experiments. We xed the codebook size at 4000 and did 5 fold cross validation to 55 learn parameters of SVM. Each action model is learned independently in one-against-all fashion. We measure the performance in terms of Average Precision (AP) (Table 4.1). Method Catch Collide Haul Pickup Push Run Throw Walk Avg BoW 0.13 0.17 0.42 0.28 0.34 0.10 0.14 0.78 0.29 CBN (MGT) 0.22 0.26 0.60 0.19 0.59 0.19 0.11 0.86 0.38 CBN (HAT) 0.49 0.72 0.53 0.32 0.72 0.71 0.45 0.92 0.61 Table 4.1: AP of CBN and BoW methods for action recognition task (MGT: Machine Generated Tracks, HAT: Human Annotated Tracks) [Recognition CBN] We performed two experiments with CBN. First we used machine generated tracks (MGT) for humans, objects and blobs using methods described above as input to our system. Second, in order to study the eect of noisy tracks, we used hand annotated tracks of humans, objects and blobs (HAT) as input. In both cases, action structures were xed by hand and minimum length of composite was nd out by evaluating the performance on training set. Comparison of performance of CBN in both cases with baseline method is provided in Table 4.1. Figure 4.7: PR curves of CBN using MGT and BoW methods Our method using machine generated tracks outperforms the baseline BoW method for all actions except for Pickup. We believe our results are better because the compositional 56 Figure 4.8: PR curves of CBN using Human Annotated Tracks approach is more adept at modeling actions with high variance as compared to BoW model. Furthermore, the evaluation set has manifestations which are not present in the training set (Fig. 4.5). On the other hand, we believe that due to strong pose changes, the BoW approach is successful in learning a better discriminative model for Pickup without object knowledge, whereas CBN suers from inability to estimate pose changes from simplistic track based features. We get many spurious tracks (Fig.4.6) due to noisy motion blobs and object detections. Tracks that are spatially distant from an actor usually do not aect the results due to actor centered modeling and inference. However, for human annotated tracks, CBN results are signicantly better than for machine generated tracks. This suggests that increasing recall has signicant eect. Increasing object and human detection reliability (precision) also decreases error. We further believe that the performance is also aected by the use of simple bounding box features to detect primitives, e.g. we use relative location of actor and object to detect primitive Close-To-Feet. Using more complex features for shape and motion would improve primitive detection and overall performance. Noise from multiple annotators also 57 (a) Catch (b) Collide and Push (c) Haul (d) Push Figure 4.9: Examples where the main action was successfully recognized. aects overall performance assessment where samples are not enough to take majority vote. We used structured output of our system to generate English language description. Our description module uses posterior probabilities and action hierarchy to rank actions. Structured descriptions for topN actions are converted to English language using generic rules. Fig. 4.9 shows our English language description results. In our common discourse, we would likely not use a phrase such as `the approached object' but rather use a more descriptive phrase. In spite of these deciencies, the example does show the ability of the 58 description to clarify that the object in the rst, second and fourth sentences is the same and that the human referred to in the three sentences is also the same. 4.4.2 Recognition and Localization of Actions In order to assess the performance of our CBN in terms of localizing actions, we apply our method on videos from Mind's Eye Year 2 dataset. We trained 7 action models (Approach, Carry, Chase, Flee, Pass, Pickup and Put Down) using Mind's Eye year one videos and validated them on 26 videos, with average length of 10 minutes. We then applied these validated models on 4 videos hold out for evaluation. Average length of these videos was also 10 minutes. We added camera calibration information to convert human tracks from 2D to 3D. Also, we used human context to lter object detection results at detection and tracking level. For quantitative evaluation, we used overlap of two events as a measure of detection accuracy. In particular, similar to object detection, we use temporal overlap (intersection over union of time intervals) of ground truth and predicted events as indicator function for true positives and false positives. The function indicates true positives if the temporal overlap is greater than some threshold. Table 4.2 shows results of our method on 26 videos in the validation set for threshold=0.5. Due to unavailability of reliable human annotations, we present quantitative results only for validation set; however, Fig. 4.10 shows some qualitative results of our method on the 4 evaluation videos The performance of carry is very low because our method depends on long term object tracking which is hard for relatively small carry-able objects as they get heavily occluded during the action. This observation re-arms the need of mutual inference of action and objects, as we discussed in earlier chapters. 59 Action F1 Precision Recall Approach 0.24 19.0% 32.0% Carry 0.07 32.0% 42.0% Chase 0.26 43.0% 19.0% Flee 0.35 32.0% 40.0% Pass 0.15 13.0% 19.0% Pick Up 0.15 78.0% 08.1% Put Down 0.15 75.0% 08.6% Average 0.20 41.7% 24.1% Table 4.2: Performance of our method on the validation subset of Mind's Eye Year 2 dataset. 4.5 Conclusion In this chapter, we explore part based representation of activities and nd that deter- mination of temporal segments play an important role in feasibility of inference. We use concepts from interval logic and put them in Bayesian framework to do probabilistic reasoning. To make inference ecient, we use bottom-up hypothesis generation to infer activity likelihoods given evidence. Our system can nd temporal spans of multiple ac- tivities in a video. These activities may temporally overlap. We also tried to address the problem of human-object interaction modeling by dening primitives as properties of actors and objects. This requires detection and tracking of entities. We observe that object perception is very important for methods which intend to answer more than what happened in the video. We empirically show that on a dataset with noisy annotations, varying conditions and view-points, the better the object estimation gets, the better the performance of activity recognition system gets. 60 (a) A person carrying a gun. (b) A woman carrying a water can. (c) A man eeing. (d) A woman passing a man. (e) A man and a woman approach each other. The man puts down down the can he was carrying. Figure 4.10: Action recognition and localization results using MGT. Actions associated with subjects at each frame are displayed. 61 Chapter 5 Multiple Pose Context Trees for Action Recognition in Single Frame 5.1 Introduction In the previous chapter, we observed that simple features computed from object and human bounding boxes were not discriminative enough for certain primitive events. Our thesis is that estimation of detailed human pose can help disambiguate such primitive components. Therefore, in this chapter we consider the problem of estimating 2D human pose in static images where the human is performing an action that involves interaction with scene objects. For example, a person interacts with the soccer ball with his/her leg while dribbling or kicking the ball. In such case, when the part-object interaction is known, the object position can be used to improve the pose estimation. For instance, if we know the position of the soccer ball or the leg, it can used to improve the estimation of other (see gure 5.1). However, to determine the interaction in an image, pose and object estimates are themselves needed. We propose a framework that simultaneously estimates the human pose and determines the nature of human object interaction. Note that the primary objective of this work is to estimate the human pose, however, in order to improve the pose estimate using object context we also determine the interaction. 62 (b) (c) (e) (f) (d) (a) Figure 5.1: Eect of object context on human pose estimation. (a), (c) show sample images of players playing soccer and basketball respectively; (b), (e) shows the human pose estimation using tree-structured models [72]; (c), (f) show the estimated human pose using the object context Numerous approaches have been developed for estimating human pose in static images [22, 35, 102, 81, 72] but these do not use any contextual information from the scene. Multiple attempts have been made to recognize human pose and/or action, both with and without using scene context [24, 30, 91]. These approaches rst estimate the human pose and then use the estimated pose as input to determine the action. Although in recent years, general human pose estimation have seen signicant advances especially using part-based models [22], these approaches still produce ambiguous results that are not accurate enough to recognize human actions. [23] used a part based model [72] to obtain pose hypotheses and used these hypotheses to obtain descriptors to query poses in unseen images. [30] used the upper body estimates obtained using [22] to obtain hand trajectories, and used these trajectories to simultaneously detect objects and recognize 63 human object interaction. [91] discovers action classes using shape of humans described by shape context histograms. Recently, attempts have also been made to classify image scenes by jointly inferring over multiple object classes [55, 15] such as co-occurrence of human and racket for recognizing tennis [55]. [15] also used the relative spatial context of the objects to improve the object detection. Previous approaches that model human-object interaction, either use a coarse estimate of the human to model interaction [30, 15] or estimate the human pose as pre-step to simultaneous object and action classication [30]. In this work, we use an articulate human pose model with 10 body parts which allows accurate modeling of human and object interaction. More precisely, we propose a graphical model, Pose Context Tree to simultaneously estimate the human pose and the object. The model is obtained by adding an object node to the tree-structured part model for human pose [22, 35, 102, 81] such that the resulting structure is still a tree, and thus allows ecient and exact inference [22]. To automatically determine the interaction, we consider multiple pose context trees for each possible human-object interaction based on which part may interact with the object. The best pose is inferred as the pose that correspond to the maximum likelihood score over the set of all trees. We also consider the probability of absence of an object which allows us to determine if the image does not contain any of the known interactions. Thus, our contribution is two-fold, Pose context trees to jointly estimate detailed human pose and object which allows for accurate interaction model A Bayesian framework to jointly infer human pose and human-object interaction in a single image; when the interaction in the image is not modeled, then our algorithm reports \unknown interaction" and estimate the human pose without any contextual information. 64 To evaluate our approach, we collected images from the Internet and previously re- leased datasets [10, 72, 30]. Our dataset has 65 images, out of which 44 has a person either kicking or dribbling a soccerball, or holding a basketball. We demonstrate that our approach improves the pose accuracy over the dataset by 5% for all parts and by about 8% for parts that are involved in the interaction, for example, legs for soccer. In the rest of the chapter, we rst discuss the pose context tree in section 2 and the algorithm for simultaneous pose estimation and action recognition in section 3. Next, we present the model learning in section 4, followed by the experiments in section 5 and conclusion in section 6. 5.2 Modeling Human Body and Object Interaction We use a tree structured model to represent the interaction between the human pose and a manipulable object involved in an action. We refer to this as Pose Context Tree. Our model is based on the observation that during an activity that involves human object interaction such as playing tennis or basketball, humans interact with the part extremities i.e. hands, lower legs and head; for example, basketball players hold the ball in their hand while shooting. We rst describe a tree structured model of human body, and then demonstrate how we extend it to a pose context tree that can be used to simultaneously infer the pose and the object. [Tree-structured Human Body Model] We represent the human pose using a tree pictorial structure [22] with torso at the root and the four limbs and head as its branches (see gure 5.2(a)). The human bodyX is de- noted as a joint conguration of body partsfx i g, wherex i = (p i ; i ) encodes the position 65 and orientation of parti. Given the full objectX, the model assumes the likelihood maps for parts are conditionally independent and are kinematically constrained using model parameters . Under this assumption, the posterior likelihood of the object X given the image observations Y is P (XjY; ) / P (Xj)P (YjX; ) =P (Xj) Y i P (Yjx i ) (5.1) /exp 0 @ X ij2E ij (x i ;x j j) + X i i (Yjx i ) 1 A where (V;E) is the graphical model; i (:) is the likelihood potential for parti; ij () is the kinematic prior potential between parts i and j modeled using . For eciency, priors are assumed to be Gaussian [22]. [Pose Context Tree] Pose context tree models the interaction between the human body and an object involved. Since humans often interact with scene objects with part extremities (legs, hands), pose context trees are obtained by adding an object node to a leaf node in the human tree model (see gure 5.2). We represent the pose and object jointly using X o =fx i g[z o , where z o is the position of the object O. P (X o jY; )/P (X o j)P (YjX o ; ) (5.2) =P (X o j) Y i P (Yjx i ) ! P (Yjz o ) whereP (X o j) is the joint prior for the body parts and the objectO. Since the object and interaction are known, we assume knowledge of the body part involved in the interaction with the object. As the graphical model with the context node is a tree, the joint kinematic priorP (X o j) can be written asP (Xj P )P (x k ;z o j a ), wherek is the part 66 interacting with the object, P is kinematic prior for body parts, a is the spatial prior for interaction modela betweenz o andx k . Thus, the joint likelihood can be now written as P (X o jY; )/P (Xj P )P (x o jx k ; a ) ( Q i P (Yjx i ))P (Yjz o ) (5.3) /P (XjY; P )exp ( a (x k ;z o j a ) + o (Yjz o )) where, o (:) is the likelihood potential of the object O, a () is the object-pose interac- tion potential between O and interacting body part k for interaction model a (given by equation 5.4). a (x k ;z o ) = 8 > < > : 1 jT ko (x k )T ok (z o )j<d a ko 0 otherwise (5.4) where T ko (:) is the relative position of the point of interaction between O and part k in the coordinate frame of the object O. o x t h x x lua lla x rua x rla x x lul x lll rul x rll x x t h x rla x rua x x lua lla x x lul x lll rul x rll x (a) (b) x Figure 5.2: Pose Model: (a) Tree structured human model with observation nodes (b) Pose Context Tree for object interaction with left lower leg; object node is marked as double-lined 67 5.3 Human Pose Estimation using Context We use Bayesian formulation to jointly infer the human pose and the part-object inter- action in a single image. Here, by inferring part-object interaction we mean estimating the object position and the interacting body part. Note that unlike [30], our model is generative and does not assume that the set of interactions forms a closed set i.e. we consider the set of interactionsA =fa i g[. The joint optimal pose conguration and interaction pair (X ;a ) is dened as (X ;a ) = arg max X;a P (X;ajY; ) = arg max X;a P (ajX;Y; )P (XjY; ) (5.5) We dene the conditional likelihood of interaction a given the pose estimate X, obser- vations Y and model parameters as the product of likelihoods of the corresponding object O(a) in the neighborhood of X and absence of objects that correspond to other interactions inA, i.e. P (ajX;Y; )/P (z o(a) jX;Y; ) Y a 0 2Anfag 1P (z o(a 0 ) jX;Y; ) (5.6) Combining equations 5.5 and 5.6, we can obtain joint optimal pose-action pair as (X ;a ) =arg max X;a P (z o(a) jX;Y; )P (XjY; ) Y a 0 2Anfag 1P (z o(a 0 ) jX;Y; ) =arg max X;a P (X o(a) jY; ) Y a 0 2Anfag 1P (z o(a 0 ) jX;Y; ) (5.7) The joint pose-interaction pair likelihood given in equation 5.7 can be represented as a graphical model, however the graph in such case will have cycles because of edges from object nodes to the left and right body parts. One may use loopy belief propagation to 68 jointly infer over all interactions [30] but in this work, we use an alternate approach by eciently and accurately solving for each interaction independently and then selecting the best pose-interaction pair. For each interactiona, we estimate the best poseX o(a) and then add penalties for other objects present close to X o(a) . The optimal pose-interaction pair is then given by (X ;a ) =arg max X;a 0 @ max X o(a) P (X o(a) jY; ) Y a 0 2Anfag 1 max z o(a 0 ) P (z o(a 0 ) jX o(a) ;Y; ) 1 A (5.8) where X o(a) can be obtained by solving the Pose Context Tree for the corresponding interaction a (described later in this section). Observe that when a = , the problem reduces to nding the best pose given the observation and adds a penalty if objects are found near the inferred pose. Thus our model can be applied on any image even when the interaction in the image is not modeled, thereby making our model more apt for estimating human poses in general scenarios. 5.3.1 Pose Inference for known Object Context using Pose Context Tree Given the object context i.e. object position and interaction model, pose is inferred us- ing pose context tree by maximizing the joint likelihood given by equation 5.2. Since the corresponding energy equation for pose context tree (eqn 5.2) has a similar form as that of tree structured human body model (eqn 5.1), both can be minimized using similar algorithms [22, 72, 23, 1]. These approaches apply part/object detectors over the all image positions and orientations to obtain part hypotheses and then enforce kinematic constraints on these hypotheses using belief propagation [49] over the graphical model. This is sometimes referred to as parsing. Given an image parse of the parts, the best pose is obtained from part hypotheses by sampling methods such as importance sampling [22], 69 maximum likelihood [1], data-driven MCMC [54]. [Body Part and Object Detectors] We used the boundary and region templates trained by Ramanan et al [72] for localizing human pose (see 5.3(a, b)). Each template is a weighted sum of the oriented bar l- ters where the weights are obtained by maximizing the conditional joint likelihood (refer [74] for details on training). The likelihood of a part is obtained by convolving the part boundary template with the Sobel edge map, and the part region template with part's appearance likelihood map. Since the appearance of parts is not known at the start, part estimates inferred using boundary templates are used to build the part appearance models (see iterative parsing [72]). For each part, an RGB histogram of the parth fg and its background h bg is learnt; the appearance likelihood map for the part is then simply given by the binary map p(H fg jc)>p(H bg jc). For more details, please refer to [72]. For each object class such as soccer ball, we trained a separate detector with a variant of Histogram of Gradients features [10], the mean RGB and HSV values, and the normal- ized Hue and Saturation Histograms. The detectors were trained using Gentle AdaBoost [27]. We use a sliding window approach to detect objects in the image; a window is tested for presence of object by extracting the image features and running them through boosted decision trees learned from training examples. The details on learning object detectors are described in Section 4.2. [Infer Part and Object Distributions] For each part and object, we apply the detector over all image positions and orientations to obtain a dense distribution over the entire conguration space. We then simultaneously compute the posterior distribution of all parts by locally exchanging messages about kine- matic information between parts that are connected. More precisely, the message from 70 (b) (a) Figure 5.3: Part Templates: (a) Edge based part templates (b) Region based part tem- plates, dark areas correspond to low probability of edge, and bright areas correspond to a high probability; part i to part j is the distribution of the joint connecting parts i and j, based on the observation at part i. This distribution is eciently obtained by transforming the part distribution into the coordinate system of the connecting joint and applying a zero mean Gaussian whose variance determines the stiness between the parts [22]. [Selecting the Best Pose and Object] Since the tree structure model does not represent inter part occlusion between the parts that are not connected, the pose obtained by assembling maximum posterior estimates for each part [1] does not result in a kinematically consistent pose. Thus, we use a top down approach for obtaining a pose by nding the maximum likelihood torso estimate rst (root) and then nding the child part given the parent estimate. This ensures a kinematically consistent pose. 71 (b) (a) (c) (d) Figure 5.4: Inference on Pose Context Tree: (a) Sample image of a soccer player (b) Distributions obtained after applying edge templates (c) Joint part distributions of edge and region templates (d) Inferred pose and object 5.4 Model Learning The model learning includes learning the potential functions in the pose context tree i.e. the body part and the object detectors for computing the likelihood potential, and the prior potentials for the Pose Context tree. For the body part detectors, we use templates provided by Ramanan et al [72] (for learning these templates, please refer to [74]). 5.4.1 Prior potentials for Pose Context Tree Model parameters include the kinematic functions between the parts ij s and the spatial context for each manipulable object O, ko . [Human Body Kinematic Prior]: The kinematic function is modeled with Gaussians, i.e. position of the connecting joint in a coordinate system of both parts (m ij ; ij ) and (m ji ; ji ) and the relative angles of the parts at the connected joint (m ij ; ij ). Given 72 the joint annotations that is available from the training data, we learn the Gaussian parameters with a Maximum Likelihood Estimator [22], [1]. [Pose-Object Spatial Prior]: The spatial function is modeled as a binary potential with a box prior (eqn 5.4). The box prior ko is parameterized as mean and variance (m;), which spans the region [m 1 2 p ;m + 1 2 p ]. Given the pose and object annotations, we learn these parameters from the training data. 5.4.2 Object Detector For each type of object, we train a separate detector for its class using Gentle AdaBoost [27]. We use a variation of Histogram of Gradients [10] to model edge boundary distri- bution and mean RGB and HSV values for color. For eciency, we do not perform all the steps suggested by [10] for computing HOGs. We divide the image in a rectangular grid of patches and for each cell in the grid, a histogram of gradients is constructed over orientation bins. Each pixel in the cell cast a vote equal to its magnitude to the bin that corresponds to its orientation. The histograms are then sum-normalized to 1:0. For appearance model of the objects, normalized histograms over hue and saturation values are constructed for each cell. Thus our descriptor for each cell consists of mean RGB and HSV values, and normalized histograms of oriented gradients, hue and saturation. For training, we use images downloaded from Internet for each class. We collected 50 positive samples for each class and 3500 negative samples obtained by random sampling windows of xed size from the images. During training detector for one class, positive examples from other classes were also added to the negative set. For robustness, we increased the positive sample set by including small ane perturbations the positive samples (rotation and scale). For each object detector, the detection parameters include number of horizontal and vertical partitions of the window, number of histogram bins for gradients, hue and saturation, number of boosted trees and their depths for the classier, 73 and were selected based on the performance of the detector on the validation set. The validation set contained 15 images for each object class and 15 images containing none of them and does not overlap with the training or test sets. We select the classier that gives lowest False Positive Rate. 5.5 Experiments To validate the model, we created a dataset with images downloaded from the Internet and other datasets [30, 10, 72]. The dataset has 21 22 images for 3 interactions - legs with soccer ball, hands with basketball and miscellaneous. For ease of writing, we refer to these as soccer, basketball and misc respectively. The soccer set includes images of players kicking or dribbling the ball, basketball set has images of players shooting or holding the ball, and the misc set includes images from People dataset [72] and INRIA pedestrian dataset [10]. Similar to People dataset [72], we resize each image such that person is roughly 100 pixels high. Figure 5.5 show some sample images from our dataset. Note that we do not evaluate our algorithm on the existing dataset [30] as our system assumes that the entire person is within the image. 5.5.1 Object Detection We evaluate the object detection for each object type i.e. soccer ball and basketball. Fig- ure 5.6 show some positive examples from the training set. For evaluation, we consider a detection hypothesis to be correct, if detection bounding box overlaps the ground truth for the same class by more than 50%. We rst desribe the detection parameters used in our experiments, and then evaluate the detectors on the test set. [Detection Parameters]: As mentioned previously in the Learning section, we set 74 Figure 5.5: Sample images from the dataset; Rows 1, 2 contains examples from basketball class; Rows 3, 4 from soccer, and Row 5 from misc. the detection parameters for each object based on its performance on the validation set. We select the detection parameters by experimenting over the grid size (1 1, 3 3, 5 5), number of histogram bins for gradients (8, 10, 12 over 180 degrees), hue (8; 10; 12) and saturation (4, 5, 6), number of boosted trees (100, 150, 200) and their depths (1, 2) for the classier, and threshold on detection condence (0:2; 0:3;:::; 0:9). We use training window size of 50 50 for both soccer and basketball. For soccer ball and basketball, we select the detection parameter settings that gives lowest False Positive Rate with miss rate of at most 20%. For soccer ball, the detector trained with 12 gradient orientation, 10 hue and 4 saturation bins over a 5 5 grid gave the lowest, 2:5 10 4 , FPPW for boosting over 150 trees of depth. On the other hand, for basketball, 200 boosted trees of 75 Figure 5.6: Sample positive examples used for training object detectors; Row 1 and 2 shows positive examples for basketball and soccer ball respectively. depth 2 on a 5 5 grid gave lowest FPPW of 4:9 10 4 for 12 HOG, 8 hue, 6 saturation bins. [Evaluation]: We evaluate the detectors by applying them on test images at known scales. To compute the detection accuracy for each detector on the test set, we merge overlapping windows after rejecting responses below a threshold. The detection and false alarm rate for each detector on the entire test set is reported in Table 5.1 for threshold of 0:5. Detection Rate False Positives PW Soccer ball 91:7% 2 10 4 Basketball 84:2% 1:5 10 3 Table 5.1: Object detection accuracy over the Test set 5.5.2 Pose Estimation For evaluation, we compute the pose estimation accuracy over the entire dataset with and without using the object context. Pose accuracy is computed as the average correctness of each body part over all the images (total of 650 parts). An estimated body part is considered correct if its segment endpoints lie within 50% of the length of the ground-truth segment from their annotated location, as in earlier reported results [23, 1]. 76 To demonstrate the eectiveness of each module in our algorithm, we compute pose accuracy using 3 approaches: Method A, which estimates human pose using [72] without using any contextual information; Method B, which estimate the human pose with known object i.e. using pose context tree; Method C, which jointly infers the object and estimate the human pose in the image. Note that Method B essentially correspond to the case when all interactions are correctly recognized and the joint pose and object estimation is done using pose context trees; hence performance of Method B gives an upper bound on the performance of Method C which is the fully automatic approach. Figure 5.8 shows sample results obtained using all 3 methods. Approach Pose Accuracy S B M Overall A No context [72] 67.1 43.2 64.3 57.33 B KnownObject-PCT 72.9 63.2 64.3 66.33 C Multiple PCT 69.4 57.7 59 61.50 Table 5.2: Pose accuracy over the entire dataset with a total of 65 10 = 650 parts. S correspond to the accuracy over images in the soccer set, similarly B for basketball and M for misc. The pose accuracy obtained using above methods is shown in Table 5.2. Notice that the use of contextual knowledge improves the pose accuracy by 9%, and using our approach, which is fully automatic, we can obtain an increase of 5%. To clearly demonstrate the strength of our model we also report the accuracy over the parts involved in the interactions in the soccer and basketball set. As shown in Table 5.3, methods using context (B and C) signicantly outperform method A that does not use context. Notice that improvement in accuracy is especially high for basketball set. This is because the basketball set is signicantly more cluttered than the soccer and misc set, and hence, pose estimation is much harder; use of context provides additional constraints that help more accurate pose estimation. 77 Approach Pose Accuracy S(legs) B(hands) Overall A No context [72] 73.53 32.95 50.64 B KnownObject-PCT 80.88 52.27 64.74 C Multiple PCT 76.47 44.32 58.33 Table 5.3: Accuracy over parts involved in the object interaction; for soccer only the legs are considered and for basketball only the hands; thus, accuracy is computed over 44 4 = 176 parts. Soccer Misc Basketball Soccer Misc Misc Basketball Soccer (b) (a) Recognition Rate: 90% Recognition Rate: 80.6% Basketball Soccer Misc Basketball 0.1 0.82 0.18 0.35 0.05 1.0 0.05 0.82 0.18 0.95 0.6 0.9 Figure 5.7: Confusion matrix for interaction categorization: (a) using scene object detec- tion,(b) using multiple pose context trees For pose accuracy to improve using contextual information, the interaction in the image must also be correctly inferred. Thus in addition to the pose accuracy, we also compute the accuracy of interaction categorization. For comparison, we use an alternate approach to categorize an image using the joint spatial likelihood of the detected object and the human pose estimated without using context. This is similar to the scene and object recognition approach [55]. Figure 5.7 shows the confusion matrices for both the methods, with object based approach as (a) and use of multiple pose context trees as (b). Notice that the average categorization accuracy using multiple pose context trees is much higher. 78 5.6 Summary In this chapter we proposed an approach to estimate the human pose when interacting with a scene object, and demonstrated the joint inference of the human pose and object increases pose accuracy. We propose the Pose context trees to jointly model the human pose and the object interaction such as dribbling or kicking the soccer ball. To simultane- ously infer the interaction category and estimate the human pose, our algorithm consider multiple pose context trees, one for each possible human-object interaction, and nd the tree that gives the highest joint likelihood score. We applied our approach to estimate human pose in a single image over a dataset of 65 images with 3 interactions including a category with assorted \unknown" interactions, and demonstrated that the use of con- textual information improves pose accuracy by about 5% (8% over the parts involved in the interaction such as legs for soccer). 79 KnownObject−PCT Iterative Parse Input (b) (d) (e) (c) (a) Multiple PCT Figure 5.8: Results on Pose Dataset. (a),(b),(c) are images from basketball set and (d),(e) are from soccer set. The posterior distributions are also shown for the Iterative Parsing approach and using PCT when action and object position is known. Notice that even in cases where the MAP pose is similar, the pose distribution obtained using PCT is closer to the ground truth. Soccer ball responses are marked in white, and basketballs are marked in yellow. In example (c), basketball gets detected as a soccer ball and thus results in a poor pose estimate using Multiple-PCT, however, when the context is known, true pose is detected using PCT. 80 Chapter 6 Simultaneous recognition of Action, Pose and Object in Videos 6.1 Introduction So far we have presented a two ways to decompose actions into components and presented Bayesian formulation to recognize actions through detection of their components in videos. We also presented a method which recognizes actions from single image by leveraging from the mutual context of action, human pose and object of interaction. In this chapter, we bring all the lessons learned from those systems to present our comprehensive system which decomposes action based on human keyposes representative of primitive events and jointly estimate action, human pose and object of interaction for robustness. As we have mentioned earlier that human action recognition is important for its wide range of applicability from surveillance systems to entertainment industry. Our objective is to not only assign an action label but also provide a description which includes not only \what"' happend but also \how"' it happened by breaking it into component primitive actions, \what"' object was used if any, \where"' the actor and object were and \when"' the interaction took place. This requires action, pose and object recognition. In past, one or more tasks were performed independently. Our hypothesis is that joint inference can improve accuracy of these tasks and in turn yield more meaningful descriptions. 81 (a) Answer phone (b) Drink from a cup Figure 6.1: Similarity in pose sequences for two actions Action recognition is a challenging task because of high variations in execution style of dierent actors, view-point changes, motion blur and occlusion, to name a few. Most of the previous work for human activity recognition has been based on modeling human dynamics alone. [30] shows that action recognition can benet from object context, e.g., presence of `cup' or `water bottle' may help in detecting `drink' action. Further, human movements are not always discriminative enough and their meaning may depend on object type and/or location. For example, `drinking' vs. `smoking'. Similarly, actions can help object recognition. Therefore, modeling action-object context can improve both action and object recognition. One way to model mutual context is to recognize actions and objects independently and learn co-occurrence statistics. Such an approach is limiting because not only that independent recognition is very challenging but also that same objects can be used to perform dierent actions. For example, a `cup' for both `drinking' and `pouring'. Also, similar movements with dierent objects imply dierent actions, for instance, `drinking' and `answering a phone' have similar motion proles. In such cases, detailed human rep- resentation, e.g. human pose model, can facilitate accurate human interaction modeling. Methods have been proposed that use actions, poses and objects to help recognition of actions and/or objects [70, 30]. [70] recognize actions independently using pose obser- vations to improve object detection, whereas, [30] calculates hand trajectories from pose estimates and use them for simultaneous action and object recognition. These methods 82 estimate human pose independently; however, robust estimation of pose is a well-known dicult problem. Commonly, skin color is used to provide context when tracking hands, which limits applicability of the method. Simultaneous inference can improve estima- tion of all three entities. Indpendent object and pose detection in our experiments gave detection rates of 57.8% and 34.96% for respectively. Recently, [99] presented a method for simultaneous inference of action, pose and object for static images which does not take advantage of human dynamics. Due to absence of human dynamics, the approach cannot dierentiate between `make' and `answer' a phone call. An intuitive way to extend the approach for videos is to map their model onto an HMM. Such extension will be computationally expensive as a large number of action- pose-object hypotheses need to be maintained in every frame because actions, such as `drink', can be performed using one of many objects. Our contribution through this paper is three fold: i) We propose a novel framework to jointly model action, pose dynamics and objects for ecient inference. We represent action as a sequence of keyposes. Transition boundaries of keyposes divide actions into their primitive components. To aid video description, we map the recognition problem onto a dynamic graphical model. Inference on it yields likely label and segmentation of action into keyposes and hence into component actions. We also obtain human pose estimates in our inference to answer \where"' and \how"' when producing description. Segmentation by our method is more detailed than [30], which only segments reach and manipulation motion. ii) We propose a novel two step algorithm in which fewer hypotheses are required to be maintained for every frame in comparison to extending [99] by adding temporal links and running Viterbi algorithm. 83 iii) Keypose instances may vary highly among actors and even for same actor. To deal with variability in keypose instances we model keypose as a mixture of gaussian over poses and refer to it as Mixture of Poses. To validate our hypothesis, we evaluate our approach on the dataset of [30] which contains videos of actions using small objects and obtain action recognition accuracy of 92.31%. We show signicant improvement over the baseline method which does not use objects. We also observe that doing simultaneous inference of action, pose and object improves action recognition accuracy by 25% in comparison to using only action and pose. Finally, we evaluate the performance of our system for video description task. In the sections to follow, we brie y mention Related Work in Section 6.2. Section 6.3 describes our model for action, pose and object recognition. We explain our inference mechanism in Section 6.4 and show results of our experiments in Section 6.5. Conclusion is presented in Section 6.6. 6.2 Related Work Human activity recognition in video has been studied rigorously in recent years in vision community. Both Statistical [52] and graphical [17, 85, 65, 67] frameworks have been proposed in the past. However, most methods try to recognize actions based on human dynamics alone. These methods can be grouped on the basis of using 2D [43, 101, 52] or 3D [93, 59, 67] action models. 2D approaches work well only for the viewpoint they are trained on. [59, 93, 67] learn 3D action models for viewpoint invariance. [59, 67] use foreground blobs for recognition; therefore, their performance depend heavily on accurate foreground extraction. Also, human silhouettes are not discriminative enough, from certain viewpoints, for actions performed by movement of hand in front of the 84 actor, e.g. drink or call. We use 3D action models learned from 2D data and do part- based pose evaluation, which does not require silhouettes. [95, 63, 13] did some early work in modeling action and object context. [63] detects contact with objects to create a prior on HMMs for dierent actions. [70, 46] does indi- rect object detection using action recognition. [25] recognizes primitive human-object interactions, such as grasp, with small objects. The method is suited for actions that depend on the shape of the object. [32] tracks and recovers pose of hand interacting with an object. Most of these methods solve for either action or object independently to improve estimate of the other, with an assumption that either one is reliably detected. However, independent estimation is generally dicult for both. [30] uses hand tracks ob- tained from independent pose estimation for simultaneous inference of action and object; where interaction with objects is restricted to hands only. Attempts have also been made to recognize actions in a single image using object and pose context [30, 99, 82]. These methods, however, do not use human dynamics cues for recognition. 6.3 Modeling Action Cycle Graphical modeling of action, pose and object provides a natural way to impose spatial, temporal, functional and compsitional constraints on the entitites. Our model has sep- arate nodes to represent action, object and actor's state. Actor's state is a combination of actor's pose, duration in the same pose and micro-event, which represent transition from one keypose to another. Variations in speed and execution of action are captured through transition potentials. 85 Figure 6.2 presents the graphical model to simultaneously estimate action A, object of interaction O, and actor's state Q 1 using their mutual context. Evidences for object and actor's state are collected as e o and e q respectively. (a) Basic underlying model (b) Time Unrolled model Figure 6.2: Graphical Model for Actor Object Interaction Let, s t =ha t ;o t ;q t i denote the model's state at time t, where a t , o t and q t represent the action, the object of interaction and the actor's state at time t. Then the joint log likelihood for a state given an observation sequence of lengthT can be obtained by a sum of potentials similar to [8]. L(s [1:T] ) = T X t=1 X f w f f (s t1 ;s t ;I t ) (6.1) where, f (s t1 ;s t ;I t ) are potential functions that model interaction between nodes in our graphical model and w f are the associated weights. 1 We disambiguate states of complete model and state of an actor at an instant by calling them model's state and actor's state respectively 86 The most likely sequnce of model's states, s 1:T , can be obtained by maximizing the log likelihood function in Eq (6.1) for all states s [1:T] = arg max 8s [1:T] L(s [1:T] ) (6.2) Optimal state sequence obtained using Eq. (6.2) provides keypose labels for each frame. These labels are used with action and object estimates to generate video descip- tion. 6.3.1 Actor's States An actor goes through certain poses, called Keyposes, during an action which are repre- sentative of that action. We use Keyposes as part of our actor's state denition like [59] but represent them using Mixture of Poses to handle variation in keypose instances. A Mixture of Poses is a Gaussian mixture of K N-dimensional Gaussian distributions, in which, each component distribution corresponds to a 3D Pose with dimensions corre- sponding to 3D joint locations. We distinguish samples drawn from keypose distirbutions by referring to them as `poses'. Next, we dene transitions from one keypose to a dierent keypose as micro-events and include them in our denition of actor's state. This way, pose transitions depend not only on the current keypose but also on the last (dissimilar) keypose. Finally, since probability of staying in the same state decreases exponentially with time, we add duration of being in the same keypose to complete actor's state deni- tion. Therefore, the state of actor at time t, q t , in our model is represented by the tuple hm t ;kp t ;p t ;d t i, where m t , kp t , p t and d t represent previous micro-event, keypose, pose and duration spent in current keypose respectively. [67] presented a similar model for actor's state but uses primitive events to explicitly model intermediate poses between two 3D keyposes. This does not account for all the 87 pose variations that may occur between two keyposes, e.g., reach motion can be catego- rized by start and end pose and does not depend on variations in poses in between. We address this by not explicitly modeling the transformation of poses. Also, linear trans- formation assumption suggests that more keyposes would be required for accurate action modeling. [59] uses similar approach to ours but does not model duration of keyposes and will have problem dealing with actions in which actor stays in a keypose for long duration. We permit arbitrary connections for skipping and repeatability of actor's states within an action. Examples of transition models for actor's states making a call and spraying using a spray bottle are provided in supplemental material. 6.3.2 Human Object Interaction A scene may have a number of objects present, requiring us to identify the type and location of the object with which an actor interacts. We have an estimate of actor's joints via pose as part of actor's state. Given an action, joint locations can help us better estimate the type and the location of the object. For example, we expect a cup to be close to the hand when drink action is being performed. Conversely, action and object constraint joint locations, e.g. when a person is sitting on a chair, his/her hip is usually closer to the chair than the head. We call the joint which is closest to the object during the interaction as the joint of interaction. Objects often get highly occluded during interaction when they are small. For small objects, the pose does not depend on the location of the object but on its type and the action. This fact is also used by Gupta et al. in their work [30], where manipulation mo- tion depends only on object's type. Therefore, using location of small object to provide context to the pose during intearction may degrade action recognition performance. In 88 our experiments, objects were detected only 5% of the time after being grabbed. Con- sequently, a binding of pose and objects at each frame, as presented in [99], may not be optimal because objects don't provide useful context at every frame. On contrary, when objects are large and not occluded, they are easier to detect, however, body parts get occluded during interaction with them. In that case, objects provide useful context. Based on above observations, we model human object interaction for those portions of the action where objects are more likely to be detected and in uence the pose. To capture this notion, we use segmentation provided by micro-events. The micro-event with likelihood of observing the object around it near the joint of interaction greater than some threshold, is called the binding event. 6.3.3 Action Model Learning First, we select keyposes to represent our actions. Next, we learn keyposes from a small number of pose annotations for each video. Finally, action models and binding events are learned from keypose transition boundary labels. [Keypose Selection] Keyposes can be selected using local optima of motion energy as explained in [59] which requires MOCAP data. To avoid use of MOCAP data, poses for whole video sequences can be marked in 2D and be lifted to 3D [88, 67]. However, keypose selection using motion energy may not have semantic signicance. Since, one of our goals is to explain dierent phases of an action; we manually choose a set of keyposes based on the semantics for training. This makes our action description task straight forward after most likely keypose sequence labeling. [Keypose Learning] To learn Mixtures of Poses, we annotate the joints of the actor in only few frames of training videos and assign them to one of the Keyposes. These 89 2D poses are then lifted to 3D and normalized for scale and orientation using a method similar to [88] and [67]. We cluster instances of a keypose using K-means and use each cluster as a component in the mixture with equal weight. [Actor's State Transition Model Learning] We mark keypose transition boundaries for each video. Using these annotations, mean and variance of duration a person spends in a keypose for every action and a valid set of micro-events and their transitions are learned from these annotations. [Binding Event Learning] We annotate objects in training data for all frames. Then an object detector for object of interest is run and only true detections are kept. The micro-event around which an object is detected with likelihood above some threshold, is selected as binding event. We collect statistics for all the joints for both starting and ending pose of a micro-event. The joint closest to the object on average is selected as joint of interaction. [Potential Weights Learning] In principle, weights (w f ) can be learned using approach presented in [8] for improved performance. In our experiments, we uniformly set these weights to 1.0. 6.4 Simultaneous Inference of Action, Pose and Object Obtaining most likely state sequence by exhaustive search is computationally prohibitive due to large state space. A particle lter based approach can be used to do inference eciently inO(KT ), where K is the beam size and T is number of frames. 90 We propose a novel two step inference algorithm, to reduce the minimum beam size and hence the complexity of inference. We break down Eq. (6.1) into two parts according to the types of potentials: L(s [1:T] ) =w i T X t=1 i (s t1 ;s t ;I t ) + T X t=1 X q j w q j q j (s t1 ;s t ;I t ) (6.3) where the rst term involving i () models human object interaction and the last term involving q j () models transitions and observations of actor's states during the action. First, we obtain samples of object tracks and pose sequences for the video. Finding likely actor's state sequence,b s [1:T] , is equivalent to maximizing the last term of eq (6.3). This gives us a distribution of poses in every frame. In the second step, we obtain the most likely sequence of action, object and actor's state by computing interaction potentials between all possible pairs of object tracks and actor's states. Pseudo code for inference is given in Algorithm 2. Algorithm 2 Pseudo Code for Inference 1: Use particle lter based algorithm to obtain pose distribution: b S = n b s i [1:T] o K i=1 . b s i [1:T] = arg i max 8s [1:T] ( T X t=1 X q j w q j q j (s t1 ;s t ;I t )) (6.4) where, max i (:) =i th best solution of (.) 2: Obtain distribution of objects using window based detection and tracking 3: Obtain most likely sequence of states, s [1:T] by maximizing Eq. (6.1) over s [1:T] 2 b S s [1:T] = arg max s [1:T] 2 b S (L(s [1:T] )) (6.5) [Complexity Analysis] Let A be the set of actions and O a and B a be the set of objects and binding events for actiona, respectively. To do joint inference in single pass, we need 91 to maintain at leastK min = P a2A jO a j hypotheses to represent all actions at least once. Therefore, the complexity of a single pass algorithm for the smallest beam becomes: T 1 =O(T X a2A jO a j) = X a2A O(jO a jT ) For our two pass algorithm, we evaluate interaction potentials only for frames where binding events occur, i.e.jB a j<T . Since we bind objects in second pass, we only need to maintain at leastK min =jAj hyoptheses during rst step. In second step, we evaluate interaction potentialjO a jjB a j times for each action a. T 2a =O(jAjT ) X a2A O(jO a jT ) T 2b = X a2A O(jO a jjB a j)< X a2A O(jO a jT ) T 2 =O(jAjT ) + X a2A O(jO a jjB a j) X a2A O(jO a jT ) If at least one action can be performed using more than one object, the above inequality is strict. 6.4.1 Pose Tracking We use a top-down approach to obtain topK likely sequences of poses, independent of the object. At any instant t, the actor state transition models are used to predict 3D poses for instant t + 1. The likelihood for these poses is then computed by evaluating them against the video evidence. For computational eciency, we use a particle lter based approach for inference similar to [67]. We keep a total of K states s i t i=1:K at each frame t. In order to have a good representation of all actions when we bind pose tracks and objects at later stage, we 92 ensure that we keep at least M = floor(K=jAj) states for each action at every frame. For each state s t , we sample the next micro-event m t given the action a t using Event Transition Potential m (s t1 ;a t ;m t ). Next, using the predicted micro-event we sample pose (p t ) for next state from the Mixture of Poses corresponding to ending keypose of the micro-event. Finally, observation potential o (p t ) is computed for the next predicted pose. The weighted sum of potentials is accumulated to nd topK state sequences. Pseudo code of our algorithm is given in Algorithm 3 Algorithm 3 Inference mechanism for Actor States Get initial states S 0 = nD s (i) 0 ; (i) 0 E ji = 1::K o 2: for t = 1 TO T do for i = 1 TO K do 4: for all Allowed m (i) t i.e. m (s t ;s t1 ;I t )6=1 do Randomly sample Mixture of Poses corresponding to end pose of m (i) t 6: Estimate weights: (i) t = (i) t1 + X q j w q j q j (s t1 ;s t ;I t ) (6.6) end for 8: end for Prune states 10: end for To maintain top-K state, we reject a higher ranked state in favor of a lower ranked state if the number of states corresponding to the action of higher ranked state does not fall belowM after rejection. In addition, we try to keep the states for an action diversied in terms of current keypose labels. We use standard denition of entropy to represent diversity in keypose labels. If needed, a state s 1 t is preferred over s 2 t , if rejecting s 2 t results in higher entropy in keypose labels compared to rejecting s 1 t . Final rejection decision of a state is based on both its likelihood and entropy. s 1 t is preferred over s 2 t , if: 93 w l l(s 1 t ) +w e ent(S a t ns 2 t )>w l l(s 2 t ) +w e ent(S a t ns 1 t ) where l(:) is likelihood, ent(:) is entropy of keypose labels, and w l , w e are weights to control relative importance and S a t is a set of actor states with a =a 1 t . Observation and transition potentials can be any function depending on the domain. We explain our choices below. 6.4.1.1 Transition Potential In our implementation we allow transition between actions only after last micro-event of the current action. Therefore, the action label is carried from previous state to the next except for the last micro-event. We want the probability of staying in the keypose to decay near mean duration. [67] models this using a log of signum function. We dene Event Transition Potential, m (:), like them as follows: if a t1 = a t : m (s t1 ;a t ;m t ) = 8 > > > < > > > : ln 1 +e d t (m t )(m t ) (m t ) m t1 = m t ln 1 +e d t (m t )+(m t ) (m t ) m t1 6= m t (6.7) if a t1 6= a t : m (s t1 ;a t ;m t ) = 8 > > > < > > > : ln 1 +e d t (m t )+(m t ) (m t ) m t1 = lm(a t ) 1 otherwise (6.8) 94 where, lm(a t ) is the last micro-event for action, and (m t ) and (m t ) are the mean and variance of duration an actor spend in the same keypose after micro-event m t . 6.4.1.2 Observation Potential We compute the observation potential of each keypose by matching its shape with the edges we extract from the video. To obtain the shape model, each keypose is scaled and rotated (in pan) appropriately based on the tracking information and then projected orthographically. For robustness, we use two dierent shape potentials - Hausdor dis- tance of the shape contour and the localization error of a 2D part based model (pictorial structure [22]). [Hausdor Distance] Given a keypose kp, we obtain its shape contour by collecting the points on the boundary of the projected pose. The shape contour is then matched with the canny edges in the image I using the Hausdor distance. H (S;I) = max p2kpcont min e2E(I) jjaejj (6.9) where kp cont is the pose shape contour, E(I) are canny edges of image I andjj:jj is any norm. We used the euclidean norm between the model points and edge points, as it can be computed eciently by computing the distance transform of the edge image. [Part Localization Error] To handle the variations in pose across dierent action instances, we use a 2D part model to accurately t the projected 3D pose to image obser- vations. The body part model used in the work is similar to the Pictorial Structures [22] which is widely used for estimating human pose in an image [72]. The model represents each part as a node in a graphical model and the edges between nodes represent the kine- matic relation between the corresponding parts. Note that unlike [22, 72], which assume the parts to be unoccluded, our body model is dened over only the observable parts; 95 note that a part may not be observable either due to inter-part occlusion or 3D-2D pro- jection (projected length may be too small). Furthermore, the estimate of the adjusted 3D pose imposes a strong constraint on the orientation and position of body parts and localization in our case does not require a dense search [22, 72]. To determine which parts are observable, we rst compute the visibility of each part by computing the fraction of overlap between that part and other body parts, and using their depth order w.r.t. the camera. In this work, we consider a part to be occluded if the fraction of part occluded is greater than 0:5. For localization, we rst apply a template for each part over the expected position and orientation of the part and a small neighborhood around it. In this work, we used the boundary templates provided by Ramanan et al [72]. Pose estimate is then obtained by maximizing the average log likelihood of the visible parts, given by ps (S;I) = X u2U u (y u jx u ) + X uv2E uv (x u ;x v ) (6.10) whereU is the set of all visible body parts,E is the set of part pairs that are kinematically connected; u () is the likelihood map of part u obtained by applying the detector; uv () is the kinematic constraints between the parts u andv. We normalize the total potential by number of visible parts to remove bias towards poses with fewer visible parts. 6.4.2 Object Recognition and Tracking We train an o-the-shelf window based object detector that does not use color for each type of object used. Our method does not depend on the specics of the object detector. We apply these detectors to obtain a set of candidates for each object of interest. We associate object candidates in each frame, by running an o-the-shelf detection based object tracker. Running an object tracker gets rid of intermittent false alarms and miss 96 Figure 6.3: Object visibility for dierent keyposes. Even for static camera, appearnace of spray bottle and ashlight change signicantly between frames. detections. Still, each frame may have more than one type of object and more than one candidate of each object type present. 6.4.3 Pose-Object Binding Object detection and tracking in full generality is a challenging problem, therefore, we do not expect reliable object detections to be available throughout the action (Fig. 6.3). Instead, we bind the objects with the action when we have the best chance of detecting them. For a pick up action, it is before the object is removed from its location, while for a put down action it is after the object is put down. We learn these facts from data as explained earlier. Once we have pose and object tracks, we compute interaction potential among poses, objects and action labels. For each pose track given an action, we compute interaction potential for object hypotheses (all detection across all types) and the joint of interaction of either the source or destination pose at each binding event for that action. We dene two binding functions depending on object visibility before or after the occurrence of the binding event. Let j a be the joint of interaction for action a. For frame t, l(p j t ) be the location of joint j for pose p and l(o t ) be the location of object o, the binding functions are dened as: 97 (a) Using Partial Model (b) Using Full Model Figure 6.4: Confusion Matrix B 0 (s t1 ;s t ) =dist(l(o t1 );l(p j a t ))=r B 1 (s t1 ;s t ) =dist(l(o t );l(p j a t1 ))=r (6.11) where, r is used to normalize the distance. Now, for the binding event b a for action a, the interaction potential is dened as: i (s t1 ;s t ;I t ) = 8 > > < > > : max(B 0 (:);B 1 (:);i min ) m t =b a 6=m t1 0 otherwise (6.12) Due to uncertainty in estimates of location of object and joint of interaction in one frame, we in practice, use mean location over n frames before or after t by replacing l(x t ) with b l(x t:t+m ) and l(x t1 ) to b l(x t1m:t1 ). 98 (a) (b) (c) (d) Figure 6.5: Eect of using object context. Likelihoods of cup, phone, ashlight and spray bottle at each location of interaction frames are shown in cyan, blue, green and red respectively. (a) Phone, `calling' action and pose were correctly recognized in presence of multiple object detections. (b) Object context helped to disambiguate `lighting' from `pouring' action. (c) `Pouring' action was recognized in absence of `cup'. (d) Detection of `cup' confused `drinking' with `pouring' action. 6.5 Experiments We evaluated performance of our system for video description task on the dataset of [30]. The dataset has 10 actors performing 6 dierent actions using 4 objects. The actions in the datasets are drinking, pouring, spraying, lighting a ashligh, making a phone call and answering a phone call. These actions are performed using one of the four objects, cup, phone, spray bottle or ashlight. The vidoes, however, contain other objects such as a stapler and a masking tape. Only 52 of 60 videos used for testing in [30] were made available by the authors for evaluation. [Object Detection] To evaluate the performance of our object detector for recognition of objects before they are grabbed we ran all the detectors at dierent scales. Locations for which no class had a likelihood of more than 0.5 were classied as background. Otherwise, the class with the highest likelihood was assigned to that location. When using object detectors without action context, we achieved 63.43% recognition rate for localized objects of interest before being grabbed. Figure 6.5 demontstrates the eect of using object context. 99 We similarly evaluated our object detector for recognition of objects after being grabbed. As expected, only 5% of the objects were detected correctly, indicating that small objects cannot be reliably detected during interaction phase. [Baseline] For action recognition task, we rst established a baseline using a method which does not use object context. We used space-time interest points based method of [52]. Code provided by authors was used to obtain interest points and SVM-Light was used for classication. Our setup produced results similar to the authors on KTH dataset. On our dataset, we ran experiments for single channel with variety of congurations and obtained best accuracy of 53.8% using HOF-313. Same conguration is reported to give best performance for single channel on KTH dataset in [52]. [Action Recognition without Object] Next, we evaluated the performance of our action recognition system. We rst performed action recognition without using object knowledge. This is equivalent to using (6.3) without the i (:) term and the most likely action was reported. We used leave-one-out validation method for testing i.e. we trained our models using 9 actors and tested it on the remaining actor. Recognition rate of 67.31% was achieved using the limited model. Signicant confusion occurs between pouring from a cup and lighting a ashlight and drinking and answering actions because of similarity in limb movements. Some of call and answer actions are confused with spraying actions, because in 2D, initial few poses of a spraying action are similar to that of calling and answering actions. [Pose Classication] We setup our pose classication task as to classify the human pose at each frame into one of 20 Keypose classes. Approximately, 5300 frames were Method Accuracy Space-Time Interest Points [52] 53.8% Keypose based Action Model 67.31% Full model with context(Section 6.3) 92.31% Table 6.1: Action Recognition Accuracy 100 Pose Classication Accuracy Without action context 34.96% With action context 66.43% With action and object context 73.29% Table 6.2: Pose Recognition Accuracy evaluated. As reported in Table 6.2, recognition improves as more contextual information is provided. Accuracy is low without action context because we chose keyposes based on their semantic meanings, therefore, their joint locations appear to be very similar from certain viewpoints. Note that accuracy for no context is still better than random guess of 5%. [Action Recognition with Object]To study the impact of using object context on action recognition, we performed evaluation of our full model as explained in Section 6.3. This improved recognition accuracy from 67.31% to 92.31% and only 4 out of 52 videos were mis-classied. Our method also demonstrates signicant improvement over the baseline method. We, however, can't compare our results directly with [30] who reports accuracy of 93.34% because a) the authors used a training set separate from the test set and b) report results for 60 videos. Both the training set and unavailable videos were not provided by the authors for they got corrupt/lost. Our experiments show that use of object context in our proposed framework increases recognition rate signicantly. Note that for generality of application, we do not use skin color model for better alignment of hand locations, which might be useful in this case. [Video Description]Finally, we evaluated performance of our sytem for video descrip- tion task. We assign a \verb"' to every keypose in our system. The scene description is then generated using the keypose segmentation provided by our inference algorithm; we represent description as a set of tupleshverb, object, start frame, end framei, segmenting the video into component actions. Each tuple represents the verb assigned to the duration of video marked by start and end frames and the object associated with the verb. For verbs relating to non-interaction phase, like `Stand', we report `NULL' for object. An 101 (a) Making Phone Call (b) Answering Phone (c) Drinking from Cup (d) Pouring from Cup (e) Using Flashlight (f) Spraying using Spray Bottle Figure 6.6: Pose Tracking and Classication results for actions that were correctly iden- tied. Object of interaction identied by our system automatically is also highlighted. 102 example output XML le is shown in Fig 6.5 for `making a call' action. We quantitatively evaluate the accuracy of our description results by comparing the segment (component action) boundaries with the ground truth segment boundaries, which is available from keypose boundary annotations. When the actions were correctly recognized, our descrip- tion sequence exactly matched the ground truth and verb durations overlapped with 73.29% accuracy. 6.6 Conclusion Object identication plays an important role in discrimination of actions that involve similar human movements, whereas knowing the action can help resolve disambiguities between objects on the basis of their normal use. We presented a probabilistic approach to utilize the mutual context that action and objects provide to each other. We represented an action as a sequence of Mixture of Poses that captures pose variations across dierent action instances in a compact manner. By combining human pose and object information in the same probabilistic model and performing joint inference, we were able to better discriminate between actions which have similar poses. We applied our approach to a dataset of human actions that involve interaction with objects and showed that action recognition improves when pose and object are used together. 103 (a) Making Phone Call (b) Answering Phone (c) Drinking from Cup (d) Pouring from Cup (e) Using Flashlight (f) Spray using Spray Bottle Figure 6.7: Some examples of XML description generated by our framework for the corresponding examples of Figure 6.5. 104 Chapter 7 Summary Action recognition is a very important and challenging task. Numerous methods have been proposed which show remarkable performance of action recognition. However, if a picture says a thousand words, a video says a million. Therefore, just recognizing actions, albeit important, is not sucient for a complete understanding of the video and infor- mation about actors, objects of interaction and spatiotemporal location of the activity are also important. Research has shown that independent inference of these tasks is very challenging; however, limited work has been done to develop action recognition methods which are also suitable to obtain action `description." This thesis explores component based approaches for action recognition with focus on action description. In Chapter 3, we introduce low-level component based models to recognize actions through decomposition. Specically, an action is randomly decomposed into spatiotem- poral components and the appearance and motion characteristics of these components are captured using statistics of low-level features which are learned from the data. The loca- tion of the components is modeled as a latent variable and we evaluate multiple choices for likelihood function. We have observed that the recognition performance is dependent on the discriminative quality of the components and not just on the number of compo- nents. Although, this method does not directly identify actors and objects of the action, but it provides a framework for structured analysis of action presented in later chapters. 105 Chapter 4 extends the approach presented earlier so that it also produces action descriptions as by-product. We use primitive concepts of actor properties, e.g. moving, bending, etc., and actor-object relationships, e.g. close, approaching, etc., to decompose actions into temporal segments. To recognize action, we rst detect and track humans and other objects. Then for each actor, we identify temporal segments of validity of each primitive using features of detection boxes. This information is combined using a Conditional Bayesian Network (CBN) to obtain the likelihood of occurrence of an action. Therefore, each recognized action has an associated actor and object of interaction (if any). In this method, we trade-o the number of action components for the discriminative quality and semantic meaningfulness of action components. This method also has the ability to localize actions in streaming video and its ecacy is demonstrated on a very challenging public dataset. We observed that the CBN based model suered from the lack of robustness of object detection and tracking algorithms, and the lack of discriminative power of the features used for detection of primitives. We proposed a method for joint estimation of detailed human pose and object of interaction to address these issues and improve action recogni- tion performance. First, we presented Multiple Pose Context Trees to recognize actions from only one representative image of the action in Chapter 5. Some actions can be recognized given only one image based on the representative pose and the object of in- teraction, such as smoking and shooting basketball. We observed that the methods to solve complex tasks of human pose estimation and object detection can benet from joint inference of human pose and object of interaction and that it leads to improvement in action recognition results. Finally, in Chapter 6, we present our full action recognition and activity description model. We use a temporal decomposition of action based on representative poses that are important for recognition of the action (keyposes). Human pose and the location and 106 the identity of the object are estimated jointly with the action label using a dynamic Bayesian Network. We propose a two-step algorithm based on particle ltering to search for the best state sequence that matches the action model. We found that our method signicantly outperforms the bag of words model at action recognition task on a publicly available dataset, with the aided benet object and actor identication, and localization of human body parts. We feel that our work can be further improved. One of the key areas is joint modeling of human pose and object of interaction for large objects. In Chapter 6, we presented a technique for joint inference of action label, human pose and object of interaction; our joint modeling approach assumes the object of interaction to be small. This assumption is often violated as shown in Chapter 4, and the human pose is often occluded or in uenced by the object of interaction. Therefore, reliable object detection plays an important role during action recognition as shown in Chapter 4. However, since the current state-of-the- art does not allow generic human pose estimation or object detection under challenging circumstances, the modeling of mutual context of actor, object and action need to include elements for discovery of reliable components during run-time and direct the ow of information accordingly. In addition, performance of individual modules that perform human, object and human pose estimation need to be improved. 107 Appendix A English Language Description Generation 1 Action recognition module outputs video descriptions in a structured XML format. The structured output contains entries for subject, object, interval and condence of each action detected (FigA.1(a)). Mapping from structured representation to English sentences is done by use of a number of rules. One key issue is whether the description should focus on a single key action in the video or include all the actions present. We have chosen to include multiple actions but we lter them based on an inherent hierarchy to keep descriptions short. Our method is also able to connect the objects (or the actors) across multiple actions if they are identied to be the same. A.1 Action ranking The action recognition process may recognize multiple actions even for a short video containing only single main event. Because concise description are preferred in natural communication, it is desirable to rank the priority of the detected events. We exploit action hierarchy information as domain knowledge to rank actions. Action hierarchy species which action is more specic than others. For example,`Haul' implies `Move', 1 The work described in this appendix is performed by Dr Sung-Chun Lee. 108 `Approach' implies `Walk', etc. Therefore, Haul and Approach are more specic than Move and Walk. A xed number of sentences are produced from the ranked event list. In our example in Fig.A.1, we choose top four sentences. (a) Input to the Natural Language Description Module (b) Output of the Description Module: An object stopped. A human approached the object. The human pushed the approached object. The human carried the pushed object. Figure A.1: An example of description generation A.2 Object blob association Due to unreliability of the object detector, we have not named noun categories other than human. All non-human objects are grouped into one object category `object'. It is required to discriminate humans or objects with dierent IDs in a video for better description. We associate entities, both humans and objects, across sentences. We qualify entities with denite article 'the' for all their successive appearances if they are the only entity in the video. For multiple entities, we use identiable expression based on the action and their role from the previous sentence they appeared in. We give the set of rules we apply to generate descriptions in the supplemental material. 109 C1 C2 C3 C4 C5 Action List Descriptive Identiable Identiable Verb (Detected Expression Expression Expression Hierarchical event) (Verb +`-ed') (Verb + `-ing') Rank CATCH caught caught catching 1 COLLIDE collided with collided colliding 1 HAUL hauled hauled hauling 1 PICKUP picked up picked up picker 1 PUSH pushed pushed pushing 2 THROW threw thrown throwing 1 WALK walked N/A 2 walking 3 A.3 Rules for description generation We use simple English sentence form, `S + V + O' where `S' is subject, `V' is verb, and `O' is object to generate description of the detected event. In normal discourse, humans resolve ambiguities among dierent types of objects by using their class identiers as identiable expressions, e.g. \A person caught a ball." When multiple objects of the same class are present, they are referred to by using their properties, such as color, height, etc. For example, \The tall person collided with the person in red shirt." Due to unavailability of reliable class labels or property measures for identiable expression, we use previous event information to be more specic about the person or object of the current event. We use the table below to nd the identiable expression for humans and objects to generate sentences. \Descriptive Expression"' column (C2) is actually used in the description sentence as shown in Figure A.1(a) and Figure 4.9 in the paper. We use identiable expression for entities (humans or objects) which appear in more than one event in the video. If the same entity previously appeared as the nominative subject, we insert the `verb' + `-ing' expression from C4 before the entity from previously detected event (e.g. the approaching human, the colliding object). If it appears as the object part, we insert the `verb' + `-ed' expression from C3 (e.g. the caught object, the approached human). The last column (C5) represents `Action hierarchy rank' i.e. which action is more specic than others. Lower number means more specic. 110 Reference List [1] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In IEEE Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 2009. [2] Christopher M. Bishop. 2006. [3] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, and Ronen Basri. Ac- tions as space-time shapes. In Proceedings of IEEE International Conference on Computer Vision, 2005. [4] Aaron F. Bobick and James W. Davis. The recognition of human movement using temporal templates. IEEE Transaction on Pattern Analysis and Machine Intelli- gence, 2001. [5] Matteo Bregonzio, Shaogang Gong, and Tao Xiang. Recognizing action as clouds of space-time interest points. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. [6] William Brendel, Alan Fern, and Sinisa Todorovic. Probabilistic event logic for interval-based event recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011. [7] William Brendel and Sinisa Todorovic. Activities as time series of human postures. In Proceedings of European Conference on Computer Vision, 2010. [8] Michael Collins. Discrminative training methods for hidden markov models: the- ory and experiments with perceptron algorithms. In Proceedings of Conference on Empirical Methods in Natural Language Processing, 2002. [9] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002. [10] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detec- tion. In Proceedings of IEEE Conference on Computer Vision and Pattern Recog- nition, volume 1, pages 886{893, june 2005. [11] Dima Damen and David Hogg. Recognizing linked events: Searching the space of feasible explanations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 111 [12] DARPA. Dataset for mind's eye program, 2011. [13] J. Davis, H. Gao, and V. Kannappan. A three-mode expressive feature model on action eort. In IEEE Workshop on Motion and Video Computing, 2002. [14] Konstantinos G. Derpanis, Mikhail Sizintsev, Kevin Cannons, and Richard P. Wildes. Ecient action spotting based on a spacetime oriented structure repre- sentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010. [15] Chaitanya Desai, Deva Ramanan, and Charless Fowlkes. Discriminative models for multi-class object layout. In Proceedings of IEEE International Conference on Computer Vision, 2009. [16] Piotr Dollar, Vincent Rabaud, Garrison Cottrell, and Serge Belongie. Behavior recognition via sparse spatio-temporal features. In Proceedings of Joint IEEE Inter- national Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. [17] Thi Duong, Hung Bui, Dinh Phung, and Svetha Venkatesh. Activity recognition and abnormality detection with the switching hidden semi-markov model. In Pro- ceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2005. [18] Alexei A. Efros, Alexander C. Berg, Greg Mori, and Jitendra Malik. Recognizing action at a distance. In Proceedings of IEEE International Conference on Computer Vision, pages 726{733, Nice, France, 2003. [19] Ahmed Elgammal, David Harwood, and Larry Davis. Non-parametric model for background subtraction. In Proceedings of European Conference on Computer Vi- sion, 2000. [20] Alireza Fathi and Greg Mori. Action recognition by learning midlevel motion fea- tures. In Proceedings of IEEE Conference on Computer Vision and Pattern Recog- nition, 2008. [21] Pedro Felzenswalb, Ros Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010. [22] Pedro Felzenszwalb and Daniel P. Huttenlocher. Pictorial structures for object recognition. International Journal of Computer Vision, 61(1):55{79, 2005. [23] Vittorio Ferrari, Manuel Marin-Jimenez, and Andrew Zisserman. Progressive search space reduction for human pose estimation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1{8, 2008. [24] Vittorio Ferrari, Manuel Marin-Jimenez, and Andrew Zisserman. Pose search: Re- trieving people using their pose. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1{8, 2009. 112 [25] Roman Filipovych and Eraldo Ribeiro. Recognizing primitive interactions by ex- ploring actor-object states. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008. [26] Yoav Freund and Rrobert E. Schapire. A decision theoretic generalization of online learning and an application to boosting. Computation Learning Theory, 1995. [27] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regres- sion: a statistical view of boosting. In The Annals of Statistics, volume 38, 2000. [28] Andrew Gilbert, John Illingworth, and Richard Bowden. Fast realistic multi-action recognition using mined dense spatio-temporal features. In Proceedings of IEEE International Conference on Computer Vision, 2009. [29] Abhinav Gupta and Larry S. Davis. Objects in action: An approach for combining action understanding and object perception. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007. [30] Abhinav Gupta, Aniruddha Kembhavi, and Larry S. Davis. Observing human- object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31:1775{1789, 2009. [31] Abhinav Gupta, Praveen Srinivasan, Jianbo Shi, and Larry S. Davis. Understand- ing videos, constructing plots learning a visually grounded storyline model from annotated videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. [32] Henning Hamer, Konrad Schindler, Esther Koller-Meier, and Luc Van Gool. Track- ing a hand manipulating an object. In Proceedings of IEEE International Confer- ence on Computer Vision, 2009. [33] Christopher Harris and Mike Stephens. A combined corner and edge detector. In Proceedings of Alvey Vision Conference, 1988. [34] Somboon Hongeng and Ramakant Nevatia. Large-scale event detection using semi- hidden markov models. In Proceedings of IEEE International Conference on Com- puter Vision, 2003. [35] Gang Hua, Ming-Hsuan Yang, and Ying Wu. Learning to estimate human pose with data driven belief propagation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 747{754 vol. 2, June 2005. [36] Chang Huang and Ramakant Nevatia. High performance object detection by col- laborative learning of joint ranking of granules features. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010. [37] Chang Huang, Bo Wu, and Ramakant Nevatia. Robust object tracking by hierar- chical association of detection responses. In Proceedings of European Conference on Computer Vision, 2008. 113 [38] N. Ikizler and D. A. Forsyth. Searching video for complex activities with nite state models. In Proceedings of IEEE International Conference on Computer Vision, 2007. [39] Nazli Ikizler-Cinbis and Stan Sclaro. Object, scene and actions: Combining mul- tiple features for human action recognition. In Proceedings of European Conference on Computer Vision, 2010. [40] Stephen S. Intille and Aaron F. Bobick. A framework for recognizing multi-agent action from visual evidence. Journal of Articial Intelligence Research, 1999. [41] Imran N. Junejo, Emilie Dexter, Ivan Laptev, and Patrick Perez. Cross-view action recognition from temporal self-similarities. In Proceedings of European Conference on Computer Vision, 2008. [42] Pakorn KaewTraKulPong and Richard Bowden. An improved adaptive background mixture model for real-time tracking with shadow detection. In Proceedings of Advance Video Based Surveillance Systems, 2001. [43] Yan Ke, Rahul Sukthankar, and Martial Hebert. Ecient visual event detection using volumetric features. In Proceedings of IEEE International Conference on Computer Vision, 2005. [44] Yan Ke, Rahul Sukthankar, and Martial Hebert. Event detection in crowded scenes. In Proceedings of IEEE International Conference on Computer Vision, 2007. [45] Furqan M. Khan, Vivek K. Singh, and Ramakant Nevatia. Simultaneous inference of activity, pose and object. In Proceedings of Applications of Computer Vision, 2012. [46] Hedvig Kjellstr om, Javier Romero, and Danica Kragic. Visual object-action recog- nition: Inferring object aordances from human demonstration. Computer Vision and Image Understanding, 115(1):81{90, 2011. [47] Alexander Klaser, Marcin Marszalek, and Cordelia Schmid. A spatio-temporal descriptor based on 3d-gradients. In Proceedings of British Machine Vision Con- ference, 2008. [48] Adriana Kovashka and Kristen Grauman. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010. [49] Frank R. Kschischang, Brendon J. Frey, and Hans-Andrea Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2):498{519, Feb 2001. [50] Yasuo Kuniyoshi and Moriaki Shimoxaki. A self-organizing neural model for context-based action recognition. In EMBS Conference on Neural Engineering, 2003. 114 [51] Ivan Laptev and Tony Lindeberg. Space-time interest points. In Proceedings of IEEE International Conference on Computer Vision, 2003. [52] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learn- ing realistic human actions from movies. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008. [53] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2006. [54] Munwai Lee and Isaac Cohen. Proposal maps driven mcmc for estimating human body pose in static images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition 2004, pages 334{341, 2004. [55] Li-Jia Li and Li Fei-Fei. What, where and who? classifying event by scene and object recognition. In Proceedings of IEEE International Conference on Computer Vision, 2007. [56] Jingen Liu, Yang Yang, and Mubarak Shah. Learning semantic visual vocabularies using diusion distance. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. [57] David G. Lowe. Object recognition from local scale-invariant features. In Proceed- ings of IEEE International Conference on Computer Vision, 1999. [58] Bruce D. Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In International Joint Conference of Articial In- telligence, 1981. [59] Fengjun Lv and Ramakant Nevatia. Single view human action recognition using key pose matching and viterbi path searching. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007. [60] Jiri Matas, Ondrej Chum, Martin Urban, and Tomas Pajdla. Robust wide-baseline stereo from maximally stable extremal regions. In Image and Vision Computing, 2004. [61] Pyry Matikainen, Martial Hebert, and Rahul Sukthankar. Trajectons: Action recognition through the motion analysis of tracked features. In Proceedings of IEEE International Conference on Computer Vision workshop on Video-oriented Objected and Event Classication, 2009. [62] Ross Messing, Chris Pal, and Henry Kautz. Activity recognition using the velocity histories of tracked keypoints. In Proceedings of IEEE International Conference on Computer Vision, 2009. [63] Darnell Moore, Irfan Essa, and Monson Hayes III. Exploiting human action and ob- ject context for recognition taks. In Proceedings of IEEE International Conference on Computer Vision, 1999. 115 [64] Vlad I. Morariu and Larry S. Davis. Multi-agent event recognition in structured scenarios. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011. [65] Louis-Philippe Morency, Ariadna Quattoni, and Trevor Darrell. Latent-dynamic discriminative models for continuous gesture recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007. [66] Pradeep Natarajan and Ramakant Nevatia. View and scale invariant action recog- nition using multiviwe shape- ow models. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008. [67] Pradeep Natarajan, Vivek K. Singh, and Ramakant Nevatia. Learning 3d action models from a few 2d videos for view invariant action recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010. [68] Juan Carlos Niebles, Chih-Wei Chen, and Li Fei-Fei. Modeling temporal struc- ture of decomposable motion segments for activity classication. In Proceedings of European Conference on Computer Vision, 2010. [69] Juan-Carlos Niebles and Li Fei-Fei. A hierarchical model of shape and appearance for human action classication. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007. [70] Patrick Peursum, Geo West, and Svetha Venkatesh. Combining image regions and human activity for indirect object recognition in indoor wide-angle views. In Proceedings of IEEE International Conference on Computer Vision, 2005. [71] Ariadna Quattoni, Sy B. Wang, Louis-Philippe Morency, Michael Collins, and Trevor Darrell. Hidden conditional random elds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007. [72] Deva Ramanan. Learning to parse images of articulated bodies. In Advances in Neural Information Processing Systems 19, pages 1129{1136. MIT Press, Cam- bridge, MA, 2007. [73] Deva Ramanan and David A. Forsyth. Automatic annotation of everyday move- ments. In Proceedings of the Conference on Neural Information Processing Systems, 2003. [74] Deva Ramanan and Cristian Sminchisescu. Training deformable models for lo- calization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 206{213, June 2006. [75] Michalis Raptis and Iasonas Kokkinos Stefano Soatto. Discovering discriminative action parts from mid-level video representations. In Proceedings of IEEE Confer- ence on Computer Vision and Pattern Recognition, 2012. [76] Michalis Raptis and Stefano Soatto. Tracklet descriptors for action modeling and video analysis. In Proceedings of European Conference on Computer Vision, 2010. 116 [77] Michael S. Ryoo and Jake K. Aggarwal. Recognition of composite human activi- ties through context-free grammar based representation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2006. [78] Sreemanananth Sadanand and Jason J. Corso. Action bank: A high-level represen- tation of activity in video. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012. [79] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: A local svm approach. In International Conference on Pattern Recognition, 2004. [80] Paul Scovanner, Saad Ali, and Mubarak Shah. A 3-dimensional sift descriptor and its application to action recognition. In ACM Multimedia, 2007. [81] L. Sigal, B. Sidharth, S. Roth, M. Black, and M. Isard. Tracking loose-limbed peo- ple. In Proceedings of IEEE Conference on Computer Vision and Pattern Recogni- tion 2004, volume I, pages 421{428, June 2004. [82] Vivek Kumar Singh, Furqan M. Khan, and Ram Nevatia. Multiple pose context trees for estimating human pose in object context. In IEEE Workshop on Structured Models in Computer Vision in conjunction with Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010. [83] Michael Siracusa and John W.Fisher III. Tractable bayesian inference of time-series dependence structure. In Proceedings of International Conference on Articial In- telligence and Statistics, 2009. [84] Jeery Siskind. Grounding lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Articial Intelligence Research, 2001. [85] Cristian Sminchisescu, Atul Kanaujia, Zhiguo Li, and Dimitri Metaxas. Conditional random elds for contextual human motion recognition. In Proceedings of IEEE International Conference on Computer Vision, 2005. [86] Thad Starner and Alex Pentland. Real-time american sign language recognition from video using hidden markov models. In ISCV, 1995. [87] Ju Sun, Xiao Wu, Shuicheng Yan, Loong Fah Cheong, Tat seng Chua, and Jintao Li. Hierarchical spatio-temporal context modeling for action recognition. In Pro- ceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. [88] Camillo J. Taylor. Reconstruction of articulated objects from point correspondences in a single uncalibrated image. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2000. [89] Son D. Tran and Larry S. Davis. Event modeling and recognition using markov logic networks. In Proceedings of European Conference on Computer Vision, 2008. [90] Heng Wang, Alexander Kl aser, Cordelia Schmid, and Cheng-Lin Liu. Action Recog- nition by Dense Trajectories. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. 117 [91] Yang Wang, Hao Jiang, Mark S. Drew, Ze-Nian Li, and Greg Mori. Unsupervised discovery of action classes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1654{1661, 2006. [92] Yang Wang and Greg Mori. Hidden part models for human action recognition: Probabilistic vs. max-margin. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010. [93] Daniel Weinland, Remi Ronfard, and Edmond Boyer. Free viewpoint action recog- nition using motion history volumes. Computer Vision and Image Understanding, 2006. [94] Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An ecient dense and scale invariant spatio-temporal interest point detector. In Proceedings of European Con- ference on Computer Vision, 2008. [95] Andrew Wilson and Aaron Bobick. Parametric hidden markov models for gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999. [96] Shandong Wu, Omar Oreifej, and Mubarak Shah. Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In Proceedings of IEEE International Conference on Computer Vision, 2011. [97] Xinxao Wu, Dong Xu, Lixin Duan, and Jiebo Luo. Action recognition using con- text and appearance distribution features. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011. [98] Junji Yamato, J. Ohya, and Kenichiro Ishii. Recognizing human actions in time- sequential images using hidden markov model. In IEEE Conference on Computer Vision and Pattern Recognition, 1992. [99] Baopeng Yao and Li Fei-Fei. Modeling mutual context of object and human pose in human-pose interaction activities. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010. [100] Lahav Yeet and Lior Wolf. Local trinary patterns for human action recognition. In Proceedings of IEEE International Conference on Computer Vision, 2009. [101] Alper Yilmaz and Mubarak Shah. Actions sketch: a novel action representation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2005. [102] Jiayong Zhang, Jiebo Luo, Robert Collins, , and Yanxi Liu. Body localization in still images using hierarchical models and hybrid search. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume II, pages 1536 { 1543, June 2006. 118
Abstract (if available)
Abstract
With cameras getting smaller, better and cheaper, the amount of videos produced these days has increased exponentially. Although not comprehensive by any means, the fact that about 35 hours of video is uploaded to YouTube every minute is indicative of the amount of data that is being generated. This is in addition to the videos recorded for surveillance by grocery stores and by security agencies at airports, train stations and streets. Whereas analysis of the video data is the core reason for surveillance data collection, services such as YouTube can also use video analysis to improve search and indexing tasks. However, due to extremely large amount of data generation, human video analysis is not feasible
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Exploitation of wide area motion imagery
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Intelligent video surveillance using soft biometrics
PDF
Event detection and recounting from large-scale consumer videos
PDF
Policy based data placement in distributed systems
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Tracking multiple articulating humans from a single camera
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Model based view-invariant human action recognition and segmentation
PDF
Motion pattern learning and applications to tracking and detection
PDF
Deep generative models for image translation
PDF
Learning to detect and adapt to unpredicted changes
PDF
Autotuning, code generation and optimizing compiler technology for GPUs
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Grounding language in images and videos
PDF
Robust representation and recognition of actions in video
PDF
Generating gestures from speech for virtual humans using machine learning approaches
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Modeling social and cognitive aspects of user behavior in social media
Asset Metadata
Creator
Khan, Furqan Muhammad
(author)
Core Title
Analyzing human activities in videos using component based models
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/24/2013
Defense Date
07/24/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,computer vision,graphical networks for action analysis,human activity analysis,machine learning,OAI-PMH Harvest,surveillance systems
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
), Jenkins, Brian Keith (
committee member
), Medioni, Gérard G. (
committee member
)
Creator Email
furqankhan82@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-300165
Unique identifier
UC11293852
Identifier
etd-KhanFurqan-1845.pdf (filename),usctheses-c3-300165 (legacy record id)
Legacy Identifier
etd-KhanFurqan-1845.pdf
Dmrecord
300165
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Khan, Furqan Muhammad
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
artificial intelligence
computer vision
graphical networks for action analysis
human activity analysis
machine learning
surveillance systems