Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A unified Bayesian and logical approach for video-based event recognition
(USC Thesis Other)
A unified Bayesian and logical approach for video-based event recognition
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A UNIFIED BAYESIAN AND LOGICAL APPROACH FOR VIDEO-BASED EVENT RECOGNITION Copyright 2003 by Somboon Hongeng A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2003 Somboon Hongeng Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3103902 UMI UMI Microform 3103902 Copyright 2003 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY O F SO UTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089-1695 This dissertation, written by ___________ Somboon H ongeng under the direction o f /ji s dissertation committee, and approved by all its members, has been presented to and accepted by the Director of Graduate and Professional Programs, in partial fulfillment of the requirements fo r the degree of DOCTOR OF PHILOSOPHY Director Date May 1 6 , 2003 Dissertation Committee Chair Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dedication to my parents Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements I would like to extend special thanks to my advisor, Professor Ramakant Nevatia for having given me the opportunity to work at IRIS-USC and for his guidance and support over the past five years. Thanks, Ram, for your wisdom, knowledge and for your faith in me. You have helped me become a better scientist. I would also like to thank my defense committee, Professors Gerard Medioni and Laurent Itti for their willingness to serve and provide me with valuable comments that helped improve the quality of the thesis. I wish to thank Professor Keith Price for serving the committee of my qualifying exam. Francois Bremond also deserves a special thanks for his role in my development in the area of event recognition. He has been an outstanding mentor and a great source of inspirational ideas and motivation for me. I thank Tao Zhao and Fengjun Lv for having read and commented on Chapter 5 of the thesis. Thanks, Tao, for your valuable advice. Alexandre Francois also has provided useful comments on the draft of the thesis. Thanks, Alex. I also wish to thank to all members of IRIS for their support and friendship. Thanks, Jinman Kang, for being such a wonderful office mate and for putting up with me having such a messy desk. Thanks, Elaine Kang and Philippos Mordohai, for joining me to a coffee break and for great conversations about every little things that help me relax. Special thanks to Elaine, Pierre Kornprobst, and Professor Isaac Cohen for their help iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. with moving blob tracking and image stabilization. I also thank ZuWhan Kim, Sung Chun Lee, Mircea Nicolescu, Chi-Keung Tang, Bertrand Leroy, Mi-Suen Lee, Qian Chen, Andres Huertas, Delsa Tan and everyone else who I am forgetting who made graduate school fulfilling and fun, at times. A special thanks to Chalermake intanagonwiwat, Somsak Datthanasombat, Poonsuk Lohsoonthom for their friendship and help in producing video data for analysis. Most importantly, I am grateful for my family back home in Thailand for their unconditional love, support and unselfish sacrifice throughout my life. I owe them the world. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures viii Abstract xii 1 Introduction 1 1.1 Difficulties in Video-Based Event R ecognition.............................................. 2 1.2 G o a l.......................................................................................................................... 5 1.2.1 Large Scale Activities .......................................................................... 5 1.2.2 Goal D escrip tion .................................................................................... 6 1.3 Event Recognition M eth od ologies................................................................... 8 1.3.1 Low-Level M ovem ents.......................................................................... 8 1.3.2 High-Level A ctivities............................................................................. 10 1.4 Summary of Our A pproach................................................................................ 12 1.4.1 Example: “Stealing by Blocking” ..................................................... 14 1.5 Summary of C ontributions................................................................................. 18 1.6 Reader’s G u id e ..................................................................................................... 18 2 Previous Work 21 3 Overview of the System 31 3.1 Video I n p u t............................................................................................................ 31 3.2 C o n te x t.................................................................................................................. 32 3.3 Camera C alibration.............................................................................................. 34 3.4 Detection and Tracking of Moving O bjects..................................................... 34 3.5 Event M od elin g ..................................................................................................... 35 3.5.1 Event Classification ............................................................................. 37 3.6 Event In feren ce..................................................................................................... 40 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4 Detection and Tracking of M oving Regions 43 4.1 Ground Plane Assumption for Filtering ....................................................... 44 4.2 Merging Regions Using K-S S ta tistics.......................................................... 46 4.3 Resolving the Discontinuity of Object Trajectories .................................. 49 4.4 Tracking Objects in Videos of Low-Angle V iew p o in t............................... 50 5 Single-Thread Event Recognition 60 5.1 Object Class and Simple Event R ecogn ition ................................................ 61 5.1.1 The Structure of Bayesian N etw orks................................................. 61 5.1.2 Parameter Learning................................................................................. 63 5.2 Complex Event R e c o g n itio n ............................................................................ 63 5.2.1 Probabilities of Multi-State Complex Events ................................. 64 5.2.2 Modeling the Probability Distributions Pi(d) of Event Durations 71 5.2.3 Complex Event Recognition A lg o rith m ........................................... 74 5.2.4 Segmenting Complex Events from Video Stream s.......................... 82 5.2.5 Implementation of Complex Event Recognition Algorithm . . . 84 5.3 Analysis Results of Single-Thread E v e n t s .................................................... 89 5.3.1 Recognizing Competing Events in UAV V id e o s .............................. 90 5.3.2 Event Segmentation from Continuous Video S tream s................... 96 5.3.3 Recognition of “Converse” and “Taking O bject” ..............................101 5.4 D iscussion.................................................................................................................. 105 6 Multi-Thread Event Recognition 107 6.1 Event G raph...............................................................................................................107 6.2 Segmenting and Managing the Probabilistic Event I n sta n c e s ...................... 109 6.3 Evaluation and Propagation of Temporal Relations in Event Graph . . . I l l 6.4 Multi-Thread Event Analysis R esults..................................................................112 6.4.1 Recognition of “Stealing by Blocking” ..............................................112 6.4.2 Recognition of Activities from Noisy Data .......................................113 6.4.3 Computation T im e ..................................................................................... 118 6.4.4 Application: Automatic Video Annotation............................................120 7 Performance Characterization 131 7.1 Levels of N o i s e ........................................................................................................132 7.2 Variable Event D uration s...................................................................................... 134 7.3 Varying Execution S t y le s ...................................................................................... 136 7.3.1 Comments ...................................................................................................138 8 Conclusion 139 8.1 Summary of C ontributions................................................................................... 141 8.2 Future W ork...............................................................................................................142 Reference List 144 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables 3.1 Primitive events that can be inferred directly from mobile object prop erties. These primitives are fundamental and can be re-used to defined more abstract events in other domains.............................................................. 38 6.1 Description o f the multiple thread event “stealing by blocking”....................108 6.2 Computation Time of Video Sequences...................................................................119 v ii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Figures 1.1 An event recognition system................................................................. 7 1.2 The “Stealing by Blocking ” sequence................................................ 16 1.3 Event recognition processes................................................................. 20 3.1 Overview o f the s y s te m ........................................................................ 31 3.2 A representation o f the complex event “converse”......................... 36 3.3 An event modeling schema.................................................................... 41 4.1 Splitting o f moving regions and noise................................................. 44 4.2 Projection o f the bottom points of moving regions on to the ground plane. 44 4.3 The “Stealing by Blocking” sequence. “A ” approaches a reference object (the person standing in the middle with his belongings on the ground). “B ” and “C ” then approach and block the view o f “A ” and the reference person from their belongings. In the mean time, “D ” comes and takes the belongings........................................................................ 47 4.4 A graph representation o f the possible tracks of object “D ”. (a) With out using the ground plane knowledge, several hypotheses can be made about the possible tracks o f the object, (b) After filtering, regions are merged or disregarded, decreasing the a m b ig u ity........................ 48 4.5 (a) Object 3 is considered a different object from object 2 due to the total occlusion, (b) Due to the reflection on a car surface, the moving region of object 5 is larger than that o f object 6, causing the discontinuity of the tra je c to ry .......................................................................................... 50 4.6 Trajectories of the objects tracked using the ground plane knowledge, K-S statistics and event c o n siste n c y ................................................. 51 4.7 The “Stealing by PhoneBooth” sequence......................................... 53 vii! Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.8 The detection and tracking results o f “Stealing by PhoneBooth” sequence. 54 4.9 The trajectories o f moving objects in “Stealing by PhoneBooth ” sequence projected on the ground plane............................................................................. 55 4.10 The “Object Transfer” sequence....................................................................... 56 4.11 The detection and tracking results o f “Object Transfer” sequence. . . . 57 4.12 The “Assault” sequence...................................................................................... 58 4.13 The detection and tracking results of “Assault” sequence.................... 59 5.1 A detailed illustration o f a naive Bayesian classifier that is used to infer “approach the reference person” in figure 3.2. Given that e\, e2 and e3 are conditionally independent given H, the belief is propagated from the sub-events e\, e2 and e3 to infer the probability distribution o f H (i.e., P ( H |ei, e2, e3)J by applying B ayes’ rule. e\, e2 and e3 can also be a parent event o f other naive Bayesian classifiers as shown in figure 3.2. . 62 5.2 A finite-state automaton that represents the complex event “approach then stop”................................................................................................................. 63 5.3 Segmenting the pattern o f simple events in “approach then stop” 64 5.4 The first two states of an HMM representation o f “a person gets cash from the ATM”......................................................................................................... 72 5.5 The exponential probability distribution o f the duration o f event state St : Pi(d) = - P ( S f |S f -1 )), where d = t - t 1 + 1 and P ( S i\S { ~ 1) is 0.9933................................................................................... 73 5.6 The processing steps performed on state Si at time t ...................................... 76 5.7 The processing steps performed on state Sq at time t ...................................... 80 5.8 Two checkpoint sequences: (a) “a car goes through the checkpoint”, (b) “a car avoids the checkpoint”. The checkpoint is defined as the zone that lies between the two tank at the intersection............................................ 92 5.9 Detection and tracking o f moving regions fo r “CheckPntA”........................ 93 5.10 Event analysis results o f the sequence “CheckPntA”. II) and III) show the evolution o f the probabilities o f two competing complex event models. 94 5.11 Detection and tracking o f moving regions fo r CheckPntB............................. 95 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.12 Event analysis results of the sequence “CheckPntB”. II) and III) show the evolution of the probabilities o f two competing complex event models. 96 5.13 Analysis o f single-thread events in the simulated sequence CheckPntC. 98 5.14 Analysis o f single-thread events in the simulated sequence CheckPntD. 100 5.15 Analysis o f single-thread events in the simulated sequence CheckPntE. 101 5.16 Analysis of single-thread events fo r object “A ”.................................................102 5.17 Analysis of single-thread events fo r object “D ”................................................104 6.1 A graphical description of a multi-thread event..................................................108 6.2 The likelihood of the complex event “chase after” inferred by a prob abilistic finite state automaton. The event, at different times, may have different likely start times depending on the most likely transition tim ings between states. For example, the most likely start times of the event during the solid, dark circles are 118, the grey circles 175 and so on. . . 110 6.3 Single-Thread Event analysis results o f the actors in the sequence “steal ing by blocking ”..........................................................................................................114 6.4 Event analysis results of the sequence “stealing by blocking”. The two plots o f the most significant instantiations o f “stealing by blocking ” in (b) have different actor and event combinations.................................................115 6.5 A graphical description of “Object Transfer”.....................................................116 6.6 Analysis results o f the action threads in the sequence “object transfer”. 122 6.7 Multi-thread event analysis results o f the sequence “object transfer”. . . 123 6.8 A graphical description o f “Stealing by PhoneBooth ”..................................... 123 6.9 Analysis results of the action threads in the sequence “stealing by phone- booth”........................................................................................................................... 124 6.10 Multi-thread event analysis results o f the sequence “stealing by phone- Booth”. The global event “object transfer” is also recognized with the same probabilities as the “stealing by phoneBooth ”.....................................125 6.11 A graphical description o f “assault ”.....................................................................125 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.12 (a), (b) and (c) show the analysis results o f the action threads in the sequence “assault”, (d) and(e) shows the evaluation o f the event thread relations and the probabilities o f the detected global event “assault”. . 126 6.13 The “Stealing By Blocking ” annotated sequence......................................127 6.14 The “Stealing By Phone” annotated sequence....................................................128 6.15 The “Object Transfer At Bench” annotated sequence.......................................129 6.16 The “Assault” annotated sequence........................................................................130 7.1 Two test patterns o f walking in a parking lot: (a) “a person passes by the reference person”, (b) “a person makes contact with the reference person ”. We examine how well our system discriminates these compet ing events when the trajectories are corrupted with various levels o f noise. 133 7.2 ROC curves fo r a data set with various noise levels...................................... 134 7.3 Results o f the test data set fo r “Approach Then Leave” sequences. . . . 135 7.4 Results o f the test sequences “Approach Then Stop at the Reference Per son” 135 7.5 ROC curves fo r a data set with varying execution styles..................................138 x i Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract Automatic generation of a description of video data has been one of the most active research areas in recent years. Many of the current video analysis systems describe a video stream in terms of layers of background and foreground objects, and their motion descriptions, providing a compact representation of videos. However, such a represen tation lacks the knowledge of the video content, which is the most crucial information in many applications such as video annotation, semantic-based video summarization, video surveillance and advanced human-machine interface. To understand the content of videos, a computer vision system must be capable of bridging the gap between the dynamic pixel-level information of image sequences and the high-level event descriptions. First, objects in the scene must be detected and recognized from the video. Finding and recognizing objects in the clutter of image features with noise, shadows and occlusion is one of the most challenging problems in computer vision. Second, a description of actions involving individual objects and the global situation must be produced in some representation scheme. Determining an event representation suitable for machine perception is one of the key issues. An event representation should be generic enough to model a variety of event types, and accurate enough to allow similar events to be discriminated. Similar to other pattern recognition tasks, a pattern of interesting events (i.e. ones that match the event models) must be segmented from continuous video streams. This is a particularly difficult task because xii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of the uncertain nature of both the input data (i.e. detection and tracking noise) and the event models. While different events may be similar in appearance, the same events may appear differently depending on how they are executed. A large variation in the time scales of some events makes it more difficult to segment them than in many other pattern recognition tasks. In this thesis, we show that automatic event understanding from video streams may be achieved for a large class of events based on the observation of the trajectories and the shapes of objects. We propose a new formalism, in which events are described by a hierarchical representation consisting of image features, mobile object properties and event scenarios. Taking image features of tracked moving regions from an image sequence as input, mobile object properties are first computed by specific methods while noise is suppressed by statistical methods. Events are viewed as consisting of single or multiple threads. In a single thread event, relevant actions occur along a linear time scale. They are recognized from mobile object properties using Bayesian networks and stochastic finite automata. For multiple thread events, several threads of events are related by logical and temporal constraints. These constraints are verified in a proba bilistic framework where all possible durations are considered for an optimal estimate. An Event Recognition Language is proposed for describing these events in a natural way. This particular design is based on our intention to be able to describe events at the symbolic level as well as to provide the optimal recognition of these events from low level visual facts in presence of variations in event execution styles and tracking noise, native to the analysis of real image sequences. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction One of the most fascinating aspects of human intelligence is the capability of perceiving the surrounding environment to describe it to another person and to make plans or take actions based on them. Equipping a machine with the same capability would be a major step toward developing a truly intelligent system. In recent years, automatic generation of a description of video streams has been actively studied. Many of the current video analysis systems describe a video stream in terms of layers of background, foreground objects and their motion descriptions, providing a compact representation of videos but lacking the understanding of the content. Knowledge of the events that occur in videos is crucial for a number of tasks that can be used to augment human capability. Some examples can be given as follows: • Automatic video surveillance can provide security in a home, an office, or an outdoor environment [BT97, AS01, PM01]. Monitoring of children, elderly or people with disabilities can allow such individuals to be cared for at a high quality level at lower costs [KGM+02], • Computer awareness of the dynamic information about the environment is impor tant for enhanced human-computer interaction. Such capability can augment the usability of the systems such as in an immersive virtual environment [BID+99] and a gesture-driven video game interface [FAB+98]. • In recent years, a large volume of video collections available for physical access on the internet has increased significantly thanks to decreasing storage costs, 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. increasing network capacity and easy availability of software to exchange digital videos. Effective support for intellectual access (e.g., access to critical segments or shots that contain events you are looking for) is becoming necessary. High quality video annotation and content-based video summarization can provide users with a more efficient storage and retrieval scheme [SDVOO]. • Automatic robot navigation relies on machine perception of the dynamic environ ment [KHE98, DK02], so that the navigation courses can be adapted accordingly. These advanced navigation systems are also useful in other unmanned vehicles. • Applications in sport domain include automatic analysis of sport videos either for counter-strategic planning or for self-practice as in the interactive sport tutor ing [NBKOO, IB01, SC02]. • Action-model based coding of a dynamic scene, which is now becoming a stan dard for MPEG video coding (www.mpeg-industry.com), provides a previously unimaginable rich representation of video streams. Video segments of interest can be transmitted and reconstructed in a remote location using less bandwidth. • In advanced video conferencing, the systems need to observe the actions of all participants and control the video presentation in an intelligent and responsive way. In this dissertation, we will present approaches for detecting and segmenting events from video streams, which represent key components of a solution to these applications. 1.1 Difficulties in Video-Based Event Recognition To understand the content of videos, a computer vision system must be capable of bridging the gap between the dynamic pixel-level information of image sequences and 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the high level event description. Even though humans can perform this task with lit tle effort, establishing such a capability in a machine is extremely hard. A common approach involves first detecting and segmenting objects in a scene. Moving objects are then tracked and their descriptions (e.g., shape, size, location and motion) are gen erated. The goal of this step is to transform pixel level data into low-level features that are more appropriate for activity analysis. From these low-level features, the type of moving objects and the pattern of their movements and interactions are then ana lyzed [BG95, IBOO, IB01, HN01]. Such interactions may also be defined with regard to scene context such as the identities of major landmarks (e.g., “a person is approach ing the building”). Finally, a description of actions involving individual objects and the global situation must be produced in some representation scheme. There are several challenges that need to be addressed to achieve this task: • Input data is uncertain by nature. Finding and tracking objects in real video data are often unstable due to poor video quality, shadows and occlusion. Recognizing objects in the clutter of image features with noise is one of the most challenging problems in computer vision. A single view constraint common to many applica tions further complicates these problems. • Moving objects may interact with other static objects in the scene, requiring static scene understanding, which is a well known difficult computer vision problem. • Determining an event representation suitable for machine perception is very chal lenging as there exists a variety of event types: facial expressions, hand gestures, static human poses or a “large-scale” activity that may involve a physical interac tion among locomotory objects moving around in the scene for a long period of time. Features that are suitable for the recognition of different event types seem to be very different. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • In addition to the event type, the choice of motion features for event recognition may depend on the goal of the application and the quality of input data. If the goal is to discriminate between “walking” and “tip-toeing”, tracking the subtle articulated movement of the body parts may be necessary. However, to detect that a person is “walking ” toward a building, the dynamic of the location of the person in the scene is the important feature and the way the person moves his/her legs is not important. • Temporal characteristics of events are varied. Some events occur instanta neously; for example, “entering a room” is almost instantaneous (sometimes called change-of-state events in linguistics literature). In contrary, “parking a ca r” is durative and can take several minutes. • While input data are numeric entities, events are conceptual and mostly indeter- ministic entities; for example, “a person is standing near the phone” is a concept that cannot be quantified precisely. Mapping functions between these entities are therefore probabilistic in nature. Events can also be defined with large latitudes in how long they may last or in the details of how the activities are performed. • The execution style of the same activity by different actors can be different, lead ing to variations in appearance and duration. Repeated performance by the same individual may also vary each time. • The same activity can appear differently depending on the viewpoint. Similar motion patterns may be caused by different activities (e.g., “a person sits dow n” versus “a person squats dow n”). • The pattern of interesting events (i.e., ones that match the event models) must be segmented from continuous video streams. This is a particularly difficult task Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. because of the uncertain nature of both the input data and the event models. Prob abilities must be incorporated appropriately in the event models and computed rigorously at all processing levels. 1.2 Goal Addressing all the issues in event detection is enormously challenging and a major undertaking. It would be an unrealistic goal, in this thesis, to develop a universal event detector. Reasonable systems, nevertheless, must capture a large set of events. In this dissertation, we focus on the detection of large-scale activities observed from a dis tance. We assume that some knowledge of the scene (e.g., the characteristics and the objects in the environment) is available. 1.2.1 Large Scale Activities One characteristic of activities of our interest is that they can be described by some spatio-temporal characteristics of the trajectory of whole-body motion and a rough change of object shapes. A body, in this case, can be a single body part (e.g., a hand) or the whole body of a human that can be segmented and tracked. The scope of such shape- and trajectory-based events covers a large set of our daily activities. For example, con sider a task of loading objects from one bin to another. This task may be accomplished by repeating the following four sub-actions: “ pick up an object from bin A ”, “move from bin A to bin B ”, “ put the object down in bin B ” and “move from bin B to bin A ”. When viewed from a distance, each of these actions can be recognized from the pattern of the trajectory of the whole hand. In other words, the articulated motion of a hand by which an object is grabbed and released is not necessary. A more complex scenario is a group of people stealing luggage left unattended by a tourist. One particular pattern of the 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. “stealing” event may be: a person “approaches” the tourist to “obstruct” the view of his luggage, while another person “takes” the luggage. This event exhibits three actions coordinated in a timely manner, each of which can be characterized by the trajectories and shapes of the whole body of the actors. In the following, the word “event” and “activity” are used to refer to a large-scale activity. 1.2.2 Goal Description Our goal is to develop an event analysis system capable of performing the tasks that are illustrated in figure 1.1. The system contains a library of predefined event models, which have specific internal representations suitable for event recognition tasks. We aim to design a formal event representation that describes simple trajectory- and shape- based actions as well as a complicated, cooperative task by several actors. The parameter learning methodologies will be provided so that the statistical spatio-temporal properties of the event data can be encoded properly in the representation. For a pragmatic system, we expect that the representation will be in terms of motion concepts that are close to the way a human describes an event. Based on this representa tion, we wish to formalize a language that will allow the system to describe the content of videos in a natural way. The availability of such a formal language will also enable the user to easily define new events, which can be compiled into the appropriate internal event representations. However, we do not aim at developing a language compiler in this thesis. In figure 1.1, the system takes a video sequence as input and search for video seg ments that may match a particular set of significant event models. The general concept of events can be roughly described as an actor performs some acts (possibly on a target object in a certain way or direction). Finding and tracking the objects in the scene, there fore, constitutes a groundwork. For large-scale events, we emphasize on tracking and 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Input Video _ T • ‘,1 ^ 1 * 1 User defines new events using Event Language ( I . Event Analysis System V) <: (}<. < yW-J • B R Object'Tracking and Event Paltern Matching o-Sr'v EvCM M odel N / \ < ► * * ■ Internal Representation Annotated Video ;T1 T7)::= Video Content Description Hum1,Hum2. Hum3, Owner, (Converse; Hum1; Owner; T1 T2), beforr (ApproachToBlock; Hum2; Owner, Luggage; T3. T4); (ApproachToBlock; Hum2; Owner, Luggage;T3, T4), before, (Obstructview; Hum2; Owner, Luggage; T4T5); (TbkeObj, Hum3; Luggage; T6 T7), during, (Obstructview; Hum2; Owner, Luggage; T3T4); (Converse; Humt; Owner;T1 T2);:= (Approach; Humt, Owner,T1 T8), then, (StandNear; Humt, Owner;T8T2); (Approach; Hum1; Owner; T1 T8):;= ... Figure 1.1: An event recognition system. analyzing moving blobs that represent the moving objects and not on the analysis of the articulated motion. We would like to provide an effective motion detection and tracking system that achieve this task. The shapes and trajectories of objects computed from the detection and tracking system must be analyzed to determine whether they match with any of the event models. We aim to design effective event recognition algorithms that handle the noise of tracking data and the uncertainty in the spatial and temporal scales Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of event models appropriately so that events can be inferred and segmented from video streams in an optimal fashion. The output of the event recognition system is the actions performed by each actor, the global events that the actors may coordinate and the most likely segmentation (i.e., the start and end times) of these events. This information may be used further in various ways, depending on the applications. In this thesis, we will demonstrate our approach on Video Annotation, in which the final output may be in the form of a text description of the video contents or a re-generated, annotated video with the event information overlaid on the images. The textual description of the video contents will be based on our proposed formal language. 1.3 Event Recognition Methodologies Based on the abstraction level in which we look at the events in the domain of interest, approaches to event recognition can be largely classified into two groups. The first group views events as movement patterns and models the statistics of changing pixel values. The second group captures a higher-level structure such as semantics and concepts. 1.3.1 Low-Level Movements Low-level movements are simple movements that can be recognized independent of con text. Examples include walking, running, sitting down, standing up and human gestures. Recognition of low-level movement consists of finding simple invariant characteristics of the movement that constitutes the signature of that event class. One approach to recognize low-level movements is by first segmenting and tracking the subject of motion (or its body parts) frame by frame individually. Some sort of feature vector (e.g., joint angles) is then extracted [Roh94, FL98, YB99], Feature vector 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. can then be formulated as a function of time. The time-varying continuous parameters of the function can be mapped to an event in a discrete event space to recognize the movement. This approach can be effective in a specific scene when useful information in a scene can be parameterized efficiently. Its utility is limited by the capability of the system to deal with varying spatio-temporal scales and the robustness of the tracker to track the shape of an unknown object accurately. Instead of consider a feature vector to be a function of time, one may remove the time factor by treating all the frames in a sequence as one long spatio-temporal vector and apply the recognition techniques such as principal component analysis (PCA) or the nearest neighbor classifier [AA01], This approach requires the normalization of the length of videos so that the feature vectors have same dimensionality. The computation might be very expensive as each vector can be very long. Another approach is to look at the global appearances instead of tracking the body parts. Appearance-based approach has had a strong ground in the analysis of single imagery such as face recognition [SS01]. It has been applied later to model “motion”, the representations of which include motion blobs, motion energy, optical flows, and temporal texture. Commonly used recognition techniques include temporal template matching [BD01, PN97] and eigen decomposition [EP95, BYJF97, GKRR02], Another recogntion technique is to match appearance models to the image sequence using a state-based approach, in which Hidden Markov Models (HMMs) are often employed [YOI92, SP95, SM96, BOP97]. An HMM consists of a set of states, a set of output symbols, state transition probabilities, output symbol probabilities and initial state probabilities. The states of the models are confined to be some distinct motion appearances. To apply the HMM, first, a set of training sequences are used to learn the parameters of the models (i.e., transition probabilities, output symbol probabilities, etc.). To match an unknown sequence with an HMM model, the probability that the 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. model could generate the particular unknown sequence is calculated. The HMM that gives the highest probability is the one that most likely generated that sequence. One useful property of an HMM is its ability to handle the varying temporal scales. However, there are some potential drawbacks in using of these conventional HMMs for event recognition. One drawback is that the duration of event states modeled by discrete transitional probabilities is of an exponential form, resulting in a bias toward a shorter event duration. Another drawback is that the conventional HMMs do not exploit domain-specific knowledge about physical mechanics. Therefore, more observations of events would be required to train the system. It is not clear how well the HMM approach will scale when the problem domain becomes more general. Also, it is not always apparent when a complex continuous model should be broken down into a sequence of distinct and simpler actions. 1.3.2 High-Level Activities High-level events implicate considerable complexity and are defined in a more sophis ticated context. They are sometimes executed with careful planning to achieve some goals. Examples include shoplifting, assaulting, playing soccer, and other task-oriented and coordinated activities involving multiple actors. Pattern matching techniques used for the movement analysis (e.g., appearance-based approaches) may not be applied effi ciently as the activities can take a long time, resulting in a very long spatio-temporal vector. Also, activities are less consistent in the way they are executed than short move ments. The fact that people are capable of recognizing familiar activities viewed in a novel way indicates that there may be a constructive component to the recognition of human action. This is supported by noticing how people describe a high-level event to be a 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. structured combination of lower-level units of meaningful and purposeful motion con cepts. This notion of event is analogous to the object recognition notion of a chair as a class of objects that are composed of a set of related functional parts that serve the purpose of supporting human bodies. High-level events can be viewed as qualitative entities, which makes it difficult to define objective motion-parametric discrimination rules. High-level events are often analyzed based on semantics reasoning and abstract spatio-temporal logic reasoning in some way [BBC93, DGG93, PB98, HNOO], For example, in [KI93], a machine understander is equipped with a structured knowl edge base of causal events. A causal event is a causal process that starts with the inten tion of the actor and is executed by the bodily motion that leads to a physical effect in the environment. An activity is described by a script that connects these causal events in a semantically consistent way. The recognition process consist of finding a script that best explains the observed dynamic scene. Most previous approaches to abstract event recognition [PB98] require a discrete decision on the happenings of events, which offers a limited use in many real world vision domains because the physical effects of an action in the environment may not always be observed reliably. Some approaches only consider events as instantaneous entities and recognize only a sequence of such events, while most human actions are durative and can have more a complex temporal structure. We have described so far two approaches to visual event recognition. Whether an event should be viewed at a movement pattern level or at a semantic level, in fact, depends on the goal of the application and the level of details one may be interested in. An interesting approach is to combine the two in a unified way. Some experimentation in this direction includes the work by M. Brand et. al. [BOP97], Y. Ivanov et. al. [IBOO] and D. Moore et. al. [MEH99]. In [BOP97], a finite state machine is used to capture the 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. global structure of an activity by defining across-model probabilities between a small number of HMMs. This automaton is used to represent the interactions between lower- level event processes. In [IBOO, MEH99], inspired by the work in speech recognition, a stochastic context-free grammar (SCFG) is used to describe and parse human gestures and multi-tasked activities from some primitive units of motions, which are recognized by the conventional HMMs. One weakness these approaches is that they only con sider the sequential occurrence of events, while units of events in many activities can be related in a more complex way (e.g., during, overlapping). 1.4 Summary of Our Approach Similar to [IBOO, MEH99, BOP97], our general approach is to unify the stochastic approaches to human movements (e.g., Bayesian networks, HMMs) and the high-level event analysis approaches (i.e., spatio-temporal logic reasoning) in a unified fashion. We define two roles of objects that are involved in an event: an actor and a reference object. An actor is the subject of the motion. A reference object could be another mobile object or a scene object (e.g., a telephone booth or a zone), that is described as a point of reference to the motion; for example, the actor entering “the zone”. Our strategy to event recognition is to track moving objects and detect potential events in their varying shapes and velocities and in changing spatial relations among moving objects and scene objects. This becomes a problem in dynamic modeling of multiple interacting processes. In this dissertation, we take a model-based approach, in which we define a complete spatio-temporal structure of what to observe for the event of interest, and how a set of interesting video signals map to it. In our approach, human users provide some guidance for the event structure in order to reduce the complexity. Our proposed structure bridges the gap between numerical 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. properties of moving objects and symbolic event concepts in a hierarchical fashion. Arguably, we consider two groups of object properties that an events may be defined from: 1) the properties of object shapes and their evolution (e.g., outlines, sizes, etc.), and 2) the properties of trajectories and their evolution (e.g., location, direction, speed, etc.). Some properties may be defined in relation to the reference object; for example, the distance from the actor to the reference object. We note that we do not claim that these basic properties are the absolute choices for describing events. Instead, we empha size on the fact that they are useful and suffice for the tasks at hand. Events are viewed as consisting of single or multiple threads. In a single thread event, relevant actions occur along a linear time scale. For multiple thread events, several threads of events are related by logical and temporal constraints, and are used to describe a global scenario. Single-thread events can further be classified into simple and complex events. Simple events are short-term coherent units of movement and are described hierachi- cally in terms of sub-events and the properties of the moving objects. A Bayesian net work is used to represent the relations between the simple event and the corresponding properties and to estimate the distributions of parameters. Using the estimated distri butions, the probability of simple event S given the observed properties at time frame t can be computed. Even though the recognition of simple events such as “a person stays in a dangerous zone ” may be useful in some applications, a more general human event consists of a series of continuous actions. We define a complex event as a linearly ordered time sequence of sub-events, which can be simple or complex events. Complex events are modeled by a Hidden Markov Model (HMM), where the nodes of the HMM corresponds to the sub-events. However, instead of modeling the discrete state transition probabilities as in the case of conventional HMMs, event duration distributions are mod eled explicitly for each event states. The a priori duration probability distribution and the derived probabilities of the event states of an HMM (e.g., the probabilities computed 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. by Bayesian networks in the case of simple events) are used to segment the observation sequence into the corresponding states and make an inference about the probability of the HMM. Global events or scenarios, in general, will be described by a multi-thread event, which is composed of several action threads, possibly performed by several actors. A multi-thread event is modeled by an event graph, where the nodes of the graph corre spond to event threads (which is an HMM) and the links between nodes correspond to logical and time relations such as “occurs during”, “starts before” and “overlaps”. To recognize a multi-thread event, these relations are verified and propagated in a proba bilistic framework where all possible durations of the event threads (segmented by the corresponding HMMs) are considered for an optimal estimate. As with other pattern recognition tasks, we need to address the question of what information constitute the invariants (signatures) that characterize the patterns in the working domain of interest. Our approach captures the invariants of large-scale events as follows. In simple events, the signature of an event is composed of the probabilities of consistent spatio-temporal properties that are observed during the course of the action. In complex events, they are the causal relations between a series of continuous actions and their duration probabilities. The signatures of multi-thread events are the temporal logical constraints between the event threads. 1.4.1 Example: “Stealing by Blocking” To illustrate our approach, let us revisit the “stealing by blocking ” event described in section 1.2.1. Figure 1.2 shows a video sequence of this event. First, obj A approaches and drops a briefcase near a reference object (the person standing in the middle with his belongings on the ground). Obj B and obj C then approach and block the view of obj 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A and the reference person from their belongings. In the mean time, obj D comes and takes the belongings. In order that a computer can detect “stealing by Blocking ”, we model the action threads of each actors as a node in an event graph and define the necessary pair-wise log ical and temporal constraints between the nodes. For example, we may define the action of obj A as “converse” and the action of obj B as “approach to block”. A temporal rela tion between these two nodes is that “converse ” must be accomplished before “approach to block” begins. By observing the video sequence shown in figure 1.2, a human observer, with little effort, can describe such temporal relations between significant events. Each action thread is modeled by an HMM to represent a sequence of sub-events. For example, “converse” is composed of three sub-events: “approach” (the reference person), “bend dow n” (to drop a briefcase) and “stop close to ” (the person). The a pri ori duration distribution of these sub-events (given the context of the event “converse ”) must be estimated appropriately to be used for the inference process. Finally, if the event states of the HMMs are simple events, we use Bayesian networks to model them from the spatio-temporal properties of the blob shapes and trajectories of the actors. For example, the event state “approaching ” (the person) is represented by a Bayesian clas sifier consisting of three primitive trajectory-based events: “heading toward” (the per son), “getting closer to ” (the person) and “slowing down”. Parameters of the Bayesian networks must be estimated to make an inference about the occurrence of these simple events. The action threads of other actors are modeled in a similar fashion. Once we have event models of all the actions we expect to see in a particular scene, the system can process a video and determine whether there is a segment in video that may contain the interesting events such as “stealing by blocking ”. The event recognition process is illustrated in figure 1.3, which shows the elements inside the Event Analysis 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 490 frame 557 Figure 1.2: The “Stealing by Blocking” sequence. 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. System module in figure 1.1. The system first learns the statistics of the background pixels and segment the moving regions. We then use a simple human shape template, the groundplane context and the coherency of color distributions to tracks regions that may correspond to humans and other interesting objects (figure 1.3(a)). For each tracked object, we compute, at each time frame, the probabilities of all sim ple events from the instantaneous properties of the trajectories and blob shapes based on the Bayesian inference. Figure 1.3(b) shows the probabilities of the simple events “approach”, “bend dow n”, “stand close to ”, “pick up luggage”, and “leave” in rela tion to the reference person (obj 0) for the actor obj 1 and 6. At the next level, the prob abilities of these simple events are used as input for the event states in HMM models of complex events. These probabilities are combined with the event duration distribu tions using an adapted Viterbi algorithm to compute the probabilities of HMM models (shown in figure 1.3(c)) and temporally segment the events (i.e., to obtain the start and end times). At the highest level (i.e., closest to the level of qualitative reasoning), an event graph takes the segmented event instances of complex events as input, verifies the necessary temporal and logical constraints and propagates the probabilities along the graph. Figure 1.3(d) shows the probabilities of the pair-wise relations between two event nodes in the event graph for “stealing by blocking ”. Figure 1.3(e) shows the prob ability of the recognized instance of “stealing by blocking”. We note that the steps (a) to (e) that we describe progress sequentially at each time frame, taking the entities pro cessed at the previous step as input. These entities are extracted and processed at the required level of abstraction for the appropriate analysis. The output from step (d) and (e) may be further processed according to the goal of the application. In this thesis, we transform the results into extended Markup Language (XML) for the video annotation application. 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.5 Summary of Contributions In this dissertation, we develop a working video analysis system that takes a video as an input, performs a low-level motion analysis as necessary and searches for segments that may contain the pre-defined significant events. Events are modeled by a transparent, structured link from high-level event description to pixel-level image features, leverag ing the domain knowledge. Based on our proposed event ontology, pre-defined event models can be adapted easily to the goal of an application. Events are detected in a probabilistic framework based on Bayesian analysis and temporal logic reasoning. The major contributions of our research can be summarized as follows: • Detection and tracking objects using scene geometry. • Event ontology and hierarchical event representation that bridges the gap between the low-level visual information and the symbolic event description. • Bayesian analysis of simple trajectory- and shape-based events. • Detection and segmentation of a single-thread event from a continuous video stream using a stochastic finite automaton. • Recognition of the interaction of multiple actors using probabilistic event graphs. 1.6 Reader’s Guide The organization of the thesis is as follows. Related work is discussed in chapter 2. An overview of our event detection system is described in chapter 3, together with our proposed hierarchical event representation. Our tracking approach that leverages the knowledge of the ground plane is in chapter 4. Algorithm for recognizing scenarios is described in detail including experimental results in chapter 5 and 6. Performance 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. characterization of the algorithm is in chapter 7. Chapter 8 concludes the thesis with the discussion on the strengths and the limitation of the proposed techniques. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Input V ideo E vent M odels A S et of Simple Event Models A S et of Complex Event Models J D L C O . . Q c p. _ q o.e C O ■ ■ “ Approach" — "Stand close" Bend down" 1 1 1 1 1 1 100 150 200 250 300 Fram e N um ber Obj 1:'"Conve 8 - 6 - 4 - 2 - rse" 1 * u ; , £ 0.8 - i o 0.6 • C O - Q 0.4 - o n" o.2 1 - "StandCloseToObj" "PickUpObj" » a "Leave" J — 100 150 200 250 F ram e N um ber A S e t of Multi-Thread Event Models Multi-thread Event Analysis 1 >■•0.8 “Converse(objl), — - — Before, Approach1(obj3)" I --------------------------------------- n o s c o _Q 0.4 o - 0 . 2 _ _ "Converse(objl), ; Before, Approach2(obj5)" ; 1 50 100 150 200 250 300 350 400 Frame Number 450 500 550 P( stealing )=0.99 (b) 350 400 450 500 550 F ra m e N um ber Obj6: Take Object □ _ 0.2 350 400 450 500 550 Fram e N um ber (d) (e) 50 100 150 200 250 300 350 400 450 500 550 Frame Number Figure 1.3: Event recognition processes. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 Previous Work During the last decade, there has been a significant amount of event recognition research in various application domains [RFV99, CLEOO]. A review of the current approaches can be found in [AC99, Gav99]. One common characteristic of most approaches is that they are developed for a specific type of events that suit the goal in a particular domain. This is partly due to the tremendous broadness of the scope of motion events ranging from small scale movements such as facial expressions to large scale coordinated tasks. Arguably, most actions can be recognized from the observed body movement. How ever, there is little consensus about which aspects of body movement are most important for motion analysis and several kinds of event representations have been proposed. The most naive way of representing visual events is to define them directly from the evolving pixel values at different time frames. The memory storage required by such a direct rep resentation is prohibitively large, which has led to the development of a more compact representation. We review some previous work in event representation and recognition in the following. Appearance-Based Approaches Movements are often represented by time dependent images and their processed vari ations such as normalized, smoothed, warped sequences and motion fields. Pattern matching and analysis techniques that are useful for static imagery are often applied to these appearance models. 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Template matching techniques have been used to determine whether the motion fea tures derived from a sequence match with the features of predefined prototypes [PN94, DB97, Bla99]. In [DB97], Davis and Bobick represent an action by a temporal template which is a static vector-image computed from accumulative motion properties at each point of the image sequences (or Motion Recency Images (MRI)). View-based template matching techniques are developed to recognize actions by matching their MRIs with the templates of known actions. Even though impressive results are obtained for short events in a constrained environment, temporal segmentation of complex events, confu sion among similar movements, and occlusion are difficult to handle. For example, it is difficult to distinguish between “sitting” and “squatting”, as when viewed from the front, they have similar accumulation templates. Motion appearances can be parameterized to obtain concise descriptions of image events. Parameterization can provide strong constraints that can be used for estimating motion in complex scenes as well as for event recognition. In [BY97], facial expressions are recognized using simple parameterized models of facial feature motion that must be detected reliably. The authors exploited the changing parameter values as visual cues for recognition. An important issue in this case is whether the model is scale dependent and how we select the appropriate spatial and temporal scales. Eigen-shape is one of the common methods for representing and recognizing motion appearances. In [GKRR02], Goldenberg et. al. segment and track moving objects to obtain their boundaries. The sequences of the segmented objects performing a particular action are normalized to a certain size and length and used to construct the eigen-shape basis of that action. An action is recognized by projecting the view of a moving body onto these basis to obtain the eigen parameters. Goldenberg et. al. used these eigen parameters to classify some cyclic motions of people and animals such as “walking” and “running”. Similarly, in [NDS97], Nan et. al. apply the eigen decomposition of 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the spatio-temporal vector extracted from image intensities to the lip-reading problem. Their method can be computationally expensive when the spatio-temporal vector is very long. All of these appearance-based approaches share one common weakness, which is the limitation to the degree of abstraction that can be handled with analog representation alone. Region-Based Approaches Regions of the same motion characteristics that persist through time may be segmented in the spatio-temporal space. These segmented volumes are analogous to segmented region representations in single image analysis. For example, in [LP98], Lui and Picard describe a method to detect regions with sim ilar temporal textures. First, an XYT image cube is produced by stacking each image frame (X, Y) captured from a static camera (or with little camera motion) along the time dimension. Moving regions are segmented and tracked along this spatio-temporal volume, and their path are fit to a line. The moving regions along this path are then aligned at each frame. Frequency analysis of the temporal histories of each aligned pixels can discriminate various types of cyclic motions such as a “running human”, a “walking human” or a “rotating wheel”. Other work that aims at detecting peri odic movements by frequency analysis of the pattern in space and time dimensions includes [NA94, PN97, CDOO]. The utility of all these techniques in the case of non periodic motion is relatively limited. Tracking Body Parts Another traditional action recognition approach is based on the analysis of the trajecto ries of the whole body or some body parts [DCR+98, CB95, MH98], Instead of tracking 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. regions, individual objects may be segmented, tracked and labeled with some informa tion about their identity, shape, trajectories. Object trajectories (e.g., trajectories of some 3D joint angles or of the whole body) can then be mapped to some phases of actions such as such as “walking”, “running” and “rollerblading” [DCR+98, CB95, MH98]. This approach has been explored extensively using some variations of tracking schemes, some of which use motion models [Bre97, Hog83, ZN02, Roh94], while oth ers do not [BH95], In [Roh94], Rohr uses a Kalman filter to estimate a current move ment state of a cylindrical human body model and also to predict its future movement state. The system is tested for tracking people and cyclists. To recognize actions, sev eral Kalman filters may be used, each of which represents different actions. Each filter can predict what the model should look like at a particular time for that action. Based upon how well the prediction made by a particular filter matches the trajectories, we can determine if that filter is allowed to continue. In the end, only one filter will remain, which will be the most likely action that is going on. Kalman filters work relatively well in a constrained environment (fronto-parallel motion) and when the movements are not complex (e.g., constant velocity). There are works that use more sophisticated tracking methods such as Condensation [IB98], HMMs and their variants. In [ZN02], Zhao and Nevatia use a hierarchical finite state machine (equivalent to multi-layered HMMs) to track a three-state activity composed of “run”, “stand” and “walk”. In [DCR+98], Davis et. al. describe a system (W4) that detects a human’s body parts, captures his gait and tracks both the overall movement of the person and his body parts. Simple periodic events such as “walking” are recognized by constructing dynamic models of the periodic pattern of people’s movements and is highly dependent on the robustness of the tracking. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. State-Space Based Approaches Another commonly used technique for motion understanding is to consider human movements to be driven by hidden intentional states to accomplish a task (i.e., pur poseful movements). Observed visual signals can then be thought to be generated by a stochastic sequence of states, whose spatio-temporal segmentation is ambiguous and needs to be inferred. Hidden Markov Models (HMM), widely used in speech recognition, have been a tool of choice for encoding and segmenting the sequence of hidden states due to their com putational efficiency and optimal performance [YOI92, BOP97, ORPOO, Bre97, SM96]. The term “hidden ” means that the state of an HMM cannot be observed directly and the term “Markov ” refers to a first-order Markov assumption that the prediction about the state at time frame t + 1 depends only on the state at time frame t and not on the state history. In [SP95], T. Starner et. al. use an HMM to represent a simple event and recognizes this event by computing the probability that the model produce the visual observation sequence. Parameterized-HMM [WB97] and coupled-HMM [BOP97] are later intro duced to recognize more complex events such as an interaction of two moving objects. In [BOP97], Brand et. al. develop an extended HMM framework, in which HMMs are coupled with across-model probabilities to model the interaction between action pro cesses and represent evolving spatial relations. HMM models are trained to recognize events and then assembled into a finite state machines, in which human knowledge about the activity process is used to define the structure of the state automaton (i.e., the transi tion paths between HMM models). A modified Viberti algorithm is used to parse video sequences of continuous action, integrating information over time to find the most prob able sequence of actions. This approach is similar to that of Zhao and Nevatia [ZN02] but differs in the underlying features used as evidence. In the case that the dynamics of 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the human body cannot be described in motion concepts understood by humans, some physical logics may still be used as a hint for the underlying transition patterns. The authors illustrate this concept in tai chi exercise where some patterns of the motion of two hands is to be recognized. Coupled HMMs with each model only seeing data from one hand is used. Recognition results have been seen to improve significantly compared to those of conventional HMMs applied to data from both hands. The advantage of HMMs is that they are robust against various temporal segmenta tions of events. However, when domain-specific knowledge about physical mechanics is not exploited, their structure and probability distributions need to be learnt from a large set of observations of events, using iterative methods. For complicated events, such networks and the parameter space may become prohibitively large. Bayesian Networks Bayesian networks have been applied in [BG95, BKRK97, MA99, RTB98, IB99] to combine evidence and infer the probabilities of simple events. In [MA99], a Bayesian network is used to model the relationships between the action “sitting ” and the change of location of human head that can be observed during the action. This assumption simplifies the network structure and parameters (i.e., fewer nodes and parameters) but the network may fail to discriminate between “sitting" and other similar actions (e.g., “bending over”) in a real application. Also, the output of these Bayesian networks are simply fragmented event descriptions at each time frame, conveying little information about the long-term, global situations. In [IB99], Intille and Bobick recognize activities involving multiple agents in a foot ball match using Bayesian networks. The nodes of the networks are represented by 1) an event related to some spatial relations of players, or 2) the temporal relation of such events evaluated by specific functions. Their system was demonstrated with good results 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. on a set of football plays that are tracked manually. However, generalizing this system to other event domains can be difficult as the topology of the networks are fixed. The main advantage of Bayesian networks is that they embed a single comprehensive model capturing both qualitative and quantitative nature of the problem, allowing the bridge between numerical values and descriptive events. The limitations of Bayesian networks are that 1) they have a relatively static topology, where prior and conditional probabilities for root node and links have to be determined before hand, and 2) they are only useful for mapping numerical properties gathered during one frame to some event concepts and not for segmenting temporal sequences. Syntactic Approaches Some efforts have been put to link image sequences with conceptual global events. In [Sri94], a survey of methods for deriving high-level descriptions of dynamic scene in natural language is presented. In general, temporal events and interac tions among objects often have structures that can be described by some syntactic rules. The basic data structure at this level tends to consist of an event automaton or some sort of a hierarchical network with complicated constraints on the global struc ture [Neu89, Cor92, Her95, God94]. For example, B. Neumann [Neu89, MN90] described car scenarios based on a pyra midal hierarchy of motion verbs with elementary motion verbs at the base of the pyramid and complex scenarios at its top. In [Her95], G. Herzog describes a soccer match using events such as “ passing the ball”, “running”, or “shooting the g oal”. Nagel [Nag88] has done extensive work in mapping the dynamics of the traffic scene into a natural language description using grammars. Car activities in a traffic scene is described and analyzed by a frame-based representation. The primitives of this representation is simple car maneuvers such as turn-left, turn-right, drive-straight, 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. change-to-left-lane. More complex behaviors are detected by finding the appropriate sequence of primitive maneuvers. For example, a take-over might be detected by the following sequence: follow-the-front-car, get-closer-the-front-car, change-to-left-lane, move-ahead-of-car-on-right, and change-to-right-lane. However, Nagel did not provide a computational model that bridges the gap between the numerical image data and the event descriptions. Kollnig et. al. [KN094] later extended this system, in which the basic maneuvers are detected directly from the characteristics of optical flows of mov ing cars. In [IBOO], Ivanov and Bobick define a stochastic context-free grammar (SCFG) pars ing algorithm (similar to the SCFG that is applied in natural language understanding) is used to compute the probability of a temporally consistent sequence of primitive actions recognized by HMMs. One limitation of this approach is that SCFG only describes the ordering of events. Many events, however, can overlap in time. Also, as the gram mar becomes more complicated (as in the case of complex activities), the framework of stochastic parsing may become difficult. In [KI93], Kuniyoshi and Inoue use finite state automata to recognize human manip ulations of blocks, in which a global event is considered to be composed of a sequence of action units. The temporal extent of each action unit is defined as the time span during which the motion features and the causality between the source (actor) and the target of the action remain invariant. Action units are segmented from the video streams when there is a change in motion features, prompting the search for the start of the next action unit. They particularly emphasize the use of high-level knowledge such as the expectation of some possible action units to guide the search. Their system detects the temporal segmentation points using stable event condition and extracts key information from images at the segmentation points. However, it is not always possible to detect precisely these segmentation points in natural scenes other than the block world. 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Another traditional approach to recognizing the interaction of moving objects is based on spatio-temporal logic [BHK93, Gal93]. In [Gal93], A. Galton integrates tem poral logic and spatial logic into a framework that can reason about the motion of rigid bodies. Complex descriptions of human actions can be generated based on a set of generic basic spatial and temporal propositions. In [PB98], Pinhanez and Bobick simplify the interval temporal logic and alge bra [AF94, DGG93] (traditionally used for temporal reasoning for situation recognition in the planning context) and apply it to event recognition. Multi-actor events are rep resented by a network of sub-events constrained by some simplified temporal relations. However, this system is targeted at defining activities at a symbolic level and not at establishing a link to the lower image data level. Primitive events are assumed to be detected accurately. Semantics Approaches The most abstract event representation that has been attempted to date is probably the conceptual event representation from the standpoint of Cognitive Science. Most event understanding systems in Cognitive Science are based on the premise that humans rep resent the knowledge about events in a compact and highly structured form in their memory, enabling them to infer many aspects of visual scenes (both static and dynamic) that may not be directly supported by the visual data. The goal of event understand ing in Cognitive Science is to bridge the semantic connection between event concepts and to make explicit the information that has been left implicit, constituting the term “understanding”. Several representations of the knowledge have been developed for understanding the events that are described by language. In particular, Conceptual Dependency (CD), a theory of the representation of the meaning of sentences, was particularly designed to 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. handle the understanding problem at the individual single sentence level. The mean ing propositions underlying the language is obtained by a conceptualization process, in which a verb is represented by the primitive element together with the explicitly stated concepts that make it unique. A set of nine primitive acts of CD (e.g., MOVE, GRASP, INGEST, etc.) were proposed by R.C. Schank [SA77] which present the element of actions in mental (e.g., tell, read) and physical events (e.g., kick, grasp). An example can be given for “give” and “take”, whose common conceptual element is “a transfer of possession” and the difference lies in the direction of the transfer. In [SA77], R.C. Schank develops Causal Chains by extending CD. He defines causal types and for malizes the representation of causation that constitutes the principle of causal chaining, providing us a way of representing connected text (i.e., a story composed of multiple sentences). 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 Overview of the System Figure 3.1 shows schematically our approach to recognize the behavior of moving objects from an image sequence (obtained from an image sensor) and available context. Context consists of associated information, other than the sensed data, that is useful for activity recognition such as a spatial map and prior activity expectation (task context). Our system is composed of two modules: 1) Motion detection and tracking (low-level processing); 2) Event Analysis (high-level processing). We explain each component of our system in the following. There are a variety of ways that image sequences can be obtained. In many applications, human activities may occur over an extended area and require the use of multiple image ):Low Level Processing B B B :Hiah Level Processing 3.1 Video Input Video T rack M o v in g R e g io n s D o tc c t a n d U ser Provided Context -Spatial Context . -Task Context C o m p u te M o b ile ObjC' Recognized ” Scenarios Library of Event Models Figure 3.1: Overview o f the system 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sensors. These image sensors may be stationary or can be attached to a mobile platform that moves around or scans through a wide area. When we look at the video sequences obtained in this fashion, we can easily segment the moving objects from the background and understand the activities that go on in the scene with little effort. Establishing the same capability in machines is, however, very difficult. All cameras must be registered under the same reference space and their movements must be compensated such that moving objects can be segmented from static scene background. In this desertation, we consider a specific application of monitoring a scene by a single stationary (or stabilized, if there is camera motion) video camera. Detecting events from a single-view video is more common and, in fact, more challenging as tracking 3D locations and obtaining the shapes of an object from 2D images is inherently an ill-posed problem, which requires a robust camera calibration. Shadows, occlusion and change of illumination in the envi- roments all make the 2D to 3D correspondence problem in a single-view camera more difficult to solve. 3.2 Context Interpretation of events from mobile objects and their tracks depends highly on the use and availability of context. For example, to detect a burglary in a supermarket at night time when the store is closed, it is sufficient to detect whether there is someone per forming an action “ picking up a merchandise”. However, during the normal operating hours of the supermarket, this action does not indicate a burglary, requiring different recognition methods. Context has been defined and used in previous event understanding work [Nag88, Str93, BT98]. F. Bremond and T. Monique [BT98] define context as the accessory infor mation, other than sensed data, that is used during the processing, to help the process to 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. complete the task efficiently. In our experiment, we use two kinds of context: spatial and task context. Spatial context provides identity and location of major static objects in the scene. Task context provides a priori expectations about the events to be detected, which depends on the goal of the applications. Such expectations help choose specific recognition methods (including necessary parameters) among other related activities. For example, the task context for “monitoring a checkpoint” consists of a set of recog nition methods for interesting behaviors related to the checkpoint such as “avoiding the checkpoint” and “passing through the checkpoint”, and a set of parameters required by the methods such as the expected size of vehicles. In the case where the scene is not known in advance, spatial context must be inferred from the video, requiring a static object recognition, which is a topic of extended research by itself. Such recognition, in general, requires segmentation and structured 3-D descriptions of the objects in the scene. In this dissertation, we study event under standing in the context of video surveillance where we observe a known site over extended periods of time. We assume that spatial context of such a site is provided by the users. The ground plane where an event occurs is partitioned into zones delim ited by polygons. The users define these zones directly on the 2D background image that is recovered as the by-product of the detection of moving regions. The decomposition of image space helps us organize a larger and complex space into smaller and simpler sub-spaces. Each polygonal zone has a symbolic name (e.g., road, checkpoint, etc.) that links to other contextual information such as the type of activities related to the zone that we may want to detect (i.e., task context). For example, a checkpoint may link to activity models related to “monitoring the checkpoint”. When a car approaches the checkpoint, via this link, appropriate recognition methods can be triggered to check if it is passing through or avoiding the checkpoint. In addition to regions, a user may also define the locations of other important static scene objects together with their symbolic names. 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3 Camera Calibration Most events of our interests in video surveillance applications are view point indepe- dent and can be recognized based on the analysis of trajectories on a 3D ground plane. At each time frame, the locations of moving object and scene context defined on the 2D camera viewing plane must be projected on to the ground where they stand. The camera must be calibrated accurately to obtain the projection matrix that makes a corre spondence between points on two planes. In general, there are two ways to calibrate a camera using a single image: an algebraic approach and a geometrical approach. In an algebraic approach, the projection matrix is computed from the direct measurement of the coordinates of 3D points in the scene and their corresponding 2D image points [ref]. In a geometrical approach, geometric properties of the scene such as vanishing points, parallel lines and the orthogonality of planes are used as projective constraints to derive the camera parameters [ref]. In our experiments, we use the direct calibration method since geometric properties are not always available. Also, these properties sometimes cannot be obtained reliably enough to produce accurate estimates of camera parameters, required for event analysis applications. 3.4 Detection and Tracking of Moving Objects Our tracking system is augmented from a graph-based moving blob tracking system described in [MCB+01]. The color intensities of background pixels are learned statis tically in real time from the input video streams. Moving regions are segmented from background by detecting changes in the intensity. Tracking moving objects over the image sequence amounts to making hypotheses about the shapes of the objects from moving regions and establishing correspondence 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. between them at different time frames in order to determine the trajectories. Correspon dence between moving regions can be made based on object templates, color or texture. In our case, a 2D dynamic color template of the object is used. First, knowledge of the ground plane, acquired as a spatial context, is used to filter moving regions. A dynamic template is then inferred from the color distribution of the filtered regions integrated over a few frames. We assume that the most common moving objects in video surveillance are human, in which case, the knowledge of human shapes (e.g., the width and height of the bounding boxes) is incorporated into these 2D templates at the initialization process. A graph is used to represent moving regions and the way they relate to each other. Each node is a region and each edge represents a possible match between two regions in two different frames. We assign to each edge a cost, which is the likelihood that the regions correspond to the same object. The trajectory of a moving object is obtained from the optimal path along each graph’s connected component. Features of the tra jectories and shapes of moving objects are then computed by some low level image processing routines and used to infer the probability of potential events defined in a library of scenario event models. 3.5 Event Modeling One important issue in pattern recognition is to find a set of features and the representa tion suitable for machine perception in a particular problem domain. In large-scale event recognition, features of interest, in our case, are the trajectories and shapes of moving blobs that represent the objects. Features should be chosen such that they capture the invariance over different instances of the same action. The issue here is how an event can be modeled from these features, which effectively bridges the gap between numer ical pixel values and the symbolic event descriptions. Events should be modeled such 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that they resolve the ambiguity of temporal scale (analogous to the scale-space problem in single image analysis). In this thesis, events in the scenario library are modeled using a hierarchical event representation, in which a hierarchy of entities is defined to bridge the gap between a high level event description and the pixel level information. Figure 3.2 shows an example of this representation for the event “converse”, which is described as “a per son approaches a reference person, drops a briefcase and then stops”. Image fea tures are defined at the lowest layer of the event representation. Several layers of more abstract mobile object properties and scenarios are then constructed explicitly by users to describe a more complex and abstract activity shown at the highest layer. mplox. single thread ovont ( Converse ) bend down to pproach ^ drop briefca wi ref person e even a reference person slowing getting heading down O C D x lT n ix N ^ e v o l i | > e o < y ( ll .* f’ .hmcjm ovoluuor li*M on<'o\/Vinglf' |r>\ u h m o tio n Bounding Box Ref. Person 1 1 Trajectory S C D "SHAPE, LOCATION" "CONTEXT" "MOTION" Figure 3.2: A representation o f the complex event “converse”. Mobile object properties are general properties of a mobile object that are com puted over a few frames. Some properties can be elementary such as width, height, color histogram or texture while the others can be complex (e.g., a graph description 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for the shape of an object). Properties can also be defined with regard to the context (e.g., “locate in the security area”). In figure 3.2, mobile object properties are defined based on spatio-temporal characteristics of the corresponding bounding boxes of mov ing blobs and the trajectories. The links between a mobile object property at a higher layer to a set of properties at the lower layers represent some relation between them (e.g., taking a ratio of width and height properties to compute the aspect ratio of the shape of mobile objects). A filtering function and a mean function are applied to property values collected over time to minimize the errors caused by environmental and sensor noise. Scenarios correspond to activities described by the classes of moving objects (e.g., human, car or suitcase) and the event in which they are involved. Mobile objects involved in an action are assigned two roles: source and target. A source mobile object performs the action, and the target object is the object that is acted upon or the refer ence of the action (e.g., moving toward “the building”). Both the class of an object and the event have a confidence value (or a probability distribution) attached to them based on statistical analysis. Scenarios are defined from a set of properties or a set of sub scenarios. The structure of a scenario is thus hierarchical. In the following we describe the classification of entities in the scenarios level and how these entities are modeled from mobile object properties. 3.5.1 Event Classification We classify scenario events into a single thread and multiple thread event. In a single thread event, relevant actions occur along a linear time scale (as is the case when only one actor is present). In multiple thread events, multiple actors may be present. A single thread event are further categorized into a simple or complex event. 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Simple, Single Thread Events (or Simple Events) Simple events are short, coherent unit of movement that can provide important clues to the type of its motion. For example, a change of direction, a stop, a pause. This type of events can be recognized without much semantics or reasoning. Simple events are described by a concurrent logical constraint on a set of sub-events (other simple events) or directly from a set of mobile object properties. For example, in figure 3.2, simple event “a person is approaching a reference person” is described by the occurrence of three sub-events: “getting closer to the reference person ”, “heading toward it”, and “slowing down”. These sub-events are defined from the properties of the object trajectory. We call the events that can be inferred directly from the mobile object properties primitive events. Table 3.1 lists some of these primitive events. Primitive Events blob .shape blob_speed blob_position blob mature trajectory standing-pose, crouching_pose, heightJncreasing, height-decreasing,... low .speed, high-speed, medium_speed, speed_stable, speed-increasing, speed-decreasing,... close, in_moderate_proximity, far, getting-closer, getting-further,... human, smalLobject, n o ise,... direction.toward, direction_away, unsettled-direction, straight, deviating,... Table 3.1: Primitive events that can be inferred directly from mobile object properties. These primitives are fundamental and can be re-used to defined more abstract events in other domains. The representation of a simple event can be viewed as an instantiation of a Bayesian network, which is constructed by determining the causal relations between mobile object properties and simple events. 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Complex, Single Thread Events (or Complex Events) While simple events are defined and labeled in a short period, complex events corre spond to a linearly ordered time sequence of simple events (or other complex events) and are verified over a long sequence of frames. They can be as long as hundreds of frames and are typically longer than ten frames. For example, complex event “con verse” is described as a linear occurrence of three consecutive sub-events: “a person approaches the reference person”, “bends dow n” (to drop a briefcase) then “stops near” (figure 3.2). This definition of event suggests the importance of time, prompt ing some sort of spatio-temporal structure to be incorporated into the representation. We propose to use a finite state automaton to represent a complex event as it allows a natural description [HBNOO]. Multiple Thread Events Multiple thread events correspond to two or more single thread events with some logical and time relations between them. Each composite thread may be performed by different actors. In such case, multiple thread events can be considered to model interactions among actors. There are several other representations proposed for modeling interactions among actors in the past. Coupled HMMs are used in [ORPOO], where each HMM represents one action process. Interactions are defined at a fine temporal granularity and are rep resented by probabilistic transitions from a hidden state of one HMM to other hidden states of another HMM. We consider event threads to be a large-scale temporal event {i.e., with start and end times) and propose to use the pair-wise interval-to-interval rela tions, first defined by Allen [AF94], such as “before”, “meets”, “during” and “overlap”, 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to describe temporal relations between sub-events of a multi-thread event. For example, “event A must occur before event B ” and “event A or event D occurs ”. A multi-thread event is represented by an event graph similar to interval algebra networks [AF94], A node in the event graph represents a single-thread event and the links between two nodes represent their temporal relations. Event Modeling Schema Figure 3.3 shows a schema of how to model an activity using our proposed formal ism. Given a description of a global activity, a user examines the event descriptions and determines the composition of conceptual action units that can be represented by event threads. An event graph is then constructed from these action units by defining appropri ate pair-wise temporal and logical constraints (i.e., er1; er2, ..., erf). Each action thread is modeled by a finite state automaton composed of a series of sub-event states. If each of these sub-event states exhibits coherent movement during its occurrence, it is defined as a simple event and modeled by a Bayesian network. Otherwise, we may further decompose it into different phases of movements, represented by another layer of an automaton. Our proposed hierarchical event representation provides a transparent stucture that enables a user to naturally link a relatively complicated symbolic event description to the properties of trajectories and shapes of mobile objects. 3.6 Event Inference Event recognition begins with generating a set of single thread action hypotheses for each moving object and all possible reference objects. A set of mobile object properties are then computed and used to evaluate these action hypotheses. 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Event hread Event Thread 3 Event F h r o a H hread hread 4 {er6} Throarl * ’ J}.. > f) ' C h - K ^ # Trajectory and shape properties Simple events Sub-events in event automaton i ) Single-thread events Figure 3.3: An event modeling schema. We consider the finite automaton model of a complex event to be an HMM, whose states correspond to simple events which can be an action (e.g., “approaching a per son”) or a pose (e.g., “standing”) that can be observed and segmented by users. These states are analogous to phonemes in speech and can be characterized by a distribution of motion or shape properties of the mobile object. Since these event states are modeled by a Bayesian network of mobile object properties, the distributions can be estimated in a simple manner. Using the estimated distributions, the probability of simple event 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. state S given the observed mobile object properties (e.g., m p \, rnjp, mp3 ) at time frame t, P (S\m pi(t)m p 2 (t)mp 3 (t) can be computed. The probabilities of simple events are combined in a long term to recognize and seg ment complex events. Instead of modeling the fixed state transition probabilities of the complex events as in the case of conventional HMMs, we model the duration distribu tions (Pi(d)) of an event state Si explicitly. The likelihood that a transition from state 5) to the other states is made after Si has been observed for k frames is distributed accord ing to Pi(d). These a priori event duration probability distributions and the probabilities P(Si\m pi(t)m p 2 (t)mp3 (t) of state Si of an HMM, are used to segment the observation sequence into the corresponding states and compute the probability of the HMM. Multiple thread events are recognized by combining the probabilities of all the seg ments of complex event threads (defined in the event graph), whose temporal segmenta tions satisfy the logical and time constraints. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 Detection and Tracking of Moving Regions Activity recognition by computer involves the analysis of the spatio-temporal interac tion among the trajectories of moving objects [IB01, HBNOO, BG95]. Robust detection and tracking of moving objects from an image sequence is therefore an important key to reliable activity recognition. In the case of a static camera, detection of moving regions is relatively easier to perform, often based on background modeling and foreground segmentation. However, noise, shadows and reflections often arise in real sequences, causing detection to be unstable. For example, moving regions belonging to the same object may not connect or may merge with some unrelated regions. Tracking moving objects involves making hypotheses about the shapes of the objects from such unstable moving regions and track them correctly in the presence of partial or total occlusions. If some knowledge about the objects being tracked or about the scene is available, tracking can be simplified [YYOO]. Otherwise, correspondence between regions must be estab lished based on pixel level information such as shape and texture [MCB+01]. Such auxiliary knowledge other than the sensed data is called context. In many applications, a large amount of context is often available. In this paper, we demonstrate the use of the ground plane information as a constraint to achieve robust tracking. Robust tracking often requires an object model and a sophisticated optimization pro cess [SMFOO]. In the case that a model is not available or the size of the image of an object is too small, tracking must rely on the spatial and temporal correspondence 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. between low level image features of moving regions. One difficulty is that the same moving regions at different times may split in several parts or merge with some other objects nearby due to noise, occlusion and low contrast. Figure 4.1 illustrates one of these problems, where the moving region R f at time t (a human shape) splits into two smaller regions R j t + 1 and Rkt + 1 and noise detected at time t + 1. The image correlation between the moving region R f and R,/ H or R f +l by itself is often low and creates an incorrect trajectory. Filtering moving regions is therefore an important step of a reliable trajectory computation. (a) frame 74 (b) frame 75 Figure 4.1: Splitting of moving regions and noise. Projection of Rj at frame 74 Image Plane (frame 75) Camera Projection of R ,- Projection of R|< Ground Plane Figure 4.2: Projection of the bottom points of moving regions on to the ground plane. 4.1 Ground Plane Assumption for Filtering Let us assume that objects move along a known ground plane. An estimate of the ground plane location of the lowest point of a moving region can be used as a spatial constraint 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to find the best candidate region that corresponds to R f. This is illustrated in figure 4.2. The dotted line shows the projection of moving region Ri of frame 74 on the ground plane, whose correspondence is to be found at frame 75. The solid lines show the projection of two moving regions detected at frame 75. Given that objects move on the ground, Rkt + 1 is unlikely to correspond to Ri (the head blob) as they would be located too far apart on the ground plane. The best candidate region R/ + 1 (the body blob) can then be selected as the most likely region correspondence accordingly. Other moving blobs for which no correspondences are made, are tracked by their relationship to the body blob. To accurately compute the world coordinates of a point on an image plane, we need the parameters of a camera model. However, if we only need to estimate the 3D locations of points of moving regions on a ground plane, we can choose the world coordinate system such that Z = 0 as we are only interested in (X , Y ) positions. This reduces the 3x4 perspective camera transformation matrix to a 3x3 projective transformation matrix (or plane homography) as follows: ( x l ) 1 5 * - S J - to c o 1 ( X ) x 2 = ^21^22^-23 Y \ :/;3 ^31^-32^33 V 1 I Since the non-singular homogeneous matrix H has 8 degrees of freedom, four or more points correspondence (x, y) to (X , Y ) are enough to determine it uniquely, where x = x i / x:i and y — x 2 / x 3. For the scene shown in figure 4.3 where the distance of an object from the camera is approximately 37 meters, we collected 8 points correspon dence and solved for H with an error margin of 15cm. 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2 Merging Regions Using K-S Statistics The best region candidate chosen based on a ground plane location may not be of the correct shape as shown by Rj. It may need to be merged with other sub-regions subject to some similarity measure. This merging process is iterated until no further merging occurs. In the final step, regions that are too small to represent valid objects can be disregarded. The merging process between the best candidate sub-region R j t+1 and sub-region Rkt+1 is performed as follows. We first make a hypothesis that both regions belong to the same object represented by R 1 ^ and that their pixel data are drawn from the same distribution as those of R f. To test this hypothesis, we compute the spatial similar ity between and /?/ . We base the test on the distribution of gray intensity. The Kolmogorov-Smimov (K-S) statistics [vM64] provides a simple measure of the overall difference between two cumulative distribution functions. Let us assume that the inten sity distribution of a region Rm is modeled by a Gaussian distribution (N(x, c r m)). K-S statistics (D) of two regions Rm and Rn can be computed as follows: rX \ (x-Vm ,) 2 D(Rm, Rn) — max I / — = — e 2a« d x I 0<x<2® 70 j 2 i o m 1 fx 1 ( x - M n ) 2 a /27Ti -e 7 T (T n dx | The significance level of D is then computed as Qks{[Ne + 0.12 + 0.11 /\fW e\D), where Ne is the effective area of the regions (Ne = slze(^n)*stze(Rn) ^ ancj e & v e size(Rm)+size{R„)' Qks( A) = e~2j2x2 (4.2) Regions are merged if the significance level and the K-S statistics of Ri and P i^ are lower than those of Ri and R/'+l- If the type of objects are known (e.g., a human), a 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 104 frame 241 frame 340 frame 311 frame 405 frame 436 frame 490 frame 557 Figure 4.3: The “Stealing by Blocking” sequence. “A ” approaches a reference object (the person standing in the middle with his belongings on the ground). “B ” and “C ” then approach and block the view of “A ” and the reference person from their belongings. In the mean time, “D ” comes and takes the belongings. 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. constraint on the size of the moving regions after merging can also be used as a criteria to reject incorrect hypotheses. frame 360 365 368 * 5 6 0.86 0.4* 0.63 0.9 0.94 L86 0.80 0.97 0.95 0.94 0.851 0.88 "M e rg e d1 0.95 0.90 r0.94 0.95 ,0.98 ,0.95 0.99 i.94 U “M erg ed " 0.81 (a) Before (b) After Figure 4.4: A graph representation of the possible tracks of object “D ”. (a) Without using the ground plane knowledge, several hypotheses can be made about the possible tracks of the object, (b) After filtering, regions are merged or disregarded, decreasing the ambiguity. Figure 4.3 shows the detection of moving regions of the “stealing by blocking” sequence at different times. To track objects in the sequence, a graph representation is used [MCB+01]. Figure 4.4 shows the graphs used for tracking object “D” before and after applying our filtering process. The nodes at each layer represent the moving regions detected at one video frame. An edge indicates a hypothesized correspondence between two regions of different frames. The nodes that are linked from the same parent represent the moving regions detected within the neighborhood of the parent node. The red nodes in the figure show moving regions of another object being tracked nearby. The numbers associated with the edges indicate the similarity of the region distribution based on K-S test ranging from 0 to 1. In figure 4.4(a), several hypotheses can be made about the possible tracks of “D” as indicated by numerous branching edges along the 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. path from frame 360. However, none of these represents a correct track since most of the edges coming from a single parent node or going to a single child node are the results of the splitting and merging of moving regions as shown in figure 4.1. For example, “A” and “B” split into three and four sub-regions at frames 360 and 368, respectively. Figure 4.4 (b) shows the tracking graph after filtering regions based on the ground plane locations and K-S statistics. At frame 361 and 369, moving regions are merged and pro duce a higher similarity level to the moving region of the previous frame. At frame 365, 366 and 367, some moving regions are discarded as noise as they contradict with the ground plane assumption or when merged with the best candidate region, the similarity level decreases. After filtering, some tracking ambiguity may still remain in the graph. For example, the node at frame 362 can be associated with two regions in the following frame, one of which is also associated with the moving region of another object nearby. Such ambiguity can be removed based on other criteria such as event context. 4.3 Resolving the Discontinuity of Object Trajectories The discontinuity of tracks of moving objects in an outdoor scene often arises when moving objects are not detected for a few frames, such as from total occlusion, or when all regions do not satisfy the ground plane assumption. This is shown in figure 4.5. In this case, hypotheses about connections of fragments of tracks need to be made. We ver ify a hypothesis about the track correspondence based on the similarity of the intensity distribution of the moving blobs. Events are then reevaluated over these possible merg ers, where we can choose the one that gives the highest confidence value. For example, objects 2 and 3 are more similar in appearance than the human on the right. After merg ing, the probability of “approaching the reference object A ” also becomes higher. The hypothesis that object 5 and 6 are the same object can also be verified similarly. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Object 2 is totally occluded by the tree. (b) Discontinuing tracks between Object 5 and 6. Figure 4.5: (a) Object 3 is considered a different object from object 2 due to the total occlusion, (b) Due to the reflection on a car surface, the moving region of object 5 is larger than that of object 6 , causing the discontinuity of the trajectory. Figure 4.6 shows the final results of our tracking approach, where the correct shapes of moving objects are recovered and the confusion of possible multiple tracks is reduced. 4.4 Tracking Objects in Videos of Low-Angle Viewpoint In figure 4.6, we demonstrate our tracking algorithm successfully on the “stealing by blocking” sequence which is taken from a rooftop. In some video surveillance applica tions, a video sequence may taken at a much lower viewpoint. As the image plane of the camera gets closer to being perpendicular to the ground plane, the points projected on the ground plane at a distance can be very noisy. Tracking moving object in these videos are very challenging because a misdetection of a few pixels of the bottom-most point of a moving region can project to a distance of a few meters on the ground plane. In the 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 241 frame 297 frame 340 frame 571 Figure 4.6: Trajectories of the objects tracked using the ground plane knowledge, K-S statistics and event consistency. following, we show the detection and tracking results of three video sequences taken at a different site. The camera is set at approximately three meters above the ground with a smaller tilting angle than the “stealing by blocking”. Figure 4.7 shows the sequence “stealing by phoneBooth”. First, a person appears in the scene with a luggage. He then approaches the phone booth and leaves his luggage on the ground. A thief then comes and takes the luggage. In the mean time, there are other people who just walk by. This event is similar to “stealing by blocking ”, except that the object is taken while the owner is using the phone in stead of being distracted by a group of people {i.e., “obstructing the view” event). The detection and tracking results of this sequence is shown in figure 4.8. Event though object 1 and 2 are occluded at frame 444, 486, 703 and 750 when an object closer to the camera moves past in the front, their trajectories are recovered correctly. The bounding boxes of the moving objects are also filtered correctly. Such tracking in image plane may satisfy the goals of many 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. applications. However, it presents a challenge for an event analysis system. Figure 4.9 shows the trajectories of objects projected on the 3D ground plane. The trajectories of solid objects (i.e., luggage) are recovered quite reliably. However, tracking articulated objects in motion can be noisy when the distance between the objects and the camera is large. This can be seen by comparing the trajectories of object 3 (in yellow) and object 4 (in orange). Figure 4.10 shows the sequence “object transfer”, where two people exchange a luggage. The difference of em “object transfer” and “stealing” events is that there are no other indicative events (e.g., the owner is using the phone) to imply that the owner of the luggage may not notice that his belonging is taken away. Figure 4.11 shows the tracking results of this sequence. The trajectories of the person who takes the luggage are split into several segments. The tracks of obj3, 4, 5 and 6, in fact, belong to the same moving object. The discontinuity of the trajectories is caused by the inaccurate detection of the feet of the moving object. A few pixels of misdetection can results in over one meters of ground distance, causing the failure in establishing the correspondence between regions. The discontinuity of trajectories can be overcome using the method discussed in section 4.3. Figure 4.12 shows the sequence “assault”, in which a person attacks another person and chases after him. The detection and tracking results are shown in figure 4.13. The challenges of this sequence are that the objects move at a higher speed than in “stealing by phonebooth” and “object transfer” sequences, and that the moving regions merge briefly as they clash. Merging of blobs is different than occlusion, as the distribution of color templates may change. The positions on the ground plane cannot be used to establish the correspondence as in the case of occlusion. However, when the merging occurs only briefly, the trajectories of object 1 and 2 can still be re-established correctly as can be seen in figure 4.13. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 750 frame 854 Figure 4.7: The “Stealing by PhoneBooth” sequence. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 108 frame 283 frame 486 psa ■ ■ ■ ■ - - frame 616 frame 750 Figure 4.8: The detection and tracking frame 215 frame 444 frame 556 frame 703 frame 854 of “Stealing by PhoneBooth ” sequence. 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 215 frame 283 frame 486 frame 616 frame 556 1 t n t r - ' \'0bj2 ■ n r T P # . frame 703 frame 750 frame 854 Figure 4.9: The trajectories of moving objects in “Stealing by PhoneBooth” sequence projected on the ground plane. ^ Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 333 frame 45 frame 399 frame 460 frame 437 frame 493 frame 547 frame 630 Figure 4.10: The “Object Transfer” sequence. 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 45 k m frame 399 frame 460 frame 333 frame 437 frame 493 frame 547 frame 630 Figure 4.11: The detection and tracking results of “Object Transfer” sequence. 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 220 232 Figure 4.12: The “Assault” sequence. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 220 232 Figure 4.13: The detection and tracking results of “Assault” sequence. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5 Single-Thread Event Recognition Suppose we have a set of competing event scenarios E = £ j , EN. Let ()tl represents a set of mobile object properties computed at time frame f*. Given a temporal sequence of observations 0 <1,4> = 0 1 0 2 ...0 t of moving objects, we want to make a decision as to the probability that an event Ei has occurred and when it starts and finishes. To make such a decision, the entities in the event model (figure 3.2) must be computed. Computations at each level are inherently uncertain, hence a formal uncertainty reason ing mechanism is needed. To recognize which event model matches the video images the best, we want to compute \/i, P(Ei\0<1,t> ) and find the event Ei with the maximal probability. P(Ei\0<l,t>) can be computed by inferring the distribution of sub-event values at the lower layers and propagating them towards the top layer. In particular, the event recognition process starts with the segmentation and computation of the probabili ties of single-thread events from the properties defined for the moving object of interest. The probabilities of these segmented single-thread events are then combined, subject to the appropriate temporal constraints, to verify a multi-thread event. In this chapter, we discuss our approach of applying Bayesian networks and HMMs to the computation of P(Ei\0<1,t>), where Ei is a simple or complex single-thread event. The method for computing P(Ei\0< 1’ t>) for a multi-thread event is described later in chapter 6. 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.1 Object Class and Simple Event Recognition The class of an object and its simple events are both represented by a Bayesian network and are inferred from the mobile object properties computed during a time frame. In Bayesian terms, the input entities are viewed as providing the evidence variables and the task is to compute the probability distribution of an output entity hypothesis. If the entities to be combined are statistically independent (given the scenario), a simple naive Bayes’ classifier can be used to compute the distribution of the combined result. When the entities are not conditionally independent, Bayesian networks offer an effi cient representation that can represent dependence by decomposing the network into conditionally independent components. To effectively use Bayesian networks, we need the knowledge about the network structure (i.e., which entities are directly related or linked to which entities) and the conditional probabilities associated with the links. 5.1.1 The Structure of Bayesian Networks In our case, the structure of the network is derived from the knowledge about the domain. For example, logical constraints of sub-events that represent the recognition of a par ticular event indicate the direct causal link between them (i.e., the sub-events are the consequences of that event). By defining each event such that its sub-events are condi tionally independent of each other given the event values, the hierarchy such as the one in figure 3.2 can be converted naturally into a Bayesian network which is composed of several layers of naive Bayesian classifiers (i.e., no hidden nodes). Belief propagation is performed in one direction from the bottom layer to the top layer. T n figure 3.2, at the top layer, the parent event “approach the reference person” (H), is linked to three child events: “getting closer to the reference person” (e{), “heading toward the reference person”(e2), and “slowing down” (e3). These child events form 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. H:Approach the Reference Person P ( H |eie2e3) = a P ( e i|P )P ( e 2|P )P (e 3|P ) P ( P ) P ( - P | e ie2e3) = a P (e 1|-'P )P (e 2|-iP )P (e 3|-iP )P (-iP ), where a is a normalizing constant such that P{H \e ie2e3) + P ( - P | e ie2e3) = 1 Figure 5.1: A detailed illustration of a naive Bayesian classifier that is used to infer “approach the reference person” in figure 3.2. Given that e1 ( e2 and e3 are conditionally independent given H, the belief is propagated from the sub-events e\, e2 and e3 to infer the probability distribution of H (i.e., P ( P |e i, e2, e3)j by applying Bayes’ rule. e\, e2 and e3 can also be a parent event of other naive Bayesian classifiers as shown in figure 3.2. another layer of three naive Bayesian classifiers. For example, e\ becomes a parent of “distance evolution” (the difference of the distance to the reference object at frame t and t — 1). The probability distribution over the parent event values in Bayesian classifiers (P (H |ei, e2, e3)) is inferred from the distribution of child event values and the conditional probabilities of child events given the values of the parent event (i.e., P(ei\H ), P (e 2|P ), and P (e 3|P ) as shown in figure 5.1). To normalize the probability P ( P |e ie 2e3), we enumerate and compute the probabil ities of the alternative parent events that make up the ->P. For example, the alternative events of “approach the reference person ” are “leave the reference person ” and “stand still with respect to the reference person”. 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. f State 1 State 2 SO:lnit c S1: approach a ref person ) S2: stop at (staying around) P 0(d) P l(d) P 2(d) Figure 5.2: A finite-state automaton that represents the complex event “approach then stop ”. 5.1.2 Parameter Learning Parameters to be learned in a Bayesian network are the conditional probabilities of child events given the values of the parent event. Traditionally, these parameters are learned using the Expectation-Maximization (EM) algorithm when hidden nodes are present. In our case where all nodes are transparent (e.g., we can observe whether the object is moving towards another object or whether it is slowing down), these conditional proba bilities (i.e., P(ei\H) and P(ei\~>H))can be learned from image sequences directly such as by making a histogram of observed values of the evidence variables, e\, e2, e3, given the value of a given hypothesis, H. In the case where a single Gaussian distribution can be assumed, the Gaussian parameters (p and a) can be computed easily. 5.2 Complex Event Recognition Complex events, defined as a temporal sequence of sub-events, are represented by a finite state automaton as shown in figure 5.2 for the event “approach then stop”. The sub-events Si and S 2 are simple events and ,S '( ) is an initial state, which is different for each complex event automaton. The dynamics of the complex event is modeled by the distributions of state durations (Pfid) for the state St, where d is the state duration) and 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. p t= 1 t=2 ... t=T Figure 5.3: Segmenting the pattern of simple events in “approach then stop”. probabilistic transitions among the event states, as shown by the arrows in the figure. The recognition process consists of determining whether there is a segment in a video stream that contains the visual evidence supporting the corresponding event automaton. This can be done by segmenting the video frames into the pattern of the corresponding automaton states of simple events (which, in our case, are inferred from the observed motion and shape properties). As the occurrence of sub-events are uncertain, the deci sion on the transition between states also becomes uncertain and must be performed in a probabilistic framework. For example, suppose we have a video sequence with a length of T frames as shown in figure 5.3. The probabilities of simple events (P (Sj\O t)) com puted from Bayesian networks during t = 1 and t = T are used to find a video segment (tu Unit) that is most likely to contain the event “approach then stop”. In the course of doing that, the most likely transition timing from state Si to S 2 (i.e., t-2) during the detected segment (ti, tinit) must be determined. 5.2.1 Probabilities of Multi-State Complex Events Suppose we consider the recognition of a multi-state complex event j , 3 M S, that is com posed of the initial state3 So and N event states j S \,j S2,..., 3 S n ■ Let ()<LT> be the set of observations during time frame 1 to T. We define P ( 3 M S j \ 0 <l'T>) as the probability that the sequence of states3 S\, ...,3Si of3 M S occurs given the sequence of observations 64 P(S | S& -“inif S p“approach” S2 : stop„at Sq mit ^ ^ r _ A _ > v 2f--------- A--------- ^ L . , A .... 0 — * □ □ [ □ □ □ □ □ Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ()<LT>. In the following, for clarity, we drop the superscript j. P (M S jf\ 0 <1,T>) can be computed as follows: = cr0 J ] F ( 0 <Ml- 1>|M 4 tl“ 1))P (O i|5 1) F ( 0 2|,S2)... F(6^v|5<tA ,’ T 1) where U refers to the time that the transition to state S, from state Si-i occurs and g<u,u+i-i> means that Si occurs during f, and t i + 1 - 1. Similarly, s ^ li,Li+1~^ denotes the fact that Si occurs during ti : and tt+1 - 1 and that the state at t i + 1 is not Si. We write Si and Oi as shorthand for J> and We derive eq. 5.1 (a) by Bayes’ rule, eq. 5.1 (b) by writing P ( 0 <1,T>) as a normaliz ing constant a 0, and eq. 5.1 (c) by expanding M S jj. Eq. 5.1 (d) is derived by expanding ()<] T> and by making an assumption that Ot is conditionally independent of Sj given Si. P(M S^' l > S iS 2 . ■ ■ S<tN'T]) in eq. 5.1 (d) can be further derived as a product of the event durations as follows. P ( 0 <1 ’ T>\M Sjr)P(MS'^r) a 0 P (O <l,T>\M Sjr)P (M S jl) •••(b) «o P ( 0 <1,T>\M S q 1 - 1 " 1 S i tl,t2~ P (M S $ 1~1)S 1 S 2 ...S % tN,T]) •••(d) (5.1) 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P (M 5 f1_1)5 1 <* 1 ’ t2_1>5 2 <ta’ fe_1> ... S <tN’ T ]) = P (M 5?1_1))P(St1|M 5 f1-1)) P ( ^ 1+1|< S t 1 1)P (5 t 1 1+2|51 <tl'tl+1>) .. .P ( ^ 2- 1|l S1 <tl’ t2- 2>)P (^ 2|51 <'1 ’ t2”1>) P ( ^ 2+1|S*2) P (^ 2+2|,S2 <t2’ te+1>) .. .P (St 2 3- 1\S<t2't3- 2>)P(St 3 3\S<t2’t3~1>) P( |iS l < J V )P( | ^ ... P ( ^ |5 < tjv- T"1>)(l - P(SZ+1\S $ n't > )) • • • (a ) = P ( M 5 q1^1))(1 - PC^q 1 |M(Sotl_1)))a 1 )0 P (5j1+1|5j1) ... P ( ^ 3 “1|51 <tl,t2"2>)(l - P(Sj2|51 <tl,t2_1>)) a2,i P(S£2+1|S£2) .. .P (S’ * 3"1|1 S2 <t2’ t3“2>)(l - P ( ^ 3|( S2 <f2’ t3“1>))a3,2 P (5 ^ +1|S ^ )... P{S%\S<tN'T~1>){l - P ( S l +1 \S<tN'T>)) •■•(b) = P (M S , J 1_1))oi,oPi(^ = h - t 1 )a2 :iP 2{d = t3 - t 2 )a^ 2 • • • . . . PN(d = T — tpf) •••(c) (5.2) We derive eq. 5.2 (a) by expanding the joint probability . .. S < tN’ T]) into a product of conditional proba bilities and by assuming that the probability of S) making a transition to Sj or remaining in the same state depends only on the duration of S). For example, the third term on the right of eq. 5.2 (a) is, infact, expanded to P(5*1+1|S'i1, MS q 1 ”1). However, given S^1, the term S \ 1 + 1 is independent of MS'q1-1, which results in /''(,5'{1 1 \S[l). 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Eq. 5.2 (b) is derived by writing P(St /\S ^ u,tj~1>) as the product of the probability that the state at time tj is not St (i.e., (1 — P(,S’-J ) ) ) and the probability that the automaton takes the path from , S ' , : to Sj normalized by all the possible paths from Sl (i.e., dj^). For example, if there are two equally likely paths from S 3 to S4 and So, we would have: a4]3 = a0,3 = 0.5. The expression under the over-brace is the product of the probabilities that the automaton remains in state Si from ti to t 2 — 1 , which is equivalent to the probabil ity that the duration of event state Si is (t2 — P)- We define P.L (d) as the probability distribution of the duration of event state St, where d is the duration of the event in frames. Eq. 5.2 (c) is derived by replacing those terms that are equivalent to event dura tion distribution with the appropriate Pi(d). The term P(S11 \M S q1~1' > ) in eq. 5.2 (b) is equal to 1 since M S q 1~^ is defined such that the state of the automaton at time p is not S0. By substituting eq. 5.2 (c) into eq. 5.1 (d), we get: P (M S t n \ 0 <1't > ) = a 0 P ( ^ p f 1"1)) P ( 0 <Ml"1>|M 4 tl“ 1)) V ( i i , i 2 ,...,tjv) ah0Pi(d = t2 - (5 3) a2liP2(o ! = t3 - f2)P ( ^ |P 2 <t2’ t3"1 :> ) ... aN>N^ P N(d = T - f7 V )P (6 ;|P < %’ Tl) 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We can further expand P i M S g 1 1}) and P ( 0 <1M 2>|M S q 1 1}) in a similar way we expand P (M Sjf) and P ( 0 <i;T>\M S^) in eq. 5.1 (a) as follows: P ( M 5 f 1_1)) P ( 0 <1’ tl_1>|M 5 f 1_1)) = J ] J ] P ( M ^ ° - 1^ 0 <to’ tl“ 1>) V ( t o < t i - l ) V ( # 0 ) P(O<1 - i0^i>, 0<*0.*1-1>\M Sl°~\ 5'0 <to,tl" 1>) • • • (a) = J ] E Po(d = P - 1 - t 0 )a0 !iP (M S t0^ ) (5-4) V (to< *i —1) V(i^O) P ( 0 <t°’tl~1>\SQto’tl~1>) P ( 0 <1’to~1>\MSl°~1) • • - (b) = J ] £ P0(d = *i - 1 - fo )P (0 <to- tl- 1>|5 0 <to’ tl- 1>) V (to < ti-l) V(i^O) a0,iP (M S j0- 1)O<1’ t° - 1>) •••(c) Eq. 5.4 (a) is derived by expanding M S q 1_1^ into P0a°!,|~ l> and MS'-0-1, and 0< i,ti-i> into 0 <Mo-i> and Q<to,ti-i>' where p is the time frame in which the tran sition is made from S) to So. Since the transition to state So can be from any of states Pi to SN (as shown in figure 5.2) and can occur at any time frame before p — 1, we take into consideration all possible i and /,0 values (i.e., )Pvf.,;/o)) and V(/;0 < P — 1)). Eq. 5.4 (b) is derived in a a similar fashion as in equations 5.1 (d) and 5.2 (c). Eq. 5.4 (c) is derived by rewriting the terms P ( M S t°~l) and P ( 0 <1’to“1>|M P-°“ 1) using Bayes’ rule. 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The derivation in equations 5.3 and 5.4 (c) requires the computation of P(Oj\S^tj’ t3 + 1~1>), which is derived as follows: I I P (O e \S!;) (a) i — i p (sT ) t j < t ' < t j + i - i v o > P ( 0 4 ' n p ( s n ■(d) 1— 1 (5.5) ■ (c) Eq. 5.5 (a) is derived by the assumption that, given S j\ O 1 ' is independent of S f , where t' ^ t". We derive eq. 5.5 (b) by Bayes’ rule and eq. 5.5 (c) by factoring out the terms in the bracket. In eq. 5.5 (d), P(tjttj+1 - 1) is shorthand for the product of ^ ^ - , - y during tj < t' < tj+1 — 1. P{():]\S^tj'Lj+' ^) can be written in a similar fashion. For compactness, let us define Bel (5 1 1 > N , as follows: = = [ [ P (S p O ‘'! f i < < ' < 9 + 1 - 1 (5.6) We can now substitute equations 5.4 (c), 5.5 (d) and 5.6 into eq. 5.3, and get: P (M S h \0 ’ ) = a 0 VfoiO) «i,o/?((1,(2- i) B e l( 5 i < tl- t2- 1>) a 2,1/3(t2it3„ 1)B e l(5 2 <t2’ t3- 1>) a-N,N-lP(tN,T)^K^NtN’T^) (5.7) 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. By comparing eq. 5.7 with eq. 5.1 (b), we notice that the terms expanded after a 0 are in fact the joint probability P(MSj([, 0 <l'T>). We can recursively expand P (M S \0~l , 0 <1,to_1>) using the result of eq. 5.7 (i.e., replacing T with t 0 — 1) until t 0 is equal to 1. After the expansion, by making an assumption that the a priori prob abilities of all events are uniformly distributed (i.e., P(Si) = P(S), V7), we can com bine P(tm,tm+i-i) (or IItm<t'<tm +1- i £ § ? )) for a 1 1 m and §et P = which can be factored out to the left of 5^V (tl t 2 tjv) regardless of the choice of temporal state segmentation tN, tN- 1, . . . , t\. If we let P (M S \ , 0 <1> t>) be P(M S\, 0 <l,t>) after the factorization, eq. 5.7 can be written as: P (M S jf\0 <1,T>) = a 0 (3<iiT>P (M S jf, 0 <M>) • • • (a) Bel(S'0 <to’ tl“ 1>) Y ao,iP(MSl ° ~ 1 , 0 < M o - i > ) y ( t 0< t i —i) v(i^o) a1,0B el(51 <tl- ta- 1>)a2,1B el(52 <t2’ t8- 1>) . . . aN,N- 1Bel(S<tN’T') •••(b) (5.8) When we compare or normalize the probabilities of P (mM S j) and P (mM S j ) of the event model m M S, or the probabilities of any two complex-events P (mMS%) and 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P(nM Sx), Go and will be canceled out. Therefore, we drop them from eq. 5.8, which can now be written as: P {M S T N\o a o/3<i,r> = P(MSjf, 0 < :L > T>) •••(b) P {M S n \0 T\n <\,T>\ (a) E V(tl Bel(S, ^to,tl_1>) ^ a0,iP ( M ^ ° - 1,O < 1^ - 1>) V(t0« i - 1 ) V(i^O) a 1-0Bel{S,<“ ’b - 1>)o2,1 B el(S2 <b'b - 1>) .. ,o w,«_1B e l(S ;‘" ’ T|) ■ • • (c) E J ] Bel(S'^0,tl_1>) ] T aoliP ( M ^ 0"1|0<h*o-1>) V(t0< t i - 1 ) V(pO) a1,0B e l( P f 1’ t2“ 1>)a2ilB el(P < ^ 2 ,£3 1> ■(d) (5.9) In this thesis, P (M S jr\0 <1,T>) is used interchangably with P (M S jf\0 <1'T>). 5.2.2 Modeling the Probability Distributions Pi(d) of Event Dura tions The computation of B el(i S ') - ;,”,' |1_1>) in eq. 5.6 requires the modeling and learning of the event duration probability distributions Pi{d). From equations 5.2 (b) and (c), I\(d) is derived as: p^d) = (1 - p (^ + ^ + 1|5'<^^+d> ) ) p ( ^ + ^ < ^ ^ + d- 1> )... (5.10) P(Sli+2 \S<ti'ti+1 >)P(Sl'+1 \Sl') Learning the probability distribution functions modeled by eq. 5.10 is very difficult and unstable due to the high dimensionality of parameters and the limited number of 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P ( s ; K 1) P (S ^ |S ^ 1 ) a a So.Arrive at ATM P(Sl 2\ S") Figure 5.4: The first two states of an HMM representation of “a person gets cash from the ATM”. training data set. If we make a first-order Markov assumption that P (S t iJ\S^tt’t:’ 1:> ) is equal to ), which is time-invariant (i.e., PiS'f |P-J ’ ) is equal to, say, P (S i -\S t -~ 1 ),'itj), eq. 5.10 can be written as: Pfid) = P (S i\S i~ lf - u\ l - P(Sl'\Sf^)) (5.11) Pfid) modeled by eq. 5.11 has only one parameter, P(Sj'\St i ~1), which can be learned more easily. However, there is one disadvantage. In eq. 5.11, Pfid) is mod eled by the exponential function, which is inappropriate for many large-scale events. For example, “a person gets cash from an ATM” can be modeled by an HMM with the first two states being “the person approaches the ATM” (event Si) and “the person arrives at the ATM” (event S2) as shown in the figure 5.4. Suppose it takes approxi mately 59 frames for a person to walk to the ATM machine, it can be computed that P (S f (S f-1 ) is 0.9933 (i.e., |jj) and P (S 2 iP f-1 ) is 0.0167 (i.e., ^ ). The probability that a person remains in the state Pi (i.e., “approach the ATM”) for “d” frames can be computed by P\(d) and plotted in figure 5.5. We can conclude from figure 5.5 that, without any other information, it is three times more likely to observe a person “approach the ATM” for 1 frame (P, (1) = 0.0167) than for 59 frames (Pj (59) = 0.0063) which is the expected duration of event Pi. Unlike the case of speech recognition where the duration of a phoneme (e.g., the sound/th in “the” 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Prior probability of observing event "Sf for d frames 0.018 0.016 0.014 0.012 X 3 C O .Q O i— 0.01 £ 0.008 0.006 0.004 0.002 50 200 250 300 100 150 Duration of event "S i1 1 (frames) Figure 5.5: The exponential probability distribution of the duration of event state S\: Pi(d) = P ( 5 f '|5 f - 1)d- 1(l - P ( S f i S f - 1)), where d = t - t 1 + la n d P ( 5 f |5 f “ 1) is 0.9933. or the sound/d in “dog”) is restrictively short and has small variance, the duration of an activity is highly variable and can last for several minutes. Therefore, the exponential model of event duration may not give accurate results. A more accurate way of computing Bel(S'l <ti,ii+1_1>) is to model the probability distribution of event duration, Pi(d), explicitly in the finite state automaton. P;(d) can be learnt using a direct method as in the case of the parameters of Bayesian networks. In the ATM example, if we know that ATM users are likely to follow a common walking path (e.g., a hall way or a line) when they approach the teller machine, we may assume the duration distribution of the “approach” event to be Gaussian. The mean and the variance of the Gaussian distribution can then be estimated accordingly. In many other real-world situations, the durations of the events of our interest can be highly variable, making it difficult to estimate. For example, the amount of time a person spend on executing “taking object” can vary significantly depending on the scene context and 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the execution styles of the actor. In our experiment, we assume that all these possible execution styles are equally likely. Consequently, the durations of events are uniformly distributed over a certain frame range. In order to avoid a very short event segment which is likely due to noise, we restrain short durations (e.g., 1 to 10 frames) by a sigmoid function. 5.2.3 Complex Event Recognition Algorithm Let us first consider the computational complexity of eq. 5.9 (d). In the case that the video stream is pre-segmented such that it contains only one instant of a complex event (i.e., t 0 in eq. 5.9 (d) is equal to 1 and P(MSq 'l \Ou > 1 ) = 1), the direct computa tion of P (M S jf\0 <1,T>) at time t = T involves at least an operation of 0 ( T N) com plexity since there are 0 ( T N) combination of the values of t\, t2, •••, tN. If one has to recalculate rii<i<Ar Bel(S)<ilA+1_1>) at each time frame t for each choice of the val ues of t\, t2, tN, the computation can be as complex as 0 ( P N+1^). In the case of unsegmented videos, the computation becomes more intensive since we also need to determine to and compute Bel(S'^0,tl_1>). In the following, we describe a more efficient recursive algorithm based on Dynamic Programming, in which P ( M S ( \ 0 <1’t'>) is derived from the previously computed P(M Sj"\0<1,t">) (wheret" < t' and i ^ j). This algorithm is an adaptation of the algo rithm for computing the forward probabilities used in the conventional HMMs [RJ93]. Recursive Computation of P (M S j\0 <l,t>) Suppose we consider the computation of P (M S \\0 <1't>) (i.e., the recognition of the sequence of event states up to the state S) at time t). We can rewrite P (M S \\0 <l't>) in terms of previously recognized event sequences (Si, S2, ..., S i- 1) at time U — 1 , 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where U < t as follows. From eq. 5.9 (d), we rewrite the terms in the bracket back into P(M5'o1_1| 0 <1’ tl_1>) and get: P (M S f\0 <1,t>) = ^ (M 5o1_1| 0 <Ml_1>) a 1,„B el(Sf,"*2- I>)a2ilB el(S3 <b’ ‘J- 1:- ) . . . a,,,_1 Bel(S,<'-'1 ) •••(») ti<t V(ti,t2,...,ti-1) ali0Bel(,S1 <<1’fe' 1>)a2ilB el(1 S2 <fe't^ 1>) . . . a ^ . i - a B e l ^ S ” 1’* 1 ) • • • (b) - J ] Bel(5<ti’ tI)aiii_iP(M Sjr1 1|0 <1A- 1>) •••(c) u<t = sP P { M S t k\ 0 <1't>) •••(d) ti<t (5.12) Eq. 5.12 (a) is derived from eq. 5.9 (d) by replacing N and T with i and t. In eq. 5.12 (b), the terms related to state Si are factored out to the left of X^v(ti t2 u-i )• e9- ^-12 (c), the terms after aiti- i are combined as P(MS'*i^1| 0 <1,t<_1>). In eq. 5.12 (d), we write the terms after J2 ti<t as P (M S l \0 <[,t>), which is the probability of M S t at time t, where the last state Si starts from t,. Given the derivation in eq. 5.12 (c), the computation of P (M S t N\0^^) proceeds as follows. At t = 0, the probabilities of being in state Si are initialized as follows: P(S°\0°) = P(M Sq\0°) = 1 and P(S?\0°) = P (M S f\0 °) = 0, Vi. At time t, starting from i = 1, P {M S \\0 <l't>) is computed from P (M S liT 1 1 \0 <1,ti~1>) for all possible start times U using eq. 5.12 (c). This computation is repeated until the final state i = N is reached, where P (M S t N\0(itt)) is computed. The processing steps related to the computation of P (M S j\0 <1,t>) performed at time frame t is illustrated in figure 5.6. For simplicity, let us assume that the event 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Frame " t" (Si-1, (Si-1,j-1) I Y i l i E d MS*;} P i(d= l) 1) ti 2)P (s ‘l |Ot,)a l, 1P(MS‘i1 -1 |0 <1' ti -1 3) IT P(S[ |0*' ) tj <f <=t-i Figure 5.6: The processing steps performed on state Si at time t. duration modeled by Pfd) for the state St is bound by d'. Therefore, there are only d' possible values of the start time ti in eq. 5.12 (c) that need to be maintained at anytime. We consider the state Si to be composed of a set of sub-states A y, where j indicates the duration of the event in frames (0 < j < d>). The state A y is assigned with the appropriate duration probability P fd = j) based on Pfd). For example, in figure 5.6, Pi(d) is illustrated by a solid curve and the state is assigned with P fd = 1). At each time frame t, we maintain and update, if necessary, the following state parameters: 1) the frame ti where the transition into state St occurs (i.e., the start frame), 2) the probability that the event sequence S \,S 2 ,..., Si-i ends at / ,,• — 1 and the transition is made to Si at fj (i.e., P (M S t if~ 11\0 <1’ti^1>)aiti^1P (S t ii\Oti)), and 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3) These parameters and Pi(d — j) are used to compute P(M Sjt \()<]'l>) for the state SitV where U — t — j. Finally, P (M S t i \0 <1't>) is com puted by taking the summation of P (M S\ | 0 <1,t>) (defined in eq. 5.12 (d)) computed ® i for Sij,Vj. This step is indicated by J2d 'n figure 5.6. At time t, we update the state parameters of all sub-states (S^o, Si,d') as follows. First, all the internal state parameter values of Sitj computed at time t — 1 are pushed to the next sub-state Sij + 1 at time t, where That is, the internal parameters of Sj j are obtained from those of Sitj j. The parameters of Sitd' are discarded as the duration of Si is bound by d’ (i.e., Pi(d > d!) — 0). This step is illustrated in figure 5.6 as the items in the arrow quote box being pushed downward along the dashed arrows. For Sij where 1 < j < d', the term in the white box is then updated by multiplying it with the probability value P(Sj\Ol) computed from the corresponding Bayesian net work. The items in the shaded box are only passed along and not updated. For Sii0, ti is set to t, establishing a new hypothesis that the transition is made from at frame t. P(K M S \z \\0 <1't~l>), computed for state Sl \ at time frame t — 1, is copied to P(MSjPi 1 \0 <1,ti~1>) of the second parameter of S^q which is then multiplied by the output of the Bayesian network P(S\\Ot) and . These actions are indicated in figure 5.6 by the dashed arrow pointing from M S j z l to ^*,o- The third parameter of 0 (the product of the Bayesian probabilities) is reset to 1. Our recursive algorithm based on eq. 5.12 (c) considers several possible start times, ti, of event Si at each frame. This process is different from the algorithm used for computing forward probabilities in a conventional HMM, in which the computation of the probabilities of a sequence of states at time t is defined based on the probabilities computed at the time frame t — 1 only. 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. There is an important issue that is concerned with the computation of the product of the probabilities in eq. 5.12 (c). As t becomes large, the product of the probabil ities becomes significantly small, requiring a normalization. Since the automaton, at any time, must be in one of the states: SQ , ...Sn , the summation of F (M S '||0 <1,t>), Vz, should be equal to 1. Therefore, when J2o<z<n P (M S l\0 <1,t:>) becomes smaller than a probability threshold < 5 M s, we normalize the state parameters n ^ c t'c t-i P(S-'{()'') of all sub-states Sitj by J2o<i<n -P(dTl S, | | 0 <1,i>). This step is effectively the same as normal izing P (M S \\0 <l't>) as i P(SI'{O* ') is used to compute P (M S t i t.\0<1,t>) and P (M S t i \0 <1't>). P{MSt i \0<l't>)) Vz, at time frame t can be compared to determine the most likely current state of the event automaton MS. Similarly, for any two event models mM S and nMS, we can compare the normalized F ( mM S '^ |0 <1’ 4>) and F ( nM S '^ |0 <1,t>) to determine which event model is more likely. Computation of P {M S t Q~l \0 <l,tl~l>) There is another issue that needs to be resolved, which is concerned with the com putation of F(MS'o1_1|0 <1,tl_1>) in eq. 5.12 (c). Compared to the computation of P{MS\ | 0 <1,t>) where we only consider the transition path from state Si to com pute F(MS'o1” 1|0 <1’ 4l“ 1>), we need to consider all possible transition paths from 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Si, S2, S N. Based on the results of eq. 5.9 (c), P(MSq 1| 0 <1,tl x>) can be expanded as follows: P (M S t Q~1 \0 <l,tl^l>) - B e l ( ^ t0,tl_1>) ao,iP{MSt i°~l \0 <1't°~l>) V(i0< t i - 1 ) V(i^O) = Y 1 P v(d = t o - h - l ) n p (s o |0 ‘') <5',3) V(t0 « i - 1 ) t o < t ' < t i - l ^ 2 ao,iJ P (M 5 l t°” 1| 0 <1''0” 1>) This derivation involves computing the product of P(So\Ot> ). However, So is not a specific state that is modeled by a Bayesian network. We infer P(Sq \0 1 ') from the likelihood that the paths to the alternative states Si, ...Sn fail. This is implemented in our algorithm as follows. We modify the second parameters of the sub-state S'oj, Vj to be J21<m<N P(SQ}\Ot°)a0 tmP (M S^~ 1 \O<1’t0~1>) as shown in figure 5.7. At time frame t, after the state parameters of S0,j are transfered to those of So,j+i for 0 < j < N — 1, we set t 0 of sub-state S0}o to t, establishing a new hypothesis that the transition to ,5 '0 is made at frame t. The value for the second parameter of ,S '0,o is then computed, using [1 — P (S ll+1 \Ot)} as P (S o |0 4 ) for the transition to S 0 from Sm as illustrated in figure 5.7 for the case of m = i. The third parameter of Sqi0 is set to 1. For the sub-state S 0 jj, we only update the third parameter, in which we compute P iS ^ O 1) as the probability that the event state S\ is not recognized, i.e., [1 — P(Sl\O t)}. The complexity of our recursive algorithm is less than that of the brute force method because the computation of P (M S j\0 <1't>) at time t, uses the previously computed P ( M S j \ 0 <1,t'>), where i / j and t' < t. Since there are a total of N states, and for each state Si we only need to compute the summation of the probabilities of all possible start times of the event state Si (i.e., all possible ti, where ti < t in eq. 5.12 (c)), the 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2) £ P(So° [O*0 ) a0 m P(MS^° 1 |Cf 1 <= m <= N [1-P(Si+ i|0 )]a 0 i\P(MS |0 ) MS}' 1 *** Msf" 1 ms‘ n1 Figure 5.7: The processing steps performed on state S0 at time t. complexity of our recursive algorithm is 0 ( N T 2). If the longest event durations are at most d frames, the complexity is then reduced to O(NTd). For short video sequences (e.g., an order of minutes) or short event durations (e.g., the upper bound of d is, say, 100 frames) O^NT2) or 0( NTd) may be acceptable. When the video sequences or the upper bound of d are hours long, the computation becomes impractical to perform by most 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. common computer resources. We describe next how to further reduce the complexity of the algorithm. Reducing the Computational Complexity of P (M S t i \0 <1,t>) We can further reduce the computation complexity of eq. 5.12 (c) by taking a summa tion of a few number of P(M Sjt \()<Lt>) with the most likely start times l,t and discard the unlikely ones (e.g., when P(M S\t \0 <1,t>) is less than an event probability thresh old, tM s )- The issue here is how we can be certain that tt that we discarded will not become more likely again. In the case that the event duration distributions are uniform or Gaussian, there is a criteria for safely discarding the unlikely start times of an event. • Casel: Uniform Distribution Suppose the event durations are uniformly distributed and let t\ and t” (where t' < t’ l) be two candidate start times of L t for the event St, during the computation of P (M S j\0 <1,t>). Let P(M Slt/\0 <1,t:>) be the computed value of the right hand side of eq. 5.12 (c) when ti — t/t and similarly for ti = t". Suppose we have P(MS*tl 10<1,c>) < P(M Sjt/l \0 <l,t>) and P(M S\i/\0 Kl,t>) turns out to be less i i i than the threshold rMs , making it unlikely to be the real start time of event S',. One criteria to determine whether t' can be safely discarded is the fact that t — t\ is longer than the spread of the sigmoid function applied to inhibit short event dura tions and that we keep t’ (. This is because P (M S\ l \0 <l't>) will never become ^ i more likely than P (M S\ / \0 <l'L>). For example, at time frame f+ 1 , we update the li value on each side of the inequality with a positive probability value P(,S,- t l | O t+l) and still have: P(M S \ \0 <1,t>)P(Sl+1 \Ot+1) < P ( M $ „ |0 <M>) P ( $ +1|Ot+1). (5.14) i h 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Case2: Gaussian Distribution In the case that the event duration distribution is a Gaussian, eq. 5.14 still holds if t — t ' is longer than the mean of the Gaussian distribution. For example, in figure 5.6, since Pi(t — t\) < Pi(t — t'[), the relation in eq. 5.14 always holds. As long as we maintain the sub-state Sitj whose ti = t'(, the parameters of the states Sij-tr can be savely disregarded. In general, only a certain numbers, say k, of such ti candidates (t-': = tlt, ...,tik) that compute the highest P (M S j\0 <1,t>) can be maintained. The value k, in fact, depends on the event probability threshold tms' the larger the threshold is, the smaller the value of k is (i.e., a large number of the candidates of ti will be discarded), allowing us to con trol the upper bound of computational complexity by setting either k or tms • Regardless of the choice of controling parameters, the expected value of k, in the case of uniform event duration distributions, should be longer than the spread of the sigmoid function used for inhibiting the short event durations. In the case of Gaussian event duration dis tribution, the value of k should be approximately the spread (or an integer approximation of the variance) of the Gaussian distribution. In term of complexity, if only a certain number k of ti are maintained, we need to update k numbers of P(M S\t | 0 <1,t>) to compute P (M S \\0 <l't>). Since this process is repeated for all N states, multi-state complex event recognition algorithm requires 0 ( N T ) operations. From our experiment, k is significantly less than the length of the video streams, even in the case of moderately noisy video sequences. 5.2.4 Segmenting Complex Events from Video Streams By computing P ( M S f \ 0 <1,T>), Vi, we can make a decision about the most likely event state in the event automaton M S at time T. However, to segment the event M S from a continuous video stream, we need to know where the start of the event sequence is. 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Also, a user may want to know when the transitions between event states have taken places. To keep track of the most likely start time of the current observation of M S and what the most likely transition timing of the sequential events might be, we compute P ( M S f f \0 <l'T>) which is defined as the probability that the complete automaton state sequence of M S occurs with the most likely state transition timing given the sequence of observations O — 0 <1,T>. P ( M S f f \0 <1,T>) can be computed using a similar equa tion to that of P ( M S f \ 0 <1,T>) (eg. 5.1 (c)), but replacing the summation with a max operator: P(M S*J |0 <llT>) = a 0 max P ( 0 <1'T>|M 5, f 1_1)5 i5 r 2 . . . S<tN’T]) V(ii,i2 ^ 15) The segmentation procedure consists in finding the values of fi, f2, •••, fjv that maximize P ( M S f f \0 <]’T>) in eq. 5.15. By following the same derivation as for P ( M S f \ 0 <1,T>) (i.e., eq. 5.2 to eq. 5.12), we can compute, at time frame t, P ( M S f \ 0 <1’t>), VSi, as: P ( M S f \ 0 <1,t>) = m a x B el(5 'fI’ t>)aM_ 1P (M 5 ,*( . t r 1)| 0 <1’ t^ 1>) (5-16) ti Similarly, the most likely transition to state Si is computed as: tihest = argmaxBel (Sf^ 't>)ai^-i P(MS*^[ (5.17) ti<t As with the case of P (M S t N\0 <1't>), starting from i = 1, eq. 5.16 and 5.17 are recursively processed until the final state i = N is reached, where P (M S f\()<]'L>) 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. represents the probability of the sequence of states occurring with optimal transition timing tlbest,t 2 best,...,tNbest. The end of an event segment of MS' can be detected in a video stream by setting a probability threshold (re). Whenever P (M S t N\0 <1't>) falls below re, we mark that frame as the end of the current event segment of MS. There are several ways to define the start of an event segment. The most naive way is to search for the most likely start time tibeat, by backtracking the most likely path computed by eq. 5.17. Another way is to find the time frame t = fp e a k with the highest the probability of P (M S t N\0 <1,t>) during the ends of the current segment and the previous one, and backtracking the best path at fp e a k to find • In section 6.2, we describe a more sophisticated algorithm that keeps track of all possible starting times during the ends of the current segment and the previous one. 5.2.5 Implementation of Complex Event Recognition Algorithm To illustrate the algorithm, we consider the recognition of a complex event automa ton that is composed of one initial state So and N real event states Si, S 2, Sn- We consider the case where the duration distributions of all event states (Pi(d)) have been estimated using, for example, a direct histogramming method. We also assume that the weights of the transitional paths aitj and other probability thresholds such as t Ms, $ms and re, have been estimated. In the following, we describe the data structure of the automaton states and the step-by-step recognition algorithm. Data Structure The structure of an automaton state Si consists of the following three parts: 1) a list of substates 5*^, 2) Pi(d), and 3) a list of four-tuples (t, ii „ , P (M S ‘\0 <»>), P(MS->AO<l-‘>))- 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A substate S^d is created to evaluate the probability of the event sequence S i , .... Si, given 0 <1,t:>, where the duration of the last event ,5 ' ,; is d frames {i.e., P{M Slt | 0 <l,l>), where t — ti = d). To compute this probability, the structure of S u is defined as follows: • Case 1: For i > 0, the structure of Sj j contains the three parameters that are shown in figure 5.6. This structure is further augmented to include the prob ability of the most likely event segmentation that is described in section 5.2.4, resulting in a structure that is composed of 4 parts: 1) the transition time frame S, 2) the probability of all possible event sequences that enter Si at if P (M S t ii S11\0 <1,ti~1>)aiti-iP (S t ii\0 ti), 3) the probability of the most likely event sequence that enters Si at P(M S*t ji S ± \0 <1’ti~1>)ai!i-iP (S t ii\Oti), and 4) the probability product: U ti<t'<t-i p (Si I0 *)- • Case 2: In the case of i — 0, the structure of Soj contains the three parameters that are shown in figure 5.7 and the probability of the most likely event segmentation as follows: 1) the transition time frame to, 2) the probability of all possible event sequences that enter S0 at t0: £ )i< m<jv P{S^\Oto)a0 tmP {M S ^ ~ 1 \O<1’t°~1>), 3) the probability of the most likely event sequence that enters S0 at t0: rnax!<m<7 v P(Sq°lOto)a0 tm P(M S*^~1 IO<1,t°^1>), and 4) the probability prod- To keep track of the detected complex event segments, we maintain: • A list of detected event segments (evSeg1, evSeg2,...). The stucture of evSegf c consists of 1) the end time /k n d of event segment evSegA . and 2) a list of all possible start times of the event segment and their corresponding probabilities: ( 4 ^ , P i M s f y ' ^ | 0 <1’ t-d>)), Where P (M ( s J d ’ tLt- | 0 <1> t^ > )) is the proba bility of the occurrence of the event sequence that starts at t^tM and ends at tk end- 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • A list of 2-tuples, (start_time, event_segment_probability), of the current potential event segment. This list will be used to initialize a new evSeg (if detected) to be added to the list of detected event segments. Computational Steps The recognition algorithm of complex events proceeds as follows: Initialization The list of sub-states of S) is initialized to an empty list. The pre-computed histograms of event duration distributions are assigned to Pi(d) of the appropriate event states. The list of four-vectors is initialized to contain a single item: (0, 0, 0, 0) for all states S), where i > 0. For the initial state S0, the list of four-tuples contains a single item: (0 ,0 ,1 ,1 ). That is, the probability that the automaton is initially in the initial state ,5 '0 is 1. The list of detected event segments is initialized to an empty list. The list of 2-tuples, (start time, event segment probability), is also empty. Processing At time frame t, the internal parameters in the state structure of all states from to S\ and the list of detected event segments is updated by the following steps: 1. Update the parameters of the existing sub-states Sl ( i in the list. For all sub-states, the d parameter (the duration of the last event of the sequence) is incremented by one frame, i.e., S^d — > Shd+j. The first, second and third parameters in the sub state structure are kept unchanged as described in section 5.2.3. The fourth param eter is updated as follows. For Shd where i > 0, we multiply it with F)(Sl j \Oi) derived from, for example, the Bayesian network of the simple event state S) at time t. For S0td, we multiply it with 1 — P{S\\l). 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2. Create and add a new sub-state Siti to the list. The values in the structure are initialized as follows. • Case 1: i > 0 1) Set U to t. 2) To compute the second parameter, take the (pre-updated) third value of the four-vector parameter of state S i- 1 (i.e., P(M S\_l — 1 |0 <M-1>)) and multiply it with aiti- 1 and P (S t i \Ot). 3) To compute the third parameter, take the (pre-updated) fourth value of the four- vector parameter of state Si- 1 (i.e., P (M S * \ _ 1 — l \ 0 <]’t~]>)) and multiply it with aiti- 1 and P (S ■]()*). 4) Initialize the fourth parameter to 1. • Case 2: i = 0 1) Set t 0 to t. 2) The second parameter, Y Ji<m<NP (Sl \ ° t)ao,mP{MSt m - 1 \0 < ^ > ) , is computed as follows. P (M S t m — l | 0 <1,t”1>) is obtained from the (pre-updated) third value of the four-vector parameter of state Sm. P (S l Q \Ol) is calculated as the probability that the next state Sm+i is not recognized, 1 — P(S^n+1 \Ot). This computation is repeated for 1 < m < N and the summa tion of all the probabilities is computed. 3) The third parameter, max.i<m<N P(Sl°\Ot0 )a0 }mP(yMS*t ^l~1 \O<1’t°^1>), is computed in a sim ilar fashion. However, P(M S*t ^ l \0 <l'to~l>) is obtained from the (pre updated) fourth value of the four-vector parameter of state Sm. This compu tation is repeated for 1 < m < N and the maximal probability is selected. 4) The fourth parameter is initialized to 1. 3. Add a new item, (t, pb e s t, P (M S j\0 <1,t>), P (M S*\\0<l't>)), to the list of four- tuples. The third value of the new tuple, P (M S \\0 <l't>), is computed by taking the summation of P(M Sjt \0 <1,t>) for all possible which can be computed from the parameters in the corresponding sub-state 5 )^ (where 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. di = t - U) by taking the product of the second and fourth parameters, i.e., P ( M S t f |0 < 'A - 1 >)a(,i_ 1P (S ‘* |0 ‘*) and p (s f I 0 *')- The second and fourth values of the new four-tuple (i.e., tib c s l and P (M S *\\0<1’t>))) are computed by finding fj that maximizes P(M S*\t | 0 <1,t>), which is computed from the parameters in the corresponding sub-state Sl^n (where d{ — t — p) by taking the product of the third and fourth parameters, i.e., P (M S -'r 1 1| 0 <1’ " - 1>K f - i ^ ( S ‘‘|0 ‘*) and P (S ‘'\0 ‘'). 4. Eliminate the unlikely sub-states from the list of possible sub-states of Si. For each S ^ , we compute P(M Sjt | 0 <1,t>) as described in step 3. If P (M S\t | 0 <1,t>) is less than tms and there exists Si^-k in the list where k > 0, the sub-state is removed from the list. 5. Normalize the probabilities to prevent the underflow of the floating point vari ables. First, we compute the summation of probabilities P (M S \\0 <1,t:>), which can be obtained from the third value of the four-tuple parameter of Si. If < c ) ms is true, we normalize the fourth parameter of I W t - 1 of a 1 1 sub-states °f s i by Y,o<i<N 6. Update the list of 2-tuples, (startJime, evenUsegment_probability), of a potential event segment and generate a new event segment if detected. The start time of the complex event with the most likely event seg mentation at frame t can be detected by backtracking the parame ter fjb c s t in the 4-tuples (t, tibcst, P (M S t i \0 <1,t>), P(M S*t i\0 <1,t>)) of state Si. For example, first, tN b e sl is used to locate the 4-tuple in the list (t, t v - lb e sl, P (M S t N_ 1 \0 <1,t>), P(M S*t N_ 1 \0 <1,t>)) of SN- 1, such that t = fjvb e s f Then, l~N-ih c sl of that 4-tuple is used to search for the tN^2 h csl and so on. 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This operation is repeated from i = N to 1, where tib c s , is detected as the most likely start time of the complex event segment at time frame t. If the list of 2-tuples, (start-time, event_segment_probability), already contains the item with start.time = t \ h c M , event_segment_probability is updated with P (M S t N\0 <1,t>) if P (M S t N\0 <1,t>) > event_segment_probability. If tib c s l is a new start_time, a new 2-tuple (tib e s t, P (M S t N\0 <1’t>)) is added to the list. After the list of 2-tuples (start-time, event_segment_probability) has been updated, we determine whether to generate a new event segment and add it to the list of detected event segments as follows. If P (M S t N\0 <1,t>) < re, we create a new event evSeg and initialize tC n d to t and the list of start times of the event segment and their corresponding probabilities to the updated list of 2-tuples (start_time, event-segment_probability). After the new event segment is generated, the list of 2-tuples (start_time, event_segment_probability) is then cleared. 5.3 Analysis Results of Single-Thread Events We constructed eighteen Bayesian networks similar to the one shown in figure 3.2 to represent simple events. These events consist of actions related to both the shape of a moving object (e.g., “stand”, “crouch”) and object’s trajectory (e.g., “approach”, “stop at”, “slow down”). Some simple events are defined with regard to a geometrical zone (e.g., a road) such as “moving along the path”. Thirteen complex events are modeled by constructing an automaton of simple events or other complex events. Parameters of each network are assumed to be Gaussian distributions and are learned from a training data set composed of up to five pre-segmented sequences (approximately 600 frames). Training sequences are staged performances by different actors, taken in different days, 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. weather conditions and locations. They contain about half of positive and half of nega tive examples. A priori probabilities of all events are assumed to be equal. These event models constitute the scenario library in figure 3.1. We tested our single-thread event recognition algorithm on video streams collected in various domains: phone booths, parking lots, street corners and airborne platforms. In the following, the algorithm is first demonstrated using the videos taken from an Unmanned Airborne Vehicle (UAV), since the tracking is less noisy and not affected by the perspective. The goal is to show that our complex event recognition algorithm can discriminate competing events and correctly segment them from continuous video streams. Then, we show the results from the more challenging ground surveillance videos including the the analysis results of the actions of two objects in “stealing by blocking” sequence (figure 4.3). We aim to show that our system can recognize a variety of events and that it works well on various input data conditions. 5.3.1 Recognizing Competing Events in UAV Videos In this section, we demonstrate how the recognition of complex events is performed and show the results in detail. We analyze two pre-segmented videos, each of which contains different events: “go through the checkpoint” and “avoid the checkpoint”. We aim at constructing event models of these activities and use them to recognize and discriminate events in the videos. The surveillance videos of a ground checkpoint area are taken from a UAV. Since UAV is a moving platform, we first compensate the camera motion to obtain stabilized video imagery, on which our tracking system can process. We use a stabilization tech nique developed by E. Kang, et. al. [KCMOO], which is based on recovering affine motion parameters from a set of feature point correspondences between consecutive frames. Such parameters are then used to warp the video images frame by frame to a 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. common reference frame. Figure 5.8 shows two stabilized image sequences (CheckP- ntA and CheckPntB),in which we observe “a car goes through the checkpoint” and ‘‘a car avoids the checkpoint” respectively. The checkpoint is defined as the zone that lies between the two tank at the intersection. The large distance from the UAV to the ground, the small depth of field and the fact that the image plane is almost parallel to the ground plane make the projection to be approximately orthographic. Besides the fact that these videos are taken from a vehicle in motion and need to be stabilized, tracking objects in these near orthographic videos is easier than in some other surveillance applications due to little occlusion and insignificant perspective problems. We filter and track the regions that match best with the average size of a passenger car, and have consistent intensity distribution, without using the geometry of the ground plane (which is the technique we describe in chapter 4 for tracking humans). The tracking results are shown in figures 5.9 and 5.11 respectively. The scene context (i.e., the checkpoint zone and the road) are defined on the image plane using polygons. We model two competing events to be detected using our hierarchical representa tion. The first complex event “go through the checkpoint” is modeled by an automaton similar to the one shown in figure 5.2 but consisted of three simple events: “approach the checkpoint”, “move inside the checkpoint zone” and “leave the checkpoint”. Each of these sub-events are modeled by a Bayesian network of sub-events and mobile object properties (e.g., see figure 3.2). The other competing complex event “avoid the check point” is modeled in the same way as “go through the checkpoint”, but with different sub-events: “approach the checkpoint”, “stop before enter the checkpoint zone” and “leave the checkpoint”. The analysis results of these event models are shown in fig ures 5.10 and 5.12. 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. m 4 frame 4 frame 18 W ’8# flS&Jtfci frame 40 frame 96 (a) Original sequence CheckPntA. frame 36 frame 53 * S } frame 105 frame 409 (b) Original sequence CheckPntB. Figure 5.8: Two checkpoint sequences: (a) “a car goes through the checkpoint”, (b) “a car avoids the checkpoint”. The checkpoint is defined as the zone that lies between the two tank at the intersection. 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 21 frame 56 frame 99 Figure 5.9: Detection and tracking of moving regions for “CheckPntA”. In figure 5.10, the probabilities of four simple event models related to the checkpoint scenarios are shown on the top. The bottom two plots show, for “go through” and “avoid” respectively, the probabilities of the current state in the event sequence of the complex event being 5) at each time frame t (i.e., P (M S f i \0 <t,t>)). “Go through the checkpoint” is recognized at frame 99 (P (M \ O <: 1 ,09>) is 0.96), while the state of the automaton “avoid checkpoint” remains in the initial (fail) state for most of the time. 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Sim ple event an aly sis of th e c a r in se q u e n c e C heckPntA .S '0.8 1 o - 6 I) Output of simple event models related to the checkpoint. > > 0.8 1 0.6 5 n A II) Evolution of P (M S j\0 <1,t>), VS) of the “go through the checkpoint” event model. ^ 0.8 id ° - 6 O 0.4 C L 0.2 0 III) Evolution of P (M S t i \0 <l,t>), V5) of the “avoid the checkpoint” event model. Figure 5.10: Event analysis results of the sequence “CheckPntA”. II) and III) show the evolution of the probabilities of two competing complex event models. Similarly, in figure 5.12, the probabilities of four simple event models related to the checkpoint scenarios are shown on the top. The bottom two plots show, for “go through” and “avoid” respectively, the probabilities of the current state in the event sequence of the complex event being 5). The state of the automaton “Go through the checkpoint” remains in the initial (fail) state or the “approach” state for most of the time, while “avoid checkpoint” event is recognized at frame 285 (P (M S !f2 H “ \O< 1,285>) 94 A n a ly sis of "avoid c h e c k p o in t" in C h e c k P n tA S q : initial f ......... \ '............................................... . — - S .,: a p p ro a c h i f - - - S 2 : s to p _ b e fo re _ e rste r i * — S 3 : le a v e I /_ _ ________________________ 0 20 40 60 80 100 F ra m e N u m b e r A n a ly sis of "go th ro u g h c h e c k p o in t in C h e c k P n tA S n : initial >-............. - S i : a p p r o a c h j i* V S 2 : g o _ th ro u g h i i i « : ■ I S 3 : le a v e ; j J___ 0 20 40 60 F ra m e N u m b e r ..... a p p ro a c h z o n e • . / . — - g o _ th ro u g h z o n e j j V xxx s to p _ b e fo re _ e n te r i r A — le a v e z o n e _y \ ................................................ 0 20 40 60 80 100 F ra m e N u m b e r Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 44 frame 280 frame 320 Figure 5.11: Detection and tracking of moving regions for CheckPntB. is 0.99). Even though the simple event recognition results of the sequence “CheckPntB” are quite noisy (as can be noticed by the peak noise at various times during the event “stop before enter”), the most likely sequence of sub-events of “avoid the checkpoint” are tracked accurately. This is because our complex event recognition algorithm prevents the state transitions that would result in the unrealistic short events. 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Sim ple event analy sis of th e c a r in se q u e n c e C heckP ntB 1 >s 0.8 ..... a p p ro a c h z o jie f • ' J * ’ jV*1 I •: 3 ? f : i s k * -', ,v ■ 1 0.6 ........ s to p _ b e fo r e je n te r j 5 S i : 1 5 ! s i ! • i ) ■ - xxx g o th ro u g h £ o n e ; ! \ : . • i ! If Q _ j . J . • ; \ : >j , 0 50 100 150 200 250 300 F ra m e N u m b e r I) Output of simple event models related to the checkpoint. A n aly sis of "go th ro u g h ch eck p o in t" in C h e c k P n tB ..... S 0 : initial i ...... V ‘ . — - S i : a p p ro a c h ; i j j j ; i - - S 2 :g o _ th ro u ^ h j | j | | | — S n : le a v e i j :■ ! ” " I " I ' , 0 50 100 150 200 250 300 F ra m e N u m b e r II) Evolution of P {M S t i \0 <1 ’t>),\/Si of the “go through the checkpoint” event model. A n aly sis of "avoid ch eck p o in t" in C h e c k P n tB xxx So: initial ..... S o a p p ro a c h S 2 :s to p _ b e fo — S 3 : le a v e ......... f - w r i W ' n r i re _ e n te r: r ; * x - , , 0 50 100 150 200 250 300 F ra m e N u m b e r III) Evolution of P {M S t i \0 <1’t>) ,\/ Si of the “avoid the checkpoint” event model. Figure 5.12: Event analysis results of the sequence “CheckPntB”. II) and III) show the evolution of the probabilities of two competing complex event models. 5.3.2 Event Segmentation from Continuous Video Streams Earlier in this section, we have demonstrated our single-thread event recognition algo rithm in pre-segmented videos. That is, the videos are segmented such that the first event state in the event automaton S\ is observed at the beginning of the video streams. In general, we need to detect and segment events from continuous video streams of an 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. unknown content. To test our segmentation algorithm, we create three simulated image sequences that contain more varied events from the “CheckPntA” and “CheckPntB”. • “CheckPntC” is created from “CheckPntA” and contains the event “go through the checkpoint”. However, instead of going directly to the checkpoint from the beginning, the car stops for approximately 20 frames, interrupting the recognition of “go through the checkpoint”. Figure 5.13(a) shows the recognition results of simple events. By looking at the plots, one incorrect interpretation may be that the event automaton “go through the checkpoint” remains in the state “approach the checkpoint” during frame 0 and 55 regardless of the fact that the car actually stops during £=23 to 46. Figure 5.13(b) shows the results of matching the trajectory of the car with the “go through the checkpoint” complex event model. During the period that the car stops (£=23 to 46), P (M S()\()<l’t>) (in black) give the highest probabilities, indicating that the observations evaluated so far do not match with the event sequence of “go through the checkpoint”. When the car starts approach ing the checkpoint again at frame 47, the automaton makes a transition from state S0 to Si (in red). At the end (from £=129 to 139), P { M S \\0 <l't>) of computed for the last state S3 give the highest probabilities (e.g., P (M S 1293| O < 1 1 129>) = 0.96), indicating that there is a segment of the trajectories of the car that matches the event “go through the checkpoint". Such a segment can be obtained by back tracking the most likely transitions from the current state S3 to the first event S). By backtracking the automaton “go through the checkpoint” at frame 129 £ jjeak , we have that the transition was made from initial state to the S i: “approach ” at frame 47, then from Si to S2 : “move inside the checkpoint zone” at frame 57, and finally from S 2 to S3: “leave the checkpoint” at frame 115. This segmentation, in fact, corresponds with the human observation. 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Sim ple event analysis of th e c ar in C heckPntC 3 ^ 0 .8 3 0.6 C O .o 0.4 o A" 0.2 a p p ro a c h z o n e sto p _ b efo re_ en tei g o _ th ro u g h z o n e le a v e z o n e ■ I : l 1 t l\ / • - 0 20 40 60 80 100 120 140 F ra m e N u m b er a) Sub-events of the sequence CheckPntC. A n aly sis of "go th ro u g h ch eck p o in t" in C h e c k P n tC .■ £ ’0- 8 1 0.6 £3 0.4 o £ 0.2 0 ..... S 0 : initial f f .f f 'V - . , v , — - — - S ! : a p p ro a c h i j : i - - - S 2 : g o „ th ro u g h j i ! 1 i t ■ — fv,: loavfi t i ’\ j . i: i 20 40 60 80 100 120 140 F ra m e N u m b er b) Complex event: “go through the checkpoint”. A n aly sis of "avoid th e ch eck p o in t" in C h e c k P n tC ^ 0 .8 - i Q 0 .6 - ■ t o - Q 0.4 - o n o .2 - S 0 : initial i 5 ? : i ! i S i : a p p ro a c h S 2 : sto p _ b e fo re _ e n jte r § i S 3 : leav e 100 120 140 F ra m e N u m b er b) Complex event: “avoid the checkpoint”. Figure 5.13: Analysis o f single-thread events in the simulated sequence CheckPntC. A mindful reader may suspect that a system with little discriminate power may also recognize the event “avoid the checkpoint” by observing the sequential event “approach the checkpoint” (in red), “stop before entering the checkpoint zone” (in black) and “leave the checkpoint” (in blue). Figure 5.13(c) indicates that this does not happen in our system. P ( M S l\0 <l,t>) (in blue) of the “avoid the checkpoint” event model remains close to 0 throughout the analysis. 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • In this experiment, a simulated sequence “CheckPntD” is generated by repeating the sequence “CheckPntB” twice. Figure 5.14(a), which shows the recognition results of simple events, indicates that a pattern of “avoiding the checkpoint” event (in red, green and blue) is repeated twice. Figure 5.14(b) shows the plot of “go through the checkpoint” event model. One can notice that the automaton makes a transition back to the initial state (in black) at frame 29 and 93, when the expected event state S 2' - “move inside the checkpoint zone” is not observed. Figure 5.14(c) shows the recognition results of the automaton event “avoid the checkpoint”. It can be noticed that two instantiations of “avoid the checkpoint” (in blue) are detected. By backtracking the most likely state transitions at frame 71 where the first instantiation ends, we know that the transition to state Si occurs at frame 7, allowing us to segment this event instantiation from frame 7 to frame 71. Similarly, we can backtrack the most likely transitions even though the event is still going on. For example, by backtracking at frame 131, we know that the current instantiation of “avoid the checkpoint” starts from frame 82. • The last simulated sequence “CheckPntE” is generated by concatenating the sequences “CheckPntB” and “CheckPntA” together. In the previous experiment, we have shown that the segmentation of two instantiations of the same event can be made correctly. The goal of this experiment is to examine whether the event segmentation from different event models coincides with each others. Figure 5.15(a) shows the results of simple event analysis which indicate that the pattern of “avoid the checkpoint” is observed before the pattern of “go through the checkpoint”. Figures 5.15(b) and (c) show the analysis result of these pat terns using the corresponding automaton models. By backtracking from the last state of the model “go through the checkpoint” (figure 5.15(b)) at frame 173, 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Sim ple event analysis of th e car in C heckPntD t f 0 . 8 3 0.6 c c .a 0.4 o rT 0 .2 t •™i--------i / — ..... a p p ro a c h z o n e J J ' / 7 ' ■ ' / - - s to p _ b e fo re _ e n te ri | J .— le a v e z o n e j 3 i ....................................................................... 0 20 40 60 80 100 120 140 F ra m e N u m b er a) Sub-events of the sequence CheckPntD. A n aly sis of "go th ro u g h ch eck p o in t" in C h e ck P n tD ^t).8 43 0.6 CC ■ g 0'4 Q > 2 0 .....S 0 : initial . — - S -i: a p p ro a c h * - - S 2 : g o _ th ro u g h — S 3 : le a v e ■ * .........i ; ................... V......... ; ’ ; ; i i * h i .. , ........................................................... 20 40 60 80 100 120 140 F ra m e N u m b er b) Complex event: “go through the checkpoint’' A n aly sis of "avoid ch eck p o in t" in C h e ck P n tD ^-0.8 5 0.6 _Q 0.4 O £ 0.2 ...S 0 : initial f ........... * H J ^ — - S ! : a p p ro a c h j i j ( i j | 1 ' — S 2 : sto p _ b e fo re _ e r{ te r * j i — S -.: le a v e • 4 ■ : ii ; ! > ! ! i ■ ' i • i - 0 20 40 60 80 100 120 140 F ra m e N u m b er b) Complex event: “avoid the checkpoint”. Figure 5.14: Analysis o f single-thread events in the simulated sequence CheckPntD. we know that this instantiation of “go through the checkpoint” starts from frame 72. Similarly, “avoid the checkpoint” is detected to end at frame 73 where P ( M S f = 7 3 \0 <1,73>) becomes under 0.05. By backtracking from the last state of “avoid the checkpoint” at frame 71 (f!j0 a k ), we know that this instantiation of the event starts from frame 7. It can be noticed that the segmentation information about activities of the same object from competing event models, more or less, agrees with each other. 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Sim ple event an aly sis of th e c ar in C heckP ntE .§'0- 8 50.6 C O .Q 0.4 O £ 0.2 0 .....a p p ro a c h z o n e ; • .......sto p _ b e fo re _ e n te r j - - g o _ th ro u g h z o n a i — le a v e z o n e ■ , " / / ) V (A A : 0 50 100 F ra m e N u m b er a) Sub-events of the sequence CheckPntE. A n aly sis of "go th ro u g h ch eck p o in t" in C h e c k P n tE ■ & 0.8 5 0 .6 C O .0 0.4 O £ 0.2 0 ..... S0: initial f ....... ?../...... ! v t — . — - Si: a p p ro a c h ! : ; ) ; i - - - S 2 : g o _ th ro u g h j I ; 1 ■ f - — So: le a v e • f t , 0 50 100 F ra m e N u m b e r b) Complex event: “go through the checkpoint”. A n aly sis of "avoid ch eck p o in t" in C h e c k P n tE £n.8 50.6 C O -Q0.4 o £ 0.2 ..... S 0 : initial S t : a p p ro a c h '— S 2 : s to p _ b e fo re _ — S 3 : le a v e .......i . C T ii) i j ! j m te rj | | 1 ) ; : sl i ,-f. j j M < • , , 50 100 F ra m e N u m b er b) Complex event: “avoid the checkpoint”. Figure 5.15: Analysis o f single-thread events in the simulated sequence CheckPntE. }} 5.3.3 Recognition of “Converse” and “Taking Object In this section, we demonstrate our event analysis algorithm on the the recognition of complex events related to humans and show our system is robust to moderate noise. The analysis of human events is more difficult than the analysis of vehicle’s events, mainly because humans are not a solid object, causing the tracking to be noisier and more sensitive to the articulation of body limbs. 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ObjA: Simple Events ■ ■ ■ ■ "Approach" _Q 0.6 C O _Q 0.4 Bend To Drop Obj "Stand close to" 0.2 1 0 0 150 200 250 300 350 400 450 500 550 50 Frame Number a) Simple events ObjA: Converse 1 5 0 .6 c o -Q 0.4 O 0.2 ;............•••■ ■ if. 1.... S0 : initial i i —- S i : approach i jlij - - S2 : bend_to_drop_obj i Si : a : : - k — S3 :jstand_close_to ......................................................................................... 50 100 150 200 250 300 350 400 450 500 550 Frame Number b) Complex event: “Converse” Figure 5.16: Analysis o f single-thread events fo r object “A ”. We have recognized correctly the complex events of over twenty humans including the ones that appear in sequences shown in figures 4.7, 4.10 and 4.12 with high proba bilities (mostly over 0.8). In the following, we show the analysis results of the actions of two objects in “stealing by blocking” sequence (figure 4.3). • Complex event “converse ” is is the action performed by object “A” in the “steal ing” sequence. It is described as a sequential occurrence of three sub-events: “approach the reference person”, “bend dow n” (to drop a briefcase) and “stand close to the reference person”. While “approach” and “stand close to ” are sim ple events, “bend down ” is another complex event modeled by an automaton of five shape-related event states: “stand”, “height decreasing”, “crouch”, “height increasing” and “stand”. 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We process the “stealing” sequence by analyzing all simple and complex events modeled in the library. Figure 5.16(a) shows the evolution of the probabilities of the sub-events of “converse” analyzed for object “A”. The probabilities of other simple events related to the trajectory are lower than 0.2 at all time frames. These results correspond with the human observation of the video; the moving object first appears on the pedestrian path on the right, and then proceeds toward the ref erence object, drops the briefcase and stops. It can be noticed that the probabilities of “approach the reference person” are considerably noisy compared to “stand close to the reference person”. This is because the recognition of “approach the reference person depends on instantaneous direction and speed of the objects, of which the measurements are not stable. In contrary, “stand close to ” event varies smoothly on the distance measurements. That is, once the object appears near the reference person, “stand close to ” will remains high as long as the object does not appear to jump a distance. Figure 5.16(b) shows the evolution of P {M S t i \ 0 <l't>),'iS i of the complex event “converse” computed from the sub-events. It can be noticed that even though “converse ” depends on the recognition of the noisy “approach the reference per son”, its probability is more stable. This is because the temporal mean algorithm averages out the noise. • Complex event “taking O bject” is a sequential occurrence of “approach the ref erence person”, “pick up object” and “leave”. It is modeled in a similar way to “converse”. “Pick up object” is defined as a complex event “bend dow n” con ditioned on the fact that the suitcase has disappeared. Figure 5.17(a) shows the probabilities of the sub-events of “taking object” for the object “D”, which match with the human observation. In figure 5 .17(b), “taking object” is recognized when its sub-events have sequentially occurred. We note that “pick up object” is, in 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ObjD: Sim ple E vents . ■S' 0.8 - JD 0.6 (0 o 0 4 "A pproach" "S tan d C lo seT o O b j" 0.2 - : • * ! • i * I i i* 7 m 100 150 200 250 300 350 400 450 500 550 PickU pO bj L eav e 100 150 200 250 300 350 400 450 500 550 F ra m e N u m b e r a) Sub-events of “Taking Object” O bjD : "T ake O bject" £ = 0 8 g° 4 0 _ 0.2 - - S 0 : initial | : r . - a p p ro a c h - ....... S 2 : s ta n d _ c lo s e ! ! ' — S 3 : p ick _ u p _ o b j 1 i ! 1 1 — S 4 : le a v e _ o w n e r 50 100 150 200 250 300 350 400 450 500 550 F ra m e N u m b e r b) Complex event: “Taking Object” Figure 5.17: Analysis o f single-thread events fo r object “D ”. fact, an articulated motion that, in an ideal case, should be recognized by track ing movements of human limbs. However, in many applications (in particular, the surveillance of an outdoor scene), moving objects may appear very small and tracking body configuration is not possible. We have shown that, in such cases, the approximate movement of shape (i.e., the change of the bounding box) may be combined with an event context (i.e., the fact that the suitcase is missing) to recognize a more sophisticated activity. 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.4 Discussion In this chapter, we have introduced a hierarchical event model that bridges the gap between the numerical property values of moving objects (i.e., the observation) and their single-thread actions. In most cases, approximately three layers of mobile object properties are defined and used in a broad spectrum of applications. Several layers (typi cally, two to four) of simple and complex events that are more specific to the application, are defined from these generic mobile object properties. Unlike previous systems that require the users to have specific knowledge (e.g., the internal structure of HMMs), our system requires the users to have only a basic understanding of our event taxonomy. Since these event concepts are defined such that they map closely to how a human would describe events, less expertise is expected from the users. Bayesian networks and probabilistic finite-state machines are trained and used to recognize events and solve the problems concerned with uncertain and incomplete data and event segmentation. Even though the structure of Bayesian networks and the finite- state machines is given by hand, we believe this is reasonable as we expect only a limited number of primitive events. The optimality of our recognition algorithm is guaranteed under the following assumptions. • Given the state at time tj, the probability of an event state at time tj is independent of the observation at time tj, where t, ^ tj. • Given the state at time tj, the distribution of observation at time tj is independent of the states or the observations at other time frames. • Given Si, Si+1 is independent of < S j, where j < i. Similarly, given S^, s < ti+1,t> is independent of S:), where j < i. 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • The transition to the state Si only depends on the duration of the previous state C l . • The a priori probabilities of events are uniformly distributed. • The parameters of the Bayesian networks are Gaussian and the duration of an event is uniformly distributed. 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 Multi-Thread Event Recognition In chapter 5, we introduced techniques for inferring single-thread events from pixel- level information. In this chapter, we build on top of the foundation provided by those techniques to address the problem of representing multi-thread events, generally used for describing multi-agent activities. A multi-thread event is composed of several of such action threads, possibly performed by different actors. These action threads are related by some logical and time relations. With each action thread being recognized with uncertainty about its occurrence and temporal segmentation, the analysis of multi thread events must also be based on a mechanism that handles such uncertainty for optimal recognition results. In the following, we first describe the representation of multi-thread events. We then explain how we maintain the complexity of the search for a set of the most likely instances of single event threads that match the temporal relations. 6.1 Event Graph We represent a multi-thread event by an event graph similar to interval algebra net works [AF94], The nodes of the events are threads of actions of some actors, which are recognized by stochastic finite state automata. A link between two nodes repre sents a pair-wise temporal and logical constraints between the two event threads. For example, “stealing by blocking” is composed of five event threads, as described in Table 6.1. The temporal constraints among these events are that 1) “converse” occurs 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. before “approachl” and “approach2 ”, 2 ) “blocking” starts right after or some time after approachl or approach2 is accomplished, and 3) “taking-.object” occurs during “blocking ”. Figure 6.1 shows a graphical representation of these events. The symbols “b”, “d” and “ s” stand for interval temporal constraints “starts before”, “occurs during” and “starts” respectively. Description of sub-events of “stealing by blocking” converse approachl approach2 blocking taking_object actori approaches his friend and drops a suitcase on the ground. actor2 approaches and stops between actori and the suitcases. actor3 approaches and stops between actor f s friend and the suitcases. actor2 or actors are blocking the view of the suitcases. actor4 approaches and takes the suitcases away. Table 6.1: Description o f the multiple thread event “stealing by blocking”. Time converse ( approachl) ^approach; {b, s} -x s} ( jjlo c k in g ^H (taking obj) {d} Figure 6.1: A graphical description o f a multi-thread event. Multi-thread events are recognized by evaluating the likelihood values of the tem poral and logical relations among event threads. Constraint satisfaction and propagation techniques have been used in the past [AF94] based on the assumption that events and 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. their durations are deterministic. We present in this chapter an algorithm to verify and propagate temporal constraints in a probabilistic framework, in which complex event threads (recognized by stochastic finite state automata) are uncertain and may have dif ferent possible start times during its course of action depending on the most likely tran sition timings between states at each moment. Our goal is to search for the segments (or instances) of event threads that satisfy the required temporal and logical constraints, and that would give the highest probability, when combined. There are two main questions that need to be answered. The first is how to perform and manage the segmentation of event instances. The second is how to evaluate the event relations. We describe our solutions to these problems in the following. 6.2 Segmenting and Managing the Probabilistic Event Instances Suppose that we have two complex events: event A and event B, each with a likelihood distribution similar to the one shown in figure 6.2. Computing the likelihood of these events satisfying a temporal constraint requires an exploration of the combination of all possible event intervals of both events that may occur during time frame 1 and the current frame. For example, to compute “A before B ”, we need to find a time if frame such that event A started and ended before if and event B started after if and may still occur. The event intervals of A and B that give the maximal combined likelihood define the occurrence of the multi-thread event. In our case, the end point of an event instance can be determined when the likelihood of the event becomes very low. For example, in figure 6.2, by assuming that an event instance ends when the likelihood becomes lower than 0.05, we have four possible event instances, the end times of which are determined to be at frame 130, 136, 168 and 191 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. "Chase After" 100 150 200 Frame Number Figure 6.2: The likelihood o f the complex event “chase after” inferred by a probabilistic finite state automaton. The event, at different times, may have different likely start times depending on the most likely transition timings between states. For example, the most likely start times o f the event during the solid, dark circles are 118, the grey circles 175 and so on. respectively. The start times of these event instances can be inferred by backtracking of the transitions in the event automaton. The event, at different times, may have different likely start times as illustrated in the figure 6 .2 , depending on the most likely transition timings at that moment. Let t ei be the end time of the event instance i, t sj be a possible start time j of the same instance, and be the event segment j of the event instance i that starts at t j and ends at t ei. t j can be maintained by keeping track of all the possible start times k (as described in section 5.2) and updating the probabilities of during t ei_v and t ( ... This process may result in a total of k * (tei_i — t ei) possible start times in the worst case. However, from our experiment, the number of the most likely start times of an event instance is likely to be close to k, which is relatively small compared to the length of the video sequences. This is because we assume that the event durations are uniformly distributed. The most likely transition times of a state at the current time frame are normally the same as the previous ones. Once the likelihood of the event currently occurs is high, in order that the current time frame is to be added to the possible start times of an event instance, it is almost the case that the current event instance has to end first. 1 1 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.3 Evaluation and Propagation of Temporal Relations in Event Graph Suppose that event threads are independent of each other and that we have segmented and computed the probabilities of the event instances of all event threads. A temporal event relation in the event graph can now be evaluated by verifying the the correspond ing temporal constraint (the link of the graph) on the intervals of two event instances are verified and computing the product of their probabilities. The highest combined probability of all the explored event instances are chosen as the probability of the event relation. For example, “ A starts before B ” is evaluated as follows: PfAstartBeforeB") - m axP(Ai)P(Bj),if start(Ai) < start(Bj), (6.1) hj where i and j indicate an instance of event threads. Other temporal relations can also be computed similarly. To compute ‘A during B ”, we first find all possible event intervals of A and B. We then search for a combination of these events that produces the maximal likelihood subject to a constraint that the start and end times of A must be during the interval of B. Event threads that are defined using a logical constraint such as “and” and “or” can be combined more easily than a temporal constraint, as we do not need to verify a temporal relation. For “ A and B ”, we compute the product of the event likelihood. For “A or B ”, we choose the event with a higher likelihood. A multi-thread event that is described by more than two event threads can be inferred by propagating the temporal constraints and the likelihood degrees of sub-events along the event graph. For example, suppose that we have two event constraints: “ A before B ” and “B before C ”. To evaluate “B before C ”, we need to consider an extra constraint that “event B” occurs after “event A”, which is performed as follows. First, for each instance 111 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of B, we search for the instance of A that, when combine with the event instance of B subject to “’before” relation, gives the highest probability. We update the probabilities of these instances of B with the corresponding “A before B ” probabilities, and eliminate the instances B, whose relations with instances of A are not satisfied. “B before C ” is then evaluated using the updated set of event instances of B. This computation requires a search operation of 0 ( k R+1) complexity, where k is an average number of starting points of an event instance and R is the number of temporal relations in a series. For example, for “A before B ” and “C during B ”, R is equal to 2. 6.4 Multi-Thread Event Analysis Results We have constructed five multi-thread event models. In this section, we show the anal ysis results of four different sequences. 6.4.1 Recognition of “Stealing by Blocking” We first illustrate our multi-agent event recognition algorithm using the sequence “steal ing by blocking ” (figure 1.2). First, obj A approaches a reference object (a person stand ing in the middle with his belongings on the ground). Obj B and obj C then approach and block the view of obj A and the reference person from their belongings. In the mean time, obj D comes and takes the belongings. Figure 6.3 (a), (b), (c) and (d) show the analysis results of single-thread events evaluated for objects A, B, C and D respectively. The blue lines show the recognition of the most significant complex event modeled by a probabilistic automaton. The red, green or magenta lines show the temporal evolution of the probabilities of their corresponding sub-events. For example, object A is rec ognized to be performing “converse ” (blue) after it approaches (red) and stops (green) at the reference person. The black line in (a) show a weak recognition of “walking 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. along the pedestrian path ” that object A performed for a short time. The results of such temporary and other unrecognized events are omitted for clarity in other plots. In (b) and (c), objects B and C have approached (red) then stopped between the reference person (green) and his belongings. The recognition of the sequential action “approach then stop between the reference objects” (which is an instantiation of “approachl ” and “approach2” in figure 6.1), is shown in blue in both plots. Figure 6.3 (d) shows that object D has performed “taking objects” (blue) by recognizing the sequence of events “appraoch” (red), “pick up object” (green) and “leave” (blue). Figure 6.4 (a) shows the recognition of the temporal relations defined in “stealing by blocking” together with the assignment of the actors. For example, It is verified at frame 320 that the event “converse” performed by object A is accomplished before the event “approach then stop between” performed by object C (shown in green). Finally, figure 6.4 (b) shows the probability of two instantiations of event “stealing by blocking”. The red and green lines show the probabilities of event “stealing by blocking ” where the sub-event “approach” is performed by object B and object A respectively. These are, infact, the two alternative paths that are defined by the “or” relationship in figure 6 . 1 . 6.4.2 Recognition of Activities from Noisy Data We have recognized other multi-thread events from videos taken at a different scene by a camera with a very low tilting angle. A low angle camera has a very common usage in many real world applications which include home videos and in-door surveillance. The videos taken by this type of camera are challenging, because the foot positions of the human objects cannot be reliably recovered, especially when the objects are located far from the camera. Such tracking noise results in very noisy single-thread event recog nition. We show that our system can still recognize the global activities correctly even though the segmentation and recognition of the actions by each actors are noisy. 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Sub-Events of "Obj A C onverse with the Reference Person" Obj A: "Converse" with the R eference Person . o 06 • c c n 0.4 "Approach" "Bend To Drop Obj" "Stand close to" 50 100 150 200 250 300 350 400 450 500 550 Frame Number n C O n o Q _ 450 500 550 200 250 300 350 400 50 100 150 Frame Number (a) Analysis of single-thread events for objA. Sub-Events of "Obj B Approach To Block View" "Approach" “Stop in View Blocking Area" 550 100 150 200 250 300 350 400 450 500 50 ObjB: Approach To Block View >•>0.8 50 100 150 200 250 300 350 400 450 500 550 Frame Number Frame Number (b) Analysis of single-thread events for objB. SubEvents of "ObjC Approach To Block View" “ Approach" “ Stop in View BlockingArea" 50 100 150 200 250 300 350 400 450 500 550 Frame Number ObjC: Approach To Block View 500 50 100 150 200 250 300 350 400 450 550 Frame Number (c) Analysis of single-thread events for objC. Sub-Events of "Obj D TakeObject" J ? ' ° - < 3 o.< c o -Qo t .... "Approach" /ir-Vi "StandCloseToObj" * c * ; : S / ; ; - - tv HJSi i___ - . 50 100 150 200 250 300 350 400 450 500 550 — "PickUpObj" - "Leave" 300 50 1 0 0 150 200 250 350 400 450 500 ObjD: "Take Object" Q_ 0.2 50 100 150 200 250 300 350 400 450 500 550 Frame Number Frame Number (d) Analysis of single-thread events for objD. Figure 6.3: Single-Thread Event analysis results o f the actors in the sequence “stealing by blocking 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Event Relations Related to "Converse" Events Relations Related to "Blocking" £*0. B S 0 6 (0 .Q 0.4 O 1 1 1 1 1 1 1 1 1 ' i i i i i i i "Converse(objA), Before, A pproachl (objC)" __ "Converse(objA), Before, Approach2(objB)" Probability ■ "A pproachl(objC ), Before, Blocking" | — "BlockingView(objB, objC)'' | , , , 50 100 150 200 250 300 350 400 Frame Number 100 150 200 250 300 350 400 450 500 550 Frame Number (a) Evaluation of event relations related to “blocking” with the assigned actors. Most Significant Instantiation of "Stealing" P("stealing")=0.99 n 0.6 JD 0.4 50 100 150 200 250 300 350 400 450 500 550 Frame Number (b) Two significant instantiations of “stealing by blocking”. Figure 6.4: Event analysis results o f the sequence “stealing by blocking”. The two plots o f the most significant instantiations o f “stealing by blocking ” in (b) have different actor and event combinations. Recognition of “Object Transfer” The first video displays a multi-thread event called “Object Transfer”, which can be described as someone brings an object into the scene so that another person can pick it up. Figure 4.10 and 4.11 show the original video and the tracking results. We note that the trajectories of Obj4, Obj5, and Obj6 are merged with the trajectory of Obj3. An event graph representation of this activity is shown in figure 6.5, which is composed of four single-thread events: “ObjI brings in O bj2”, “Obj3 takes away O bj2”, “Obj3 leaves O b jl” and “Obj2 is taken away from O b jl”. The actors in this event graph are selected so that they match with those that appear in the video to clarify the idea. All mobile objects detected in the video are assigned different roles in event threads during the analysis. Figures 6 . 6 (a), (b), (c) and (d) show the analysis results of single-thread events eval uated for Obj 1, Obj2, Obj3 and Obj4 respectively. The dotted blue lines show the recog nition of the complex event. The solid red, green and black (if shown) lines indicate the 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (6bj1 "bring irA V Obj2 J {start before} (t)bj3 "take away"^ V Obj2 J {overlap} {overlap} (5bj3 "leaveN / Obj2 A V Obi1 J "taken away" V from Obj 1 J Figure 6.5: A graphical description o f “Object Transfer”. temporal evolution of the probabilities of their corresponding sub-events. Despite the fact that “Obj3 leaves O bjl ” and “Obj 2 is taken away from O bjl ” are particularly noisy, it can be noticed that the actions of the actors in this video sequence are successfully rec ognized. Figure 6.7 (a) shows the evaluation of the temporal relations defined in “object transfer” together with the assignment of the actors. For example, it is detected (as shown in red) at frame 545, which is the time frame when the event “Obj3(human) take Obj2(luggage)” is recognized, that this event occurs after the event “Objl(human) bring in Obj2”. Similarly, the green and the blue lines show the evaluation of the two “overlap” links in the figure 6.5. Finally, figure 6.7 (b) shows the probability of two instantiations of event “object transfer”. They represent the two alternative paths in the event graph (figure 6.5). Recognition of “Stealing by Phonebooth” In the previous examples, we have shown activities that do not require the knowledge of scene objects to be understood. In some cases, global activities may be defined in 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. relation with a scene context. Figure 4.7 and 4.8 show the sequence “stealing by phone- booth ” and the tracking results. This event is similar to “object transfer", except that the object is taken while the owner is using the phone. We define the notion of “stealing ” to be that the attention of the owner is diverted in some way, when his or her belongings are removed. In “stealing by blocking", the owner’s attention to his belongings is dis tracted by the blocking of the view by someone. In this sequence, the owner’s attention is distracted by his use of payphone. We represent “stealing by phonebooth” using the event graph in figure 6 .8 . It is composed of the same event threads as in the case of “object transfer”, with the addition of the “use phone” event. The potential occurrence of the “use phone ” event is inferred from the fact that the location of the phone user lies within the “use phone” zone. We define this zone by drawing a polygon on the ground plane at the location where people who use the pay phone would have to stand. Figure 6.9 (a), (b), (c), (d) and (e) show the analysis results of the single-thread events evaluated for Objl, Obj2 and Obj4 respectively. Even though we do not show the single-thread event recognition of Obj 3, the automata corresponding these five event threads never reach their last state, as they are evaluated for Obj3. That is, none of these event threads matches with the trajectories of Obj3. We use the same color-coding as in the last example. Figure 6.10 (a) shows the recognition of the temporal relations defined in “stealing by phone” together with the assignment of the actors. Finally, figure 6.4 (b) shows the probability of most significant instance of the global activity “stealing by phone”, which is a result of propagating the probabilities of event relations to the last relation defined in the event graph. 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Recognition of “Assault” The last example is concerned with two actors, one of which assaults the other. Fig ures 4.12 and 4.13 shows the original sequence and the tracking results. The graph representation of “assault” is shown in figure 6 .1 1 . Figures 6.3 (a), (b) and (c) show the analysis results of the single-thread events evaluated for objO and Objl respectively. The same color-codes are used as before. Figures 6.3 (d) and (e) show the evaluation of the event relations in the event graph and the probability of the global event ‘assault”, respectively. It may suffice, in some applications, to say that there is a potential occurrence of the “assault” event based on trajectory analysis and call for an attention from human operators. A more detailed shape analysis may be required to enhance the discriminative power and robustness of the system. 6.4.3 Computation Time As described in section 5.2.3, the complexity of our single-thread event recognition algorithm is O (N T ), where N is the number of sub-events and T is the total number of frames. The complexity of the search operation in multiple-thread event detection is 0 { k R+1) (see section 6.3), where k is an average number of starting points of an event instance and R is the number of temporal relations in a series. The computation time to process a video, however, depends on other free parameters such as: the number of mov ing objects and the number of scene contexts (e.g. checkpoint zones, phones, pedestrian paths). The computation time of five representative sequences is summarized together with these parameters in Table 6.2, where SE, CE, M T and Ctx are short for the numbers of simple events, complex events, multi-thread events and contexts respectively. 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Computation Time of Video Sequences Sequence No. of Obj’s Frames SE/CE/MT/Ctx Time (sec) fps. CheckPntA 2 109 38/3/0/1 2.5 43.6 CheckPntB 3 292 38/3/0/1 18 16.22 Assault 2 240 6 8 / 8 / 1 /0 22.5 10.67 Obj Transfer 3 640 83/11/3/1 453 0.71 StealByBlocking 4 460 104/15/2/3 994 0.46 Table 6.2: Computation Time o f Video Sequences. We note that the computation time does not include motion detection and tracking processes. We have processed these sequences using a PII-333MHz machine with 128 MB of RAM. To convert the computation time to today’s processing power (e.g. P4- 2GHz with 256 MB RAM), we can approximately divide it by 8 . The number of start times (t\) to maintain for each sub-event of a single-thread event is set to 2 (see the end of Section 5.2.3). From Table 6.2, the average frame rate is approximately 14.33 fps. “CheckPntA” and “CheckPntB” (car traffic scenes) are relatively fast to process. This is because a large number of events related to human actions do not apply. “ A ssault” is also rela tively fast to process, even though the number of complex events has increased to eight. Seven out of eight complex events are in fact defined with regard to other moving objects (reference objects), which are unbound parameters. Events with an unbound parameter can significantly increase the computation time. For example, if there are three objects in the video, there will be six possible combinations of (actor, reference) pairs for each complex event to be analyzed. We notice the effect of free parameters in “Object Trans fe r ” and “Steal By Blocking” sequences, where the frame rates drop to 0.71 and 0.46 accordingly. In the cases where the number of moving objects are high (a crowd of peo ple), some pruning of the (actor, reference) pairs may be necessary. For example, spatial 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. constraints such as a distance threshold between the two objects may be used to remove the object pairs that are too far apart to be considered for, say, “converse” event. 6.4.4 Application: Automatic Video Annotation The output from our event detection system can be further used in various ways. In Video Annotation, a text description of video contents is provided along with the video data itself and can be used to search for video clips in a video database. Segmenting videos manually is, however, a time consuming task. In this section, we show a video annotation application, in which the results from our single-thread event analysis are used to tag moving objects and describe their trajectories and actions. Multiple-thread event analysis results are used to tag video segments that contain the corresponding global activities. In order to associate objects and video segments with the appropriate description, we use XML (Extensible Markup Language), which provides a flexible and standard way of describing data and enables users to parse, search and make comparison. Similar to HTML (Hypertext Markup Language), XML contains markup symbols to describe what the data is. For example, the word “actorJist” that is placed within markup tags could indicate that the data that follows is the list of actors. XML is extensible because the markup symbols are unlimited and self-defining. At the Institute for Robotics and Intelligent Systems- University of Southern California, a set of XML tags have been developed to describe information such as “bounding_box”, “moving_direction”, “time_range”, “event-type” and “frame_range” (of events). Such information can be parsed to generate videos with the event information being overlaid on the images. Fig ures 6.13, 6.14, 6.15 and 6.16 show the processed videos with the detected events over laid on. The most significant single-thread events are displayed for each objects during 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the video segments in which they are performed. For example, in figure 6.13, Obj5 per forms “approach to block view ” at frame 315. Later, at frame 416, the most significant event becomes “blocking view ”. At frame 547, we also notice the detection of a ghost object (Obj7), which indicates that a small object has been removed. The global event “steal by blocking ” is also display during the video segment, in which the multi-thread event occurs. The information about the object identities, the trajectories and the events that are displayed on figures 6.13, 6.14, 6.15 and 6.16 can all be obtained from the XML tags (e.g. “bounding_box”, “event_type”, “frame_range”). 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Probability Probability Probability Probability Sub-Events of "Objl bring in Obj2" O b jl: "bring in" Obj2 "move along Obj2" -M -v, / I f f t ' w R 1 "stand close to Obj2“ 100 200 300 400 500 — "bring in Obj2" n c o n o Q _ too 200 300 400 500 Frame Number Frame Number (a) Analysis of the action thread: Obj 1 “bring in” Obj2. Sub-Events of "Obj3 take away Obj2 (luggage)" Obj3: "take away" Obj2 (luggage) ....... "approach the luggage" ■ —- "stand close to the luggage" ; ?"•, ; — "move along with the luggage" ; 3=: {]" n Probability — "take awayObj2" , 100 200 300 400 500 Frame Number 100 200 300 400 500 600 Frame Number (b) Analysis of the action thread: Obj3 “take away” Obj2. Sub-Events of "Obj3 leave Objl (human)" Obj3: "leave Objl" (human) 'locate close to Objl" 'move away from Objl" 0 too 200 300 400 600 500 "leave Obi 1 " . .Q 0.6 Frame Number (c) Analysis of the action thread: Obj 3 “leave” Obj 1. 100 200 300 400 500 600 Frame Number Sub-Events of "Obj2 taken away from Objl" Obj2: "taken away from" Objl "locate close to Objl" "move away from Objl" ... I t 100 200 300 400 500 600 — "taken away from ObJ2" 0 300 100 200 400 500 600 Frame Number Frame Number (d) Analysis of the action thread: Obj2 “taken away from” Obj 1. Figure 6 .6 : Analysis results o f the action threads in the sequence “object transfer’ 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Evaluation of event relations in "object transfer" J J 0.6 < 0 -Q 0.4 O ..... "O bjl bringln Obj2", start before, "Obj3 tak e Obj2" - - “O bj3 tak e O bj2“, overlap, - "Obj2 leav e O b jl" - — "Obj3 tak e Obj2", overlap, "Obj3 leav e Obj 1" ----------------------- Fram e N um ber (a) Evaluation of event relations in event graph 6.5 with the assigned actors. Two m o st likely o c c u rre n c e s of "object tran sfer” n o,4 200 300 400 Fram e N um ber (b) The most significant instance of “object transfer”. Figure 6.7: Multi-thread event analysis results o f the sequence “object transfer’ /6bj1 "bring irA V Obj2 ) {start before} during} {overlap} {overlap} dbj3 "leave^ f Obi2 \ V Objl J ("taken away") yfrom Qbj1 J Figure 6 .8 : A graphical description o f “Stealing by PhoneBooth”. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. £ * 0 .1 f i Sub-Events of "Objl bring in Ob]2" — "move along with Ob]2” - - "stand close to static Obj2" O b jl: "bring in" Obj2 100 200 300 400 500 600 700 800 Frame Number — "bring in" 800 900 500 600 700 100 200 300 400 Frame Number (a) Analysis of the action thread: Objl “bring in” Obj2. Sub-Events of "Objl use phone" • — "approach zone" — "locate in zone" Q . 0 2 300 400 500 600 700 800 100 200 13” 2 ° Q_ 0.2 • Objl: "use phone" - "use phone I f 100 200 300 400 500 600 700 800 900 Frame Number Frame Number (b) Analysis of the action thread: Objl “use phone”. X ) 0 .6 • C O JD 0.4 Sub-Events of "Obj4 take away Obj2" Obj4: "take away" Obj2 "approach Obj2" "stand_close_to static Obj2" "move along with Obj2" /r m ' I 100 200 300 400 500 700 800 900 take away Obi2 Frame Number Frame Number (c) Analysis of the action thread: Obj4 “take away” Obj2. S u b -E v en ts of "Ob]4 leave O bjl (hum an)" Obj4: "leave" O bjl (hum an) ■ "locate clo se to O b jl" ■ "move aw ay from O b jl" -Q 0.4 O £ 0.2 : Av- t . JLu 100 200 300 400 500 600 700 800 Frame Number — "leave O bjl .Q 0.6 100 200 300 400 500 600 Frame Number (d) Analysis of the action thread: Obj4 “leave” Obj 1. Sub-Events of "Obj2 taken away from Objl (human)" Obj2: "taken away" from Objl (human) ^0.1 !q 0.6 < 0 ■ 9 ° - ‘ . - ^ o . s — "taken away from Objl" A aft — "locate close to Objl" ‘ J C O : f — "move away from Obj 1 ” I ft "§ 0 '4 : j .. . iliK £ 0 2 .................................................... 100 200 300 400 500 700 800 900 100 200 300 400 500 600 800 900 Frame Number Frame Number (e) Analysis of the action thread: Obj2 is “taken away from” Obj 1. Figure 6.9: Analysis results o f the action threads in the sequence “stealing by phone- booth 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Evaluation of event relations of "steal by phoneBooth" Evaluation of event relations of “ steal by phoneBooth" JQ 0.6 CO € o, Q_ 0 2 ■ ■ "O b jl b rin g jn Obj2", start_before, "O b jl use_phone' — "O bj4take_aw ay Obj2", during, "O b jl use_phone" .o O f CO -9 o.‘ "Obj4 take_away O bj2H , overlap, "Obj2 taken_away from O b jl" - "Obj4 take_away Obj2", overlap, nObj4 leave O b jl" Frame Number Frame Number (a) Evaluation of event relations in event graph 6 . 8 with the assigned actors. Recognition of the global event "stealing by phoneBooth" IT 0 - 5 o. CO 2 ° 0 _ 0.2 - Note: "obj_transfer“ is also recognized with the sam e probabilities Frame Number (b) The most significant instance of “stealing by phoneBooth”. Figure 6.10: Multi-thread event analysis results o f the sequence “stealing by phone Booth”. The global event “object transfer” is also recognized with the same probabili ties as the “stealing by phoneBooth ”. f 0bj2 "clash" 'N V Objl ; {meet} /6bi1 "escaped V Obj2 ) {overlap} ' r /6bj3 "follow^ V Obil ) Figure 6.11: A graphical description of “assault”. 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Sub-Events of "Objl clash with ObjO" O b jl: "clash with" ObjO 3 06 C O .Q 0.4 ‘ "approach ObjO"; ;j ' — "hit ObjO" .; i ; is ili • s s “ Probability — "clash with ObjO" 1 , Frame Number Frame Number (a) Analysis of the action thread: Obj 1 “clash” with ObjO. Sub-Events of "ObjO escape from Objl" ObjO: "escape" from Objl £ 0.4 g o _ — "approach Objl" — "locate close to" — "speed away" 'sV u n ' 2 i Q_ escape Frame Number Frame Number (b) Analysis of the action thread: ObjO “escape” from Obj 1. Sub-Event of "Objl chase after ObjO" Objl: "chase" after ObjO -"follow w/ high speed" .Q 0.4 50 100 150 Frame Number 200 250 — chase O 0.6 Frame Number (c) Analysis of the action thread: Obj 1 “chase” after ObjO. Evaluation of event relations in "assault" > » 0.8 3 0.6 C D .Q 0.4 Q. 0.2 - —■ "Objl clash with O bjO", meet, "ObjO escape Objl" - — "ObjO escape Objl" , overlap, "Objl chase after O bjO " Frame Number (d) Evaluation of event relations in event graph 6.11 with the assigned actors. The most likely occurrence of "assault" Frame Number (e) The most significant instance of “assault”. Figure 6.12: (a), (b) and (c) show the analysis results o f the action threads in the sequence “assault”, (d) and (e) shows the evaluation o f the event thread relations and the probabilities o f the detected global event “assault”. 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 454 frame 547 Figure 6.13: The “Stealing By Blocking” annotated sequence. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 90 frame 157 frame 552 “ STEAL BY PHONEBOOTH, frame 102 •2TE;-*l . BY PHONEBOOTH frame 286 V Sit -i BY P:v >iiLL'.;UlM B Y r fiO M E ioO O 'I frame 845 f:a l v ■•ionfkoo,!. (ST E A L BY'PHONEBOOTH frame 861 frame 864 Figure 6.14: The “Stealing By Phone” annotated sequence. 128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ■ f t l n w « r a B r " frame 16 OBJECT T RAN&FER frame 158 O': -CT IPAN'SFI-R oDJEC.i ir."»FER lW M f » l frame 174 frame 366 M N OBJECT TRANM-ER ■BJLCT rt-.W F E H frame 542 frame 629 Figure 6.15: The “Object Transfer At Bench” annotated sequence. 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 220 frame 235 Figure 6.16: The “Assault” annotated sequence. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 7 Performance Characterization We have tested our system successfully on real video streams containing a variety of events in addition to the ones shown in section 5.3 and 6.4. These events are mainly for surveillance purposes such as “a person drops off a package ”, “a person (a car) fo l lows another person (another car) ” and “a person makes contact with another person Evaluating the performance of an event detection system, however, requires a much more extensive collection of data. In an ideal case, the data set should contain many possible variations of any particular event captured at various time, under different envi ronments. Obtaining such data is possible if the action to be recognized is confined in a highly structured scene and has little variation in the execution style. However, signifi cant scenarios such as “stealing” occur rarely in natural observation. They, sometimes, have execution styles that are difficult to anticipate and allow for more variation in tem poral constraints. For example, the path and the duration that a person takes to steal a suitcase can vary depending on the location of the suitcase in the scene. Simulated data of “variations” may help in such cases to provide some insight into the performance of the algorithms. There are several variations that may affect the performance of our recognition algo rithm: • Bayesian networks can be affected by noise and error in motion detection and tracking; • Probabilistic event automaton can be sensitive to the variable durations of sub events in addition to tracking noise; 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • The performance of event graphs can be affected further by the variable timing of event relations due to different execution styles. In the following, we characterize the performance of our recognition methods using spatio-temporal perturbation of tracking data computed from real video streams that simulates these noise and variations. 7.1 Levels of Noise Suppose we are to recognize and discriminate between complex event “pass by ” and “contact 1 ”. Figures 7.1(a) and (b) show two original video sequences of these event. “Pass b y ” is modeled by a stochastic automaton of three states: “approaching the ref erence person”, “being close to the reference person” and “walking straight away from the reference person”. Similarly, “co n tactl” is modeled by a stochastic automaton of three states: “approaching the reference person”, “being close to the reference per son” and “turning around and leave”. To examine the robustness of our system with regard to small trajectory variations, we generate noisy sequences from sequence A and B by corrupting the tracking data with Gaussian noise. First, we compute the mean (p = 13.01cm ,/fram e) and variance (a = 6 .6 8 cm /fra m e) of the speed of walking people detected in both sequences. Tracking data of the original sequence is then cor rupted with a Gaussian noise G (p = 0, u > ■ a) to simulate a noisy sequence, where u j is the level of noise. The larger u is, the noisier the tracking data becomes corrupted. In the following experiment, we test our algorithm o n u = 1, 3, 5 and 7. Forty sequences of noisy “con tactl” and “ passing b y” are evaluated for each level of noise. If the probability of the event model corresponding to the sequence is lower than the threshold value (r), we say that it produces a negative result. If the probability of the competing event model is higher than r, then we say that it produces a false alarm. In 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frame 100 frame 150 (a) Original sequence “pass by”. frame 82 frame 175 (a) Original sequence “contactl”. Figure 7.1: Two test patterns o f walking in a parking lot: (a) “a person passes by the reference person”, (b) ‘‘ a person makes contact with the reference person”. We examine how well our system discriminates these competing events when the trajectories are corrupted with various levels o f noise. an ideal case, the system would produce zero negative results and zero false alarm. We characterize results in terms of missing rate and false alarm rate. Missing rate is the ratio of negative results to all the positive sequences and false alarm rate is the ratio of the false alarms to all the positive sequences. A trade-off between the missing rate and the false alarm rate can be exercised by varying the threshold value. An optimal thresh old can be selected to give the desired trade-off for a given application based on some criteria. To help make such a decision, it is common to plot a trade-off curve by varying threshold values. This curve is commonly called a Receiver Operating Characterizing (ROC) curve. Figure 7.2 shows ROC curves for each noise level for the detection of “contactl ” and “passing b y ”. It can be seen that when the variance of the random Gaussian noise 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ROC on D ataset of Various G aussian Noise Levels 0.9 0.8 0.7 3*0 — 5*o — 7*o o: hum an ground sp e e d variance (6,68 cm ./fram e) c y y y 0.3 / / / y 0.2 0.1 o.- 0.2 0.3 0.4 O.S 0.6 0.7 0.8 0.9 False R ate % Figure 7.2: ROC curves fo r a data set with various noise levels. of tracking data is below 33.4cm / fra m e (which is quite high compared to the average walking speed of l3X )lcrn/ fro/rne), it is still possible to maintain both the missing rate and the false alarm rate under 10%. This is because the Gaussian noise only locally cor rupts the ground location of an object frame by frame. The overall shape of the trajectory is, therefore, still preserved. Such local noise is averaged out during the computation of mobile object properties. Our complex event recognition algorithm also avoids error- neous short event durations, that may be caused by noisy simple events. The error occurs when the noise repeatedly causes a large displacement while the person still stops at the reference person, causing the recognition of “stops a t’’ to be weak for a long period of time. In this section, we examine how the duration of sub-events affect the recognition of com plex events. We model two competing complex events. The first event (EVf), “approach a reference person and then stop”, is composed of two sub-events: “approach” and 7.2 Variable Event Durations 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. “stop”. The second event (EYf, “approach a reference person and then leave”, is also composed of two sub-events: “approach” and “leave”. We simulated, for each com plex event, a test data set of twenty sequences at each of the five different noise levels (a — 1 ,3 ,5 , 7 and 10) using the method described in section 7.1. Half of the test data set (i.e., ten sequences) has a shorter duration of the second sub-event. For example, half of the test sequences for “approach then leave ” have a long second event “leave ” of 97 frames, while the event “leave” of the other half of the test sequences is 28 frames long. 28-Frame Long "Approach Then Leave" EV1: Approach_Then_Stop (negative) Noise Level 97-Frame Long "Approach Then Leave" EV1: Approach_Then_Stop (negative) Noise Level (a) short sequences (b) long sequences Figure 7.3: Results o f the test data set fo r “Approach Then Leave” sequences. 26-Frame Long "Approach Then StopAt The Reference Person" 106-Frame Long "Approach Then StopAt The Reference Person" — E V t: Approach_Then_StopAt (positive) T3 O O £ Z C D Lj Noise L evel 90 60 70 — EV1: Approach_Then_StopAt (positive) ■ O 6 0 o 0 ■ £ 5 0 1 30 20 10 0 2 3 0 5 e Noise Level 9 10 (a) short sequences, (b) long sequences Figure 7.4: Results o f the test sequences “ Approach Then Stop at the Reference Person ’ 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For each noise level, we analyze the test data set and compute the mean and standard deviation of the probabilities of both EV\ and EV2. Figure 7.3 (a) shows the evalua tion of EXi and EV2 on “approach then leave ” sequences with a shorter duration (28 frames) of the second sub-event ( “leave the reference person”). Figure 7.3 (b) shows the results on “approach then leave ” sequences with 97 frames of the second sub-event. The average probabilities of the positive event model EV2 are shown in solid lines with the standard deviation displayed with I — I . The dotted lines are the probabilitites of the negative event model E V (. Similar to figure 7.3, figure 7.4 shows the analysis results of the test sequences “approach then stop”, but with the alternate roles of EX\ and EV2. It can be seen that the system can discriminate between competing event models much better when the sub-events are observed for a longer period of time. That is, a larger gap between the solid and dotted lines is observed for every noise level in figures 7.3 (b) and 7.4 (b). Especially in the case of “approach then leave” sequences, the system discriminates poorly between EV\ and EV2 when the event “leave ” is not observed long enough. This, in fact, corresponds with human performance as observers get confused when event duration is only one or two seconds long. 7.3 Varying Execution Styles It is conceivable that the recognition of a multi-agent event may vary as a result of a change in the patterns of execution by actors. Such variation, however, should not be significant in a robust system as long as the temporal relations among sub-events do not change. One of the greatest difficulties in performance analysis of a multi-agent event detection system is to find an extensive set of real test data that includes all possible 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. action execution styles by individuals. As with the case of single-thread events, analy sis of ROC curves on synthetic data can provide insight into the characteristics of the system. We define two competing multi-agent events for ROC analysis. The first event is “stealing by blocking ”, the event graph for which is described in section 6.1. The second event is “non-blocking object transfer” and is described as a person approaching to converse with another person whose luggage is left on the ground. Then, another person comes and takes the luggage away. The difference between the two events is whether or not the view of the belongings is obstructed while there is a transfer of the ownership. We simulated a test data set composed of twenty one “stealing by blocking” sequences and twenty one “non-blocking object transfer” sequences. Simulated sequences are generated as follows. First, the trajectory of each object in the original sequence is extracted and smoothed out to obtain a principal curve. Then, four points on the principal curve are selected such that that curve is segmented into five segments of an equal length (k). The four points are then randomly perturbed with Gaussian noise G{fi — 0, o — 2k) and fitted on a smooth B-spline curve. Finally, tracking data (points along the B-spline curve) is assigned according to the estimated walking speed (13.01 c m / fra m e ) and variance (6.68cm / frame) of a human. The variable lengths of the perturbed trajectories introduce variations in the timing among event threads. There fore, we manually classify and correctly label each of the perturbed sequences. Figure 7.5 shows the ROC curve for the test data set. Even though the two events are very similar, by choosing an appropriate threshold, we can achieve the detection rate as high as 81% while keeping the false alarm rate at 16%. The main reason for misdetection is the critical event “blocking” that helps discriminate between “stealing by blocking” and “non-blocking object transfer” can not be recognized when the persons who per form “blocking” action move abruptly away repeatedly during the blocking. It also fails 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. when the blocking person walks past the reference objects rapidly and comes back to block, causing the “approach to block” event (i.e., “approach! ” and “approach2 ”) to be recognized weakly. ROC on D a ta se t of Variable Execution S tyles F alse R a te % Figure 7.5: ROC curves fo r a data set with varying execution styles. 7.3.1 Comments We have demonstrated that our system is robust, but its performance is dependent on tracking accuracy. We notice a decrease of the recognition performance on noisy sequences and on variable execution styles of activities. The recognition of the interac tion of actors, in fact, relies on the accuracy of the detection and segmentation of com plex events. Currently, several assumptions are made about the probability distributions in complex event modeling. Some of these assumptions (e.g., a uniform distribution of event durations) may be relaxed to improve the accuracy of the analysis. 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 8 Conclusion In this dissertation, we presented an approach for both single- and multiple- actor activ ity recognition from videos using vision. We have proposed a transparent, generic (extendible) hierarchical activity representation that provides flexibility and modular ity in event modeling scheme and simplifies the task of parameter learning. Bayesian networks and probabilistic finite state automata are used to describe single actor activ ities. The interaction of multiple actors is modeled by an event graph, useful for plan and coordinated activity recognition. We have developed a complete system and shown, by real world examples, that possible events can be inferred and segmented from video data automatically by probabilistic analysis of the shape, motion and trajectory features of moving objects. By using Bayesian networks to assess local and simple movement characteristics, which are are then integrated in a long term using a probabilistic finite state automaton, we have achieved an accuracy of 96.7% on discriminating predefined competing single thread actions of 30 objects (including both humans and vehicles). We have observed that our system is robust against peak noise, thanks to the explicit modeling of event duration distributions in the event automaton. Nevertheless, the performance of our system is dependent on tracking accuracy. Based on the analysis results 160 simulated object trajectories corrupted with various levels of noise, some degradation in discrimi nating power was seen. For the recognition of multi-thread events, we have developed an algorithm that search a set of the most likely instances of single event threads for the ones that best 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. match the temporal constraints defined in the event graph. We achieved as high as 81% detection rate on 42 simulated sequences. To conclude this dissertation, we provide some discussion on each component of our system and future work in the following. • We have demonstrated at many processing levels the use of context information for event recognition task. Knowledge of the ground plane has been used to filter moving regions and improve the tracking results. However, this approach requires the robust detection of the feet of the objects, which may not be available in some applications. Tracking a crowd of people, necessary for many surveillance appli cations, is also still very difficult due to self-occlusion and the occlusion of the body parts with others. A robust filtering mechanism that integrates both the low-level (e.g. color distribution, texture) and the high-level (e.g. consistency of actions) information may be useful in such cases. • Even though our activity representation is currently based on 2-D shape and tra jectory features, it provides layers of abstraction which allows an integration of other complex features (e.g. human pose configurations, optical flow statistics) to describe more sophisticated events. An additional mechanism to derive the prob ability of an abstract event entity from these complex features may be required. Also, further development of the temporal and logical constraints may be required to represent a more sophisticated scenario model. For example, numerical con straints such as “A occurs before B at least an hour ago ” or additional logical constraints such as “no other events should happen between A and B ” may be allowed to enhance the characterization of a scenario. • In section 7, we have demonstrated a preliminary performance evaluation scheme that validates our system on some aspects of a real application. Analysis of ROC 140 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. curves on synthetic data is useful when obtaining enough test data set is not pos sible. With an increasing number of other event recognition approaches, a more systematic performance evaluation procedure is required to make useful compar isons between algorithms. • One concern about the scalability of our system is that the complexity of our inference method depends on the number of moving objects and the complexity of the scene (e.g. number of reference scene objects). Currently we analyze all scenarios of all possible combinations of moving objects and reference objects. The complexity can be decreased by the use task context to process only relevant scenarios and the use of heuristics (e.g. a threshold on the proximity of relevant object) to choose only the objects that may involve in a particular activity. 8.1 Summary of Contributions The major contributions of our research can be summarized as follows: • Detection and tracking objects using scene geometry. • Event ontology and hierarchical event representation that bridges the gap between the low-level visual information and the symbolic event description. • Bayesian analysis of simple trajectory- and shape-based events. • Detection and segmentation of a single-thread event from a continuous video stream using a stochastic finite automaton. • Recognition of the interaction of multiple actors using probabilistic event graphs. 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8.2 Future Work There are several opportunities to extend our work into new areas. In order for vision- based event recognition to enjoy commercial success and wide-spread use, we must provide a flexible mechanism for a user to define new events of his own interest. Cur rently, for a particular event understanding application, users must provide the models for various events to be understood to the system. At the same time, the system needs to describe the event analysis results to human users. This requires the users to under stand the computational representation of events and the internal operation of the system (e.g. Bayesian networks, stochastic finite state automata, etc.). These processes can be speeded up by the development of a declarative Event Representation Language (ERL) that allows a developer to define a new scenario in a more efficient and natural way, and also to make the communications with the users easier. Based on our event repre sentation formalism, an ERL can be developed to allow the users to define new daily life events and to describe the interaction of multiple objects using logical, spatial and temporal relationships. Such high-level descriptions may be compiled into procedural ones 9or programs) for recognition. In this thesis, we consider a specific application of monitoring a scene by a single video camera, in which ambiguity may arise from occlusion, resulting in noisy tracking. The use of multiple cameras may solve these view dependent problems. Also, human activities may occur over an extended area, requiring the use of multiple sensors. In this case, we are required to combine information (possibly of different nature) from multiple views or sites. One of the biggest challenges that must be addressed in this regard is sensor fusion. Since vision-based event recognition is relatively a new research area, there is need for a more extensive database of a variety of activities in natural observation that researchers in the field can evaluate the performance of their algorithms on. Collecting 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and constructing such a database is a difficult task, since there are several controversies regarding the collection, processing and dissemination of information collected about people, especially in the surveillance environment. 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reference List [AA01] [AC99] [AF94] [ASOl] [BBC93] [BDOl] [BG95] [BH95] [BHK93] [BID+99] A. Ali and J. K. Aggarwal. Segmentation and recognition of continuous human activity. In EventVideoOl, Vancouver, Canada, July 2001. J. K. Aggarwal and Q. Cai. Human motion analysis: A review. Computer Vision and Image Understanding, 73:428-440, 1999. J. F. Allen and G. Ferguson. Actions and events in temporal logic. Journal o f Logic and Computation, 4(5):531-579,1994. D. Ayers and M. Shah. Learning the distribution of object trajectories for event recognition. Image and Vision Computing, 19(12):833— 846, 2001. M. Brand, L. Birnbaum, and P. Cooper. Sensible scenes: Visual under standing of complex scenes through causal analysis. AAAI, 1993. A. Bobick and J. Davis. The recognition of human movement using tempo ral templates. IEEE Transactions on Pattern Analysis and Machine Intelli gence, 23:3:257-267, March 2001. H. Buxton and S. Gong. Visual surveillance in a dynamic and uncertain world. Artificial Intelligence, 78(l-2):431— 459, 1995. A. Baumberg and D. Hogg. An adaptive eigenshape model. In proc. of the British Machine Vision Conference (BMVC), Birmingham, September 1995. A. Beringer, S. Holldobler, and F. Kurfess. Spatial reasoning and connec- tionist inference. In proc. of the Int'l Joint Conf on Artificial Intelligence (IJCAI), 1993. A. Bobick, S. Intille, J. Davis, F. Baird, C. Pinhanez, L. Campbell, Y. Ivanov, A. Schutte, and A. Wilson. The kids room: A perceptually- based interactive and immersive story environment. Teleoperators and Vir tual Environments, 8:4:367-391, 1999. 144 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [BKRK97] J. Binder, D. Roller, S. Russell, and K. Kanazawa. Adaptive probabilistic networks with hidden variables. Machine Learning, 29:213-244,1997. [Bla99] [BOP97] [Bre97] [BT97] [BT98] [BY97] [BYJF97] [CB95] [CDOO] [CLEOO] [Cor92] M. J. Black. Explaining optical flow events with parameterized spatio- temporal models. In IEEE Proceedings of Computer Vision and Pattern Recognition, Fort Collins, CO, 1999. M. Brand, N. Oliver, and A. Pentland. Coupled hidden markov models for complex action recognition. In IEEE Proceedings o f Computer Vision and Pattern Recognition, Puerto Rico, USA, 1997. C. Bregler. Learning and recognizing human dynamics in video sequences. In IEEE Proceedings o f Computer Vision and Pattern Recognition, pages 568-574, Puerto Rico, USA, 1997. F. Bremond and M. Thonnat. Object tracking and scenario recognition for video-surveillance. In the poster sessions o f the 15th Int’l Joint Conference on Artificial Intelligence (IJCAI), Nagoya (Japan), August 1997. F. Bremond and M. Thonnat. Issues of representing context illustrated by video-surveillance applications. International Journal o f Human- Computer Studies, Special Issue on Context, 1998. M.J. Black and Y. Yacoob. Recognizing facial expressions in image sequences using local parameterized models of image motion. Interna tional Journal o f Computer Vision, 25(1):23^18, 1997. M.J. Black, Y. Yacoob, A.D. Jepson, and D J. Fleet. Learning parameter ized models of image motion. In IEEE Proceedings o f Computer Vision and Pattern Recognition, pages 561-567, 1997. L. Campbell and A. Bobick. Recognition of human body motion using phase space constraints. In proc. o f the fifth International Conference on Computer Vision, pages 624-630, Cambridge MA, 1995. R. Cutler and L.S. Davis. Robust real-time periodic motion detection, anal ysis, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):781-796, 2000. R. T. Colins, A. J. Lipton, and T. Kanade (Eds.). Special issue on video surveillance. IE E E Transactions on Pattern A nalysis and M achine Intelli gence, 22(8), August 2000. D. Corrall. Deliverable 3: Visual monitoring and surveillance of wide-area outdoor scenes. Technical report, Esprit Project 2152: VIEWS, June 1992. 145 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [DB97] [DCR+98] [DGG93] [DK02] [EP95] [FAB+98] [FL98] [Gal93] [Gav99] [GKRR02] [God94] J. W. Davis and A. F. Bobick. The representation and recognition of action using temporal templates. In IEEE Proceedings o f Computer Vision and Pattern Recognition, Puerto Rico, 1997. L. Davis, R. Chelappa, A. Rosenfeld, D. Harwood, I. Haritaoglu, and R. Cutler. Visual surveillance and monitoring. In DARPA Image Under standing Workshop, pages 73-76, 1998. C. Dousson, P. Gabarit, and M. Ghallab. Situation recognition representa tion and algorithms. In Proc. o f the IJCAI-93, chambery, volume 1, pages 166-172, August 1993. G. DeSouza and A. Kak. Vision for mobile robot navigation: A sur vey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):237-301, 2002. I. Essa and A. Pentland. Facial expression recognition using a dynamic model and motion energy. In IEEE Proceedings o f the International Con ference on Computer Vision, pages 360-367, 1995. W.T. Freeman, D.B. Anderson, P.A. Beardsley, C.N. Dodge, M. Roth, C.D. Weissman, W.S. Yerazunis, H. Kage, K. Kyuma, Y. Miyake, and K. Tanaka. Computer vision for interactive computer graphics. IEEE Com puter Graphics and Applications, 18:3:42-53,1998. H. Fujiyoshi and J. Lipton. Real time human motion analysis by image skeletonization. In Fourth IEEE Workshop on Applications o f Computer Vision, WACV, pages 15-21, 1998. A. Galton. Towards an integrated logic of space, time and motion. In International Joint Conference on Artificial Intelligence (IJCAI), Cham bery, France, August 1993. D. M. Gavrila. The visual analysis of human movement: A survey. Com puter Vision and Image Understanding, 73:82-98, 1999. R. Goldenberg, R. Kimmel, E. Rivlin, and M. Rudzsky. ’dynamism of a dog on a leash’ or behavior classification by eigen-decomposition of peri odic motions. In John Illingworth, editor, Proceedings o f the European Conference on Computer Vision, volume 1, page 461 ff., 2002. N. H. Goddard. Incremental model-based discrimination of articulated movement from motion features. In Proc. of the IEEE workshop on motion of non-rigid and articulated objects, pages 89-95, Austin, Texas, 1994. 146 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [HBNOO] [Her95] [HNOO] [HNOl] [Hog83] [IB98] [IB99] [IBOO] [IBOl] [KCMOO] [KGM+02] [KHE98] S. Hongeng, F. Bremond, and R. Nevada. Representation and optimal recognition of human activities. In IEEE Proceedings o f Computer Vision and Pattern Recognition, South Carolina, 2000. G. Herzog. From visual input to verbal output in the visual translator. Projet VITRA 124, Universitat des Saarlandes, Saarbriicken, Germany, 1995. M. Haag and H.H. Nagel. Incremental recognition of traffic situationd from video image sequences. Image and Vision Computing, 18(2): 137— 153,2000. S. Hongeng and R. Nevatia. Multi-agent event recognition. In IEEE Pro ceedings o f the International Conference on Computer Vision, volume 2, pages 84-91, Vancouver, Canada, July 2001. D. Hogg. Model-based vision: A program to see a walking person. In Image and Vision Computing, volume 1, pages 5-20, South Carolina, 1983. M. Isard and A. Blake. Condensation-conditional density propagation for visual tracking. International Journal o f Computer Vision, 29(1):5— 28, 1998. S. Intille and A. Bobick. Visual recognition of multi-agent action using binary temporal relations. In IEEE Proceedings o f Computer Vision and Pattern Recognition, Fort Collins, CO, June 1999. Y. Ivanov and A. Bobick. Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):852— 872, 2000. S. Intille and A. Bobick. Recognizing planned multiperson action. In Jour nal o f Computer Vision and Image Understanding, volume 3, pages 414- 445, March 2001. E. Kang, I. Cohen, and G. Medioni. A graph-based global registration for 2d mosaics. In Proceedings o f the International Conference on Pattern Recognition, Barcelona, Spain, 2000. H. Karen, C. Geib, C. Miller, J. Phelps, and T. Wagner. Agents for recog nizing and responding to the behavior of an elder. In Proceedings o f the AAAI-02 Workshop “Automation as Caregiver", pages 31— 38, July 2002. D. Kriegman, G. Hager, and S. Morse (Editors). The Confluence o f Vision and Control. Springer-Verlag, 1998. 147 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [KI93] [KN094] [LP98] [MA99] [MCB+01] [MEH99] [MH98] [MN90] [NA94] [Nag88] [NBKOO] [NDS97] Y. Kuniyoshi and H. Inoue. Qualitative recognition of ongoing human action sequences. In Proceedings o f the International Joint Conference on Artificial Intelligence, pages 1600-1609, 1993. H. Kollnig, H. Nagel, and M. Otte. Association of motion verbs with vehi cle movements extracted from dense optical flow fields. In Proceedings of the European Conference on Computer Vision, May 1994. F. Lui and R.W. Picard. Finding periodicity in space and time. In IEEE Proceedings o f the International Conference on Computer Vision, pages 376-383, 1998. A. Madabhushi and J. K. Aggarwal. A bayesian approach to human activ ity recognition. In Second IEEE International Workshop on Visual Surveil lance (CVPR Workshop, pages 25-30, Fort Collins, Colorado, June 1999. G. Medioni, I. Cohen, F. Bremond, S. Hongeng, and R. Nevada. Event detection and analysis from video streams. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(8):873-889, 2001. D. Moore, I. A. Essa, and M. H. Hayes. Exploiting human actions and object context for recognition tasks. In Proc. o f the 7th Int’l Conference on Computer Vision (ICCV), Corfu, Greece, Sept 1999. R. Morris and D. Hogg. Statistical models of object interaction. In proc. o f the Int’l Conference on Computer Vision (ICCV) Workshop on Visual Surveillance, Bombay, India, 1998. M. Mohnhaupt and B. Neumann. Understanding object motion: Recogni tion, learning and spatiotemporal reasoning. Research Report FBI-HH-B- 145/90, University of Hamburg, 1990. S. A. Niyogi and E. H. Adelson. Analyzing and recognizing walking figures in xyt. In IEEE Proceedings o f Computer Vision and Pattern Recognition, pages 469-474, 1994. H. H. Nagel. From image sequences towards conceptual descriptions. Image and Vision Computing, 6(2):59-74, May 1988. N. Nitta, N. Babaguchi, and T. Kitahashi. Extracting actors, actions and events from sport video: A fundamental approach to story tracking. In Pro ceedings o f the International Conference on Pattern Recognition, volume 4, pages 718-721, 2000. L. Nan, S. Dettmer, and M. Shah. Visually recognizing speech using eigen sequences. In Motion-Based Recognition, page Chapter 15, 1997. 148 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [Neu89] [ORPOO] [PB98] [PMOl] [PN94] [PN97] [RFV99] [RJ93] [Roh94] [RTB98] [SA77] [SC02] B. Neumann. Semantic structures: advances in natural language process ing, chapter 5, pages 167-206. David L. Waltz, 1989. N. Oliver, B. Rosario, and A. Pentland. A bayesian computer vision system for modeling human interactions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):831-843, August 2000. C. Pinhanez and A. Bobick. Human action detection using pnf propaga tion of temporal constraints. In IEEE Proceedings o f Computer Vision and Pattern Recognition, Santa Barbara, CA, 1998. I. Pavlidid and V. Morellas. Two examples of indoor and outdoor surveil lance systems: Motivation, design, and testing. In Proceedings o f the 2nd European Workshop on Advanced Video-Based Surveillance Systems (AVBS), September 2001. R. Polana and R. Nelson. Low level recognition of human motion. In Proc. o f the IEEE workshop on motion o f non-rigid and articulated objects, Austin, Texas, November 1994. R. Polana and R. Nelson. Temporal texture and activity recognition. In Motion-Based Recognition, page Chapter 5. Kluwer Academic Publishers, 1997. C. S. Regazzoni, G. Fabri, and G. Vemazza. Advanced Video-Based Surveillance Systems. Kluwer Academic Publishers, 1999. L. R. Rabiner and B. H. Juang. Fundamental o f Speech Recognition. Pren tice Hall, 1993. K. Rohr. Towards model-based recognition of human movements in images sequences. Computer Vision and Graphism Image Processing (CVGIP), 59:94-115, January 1994. P. Remagnino, T. Tan, and K. Baker. Multi-agent visual surveillance of dynamic scenes. Image and Vision Computing, 16:529-532, 1998. R.C. Schank and R.P. Abelson. Scripts, Plans, Goals and Understanding: an Inquiry into Human Knowledge Structures. L. Erlbaum, Hillsdale, NJ, 1977. J. Sullivan and S. Carlson. Recognizing and tracking human actions. In Proceedings o f the European Conference on Computer Vision, pages 629- 644, 2002. 149 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [SDVOO] [SM96] [SMFOO] [SP95] [Sri94] [SSOl] [Str93] [vM64] [WB97] [YB99] [YOI92] [YYOO] K. Shearer, C. Dorai, and S. Venkatesh. Incorporating domain knowledge with video and voice data analysis in news broadcasts. In Proceedings o f the Sixth ACM International Conference on Knowledge Discovery and Data Mining, May 2000. J.M. Siskind and Q. Morris. A maximum-likelihood approach to visual event classification. In Proceedings o f the European Conference on Com puter Vision, volume 2, pages 347-360, 1996. H. SidenBladh, M.J.Black, and D.J. Fleet. Stochastic tracking of 3d human figures using 2d image motion. In Proceedings o f the European Conference on Computer Vision, 2000. T. Stamer and A. Pentland. Visual recognition of american sign language using hidden markov models. In proc. o f th Intl. Workshop on Automatic Face- and Gesture-Recognition, Zurich, 1995. R. Srihari. Computational models for integrating linguistic and visual infor mation: A survey. AIR, 8(5-6):349-369, 1994. T. Shakunaga and K. Shigenari. Decomposed eigenface for face recognition under various lighting conditions. In IEEE Proceedings o f Computer Vision and Pattern Recognition, volume 1, pages 864-871, July 2001. T. Strat. Employing contextual information in computer vision. In DARPA Image Understanding Workshop, pages 217-229, 1993. R. von Mises. Mathematical Theory o f Probability and Statistics. Aca demic Press, New York, 1964. A. Wilson and A. Bobick. Recognition and interpretation of parametric gesture. Technical Report 421, M.I.T., Media Laboratory Perceptual Com puting Section, 1997. Y. Yacoob and M.J. Black. Parameterized modeling and recognition of activities. Journal o f Computer Vision and Image Understanding, 73(2):232-247, 1999. J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time- sequential images using hidden markov model. In IEEE Proceedings of Computer Vision and Pattern Recognition, Champaign, IL, 1992. M. Yamamoto and K. Yagishita. Scene constraints-aided tracking of human body. In IEEE Proceedings of Computer Vision and Pattern Recognition, pages 151-156, 2000. 150 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [ZN02] T. Zhao and R. Nevada. 3d tracking of human locomotion: A tracking as recognition approach. In Proceedings o f the International Conference on Pattern Recognition, Quebec, Canada, 2002. 151 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A perceptual organization approach for figure completion, binocular and multiple-view stereo and machine learning using tensor voting
PDF
Extendible tracking: Dynamic tracking range extension in vision-based augmented reality tracking systems
PDF
Automatically and accurately conflating road vector data, street maps and orthoimagery
PDF
Analysis, synthesis and recognition of human faces with pose variations
PDF
Contribution to transform coding system implementation
PDF
Structural indexing for object recognition
PDF
Content -based video analysis, indexing and representation using multimodal information
PDF
Data driven control and identification: An unfalsification approach
PDF
A unified mapping framework for heterogeneous computing systems and computational grids
PDF
A voting-based computational framwork for visual motion analysis and interpretation
PDF
A thermal management design for system -on -chip circuits and advanced computer systems
PDF
Dynamic logic synthesis for reconfigurable hardware
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
A machine learning approach to multilingual proper name recognition
PDF
Architectural support for network -based computing
PDF
A script-based approach to modifying knowledge -based systems
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
Robust representation and recognition of actions in video
PDF
Grobner bases approach to stability and infinitesimal V-stability of maps arising in robust control
Asset Metadata
Creator
Hongeng, Somboon (author)
Core Title
A unified Bayesian and logical approach for video-based event recognition
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
), Itti, Laurent (
committee member
), Medioni, Gerard (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-373334
Unique identifier
UC11334980
Identifier
3103902.pdf (filename),usctheses-c16-373334 (legacy record id)
Legacy Identifier
3103902.pdf
Dmrecord
373334
Document Type
Dissertation
Rights
Hongeng, Somboon
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA