Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multiple vehicle segmentation and tracking in complex environments
(USC Thesis Other)
Multiple vehicle segmentation and tracking in complex environments
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MULTIPLE VEHICLE SEGMENTATION AND TRACKING IN COMPLEX
ENVIRONMENTS
by
Xuefeng Song
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2007
Copyright 2007 Xuefeng Song
ii
Acknowledgements
This thesis is the result of five years of hard work supported by many people. It
is a great pleasure to thank all the people who make this thesis possible.
No word can express my deepest gratitude to my PhD advisor, Prof. Ram
Nevatia. His incomparable knowledge, overly enthusiasm, everlasting dedication on
research guides me work through a lot of difficulties during all these years. Besides
of his scientific wisdom, his pleasant personality and excellent communication skill
impress me most. He may not even know how much I have learnt from him.
I would also like to thank other members of my PhD committee: Prof. Gerard
G. Medioni, Prof. Isaac Cohen and Prof. Zhonglin Lu. Your guidance and
contribution on the defense and this thesis are deeply appreciated.
I am indebted to many student colleagues at IRIS for providing a stimulating
and fun environment in which to learn and grow. Special thanks to Tao Zhao,
Fengjun Lv, Munwai Lee, and Tae Eun Choe.
Lastly, and most importantly, I have to thank my parents, my two sisters and
my wife. Without their unconditional love and support, I would never make this
achievement. To them I dedicate this thesis.
iii
Table of Contents
Acknowledgements ii
List of Figures v
List of Tables viii
Abstract ix
Chapter 1 Introduction and Motivation 1
1.1 Background 1
1.2 Our Approach 4
Chapter 2 Previous Work 8
2.1 Background modeling methods 8
2.2 Vehicle Detection 11
2.3 Detection-based Tracking Methods 12
2.4 Blob Tracking 14
2.5 Tracking multiple objects with occlusion 15
2.6 Moving shadow detection 17
Chapter 3 Robust Vehicle Blob Tracking in Low-occlusion Scenes 18
3.1 Task and Data Description 18
3.2 Vehicle Motion Blob Tracking 20
3.2.1 Scene Constraints 21
3.2.2 Background subtraction 22
3.2.3 Track vehicle blobs 23
3.2.4 Track merged vehicles with Meanshift 25
3.3 Experiments and Results 29
3.4 Evaluation and Discussion 30
3.5 Applications to tracking other objects 32
3.6 Conclusions 35
Chapter 4 Detection-based MCMC Vehicle Segmentation 37
4.1 Approach Overview 37
4.2 Bayesian Problem Formulation 41
4.2.1 Prior Probability 42
4.2.2 Multi-vehicle Joint Likelihood 43
4.2.3 Search for MAP 44
4.3 2-layer Detection for vehicle hypotheses 45
iv
4.3.1 Rectangular Region Detection 46
4.3.2 2D Model Match 49
4.4 Markov Chain Monte Carlo 50
4.5 Experiments and Evaluation 53
Chapter 5 Vehicle Segmentation with Arbitrary Orientations 59
5.1 Bayesian Problem Formulation 60
5.1.1 Prior Probability 61
5.1.2 Multi-Vehicle Joint Likelihood 62
5.2 MCMC-based Posterior Probability Estimation 63
5.2.1 Vehicle hypotheses proposals 65
5.2.2 Orientation Sampling from Motion 67
5.2.3 Meanshift move 67
5.2.4 Greedy size estimation 69
5.3 Vehicle segmentation in presence of shadows 70
5.4 Experiments 71
5.5 Related Issues 75
5.5.1 Divide and Conquer 75
5.5.2 Scene Occlusion 76
5.6 Evaluation on vehicle blob merge 77
Chapter 6 Vehicle Tracking 83
6.1 Vehicle Association by using a Kalman Filter 84
6.1.1 Data Association 84
6.2 Tracking through Multiple Hypotheses with the Viterbi Algorithm 86
6.3 Experiments and Evaluation 90
Chapter 7 Summary and Conclusion 96
7.1 Summary of Contributions 97
7.2 Future Directions 98
References 100
Appendix A : Evaluation Criteria Definitions for Vehicle Tracking 105
v
List of Figures
Figure-1 Background subtraction.................................................................................3
Figure-2 Approach overview .......................................................................................6
Figure-3 Vehicle detection using wavelet features (a) and wire frame model (b).....11
Figure-4 Motion Blob Tracking.................................................................................13
Figure-5 Methods of vehicle occlusion analysis........................................................16
Figure-6 Moving shadow detection examples ...........................................................17
Figure-7 Sample frames of six cases..........................................................................19
Figure-8 Method Overview........................................................................................21
Figure-9 Common difficulties of vehicle blob tracking.............................................23
Figure-10 Track-blob association matrix...................................................................23
Figure-11 Sample result frames on training/testing sequences..................................28
Figure-12 Typical tracking error samples..................................................................30
Figure-13 ROC Curve of CLEAR-VACE Surveillance Evaluation on Vehicle
Tracking .............................................................................................................32
Figure-14 Sample results of CLEAR-VACE person tracking sequences..................34
Figure-15 Application of object tracking in other scenes. .........................................35
Figure-16 3D vehicle modal and its 2D appearance ..................................................39
Figure-17 Sketch of our method ................................................................................40
Figure-18 The match between foreground and synthesized vehicle mask. ...............43
Figure-19 Schematic depiction of 2-layer detection ..................................................45
Figure-20 Integral image and integral histogram image............................................46
vi
Figure-21 Coarse and fine vehicle model match........................................................47
Figure-22 Procedures of vehicle segmentation..........................................................49
Figure-23 Overview of MCMC iterations .................................................................53
Figure-24 Detection results on side-view vehicle segmentation................................57
Figure-25 Tracking results on turning vehicles. ........................................................58
Figure-26 Match between vehicle box model and foreground ..................................60
Figure-27 Vehicle proposing based on foreground and motion.................................66
Figure-28 multiple vehicles with shadows.................................................................68
Figure-29 Greedy vehicle size search ........................................................................69
Figure-30 Segmentation results of multiple vehicles with shadows..........................71
Figure-31 Vehicle Segmentation Results...................................................................72
Figure-32 Examples of segmentation errors ..............................................................73
Figure-33 Examples of segmentation with big cars...................................................74
Figure-34 Divide and conquer ...................................................................................75
Figure-35 Scene occlusion and scene occluding mask ..............................................76
Figure-36 Scene occlusion error compensation .........................................................77
Figure-37 Segmentation on large size vehicles..........................................................78
Figure-38 Vehicle box match comparison .................................................................80
Figure-39 Results of vehicle box match on foreground blobs ...................................81
Figure-40 Error results of single vehicle vs. multiple vehicles classification............82
Figure-41 Vehicle Association at consecutive frames...............................................84
Figure-42 Association between tracking vehicles and detection at current frame....85
vii
Figure-43 Graphical Model of Multiple Vehicle Tracking........................................87
Figure-44 Tracking with the Viterbi algorithm,.........................................................88
Figure-45 Result samples on traffic01.avi .................................................................94
Figure-46 Result frames on traffic02.avi ...................................................................95
viii
List of Tables
Table-1 Evaluation scores on 50 test video sequences ..............................................31
Table-2 Evaluation scores of CLEAR-VACE person tracking on 50 test videos......33
Table-3 Evaluation on vehicle segmentation .............................................................56
ix
Abstract
Our goal is to detect and to track multiple moving vehicles observed from static
surveillance cameras, which are usually placed on poles or buildings. Methods of
background subtraction are widely used in these kinds of conditions. But to extract
vehicle information from motion foreground, common difficulties, such as noise
foreground, shadow, scene occlusion, blob merge and blob split, have to be solved.
By using vehicle shape models, in addition to camera calibration and ground plane
knowledge, the proposed methods can detect, track and classify moving vehicles in
the presence of all these difficulties.
Two methods are proposed in this thesis to deal with related problems. The
first method uses dynamic background model to extract the motion foreground. The
models of camera and vehicle are used to reduce the foreground noise. Spatial and
temporal constraints are applied to handle blob split, and object color appearance is
used to track each vehicle when multiple vehicles are merged together. Evaluation on
a large dataset by a third party shows that this method works robustly under many
conditions.
The second method focuses on challenging tracking situations where vehicle
inter-occlusion is prevalent and persistent. In this case, each foreground blob can
contain multiple vehicles. Simple one-to-one correspondence between the
x
foreground blobs and vehicles does not apply any more. Segmentation of the merged
vehicles is a difficult problem. This proposed method works in the framework of
Markov chain Monte Carlo (MCMC) approach. By sampling in the multi-vehicle
configuration space, the method searches for the set of vehicle parameters, that best
explains the foreground. Several bottom-up detections are utilized with top-down
analysis to guide the sampling in an effective way.
The goal of this work is to infer the trajectory of each individual vehicle.
Because of the approximation of vehicle models and the limitation of the likelihood
function, the multi-vehicle configuration with the highest probability may not always
be the correct segmentation. By exploring the spatial and temporal constraints across
the image sequences, a tracking method is proposed to reduce the errors on single
frame vehicle detection.
1
Chapter 1 Introduction and Motivation
1.1 Background
Visual surveillance is an important component of intelligent security systems.
Automatic traffic surveillance systems are highly desired, since manual observation
is a very boring job. Some researchers have reported that the attention of a human
observer decreases greatly after the first few hours. There are some existing sensor-
based technologies that measure the information of passing vehicles, such as vehicle
count, weight and speed. Vision-based systems are getting large attention as an
attractive alternative, not only because they are easy to install and operate, but also
because they have the potential to provide a much richer description about traffic.
For example, vision system can detect vehicle classes, vehicle trajectories, and the
distance between vehicles. Furthermore, vision system can infer more complicated
vehicle behaviors, like illegal u-turn, running red light, unusual motion, car accident,
etc.
When the observation is from a stationary video camera, comparison with a
learned background model can be used to extract the moving pixels (motion
foreground). Connected foreground pixel regions are usually called foreground
blobs. Ideally, each blob corresponds to one moving object. However, illumination
changes, waving trees or flags all create noise blobs in practice. Multiple vehicles
may merge into one blob in presence of inter-occlusion and one vehicle may split
2
into multiple blobs because of occlusion by scene objects. Figure-1 shows a typical
example. To detect, track and classify vehicles from motion foreground blobs, even
in the presence of these difficulties, is the objective of this thesis. More specifically,
the difficulties we are to overcome are listed below:
● Background modeling:
In the case of a stationary camera, we can model the background as a static
appearance model. Then a background subtraction method can be applied to extract
the moving foreground efficiently. However, this pixel-based method is not capable
of providing perfect foreground, because shadows, reflection, illumination change,
and non-stationary scene objects (trees, flags, etc.) can all create false foreground
blobs. Part of the object could be missing because of scene occlusion or low contrast
with the background.
● Vehicle modeling:
Vehicles are rigid objects. But the size variation of vehicles is quite large. It
changes from 175 inches for a regular sedan to more than 400 inches for a trailer
truck. Vehicles also have several different kinds of shapes, and many colors.
Appearance of a vehicle in a 2d projection may change greatly when the camera tilt
angle or the vehicle orientation changes.
● Shadow modeling:
Shadows are quite common in an outdoor environment. After background
subtraction, it creates foreground as well as moving objects. How to model both the
object and its shadow in a tracking framework is a challenge.
(a) Sample input frame (b) Motion foreground
Figure-1 Background subtraction
● Vehicle Inter-Occlusion
Multiple vehicles can occlude each other when the camera is not looking top-
down. Then one blob after background subtraction can contain more than one
vehicle. How to segment the multiple vehicles from one foreground blob is a major
goal in this thesis.
● Scene Occlusion
Moving vehicles can be partially or fully occluded by scene objects, like trees,
poles, and walls, etc. The appearances of one object are not consistent under scene-
occlusion, and the object region could split into a few parts. We also investigate this
problem in this thesis.
Figure-1 shows an example of typical surveillance environment and its motion
foreground. Multiple vehicles appear in the scene. The foreground of one vehicle is
split into several parts due to the occlusion of a tree. A walking person is visible.
Some other noise in the foreground comes from waving trees. Although the
3
4
foreground image appears quite noisy, a human observer has no difficulty in locating
the position of every car.
1.2 Our Approach
We focus on videos captured by a single calibrated camera, at a height of a few
meters above the ground, which is a common setting in surveillance applications. In
this setting, occlusions among vehicles and from static objects such as trees may be
significant. We expect the resolution of a medium size car to be taller than 20 pixels
in the image.
Basically, our approach analyzes the moving foreground to extract the vehicle
information. We apply a dynamic background learning method to adapt background
model to the illumination change. After computing the difference between the input
frame and the background model, we extract a binary moving foreground. This
moving foreground does not need to be perfect, because the vehicle segmentation
approach is robust to small errors in detection of foreground.
First, we will describe a robust vehicle blob tracking method. This method is
developed for a surveillance task on vehicle tracking. The videos are captured at
several locations under different conditions. Fast illumination change and camera
shaking happen often in these videos. By using camera calibration information and
vehicle 3D model, our method estimates vehicle image size and removes most of the
noise foreground by using a size constraint. A Kalman filter is applied to predict
vehicle’s position in the new frame and the association between vehicle track and
5
blob detections is based on region overlap. Multiple rules are defined to handle blob
split, blob merge, track start and track end.
Then, we move to the more challenging case where vehicle inter-occlusion is
prevalent. To extract individual vehicles from a merged blob, we use the general
constraints from the camera model and the vehicle models. We will describe a
method that combines bottom-up vehicle detection and top-down analysis in the
framework of MCMC. In this method, a vehicle model match technique provides
proposals to MCMC process. Then the MCMC does the synthesis-analysis
procedure, and searches in the high-dimensional multi-vehicle space for the state
with maximal posterior probability. The likelihood function is based on the match
between the foreground and the synthesized multi-vehicle mask.
Furthermore, to better handle the vehicle orientation change, we propose
another vehicle segmentation method. In this method, the proposals are generated
randomly on the detected foreground. Motion information is introduced to guide the
search of vehicle orientation. Under the MCMC framework, local gradient descent
methods are designed carefully to speed up the vehicle position search.
Neither of the above two methods deals with foreground created by shadows.
Given that shadows are not ignorable in outdoor scenarios and there is no perfect
shadow removal method at the pixel level, we incorporate a shadow model into our
vehicle segmentation method and segment merged vehicles and their shadows
simultaneously.
Figure-2 Approach overview
Input Image
Sequences
Scene Knowledge
Representation
Background
Model
Model of
Scene
Camera
Model
Foreground
Extraction
Multi-vehicle Segmentation
under
(1) Vehicle Inter-occlusion;
(2) Scene Occlusion;
(3) Shadow.
Tracking: Vehicle Trajectory
Estimation
Object 3D Model
Model of
Shadow
Occlusion caused by scene object can also be a big problem to vehicle tracking
in some cases. The focus of this thesis is not on the situation where scene occlusion
is severe. But we investigate this problem, and use pre-learnt scene occlusion models
to compensate the errors.
Figure-2 shows the overview of the approach. A dynamic background model is
applied to extract motion foreground from image sequences. Several models are used
to explain the foreground. They are models of vehicle, camera, scene-occlusion and
6
7
shadow estimation. Models of camera and scene occlusion are all pre-learnt. Regular
3D vehicle models are used. Direction of sun is given to estimate the shadow
appearance. Vehicle segmentation works on the motion foreground. The typical
difficulties to overcome in this thesis are vehicle inter-occlusion, scene occlusion and
object shadow. After segmentation is done on each frame, a tracking method is
proposed to associate the vehicle detections on single frames, and generate tracks for
vehicle objects.
The rest of the thesis is organized as follows. Chapter 2 covers the related
work. Chapter 3 describes the robust blob tracking method that we developed for
vehicle tracking in low-occlusion scenes. In Chapter 4, we describe a MCMC based
vehicle segmentation method. The vehicle hypotheses are provided by a two-layer
detection method. To deal with vehicle segmentation with arbitrary orientations, we
propose an improved MCMC based vehicle segmentation method in Chapter 5. Base
on the segmentation result on each frame, we apply Viterbi algorithm to associate
detections through image sequences. Chapter 6 describes this tracking method. The
conclusion is given in Chapter 7.
8
Chapter 2 Previous Work
The topics of motion detection, vehicle detection and vehicle tracking have
been addressed in a number of earlier research papers. This chapter provides a
survey that groups the major related works into several categories. The main ideas of
the works are summarized, and their advantages and disadvantages are discussed.
Some of these are also discussed in later chapters with more details when
appropriate.
2.1 Background modeling methods
Motion foreground extraction methods are a class of widely used techniques in
the context of videos captured from a stationary camera. By computing the
differences between the current frame and the background model at each pixel, the
foreground extraction methods can detect the moving foreground in a very efficient
way. The assumption that the color of the moving objects is different from the
background is generally valid. Usually a motion foreground extraction method has
three major parts.
(1) Training of the background model: a background model is trained at each
pixel location from some collected images.
(2) Motion detection: By computing its distance to the model in a color space
(RGB, YUV, etc.), every pixel of the new frame is classified into two
classes: background and motion foreground.
(3) Model update: update background model with new frames, since
background usually changes with the change of illumination. The change
can be gradual or abrupt. Because most of surveillance applications usually
run for long time, model update is necessary.
There are several ways to design the background model. A background image
is the simplest one, where each pixel model is a color vector. A static background
image can be good enough for a video whose background does not change. Learning
the average value of the training images is a way to train a static background image
model. To tolerate some moving objects in the training process, Haritaoglu et al. [15]
used a median filter. To have a dynamic background image model, the background
model can update with the following equation:
9
t 11
(1 )
tt
B BI λ λ
+ +
= −⋅ + ⋅
(1)
Where
1
,
t t
B B
+
are the background models at time t and , 1 t +
1 t
I
+
is the image
frame at time . 1 t + λ is the background update ratio, between the range of [0 .
The background model adapts to the background change slowly when
,1]
λ is small and
adapts fast when λ is large.
A Gaussian model is used for each pixel in Wren et al. [58] to cover the
variance of background pixels. Each pixel has a mean and variance in the YUV
space. The Mahalanobis distance is used to compute the color distance. Stauffer et al.
[53] extended the pixel model to a mixture of Gaussians to handle the cases where
the values at some pixels switch among several values. This case usually happens on
water ripples, swinging trees or flags. For the multi-Gaussian model, estimating the
10
number of Guassian models is a problem. Elgammal et al. [12] used a non-
parametric technique to estimate the density of the distribution. Li et al. [32] study
the difference between changing background objects and foreground objects, and
further classify changing pixels into background or foreground.
To our knowledge, the current proposed background subtraction methods are
all pixel based. The environment is complicated. Based on the information of a
single pixel, it is not possible to determine if a pixel is from a moving object or other
environmental change (illumination change, waving trees, etc.). The detected
foreground will be noisy. How to detect the objects in presence of the noise is still a
challenging research problem.
After background subtraction, usually a morphological operation is applied to
remove the isolated pixels and fill in small holes in the foreground. Then a connected
component operation organizes the foreground pixels into connected foreground
blobs (simply, blobs). In an ideal case, each blob corresponds to one object when the
tracked targets are separate from each other. However, one object could split into
multiple blobs because of scene occlusion or low contrast with the background.
Multiple objects can merge in one blob when they occlude each other. In this thesis,
we focus on the problem of blob merge, and also investigate other related problems.
(a) (b)
Figure-3 Vehicle detection using wavelet features (a) and wire frame model (b).
2.2 Vehicle Detection
Car detection and classification has been attempted from static images. In [44],
Rajagopalan et al. present a vehicle detection algorithm based on higher order image
statistics. The performance is not very good and the method is sensitive textured
background. Schneiderman and Kanade [47] use combination of wavelet features to
detect vehicles and eight detectors are trained separately to cover different
viewpoints. However, it is unclear how many detectors are necessary to cover all the
possible viewpoints. Since the detectors are trained on samples without occlusion,
the performance may decrease when occlusion happens. Figure-3(a) shows an
example of the vehicle detection on frontal views. Agarwal et al [1][2] represent
objects with sparse parts and their spatial relationship, and report experiments on
images of side view cars. The method is robust to small partial occlusion.
Methods described in [54][28] utilize knowledge of camera calibration and
assume that vehicles are on a known ground plane. They project simplified 3D
wireframe vehicle models onto the 2D image to match image gradients for vehicle
11
12
detection and classification. An example is shown in Figure-3(b). These methods are
applied to un-occluded vehicles only, and do not cover the orientation change and
the variance of different types of vehicles. Methods described in [60] and [27] use
2D shape models to track vehicles in aerial videos where inter-vehicle occlusion is
not an issue. Levin et al [31] pre-train an Adaboost vehicle detector with small
amounts of training samples, and improve the performance with unlabeled data over
time. As in other detection methods, this method does not deal with occlusion.
Experiments were performed on vehicles with the same orientation. However it is
not clear if this method could extend to cover multiple orientations.
2.3 Detection-based Tracking Methods
Many classical detection-based tracking methods come from the study of radar-
based tracking. In radar-based tracking, an object appears as a bright point against on
a dark background. It can be easily detected after simple operations. The basic
assumption is that one object creates at most one observation, and one observation
corresponds to at most one object. The main concern of tracking is how to optimally
estimate the state of the object. A Kalman filter[19] is sufficient to provide optimal
estimation in a linear system with Gaussian noise. False detections can appear
because of noise on sensors. A probabilistic data association filter [3] (PDAF) makes
probabilistic association of the object with the multiple detections. This technique
gives an averaged estimate:
1
n
ii
i
vv β
=
= ⋅
∑
(2)
Where is the state of one of the multiple detections, and
i
v
i
β is the association
probability. PDAF assumes false detections appear randomly around the real target.
So the track can drift away when there is persistent distraction.
Instead of making probabilistic data association, the Multiple Hypothesis
Tracking (MHT)[45] enumerates all possible associations and tracks each of them.
This results in an exponential increase of the possibilities. Cox et al. [10] proposed
an efficient implementation of the MHT, but it is still much more expensive than
PDAF.
PDAF and MHT are for tracking single objects. In the case of multiple objects,
a joint probabilistic data association (JPDAF) is proposed in [3]. It is similar to
PDAF except that the association probabilities between objects and detections
enforce the constraint that multiple objects should not lock on the same detection.
MHT technique also applies for multi-object tracking. Some works [6][7][21] apply
the idea of MHT and JPDAF to vision based object tracking.
(a) Blob Tracking (b) Tensor Voting
Figure-4 Motion Blob Tracking
13
14
2.4 Blob Tracking
Motion blob tracking is widely used to track moving objects. As mentioned in
section 2.1, background subtraction method can extract motion foreground blobs
effectively. In simple cases, each blob corresponds to one moving object. By
associating blobs in consecutive frames, blob-tracking method can generate the
object tracks. Two common problems of blob tracking are:
(1) Blob split: one object splits into several blobs due to low contrast to
the background.
(2) Blob merge: multiple objects merge into one blob when they are
close.
In W
4
[15], Haritaoglu et al detect and track walking people from motion
foreground, and analyze simple activities of people. The merge of multiple persons is
handled by detecting the peak points of foreground blobs. To deal with general blob
merge and split, Cohen et al. [8] extract tracks by representing blobs in continuous
frames as a graph. Kornprobst et al. [30] (see Figure-4(b)) associate the blobs across
all frames by enforcing global spatial and temporal smoothness under the framework
of tensor voting. Both methods can filter out short-time merges or splits. However,
since no object shape models are involved, these methods are unable to separate
merged objects for an extended period (e.g. two connected cars moving together.).
Gupte et al. [14] apply blob tracking for vehicle detection and classification. Blob
merge is a major cause for both detection and classification errors (see Figure-4(a)).
Recently, Nillius et al. [39] represent consistent tracks (without split or merge) in a
15
track graph, and link the identities using Bayesian network inference. Result is
shown to track soccer players.
2.5 Tracking multiple objects with occlusion
Koller et al.[29] track vehicles with their foreground contours. In presence of
occlusion, the method sorts the involved vehicles according to their depth orders. By
reasoning about the occlusion relationship, the method segments each vehicle out by
removing the foreground of the other involved vehicles. Oberti et al.[40] track each
object with feature points. When occlusion happens, some of the points will be
missing. So the method is robust to small partial occlusion. Kamijo et al.[20] handle
occlusion by using temporal and spatial constraints. However, all these three
methods require a specific detection zone to initialize the objects separately.
Pang et al. [41] work on foreground image to detect occlusion and separate
merged vehicles; their method is sensitive to foreground noise, and is not able to
handle change of vehicle orientation. Kanhere et al. [22] segment merged vehicles on
a freeway by grouping feature points. Results have been shown on vehicles with only
one orientation.
Particle filtering methods have also been applied to tracking multiple humans
in [18][55] with a joint likelihood formulation. These two papers both employ
explicit human shape models. However, results were only shown on up to four
people and the method of particle filtering has the dimensionality curse. Khan et al.
[24][25] apply an MCMC-based particle filter to track multiple ants. Since the
likelihood is evaluated on each individual target instead of the joint target space, this
filter works efficiently when the targets do not interact. But it has difficulty on
handling persistent interaction.
Zhao et al. [61][62] successfully apply MCMC to track multiple pedestrians
with severe occlusion. In his implementation, the human hypotheses are proposed by
head peak detection and the residue image analysis. We call it “detection-based
MCMC”. We follow the same strategy for vehicle segmentation in the Chapter 4.
Furthermore, we propose an MCMC vehicle segmentation method with random
vehicle hypotheses. This new method removes the high requirement on detection.
Some other works [26][38] track multiple human objects with multiple
cameras. The setting of multiple cameras is usually applied to indoor environments.
Constraints from multiple cameras are utilized to segment the merged human blobs.
(a) Contour Model (b) Rectangular Shape (c) Feature Points
Figure-5 Methods of vehicle occlusion analysis
16
2.6 Moving shadow detection
Moving shadow detection is critical for accurate moving object detection in
video, since moving shadow pixels are often mis-detected as moving object pixels.
A. Prati et al. [42] summarize the existing approaches into 4 classes: statistical non-
parametric (SNP) approach [17], statistical parametric (SP) approach [37],
deterministic non-model based (DNM1) approach [11] and deterministic non-model
based (DNM2) approach [53]. Basically, these approaches classify the motion pixels
into object or shadow based on the characteristics of shadow appearance. Since most
of the approaches are pixel based, their performance will decrease when objects have
similar color, usually grey or black, as shadows. Figure-6 shows a few examples
Figure-6 Moving shadow detection examples
17
18
Chapter 3 Robust Vehicle Blob Tracking in
Low-occlusion Scenes
As discussed in the literature review, the existing image-based vehicle
detection methods only work for a specific viewpoint; the color or contour based
tracking methods [7][9] have problems on object initialization and model update.
Instead, motion foreground based tracking methods can detect and track moving
objects automatically. They are widely used for many applications. In this chapter,
we describe the vehicle blob tracking method we developed to track vehicles in low
occlusion scenes.
The videos are captured from several street surveillance systems. Our system
applies a dynamic background learning method to extract the moving foreground.
The constraints of scene, camera and vehicle models are exploited to overcome many
difficulties of blob tracking. Evaluation on a large dataset shows that the system
works robustly under many kinds of conditions.
3.1 Task and Data Description
The task here is to detect moving vehicles. Objective includes accurate
detection, localization and tracking while maintaining the identities vehicles as they
travel across different frames.
(a) Camera #1 at daytime (b) Camera #2 at daytime (c) Camera #3 at daytime
(d) Camera #1 at night
with lighting
(e) Camera #2 at dark
night
(f) Camera #3 at nightfall
Figure-7 Sample frames of six cases
The videos are of street scenes captured by cameras mounted at light pole
heights looking down towards the ground. There is one road running from top to
bottom of the image and another one from left to right near the top of the image.
Provided videos are from three different cameras at three different sites. They
include data captured at several different times of the day including some at night.
Some examples are shown in Figure-7.
Our basic approach is to detect moving vehicles in these videos by computing
the motion foreground blobs by comparing input frame with a learned background
model and then to track by making data associations between detected foreground
blobs. However, there are several factors that make this task highly challenging. We
summarize these in three groups below:
19
20
Camera Effects: Cameras shake and create false foreground detections.
Automatic gain control abruptly changes the intensity of the video
sometimes causing multiple false detections.
Scene Effects: Ambient illumination changes such as due to passing
clouds. Other moving objects like walking people, swinging trees or
small animals all create motion foreground. An object may not be fully
detected as foreground when its contrast against the background is low.
Object Appearance Effects: The shadow of vehicles creates foreground
on sunny days. Blobs from different vehicles may merge into one,
particularly in heavy traffic or near a stop sign.
The image sizes of the objects in the top of the image are much smaller than of
those near the bottom. Thus, the size of a vehicle as it travels from the top to the
bottom changes substantially. An ambiguous zone or "don't care" region near the top
is provided by the evaluation scheme to exclude hard to see, smaller vehicles as
being part of the evaluation.
The rest of this chapter is organized as follows. The details of the proposed
vehicle tracking method are presented in section 3.2. Section 3.3 describes the
experiments and results. The quantitive evaluation and analysis are in section 3.4.
3.2 Vehicle Motion Blob Tracking
Figure-8 shows an overview of our method. We compute motion foreground
blobs at each frame and then track these blobs based on their association on
appearance. Knowledge of scene, camera and vehicles are pre-learnt to enforce
several general constraints.
Background
Subtraction
Input Frame
(t)
Foreground
Blobs (t)
Object Tracks
(t-1)
Estimated
Object Tracks (t)
Track-Blob
Association
Matrix
New Track
Initialization
End Track
Vehicle Blob
Split
Vehicle Blob
Merge
Object Tracks
(t)
Vehicle
Knowledge
Scene and Camera
Knowledge
Figure-8 Method Overview
3.2.1 Scene Constraints
We assume vehicles move on a ground plane; we use vehicle motion and
vanishing points from scene features to compute an approximate camera model [34].
This process is performed once, in training phase. To distinguish vehicles from
walking humans and other motion, we set a minimum size for an object to be
considered as a vehicle; this size is set in 3-D, its size in image is computed by using
the camera parameters. Basically, vehicles are modeled as a rectangular box. We
assign the road orientation as the vehicle orientation, and project the 3D box onto the
2D image. The projection area (number of pixels) is considered as a standard size of
a regular vehicle. The blobs, which are larger or comparable to the standard size, are
considered as vehicle blobs. And the other small blobs are considered as human
objects or noise foreground.
21
22
3.2.2 Background subtraction
We learn a pixel-wise color model of the background pixels. The model is
updated adaptively with new frames to adapt to illumination changes [32]. We do not
assume that an empty background frame is available. Pixels that do not conform to
the background model are hypothesized to be due to motion and called "foreground"
pixels. These foreground pixels are grouped into connected regions; we apply a
sequence of morphological operations to remove small noise regions and fill in small
holes in the regions. In an ideal case, every blob would correspond to one vehicle
object; however, this is not always the case. Figure-9 shows three common problems
that may be present.
Blob merge: one blob could contain multiple vehicle objects;
Blob split: one vehicle object could split into several blobs when its
contrast to the background is low;
Noise blobs: other moving objects, humans, also create foreground
blobs.
(a) Blob Merge (b) Blob split (c) Other objects
Figure-9 Common difficulties of vehicle blob tracking
…...
1
1
1
1 1
…
...
1
t
B
2
t
B
3
t
B
4
t
B
1
ˆ
t
O
2
ˆ
t
O
3
ˆ
t
O
4
ˆ
t
O
Figure-10 Track-blob association matrix
3.2.3 Track vehicle blobs
For simplicity, we model vehicle objects as rectangles:
{( , , , ), , }
iiiii i
ttttt t
OxywhAD =
i
t
(3)
23
Where t is the frame number, i is the object id, (, , , )
ii i i
tt t t
x yw h describes the image
location of the object, and are the object appearance model and motion
dynamic model. In our implementation, is a color histogram, and is a Kalman
filter. Similarly, the detected blob is modeled as a rectangle with a color histogram:
,
i
t t
A D
i
i
t
i
t
A
i
t
D
{( , , , ), }
iiiii
ttttt
B xy w h A = .
Our tracking method processes the frames sequentially. At each new frame t,
we first apply tracking object's dynamic model (Kalman filter) to predict object's
new position.
11
ˆ
()
ii i
tt t
OD O
− −
= (4)
Then the predicted objects with detected blobs will generate an association
matrix, see Figure-10. The association is based on the overlap between the predicted
object rectangles and blob rectangles.
ˆ
1,
ˆ
(, )
min( , )
0,
ii
tt
ii
tt
OB
if
Mi j
OB
otherwise
τ
⎧
∩
>
⎪
=
⎨
⎪
⎩
(5)
As mentioned before, we have the vehicle 3D model and camera calibration
model. At each image location, we can project the 3D model onto the image, and
estimate the image size of a standard vehicle at different image locations. For each
blob, we define its relative size ratio as its size ratio against the standard vehicle at its
location. Based on the track-blob match matrix, we define the following rules to
handle the track birth, death and update, as well as blob split and merge:
24
25
[1] If the track-blob match is one-to-one, we update the position of the track,
and the associated color histogram and Kalman model.
[2] When a blob has no match with any tracking object, and its relative size
ratio is larger than a threshold, a new track will be created.
[3] A track ends when it has no blob match for more than a threshold number
of frames.
[4] If one blob has match with multiple tracks, it’s a merge case. We apply
modified meanshift color tracking method to track them separately.
[5] If one object matches with multiple blobs, there are two common cases.
One is that one vehicle splits into multiple blobs; the other is multiple
merged vehicles start to separate. We evaluate the relative size of the blobs.
If the number of “big” blobs is less than two, we believe it’s a split case
and combine the multiple blobs as one to match with the track. Otherwise,
we match the track with the blob with highest match score, and take the
other “big” blobs as new tracks.
Currently, our method is not able to segment merged vehicles if they appear
merged since the beginning. But they will be tracked separately after they are
detected as separate blobs.
3.2.4 Track merged vehicles with Meanshift
When multiple objects merge into one blob, the binary foreground blob doesn’t
tell the position of each involved object directly. Ideally, we prefer to segment
merged objects out, by using the object appearance information collected in previous
frames. However, there is no simple solution to this problem, and potential
segmentation methods could be computationally expensive. Instead, we modify
Meanshift-tracking method[9] to track each object individually in this case.
The well-known Meanshift tracking method computes the pixel probability
based on the object’s color histogram. Then the probability intensity gradient is
computed at the initial window, and the method follows the gradient to find the local
maxima within several steps.
2
0
1
1 2
0
1
()
()
h
h
n
i
ii
i
n
i
i
i
yx
xwg
h
y
yx
wg
h
=
=
−
=
−
∑
∑
,
1
[( ) ]
m
u
i i
u
u
q
wbxu
p
δ
=
=−
∑
(6)
Given an initial window at with size , is the weight of one pixel
0
y h
i
w
i
x inside the window. is the value of bin in the target histogram, and is from
the histogram of the current window. is the histogram index of pixel
u
q u
u
p
( )
i
bx
i
x .
2
0
(
i
yx
g
h
−
) is a Gaussian kernel, which assign higher weight on pixels close to
center and lower weight on pixels away from center. The new window position is
computed by equation(6). Meanshift computes the new window position iteratively.
Comaniciu et al. [9] have proven that Meanshift process will converge to the
position, which locally maximizes the similarity of the histogram of the converged
window and the target histogram.
1
y
26
Bradski [5] proposed a face tracking method, called Camshift. The main part is
very similar to Meanshift. The way to compute pixel weight is slightly different,
i
w
and Gaussian kernel is not applied. Based on our experience, this method works as
well as Meanshift in practice for most of the cases.
1
1
1
h
h
n
i i
i
n
i
i
x w
y
w
=
=
=
∑
∑
,
1
[( ) ]
m
ii
u
wbxu δ
=
u
q = −⋅
∑
(7)
As mentioned before, each tracking object has a color histogram appearance
model. Assume two vehicle objects and O are merged in one blob. We compute the
pixel probability of O with the following equation:
2
1
(8)
2
1
1
12
ˆ
0, ( , )
(| )
(|)
,
(|) ( |)
if x y O
pO pixel
ppixel O
otherwise
p pixel O p pixel O
⎧
∈
⎪
=
⎨
⎪
+
⎩
If the pixel is inside the estimated position of , it’s set as zero. Otherwise, it
is computed based on the color histogram of O and . The pixel probability of O
is computed in the similar way. Then we apply Camshift tracking method to find the
local maximal positions of both O and . This method is effective to handle short-
time vehicle merge, even when the color of the two vehicles are similar.
2
O
1 2
O
2
1 2
O
27
(a) (b)
(c) (d)
(e) (f)
Figure-11 Sample result frames on training/testing sequences.
(Green rectangle is the ground-truth, red rectangle is the tracking output of our
system, and blue rectangle is the defined "ambiguous" zone.)
28
29
3.3 Experiments and Results
The blob tracking system is tested on the videos provided by the project of
CLEAR-VACE surveillance evaluation[23]. The size of each frame is 720x480. The
experiments are finished on a regular PC with Intel Pentium 2.6GHZ CPU. The
average processing time is 2.85 frame/second. During the training and testing stages,
the manually labeled annotations of 100 surveillance video sequences (about 165
minutes in total) are from a third party. Figure-11 shows some tracking examples
under different conditions. In general, our system works well for daytime videos,
though the detected vehicle size is usually larger than the ground-truth when
shadows are present. At nighttime, vehicle headlights create large change regions;
our system has difficulty in locating the vehicle positions accurately in such cases.
Figure-12 shows a few typical tracking errors. In case (a), two vehicles are not
detected because of low contrast with the background. In case (b), there is significant
camera shaking; this creates many false foreground regions. Also, a person object is
detected as a vehicle because its size is comparable to small vehicles. In case(c), four
vehicles come together due to traffic congestion; this causes one missed detection,
and the position of another detection is not accurate. Case (d) is captured at night.
Headlight creates big foreground blob, and the detection is not very accurate.
(a) Low contrast (b) Shaking camera
(c) Congestion (d) Headlight
Figure-12 Typical tracking error samples
3.4 Evaluation and Discussion
We quantitively evaluated our system according to the requirements of the test
process. Table-1 lists the scores on 50 test video sequences. Four metrics (MODP,
MODA, MODTP, MOTA) are defined to evaluate both the detection and tracking
performances. We describe the metric definitions in Appendix A :, and full details
can be found in [23].
MODP (Multiple Object Detection Precision) measures the position
precision of single frame detections;
MODA (Multiple Object Detection Accuracy) is the combination of
miss detections and false alarms;
30
MOTP (Multiple Object Tracking Precision) measures the position
precision at tracking level;
MOTA (Multiple Object Tracking Accuracy) is MODA at tracking
level with consideration of ID switches.
One observation from the table is that: the difference of MODP and MOTP, or
MODA and MOTA is very small for all the test video sequences. It is mainly
because that the penalty on object id change is quite small. Actually, there are some
numbers of id changes in the output of our system. However, the current defined
MOTA and MOTP are not able to reflect this tracking error very well.
To evaluate performance trade-offs, we repeated our experiments with 5
different sets of parameters. As MODA combines the influence of missed detections
and false positives, it is not easy to see a trade-off using this metric. Instead, we plot
an ROC curve using the traditional detection and false positive rates in Figure-13.
Scene Name
Average
MODP
Average
MODA
Average
MOTP
Average
MOTA
PVTRA102 0.653 0.675 0.645 0.667
PVTRA201 0.540 0.612 0.539 0.605
PVTRN101a 0.665 0.625 0.664 0.623
PVTRN102d 0.691 0.644 0.684 0.641
Total 0.616 0.645 0.615 0.639
Num of
Sequences
24
18
5
3
50
Table-1 Evaluation scores on 50 test video sequences
31
Figure-13 ROC Curve of CLEAR-VACE Surveillance Evaluation on Vehicle Tracking
3.5 Applications to tracking other objects
The blob tracking method we developed here is quite general. It can apply to
track other moving objects by just changing the 3d model of the objects. Besides
CLEAR-VACE surveillance vehicle tracking evaluation, we also apply our method
to CLEAR-VACE surveillance person tracking evaluation. The video dataset is the
same, but the tracking objects are walking humans. By changing the vehicle 3D
rectangular box model to a human 3D rectangular box model, we achieve quite good
tracking performance on walking persons (see Table-2). Some sample images are
shown in Figure-14. We classify vehicles and walking persons by using the heuristic
that vehicle blobs are bigger than human blobs. There are only a few classification
errors in this evaluation. The same as the problem in vehicle tracking, this method is
32
not able to segment persistently merged persons. In [59], the tracking performance is
improved by combing the proposed method with shape based human detector.
We also apply this method to some other application scenes, PETS’06 event
recognition[35] and ETISEO event recognition. Figure-15 shows a few samples.
Scene Name
Average
MODP
Average
MODA
Average
MOTP
Average
MOTA
PVTRA102 0.581 0.560 0.584 0.558
PVTRA201 0.565 0.235 0.569 0.233
PVTRN101a 0.504 0.698 0.631 0.698
PVTRN102d 0.614 0.421 0.605 0.418
Total 0.569 0.432 0.572 0.430
Num of
Sequences
24
18
5
3
50
Table-2 Evaluation scores of CLEAR-VACE person tracking on 50 test videos
33
(a) PVTRA102 (b) PVTRN101a
(c) PVTRA201 (d) PVTRN102d
Figure-14 Sample results of CLEAR-VACE person tracking sequences.
34
(a) PETS’06 Airport (b) ETISEO Building Entrance
(c) ETISEO Apron Scene (d) ETISEO Road Scene
Figure-15 Application of object tracking in other scenes.
3.6 Conclusions
We developed a blob tracking system for vehicle tracking in surveillance
environment. The evaluation is fully conducted on a relatively large dataset. The
performance is promising. Also the system can well apply to track other moving
objects in static camera scenes. The major limitation of the current method is that it
doesn’t work well when object merge is prevalent. This is understandable because
the system is based on motion foreground. When object merge is prevalent, one blob
35
36
often contains several objects. It’s not always straightforward to retrieve the object
identity information from the foreground blob. The next chapter will describe how
we use the object model information to tackle this problem.
Our system does not run in real-time; some of the needed speed-up can be
obtained by more careful coding and use of faster commodity hardware but is also
likely to require algorithmic improvements.
37
Chapter 4 Detection-based MCMC Vehicle
Segmentation
Vehicle merge is a common problem to foreground-based vehicle tracking,
since one blob could contain several vehicles. The chapter describes how we
segment merged vehicles into individual ones.
4.1 Approach Overview
Our approach is to use generic shape models of classes of vehicles, such as for
a sedan or an SUV, to detect, classify and track vehicles from motion foreground
blobs. We assume that these blobs are generated by moving vehicles or by noise.
Shadows and reflections are assumed to be absent or removed by some pre-
processing. Given a binary foreground blob, we formulate the segmentation problem
as one of finding the number of vehicles and their associated parameters. We use a
generative approach in a Bayesian framework; we form vehicle hypotheses and
compute their posterior probabilities by combining a prior distribution on the number
of vehicles with measurement of how well the synthesized foreground matches with
the observed foreground (i.e. the likelihood).
Several issues need to be addressed to follow this approach: the nature of
models, likelihood and prior distribution functions, and efficient search through the
38
hypothesis space. We summarize our approach below and provide details in the
following sections.
We start with 3-D models of a typical member of a class of vehicles. For
example, we pick a specific sedan model as a representative of all sedans; this is
satisfactory because our aim is not to classify vehicles into specific models and the
intra-class differences are not significant for detection. We assume that the camera
is calibrated, and that it is approximated by an orthographic projection model for
observing vehicles, whose size is small compared to their distance from the camera.
We also assume vehicles move on a known ground plane and that the image x-axis is
parallel to this plane (i.e. zero “yaw” angle). Thus, the projected shape depends only
on the orientation and vertical position in the image of the vehicle. This vertical
position can be used to compute a vehicle-centered tilt angle (i.e. the angle between
the ground plane and the line from camera center to the vehicle center). For efficient
processing, we pre-compute 2D image profiles for a number of different viewpoints;
we quantize this space in bins, 72 bins for 360° vehicle orientation range, and 19 bins
for 90° tilt angle range, to give a collection of 1,368 2D shape templates (Figure-16
shows a few of them).
(a) 3D sedan model (b) Samples of 2D template
Figure-16 3D vehicle modal and its 2D appearance
(In (b), the three rows are camera tilt angle 0°, 15°, and 30° respectively; the four
columns are vehicle orientation 0°, 30°, 90°, and 120° respectively.)
We cast the vehicle detection (and classification) problem as being one of
finding the set of vehicle types and associated parameters that maximize a posterior
probability, i.e. to compute a MAP solution. It is not practical to conduct a top-down
search through the complete space of possible hypotheses to find this solution for an
observed foreground. We use a coarse-to-fine, 2-stage strategy for reducing the
search effort as illustrated in Figure-17. In the first stage, we take advantage of the
observation that all vehicles have a roughly rectangular shape in the motion
foreground. We find rectangles in the image which are likely to include a vehicle; the
image size of a rectangle is predicted by the image location given the knowledge of
the ground plane, vehicle type and camera geometry. Knowledge of static occluding
objects is also considered to reduce the occlusion effect from scene objects. The
detected rectangles indicate corresponding pre-generated vehicle masks which are
then used for a more accurate measurement. In a general case, we search through a
set of vehicle orientations to cover all directions; if the direction of motion of the
39
vehicles is constrained, say to be along a known road, we can limit the search for
computational efficiency and robustness.
A finer search to sample the probability distribution for hypotheses is
conducted by a data-driven MCMC (DDMCMC). Such methods have been shown to
be effective for human and image segmentation in previous work [61] and [56]. The
hypotheses generated in the first stage provide initial proposals. MCMC search
includes jump dynamics, where objects may be created, removed, replaced, and
diffusion dynamics where variations in parameter values are explored.
Camera Calibration
Kalman
Tracking
Rectangular
Region Detection
2D Model
Match
MCMC Joint-
space search
Foreground
Detection
Vehicle Models
Input Frame
Figure-17 Sketch of our method
We compute posterior probability estimates for vehicles in each frame
independently and then use Kalman filtering to compute associations between the
frames. This is not optimal; inter-frame constraints should be incorporated in the
search process itself. Nonetheless, even our minimal use of inter-frame information
results in large improvements as shown in our results later. Our analysis is largely
confined to the use of the silhouettes of the foreground blobs. This is also not
sufficient in general as features in the interior of the blobs can be informative. But
40
we believe that our current results are already quite promising and useable for
various applications.
The following section gives the formal definition of our approach. And section
4.3 describes the details about the detection method and the way we apply MCMC.
Section 4.4 presents the tracking method we implemented. The results and evaluation
are shown in section 4.5.
4.2 Bayesian Problem Formulation
Given a binary foreground image I, assumed to originate from vehicle motion
plus noise, we want to answer the following three questions: (1) how many cars are
there? (2) Where are they? And (3) what is the type of each car? Formally, we want
to find the solution in the space defined by:
) , , , ( , ) ... (
0
2 1 i i i i i
k
k
o y x t M M M M = × × × = Θ
∞
=
∪
(9)
where k is the number of vehicles and M
i
is a vehicle parameter vector,
representing vehicle type (t), its center in the image plane (x,y), and its orientation in
the ground plane (o). We can also include the 3D size of vehicles in the parameter
vector if the variance of size is not negligible. The solution space is high dimensional
and its size (depends on number of vehicles) is not known in advance.
We formulate the problem as computing the maximum a posterior probability
, so that
*
θ
) | ( max arg
*
I P θ θ
θ Θ ∈
= (10)
41
Under Bayes’ rule, posterior probability is proportional to the product of prior
probability and likelihood,
) | ( ) ( ) | ( θ θ θ I P P I P ⋅ ∝
(11)
We discuss how to calculate these two terms in the following sections.
4.2.1 Prior Probability
∏
=
⋅ =
k
i
i
M P N k P P
1
) ( ) | ( ) (θ (12)
where P(k|N) is a Poisson distribution with the sample mean N=A
f
/A
v
. A
f
is the
overall area of the foreground and A
v
is the average area of one vehicle. The intuition
is that more number of vehicles will create a larger area foreground. This term
penalizes both too few vehicles and too many vehicles.
) ( ) , ( ) ( ) ( o P y x P t P M P ⋅ ⋅ =
(13)
where P(t) is the prior probability of the vehicle type. In our implementation,
we have three vehicle types, we set P(sedan)=P(SUV)=P(truck). In general case, we
assume the prior probabilities of position P(x,y) and orientation P(o) are uniform
distributions. However, knowledge of road position and directions, if available, can
be added for better performance.
42
(a) Two connected cars (b) Motion foreground
(c) Synthesized mask of joint solution (d) Match between foreground and mask
Figure-18 The match between foreground and synthesized vehicle mask.
4.2.2 Multi-vehicle Joint Likelihood
When vehicles occlude each other, the image likelihood cannot be decomposed
into the product of image likelihood of individual vehicles. We compute the joint
likelihood for a given state based on the match between the motion foreground and
the synthesized solution image.
We assume that foreground is formed by vehicles and random noise (see
Figure-16). Denote F as image foreground, V
i
as the image mask of one vehicle in a
solution. Then is the ideal foreground of the whole solution. We use the
likelihood function described in [61], and given by:
i
k
i
V
1 =
∪
) (
2 2 1 1
) | (
E E
e I P
⋅ + ⋅ −
⋅ =
λ λ
α θ (14)
43
where is the set of foreground pixels that are not
covered by mask, and is the set of mask pixels with no
foreground pixels matches, α is a constant, and λ
) (
1 1 i
k
i
V F F E
=
∩ − = ∪
) ( ) (
1 1 2 i
k
i i
k
i
V F V E
= =
∩ − = ∪ ∪
1
, λ
2
, are two coefficients to weigh
the penalties on the two kinds of errors.
By combining the above likelihood and the prior probability, we get the
posterior probability function as:
∏
=
⋅ + ⋅ −
⋅ ⋅ ∝
k
i
E E
i
e M P N k P I P
1
) (
2 2 1 1
)) ( ( ) | ( ) | (
λ λ
θ
(15)
4.2.3 Search for MAP
Given a foreground image, the number of vehicles is usually unknown, and the
possible parameter space could be very large (20 dimensions for 5 cars).
Computationally it is not feasible to conduct an exhaustive search for the MAP
solution. A simple greedy search is likely to find only a local maximum. We describe
a 2-stage method: in the first stage, coarse hypotheses for presence of vehicles (with
associated parameters) are generated; this stage is designed to have high detection
rate but also may have many false alarms, see Figure-19. These hypotheses provide
initial proposals for a second stage DDMCMC method, similar to one described in
[61], with jump and diffusion dynamics.
44
Sub-
W Indows
Rectangular
Region Detection
2D M odel
Match
Vehicle
Hypothesis
Rejected W indows
Y
Y
N
N
Figure-19 Schematic depiction of 2-layer detection
4.3 2-layer Detection for vehicle hypotheses
As mentioned before, we assume ground plane, orthographic projection and
zero yaw angle. We describe a vehicle in the image with an 8-parameter vector (cx,
cy, w, h, rotate, type, orient, tilt). (cx,cy) is the center of the vehicle at image space.
(w,h,rotate) is rotated bounding rectangle. (type) is vehicle type, it can be a sedan,
SUV or truck. (orient) is the vehicle orientation relative to camera. And (tilt) is the
angle between ground plane and the line from the camera center to the object center.
These eight parameters are not independent. We can derive (w,h,rotate,tilt) from
(cx,cy,type,orient) by projecting the 3D vehicle model onto the 2D image, with the
assumption that camera parameters are known, that the center of the 3D vehicles lies
on the projection line from (cx,cy) and that the variance in size for a vehicle type is
not significant. Then (cx,cy,type,orient) suffice to describe one vehicle. To avoid
repeated computation of 3D model projections, we quantize the space
(cy,type,orient), and learn the corresponding parameters (w,h,rotate,tilt) by samples;
this also accommodates errors in camera calibration. We search in the 4D space
(cx,cy,type,orient) for vehicle hypotheses.
45
To do this in an efficient way, we devise a 2-layer detection method. The first
layer searches for a rectangular region (not necessarily aligned with the image axes)
likely to contain a vehicle, and the second layer uses the projected vehicle models
indicated by the detected rectangular regions to match with the foreground for a
more accurate evaluation.
4.3.1 Rectangular Region Detection
Given a vehicle (cx,cy,type,orient), we retrieve a tilted rectangle
(cx,cy,w,h,rotate), from a look up table, to approximate the image area of the vehicle.
To conduct an efficient match with the image foreground, we use the integral image
representation of an intensity image introduced in [57]. The value of a pixel (x, y) in
integral image is the sum of pixel intensities in the rectangle (0,0,x,y). The advantage
of this representation is that the intensity sum of any rectangle can be computed with
4 values of the corner pixels in the integral image (see Figure-20 (a)). This reduces
the computation cost greatly.
(a) Integral Image (b) Integral Histogram Image
Figure-20 Integral image and integral histogram image
(In (b), the value of 1 is sum of pixels on A. The value of 2 is A+B. So, the sum on B
is 2-1.)
46
As a special case of integral image, we create an integral histogram image to
sum the pixels along a row or a column. In the vertical integral histogram image, the
value of pixel (x,y) is the sum of the pixels from (x,0) to (x,y). With this integral
histogram image, the intensity sum of any arbitrary vertical line segment can be
computed with two array references (see Figure-20(b)). The horizontal direction
works in the same way. It takes only one operation in the integral histogram image to
compute the intensity sum on a line segment, while it takes three operations in the
normal integral image.
(a) Rectangle region match (b) Rectangle contour match
(c) 2d mask region match (d) 2d mask contour match
Figure-21 Coarse and fine vehicle model match
Given a binary foreground image, we compute an integral image, Int(x,y). A
distance map is generated (see Figure-21(b)) from the boundary of foreground
image. Vertical and horizontal integral histogram images are computed on distance
map, denoted as vIh(x,y) and hIh(x,y). For a arbitrary rectangle (x,y,w,h), we
calculate the evaluation value V as below:
47
⎪
⎪
⎩
⎪
⎪
⎨
⎧
⋅ ⋅ + ⋅ + ⋅ =
+ − + + + − + =
+ − + + + − + =
+ + − + − + + =
) ) ( 2 /( ) (
) , ( ) , ( ) , ( ) , (
) , ( ) , ( ) , ( ) , (
) , ( ) , ( ) , ( ) , (
3 2 1
3
2
1
h w h w V V V V
h y x hIh h y w x hIh y x hIh y w x hIh V
h w x vIh h y w x vIh y x vIh h y x vIh V
y x Int h y x Int y w x Int h y w x Int V
(16)
As a rectangle is not an accurate approximation for a 2D vehicle appearance,
we set a low threshold on V to achieve a high detection rate.
Figure-22(b) shows the detected rectangles from the foreground of the image in
Figure-22(a). In this example, several cars are merged in one blob, but also one car is
split into three parts because of the occlusion by scene objects (a tree and a pole). We
use knowledge of these occluding scene objects (given by a scene model or by
learning) in the following way: when a rectangle covers enough foreground pixels,
we treat the inside scene-occluded area (if any) as foreground. In this way, we detect
most of the vehicle occluded by scene object with the cost of some more false
alarms. In this case, 212 redundant rectangles are found but every vehicle is captured
by at least one rectangle.
48
(a) Input image (b) 217 detected rectangular regions
(c) 9 vehicle proposals after refine match (d) 5 detected vehicles after MCMC
Figure-22 Procedures of vehicle segmentation
(One sedan is misclassified as a truck, because shadow creates some foreground
under the car and part of the car (rear window) is not detected on foreground.)
4.3.2 2D Model Match
In the second layer, we use 2D appearance masks to compute a more precise
match. From 3D vehicle models, we pre-generate and index the 2D orthographic
appearance masks from all viewpoints with quantized parameters (type,orient,tilt) as
described in section 1.2 earlier. For a detected rectangle, we pick the model with the
parameters (type,tilt,orient) and match the model with the foreground (see Figure-22
(c)(d)) for a more precise evaluation. As the rectangle detection, the evaluation is
based on an area match and a contour match. Denote by V
f
as the number of
49
foreground pixels matched with model, and V
e
as the intensity sum of pixels on edge
energy map along the boundary of the model. We output the detection if V
f
*V
e
is
larger than a threshold. Finally, we take only the hypotheses with local maximum
evaluation values to reduce duplicate detections (though multiple overlapping
hypotheses are still possible). Figure-22(c) shows 9 vehicle hypotheses (for five
actual vehicles) after model match.
4.4 Markov Chain Monte Carlo
The method described above has high detection rate but also many false
alarms. Due to foreground noise and the approximations at the detection level, the
parameters of the ‘right’ detection may also not be very accurate. We use the
detected hypotheses as initial samples in a data driven MCMC method to search for
the joint MAP solution.
As described in [61][13], the basic idea of MCMC is to design a Markov chain
to sample a probability distribution P(θ |I). At each iteration t, a new state θ’ is
sampled from θ
t-1
based on a proposal distribution q(θ’|θ
t-1
). The transition intensity
is given as
if k=k’
⎪
⎪
⎩
⎪
⎪
⎨
⎧
⋅ ⋅ ⋅
⋅
⋅
=
−
− −
−
−
), ' ( )
) | (
) ' | (
, 1 min(
),
) | ' ( ) | (
) ' | ( ) | ' (
, 1 min(
)) ' , ' ( ), , ((
' '
1
1 1
1
1
θ
θ
θ
θ θ θ
θ θ θ
θ θ
k k
t
t t
t
t
P p
I P
I P
C
q I P
q I P
k k p
otherwise
(17)
50
where k, k’ are vehicle numbers in θ
n-1
, θ’ respectively, C is a normalization
factor, p
k’
is the prior probability of k’ vehicles, and P
k’
(θ’) is the prior probability of
θ’ in the space of k’ vehicles.
Then, the Markov chain accepts the new state with the Metropolis-Hasting
algorithm:
if ) ' , (
1
θ θ β
−
<
t
p
⎩
⎨
⎧
=
−
,
, '
1 t
t
θ
θ
θ
otherwise
(18)
where β is a random number at [0,1).
It has been proven that the Markov chain constructed in this way has a
stationary distribution equal to P(θ |I), as long as transition function q(θ’|θ
t-1
)
satisfies the requirement of being irreducible and aperiodic.
Denote the state at iteration t-1 as θ
t-1
={k, M
1
, M
2
, …, M
k
}. The following
Markov chain dynamics apply to θ
t-1
for new state θ’. The dynamics correspond to
sampling the proposal probability q(θ’|θ
t-1
).
(1) Vehicle hypothesis addition: Randomly select a proposal M={t,x,y,o} from
vehicle proposals provided by the detection method. A new hypothesis
M’={t,x’,y’,o’} is generated based on the assigned Gaussian distribution on every
parameter (except type). θ’ ={k+1, M
1
, M
2
, …, M
k
, M’}.
(2) Vehicle hypothesis removal: Randomly select an existing vehicle
hypothesis M
i
in θ
t-1
.
θ’={k-1, M
1
, M
2
, …, M
i-1
, M
i+1
,…, M
k
}.
51
52
(3) Vehicle hypothesis replacement: Randomly select an existing vehicle
hypothesis M
i
, and replace M
i
with M
i
’, which is a proposal and has some overlap
with M
i
. θ’={k, M
1
, M
2
, …, M
i-1
, M
i
’, M
i+1
,…, M
k
}.
(4) Vehicle type change: Randomly select an existing vehicle hypothesis and
randomly change the type of this vehicle. The other parameters are unchanged.
(5) Stochastic vehicle position diffusion: update the position of a vehicle
hypothesis in the direction of the gradients under some random noise.
(6) Stochastic vehicle orientation change: update the orientation of a vehicle
hypothesis under some random noise.
The first four are referred to as jump dynamics and the rest are referred to as
diffusion dynamics. The Markov chain designed in this way is irreducible and
aperiodic since all moves are stochastic. Furthermore, redundant dynamics (e.g.
replace = remove + add) is added for more efficient traversal in the solution space.
Figure-24 shows the results of the MCMC process on our running examples;
these results were obtained after 1000 iterations.
53
Add Remove Replace
Type
Change
Position
Diffusion
Compute Acceptance Prob. P( θ’| θ
t-1
)
Accept?
θ
t
= θ’
Yes
No
θ
t
= θ
t-1
1 t
θ
−
' θ
Orientation
Diffusion
Figure-23 Overview of MCMC iterations
4.5 Experiments and Evaluation
We show results on two different traffic scene videos, (named cross.avi and
turn.avi). For both sequences, the total processing time is about 1.0 second per frame
on a Pentium-IV 2.6GHz computer.
In cross.avi, vehicles travel along a road in two known directions. Small
shadows, pedestrians, trees, and other noise affect the detected motion foreground.
Merging of vehicles in a blob is common: we often see two cars moving together in
adjacent lanes in the same direction; cars traveling in opposite directions also merge
54
in the foreground. Up to five vehicles merged together are observed. Merging may
persist for several frames or even the whole time that two vehicles are visible. We
used 3 common vehicle models (sedan, SUV, and truck) in our experiments. We
used an optical flow method [33] to distinguish the two orientations (moving left or
right), and the known orientations in search for vehicle models. Some graphical
results are shown in Figure-24. Note that all vehicles are detected and the image
locations appear to be accurate. Turn.avi is captured at a T-shaped intersection (see
Figure-25). Cars make a right turn to merge onto the main street. When the cars are
waiting at a stop sign, they could be merged in the detected foreground for 100
frames or more. As the cars coming from right pass by, they also merge with the
stopping cars for about 20 frames. To account for turns, we searched vehicle
orientation in a 90-degree range.
Because of the incompleteness of the foreground and the limitation of our
models, there are some detection errors and classification errors in the results. In
Figure-24, The yellow mini van in the third row is detected as a sedan, because we
don’t have mini van model and sedan matches it best. In the fourth row, the white
sedan is classified to SUV because shadow creates a big foreground under it. The
right black sedan is classified to SUV because its trunk is invisible. Row 5 is a
typical false alarm case. A false alarm sedan is generated to cover the shadow below
the blue truck. In Figure-25, the black car in row 4 shows a case of wrong orientation
detection caused by occlusion of both the front and the back of the car, leaving little
information to distinguish its orientation from the foreground image.
55
Some quantitive evaluations are given in Table-3. We evaluate both the
detection and the tracking results. In detection evaluation, each frame is processed
independently. We define correct detection when a detected vehicle model has an
overlap of more than 80% with the real vehicle. As can be seen, we achieve a high
detection rate even occlusion is prevalent (73% vehicle instances are merged with
others in cross.avi). As we limit our detection on motion foreground, the number of
false alarms is small unless many big moving objects appear (e.g. a group of
pedestrians). The classification rate is lower, mostly due to error caused by shadows.
To evaluate tracking, we define a track to be “complete” if it covers more than 90%
of the frames that the vehicle is in view. Otherwise we call it “incomplete” track.
Note one vehicle could create more than one incomplete track. A “false alarm track”
is where no actual vehicle corresponds to the tracked vehicle for more than half of its
duration. It can be seen that most vehicles are tracked completely, with few
fragments and false alarms.
56
cross.avi turn.avi
Frames 810 543
Vehicle Instances 1451 980
Merged Vehicles 1059 304
Detection 1346 953
Detection Rate 92.8% 97.2%
False Alarms 7 20
Single Frame
Detection
Classification Rate 77.1% N/A
Single Vehicles 35 5
Complete Tracks 31 5
Incomplete Tracks 6 0
Video Sequence
Tracking
False Tracks 2 1
Table-3 Evaluation on vehicle segmentation
Figure-24 Detection results on side-view vehicle segmentation.
57
Figure-25 Tracking results on turning vehicles.
(Through the frames, one black car makes a right turn. The
orientation changes are displayed by drawing corresponding
contours.)
58
59
Chapter 5 Vehicle Segmentation with
Arbitrary Orientations
The pervious chapter describes a method to segment merged vehicles. That
method relies on a rectangular region detection method to provide vehicle proposals.
Since only the appearances of side-view or frontal-view vehicles are rectangular, it
has difficulty in handling vehicles with arbitrary orientations. In this chapter, we use
a general rectangular 3D box model as the vehicle model, and segment merged
vehicles with arbitrary orientations.
Again, we assume vehicles move on a known ground plane. Camera and
vehicle models are used to project 3D vehicle models onto 2D images. Since the
number of vehicles is unknown in advance, the multi-vehicle solution space is a
union of several subspaces. Each subspace is the parameter space of a fixed number
of vehicles. When the number of vehicles is relatively high, the dimension of the
total space can be very high, making brute-force search or dense sampling
computationally intractable. We therefore apply the Markov chain Monte Carlo
(MCMC) method to sample the distribution of the multi-vehicle configuration. By
making use of the motion information and designing an effective position move, we
implement a system that can segment merged vehicles robustly.
The vehicle model we use is a 3D rectangular box (see Figure-26). More
complex 3D models could be applied in the same framework to improve the
matching accuracy, but this increases the required computation for projecting the 3D
models onto the 2D image. Although a rectangular box is a simple approximation to
the various shapes of vehicles, our experiments show that the tracking performance
is insensitive to the small match error between the model and vehicle appearance.
We approximate that the 2D image center of a vehicle is the projection of the 3D
center of the vehicle model. Assuming the camera calibration and ground plane are
known, we can readily transform between these two coordinates.
(a) Rectangular Model (b) Match with foreground
Figure-26 Match between vehicle box model and foreground
5.1 Bayesian Problem Formulation
Given a foreground image, we formulate the vehicle segmentation problem as
searching for the multi-vehicle state * θ , which maximizes the posterior
probability (| ) PI θ . I is the given foreground image. The space of θ is defined as:
12
0
(), ( ,
ki iii
k
, , )
i
MMM M xyo
∞
=
Θ= × × × = ∪
s
(19)
60
Where is the number of vehicles; k
i
M is a vehicle parameter vector with
image center position ( , )
i i
x y , 3D orientation and size factor . Vehicles have 3D
size (width, height, length). They are not fully independent. The wider vehicles are
usually higher and longer. We do PCA analysis on the sizes of 37 typical vehicle
samples. Only one principal component is kept to cover around 70% variance.
i
o
i
s
According to Bayes' rule, the posterior probability is proportional to the
product of the prior probability and likelihood term.
(| ) ( ) ( | ), PI P PI θ θθ θ ∝⋅∈Θ
M
)
(20)
5.1.1 Prior Probability
We define the prior probability of a state as the product of (1) the probability of
the number of vehicles, (2) the probability of individual vehicles, and (3) the
probability of 3D position overlap between any two vehicles.
1,[1,]
() ( ( )) ( ( , ))
ij k
vi oi j
iijk
PPM PM θ
<
=∈
=⋅
∏∏
(21)
In general, we set the vehicle probability as to be a uniform
distribution. But if we have some scene knowledge about possible vehicle positions
and orientations, we could introduce the knowledge here to penalize unrealistic
states. In the real 3D world, two vehicles don't overlap. The overlap probability term
penalizes the overlap of two vehicles on the ground plane. For
robustness in practice, we set the probability as an exponential function:
( )
v i
PM
(,
oi j
PM M
61
3
2*
(, ) (1 ),
overlap
oi j
ij
A
PM M x x
AA
=− =
+
(22)
where , are the area of the ground plane projections of the two vehicles.
is the overlap area of these two projections. With our rectangular model, the
ground plane projection of a vehicle is a rotated rectangle; the overlap area is the
intersection of two rotated rectangles, which is a convex polygon in general. We can
get vertices of this polygon by applying a Polygon Clipping method [16], and the
area can be computed from the coordinates of the vertices. In practice, we enlarge
the width and the length of the vehicle to 110% to penalize the situation where
vehicles are very close.
i
A
j
A
overlap
A
5.1.2 Multi-Vehicle Joint Likelihood
Similar to previous works [61], we compute the likelihood of a multi-vehicle
state by matching its synthesized image mask with the foreground image. Let
denote the image foreground, and denote the image mask of one particular vehicle
. Then is the binary mask of a multi-vehicle state. We compute this mask
image by transforming a 2D vehicle image center into a corresponding 3D center,
and projecting the 3D model back onto the image. The multi-vehicle likelihood
function is designed as an exponential function of the match errors.
F
i
V
i
1
k
i
V
= ∪ i
11 2 2
()
(| )
ee
PI e
λλ
θα
− ⋅+ ⋅
=⋅ (23)
62
where
1
1
()
k
i i
FF V
E
F
=
−∩
=
∪
is the ratio of the foreground pixels which are
not covered by the state mask, and
1
2
() ()
kk
ii i i
VF V
E
F
=
−∩
=
∪ ∪1=
is the ratio of the
mask pixels which has no correspondence pixels on foreground.
1
λ and
2
λ are two
weights on these two types of errors. They are set by experience.
By multiplying the prior probability term and the likelihood term together, we
get the posterior probability functions as:
11 2 2
()
1
(| ) ( | )( ( ))( ( , ))
ij k
ee
kvi oij
i
PI PkN PM PMM e
λλ
θ
<
− ⋅+ ⋅
=
∝⋅ ⋅ ⋅
∏∏
(24)
The purpose of vehicle segmentation is to estimate the distribution of this
function, and search for the multi-vehicle state with the highest posterior probability.
5.2 MCMC-based Posterior Probability Estimation
Due to the possible large multi-vehicle parameter space, it is not
computationally feasible to compute the distribution with dense sampling. MCMC
has been successfully applied in a few computer vision problems to sample a
probability distribution at a promising speed. To make the MCMC search efficient,
Chapter 4 uses a rectangular region detector to provide vehicle hypotheses. However,
the rectangular model is a good approximation only for some of vehicle orientations.
The performance will decrease when the vehicle appearance is not close to a
rectangle. One contribution of this proposed method is that it doesn't rely on a
detection method to provide vehicle hypotheses; the hypotheses are generated
63
directly from the foreground image in a stochastic way. By working in this way, this
new method is able to segment merged vehicles with arbitrary orientations.
As in Chapter 4, we design a standard MCMC to sample the probability
distribution (| ) P I θ . Based on the proposal distribution
1
('| )
t
q θ θ
−
, a new state
sample ' θ is proposed at iteration t . The transition probability is given as:
1
1
11
('| ) ( | ')
(,') min(1,
(|) ('| )
t
t
tt
PIq
p
PIq
)
θ θθ
θθ
θθθ
−
−
−−
⋅
=
⋅
(25)
Then, the Markov chain accepts the new state with the Metropolis-Hasting
algorithm:
1
1
', ( , ')
,
t
t
t
if p
otherwise
θβ θ θ
θ
θ
−
−
< ⎧
=
⎨
⎩
(26)
where β is a random number in [0 . ,1)
We design the following MCMC dynamics to generate the new state ' θ from
the current state
1 t
θ
−
.
(1) Add a vehicle hypothesis. The way to propose a vehicle hypothesis is
described in section 5.2.1;
(2) Remove a vehicle hypothesis. A vehicle in the current state is picked
randomly, and removed;
(3) Replace a vehicle hypothesis. Randomly generate a vehicle hypothesis as
new vehicle, and replace the vehicle;
(4) Vehicle position diffusion: change the position of a vehicle based on the
mean-shift move, see section 5.2.3;
64
65
(5) Vehicle orientation diffusion: change the vehicle orientation based on the
motion distribution, see section 5.2.2.
(6) Vehicle size diffusion: change the vehicle size randomly.
At each iteration, one of the dynamics elements is selected randomly, and
applied to the current state.
5.2.1 Vehicle hypotheses proposals
In Chapter 4, the vehicle hypotheses are provided by a vehicle region detection
method. Then the distribution of multi-vehicle configuration is estimated by the
MCMC sampling. However, there are some limitations of vehicle detection method.
Since the rectangular model is a good approximation only for some perspective
views of a vehicle, the performance will decrease when the appearance of a vehicle
is not close to a rectangle. Especially in the scenario of a road intersection where
vehicles often change their orientations, such a detection based MCMC vehicle
segmentation method has difficulty to propose vehicle hypotheses effectively.
To better handle vehicle inter-occlusion with arbitrary vehicle orientations, we
apply a more flexible way to propose vehicle hypotheses. In the proposed method,
the center and the orientation of a vehicle are proposed separately. The center
proposal is based on a center proposal map, and the orientation proposal is based on
motion information (see Figure-27).
(a) Foreground image (b) Proposal Map
(c) Motion Orientation Map (d) Random vehicle proposals
Figure-27 Vehicle proposing based on foreground and motion
Intuitively, the local region around a vehicle center should contain a large ratio
of foreground pixels. To propose vehicle center hypotheses efficiently, we generate a
center proposal map (see Figure-27(b)), where the intensity of each pixel is
proportional to the foreground pixel ratio in its local window. Each time, one center
hypothesis is proposed according to the distribution of this proposal map. So the
pixel with a higher intensity in the proposal map will have a higher probability of
being selected. This procedure helps the center proposals to focus on the foreground
area. It saves considerable computation especially when the foreground is only a
small part of the image. We apply the integral image technique to compute the
proposal map in a highly efficient way.
66
5.2.2 Orientation Sampling from Motion
In a vehicle-tracking scenario, most of the vehicles are moving; the motion
information provides good indication about the orientation of the vehicles. We apply
Lucas & Kanade algorithm, as implemented in OpenCV, to compute the motion
between two consecutive frames. The motion orientation result we get is informative,
but quite noisy (see Figure-27(c)). To get robust orientation hypothesis of a vehicle,
we sample the orientation from its local window. Specifically, we create a motion
orientation histogram from the local window of the vehicle center. Then we propose
an orientation hypothesis by sampling from the orientation histogram. For a region
with no motion (e.g. a stopped car), the orientation histogram is close to uniform.
Finally, we transform the X-Y motion on the 2D image plane to a 3D orientation,
using the camera model. The size is assigned to a regular vehicle.
5.2.3 Meanshift move
Meanshift tracking approach [9] has been successfully applied to track a
given image region. Given a probability distribution, meanshift tracking can compute
the gradient of the probability density, and follow the gradient to find the local
maximum within several steps. We apply the idea of meanshift tracking to help in
searching for a vehicle position. A simplified meanshift equation is given as:
11
11
(, ) ( , )
,
(, ) ( , )
nn
iii i ii
ii
nn
ii ii
ii
x Px y y Px y
x y
Px y P x y
==
==
⋅⋅
==
∑∑
∑∑
(27)
67
where (, ) x y is the new center of one meanshift step, and ( , )
i i
x y is the 2d image
coordinates of each pixel within the tracking region. is the probability of
the pixel at
( , )
i i
Px y
( , )
i i
x y . In our case equals to one when this pixel is on
foreground, otherwise it equals to zero.
( , )
i i
Px y
Given a vehicle hypothesis (, , ) x yo , its center position (, ) x y may not be
correct since it is originally proposed randomly (see section 5.1). To move the
vehicle to the correct position efficiently, we designed a meanshift move strategy to
propose a new center position. First, we get the vehicle's 2D image projection by
transforming the 2D center into 3D and projecting the 3D model back to 2D image.
We use the bounding box of the 2D vehicle projection as the initial window. Then
we apply one step of meanshift (equation (27)) to get the position (, ) x y . The new
vehicle center hypothesis (', ') x y is sampled from a Guassian distribution centered at
(, ) x y .
(', ') ( , ) ( , )
xy
xy x y N δ δ = +
(28)
(a) Sample image (b) Motion foreground (c) Vehicle model
Figure-28 multiple vehicles with shadows
68
5.2.4 Greedy size estimation
The minimal size of the vehicle is set as a regular sedan, (W=180cm,
H=144cm, L=482cm); the maximal size is set as (W=259cm, H=299cm,
L=1199cm). We design a size search method to find the vehicle size more
efficiently. For a given vehicle proposal, we evaluate its match against foreground
within a local window. If most of the error pixels are foreground pixels, we enlarge
the size of the vehicle. On the other hand, if most of the error pixels are vehicle mask
pixels, we reduce the size of the vehicle. The new size factor is given in the equation
below.
12
'(1
e e
ss
A
)
−
=⋅ + (29)
Where is the number of foreground error pixels, which have no match with
vehicle mask, is the number of vehicle mask error pixels, which have no match
with foreground, and is the pixel area of the vehicle mask.
1
e
2
e
A
(a) (b)
Figure-29 Greedy vehicle size search
69
70
5.3 Vehicle segmentation in presence of shadows
Shadows are a common problem in foreground based tracking methods, since it
creates foreground pixels just as the moving objects. Figure-28 shows a typical case
of multiple vehicle blobs merged together when shadows exist. Given the direction
of sun, we can estimate the shadow appearance of a vehicle model by projecting each
face of the box onto the ground, as seen in Figure-28. By incorporating shadow
model in the original vehicle box model, the proposed MCMC segmentation method
can be applied directly to segment merged vehicles and their shadows at the same
time.
Figure-30 shows a few segmentation samples from two scenarios. In spite of
the existence of strong shadows, our method is able to detect the vehicles (marked
with red boxes) successfully, and the shadow regions (under blue masks) are
segmented from vehicles with good accuracy.
Figure-30 Segmentation results of multiple vehicles with shadows
5.4 Experiments
We did experiments on videos captured at several scenes. Some image results
are shown below in Figure-31. Several error examples are shown in Figure-32. In
case(a), the orientations of two vehicles are not correct, because part of the two
vehicles have left the image. The orientation of a vehicle in case(b) is not correct
because this vehicle is occluded partly by trees. In case(c), the black car at right
bottom is not detected. The reason is that it has low contrast to the background, and
large part of this car is not detected as foreground. A big bus appears in case(d).
Since the size of the bus is much bigger than our model, three false alarms are
detected on the bus.
71
Figure-31 Vehicle Segmentation Results
72
(a) Image boundary
(b) Scene Occlusion
(c) Low contrast
(d) Big bus
Figure-32 Examples of segmentation errors
73
(a) (b)
(c) (d)
Figure-33 Examples of segmentation with big cars
74
5.5 Related Issues
Besides the merge problem discussed above, there are some other issues related
with tracking vehicles outdoors. We describe the problems and provide a few simple
but useful solutions below.
5.5.1 Divide and Conquer
(a) Color image (b) Foreground image (c) Independent blobs
Figure-34 Divide and conquer
Given a foreground image of moving vehicles, our goal is to detect the
positions of each vehicle. Since we are using the hypothesis and synthesize method,
we generate the foreground image of each hypothesis, and match it with the
foreground image to get the likelihood of the synthesized image. The size of the
image we process is quite big, 720x480. So synthesizing and matching such a big
image is computationally expensive. Assume one vehicle doesn’t break into to two
or more foreground blobs; then each blob is independent from the others. We divide
the foreground into separate blobs and look for the vehicle configuration of each
blob. We run the vehicle segmentation method on each blob individually and
combine the detected vehicles as final detection result. As seen in Figure-34, the
separated blobs contain one, two, four vehicles. It’s much more efficient to process
each blob individually than processing the whole image.
75
5.5.2 Scene Occlusion
Scene occlusion is usually caused the trees, poles along the local street.
Occluded by scene objects, part of the moving vehicles cannot be detected as
foreground; the visible part of a vehicle could split into several blobs. How to
recover vehicle’s parameters under complicated scene occlusion is not trivial. Since
most of the scene objects are static, we could consider labeling the scene occluding
objects as an occluding mask manually, as in Figure-35(c). Several related works,
like [63], tried to learn the scene occlusion model automatically.
Figure-36 shows how we compensate the error caused by scene occlusion.
Figure-36(c) is the match error image of Figure-36(a) and (b). It’s very clear that
many of the error pixels are due to the scene occlusion. We add the scene occlusion
mask onto both the foreground image and the synthesized vehicle mask, and
compute the error image again in Figure-36(f). The error pixels from scene occlusion
are cleared up.
(a) Input image (b) Motion foreground (c) Occluding mask
Figure-35 Scene occlusion and scene occluding mask
76
(a) Foreground (b) Synthesized mask
(c) Match error image
before compensation
(d) Foreground
+ Occlusion Mask
(e) Synthesized mask
+ Occlusion Mask
(d) Match error image
after compensation
Figure-36 Scene occlusion error compensation
5.6 Evaluation on vehicle blob merge
Our segmentation method applies a fixed size vehicle model to explain the
motion foreground. When a big vehicle (bus or big truck) appears in the scene, the
method will explain the big vehicle as multiple regular vehicles. As seen in Figure-
37(b), a big bus is segmented into 5 regular vehicles. How to distinguish blobs of big
vehicles from the blobs of merged vehicles is a challenging problem. We extend our
method to cover the size variance of vehicles.
77
(a) Successful segmentation on regular vehicles
(b) Failed segmentation on a vehicle of large size
Figure-37 Segmentation on large size vehicles
We approximate all the vehicles as rectangular boxes with changing sizes.
Then we quantize the possible width of a vehicle to 3 values (180cm, 216cm,
259cm), height to 5 values (144cm, 173cm, 207cm, 249cm, 299cm), and length to 6
values (482cm, 578cm, 694cm, 833cm, 999cm, 1199cm). So in total, there are
3X5X6=90 vehicle types of different sizes.
Intuitively, we believe the blob of one big vehicle usually looks like a
rectangular box, and the blob of several merged vehicles does not look like a box.
We match a vehicle box model with the foreground blob, hoping that the big vehicle
blob will have a small match error and merged blob will have a big match error.
We assume the orientation of the vehicle is given. Usually the vehicle
orientation is the road orientation, or the orientation can be computed from motion
78
79
orientation. Given a foreground blob, we take the image center of the blob as object
3D center, and project each of the 90 box models onto the image. The match error
between the blob and projected box are computed as the ratio of the miss-matched
pixels. The smallest match error is saved as the likelihood of a blob being one
vehicle.
We collected 41 big-car cases and 47 merge of small cars cases from CLEAR
evaluation data. The experiment results of match error are given in the graph below.
From the graph, we see that the merge of small cars has bigger match error than big
cars. However, there is an overlap area. The best classification rat we can achieve on
the collected data is 89.8%; 9 cases among the total 88 cases are miss-classified.
Vehicle Box Match Comparison
0
2
4
6
8
10
12
14
0.1~0.15
0.15~0.2
0.2~0.25
0.25~0.3
0.3~0.35
0.35~0.4
0.4~0.45
0.45~0.5
0.5~0.55
0.55~0.6
Match Error (%)
Number
Big Single Car Case
Small Car Merge Case
Figure-38 Vehicle box match comparison
80
(a) score=0.12 (b) score=0.29
(c) score=0.20 (d) score=0.33
(e) score=0.15 (f) score=0.53
Figure-39 Results of vehicle box match on foreground blobs
81
(a) (b)
(c) (d)
Figure-40 Error results of single vehicle vs. multiple vehicles classification
82
83
Chapter 6 Vehicle Tracking
In Chapter 4 and Chapter 5, we described two vehicle segmentation methods,
which are capable of segmenting merged vehicles in crowded situation. To achieve
our goal of extracting the vehicle trajectories, we need to link the same vehicle
across image sequences. In other words, we want to assign an identical ID to the
instances of same vehicle, which appears in the image sequences.
If vehicle segmentation works well on every single frame, we have good
detections on vehicles even in crowded situations. Then what we need to do is to link
the same vehicle in consecutive images together. This procedure is quite similar to
the data association task. In this kind of tracking scenario, a tracking method has
three functions to execute. (1) Associate the same vehicle in consecutive images. (2)
Initialize a new tracking object when a new vehicle enters the scene. (3) End the
tracking when the vehicle leaves the scene.
In this chapter, we describe two detection based tracking methods. The first
method works on the best segmentation result of each frame. Since the segmentation
could not be perfect, we develop the second tracking method, which works on
multiple segmentation results of each frame.
6.1 Vehicle Association by using a Kalman Filter
To simply the vehicle association problem, we model each vehicle blob as a
rectangle. The match between vehicles is treated as maximal area match of two
rectangles. Although appearance model (template or histogram) could provide a
more accurate match, we find out rectangular model is good enough for most of the
cases and it is much more computationally efficient than match of template or
histogram.
To match the vehicle rectangles of previous frame and the rectangles of current
frame, the major concern is to estimate the vehicle state for the observed history. As
most of researchers know, Kalman filter is capable to provide the optimal estimation
for a linear system with Gaussian noise. In both of our methods in this chapter, we
associate a Kalman filter to each vehicle object.
Figure-41 Vehicle Association at consecutive frames
6.1.1 Data Association
Assume we are tracking N vehicle objects and there are M vehicle detections in
the current frame. The objective of data association is to decide (1) which detection
84
is associated with which tracking object; (2) which tracking object disappears from
the scene; (3) which detection is a new vehicle object, which just enters the scene.
We elaborate this situation with the figure below.
X1 X2 X3 X4
Z1 Z2 Z3 Z4
X5
Disappears
Disappears
initialize a new
tracking object X6
Tracking Objects
at time T-1
Detections
at time T
Figure-42 Association between tracking vehicles and detection at current frame.
Here, each vehicle image appearance is modeled as a rectangle, and the match
is based on rectangle overlap. Although we could model vehicles with a more
accurate contour model and appearance (template or color histogram) model, we find
out that rectangle model is usually good enough for the association of vehicles, and it
is much more computationally efficient.
Given M tracking objects and N detections, we first create an M N ×
association matrix . is the rectangle match score between tracking object
and detection . is computed in the way below.
H (, ) Hi j
i j (, ) Hi j
12
12
(, )
min( , )
R R
Hi j
R R
∩
=
(30)
Since one tracking object could have intersection with more than one detection,
and one detection could also have intersections with more than one tracking objects.
85
86
We apply a greed way to solve this association problem. We find the maximal score
in association matrix H, and then we get one association between the corresponding
row and column. Then we remove these row and column. We keep finding the
maximal score until the maximal score is not larger than a threshold (could be zero
or less). Finally, we check the matrix, and the left tracking objects are treated as
disappearing objects, and the left detections are initialized as new tracking objects.
Because of severe occlusion, one vehicle could be missing for a few frames.
We set a “missing” buffer for every tracking vehicle. That is: when a tracking
vehicle has no detection to match. We take the predicted position as the detection.
When the accumulated “missing” is more than a threshold value, we take it as real
missing and remove this tracking object from tracking list.
As well as general object detection methods, our segmentation method
provides false alarm vehicle detections some time. To avoid the effect of false
alarms, we require the new tracking object to have more than a threshold number of
detections in the first several number of frames.
6.2 Tracking through Multiple Hypotheses with the Viterbi
Algorithm
After applying the MCMC vehicle segmentation method on each single frame,
we get a set of samples of multi-vehicle configuration. Each sample has a probability
score. Ideally, the sample with the highest probability is the correct segmentation.
However, because of the noise on foreground image, the approximation of our
87
)
vehicle model and limitation of the likelihood function, the sample with the highest
probability may not always be the best solution. By taking consideration of the
spatial and temporal constraints between consecutive frames, the tracking method
can remove some ambiguities at individual frames. Specifically, we keep some
number of hypotheses about vehicle number and their parameters at each frame, and
look for the "smoothest" track among these hypotheses.
We describe our tracking problem under the framework of hidden Markov
model (see Figure-43). In our case, the state is the multi-
vehicle configuration at time t, where K_t is the number of vehicles, and
1
( , ,... ..., )
t
tt t
tt k K
KM M M θ =
(,,,
t
k
M type x y orient = is the parameter vector of one specific vehicle. The
observation I_t is the binary foreground image at frame t, and it can be the color
image in a general case.
Start End
1
θ
t
θ
T
θ
1
I
t
I
T
I
Figure-43 Graphical Model of Multiple Vehicle Tracking
Start End
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
… n
T
θ
2
T
θ
1
T
θ
1
1 t
θ
+
2
1 t
θ
+
1
n
t
θ
+
n
t
θ
1
t
θ
2
t
θ
1
1
θ
2
1
θ
1
n
θ
Figure-44 Tracking with the Viterbi algorithm,
(The blue line illustrates the optimal track through multiple hypotheses at each
frame.)
With Markov assumption [43], the optimal sequence of a tracking application
is given by
1
1
**
11
{ ,..., }
11
{ ,..., } arg max{( ( | )) ( ( , ))}
T
TT
Ttt
tt
pO a
θθ
θθ θ θθ
−
+
==
=×
∏∏tt
(31)
Where
1
(, )
tt
a θ θ
+
is the association function between two consecutive states.
As described before, the MCMC process generates a set of samples for the
probability distribution ( | )
t t
p I θ at each frame. The task of tracking is to find the
sample
*
t
θ from the sample set at each frame, and the combination of the samples
*
1
{ ,..., }
T
*
θ θ maximizes the equation (11).
Among the multi-vehicle configuration samples generated from MCMC
segmentation, some have very low probability and some others are very similar to
each other. To reduce the computation, we ignore the low probability samples. The
threshold is set by experience. For the similar samples, it is not necessary to keep all
88
of them. After clustering them into groups, we take only one representative sample
from each group as input for tracking. So, the solution of
t
θ is limited to some
number of samples.
*1
{ ,...., }
n
tt t
θ θθ ∈ (32)
We can also see this as a multiple hypotheses tracking (MHT) method. The
difference is that one hypothesis here is a multiple vehicle configuration, instead of
one object in usual MHT applications.
Next we apply Viterbi dynamic programming method to search for the best
track (see Figure-44). The dynamic programming equation is:
1
11 1
( , ) ( ( , )) ( ( )) ( ( ))
lm n
ij
tt i i end i new i
iil il
aovu pv p θθ
−
==+ =+
=⋅ ⋅
∏∏ ∏
u
.0
)
At the starting state, .
1
0
() ()1 fstart f θ ==
The association function
1
(,
i j
t t
a θ θ
−
measures the spatial correlation between
multi-vehicle configuration
1
i
t
θ
−
and
j
t
θ . Let
11 1
( , ,..., ), ( , ,..., )
ij
tmt
mv v k u u θθ
−
==
k
(33)
For each pair of vehicles , we compute their overlap score by (, )
i j
vu (, )
i j
Ov u
2* ( , )
(, )
() ( )
overlap i i
ii
ii
Avu
Ov u
Av A u
=
+
(34)
89
Where and are the image area of vehicle , and
is the area of the overlap. To compensate for motion, the position of is updated
with the motion computed in section 5.2.2.
( )
i
Av ( )
j
Au
i
v
j
u (, )
overlap i j
A vu
i
v
Based on the overlap score, some of vehicles will find its best matches and the
rest of them have will no match. Assume the first l vehicles of
1
i
t
θ
−
and
j
t
θ are
matched with each other, and the rest of vehicles have no match. We design the
association function as:
1
11 1
( , ) ( ( , )) ( ( )) ( ( ))
lm n
ij
tt i i end i new i
iil il
aovu pv p θθ
−
==+ =+
=⋅ ⋅
∏∏ ∏
u
(35)
where ()
end i
p v is the likelihood of vehicle disappearing, and ( )
new i
p u is the
likelihood of new vehicle appearing. () ( ) 1.0
end i new i
pv p u = = , when the position of
the vehicle is close to image boundary. Otherwise they are set as a low value. The
intuition is that vehicles usually enter or leave the scene at the boundary of the
image. By applying such an association function, we penalize the case that vehicles
appear or disappear in the middle of the image, and prefer to have consistent vehicle
tracks.
The computation of dynamic programming is
2
( OT n) ⋅ , where T is the number
of frames, and n is the number of hypotheses at each frame.
6.3 Experiments and Evaluation
We performed our experiments on traffic videos captured at two different road
intersections, traffic01.avi in Figure-45 and traffic02.avi in Figure-. The image size
90
91
is 720x480. The camera tilt angle is about 19 degree for traffic01.avi, and about 27
degree for traffic02.avi. For each frame, we run MCMC with 5000 iterations, and at
most 500 samples are kept for tracking. The total processing time is about 3 seconds
per frame on a Pentium-IV 2.6GHz computer.
In these two scenarios, merging of vehicle blobs is very common. One typical
case is that two vehicles in adjacent lanes move together at the same speed. They are
merged since appearance in camera view to disappearance. There is no place to
detect and initialize them individually. Another case is that several cars can be
merged for hundreds of frames when they wait to make left/right turns. Also the
moving cars merge with the waiting cars they pass by. Up to six cars merged
together are observed. Notice that in traffic02.avi, several pedestrians also create
blobs in foreground. But since the blobs created by pedestrians are much smaller
than the blobs created by vehicles, there is no vehicle detected for the pedestrian
blobs. Some result frames are shown in Figure-45 and Figure-. The rectangular box
model is drawn on the image. Note that all merged vehicles are found, and the
positions and orientations appear to have good accuracy.
We quantitively evaluated both the segmentation performance and the tracking
performance of the proposed method. The report is given in Table-3. To evaluate
vehicle segmentation, we process each frame independently. Correct detection is
defined as detection, which has an overlap of more than 80% with a ground truth
vehicle. Note that the detection rate is quite high (96.8% for traffic01.avi, and 88%
for traffic02.avi), even though many of the vehicle instances (about 1/3 in
92
traffic01.avi and 1/2 in traffic02.avi) are merged with others. Since we work on
motion foreground, the number of false alarms is small. Miss detection is mostly
caused by two reasons. 1) Large part of the object is not detected as foreground,
because of low contrast to the background or scene occlusion. 2) Large part of the
object is occluded by other objects. Among a total of 61 individual vehicles in
traffic01.avi, 54 are tracked completely. 7 of 9 individual vehicles in traffic02.avi
have complete tracks. Shorter tracks or broken tracks are mostly caused by persistent
missed detections.
93
Figure-45 Result samples on traffic01.avi
94
Figure-46 Result frames on traffic02.avi
95
96
Chapter 7 Summary and Conclusion
This thesis presented approaches to detect and track multiple vehicles from the
video captured by a stationary camera. Promising results are shown on quite
challenging situations.
The main idea of the proposed approaches is to combine the bottom-up
detections and top-down analysis to explain the video observation. Camera and
vehicle object models are utilized to enforce the constraints of 3D world. Towards
the goal of detecting and tracking of multiple vehicles, we presented two approaches.
The first approach focuses on the vehicle tracking in real surveillance environments.
There are a lot of environmental noise presenting in this kind of video, such as
illumination change, waving trees, camera shaking, scene occlusion, and strong
shadows, etc. The proposed vehicle tracking method uses an adaptive background
model to extract the motion foreground, and takes advantage of vehicle size
constraint to reduce the environmental noise. A third party evaluates the approach on
a relatively large dataset. The performance is very good and robust in presence of
various difficulties. The second approach focused on segmenting multiple merged
vehicles in crowded situation. The approach follows a Bayesian formulation, and
seeks the solution by computing the maximum of the posterior probability in the
joint space of multiple vehicles. We employ a sampling method, Markov chain
Monte Carlo (MCMC) to explore the large solution space. Various bottom-up
97
detection techniques are incorporated to make the top-down search more efficient.
Promising results are shown on multiple vehicles with persistent occlusion.
7.1 Summary of Contributions
• The use of camera model and 3D vehicle model. A 3D vehicle shape
model is used with a camera model to estimate the vehicle appearance
on 2D image. This enables the method to explain the image foreground
under the constraint of 3D world.
• New MCMC search strategies. Different from pervious MCMC based
human segmentation method; we are not using any strong detection
method for vehicle proposals. Instead the vehicle proposals are
generated randomly, and several parameter diffusion strategies are
designed to facilitate the search in the high dimensional space.
• Simultaneously segmenting vehicles and their shadows. Using of 3D
vehicle model also facilitates the estimation of vehicle shadow
appearance. Furthermore, the approach has the ability to segment
merged vehicles and their shadows simultaneously.
• Applying vehicle size constraint in vehicle blob tracking. Using camera
model and ground plane assumption, we estimate the vehicle 2D image
size by projecting 3D vehicle model onto the image. Using size
constraint largely reduces foreground noise blobs.
98
7.2 Future Directions
This thesis focuses on the analysis of motion foreground. However, binary
motion foreground is only part of the information of the real video sequence, and the
information of binary motion foreground alone is not enough to extract the object
description in some cases. The work of this thesis gets close to what can be extracted
from motion foreground only. Starting from this thesis, there are a few directions
interesting to explore.
Precise vehicle type classification. As we all know, vehicles can be
classified into several types. The proposed method does some
preliminary work on vehicle classification based on mask matching. A
more precise vehicle type model may help to do the classification
better.
More informative likelihood function. Currently, the likelihood
between the foreground and multi-vehicle state is evaluated on the
binary foreground match error. Color and texture information are not
considered at all. We showed a few examples where human observers
are not able to tell the vehicle information directly from the foreground.
However, human observers have no difficulty to tell the vehicles from
corresponding color image. This experiment shows that foreground-
based likelihood is not informative enough to segment vehicles in some
very congested situation. A more precise likelihood with color/texture
99
could improve the capability of the proposed vehicle segmentation
method.
Integration with event recognition models. The tracking results of the
vehicle segmentation and tracking are saved in XML format, and ready
to be used for trajectory based event recognition (i.e. [35]). Now, it is a
single-direction procedure. Event recognition model use the tracking
results to infer high-level activities. It is interesting to study that if event
recognition could give feedback to tracking method and helps tracking
method to solve some difficulties in tracking level.
100
References
[1] Shivani Agarwal and Dan Roth, Learning a sparse representation for object
detection. In Proceedings of the Seventh European Conference on Computer
Vision, Part IV, pages 113-130, Copenhagen, Denmark, 2002.
[2] Shivani Agarwal, Aatif Awan, and Dan Roth, Learning to detect objects in
images via a sparse, part-based representation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 26(11): 1475-1490, 2004.
[3] Y. Bar-Shalom, Tracking and Data Association, Academic Press, 1988
[4] G. Borgefors, Distance Transformations in Digital Images, CVGIP, vol. 34, pp.
344-371, 1986
[5] G.R. Bradski. Computer vision face tracking as a component of a perceptual user
interface. In Workshop on Applications of Computer Vision, pages 214-219,
Princeton, NJ, Oct. 1998.
[6] T. J. Cham and J. M. Rehg, “A Multiple Hypothesis Approach to Figure
Tracking”, IEEE Proc. Computer Vision and Pattern Recognition, Vol. 2, pp.
239-245, Ft. Collins, CO, June 1999.
[7] Y. Chen, Y. Rui, and T. S. Huang, “JPDAF Based HMM for Real-Time Contour
Tracking”, IEEE Proc. CVPR, Vol 1. pp.543-550, Kauai, Hawaii, December 11-
13, 2001
[8] I. Cohen, and G. Medioni, Detecting and Tracking Moving Objects for Video
Surveillance. Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol.2,
pp. 2319-2326, 1999
[9] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigid objects
using mean shift. IEEE Conf. on Computer Vision and Pattern Recognition 2001,
vol.1, pp. 511-518, 2001
[10] I. J. Cox and S. L. Hingorani, “An Efficient Implementation of Reid’s
Multiple Hypothesis Tracking Algorithm and Its Evaluation for the Purpose of
Visual Tracking”, IEEE Trans. PAMI, vol.18, no.2, 1996
[11] R. Cucchiara, C. Grana, G. Neri, M. Piccardi, and A. Prati, “The Sakbot
system for moving object detection and tracking,” in Video-based Surveillance
Systems - Computer Vision and Distributed Processing. 2001, pp. 145–157,
Kluwer Academic.
[12] A. Elgammal, D. Harwood, and L.S. Davis, “Nonparametric Background
Model for Background Subtraction”, Proc. Sixth European Conf. of Computer
Vision, vol.2, pp.751-767, 2000
101
[13] W. R. Gilk, S. Richardson, D. J. Spiegelhalter, Chapter 13, Markov Chain
Monte Carlo in Practice.
[14] S. Gupte, O. Masoud, R. Martin, N. P. Papanilolopoulos. Detection and
Classification of Vehicles. IEEE Trans. On Intelligent Transportation Systems,
vol.3, no.1, 2002
[15] I. Haritaoglu, D. Harwood and L. Davis. W
4
: Real-time surveillance of
people and their activities. IEEE Trans. Pattern Analysis and Machine
Intelligence, 22(8): 809-830, August 2000
[16] Donald Hearn and M. Pauline Baker. Computer Graphics. Addison-Wesley
Publishing Company, Inc., 1994
[17] T. Horprasert, D. Harwood, and L.S. Davis, “A statistical approach for real-
time robust background subtraction and shadow detection,” in Proceedings of
IEEE ICCV’99 FRAME-RATE Workshop, 1999
[18] M. Isard and J. MacCormick, “BarMBLe: A Bayestian Multiple-Blob
Tracker”, Proc. Int’l Conf. Computer Vision, vol.2, pp.34-41, 2001
[19] R. Kalman, “A New Approach to Linear Filtering and Prediction Problems”,
J. Basic Eng., vol.82, pp.35-45, 1960
[20] S. Kamijo, Y. Matsushita, K Ikeuchi, M. Sakauchi. Occlusion robust tracking
utilizing spatio-temporal markov random field model. In Proc. IEEE 15thInt.
Conf. Pattern Recognition, vol. 1, 2000, pp. 140-144
[21] Jinman Kang, Isaac Cohen, and Gérard Medioni, "Continuous Tracking
Within and Across Camera Streams", Proceedings of the IEEE Computer Vision
and Pattern Recognition, Vol. 1, pp. 267-272, Madison, Wisconsin, 18-20 June,
2003.
[22] N. K. Kanhere, S. J. Pundlik, S. T. Birchfield, Vehicle Segmentation and
Tracking from a Low-Angle Off-Axis Camera, IEEE Conference on Computer
Vision and Pattern Recognition, 2005
[23] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, M. Boonstra, and V.
Korzhova. Performance Evaluation Protocal for Face, Person and Vehicle
Detection \& Tracking in Video Analysis and Centent Extraction (VACE-II)
CLEAR - Classification of Events, Activities and Relationships.
http://www.nist.gov/speech/tests/clear/2006/CLEAR06-R106-EvalDiscDoc/Data
and Information/ClearEval\_Protocol\_v5.pdf
[24] Z. Khan, T. Balch, and F. Dellaert, “An MCMC-based Particle Filter for
Tracking Multiple Interacting Targets”, European Conference on Computer
Vision (ECCV'04), 2004
[25] Z. Khan, T. Balch, and F. Dellaert, “Multi-target Tracking with Split and
Merged Measurements”, IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR'05), San Diego, CA, 2005
102
[26] Kyungnam Kim and Larry S. Davis, "Multi-Camera Tracking and
Segmentation of Occluded People on Ground Plane using Search-Guided Particle
Filtering", European Conference on Computer Vision (ECCV), LNCS, 2006
[27] Z. Kim, J. Malik. Fast Vehicle Detection with Probabilistic Feature Grouping
and its Application to Vehicle Tracking. In Proc. IEEE Intl. Conf. Computer
Vision, 2003
[28] D. Koller, K. Daniilidis, and H,-H. Nagel. Model-based object tracking in
monocular image sequences of road traffic scenes. International Journal of
Computer Vision, 10(3): 257-281, 1993
[29] D. Koller, J. Weber, and J. Malik. Robust multiple car tracking with
occlusion reasoning. In Proc. European Conf. On Computer Vision, pages A:
189-196, 1994
[30] P. Kornprobst, and G. Medioni. Tracking Segmented Objects using Tensor
Voting. Proc. IEEE Conf. Computer Vision and Pattern Recognition. Vol. 2, pp.
118-125, 2000.
[31] Anta Levin, Paul Viola and Yoav Freund. Unsupervised Improvement of
Visual Detectors Using Co-Training, Proceedings of the 9
th
IEEE International
Conference on Computer Vision, 2003
[32] Liyuan Li, Weimin Huang, Irene Y.H. Gu, and Qi Tian. ``Foreground Object
Detection from Videos Containing Complex Background,'' ACM MM 2003.
[33] B. Lucas, and T. Kanade. An Iterative Image Registration Technique with an
Application to Stereo Vision, Proc. of 7th International Joint Conference on
Artificial Intelligence (IJCAI), pp. 674-679, 1981
[34] Fengjun Lv, Tao Zhao and Ramakant Nevatia. ''Self-Calibration of a Camera
from Video of a Walking Human,'' 16th International Conference on Pattern
Recognition (ICPR), Quebec, Canada, 2002
[35] Fengjun Lv, Xuefeng Song, Bo Wu, Vivek Kumar Singh, Ram Nevatia,
"Left-Luggage Detection using Bayesian Inference", 9th Intl. Workshop on
Performance Evaluation of Tracking and Surveillance (PETS-CVPR'06), 2006
[36] S.D. Ma and L. Li. Ellipsoid Reconstruction from Three Perspective Views.
ICPR’96
[37] I. Mikic, P. Cosman, G. Kogut, and M.M. Trivedi, “Moving shadow and
object detection in traffic scenes,” in Proceedings of Int’l Conference on Pattern
Recognition, Sept. 2000, vol. 1, pp. 321–324.
[38] Anurag Mittal and Larry S. Davis, “M2Tracker: A Multi-View Approach to
Segmenting and Tracking People in a Cluttered Scene,” IJCV, Vol. 51 (3), 2003.
103
[39] P. Nillius, J. Sullivan and S. Carlsson, Multi-Target Tracking - Linking
Identities using Bayesian Network Inference, In Proc. IEEE Computer Vision
and Pattern Recognition (CVPR06), New York City, June 2006
[40] F. Oberti, S. Calcagno, M. Zara, C. Regazzoni. Robust Tracking of Humans
and Vehicles in Cluttered Scenes with Occlusion. International Conference on
Image Processing, 2002.
[41] C. C. Pang, W. L. Lam, H. C. Yung, A Novel Method for Resolving Vehicle
Occlusion in a Monocular Traffic-image Sequence. IEEE Trans. on Intelligent
Transportation Systems. vol.5, no.3, Sept. 2004.
[42] A. Prati, I. Mikic, M.M. Trivedi, and R. Cucchiara. Detecting Moving
Shadows: Algorithms and Evaluation. IEEE Transaction on Pattern Analysis and
Machine Intelligence, vol.25, no.7, 2003
[43] L. R. Rabiner and B. H. Juang, An introduction to hidden Markov models.
IEEE ASSP Mag., pages 4-15, January 1986.
[44] A.N. Rajagopalan, P. Burlina, and R. Chellappa. Higher order statistical
learning for vehicle detection in images. In Proc. IEEE Intl. Conf. Computer
Vision, vol.2, pp. 1204-1209, 1999
[45] D.B. Reid, “An Algorithm for Tracking Multiple Targets”, IEEE Trans.
Automatic Control, vol.24, no.1, pp.843-854, 1979
[46] J. Rittscher, J. Kato, S. Joga, and A. Blake, “A Probabilistic Background
Model for Tracking”, Proc. European Conf. of Computer Vision, 2000
[47] H. Schneiderman and T. Kanade. A statistical model for 3d object detection
applied to faces and cars. In Proc. IEEE Conf. Computer Vision and Pattern
Recognition, vol.1, pp.1746-1751, 2000.
[48] J. Stauder, R. Mech, and J. Ostermann, “Detection of moving cast shadows
for object segmentation,” IEEE Transactions on Multimedia, vol. 1, no. 1, pp.
65–76, Mar. 1999.
[49] Xuefeng Song, Ram Nevatia, Detection and Tracking of Moving Vehicles in
Crowded Scenes, WMVC 2007 (accepted)
[50] Xuefeng Song, Ram Nevatia, Robust Vehicle Blob Tracking with
Split/Merge Handling, CLEAR Evaluation Workshop 2006
[51] Xuefeng Song, Ram Nevatia, A Model-based Vehicle Segmentation for
Tracking, ICCV 2005
[52] Xuefeng Song, Ram Nevatia, Combined Face-body Tracking in Indoor
Environments, ICPR 2004
[53] C. Stauffer, and W.E.L. Grimson, Learning Patterns of Activity Using Real-
Time Tracking, IEEE Trans. PAMI, vol.22, no.8, 2000
104
[54] G. D. Sullivan. Model-based vision for traffic scenes using the ground-plane
constraint. Phil. Trans. Roy, Soc. (B), vol. 337, pp. 361-370, 1992.
[55] H. Tao, H. S. Sawhney, and R. Kumar, A Sampling Algorithm for Tracking
Multiple Objects, Proc. IEEE Workshop Vision Algorithm, in conjunction with
ICCV’99, 1999
[56] Z. W. Tu and S. C. Zhu, Image Segmentation by Data-Driven Markov Chain
Monte Carlo, IEEE Trans. On PAMI, vol.24, no.5, 2002
[57] P. Viola, and M. Jones. Rapid Object Detection using a Boosted Cascade of
Simple Features. In IEEE Conf. on Computer Vision and Pattern Recognition,
2001
[58] C.R. Wren, A. Azarbayejani, T. Darrell, and A.P. Pentland, “Pfinder: Real-
time Tracking of the Human Body”, IEEE Trans. PAMI, vol. 19, no. 7, 1997
[59] Bo Wu, Xuefeng Song, Vivek Kumar Singh, and Ram Nevatia. Evaluation of
USC Human Tracking System for Surveillance Videos. In CLEAR'06 Evaluation
Campaign and Workshop, in conjunction with FG'06, 2006
[60] T. Zhao, R. Nevatia. Car Detection in low resolution aerial image. IEEE Intl.
Conf. Computer Vision, 2001.
[61] Tao Zhao, Ram Nevatian. Bayesian Human Segmentation in Crowded
Situations. IEEE Conf. On Computer Vision and Pattern Recognition, 2003.
[62] T. Zhao, R. Nevatia, Tracking Multiple Humans in Crowded Environment,
Proc IEEE Conf on Computer Vision and Pattern Recognition (CVPR'04), 2004
[63] Yue Zhou and Hai Tao, “A Background Layer Model for Object Tracking
through Occlusion," in Proc. IEEE International Conf. on Computer Vision,
ICCV'03, pp. 1079-1085, 2003
Appendix A : Evaluation Criteria Definitions
for Vehicle Tracking
Assume is a ground truth vehicle rectangle, and is a detection rectangle
at frame t. These two rectangles are a pair of match if the overlap ratio is larger than
a threshold.
i
t
G
j
t
D
||
_
||
ii
tt
ii
tt
GD
overlap ratio Threshold
GD
∩
=>
∪
(36)
The ground truth rectangles, which have no detection match, are considered as
miss detections; and the detected rectangles, which have no ground truth match, are
considered as false positives.
● MODP (Multiple Object Detection Precision) measures the position
precision of single frame detections. The overlap ratios of the matched detections are
averaged over all the frames in the sequence.
11
|| 11
(
||
t
frames mapped
NN
ii
tt
ti
ti
frames mapped t t
G D
MODP
NN G
==
∩
=
∪
∑∑
ii )
i
D
(37)
where is the number of matched object sets in frame t. The threshold
on overlap ratio is set as 0.2. Since only matched objects are considered, the range of
MODP is [ .
t
mapped
N
0.2,1.0]
● MODA (Multiple Object Detection Accuracy) combines the influence of
miss detections and false positives.
105
1
1
(( ) ( ))
1
frames
frames
N
mi f i
i
N
i
G
i
cm c fp
MODA
N
=
=
+
=−
∑
∑
(38)
where and
i
m
i
fp are numbers of miss detections and false positives; ,
are the const functions for the miss detections and false positives. In this
evaluation, and
()
m
c
()
f
c
( )
mi
cm m =
i
()
f i i
cfp fp = ; is the number of ground truth
rectangles at frame i. The range of MODA is (
i
G
N
,1.0] −∞ , because the number of false
positives can be infinite.
●MOTP (Multiple Object Tracking Precision) measures the position precision
at tracking level.
11
1
||
()
||
mapped frames
frames
NN
ii
tt
ii
it tt
N
j
mapped
j
GD
GD
MOTP
N
==
=
∩
∪
=
∑∑
∑
(39)
where, refers to the mapped objects over the entire track as opposed to
just the frame and refers to the number of mapped objects in the frame.
mapped
N
t
mapped
N
th
t
●MOTA (Multiple Object Tracking Accuracy) is MODA at tracking level with
consideration of ID switches.
1
1
(( ) ( ) log( ))
1
frames
frames
N
m i f i i switches
i
N
i
G
i
cm c fp id
MOTA
N
=
=
++
=−
∑
∑
(40)
106
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Model based view-invariant human action recognition and segmentation
PDF
Intelligent video surveillance using soft biometrics
PDF
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Tracking multiple articulating humans from a single camera
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Robust representation and recognition of actions in video
PDF
Facial gesture analysis in an interactive environment
PDF
Exploitation of wide area motion imagery
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
PDF
Line segment matching and its applications in 3D urban modeling
PDF
Motion pattern learning and applications to tracking and detection
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Moving object detection on a runway prior to landing using an onboard infrared camera
PDF
A deep learning approach to online single and multiple object tracking
Asset Metadata
Creator
Song, Xuefeng
(author)
Core Title
Multiple vehicle segmentation and tracking in complex environments
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Defense Date
11/15/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,OAI-PMH Harvest,pattern recognition,vehicle tracking
Language
English
Advisor
Nevatia, Ramakant (
committee chair
), Cohen, Isaac (
committee member
), Lu, Zhong-Lin (
committee member
), Medioni, Gerard G. (
committee member
)
Creator Email
xsong@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m308
Unique identifier
UC1145336
Identifier
etd-Song-20070304 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-322986 (legacy record id),usctheses-m308 (legacy record id)
Legacy Identifier
etd-Song-20070304.pdf
Dmrecord
322986
Document Type
Dissertation
Rights
Song, Xuefeng
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
computer vision
pattern recognition
vehicle tracking