Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Intelligent video surveillance using soft biometrics
(USC Thesis Other)
Intelligent video surveillance using soft biometrics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INTELLIGENT VIDEO SURVEILLANCE USING SOFT BIOMETRICS
by
Vikash Khatri
A Thesis Presented to the
FACULTY OF THE USC VITERBI SCHOOL OF ENGINEERING
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(ELECTRICAL ENGINEERING)
August 2010
Copyright 2010 Vikash Khatri
ii
Table of Contents
List of Figures .................................................................................................................................. iii
Abstract ............................................................................................................................................ v
Chapter 1. Introduction ............................................................................................................. 1
1.1. Background .............................................................................................................. 1
1.2. Related Work ........................................................................................................... 2
Chapter 2. Calibration ................................................................................................................ 5
2.1. Vanishing Points ....................................................................................................... 6
2.2. Finding Parameters of Matrix M Using Vanishing Points ........................................ 7
2.3. Example Videos ........................................................................................................ 9
2.3.1. Top of the Building ............................................................................................... 9
2.3.2. On the Tripod ..................................................................................................... 12
2.4. Conclusion and Analysis ......................................................................................... 14
Chapter 3. Detecting and Tracking People in Image................................................................ 15
3.1. Background Subtraction ......................................................................................... 16
3.2. Integrating a Better Tracker ................................................................................... 21
3.3. Performance .......................................................................................................... 21
3.4. Conclusion and Analysis ......................................................................................... 22
Chapter 4. Head Detection ...................................................................................................... 24
4.1. Related Work ......................................................................................................... 24
4.2. Our Approach ......................................................................................................... 24
4.3. Results .................................................................................................................... 27
4.4. Conclusion and Analysis ......................................................................................... 28
Chapter 5. Height Estimation ................................................................................................... 29
5.1. Results .................................................................................................................... 31
5.2. Conclusion and Analysis ......................................................................................... 33
Chapter 6. Color Estimation ..................................................................................................... 35
6.1. Results .................................................................................................................... 37
6.2. Conclusion and Analysis ......................................................................................... 38
Chapter 7. Synopsis.................................................................................................................. 39
7.1. Conclusion .............................................................................................................. 39
7.2. Future Work ........................................................................................................... 40
References ..................................................................................................................................... 42
Alphabetized References ............................................................................................................... 44
iii
List of Figures
Figure 1 - Surveillance Cameras ....................................................................................................... 2
Figure 2 – Overview of approach ..................................................................................................... 4
Figure 3 - Three noncoplaner poles can fix V
Y
and V
L
...................................................................... 6
Figure 4 - Image from the Example 1 input video ......................................................................... 10
Figure 5 - Calibration Lines Example 1 ........................................................................................... 11
Figure 6 - Vanishing points example 1 ........................................................................................... 11
Figure 7 – Image from Example 2 input video ............................................................................... 12
Figure 8 – Calibration Lines Example 2 .......................................................................................... 13
Figure 9 - Vanishing Points Example 2 ........................................................................................... 13
Figure 10 – People Detection Flow ................................................................................................ 16
Figure 11 – RGB to Gray Conversion .............................................................................................. 17
Figure 12 – Threshold Difference between N and N-1 Frame ....................................................... 18
Figure 13 – Person Blob ................................................................................................................. 19
Figure 14 – Bounding Box .............................................................................................................. 20
Figure 15 – Background update model .......................................................................................... 21
Figure 16 - Background Subtracted Head ...................................................................................... 25
Figure 17 - Head Detection Result ................................................................................................. 26
Figure 18 - Head Detection Diagram.............................................................................................. 26
Figure 19 - Head Detection False Alarm ........................................................................................ 27
Figure 20 – Measuring Height ....................................................................................................... 30
Figure 21 - Results of Height Estimation ........................................................................................ 33
iv
Figure 22 – Divided into head, torso and legs ............................................................................... 36
Figure 23 – Color Estimation Algorithm ......................................................................................... 36
Figure 24 – Color Estimation Results ............................................................................................. 37
v
Abstract
The increasing use of surveillance cameras has generated a need of intelligent
surveillance systems, which can identify the events of interest from the long video
streams. Face detection is normally used for detecting and tracking people in the
automated surveillance systems, but face detection is resource intensive and it is not a
solved problem [13]. Soft biometric features like height and color have also been used
for indoor surveillance cameras. During this thesis these soft biometric features are
used for automation for surveillance systems in external environment where identifying
and tracking a person is an added challenge due to shadow and occlusions with non-
person objects. In this work, we first calibrate our fixed camera with respect to the
world which is assumed to be planer. We then detect the moving pixels and blobs are
constructed using contour tracing. Assuming each blob is a person, the height of blob is
determined with the help of camera calibration parameters and the blob is divided into
three parts, i-e, head, torso and legs. The color of each blob is estimated by averaging
the R, G and B values of all the pixels in the block. Hence we have a set of four soft
biometric features, height, color of head, color of torso and color of legs. These features
are helpful when huge amount of videos are available in the database and the user has
to find people in those videos that are six feet high and wearing a red shirt. The testing
of the research is done on five videos in which three people of different height and
wearing different clothes are walking.
1
Chapter 1. Introduction
1.1. Background
The use of CCTV cameras is common everywhere for the monitoring, video surveillance
or simply webcasting an important event or an important place. There were approximately
500,000 cameras in London and approximately 4,200,000 in UK alone according to a 2002
estimate [10]. It is boring and repetitive job for human to watch hours of video in multiple
cameras to find the suspicious or any other event of interest. The intelligent video surveillance
[9] can be designed to use the computer vision techniques for identifying the required events in
a video in order to assist human user in finding relevant events and avoid human error. The past
videos can also be indexed in a database to find the explicit event like “Find a person with height
greater than 6 ft in white shirt” or “Find a person wearing white shoes” if it is required for past
events.
2
(a)
(b)
(c)
(d)
Figure 1 - Surveillance Cameras
1.2. Related Work
In order to find a person in the camera, the face recognition method is used [11, 12].
The face recognition is not a completely solved problem and the advanced face recognition
methods [13] require high resolution face images and the required level of detail is not easily
available in surveillance camera systems. The face is also not necessarily visible in all situations
as people can be walking against camera or the face of target might be covered.
3
Jain et al [1] says that biometrics is rapidly gaining acceptance because they can
automatically recognize individuals based on their physiological and sociological characteristics.
Jain also argues that it is better to use and easy to find multiple biometric features instead of a
single biometric feature in terms of efficiency and applicability. Therefore we are using soft
biometric features for classifying people in a video. Soft biometrics are the human
characteristics that provide information about the individual but is insufficient to differentiate
any two individuals and thus identify an individual reliably and uniquely due to its lack of
distinctiveness and permanence [ 14].
Some soft biometric features require 3D information of the scene in order to find the
height and size of the person in the world coordinate system. Ko et al. [15] uses stereo camera
based 3D approximation for the identification of height and stride. Multiple camera systems are
harder to implement in all surveillance systems because it will multiply the number of cameras
and there are already millions of single cameras installed [10] which can be used if a single
camera system is implemented.
We are therefore using a calibrated single camera system in order to find the height,
head and color of the person walking in the video stream. Each video from camera is manually
calibrated using the visible information in the video and then height is approximated for the
person detected. The head is detected for standing person and the color of shirt and jeans is
also identified by finding the average color.
4
Figure 2 – Overview of approach
Video
Moving Pixel
Detections
Height
Estimation
Head
Detection
Manual
Camera
Calibration
Color
Estimation
Height
Head Torso Color
Leg Color
5
Chapter 2. Calibration
In order to find height and step size from video we need to know the transformation of
a point from the world axis to the image plane axis called extrinsic parameters of camera and
the transformation to 2D image plane called intrinsic parameters of the camera. The extrinsic
parameters of camera consists both rotation and translation which determines the position of
camera in the world. The intrinsic parameters define the projection of 3D points on camera axis
into the image including other geometric distortions from the lenses [2]. The goal of camera
calibration is to find the extrinsic and intrinsic parameters of the camera. These parameters can
be modeled for pinhole camera using the 3x4 projection matrix M and the relationship between
3D point [X,Y,Z,1]
T
and its image projection [u,v,1]
T
can be defined as
[u,v,1]
T
~ M . [X,Y,Z,1]
T
M contains five intrinsic parameters (focal length f, principal point (up,vp), aspect ratio a
and skew s) and six extrinsic parameters (translation T
c
has three parameters, rotation in Y-axis
by angle pan, rotation around X-axis by angle tilt and finally a rotation around Z-axis by roll). The
corresponding rotation matrices are R
Y
, R
X
, R
Z
respectively. Accordingly M can be represented
by the product of following three matrices [2].
M = A . R . [ I – Tc ] where A =
and R = R
z
. R
x
. R
y
The parameters in the matrix M can be determined by solving equations if enough 3D
points and corresponding image points are available.
6
2.1. Vanishing Points
In the case when enough 3D points are not available the vanishing points in X, Y and Z
can be used in order to find the parameters in M matrix [2, 3]. The vertical vanishing point V
Y
is
the intersection of N vertical poles in image plane which are of the same height in 3D world. For
any two poles the line connecting top and bottom of line intersect on a line V
L
as shown in figure
3.
Figure 3 - Three noncoplaner poles can fix V
Y
and V
L
As shown in figure, three or more poles can fix V
Y
and V
L
. Say the poles are defined as
{(h
i
,f
i
)}
i= 1,…,N
, where h
i
and f
i
are the image positions of head and feat of th pole and {( ∑
hi
, ∑
fi
)}are
the corresponding covariance matrices. The V
Y
is then computed using the method described in
[4], say m
i
is the midpoint of h
i
and f
i
; V
Y
is the point v that minimizes the sum of distances from
h
i
and f
i
to the linking line m
i
and v.
7
Where (w
i
,b
i
) is the line determined by m
i
and v. Suppose x
i
= [x
i
,y
i
]
T
is the point on V
L
due to two poles (h
j
,f
j
) and (h
k
,f
k
) by the intersection of line h
j
h
k
and f
j
f
k
. The covariance matrix ∑
i
of x
i
is computed by the Jacobian as
Where
Consider
is a unit vector of V
L
, then we can use the equation
to
determine V
L
and
Similarly we determine the V
X
and V
Z
using the lines which are parallel to the ground and
perpendicular to each other.
2.2. Finding Parameters of Matrix M Using Vanishing Points
Under the assumption of zero skew (s = 0) and unit aspect ratio (a =1 ), the orthocenter
of the three vanishing points as vertices is the principal point. Now the equation of M as given
above is used to the 3D point [1,0,0,0]
T
which corresponds to V
X
in image plane and [0,1,0,0]
T
which corresponds to V
Y
. For V
X
we get
8
For V
Y
we get
Here R
11
,…,R
33
are elements of Rotation Matrix R, (
are the image coordinates
of V
X
and (
are the image coordinates of V
Y.
By solving the equations above we get
And
In our example we are given the distance of the person from the camera, so we don’t
need complete translation vector T
C
[8] as defined in the section of measuring height. Fenguin et
al [2] defines a method to compute translation vector and says that T
C2
is the equal to negative
of camera height. In order to compute camera height we need some reference height in the
image. Fengjun uses person height itself as the reference height which cannot be used here
9
because we are doing camera calibration of person height itself. In order to find camera height,
assume B is bottom of the reference height, C is the top of reference height and D is the
intersection of the horizon line and the line passing through V
Y
, B and C. H is reference height
which is taken to be 7 meters in our case and H
C
is the camera height. The function d(P1,P2) is
the Euclidean distance between points P1 and P2. The H
C
can now be found by solving the
following equation.
2.3. Example Videos
2.3.1. Top of the Building
In this example the camera is on the roof of a building and the person starts walking
from approximately the right bottom to left up and comes back to right bottom as shown in
figure 4.
10
Figure 4 - Image from the Example 1 input video
The vertical poles behind the trees and the trees are used as vertical lines for the approximation
of V
y
and all the vertical lines which are used in the example are shown in green in figure 5. The
lines for parking car are used as horizontal lines for the approximation of V
x
, shown in red color
in figure 5 and the lines of building under construction in the back are used as lines for the z axis,
shown in blue color in figure 5. The person in the image is used for reference height and is
considered to be 5 ft 6 in. The correctness of results depends on the accuracy of the reference
height of the person.
11
Figure 5 - Calibration Lines Example 1
The vanishing points and principal points are located at
Figure 6 - Vanishing points example 1
12
2.3.2. On the Tripod
In this example the camera is on tripod on the ground so the camera tilt approaches
zero as shown in figure 7.
Figure 7 – Image from Example 2 input video
The person walks from left to right and on the way back from right to left, he walks
towards camera. The camera calibration is critical in this scenario because when the person
moves towards camera, his height increases dramatically in the image space but actually the
height remains same. The problem with this video is the tilt approaches to zero. As shown in
next section, the tilt is important in height approximation, so we need to use as many vertical
lines visible in scene in order to find the correct tilt. Therefore we use the trees in the
background as the vertical lines for the approximation of V
y
, note that lines are tilted to reach a
non-infinite V
y
. We also use the person position everywhere in the image as the vertical line for
better approximation of V
y
.The parking lines are used for V
x
and the curb is used for V
z
. The
color coding is same as in example 1 and the figure is shown below with lines highlighted.
13
Figure 8 – Calibration Lines Example 2
The vanishing points and principal points are located at
Figure 9 - Vanishing Points Example 2
14
2.4. Conclusion and Analysis
The examples shown above prove that we can calibrate a static camera if we are able to
find enough vertical, horizontal and orthogonal lines in the image. If there are not enough
vertical lines then we can use some moving object as reference and find the lines accordingly. It
is usually harder to get stable results for the case camera is parallel to the ground as the tilt
angle approaches zero and the vanishing points tends to infinity. Since there is always a discrete
error in finding lines in the image and lines are thick, one cannot find an accurate line, it is
helpful to tilt the lines towards each other, so you will get an approximate stable result instead
of an unstable infinite result.
The experiments show that increasing the number of sample lines gives better results,
for example in example 1 when only three lines were selected the result was much different and
the vanishing points didn’t make a pyramid as they do in figure 6. There are two reasons for
better results when using many lines. First reason is that when you select lines in all corners of
the image, the complete plane in the image is covered the calibrator knows how is plane drawn
in image, and the second reason is since the lines we draw are not accurate the Jacobian helps
find the best solution when more lines are given to the calibrator.
The calibration is done manually and these equations are solved on MATLAB. The
running time and performance depends on the number of lines we chose and the size of the
image and efficiency of human user to type these equations in the MATLAB. Nevertheless, the
calibration parameters for the five example videos are given in the code for further
experiments.
15
Chapter 3. Detecting and Tracking People in Image
In order to find the biometric features of person it is important to locate and track the
person in the image. There are many ways in computer vision to detect the desired object such
as object modeling [6], using stereo images [5], foreground segmentation [2] and more. We are
using foreground segmentation because it is can be applied without setting up cameras or
background in a defined arrangement and it is a basic method which serves the purpose for
height estimation. Plus this system is universal as most environments use one camera for video
surveillance for a particular scene. The method of object modeling doesn’t work when object is
occluded, but the moving persons in video surveillance cameras are almost always occluded so
background subtraction method is a better choice. The set back with foreground segmentation
method is that it relies on a static background and that target object is different from
background. If the object looks similar to background than this method will not produce
satisfying results. We use foreground segmentation with adaptive background subtraction which
is inspired from the dissertation of Kwangsu Kim which was published in University of Southern
California in October 2007. The object detection system defined by Kim is slightly modified and
shown in figure 10.
16
Figure 10 – People Detection Flow
In figure 10, R
FD
is the person blob from frame difference operation; R
BS
is the person
blob from background subtraction and R
C
is the person candidate blob. The use of adaptive
background subtraction makes system robust to sudden illumination changes and temporary
moving objects which are not human. Kim claims that system shows adapting capability to
turning light on and off for at least 8 hours continuously.
3.1. Background Subtraction
In the system each frame from the video is initially converted to gray scale using the
following formula for each pixel as shown in figure 11.
I
n
(x) = (11*I
n
(x
R
) + 16* I
n
(x
G
) + 5* I
n
(x
B
)) /32
Here I
n
is the intensity of pixel ‘x’ in nth frame and ‘x
R
’, ‘x
g
’ and ‘x
B
’ are red, green and
blue pixels respectively. A pixel is considered to be a foreground in nth frame if the difference
Edge
Image
-
Current
Image
Edge
Image
Previous
Image
Background
Image
Edge
Image
-
R
FD
R
BS
AND R
C
Update
Background
17
between the intensity pixel of nth frame is greater than threshold th
frame Diff
intensity values from
corresponding pixel in (n-1)th frame and stored as a binary image say binary1 as shown in figure
12.
(a) Original Image
(b) Grey Image
Figure 11 – RGB to Gray Conversion
18
(a) Grey Image
(b) Edge Image
Figure 12 – Threshold Difference between N and N-1 Frame
The background frame is initially the first frame of the video sequence and then it is
periodically updated. This background frame is converted to gray scale and then edge image
using canny edge detection operation with lower threshold as 5 and upper threshold as 25 using
the eight point connectivity. The nth frame is also converted to edge image with same
parameters and the difference of background edge image and nth frame edge is compared to
background threshold (which is a part of background model) is saved as a binary image say
19
binary2. Now the pixels which are high in both binary1 and binary2 are considered to be person
candidate and they are determined by AND operation between binary1 and binary2. The final
result is shown in figure 13.
Figure 13 – Person Blob
In the example shown figure 12 and figure 13 show almost same blob, but if there is a
strong shadow or illumination in the background than the result of background subtraction is
much stronger than simple frame difference.
Since the image contains only one person, the white pixels only contain a person. So the
blob analysis is to find a block which contains all the white pixels even the ones which are
separated by maximum 10 pixels. The result with bounding box is shown in figure 14.
20
Figure 14 – Bounding Box
These blobs in images are considered blob if they are within certain threshold range
which is experimentally set to be 20 to 300 pixels in height and 10 to 100 pixels in width. The
background model consists of B
n
(x) which is background intensity image and T
n
(x) which is the
threshold for background. The variables α and β are the coefficients for speed of change in the
background as defined by Kim in dissertation and their value is set to be 0.9 and 5 respectively.
The background model is updates in the following manner
21
Figure 15 – Background update model
3.2. Integrating a Better Tracker
The result of height depends on the efficiency of bounding box around the person. If the
bounding box doesn’t cover the person from head to feet or it includes more area than the
person than the resultant height will be different than equal height. The tracker also needs to be
consistent in bounding the person irrespective of the speed of person, background motion and
random noise. The basic tracker shown above may not be robust in all situations therefore a
better tracker can definitely bolster the performance of the system. In order to integrate a
tracker, just run the video on tracker and store results in a text file. The format of text file should
be such that the first line contains starting and ending frame of video sequence and then each
line contains Cx,Cy, Sx, Sy, rotation. Where (Cx,Cy) is the center of the bounding box, (Sx,Sy) is
the size of the box and rotation is rotation of box.
3.3. Performance
It takes approximately 268 ms for each frame to identify one blob and update
background accordingly for the huge frame size 1920x1080 in all the example videos. The time
22
required increases as soon as there are more than one blobs of moving pixels in the frame.
Clearly we cannot process 30 frames per second when we need 268 ms per frame on a Intel
Core 2 Duo, 4GB RAM machine running Windows Vista; therefore this is an offline process. In
order to do a real time background subtraction we need to use smaller image frames or a
powerful image processor as required by the application.
3.4. Conclusion and Analysis
The background subtraction method is common for detecting and tracking people,
because it is easier to implement and it is intuitive as one would imagine that moving pixels are
the ones which are changing. In practicality, since we need to compare each pixel with the
background and previous frames, it takes much longer time than expected and may not be used
in real time when the frame size is bigger. In the cases where the person is way smaller than
each frame, we may want to use a model fitting approach or a smarter technique and not
comparing each pixel when we know there are no moving pixels in that part of the image.
Nevertheless it gives the required result in both cases, when the camera is on the top of the
building on lying on the tripod. This means the background subtraction method is invariant to
size and skew of the person, which is a reason to use background subtraction as compared to
model fitting approach.
The background subtraction method does not support occlusion detection and cannot
distinguish if two people walk close to each other as they will be merged together to be
classified as a single blob/person. In our scenario there is mostly a single person walking in a
parking lot therefore the person is classified properly, but problems occur when the person is
23
occluded by a car or any other object. In this scenario with either consider the previous frame
blob or ignore the frame until we get reasonably good size blob again.
The shadow of the person also has an impact on background subtraction, which is
reduced by comparing each current frame to the background and n-1 frame. If only the current
frame and the background frame are compared, we get almost perfect result as shown in figure
14. Considering the same lighting conditions and relatively closer position of the person, the
shadow is removed if the changing pixels are considered in consecutive image frames. The
impact of shadow is also reduced by choosing proper thresholds in edge detection in the current
frame and previous frame.
Nevertheless the background subtraction method works fine for our example videos
and it detects perfect bounding box around each person in almost 86.7% frames. The results are
not perfect when the person stops moving or takes a turn, therefore moving pixels doesn’t
cover the complete body of the person. The option of integrating a better tracker is given in the
system, which helps in doing an offline tracking and finding the biometric features in the
runtime.
24
Chapter 4. Head Detection
4.1. Related Work
The head position is important because it can be used to extract the facial features of
the person and it can also help in detecting the location of torso and limbs [16]. The problem is
hard in our case because the face has to be detected in side and in backwards. Many methods
have been used for frontal face detection such as neural networks [17], support-vector machine
[18], Adaboost [19, 20] or a wavelet based method by Schneiderman [20]. Zhao and Nevatia [21]
have suggested an ellipsoid ‘Ω’ fitting approach for rear and forward frontal face detection, but
it doesn’t work when the person is moving sideways. During the sideway motion the shape of
person is more like a ‘D’ instead of ‘Ω’. Kim [8] suggests using the combination of both methods
and training haar classifier for detecting head and shoulders. The problem with Haar classifier is
that is takes a lot of time for processing as we have to try it for multiple sizes, so it may not be
suitable for real-time applications.
4.2. Our Approach
As shown in figure below from the background subtraction method, when the person is
separated from the background in the upward position then the top pixels belong to the head of
the person.
25
Figure 16 - Background Subtracted Head
We exploit this feature and start the face area as soon as pixels start from the top. The
head area finishes when the horizontal area of the pixels exceeds 50% of the pixels or the height
is greater than 20% of the height of the person. The problems with this approach is that even if
there is no actual head, even then a head is detected in top 20% of the detected person frame
so there is no way to detect if the pixels are of a person or some other object. It can be assumed
that the video stream is of people only in densely populated cities and most of the moving
objects are people. In the example showed below the head of the person is properly identified
in the complete video using the technique defined above. This technique of head detection is
applied after the haar classifier for profile face detection provided by OpenCV has failed to
identify the face of the person in the frame.
26
Figure 17 - Head Detection Result
Figure 18 - Head Detection Diagram
True
Person
Haar
Classifier
Scan
from top
Head
Detected
Horizontal
line covers
> 50 %
20% body
height
covered
Not Detected
Detected
True
False
False
27
4.3. Results
As described in our approach, there are chanced of detecting a head even when there is
no head because we assume the top 20% of person body is head. The successful result is shown
in figure 17 and an example false alarm is shown in figure 19. The head is occluded by the pillar
and still there is a box showing head.
Figure 19 - Head Detection False Alarm
These false alarms only occur when the face is occluded which happens in 3 out of 4386
frames in walk2 video. Hence the false alarm rate is less than 0.07%. When the person goes far
away in the background then there is no bounding box surrounding person and no head is
detected. Therefore the conditional rate that among the frames in which the person is detected
is 2459 frames the false alarm rate is 0.12%.
28
4.4. Conclusion and Analysis
The head detection method defined above assumes that head is the top most part of
the body and an easy hack will be if the subject does the hands up. This may not be the perfect
head detector but it works in our scenario in which the pedestrians are generally walking hands
down and therefore the error rate is less than 1%. The Haar classifier method taken huge
amount of time (>300 ms for each frame) because the Haar classifier has to check for the
ellipsoid of all sizes in all the area of interest which is the bounding box surrounding the person,
still the results of Haar classifier are not satisfactory as it detects almost 12 faces out of 4386
frames in the examples in which the camera is on the roof and almost 189 frames out of 2492
frames in the example where camera is on the tripod. There are two reasons the Haar classifier
does not work, one is that the Haar classifier is trained for examples having ellipsoid faces but
our examples contains people walking sideways which is ‘D’ or partial ‘D’ structure and the
other reason is that faces are too small as compared to the rest of the frame. If the face is
zoomed in we see pixilated faces which are not even distinguishable by naked eye. Hence the
quality of head detection using the Haar classifiers will improve if the person is closer to the
camera and the subject is moving towards or away from camera rather than the sideways
movement in front of the camera. On the other hand the Haar classifier can be trained for the
pixilated images of faces present in the videos, which will train the classifier to work on our
samples, but it will not be a general solution. Therefore we are using the method described in
figure 18 for the purpose of head detection and it gives satisfactory results.
29
Chapter 5. Height Estimation
The calibration parameters are used to compensate the camera distortion and different
geometric analysis can be applied to estimate the height from the bounding box surrounding
the person. The top and bottom center points of box can be converted to corresponding 3D
points and the distance between them can be assumed as the height of the person [7]. We need
the exact extrinsic parameters for this method to compute the height of person and we don’t
have translation parameters as concluded in Calibration section of this document. Jeges et al [2]
recommends three methods of height estimation Single view measurement, multiple view
measurement and spatial view measurement. Again all these methods require all the
parameters of Calibration Matrix ‘M’ for camera. Kim [8] suggests a method using only angles
and camera height to determine the height which is used in our system.
If we assume that people stand in same planar surface and the whole body of person is
shown than we can identify the height of person if we know the angle between center of image
and upper end of detected person blob [8].
30
θ
1
θ
2
θ
3
h
H
d
C(x,y)
Image
X
max
X
min
x
max
x
min
Camera θ
h
θ
1
θ
2
θ
3
h
H
d
C(x,y)
Image
X
max
X
min
x
max
x
min
Camera θ
h
Figure 20 – Measuring Height
If we assume pin-hole camera, the angle (Θ3 - Θ2) and Θ 1 can be calculated from image
as
h
image
is the image height and Θ
h
is the view angle of camera. The view angle is assumed
to 40
0
for the sake of computation. If we assume the camera tilt is Θ
2
than we can get Θ
3
and
calculate height of person as following [8].
31
5.1. Results
The result of height is almost consistent throughout the video and the result is similar
for same person in two different videos. Here is the average result of height estimation for each
example video.
(a) Walk2 Video Average Height is 5.67
(b)Walk3 Video Average Height is 5.65
Figure 21 – Results of Height Estimation
32
Figure 21 – Results of Height Estimation (Contd)
(c) Walk4 Video Average Height is 6.36
(d) Walk5 Video Average Height is 6.42
Figure 21 – Results of Height Estimation
33
Figure 21 – Results of Height Estimation (Contd)
(e) Walk6 Video Average Height is 6.44
Figure 21 - Results of Height Estimation
5.2. Conclusion and Analysis
The height estimation takes less than 3 ms on average because there is a single equation
involving computation of tan() for height estimation. All other parameters come from the
calibration of the camera. The height estimation inherits errors from calibration and tracking,
since the calibration helps convert the height of bounding box to 3D world and if the bounding
box is not covering the person in full then the height is invalid. Still the average height turns out
to be reasonably fine and close to the original height of the person. It is important to note that
the person is same in example 1 and example 2 as shown in Figure 21, similarly the person is
same in example 3 and example 4. In both the cases the average height of person is similar in
different views. The height in example 5 is almost similar to the height in example 3 and
example 4 and in this case we can use color to distinguish different people.
34
The height has helped us distinguish 2 out of three people, therefore it can be said that
height is a valid soft biometric parameter. The results will improve if the camera calibration
parameters are given with the video and of course a better tracker is integrated with the
system. In the current scenario we get all the height values in less than one standard deviation
from the mean of the height.
35
Chapter 6. Color Estimation
Anil et al [1] suggests that multiple biometric features can improve the identification
process as the single biometric feature can be wrong due to noise or miscalculation. The other
soft biometric features can be skin color, moustache, bald, carrying some object or any other
feature which can be easily noticed by naked eye in one look. Since the people are considerably
small in size in our system it is not easy to find the facial features of person and requires
complex computation. We therefore use the color of the shirt or upper as the secondary
biometric feature of the person.
In order to find the color, the human body is to be divided into head, torso and legs. It is
empirically found that top 20% of body contains head or person, then 40 percent contains the
torso and then 40 percent contains legs. These results are also confirmed by the golden ratio of
the body which says “(head+ torso) / legs” is approximately equal to 1.6180399. We therefore
divide the body into three parts as shown in figure 22 using the percentages given above and
the average color for the middle box is computed.
Since the color of the shirt depends on direction of the light and varies as we go along
the whole block, one cannot claim a single R, G and B value to be the color of the shirt or upper.
Therefore the average color is further averaged out over the complete video sequence to find a
R, G and B value. This R, G and B value also contains the pixels surrounding the torso if person is
moving sideways and it is the average of two colors if person is moving towards camera and is
wearing multiple visible shirts. Nevertheless in the example shown below the final average color
matches closely to top part of upper of the person walking in the video.
36
Figure 22 – Divided into head, torso and legs
Figure 23 – Color Estimation Algorithm
Person
Get Head
Get Torso
Get Legs
Find Average
Color
Find Average
Color
Find Average
Color
Head Color
Torso Color
Leg Color
37
6.1. Results
Video Name Head Color Torso Color Leg Color
Walk2 video (same
person as in Walk3)
R = 84, g = 64, b = 53
R = 55, g = 54, b = 52
R = 29, g = 39, b = 41
Walk3 video (same
person as in walk2)
R = 93, g = 70, b = 78
R = 36, g = 33, b = 44
R = 41, g = 44, b = 49
Walk4 video (same
person as in walk5)
R = 141, g = 114, b = 107
R = 195, g = 198, b = 213
R = 168, g = 151, b = 135
Walk5 video (same
person as in walk4)
R = 158, g = 137, b = 144
R = 210, g = 214, b = 266
R = 139, g = 120, b = 105
Walk6 video
R = 138, g = 106, b = 107
R = 37, g = 38, b = 33
R = 204, g = 198, b = 198
Figure 24 – Color Estimation Results
38
6.2. Conclusion and Analysis
The color estimation problem inherits errors from the head detection algorithm and
bounding box accuracy from the tracker. When we determine the average color in a box, not all
the pixels in the box belong to the body of the person as shown in figure 22. Plus the box of
head contains hair, eyes and edges around neck therefore the color of head cannot be used to
distinguish the person in different videos. On the other hand the color of torso and legs looks
like a distinguishing feature as these colors are similar for the same persons in the different
videos and they are different for different people. For example the person in walk4 and walk5
can be distinguished from the person in walk6 because the color range is visibly different and it
is also way far in numbers.
These results can be improved by improving the picture quality of the video. The current
samples are large in size and the area of person looks blocky when the image is zoomed in closer
to the head or body of the person. The results can also be improved by choosing only the
moving pixels to only get the pixels from person’s body rather than selecting the completing
bounding box. The overhead for getting these improved results is much more than the
improvement that can be achieved in the system therefore these improvements are traded off
for performance of the system. Nevertheless the color can help distinguish the person in walk6
from walk5 and walk 4 based on torso color and leg color as required by height estimator. Hence
these can be used as valid soft biometric features.
39
Chapter 7. Synopsis
7.1. Conclusion
We have used soft biometric features like height and color of body in order to classify
three people in five different videos. The videos have been shot from a static camera at the top
of a building and from the tripod which are common scenarios in surveillance systems. The
camera calibration is performed by the finding the orthogonal lines in three directions and
determining the three vanishing points V
x
, V
y
and V
z
[21]. The principal point and tilt angle of
camera are extracted from the vanishing points and used in height estimation of the person. In
order to estimate the height we need to detect and track people in the video, which is done by
using the technique of foreground separation or moving pixel detection [2]. The efficiency of
feature extraction relies on the accuracy of the camera calibration and the bounding box
surrounding person resulting from tracker. The height of the person is identified using the
following equation (See Height Estimation)
The body of the person is divided into three parts head, torso and legs by considering
head as the topmost 20% curve or before the shoulder starts which occupies more than 50% of
the horizontal width of the frame. The torso and legs are divided into two halves for the
remaining part of the body. The average color in each of these parts is arithmetic mean of R,g
and b values of all the pixels and they are called head color, torso color and leg color
respectively.
40
Hence it can be concluded that the soft biometric features can be used to classify a set
of people in categories of the choice of users and they are not meant to be identifying each
individual. This statement is proved in the examples above as the height classified the three
persons into two categories (say greater than 6 ft and less than 6 ft). When we add the
dimension of color, we can classify all the three people in five videos. In real time situation when
a large number of people move in front of the camera, we can categorize people based on their
color and height but all the people can be individually identified unless face detection
techniques or some other biometric feature is implemented.
7.2. Future Work
We have designed a basic framework for implementing the soft biometric features for
intelligent surveillance and it needs improvement to be a real time system. The future work
should focus on implementing more soft biometric features and optimize these features to run
faster in order to process at least 30 frames in a second. Stride and cadence [7] are important
features for example and they can classify people when enough samples are present. If the color
of skin can be extracted by identifying the direction of the head and when the direction is
towards the camera then get pixels except for eyes and hair if any, on the skin. With color it is
better to use a Gaussian probabilistic model for comparison instead of single color because the
color varies depending on exposure to light. The moustache, hair color, beard and whether or
carrying any prop can be other possible soft biometric features which can be implemented in
future.
After determining the soft biometric features the next step is to automatically classify
people based on the determined features. It is shown in color estimation that some of the
41
features may not be important in classification like the head color because it was different even
for the same people in different videos. Therefore first of all the important features should be
extracted using the Principal Component Analysis or other feature prioritization technique
depending on the application. When we have a set of features which can classify the subjects
based on our requirements we can use a good classifier to identify and highlight people in the
video as required by the user or an action can be set like turn on an alarm or red light when a
person with certain biometric features is identified.
42
References
[1] A. Jain “Can soft biometric traits assist user recognition?” Proceedings of SPIE 2004
[2] E. Jeges, I. Kispál, Z. Hornák, “Measuring human height using calibrated cameras” Conference
on Human System Interaction 2008
*3+ R. Cipolla, T. Drummond, and D.P. Robertson, “Camera Calibration from Vanishing Points in
Images of Architectural Scenes,” Proc. British M achine Vision Conf., vol. 2, pp. 382-391, 1999.
*4+ D. Liebowitz, A. Criminisi, and A. Zisserman, “Creating Architectural Models from Images,”
Proc. EuroGraphics, vol. 18, pp. 39-50, 1999.
[5] H. Wang, R. Lu, X. Wu, L. Zhang, J. Shen; "Pedestrian Detection and Tracking Algorithm
Designin Transportation Video Monitoring System"; 2009 International Conference on
Information Technology and Computer Science
*6+ B. Wu, R. Nevatia, “Detection of Multiple, Partially Occluded Humans in a Single Image by
Bayesian Combination of Edgelet Part Detectors”, ICCV, Volume I, pp. 90 -97. Beijing, China,
October 2005
[7] C. BenAbdelkadery, R. Cutlerz, L. Davis "Stride and Cadence as a Biometric in Automatic
Person Identification and Verification" Fifth IEEE International Conference on Automatic Face
and Gesture Recognition (2002)
[8] K. Kim “ROBUST REAL-TIME VISION MODULES FOR A PERSONAL SERVICE ROBOT IN A HOME
VISUAL SENSOR NETWORK” University of Southern California 2007.
[9] W. Hu, T. Tan, L.Wang, and S. Maybank. A survey on visual surveillance of object motion and
behaviors. IEEE Trans. Syst., Man, Cybern., Pt. C, 34(3):334–352, 2004.
[10] M. McCahill, C. Norris "CCTV in LONDON" Working paper no.6, UrbanEye June 2002
[11] J. Sivic, M. Everingham, and A. Zisserman. Person spotting: video shot retrieval for face sets.
In International Conference on Image and Video Retrieval, 2005.
[12] “Cognitec | The Face Recognition Company” April 30, 2010. Available:
http://www.cognitec-systems.de/. [Accessed: June 25, 2010].
[13] “Face Recognition Vendor Test” Mar 19, 2010. Available: http://www.frvt.org/. [Accessed:
June 25, 2010].
*14+ M. Kruger, A. Rosiers, T. McKenna “ Automated Entity Classification in Video Using Soft
Biometrics”: Navy SBIR 2008.1 - Topic N08-077
43
[15] J. Ko, J. Jang, E. Kim "Intelligent person Identification system using stereo camera - based
height and stride estimation" Proc. of SPIE Vol. 5817 2005
[16] M. W. Lee, I. Cohen, "A Model-Based Approach for Estimating Human 3D Poses in Static
Images", IEEE Trans. on PAMI, vol. 28, No. 6, pp. 905-916, 2004
[17] H. Rowly, S. Baluja and T. Kanade, "Neural network based face detection", IEEE Trans. on
PAMI, vol 20. pp.23-38, 1998
[18] E. Osuna, R. Freund and F. Girosi, "Training Support Vector Machines: an Application to
Face Detection", CVPR, pp. 130-136, San Juan, Puerto Rico, 1997
*19+ P. Viola and M. J. Jones, “Robust real -time face detection”, IJCV, vol 57(2), pp. 137 -154,
2004
[20] H. Schneiderman and T. Kanade, "A Statistical Method for 3D Object Detection Applied to
Faces and Cars", CVPR, vol. 1, pp. 1746-1753, 2000
[21] T. Zhao, R. Nevatia, “Tracking Multiple Humans in Complex Situations”, IEEE Trans. on PAMI,
vol. 26. No. 9, pp. 1208-1221, 2004
44
Alphabetized References
[7] C. BenAbdelkadery, R. Cutlerz, L. Davis "Stride and Cadence as a Biometric in Automatic
Person Identification and Verification" Fifth IEEE International Conference on Automatic Face
and Gesture Recognition (2002)
[3] R. Cipolla, T. Drummond, and D.P. Robertson, “Camera Calibration from Va nishing Points in
Images of Architectural Scenes,” Proc. British Machine Vision Conf., vol. 2, pp. 382 -391, 1999.
[12] “Cognitec | The Face Recognition Company” April 30, 2010. Available:
http://www.cognitec-systems.de/. [Accessed: June 25, 2010].
[13] “Face R ecognition Vendor Test” Mar 19, 2010. Available: http://www.frvt.org/. [Accessed:
June 25, 2010].
[9] W. Hu, T. Tan, L.Wang, and S. Maybank. A survey on visual surveillance of object motion and
behaviors. IEEE Trans. Syst., Man, Cybern., Pt. C, 34(3):334–352, 2004.
[1] A. Jain “Can soft biometric traits assist user recognition?” Proceedings of SPIE 2004
[2] E. Jeges, I. Kispál, Z. Hornák, “Measuring human height using calibrated cameras” Conference
on Human System Interaction 2008
[8] K. Kim “ROBUST REAL-TIME VISION MODULES FOR A PERSONAL SERVICE ROBOT IN A HOME
VISUAL SENSOR NETWORK” University of Southern California 2007.
[15] J. Ko, J. Jang, E. Kim "Intelligent person Identification system using stereo camera - based
height and stride estimation" Proc. of SPIE Vol. 5817 2005
[14] M. Kruger, A. Rosiers, T. McKenna “ Automated Entity Classification in Video Using Soft
Biometrics”: Navy SBIR 2008.1 - Topic N08-077
[16] M. W. Lee, I. Cohen, "A Model-Based Approach for Estimating Human 3D Poses in Static
Images", IEEE Trans. on PAMI, vol. 28, No. 6, pp. 905-916, 2004
[4] D. Liebowitz, A. Criminisi, and A. Zisserman, “Creating Architectural Models from Images,”
Proc. EuroGraphics, vol. 18, pp. 39-50, 1999.
[10] M. McCahill, C. Norris "CCTV in LONDON" Working paper no.6, UrbanEye June 2002
[18] E. Osuna, R. Freund and F. Girosi, "Training Support Vector Machines: an Application to
Face Detection", CVPR, pp. 130-136, San Juan, Puerto Rico, 1997
[17] H. Rowly, S. Baluja and T. Kanade, "Neural network based face detection", IEEE Trans. on
PAMI, vol 20. pp.23-38, 1998
45
[20] H. Schneiderman and T. Kanade, "A Statistical Method for 3D Object Detection Applied to
Faces and Cars", CVPR, vol. 1, pp. 1746-1753, 2000
[11] J. Sivic, M. Everingham, and A. Zisserman. Person spotting: video shot retrieval for face sets.
In International Conference on Image and Video Retrieval, 2005.
[19] P. Viola and M. J. Jones, “Robust real -time face detection”, IJCV, vol 57(2), pp. 137 -154,
2004
[5] H. Wang, R. Lu, X. Wu, L. Zhang, J. Shen; "Pedestrian Detection and Tracking Algorithm
Designin Transportation Video Monitoring System"; 2009 International Conference on
Information Technology and Computer Science
[6] B. Wu, R. Nevatia, “Detection o f Multiple, Partially Occluded Humans in a Single Image by
Bayesian Combination of Edgelet Part Detectors”, ICCV, Volume I, pp. 90 -97. Beijing, China,
October 2005
[21] T. Zhao, R. Nevatia, “Tracking Multiple Humans in Complex Situations”, IEEE Trans. on PAMI ,
vol. 26. No. 9, pp. 1208-1221, 2004
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Moving object detection on a runway prior to landing using an onboard infrared camera
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Analyzing human activities in videos using component based models
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Tracking multiple articulating humans from a single camera
PDF
Exploitation of wide area motion imagery
PDF
Facial gesture analysis in an interactive environment
PDF
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
Multiple pedestrians tracking by discriminative models
PDF
Robust representation and recognition of actions in video
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Motion pattern learning and applications to tracking and detection
PDF
RGBD camera based wearable indoor navigation system for the visually impaired
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Automatic image and video enhancement with application to visually impaired people
PDF
Vision-based studies for structural health monitoring and condition assesment
Asset Metadata
Creator
Khatri, Vikash
(author)
Core Title
Intelligent video surveillance using soft biometrics
School
Viterbi School of Engineering
Degree
Master of Science
Degree Program
Electrical Engineering
Publication Date
08/05/2010
Defense Date
06/09/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
camera calibration,OAI-PMH Harvest,outdoor environment,soft biometric features,surveillance systems
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gerard (
committee chair
), Kuo, C.-C. Jay (
committee member
), Nevatia, Ramakant (
committee member
)
Creator Email
vikashkkhatri@gmail.com,vkhatri@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3281
Unique identifier
UC1218141
Identifier
etd-Khatri-3529 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-366167 (legacy record id),usctheses-m3281 (legacy record id)
Legacy Identifier
etd-Khatri-3529.pdf
Dmrecord
366167
Document Type
Thesis
Rights
Khatri, Vikash
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
camera calibration
outdoor environment
soft biometric features
surveillance systems