Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Landmark detection for faces in the wild
(USC Thesis Other)
Landmark detection for faces in the wild
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Landmark Detection for Faces in the Wild
by
KangGeon Kim
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2018
Copyright 2018 KangGeon Kim
Acknowledgement
First and foremost, I would like to thank my advisors, Professor G erard Medioni and
Professor Louis-Philippe Morency. They have been supportive and patient providing me
insightful guidance to overcome all diculties along the way of completing this disserta-
tion. Their enthusiasm for research has been a great inspiration in my computer vision
research. Pursuing a Ph.D. is a long journey and their wise guidance and encouragement
have helped me to keep moving forward. My Ph.D. work would not be nished without
their constructive advice and marvelous support.
I would also like to thank the members of my Ph.D. committee, Professor Laurent
Itti and Professor Alexander Sawchuk for their insightful advice. Also, I want to thank
Professor Aiichiro Nakano and Professor Joseph Lim for joining my qualifying exam
committee. I am really grateful to receive their valuable feedbacks. Also, I am extremely
grateful to Professor Ram Nevatia, Dr. Jongmoo Choi, Professor Tal Hassner and Dr.
Iacopo Masi for being such good mentors and friends of mine. They always provided help
and constant cheerfulness whenever I was facing diculties in the problem.
All of the past and current members of the USC Computer Vision Lab also made
my time at USC special: Dr. Anh Tran Tuan, Dr. Matthias Hernandez, Dr. Sungchun
Lee, Dr. Young Hoon Lee, Jungyeon Kim, Fengju Chang, Zhenheng Yang, Kan Chen,
Gozde Sahin, Jatuporn Leksut, Dr. Borjeng Chen, Dr. Zhuoliang Kang, Dr. Ruizhe
ii
Wang, Dr. Tung-Sing Leung, Dr. Shay Deutsch, Dr. Prithviraj Banerjee, Dr. Pramod
K. Sharma, Dr. Chen Sun, Dr. Yinghao Cai, and Rama Reddy Knv. You all have been
such a blessing to me. I wish you all the best.
Also, I am thankful to my family in Korea for their endless love and support across
thousands of miles. Last but not the least, I would like to express special thanks to
my wife, JyEun and two sons YeJune, and SeoJune, for their unconditional love and
unwavering support. This dissertation is dedicated to you.
iii
Table of Contents
Acknowledgement ii
List of Tables vi
List of Figures viii
Abstract xii
1 Introduction 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Facial landmark detection . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Related Work 11
2.1 Face detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Facial landmark detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Head pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Landmark condence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Holistically Constrained Local Model 19
3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Constrained Local Model . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Holistically Constrained Local Model . . . . . . . . . . . . . . . . . 21
3.1.3 Two holistic predictors: sparse landmarks and head pose . . . . . 23
3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Comparison with baseline methods . . . . . . . . . . . . . . . . . . 26
3.2.2 Ablation experiments . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.5 Error metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
iv
3.3.1 Facial landmark detection . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 Ablation experiments and sparse landmark detection . . . . . . . . 32
3.3.3 Head pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Local-Global Landmark Condences for Face Recognition 36
4.1 Local-Global Landmark Condences . . . . . . . . . . . . . . . . . . . . . 36
4.1.1 Constrained Local Condence . . . . . . . . . . . . . . . . . . . . . 36
4.1.2 Rendering based Global Condence . . . . . . . . . . . . . . . . . 38
4.2 Application on face recognition . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 Convolution Neural Network for face recognition . . . . . . . . . . 40
4.2.2 Applying Landmark Condences on face recognition . . . . . . . . 41
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.1 Evaluation of Local Condence . . . . . . . . . . . . . . . . . . . . 45
4.4.2 Evaluation of Global Condence . . . . . . . . . . . . . . . . . . . 46
4.4.3 Condences analysis for face recognition . . . . . . . . . . . . . . . 49
4.4.4 Local and Global Condences for face recognition . . . . . . . . . 50
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Face and Body Association for Video-based Face Recognition 55
5.1 Face and Body Association . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.1 Face detection using YOLO cascade . . . . . . . . . . . . . . . . . 56
5.1.2 Body pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.3 Data association . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 Datasets and metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.3 Comparison with the state-of-the-art . . . . . . . . . . . . . . . . . 67
5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6 Conclusions and Future work 71
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Reference List 76
v
List of Tables
3.1 The structure of our convolutional network used for sparse landmark de-
tection and head pose estimation. . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 The structure of our convolutional network used for global condence esti-
mation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Average performances of the score fusion on JANUS CS2 dataset with local
and global condences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Performance comparisons of the proposed method to the previous face recog-
nition approaches on JANUS CS2 dataset. . . . . . . . . . . . . . . . . . . 50
4.4 Average performances of the score fusion on IJB-A dataset [28] with local
and global condences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Ablation experiments. Average Performances of Gallery 1 and 2 in the
Protocol 6 of the JANUS CS3 using CNN Model 1. . . . . . . . . . . . . . 66
5.2 Comparison with the state-of-the-art. Performances on Gallery 1 in the
Protocol 6 of the JANUS CS3. . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Comparison with the state-of-the-art. Performances on Gallery 2 in the
Protocol 6 of the JANUS CS3. . . . . . . . . . . . . . . . . . . . . . . . . 68
vi
5.4 Comparison with the state-of-the-art. Average Performances of Gallery 1
and 2 in the Protocol 6 of the JANUS CS3. . . . . . . . . . . . . . . . . . 69
vii
List of Figures
1.1 The variety of facial appearance caused by many factors such as pose, image
quality, illumination, and expression. . . . . . . . . . . . . . . . . . . . . . 6
3.1 Illustration of CLM tting. (i) a local search for feature locations to get
the response maps and (ii) an optimization strategy to maximize the re-
sponses of the Point Distribution Model constrained landmarks. Figure
taken from [59]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 The overview of our framework. HCLM model integrates sparse and dense
landmark detection together in a unied framework. . . . . . . . . . . . . 21
3.3 Pipeline of our method. We rst start with a CNN head pose estimation
which allows us to choose the appropriate CNN sparse landmark detector,
nally, this is followed by HCLM based dense landmark detection. . . . . 25
3.4 Example detection results on the 300-W and IJB-FL. Each column presents
images from the subsets of the 300-W (HELEN, IBUG and LFPW) and
IJB-FL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Cumulative error curves on 300-W. Measured as the mean Euclidean dis-
tance from ground truth normalized by the inter-ocular distance. . . . . . 31
viii
3.6 Cumulative error curves on IJB-FL. Measured as the mean Euclidean dis-
tance from ground truth normalized by the face size. Note that we use 49
points for both frontal and prole images. . . . . . . . . . . . . . . . . . . 32
3.7 Ablation experiments and sparse landmarks error on IJB-FL. (a) We use
68 points for comparison. Our approach is robust in presence of large head
pose variation in prole images (b) Cumulative error of sparse landmarks
on prole images. Note that sparse landmarks from CNN are rened by
HCLM model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.8 Head pose estimation results from 300-W, and AFLW. (a) and (b) show
the mean and median of absolute errors for pitch, yaw and roll respectively,
while (c) shows predicted head pose sample images. . . . . . . . . . . . . . 34
4.1 The structure of our convolutional network used for global condence esti-
mation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 The overview of our face recognition framework and the way of generating
two types of landmark condences. The probe and gallery images are from
JANUS CS2 dataset. The fusion of matching scores from 2D Alignment
and cropping only is achieved with the help of landmark condences. . . . 42
4.3 Correlation between local condence and eye distance normalized RMS er-
ror. When the local condence is higher, the RMS error is smaller. . . . . 46
4.4 Example predicted scores from CASIA test set. Rendered images to prole
view (75
). Higher numbers are better. . . . . . . . . . . . . . . . . . . . . 47
ix
4.5 Correlation between condences and matching scores of genuine pairs. Image-
to-Image matching score (left) and histogram of scores of both high and
low condence pairs (left). . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Comparative performance (ROC) on JANUS CS2 dataset for our con-
dence based fusion method for face verication. Only TARs of 0% to 1%
FARs are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.7 Comparative performance (CMC) on JANUS CS2 dataset for our con-
dence based fusion method for face identication. . . . . . . . . . . . . . . 52
5.1 Overview of our video-based face recognition system. The probe videos and
gallery images are from JANUS CS3 dataset. It is dicult to match the
faces in the gallery just with the annotated target face from a single frame
(depicted with a red box) in the probe video. Our association method
provides much richer description of the probe subject. . . . . . . . . . . . 56
5.2 ROC curves of YOLO Cascade and other face detection methods bench-
marked on FDDB dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Example of body pose estimation from OpenPose on challenging JANUS
CS3 video frames. Note how the use of an upper-body detector is able to
tackle challenging cases in which the person is partially visible or when the
frame contains talking heads. . . . . . . . . . . . . . . . . . . . . . . . . . 58
x
5.4 Sample frames from JANUS CS3 dataset. The target face annotated with
red box appears in multiple shots in the video. The detected face is shown
with a yellow box and the upper-body is shown with a blue box. While
tracking by using only detected faces can be dicult due to the large scale
changes and low resolutions, upper-body can be more discriminative in the
association. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 Some representative faces that can be found in JANUS CS3 videos: there
are dierent cases of face appearance in videos. For frontal or near-frontal
faces, it is easy to obtain the face tracks (a). For faces undergoing pose
variations and appearing in dierent scenes, the proposed face and body
association (FBA) is required (b)(c)(d). . . . . . . . . . . . . . . . . . . . 65
5.6 Sample associated faces in JANUS CS3 video. (a) A large number of people
appear and move to another place (b) target subject undergoes abrupt pose
changes and the video contains severe motion blur. . . . . . . . . . . . . . 70
xi
Abstract
Facial landmark detection has received much attention in recent years together with
two detection paradigms emerging: local approaches, where each facial landmark is mod-
eled individually and with the help of a shape model; and holistic approaches, where
the face appearance and shape are modeled jointly. In recent years, both of these ap-
proaches have shown great performance gains for facial landmark detection even under
real-world conditions (referred to as in-the-wild) of varying illumination, occlusion and
image quality. However, their accuracy and robustness are very often reduced for prole
faces where face alignment is more challenging (e.g., no more facial symmetry, less dened
features and more variable background). Meanwhile, a key to successful face recognition
is an accurate and reliable face alignment using automatically-detected facial landmarks.
Given this strong dependency between face recognition and facial landmark detection, a
robust face recognition requires knowledge of when the facial landmark detection algo-
rithm succeeds and when it fails. Facial landmark condence represents this measure of
success.
In this dissertation, we present the Holistically Constrained Local Model (HCLM)
which unies local and holistic facial landmark detection by integrating head pose esti-
mation, sparse-holistic landmark detection and dense-local landmark detection. HCLM
shows state-of-the-art performance, especially with the extreme head poses. We also
xii
present two methods to measure landmark detection condence: local condence based
on local predictors of each facial landmark, and global condence based on a 3D rendered
face model. A score fusion approach is also introduced to integrate these two condences
eectively. It shows a signicant accuracy improvement when face recognition algorithm
integrates the local-global condence metrics.
Lastly, we extend the face recognition domain from still-images to videos. Videos,
unlike still-images, oer a myriad of data for face modeling, sampling, and recognition,
but, on the other hand, contain low-quality frames and motion blur. A key component
in video-based face recognition is the way in which faces are associated through the video
sequence before being used for recognition. We present a video-based face recognition
method taking advantage of face and body association (FBA). To track and associate
subjects that appear across frames in multiple shots, we solve a data association prob-
lem using both face and body appearance. The nal recovered track is then used to
build a face representation for recognition. Our end-to-end system shows state-of-the-art
performances on the latest challenging video-based face recognition benchmark.
xiii
Chapter 1
Introduction
1.1 Problem statement
Facial landmarks, also known as facial feature points are mainly located around facial
components such as eyes, nose, mouth, and chin. Facial landmark detection refers to
locating a certain number of points of interest in an image of a face. It usually starts
from a rectangular bounding box returned by the face detector [70] which implies the
face location. Facial landmark detection is an essential preliminary step for a number of
facial analysis such as expression analysis, animation, 3D face modeling, facial attribute
analysis [86] and face recognition. Accurate facial landmarks are required in order to align
faces between images, improving robustness of face analysis. For example, when a face
alignment is performed with poorly detected facial landmarks, face recognition is likely
to fail. Zhang et al. [86] shows the importance of alignment by proposing pose-aligned
network.
The problem of automatic facial landmark detection has seen much progress over
the past years [9, 2, 73, 85, 90]. However, most of the state-of-the-art methods still
1
struggle in the presence of extreme head pose, especially in challenging in-the-wild images.
Furthermore, most of the methods [73, 55, 85] rely on a good and consistent initialization
which is often very dicult to achieve.
In this dissertation, we present the Holistically Constrained Local Model (HCLM)
which unies local and holistic facial landmark detection by integrating head pose es-
timation, sparse-holistic landmark detection and dense-local landmark detection. Our
method's main advantage is the ability to handle very large pose variations, including
prole faces. Furthermore, our model integrates local and holistic facial landmark detec-
tors in a joint framework, with a holistic approach narrowing down the search space for
the local one.
Moreover, while the main goal of the research is to develop methods for more reliable
detecting of facial landmark, we also propose a new method for measuring landmark
condence as a condence can tell us a degree of certainty that a prediction is correct.
One is the constrained local method, and the other is rendering based global method. The
local method can measure accuracy based on the local predictors of each facial landmark,
and the global method can predict the condence from 3D rendered faces. While the
local condence measures the goodness of landmarks based on the 2D aligned images,
the global condence measures it from the 3D rendering point of view. The merit of using
these two condences is that they can help alleviating the in
uences of the poorly aligned
face images on face recognition so as to improve the recognition accuracy.
Meanwhile, we extend the face recognition domain from still-images to videos. Unlike
still images, videos have spatial-temporal information consisting of a series of frames.
Therefore, by its own nature, videos oer far wide amount of data that is available to
2
describe the subject; it is in principle possible to collect samples of the desired subject
under a wide range of conditions. These samples can thus be used for improving the
accuracy of the video-based recognition methods. Although a video can oer plenty
of data, the preprocessing steps to correctly gather this information remain challenging.
Usually these steps refer to three components that are: (1) face detection, (2) face tracking
and (3) association. Beyond association, face recognition itself is dicult due to the pose
changes, motion blur and illumination variations including the cases such as low video
quality and faces at small scales.
In general settings, subjects in a video can appear repeatedly over multiple shots and
scenes. To simplify the task, many studies on video-based face recognition provide bound-
ing boxes or perform recognition for consecutive frames in a shot. Well-known datasets,
such as YouTube Faces Database [72] and Celebrity-1000 [36], provide a bounding box
for each frame and recognition is performed based on this box. A much more challenging
setting is when the subject is annotated only once in a video stream and the video also
contains multiple people or scene changes. Therefore, to keep track of the desired subject,
appearing in multiple shots or scenes, it is necessary to associate the person's position
across multiple scenes and shots.
In this dissertation, we present a novel method for face and body association (FBA)
in videos to improve face recognition. Our contribution is the following: a more robust
face representation can be obtained by taking advantage of person association in a video,
not only using face association [11], but supporting this latter with additional body
information. Although this is related to person re-identication [34, 35, 33], in our case,
the re-identication using body cues is performed within the same video. We support
3
our claim showing that the use of the body helps improving the performance constantly
in all the face recognition metrics. Dierent from the video-based approach in [11],
our association method does not rely only on the detected faces but also uses the body
information to guide the data association within the video. This allows the tracking of
targets even when facial details are not clearly visible or when the face representation is
not reliable due to low resolution or small scale. Using this approach, we can obtain a
wider set of facial frames representing the target subject. These associated face images
are used to compute a representation for face recognition.
4
1.2 Challenges
1.2.1 Facial landmark detection
Facial landmark detection in the wild is still considered as a challenging problem. This is
mainly due to the variety of facial appearance caused by many factors such as pose, illu-
mination, skin tone, identity, image quality (blur or low resolution), expression, occlusion
and various accessories (hats, glasses, scarves, etc.).
• Head pose: Head pose has a big impact on landmark detection. Especially, if faces
have extreme poses (e.g., prole faces), facial symmetry could not be used and
landmarks will be detected through less dened features. Since feature visibility
is also limited, a self-occlusion problem occurs. Therefore, the pose is the most
challenging problem in landmark detection.
• Illumination: Facial appearance is highly aected by lighting conditions. When
the illumination direction changes, the location and shape of shadow shift, highlight
changes, and the contrast gradient reverses. These changes due to the illumination
make landmark detection challenging.
• Image quality: Head pose and illumination variations are not the only factors
aecting the facial landmark detection accuracy. Low quality images are one of the
factors that make landmark detection dicult. These images are often captured
from video frames rather than still images from the camera. Blurring is usually
caused by movement of the subject or the video recorder itself and images are
5
(a) Head pose, and image quality
(b) Illumination, and expression
Figure 1.1: The variety of facial appearance caused by many factors such as pose, image quality,
illumination, and expression.
sometimes converted to low resolution during the encoding process. It is dicult
to extract accurate features from the low quality images.
• Expression: Landmark detection accuracy also suers in the presence of extreme
variations of expression. Expression is a facial activity caused by a set of muscle
movements, causing large variations in appearance mainly around facial parts. This
facial deformation not only makes it dicult to nd the facial landmark, but it also
makes visual-based expression analysis dicult. In fact, most of the errors occur
when the lip movement cannot be reliably detected.
In recent years, the state-of-the-art facial landmark detection methods have shown a
great performance even under challenges mentioned above such as varying illumination,
6
image quality, expression, and occlusion. However, their accuracy and robustness are
frequently reduced with the extreme head poses. Hence, in the scope of this dissertation,
we focus on handling the head pose issues.
1.2.2 Face recognition
Face recognition is also aected by many factors such as large pose variations, illumination
(uncontrolled lighting conditions), facial expression, and image quality. Occlusion, hairs,
make-up, and dierent ages also make face recognition tasks challenging. Similarly, in
scenarios such as visual surveillance, videos are often acquired in uncontrolled situations
or from moving cameras. Actually, there are several challenges and key factors that can
impact face recognition performance signicantly, as well as other factors that can aect
matching scores. In addition to the four challenge factors mentioned in facial landmark
detection, two additional factors are as follows:
• Ageing: Ageing can severely aect the performance of face recognition. In general,
the eect of age variation is not commonly considered in face recognition research.
One of the main reasons for the small number of face recognition studies concerning
age factor is the absence of representative public databases with images containing
individuals of dierent ages as well as the low quality of old images. It is very
dicult to collect a dataset for face images that contains the same person's images
taken at dierent ages of life.
• Occlusion: Faces can be partially occluded by other objects. In an image with
a group of people, some faces can be partially occluded by other faces or objects,
7
so only a small part of the face is available in many cases. This makes the face
detection dicult, and even if a face is found, the recognition itself will be dicult
due to the hidden part of the face.
1.3 Contributions
We propose a new landmark detection model in the wild. Comparing to the state-of-
the-art, this dissertation provides strong contributions in terms of novelty. Moreover, we
propose a novel method for face and body association (FBA) in videos to improve face
recognition. Our contributions are the followings:
• Holistically Constrained Local Model
{ We present a novel Holistically Constrained Local Model (HCLM), which uni-
es local and holistic facial landmark detection by integrating head pose es-
timation, sparse-holistic landmark detection and dense-local landmark detec-
tion.
{ Our method's main advantage is the ability to handle very large pose vari-
ations, including prole faces. Furthermore, our model integrates local and
holistic facial landmark detectors in a joint framework, with a holistic ap-
proach narrowing down the search space for the local one.
• Local-Global Landmark Condences for Face Recognition
{ We propose a new method for measuring landmark condence. One is the
constrained local method, and the other is rendering based global method.
8
The local method can measure accuracy based on the local predictors of each
facial landmark, and the global method can predict the condence from 3D
rendered faces.
{ While the local condence measures the goodness of landmarks based on the
2D aligned images, the global condence measures it from the 3D rendering
point of view.
{ The merit of using these two condences is that they can help alleviating the
in
uences of the poorly aligned face images on face recognition so as to improve
recognition accuracy.
• Face and Body Association for Video-based Face Recognition
{ We present a novel method for face and body association (FBA) in videos
to improve face recognition. A more robust face representation can be ob-
tained by taking advantage of person association in a video, not only using
face association, but supporting this latter with additional body information.
{ This allows the tracking of targets even when facial details are not clearly
visible or when the face representation is not reliable due to low resolution
or small scale. The upper-body of the subject contains enough discriminative
information that can be leveraged to aid the association within the video.
{ Using this approach, we can obtain a wider set of facial frames represent-
ing the target subject. These associated face images are used to compute a
representation for face recognition.
9
1.4 Outline
In the following Chapter, we provide a brief survey of the related work for face detection,
facial landmark detection, head pose estimation, landmark condences and face recog-
nition. In Chapter 3, we describe our novel Holistically Constrained Local Model for
landmark detection. We follow this with the description of our experiments and results
demonstrating the benets of our model. In Chapter 4, we describe our local-global con-
dence metrics and their application on face recognition. In Chapter 5, we present our
novel method for face and body association (FBA) in videos to improve face recognition.
Finally, we conclude the dissertation in Chapter 6 and propose future research directions.
10
Chapter 2
Related Work
In this chapter, we review the related work as well as the state-of-the-art techniques for
face detection, facial landmark detection, head pose estimation, landmark condence and
face recognition problems.
2.1 Face detection
Vast research eorts have been devoted into the face detection eld and a number of o-
the-shelf face detectors are available in various libraries. Modern face detection methods
can be classied in four dimensions. One is along the axis of dierent types of features
used, ranging from seminal work by Viola Jones et al. [70] which implemented the
Haar-like features to SURF features in [31], and channel features used in the series of
work by Yang et al. [75, 76]. Another dimension is the type of classiers applied. While
various methods have been used for classication, cascade-structured classiers have been
popular due to the trade-o between performance and eciency [12, 32, 80]. The third
axis consists in part-based methods: this is an important category of face detection
methods as well. Deformable part models (DPM) dene face as a collection of parts and
11
model the connections through Latent Support Vector Machine. These methods include
[91, 74, 78, 53].
The nal axis consists of the most recent studies leveraging the capacity of deep
neural network and the vast amount of face data [78, 79]. A subset of such works draws
inspiration from the advances made in object detection, by applying similar ideas to faces.
For instance, Jiang et al. [27] implemented Faster R-CNN [56] architecture, while the
work in [87] is inspired by the Single Shot MultiBox Detector (SSD) [37].
2.2 Facial landmark detection
Facial landmark detection has made huge progress in the past couple of years. A large
number of new approaches and techniques have been proposed especially for landmark
detection in faces from RGB images.
Modern facial landmark detection approaches can be split into two main categories
- local and holistic. Local approaches often model both appearance and shape of facial
landmarks with the latter providing a form of regularization. Holistic approaches, on the
other hand, do not require an explicit shape model and landmark detection is directly
performed on appearance. We provide a short overview of the recent local and holistic
methods.
Holistic: Nowadays, the majority of the holistic approaches follow a cascaded regres-
sion framework, where facial landmark detection is updated in a cascaded fashion. The
landmark detection is continually improved by applying a regressor on appearance given
the current landmark estimate as performed by Cao et al. in explicit shape regression
12
[9]. Other cascaded regression approaches include the Stochastic Descent Method (SDM)
[73] which uses SIFT [39] features with linear regression to compute the shape update
and Coarse-to-Fine Shape Searching (CFSS) [90] which attempts to avoid a local optima
in cascade regression by performing a coarse to ne shape search.
Recent work has also used deep learning techniques in a cascaded regression framework
to extract visual features. Coarse-to-Fine Auto-encoder Networks (CFAN) [85] use visual
features extracted by an auto-encoder together with linear regression. Sun et al. [64]
proposed a Convolutional Neural Network (CNN) based cascaded regression approach
for sparse landmark detection, however, while their approach is robust, it is not very
accurate.
Local: Local approaches are often made up of two steps: extracting an appearance
descriptor around certain areas of interest and computing local response maps; tting a
shape model based on the local predictions. Such areas are often dened by the current
estimate of facial landmarks. A popular local method for landmark detection is the
Constrained Local Model (CLM) [60] and its various extensions such as Constrained
Local Neural Fields [2] and Discriminative Response Map Fitting [1] use more advanced
ways of computing local response maps and inferring the landmark locations. The CLM-
based methods take into account the appearance variation around each facial landmark
independently. Therefore, one response map can be calculated from the appearance
variation around each facial landmark with the help of a corresponding local patch expert.
Facial landmarks are predicted from these response maps rened by a shape prior which
is generally learned from training shapes. Project out Cascaded regression (PO-CR) [68]
13
is another example of a local approach that uses a cascaded regression to update the
shape model parameters rather than predicting landmark locations directly.
Another noteworthy local approach is the mixture of trees model [91] that uses a
tree based deformable parts model to jointly perform face detection, pose estimation and
facial landmark detection. A notable extension to this approach is the Gauss-Newton
Deformable Part Model (GN-DPM) [69] which jointly optimizes a part-based
exible
appearance model along with a global shape using Gauss-Newton optimization.
Rajamanoharan and Cootes [52] proposed a local approach that explicitly aims to be
more robust in the presence of large pose variations. They use landmark detectors trained
at orientation conditions to produce more discriminative response maps and explore the
best spatial splits for this task. However, they do not propose how such pose information
could be acquired to initialize the model. In our work, we use similarly trained landmark
detectors but also provide a way of initializing the models at extreme angles.
2.3 Head pose estimation
Head pose estimation attempts to compute the orientation of the head. It has not re-
ceived the same amount of interest as facial landmark detection in recent years. Most of
the recent work concentrated on exploiting range sensors for the task [19, 45]. However,
the limitation of such approaches is that they cannot work on purely RGB images. A
large number of head pose estimation approaches rely on explicit prediction of head pose
from the image [17]. This is often done through multivariate regression or multi-class
14
classication [43]. It is also possible to use model based approaches rather than discrim-
inative approaches [17]. For example, they use the estimated facial landmarks together
with a 3D face model to estimate the head pose. However, this requires an estimate of
camera calibration parameters and might not be suitable for some applications.
Most similar model to our work is the one proposed by Yang et al. [77]. In their
approach, an estimate of a head pose from a Convolutional Neural Network (CNN) is
used to initialize a cascaded shape regression approach for face alignment. Our work
integrates the head pose in a similar manner, but also proposes the use of a combined
holistic and local approaches for landmark detection.
2.4 Landmark condence
Despite of many studies on landmark detection, there are few studies on condence which
evaluate landmark condence. Steger et al. [63] proposed failure detection method for
facial landmark detector and this method allows recovery from failures. Steger et al. [21]
proposed landmark localizations quality assessment using a regression to the Area Under
the Point Accuracy Curve (AUPAC) which dened for the trade-o curve between the
distance threshold and the obtained recall. Condence is important in various aspects: it
allows us to improve the landmark detection, furthermore, it can be used in later pipeline
by weighting on the scores.
15
2.5 Face recognition
Face recognition have made huge progress in the past couple of years, mainly because
of the use of end-to-end deep learning methods. A large number of new approaches and
techniques have been proposed for facial recognition in images [67, 41, 42, 14, 61], while
less attention has been given to video-based face recognition [50, 11].
Face recognition with CNNs: Convolutional Neural Networks (CNNs) have been
used for face recognition as far back as [29], but recently, it is getting more attention
with improvements in hardware, massive datasets, and the development of network ar-
chitecture designs. Face recognition performance is greatly enhanced by training CNNs
on huge datasets. Google FaceNet [62] shows that it is possible to learn a compact em-
bedding for faces with their learning system trained on 260 million images. DeepID [65]
uses ensembles of CNN representations which are trained on dierent facial patches with
the goal of learning pose invariance at the patch level. Along with their joint Bayesian
metric, they showed remarkable performance on the Labeled Faces in the Wild (LFW)
benchmark. Their work was later extended in [66] to show how the CNN learns sparse
features that implicitly encode attribute information such as gender. Finally, an eort
was made to produce more rich and varied data for training the CNN in [81, 89, 46].
Some previous work attempted to let the network disentangle the identity and the
viewpoint by either performing multi-task learning [83] or using multi-view perceptrons [92].
The drawback of these latter methods is that they are only trained on constrained images
in the Multi-PIE dataset in which the pose is manually specied. These methods were
16
consequently not tested on unconstrained benchmarks such as IARPA Janus Benchmark
A (IJB-A) [28].
Contrary to these, the recent work of Wang et al. [71] used CNNs to show accurate
face identication on a gallery of 80 million images and on the IJB-A benchmark. Chen
et al. [13, 15] demonstrated compelling results on IJB-A by using a single CNN trained
from scratch on frontal views and ne-tuned to learn an eective metric on the target
dataset.
Score fusion for face recognition: Face recognition is usually achieved by matching
one face image to another, and assigning a matching score to show the possibility whether
these two face images are from the same person or not. It is often the case that we have
multiple matching scores due to multiple medias for a face (e.g., 2D and 3D) or (and)
multiple matching methods (e.g. Euclidean distance or inner product) or (and) multiple
features (e.g. CNN features learned from aligned images or cropped images) for example.
Therefore, we need to obtain a single score for the considered matching pair by score
fusion. The most widely used fusion methods are the average, maximum or minimum
fusion [57, 6]. In [41], the ensemble SoftMax method is shown to be eective on fusing the
scores from matching face images of dierent poses. Recently, the quality based fusion
approaches have been proposed [6, 51] to integrate the quality of the score into fusion.
In [6], the qualities of individual media are directly utilized as the weights to combine
ve types of media: 2D face image, video, 3D face model, sketch, and demographic
information. However, the media quality itself might be noisy and a better transformation
from it to the fusion weight is needed. [51] proposes a unied Bayesian framework that
incorporates the quality information elegantly for multimodal biometric fusion. One of
17
the drawbacks of this approach is that we need the accurate probability estimation of
class labels given the features and qualities.
18
Chapter 3
Holistically Constrained Local Model
3.1 Model
In this section, we introduce our Holistically Constrained Local Model (HCLM) for facial
landmark detection. We rst start by introducing the Constrained Local Model (CLM)
in section 3.1.1. Then, we present our joint framework of incorporating holistic and local
models (section 3.1.2). This is done by rening the sparse landmark predictions (holistic)
with a dense landmark model (local). We follow this by a description of sparse landmark
detection and head pose estimation that constrain and rene our model in section 3.1.3.
3.1.1 Constrained Local Model
Constrained Local Model (CLM) based methods are deformable model based approaches
which are commonly used for the task of landmark registration. The problem of tting a
deformable model involves nding the parameters of the model that best match a given
image. CLM based methods depends on a parametrized shape model, which controls the
possible shape variations of the non-rigid object. The appearance of local patches around
19
Figure 3.1: Illustration of CLM tting. (i) a local search for feature locations to get the response
maps and (ii) an optimization strategy to maximize the responses of the Point Distribution Model
constrained landmarks. Figure taken from [59].
landmarks of interest is modeled using patch experts. It means one response map can be
calculated from the appearance variation around each facial landmark with the help of a
corresponding local patch expert.
Given a shape model and local patch experts, a deformable model tting process is
used to estimate the parameters that could have produced the appearance of a face in an
unseen image. A common approach to tting is illustrated in Figure 3.1. The parameters
are optimized with respect to an error term which depends on how well the parameters
model the appearance of a given image, or how well the current points represent an
aligned model. That is, facial landmarks are predicted from the response maps rened
by a shape prior which is generally learned from training shape model.
20
Figure 3.2: The overview of our framework. HCLM model integrates sparse and dense landmark
detection together in a unied framework.
3.1.2 Holistically Constrained Local Model
Our model integrates coarse and ne landmark detection together in a unied framework
(Figure 3.2). The main goal of this approach is to improve the ne grained dense landmark
detection with the help of coarser sparse landmarks.
For a given set of k facial landmark positions x =fx
1
;x
2
;:::;x
k
g, our HCLM model
denes the likelihood of the facial landmark positions conditioned on a set of sparse
landmark positions X
s
=fx
s
; s2Sg (jSjk) and imageI as follows:
p(xjX
s
;I)/p(x)
k
Y
i=1
p(x
i
jX
s
;I): (3.1)
In Equation 3.1, p(x) is prior distribution over a set of landmarks x following a
3D point distribution model (PDM) with orthographic camera projection. Similar to
Saragih et al. [60], we impose a Gaussian prior on the non-rigid shape parameters on
21
the model. The probability of individual landmark alignment (response map) is modeled
using the following distribution:
p(x
i
jX
s
;I) =
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
N (x
i
j =X
s
;
2
) i2S
C(x
i
jI) i = 2S
(3.2)
Above, C is a probabilistic patch expert that describes the probability of a landmark
being aligned, whileN (x
i
j;
2
) is a bivariate Normal distribution evaluated at x
i
with
mean - and variance -
2
in both dimensions. Equation 3.2 allows our model to place
high condence (controlled by small ) on a set of sparse landmarks (detected by a
holistic model from Section 3.1.3), while incorporating the response maps from a denser
set. C can be any model producing a probabilistic predictions of landmark alignment.
In our work, we dene C as a multivariate Gaussian likelihood function of a Continuous
Conditional Neural Field (CCNF) [3]:
C(yjI) =p(yjI) =
exp( )
R
1
1
exp( ) dy
; (3.3)
=
X
y
i
K1
X
k=1
k
f
k
(y
i
;W(y
i
;I);
k
) +
X
y
i
;y
j
K2
X
k=1
k
g
k
(y
i
;y
j
); (3.4)
Above y is a mm area of interest in an image around the current estimate of the
landmark (the area we will be searching in for an updated location of the landmark),
W(y
i
;I) is a vectorized version of an nn image patch centered around y
i
and is called
22
the support region (the area based on which we will make a decision about the landmark
alignment, typically m>n), f
k
is a logistic regressor, and g
k
is smoothness encouraging
edge potential [3]. It can be shown that Equation 3.3 is a Multivariate Gaussian func-
tion [3], making the exact inference possible and fast to compute. The model parameters
[;;] of the CCNF are learned by using Maximum Likelihood Estimation (using BFGS
optimization algorithm).
In order to optimize Equation 3.1, we use Non-Uniform Regularized Landmark Mean-
Shift which iteratively computes the patch responses and updates the landmark estimates
by updating the PDM parameters [2].
The following section describes the method of acquiring the holistic sparse landmarks,
X
s
, used to constrain our dense landmark detector. It also describes our head pose
estimation model that allows for better initialization.
3.1.3 Two holistic predictors: sparse landmarks and head pose
Our HCLM model depends on a set of sparse landmarks from a holistic model. In this
study, we use a similar approach to the CNN model proposed by Sun et al. [64] for such
landmark detection. Table 3.1 (a) shows the CNN architecture used in our work. A gray-
scale image of size 39 39 is used for input and pixel values are normalized to the range
between 0 and 1. For sparse landmarks, ve landmarks (two eyes, one nose, two mouth
corners) are used for frontal face and three landmarks (one eye, one nose, one mouth
corner) are used for prole face. Consequently, the number of output is ten when the face
is frontal or six when it is prolefx
1
;y
1
;x
2
;y
2
;:::;x
n
;y
n
g (due to self occlusion). The
location of landmarks is shifted with respect to the image center, x and y, and normalized
23
Table 3.1: The structure of our convolutional network used for sparse landmark detection and
head pose estimation.
Name Type Filter Size Stride Output Size
Conv1 convolution 20x4x4 1 20x36x36
Pool1 max-pooling 2x2 2 20x18x18
Conv2 convolution 40x3x3 1 40x16x16
Pool2 max-pooling 2x2 2 40x8x8
Conv3 convolution 60x3x3 1 60x6x6
Pool3 max-pooling 2x2 2 60x3x3
Conv4 convolution 80x2x2 1 80x2x2
Dense1 fully connected 120
Dense2 fully connected 10 (6)
Name Type Filter Size Stride Output Size
Conv1 convolution 32x3x3 1 32x94x94
Pool1 max-pooling 2x2 2 32x47x47
Conv2 convolution 64x2x2 1 64x46x46
Pool2 max-pooling 2x2 2 64x23x23
Conv3 convolution 128x2x2 1 128x22x22
Pool3 max-pooling 2x2 2 128x11x11
Dense1 fully connected 400
Dense2 fully connected 400
Dense3 fully connected 3
(a) The CNN architecture for sparse landmark detection (b) The CNN architecture for head pose estimation
by width and height to be in range between -0.5 and 0.5. We use the Euclidean distance
as the network loss:
loss
sparse
=k
^
X
i
X
i
k
2
2
; (3.5)
where
^
X
i
is the given ground truth location, and X
i
is the predicted location for training
imageI
i
. Note that
^
X
i
, and X
i
are normalized locations.
To assist the face alignment and facial landmark detection, it is helpful to know the
head pose in advance. Most cases of face alignment failures come from the large head
pose variations as the initial shape is often frontal and local approaches are not able to
converge onto correct non-frontal landmarks. To avoid this problem, we developed a head
24
Figure 3.3: Pipeline of our method. We rst start with a CNN head pose estimation which
allows us to choose the appropriate CNN sparse landmark detector, nally, this is followed by
HCLM based dense landmark detection.
pose estimation module which gives an estimate of the three head pose angles: pitch, yaw
and roll.
Our implementation of CNN for head pose estimation is based on the work of Yang et
al. [77]. Table 3.1 (b) shows an overview of our CNN architecture. A gray-scale image
of size (9696) is used for input and pixel values are normalized to the range between 0
and 1. The output is three dimensional vector which represents pitch, yaw and roll. The
output angles are normalized between -1 and 1. We use the Euclidean distance as the
network loss:
loss
headpose
=k
^
P
i
P
i
k
2
2
; (3.6)
where
^
P
i
is the given ground truth angle, and P
i
is the predicted head pose for training
image I
i
.
25
The full pipeline of our HCLM model (Figure 3.3) is as follows: 1) use a CNN head
pose predictor to estimate the head pose in the input image; 2) use a view dependent CNN
sparse landmark detector; 3) use the HCLM model with the detected sparse landmarks.
In the following section, we extensively evaluate our model showing its benets over other
models, especially in presence of extreme poses.
3.2 Experiments
We designed our experiments to study three aspects of our model. First, we compared
at a general level our HCLM with a number of state-of-the-art baselines on two publicly
available facial landmark detection datasets. Second, we performed a set of ablation
experiments to see how each element of HCLM model aects the nal facial landmark
detection results. Finally, we demonstrated the performance of the individual holistic
models for sparse landmark detection and head pose estimation. In the following sections,
the datasets we used and the experimental procedures we followed are presented.
3.2.1 Comparison with baseline methods
We compared our model with a number of recently proposed approaches for facial land-
mark detection (both holistic and local ones). The following models acted as our baselines:
Tree-based deformable part method of Zhu et al. [91], Gauss-Newton Deformable Part
Model (GN-DPM) [69], Discriminative Response Map Fitting (DRMF) instance of a
Constrained Local Model [1], Supervised Descent Method (SDM) model of cascaded
regression [73], Coarse-to-ne Shape Searching (CFSS) extension of cascaded regres-
sion [90] and a multi-view version of Constrained Local Neural Fields (CLNF) model
26
[3]. We used multi-view initialization for CLNF model. As the GN-DPM and SDM
models we used were only trained on 49 landmarks, we only evaluated them for those
landmarks. Our comparison against state-of-the-art algorithms is based on the original
author's implementations.
3.2.2 Ablation experiments
We performed three ablation experiments to see how the individual elements of our
pipeline aect the nal results. First, we removed the head pose estimation module
and performed sparse landmark detection followed by dense landmarks (three head pose
estimation models in parallel). We picked the model with the highest converged likelihood
to determine the nal landmarks. Second, we did not use sparse landmark detection, in-
stead we used estimated head pose to initialize a CLNF dense landmark detector (Section
3.1.2). Finally, we performed just a CLNF dense landmark detection.
3.2.3 Datasets
We evaluated our works on three publicly available in-the-wild datasets:
300-W is a popular dataset which contains images from the HELEN [30], LFPW [4],
AFW [91] and IBUG [58]. 300-W provides the ground truth bounding boxes and manually
annotated 68 landmarks.
AFLW dataset contains 24,386 face images from Flickr. Each face is annotated with
up to 21 landmarks. AFLW provides head pose estimation obtained by tting a 3D mean
face.
27
Figure 3.4: Example detection results on the 300-W and IJB-FL. Each column presents images
from the subsets of the 300-W (HELEN, IBUG and LFPW) and IJB-FL.
IJB-FL is a new dataset which has a substantial proportion of non-frontal images. It
is a subset of IJB-A [28] which is a face recognition benchmark and includes challenging
unconstrained faces with full pose. We took a sample of 180 images (128 images for frontal
and 52 images for prole) from IJB-A, and manually annotated up to 68 facial landmarks
in each image (depending on visibility). This is a very challenging subset containing a
number of images in non-frontal pose (see Figure 3.4).
1
3.2.4 Methodology
For the CNN training for head pose estimation and sparse landmark detection, we used
300-W and AFLW datasets. More specically, we used the training partitions of HELEN
(2,000 images), LFPW (811 images), AFW (337 images), and AFLW (14,920 images). In
addition, to avoid over-tting, the images were augmented three times with enlargement
1
Annotations of the IJB-FL dataset are available for research purposes.
28
by 10% and 20% on the face bounding box. We began training at a learning rate of 0.001
and dropped the learning rate to 1e
8
with 0.1 step and set the momentum to 0.9. For
the CLNF patch expert training, we used Multi-PIE (Pose, Illumination, Expression) [20]
and the training partitions of HELEN and LFPW. Furthermore, we used a multi-view
and multi-scale approach as described in Baltru saitis et al. [3].
For the test of head pose estimation and landmark detection, we used the test set of
LFPW (224 images), HELEN (330 images), IBUG (135 images), the remaining AFLW
(4,972 images), and IJB-FL (180 images). Faces in the training set were not used in
testing.
In case of the 300-W (LFPW, HELEN, AFW, IBUG) datasets, we used the bounding
boxes provided by the organizers [58] which were based on a face detector. In case of
the AFLW and IJB-FL datasets, we used the bounding boxes based on the ground truth
landmarks since the automatic face detection could not detect some faces in the data.
3.2.5 Error metric
Landmark detection error metric: In order to measure landmark detection accuracy,
an error metric was needed. The Root Mean Square (RMS) error between the given
ground truth location and the detected landmarks location is dened as follows:
RMS
error
=
v
u
u
t
1
N
N
X
i=1
(( ^ x
i
x
i
)
2
+ ( ^ y
i
y
i
)
2
); (3.7)
where ^ x
i
, ^ y
i
are the ground truth locations of landmarks,x
i
,y
i
are the detected landmarks
location, and N is the number of landmarks used.
29
We used a size normalised error so that it would be possible to compare errors across
datasets, and to avoid bias caused by face size. So, we divided the error by the average
of the width and height of the ground truth shape:
NormalisedRMS
error
=
q
1
N
P
N
i=1
(( ^ x
i
x
i
)
2
+ ( ^ y
i
y
i
)
2
)
0:5 (width +height)
; (3.8)
Note that not all of the landmarks were used for RMS error computation. For prole
images, only the visible landmarks were used for error estimation (e.g. if a persons face
has turned to the right, the right side of the face was not used).
Landmark detection error is often visualized as a cumulative error curve (see Fig-
ure 3.5, Figure 3.6 for examples). The curve is constructed by computing the percentage
of images in which the error was below a specic value. The closer the curve is to the top
left corner of the graph, the better the tting. The curve can also reveal accuracy and
robustness trade-os between approaches, and is more informative than mean or median
error values.
Head pose estimation error metric: Most approaches to head pose estimation use
the mean of absolute angular error for pitch, yaw and roll. However, the main problem
with this measure is that the error distributions are unlikely to be normal. Therefore,
the mean error would be negatively aected by the outliers.
In addition to the mean error, the median error was computed as well. In head pose
estimation, median error is often more informative than the mean error, because the latter
is heavily in
uenced by outliers. For example, if the estimated head pose has an error of
30
(a) 49 point cumulative errors (b) 68 point cumulative errors
Figure 3.5: Cumulative error curves on 300-W. Measured as the mean Euclidean distance from
ground truth normalized by the inter-ocular distance.
100 degrees, this result will aect the mean error but not the median error. Hence, both
errors were computed in our experiments.
3.3 Results and discussion
3.3.1 Facial landmark detection
The results of comparing our HCLM model to the previously mentioned baselines are
in Figure 3.5 (300-W) and Figure 3.6 (IJB-FL). Our model demonstrates competitive or
superior performance to most of the baselines on both datasets. The better performance of
our model is particularly clear at higher error rates and on prole images (see Figure 3.6b).
This indicates that our model is more robust than the baselines, and that it is able to
t better on more complex images. This is due to the both better initializations and
combination of holistic and local approaches of our model (see the following sections).
31
(a) frontal (b) prole
Figure 3.6: Cumulative error curves on IJB-FL. Measured as the mean Euclidean distance from
ground truth normalized by the face size. Note that we use 49 points for both frontal and prole
images.
3.3.2 Ablation experiments and sparse landmark detection
We performed ablation experiments to see how the individual elements of our pipeline
aect the nal results. Figure 3.7a shows the performance of our full pipeline compared
with the following three cases: only using the dense landmark detector (CLNF), dense
landmark detector with head pose estimation (without any sparse landmark anchors), and
CLNF with sparse landmark anchors (no head pose estimation). The result shows that
each individual module - head pose estimation, sparse landmark detection - is important
for HCLM model to improve the performance.
In addition, we evaluated the performance of sparse landmark detection methods on
the IJB-FL dataset. Figure 3.7b shows the cumulative error of sparse landmarks on prole
images. Sparse landmarks were used to anchor our dense landmark detector and were
32
(a) Ablation experiments on prole (b) Sparse landmarks error on prole
Figure 3.7: Ablation experiments and sparse landmarks error on IJB-FL. (a) We use 68 points
for comparison. Our approach is robust in presence of large head pose variation in prole images
(b) Cumulative error of sparse landmarks on prole images. Note that sparse landmarks from
CNN are rened by HCLM model.
rened by our HCLM model. The result shows that HCLM model improves the accuracy
of sparse landmarks which were predicted by the CNN.
3.3.3 Head pose estimation
We evaluated our head pose detector on the 300-W and AFLW datasets. The quanti-
tative results are in Figure 3.8a and Figure 3.8b, while Figure 3.8c shows some sample
estimations. The absolute mean errors of the pitch, yaw and roll are 4.27, 4.27 and 2.58
for 300-W and 6.74, 7.65, and 3.60 for AFLW respectively. The absolute median errors
of the pitch, yaw and roll are 3.33, 2.93 and 1.64 for 300-W and 5.35, 5,43, and 2.03 for
AFLW respectively.
33
(a) Error on the 300-W (b) Error on the AFLW (c) Sample images
Figure 3.8: Head pose estimation results from 300-W, and AFLW. (a) and (b) show the mean
and median of absolute errors for pitch, yaw and roll respectively, while (c) shows predicted head
pose sample images.
In addition, we measured three view classication accuracy based on the head pose
estimation results. The ranges of frontal, left and right sides are from -30
to 30
, greater
than 30
, less than -30
respectively. The classication accuracy on 300-W and AFLW
datasets is 95.2% and 91.5% respectively. This demonstrates that our CNN head pose
estimation works well and is useful for the further steps of the pipeline.
3.4 Conclusions
In this chapter, we learn that the landmark detection is important for many facial analysis.
The problem of automatic facial landmark detection has seen much progress over the
past years, however, most of the state-of-the-art methods still struggle in the presence
of extreme head pose, especially in challenging in-the-wild images. In order to solve
this problem, we present a new model, HCLM which unies local and holistic facial
landmark detection by integrating these three methods: head pose estimation, sparse-
holistic landmark detection and dense-local landmark detection. Our new model was
34
evaluated on three challenging datasets: 300-W, AFLW and IJB-FL. It shows state-of-
the-art performance and is robust, especially in the presence of large head pose variations.
35
Chapter 4
Local-Global Landmark Condences for Face Recognition
4.1 Local-Global Landmark Condences
In this section, we introduce our condence model for facial landmark detection. First,
we start by describing a constrained local condence model (Section 4.1.1). We follow
this by a description of global condence model in Section 4.1.2.
4.1.1 Constrained Local Condence
Our constrained local condence builds from the Constrained Local Neural Field (CLNF)
[2] deformable model. We compute the facial landmark by using CLNF. For a given set
ofk facial landmark positions x =fx
1
;x
2
;:::;x
k
g, our model denes the likelihood of the
facial landmark positions conditioned on imageI as follows:
p(xjI)/p(x)
k
Y
i=1
P (x
i
jI): (4.1)
In Equation (4.1), p(x) is a prior distribution over set of landmarks x following a
3D point distribution model (PDM) with orthographic camera projection. Similarly to
36
Saragih et al. [60], we impose a Gaussian prior on the non-rigid shape parameters on the
model. P is a probabilistic patch expert that describes the probability of a landmark
being aligned (response map). Patch experts (also called local detectors) are a very
important part and from these patch experts, we can infer local condence. P can be
any model producing a probabilistic predictions of landmark alignment. In our work,
we dene P as a multivariate Gaussian likelihood function of a Continuous Conditional
Neural Field (CCNF) [3]:
P (yjI) =
exp( )
R
1
1
exp( ) dy
; (4.2)
=
X
y
i
K1
X
k=1
k
f
k
(y
i
;W(y
i
;I);
k
) +
X
y
i
;y
j
K2
X
k=1
k
g
k
(y
i
;y
j
); (4.3)
Above y is a mm area of interest in an image around the current estimate of the
landmark (the area we will be searching in for an updated location of the landmark),
W(y
i
;I) is a vectorized version of an nn image patch centered around y
i
and is called
the support region (the area based on which we will make a decision about the landmark
alignment, typicallym>n),f
k
is a logistic regressor, andg
k
is a smoothness encouraging
edge potential [3]. It can be shown that (4.2) is a Multivariate Gaussian function [3], mak-
ing the exact inference possible and fast to compute. The model parameters [;;] of the
CCNF are learned by using Maximum Likelihood Estimation (using BFGS optimization
algorithm).
37
In order to optimize (4.1), we use Non-Uniform Regularized Landmark Mean-Shift
which iteratively computes the patch responses and updates the landmark estimates by
updating the PDM parameters [2].
After optimizing (4.1), the model likelihoods from each response map are computed.
To obtain the local condence, we take a sum of log probabilities over all of the likelihoods
and normalize them to the range between 0 and 1. We used multi-view (three-view)
initialization for CLNF model, therefore, condences need to be normalized according to
their view.
4.1.2 Rendering based Global Condence
We have observed that various facts can aect the quality of rendered images. Not
only the alignment error from poor landmarks, but also image quality, illumination or
rendering error itself can aect the rendering quality and this has a big in
uence on the
recognition performance. Local condence can only estimate landmark quality, therefore,
we need to predict global condence for rendered images. Our method is motivated by
these factors.
We use rendering technique to generate synthetic faces [41], and rendered faces were
used for the matching. We render images not only to frontal (0
), but also to half-prole
(40
) and prole view (75
) in order to cope with extreme yaw angles. We use three CNN
(Convolutional Neural Network) for each rendered view to train our global condence
for rendered images. Table 4.1 and Figure 4.1 show the CNN architecture used in our
work. A gray-scale rendered image of size 9696 is used for input and pixel values are
38
Table 4.1: The structure of our convolutional network used for global condence estimation.
Name Type Filter Size Stride Output Size
Conv1 convolution 32x3x3 1 32x94x94
Pool1 max-pooling 2x2 2 32x47x47
Conv2 convolution 64x2x2 1 64x46x46
Pool2 max-pooling 2x2 2 64x23x23
Conv3 convolution 128x2x2 1 128x22x22
Pool3 max-pooling 2x2 2 128x11x11
Dense1 fully connected 400
Dense2 fully connected 400
Dense3 fully connected 1
Figure 4.1: The structure of our convolutional network used for global condence estimation.
normalized to the range between 0 and 1. The output is a value which represents the
global condence. We use the Euclidean distance as the network loss:
loss
confidence
=k
^
C
i
C
i
k
2
2
; (4.4)
where
^
C
i
is the annotated value, and C
i
is the predicted condence for training image
I
i
.
39
4.2 Application on face recognition
4.2.1 Convolution Neural Network for face recognition
Convolution Neural Network (CNN) is an ecient recognition algorithm which is widely
used for image processing and pattern recognition. It has many features such as simple
structure, less training parameters (shared weights) and adaptability. It has become a hot
topic in image recognition and natural language processing. Its weight shared network
structure make it more like biological neural networks. It reduces the complexity of the
network model and the number of weights.
In general, the structure of CNN includes two layers. One is a feature extraction layer,
and the input of each neuron is connected to the local receptive elds of the previous
layer and extracts the local feature. Once the local features are extracted, the positional
relationship between the feature and the other features also will be determined. The other
is a feature map layer, each computing layer of the network is composed of a plurality of
feature map. Every feature map is a plane and the weights of the neurons in the plane
are equal. The structure of feature map uses the sigmoid function as activation function
of the convolution network which makes the feature map have shift invariance. Besides,
since the neurons in the same mapping plane share weight, the number of free parameters
of the network is reduced. Each convolution layer in the CNN is followed by a computing
layer which is used to calculate the local average and the second extract. This unique
two feature extraction structure reduces the resolution.
CNN is mainly used to identify displacement, zoom and other forms of distorting
invariance of two-dimensional graphics. Since the feature detection layer of CNN learns by
40
training data, it avoids explicit feature extraction and implicitly learns from the training
data when we use CNN. Furthermore, the neurons in the same feature map plane have
the identical weight, therefore, the network can study concurrently. This is a major
advantage of the convolution network with respect to the neuronal network connected to
each other. It is because the special structure of the CNNs local shared weights makes it
have a unique advantage in image processing and natural language processing. Its layout
is closer to the actual biological neural network. Shared weights reduce the complexity
of the network. In particular, multi-dimensional input vector image can directly enter
the network which avoids the complexity of data reconstruction in feature extraction and
classication process.
Face recognition is a biometric identication method based on the facial features of
persons. In practical applications, such as monitoring system, the collected face images
captured by cameras are often low resolution and/or with large pose variations. Aected
by low quality images and/or pose variations, the performance of face recognition drops
dramatically. In other words, bad quality images and/or pose variations bring great
challenge to face recognition. They bring nonlinear factors into face recognition. Here,
deep learning such as CNN can achieve the approximation of complex function by a deep
nonlinear network structure.
4.2.2 Applying Landmark Condences on face recognition
In general, face recognition relies on matching one face image to another, and the matching
score is computed to evaluate the probability whether these two images are from the same
person or not, as shown in Figure 4.2. In this work, the matching score is computed by
41
Figure 4.2: The overview of our face recognition framework and the way of generating two types
of landmark condences. The probe and gallery images are from JANUS CS2 dataset. The fusion
of matching scores from 2D Alignment and cropping only is achieved with the help of landmark
condences.
the correlation distance following [41]. We have two scores (Score 1 and 2 in Figure 4.2)
by matching two types of face images: with 2D alignment or not (cropping only). The
motivation of matching cropped images is to compensate for matching the poorly 2D-
aligned images (due to the bad landmarks detected). Now, the problem becomes how to
combine these two scores. We employ either or both of the proposed landmark condences
to determine the fusion weights.
While the local condence measures the goodness of landmarks based on the 2D
aligned images, the global condence measures it from the 3D rendering point of view.
Since both of them re
ect the quality of matching scores from 2D aligned images, the
lower the condence is, the lower the fusion weight of this score should be. To model the
42
relation between the landmark condence and the fusion weight, we exploit a sigmoid
function:
w
A
=
1
1 + exp(m (lr))
(4.5)
wherem andr are the slope and oset of the sigmoid function. l is the local condence or
global condence. The average of the local and global condences is used when we consider
both before applying the sigmoid function. w
A
is the fusion weight of the matching score
from the 2D alignment and the weight of the score from the cropping only is simply
1w
A
.
Exploiting the sigmoid function ensures the fusion weight is always in the range of
[0; 1] and has more
exibility to model the relationship between the condences and
fusion weights such as nonlinearity. Note that there are two landmark condences (local
or global) for a given face image pair, we take the minimum of them
1
as a single condence
associated with this matching score.
4.3 Experiments
The datasets we used and the experimental procedures we followed are presented in this
section.
1
The average and max values are also experimented, but the min value perform the best empirically.
43
4.3.1 Datasets
For the CLNF patch expert training, we used Multi-PIE (Pose, Illumination, Expres-
sion) [20] and the training partitions of HELEN [30] and LFPW [4]. Furthermore, we
used a multi-view and multi-scale approach as described in Baltru saitis et al. [3].
The CASIA WebFace dataset [82] is used for CNN training for global condence which
is currently the largest publicly available dataset and contains roughly 500K face images.
We evaluate the local and global landmark condences for the matching score fusion on
JANUS CS2 dataset and IJB-A dataset [28]. IJB-A is a new publicly available challenge
proposed by IARPA and made available by NIST
2
in order to push the frontiers of face
recognition in the wild following the recent saturation of results on the existing de facto
standard benchmark, LFW [18]. JANUS CS2
3
is the extended version of IJB-A. Both
JANUS CS2 and IJB-A include unconstrained faces of 500 subjects with extreme poses,
expressions and illuminations. JANUS CS2 contains not only the images and sampled
key frames but also the original videos of a subject. Also, JANUS CS2 includes much
test data for identication and verication than IJB-A.
4.3.2 Methodology
In Equation (4.5), the slope and oset of the sigmoid function are selected in the range
of [3; 5; 7; 9], and [0:1; 0:2; 0:3; 0:4; 0:5] respectively based on the performance of the vali-
dation set (created in the gallery set). In the following, the LandMark Condences are
shortened as LMCs.
2
IJB-A is available under request at http://www.nist.gov/itl/iad/ig/ijba_request.cfm
3
The JANUS CS2 dataset is not publicly available yet.
44
Regarding the score fusion, the baselines we compare can be divided into three cat-
egories: (1) No score fusion: The matching scores from 2D alignment or cropping only,
(2) The fusion methods without LMCs: The average, max, min, weighted average fusion
methods. In the weighted average method, we use the grid search method and pick the
fusion weight of the matching score from 2D alignment in the range of 0.1 to 0.9 with
interval 0.1 based on the validation set performance, and (3) The fusion methods with
LMCs: Score replacement, which replaces the scores from the 2D alignment with the ones
from cropping only, when the condence (local or global or the average of them) is less
than a threshold (the threshold is picked in [0:1; 0:2; 0:3; 0:4; 0:5] based on the validation
set performance). We also compare our approach to [6] by using the landmark condence
directly as the fusion weight without any transformations (denoted as Local LMCs Only
in Table 4.2).
4.4 Results and discussion
In this section, we analyze and evaluate our two condence models. Also, we demonstrate
the ecacy of landmark condences on face recognition by applying them on the face
matching score fusion.
4.4.1 Evaluation of Local Condence
Figure 4.3 shows the relation between local landmark condence and eye distance nor-
malized Root-Mean-Square (RMS) error. Since JANUS CS2 is annotated only with 3
key-points on the faces (two eyes and nose base), the error is normalized by the dis-
tance between two eyes. We use 25K images from JANUS CS2 dataset but only 2.5K
45
Figure 4.3: Correlation between local condence and eye distance normalized RMS error. When
the local condence is higher, the RMS error is smaller.
points are plotted to avoid visual distractions. The equation of the regression line is
y = 0:4237e
2897x
. Coecient of determination, R
2
is 0.422 and it means 42% of the
points fall within the regression line. When the local condence is above 0.6, most of the
eye distance normalized RMS errors are below 0.1. This shows that the local condence
above 0.6 is predicting well the align error.
4.4.2 Evaluation of Global Condence
We had two coders who annotated the quality of rendered images. Good (+1) and bad (0)
annotations are used as labels. To measure inter-coder agreement, we use Krippendors
alpha [23]. The Krippendors alpha value from the two coders is 0.875 which is relatively
high since the variable is easy to code. The agreement between the two coders are
46
0.0 0.1 0.2 0.3 0.4 0.5
0.5 0.6 0.7 0.8 0.9 1.0
Figure 4.4: Example predicted scores from CASIA test set. Rendered images to prole view
(75
). Higher numbers are better.
considered as the gold standard in our experiments. We began training at a learning rate
of 0.001 and the learning rate was dropped to 1e
8
with 0.1 step and set the momentum
to 0.9.
We tested the CNN classier based on the 2.5K of CASIA synthesized images. Among
10K images, three forth images were used for training and one forth images were used
for test. Images in the training set were not used in testing. Accuracy of good and bad
classication was 95.19% for test set and 99.91% for training set. Figure 4.4 shows the
example of predicted scores from CASIA test set.
47
(a) Relation between global condence and matching scores
(b) Relation between local condence and matching scores
Figure 4.5: Correlation between condences and matching scores of genuine pairs. Image-to-
Image matching score (left) and histogram of scores of both high and low condence pairs (left).
48
Table 4.2: Average performances of the score fusion on JANUS CS2 dataset with local and global
condences.
Matching method TAR(%) Identication Rate (%)
FAR@1e-2 FAR@1e-3 FAR@1e-4 Rank-1 Rank-5 Rank-10
2D Alignment [41] 86.91 59.02 24.90 83.66 92.10 94.31
Cropping Only (Unaligned) [47] 72.33 39.34 16.05 69.60 86.11 90.74
Average [57] 86.64 64.01 35.64 82.93 92.24 94.68
Max [57] 82.96 53.05 23.26 79.00 90.82 93.78
Min [57] 85.15 63.39 36.81 81.92 91.96 94.23
Weighted Average 87.56 65.51 35.78 83.76 92.58 94.85
Score Replacement 87.56 63.33 32.06 83.27 92.29 94.60
Local LMCs Only [6] 85.95 61.77 33.11 81.74 91.74 94.21
Local LMCs + Sigmoid 88.83 68.90 41.64 85.03 92.96 95.13
Global LMCs + Sigmoid 87.36 67.00 42.51 83.32 92.29 94.45
Local + Global LMCs + Sigmoid 88.76 69.24 43.19 85.03 92.86 94.97
4.4.3 Condences analysis for face recognition
We analyze the relationship between condences and matching scores. Only one image
per template was left to get image-to-image matching scores and we obtained all scores
of genuine pairs. Figure 4.5 shows the correlation between condences and matching
scores. As shown in left gures, when the condence of each pair is higher, the score is
also higher (the red means higher). Right gures are the histogram of scores of both high
(condence above 0.7) and low (condence below 0.3) condence pairs, and they show
signicant dierence in distribution (global condence p = 3:44e
53
, local condence
49
Table 4.3: Performance comparisons of the proposed method to the previous face recognition
approaches on JANUS CS2 dataset.
Matching method TAR(%) Identication Rate (%)
FAR@1e-2 FAR@1e-3 FAR@1e-4 Rank-1 Rank-5 Rank-10
COTs [28] 58.1 37 - 55.1 69.4 74.1
GOTs [28] 46.7 25 - 41.3 57.1 62.4
Fisher Vector [16] 41.1 25.0 38.1 55.9 63.7 -
J.-C. Chen et al. [14] 87.6 - - 83.8 92.4 94.9
Ours(Local + Global LMCs + Sigmoid) 88.76 69.24 43.19 85.03 92.86 94.97
p = 1:59e
40
at signicance levelp< 0:001). Considering this analysis, it makes sense to
put some weight from both condences during fusion process.
4.4.4 Local and Global Condences for face recognition
In this section, we evaluate the local and global landmark condences for fusing the
matching scores from the 2D alignment and cropping on JANUS CS2 dataset and IJB-A
dataset [28]. Since there are multiple scores for each probe-to-gallery image pair, we
use the ensemble SoftMax method [41] to obtain a single matching score to measure
the similarity between a probe image set and a gallery image set. Following [28], the
average verication performances (in terms of TARs at 1%, 0.1%, and 0.01% FARs) and
identication performances (in terms of Rank-1, Rank-5, and Rank-10 identication rates)
are employed for evaluation and reported in Table 4.2 and Table 4.4. The LandMark
Condences are shorten as LMCs in both tables. The benchmark evaluation measures
including the Receiver Operating characteristic curve (ROC) and the Cumulative Match
50
Figure 4.6: Comparative performance (ROC) on JANUS CS2 dataset for our condence based
fusion method for face verication. Only TARs of 0% to 1% FARs are shown.
curve (CMC) for Janus CS2 dataset are also displayed in Figure 4.6 and 4.7. Table 4.3
demonstrates the comparisons of our method to the commonly used face recognition
systems including COTs [28], GOTs [28], Fisher Vector [16], and J.-C. Chen et al. [14]
on Janus CS2 dataset.
As can be seen in Table 4.2, our method (local LMCs + Sigmoid) outperforms the
followings: no score fusion, the methods without considering LMCs, and using LMCs
directly as weights by 21:2%, 8:8%, 9:1% in average for TAR at 0.01% FAR
respectively. It can also be seen that the rendering based global condence achieves the
similar improvement, showing the goodness of rendered images also help to evaluate the
matching score quality and determine the fusion weights properly. Considering both the
local and global condences further improves the face verication rate in low FARs and
51
Figure 4.7: Comparative performance (CMC) on JANUS CS2 dataset for our condence based
fusion method for face identication.
the identication rate in rank-1, which can also be seen in Figure 4.6 and 4.7. In Table 4.4,
similar improvements can be observed. Moreover, the in
uence of global condences is
most signicant when requiring large TAR at small FAR (e.g. FAR@1e-4), as shown in
Table 4.2 and Table 4.4.
Table 4.3 demonstrates the comparisons of our method to the commonly used face
recognition systems including COTs [28], GOTs [28], Fisher Vector [16], and J.-C. Chen
et al. [14] on Janus CS2 dataset. As can be seen, our method performs better than the
other face recognition systems.
Notice in Figure 4.6 and 4.7 that using LMCs as fusion weights directly (denoted as
Local LMCs Only) performs worse than the weighted average (without LMCs) in low
FARs, but performs much better after applying the sigmoid function. This implies that
52
Table 4.4: Average performances of the score fusion on IJB-A dataset [28] with local and global
condences.
Matching method TAR(%) Identication Rate (%)
FAR@1e-2 FAR@1e-3 FAR@1e-4 Rank-1 Rank-5 Rank-10
2D Alignment [41] 85.32 49.62 19.14 89.97 95.53 96.73
Cropping Only (Unaligned) [47] 79.06 35.56 8.87 85.37 94.28 96.16
Average [57] 85.96 47.47 16.72 89.28 95.51 96.94
Max [57] 84.82 47.34 16.60 89.45 95.49 96.83
Min [57] 83.24 41.19 12.19 87.97 95.21 96.62
Weighted Average 86.42 51.75 19.64 90.18 95.67 96.92
Score Replacement 86.81 52.84 20.51 90.09 95.59 96.90
Local LMCs Only [6] 85.67 49.14 18.01 89.03 95.38 96.85
Local LMCs + Sigmoid 87.89 56.11 22.91 90.39 95.66 97.05
Global LMCs + Sigmoid 87.88 55.93 24.66 90.27 95.76 97.00
Local + Global LMCs + Sigmoid 88.29 56.74 23.83 90.42 95.75 97.02
the relationship of the landmark condences and fusion weights is nonlinear, and can be
better modeled by adjusting the slope and oset of the sigmoid function. To sum up, the
matching score fusion in face recognition can benet from both the proposed landmark
condences and the nonlinear transformation.
4.5 Conclusions
In this chapter, we propose two methods for measuring the landmark condence: local
condence based on the local predictors of each facial landmark, and global condence
based on a 3D rendered face model. While the local condence measures the goodness of
53
landmarks based on the 2D aligned images, the global condence measures it from the
3D rendering point of view. Our experiments show that both condences are benecial to
face recognition accuracy by up to 9% improvements compared to the methods without
landmark condences.
54
Chapter 5
Face and Body Association for Video-based Face
Recognition
5.1 Face and Body Association
In our video-based face recognition, we have the following settings. A single video is
used to query a face recognition system with two dierent galleries. The galleries contain
multiple sets of still images; each set describes the same subject. We address the open-set
video-based face recognition problem, that is, given a probe, we cannot assume that the
probe subject is contained in the gallery. Moreover, we have a single annotation of the
target face in the probe video. We use this annotation to associate other detected faces in
the video using both face and body cues. By doing so, we are able to greatly expand the
probe image into a large set of images of the same subject, better spanning the subject's
appearance manifold. This new probe set is used to create a face representation used
for identication. For nal face representation, we take the average of the deep feature
descriptors obtained following the method presented in [42, 40]. Figure 5.1 shows how
55
Figure 5.1: Overview of our video-based face recognition system. The probe videos and gallery
images are from JANUS CS3 dataset. It is dicult to match the faces in the gallery just with
the annotated target face from a single frame (depicted with a red box) in the probe video. Our
association method provides much richer description of the probe subject.
the proposed method is integrated in a video-based face recognition framework. In the
following, we explain the details of each step.
5.1.1 Face detection using YOLO cascade
We have implemented an improved o-the-shelf face detector [80] as the detection module.
The original architecture is a cascade of three stages of shallow fully convolutional neural
networks (FCN). Similar to recent advances in face detection that draw inspiration from
object detection topic, we replace the \proposal stage" in [80] with a re-trained YOLO
detector [54] and extend the \verication stage" to a single deep neural network using
ResNet-50 architecture [24]. The YOLO face detector is ne-tuned on the WiderFace
dataset [79] which consists of 13k images with over 90k faces. For training the verication
network, we used positive face samples where the face region has an intersection-over-
union (IoU) larger than 0.5 with ground truth bounding box; the negative samples come
from MIT Places dataset [88].
56
0.5
0.6
0.7
0.8
0.9
1
0 50 100 150 200
True positive rate
False positive
S3FD
Supervised STN
YOLO Cascade
TinyFaces
Faceness
DP2MFD
FCN cascade
HeadHunter
Figure 5.2: ROC curves of YOLO Cascade and other face detection methods benchmarked on
FDDB dataset.
The performance of the face detection is benchmarked on FDDB dataset [26] and the
improved [80] is referred as \YOLO cascade" in the legend, as shown in Figure 5.2. YOLO
cascade detector does not have the best performance compared to other state-of-the-art
methods, but still performs on par with other respectable baseline methods such as [25].
This shows that our face detector provide reliable and eective responses to support our
association method.
5.1.2 Body pose estimation
To estimate the upper-body region corresponding to the target face, we employ Open-
Pose [10] as a body pose estimator. For each video frame, we extract 18 keypoints
57
Figure 5.3: Example of body pose estimation from OpenPose on challenging JANUS CS3 video
frames. Note how the use of an upper-body detector is able to tackle challenging cases in which
the person is partially visible or when the frame contains talking heads.
(x;y;s)
i2f1;:::;18g
, that indicate the 2D joints of the skeleton of a person plus the de-
tection score s. From the keypoints, we obtain the bounding box corresponding to the
upper-body region through the coordinates of the extracted joints. We motivate the use
of OpenPose for two reasons: (1) its ability to extract joints { and therefore boxes { even
in extreme cases, where part of the body is occluded or when the videos present mostly
talking heads. (2) OpenPose returns also head joints that are used in our approach to l-
ter out false positives in the body pose estimation by checking the consistency of the head
joints with the face detection response from YOLO cascade, described in Section 5.1.1.
58
Figure 5.3 shows example body pose estimation from OpenPose supporting this. The
extracted upper-body boxes are used later in the association part of our method.
5.1.3 Data association
The face and body detections described respectively in Section 5.1.1 and Section 5.1.2 are
used in the score function in the data association step. The goal of the data association
is to keep track of the target subject in dierent scene or even shots, thus at the end,
we are able to track the subject movements across scenes. The proposed association is
based on our observation that while the appearance of the face can change drastically
from scene to scene, the upper-body appearance of the same subject remain similar in
dierent scenarios, within the same video.
Scoring function: A person p is described as a vector of bounding boxes from the
face and body detector as p = [d; g] where d = [x;y;w;h;s] are the detected bounding
boxes plus the condence score; g is represented in the same form as d but indicates
the upper-body part. More formally, we assume that multiple people are present in both
frames; that is, p
i
t
2D
t
at frame t andD
t
is the set of detections.
The data association works between consecutive frames; without loss of generality,
we can assume that the system associated between frame at time t and t + 1. The
data association between these two frames is performed using a greedy, threshold-based
approach roughly inspired by [8]. The scoring function that connects two person detection
across two consecutive frames is dened as:
59
(p
i
t
; p
j
t+1
)
:
=(d
i
t
; d
j
t+1
) + (1) (s
i
t
+s
j
t+1
)+
+(d
i
t
; d
j
t+1
) +(g
i
t
; g
j
t+1
):
(5.1)
The symbols involved in the above equation are as follows: (;) is the intersection-
over-union (IoU) between two face detection bounding boxes; s is the face detection
condence corresponding to face detection; (;) is a function that returns the cosine
similarity of the VGG [48] feature encodings computed on the input boxes.
The hyper-parameters in
uence the importance of spatial information versus appear-
ance one: controls the trade-o of relative importance between spatial location consis-
tency and face detection condence and controls the weight of the body appearance;
is set to be 0.2 and is set to be 0.3 empirically in our experiments.
Greedy data association: Equation (5.1) is further thresholded as follows to prevent
wrong associations due to a weak score value:
0
(p
i
t
; p
j
t+1
)
:
=
8
>
>
>
>
<
>
>
>
>
:
(p
i
t
; p
j
t+1
); if (p
i
t
; p
j
t+1
)
1; otherwise:
(5.2)
The greedy algorithm selects the maximum using the thresholded scoring function
0
(;) across all the comparisons betweenD
t
andD
t+1
. Once the maximum is selected,
two persons are associated as the same between frames; note that it is possible that there
is no association between frames. Data-association across multiple frames is obtained by
simply iterating this process and keep linking until a person gets associated.
60
Tracklet initialization and termination: Old tracklets are terminated whenever the
new detection results cannot match to any existing tracklet for a certain frames, set to
15 in our experiments. New tracklets are initialized from these new detection responses.
Tracklet ltering: Finally, after the data association is completed, we lter the ex-
tracted tracklets to select face samples of good quality thereby requiring less computa-
tional burden for the face recognition system. Filtering is based on the tracklet length
and detection box dimension. Practically, short tracklets (less than 20 frames) are ltered
out to remove some noise in the tracking. Also, tracklets containing detection responses
at smalls scales (average bounding box size smaller than 20 pixels) are ltered out. This
latter ltering step is useful as low resolution faces cannot help in recognition but only
add noise.
After ltering, K frames are selected from each remaining tracklet. To maintain the
face pose distribution as diverse as possible, the K frames are picked uniformly in the
time span of the tracklet, instead of picking the top-K, most condent detections { which
could span all similar, frontal faces. We set K = 15 in all our experiments.
Parameter Settings: The hyper-parameters of the data association are initialized fol-
lowing rationals from multiple object tracking (MOT) task [5], for example, the relative
scales of balancing weights. And then the parameters are changed through visual inspec-
tion on a held-out validation set. The parameter for tracklet ltering is based on the
observation that any track less than one second is often very blurry and does not contain
enough useful information for recognition, assuming videos at 20 fps. Finally, the number
K for the frame selection is set by cross-validation of face recognition performance on a
held-out set.
61
5.2 Experiments
We have conducted experiments to explore the eectiveness of each component in our FBA
pipeline. Furthermore, the performance of dierent variants of FBA is also presented as
an ablation study. Finally, we compare our FBA model with current state-of-the-art
methods. The experiments are carried out on the JANUS CS3 set, following protocol 6,
which is suitable for the evaluation of our video-based face recognition.
Experiment setup: We use the publicly available ResFace-101 model proposed recently
by [42]. Their system has been trained on both real and rendered face images using the
ResNet-101 architecture [24]. Following that approach, we trained on the MS-Celeb
face set [38]. We show recognition results using two dierent snapshots of the training
(namely CNN Model 1 and CNN Model 2). Face images are aligned with the facial
landmark provided by CE-CLM [84]. The aligned face images are then, used as input to
the recognition system mentioned in [40].
To better compare with the previous work, we follow [11] and also use protocol 6 of
the JANUS CS3 dataset for evaluation. Protocol 6 denes an open-set video-to-image
identication problem. It provides two gallery sets and one probe set. Each gallery set
contains only still images and the probe set consists of only videos. Open-set means that
it is not guaranteed that a probe subject is contained in the gallery.
62
5.2.1 Datasets and metrics
Datasets: The face recognition evaluation is conducted on JANUS CS3 dataset, which
is an extended version of IJB-A [28]
1
. JANUS CS3 dataset includes unconstrained faces
of 1,871 subjects with extreme poses, expressions and illuminations, compared to the
500 subjects in IJB-A. Dierently from IJB-A which provides only images, and sampled
key frames, JANUS CS3 set also provides the original videos where the key frames are
extracted. Quantitatively, JANUS CS3 provide 11,876 still images, 55,732 key frames
and 7,245 videos.
Video analysis: JANUS CS3 videos vary dramatically in video length, frame resolution
and scenes where they are captured. For example, the average length of the 7,195 probe
videos is 49 seconds, while there are 1,338 (18.6%) short videos within 3 seconds. The
videos come from dierent sources, like news, sport broadcasts and social media.
Figure 5.4 and Figure 5.5 shows some representative frames extracted from JANUS
CS3 videos. In Figure 5.5 (a), the face appears consistently of similar scale and at similar
location in the video and there is not large pose variation of the face. It is easy to get
an eective face representation for recognition without using other special methods. In
the case of Figure 5.5 (b) and Figure 5.5 (c), the face annotations on key frames contain
only non-frontal and small faces, while good frontal faces are present in other parts of
the video. In this case, successful association of the same entity can greatly improve the
video-based recognition. In the case of Figure 5.5 (b), the body appearance dierence is a
reliable signal to distinguish between entities. Frequent shot changes occur in Figure 5.5
1
IJB-A is available under request at http://www.nist.gov/itl/iad/ig/ijba_request.cfm
63
Figure 5.4: Sample frames from JANUS CS3 dataset. The target face annotated with red box
appears in multiple shots in the video. The detected face is shown with a yellow box and the
upper-body is shown with a blue box. While tracking by using only detected faces can be dicult
due to the large scale changes and low resolutions, upper-body can be more discriminative in the
association.
(c), but as faces are detected, face representation similarity helps associating among
dierent subjects. In the case of Figure 5.5 (d), the faces appear as non-frontal and at a
small scale throughout the video, making it dicult to extract any good representations
for any method.
Metrics:: Considering evaluation metrics, we followed [28]: the identication perfor-
mance is evaluated using the identication rate at dierent ranks: Rank-1, Rank-5 and
64
(a) good, frontal or near-frontal face
(b) non-frontal or small face annotated, but frontal face present in subsequent frames
(c) non-frontal or small face annotated, but frontal face present in subsequent frames, with scene changes
(d) non-frontal faces or face of very small scale
Figure 5.5: Some representative faces that can be found in JANUS CS3 videos: there are dierent
cases of face appearance in videos. For frontal or near-frontal faces, it is easy to obtain the face
tracks (a). For faces undergoing pose variations and appearing in dierent scenes, the proposed
face and body association (FBA) is required (b)(c)(d).
Rank-10. The average verication performance is evaluated with True Alarm Rate (TAR)
at dierent False Alarm Rate (FAR): 1e-2, 1e-3 and 1e-4.
5.2.2 Ablation study
We have performed ablation experiments to explore how individual elements in our
pipeline aect the face recognition results. Two variants have been compared with our
65
Table 5.1: Ablation experiments. Average Performances of Gallery 1 and 2 in the Protocol 6 of
the JANUS CS3 using CNN Model 1.
Method Identication Rate (%) TAR (%)
Rank-1 Rank-5 Rank-10 FAR@1e-2 FAR@1e-3 FAR@1e-4
Key frame only 54.5 68.1 72.8 73.2 51.9 25.5
FA (Face only) 63.2 74.8 78.7 79.8 60.8 37.0
FBA (Face and Body) 64.6 75.3 79.2 80.4 62.6 38.3
FBA pipeline: (1) Key frame only: extracting the annotated key frames only and feed-
ing the annotated faces into the recognition pipeline; (2) FA (Face only): using only
face detections for association. Formally, in Equation (5.1), the similarity of body fea-
ture (g
i
t
; g
j
t+1
) is discarded. Finally, we present the proposed method FBA (Face and
Body), showing what is the impact of using the body information in the association to
the recognition.
The evaluation results of dierent variants are shown in Table 5.1. There is a consis-
tent improvement when adding the face association only and then incorporating upper-
body cues into the association process. Compared to using key frames only, FA outper-
forms by a large margin of 8.7% in identication rate at Rank-1, as association helps nd
faces of good quality beyond the annotations in key frames. When adding upper-body
information, the association process becomes more robust with the complementary cues.
66
Table 5.2: Comparison with the state-of-the-art. Performances on Gallery 1 in the Protocol 6 of
the JANUS CS3.
Method Identication Rate (%) TAR(%)
Rank-1 Rank-5 Rank-10 FAR@1e-2 FAR@1e-3 FAR@1e-4
ALIEN tracker [49] 65.3 77.7 82.1 82.9 65.5 45.1
MDNet tracker [44] 68.0 76.5 78.0 80.0 67.2 43.6
TFA [11] 66.9 78.8 82.6 - 57.0 38.9
FBA (CNN Model 1) 70.8 80.9 84.6 85.3 70.9 52.1
FBA (CNN Model 2) 72.2 81.3 84.3 84.9 68.1 36.5
5.2.3 Comparison with the state-of-the-art
To show comparison to state-of-the-art methods, we employ both target tracking ap-
proaches applied to our recognition pipeline and the Target Face Association (TFA)
method, proposed in [11]. Regarding the tracking methods, since the faces in key frames
are annotated, recent state-of-the-art tracking methods are employed to track each face
independently starting from the annotated key frame. These trackers include the method
based on oversampling local features (ALIEN tracker [49]) and Multi-Domain Network
(MDNet tracker [44]). We used publicly available implementations for ALIEN and MD-
Net, while TFA results are taken from [11], reporting identication and verication ac-
curacy for JANUS CS3 dataset.
67
Table 5.3: Comparison with the state-of-the-art. Performances on Gallery 2 in the Protocol 6 of
the JANUS CS3.
Method Identication Rate (%) TAR(%)
Rank-1 Rank-5 Rank-10 FAR@1e-2 FAR@1e-3 FAR@1e-4
ALIEN tracker [49] 54.4 65.3 70.2 70.9 49.3 23.1
MDNet tracker [44] 54.5 67.0 69.0 70.7 53.7 23.8
TFA [11] 55.1 68.0 73.2 - 42.5 29.3
FBA (CNN Model 1) 58.3 69.4 73.8 75.5 54.2 24.5
FBA (CNN Model 2) 59.5 69.6 74.0 75.4 54.3 22.1
Tables 5.2 and 5.3 show the identication and verication performances on Gallery 1
and Gallery 2 respectively. In line to what reported in [11], the performance on Gallery
1 is much better than the one on Gallery 2.
Table 5.4 shows the average performances of Gallery 1 and 2. Our method outperforms
the MDNet tracker by 4.6% at rank-1 and the ALIEN tracker by 3.5% in TAR at FAR=1e-
2 respectively. Also our method consistently outperform the similar method TFA also
based on data association. We observe that the performance of FBA (CNN Model 2) is
1.3% better on Rank-1 accuracy than FBA (CNN Model 1). However, FBA (CNN Model
1) shows slightly better accuracy in TAR. This is due to dierent snapshots used for
testing the system. It is noteworthy that even the less performing method, the ALIEN
tracker, still shows improved results compared to the system using the single, annotated
key frame. Finally, Figure 5.6 shows the sample associated faces from JANUS CS3 videos.
68
Table 5.4: Comparison with the state-of-the-art. Average Performances of Gallery 1 and 2 in
the Protocol 6 of the JANUS CS3.
Method Identication Rate (%) TAR(%)
Rank-1 Rank-5 Rank-10 FAR@1e-2 FAR@1e-3 FAR@1e-4
Key frame only 54.5 68.1 72.8 73.2 51.9 25.5
ALIEN tracker [49] 59.9 71.5 76.2 76.9 57.4 34.1
MDNet tracker [44] 61.3 71.8 73.5 75.4 60.5 33.7
TFA [11] 61.0 73.4 77.9 - 49.7 34.1
FBA (CNN Model 1) 64.6 75.3 79.2 80.4 62.6 38.3
FBA (CNN Model 2) 65.9 75.5 79.2 80.2 61.2 29.3
5.3 Conclusions
In this chapter, we propose a robust Face and Body Association (FBA) method for face
recognition in the videos. FBA associates faces across dierent frames based on the cues
of spatial location, detection condence and appearance information extracted from both
the face and upper-body. Our experiments show that FBA is benecial for video-based
face recognition and achieves a performance boost of 4.6% in identication rate at Rank-1
and a margin of 3.5% in TAR at FAR=1e-2 compared to current state-of-the-art method.
69
(a)
(b)
Figure 5.6: Sample associated faces in JANUS CS3 video. (a) A large number of people appear
and move to another place (b) target subject undergoes abrupt pose changes and the video contains
severe motion blur.
70
Chapter 6
Conclusions and Future work
In this study, we learn that the landmark detection is important for many facial analyses.
The problem of automatic facial landmark detection has seen much progress over the
past years, however, most of the state-of-the-art methods still struggle in the presence
of extreme head pose, especially in challenging in-the-wild images. In order to solve
this problem, we present a new model, HCLM which unies local and holistic facial
landmark detection by integrating these three methods: head pose estimation, sparse-
holistic landmark detection and dense-local landmark detection. Our new model was
evaluated on three challenging datasets: 300-W, AFLW and IJB-FL. It shows state-of-
the-art performance and is robust, especially in the presence of large head pose variations.
We also propose two methods for measuring the landmark condence: local condence
based on the local predictors of each facial landmark, and global condence based on a
3D rendered face model. While the local condence measures the goodness of landmarks
based on the 2D aligned images, the global condence measures it from the 3D rendering
71
point of view. Our experiments show that both condences are benecial to face recog-
nition accuracy by up to 9% improvements compared to the methods without landmark
condences.
Moreover, we propose a robust Face and Body Association (FBA) method for face
recognition in the videos. FBA associates faces across dierent frames based on the cues
of spatial location, detection condence and appearance information extracted from both
the face and upper-body. Our experiments show that FBA is benecial for video-based
face recognition and achieves a performance boost of 4.6% in identication rate at Rank-1
and a margin of 3.5% in TAR at FAR=1e-2 compared to current state-of-the-art method.
6.1 Contributions
The main goal of this dissertation was to propose a new landmark detection model in the
wild environments. Comparing to the state-of-the-art, this dissertation provided strong
contributions in terms of novelty. Moreover, we proposed a novel method for face and
body association (FBA) in videos to improve face recognition. The main contributions
of this dissertation are as follows:
• Holistically Constrained Local Model
{ We presented a novel Holistically Constrained Local Model (HCLM) which
unies local and holistic facial landmark detection by integrating head pose
estimation, sparse-holistic landmark detection and dense-local landmark de-
tection.
72
{ Our method's main advantage is the ability to handle very large pose vari-
ations, including prole faces. Furthermore, our model integrates local and
holistic facial landmark detectors in a joint framework, with a holistic ap-
proach narrowing down the search space for the local one.
• Local-Global Landmark Condences for Face Recognition
{ We proposed a new method for measuring landmark condence. One is the
constrained local method, and the other is rendering based global method.
The local method can measure accuracy based on the local predictors of each
facial landmark, and the global method can predict the condence from 3D
rendered faces.
{ While the local condence measures the goodness of landmarks based on the
2D aligned images, the global condence measures it from the 3D rendering
point of view.
{ The merit of using these two condences is that they can help alleviating the
in
uences of the poorly aligned face images on face recognition so as to improve
recognition accuracy.
• Face and Body Association for Video-based Face Recognition
{ We presented a novel method for face and body association (FBA) in videos
to improve face recognition. A more robust face representation can be ob-
tained by taking advantage of person association in a video, not only using
face association, but supporting this latter with additional body information.
73
{ This allows the tracking of targets even when facial details are not clearly
visible or when the face representation is not reliable due to low resolution
or small scale. The upper-body of the subject contains enough discriminative
information that can be leveraged to aid the association within the video.
{ Using this approach, we can obtain a wider set of facial frames represent-
ing the target subject. These associated face images are used to compute a
representation for face recognition.
6.2 Future work
Despite having state-of-the-art results on facial landmark detection in the wild, there is
still room for future research on these topics. In particular, we are interested in exploring
the problems below:
• Data augmentation for prole images We presented our Holistically Con-
strained Local Model (HCLM) for facial landmark detection, and it shows state-
of-the-art performance, especially in the presence of large head pose variations.
However, the performance of prole images are usually worse. Fewer prole train-
ing images were used, compared to frontal training images, this may potentially
lead to worse performance. We can perform data augmentation to generate syn-
thetic (prole) images using Generative Adversarial Networks (GANs) to provide
more training data from only a limited number of labeled images for facial landmark
detection.
74
• Joint facial landmark detection and 3D face model framework Does 3D
modeling help to improve landmark accuracy? First, we can get the initial land-
marks using the current method. Given the detected facial landmarks, 3D pose is
estimated with RANSAC-PnP [22] and 3D face model is estimated using 3D Mor-
phable Model tting [7]. Then, we can obtain the facial landmarks by projecting
3D landmark points onto the 2D facial image and ne-tune their 2D position. We
can repeat this process several times until getting a stable 2D facial landmarks and
3D face model.
• Association with target subjects in the video Although our Face and Body
Association (FBA) method outperforms baseline methods by a large margin, it is
still limited as FBA is based on the observation that face and/or upper-body remain
similar across dierent scenes and shots, which may not be true in extreme cases.
This is a challenging problem we want to explore in our future work.
75
Reference List
[1] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robust discriminative response
map tting with constrained local models. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3444{3451, 2013.
[2] T. Baltrusaitis, P. Robinson, and L.-P. Morency. Constrained local neural elds for
robust facial landmark detection in the wild. In The IEEE International Conference
on Computer Vision (ICCV) Workshops, June 2013.
[3] T. Baltru saitis, P. Robinson, and L.-P. Morency. Continuous conditional neural
elds for structured regression. In Computer Vision{ECCV 2014, pages 593{608.
Springer, 2014.
[4] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts
of faces using a consensus of exemplars. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 35(12):2930{2940, 2013.
[5] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking perfor-
mance: the clear mot metrics. EURASIP Journal on Image and Video Processing,
2008(1):246309, 2008.
[6] L. Best-Rowden, H. Han, C. Otto, B. F. Klare, and A. K. Jain. Unconstrained
face recognition: Identifying a person of interest from a media collection. IEEE
Transactions on Information Forensics and Security, 9(12):2144{2157, 2014.
[7] V. Blanz and T. Vetter. Face recognition based on tting a 3d morphable model.
IEEE Transactions on pattern analysis and machine intelligence, 25(9):1063{1074,
2003.
[8] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. On-
line multiperson tracking-by-detection from a single, uncalibrated camera. IEEE
transactions on pattern analysis and machine intelligence, 33(9):1820{1833, 2011.
[9] X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by Explicit Shape Regression.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 2887{2894.
Ieee, jun 2012.
[10] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estima-
tion using part anity elds. arXiv preprint arXiv:1611.08050, 2016.
[11] C.-H. Chen, J.-C. Chen, C. D. Castillo, and R. Chellappa. Video-based face associ-
ation and identication. In Automatic Face & Gesture Recognition (FG 2017), 2017
12th IEEE International Conference on, pages 149{156. IEEE, 2017.
76
[12] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and
alignment. In European Conference on Computer Vision, pages 109{122. Springer,
2014.
[13] J. Chen, V. M. Patel, and R. Chellappa. Unconstrained face verication using deep
CNN features. arXiv preprint, arXiv:1508.01722v1, 2015.
[14] J.-C. Chen, V. M. Patel, and R. Chellappa. Unconstrained face verication using
deep cnn features. In IEEE Winter Conference on Applications of Computer Vision
(WACV), pages 1{9, 2016.
[15] J. C. Chen, V. M. Patel, and R. Chellappa. Unconstrained face verication using
deep cnn features. 2016.
[16] J.-C. Chen, S. Sankaranarayanan, V. M. Patel, and R. Chellappa. Unconstrained
face verication using sher vectors computed from frontalized faces. In IEEE 7th
International Conference on Biometrics Theory, Applications and Systems (BTAS),
pages 1{8, 2015.
[17] B. Czupry nski and A. Strupczewski. Active Media Technology: 10th International
Conference, AMT 2014, Warsaw, Poland, August 11-14, 2014. Proceedings, chapter
High Accuracy Head Pose Tracking Survey, pages 407{420. Springer International
Publishing, Cham, 2014.
[18] C. Ding and D. Tao. Robust face recognition via multimodal deep face representa-
tion. IEEE Transactions on Multimedia, 17(11):2049{2058, 2015.
[19] G. Fanelli, J. Gall, and L. V. Gool. Real Time Head Pose Estimation with Random
Regression Forests. In IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 617{624, 2011.
[20] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-PIE. In IEEE
International Conference on Automatic Face and Gesture Recognition, pages 1{8,
2008.
[21] Y. Gurovich, I. Kissos, and Y. Hanani. Quality scores for deep regression systems. In
Image Processing (ICIP), 2016 IEEE International Conference on, pages 3758{3762.
IEEE, 2016.
[22] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge
university press, 2003.
[23] A. F. Hayes and K. Krippendor. Answering the call for a standard reliability
measure for coding data. Communication methods and measures, 1(1):77{89, 2007.
[24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770{778, 2016.
[25] P. Hu and D. Ramanan. Finding tiny faces. In 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 1522{1530. IEEE, 2017.
[26] V. Jain and E. Learned-Miller. Fddb: A benchmark for face detection in uncon-
strained settings. Technical Report UM-CS-2010-009, University of Massachusetts,
Amherst, 2010.
77
[27] H. Jiang and E. Learned-Miller. Face detection with the faster r-cnn. In Automatic
Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference
on, pages 650{657. IEEE, 2017.
[28] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P. Grother,
A. Mah, M. Burge, and A. K. Jain. Pushing the frontiers of unconstrained face de-
tection and recognition: Iarpa janus benchmark a. In Computer Vision and Pattern
Recognition (CVPR), 2015 IEEE Conference on, pages 1931{1939. IEEE, 2015.
[29] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back. Face recognition: A convolu-
tional neural-network approach. IEEE transactions on neural networks, 8(1):98{113,
1997.
[30] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang. Interactive facial feature
localization. In Computer Vision{ECCV 2012, pages 679{692. Springer, 2012.
[31] J. Li and Y. Zhang. Learning surf cascade for fast and accurate object detection.
In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 3468{3475, 2013.
[32] S. Liao, A. K. Jain, and S. Z. Li. A fast and accurate unconstrained face detec-
tor. IEEE transactions on pattern analysis and machine intelligence, 38(2):211{223,
2016.
[33] G. Lisanti, S. Karaman, and I. Masi. Multichannel-kernel canonical correlation anal-
ysis for cross-view person reidentication. ACM Transactions on Multimedia Com-
puting, Communications, and Applications (TOMM), 13(2):13, 2017.
[34] G. Lisanti, I. Masi, A. D. Bagdanov, and A. Del Bimbo. Person re-identication
by iterative re-weighted sparse ranking. IEEE transactions on pattern analysis and
machine intelligence, 37(8):1629{1642, 2015.
[35] G. Lisanti, I. Masi, and A. Del Bimbo. Matching people across camera views using
kernel canonical correlation analysis. In Proceedings of the International Conference
on Distributed Smart Cameras, page 10. ACM, 2014.
[36] L. Liu, L. Zhang, H. Liu, and S. Yan. Toward large-population face identication
in unconstrained videos. IEEE Transactions on Circuits and Systems for Video
Technology, 24(11):1874{1884, 2014.
[37] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg.
Ssd: Single shot multibox detector. In European conference on computer vision,
pages 21{37. Springer, 2016.
[38] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild.
In Proceedings of the IEEE International Conference on Computer Vision, pages
3730{3738, 2015.
[39] D. G. Lowe. Distinctive image features from scale invariant keypoints. Int'l Journal
of Computer Vision, 60:91{11020042, 2004.
[40] I. Masi, T. Hassner, A. T. Tran, and G. Medioni. Rapid synthesis of massive face
sets for improved face recognition. In IEEE International Conference on Automatic
Face and Gesture Recognition, 2017.
78
[41] I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-aware face recognition in
the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4838{4846, 2016.
[42] I. Masi, A. Tran, T. Hassner, J. T. Leksut, and G. Medioni. Do We Really Need to
Collect Millions of Faces for Eective Face Recognition? In European Conference
on Computer Vision, 2016.
[43] E. Murphy-Chutorian and M. M. Trivedi. Head pose estimation in computer vi-
sion: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence,
31(4):607{26, apr 2009.
[44] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual
tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4293{4302, 2016.
[45] C. Papazov, T. K. Marks, and M. Jones. Real-time 3D Head Pose and Facial Land-
mark Estimation from Depth Images Using Triangular Surface Patch Features. In
CVPR, 2015.
[46] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. 2015.
[47] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British
Machine Vision Conference, volume 1, page 6, 2015.
[48] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British
Machine Vision Conference, 2015.
[49] F. Pernici and A. Del Bimbo. Object tracking by oversampling local features. IEEE
transactions on pattern analysis and machine intelligence, 36(12):2538{2551, 2014.
[50] P. J. Phillips, M. Q. Hill, J. A. Swindle, and A. J. O'Toole. Human and algorithm
performance on the pasc face recognition challenge. In Biometrics Theory, Applica-
tions and Systems (BTAS), 2015 IEEE 7th International Conference on, pages 1{8.
IEEE, 2015.
[51] N. Poh and J. Kittler. A unied framework for biometric expert fusion incorporating
quality measures. IEEE transactions on pattern analysis and machine intelligence,
34(1):3{18, 2012.
[52] G. Rajamanoharan and T. F. Cootes. Multi-View Constrained Local Models for
Large Head Angle Facial Tracking. In ICCV, 2015.
[53] R. Ranjan, V. M. Patel, and R. Chellappa. A deep pyramid deformable part model
for face detection. In Biometrics Theory, Applications and Systems (BTAS), 2015
IEEE 7th International Conference on, pages 1{8. IEEE, 2015.
[54] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unied,
real-time object detection. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 779{788, 2016.
[55] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 fps via regressing local
binary features. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 1685{1692, 2014.
79
[56] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object de-
tection with region proposal networks. In Advances in neural information processing
systems, pages 91{99, 2015.
[57] A. Ross and A. Jain. Information fusion in biometrics. Pattern recognition letters,
24(13):2115{2125, 2003.
[58] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild chal-
lenge: The rst facial landmark localization challenge. In Proceedings of the IEEE
International Conference on Computer Vision Workshops, pages 397{403, 2013.
[59] J. M. Saragih, S. Lucey, and J. F. Cohn. Deformable model tting by regularized
landmark mean-shift. International Journal of Computer Vision, 91(2):200{215,
2011.
[60] J. M. Saragih, S. Lucey, and J. F. Cohn. Deformable Model Fitting by Regularized
Landmark Mean-Shift. 91(2):200{215, 2011.
[61] F. Schro, D. Kalenichenko, and J. Philbin. Facenet: A unied embedding for face
recognition and clustering. pages 815{823, 2015.
[62] F. Schro, D. Kalenichenko, and J. Philbin. FaceNet: A unied embedding for face
recognition and clustering. 2015.
[63] A. Steger, R. Timofte, and L. V. Gool. Failure detection for facial landmark detec-
tors, 2016.
[64] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial
point detection. Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, pages 3476{3483, 2013.
[65] Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting
10,000 classes. 2014.
[66] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse,
selective, and robust. 2015.
[67] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to
human-level performance in face verication. pages 1701{1708. IEEE, 2014.
[68] G. Tzimiropoulos. Project-Out Cascaded Regression with an application to Face
Alignment. In CVPR, 2015.
[69] G. Tzimiropoulos and M. Pantic. Gauss-newton deformable part models for face
alignment in-the-wild. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1851{1858, 2014.
[70] P. Viola and M. J. Jones. Robust real-time face detection. International journal of
computer vision, 57(2):137{154, 2004.
[71] D. Wang, C. Otto, and A. K. Jain. Face search at scale: 80 million gallery. arXiv
preprint, arXiv:1507.07242, 2015.
[72] L. Wolf, T. Hassner, and Y. Taigman. Eective unconstrained face recognition by
combining multiple descriptors and learned background statistics. 33(10):1978{1990,
2011.
80
[73] X. Xiong and F. Torre. Supervised descent method and its applications to face
alignment. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 532{539, 2013.
[74] J. Yan, X. Zhang, Z. Lei, and S. Z. Li. Face detection by structural models. Image
and Vision Computing, 32(10):790{799, 2014.
[75] B. Yang, J. Yan, Z. Lei, and S. Z. Li. Aggregate channel features for multi-view
face detection. In Biometrics (IJCB), 2014 IEEE International Joint Conference
on, pages 1{8. IEEE, 2014.
[76] B. Yang, J. Yan, Z. Lei, and S. Z. Li. Convolutional channel features. In Proceedings
of the IEEE international conference on computer vision, pages 82{90, 2015.
[77] H. Yang, W. Mou, Y. Zhang, I. Patras, H. Gunes, and P. Robinson. Face Alignment
Assisted by Head Pose Estimation. In BMVC, pages 1{13, 2015.
[78] S. Yang, P. Luo, C.-C. Loy, and X. Tang. From facial parts responses to face detec-
tion: A deep learning approach. In Proceedings of the IEEE International Conference
on Computer Vision, pages 3676{3684, 2015.
[79] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A face detection benchmark.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 5525{5533, 2016.
[80] Z. Yang and R. Nevatia. A multi-scale cascade fully convolutional network face
detector. In Pattern Recognition (ICPR), 2016 23rd International Conference on,
pages 633{638. IEEE, 2016.
[81] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv
preprint, arXiv:1411.7923, 2014.
[82] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv
preprint arXiv:1411.7923, 2014.
[83] J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, and J. Kim. Rotating your face using
multi-task deep neural network. 2015.
[84] A. Zadeh, Y. C. Lim, T. Baltru saitis, and L.-P. Morency. Convolutional experts
constrained local model for 3d facial landmark detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 2519{2528,
2017.
[85] J. Zhang, S. Shan, M. Kan, and X. Chen. Coarse-to-ne auto-encoder networks
(cfan) for real-time face alignment. In Computer Vision{ECCV 2014, pages 1{16.
Springer, 2014.
[86] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. Panda: Pose aligned
networks for deep attribute modeling. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1637{1644, 2014.
[87] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. S
3
fd: Single shot scale-
invariant face detector. arXiv preprint arXiv:1708.05237, 2017.
[88] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep fea-
tures for scene recognition using places database. In Advances in neural information
processing systems, pages 487{495, 2014.
81
[89] E. Zhou, Z. Cao, and Q. Yin. Naive-deep face recognition: Touching the limit of
LFW benchmark or not? arXiv preprint, arXiv:1501.04690, 2015.
[90] S. Zhu, C. Li, C. Change Loy, and X. Tang. Face alignment by coarse-to-ne shape
searching. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4998{5006, 2015.
[91] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization
in the wild. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
Conference on, pages 2879{2886. IEEE, 2012.
[92] Z. Zhu, P. Luo, X. Wang, and X. Tang. Multi-view perceptron: a deep model for
learning face identity and view representations. 2014.
82
Abstract (if available)
Abstract
Facial landmark detection has received much attention in recent years together with two detection paradigms emerging: local approaches, where each facial landmark is modeled individually and with the help of a shape model
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Facial gesture analysis in an interactive environment
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Estimation-based control for humanoid robots
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Accurate 3D model acquisition from imagery data
PDF
Robust real-time vision modules for a personal service robot in a home visual sensor network
PDF
Object detection and recognition from 3D point clouds
PDF
Understanding the relationship between goals and attention
PDF
Efficient template representation for face recognition: image sampling from face collections
PDF
The importance of not being mean: DFM -- a norm-referenced data model for face pattern recognition
PDF
Behavior-based approaches for detecting cheating in online games
PDF
Outdoor visual navigation aid for the blind in dynamic environments
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Digitizing human performance with robust range image registration
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Ubiquitous computing for human activity analysis with applications in personalized healthcare
PDF
Moving object detection on a runway prior to landing using an onboard infrared camera
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
Asset Metadata
Creator
Kim, KangGeon
(author)
Core Title
Landmark detection for faces in the wild
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
05/14/2018
Defense Date
03/02/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
face and body association,face recognition,facial landmark detection,FBA,HCLM,head pose estimation,holistically constrained local model,landmark detection confidence,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Itti, Laurent (
committee chair
), Medioni, Gérard (
committee chair
), Morency, Louis-Phillppe (
committee member
), Sawchuk, Alexander (
committee member
)
Creator Email
kanggeon.kim@gmail.com,kanggeon.kim@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-504173
Unique identifier
UC11267026
Identifier
etd-KimKangGeo-6325.pdf (filename),usctheses-c40-504173 (legacy record id)
Legacy Identifier
etd-KimKangGeo-6325.pdf
Dmrecord
504173
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kim, KangGeon
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
face and body association
face recognition
facial landmark detection
FBA
HCLM
head pose estimation
holistically constrained local model
landmark detection confidence