Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Landmark-free 3D face modeling for facial analysis and synthesis
(USC Thesis Other)
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Landmark-Free 3D Face Modeling for Facial Analysis and Synthesis
by
Feng-Ju Chang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2019
Copyright 2019 Feng-Ju Chang
Acknowledgments
I would like to express my gratitude to those who help me complete this thesis and grow during
my Ph.D. journey. First, I thank my advisor, Prof. Ram Nevatia, who always provide me with a
lot of spaces for solving my own research topics. He would also give me a big picture and analyze
the pros and cons with me the ideas and solutions.
Second, I would like to thank my co-advisor, Prof. Tal Hassner, who always push me to be
the best version of myself. He is also really helpful to provide me a big picture for the research
problems I worked on and encourages me to keep trying the solutions we believe is right even the
initial results are unexpected.
Third, I really thank my internship mentor, Dr. Xiang Yu. Although we didnt work for a long
time, but working with him brings the best of me to achieve several challenging tasks for facial
synthesis. He also taught me and shared his experience on how to manage multiple projects on
hand.
Besides, I would like to thank the post-doc, Dr. Iacopo Masi, who is a good researcher I
admire. We sometimes would discuss a research problem for a long time in terms of various
aspects and come up with the most effective and efficient solution. I pretty enjoy those moments.
In addition, I thank Dr. Jongmoo Choi who often encourages me and other lab members to
do the impactful researches. This helps me to treat a common research problem in different ways
and come up with the solutions, which is not too complicated yet quite effective.
I would like to thank my labmates, Dr. Anh Tran, Dr. KangGeon Kim, Dr. Kan Chen, and
Dr. Zhenheng Yang. They are really helpful from some OS/ computer issues to the research
problem. We used to cooperate with each other for the projects, I have learned a lot from them to
do research more efficiently.
Finally, I cant appreciate more to my husband, Prof. Wei-Lun (Harry) Chao in this journey.
He is always patient to discuss with me about some research problems I encountered. During the
brainstorm with him about the ideas, Ive learned a lot on how to investigate a research problem,
how to conduct solid experiments to verify a hypothesis is right. He would support me for the
correct strategies to proceed a task, and reminds me when I do something wrong to prevent me
wasting the time and resources. Thank you for all of your supports, encouragements, inspirations
and love.
ii
Table of Contents
Acknowledgments ii
List of Tables vi
List of Figures viii
1 Introduction 1
1.1 3D Face Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Facial Attribute Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Face Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Related Work 10
2.1 Facial Landmark Detection and Its Applications . . . . . . . . . . . . . . . . . . 10
2.2 Deep Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Expression Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 3D face modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Face alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Face Frontalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Attribute Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Multi-Source Fusion: Image Set-to-Set Similarity Computation . . . . . . . . . . 14
2.8.1 Set Fusion Followed by Set Matching . . . . . . . . . . . . . . . . . . . 14
2.8.2 Set Matching Followed by Set Score Fusion . . . . . . . . . . . . . . . . 15
2.9 Discriminative Multi-Source Fusion: Metric Learning for Set-to-Set Matching . . 15
3 Deep Face Pose Net 16
3.1 Motivtion: A critique of Facial Landmark Detection . . . . . . . . . . . . . . . . 16
3.1.1 Landmark Detection Accuracy Measures . . . . . . . . . . . . . . . . . 16
3.1.2 Landmark Detection Speed . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.3 Effects of Facial Expression and Shape on Alignment . . . . . . . . . . . 17
3.2 Deep, Direct Head Pose Regression . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Head Pose Representation . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Training Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 FPN Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.4 2D and 3D Face Alignment with FPN . . . . . . . . . . . . . . . . . . . 20
iii
3.3 Experimental Results for FPN . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Effect of Alignment on Face Recognition . . . . . . . . . . . . . . . . . 20
3.3.2 Face Recognition Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.3 Face Verification and Identification Results . . . . . . . . . . . . . . . . 22
3.3.4 Landmark Detection Accuracy . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Deep Expression Net 26
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Deep, 3D Expression Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Representing 3D Faces and Expressions . . . . . . . . . . . . . . . . . . 27
4.2.2 Generating 3D Expression Data . . . . . . . . . . . . . . . . . . . . . . 28
4.2.3 Training ExpNet to Predict Expression Coefficients . . . . . . . . . . . . 29
4.2.4 Estimating Expressions Coefficients with ExpNet . . . . . . . . . . . . . 29
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Quantitative Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Deep, Landmark-Free FAME:
Face Alignment, Modeling, and Expression Estimation 38
5.1 Motivations and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Our proposed FAME framework . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.2 Modeling subject-specific 3D shape . . . . . . . . . . . . . . . . . . . . 41
5.2.3 Modeling face viewpoint and facial expressions . . . . . . . . . . . . . . 41
5.2.4 Discussion: Training labels from landmark detections? . . . . . . . . . . 42
5.3 From FAME to landmark detection . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.1 Landmark projections . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.2 Landmark refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 A critique of existing test paradigms . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4.1 Detection accuracy measures . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4.2 Ill-defined facial locations . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.3 Viewpoint-dependent facial locations . . . . . . . . . . . . . . . . . . . 46
5.4.4 Occluded points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.5 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4.6 Landmark detection vs. face processing application . . . . . . . . . . . . 47
5.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5.1 Evaluating face alignment . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5.2 Evaluating expression estimation . . . . . . . . . . . . . . . . . . . . . . 50
5.5.3 Landmark detection accuracy . . . . . . . . . . . . . . . . . . . . . . . 51
5.5.3.1 300W results with ground truth bounding boxes . . . . . . . . 52
5.5.3.2 300W results with detected bounding boxes . . . . . . . . . . 53
5.5.3.3 Landmark detection runtime . . . . . . . . . . . . . . . . . . . 54
5.5.3.4 AFLW 2000-3D results . . . . . . . . . . . . . . . . . . . . . 54
5.5.3.5 AFLW-PIFA results . . . . . . . . . . . . . . . . . . . . . . . 54
iv
5.5.3.6 Discussion: FAME vs. OpenFace . . . . . . . . . . . . . . . . 55
5.5.4 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6 Pose-Variant 3D Facial Attribute Generation 63
6.1 Motivations and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2.1 3D Shape Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.2 UV Position and Texture Maps . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.3 UV Texture Map Completion . . . . . . . . . . . . . . . . . . . . . . . 67
6.2.4 3D Face Attribute Generation . . . . . . . . . . . . . . . . . . . . . . . 69
6.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.2 UV Texture Map Completion . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.3 3D Attribute Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.5 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7 Multi-Source Fusion for Face Matching 89
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2.1 Template-to-Template Similarity . . . . . . . . . . . . . . . . . . . . . . 91
7.2.2 Ensemble SoftMax Similarity Embedding via Template Triplets . . . . . 91
7.2.3 On Incorporation of Context for Sample-Specific Embedding . . . . . . . 93
7.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.3.2 Features and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3.3 Protocols and Evaluation Measures . . . . . . . . . . . . . . . . . . . . 95
7.3.4 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3.5 Comparisons of Ensemble SoftMax Similarity to the Other Template
Based Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3.6 Comparisons of the Proposed Approaches to the Existing Image Template
Classification Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8 Conclusions 100
Bibliography 101
v
List of Tables
3.1 Summary of augmentation transformation parameters used to train our FPN.
WhereU(a;b) samples from a uniform distribution ranging from a to b and
N (;
2
) samples from a normal distribution with mean and variance
2
.width
andheight are the face detection bounding box dimensions. . . . . . . . . . . . 19
3.2 Verification and identification on IJB-A and IJB-B, comparing landmark detection
based face alignment methods. Three baseline IJB-A results are also provided as
reference at the top of the table.
Numbers estimated from the ROC and CMC
in [159]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1 Expression Estimation runtime. Runtime for expression fitting for recent methods.
Landmark-based methods need to address for landmark extraction and then opti-
mization fitting at test-time; where as deep methods are solving the entire problem
in a single step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 Verification and identification on IJB-A and IJB-B, comparing landmark detection–
based face alignment methods. Three baseline IJB-A results are also provided as
reference at the top of the table.
Numbers estimated from the ROC and CMC
in [159]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 The NME (%) of 68 point detection results on 300W with ground truth bounding
boxes provided by 300W. We use the typical split: Common (HELEN and LFPW),
Challenging (iBUG), and Full (HELEN, LFPW, and iBUG). * These methods
were tested by us using codes provided by their authors. . . . . . . . . . . . . . . 58
5.3 The NME (%) of 68 point detection results on 300W with the bounding box
provided by the face detector of [172]. We use the typical splits: Common
(HELEN and LFPW), Challenging (iBUG), and Full (HELEN, LFPW, and iBUG). 59
5.4 The NME (%) of 68 point detection results on AFLW2000-3D for different ranges
of yaw angles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 NME (%) results on the AFLW dataset under the PIFA protocol [79, 80].
Followed a non-standard training/testing setting on the AFLW dataset. . . . . . . 60
6.1 FID score comparison on LFW dataset. We randomly select one image out of the
verification pairs, and render yaw to 15, 30, 45, 60, and 75 respectively. FID is
calculated between the frontalized images and the not selected original images. . 73
6.2 Verification accuracy comparison on LFW dataset. We apply our TC-GAN and
other face frontalization methods to the LFW images of yaw angle 15 to replace
the original image with the frontalized one. . . . . . . . . . . . . . . . . . . . . 73
vi
6.3 Quantitative comparison on attribute generation by F1 score on CelebA testing set.
The target generated attribute is evaluated by an off-line attribute classifier for F1
score (precision and recall). The higher the better. “real” means original CelebA
training set. “real-a” means original plus pose augmented images. “Ours” means
training with our proposed loss and UV texture data. *: we apply the network
structure and re-train models. SG: Sunglass, LS: Wearing Lipstick, SH: 5’o clock
shadow, SM: Smiling, BA: Bangs. . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4 Quantitative comparison on attribute generation by FID score [65] on CelebA
testing set. Visual quality is indicated by FID score between the target attribute
generated images and the ground truth with same attribute images. The lower the
better. “real” means original CelebA training set. “real-a” means original plus
pose augmented images. “Ours” means training with our proposed loss and UV
texture data. *: we apply the network structure and re-train models. SG: Sunglass,
LS: Wearing Lipstick, SH: 5’o clock shadow, SM: Smiling, BA: Bangs. . . . . . 75
6.5 Identity preserving evaluation on IJBA dataset under the verification protocol,
reporting TAR@FAR0.01. *: models we retrain on our training data. SG: Sunglass,
LS: Wearing Lipstick, SH: 5’oclock shadow, SM: Smiling, BA: Bangs. . . . . . . 75
6.6 Ablation study for w/o masked reconstruction loss (Eq. 12)), and/or w/o attribute
loss (Eq. 14). F1 scores are reported. We use CycleGAN ResNet structure as
it achieves the best result across the experiments. SG: Sunglass, LS: Wearing
Lipstick, SH: 5’o clock shadow, SM: Smiling, BA: Bangs. . . . . . . . . . . . . 76
6.7 Ablation study for w/o masked reconstruction loss (Eq. 12)), and/or w/o attribute
loss (Eq. 14). FID scores are reported. We use CycleGAN ResNet structure as
it achieves the best result across the experiments. SG: Sunglass, LS: Wearing
Lipstick, SH: 5’o clock shadow, SM: Smiling, BA: Bangs. . . . . . . . . . . . . 76
6.8 StarGAN Generator network architecture . . . . . . . . . . . . . . . . . . . . . 77
6.9 StarGAN Quality and Attribute discriminator network architecture . . . . . . . . 77
6.10 CycleGAN Generator network architecture . . . . . . . . . . . . . . . . . . . . 78
6.11 CycleGAN quality discriminator network architecture . . . . . . . . . . . . . . . 78
6.12 The Attribute discriminator network architecture we used with CycleGAN . . . . 78
7.1 Face Recognition Accuracies (in terms of TARs (%) at different FARs) on IARPA Janus
Benchmark A (Verification Protocol)[88] with different kinds of template-to-template
similarity measures. Note Paradigm (1) is “set fusion followed by set matching”, and
Paradigm (2) is “set matching followed by set score fusion”. . . . . . . . . . . . . . . 96
7.2 Average recognition rate (ARR) (%) on the YTC dataset . . . . . . . . . . . . . . . . 96
7.3 Average verification performances (%) on the YTF dataset . . . . . . . . . . . . . . . 97
7.4 Average recognition rate (ARR) (%) on the Traffic dataset . . . . . . . . . . . . . . . 97
7.5 Average verification performances (%) on the IJB-A dataset . . . . . . . . . . . . . . 98
7.6 Average verification performances (%) of the templates with at least 10 samples on the
IJB-A dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.7 Average verification performances (%) of three baselines and the proposed method on the
three datasets with the evaluation measure shown in the bracket . . . . . . . . . . . . . 99
vii
List of Figures
1.1 An example of face alignment and rendering. . . . . . . . . . . . . . . . . . . . 2
1.2 Editing an face image with its 3D face shape by synthesizing ”smiling”, ”sun-
glasses”, and ”bangs ” attributes under different yaw angles. . . . . . . . . . . . 2
1.3 Proposed framework for 3D face modeling. Given an input face photo, we process
it using three separate deep networks. These networks estimate, from top to bottom:
the 3D face shape [149], 6DoF viewpoint, and 29D expression coefficients (last
two described here). The output is an accurate 3D face model, aligned with the
input face. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Results of our FAME approach. We propose deep networks which regress 3DMM
shape, expression, and viewpoint parameters directly from image intensities. We
show this approach to be highly robust to appearance variations, including out-of-
plane head rotations (top row), scale changes (middle), and ages (bottom). . . . . 4
1.5 Facial attributes generation under head pose variations, showing results comparison
of our method to StarGAN [27] and CycleGAN [185]. Traditional frameworks
generate artifacts due to pose variations. Introducing a 3D UV representation, the
proposed TC-GAN and 3DA-GAN generates photo-realistic face attributes on
pose-variant faces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 The proposed framework of pose-variant 3D facial attribute generation. By 3D
dense shape reconstruction, a pose-variant face input is transformed into the UV
position map and incomplete UV texture map (with the black holes) due to self-
occlusion. Then, a texture completion GAN (TC-GAN) inpaints the black holes
into a completed UV texture map. Further, a 3D attribute generation GAN (3DA-
GAN) is designed to generate the target attributes on UV texture map and rendered
back to 2D images with variant head poses. . . . . . . . . . . . . . . . . . . . . 6
1.7 Different levels of fusion in the matching stage..The fusion problem can occur in
image level: multiple alignment methods to be combined, or in set level: distinct
medias for a subject to be aggregated. . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 The proposed template triplet embedding: (a) shows an example of a triplet. The
goal is to enlarge similarity of positive pairs while reducing that of negative ones.
Template triplet embedding, right hand side of (c), is compared to the conventional
contrastive embedding (b), and image triplet embedding, left hand side of (c).
Our approach introduces the (sub)template triplets to take the template structure
into account. The image-triple embedding can be seen as a special case when the
subtemplate size equals 1. Better view in color. . . . . . . . . . . . . . . . . . . 9
viii
2.1 Applications of facial landmarks. Illustrating the frequency of various task and
application names in paper titles citing two of the most popular landmark detec-
tors [190] and [168]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Augmenting appearances of images from the VGG face dataset [120]. After
detecting the face bounding box and landmarks, we augment its appearance by
applying a number of simple planar transformations, including translation, scaling,
rotation, and flipping. The same transformations are applied to the landmarks,
thereby producing example landmarks for images which may be too challenging
for existing landmark detectors to process. . . . . . . . . . . . . . . . . . . . . . 18
3.2 Example augmented training images. Example images from the VGG face data
set [120] following data augmentation. Each triplet shows the original detected
bounding box (left) and its augmented versions (mirrored across the vertical
axis). Both flipped versions were used for training FPN. Note that in some cases,
detecting landmarks would be highly challenging on the augmented face, due
to severe rotations and scalings not normally handled by existing methods. Our
FPN is trained with the original landmark positions, transformed to the augmented
image coordinate frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Verification and identification results on IJB-A and IJB-B. ROC and CMC curves
accompanying the results reported in Table 5.1. . . . . . . . . . . . . . . . . . . 22
3.4 68 point detection accuracies on 300W. (a) The percent of images with 68 landmark
detection errors lower than 5%, 10%, and 20% inter-ocular distances, or greater
than 40%, mean error rates (MER) and runtimes. Our FPN was tested using
a GPU. On the CPU, FPN runtime was 0.07 seconds. 3DDFA used the AFW
collection for training. Code provided for 3DDFA [192] did not allow testing on
the GPU; in their paper, they claim GPU runtime to be 0.076 seconds. As AFW
was included in our 300W test set, landmark detection accuracy results for 3DDFA
were excluded from this table. (b) Accumulative error curves. . . . . . . . . . . . 23
3.5 Qualitative landmark detection examples. Landmarks detected in 300W [135] im-
ages by projecting an unmodified 3D face shape, pose aligned using our FPN (red)
vs. ground truth (green). The images marked by the red-margin are those which
had large FPN errors (> 10% inter-ocular distance). These appear perceptually
reasonable, despite these errors. The mistakes in the red-framed example on the
third row was clearly a result of our FPN not representing expressions. . . . . . . 24
4.1 Confusion Matrix for Expression Recognition on CK+ dataset. This report how
the confusion distributes across emotions using the original resolution, given (a)
CE-CLM landmarks and Expression Fitting (b) using directly the deep method
of [192] (c) our method. Our method shows the less confusing distribution since
most of the intensity is distributed better on the diagonal than the other approaches. 30
4.2 Confusion Matrix for Expression Recognition on EmotiW-17 dataset. This report
how the confusion distributes across emotions using the original resolution, given
(a) CE-CLM landmarks and Expression Fitting (b) using directly the deep method
of [192] (c) our method. Our method still shows the less confusing distribution
even on highly unconstrained senario. . . . . . . . . . . . . . . . . . . . . . . . 31
ix
4.3 Expression Recognition Accuracy on CK+ dataset. Each curve corresponds to
a method. For each scale, the experiment resizes the input image accordingly.
Lower scale indicates lower resolution. Original resolution is 640490. . . . . . 32
4.4 Expression Recognition Accuracy on EmotiW-17 dataset. Each curve corresponds
to a method. For each scale, the experiment resizes the input image accordingly.
Lower scale indicates lower resolution. Original resolution is 720576. . . . . . 33
4.5 Qualitative Expression Estimation Comparison. Our method extends the 3DMM-
CNN method that is unable to model expression. All the methods used the same
3D shape provided by 3DMM-CNN [149]. Each labeled emotion is reported
above each result. The proposed method and 3DDFA show consistent expression
fitting across scale and appear to be robust across resolution. In particular, our
method is able to model better subtle expressions than 3DDFA. On the other end,
the top-performing landmark detector (CE-CLM) is not able to retrieve clearly the
subject expression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Expression Estimation Failure When the input image contains intense expres-
sion, our method does not match the expression intensity while other methods
exaggerate the expression (3DDFA) or are inconsistent across scales (CE-CLM). 36
4.7 Qualitative results. 48 expression estimation results on images from the challeng-
ing IJB-A [88], IJB-B [159], and 300W [135] collections. Each result displays
an input photo along with the regressed, texture and shape, rendered in with
the automatically determined pose and expression. These results demonstrate
our method’s capability of handling images of faces in challenging in-plane and
out-of-plane head rotations, low resolutions, occlusions, ages and more. . . . . . 37
5.1 Results of our FAME approach. We propose deep networks which regress 3DMM
shape, expression, and viewpoint parameters directly from image intensities. We
show this approach to be highly robust to appearance variations, including out-of-
plane head rotations (top row), scale changes (middle), and ages (bottom). . . . 39
5.2 From FAME to landmarks. Two examples of landmarks detected using our
FAME framework. In each example: (a) Input image, (b) generic 3D face shape
aligned using FPN (Sec. 5.2.3); reference 3D landmarks (including occluded ones)
rendered in green, (c) reference 3D landmarks projected onto image (Sec. 5.3.1),
(d) 3D shape estimated with FaceShapeNet [149], adjusted for pose with FPN,
and expression with FEN (Sec. 4.2), (e) reference 3D landmarks on adjusted 3D
shape, projected onto image (Sec. 5.3.1), finally, (f) 2D landmarks after refinement
(Sec. 5.3.2). Note refined landmarks moved from self-occluded locations (e) to
the contour landmarks (f) used by 2D landmark detection benchmarks. . . . . . 44
5.3 Visualizing potential problems with facial landmark detection benchmarks. Three
example photos along with their ground truth landmark annotations. (a) Annota-
tions on the jawline and bridge of the nose do not correspond to well-defined facial
features (from LFPW [8]). (b) Landmarks represent points on the face contour
and so represent different facial locations in different views (AFW [190]). (c) 3D
landmark annotations represent occluded facial regions (3D Menpo [180]). . . . 46
x
5.4 Reference landmark selection vs. detection accuracy. 68 reference points projected
from a 3D face shape to the input image, along with landmark detection accu-
racy. Two sets of 3D reference points are considered: red points were manually
annotated on the reference 3D face, green points were obtained by projecting
2D landmarks, detected by dlib [86, 82] on a frontal face image, onto the 3D
reference shape. Both sets of landmarks can be equally used for face alignment
and processing, but red points produce far lower landmark prediction errors. Does
this imply that the red points are better than the green? . . . . . . . . . . . . . . 47
5.5 Verification and identification results on IJB-A and IJB-B. ROC and CMC curves
accompanying the results reported in Table 5.1. . . . . . . . . . . . . . . . . . . 50
5.6 Emotion classification over scales on the CK+ benchmark. Curves report emotion
classification accuracy over different scales of the input images. Lower scale
indicates lower resolution. Original resolution is 640490. (a) reports results with
a simple kNN classifier. (b) Same as (a), now using a SVM (RBF kernel) classifier. 51
5.7 Emotion classification over scales on the EmotiW-17 benchmark. Curves report
emotion classification accuracy over different scales of the input images. Lower
scale indicates lower resolution. Original resolution is 720576. (a) reports results
with a simple kNN classifier. (b) Same as (a), now using a SVM (RBF kernel)
classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.8 Comparisons of CED curves on 300W with the face bounding boxes detected
by [172]. Results provided for the Common (HELEN and LFPW), Challenging
(iBUG), and Full (HELEN, LFPW, and iBUG) splits. . . . . . . . . . . . . . . . 52
5.9 Comparisons of CED curves on AFLW2000-3D. To balance the yaw distributions,
we randomly sample 699 faces from AFLW 2000-3D, split evenly among the 3
yaw categories and compute the CED curve. This is done 10 times and the average
of the resulting CED curves are reported. . . . . . . . . . . . . . . . . . . . . . 55
5.10 Limitations of our approach. Results obtained with the state-of-the-art 3DDFA
of [192] and our full reconstruction (Us, denoting FPN + FEN + FaceShapeNet)
for three faces in the IJB-A benchmark. See text for more details. . . . . . . . . . 61
5.11 Qualitative 3D reconstruction results. Rendered 3D reconstruction results for
IJB-A images representing a wide range of viewing settings. For each image we
provide results obtained by 3DDFA [192], the FaceShapeNet of [149] (adjusted
for viewpoint using FPN, Sec. 5.2.3), and our full approach (Us, denoting FPN
+ FEN + FaceShapeNet). These results should be considered by how well they
capture the unique 3D shape of each individual, the viewpoint, and the facial
expression. See text for more details. . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1 Facial attributes generation under head pose variations, showing results comparison
of our method to StarGAN [27] and CycleGAN [185]. Traditional frameworks
generate artifacts due to pose variations. Introducing a 3D UV representation, the
proposed TC-GAN and 3DA-GAN generates photo-realistic face attributes on
pose-variant faces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Illustration of image coordinate space and UV space. (a) Input image. (b) 3D
dense point cloud. (c) UV position map U
p
transferred from 3D point cloud. (d)
UV texture map U
t
, partially visible due to pose variation (Best viewed in color). 65
xi
6.3 Align a ground truth shape or an estimated shape from the existing 3D recon-
struction method to the trimmed BFM shape. The example image is from 4DFE
dataset and the landmarksL(I) can be obtained by any off-the-shelf image based
landmark detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 Given an input image, conversion from the aligned BFM to the fixed UV coordi-
nates, and the uv texture map rendering based on the vertex visibility and input
image. denotes element-wise multiplication. . . . . . . . . . . . . . . . . . . 67
6.5 The architecture and loss design of 3DA-GAN. . . . . . . . . . . . . . . . . . . 69
6.6 Manually defined attribute related masks based on the reference UV texture map.
(a) reference U
t
(constructed by our generated UV position map and the mean
face texture provided by Basel Face Model), (b) eyeglasses mask, (c) lipstick and
smile mask, (d) 5’o clock shadow mask, and (e) bangs mask. . . . . . . . . . . . 70
6.7 Visualization of TC-GAN and other face frontalization methods on LFW [69]. A
near-frontal image is randomly selected from LFW and shown as “Ground truth”.
We render the ground truth with multiple head poses as input with black background. 79
6.8 Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [27] and
CycleGAN [185] trained on our prepared data. . . . . . . . . . . . . . . . . . . 80
6.9 Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [27] and
CycleGAN [185] trained on our prepared data. . . . . . . . . . . . . . . . . . . 80
6.10 Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [27] and
CycleGAN [185] trained on our prepared data. . . . . . . . . . . . . . . . . . . 81
6.11 Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [27] and
CycleGAN [185] trained on our prepared data. . . . . . . . . . . . . . . . . . . 81
6.12 Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [27] and
CycleGAN [185] trained on our prepared data. . . . . . . . . . . . . . . . . . . 82
6.13 Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [27] and
CycleGAN [185] trained on our prepared data. . . . . . . . . . . . . . . . . . . 82
6.14 Visual results of applying our method to augment face images from CelebA [104]
testing set, in attributes and yaw angles. . . . . . . . . . . . . . . . . . . . . . . 83
6.15 Visual results of applying our method to augment face images from CelebA [104]
dataset, in attributes and yaw angles. . . . . . . . . . . . . . . . . . . . . . . . . 83
6.16 Visual results of applying our method to augment face images from CelebA [104]
dataset, in attributes and yaw angles. . . . . . . . . . . . . . . . . . . . . . . . . 84
6.17 Visual results of applying our method to augment face images from CelebA [104]
dataset, in attributes and yaw angles. . . . . . . . . . . . . . . . . . . . . . . . . 84
6.18 Visual results of applying our method to augment face images from CelebA [104]
dataset, in attributes and yaw angles. . . . . . . . . . . . . . . . . . . . . . . . . 85
6.19 Visual results of applying our method to augment face images from CelebA [104]
dataset, in attributes and yaw angles. . . . . . . . . . . . . . . . . . . . . . . . . 85
6.20 The effect of masked reconstruction loss on sunglasses, smile, and lipstick gen-
eration. From left to right: input images from CelebA dataset, using full losses,
without masked reconstruction loss (Eq. 12). The masked reconstruction loss
helps generating attributes in a specific region while preserve the non-attribute parts. 86
xii
6.21 The effect of adversarial attribute loss on smile and bangs generation. From left
to right: input images from CelebA dataset, using full losses, without adversarial
attribute loss (Eq. 14). The adversarial attribute loss helps enhancing the intensity
of generated attributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.22 The effect of cycle consistent loss and identity loss on sunglasses generation. From
left to right: input images from CelebA dataset, using full losses, without cycle
consistent loss (Eq. 11), and without identity loss (Eq. 8). The cycle consistent
loss and identity loss help preserving the non-attribute regions. The identity loss
also makes the generated attribute regions more natural. . . . . . . . . . . . . . . 88
7.1 Global and context-aware embedding: A global embedding is learned generally
in previous work. We propose integrating the context features to achieve sample-
specific embedding, wheref(.) consists of three factorized matrices as described
in Chapter 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.2 The distribution of the number of image samples per template on four used datasets:
YTC and Traffic datasets have at least 167 frames and 50 images per template
respectively, while less than 10 images/ videos are in IJB-A and YTF datasets.
Furthermore, about 50.14% templates on IJB-A are of a single image. . . . . . . 95
xiii
Chapter 1
Introduction
3D face modeling is a critical foundation for an abundant of face research due to its capability to
disentangle the intrinsic face shape from the head pose and expression variations. For example,
face recognition systems usually alleviate the effect of pose changes on identity features by aligning
or rendering a face image into canonical viewpoints as shown in Figure 1.1. The rendering process
requires a full 3D face shape information in order to connect the original image coordinates and
textures to the rendered-view image ones. Moreover, 3D face modeling makes it simpler to edit an
face image by synthesizing various facial attributes such as from neutral expression to smiling,
no sunglasses to wearing sunglasses, and without bangs to with bangs under different head poses
(Figure 1.2).
In this section, we will give an overview of the 3D face modeling, facial attribute generation,
and face matching, describing the problems, challenges and the conventional solutions in each
topic. The outline of this thesis will be summarized in the end of this section.
1.1 3D Face Modeling
The aim of 3D face shape modeling is to reconstruct the high fidelity face shape given an image
and/or offer the parameterizations for facial expressions besides the underlying 3D facial shape [13,
12, 121, 130]. A standard way to represent a face shape is modeling it as a linear combination
of face principal components in 3D [13, 12, 28, 66, 121], also called 3D morphable face model
(3DMM) representation, with the coefficients controlling the intensities of shape and expression
deformations.
Given an image with multiple human faces, we first apply a face detector [172] to obtain the
face bounding box of interest,B; an example is illustrated in Figure 1.1 (a). How do we warp a
face region,B, of the arbitrary head pose into the faces of canonical views such as in-plane frontal
(0
roll), out-of-plane-frontal (0
yaw), half-profile (45
yaw), and profile (75
yaw) as shown in
Figure 1.1 (b)-(e)? We need global 2D or 3D transformations. Similarly, we need expression and
shape coefficients to model the image-specific face shape inB. Therefore, the main problem in 3D
face shape modeling is to derive a function,f :B! p, where p refers to the parameters of the
2D or 3D transformation (or the pose parameters), and the shape as well as expression coefficients.
The conventional approaches [13, 12, 28, 121, 130, 147, 171] usually relied, to some extent,
on facial landmark detection performed either prior to reconstruction or concurrently, as part of
the reconstruction process. By using landmark detectors, these methods were sensitive to face
pose and, aside from a few recent exceptions (e.g., 3DDFA [192]), could not operate well on
1
Figure 1.1: An example of face alignment and rendering. With the help of 3D face modeling, a
face image (a) can be rendered to (b) roll-0
, (c) yaw-0
, (d) yaw-45
, and (e) yaw-75
.
Input
Smiling
+ Yaw15
Smiling + Bangs
+ Yaw30
Smiling + Bangs
+ Sunglasses + Yaw45
Figure 1.2: Editing an face image with its 3D face shape by synthesizing ”smiling”, ”sunglasses”,
and ”bangs ” attributes under different yaw angles.
faces viewed in extreme out-of-plane rotations (e.g., near-profile). Scale changes and occlusions
were also problems: either because landmarks were too small to be accurately localized or were
altogether invisible due to occlusions, detection and 3D reconstruction were not handled well.
Finally, many methods applied iterative analysis-by-synthesis steps [7, 75, 131] to reconstruct
the 3D face shape directly or estimate the shape, expression, and pose parameters [192]. These
approaches are not only computationally expensive, but also hard to distribute and run in parallel
on dedicated hardware offered by graphical processing units (GPU).
To address the above issues, we propose a novel, efficient, and accurate alternative based on
3-way deep networks for joint face alignment, modeling, and expression estimation (FAME) as
shown in 1.3. Our method can estimate well each of the following components of a 3D morphable
face model (3DMM) representation (Sec. 5.2): 3D face shape, six degrees of freedom (6DoF)
viewpoint, and 3D face expression (Figure 5.1). We also introduce how we prepare for our pose
and expression labeled data in order to achieve the supervised training of our networks. Contrary
to the conventional methods, our approach does not require facial landmark detection at test time
and models faces directly from image intensities. Still, if facial landmarks are required, we show
how they can be estimated as a by-product of our modeling, rather than as part of our modeling
process (Sec. 5.3).
2
Figure 1.3: Proposed framework for 3D face modeling. Given an input face photo, we process it
using three separate deep networks. These networks estimate, from top to bottom: the 3D face
shape [149], 6DoF viewpoint, and 29D expression coefficients (last two described here). The
output is an accurate 3D face model, aligned with the input face.
1.2 Facial Attribute Generation
Generation of facial attributes, especially the identity unrelated ones – sunglasses, smiling, bangs,
etc. – has attracted a lot of attention in face research due to its potential benefits for the downstream
applications, such as face animation, and the augmentation of under-represented classes in face
recognition.
In recent years, conditional generative models such as Variational Auto-Encodesr (V AE) [87] or
Generative Adversarial Networks (GAN) [51] have achieved impressive results [170, 27, 64, 142].
However, they have largely focused on 2D near-frontal faces. In contrast, we consider the problem
of generating 3D-consistent attributes on possibly pose-variant faces.
Take the problem of adding sunglasses to a face image as an example. For a frontal input and
with a desired frontal output, this involves the inpainting with sunglasses texture limited to the
region around the eyes. For an input face image observed under a largely profile view and a more
general task of generating an identity-preserving and sunglass-augmented face under arbitrary pose,
a more complex transformation is needed since (i) both attribute-related and unrelated regions
must be handled and (ii) the attribute must be consistent with 3D face geometry. Technically, this
requires working with a higher-dimensional output space and generating an image conditioned
on both head pose and the attribute code. In Figure 1.5, we show how our proposed framework
achieves these abilities surpassing conventional ones such as StarGAN [27] and CycleGAN [185].
3
Figure 1.4: Results of our FAME approach. We propose deep networks which regress 3DMM
shape, expression, and viewpoint parameters directly from image intensities. We show this
approach to be highly robust to appearance variations, including out-of-plane head rotations (top
row), scale changes (middle), and ages (bottom).
4
Shape Recon. Render TC-GAN 3DA-GAN Render
StarGAN CycleGAN Ours
Bangs
Smile
Sunglasses
Input
UV Maps
Figure 1.5: Facial attributes generation under head pose variations, showing results comparison of
our method to StarGAN [27] and CycleGAN [185]. Traditional frameworks generate artifacts due
to pose variations. Introducing a 3D UV representation, the proposed TC-GAN and 3DA-GAN
generates photo-realistic face attributes on pose-variant faces.
A first attempt would be to frontalize the pose-variant face input. Despite good visual quality,
appearance-based face frontalization methods [151, 176, 71, 143] may not preserve identity
properties. Geometric modeling methods [60, 35] faithfully inherit visible appearance but need
to guess the invisible appearance due to self-occlusion, leading to extensions like UV-GAN [35].
Further, we note that both texture completion and attribute generation are correlated with 3D
shape, that is, the hallucinated appearance should be within the shape area and the generated
attribute should comply with the shape. This motivates our framework that utilizes both 3D shape
and texture, distinguishing our method from traditional ones that deal only with appearance or
UV-GAN [35] that only uses the texture map.
Specifically, we propose to disentangle the attribute generation task into the following stages,
as illustrated in Figure 1.6: (1) We apply an off-the-shelf 3D shape reconstruction method, e.g.
our FAME [24] (Figure 1.3) or PRNet [49](the recent state-of-the-art) with a rendering layer to
5
3D Dense Shape Reconstruction
UV-pos map
UV - tex map
Estimated 3D shape
Flipped UV-tex
- UV Texture Map Completion
TC-GAN 3DA-GAN
Attribute code
0 0 0 1 0
0 1 0 0 0
Sunglass
Smile
3D Face Attribute Generation Rendered
Figure 1.6: The proposed framework of pose-variant 3D facial attribute generation. By 3D
dense shape reconstruction, a pose-variant face input is transformed into the UV position map
and incomplete UV texture map (with the black holes) due to self-occlusion. Then, a texture
completion GAN (TC-GAN) inpaints the black holes into a completed UV texture map. Further,
a 3D attribute generation GAN (3DA-GAN) is designed to generate the target attributes on UV
texture map and rendered back to 2D images with variant head poses.
directly achieve 3D shape and weak perspective matrix estimation from a single input image, and
utilize the information to render partial (self-occluded) texture. (2) A two-step GAN, consisting
of a texture completion GAN (TC-GAN) that utilizes the above 3D shape and partial texture to
complete the texture map and a 3D Attribute generation GAN (3DA-GAN) that generates target
attributes on the completed 3D texture representation.
In stage (1), we apply the UV representation [53, 49] on both the estimated 3D point cloud
and texture, termed U
p
and U
t
, respectively. The UV representation not only provides the dense
shape information but also builds a one-to-one correspondence from point cloud to texture. In
stage (2), TC-GAN and 3DA-GAN use both U
p
and U
t
as input to inject 3D shape insights into
both the completed texture and generated attribute.
1.3 Face Matching
The facial representation learning and the matching stages follow the 3D face shape modeling and
synthesis. Suppose we have obtained facial features of two face images, f
1
and f
2
. The matching
stage is to measure the similarity of these two feature vectors so as to determine if the two face
images belong to the same person or not. Denote the similarity score of f
1
and f
2
assim(f
1
; f
2
), a
popular measure is based on the correlation distance:
sim(f
1
; f
2
) =
(f
1
f
1
) (f
2
f
2
)
kf
1
f
1
k
2
kf
2
f
2
k
2
(1.1)
where
f
1
and
f
2
are the mean of the elements of f
1
and f
2
respectively and means the dot product.
Another popular measure is cosine similarity.
Notice that every face image may have several features because of the augmentations of the
head poses and facial attributes. To do the feature matching, we can either get similarity scores
with each of feature by (1.1) and combine multiple scores into one, or combining features first for
each face image and apply (1.1) to get a single similarity score. No matter which direction we go,
the fusion problem of multiple sources, different views or attributes, is always there. Furthermore,
in the set-based face recognition application where a single instance is a set of face images or a
video rather than one image, multiple medias (images or frames) need to be fused as well. It is
6
Figure 1.7: Different levels of fusion in the matching stage..The fusion problem can occur in image
level: multiple alignment methods to be combined, or in set level: distinct medias for a subject to
be aggregated.
thus critical to learn a fusion function in order to aggregate multiple sources effectively. Figure 1.7
demonstrates different levels of fusion occur in the face matching.
There has been an increasing number of methods recently addressing the multi-source fusion
problem. The ultimate goal is always to determine the image-to-image similarity or the set-to-set
similarity in image set classification. According to how such a similarity is computed, existing
methods can be divided into two paradigms: (1) “Set fusion followed by set matching” [74, 158,
55], which first constructs a single representation for each set and compares two sets according
to such representations. (2) “Set matching followed by set score fusion” [113, 115], which first
computes matching scores between images of two sets and then pools those scores by late fusion
schemes such as the average or max fusion.
The methods in paradigm (1) usually assume a certain distribution on the sets and/or require
a large amount of samples in a set for the robust set modeling. Unfortunately, it is often not the
case in practice that a set may have few or even a single image, or contain amixture of images
and video frames. This kind of set is calledtemplate in the recently available IARPA Janus
Benchmark (IJB-A) dataset for unconstrained face recognition [88]. Because of the existence of
multi-media in a template, the intra-class variation becomes much larger than sets involving only a
single media such as Labeled Face in the wild [70] (multiple images as a set) and YouTube face
dataset [160] (a video as a set).
We introduce atemplate-triplet-based embedding approach to optimize template-to-template
similarity computed by the ensemble SoftMax function [113, 115] — the ensemble SoftMax
function fuses image-to-image scores at multiple scales (i.e., paradigm (2)). Different fromimage-
triplet embedding [137, 138], our triplet can be created not only on the entire template, but also on
thesubtemplate of samples with any reasonable size. Note that a template triplet becomes an
image triplet when thesubtemplate size equals to 1. Therefore, the proposed approach generalizes
7
the triplet embedding to sub-templates of a predefined size. An illustration of our approach is
presented in Figure 1.8.
1.4 Thesis Outline
The thesis will begin with the detailed survey about the previous work on 3D face modeling, face
alignment, face frontalization, attribute generation, and face matching and fusion (Chapter 2).
Chapter 3 then introduces FacePoseNet component of our FAME method. Starting with
the concerns about previous facial landmark detection, we then describe in more detail of our
deep neural network, Face PoseNet (FPN) in order to estimate the 6DoF 3D head poses (i.e. 3D
transformation parameters), directly from face image intensities without landmark detection as
the intermediate step. Besides, we also present how the pose labeled data is prepared in order to
make our FPN more robust alignment to the head pose changes. Finally, we evaluate different
alignment methods by bottom line face recognition accuracies. A surprising conclusion from our
experimental results is that better landmark detection accuracy does not necessarily translate to
better face recognition.
In Chapter 4, we introduce Face Expression Net (FEN), another component of our 3D FAME
shape estimation framework. We show how facial expressions can be modeled directly from image
intensities using the proposed Expression-Net (FaceExpNet). We provide a multitude of face
reconstruction examples, visualizing our estimated expressions on faces appearing in challenging
unconstrained conditions. Additionally, we offer quantitative comparisons of our facial expression
estimates. by measuring how well different expression regression methods capture facial emotions
on the emotion datasets, Extended Cohn-Kanade (CK+) dataset [109] and the Emotion Recognition
in the Wild Challenge (EmotiW-17) dataset
1
. We show that not only does our deep approach
provide more meaningful expression representations, it is more robust to scale changes than
methods which rely on landmarks for this purpose.
Chapter 5 introduces our method to address the challenging problem of generating facial
attributes using a single image in an unconstrained pose. In contrast to prior works that largely
consider generation on 2D near-frontal images, we propose a GAN-based framework to generate
attributes directly on a dense 3D representation given by UV texture and position maps, resulting
in photorealistic, geometrically-consistent and identity-preserving outputs.
Chapter 6 describes in detail ourtemplate-triplet-based metric learning approach and our
“context-aware” metric learning approach by integrating image-specific context into metric learning
so as to achieve distinct embedding for every image.
1
https://sites.google.com/site/emotiwchallenge/challenge-details
8
Figure 1.8: The proposed template triplet embedding: (a) shows an example of a triplet. The goal
is to enlarge similarity of positive pairs while reducing that of negative ones. Template triplet
embedding, right hand side of (c), is compared to the conventional contrastive embedding (b), and
image triplet embedding, left hand side of (c). Our approach introduces the (sub)template triplets
to take the template structure into account. The image-triple embedding can be seen as a special
case when the subtemplate size equals 1. Better view in color.
9
Chapter 2
Related Work
In this section, we provide a detailed survey about the 3D face modeling including the face
alignment, head pose estimation, expression modeling, and face frontalization. Besides, the
relevant work for facial attribute generation and multi-source fusion for face matching are also
introduced.
2.1 Facial Landmark Detection and Its Applications
There has been a great deal of work dedicated to accurately detecting facial landmarks, and not only
due to their role in face alignment and expression estimation. Face landmark detection is a general
problem which has applications in numerous face related systems. Landmark detectors are very
often used to align face images by applying rigid [45, 46, 160] and non-rigid transformations [58,
77, 146, 192] in 2D and 3D [112, 115, 116]. Other facial landmark applications involve estimating
3D face shape, expression, emotion, and many others as shown in Figure 2.1.
Generally speaking, landmark detectors can be divided into two broad categories: Regression
based [19, 86] and Model based [3, 4, 178, 192] techniques. Regression based methods estimate
landmark locations directly from facial appearance while model based methods explicitly model
both the shape and appearance of landmarks. Regardless of the approach, landmark estimation
can fail whenever faces are viewed in extreme out-of-plane rotations (far from frontal), low scale,
or when the face bounding box differs significantly from the one used to develop the landmark
detector.
To address the problem of varying 3D poses, the recent 3DDFA approach [192], related to our
own, learns the parameters of a 3DMM representations using a CNN. Unlike us, however, they
prescribe an iterative, analysis-by-synthesis technique. Also related to us is the recent CE-CLM
method [178]. CE-CLM introduce a convolution expert network to capture the very complex
landmark appearance variations thereby achieving state-of-the-art landmark detection accuracy.
Very recently, [9] proposed their 3D-STN approach which shares some of our design goals. In
fact, we use a modified version of their regression sampler component to optimize facial landmark
localization. Their approach, however, learns a 3D, thin plate splines (TPS) warping matrix and
a 11DoF camera projection matrix (11 DoF) to modify and fit a generic 3D face model to the
input. Our approach is very different: we use a 3DMM representation, estimating facial shape,
expression, and 6DoF pose directly from image intensities with three deep networks.
[96] proposed the KEPLER system, an iterative method based on three submodules: a rendering
module that stacks previously predicted key-point locations in the current image, a CNN that
10
Figure 2.1: Applications of facial landmarks. Illustrating the frequency of various task and
application names in paper titles citing two of the most popular landmark detectors [190] and [168].
predicts key-point location updates towards the ground-truth, and a final step which applies these
increments to generate new landmark estimates. Inspired by cascaded regression, the method
iterates over these three modules in order to get final key-points predictions along with a visibility
confidence and 3D head pose estimation. Unlike KEPLER, our method offers the speed of a direct
regression method and is self-supervised, without the need for manual landmark annotations for
training.
[18] discuss the importance of data set size, proposing the largest annotated training set
for landmark detection (LS3D-W), consisting of230,000 samples. They additionally provide
ablation studies of pose, landmark initialization, face resolution, and more such factors. [17]
studied how to redesign the bottleneck layer of a CNN in order to obtain substantial improvements
in localization accuracy while constraining the learned model to be lightweight, fast, and compact.
While most methods are designed to be robust to typical appearance nuisance factors — pose and
illumination — [18] and [40] designed a method to make landmark detection systems robust to
different image styles.
Finally, [95] used a pose estimating network similar to our FacePoseNet [23] in their facial
landmark detection system. Our detection accuracy on landmark benchmarks may be lower than
theirs. Unlike them, however, our FAME approach provides a complete 3D face reconstruction,
with landmarks only being a by-product.
2.2 Deep Pose Estimation
Chapter 4 will describe a deep network trained to estimate the 6DoF of 3D head pose viewed in
single images. Deep learning is increasingly used for similar purposes, though typically focusing
on general object classes [5, 123, 145]. Some recently addressed faces in particular, though
their methods are designed to estimate 2D landmarks along with 3D face shapes [80, 94, 192].
11
Unlike our proposed pose estimation, they regress poses by using iterative methods which involve
computationally costly face rendering. We regress 6DoF directly from image intensities without
such rendering steps.
In all these cases, absence of training data was cited as a major obstacle for training effective
models. In response, some turned to larger 3D object data sets [163, 164] or using synthetically
generated examples [128]. We propose a far simpler alternative and show it to result in robust and
accurate face alignment.
2.3 Expression Estimation
We first emphasize the distinction between the related, yet different tasks of emotion classification
vs. expression regression. The former seeks to classify images or videos into discrete sets of facial
emotion classes [37, 109] or action units [47, 179]. Similarly to face recognition, this problem was
also often addressed by considering the locations of facial landmarks. In recent years a growing
number of state of the art methods have instead adopted deep networks [91, 100, 183], applying
them directly to image intensities rather than estimating landmark positions as a proxy step.
Methods for expression regression attempt to extract parameters for face deformations. These
parameters are often expressed in the form of active appearance models (AAM) [109] and Blend-
shape model coefficients [128, 192, 191]. In this work we focus on estimating 3D expression
coefficients, using the same representation described by 3DDFA [192]. Unlike 3DDFA, however,
we completely decouple expression coefficient regression from facial landmark detection. Our
tests demonstrate that by doing so, we obtain a method which is more robust to changing image
scales.
Importantly, the exact locations of facial landmarks were once considered subject-specific
information which can be used for face recognition [44]. Today, however, these attempts are
all but abandoned. The reason for turning to other representations may be due to the real-word
imaging conditions typically assumed by modern face recognition systems [88] where even
state of the art landmark detection accuracy is insufficient to discriminate between individuals
based solely on the locations of their detected facial landmarks. In other applications, however,
facial landmarks prevail. We follow recent attempts, most notably Chang et al. [23], by proposing
landmark free alternatives for face understanding tasks. This effort is intended to allow for accurate
expression estimation on images which defy landmark detection techniques, in similar spirit to
the abandonment of landmarks as a means for representing identities. To our knowledge, such a
direct, landmark free, deep approach to expression modeling was never previously attempted .
2.4 3D face modeling
Estimating the 3D shape of a face appearing in a single image is a problem with a history now
spanning over two decades. Some proposed example-based approaches [59, 154, 58] where the
3D shape was estimated using the shapes of similar reference faces. These methods were typically
very robust to viewing conditions, but were not designed to offer accurate 3D reconstructions. To
estimate fine facial details, others used shape from shading (SfS) for face reconstruction [83, 101].
Though SfS reconstructions were detailed, they were often limited to rather constrained viewing
settings.
12
Possibly the most popular methods of estimating 3D facial shapes involved 3DMM. These sta-
tistical representations were introduced by [10] and then later improved by others [14, 28, 121, 130,
147, 171]. We provide a brief overview of these representations in Sec. 4.2.1. Whereas classical
3DMM fitting methods used an analysis-by-synthesis approach which involved computationally
expensive rendering cycles, we estimate 3DMM parameters directly from image intensities with
deep networks.
Facial landmarks were also used to produce 3D face shape estimates [80, 192]. Because these
methods often focused on landmark detection accuracy, they were typically very fast, but not
necessarily accurate in the quality of their reconstructions.
Finally, deep learning methods were also recently proposed for 3D face shape estimation [15,
43, 76, 139, 127, 129, 140, 148, 149, 150]. Our work extends the method proposed by [149] by
adding deep 6DoF pose and 29D facial expression estimation to their accurate 3D faces.
2.5 Face alignment
The term face alignment is often used in papers presenting facial landmark detection methods [2,
20, 126], implying that the two terms are used interchangeability. This reflects an interpretation of
alignment as forming correspondences between particular spatial locations in two face images. A
different interpretation of alignment, and the one used here, refers not only to establishing these
correspondences but also to warping the two face images in order to bring them into alignment,
thereby making them easier to compare and match. Such methods using 2D or 3D transformations
are well known to have a profound impact on the accuracy of face recognition systems [61, 62, 68].
We describe a deep network trained to estimate the 6DoF of 3D faces viewed in single images.
Deep learning is increasingly used for similar purposes, though typically focusing on general
object classes [5, 123, 145]. Some recently addressed faces in particular, though their methods
are designed to estimate 2D landmarks along with 3D face shapes [80, 96, 95, 192]. Unlike our
proposed pose estimation, many of these regress poses by using iterative methods which involve
computationally costly face rendering. We regress 6DoF directly from image intensities without
such rendering steps.
In all these previous efforts, absence of training data was cited as a major obstacle for
training effective deep models for alignment. In response, some turned to larger 3D object
data sets [163, 164] or using synthetically generated examples [129]. We propose a far simpler
alternative and show it to result in robust and accurate face alignment.
2.6 Face Frontalization
Early works [60, 50] apply a 3D Morphable Model and search for dense point correspondence
to complete the invisible face region. [189] proposes a high fidelity pose and expression normal-
ization approach based on 3DMM. Sagonas et al. [31] formulate the frontalization as a low rank
optimization problem. Yang et al. [81] formulate the frontalization as a recurrent object rotation
problem. Yim et al. [174] propose a concatenate network structure to rotate faces with image-level
reconstruction constraint. Cole et al. [48] proposes using the identity perception feature to recon-
struct normalized faces. Recently, GAN-based generative models [151, 176, 71, 143, 6, 35] have
13
achieved high visual quality and preserve identity with large extent. Our method aligns in the
GAN-based methods but works on 3D UV position and texture other than the 2D images.
2.7 Attribute Generation
Pixel-level graphical editing takes large part in attribute generation. However, we focus on the
holistic image-level attribute generation and thus only discuss the closely related works. Li et
al. [102] apply an attribute perception loss to guide the attribute synthesis. Upchurch et al. [152]
propose the target attribute guided feature-level interpolation for the synthesis. Shen and Liu [142]
introduce residual maps to add or remove specific attributes. GAN-based methods [122, 184,
165, 64, 27, 97, 124, 181, 166] aim at connecting the latent attribute code space and the with-
target-attribute image space, i.e., , swap attribute related latent code [184, 165], or disentangling
the attribute for invariant representation [97], or imposing an attention network to guide the
attribute generation in a specific area [181]. Xiao et al, [166] worked on paired images of attribute
transfer. Given low resolution or occluded face images, both [108] and [26] attempted to generate
high resolution images, which satisfy the user-given attributes. Our work lies in the GAN-based
methods. In literature, there is no work synthesize attributes based on 3D representation while
ours is the first. Moreover, our newly proposed two phase training and masked reconstruction loss,
enable the network to focus only on the attribute related region, thus highly preserves the identity.
2.8 Multi-Source Fusion: Image Set-to-Set Similarity Computation
2.8.1 Set Fusion Followed by Set Matching
In [106, 137], the sample mean is used to represent a set/template before matching by the inner
product. It is risky to take the average in the feature level because some discriminative information
maybe lost. Besides, averaging over different medias may destroy the intrinsic structure within
each media. Therefore, media pooling is proposed [30] to do the average pooling within the media
instead and then take the average again on the media-pooled features. [21, 67, 186] represent
a template by the convex or affine combination of samples in a template and [169, 85] use the
subspace representation by applying PCA on the templates. The templates are then matched by
Eulidean distance or principle angles.
In [54, 73, 56], geometric structures such as Riemannian manifolds are exploited, which
assume a single or multivariate Gaussian distributions on templates [55]; this assumption could be
easily violated in practice. To handle arbitrary distribution in a template, [55] proposed to model
the image template as probability distribution functions using kernel density estimations and the
Log-Euclidean distance or K-L divergence are employed for matching. To capture the nonlinear
variations, nonlinear manifold modeling methods and the associated manifold-to-manifold distance
are introduced in [157, 156, 107]. In the above approaches, a large amount of samples is usually
needed to model a distribution or manifold well, which hinders them to be applicable on the dataset
with templates of extremely small sizes. Instead of using a single statistics to model a set, [72]
represents an image set with the sample mean, sample covariance, and the Gaussian mixture model
(GMM) where they bridge the gap between the Euclidean space and Riemannian manifold.
14
2.8.2 Set Matching Followed by Set Score Fusion
Different from the approaches involving the template modeling before matching, image-to-image
matching are first performed among templates in [113, 115], the ensemble softmax fusion (which
is the ensemble of average, max, and weighted average score fusions) is then applied to obtain the
template-to-template similarity. This method does not assume any distributions for the template
and can handle the template with extreme size. Besides, it could be more robust to the noise due to
the “soft” average over the similarity scores on image level.
2.9 Discriminative Multi-Source Fusion: Metric Learning for Set-
to-Set Matching
There are several methods to handle image-to-image matching such as LDML [52], ITML [34],
and [16]. LDML exploits the linear logistic discriminant model to estimate the probability of
the two images belonging to the same object, while ITML and [16] impose prior knowledge on
the metric so that the learned one can be closed to the known prior and invariant to the rigid
transformation respectively. KISSME [90], on the other hand, models the genuine and imposter as
multivariate Gaussians and the distance is defined by the likelihood ratio test.
For template-to-template matching, the template-to-template distance (similarity) is first
defined by either of the above-mentioned strategies, and then the image-to-image metric learning
is employed to enhance the discriminative power of the defined measures. There are two typical
types of losses exploited in metric learning:contrastive loss [186, 54, 73, 156, 106, 55, 52] or
triplet loss [107, 137, 153]. The goal ofcontrastive loss is to minimize the intra-class distance
but maximize the inter-class distance, while the triplet loss ensures the distance between an
anchor and apositive sample, both of which have the same class label, is minimized and the one
between theanchor and anegative sample of different classes is maximized. Note the distance
can be replaced by the similarity and the minimization (maximization) becomes maximization
(minimization).
Thetriplet loss was previously exploited in image level when a template is represented as a
single image or features [137]. Instead, we introduce a new type of triplet, calledtemplate triplet,
which directly works on the templates rather than on images. Unlike [107] where the triplets are
created only onk nearest neighbors, the farthest positive template from the anchor template is
selected to be in the template triplet.
Our context-aware metric learning method is most related to [141, 29]. They learn a specific
metric for a group of images, while ours achieves truly image-specific metric with the help
of context. The proposed method to incorporate context into metric learning is based on the
matrix factorization, which has been shown effective in image captioning [78, 110] and image
modeling [118, 136] especially for the integration of the sample-specific clues.
15
Chapter 3
Deep Face Pose Net
This chapter begins with the motivation of our landmark-free alignment method, deep Face
PoseNet (FPN). We first discuss several concerns about using facial landmark detectors in a face
processing system. Then, we introduce the required parametric transformations for alignment
and how we can estimate them directly from image intensities instead of using 2D-3D landmarks
correspondences (Note that these correspondences are only used to generate the ground-truths of
the parameters. Once the FPN is trained, we won’t need the landmarks anymore, that’s why we
call it landmark-free). The face pose estimates from our FPN is then used to warp face images
into canonical views. Our method is evaluated against other landmark detection methods on face
recognition with the two most challenging face recognition datasets recently, IJB-A [88] and IJB-B
benchmarks [159].
3.1 Motivtion: A critique of Facial Landmark Detection
3.1.1 Landmark Detection Accuracy Measures
Facial landmark detection accuracy is typically measured by considering the distances between
estimated landmarks and ground truth (reference) landmarks, normalized by the reference inter-
ocular distance of the face [33]:
e(L;
^
L) =
1
mk^ p
l
^ p
r
k
2
m
X
i=1
kp
i
^ p
i
k
2
; (3.1)
Here, L =fp
i
g is the set ofm 2D facial landmark coordinates,
^
L =f ^ p
i
g, their ground truth
locations, and ^ p
l
; ^ p
r
the reference left and right eye outer corner positions. These errors are then
translated to a number of standard quantities, including the mean error rate (MER), the percentage
of landmarks detected under certain error thresholds (e.g., below 5% or 10% error rates) or the
area under the accumulative error curve (AUC).
There are two key problems with this method of evaluating landmark errors. First, the ground
truth compared against is manually specified, often by mechanical turk workers. These manual
annotations can be noisy, they are ill-defined when images are low resolution, the landmarks
are occluded (in case of large out-of-plane head rotations, facial hair and other obstructions), or
located in featureless facial regions (e.g., along the jawline). Accurate facial landmark detection,
16
as measured on these benchmarks, thus implies better matching human labels but not necessarily
better detection. These problems are demonstrated in Figure ??.
A second potential problem lies in the error measure itself: Normalizing detection errors by
inter-ocular distances biases against images of faces appearing at non-frontal views. When faces
are near profile, perspective projection of the 3D face onto the image plane shrinks the distances
between the eyes thereby naturally inflating the errors computed for such images.
3.1.2 Landmark Detection Speed
Some facial landmark detection methods emphasize impressive speeds [86, 126]. Measured on
standard landmark detection benchmarks, however, these methods do not necessarily claim state-
of-the-art accuracy, falling behind more sophisticated, yet far slower detectors [178]. Moreover,
aside from [192], no existing landmark detector is designed to take advantage of GPU hardware,
a standard feature in commodity computer systems and most, including [192], apply iterative
optimizations which may be hard to convert to parallel processing.
3.1.3 Effects of Facial Expression and Shape on Alignment
It was recently shown that 3D alignment and warping of faces to frontal viewpoints (i.e. frontal-
ization) is effective regardless of the precise 3D face shape used for this purpose [61]. Facial
expressions and 3D shapes in particular, appear to have little impact on the warped result as evident
by the improved face recognition accuracy reported by that method. Moreover, it was recently
demonstrated that by using such a generic 3D face shape, rendering faces from new viewpoints
can be accelerated to the same speed as simple 2D image warping [116].
Interestingly, they and many others used facial landmark detectors to compute parametric
transformations – projection matrix [61] or 2D affine or similarity transforms [45, 68] – by applying
robust estimators to corresponding detected facial landmarks [58, 99]. Variations in landmark
locations due to expressions and face shapes essentially contribute noise to this estimation process.
The effects these variations have on the quality of the alignment were, as far as we know, never
truly studied.
3.2 Deep, Direct Head Pose Regression
Rather than align faces using landmark detection, we refer to alignment as a global, 6DoF 3D
face pose, and propose to infer it directly from image intensities, using a simple deep network
architecture. Figure. illustrates the different between our method and the conventional approaches
for alignment. We next describe the network and the novel method used to train it.
3.2.1 Head Pose Representation
We define face alignment as the 3D head pose h, expressed using 6DoF: three for rotations,
r = (r
x
;r
y
;r
z
)
T
, and three for translations, t = (t
x
;t
y
;t
z
)
T
:
h = (r
x
;r
y
;r
z
;t
x
;t
y
;t
z
)
T
(3.2)
17
Figure 3.1: Augmenting appearances of images from the VGG face dataset [120]. After detecting
the face bounding box and landmarks, we augment its appearance by applying a number of
simple planar transformations, including translation, scaling, rotation, and flipping. The same
transformations are applied to the landmarks, thereby producing example landmarks for images
which may be too challenging for existing landmark detectors to process.
where (r
x
;r
y
;r
z
) are represented as Euler angles (pitch, yaw, and roll). Given m 2D facial
landmark coordinates on an input image, p
2m
, and their corresponding, reference 3D coordinates,
P
3m
– selected on a fixed, generic 3D face model – we can obtain a 3D to 2D projection of
the 3D landmarks onto the 2D image by solving the following equation for the standard pinhole
model:
[p; 1]
T
= A[R; t][P; 1]
T
; (3.3)
where A and R are the camera matrix and rotation matrix respectively and 1 is a constant vector
of 1. We then extract a rotation vector r = (r
x
;r
y
;r
z
)
T
from R using the Rodrigues rotation
formula:
R = cosI + (1 cos)rr
T
+ sin
0
@
0 r
z
r
y
r
z
0 r
x
r
y
r
x
0
1
A
;
where we define =jjrjj
2
.
3.2.2 Training Data Augmentation
Although our network architecture is not very deep compared to deep networks used today for
other tasks, training it still requires large quantities of labeled training data. We found the numbers
of facial landmark annotated faces in standard data sets to be too small for this purpose. A key
problem is therefore obtaining a large enough training set.
We produce our training set by synthesizing 6D, ground truth pose labels by running an
existing facial landmark detector [4] on a large image set: the 2.6 million images in the VGG face
dataset [120]. The detected landmarks were then used to compute the 6DoF labels for the images
in this set. A potential danger in using an existing method to produce our training labels, is that our
CNN will not improve beyond the accuracy of its training labels. As we show in our experiments,
this is not necessarily the case.
To further improve the robustness of our CNN, we apply a number of face augmentation
techniques to the images in the VGG face set, substantially enriching the appearance variations
it provides. Figure 3.1 illustrates this augmentation process. Specifically, following face detec-
tion [172] and landmark detection [4], we transform detected bounding boxes and their detected
facial landmarks using a number of simple in-plane transformations. The parameters for these
18
Table 3.1: Summary of augmentation transformation parameters used to train our FPN. Where
U(a;b) samples from a uniform distribution ranging from a to b andN (;
2
) samples from
a normal distribution with mean and variance
2
. width andheight are the face detection
bounding box dimensions.
Transformation Range
Horizontal translation U(0:1; 0:1)width
Vertical translation U(0:1; 0:1)height
Scaling U(0:75; 1:25)
Rotation (degrees) 30N (0; 1)
Figure 3.2: Example augmented training images. Example images from the VGG face data
set [120] following data augmentation. Each triplet shows the original detected bounding box (left)
and its augmented versions (mirrored across the vertical axis). Both flipped versions were used for
training FPN. Note that in some cases, detecting landmarks would be highly challenging on the
augmented face, due to severe rotations and scalings not normally handled by existing methods.
Our FPN is trained with the original landmark positions, transformed to the augmented image
coordinate frame.
transformations are selected randomly from fixed distributions (Table. 3.1). The transformed faces
are then used for training, along with their horizontally mirrored versions, to provide yaw rotation
invariance. Ground truth labels are, of course, computed using the transformed landmarks.
Some example augmented faces are provided in Figure 3.2. Note that augmented images
would often be too challenging for existing landmark detectors, due to extreme rotations or scaling.
This, of course, does not affect the accuracy of the ground truth labels which were obtained from
the original images. It does, however, force our CNN to learn to estimate poses even on such
challenging images.
3.2.3 FPN Training
For our FPN we use an AlexNet-like architecture [93] with its initialized weights provided
by [114]. The only difference is that here the output regresses 6D floating point values rather
than predicts one-hot encoded, multi class labels. Note that during training each dimension of the
head pose labels is normalized by the corresponding mean and standard deviation of the training
19
set, compensating for the large value differences among dimensions. The same normalization
parameters are used at test time.
3.2.4 2D and 3D Face Alignment with FPN
Given a test image, it is processed by applying the same face detector [172], cropping the face
and scaling it to the dimension of the network’s input layer. The 6D network output is then
converted to a projection matrix. Specifically, the projection matrix is produced by the camera
matrix A, rotation matrix R, and the translation vector t in Eq. (3.3). With this projection matrix
we can render new views of the face, aligning it across 3D views as was recently proposed by
others [115, 116].
For 2D alignment, we compute the 2D similarity transform to warp the 2D projected landmarks
from FPN to pre-defined landmark locations. With frontal images (absolute yaw angle 30
),
we use the eye centers, the nose tip, and the mouth corners for alignment. With profile images
(absolute yaw angle> 30
), however, only the visible eye center and the nose tip are used.
3.3 Experimental Results for FPN
We provide comparisons of our FPN with the following widely used, state-of-the-art, facial
landmark detection methods: Dlib [86], CLNF [3], OpenFace [4], DCLM [178], RCPR [19], and
3DDFA [192] evaluating them for their effects on face recognition vs. their landmark detection
accuracy.
3.3.1 Effect of Alignment on Face Recognition
Section 3.1 discusses the various potential problems of comparing face alignment methods by
measuring their landmark detection accuracy. As an alternative, we propose comparing methods
for face alignment and landmark detection by evaluating their effect on the bottom line accuracy of
a face processing pipeline. Since face recognition is arguably one of the most popular applications
for face alignment, we use recognition accuracy as a performance measure. To our knowledge, this
is the first time alignment methods are compared based on their effect on recognition accuracy.
Specifically, we use two of the most recent benchmarks for face recognition: IARPA Janus
Benchmark A [88] and B [159] (IJB-A and IJB-B). Importantly, these benchmarks were designed
with the specific intention of elevating the difficulty of face recognition. This heightened challenge
is reflected by, among other factors, an unprecedented amount of extreme out of plane rotated faces
including many appearing in near-profile views [115]. As a consequence, these two benchmarks
not only push the limits of face recognition systems, but also the alignment methods used by these
systems, possibly more so than the faces in standard facial landmark detection benchmarks.
3.3.2 Face Recognition Pipeline
We employ a system similar to the one recently proposed by [116, 115], building on their publicly
available ResFace101 model and related code. We chose this system, as it explicitly aligns faces
to multiple viewpoints, including rendering novel views. These steps are highly dependent on the
quality of alignment and so its recognition accuracy should reflect alignment accuracy. In practice,
20
Table 3.2: Verification and identification on IJB-A and IJB-B, comparing landmark detection based
face alignment methods. Three baseline IJB-A results are also provided as reference at the top of
the table.
Numbers estimated from the ROC and CMC in [159].
Method# TAR@FAR Identification Rate (%)
Eval.! .01% 0.1% 1.0% Rank-1 Rank-5 Rank-10 Rank-20
IJB-A [88]
Crosswhite et al. [30] – – 93.9 92.8 – 98.6 –
Ranjan et al. [125] 90.9 94.3 97.0 97.3 – 98.8 –
Masi et al. [116] 56.4 75.0 88.8 92.5 96.6 97.4 98.0
RCPR [19] 64.9 75.4 83.5 86.6 90.9 92.2 93.7
Dlib [86] 70.5 80.4 86.8 89.2 91.9 93.0 94.2
CLNF [3] 68.9 75.1 82.9 86.3 90.5 91.9 93.3
OpenFace [4] 58.7 68.9 80.6 84.3 89.8 91.4 93.2
DCLM [178] 64.5 73.8 83.7 86.3 90.7 92.2 93.7
3DDFA [192] 74.8 82.8 89.0 90.3 92.8 93.5 94.4
Our FPN 77.5 85.2 90.1 91.4 93.0 93.8 94.8
IJB-B [159]
GOTs [159]
16.0 33.0 60.0 42.0 57.0 62.0 68.0
VGG face [159]
55.0 72.0 86.0 78.0 86.0 89.0 92.0
RCPR [19] 71.2 83.8 93.3 83.6 90.9 93.2 95.0
Dlib [86] 78.1 88.2 94.8 88.0 93.2 94.9 96.3
CLNF [3] 74.1 85.2 93.4 84.5 90.9 93.0 94.8
OpenFace [4] 54.8 71.6 87.0 74.3 84.1 87.8 90.9
DCLM [178] 67.6 81.0 92.0 81.8 89.7 92.0 94.1
3DDFA [192] 78.5 89.1 95.6 89.0 94.1 95.5 96.9
Our FPN 83.2 91.6 96.5 91.1 95.3 96.5 97.5
we used their 2D (similarity transform) and 3D (new view rendering) code directly, changing how
the transformations are computed: our tests compare different landmark detectors used to recover
the 6DoF head pose required by their warping and rendering method, with the 6DoF regressed
using our FPN.
Their system uses a single Convolutional Neural Network (CNN), a ResNet-101 architec-
ture [63], trained on both real face images and synthetic, rendered views. We fine tune the
ResFace101 CNN usingL2-constrained Softmax Loss [125] instead of the original softmax used
by Masi et al. for their publicly released model. This fine tuning is performed using the MS-Celeb
face set [105] as an example set. Aside from this change, we use the same recognition pipeline
from [116] and we refer to that paper for details.
21
(a) ROC IJB-A (b) CMC IJB-A
(c) ROC IJB-B (d) CMC IJB-B
Figure 3.3: Verification and identification results on IJB-A and IJB-B. ROC and CMC curves
accompanying the results reported in Table 5.1.
Face Bounding Box Detection We emphasize that an identical pipeline was used with the
different alignment methods; different results vary only in the method used to estimate facial pose.
The only other difference between recognition pipelines was in the facial bounding box detector.
Facial landmark detectors are sensitive to the face detector they are used with. We therefore
report results obtained when running landmark detectors with the best bounding boxes we were
able to determine. Specifically, FPN was applied to the bounding boxes returned by the detector of
Yang and Nevatia [172], following expansion of its dimensions by 25%. Most detectors performed
best when applied using the same face detector, without the 25% increase. Finally, 3DDFA [192]
was tested with the same face detector followed by the face box expansion code provided by its
authors.
3.3.3 Face Verification and Identification Results
Face verification and identification results on both IJB-A and IJB-B are provided in Table 5.1.
We report multiple recognition metrics for both verification and identification: For verification,
these measure the recall (True Acceptance Rate) at three cut-off points of the False Alarm Rate
(TAR-f1%,0.1%,0.01%g). For identification we provide recognition rates at four ranks from the
CMC (Cumulative Matching Characteristic). The overall performances in terms of ROC and CMC
curves are shown in Figure 5.5. The table also provides, as reference, three state-of-the-art IJB-A
22
Method 5% 10% 20% 40% MER Sec./im.
RCPR [19] 44.44 % 66.96 % 77.39 % 9.55 % 0.1386 0.19
Dlib [86] 60.03 % 82.65 % 90.94 % 2.83 % 0.0795 0.009
CLNF [3] 20.86 % 65.11 % 87.62 % 2.63 % 0.1106 0.38
OpenFace [4] 54.39 % 86.74 % 95.42 % 1.27 % 0.0702 0.31
DCLM [178] 64.91 % 91.91 % 96.00 % 1.17 % 0.0611 15.83
3DDFA [192] N/A N/A N/A N/A N/A 0.6
Our FPN 1.75 % 65.40 % 93.86 % 0.97 % 0.1043 0.005
(a) Quantitative results (b) Acumulative error curves
Figure 3.4: 68 point detection accuracies on 300W. (a) The percent of images with 68 landmark
detection errors lower than 5%, 10%, and 20% inter-ocular distances, or greater than 40%, mean
error rates (MER) and runtimes. Our FPN was tested using a GPU. On the CPU, FPN runtime was
0.07 seconds. 3DDFA used the AFW collection for training. Code provided for 3DDFA [192] did
not allow testing on the GPU; in their paper, they claim GPU runtime to be 0.076 seconds. As
AFW was included in our 300W test set, landmark detection accuracy results for 3DDFA were
excluded from this table. (b) Accumulative error curves.
results [30, 116, 125] and baseline results from [159] for IJB-B (to our knowledge, we are the first
to report verification and identification accuracies on IJB-B).
Faces aligned with our FPN offer higher recognition rates, even compared to the most recent,
state-of-the-art facial landmark detection method of [178]. In addition, our verification scores on
IJB-A outperform the scores reported for the system used here as the basis for our recognition
system [116]. These superior results are likely due to the better alignment of the faces provided by
our FPN.
3.3.4 Landmark Detection Accuracy
From 6DoF Pose to Facial Landmarks Given a 6DoF head pose estimate, facial landmarks
can then be estimated and compared with existing landmark detection methods for their accuracy
on standard benchmarks. To obtain landmark predictions, 3D reference coordinates of facial
landmarks are selected off line once on the same generic, 3D face model used in [116]. Given a
pose estimate, we convert it to a projection matrix and project these 3D landmarks down to the
input image.
Recently, a similar process was proposed for accurate landmark detection across large
poses [192]. In their work, an iterative method was used to simultaneously estimate a 3D face
shape, including facial expression, and project its landmarks down to the input image. Unlike them,
our tests use a single generic 3D face model, unmodified. By not iterating over the face shape, our
method is simpler and faster, but of course, our predicted landmarks will not reflect different 3D
shapes and facial expressions. We next evaluate the effect this has on landmark detection accuracy.
Detection Accuracy on the 300W Benchmark We evaluate performance on the 300W data
set [135], the most challenging benchmark of its kind [161], using 68 landmarks. We note that we
23
Figure 3.5: Qualitative landmark detection examples. Landmarks detected in 300W [135] images
by projecting an unmodified 3D face shape, pose aligned using our FPN (red) vs. ground truth
(green). The images marked by the red-margin are those which had large FPN errors (> 10%
inter-ocular distance). These appear perceptually reasonable, despite these errors. The mistakes
in the red-framed example on the third row was clearly a result of our FPN not representing
expressions.
did not use the standard training sets used with the 300W benchmark (e.g., the HELEN [98] and
LFPW [8] training sets with their manual annotations). Instead we trained FPN with the estimated
landmarks, as explained in Section 3.2.3. As a test set, we used the standard union consisting of
the LFPW test set (224 images), the HELEN test set (330), AFW [190] (337), and IBUG [134]
(135). These 1026 images, collectively, form the 300W test set. Note that unlike others, we did
not use AFW to train our method, allowing us to use it for testing.
Figure 3.4 (a) reports five measures of accuracy for the various methods tested: The percent of
images with 68 landmark detection errors lower than 5%, 10%, and 20% inter-ocular distances, and
the mean error rate (MER), averaging Eq. (5.2) over the images tested. Figure 3.4 (b) additionally
provides accumulative error curves for these methods.
Not surprisingly, without accounting for face shapes and expressions, our predicted landmarks
are not as accurate as those predicted by methods which are influenced by these factors. Some
qualitative detection examples are provided in Fig. 3.5 including a few errors larger than 10%.
These show that mistakes can often be attributed to FPN not modeling facial expressions and shape.
One way to improve this would be to use a single-view 3D face shape estimation method [58, 149]
to better approximate landmark positions, though we have not tested this here.
24
Detection Runtime In one tested measure FPN far outperforms its alternatives: The last column
of Figiure 3.4 (a) reports the mean, per-image runtime for landmark detection. Our FPN is an
order of magnitude faster than nearly all other face alignment methods. Dlib [86] was slightly
slower than our FPN, but is far less accurate in the face recognition tests (Table 5.1).
All methods were tested using an NVIDIA, GeForce GTX TITAN X, 12GB RAM, and
an Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz, 132GB RAM. The only exception was
3DDFA [192], which required a Windows system and was tested using an Intel(R) Core(TM)
i7-4820K CPU @ 3.70GHz (8 CPUs), 16GB RAM, running 8 Pro 64-bit.
3.3.5 Discussion
Landmarks predicted using FPN in Figure 3.4 were less accurate than those estimated by other
methods. How does that agree with the better face recognition results obtained with images
aligned using FPN? As we mentioned in Section 3.1 better accuracy on a face landmark detection
benchmark reflects many things which are not necessarily important when aligning faces for
recognition. These include, in particular face shapes and expressions, the latter can actually cause
misalignments when computing face pose and warping the face accordingly. FPN, on the other
hand, ignores these factors, instead providing a 6DoF pose estimates at breakneck speeds, directly
from image intensities.
An important observation is that despite being trained with labels generated by OpenFace [4],
recognition results on faces aligned with FPN are better than those aligned with OpenFace.
This can be explained in a number of ways: First, FPN was trained on appearance variations
introduced by augmentation, which OpenFace was not necessarily designed to handle. Second,
poses estimated by FPN were less corrupted by expressions and facial shapes, making the warped
images better aligned. Third, as was recently argued by others [149], CNNs are remarkably adapt
at training with label noise such as any errors in the poses predicted by OpenFace for the ground
truth labels. Finally, CNNs are highly capable of domain shifts to new data, such as the extremely
challenging views of the faces in IJB-A and IJB-B.
25
Chapter 4
Deep Expression Net
An important component for 3D face modeling is to model expression variations in addition to
the intrinsic face shape. A standard way for expression modeling is a linear combination of face
expression bases with the expression coefficients controlling the intensities. Traditional approaches
usually required the landmark detection in order to obtain the expression coefficients. However,
the quality cannot be guaranteed for the landmark detector is susceptible to the large pose changes
and image scales In this section, we are going to introduce how we can learn this coefficients by
the proposed Face Expression Net (FEN) without landmarks at test time. The experimental results
demonstrate our FEN achieves the best expression classification accuracy on both controlled and
unconstrained emotion datasets.
4.1 Motivation
Successful methods for single view 3D face shape modeling were proposed nearly two decades
ago [13, 12, 121, 130]. These methods, and the many that followed, often claimed high fidelity
reconstructions and offered parameterizations for facial expressions besides the underlying 3D
facial shape. Despite their impressive results, they and others since [13, 12, 28, 121, 130, 147,
171] suffered from prevailing problems when it came to processing face images taken under
unconstrained viewing conditions. Many of these methods relied in one way or another on
facial landmark detection, performed either prior to reconstruction or concurrently, as part of the
reconstruction process.
By involving face landmark detection, these methods are sensitive to face pose and, aside from
a few recent exceptions (e.g., 3DDFA [192]), could not operate well on faces viewed in extreme
yaw rotations (e.g., near profile). Scale and occlusions are also problems: Whether because
landmarks are too small to accurately localize or altogether invisible due to occlusions, accurate
detection and consequent 3D reconstruction is not handled well. In addition to these problems,
many methods applied iterative steps of analysis-by-synthesis [7, 75, 131]. These methods were
not only computationally expensive, but also hard to distribute and run in parallel, e.g., using
dedicated hardware such as the now ubiquitous graphical processing units (GPU).
Very recently, some of these problems were addressed by two papers. First, Tran et al. [149]
proposed to use a deep CNN to estimate the 3D shape and texture of faces appearing in uncon-
strained images. Their CNN regressed 3D morphable face model (3DMM) parameters directly. To
test the extent to which their estimates were robust and discriminative, they then used them as face
representations in challenging, unconstrained face recognition benchmarks, including the Labeled
26
Faces in the Wild (LFW) [70] and the IARPA Janus Benchmark A (IJB-A) [88]. By doing so,
they showed that their estimated 3DMM parameters were nearly as discriminative as opaque deep
features extracted by deep networks trained specifically for recognition.
Chang et al. [23] extended this work by showing that 6 degrees of freedom (6DoF) pose
can also be estimated using a similar deep, landmark free approach. Their proposed Face-Pose-
Network (FPN) essentially performed face alignment in 3D, directly from image intensities and
without the need for facial landmarks which are usually used for these purposes.
Following the concepts of Tran et al. [149] and Chang et al. [23], we use similar techniques
to model 3D facial expressions. In this chapter, we will show how facial expressions can be
modeled directly from image intensities using the proposed ExpNet, a deep neural network. To our
knowledge, this is the first time that a CNN is shown to estimate expression coefficients directly,
without requiring or involving facial landmark detection. A multitude of face reconstruction
examples, visualizing our estimated expressions on faces appearing in challenging unconstrained
conditions will be provided.
Additionally, we offer quantitative comparisons of our facial expression estimates. To this end,
we propose to measure how well different expression regression methods capture facial emotions
on the Extended Cohn-Kanade (CK+) dataset [109] and the Emotion Recognition in the Wild
Challenge (EmotiW-17) dataset
1
. CK+ contains controlled images which allow us to focus on how
well emotions are captured by our method and others, in a sterile environment, while EmotiW-17
consists of highly unconstrained images. We show that not only does our deep approach provide
more meaningful expression representations, it is more robust to scale changes than methods
which rely on landmarks for this purpose.
4.2 Deep, 3D Expression Modeling
We propose to estimate facial expression coefficients using a CNN applied directly to image inten-
sities. A primary concern when training such deep networks is the availability of labeled training
data. For our purposes, training labels are 29D real-valued vectors of expression coefficients,
which do not have a natural interpretations that may easily be used by human operators to manually
collect data. We next explain how 3D shapes and their expressions are represented and how ample
data may be collected to effectively train a deep network for our purpose.
4.2.1 Representing 3D Faces and Expressions
We assume a standard 3DMM face representation [13, 12, 28, 66, 121]. Given an input face photo
I, standard methods for estimating its 3DMM representation typically detect facial feature points
and then use those as constraints when estimating the optimal 3DMM expression coefficients
(see, for example, the recent 3DDFA method [192]). Instead, we propose to estimate expression
parameters by directly regressing 3DMM expression coefficients, decoupling shape and texture
from pose and from expression.
1
https://sites.google.com/site/emotiwchallenge/challenge-details
27
Specifically, we model a 3D face shape using the following, standard, linear 3DMM represen-
tation (for now, ignoring parameters representing facial texture and 6DoF pose):
S
0
=b s +
s
X
i=1
i
S
i
+
m
X
j=1
j
E
j
(4.1)
whereb s2 R
3n1
represents the average 3D face shape. The first summation provides shape
variations as a linear combination of shape coefficients2 R
s
with S2 R
3n
, s principal
components. 3D expression deformations are provided as an additional linear combination of
expression coefficients2R
m
andm expression components E2R
3nm
. Here, 3n represents
the 3D coordinates for the n pixels in I. The numbers of components for shape, s, and for
expression,m, provide the dimensionality of the 3DMM coefficients. Our representation uses the
BFM 3DMM shape components [121], wheres = 99 and the expression components defined by
3DDFA [192], withm = 29.
The vectors and control the intensity of deformations provided by the principal compo-
nents. Given estimates for and, it is therefore possible to reconstruct the 3D face shape of the
face appearing in the input image (5.1).
4.2.2 Generating 3D Expression Data
To our knowledge, there is no publicly available data set containing sufficiently many face images,
labeled with their expression coefficients. Presumably, one way of mitigating this problem is to
use face landmark detection benchmarks. That is, taking the face images in existing landmark de-
tection benchmarks and computing their expression coefficients using their ground truth landmark
annotations in order to obtain 29D ground truth expression labels. The primary reason why we
cannot use landmark detection benchmarks for this purpose is their size. The number of images in
the training and testing splits of the popular 300W landmark detection data set, for example, is
3,026. This is far too small to train a deep CNN to regress 29D real valued vectors.
Given the absence of sufficiently large and rich 3D expression training sets, we propose a
simple method for generating ample examples of faces in the wild coupled with 29D expression
coefficients labels. We begin by estimating 99D 3DMM coefficients for the 0.5 million face
images in the CASIA WebFace collection [173]. 3DMM shape parameters were estimated
following the state of the art method of [149], giving us, for every CASIA image a shape estimate,
S
00
=b s +
P
s
i=1
i
S
i
.
We assume that all images belonging to the same subject should have the same, single 3D
shape. We therefore apply the shape coefficients pooling method of [149] to average the 3DMM
shape estimates for all images belonging to the same subject, thereby obtaining a single 3DMM
shape estimate per subject. Pose were additionally estimated for each image using FPN [23]. We
then use standard techniques [57] to compute a projection matrix from the 6DoF provided by
that method.
28
Given a projection matrix that maps from the recovered 3D shape (determined by S
00
and
E) to the 2D points of an input image, p, we can solve the following optimization problem to get
expression coefficients:
?
= arg min
jjp
S
00
+ E
jj
2
;
subject to j
j
j 3
Ej
;
(4.2)
where
Ej
is the deviation of the j-th principal components of the 3DMM expression; p is a set
of landmarks detected by standard facial landmark detection methods. The optimization itself is
performed by standard Gauss-Newton optimization.
4.2.3 Training ExpNet to Predict Expression Coefficients
We use the expression coefficients as ground truth labels when training our ExpNet to regress
29D expression coefficients. In practice, we use a ResNet-101 deep network architecture [63].
We did not experiment with smaller network structures, and so a more compact network may
well work just as well for our purposes. Our ExpNet is trained to regress a parametric function
f(fW; bg; I)7!, wherefW; bg represent the parametric filters and weights of the CNN. We
use a standard`
2
reconstruction loss between the predictions from our ExpNet and the 3DMM
expression coefficients as obtained in Sect. 4.2.2.
We use Stochastic Gradient Descent (SGD) with a mini-batch of size 144, momentum set to
0.9 and weight decay of 5e-4 to optimize our ExpNet. The network weights are updated with a
learning rate set to 1e-3. When the validation loss saturates, we decrease learning rates by an
order of magnitude, until the validation loss stops decreasing. No data augmentation is performed
during training: that is, we use the plain images in the CASIA set, since they are already roughly
aligned [173]. In order to make training easier, we removed the empirical mean from all the input
faces.
We note that our approach is similar to the one used by Tran et al. [149], and in particular,
we use the same network architecture used in their work to regress 3DMM shape and texture
parameters. They, however, explicitly assume a unique shape representations for all images of the
same subject. This assumption allowed them to better regularize their network, by presenting it
with multiple images with varying nuisance but the same underlying label (i.e. shape coefficients
do not vary within the images of a subject). In our work, this is not the case, and expression
parameters vary from one image to the next, regardless of subject identity.
4.2.4 Estimating Expressions Coefficients with ExpNet
Existing methods for expression estimation often take an analysis-by-synthesis approach to
optimizing facial landmark locations. Contrary to them, our 3DMM estimates are obtained in a
single forward pass of our CNN. To estimate an expression coefficients vector,
t
, we evaluate
f(fW; bg; I
t
) for test image, I
t
. We preprocess images at test time using the face detector of
Yang et al. [172] and increasing its returned face bounding box by a scale of1:25 of its size.
This scaling was manually determined to bring their bounding box size to roughly the same size as
the loose bounding boxes provided for CASIA images.
29
(a) (b) (c)
Figure 4.1: Confusion Matrix for Expression Recognition on CK+ dataset. This report how the
confusion distributes across emotions using the original resolution, given (a) CE-CLM landmarks
and Expression Fitting (b) using directly the deep method of [192] (c) our method. Our method
shows the less confusing distribution since most of the intensity is distributed better on the diagonal
than the other approaches.
4.3 Experimental Results
4.3.1 Quantitative Tests
Benchmark Settings Aside from 3DDFA [192], we know of no previous method which directly
estimates 29D expression coefficients vectors. Instead, previous work relied on facial landmark
detectors and use their detected landmarks to estimate facial expressions. We therefore compare
the expressions estimated by our ExpNet to those obtained from state of the art landmark detectors.
Because no benchmark exists with ground truth expression coefficients, we compare these methods
on the related task of facial emotion classification. Our underlying assumption here is that better
expression estimation implies better emotion classification.
We use two benchmarks containing face images labeled for discrete emotion classes. For
each image we estimate its expression coefficients, either directly using our ExpNet and 3DDFA,
or using detected landmarks by solving (4.2) as described in Section 4.2.2. We then attempt to
classify the emotions for test images using the exact same classification pipeline applied to these
29D expression representations.
Our tests use the Extended Cohn-Kanade (CK+) dataset [109] and the Emotion Recognition
in the Wild Challenge (EmotiW-17) dataset
2
. The CK+ dataset is a constrained set, with frontal
images taken in the lab, while the EmotiW-17 dataset contains highly challening video frames
collected from 54 movie DVDs [39].
The CK+ dataset contains 327 face video clips labeled for seven emotion classes: anger (An),
contempt (Co), disgust (Di), fear (Fe), happy (Ha), sadness (Sa), surprise (Su). From each clip, we
take the peak frame (the end of video)– the frame assigned with an emotion label – and use it for
classification. Following the protocol used by [109], we ran a leave-one subject-out test protocol to
assess performance. The EmotiW-17 dataset, on the other hand, is of 383 face video clips labeled
for 7 emotion classes: anger (An), disgust (Di), fear (Fe), happy (Ha), neutral (Ne), sadness (Sa),
2
https://sites.google.com/site/emotiwchallenge/challenge-details
30
(a) (b) (c)
Figure 4.2: Confusion Matrix for Expression Recognition on EmotiW-17 dataset. This report
how the confusion distributes across emotions using the original resolution, given (a) CE-CLM
landmarks and Expression Fitting (b) using directly the deep method of [192] (c) our method. Our
method still shows the less confusing distribution even on highly unconstrained senario.
surprise (Su). We estimate 29D expression representations for every frame and apply average
pooling for each video. We also evaluate the robustness of different methods to scale changes.
Specifically, we tested all methods on multiple version of the CK+ and EmotiW-17 benchmarks,
each version with all images scaled down to0.8, 0.6, 0.4, and 0.2 their sizes.
Emotion Classification Pipeline The same simple classification method was used. We preferred
a simple classification method rather than a state of the art technique, in order to prevent masking
the quality of the landmark detector / emotion estimation with an elaborate classifier. We therefore
use a simple kNN classifier withK = 5. It is important to note that the results obtained by all of
the tested methods are far from the state of the art on this set; our goal is not to outperform state
of the art emotion classification methods, but only to compare expression coefficient estimation
techniques.
Baseline methods We compare our approach to widely used, state-of-the-art face landmark
detectors. These are Dlib [86], CLNF [3], OpenFace [4], CE-CLM [178], RCPR [19], and
3DDFA [192].
Results Figure 4.1 reports the emotion classification confusion matrix on the original CK+ data
set (unscaled) for our method (Figure 4.1c), comparing it to the other two performing methods,
3DDFA (Figure 4.1b) and CE-CLM (Figure 4.1a). Our expression coefficients were able to capture
well emotions of surprise (Su), happy (Ha), and disgust (Di), but were less able to represent
the emotions which were less clearly defined by expressions in the benchmark: (An), contempt
(Co), fear (Fe), sadness (Sa). These same classes were also challenging for other expression
representation methods. On the whole, our representation was noticeably better at capturing all
emotion classes. Figure 4.2 shows the emotion classification confusion matrix on the original
EmotiW-17 data set (unscaled). Our expression coefficients were able to capture well emotions of
neutral (Ne), happy (Ha), and sad (Sa), sngry (An), but were less able to represent the emotions
31
Figure 4.3: Expression Recognition Accuracy on CK+ dataset. Each curve corresponds to a
method. For each scale, the experiment resizes the input image accordingly. Lower scale indicates
lower resolution. Original resolution is 640490.
which were less clearly defined by expressions in the benchmark: disgust (Di), fear (Fe), surprise
(Su), which from our observations, are highly similar to angry (An).
Figure 5.6 and Figure 5.7 reports the emotion classification performances of all the emotion
recognition accuracy on CK+ and EmotiW-17 respectively by changing the input image resolution
to the system. The plot shows the sensitivity of each tested method respect to the input resolution:
on the x-axis we report the downsizing performed to all the images with a factor proportional
to the scale. Scale equal one means processing the original image (640x490 for CK +, while
730x576 for EmotiW-17), while the lower scale equal to 0.2 process image (128x98 for CK+ and
146x115 for EmotiW-17). Note that for deep methods the image resolution is anyway bounded
to the input which is 224x224. Both Figure 5.6 and Figure 5.7 show that our approach is the
most accurate in terms of emotion recognition accuracy, among the methods tested here modeling
expression explicitly, yet it is also more robust to scale changes respect to other methods based on
landmark detections. Also, there is a noticeable difference in emotion recognition between deep
methods –ours and [192]– and landmark based. Interestingly, landmark detection modules such as
CLNF [3], OpenFace [4], CE-CLM [178] performs well at full resolution, substantially degrading
32
Figure 4.4: Expression Recognition Accuracy on EmotiW-17 dataset. Each curve corresponds to a
method. For each scale, the experiment resizes the input image accordingly. Lower scale indicates
lower resolution. Original resolution is 720576.
their performance at lower scales. On the other hands, Dlib [86], RCPR [19] and 3DDFA [192]
appear to be more invariant to scale change, nevertheless showing lower recognition accuracy.
Table 4.1 offers runtime for landmark-based methods along with deep, direct method such
as the proposed one and 3DDFA. Our method outperforms all the available alternatives: it is
interesting to notice that in general expression fitting methods relying on landmarks decouples
in three part the process: (i) fiducial points extraction, (ii) pose estimation, and (iii) expression
fitting. By doing so, the total processing time is a sum of multiple factors: although some landmark
detection methods (e.g. DLIB) is very efficient in extracting landmarks (0.009s), they need still to
address the optimization problem explained in Eq. (4.2), leading to a runtime which is slower than
the proposed method. Regarding a comparison with deep methods, the software package provided
by 3DDFA [192] did not allow testing on the GPU; in their paper, they report GPU runtime to be
0.076 seconds, which is similar to ours, in this case. Our methods is tested using a GPU. Other
landmarks detector and the expression fitting code to solve Eq. (4.2) are intrinsically implemented
on CPU, though they could be ported to graphical units.
33
Landmark-based Deep, Direct
Time (s/img) DLIB CE-CLM OpenFace CLNF RCPR 3DDFA Us
Landmarks 0.009 15.83 0.31 0.38 0.19 – –
Pose Fitting 0.29 – –
Expr. Fitting 0.30 – –
Total 0.599 16.42 0.90 0.97 0.78 0.6 0.088
Table 4.1: Expression Estimation runtime. Runtime for expression fitting for recent methods.
Landmark-based methods need to address for landmark extraction and then optimization fitting at
test-time; where as deep methods are solving the entire problem in a single step.
All methods are tested using an NVIDIA, GeForce GTX TITAN X on an Intel Xeon CPU
E5-2640 v3 @ 2.60GHz. The only exception was 3DDFA [192], which required a Windows
system and was tested using an Intel Core i7-4820K CPU @ 3.70GHz with 8 CPUs.
4.3.2 Qualitative Results
In Figure 4.5 we show qualitative rendering of the 3D expression using input images from the CK+
dataset. We present each result at original resolution (scale 1) and also at the lowest resolution
(scale 0.2). Our method extends the 3DMM-CNN [149] method that is unable to model expression
at all. All the methods in this figure use the same 3D shape provided by 3DMM-CNN [149]. The
rest of the variations in the 3D renderings are coming from expression modeling. We compare our
deep approach with the recent deep approach estimating both shape and expression [192], with
the top performing landmark detector CE-CLM [178] and with the baseline method [149]. Our
method is the one that better model the expression from a point of view of the perception when
rendering the 3D model: this is clear observing in Fig. 4.5 rendering for subtle expressions such as
fear and anger. This is consistent with the improvement shown in the confusion matrices in Fig.4.1
. 3DDFA is inconsistent across the same expression (happy) and tends to either exaggerate the
expression or underestimate it. CE-CLM instead is fragile to resolution variations by modeling
different expressions for the same input downsized image, making it inconsistent across scale.
In Figure 4.5 we report weak points of our deep method that is unable to capture emotions
with strong intensity such as surprise. Although 3DDFA is doing a good job from a perception
point of view, its expression estimates look too exaggerated. CE-CLM is performing visually well
on surprising although, its predictions are inconsistent across scale. Our method is consistent with
varying resolution but we can less perceive surprise emotion in our rendering.
Finally Figure 4.7 depicts qualitative rendering results using our Deep 3D Expression method
extending what shown in Figure ??. Imagery from challenging face recognition benchmarks
(IJB-A [88], IJB-B [159]) and landmark detector benchmarks (300W [135]) are used as input to
our system to predict 3D expression. In this figure additionally we show that coupling our method
with two publicly available systems, handling pose and shape, we can get a full deep-based 3DMM
face modeling system.
Rendering in the figure are compute in the following way. Given an input image I
t
, expression
coefficient
t
are retrieved following Section 4.2.4. We also estimate the following: (i) the 3D
34
Figure 4.5: Qualitative Expression Estimation Comparison. Our method extends the 3DMM-CNN
method that is unable to model expression. All the methods used the same 3D shape provided by
3DMM-CNN [149]. Each labeled emotion is reported above each result. The proposed method and
3DDFA show consistent expression fitting across scale and appear to be robust across resolution.
In particular, our method is able to model better subtle expressions than 3DDFA. On the other
end, the top-performing landmark detector (CE-CLM) is not able to retrieve clearly the subject
expression.
coefficient
t
using the publicly available code of [149] (ii) the 3D face pose
t
of the shape
respect to the input image is estimated using the publicly available code of [23]
3
. Therefore we
are able to render a given input image accounting for pose, expression and shape variations as
t
b s + S
t
+ E
t
. The figure shows the input image on the left, followed by the rendered 3D
face model with and without texture; the model is also animated with expression and rotated to fit
the pose of the input image. Fig. 4.7 demonstrates how our approach can handle a large variety
of challenges such as harsh in-plane pose variations or out-of-plane pose variations, covering
around 40 degree variation from a frontal pose. Note how we can capture expression of the subject
with a remarkable fidelity, even though the imagery is unconstrained and often has low resolution,
harsh lighting conditions or the image is characterized by an expression with low intensity. We
believe this expression modeling is possible since we were able to train effectively DCNN using
a large and wide facial appearance variations provided by the CASIA set. Furthermore, it is
worth to mention that all these renderings were generated by using a system which is completely
landmark-free, that is, the system does not require any fiducial facial points to be detected to model
a face.
3
These codes are respectively available athttps://github.com/anhttran/3dmm_cnn andhttps://
github.com/fengju514/Face-Pose-Net
35
Figure 4.6: Expression Estimation Failure When the input image contains intense expression, our
method does not match the expression intensity while other methods exaggerate the expression
(3DDFA) or are inconsistent across scales (CE-CLM).
36
Figure 4.7: Qualitative results. 48 expression estimation results on images from the challenging
IJB-A [88], IJB-B [159], and 300W [135] collections. Each result displays an input photo along
with the regressed, texture and shape, rendered in with the automatically determined pose and
expression. These results demonstrate our method’s capability of handling images of faces in
challenging in-plane and out-of-plane head rotations, low resolutions, occlusions, ages and more.
37
Chapter 5
Deep, Landmark-Free FAME:
Face Alignment, Modeling, and Expression Estimation
We present a novel method for modeling 3D face shape, viewpoint, and expression from a single,
unconstrained photo. Our method uses three deep convolutional neural networks (CNN) to
estimate each of these components separately. Importantly, unlike others, our method does not use
facial landmark detection at test time; instead, it estimates these properties directly from image
intensities. In fact, rather than using detectors, we show how accurate landmarks can be obtained
as a by-product of our modeling process. We rigorously test our proposed method. To this end,
we raise a number of concerns with existing practices used in evaluating face landmark detection
methods. In response to these concerns, we propose novel paradigms for testing the effectiveness
of rigid and non-rigid face alignment methods without relying on landmark detection benchmarks.
We evaluate rigid face alignment by measuring its effects on face recognition accuracy on the
challenging IJB-A and IJB-B benchmarks. Non-rigid, expression estimation is tested on the CK+
and EmotiW’17 benchmarks for emotion classification. We do, however, report the accuracy of
our approach as a landmark detector for 3D landmarks on AFLW2000-3D and 2D landmarks on
300W and AFLW-PIFA. A surprising conclusion of these results is that better landmark detection
accuracy does not necessarily translate to better face processing.
5.1 Motivations and Overview
Successful methods for single-view 3D face shape modeling were proposed nearly two decades
ago [13, 12, 121, 130]. These methods, and the many that followed, often claimed high fidelity
reconstructions and offered parameterizations for facial expressions besides the underlying 3D
facial shape.
Despite their impressive results, they and others since have suffered from problems when
processing images taken under unconstrained viewing conditions [13, 12, 28, 121, 130, 147,
171]. Many of these methods relied, to some extent, on facial landmark detection performed
either prior to reconstruction or concurrently, as part of the reconstruction process. By using
landmark detectors, these methods were sensitive to face pose and, aside from a few recent
exceptions (e.g., 3DDFA [192]), could not operate well on faces viewed in extreme out-of-plane
rotations (e.g., near-profile). Scale changes and occlusions were also problems: either because
landmarks were too small to be accurately localized or were altogether invisible due to occlusions,
detection and 3D reconstruction were not handled well. Finally, many methods applied iterative
analysis-by-synthesis steps [7, 75, 131]. This approach was not only computationally expensive,
38
Figure 5.1: Results of our FAME approach. We propose deep networks which regress 3DMM
shape, expression, and viewpoint parameters directly from image intensities. We show this
approach to be highly robust to appearance variations, including out-of-plane head rotations (top
row), scale changes (middle), and ages (bottom).
but also hard to distribute and run in parallel on dedicated hardware offered by graphical processing
units (GPU).
We offer a novel, efficient, and accurate alternative to these methods by describing a deep
learning–based approach for face alignment, modeling, and expression estimation (FAME). We
show how deep networks can separately estimate each of the following components of a 3D
morphable face model (3DMM) representation (Sec. 5.2): 3D face shape, six degrees of freedom
(6DoF) viewpoint, and 3D face expression (Fig 5.1). We see access to sufficient labeled training
data as a key concern in such an approach and explain how we mitigate this problem and obtain
massive, labeled data sets for the supervised training of our networks.
Contrary to others, our approach does not require facial landmark detection at test time and
instead models faces directly from image intensities. Still, if facial landmarks are required, we
show how they can be estimated as a by-product of our modeling, rather than as part of our
modeling process (Sec. 5.3).
In Sec. 3.1, we have discussed the shortcomings of facial landmark detection benchmarks. In
particular, we claim that manual landmark annotations used as ground truth by such benchmarks
39
can be arbitrary and inaccurate. Thus, better detection accuracy may actually reflect better
estimation of uncertain human labels rather than, say, better face alignment. A similar observation
was recently made by others [41] and is reflected in the design of the PIFA protocol [79].
To address these concerns, we propose a simple, alternative test paradigm which considers
the bottom line performances of the systems employing these methods. Thus, because one of
the most popular applications of facial landmark detectors is face alignment in face recognition
pipelines [23, 117], we evaluate different methods by measuring their effect on face recognition
accuracy. To this end, we use the challenging, unconstrained images of the IJB-A [88] and IJB-
B [159] benchmarks (Sec. 5.5.1). Both sets contain images with viewing conditions which are
typically far more challenging than those in landmark benchmarks.
To further compare accuracy of non-rigid face deformation estimations, we propose tests on
facial emotion classification benchmarks (Sec. 5.5.2). For this purpose, we use both the controlled
images in the Extended Cohn-Kanade (CK+) benchmark [109] and the unconstrained images in the
EmotiW-17 benchmark [38]. We use both benchmarks to test how different expression estimates,
obtained by different methods, affect the quality of descriptors used for emotion classification.
Finally, we evaluate the facial landmarks obtained by our method and others on the popu-
lar 300W benchmark [134] and AFLW-PIFA dataset [79] for 2D landmark detection and the
AFLW2000-3D benchmark for 3D landmark detection [192] (Sec. 5.5.3). A surprising conclusion
of our tests is that older facial landmark detectors often outperform newer, presumably state-of-
the-art, methods when used in face processing pipelines. Our approach is not necessarily more
accurate as a landmark detector than state-of-the-art detectors, though its accuracy is comparable
to theirs. Our method is, however, far faster and more accurate than existing landmark detectors
when evaluated for its bottom line performance on face recognition and emotion classification.
5.2 Our proposed FAME framework
We describe a deep, landmark-free 3D face modeling method. Our approach, illustrated in Fig. 1.3,
is designed to provide an alternative means of obtaining the same goals previously attained with the
help of facial landmark detectors. Namely, it allows for face alignment in 3D and 2D, expression
modeling (deformations) in 3D, and 3D face shape estimation. Our method uses three deep
networks which separately estimate subject-specific 3D face shape, viewpoint, and 3D expression
deformations directly from the input image. As we later show, rather than using landmark detectors
in this process, accurate facial landmarks can actually be obtained as a by-product of our FAME
approach.
We emphasize that although we use landmark detection methods during training, our test time
system is completely landmark free. We use landmark detectors as a cheap yet effective alternative
to the manual labor required by others for labeling training images. The term “landmark-free”
therefore refers to the absence of landmark detection at test time.
40
5.2.1 Preliminaries
We model a 3D face shape F2R
3n
using the following standard, linear 3DMM representation
(for now, ignoring parameters representing facial texture and 6DoF pose):
F =b s +
s
X
i=1
i
S
i
+
m
X
j=1
j
E
j
; (5.1)
whereb s2 R
3n
represents the average 3D face shape. The first summation expresses shape
variation as a linear combination of shape principal components S2R
3ns
with the coefficient
vector2 R
s
. Here, S
i
indicates the direction to deform the face, following the variation in
the data learned by principle component analysis (PCA). This deformation is regulated by the
corresponding scalar
i
. 3D expression deformation is represented as a linear combination of
expression components E2R
3nm
with the coefficient vector2R
m
. Here, 3n represents the
3D coordinates for then vertices of the 3D face shape. The numbers of components for shape,s,
and expression,m, define the dimensionality of the 3DMM coefficients.
Our representation uses the Basel Face Model (BFM) 3DMM shape components [121], where
s = 99 and the expression components defined by 3DDFA [192], withm = 29. The vectors
and control the intensity of deformations provided by the principal components. Given estimates
for and, it is thus possible to reconstruct the 3D shape of the face appearing in the input image
by Eq. (5.1).
5.2.2 Modeling subject-specific 3D shape
Given image I, we estimate with the recent deep 3DMM approach of [149], using their
publicly available code and pre-trained model. They regress 3DMM coefficients,, directly
from image intensities using a ResNet architecture with 101 layers [63] which we refer to here
as FaceShapeNet.
The decision to use this method to estimate a subject-specific 3D face shape is not matter of
fact. This method was extensively tested, and the results reported in [149] demonstrate it to be both
invariant and discriminative under the harshest viewing conditions, previously considered only by
face recognition systems. In particular, it has been shown to work well and produce invariant 3D
shapes even on photos where the face is heavily occluded. In fact, it is this property which led to
its use for occluded 3D face reconstruction by [150].
We note that FaceShapeNet was designed to infer only the 3D face shape, without estimating
its viewpoint or expression [149]; i.e., the 3D face shape is produced in fixed 3D coordinates and
is not modified to account for different viewpoints and expressions. Estimating these missing
components—viewpoint and expression—is described next.
5.2.3 Modeling face viewpoint and facial expressions
We propose to infer a global, 6DoF 3D face viewpoint directly from image intensities using a
deep neural network. For a given face photo, our FacePoseNet (FPN) regresses 6DoF viewpoint
parameters. We next describe FPN and the novel method used to produce sufficient training data,
along with the pose labels required to train it. Please refer to the details of our FacePoseNet (FPN)
in Chapter 3, and FaceExpNet (FEN) in Chapter 4.
41
Improved FPN architecture. We experimented with two network architectures for our FPN.
The first, originally presented in [23], is a simple AlexNet [93], initialized using weights provided
by [114]. We modify this network by defining a new output layer, FC8, which uses`
2
-loss to
regress a 6D floating point output—representing the 6DoF viewpoint—rather than predict one-hot
encoded, multi-class label. We initialize this new layer with parameters drawn from a Gaussian
distribution with zero mean and standard deviation 0.01. All biases are initialized with zero.
During training, batch size is set to 60, and the initial learning rate is set to 0.01. Learning rate is
decreased by an order of magnitude every 5; 000 iterations until the validation accuracy for the
fine-tuned network saturates.
Note that during training, each dimension of the head pose labels is normalized by the
corresponding mean and standard deviation of the training set, compensating for the large value
differences among dimensions. The same normalization parameters are used at test time.
In addition to the shallow architecture described above, we tested a deeper network: a ResNet
architecture with 101 layers [63] (ResNet101). This deeper network was trained on a larger
training set — the 300W-LP dataset [192] — which offers richer appearance variations including,
in particular, a larger proportion of profile views [111]. In this improved version, instead of using
the pinhole model detailed in Eq. (3.3), the weak perspective projection is regressed for the
300W-LP dataset. All other settings were similar to the ones used for the AlexNet version.
5.2.4 Discussion: Training labels from landmark detections?
Training of both our FPN (Sec. 5.2.3) and our FEN (Sec. 4.2) followed a similar theme in which
training labels were automatically synthesized by using facial landmark detectors. This approach
raises a natural concern: Wouldn’t our trained networks be only as good as the landmark detectors
used to produce their labels? First, recall that FPN was trained on images which sometimes
underwent significant augmentations (Chapter 3). The same transformations were applied to the
landmarks, producing training examples—images and viewpoint labels—which could often be
too challenging for existing landmark detectors to process (Fig. 3.2). By training our FPN on
these examples, we obtain a network that can handle such challenging viewing conditions and may
therefore be more capable than the original landmark detector.
More generally, however, is the well-known robustness of deep networks to training label
noise (here, errors in landmark detections and consequent viewpoint or expression estimates). This
robustness is especially true in large training sets such as the 2.6 million images in the VGG Face
dataset [120] and the 0.5 million face images in the CASIA WebFace collection [173] respectively
used in training FPN and FEN. This phenomenon was demonstrated by [167], who introduced
label errors to improve training, and by [120], who reported better face recognition accuracy with
a network trained with noisy labels. A similar effect is the basis for semi-supervised methods
using pseudo-labels [42], though their approach is different from our own.
In our work, this robustness is reflected in our trained networks learning to generalize beyond
any errors in their training labels, and so beyond the capabilities of the landmark detectors used to
produce their labels. These improvements are demonstrated in Sec. 5.5.
42
5.3 From FAME to landmark detection
For a given face photo, the process described in Sec. 5.2 provides estimates for the 3D face
shape, its viewpoint, and any deformation of its surface due to facial expressions. For many face
processing applications, these estimates would suffice: For example, we later show that our FPN
alone provides an effective means of aligning faces in 2D and 3D for face recognition, contributing
to improved recognition accuracy. Whenever a complete 3D modeling is desired, all three networks
can be used jointly.
Nevertheless, some applications may still require 2D facial landmark detection. In such cases,
landmark coordinates can be obtained from our 3D face modeling as a by-product of our approach,
rather being a step towards modeling.
5.3.1 Landmark projections
We estimate 2D facial landmarks from a 3D face shape and a projection matrix, relating points on
the 3D surface with coordinates in the input image. Our FPN (Sec. 5.2.3) provides an estimate for
the projection matrix. 3D reference landmarks can be specified on the surface of, say, a generic 3D
face shape [61]. 2D landmarks can then be estimated by projecting the reference 3D landmarks
using the projection matrix of Sec. 5.2.3, Eq. (3.3) and Fig. 5.2(b,c).
Projecting points from a generic shape would undoubtedly cause large errors, whenever the
face in the image does not share the same face shape as the generic (e.g., is wider or slimmer than
the generic shape) or in cases where the facial expression in the image affects landmark positions.
These errors are illustrated in Fig. 5.2(c). A more accurate estimate of landmark locations would
then project 3D landmarks from a 3D face shape which matches the one in the image and obtained
using FaceShapeNet (Sec. 5.2.2) and modified for expressions using FEN (Section 4.2).
Specifically, given the estimated shape coefficients (Sec. 5.2.2), expression coefficients
(Section 4.2), and the projection matrix derived from the 6DoF viewpoint (Sec. 5.2.3), we can
reconstruct the 3D shape, F, by Eq. (5.1). Because we use a standard 3DMM representation, all
faces naturally have corresponding 3D vertices: 3D points on the face surface are consistently
indexed and the same index always refers to the same facial feature (e.g., tip of the nose), no matter
the particular face shape or expression.
A consequence of this correspondence is that we can select reference 3D facial landmarks
of interest, P, on an ideal 3D face shape, F, once, at preprocessing. Given a novel face image
and following our 3D modeling, we can project P to obtain the corresponding 2D landmark
points by simply applying Eq. (3.3). Examples of such results are provided in Fig. 5.2(e). We test
the accuracy of landmarks predicted using FPN alone (and a generic face shape) as well as by
additionally estimating expression and face shape in Chpater 4.
Importantly, unlike some landmark detection methods, our method of obtaining facial land-
marks can easily be modified for different landmark numbers and locations without requiring
re-training or redesign of the system. Instead, to obtain detections for different landmarks, we only
need to select different reference 3D points P on F.
43
Figure 5.2: From FAME to landmarks. Two examples of landmarks detected using our FAME
framework. In each example: (a) Input image, (b) generic 3D face shape aligned using FPN
(Sec. 5.2.3); reference 3D landmarks (including occluded ones) rendered in green, (c) reference
3D landmarks projected onto image (Sec. 5.3.1), (d) 3D shape estimated with FaceShapeNet [149],
adjusted for pose with FPN, and expression with FEN (Sec. 4.2), (e) reference 3D landmarks on
adjusted 3D shape, projected onto image (Sec. 5.3.1), finally, (f) 2D landmarks after refinement
(Sec. 5.3.2). Note refined landmarks moved from self-occluded locations (e) to the contour
landmarks (f) used by 2D landmark detection benchmarks.
5.3.2 Landmark refinement
Our results in Fig. 5.2 show that projecting 3D landmarks from our estimated face model, already
provides reasonable landmark accuracy. As we discussed in Sec. 3.1, however, standard bench-
marks for 2D facial landmark detection often measure the accuracy of landmark prediction along
face contours—landmarks which, unlike our 3D reference points, change according to facial pose.
If such points are desired, and to better optimize our detections to other variabilities of the manual
annotations of these benchmarks, we further perform image-based, 2D landmark refinement.
To this end, we use a modified version of the regression sampler described by [9]. In our
framework, we use this component to estimate 2D offsets for our projected landmarks (Sec. 5.3.1)
and refer to it as our offset regression network. For details on the regression sampler, we refer
to [9]. We found it necessary, however, to make the following changes to their design when using
it in our framework.
Specifically, they used their shared feature extractor network to obtain local representations
at landmark coordinates. We, instead, use our FPN (Sec. 5.2.3) for the same purpose. FPN
was selected as it is trained to estimate pose, and so we expect it to capture viewpoint-related
information. We found that the FPN conv2 layer is suitable for this purpose and use it as the input
to our offset regression network.
[9] used two convolutional layers in their regression sampler, with 33 and 11 convolutions.
We found that our results improved if the input to our offset regression network was a 5 5
convolutional layer, which is then followed by their 3 3 and 1 1 layers. Finally, we appended
another fully connected layer with ReLU activation to the end of our offset regression network.
This layer added additional nonlinearities which were also empirically determined to improve
performance.
44
Our offset regression network was trained separately for 2D and 3D landmark detection: for
2D training, we used AFW [190], the training splits of LFPW [8], HELEN [98], and AFLW-
PIFA [103]. We used 300W-LP [192] as a training set for 3D landmark refinement. In both
cases, training labels for our offset regression network were obtained by subtracting the projected
landmark locations (2D landmarks obtained by projecting 3D reference points; Sec. 5.3.1) from
the ground truth landmark coordinates provided by each benchmark. The`
1
loss is used here to
prevent our model from being affected by few landmarks with large offsets from ground truth
keypoints.
All layers in our offset regression network are initialized with parameters drawn from a
Gaussian distribution with zero mean and standard deviation 0.01. All biases are initialized with
zero. During training, batch size is set to 30 and the initial learning rate is set to 0.001. Learning
rate is decreased by a factor of 0.8 per epoch until the validation accuracy converges. Examples of
these refined landmarks are provided in Fig. 5.2(f).
5.4 A critique of existing test paradigms
Our framework can be used in face processing systems as an alternative to existing facial landmark
detectors. These methods are typically tested on benchmarks designed to measure landmark
detection accuracy. Some such popular benchmarks include AFW [190], LFPW [8], HELEN [98],
iBUG [134], and 300W [135]. Before testing our method, we pause to consider how accuracy is
evaluated on these benchmarks and raise a number of potential problems with these evaluation
paradigms.
5.4.1 Detection accuracy measures
Facial landmark detection accuracy is typically measured by considering the distances between
estimated landmarks and ground truth (reference) landmarks, normalized by the reference inter-
ocular distance of the face [33]:
e(L;
^
L) =
1
mk^ p
l
^ p
r
k
2
m
X
i=1
kp
i
^ p
i
k
2
; (5.2)
Here, L =fp
i
g is the set of estimatedm 2D facial landmark coordinates,
^
L =f ^ p
i
g their ground
truth locations, and ^ p
l
; ^ p
r
the reference left and right eye outer corner positions. These errors
are then translated to a number of standard quantities, including the mean error rate (MER), the
percentage of landmarks detected under certain error thresholds (e.g., below 5% or 10% error
rates) or the area under the accumulative error curve (AUC).
The ground truth compared against is manually specified, often by mechanical turk workers.
As we detail next, these manual annotations can be misleading. Moreover, Eq. (5.2) itself can also
be misleading: Normalizing detection errors by inter-ocular distances biases against images of
faces appearing at non-frontal views. When faces are near-profile, perspective projection of the 3D
face onto the image plane shrinks the distances between the eyes—thereby unnaturally inflating
the errors computed for such images.
45
Figure 5.3: Visualizing potential problems with facial landmark detection benchmarks. Three
example photos along with their ground truth landmark annotations. (a) Annotations on the jawline
and bridge of the nose do not correspond to well-defined facial features (from LFPW [8]). (b)
Landmarks represent points on the face contour and so represent different facial locations in
different views (AFW [190]). (c) 3D landmark annotations represent occluded facial regions (3D
Menpo [180]).
5.4.2 Ill-defined facial locations
Recent benchmarks provide annotations for 49 or 68 facial landmarks. These annotations are
presumed to offer more stability than the far fewer (typically five) landmarks originally used by
older sets such as AFW [190] and LFPW [8]. By doing so, however, these extended annotations
include many facial locations which do not correspond to well-defined facial features, such as
points along the jawline or the bridge of the nose (illustrated in Fig. 5.3(a)).
It is well known that human annotators tend to vary greatly in the positions they choose for
these ill-defined landmarks [135]. This variance, however, is not reflected in the measures used to
report accuracy (e.g., Eq. (5.2)). Thus, estimating plausible positions for jawline landmarks may
raise or lower the error depending on the ground truth annotations, despite any uncertainty of the
ground truth.
5.4.3 Viewpoint-dependent facial locations
2D landmark detection benchmarks provide landmark annotations on face contours (Fig. 5.3(b)).
These landmark locations correspond to different facial features in different views. Although there
may be applications which require detection of facial contours, the use of these landmarks for face
alignment (i.e., by matching them with corresponding points in reference views; Sec. 2.5) can
introduce alignment errors.
5.4.4 Occluded points
In response to this last problem, some have recently proposed labeling faces with 3D points, which
correspond to the same facial features regardless of the viewpoint [135, 180]. The problem here
is, of course, that because these points are occluded, it would be hard for human annotators to
46
Figure 5.4: Reference landmark selection vs. detection accuracy. 68 reference points projected
from a 3D face shape to the input image, along with landmark detection accuracy. Two sets of
3D reference points are considered: red points were manually annotated on the reference 3D face,
green points were obtained by projecting 2D landmarks, detected by dlib [86, 82] on a frontal
face image, onto the 3D reference shape. Both sets of landmarks can be equally used for face
alignment and processing, but red points produce far lower landmark prediction errors. Does this
imply that the red points are better than the green?
reliably localize these points, leading to increased uncertainty in their ground truth annotations
(Fig. 5.3(c)).
5.4.5 Illustrative examples
Fig. 5.4 illustrates some of these problems, using landmarks estimated with our FAME approach
and the landmark projection method of Sec. 5.3. In particular, it demonstrates the effect 3D
reference landmark selections have on landmark estimation accuracies, measured on the 300W
benchmark [135].
For each of the three example faces, we projected two different sets of 3D landmarks using the
method described in Sec. 5.3. One was manually specified (red points), whereas the other (green)
was obtained by using dlib [86, 82] to detect 2D landmarks on a reference, frontal face photo (not
shown) and projecting these landmarks onto the 3D surface.
Both sets of points are reasonable choices for reference facial landmarks and, of course,
both sets of points can be used for the same purposes. For example, consistently using the
same reference point selection (either red or green) would provide the landmark correspondences
required for 2D or 3D face alignment between face photos (Sec. 2.5). These correspondences
would be equally accurate regardless of which set of reference landmarks are used (so long as this
selection is consistent).
We provide landmark detection accuracies for these two sets of landmarks on three images
from the 300W benchmark, using its ground truth annotations to compute these errors. These
errors are clearly very different despite the fact that there is no real practical difference between
the two sets of points. These results support our concerns with landmark detection benchmarks.
They suggest that errors reported on these benchmarks may not reflect meaningful differences in
the quality of different detection methods, but rather the accuracy of these methods in predicting
manual annotations.
5.4.6 Landmark detection vs. face processing application
Consequent to this visualization, and possibly the most important concern, is that facial landmark
detectors are not used on their own but rather by face processing systems that require landmarks for
47
a variety of purposes. Because landmark detectors are tested independently of these face processing
systems, it is not clear if and how improving their accuracy affects the performance of the systems
that use them. In fact, as we show in our experiments (Sec. 5.5), older landmark detectors which
were outperformed on landmark detection benchmarks by newer methods, sometimes provide
better bottom-line results than their newer alternatives.
5.5 Experimental results
We extensively test our approach and its components, reporting a wide range of quantitative and
qualitative results. Following the concerns raised in Sec. 5.4, we propose a novel approach for
evaluating face alignment and expression estimation methods—that is, by measuring their effect on
bottom-line performances of face recognition (Sec. 5.5.1) and emotion classification benchmarks
(Sec. 5.5.2). Doing so also places different landmark detection techniques (with varying numbers
of detected landmarks) and direct approaches such as ours on even grounds, allowing for direct
comparison between these methods. Although landmark prediction is not a focus of this paper, we
additionally provide extensive landmark prediction results in Sec. 5.5.3.
We note that FaceShapeNet (Sec. 5.2.2) is not a contribution of this thesis. The 3D shapes
estimated by that network have been extensively tested by [149] and shown to provide state-of-the-
art 3D reconstruction accuracy as well as robustness to extreme viewing conditions. We refer to
their paper for more details on these results. Their method, however, does not estimate viewpoint
or adjust the 3D face to account for expressions. We therefore focus on evaluating viewpoint and
expression accuracies.
5.5.1 Evaluating face alignment
Facial landmarks are predominantly used for 2D and 3D face alignment in face recognition
systems [23]. Rather than measure the accuracy of the landmark positions, we therefore propose
to test the effects different landmark detectors (or, more generally, different face alignment
techniques) have on face recognition accuracy. The rationale is that a face recognition system
which uses aligned face images would be affected by the quality of the method used for this
alignment. Better alignment should therefore translate to better face recognition results, regardless
of the specific method used for alignment.
Importantly, the purpose of these experiments is not to demonstrate state-of-the-art face
recognition accuracy. In particular, we do not optimize a face recognition pipeline in order to
set new recognition accuracy rates. Instead, we use face recognition benchmarks to compare the
effects of different alignment methods.
Face recognition for evaluating face alignment. We test our improved FPN of Sec. 5.2.3
independently of the other two networks, using it to estimate the viewpoint of a face in an input
image and then align the face in 2D and 3D. As baseline methods, we use the following popular,
state-of-the-art, facial landmark detectors: dlib [86], CLNF [3], OpenFace [4], DCLM [178],
RCPR [19], and 3DDFA [192].
We clarify that deep networks, both our FPN and some of the baseline detectors [4, 178, 192],
require large quantities of training data. Training FPN requires 2.6 million images for the AlexNet-
structure and 122K for the ResNet variant. These numbers are greater than those used to develop
48
older methods such as dlib [86]. These differences should be considered when comparing results
of these methods.
Face recognition benchmarks. Our tests use two of the most recent benchmarks for face
recognition: IARPA Janus Benchmark A [88] and B [159] (IJB-A and IJB-B, resp.). Importantly,
these benchmarks were designed with the specific intention of elevating the difficulty of face
recognition. This heightened challenge is reflected by, among other factors, an unprecedented
amount of extreme out-of-plane rotated faces, including many appearing in near-profile views [115].
Consequently, these two benchmarks raise the bar well above other facial landmark detection
benchmarks.
Face recognition pipeline. All face alignment methods were tested using the same face
recognition pipeline, similar to the one proposed by [116, 115? ]. We use this system partly
because its code and trained models are publicly available, allowing for reproduction of our results.
More importantly, however, this system explicitly aligns faces to multiple viewpoints, in
both 2D and 3D. These steps are highly dependent on viewpoint estimation quality and so
their recognition accuracy reflects viewpoint accuracy. In practice, we used their 2D (similarity
transform) and 3D (face rendering) code directly, only changing the viewpoint estimation step.
Our tests compare different landmark detectors used to recover the 6DoF head pose required by
their warping and rendering method (converting facial landmarks as described for our FPN training
data in Sec. 5.2.3), with the 6DoF directly regressed by our FPN.
Their system uses a single ResNet101 architecture [63], trained on both real face images
and synthetic, rendered views. We found that better face recognition results can be obtained by
fine-tuning their network usingL2-constrained Softmax Loss [125], rather than their original
Softmax [116, 115]. This fine-tuning is performed using the MS-Celeb face set [105] as the
training set. Aside from this change, we use the same recognition pipeline from [116], and we
refer to that paper for details.
Bounding box detection. We emphasize that we tested all methods with an identical pipeline,
only changing alignment methods; different results vary only in the method used to estimate facial
pose. The only other difference was in the facial bounding box detector.
Facial landmark detectors can be sensitive to the face detector they are used with. We therefore
report results obtained when running landmark detectors with the best bounding boxes we were
able to determine for each method. Specifically, FPN was tested with the bounding boxes returned
by the detector of [172], as described in Sec. 5.2.3, including the rescale factor of1:25. We found
that most facial landmark detectors performed best when applied with the same face detector but
without the 25% increase. Finally, 3DDFA [192] was tested with the same face detector followed
by the face box expansion code provided by its authors.
Face verification and identification results. Face verification and identification results on
both IJB-A and IJB-B are provided in Table 5.1. We report multiple recognition metrics for both
verification and identification. For verification, these measure the recall (true acceptance rate) at
three cutoff points of the false alarm rate (TAR-f1%,0.1%,0.01%g). For identification we provide
recognition rates at four ranks from the CMC (cumulative matching characteristic). The overall
performances in terms of ROC and CMC curves are shown in Fig. 5.5. For clarity, the figure only
reports results for the ResNet101 version of our FPN. The table also provides, as reference, three
state-of-the-art IJB-A results [30, 116, 125] and baseline results from [159] for IJB-B. To our
knowledge, we are the first to report verification and identification accuracies on IJB-B.
49
10
-4
10
-3
10
-2
10
-1
False Acceptance Rate
0.5
0.6
0.7
0.8
0.9
1
True Acceptance Rate
RCPR
Dlib
CLNF
OpenFace
DCLM
3DDFA
Improved FPN
5 10 15 20 25 30
Rank Score
0.84
0.86
0.88
0.9
0.92
0.94
0.96
Recognition Rate
RCPR
Dlib
CLNF
OpenFace
DCLM
3DDFA
Improved FPN
(a) ROC IJB-A (b) CMC IJB-A
10
-4
10
-3
10
-2
10
-1
False Acceptance Rate
0.5
0.6
0.7
0.8
0.9
1
True Acceptance Rate
RCPR
Dlib
CLNF
OpenFace
DCLM
3DDFA
Improved FPN
5 10 15 20 25 30
Rank Score
0.7
0.75
0.8
0.85
0.9
0.95
1
Recognition Rate
RCPR
Dlib
CLNF
OpenFace
DCLM
3DDFA
Improved FPN
(c) ROC IJB-B (d) CMC IJB-B
Figure 5.5: Verification and identification results on IJB-A and IJB-B. ROC and CMC curves
accompanying the results reported in Table 5.1.
Faces aligned with our original FPN and the ResNet101 FPN (improved FPN) lead to better
recognition accuracy, even compared to the most recent, state-of-the-art facial landmark detection
method of [178]. Remarkably, faces aligned with FPN provide substantially better recognition
accuracy than those aligned with the facial landmark detector of [4], which is the method used
to produce the viewpoint labels for training our FPN. This result supports the claims made in
Sec. 5.2.4 on the robustness of deep networks to label noise.
5.5.2 Evaluating expression estimation
In section 4.3.1, we have evaluated our FaceExpNet on the Extended Cohn-Kanade (CK+)
dataset [109] and the Emotion Recognition in the Wild Challenge (EmotiW-17) dataset [38]
by using kNN classifier withK = 5, without optimizing for K. In this section, we additionally
report results with a SVM (RBF kernel) to show the consistent improvement of our method
irrespective of the classifier used. Of course, all reported results are far from the state-of-the-art on
this set. As previously noted, our goal is not to outperform state-of-the-art emotion classification,
but rather to compare methods for expression coefficient estimation.
Baseline methods. We compare our approach to widely used, state-of-the-art face landmark
detectors: Dlib [86], CLNF [3], OpenFace [4], DCLM [178], RCPR [19], and 3DDFA [192].
Of these, 3DDFA is the only one that, similar to our FEN, directly estimates 29D expression
50
0.2 0.4 0.6 0.8 1
Image Scale
0.2
0.3
0.4
0.5
0.6
0.7
Accuracy
KNN
Our FEN OpenFace DCLM Dlib CLNF RCPR 3DDFA
(a)
0.2 0.4 0.6 0.8 1
Image Scale
0.4
0.5
0.6
0.7
0.8
Accuracy
SVM
Our FEN OpenFace DCLM Dlib CLNF RCPR 3DDFA
(b)
Figure 5.6: Emotion classification over scales on the CK+ benchmark. Curves report emotion
classification accuracy over different scales of the input images. Lower scale indicates lower
resolution. Original resolution is 640490. (a) reports results with a simple kNN classifier. (b)
Same as (a), now using a SVM (RBF kernel) classifier.
coefficients vectors. For all other methods, following landmark detection, expression coefficients
were estimated using Eq. (4.2).
5.5.3 Landmark detection accuracy
Sec. 5.3 explains how facial landmark estimates can be obtained from our 3D models. Although
our goal is 3D face modeling and not breaking published state-of-the-art performance on face
landmark detection benchmarks, it is instructional to consider the accuracy of landmarks estimated
as a by-products of our FAME approach. We therefore tested our method on landmark detection
benchmarks for 2D landmarks—the 300W [135] and AFLW-PIFA [80] benchmarks—and 3D
landmarks, the AFLW 2000-3D dataset [192].
300W. 300W [135] contains multiple face alignment sets with 68 landmark annotations: AFW,
LFPW, HELEN, and iBUG.
AFLW-PIFA. This dataset [79] offers 5200 images sampled from AFLW [92] with a balanced
distribution of yaw angles and left vs. right viewpoints. Each image is labeled with up to 21
landmarks with a visibility label for each landmark. [80], offer 13 additional landmarks for these
images, for a total of 34 landmarks per image.
AFLW2000-3D. 3D face modeling and alignment is widely evaluated on the set offered
by [192]. The data set contains ground truth 3D faces and corresponding 68 landmarks for the first
2000 AFLW samples.
Importantly, ground truth annotations in 300W and AFLW-PIFA are provided only for vis-
ible face contours. Thus, the same landmarks in different images reflect different facial details
(Sec. 5.4). Landmark annotations in AFLW 2000-3D [192], on the other hand, reflect the same
facial features, though these features may be self-occluded in different images and their loca-
tions were therefore guessed by the annotators. Regardless, landmark annotations in all these
benchmarks are conceptually different.
51
0.2 0.4 0.6 0.8 1
Image Scale
0.16
0.2
0.22
0.24
0.26
0.28
0.3
Accuracy
KNN
Our FEN OpenFace DCLM Dlib CLNF RCPR 3DDFA
(a)
0.2 0.4 0.6 0.8 1
Image Scale
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Accuracy
SVM
Our FEN OpenFace DCLM Dlib CLNF RCPR 3DDFA
(b)
Figure 5.7: Emotion classification over scales on the EmotiW-17 benchmark. Curves report
emotion classification accuracy over different scales of the input images. Lower scale indicates
lower resolution. Original resolution is 720576. (a) reports results with a simple kNN classifier.
(b) Same as (a), now using a SVM (RBF kernel) classifier.
0 0.1 0.2 0.3 0.4
Pt-pt error normalized by interocular distance
0
0.2
0.4
0.6
0.8
1
Fraction of the number of images
Common
RCPR
Dlib
CLNF
OpenFace
DCLM
3DDFA
Improved FPN, FEN, FSN (+ref.)
(a)
0 0.1 0.2 0.3 0.4
Pt-pt error normalized by interocular distance
0
0.2
0.4
0.6
0.8
1
Fraction of the number of images
Challenging
RCPR
Dlib
CLNF
OpenFace
DCLM
3DDFA
Improved FPN, FEN, FSN (+ref.)
(b)
0 0.1 0.2 0.3 0.4
Pt-pt error normalized by interocular distance
0
0.2
0.4
0.6
0.8
1
Fraction of the number of images
Full
RCPR
Dlib
CLNF
OpenFace
DCLM
3DDFA
Improved FPN, FEN, FSN (+ref.)
(c)
Figure 5.8: Comparisons of CED curves on 300W with the face bounding boxes detected by [172].
Results provided for the Common (HELEN and LFPW), Challenging (iBUG), and Full (HELEN,
LFPW, and iBUG) splits.
Evaluation metrics. Alignment accuracy is evaluated by the normalized mean error (NME):
the average of landmark error normalized by the interocular distance on 300W [135], and by the
bounding box size on AFLW-PIFA and AFLW2000-3D datasets [79, 177].
5.5.3.1 300W results with ground truth bounding boxes
We follow the standard protocol of [187], where the training part of LFPW, HELEN and the entire
AFW are used for fine-tuning our FAME networks, and perform testing on three parts: the test
samples from LFPW and HELEN as the common subset, the 135-image iBUG as the challenging
subset, and the union of them as the full set with 689 images in total.
Table 5.2 compares our landmark detection errors with those of recent state-of-the-art methods.
We use the ground truth face bounding boxes provided by the benchmark. Note that all the baseline
results provided in Table 5.2 except for dlib, CLNF, and OpenFace were reported by their authors
in the original publications.
52
Our results (FPN (AlexNet) + FEN (ResNet101) + FaceShapeNet (ResNet101) + refinement)
achieve comparable results with the recent state-of-the-art [192]. Although more accurate de-
tections are reported by [178], their method is three orders of magnitude slower than our own.
Furthermore, as our face recognition results show in Sec. 5.5.1, better landmark detection accuracy
does not always imply better bottom-line performance of the face processing pipeline.
Finally, we note the effects of better approximating the 3D face shape, as evident in our bottom
three results. Landmarks estimated using FPN alone are not particularly accurate. By adding shape
and expression estimation (FPN + FEN + FaceShapeNet), predictions are substantially improved.
Landmark refinement (Sec. 5.3.2) provides an additional drop in landmark localization errors
compared with the manual annotations.
We report additional improvement in detection accuracy by adopting the deeper ResNet101
version of our FPN and training it with more profile faces (Sec. 5.2.3). Some of this improvement
may be due to the fact that our method estimates positions for fixed, physical facial positions,
including positions that are occluded from view, whereas 300W measures accuracy of contour
points which change depending on viewpoint (Sec. 5.4). Refinement moves our landmarks towards
the visible contour (see Fig. 5.2(e) vs. (f)) and reduces some of these errors.
5.5.3.2 300W results with detected bounding boxes
The 300W benchmark provides ground truth face bounding boxes for all of its images. In practical
scenarios, such bounding boxes would likely not be available, and a face detection method would
be used to obtain these bounding boxes. We therefore tested performances on 300W images using
detected bounding boxes. Our FAME networks are fine-tuned on the training parts of LFPW,
HELEN, and the entire AFW using the same face detector. The results are provided in Table 5.3.
Fig. 5.8 additionally offers cumulative error distribution (CED) curves on 300W with detected
bounding boxes.
The same face detection method of [172] was used with all landmark detectors. Because
landmark detectors can be sensitive to the choice of face detection method, we attempted to
optimize performances for these baseline methods by scaling the face detection bounding box,
using the best scaling for each method. Note that we report baseline results in Table 5.3 only for
methods for which we could find code available. Our results are, however, consistent with those
reported by the different authors, appearing in Table 5.2.
There are a number of noteworthy observations from our results in Table 5.3. First, the
accuracy of all methods tested here dropped somewhat compared to the precision reported using
ground truth bounding boxes (Table 5.2). All variations of our approach, however, show only
small drops in accuracy. We attribute this robustness to face bounding boxes to the scale changes
which we synthetically introduced to the data set used to train our FPN (Sec. 5.2.3). By making
our network robust to scale changes, it better handles differences in the tightness of the detected
facial bounding boxes.
Another remarkable result is that in a realistic use-case where a face detector is used to obtain
face bounding boxes, dlib [86] appears to perform very well, despite its age and despite being the
fastest landmark detector we tested. This result is consistent with those in Sec. 5.5.1 where dlib
also performed well in aligning faces for recognition.
Finally, our method (FPN + FEN + FaceShapeNet + refinement) obtains the most accurate
landmark detection results. Even without the landmark-specific refinement step, the accuracy
53
of our approach (FPN + FEN + FaceShapeNet) is comparable to the existing state-of-the-art,
DCLM [178], outperforming it on both the challenging and full splits despite being at least an
order of magnitude faster.
5.5.3.3 Landmark detection runtime
Both Tables 5.2 and 5.3 also report the runtimes for the various methods we tested. These runtimes
were all measured by us on the same machine (see Sec. 5.5.1); missing results in Table 5.2
represent methods for which we were unable to run the code ourselves.
It is worth noting that our full FAME approach is only slower than RCPR [19] and the very
fast dlib [86]. These two methods, however, provide less accurate rigid alignment than even our
(much faster) FPN alone (Sec. 5.5.1) and are less accurate than our full approach in landmark
localization (Tables 5.2 and 5.3).
5.5.3.4 AFLW 2000-3D results
Because we estimate 3D face shape, we can also report landmark detection accuracy with 3D
landmarks on the AFLW 2000-3D benchmark of [192]. Results are provided in Table 5.4 for three
categories of the absolute yaw angles: [0; 30], (30; 60], and (60; 90]. Following [9], we use the
bounding box associated with 68 landmarks. The NME is computed using the bounding box size.
The cumulative error distribution (CED) curves are reported in Fig. 5.9. All baselines except for
3DSTN [9] were reported by their respective authors.
The very recent method of [9] appears to perform better than most other baselines in most of
the tests. Our approach (FPN + FEN + FaceShapeNet + refinement) is the most accurate in the
ranges [0:30] (30; 60], coming in second in (60; 90], outperforming other methods designed for
3D landmark detection (e.g., 3DDFA [192]). Even without the landmark-specific refinement step,
our method is the most accurate in (30; 60].
5.5.3.5 AFLW-PIFA results
We report landmark detection results on AFLW, strictly following the PIFA protocol suggested
by [79].
1
PIFA provides 5,200 images where the numbers of images with absolute yaw angle
viewpoints within [0,30), [30,60), [60,90] are approximatively one third each. Finally, 3,901 of
these images are used for training and 1,299 for testing. Note that [96] used many more images
to train (23,386) and a different testing partition with 1000 images. Their results are not directly
comparable to all others.
All AFLW-PIFA images are labeled with up to 21 landmarks [79] and 34 landmarks [80].
Results based on our best method using FPN, FEN, and FaceShapeNet (all with a ResNet101
architecture) are provided in Table 5.5. For fair comparison, we report results for both 34 landmarks
and 21 landmarks.
1
The train/test partitions of PIFA are available athttp://cvlab.cse.msu.edu/project-pifa.html
54
0 0.05 0.1 0.15 0.2
Pt-pt error normalized by interocular distance
0
0.2
0.4
0.6
0.8
1
Fraction of the number of images
CDM
RCPR
ESR
SDM
3DDFA
3DDFA+SDM
Improved FPN, FEN, FSN (+ref.)
Figure 5.9: Comparisons of CED curves on AFLW2000-3D. To balance the yaw distributions, we
randomly sample 699 faces from AFLW 2000-3D, split evenly among the 3 yaw categories and
compute the CED curve. This is done 10 times and the average of the resulting CED curves are
reported.
5.5.3.6 Discussion: FAME vs. OpenFace
The face recognition results reported in Sec. 5.5.1 demonstrate that our FPN better aligns faces for
face recognition than OpenFace [4], the method used to produce pose labels for training our FPN.
In this section, Tables 5.2 and 5.3 directly compare the landmark detection accuracy of our FAME
approach with the accuracy reported by OpenFace.
Alone, the landmark accuracy of FPN is comparable with OpenFace—FPN matching the
accuracy of the method used to train it—though FPN is nearly an order of magnitude faster and,
unlike OpenFace, was not designed or optimized for landmark detection. This is not entirely
surprising, as FPN is trained to solve a 6D regression problem (six floating point numbers), whereas
landmark detection methods such as OpenFace try to solve a harder 249 or 268 dimensional
regression task (49 or 68 integer 2D image coordinates). With a simpler regression problem, we
can therefore train FPN to achieve better accuracy than the method used to produce its labels.
Importantly, even without the refinement step which optimizes for landmark localization, our
FAME approach clearly outperforms OpenFace.
5.5.4 Qualitative results
We visualize the results of our FAME approach (FPN + FEN + FaceShapeNet) by rendering 3D
face shape estimates, super-positioned over the original input images. We use images from the
IJB-A benchmark for this purpose.
55
Fig. 5.10 provides a few examples of the limitations of our approach, comparing them with the
results obtained by the state-of-the-art 3DDFA of [192]. Because profile views are underrepre-
sented in the VGG Face set used to train our FPN in Sec. 5.2.3, our network is conservative in
the poses it estimates, preferring lower than 90 yaw angles (Fig. 5.10(a)). 3DDFA was explicitly
designed to handle such profile views and so better handles such images. Both methods fail on
images with other extreme viewpoints: larger than profile yaw rotations of the head (Fig. 5.10(b))
and extreme scales (Fig. 5.10(c)).
Finally, Fig. 5.11 provides a wide range of example 3D reconstructions, produced using IJB-A
images. These results were selected to represent varying expressions, viewpoints, ethnicities,
genders, facial expressions, occlusions, and different image qualities. We provide baseline results
for 3DDFA [192] and the FaceShapeNet method of [149] (adjusted for viewpoint using FPN,
Sec. 5.2.3). Our estimated 3D shapes clearly capture subject-specific facial attributes (e.g., ethnicity
and gender). Expressions are also very evident on the reconstructed shapes.
Our results are especially remarkable considering the extreme conditions in some of these
images: Severe occlusions Fig. 5.11(a,f,h,q), very low-quality photos in Fig. 5.11(c,e,q), and a
wide range of scales (see Fig. 5.11(b) vs. Fig. 5.11(l,n)).
5.6 Conclusions
Over the past decade, facial landmark detection methods have played a tremendous part in
advancing the capabilities of face processing applications. Despite these contributions, landmark
detection methods and the benchmarks that measure their performances have their limits. We show
that deep learning can be leveraged to perform tasks that, until recently, required the use of these
facial landmark detectors. In particular, we show how face shape, viewpoint, and expression can
be estimated directly from image intensities, without the use of facial landmarks. Moreover, facial
landmarks can be obtained as by-products of our deep 3D face modeling process.
By proposing an alternative to facial landmark detection, we must also provide novel alterna-
tives for evaluating the effectiveness of landmark-free methods such as our own. We therefore
compare our method with facial landmark detectors by considering the effect these methods have
on the bottom line performances of the methods that use them: face recognition for rigid 2D and
3D face alignment, and emotion classification for non-rigid, expression estimation. Of course,
these tests are not meant to be exhaustive: This evaluation paradigm can potentially be extended to
other benchmarks, representing other face processing tasks.
In addition to extending our tests to other face processing applications, another potential
direction for future work is improvement of our proposed FAME framework. Specifically, notice
that our FPN is trained to estimate pose for a generic face shape, whereas in practice, the 3D face
shape that we project is subject- and expression-adjusted to the input face. This discrepancy can
lead to misalignment errors, even if small ones. These errors may be mitigated by combining the
three networks into a single, jointly learned, FAME network.
56
Table 5.1: Verification and identification on IJB-A and IJB-B, comparing landmark detection–
based face alignment methods. Three baseline IJB-A results are also provided as reference at the
top of the table.
Numbers estimated from the ROC and CMC in [159].
Method# TAR@FAR Identification Rate (%)
Eval.! .01% 0.1% 1.0% Rank-1 Rank-5 Rank-10 Rank-20
IJB-A [88]
Crosswhite et al. [30] – – 93.9 92.8 – 98.6 –
Ranjan et al. [125] 90.9 94.3 97.0 97.3 – 98.8 –
Masi et al. [116] 56.4 75.0 88.8 92.5 96.6 97.4 98.0
RCPR [19] 64.9 75.4 83.5 86.6 90.9 92.2 93.7
Dlib [86] 70.5 80.4 86.8 89.2 91.9 93.0 94.2
CLNF [3] 68.9 75.1 82.9 86.3 90.5 91.9 93.3
OpenFace [4] 58.7 68.9 80.6 84.3 89.8 91.4 93.2
DCLM [178] 64.5 73.8 83.7 86.3 90.7 92.2 93.7
3DDFA [192] 74.8 82.8 89.0 90.3 92.8 93.5 94.4
Our FPN 77.5 85.2 90.1 91.4 93.0 93.8 94.8
Our improved FPN 78.5 86.0 90.8 91.6 93.4 94.0 94.8
IJB-B [159]
GOTs [159]
16.0 33.0 60.0 42.0 57.0 62.0 68.0
VGG Face [159]
55.0 72.0 86.0 78.0 86.0 89.0 92.0
RCPR [19] 71.2 83.8 93.3 83.6 90.9 93.2 95.0
Dlib [86] 78.1 88.2 94.8 88.0 93.2 94.9 96.3
CLNF [3] 74.1 85.2 93.4 84.5 90.9 93.0 94.8
OpenFace [4] 54.8 71.6 87.0 74.3 84.1 87.8 90.9
DCLM [178] 67.6 81.0 92.0 81.8 89.7 92.0 94.1
3DDFA [192] 78.5 89.1 95.6 89.0 94.1 95.5 96.9
Our FPN 83.2 91.6 96.5 91.1 95.3 96.5 97.5
Our improved FPN 83.2 91.6 96.6 91.6 95.6 96.7 97.5
57
Table 5.2: The NME (%) of 68 point detection results on 300W with ground truth bounding boxes
provided by 300W. We use the typical split: Common (HELEN and LFPW), Challenging (iBUG),
and Full (HELEN, LFPW, and iBUG). * These methods were tested by us using codes provided
by their authors.
Method Comm. Chall. Full Sec./im.
TSPM [190] 8.22 18.33 10.20 -
ESR [20] 5.28 17.00 7.58 -
CFAN [182] 5.50 16.78 7.69 -
RCPR [19] 6.18 17.26 8.35 0.19
SDM [168] 5.57 15.40 7.50 -
LBF [126] 4.95 11.98 6.32 -
Dlib* [86] 5.41 20.31 8.33 0.009
CLNF* [3] 5.64 17.08 7.88 0.19
OpenFace* [4] 4.57 14.41 6.50 0.28
TCNN [162] 4.10 11.86 5.62 -
PCD-CNN [95] 3.67 7.62 4.44 -
DCLM [178] 3.42 7.66 4.25 15.83
3DDFA [192] 6.15 10.59 7.01 0.6
3DDFA+SDM [192] 5.53 9.56 6.31 -
FPN (AlexNet), FEN (ResNet101), FaceShapeNet (ResNet101)
FPN 8.34 13.78 9.40 0.005
FPN+FEN+FaceShapeNet 5.80 11.39 6.89 0.029
FPN+FEN+FaceShapeNet+ref. 5.03 10.59 6.12 0.20
FPN (ResNet101), FEN (ResNet101), FaceShapeNet (ResNet101)
FPN 5.78 9.82 6.57 0.088
FPN+FEN+FaceShapeNet 3.93 7.57 4.64 0.112
FPN+FEN+FaceShapeNet+ref. 3.34 6.56 3.97 0.283
58
Table 5.3: The NME (%) of 68 point detection results on 300W with the bounding box provided by
the face detector of [172]. We use the typical splits: Common (HELEN and LFPW), Challenging
(iBUG), and Full (HELEN, LFPW, and iBUG).
Method Comm. Chall. Full Sec./im.
RCPR [19] 8.58 22.33 11.27 0.19
Dlib [86] 4.50 17.23 6.99 0.009
CLNF [3] 8.19 21.09 10.72 0.38
OpenFace [4] 4.81 15.45 6.90 0.31
DCLM [178] 4.10 13.74 5.99 15.83
3DDFA [192] 10.64 12.87 11.08 0.6
FPN (AlexNet), FEN (ResNet101), FaceShapeNet (ResNet101)
FPN 8.17 13.14 9.14 0.005
FPN+FEN+FaceShapeNet 5.77 10.84 6.76 0.029
FPN+FEN+FaceShapeNet+ref. 5.51 10.33 6.45 0.20
FPN (ResNet101), FEN (ResNet101), FaceShapeNet (ResNet101)
FPN 6.35 11.38 7.33 0.088
FPN+FEN+FaceShapeNet 4.79 9.67 5.75 0.112
FPN+FEN+FaceShapeNet+ref. 3.97 8.51 4.86 0.283
Table 5.4: The NME (%) of 68 point detection results on AFLW2000-3D for different ranges of
yaw angles.
Method [0,30] (30,60] (60,90] mean std
RCPR [19] 4.26 5.96 13.18 7.80 4.74
ESR [20] 4.60 6.70 12.67 7.99 4.19
SDM [168] 3.67 4.94 9.76 6.12 3.21
3DDFA [192] 3.78 4.54 7.93 5.42 2.21
3DDFA+SDM [192] 3.43 4.24 7.17 4.94 1.97
3DSTN (AlexNet) [9] 3.71 5.33 7.19 5.41 1.74
3DSTN (VGG-16) [9] 3.15 4.33 5.98 4.49 1.42
FPN (AlexNet), FEN (ResNet101), FaceShapeNet (ResNet101)
FPN 5.38 10.31 19.08 11.59 6.94
FPN+FEN+FaceShapeNet 4.20 5.12 8.14 5.82 2.06
FPN+FEN+FaceShapeNet+refinement 3.35 4.15 7.05 4.85 1.94
FPN (ResNet101), FEN (ResNet101), FaceShapeNet (ResNet101)
FPN 3.78 4.23 7.25 5.09 1.89
FPN+FEN+FaceShapeNet 3.46 4.13 7.09 4.89 1.93
FPN+FEN+FaceShapeNet+refinement 3.11 3.84 6.60 4.52 1.84
59
Table 5.5: NME (%) results on the AFLW dataset under the PIFA protocol [79, 80].
Followed a
non-standard training/testing setting on the AFLW dataset.
Method 34 land. 21 land.
RCPR [19] 6.26 7.15
CFSS [187] 6.75 -
PIFA [79] 8.04 6.52
CCL [188] 5.81 -
PAWF [80] 4.72 -
DeFA [103] 3.86 -
KEPLER [96]* - 2.98
FPN 4.79 4.75
FPN+FEN+FaceShapeNet 4.20 4.09
FPN+FEN+FaceShapeNet+ref. 4.03 3.9
60
Figure 5.10: Limitations of our approach. Results obtained with the state-of-the-art 3DDFA
of [192] and our full reconstruction (Us, denoting FPN + FEN + FaceShapeNet) for three faces in
the IJB-A benchmark. See text for more details.
61
Figure 5.11: Qualitative 3D reconstruction results. Rendered 3D reconstruction results for IJB-A
images representing a wide range of viewing settings. For each image we provide results obtained
by 3DDFA [192], the FaceShapeNet of [149] (adjusted for viewpoint using FPN, Sec. 5.2.3), and
our full approach (Us, denoting FPN + FEN + FaceShapeNet). These results should be considered
by how well they capture the unique 3D shape of each individual, the viewpoint, and the facial
expression. See text for more details.
62
Chapter 6
Pose-Variant 3D Facial Attribute Generation
We address the challenging problem of generating facial attributes using a single image in an
unconstrained pose. In contrast to prior works that largely consider generation on 2D near-
frontal images, we propose a GAN-based framework to generate attributes directly on a dense 3D
representation given by UV texture and position maps, resulting in photorealistic, geometrically-
consistent and identity-preserving outputs. Starting from a self-occluded UV texture map obtained
by applying an off-the-shelf 3D reconstruction method, we propose two novel components. First,
a texture completion generative adversarial network (TC-GAN) completes the partial UV texture
map. Second, a 3D attribute generation GAN (3DA-GAN) synthesizes the target attribute while
obtaining an appearance consistent with 3D face geometry and preserving identity. Extensive
experiments on CelebA, LFW and IJB-A show that our method achieves consistently better
attribute generation accuracy than prior methods, a higher degree of qualitative photorealism and
preserves face identity information.
6.1 Motivations and Overview
Faces are of unique interest in computer vision, whether it be for recognition, visualization or
animation, upon the diversity with which their images are manifested. This is partly due to
the variety of attributes associated with faces and partly due to extrinsic variations like head
pose. Thus, generating photorealistic images of faces that address both of those aspects is a
problem of fundamental interest that also enables downstream applications, such as augmentation
of under-represented classes in face recognition.
In recent years, conditional generative models such as Variational Auto-Encodesr (V AE) [87] or
Generative Adversarial Networks (GAN) [51] have achieved impressive results [170, 27, 64, 142].
However, they have largely focused on frontal faces. In contrast, we consider the problem of
generating 3D-consistent attributes on possibly pose-variant faces. As a motivating example,
consider the problem of adding sunglasses to a face image. For a frontal input and with a desired
frontal output, this involves inpainting with sunglass texture limited to the region around the
eyes. For an input face image observed under a largely profile view and a more general task
of generating an identity-preserving and sunglass-augmented face under arbitrary pose, a more
complex transformation is needed since (i) both attribute-related and unrelated regions must be
handled and (ii) the attribute must be consistent with 3D face geometry. Technically, this requires
working with a higher-dimensional output space and generating an image conditioned on both
63
Shape Recon. Render TC-GAN 3DA-GAN Render
StarGAN CycleGAN Ours
Bangs
Smile
Sunglasses
Input
UV Maps
Figure 6.1: Facial attributes generation under head pose variations, showing results comparison of
our method to StarGAN [27] and CycleGAN [185]. Traditional frameworks generate artifacts due
to pose variations. Introducing a 3D UV representation, the proposed TC-GAN and 3DA-GAN
generates photo-realistic face attributes on pose-variant faces.
head pose and attribute code. In Figure 6.1, we show how our proposed framework achieves these
abilities surpassing conventional ones such as StarGAN [27] and CycleGAN [185].
A first attempt would be to frontalize the pose-variant face input. Despite good visual quality,
appearance-based face frontalization methods [151, 176, 71, 143] may suffer from lack of identity-
preservation. Geometric modeling methods [60, 35] faithfully inherit visible appearance but need
to guess the invisible appearance due to self-occlusion, leading to extensions like UV-GAN [35].
Further, we note that both texture completion and attribute generation are correlated with 3D shape,
that is, the hallucinated appearance should be within the shape area and the generated attribute
should comply with the shape. This motivates our framework that utilizes both 3D shape and
texture, distinguishing our work from traditional ones that deal only with appearance or UV-GAN
that only uses the texture map.
64
(a) Input (b) 3D dense shape (c) UV position map (d) UV texture map
Figure 6.2: Illustration of image coordinate space and UV space. (a) Input image. (b) 3D dense
point cloud. (c) UV position map U
p
transferred from 3D point cloud. (d) UV texture map U
t
,
partially visible due to pose variation (Best viewed in color).
Specifically, we propose to disentangle the task into mainly two stages: (1) We apply an off-the-
shelf 3D shape regression PRNet [49] with a rendering layer to directly achieve 3D shape and weak
perspective matrix from a single input, and utilize the information to render partial (self-occluded)
texture. (2) A two-step GAN, consisting of a texture completion GAN (TC-GAN) that utilizes
the above 3D shape and partial texture to complete the texture map and a 3D Attribute generation
GAN (3DA-GAN) that generates target attributes on the completed 3D texture representation. In
stage (1), we apply the UV representation [53, 49] for both 3D point cloud and texture, termed
U
p
and U
t
, respectively. The UV representation not only provides the dense shape information
but also builds a one-to-one correspondence from point cloud to texture. In stage (2), TC-GAN
and 3DA-GAN use both U
p
and U
t
as input to inject 3D shape insights into both the completed
texture and generated attribute. Extensive experiments show the effectiveness of our method,
which generates geometrically accurate and photorealistic attributes under large pose variation,
while preserving identity.
Our contributions are summarized as the following:
• We are the first to achieve 3D facial attributes generation under unconstrained head poses such
as profile pose. Our method works on the pose-invariant 3D UV space, while most prior ones
work on 2D image space.
• We propose a novel two-stage GAN, for UV space texture completion (TC-GAN) and texture
attribute generation (3DA-GAN). The stacked structure effectively solves the pose variation
problem, conducts face frontalization and can generate attributes for different pose angles.
• We propose a two-phase training protocol to guide the network to focus only on the area related
to the attribute, which significantly improves identity-preservation.
• Extensive experiments on several public benchmarks demonstrate the consistently better results
in face frontalization, accurate attribute generation, image visual quality and close-to-original
identity preservation.
6.2 Method
In this section, we firstly introduce a dense 3D representation named UV space that supports
appearance generation. Then, rendering is conducted to generate visible appearance from the
original input. Further, a texture completion GAN is presented to achieve fully visible texture map.
65
Trimmed mean BFM
GT shape
P
g
'=argmin
P
g
‖P
g
S
g
−L(I)‖
2
L(I)
P',α'=argmin
P,α
‖PL(S
b
+Qα)−L(S
g
)‖
2
P
b
'=argmin
P
b
‖P
b
S
b
−L(I)‖
2
S
g
P
g
S
b
P
b
Landmarks
Pose-variant face shape
in Trimmed BFM domain
Figure 6.3: Align a ground truth shape or an estimated shape from the existing 3D reconstruction
method to the trimmed BFM shape. The example image is from 4DFE dataset and the landmarks
L(I) can be obtained by any off-the-shelf image based landmark detector.
In the end, a 3D attribute generation GAN is proposed to work on the 3D UV position and texture
representation, generating target attribute under pose-variant conditions.
6.2.1 3D Shape Alignment
We first trim the original BFM shape to the one focusing on the facial area and consists of 38K
vertices, as the BFM reference shape thereafter. Given an image, we obtain its 3D shape from
the dataset or estimated by [49]. Since the number and definition of 3D vertices are different, the
untrimmed shape need to be aligned to the reference trimmed BFM shape. A diagram for this
alignment is shown in Fig. 6.3. The 4DFE 3D point cloud and reference BFM are deformed to
match the detected 2D landmarks. Then we refine the alignment via a 3D-ICP like procedure to
obtain the aligned shape.
6.2.2 UV Position and Texture Maps
To faithfully render the visible appearance, we seek a dense 3D reconstruction of shape and texture.
The 3D Morphable Model [11] sets up a parametric representation by decomposing both shape
and texture into linear subspaces. It reduces the space dimension but also drops the high frequency
information which is highly demanded for the rendering and generation tasks. Directly applying
the raw shape and texture is computationally heavy. Following [53, 49], we introduce a sphere UV
space that homographically map to the coordinate space. Note that this reference UV coordinates
is shared by all images, so every pixel corresponds to the same facial point; this is essential to
define the attribute related masks (will be introduced later) and facilitate attribute generation in a
specific facial region under arbitrary head pose variation.
Assume 3D point cloud S2 R
Nx3
, N is the number of vertices. Each vertex s = (x;y;z)
consists of the three dimensional coordinates in 3D space. (u; v) are defined as:
66
Reference image
(Rendered from
trimmed BFM )
Aligned BFM shape
Projected to UV space
UV space after
Tutte embedding
Input image UV position map Aligned Pose-variant BFM Vertex visibility mask
UV texture map
Figure 6.4: Given an input image, conversion from the aligned BFM to the fixed UV coordinates,
and the uv texture map rendering based on the vertex visibility and input image. denotes
element-wise multiplication.
u =
1
arccos(
x
p
x
2
+z
2
); v = 1
1
arccos(y) (6.1)
Eq. 6.1 establishes a unique mapping from dense point cloud to the UV maps. By quantizing
the UV space with different granularity, one can control the density of UV space versus the
image resolution. In this work, we quantize the UV maps into 256 256 and thus preserves 65k
vertices. As shown in Fig. 6.2, a UV position map U
p
is defined on the UV space, each entry is
the corresponding three dimensional coordinate (x;y;z). We apply PRNet [49] to estimate the 3D
shape and then exploit Eq. 6.1 to obtain the U
p
. A UV texture map U
t
is also defined on the UV
space, each entry is the corresponding coordinate’s RGB color.
UV texture map rendering: U
t
of a pose-variant face is partially visible as shown in Fig. 6.2
(d). The invisible region corresponds to the self-occluded region resulting from pose variation.
In the original coordinate space, we conduct a z-buffering algorithm [193] to label the visible
condition of each 3D vertex. Those vertices with largest depth information are visible while all
others are invisible. Assume the visibility matrix M with entry 1 means visible and 0 invisible. The
rendering is a look-up operation by associating the specific coordinate’s color to the corresponding
UV coordinate. We formulate the process in Eq. 6.2, and illustrated in Fig. 6.4.
U
t
(u; v) = I(x;y) M(x;y;z) (6.2)
where (u; v) is determined by Eq. 6.1 and denotes element-wise multiplication.
6.2.3 UV Texture Map Completion
The incomplete U
t
from the rendering is insufficient to conduct the attribute generation. We seek
a texture completion that can not only recover photo-realistic appearance but also preserve identity.
67
UV-GAN [35] proposes a similar framework to complete the UV texture map by applying an
adversarial network. However, it only considers the texture information. We argue that for 3D
UV representation, completing the appearance should consider both texture information and the
shape information. For example, combining the original and flipped input will provide a good
initialization for appearance prediction. But it only applies the symmetry constraint on shape,
which is not sufficient to preserve the shape information. Thus, we take U
p
, U
t
and flipped U
t
as
input.
Reconstruction module: To prepare the UV texture ground truth, we start with near-frontal face
images where all the pixels are visible. Then, we perturb the head pose of this original image
with random angle. Note that all the pose variant images share the same frontal ground truth
which is the original image. By rendering in Eq. 6.2, we obtain the incomplete texture map for the
input. Since ground truth is provided, we propose the supervised reconstruction loss to guide the
completion.
L
r
=kG
tc
(U
t
;
~
U
t
; U
p
) U
t
k
1
(6.3)
G
tc
(:) stands for the generator consists of the encoder and decoder. U
t
is the partial texture map,
~
U
t
the flipped input and U
t
the complete ground truth of the input. Merely rely on reconstruction
leads to blurry effect. We introduce the adversarial learning to improve the generation quality.
Discriminator module: Given the ground truth images U
t
in the positive sample set R and the
generated samples
^
U
t
=G
tc
(U
t
;
~
U
t
; U
p
) in the negative sample set F, we train a discriminator
D with the following objective.
L
D
= E
U
t
2R
log(D(U
t
)) E
^
Ut2F
log(1D(
^
U
t
)) (6.4)
Generator module: Following the adversarial training, G
tc
aims to fool D and thus push the
objective to the other direction.
L
a
= E
^
Ut2F
log(D(
^
U
t
)) (6.5)
Smoothness term: To remove the artifact, we propose to apply the total variation loss to locally
constrain the smoothness of the output.
L
tv
=
1
jU
t
j
X
jrG
tc
(U
t
;
~
U
t
; U
p
)j (6.6)
rG
tc
(U
t
;
~
U
t
; U
p
) is the gradient of the output
^
U
t
.jU
t
j is the number of entries of the texture
map. To preserve identity, it is general to introduce a face recognition engine to guarantee the
recognition feature of generated image is close to the ground truth feature. In practice, we find the
reconstruction constraint Eq. 6.3 is sufficient to preserve the identity. It is because major part of
the facial area is visible, which already largely indicates the identity information. By symmetry
and reconstruction constraint, the identity is well preserved. Thus, the overall loss for TC-GAN is
summarized:
L
TC
=
r
L
r
+
a
L
a
+
tv
L
tv
(6.7)
Weight balance is empirically set as
r
= 1;
a
= 0:1;
tv
= 0:05 respectively.
68
p*
Target Attribute code (p)
0 1 0 1 0 p
0 1 0 0 0
GT Attribute code(p*)
Masked
Recon.
Loss
Identity
loss
Cycle
Consistency
Loss
Quality
Loss
Target Attribute code (p)
0 1 0 1 0
Attribute
Loss
0 1 0 0 0
GT Attribute code (p*)
Q
A
Figure 6.5: The architecture and loss design of 3DA-GAN.
6.2.4 3D Face Attribute Generation
Dissimilar from the traditional image based attribute generation, we adopt the 3D UV representa-
tion, the U
p
and completed
^
U
t
, as the input. We believe that introducing 3D geometric information
can better synthesize attribute, i.e., , with 3D shape information, sunglasses will be generated as
surface. We formulate the target attribute generation as a conditional GAN framework, as shown
in Fig. 6.5, by inserting the attribute code p into the data flow. We manually select 5 out of 40
attributes defined from celebA [104] which do not indicate the face identity. Thus, p2R
5
, each
element stands for one attribute, 1 means with the attribute 0 without. The attribute code p is
convolved with two blocks and then concatenated to the third block of the encoder of generator
G
a
.
We investigate CycleGAN [185] and StarGAN [27] network structures and find that CycleGAN
provides a more stable training and better accuracy indicated in experiment section. Thus, we start
with the CycleGAN loss design.
Identity loss: in conditional GAN setting, if input attribute code p is the original ground truth p
,
we expect the output should reconstruct the ground truth input, terming as the identity loss:
L
id
=kG
a
(
^
U
t
; U
p
; p
) U
t
k
1
(6.8)
Quality Discriminator: We introduce a quality discriminatorQ in charge of the image quality,
leaving the attribute generation correctness to an independent discriminator. The positive sample
set R
g
are the ground truth U
t
and the negative sample set F
g
are the generated UV maps
U
g
=G
a
(
^
U
t
; U
p
; p). To updateQ, we apply the following loss.
L
Q
= E
U
t
2Rg
log(Q(U
)) E
Ug2Fg
log(1Q(U
g
))) (6.9)
69
(a) (b) (c)
(d) (e)
Figure 6.6: Manually defined attribute related masks based on the reference UV texture map. (a)
reference U
t
(constructed by our generated UV position map and the mean face texture provided
by Basel Face Model), (b) eyeglasses mask, (c) lipstick and smile mask, (d) 5’o clock shadow
mask, and (e) bangs mask.
The quality loss fromQ is fed back to the generatorG
a
, resulting the adversarial loss of quality.
L
Qa
= E
Ug2Fg
log(Q(U
g
)) (6.10)
Cycle Consistency: Following CycleGAN’s setting, we simultaneously set an inverse generation
moduleG
1
a
, to convert the generated U
g
into the original input
^
U
t
, and expect the converted
back UV texture is similar to the original input.
L
cc
=kG
1
a
(G
a
(
^
U
t
; U
p
; p))
^
U
t
k
1
(6.11)
Besides the CycleGAN losses, we propose two new losses that specifically deal with attribute
generation.
Masked Reconstruction Module: We manually define the non-attribute area shown in Fig. 6.6
on the reference UV texture map. Those 5 attributes are divided into several different mask types
or their combination, i.e., , lipsticks and smile share the same mask of Fig. 6.6 (c). Together with
the fully visible one (mask of all entries as 1), we define mask
i
;i = 1; 2;::; 5 indicating all the
categories. The reconstruction objective is as below.
L
ra
=k(G
a
(
^
U
t
; U
p
; p)
^
U
t
)
i
k
1
; (6.12)
i is determined by the target attribute code.
Target Attribute Discriminator: Separated fromQ, we set an independent discriminatorA to
evaluate whether the one-bit specific attribute is correctly generated or not. The positive sample
set R
a
consists of samples from the ground truth with the specific attribute. The negative sample
set F
a
are the samples generated fromG
a
. The target attribute discriminator is updated as:
L
A
= E
U
2Ra
log(A(U
)) E
U2Fa
log(1A(U))) (6.13)
Accordingly, the adversarial loss to update the generator is:
L
Av
= E
U2Fa
log(A(U)) (6.14)
n TC-GAN, we find that reconstruction loss other than recognition perception loss is sufficient
to preserve identity. It also applies for attribute generation. As shown in Fig. 6.6, attribute related
area is small portion of the entire facial area. By reconstruction, the large portion already strongly
indicates the identity. The overall training is divided in two phases. Phase one accepts the original
70
attribute code and expect to output the reconstructed UV texture. Phase two accepts the target
attribute code and generate the image with target attribute.
L
p1
=
id
L
id
+
Qa
L
Qa
+
cc
L
cc
+
ra
L
ra
(6.15)
L
p2
=
id
L
id
+
Qa
L
Qa
+
cc
L
cc
+
ra
L
ra
+
Av
L
Av
The hyper-parameters for phase one and phase two are set as
id
= 5;
Qa
= 1;
cc
= 10;
ra
= 5,
and
id
= 5;
Qa
= 1;
cc
= 10;
ra
= 5;
Av
= 1 respectively.
6.3 Implementation Details
To prepare TC-GAN training data, we collect near-frontal images from 4DFE and 300W-LP
(58848 from 4DFE and 2735 from 300W-LP) and augment them with uniformly distributed poses,
i.e., , from left profile to right profile in every 15
. The near-frontal images are converted to UV
representation and serve as the ground truth. The augmented pose-variant images are converted
to UV position and incomplete texture, serving as input. By mixing the two training sets, the
model generalization ability is enhanced. We apply an hour-glass [119] structure as the TC-GAN
backbone. We find that inside the structure, skip links are important to preserve high frequency
information, especially from the lower layers. we train the network using Adam optimizer, with
batch size 120 and initial learning rate 1e
4
. It converges within 10 epochs. We further fine-tune
it on CelebA training set with initial learning rate 1e
5
for another 8 epochs.
We similarly prepare training data for 3DA-GAN training, picking 48K near-frontal images
from CelebA for each attribute and convert them to UV representation. Those without target
attribute ones serve as input. Those with target attribute ones are positive samples while generated
UV texture maps are negative samples for attribute discriminator. For quality discriminator, real
UV texture are positive samples and generated ones are negative samples. We randomly select
one bit as our target attribute and all others remain not perturbed. The training procedure is two
phases: (1) Reconstruction. Assuming input
^
U
t
; U
p
and the original attribute code p. (2) Attribute
perturbed generation. We set one attribute per time of p to be 1. The inputs are
^
U
t
; U
p
and the
perturbed p
0
. The two-phase training pushes the generation to focus on the attribute related area
while remain the non-attribute area. We use Adam optimizer with batch size 16 and initial learning
rate 1e
4
. The training converges around 15 epochs across different target attributes.
6.4 Experiments
In this section, we evaluate our framework for the tasks of UV texture map completion and the
3D attribute generation. Regarding the training, for texture completion, we generate the UV
space representation of 300W-LP [193] and 4DFE [175] to form our training set. The evaluation
for texture completion is conducted on LFW [69] on both visualization and FID score as a fair
comparison to other methods. For attribute generation, we generate the UV space representation
of CelebA [104] and provide the rendered pose augmented images for both training and testing.
71
6.4.1 Datasets
300W-LP: It is generated from 300W [133] face database by 3DDFA [193], in which it establishes
a 3D morphable model and reconstructs the face appearance with varying head poses. It consists
of overall 122,430 images from 3,837 subjects. For each subject, images are with uniformly
distributed varying head poses.
CelebA: It contains about 203K images with 40 attributes per image annotated. The distribution
of this dataset in terms of yaw angle is highly long-tailed towards near-frontal, which remains the
demand to augment it for more pose-variant attribute generation.
4DFE: It is a high-resolution 3D dynamic facial expression database. It contains 606 3D facial
expression sequences captured from 101 subjects, with a total of approximately 60,600 frame
models. Each 3D model of a 3D video sequence has the resolution of approximately 35,000
vertices. The texture video has a resolution of about 1040 1329 pixels per frame.
6.4.2 UV Texture Map Completion
In our framework, we firstly apply a 3D dense shape reconstruction and rendering to obtain a
partially visible UV texture map. Then we apply our TC-GAN to obtain the completed UV texture
map and render it back to image-level appearance.
Frontalization Visual Comparison: since our framework provides a way to conduct face
frontalization, we visually compare our method with several state-of-the-art frontalization methods
in Fig. 6.7. The traditional geometric method [60] fails to complete the holes caused by self-
occlusion when head pose is large. DR-GAN [151] works fairly well when head pose is small.
When head pose is close to profile, DR-GAN fails to preserve the face identity while our method
consistently preserves the identity across different head poses. Our method also consistently
preserves the skin color where DR-GAN cannot.
Quantitative Comparison: the Fletcher Inception Distance (FID) [65] is introduced in Ta-
ble 6.1 to quantitatively indicate the photo-realisticity of generated images compared to original
real images. The closer to real images, the lower FID score. In Table 6.1, our method clearly
achieves significantly lower FID score than other methods. In addition to the photo-realisticity of
generated images, we also evaluate the identity preserving property of our method by comparing
the identity feature distance to the ground truth feature, extracted from an off-line trained face
recognition engine and compare to the above mentioned frontalization methods. Namely, we
applying all methods to the non-frontal images, which we define as the ones of yaw 15 and
replacing the original images with the frontalized ones. The state-of-the-art face recognition
engine, ArcFace [36] is exploited to provide the identity features. The verification accuracies
on LFW dataset are shown in Table 6.2. The accuracy based on TC-GAN drops the least com-
pared to original performance, which indicates our method preserves identity better than the
state-of-the-arts.
6.4.3 3D Attribute Generation
We manually select 5 out of 40 attributes defined in CelebA, which do not indicate face identity
and only correlate with the facial area. They are Sunglasses (SG), Wearing lipstick (LS), 5’o clock
shadow (SH), Smiling (SM), and Bangs (BA). We strictly follow the CelebA training, validation
and testing splitting protocol.
72
method yaw-15 yaw-30 yaw-45 yaw-60 yaw-75
Hassner et al.[60] 30.85 53.80 174.12 208.79 203.71
DR-GAN [151] 82.39 84.88 90.82 98.68 110.11
Ours 8.06 13.17 20.29 27.39 38.92
Table 6.1: FID score comparison on LFW dataset. We randomly select one image out of the
verification pairs, and render yaw to 15, 30, 45, 60, and 75 respectively. FID is calculated between
the frontalized images and the not selected original images.
method Verification Accuracy
Original 99:27 0:11
Hassner et al.[60] 98:91 0:15
DR-GAN [151] 96:43 0:55
Ours 99.17 0:12
Table 6.2: Verification accuracy comparison on LFW dataset. We apply our TC-GAN and other
face frontalization methods to the LFW images of yaw angle 15 to replace the original image
with the frontalized one.
Traditional attribute generation methods, i.e., , FaderNet [97] and AttGAN [64], are trained on
2D images. For fair comparison, we apply StarGAN and CycleGAN network structures, trained
orthogonally on real, real plus pose augmented images and our UV texture and position maps.
For real data in CelebA, we observe a strong head pose bias towards near-frontal poses. We
calculate the pose by the reconstructed 3D shape vertices, and use it to split the testing data into
yaw< 45
and 45
. Asyaw 45
testing data is very few, such as, 221 images for lipstick,
whereyaw< 45
have 9288 images, we augment theyaw 45
data from near-frontal images
and achieves 6735 augmented samples to match the volume ofyaw< 45
.
Attribute Generation Accuracy: we apply an off-line attribute classifier, trained on CelebA
training set, to evaluate the attribute generation performance, whose average precision on CelebA
testing set is 91:7%, close to state-of-the-art performance. F1 score is reported as precision and
recall may vary due to threshold setting. We apply 3DA-GAN on the negative samples (without the
target attribute) to generate the images with target attribute, serving as positive samples. Further,
FID [65] is computed to evaluate the photo-realisticity of the attribute augmented images.
In Table 6.3 and 6.4, we compare to several state-of-the-arts, FaderNet [97], AttGAN [64],
StarGAN [27] and CycleGAN [185]. The last two are retrained on original celebA real data, real
plus pose augmented data (“real-a”) as well as our UV texture and position data. For “Ours”, we
apply our proposed loss instead of StarGAN or CycleGAN loss. The numbers in Table 6.3 clearly
show that our proposed 3DA-GAN consistently achieves higher F1 score than the state-of-the-arts.
Moreover, our method also achieves consistently lower FID score in Table 6.4. On CycleGAN
(ResNet) model, our method FID score is close to the one trained on “real-a”, i.e., tie onyaw< 45
and slightly better onyaw 45. However, our method achieves much higher F1 score (precision
and recall), i.e., more than 10% higher on “SM” and “BA”, compared to CyCleGAN trained on
“real-a” across yaw< 45 and 45.
73
Test! real (yaw< 45) real-a (yaw 45)
Train# method# SG LS SH SM BA SG LS SH SM BA
FaderNet [97] real 98.97 - - - - 96.72 - - - -
AttGAN [64] real 97.80 - - - 86.86 91.89 - - - 86.15
StarGAN [27]*
real 97.15 84.26 88.75 87.40 89.56 96.38 77.54 82.07 77.11 86.33
real-a 97.35 78.87 89.63 83.40 89.33 98.07 75.43 88.64 79.01 86.77
Ours 98.88 84.70 91.12 87.87 94.86 98.23 82.04 90.06 83.32 93.67
CycleGAN [185]*
real 97.66 84.41 84.49 86.33 70.96 90.49 74.45 79.21 76.48 69.01
real-a 98.93 91.34 85.17 84.25 82.43 97.31 69.27 84.98 75.51 80.70
Ours 99.37 94.69 91.80 94.56 93.35 99.10 93.04 90.90 91.49 91.64
Table 6.3: Quantitative comparison on attribute generation by F1 score on CelebA testing set. The
target generated attribute is evaluated by an off-line attribute classifier for F1 score (precision and
recall). The higher the better. “real” means original CelebA training set. “real-a” means original
plus pose augmented images. “Ours” means training with our proposed loss and UV texture data.
*: we apply the network structure and re-train models. SG: Sunglass, LS: Wearing Lipstick, SH:
5’o clock shadow, SM: Smiling, BA: Bangs.
Identity Preserving Property: we apply a state-of-the-art face recognition engine, Arc-
Face [36] to provide the identity feature. For each verification pair, we randomly select one image
without the target attribute, apply our method to generate the target attribute, and evaluate the simi-
larity between the generated target attribute image and the not selected image. We independently
run experiments for those 5 attributes. In Table 6.5, “Original” means the original verification
accuracy without any attribute generation, which serves as the upper bound for all methods. Com-
pared to other methods, our 3DA-GAN achieves almost all higher verification accuracy while
slightly worse on lipstick. Nevertheless, our method achieves 89:60% average accuracy, which is
close to the upper bound 91:38%, indicating that the proposed attribute generation maximumly
preserves identity information.
Visualization: we show a pose-variant face attribute generation example in Fig. 6.8, 6.9, 6.10, 6.11, 6.12, 6.13,
and compare to StarGAN and CycleGAN. The 2D image based methods suffers from the pose
variation, i.e., , for both StarGAN and CycleGAN in Sunglass, the left eye region is not correctly
generated. In smile, StarGAN failed to generate the attribute while CycleGAN shows unpleasant
artifacts in the mouth area. In contrast, our method shows not only the correct attribute generation
but also the pleasant visual quality. Worth noting that for “lipstick” and “shadow”, they are actually
related to the gender or identity. This is because for lipstick, the dataset is naturally biased towards
female. For shadow, the training images are quite similar to another attribute “beard”, which
caused the similar appearance generation effect.
Further shown in Figure 6.14, 6.15, 6.16, 6.17, 6.18, 6.19, given an unconstrained face image,
our method can generate target attribute with variant head poses. It provides strong potential
in high quality face editing of multiple attributes and can serve as face augmentation for face
recognition alongside head pose and attribute axis.
74
Test! real (yaw< 45) real-a (yaw 45)
Train# method# SG LS SH SM BA SG LS SH SM BA
FaderNet [97] real 52.1 - - - - 79.9 - - - -
AttGAN [64] real 87.6 - - - 135.5 99.0 - - - 172.6
StarGAN [27]*
real 85.68 78.86 96.97 92.28 82.28 139.77 135.93 172.84 150.58 144.02
real-a 72.73 68.91 42.36 58.92 59.53 114.02 85.14 89.02 82.82 105.34
Ours 38.22 34.05 26.19 33.02 21.79 36.31 35.43 30.05 30.58 19.39
CycleGAN [185]*
real 30.10 25.06 28.73 32.32 28.69 40.88 49.21 42.56 43.31 36.78
real-a 33.89 12.46 6.57 12.74 9.05 19.83 31.04 8.81 17.06 11.54
Ours 18.54 12.56 7.47 13.03 10.28 29.65 10.92 6.81 10.97 8.94
Table 6.4: Quantitative comparison on attribute generation by FID score [65] on CelebA testing
set. Visual quality is indicated by FID score between the target attribute generated images and
the ground truth with same attribute images. The lower the better. “real” means original CelebA
training set. “real-a” means original plus pose augmented images. “Ours” means training with our
proposed loss and UV texture data. *: we apply the network structure and re-train models. SG:
Sunglass, LS: Wearing Lipstick, SH: 5’o clock shadow, SM: Smiling, BA: Bangs.
Method SG LS SH SM BA Avg.
Original - - - - - 91.38
FaderNet [97] 79.05 - - - - 79.05
AttGAN [64] 87.94 - - - 82.20 85.07
StarGAN [27]* 75.28 78.11 81.11 78.80 81.31 79.03
CycleGAN [185]* 89.79 88.09 88.48 90.00 89.20 89.11
Ours 90.40 87.11 89.76 90.68 90.06 89.60
Table 6.5: Identity preserving evaluation on IJBA dataset under the verification protocol, reporting
TAR@FAR0.01. *: models we retrain on our training data. SG: Sunglass, LS: Wearing Lipstick,
SH: 5’oclock shadow, SM: Smiling, BA: Bangs.
6.4.4 Ablation Study
We investigate the contribution of each component proposed in our framework. In Table 6.6 and 6.7,
we start with the default CycleGAN loss, which is without our proposed masked reconstruction
loss Eq. 6.12 and attribute adversarial loss Eq. 6.14. For CycleGAN loss, i.e., , generative
adversarial loss (a.k.a quality adversarial loss), identity loss and cycle consistency loss, we believe
these components’ effects are clearly discussed in [185]. Thus, we focus on the two newly
proposed losses Eq. 6.12 and Eq. 6.14. Overall, without each or both of the two new components,
the performance across F1 and FID score is degraded in certain degree. Moreover, without
attribute adversarial loss is more critical as accuracy drops significantly more than without masked
reconstruction loss.
More interestingly, we visualize the qualitative generation images by running the ablative
models to further indicate the effect of the proposed losses. Figure 6.20, 6.21, 6.22 show that
for “w/o Eq. 12”, which is masked reconstruction loss, some of the generation fails and some
of the generation introduces artifacts. For “w/o Eq. 14”, which is attribute adversarial loss, the
75
Test!
F1-score (higher better)
real (yaw< 45) real-a (yaw 45)
Model Loss SG LS SH SM BA SG LS SH SM BA
CycleGAN
w/o Eq. 12,14 97.97 87.92 85.05 84.62 83.65 97.93 86.21 84.40 81.11 82.21
w/o Eq. 12 99.28 92.95 90.10 93.17 94.86 98.87 90.79 89.15 89.50 93.82
(ResNet) w/o Eq. 14 97.82 83.28 82.25 81.81 86.56 97.54 82.35 82.58 78.43 85.86
Full 99.37 94.69 91.80 94.56 93.35 99.10 93.04 90.90 91.49 91.64
Table 6.6: Ablation study for w/o masked reconstruction loss (Eq. 12)), and/or w/o attribute loss
(Eq. 14). F1 scores are reported. We use CycleGAN ResNet structure as it achieves the best result
across the experiments. SG: Sunglass, LS: Wearing Lipstick, SH: 5’o clock shadow, SM: Smiling,
BA: Bangs.
Test!
FID-score (lower better)
real (yaw< 45) real-a (yaw 45)
Model Loss SG LS SH SM BA SG LS SH SM BA
CycleGAN
w/o Eq. 12,14 20.2 10.6 13.9 7.8 14.1 43.8 20.4 20.4 27.2 18.3
w/o Eq. 12 17.6 17.5 7.0 13.8 11.9 26.6 18.1 11.4 15.0 11.4
(ResNet) w/o Eq. 14 29.1 19.0 7.6 18.1 10.5 39.3 18.4 11.3 17.7 10.4
Full 18.5 12.6 7.5 13.0 10.3 29.7 10.9 6.8 11.0 8.9
Table 6.7: Ablation study for w/o masked reconstruction loss (Eq. 12)), and/or w/o attribute loss
(Eq. 14). FID scores are reported. We use CycleGAN ResNet structure as it achieves the best
result across the experiments. SG: Sunglass, LS: Wearing Lipstick, SH: 5’o clock shadow, SM:
Smiling, BA: Bangs.
generation mostly fails. For “w/o Eq. 11”, which is cycle consistency loss, it shows more artifacts
than the full results. For “w/o Eq. 8”, which is identity loss, it also shows certain level of artifact
compared to the full result.
6.5 Network Architectures
The network architectures of StarGAN and CycleGAN used in our experiments are shown in
Table 6.8, 6.9, 6.10, 6.11, 6.12. We use instance normalization for the generator network in all the
layers except the output layer. For the quality and attribute discriminator networks, we use Leaky
ReLU with a negative slope of 0.01 in StarGAN, and 0.02 in CycleGAN. The definitions of the
annotations in the tables are as follows: C: the number of output channels, K: kernel size, S: stride
size, P: padding size, IN: instance normalization,A
d
: the number of attributes to be generated,h
andw are the height and width of the input image.
6.6 Conclusion
We propose a two-stage Texture Completion GAN (TC-GAN) and 3D Attribute GAN (3DA-GAN),
to tackle the pose-variant facial attribute generation problem. The TC-GAN inpaints the missing
76
Type Layer
Downsampling Conv-(C64, K7x7, S1, P3), IN, ReLU
Downsampling Conv-(C128, K4x4, S2, P1), IN, ReLU
Downsampling Conv-(C256, K4x4, S2, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P1), IN, ReLU
Upsampling Deconv-(C128, K4x4, S2, P1), IN, ReLU
Upsampling Deconv-(C64, K4x4, S2, P1), IN, ReLU
Upsampling Deconv-(C3, K7x7, S1, P3), Tanh
Table 6.8: StarGAN Generator network architecture
Type Layer
Input Conv-(C64, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C128, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C256, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C512, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C1024, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C2048, K4x4, S2, P1), Leaky ReLU
Output Conv-(C1, K3x3, S1, P1)
Output Conv-(CA
d
,K
h
64
w
64
, S1, P0)
Table 6.9: StarGAN Quality and Attribute discriminator network architecture
appearance from self-occlusion and provides a normalized UV texture. Our 3DA-GAN works
on the UV texture space to generate target attributes with maximum preserved subject identity.
Extensive experiments show that our method achieves consistently better attribute generation
accuracy, closer to original images’ visual quality, and higher identity preserving verification
accuracy, when compared to several state-of-the-art attribute generation methods. Our good
generation quality also provides the potential for face editing and face image augmentation
alongside pose and attribute axis.
77
Type Layer
Input ReflectionPad2d(3)
Input Conv-(C64, K7x7, S1, P0), IN, ReLU
Downsampling Conv-(C128, K3x3, S2, P1), IN, ReLU
Downsampling Conv-(C256, K3x3, S2, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Upsampling Deconv-(C128, K3x3, S2, P1), IN, ReLU
Upsampling Deconv-(C64, K3x3, S2, P1), IN, ReLU
Upsampling ReflectionPad2d(3)
Upsampling Deconv-(C3, K7x7, S1, P0), Tanh
Table 6.10: CycleGAN Generator network architecture
Type Layer
Input Conv-(C64, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C128, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C256, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C512, K4x4, S1, P1), Leaky ReLU
Output Conv-(C1, K4x4, S1, P1)
Table 6.11: CycleGAN quality discriminator network architecture
Type Layer
Input Conv-(C64, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C128, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C256, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C512, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C1024, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C2048, K4x4, S2, P1), Leaky ReLU
Output Conv-(CA
d
,K
h
64
w
64
, S1, P0)
Table 6.12: The Attribute discriminator network architecture we used with CycleGAN
78
Ground truth Hassner et al’15 DR-GAN
Ours Input
Figure 6.7: Visualization of TC-GAN and other face frontalization methods on LFW [69]. A
near-frontal image is randomly selected from LFW and shown as “Ground truth”. We render the
ground truth with multiple head poses as input with black background.
79
Sunglass Lipstick Shadow Smile Bangs Input
StarGAN
CycleGAN
Ours
Figure 6.8: Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [27] and
CycleGAN [185] trained on our prepared data.
Sunglass Lipstick Shadow Smile Bangs Input
StarGAN
CycleGAN
Ours
Figure 6.9: Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [27] and
CycleGAN [185] trained on our prepared data.
80
Sunglass Lipstick Shadow Smile Bangs Input
StarGAN
CycleGAN
Ours
Figure 6.10: Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [27] and
CycleGAN [185] trained on our prepared data.
Sunglass Lipstick Shadow Smile Bangs Input
StarGAN
CycleGAN
Ours
Figure 6.11: Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [27] and
CycleGAN [185] trained on our prepared data.
81
Sunglass Lipstick Shadow Smile Bangs Input
StarGAN
CycleGAN
Ours
Figure 6.12: Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [27] and
CycleGAN [185] trained on our prepared data.
Sunglass Lipstick Shadow Smile Bangs Input
StarGAN
CycleGAN
Ours
Figure 6.13: Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [27] and
CycleGAN [185] trained on our prepared data.
82
Original
yaw
Yaw 45
Input
Yaw 60
Sunglass Lipstick Shadow Smile Bangs
Figure 6.14: Visual results of applying our method to augment face images from CelebA [104]
testing set, in attributes and yaw angles.
Original
yaw
Yaw 30
Input
Yaw 60
Sunglass Lipstick Shadow Smile Bangs
Figure 6.15: Visual results of applying our method to augment face images from CelebA [104]
dataset, in attributes and yaw angles.
83
Original
yaw
Yaw 30
Input
Yaw 60
Sunglass Lipstick Shadow Smile Bangs
Figure 6.16: Visual results of applying our method to augment face images from CelebA [104]
dataset, in attributes and yaw angles.
Original
yaw
Yaw 30
Input
Yaw 60
Sunglass Lipstick Shadow Smile Bangs
Figure 6.17: Visual results of applying our method to augment face images from CelebA [104]
dataset, in attributes and yaw angles.
84
Original
yaw
Yaw 30
Input
Yaw 60
Sunglass Lipstick Shadow Smile Bangs
Figure 6.18: Visual results of applying our method to augment face images from CelebA [104]
dataset, in attributes and yaw angles.
Original
yaw
Yaw 30
Input
Yaw 60
Sunglass Lipstick Shadow Smile Bangs
Figure 6.19: Visual results of applying our method to augment face images from CelebA [104]
dataset, in attributes and yaw angles.
85
Input Full w/o Eq. 12
Sunglasses
Smile Lipstick
Input Full w/o Eq. 12 Input Full w/o Eq. 12
Figure 6.20: The effect of masked reconstruction loss on sunglasses, smile, and lipstick generation.
From left to right: input images from CelebA dataset, using full losses, without masked recon-
struction loss (Eq. 12). The masked reconstruction loss helps generating attributes in a specific
region while preserve the non-attribute parts.
86
Input Full w/o Eq. 14
Smile
Bangs
Input Full w/o Eq. 14
Figure 6.21: The effect of adversarial attribute loss on smile and bangs generation. From left to
right: input images from CelebA dataset, using full losses, without adversarial attribute loss (Eq.
14). The adversarial attribute loss helps enhancing the intensity of generated attributes.
87
Input Full w/o Eq. 11 w/o Eq. 8
Figure 6.22: The effect of cycle consistent loss and identity loss on sunglasses generation. From
left to right: input images from CelebA dataset, using full losses, without cycle consistent loss (Eq.
11), and without identity loss (Eq. 8). The cycle consistent loss and identity loss help preserving
the non-attribute regions. The identity loss also makes the generated attribute regions more natural.
88
Chapter 7
Multi-Source Fusion for Face Matching
In Section 1.3, we mention that the problem of multi-source fusion usually occurs in the matching
stage of the face processing system. Given an image, we may have multiple feature representations
resulting from different alignments and views. Given a image set in image set classification task,
we further have multiple medias to be leveraged. Section 2 introduces two paradigms to address
this problem: (1) “Set fusion followed by set matching” and (2) “Set matching followed by set
score fusion” [113, 115]. One of simple yet effective methods in (1) is media pooling [30, 149],
which is a kind of weighted average pooling and can make the pooled features more robust and
discriminative. One of a representative methods in (2) is ensemble SoftMax method [113, 115],
which is an ensemble of different fusion strategies. In this chapter, we focus on paradigm
(2), and apply it to set-based face recognition. In the following, the word “template” is used
interchangeably with “set” because of their common usage in literature.
7.1 Introduction
With the growth of video data and camera networks, image set classification has become a
common multi-source fusion problem recently in computer vision and pattern recognition. One
representative application is set-based face recognition, where a single training or testing “unit” is
a set of face images or a video rather than one image. Given multiple images to describe different
aspects of a person of interest, the face recognition accuracy can be improved with high potential;
meanwhile, the large intra-class variation problem get serious as well. It is thus critical to enhance
the discriminative power of image set-to-set similirty in order to aggragete multiple sources of the
considered two image sets more effectively.
To obtain more discriminative similarity among templates, a commonly used technique is
distance or similarity metric learning, with either contrastive embedding [186, 54, 73, 156, 106,
55, 52] or triplet embedding [107, 137, 153] as illustrated in Figure 1.8. Triplet embedding, which
simultaneously considers a positive and negative pair of samples w.r.t. an anchor sample, has been
demonstrated to outperform contrastive embedding [137, 153, 138]. Previous methods, however,
either work on the image level or generate a single representation for each template by average
pooling (i.e., paradigm (1)) before constructing triplets. The structure within templates, which
is likely to convey significant discriminative information, is thus ignored. Besides, due to the
large intra-class variations, it is hard to enforce all the image samples in a template (of the same
class) to be close in the embedding space by a global embedding (shared for all samples) or by the
class-specific embedding [107] (shared for samples of the same class).
89
Figure 7.1: Global and context-aware embedding: A global embedding is learned generally in
previous work. We propose integrating the context features to achieve sample-specific embedding,
wheref(.) consists of three factorized matrices as described in Chapter 6.
In this chapter, we first introduce atemplate-triplet-based embedding approach to optimize
template-to-template similarity computed by the ensemble SoftMax function [113, 115] — the
ensemble SoftMax function fuses image-to-image scores at multiple scales (i.e., paradigm (2)) and
has been shown effective in [113, 115] as well as in our experiments. Different fromimage-triplet
embedding [137, 138], our triplet can be created not only on the entire template, but also on the
subtemplate of samples with any reasonable size. Note that a template triplet becomes an image
triplet when thesubtemplate size equals to 1. Therefore, the proposed approach generalizes the
triplet embedding to sub-templates of a predefined size. To our best knowledge, we are the first to
apply triplet based metric learning on the “set matching followed by set score fusion” paradigm
for the image set classification. An illustration of this approach is presented in Figure 1.8.
Second, inspired by the scene-specific image captioning task [78, 110], we propose a “context-
aware” metric learning approach by integrating image-specific context into metric learning so as
to achieve distinct embedding for every image, as illustrated in Figure 7.1. The image-specific
context, in face recognition task, can be defined as the capturing condition and the measurable
attributes of a face (e.g., poses and expressions). Both of them may be provided by metadata
of the dataset or by the existing face attribute detectors. The meta data, however, are usually
inaccessible. Attributes could constitute useful context but the state-of-art attribute detectors seem
to be too noisy, which hinders the performance improvements. Hence, we exploit the image
features themselves as context. Although these features are also used to generate matching scores,
the ways of using them on the context (via matrix factorization) and matching (with ensemble
softmax similarity) are quite different. Of course, all features, including, attributes, derive from
the image so are correlated to some extent. Furthermore, a mechanism to alleviate the influence
of noisy context is also introduced. Experiments on four benchmark datasets for face template
classification [84, 160, 88, 22] demonstrate the efficacy of the proposed approach.
90
7.2 Proposed Approach
In this section, we introduce a new triplet creation approach for similarity metric learning by
creating the triplets on thetemplates. We also propose to integrate the context features to achieve
a specific embedding for every image.
First, the similarity between templates is defined in Section 7.2.1. Then, we describe in
Section 7.2.2 and Section 7.2.3 how to learn discriminative embedding via the template triplet loss
and the context-aware metric learning respectively.
7.2.1 Template-to-Template Similarity
Given two input images with featuresx
i
andx
j
, a common similarity measure is inner product,
defined ass
inn
(x
i
;x
j
) =x
T
i
x
j
.
In template based recognition tasks, it is often the case to compare two given templates instead
of two images since a subject or visual class usually contains more than one image, and the images
are grouped intotemplates (as in the Janus benchmark[88]). Therefore, how to measure the
similarity between two templates becomes very important.
Witnessing the success of the ensemble SoftMax fusion [113, 115] for summarizing the image-
level matching scores among two templates in face recognition, we exploit it as the similarity
measure in this paper. Given two templates D
1
=fx
1
;:::x
M
g and D
2
=fz
1
;:::z
N
g, their
ensemble SoftMax similarity is defined as:
s(D
1
;D
2
) =
X
s
(D
1
;D
2
); (7.1)
where
s
(D
1
;D
2
) =
P
x
i
2D
1
P
z
j
2D
2
exp(s
inn
(x
i
;z
j
))s
inn
(x
i
;z
j
)
P
xr2D
1
P
zq2D
2
exp(s
inn
(x
r
;z
q
))
: (7.2)
This similarity can be regarded as an ensemble of multiple fusion schemes including the min
fusion (!1), average fusion (! 0), max fusion (!1), and weighted average fusion
(0<<1) of a set of image-to-image similarity scores
1
.
Thus, the ensemble SoftMax similarity can not only handle varying template size (number
of samples in a template), arbitrary template distribution, but also robust to the noise via the
“soft” average. In the experiments (cf. Table 7.1), we compare it to other common types of
template-based similarities to demonstrate its effectiveness.
7.2.2 Ensemble SoftMax Similarity Embedding via Template Triplets
Inspired by the effectiveness of the triplet based metric learning on the recognition tasks [137, 138],
we propose a new type of triplet for similarity metric learning, calledtemplate triplets, based on
the ensemble SoftMax similarity. A diagram for our approach is shown in Figure 1.8 where the
1
In this work, we use totally 21 values of inf0;1; ;20g to combine the advantages of multiple fusion schemes,
following [113, 115].
91
triplets are generated among templates instead of among image samples. Theanchor template
andpositivie template share the same class label, and thenegative template is of the different
class label.
Our aim is to learn a discriminative embeddingW , where the ensemble SoftMax similarity
of theanchor templateA =fa
1
;:::a
M
g to thepositive templateP =fp
1
;:::p
Q
g is larger than
the one to thenegative templateN =fn
1
;:::n
R
g as described by:
s
W
(A;P )>s
W
(A;N) (7.3)
where
s
W
(A;X) =
P
a
i
2A
P
x
j
2X
exp((Wa
i
)
T
(Wx
j
)) (Wa
i
)
T
(Wx
j
)
P
ar2A
P
xq2X
exp((Wa
r
)
T
(Wx
q
)))
(7.4)
andX2P;N,x2p;n
Given a set of labeled templates, the optimal embedding
^
W can be solved by the following
optimization problem:
^
W = arg min
W
X
A;P;N2T
A
max(s
W
(A;N)s
W
(A;P ) +; 0) (7.5)
Here, is the margin of the hinge loss, andT
A
represents all possible template triplets (A;P;N).
We apply the stochastic gradient descent (SGD) with early stopping technique for optimization.
In each iteration of SGD, a template triplet is generated for updatingW w.r.t. the hinge loss in
eq. (7.5) (Please refer to the supplementary material for derivations of the gradients). Our template
triplet based metric learning approach is summarized in Algorithm 1, where anepoch means a
process going through all the labeled templates, andT determines how many times we repeat this
process. Besides, the two subscripts ofW correspond to the current epoch and template indices.
Template Triplet Creation : It is infeasible to consider the exponentially large set of triplets
in (7.5). Therefore, we attempt to find out the most violating (A;P;N) w.r.t. (7.5) and (7.3) such
that the size of total triplets is linear in the number of labeled templates. The creation of the
template triplet is described step be step:
1. Ananchor templateA is randomly selected first. (Actually, we go through all the labeled
templates in the training process in a random order, as can be seen in Algorithm 1)
2. We then pick the most dissimilar positive templateP to the currently considered anchor template
A as described below:
P = arg min
^
P
s
W
(A;
^
P ) (7.6)
92
Algorithm 1: Ensemble SoftMax Similarity Embedding based on Template Triplets
Input:J labeled templatesf(L
j
=fx
1
;:::x
M
g;x
i
2R
D
;y
j
2f1; ;Cg)g
J
j=1
ofC classes, the
dimensiondD of the output embedding, the number of epochsT , the margin, the step size,
and subtemplate sizem. ;
Output: Learned similarity metric or embedding
^
W
T;J
2R
dD
;
Initialize:W
0;J
I
D
ford =D, whereI denotes an identity matrix. Otherwise, deriveW
0;J
such
thatW
T
0;J
W
0;J
is a low rank approximation ofI
D
;
fort 1; 2;:::;T do
a.W
t;0
W
t1;J
.;
b. Order the templatesfL
j
g
J
j=1
by a random permutation :j!(j).;
for` (1);(2);:::;(J) do
1. Let` be theanchor templateA in eq. (7.5).;
2. Pick a template
~
l,
~
l6=` andy(
~
l) =y(`) as thepositive templateP via eq. (7.6). ;
3. Pick a template
~
l,y(
~
l)6=y(`) as thenegative templateN via eq. (7.7).;
4. Ifm6=1, subsampleA,P ,N to be subtemplates of size at mostm.;
5. Compute the stochastic gradientH w.r.t.W , according to eq. (7.5) on the chosen (A, P,
N).;
W
t;`
W
t;`1
H ;
c. EvaluateW
t;J
on the held-out validation set. If the performance does not improve forK
epochs, we stop the iteration ont and return
^
W
T;J
=W
t;J
. In our experiments, we setK = 5.
3. Finally, thenegative template most close toA is picked:
N = arg max
^
N
s
W
(A;
^
N) (7.7)
7.2.3 On Incorporation of Context for Sample-Specific Embedding
Learning a good embedding for template-to-template matching helps enlarge the inter-class
variations (better discrimination ability); the intra-class variations, however, may still be hardly
narrowed since only a single embedding is learned generally and shared by all the image samples
in previous work. To better address intra-class variations, we propose to learn a specific embedding
for each image by factorizing the metric to be learned with its own context, the same as image
features in this work.
Given the context~ c
i
of thei-th image, to inject~ c
i
2R
dc
and thus adapt the template based
metric learning to be image-specific, we factorize the embeddingW
i
as follows:
W
i
=F diag(L~ c
i
)G =
X
k2dc
(L~ c
i
)
k
F
;k
G
k;
(7.8)
where diag(.) means the diagonal matrix, (L~ c
i
)
k
is thek-th element of the vector, andF
;k
as well
asG
k;
indicate thek-th column and row of the two (suitably sized) matricesF andG, shared for
all images, andL is another matrix that linearly transforms the context vector.
93
Because of the potential human annotation error or the ambiguity in how the context is defined,
the context information may be noisy and would be harmful for context-aware metric learning. To
alleviate this problem, we concatenate~ c
i
with a value 1 as a “pseudo”non-context feature for all
images:
W
i
=F diag(L[~ c
T
i
; 1]
T
)G (7.9)
Note that whenc
i
is a zero vector for all images or when the firstd
c
columns ofL are all 0,
W
i
degenerates to a globally-sharedW . With the above decomposition in eq. (7.9), the same
algorithm as Algorithm 1 can be applied. The difference is now we are to learn the three matrics
F ,L, andG rather thanW .
Initialization ofF ,L,G : It is critcal to initializeF ,L, andG. If we start fromW =I
D
(D
is the input feature dimension),F andG are both initialized asI
D
, andL2R
D(dc+1)
is set by
all zeros except for the last column as all 1’s.
On the other hand, if we have
^
W learned from Algorithm 1, thenF , L, andG can be
initialized by the singular value decomposition ofW . GivenW =UV
T
,F andG can be set
asU andV
T
respectively, and the singular values of are placed in the last column ofL with all
other entries as 0’s.
7.3 Experimental Results
This section presents the experiments and results of our proposed methods, (1) Template Triplet
based Ensemble SoftMax Similarity Embedding (denoted as TT-ESSE in the following) and (2)
Context-Aware sample specific embedding (denoted as TT-ESSE + context in the following) on
four image set classification datasets. We describe in turn the adopted datasets, features and
context, and the experimental protocols, commonly used evaluation measures along the results.
7.3.1 Datasets
To evaluate the performance of the proposed method, we conduct experiments on four pub-
licly available datasets, including YouTube Celebrity (YTC) [84], YouTube Face (YTF) [160],
and IARPA Janus Benchmark A (IJB-A) [88] for face recognition, and UCSD Traffic dataset
(Traffic) [22] for scene classification.
YouTube Celebrity (YTC) [84]: This dataset contains 1,910 video clips of 47 subjects col-
lected from YouTube. Each subject consists of 40 templates in average and 170 images/frames
are in each template.
YouTube Face (YTF) [160]: 3,425 videos of 1,595 different people are downloaded from
YouTube. An average of 2 videos or templates are available for each subject, and the average
length of a video clip is 181 frames. We downsampled every video about ten times because of the
large redundancies.
IARPA Janus Benchmark A (IJB-A) [88]: IJB-A contains totally 5,712 images and 2,085
videos for 500 subjects. Each subject consists of 11 images and 4 videos. A template can be
of a mixture of images and video frames.
94
Figure 7.2: The distribution of the number of image samples per template on four used datasets:
YTC and Traffic datasets have at least 167 frames and 50 images per template respectively, while
less than 10 images/ videos are in IJB-A and YTF datasets. Furthermore, about 50.14% templates
on IJB-A are of a single image.
UCSD Traffic dataset (Traffic) [22]: The traffic video database is collected over two days
from the highway traffic in Seattle with a single stationary traffic camera. It consists of 254 video
sequences, and are manually labeled in terms of the amount of traffic congestion: heavy, medium,
and light traffics.
The distributions of number of images per template on the above datasets are shown in Fig. 7.2.
7.3.2 Features and Context
For the YTC dataset, we follow [55] and describe each region with a histogram of Local Binary Pat-
terns (LBP). For the YTF and IJB-A datasets, we follow [115] to use the very deep VGGNet [144]
CNN with 19 layers, trained on the large scale image recognition benchmark (ILSVRC) [132];
this CNN is then finetuned on the CASIA WebFace dataset [173]. Different from [115] where
synthesized images of various poses, shapes and expressions are included for CNN finetuning,
we only exploit the real images and the images rendered to the closet poses for the finetuning
(Please refer to supplementary material for more details). For the Traffic dataset, HoG features [32]
are exploited to describe each frame. Note that all the features are`
2
normalized before metric
learning in all the experiments. In the contex-aware metric learning approach, we employ the
input features as context since they contain not only identity but also context informationthe such
as poses, shapes, expressions, illuminations. Experiments using more sophisticated context or
attribute detectors are remained for future work.
7.3.3 Protocols and Evaluation Measures
Following the standard practice [106], we split the YTC dataset into 5 folds, each of which
contains 3 and 6 randomly selected videos from each person as the training/gallery templates and
probe templates respectively. The average recognition rate is reported for YTC. The experiments
performed on the Traffic dataset use the 4 splits provided with dataset [22], each of which contains
75% training/gallery templates, and 25% probe templates. The average recognition rate is also
used for this dataset.
For the YTF dataset, 5,000 video pairs are randomly collected from the database, where half
of them are of the same person, and half are different people. These pairs are divided into 10
splits and ensure the splits are subject-mutually exclusive (Please refer to [160] for more details).
95
Table 7.1: Face Recognition Accuracies (in terms of TARs (%) at different FARs) on IARPA Janus
Benchmark A (Verification Protocol)[88] with different kinds of template-to-template similarity measures.
Note Paradigm (1) is “set fusion followed by set matching”, and Paradigm (2) is “set matching followed by
set score fusion”.
Paradigm Similarity Measure @10%FAR @1%FAR @0.1%FAR @0.01%FAR
(1) Avg pooling + inner prod. 94.49 76.28 50.31 19.76
Avg pooling + cos sim. 93.84 79.98 58.89 29.42
KDE [55] 92.68 81.32 61.93 23.42
(2) Min Fusion 44.88 17.64 8.02 3.44
Average Fusion 94.49 76.28 50.31 19.76
Max Fusion 92.23 73.47 41.61 13.29
(2) Emsemble SoftMax Fusion 95.49 84.30 61.15 20.45
Table 7.2: Average recognition rate (ARR) (%) on the YTC dataset
Method inn+ESS [113] ITSE [137] TT-ESSE-3 TT-ESSE-5 TT-ESSE-whole
ARR 51.99 63.85 65.33 64.87 63.22
Method ITSE TT-ESSE-3 TT-ESSE-5 TT-ESSE-whole
+ context + contexts + contexts + context
ARR - 67.46 66.29 67.14 65.38
We consider theUnrestricted protocol and the average of true acceptance rates (TARs) at 10%
and 1% false acceptance rates (FARs), verification accuracy (under a threshold selected on the
validation set), equal error rate (EER) over 10 splits are reported.
In IJB-A dataset, there are 10 random training and testing splits, where 333 subjects are
randomly sampled and placed in the training split, and the other 167 are placed in the testing
split. We follow thecompare (verification) protocol (for face verification) as defined in [88], and
evaluate the verification performance by the average of TARs at 1%, 0.1%, and 0.01% FARs over
10 splits.
7.3.4 Experimental Settings
In Algorithm 1, the number of epochsT is set to 10. The margin and the step size are selected
in the range off0:1; 0:2; ; 1g and 10
f3;2;;0g
respectively on the held-out validation set.
Besides, we setd =D in all the experiments.
In addition to considering the entire template in the metric learning (denoted as TT-ESSE-
whole), we also set subtemplate size to be 3 and 5 in the experiments and denote them as
TT-ESSE-3 and TT-ESSE-5).
7.3.5 Comparisons of Ensemble SoftMax Similarity to the Other Template Based
Similarity Measures
We compare the ensemble SoftMax similarity on IJB-A [88] to the other commonly used measures,
Paradigm (1) and Paradigm (2), as introduced in Section ??. The average feature pooling with
inner product or cosine similarity are denoted as Avg pooling + inner prod. and Avg pooling + cos
sim. in Table 7.1. The kernel desnity estimation method (KDE) [55] is also compared. Both of
them belong to Paradigm (1). Paradigm (2) contains several special cases of the ensemble SoftMax
96
Table 7.3: Average verification performances (%) on the YTF dataset
Method TAR@10%FAR TAR@1%FAR Verification Accuracy EER
inn+ESS [113] 85.64 55.96 68.18 12.12
ITSE [137] 86.92 61.04 72.10 11.60
TT-ESSE-3 86.96 64.40 72.22 11.56
TT-ESSE-5 88.28 65.00 71.82 10.92
TT-ESSE-whole 87.60 62.04 72.58 11.40
ITSE 88.80 65.36 79.78 10.64
+ context
TT-ESSE-3 88.04 65.04 72.44 11.04
+ context
TT-ESSE-5 87.84 64.16 77.42 11.24
+ contexts
TT-ESSE-whole 88.16 63.72 77.56 10.96
+ context
Table 7.4: Average recognition rate (ARR) (%) on the Traffic dataset
Method inn+ESS [113] ITSE [137] TT-ESSE-3 TT-ESSE-5 TT-ESSE-whole
ARR 91.36 92.94 93.34 93.73 94.12
Method ITSE TT-ESSE-3 TT-ESSE-5 TT-ESSE-whole
+ context + context + context + context
ARR - 94.11 93.71 91.74 92.94
fusion with the cosine similarity for image-to-image matching. As can be seen in Table 7.1, the
ensemble SoftMax similarity (ESS) outperforms the others in most cases.
7.3.6 Comparisons of the Proposed Approaches to the Existing Image Template
Classification Methods
We compare our method to the ensemble SoftMax similarity (ESS) [113, 115] with the inner
product as image-to-image matching score computation (denoted as inn+ESS in Tables 7.2-7.4).
We also compare to the image-triplet similarity embedding [137] with ESS (denoted as ITSE in
Tables 7.2-7.4). In ITSE, we consider the image triplets in metric learning but apply ESS in testing
to compute the similarity
2
. This method then is equivalent to TT-ESSE-1 of our approach with a
subtemplate size 1.
Tables 7.2 and 7.3 illustrate the performances of TT-ESSE with and without context on
YTC and YTF datasets. It can be seen that on the YTC dataset ITSE significantly improves
over inn+ESS, and creating template triplets can further boost the performances ( 2% relative
improvements) with the subtemplate size 3 (TT-ESSE-3). Adding context leads to further 5%
relative improvements when the subtemplate size is 1 or 5 (TT-ESSE-1 + context and TT-ESSE-
5 + context). As for the YTF dataset, TT-ESSE-5 outperforms ITSE by 6% in terms of
TAR@1%FAR, and the efficacy of context is shown with 9% relative improvements based on
the verification accuracy.
2
Note that [137] performs average pooling + inner product in testing. Here we apply ESS because of its superior
performance as shown in Table 7.1.
97
Table 7.5: Average verification performances (%) on the IJB-A dataset
Method TAR@1%FAR TAR@0.1%FAR TAR@0.01%FAR
GOTS [88] 40.6 19.8 -
OpenBR [89] 23.6 10.4 -
Wangetal: [155] 73.3 51.4 -
Deep Multi-Pose [1] 78.7 - -
VGG-FACE [120] 80.5 60.4 -
inn+ESS [113]
y
84.3 61.2 20.5
Chenetal: [25] 78.7 - -
[73] 68.8 28.6 16.8
KISSME [90] 65.4 39.4 15.2
ITSE [137]
y
84.5 65.1 35.2
TT-ESSE-3 84.8 65.4 35.6
TT-ESSE-5 84.8 66.3 36.5
TT-ESSE-whole 85.0 65.1 34.5
ITSE 85.5 65.9 36.2
+ context
TT-ESSE-3 85.3 66.4 36.5
+ context
TT-ESSE-5 85.3 66.2 36.4
+ context
TT-ESSE-whole 85.4 66.5 36.0
+ context
y
: Our reimplementation.
Table 7.6: Average verification performances (%) of the templates with at least 10 samples on the IJB-A
dataset
Method TAR@1%FAR TAR@0.1%FAR
ITSE [137] 92.97 76.09
TT-ESSE-3 93.67 79.82
TT-ESSE-5 95.16 81.38
TT-ESSE-whole 95.18 81.41
The benifit of template triplets (TT-ESSE) is also demonstrated on the Traffic dataset in
Table 7.4. Note that the context integration works well for the subtemplate size 1 (TT-ESSE-1 +
context) but not for larger subtemplate sizes (TT-ESSE-5 + context and TT-ESSE + whole). We
believe that more elaborated context rather than HoG features is required for further improvements.
The results on IJB-A dataset are shown in Table 7.5. The improvements of TT-ESSE are
limited since the average number of images per template is quite small as can be seen in Fig. 7.2.
In average, the templates are of less than 10 images, and half of them are only of a single image,
which makes the generation of good template triplets hard. According to these observations, we
hypothesize that our approach would work better if the templates are of sufficient number of
images such as at least 10 images. Table 7.6 presents the performances on the template pairs with
at least 10 images. As can be seen, the improvements of TT-ESSE over the baselines are much
more significant ( 6% relative improvements) in terms of TAR@0.1%FAR compared to the ones
in Table 7.5.
98
Table 7.7: Average verification performances (%) of three baselines and the proposed method on the three
datasets with the evaluation measure shown in the bracket
Method YTC (ARR) YTF (EER) Traffic (ARR)
[73] 45.6 24.3 83.5
[55] 58.0 20.9 94.1
KISSME [90] 53.1 12.8 18.9
Ours 67.4 10.6 94.1
Finally, we compare our method to the widely used metric learning approaches, [73], [55], and
KISSME [90]. The results on YTC, YTF, and Traffic datasets are shown in Table 7.7 based on
different evaluation measures (The most significant one used in the literature), and the results on
the IJB-A dataset is in Table 7.5. All the hyper-parameters were selected following [73, 55, 90]. It
can be seen that our approach performs better than these methods where strong assumptions are
made for a template’s distribution.
In summary, “Template Triplet” and “Context-Aware” are orthogonal methods to improve
the image-based metric learning (ITSE) [137]. In all cases, we observe consistent gains from
ITSE to TT-ESSE (by template triplet), and from ITSE to ITSE+context (by context). We also see
consistent gains from TT-ESSE to TT-ESSE+context on YTC and YTF datasets, suggesting that
the two methods can complement each other. Our best combination overall achieves 2:5% gain
(averaged over datasets and measures) over ITSE.
99
Chapter 8
Conclusions
Over the past decade, facial landmark detection methods have played a tremendous part in
advancing the capabilities of face processing applications. Despite these contributions, landmark
detection methods and the benchmarks that measure their performances have their limits. We show
that deep learning can be leveraged to perform tasks that, until recently, required the use of these
facial landmark detectors. In particular, we show how face shape, viewpoint, and expression can
be estimated directly from image intensities, without the use of facial landmarks. Moreover, facial
landmarks can be obtained as by-products of our deep 3D face modeling process.
By proposing an alternative to facial landmark detection, we must also provide novel alterna-
tives for evaluating the effectiveness of landmark-free methods such as our own. We therefore
compare our method with facial landmark detectors by considering the effect these methods have
on the bottom line performances of the methods that use them: face recognition for rigid 2D and
3D face alignment, and emotion classification for non-rigid, expression estimation. Of course,
these tests are not meant to be exhaustive: This evaluation paradigm can potentially be extended to
other benchmarks, representing other face processing tasks.
We also demonstrate how our proposed 3D face modeling can help facial synthesis. We
introduce a two-stage Texture Completion GAN (TC-GAN) and 3D Attribute GAN (3DA-GAN),
to tackle the pose-variant facial attribute generation problem. The TC-GAN inpaints the missing
appearance from self-occlusion and provides a normalized UV texture. Our 3DA-GAN works
on the UV texture space to generate target attributes with maximum preserved subject identity.
Extensive experiments show that our method achieves consistently better attribute generation
accuracy, closer to original images’ visual quality, and higher identity preserving verification
accuracy, when compared to several state-of-the-art attribute generation methods.
In addition to extending our tests to other face processing applications, another potential
direction for future work is improvement of our proposed FAME framework. Specifically, notice
that our FPN is trained to estimate pose for a generic face shape, whereas in practice, the 3D face
shape that we project is subject- and expression-adjusted to the input face. This discrepancy can
lead to misalignment errors, even if small ones. These errors may be mitigated by combining
the three networks into a single, jointly learned, FAME network. Moreover, our good generation
quality of the pose-variant facial attribute generation framework also provides the potential for
face editing and face image augmentation alongside pose and attribute axis.
100
Bibliography
[1] Wael AbdAlmageed, Yue Wua, Stephen Rawlsa, Shai Harel, Tal Hassner, Iacopo Masi,
Jongmoo Choi, Jatuporn Toy Leksut, Jungyeon Kim, Prem Natarajan, et al. Face recognition
using deep multi-pose representations. arXiv preprint arXiv:1603.07388, 2016.
[2] Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja Pantic. Incremental face
alignment in the wild. In Proc. Conf. Comput. Vision Pattern Recognition. IEEE, 2014.
[3] Tadas Baltrusaitis, Peter Robinson, and Louis-Philippe Morency. Constrained local neural
fields for robust facial landmark detection in the wild. In Proc. Conf. Comput. Vision
Pattern Recognition Workshops, pages 354–361. IEEE, 2013.
[4] Tadas Baltruˇ saitis, Peter Robinson, and Louis-Philippe Morency. Openface: an open source
facial behavior analysis toolkit. In Applications of Computer Vision (WACV), 2016 IEEE
Winter Conference on, pages 1–10, 2016.
[5] Aayush Bansal, Bryan Russell, and Abhinav Gupta. Marr revisited: 2D-3D alignment
via surface normal prediction. In Proc. Conf. Comput. Vision Pattern Recognition, pages
5965–5974, 2016.
[6] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. Towards open-set
identity preserving face synthesis. In CVPR, 2018.
[7] Anil Bas, William A. P. Smith, Timo Bolkart, and Stefanie Wuhrer. Fitting a 3D morphable
model to edges: A comparison between hard and soft correspondences. arxiv preprint,
abs/1602.01125, 2016. URLhttp://arxiv.org/abs/1602.01125.
[8] Peter N Belhumeur, David W Jacobs, David J Kriegman, and Narendra Kumar. Localizing
parts of faces using a consensus of exemplars. Trans. Pattern Anal. Mach. Intell., 35(12):
2930–2940, 2013.
[9] Chandraskehar Bhagavatula, Chenchen Zhu, Khoa Luu, and Marios Savvides. Faster than
real-time facial alignment: A 3d spatial transformer network approach in unconstrained
poses. Proc. ICCV , page to appear, 2, 2017.
[10] V . Blanz and T. Vetter. Morphable model for the synthesis of 3D faces. In Proc. ACM
SIGGRAPH Conf. Comput. Graphics, 1999.
[11] V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In
SIGGRAPH, 1999.
101
[12] V olker Blanz and Thomas Vetter. Face recognition based on fitting a 3d morphable model.
Trans. Pattern Anal. Mach. Intell., 25(9):1063–1074, 2003.
[13] V olker Blanz, Sami Romdhani, and Thomas Vetter. Face identification across different
poses and illuminations with a 3d morphable model. In Int. Conf. on Automatic Face and
Gesture Recognition, pages 192–197, 2002.
[14] V olker Blanz, Kristina Scherbaum, Thomas Vetter, and Hans-Peter Seidel. Exchanging
faces in images. volume 23, pages 669–676. Wiley Online Library, 2004.
[15] James Booth, Epameinondas Antonakos, Stylianos Ploumpis, George Trigeorgis, Yannis
Panagakis, and Stefanos Zafeiriou. 3D face morphable models “in-the-wild”. In Proc. Conf.
Comput. Vision Pattern Recognition, 2017.
[16] Joel Bosveld, Arif Mahmood, Du Q Huynh, and Lyle Noakes. Constrained metric learning
by permutation inducing isometries. IEEE Transactions on Image Processing, 25(1):92–
103, 2016.
[17] Adrian Bulat and Georgios Tzimiropoulos. Binarized convolutional landmark localizers
for human pose estimation and face alignment with limited resources. In Proc. Int. Conf.
Comput. Vision, 2017.
[18] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d
face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proc. Int. Conf.
Comput. Vision, 2017.
[19] Xavier P Burgos-Artizzu, Pietro Perona, and Piotr Doll´ ar. Robust face landmark estimation
under occlusion. In Proc. Int. Conf. Comput. Vision, pages 1513–1520. IEEE, 2013.
[20] Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape
regression. Int. J. Comput. Vision, 107(2):177–190, 2014.
[21] Hakan Cevikalp and Bill Triggs. Face recognition based on image sets. In CVPR, pages
2567–2573, 2010.
[22] Antoni B Chan and Nuno Vasconcelos. Probabilistic kernels for the classification of
auto-regressive visual processes. In CVPR, pages 846–851, 2005.
[23] Feng-Ju Chang, Anh Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, and G´ erard Medioni.
Faceposenet: Making a case for landmark-free face alignment. In Proc. Int. Conf. Comput.
Vision Workshops, 2017.
[24] Feng-Ju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, and G´ erard
Medioni. Deep, landmark-free fame: Face alignment, modeling, and expression estimation.
International Journal of Computer Vision, pages 1–27, 2019.
[25] Jun-Cheng Chen, Vishal M Patel, and Rama Chellappa. Unconstrained face verification
using deep cnn features. arXiv preprint arXiv:1508.01722, 2015.
102
[26] Zeyuan Chen, Shaoliang Nie, Tianfu Wu, and Christopher G Healey. High resolution face
completion with multiple controllable attributes via fully end-to-end progressive generative
adversarial networks. arXiv preprint arXiv:1801.07632, 2018.
[27] Y . Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo. Stargan: Unified generative
adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
[28] Baptiste Chu, Sami Romdhani, and Liming Chen. 3D-aided face recognition robust to
expression and pose variations. In Proc. Conf. Comput. Vision Pattern Recognition, 2014.
[29] Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Unsupervised metric
learning for face identification in tv video. In ICCV, pages 1559–1566, 2011.
[30] Nate Crosswhite, Jeffrey Byrne, Chris Stauffer, Omkar Parkhi, Qiong Cao, and Andrew
Zisserman. Template adaptation for face verification and identification. In Automatic Face
& Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on, pages
1–8. IEEE, 2017.
[31] C.Sagonas, Y .Panagakis, S.Zafeiriou, and M.Pantic. Robust statistical face frontalization.
In ICCV, 2015.
[32] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In
CVPR, pages 886–893, 2005.
[33] Matthias Dantone, Juergen Gall, Gabriele Fanelli, and Luc Van Gool. Real-time facial
feature detection using conditional regression forests. In Proc. Conf. Comput. Vision
Pattern Recognition, pages 2578–2585. IEEE, 2012.
[34] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-
theoretic metric learning. In ICML, pages 209–216, 2007.
[35] J. Deng, S. Cheng, N. Xue, Y . Zhou, and S. Zafeiriou. Uv-gan: Adversarial facial uv map
completion for pose-invariant face recognition. In CVPR, 2018.
[36] Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos Zafeiriou. Arcface: Additive angular
margin loss for deep face recognition. In CVPR, 2019.
[37] Abhinav Dhall, O.V . Ramana Murthy, Roland Goecke, Jyoti Joshi, and Tom Gedeon. Video
and image based emotion recognition challenges in the wild: EmotiW 2015. In Int. Conf.
on Multimodal Interaction. ACM, 2015.
[38] Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon.
From individual to group-level emotion recognition: Emotiw 5.0. In Proceedings of the 19th
ACM International Conference on Multimodal Interaction, pages 524–528. ACM, 2017.
[39] Abhinav Dhall et al. Collecting large, richly annotated facial-expression databases from
movies. 2012.
[40] Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style aggregated network for facial
landmark detection. In Proc. Conf. Comput. Vision Pattern Recognition, 2018.
103
[41] Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, and Yaser Sheikh.
Supervision-by-registration: An unsupervised approach to improve the precision of facial
landmark detectors. In Proc. Conf. Comput. Vision Pattern Recognition, 2018.
[42] Xuanyi Dong, Liang Zheng, Fan Ma, Yi Yang, and Deyu Meng. Few-example object
detection with model communication. Trans. Pattern Anal. Mach. Intell., 2018. doi:
10.1109/TPAMI.2018.2844853.
[43] Pengfei Dou, Shishir K. Shah, and Ioannis A. Kakadiaris. End-to-end 3D face reconstruction
with deep neural networks. In Proc. Conf. Comput. Vision Pattern Recognition, July 2017.
[44] Gareth J Edwards, Timothy F Cootes, and Christopher J Taylor. Face recognition using
active appearance models. In European Conf. Comput. Vision, pages 581–595. Springer,
1998.
[45] Eran Eidinger, Roee Enbar, and Tal Hassner. Age and gender estimation of unfiltered faces.
Trans. on Inform. Forensics and Security, 9(12), 2014.
[46] M. Everingham, J. Sivic, and A. Zisserman. “Hello! My name is... Buffy” – automatic
naming of characters in TV video. In Proc. British Mach. Vision Conf., 2006.
[47] C Fabian Benitez-Quiroz, Ramprakash Srinivasan, and Aleix M Martinez. Emotionet: An
accurate, real-time algorithm for the automatic annotation of a million facial expressions in
the wild. In Proc. Conf. Comput. Vision Pattern Recognition, pages 5562–5570, 2016.
[48] F.Cole, D.Belanger, D.Krishnan, A.Sarna, I.Mosseri, and W. T. Freeman. Synthesizing
normalized faces from facial identity feature. In CVPR, 2017.
[49] Y . Feng, F. Wu, X. Shao, Y . Wang, and X. Zhou. Joint 3d face reconstruction and dense
alignment with position map regression network. In ECCV, 2018.
[50] C. Ferrari, G. Lisanti, S. Berretti, and A. Bimbo. Effective 3d based frontalization for
unconstrained face recognition. In ICPR, 2016.
[51] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
and Y . Bengio. Generative adversarial nets. In NIPS, 2014.
[52] Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. Is that you? metric learning
approaches for face identification. In ICCV, pages 498–505, 2009.
[53] Rfiza Alp G¨ uler, George Trigeorgis, Epameinondas Antonakos, Patrick Snape, Stefanos
Zafeiriou, and Iasonas Kokkinos. Densereg: Fully convolutional dense shape regression
in-the-wild. In CVPR, 2017.
[54] Jihun Hamm and Daniel D Lee. Grassmann discriminant analysis: a unifying view on
subspace-based learning. In ICML, pages 376–383, 2008.
[55] Mehrtash Harandi, Mathieu Salzmann, and Mahsa Baktashmotlagh. Beyond gauss: Image-
set matching on the riemannian manifold of pdfs. In ICCV, pages 4112–4120, 2015.
104
[56] Mehrtash T Harandi, Mathieu Salzmann, and Richard Hartley. From manifold to manifold:
Geometry-aware dimensionality reduction for spd matrices. In European Conference on
Computer Vision, pages 17–32, 2014.
[57] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision.
Cambridge university press, 2003.
[58] Tal Hassner. Viewing real-world faces in 3D. In Proc. Int. Conf. Comput. Vision,
pages 3607–3614. IEEE, 2013. Available: www.openu.ac.il/home/hassner/
projects/poses.
[59] Tal Hassner and Ronen Basri. Example based 3D reconstruction from single 2D images. In
Proc. Conf. Comput. Vision Pattern Recognition Workshops, 2006.
[60] Tal Hassner, Shai Harel, Eran Paz, and Roee Enbar. Effective face frontalization in
unconstrained images. In CVPR, 2015.
[61] Tal Hassner, Shai Harel, Eran Paz, and Roee Enbar. Effective face frontalization in
unconstrained images. In Proc. Conf. Comput. Vision Pattern Recognition, 2015.
[62] Tal Hassner, Iacopo Masi, Jungyeon Kim, Jongmoo Choi, Shai Harel, Prem Natarajan, and
Gerard Medioni. Pooling faces: Template based face recognition with pooled face images.
In Proc. Conf. Comput. Vision Pattern Recognition Workshops, June 2016.
[63] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proc. Conf. Comput. Vision Pattern Recognition, June 2016.
[64] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen. Attgan: Facial attribute editing by only
changing what you want. In arXiv:1711.10678, 2018.
[65] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, and Bernhard Nessler. Gans trained
by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
[66] Guosheng Hu, Fei Yan, Chi-Ho Chan, Weihong Deng, William Christmas, Josef Kittler,
and Neil M Robertson. Face recognition using a unified 3D morphable model. In European
Conf. Comput. Vision, pages 73–89. Springer, 2016.
[67] Yiqun Hu, Ajmal S Mian, and Robyn Owens. Sparse approximated nearest points for image
set classification. In CVPR, pages 121–128, 2011.
[68] Gary B Huang, Vidit Jain, and Erik Learned-Miller. Unsupervised joint alignment of
complex images. In Proc. Int. Conf. Comput. Vision, pages 1–8. IEEE, 2007.
[69] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the
wild: A database for studying face recognition in unconstrained environments. Technical
Report 07-49, UMass, Amherst, October 2007.
[70] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the
wild: A database for studying face recognition in unconstrained environments. Technical
report, Technical Report 07-49, University of Massachusetts, Amherst, 2007.
105
[71] Rui Huang, Shu Zhang, Tianyu Li, and Ran He. Beyond face rotation: Global and local
perception gan for photorealistic and identity preserving frontal view synthesis. In ICCV,
2017.
[72] Zhiwu Huang, Ruiping Wang, Shiguang Shan, and Xilin Chen. Face recognition on large-
scale video in the wild with hybrid euclidean-and-riemannian metric learning. Pattern
Recognition, 48(10):3113–3124, 2015.
[73] Zhiwu Huang, Ruiping Wang, Shiguang Shan, and Xilin Chen. Projection metric learning
on grassmann manifold with application to video based face recognition. In CVPR, pages
140–149, 2015.
[74] Zhiwu Huang, Ruiping Wang, Shiguang Shan, Xianqiu Li, and Xilin Chen. Log-euclidean
metric learning on symmetric positive definite manifold with application to image set
classification. In ICML, pages 720–729, 2015.
[75] P. Huber, G. Hu, R. Tena, P. Mortazavian, W. Koppen, W. Christmas, M. Rtsch, and
J. Kittler. A multiresolution 3D morphable face model and fitting framework. In Int. Conf.
on Computer Vision Theory and Applications, 2016.
[76] Aaron S Jackson, Adrian Bulat, Vasileios Argyriou, and Georgios Tzimiropoulos. Large
pose 3D face reconstruction from a single image via direct volumetric CNN regression.
Proc. Int. Conf. Comput. Vision, 2017.
[77] L´ aszl´ o A Jeni, Jeffrey F Cohn, and Takeo Kanade. Dense 3D face alignment from 2D
videos in real-time. In Int. Conf. on Automatic Face and Gesture Recognition, volume 1.
IEEE, 2015.
[78] Junqi Jin, Kun Fu, Runpeng Cui, Fei Sha, and Changshui Zhang. Aligning where to see
and what to tell: image caption with region-based attention and scene factorization. arXiv
preprint arXiv:1506.06272, 2015.
[79] Amin Jourabloo and Xiaoming Liu. Pose-invariant 3d face alignment. In Proc. Conf.
Comput. Vision Pattern Recognition, pages 3694–3702, 2015.
[80] Amin Jourabloo and Xiaoming Liu. Large-pose face alignment via cnn-based dense 3D
model fitting. In Proc. Conf. Comput. Vision Pattern Recognition, 2016.
[81] J.Yang, S.Reed, M.-H.Yang, and H.Lee. Weaklysupervised disentangling with recurrent
transformations for 3d view synthesis. In NIPS, 2015.
[82] Vahdat Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble
of regression trees. In Proc. Conf. Comput. Vision Pattern Recognition. IEEE, 2014.
[83] I. Kemelmacher-Shlizerman and R. Basri. 3D face reconstruction from a single image using
a single reference face shape. Trans. Pattern Anal. Mach. Intell., 33(2):394–405, 2011.
[84] Minyoung Kim, Sanjiv Kumar, Vladimir Pavlovic, and Henry Rowley. Face tracking and
recognition with visual constraints in real-world videos. In CVPR, pages 1–8, 2008.
106
[85] Tae-Kyun Kim, Josef Kittler, and Roberto Cipolla. Discriminative learning and recog-
nition of image set classes using canonical correlations. Pattern Analysis and Machine
Intelligence, 29(6):1005–1018, 2007.
[86] Davis E King. Dlib-ml: A machine learning toolkit. J. Mach. Learning Research, 10(Jul):
1755–1758, 2009.
[87] D.P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
[88] Brendan F. Klare, Ben Klein, Emma Taborsky, Austin Blanton, Jordan Cheney, Kristen
Allen, Patrick Grother, Alan Mah, Mark Burge, and Anil K. Jain. Pushing the frontiers of
unconstrained face detection and recognition: IARPA Janus Benchmark-A. In Proc. Conf.
Comput. Vision Pattern Recognition, 2015.
[89] Joshua C Klontz, Brendan F Klare, Scott Klum, Anubhav K Jain, and Mark J Burge. Open
source biometric recognition. In BTAS, pages 1–8, 2013.
[90] Martin Koestinger, Martin Hirzer, Paul Wohlhart, Peter M. Roth, and Horst Bischof. Large
scale metric learning from equivalence constraints. In CVPR, 2012.
[91] Ronak Kosti, Jose M Alvarez, Adria Recasens, and Agata Lapedriza. Emotion recognition
in context. In Proc. Conf. Comput. Vision Pattern Recognition, 2017.
[92] Martin K¨ ostinger, Paul Wohlhart, Peter M Roth, and Horst Bischof. Annotated facial
landmarks in the wild: A large-scale, real-world database for facial landmark localization.
In Proc. Int. Conf. Comput. Vision Workshops. IEEE, 2011.
[93] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep
convolutional neural networks. In NIPS, 2012.
[94] A. Kumar, A. Alavi, and R. Chellappa. KEPLER: keypoint and pose estimation of un-
constrained faces by learning efficient H-CNN regressors. In Automatic Face and Gesture
Recognition, pages 258–265, May 2017.
[95] Amit Kumar and Rama Chellappa. Disentangling 3D pose in a dendritic cnn for uncon-
strained 2d face alignment. In Proc. Conf. Comput. Vision Pattern Recognition, 2018.
[96] Amit Kumar, Azadeh Alavi, and Rama Chellappa. Kepler: Keypoint and pose estimation of
unconstrained faces by learning efficient h-cnn regressors. In Automatic Face and Gesture
Recognition, pages 258–265. IEEE, 2017.
[97] Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic DENOYER,
et al. Fader networks: Manipulating images by sliding attributes. In NIPS, 2017.
[98] Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, and Thomas Huang. Interactive
facial feature localization. European Conf. Comput. Vision, pages 679–692, 2012.
[99] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o (n) solution
to the pnp problem. International journal of computer vision, 81(2):155, 2009.
107
[100] Gil Levi and Tal Hassner. Emotion recognition in the wild via convolutional neural networks
and mapped binary patterns. In Int. Conf. on Multimodal Interaction, pages 503–510. ACM,
2015.
[101] Chen Li, Kun Zhou, and Stephen Lin. Intrinsic face image decomposition with human face
priors. In European Conf. Comput. Vision, 2014.
[102] M. Li, W. Zuo, and D. Zhang. Convolutional network for attribute- driven and identity-
preserving human face generation. In arXiv:1608.06434, 2016.
[103] Yaojie Liu, Amin Jourabloo, William Ren, and Xiaoming Liu. Dense face alignment. In
Proc. Conf. Comput. Vision Pattern Recognition, 2017.
[104] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in
the wild. In ICCV, 2015.
[105] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in
the wild. In Proc. Int. Conf. Comput. Vision, 2015.
[106] Jiwen Lu, Gang Wang, and Pierre Moulin. Image set classification using holistic multiple
order statistics features and localized multi-kernel metric learning. In ICCV, pages 329–336,
2013.
[107] Jiwen Lu, Gang Wang, Weihong Deng, Pierre Moulin, and Jie Zhou. Multi-manifold deep
metric learning for image set classification. In CVPR, pages 1137–1145, 2015.
[108] Yongyi Lu, Yu-Wing Tai, and Chi-Keung Tang. Attribute-guided face generation using
conditional cyclegan. In Proceedings of the European Conference on Computer Vision
(ECCV), pages 282–297, 2018.
[109] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain
Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit
and emotion-specified expression. In Proc. Conf. Comput. Vision Pattern Recognition
Workshops, pages 94–101. IEEE, 2010.
[110] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. Deep caption-
ing with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632,
2014.
[111] I. Masi, F. J. Chang, J. Choi, S. Harel, J. Kim, K. Kim, J. Leksut, S. Rawls, Y . Wu,
T. Hassner, W. AbdAlmageed, G. Medioni, L. P. Morency, P. Natarajan, and R. Nevatia.
Learning pose-aware models for pose-invariant face recognition in the wild. Trans. Pattern
Anal. Mach. Intell., 2018.
[112] Iacopo Masi, Claudio Ferrari, Alberto Del Bimbo, and G´ erard Medioni. Pose independent
face recognition by localizing local binary patterns via deformation components. In Int.
Conf. on Pattern Recognition, pages 4477–4482, 2014.
[113] Iacopo Masi, Stephan Rawls, Gerard Medioni, and Nararajan Prem. Pose-aware face
recognition in the wild. In CVPR, 2016.
108
[114] Iacopo Masi, Stephen Rawls, G´ erard Medioni, and Prem Natarajan. Pose-aware face
recognition in the wild. In Proc. Conf. Comput. Vision Pattern Recognition, pages 4838–
4846, 2016.
[115] Iacopo Masi, Anh Tran, Tal Hassner, Jatuporn Toy Leksut, and G´ erard Medioni. Do We
Really Need to Collect Millions of Faces for Effective Face Recognition? In European Conf.
Comput. Vision, 2016. Availablewww.openu.ac.il/home/hassner/projects/
augmented_faces.
[116] Iacopo Masi, Tal Hassner, Anh Tuˆ an Tran, and G´ erard Medioni. Rapid synthesis of massive
face sets for improved face recognition. In Int. Conf. on Automatic Face and Gesture
Recognition, pages 604–611. IEEE, 2017.
[117] Iacopo Masi, Yue Wu, and Tal Hassner Prem Natarajan. Deep face recognition: A survey.
In Conf. on Graphics, Patterns and Images, October 2018.
[118] Roland Memisevic and Geoffrey Hinton. Unsupervised learning of image transformations.
In CVPR, pages 1–8, 2007.
[119] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose
estimation. In ECCV, 2016.
[120] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In Proc. British Mach.
Vision Conf., 2015.
[121] P Paysan, R Knothe, B Amberg, S Romhani, and T Vetter. A 3D face model for pose and
illumination invariant face recognition. In Int. Conf. on Advanced Video and Signal based
Surveillance, 2009.
[122] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. A lvare. Invertible conditional gans
for image editing. In NIPS Workshops, 2016.
[123] Patrick Poirson, Phil Ammirato, Cheng-Yang Fu, Wei Liu, Jana Kosecka, and Alexander C
Berg. Fast single shot detection and pose estimation. In Int. Conf. on 3D Vision, pages
676–684. IEEE, 2016.
[124] A. Pumarola, A. Agudo, A.M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. Ganimation:
Anatomically-aware facial animation from a single image. In ECCV, 2018.
[125] Rajeev Ranjan, Carlos D Castillo, and Rama Chellappa. L2-constrained softmax loss for
discriminative face verification. arXiv preprint arXiv:1703.09507, 2017.
[126] Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. Face alignment at 3000 fps via
regressing local binary features. In Proc. Conf. Comput. Vision Pattern Recognition. IEEE,
2014.
[127] Elad Richardson, Matan Sela, and Ron Kimmel. 3d face reconstruction by learning from
synthetic data. In Int. Conf. on 3D Vision, 2016.
[128] Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. Learning detailed face
reconstruction from a single image. arXiv preprint arXiv:1611.05053, 2016.
109
[129] Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. Learning detailed face
reconstruction from a single image. In Proc. Conf. Comput. Vision Pattern Recognition,
July 2017.
[130] Sami Romdhani and Thomas Vetter. Efficient, robust and accurate fitting of a 3D morphable
model. In Proc. Int. Conf. Comput. Vision, 2003.
[131] Sami Romdhani and Thomas Vetter. Estimating 3D shape and texture using pixel intensity,
edges, specular highlights, texture constraints and a prior. In Proc. Conf. Comput. Vision
Pattern Recognition, volume 2, pages 986–993, 2005.
[132] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet
large scale visual recognition challenge. International Journal of Computer Vision, 115(3):
211–252, 2015.
[133] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge:
The first facial landmark localization challenge. In ICCVW, 2013.
[134] Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300
faces in-the-wild challenge: The first facial landmark localization challenge. In Proc. Conf.
Comput. Vision Pattern Recognition Workshops, 2013.
[135] Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou,
and Maja Pantic. 300 faces in-the-wild challenge: Database and results. Image and Vision
Computing, 2015.
[136] Ruslan Salakhutdinov and Geoffrey E Hinton. Deep boltzmann machines. In AISTATS,
volume 1, page 3, 2009.
[137] Swami Sankaranarayanan, Azadeh Alavi, and Rama Chellappa. Triplet similarity embed-
ding for face verification. arXiv preprint arXiv:1602.03418, 2016.
[138] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding
for face recognition and clustering. In CVPR, pages 815–823, 2015.
[139] Matan Sela, Elad Richardson, and Ron Kimmel. Unrestricted facial geometry reconstruction
using image-to-image translation. In Proc. Int. Conf. Comput. Vision, 2017.
[140] Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo, and David Jacobs. SfSNet:
Learning shape, reflectance and illuminance of faces in the wild. In Proc. Conf. Comput.
Vision Pattern Recognition, 2018.
[141] Gaurav Sharma and Patrick P´ erez. Latent max-margin metric learning for comparing video
face tubes. In CVPR Workshops, pages 65–74, 2015.
[142] W. Shen and R. Liu. Learning residual images for face attribute manipulation. In CVPR,
2017.
[143] Yujun Shen, Ping Luo, Junjie Yan, Xiaogang Wang, and Xiaoou Tang. Faceid-gan: Learning
a symmetry three-player gan for identity-preserving face synthesis. In CVPR, 2018.
110
[144] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[145] Hao Su, Charles R Qi, Yangyan Li, and Leonidas J Guibas. Render for CNN: Viewpoint
estimation in images using CNNs trained with rendered 3D model views. In Proc. Int. Conf.
Comput. Vision, pages 2686–2694, 2015.
[146] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the
gap to human-level performance in face verification. In CVPR, 2014.
[147] Hao Tang, Yuxiao Hu, Yun Fu, Mark Hasegawa-Johnson, and Thomas S Huang. Real-time
conversion from a single 2d face image to a 3D text-driven emotive audio-visual avatar. In
Int. Conf. on Multimedia and Expo, pages 1205–1208. IEEE, 2008.
[148] Ayush Tewari, Michael Zollhfer, Pablo Garrido, Hyeongwoo Kim Florian Bernard, Patrick
Prez, and Christian Theobalt. Self-supervised multi-level face model learning for monocular
reconstruction at over 250 Hz. In Proc. Conf. Comput. Vision Pattern Recognition, 2018.
[149] Anh Tran, Tal Hassner, Iacopo Masi, and G´ erard Medioni. Regressing robust and discrim-
inative 3D morphable models with a very deep neural network. In Proc. Conf. Comput.
Vision Pattern Recognition, 2017.
[150] Anh Tuan Tran, Tal Hassner, Iacopo Masi, Eran Paz, Yuval Nirkin, and G´ erard Medioni.
Extreme 3D face reconstruction: Looking past occlusions. In Proc. Conf. Comput. Vision
Pattern Recognition, 2018.
[151] Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled representation learning gan for
pose-invariant face recognition. In CVPR, 2017.
[152] Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless, Noah Snavely, Kavita Bala, and
Kilian Weinberger. Deep feature interpolation for image content changes. In CVPR, 2017.
[153] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In IEEE
International Workshop on Machine Learning for Signal Processing, pages 1–6, 2012.
[154] Thomas Vetter and V olker Blanz. Estimating coloured 3D face models from single images:
An example based approach. In European Conf. Comput. Vision, 1998.
[155] Dayong Wang, Charles Otto, and Anil K Jain. Face search at scale: 80 million gallery.
arXiv preprint arXiv:1507.07242, 2015.
[156] Ruiping Wang and Xilin Chen. Manifold discriminant analysis. In CVPR, pages 429–436,
2009.
[157] Ruiping Wang, Shiguang Shan, Xilin Chen, and Wen Gao. Manifold-manifold distance
with application to face recognition based on image set. In CVPR, pages 1–8, 2008.
[158] Wen Wang, Ruiping Wang, Zhiwu Huang, Shiguang Shan, and Xilin Chen. Discriminant
analysis on riemannian manifold of gaussian distributions for face recognition with image
sets. In CVPR, pages 2048–2057, 2015.
111
[159] Cameron Whitelam, Emma Taborsky, Austin Blanton, Brianna Maze, Jocelyn Adams, Tim
Miller, Nathan Kalka, Anil K Jain, James A Duncan, Kristen Allen, et al. Iarpa janus
benchmark-b face dataset. In Proc. Conf. Comput. Vision Pattern Recognition Workshops,
2017.
[160] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched
background similarity. In Proc. Conf. Comput. Vision Pattern Recognition, 2011.
[161] Yue Wu and Tal Hassner. Facial landmark detection with tweaked convolutional neural
networks. arXiv preprint arXiv:1511.04031, 2015.
[162] Yue Wu, Tal Hassner, KangGeon Kim, Gerard Medioni, and Prem Natarajan. Facial
landmark detection with tweaked convolutional neural networks. Trans. Pattern Anal.
Mach. Intell., 2017.
[163] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3D
object detection in the wild. In Winter Conf. on App. of Comput. Vision, pages 75–82.
IEEE, 2014.
[164] Yu Xiang, Wonhui Kim, Wei Chen, Jingwei Ji, Christopher Choy, Hao Su, Roozbeh
Mottaghi, Leonidas Guibas, and Silvio Savarese. Objectnet3D: A large scale database for
3D object recognition. In European Conf. Comput. Vision, pages 160–176. Springer, 2016.
[165] T. Xiao, J. Hong, and J. Ma. Dna-gan: Learning disentangled repre- sentations from
multi-attribute images. In ICLR Workshops, 2018.
[166] Taihong Xiao, Jiapeng Hong, and Jinwen Ma. Elegant: Exchanging latent encodings with
gan for transferring multiple face attributes. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 168–184, 2018.
[167] Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, and Qi Tian. DisturbLabel: Regulariz-
ing CNN on the loss layer. In Proc. Conf. Comput. Vision Pattern Recognition, 2016.
[168] Xuehan Xiong and Fernando De la Torre. Supervised descent method and its applications
to face alignment. In Proc. Conf. Comput. Vision Pattern Recognition. IEEE, 2013.
[169] Osamu Yamaguchi, Kazuhiro Fukui, and Ken-ichi Maeda. Face recognition using temporal
image sequence. In FG, pages 318–323, 1998.
[170] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from
visual attributes. In ECCV, 2016.
[171] Fei Yang, Jue Wang, Eli Shechtman, Lubomir Bourdev, and Dimitri Metaxas. Expression
flow for 3D-aware face component transfer. ACM Trans. on Graphics, 30(4):60, 2011.
[172] Zhenheng Yang and Ramakant Nevatia. A multi-scale cascade fully convolutional network
face detector. In ICPR, pages 633–638, 2016.
[173] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from
scratch. arXiv preprint arXiv:1411.7923, 2014. Available:http://www.cbsr.ia.ac.
cn/english/CASIA-WebFace-Database.html.
112
[174] J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, and J. Kim. Rotat- ing your face using multi-task
deep neural network. In CVPR, 2015.
[175] L. Yin, X. Chen, Y . Sun, T. Worm, , and M. Reale. A high-resolution 3d dynamic fa-
cial expression database. In International Conference on Automatic Face and Gesture
Recognition, 2008.
[176] Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Manmohan Chandraker. Towards
large-pose face frontalization in the wild. In ICCV, 2017.
[177] Xiang Yu, Junzhou Huang, Shaoting Zhang, Wang Yan, and Dimitris N Metaxas. Pose-free
facial landmark fitting via optimized part mixtures and cascaded deformable shape model.
In Proc. Int. Conf. Comput. Vision, pages 1944–1951. IEEE, 2013.
[178] Amir Zadeh, Tadas Baltruˇ saitis, and Louis-Philippe Morency. Deep constrained local
models for facial landmark detection. arXiv preprint arXiv:1611.08657, 2016.
[179] Stefanos Zafeiriou, Athanasios Papaioannou, Irene Kotsia, Mihalis Nicolaou, and Guoying
Zhao. Facial affect“in-the-wild”. In Proc. Conf. Comput. Vision Pattern Recognition
Workshops, pages 36–47, 2016.
[180] Stefanos Zafeiriou, Grigorios G Chrysos, Anastasios Roussos, Evangelos Ververas, Jiankang
Deng, George Trigeorgis, Daniel Crispell, Maxim Bazik, Pengfei Xiong, Guoqing Li, et al.
The 3d menpo facial landmark tracking challenge. In ICCV 3D Menpo Facial Landmark
Tracking Challenge Workshop, volume 5, 2017.
[181] Gang Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Generative adversarial network
with spatial attention for face attribute editing. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 417–432, 2018.
[182] Jie Zhang, Shiguang Shan, Meina Kan, and Xilin Chen. Coarse-to-fine auto-encoder net-
works (CFAN) for real-time face alignment. In European Conf. Comput. Vision. Springer,
2014.
[183] Kaipeng Zhang, Lianzhi Tan, Zhifeng Li, and Yu Qiao. Gender and smile classification using
deep convolutional neural networks. In Proc. Conf. Comput. Vision Pattern Recognition
Workshops, pages 34–38, 2016.
[184] S. Zhou, T. Xiao, Y . Yang, D. Feng, and Q. He. Genegan: Learning object transfiguration
and attribute subspace from unpaired data. In BMVC, 2017.
[185] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image
translation using cycle-consistent adversarial networks. In ICCV, 2017.
[186] Pengfei Zhu, Lei Zhang, Wangmeng Zuo, and David Zhang. From point to set: Extend the
learning of distance metrics. In ICCV, pages 2664–2671, 2013.
[187] Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Face alignment by coarse-to-
fine shape searching. In Proc. Conf. Comput. Vision Pattern Recognition, 2015.
113
[188] Shizhan Zhu, Cheng Li, Chen-Change Loy, and Xiaoou Tang. Unconstrained face alignment
via cascaded compositional learning. In Proc. Conf. Comput. Vision Pattern Recognition,
2016.
[189] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-fidelity pose and expression normalization
for face recognition in the wild. In CVPR, 2015.
[190] Xiangxin Zhu and Deva Ramanan. Face detection, pose estimation, and landmark local-
ization in the wild. In Proc. Conf. Comput. Vision Pattern Recognition, pages 2879–2886.
IEEE, 2012.
[191] Xiangyu Zhu, Zhen Lei, Junjie Yan, Dong Yi, and Stan Z Li. High-fidelity pose and
expression normalization for face recognition in the wild. In Proc. Conf. Comput. Vision
Pattern Recognition, pages 787–796, 2015.
[192] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Li. Face alignment across
large poses: A 3D solution. In Proc. Conf. Comput. Vision Pattern Recognition, Las Vegas,
NV , June 2016.
[193] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li. Face alignment across
large poses: A 3D solution. In CVPR, 2016.
114
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Face recognition and 3D face modeling from images in the wild
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Facial gesture analysis in an interactive environment
PDF
Efficient template representation for face recognition: image sampling from face collections
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
Green learning for 3D point cloud data processing
PDF
3D deep learning for perception and modeling
PDF
Building and validating computational models of emotional expressivity in a natural social task
PDF
Facial age grouping and estimation via ensemble learning
PDF
Deep representations for shapes, structures and motion
PDF
Scalable dynamic digital humans
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
From active to interactive 3D object recognition
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
Data-driven 3D hair digitization
PDF
Human appearance analysis and synthesis using deep learning
PDF
Multi-scale dynamic capture for high quality digital humans
PDF
Recording, reconstructing, and relighting virtual humans
Asset Metadata
Creator
Chang, Feng-Ju
(author)
Core Title
Landmark-free 3D face modeling for facial analysis and synthesis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/20/2019
Defense Date
04/11/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D face modeling,face recognition,face shape and expression estimation,facial attribute synthesis,OAI-PMH Harvest,pose estimation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ram (
committee chair
), Georgiou, Panayiotis (
committee member
), Itti, Laurent (
committee member
)
Creator Email
fengju514@gmail.com,fengjuch@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-219928
Unique identifier
UC11673326
Identifier
etd-ChangFengJ-7831.pdf (filename),usctheses-c89-219928 (legacy record id)
Legacy Identifier
etd-ChangFengJ-7831.pdf
Dmrecord
219928
Document Type
Dissertation
Rights
Chang, Feng-Ju
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D face modeling
face recognition
face shape and expression estimation
facial attribute synthesis
pose estimation