Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Face recognition and 3D face modeling from images in the wild
(USC Thesis Other)
Face recognition and 3D face modeling from images in the wild
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
FACE RECOGNITION AND 3D FACE MODELING FROM IMAGES IN THE WILD by Anh Tran A Dissertation Presented to the FACULTY OF THE VITERBI SCHOOL OF ENGINEERING UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2017 Copyright 2017 Anh Tran Acknowledgement First and foremost, I would like to thank my advisor Professor Gerard Medioni. He is a profound expert, who showed me the great world of computer vision. By working with him, I have learned a lot to grow from a beginner to a condent computer vision researcher. There was a hard time I got stuck in my research, without any progress for two years. He patiently helped me to surpass the hardship with his wise guidances. Without his marvelous support, I would have never nished my Ph.D. program. Today, when presenting this dissertation, I really appreciate every moment in the long journey under the supervision of Professor Gerard Medioni. Second, I want to thank Professor Tal Hassner, Iacopo Masi, and Jongmoo Choi for their tremendous help, particularly in my last two years. I could have never gone this far without their cooperation. Working experience with them is a treasure; I have learned a lot from their expertise, creativity, inspiration, and hard work. I will never forget this valuable time and everything they have done for me. I would like to thank Professor Ram Nevatia and Professor Sandeep Gupta for spend- ing their precious time to attend my dissertation. I also want to thank Professor Aiichiro Nakano, Professor Hao Li, and Professor Alexander Sawchuk for joining my qualifying exam committee. I am really grateful to receive their valuable feedbacks. I feel blessed to work with smart and skilled people in the GLAIVE team working for the IARPA JANUS ii Project from Information Sciences Institute (ISI), such as Stephen Rawls, Rex Yu, Wael AbdAlmageed, and Professor Prem Natarajan. I am thankful to work with Professor Lu- ciano Silva and Professor Olga Bellon, who supported me to overcome the hardest time in Ph.D. life. I feel grateful for working with IRISers: Kanggeon Kim, Matthias Her- nandez, Fengju Chang, Gozde Sahin, Zhenheng Yang, Jatuporn Toy Leksut, Zhouliang Kang, Jungyeon Kim, Ruizhe Wang, Younghoon Lee, Tung Sin Leung, Shay Deutsch, and Bor-Jeng Chen. I have many great moments at USC thanks to Hien To, Loc Huynh, Duc Le, Phong Trinh, Quynh Nguyen, Thanh Nguyen, Hieu Nguyen, Luan Tran, and many other friends. Finally, I would love to thank my parents and my younger sister, who give me endless love, innumerable advice, immeasurable support and always follow each step I go from another side of the Earth. This dissertation is dedicated to you. iii Table of Contents Acknowledgement ii List of Tables vii List of Figures ix Abstract xiii 1 Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 3D Face Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Related Work 11 2.1 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Face Recognition Systems . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.3 Face Synthesis for Face Recognition . . . . . . . . . . . . . . . . . 15 2.2 3D Face Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Triangulation-based Approaches . . . . . . . . . . . . . . . . . . . 15 2.2.2 Shape from X Methods . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Deformation-based Approaches . . . . . . . . . . . . . . . . . . . . 17 2.2.4 Template-based Approaches . . . . . . . . . . . . . . . . . . . . . . 19 I Face Recognition 22 3 2D Face Recognition Aided by 3D Face Augmentation 23 3.1 Data Augmentation by Synthesizing Faces . . . . . . . . . . . . . . . . . . 23 3.1.1 Pose Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.2 3D Shape Variations . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.3 Expression Variations . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Face Recognition Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iv 3.2.1 CNN Training with 3D Augmented Data . . . . . . . . . . . . . . 29 3.2.2 Face Recognition with Synthesized Faces . . . . . . . . . . . . . . . 31 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 Results on the IJB-A Benchmarks . . . . . . . . . . . . . . . . . . 35 3.3.2 Results on Labeled Faces in the Wild . . . . . . . . . . . . . . . . 38 3.3.3 Result Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4 Enhance 2D Face Recognition System 42 4.1 Accelerate Augmentation and Matching . . . . . . . . . . . . . . . . . . . 43 4.1.1 Face Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.2 Rapid Rendering with Generic Faces . . . . . . . . . . . . . . . . . 45 4.1.3 Matching by Pooling Synthesized Faces . . . . . . . . . . . . . . . 50 4.1.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Scaling up the training set: the COW dataset . . . . . . . . . . . . . . . . 57 4.2.1 COW dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 a. Components . . . . . . . . . . . . . . . . . . . . . . . . . 58 b. Combining datasets . . . . . . . . . . . . . . . . . . . . . 59 c. Filtering noisy images . . . . . . . . . . . . . . . . . . . . 59 d. Filtering noisy labels . . . . . . . . . . . . . . . . . . . . 59 4.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Landmark-Free Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.1 Do We Need Accurate Landmark Detection for Face Alignment? . 64 4.3.2 A Critique of Facial Landmark Detection . . . . . . . . . . . . . . 66 4.3.3 Deep, Direct Head Pose Regression . . . . . . . . . . . . . . . . . . 68 4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.5 Eect of Alignment on Recognition . . . . . . . . . . . . . . . . . . 73 4.3.6 Landmark Detection Accuracy . . . . . . . . . . . . . . . . . . . . 77 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 II 3D Face Modeling 82 5 3D Face Modeling with An Analysis-by-Synthesis Approach 83 5.1 3D Morphable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2 3D Morphable Model Fitting on A Single Image . . . . . . . . . . . . . . 85 5.3 Model Fusion from Multiple Input Images . . . . . . . . . . . . . . . . . . 90 5.4 3D Face Reconstruction from Video . . . . . . . . . . . . . . . . . . . . . 91 5.4.1 3D Face Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 a. Initialization with 3-D Face Modeling . . . . . . . . . . . 92 b. 3-D Pose Estimation . . . . . . . . . . . . . . . . . . . . . 94 c. Re-acquisition . . . . . . . . . . . . . . . . . . . . . . . . 98 d. Validation Experiments . . . . . . . . . . . . . . . . . . . 100 e. Experiments with Ground-truth Datasets . . . . . . . . . 102 f. Experiments with Recorded Videos . . . . . . . . . . . . 103 g. Experiments with Real-time Tracking . . . . . . . . . . . 106 v 5.4.2 Frame-based 3D Face Modeling from Videos . . . . . . . . . . . . . 106 a. Bad Tracked Frame Removal . . . . . . . . . . . . . . . . 107 b. Key-frames based Modeling . . . . . . . . . . . . . . . . . 108 5.5 Parameter based 3D-3D Matching . . . . . . . . . . . . . . . . . . . . . . 110 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.6.1 Experiment with Ground-truth Data . . . . . . . . . . . . . . . . . 111 5.6.2 Experiment with 3D Face Modeling and Matching . . . . . . . . . 112 5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6 Deep 3D Face Modeling and Beyond 122 6.1 Why Deep Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 Regressing 3DMM Parameters with A CNN . . . . . . . . . . . . . . . . . 125 6.2.1 Generating Training Data . . . . . . . . . . . . . . . . . . . . . . . 126 6.2.2 Learning to Regress Pooled 3DMM . . . . . . . . . . . . . . . . . . 127 6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.3.1 3D Shape Reconstruction Accuracy . . . . . . . . . . . . . . . . . . 131 6.3.2 3DMM Regression Speed . . . . . . . . . . . . . . . . . . . . . . . 134 6.3.3 Face Recognition In The Wild . . . . . . . . . . . . . . . . . . . . 134 6.3.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.4 2D-3D Fusion for Face Recognition . . . . . . . . . . . . . . . . . . . . . . 140 6.5 3D Model Renement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.5.1 Deep Expression Estimation . . . . . . . . . . . . . . . . . . . . . . 142 6.5.2 Fine-detail Estimation with Shape-from-Shading . . . . . . . . . . 143 6.5.3 Occlusion Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 a. Pixel-Grid Mesh vs. BFM Structured Mesh . . . . . . . . 146 b. Dense and Sparse Mesh Conversion . . . . . . . . . . . . 147 c. Occluded Vertices Detection . . . . . . . . . . . . . . . . 149 d. Occluded Vertices Inference . . . . . . . . . . . . . . . . . 149 e. Dense Mesh Symmetry-based Inference . . . . . . . . . . 149 f. Mesh Zippering . . . . . . . . . . . . . . . . . . . . . . . 150 g. Experimental Results . . . . . . . . . . . . . . . . . . . . 150 6.5.4 Deep Fine-Details Regression . . . . . . . . . . . . . . . . . . . . . 151 a. Representation . . . . . . . . . . . . . . . . . . . . . . . . 153 b. Fine-Details Regression Training . . . . . . . . . . . . . . 153 c. Experimental Results . . . . . . . . . . . . . . . . . . . . 156 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7 Conclusions and Future Work 159 7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Reference List 165 vi List of Tables 3.1 SoftMax template fusion for score pooling vs. other standard fusion tech- niques on the IJB-A benchmark . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Eect of each augmentation on IJB-A performance on verication (ROC) and identication (CMC), resp. Only in-plane aligned images used in these tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Eect of in-plane alignment and pose synthesis at test-time (matching) on IJB-A dataset respectively for verication (ROC) and identication (CMC). 37 3.4 Comparative performance analysis on JANUS CS2 and IJB-A respectively for verication (ROC) and identication (CMC) . . . . . . . . . . . . . . 39 4.1 Overview of render times and various properties of recent face rendering methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Comparison of various pipeline components: the impact of face context on recognition, pooling across synthesized images and incremental training. . 54 4.3 Performance analysis on JANUS CS2 and IJB-A respectively for verica- tion (ROC) and identication (CMC). . . . . . . . . . . . . . . . . . . . . 56 4.4 Summary of COW dataset versions and their components. . . . . . . . . . 60 vii 4.5 Face recognition performance on IJB-A for verication (ROC) and identi- cation (CMC) when using dierent training data. . . . . . . . . . . . . . 62 4.6 Summary of augmentation transformation parameters used to train our FPN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.7 Verication and identication results on IJB-A and IJB-B, compared to landmark detection based face alignment methods. . . . . . . . . . . . . . 75 5.1 Belief table in a single evidence case . . . . . . . . . . . . . . . . . . . . . 96 5.2 3D Face Tracking results on ICT-3DHP dataset . . . . . . . . . . . . . . . 101 5.3 3D Face Tracking results on BU dataset . . . . . . . . . . . . . . . . . . . 102 5.4 3D Face Tracking results on Biwi dataset . . . . . . . . . . . . . . . . . . 103 5.5 Speed of 3-D Face Tracker in real-time tracking mode . . . . . . . . . . . 106 5.6 Error measure on the reconstructed 3D face models from cooperative videos in MICC dataset [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.7 3D-3D matching performance on Multi-PIE dataset. . . . . . . . . . . . . 115 5.8 3D-3D matching performance on CASIA-WebFace dataset [156]. . . . . . 116 5.9 3D-3D matching performance on IJB-A dataset in comparison to some popular 2D systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.1 3D estimation accuracy and per-image speed on the MICC dataset . . . . 132 6.2 LFW and YTF face verication . . . . . . . . . . . . . . . . . . . . . . . . 136 6.3 IJB-A face verication and recognition. . . . . . . . . . . . . . . . . . . . 137 6.4 2D-3D face recognition fusion on IJB-A. . . . . . . . . . . . . . . . . . . . 141 6.5 Network structure for bump map regression. . . . . . . . . . . . . . . . . . 155 viii List of Figures 1.1 Two tasks in Face recognition problem. . . . . . . . . . . . . . . . . . . . 2 1.2 3D Face modeling problem. . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Challenges in face recognition and 3D face reconstruction . . . . . . . . . 4 2.1 IJBA dataset with template-based matching protocols. . . . . . . . . . . . 14 2.2 Triangulation-based 3D Face Modeling approaches. . . . . . . . . . . . . . 16 2.3 Shape-from-Shading 3D Face Modeling methods. . . . . . . . . . . . . . . 18 2.4 Deformation-based 3D Face Modeling approaches. . . . . . . . . . . . . . 19 2.5 3D Morphable Models tting [22, 109]. . . . . . . . . . . . . . . . . . . . . 20 3.1 Adding pose variations by synthesizing novel viewpoints . . . . . . . . . . 25 3.2 Adding shape variations by rendering with reference 3D face shapes . . . 27 3.3 Expression synthesis examples . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Eect of 3D-based augmentation on the training dataset in size and dis- tribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5 Ablation study of our data synthesis and test time matching methods on IJB-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6 LFW verication results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Python code snippet for 3D rendering. . . . . . . . . . . . . . . . . . . . . 47 ix 4.2 Preparing a generic 3D model: Head added to a generic 3D face (top) along with two planes representing the background. . . . . . . . . . . . . . . . . 49 4.3 Face representation: given an input image (left), it is rendered to novel viewpoints as well as in-plane aligned. These are all encoded by our CNN. The CNN features are then pooled by element wise average, obtaining the nal representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 ROC curves on the IJB-A verication for (left) the use of context; (right) incremental training. See also Tab. 4.2. . . . . . . . . . . . . . . . . . . . 54 4.5 Filtering mislabeled data on COW. . . . . . . . . . . . . . . . . . . . . . . 61 4.6 Face recognition performance on IJB-A for verication (ROC) and identi- cation (CMC) when using dierent training data. . . . . . . . . . . . . . 63 4.7 The problem with manually labeled ground truth facial landmarks. . . . . 65 4.8 Augmenting appearances of images from the VGG face dataset. . . . . . . 70 4.9 Example augmented training images. . . . . . . . . . . . . . . . . . . . . . 72 4.10 Verication and identication results on IJB-A and IJB-B. . . . . . . . . . 76 4.11 Qualitative landmark detection examples in 300W dataset . . . . . . . . . 77 4.12 68 point detection accuracies on 300W. . . . . . . . . . . . . . . . . . . . 80 5.1 The 3D Morphable Model used in our system . . . . . . . . . . . . . . . . 85 5.2 3DMM tting on a single input image . . . . . . . . . . . . . . . . . . . . 86 5.3 Landmark detection with CLNF . . . . . . . . . . . . . . . . . . . . . . . 88 5.4 Eect of line-search optimization strategy . . . . . . . . . . . . . . . . . . 90 x 5.5 Overview of the proposed framework. In this section, we describe main modules: initialization with the 3-D face modeling, the 3-D pose estimation & validation and the re-acquisition . . . . . . . . . . . . . . . . . . . . . . 93 5.6 Comparison between SURF and dynamic random points strategy . . . . . 95 5.7 Re-acquisition techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.8 Evaluation for proposed techniques in the validation test with ICT-3DHP dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.9 Experiments on recorded videos by GAVAM+CLM (top) and 3-D Face Tracker (bottom) (1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.10 Experiments on recorded videos by GAVAM+CLM (top) and 3-D Face Tracker (bottom) (2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.11 HOG-based pose quality measurement . . . . . . . . . . . . . . . . . . . . 107 5.12 3D Reconstruction results on video input . . . . . . . . . . . . . . . . . . 109 5.13 3D Face Modeling results on MICC dataset . . . . . . . . . . . . . . . . . 113 5.14 3D-3D matching performance on Multi-PIE dataset [50]. . . . . . . . . . . 114 5.15 3D-3D matching performance on CASIA-WebFace dataset. . . . . . . . . 117 5.16 3D-3D matching performance on CASIA-WebFace dataset w.r.t. the max- imum number of modeling images per template. . . . . . . . . . . . . . . . 117 5.17 3D Morphable Model tting on multiple images . . . . . . . . . . . . . . . 118 5.18 3D-3D matching performance on IJB-A dataset. . . . . . . . . . . . . . . 119 5.19 3D-3D matching performance on IJB-A dataset w.r.t the minimum number of images per template. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.1 Unconstrained, single view, 3D face shape reconstruction. . . . . . . . . . 124 xi 6.2 Overview of our process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3 Eect of our loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.4 Qualitative comparison of surface errors, visualized as heat maps with real world mm errors on MICC face videos. . . . . . . . . . . . . . . . . . . . . 131 6.5 Face verication and recognition results. From left to right: Verication ROC curves for LFW, YTF, and IJB-A, and the recognition CMC for IJB-A.138 6.6 Qualitative results on 3DMM-CNN . . . . . . . . . . . . . . . . . . . . . . 139 6.7 2D-3D face recognition fusion on IJB-A. . . . . . . . . . . . . . . . . . . . 141 6.8 Qualitative results on Deep 3D Expression Estimation on LFW . . . . . . 144 6.9 Qualitative results on Fine-detail Estimation with Shape-from-Shading. . 145 6.10 Occlusion handling process in 3D face renement . . . . . . . . . . . . . . 147 6.11 Pixel-grid mesh vs. BFM-structured mesh. . . . . . . . . . . . . . . . . . 148 6.12 3D face modeling and renement with occlusion handling. . . . . . . . . . 152 6.13 Bump map denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.14 Overview of our deep ne-detail regression system. . . . . . . . . . . . . . 154 6.15 Qualitative results of our ne-detail regression method . . . . . . . . . . . 157 xii Abstract Face recognition and 3D face modeling are key problems in computer vision with many ap- plications in biometrics, human-computer interactions, surveillance, entertainment, and many more. While we have witnessed improvements over the last few years, open prob- lems remain when images and videos in the wild are considered. In this dissertation, we discuss how to address these problems eectively, as well as the connection between them. Face recognition must address appearance changes due to 3D factors, such as head pose, face shape, and expression. Second, 3D face modeling should recover a stable and recognizable 3D shape. The rst part of this thesis focuses on face recognition in the wild. We show that by coupling 3D face augmentation with a state-of-the-art 2D face recognition engine, we can greatly boost recognition accuracy. Our 3D face augmentation synthesizes facial images with dierent 3D head poses, 3D shapes, and expressions, thereby making our system robust to facial variations introduced by these factors. Our end-to-end system shows state-of-the-art performances on the latest challenging face recognition benchmarks. We also present some additional novel techniques to enhance the proposed system, from speeding-up rendering and matching to a complete landmark-free pipeline, which makes our system scalable and robust to a very-large training data and further break in-the-wild recognition records. xiii Inferring the accurate 3D geometry of a face from one or more images is a challenging problem. In the second part of this thesis, we present robust methods to build 3D mor- phable face models (3DMMs), and validate the quality with face recognition tests. First, we dene the state of the art of traditional analysis-by-synthesis 3DMM methods. Par- ticularly, we investigate the impact of multiple inputs on the 3D modeling results in both accuracy and distinctiveness. From this observation, we then generate a large amount of 3D "ground-truth" faces and train a convolutional neural network (CNN) to regress 3D shape and texture directly from any single input photo. The 3D estimates produced by our CNN surpass the state-of-the-art 3D reconstruction accuracy. Our CNN also shows the rst competitive face recognition results on the face recognition benchmarks using 3D face shapes as representations, rather than the somewhat opaque deep features used by other systems. Finally, we introduce some additional techniques to push 3D face recon- struction to the next level, thereby estimating expression in 3d and also 3D ne-grained details of the face, aiming towards laser-scan quality in the wild. xiv Chapter 1 Introduction 1.1 Problem Statement The human face is one of the main research targets in many computer science elds, such as computer vision, computer graphics, biometrics, and Human-Computer-Interaction (HCI). The study about face consists of face detection & tracking, face recognition, face attribute analysis, face modeling, and animation. In this work, we focus on the most crucial tasks: face recognition and face modeling. Face recognition is often used in security systems as one of the most important bio- metrics, besides ngerprint and iris recognition. It is also one of the key components in video surveillance, where tracking and recognizing people are the main tasks. Nowa- days, it is also widely used in Human Computer Interaction (HCI) and entertainment. Many big systems, such as Facebook, Samsung, and Apple, already include it inside their modules. Face recognition consists of 2 dierent tasks: identication and verication. In identi- cation task, given a set of images and/or videos showing the face of a person (probe), we 1 Figure 1.1: Two tasks in Face recognition problem. need to identify that person according to a prior facial dataset (gallery). In verication one, given 2 imagery inputs (e.g. 2 images), we need to decide whether they are from the same person or not. Even though these tasks are very dierent, in this dissertation, we aim to build a single framework that can solve both of them eectively (Fig. 1.1). Face modeling, however, focuses on recovering underlay facial geometry. An accurate face model can be used as input in many other tasks, such as rendering and animation, visual avatar, face attribute analysis, and even face recognition. While there are some studies about 2D face modeling, 3D face modeling is much more eective due to the 3D intrinsic nature of the human face and the camera models. The input data for each human subject can be a single image, multiple images, or a video (Fig. 1.2). Despite having dierent focuses, two topics are closely connected. When most of the challenges in face recognition comes from facial appearance change due to 3D factors (e.g. 3D views, 3D shape, expressions...), 3D face model provides an important base to solve 2 Figure 1.2: 3D Face modeling problem. this problem. In turn, a good 3D face modeling method needs to capture the distinctive features of each subject's face, hence the quality of the reconstructed face models should be veried by face recognition systems. Therefore, besides investigating the solutions to solve each problem, we also analyze and exploit their relationship. For the scope of this dissertation, we focus only on 2D/3D face recognition and 3D face modeling from RGB inputs. Due to the real-life demand, we focus on in-the-wild data, which are captured in un-controlled conditions with various qualities. We assume that the detected faces are given since face detection is not our focus. 1.2 Issues 1.2.1 Face Recognition Face recognition in-the-wild is not easy. There are many factors challenging this task: 3 (a) Pose (b) Illumination (c) Pose (d) Image quality Figure 1.3: Challenges in face recognition and 3D face reconstruction • Pose: Images/videos can be taken from dierent views and distances, thus we may get completely dierent 2D shapes from captured data of the same person. For instance, in Fig. 1.3(a), we have two images of the same person appearing in a frontal and a half-prole view. Without 3D inference and pose alignment, it is tough for the computer to correlate these images for recognition task. Pose also comes along with self-occlusion, making it more crucial. • Illumination: Images/videos can be captured under dierent lighting conditions. Dierent light colors and brightness vary color tone and pixel intensities, while shadow/spectacular spots can add noise onto the image. Fig. 1.3(b) illustrates this problem: despite belonging to the same person, two images taken under dierent lighting congurations look very dissimilar. 4 • Expression: Expressions cause dierent deformations on human face, especially on mouth, jaw, and eye regions (see Fig. 1.3(c)). A strong face recognition system needs to be robust to these changes in appearance. • Image quality: Data captured in the wild can have very low quality. We expect to handle images coming from various resolutions, blurriness, and noise levels (Fig. 1.3(d)). There are some other issues challenging the face recognition tasks such as occlusions, hairs, makeup, or age-related changes. However, they do not appear frequently. Hence, in the scope of this paper, we focus on handling the main issues above. 1.2.2 3D Face Reconstruction 3D face reconstruction problem is highly aected by the same factors: • Pose: 3D head pose is an important factor that needs to be handled in the 3D modeling process. It denes the location and orientation of the face according to camera coordinates. It also infers which parts/components of the subject face are visible/hidden. For example, frontal images barely provide us information about the face depth, while prole images lack information about the upfront face shape. When working on multiple images or video frames, 3D pose is particularly important in connecting input images/frames in order to build a complete single 3D face model. • Illumination: Illumination is an important factor in image formation process, which cannot be ignored when investigating pixel intensities. On one hand, it sets hurdles by varying the facial appearance across images and video frames. On 5 another hand, however, it provides extra information indicating the face normals, which has been largely exploited in shape-from-shading approaches. • Expression: Expression puts noise into facial data by deforming dierent face parts. We need to get rid of it in order to rebuild a consistent 3D face shape. • Image quality: Data captured in the wild often has very low quality. Hence, traditional data-driven approaches (e.g. stereo, shape-from-shading) do not work on these inputs. 1.3 Contributions In this dissertation, we propose the robust and eective solutions to solve 2D face recog- nition and 3D face modeling in-the-wild. As for 2D face recognition, this dissertation provides 4 key contributions compared to the state of the art methods in terms of nov- elty: • 3D Face Augmentation Aid: To handle facial variants due to dierent 3D factors such as head pose, shape, and expression, we proposed the use of 3D face augmenta- tion inside a state-of-the-art face recognition framework. In training, each training image is re-rendered in dierent 3D congurations. This helps to enrich training data required by a data-greedy deep neural network, as well as to strengthen the network ability to handle these facial changes. In testing, this facial synthesis is used to bring the faces to the same condition for better comparison. This 3D aug- mentation, in both training and testing, signicantly boosts recognition power of the trained face recognizer. 6 • Rapid face synthesis and matching: The traditional 3D face re-rendering method is slow, which creates a bottleneck in performance of the 3D face aug- mentation approach. We proposed a novel 3D face synthesis, which can do 3D face augmentation with speed and computational of 2D image warping. We also pro- posed the use of pooled synthesized faces in matching, which is proved to be faster and more eective than the view-based matching approach. These techniques make our system scalable to work on large-scale facial data. • A very large scale training data: To push further face recognition performance and to test the scalability of our framework, we introduced COW dataset, the largest clean and public facial dataset. Our system trained on COW deeply breaks the matching records on a challenging in-the-wild face recognition benchmark. • Landmark-free face alignment: We further improved our system by regressing directly 3D head pose from input images, rather than using an expensive landmark detector. We proved that accurate landmarks are not required in face recognition system, and our landmark-free pipeline provides even better matching performance than any landmark-based method. We also make ve contributions in 3D face modeling study: • Eective 3D modeling approach on multiple-images: Inspired by a recent paper [103], we investigated a simple but robust method of 3D face modeling from multiple images. It rst estimates 3D model tting on each single input, then combines them using a condence measure. We went beyond the original paper by 7 proving series of experiments on in-the-wild datasets to prove the eectiveness of this method. The experimental results indicated an ability to rebuild accurate 3D face shape and identity, given enough input data. • A novel 3D face tracking on video: In order to do 3D face modeling on video, we developed a 3D face tracker to precisely estimate 3D head pose in each video frame. This real-time face tracker shows the state-of-the-art 3D head pose estimation on video in both accuracy, coverage, and robustness. • Deep 3D Face Modeling method: After investigating the analysis-by-synthesis methods, we designed a deep-learning based approach to regress directly 3D face model from any single image input. Our CNN is both robust and discriminative, which provides the state-of-the-art 3D face modeling in accuracy and distinctiveness. • 3D face modeling evaluation by face recognition test: To evaluate the qual- ity of 3D reconstruction results, traditional methods require facial datasets with 3D ground-truth models, which are very limited in number, size, and capturing condi- tions. Instead, from the fact that a good 3D face needs to recognizable, we proposed the use of face recognition tests as an eective quality measure. This extends our evaluation to work on any 2D facial dataset, particularly on challenging in-the-wild sets. While the traditional face modeling methods fail this test by providing either too generic or too unstable 3Ds, our deep 3D face modeler provides stable and dis- criminative 3D models and rst time shows a competitive face recognition result on in-the-wild data. 8 • Image-dependent 3D component recovery: We also discussed additional tech- niques to recover the image-dependent 3D components such as expression and ne- details. These techniques rene the nal 3D models, make them more realistic, and bring them towards 3D depth scan quality. Besides theoretical contributions, we also provide practical contributions by releasing our implementations: • ResFace-101: CNN model of our face recognition described in section 4.1 1 . • Rapid face synthesis code: Our Python face specic augmentation code 1 . • FacePoseNet: Our CNN model and demo code of FacePoseNet for landmark-free face alignment described in section 4.3 2 . • 3DMM CNN: Our code and CNN model of the deep 3D face modeling method described in Chapter 6 3 . • COW dataset: List of images and corresponding subject IDs in our CleanCOW dataset (coming soon). 1.4 Dissertation Outline The rest of this dissertation is organized as follows: in Chapter 2, we review the related work for both face recognition and 3D face reconstruction problems; in Chapter 3, we 1 http://www.openu.ac.il/home/hassner/projects/augmented_faces 2 https://github.com/fengju514/Face-Pose-Net 3 http://www.openu.ac.il/home/hassner/projects/CNN3DMM 9 propose the use of 3D face augmentation in boosting matching performance on a state- of-the-art face recognition framework; in Chapter 4, we discuss the techniques to leverage this face recognition system to achieve state-of-the-art performance on in-the-wild data; in Chapter 5, we revise an existing 3D reconstruction algorithm to get an automatic engine working on any imagery input; in Chapter 6, we propose a deep-learning based method for 3D face modeling, which is robust and discriminative; in Chapter 7, we conclude the dissertation with contributions and future work. 10 Chapter 2 Related Work In this chapter, we review the related work as well as the state-of-the-art techniques for both face recognition and 3D face reconstruction problems. 2.1 Face Recognition In this part, we rst provide an overview of the state of the art face recognition systems (section 2.1.1). Then, we discuss about existing work on data augmentation (section 2.1.2) and face synthesis (section 2.1.3) - two central techniques to enhance system performance with the use of reference 3D shapes. 2.1.1 Face Recognition Systems Face recognition is one of the central problems in computer vision and, as such, work on this problem is extensive. The classical studies focus on extracting local descriptors, either texture-based [7, 97], edge-based [25, 91, 141], or both [133]. They also use dierent simple machine learning techniques [95, 49, 47]. Nonetheless, these naive approaches limit their scope in controlled datasets, thus performing poorly in the wild [143]. 11 As with many other computer vision problems, face recognition performances skyrock- eted with the introduction of deep learning techniques and in particular Convolutional Neural Networks (CNN). Though CNNs have been used for face recognition as far back as [74], only when massive amounts of data became available did their performance soar. This was originally demonstrated by the Facebook DeepFace system [131], which used an architecture not unlike the one used by [74], but with over 4 million images used for training, they obtained far more impressive results. Since then, CNN based recognition systems continuously cross performance barriers with some notable examples including the Deep-ID 1-3 systems [128, 126, 127]. They and many others since, developed and trained their systems using far fewer training images, at the cost of somewhat more elaborate network architectures. Though novel network architecture designs can lead to better performance, further improvement can be achieved by collecting more training data. This has been demon- strated by the Google FaceNet team [118], who developed and trained their system on 200 million images. Besides improving results, they also oered a fascinating analysis of the consequences of adding more data: apparently, there is a signicant diminishing re- turns eect when training with increasing image numbers. Thus, the leap in performance obtained by going from thousands of images to millions is substantial but increasing the numbers further provides smaller and smaller benets. One way to explain this is that the data they and others used suers from a long tail phenomenon [156], where most subjects in these huge datasets have very few images available for the network to learn intra-subject appearance variations from. 12 These methods were all evaluated on the Labeled Faces in the Wild (LFW) dataset, which has been for some time a standard de facto for measuring face recognition perfor- mances. Many of these LFW results, however, are already reaching near-perfect perfor- mances, suggesting that LFW is no longer a challenging benchmark for today's systems. Another relevant benchmark, also frequently used to report performances, is the YouTube Faces (YTF) set [145]. It contains unconstrained face videos rather than images, but it too is quickly being saturated. Nevertheless, face recognition in the wild is still an open problem. Recently, a newly introduced in the wild benchmark named Janus [68] did revive the competition. It oers several novelties compared to existing sets, including template based, rather than image- based, recognition and a mix of both images and videos. It is also tougher than previous collections, as illustrated in Fig. 2.1. The recent papers [139, 31] show that even with complex deep learning based methods, the matching performance on this benchmark is pushed back to 70% - 80%. It infers that only deep learning is not enough, but we need more innovative ideas to make the machine performance match human standard on uncontrolled data. 2.1.2 Data Augmentation Data augmentation techniques are transformations applied to the images used for training or testing, but without altering their labels. Such methods are well known to improve the performance of CNN based methods and prevent overtting [30]. These methods, however, typically involved generic image processing operations which do not exploit knowledge of the underlying problem domain to synthesize new appearance variations. 13 Figure 2.1: IJBA dataset with template-based matching protocols. Popular augmentation methods include simple, geometric transformations such as oversampling (multiple, translated versions of the input image obtained by cropping at dierent osets) [71, 79], mirroring (horizontal ipping) [30, 152], rotating [148] the im- ages as well as various photometric transformations [71, 123, 45]. Surprisingly, despite being widely recognized as highly benecial to the training of CNN systems, we are unaware of previous attempts to go beyond these simple transfor- mations as we proposed doing. One notable exception is the recent work of [88] which proposes to augment training data for a person re-identication network by replacing image backgrounds. We propose a far more elaborate, yet easily accessible means of data augmentation. Finally, we note that the recent work of [149] describes a so-called task-specic data augmentation method. They, as well as [150], do not synthesize new data as we propose to do here, but rather oer additional means of collecting images from the Internet to 14 improve learning in ne-grained recognition tasks. This is, of course, very dierent from our own approach. 2.1.3 Face Synthesis for Face Recognition The idea that face images can be synthetically generated in order to aid face recognition systems is not new. To our knowledge, it was originally proposed in [53] and then eec- tively used by [131] and [54]. Contrary to us, they all produced frontal faces which are presumably better aligned and easier to compare. They did not use other transformations to generate new images (e.g., other poses, facial expressions). More importantly, their images were used to reduce appearance variability, whereas we propose the opposite: to dramatically increase it to improve both training and testing. 2.2 3D Face Modeling There have been many studies on 3D face modeling from a single image, multiple images, or videos [8, 160, 37, 83, 135, 9, 65, 64, 53]. Roughly speaking, we can categorize them into four groups: 2.2.1 Triangulation-based Approaches Many practical systems for capturing 3D faces are based on the triangulation principle. Structured lighting systems, laser scanners, and stereo systems rely on the correspon- dences between pixels from dierent images to triangulate the corresponding 3D points. Chen and Medioni show 3D face models from a pair of stereo images [33]. Later, Medioni 15 (a) Light Stage [8] (b) Structure-from-Motion [83] Figure 2.2: Triangulation-based 3D Face Modeling approaches. and Pesenti [90], and Medioni et al. [89] demonstrate that 3D face models can be recon- structed using video data taken by a single camera. State of the art methods that provide very high-quality 3D face models require special equipment or studio-based capturing environment [8, 24, 136]. In [8], the user has to be scanned in a ball-shaped light stage with 156 LED lights that captures the faces geometry and re ectance (Fig. 2.2(a)). [24] uses a setup of 14 high-denition video cameras to capture small patches of the face surface, then applies an iterative binocular stereo method to reconstruct the model. [136] requires the users to wear re ecting markers on their face and use a motion capture system to create and animate the models. Also, many commercial systems require expensive sensing hardware, such as 3dMD [5]. One of the most successful 3D modeling approaches is Structure-from-Motion (SfM). SfM collects correspondences from multiple frames, then reconstructs 3D model using global optimization techniques. Although this approach has a dierent formulation, we put it in this category since it has a similar conguration as other triangulation-based 16 approaches. SfM methods have provided very accurate results on high-resolution images. For instance, Lin et. al. can reconstruct an accurate 3D face model from a set of weakly calibrated images [83], as can be seen in Fig. 2.2(b). Nonetheless, triangulation-based approaches share a crucial task of nding point cor- respondences for each pair of input images. This task sometimes becomes demanding, es- pecially on low-quality data, making these approaches impractical on in-the-wild datasets. 2.2.2 Shape from X Methods These methods exploit specic properties of the scene such as focus [93], symmetry [44], or shading [16, 64]. Shape-from-shading(SfS) is the most popular approach in this group, which relies on the re ectance properties of facial skin. It rst estimates the illumination model, then infers face vertex normals, and nally recovers 3D face shape [16]. Most of them assume human face is a Lambertian surface [16, 64], while some recent studies start considering complex skin re ectance model [81]. While SfS is designed on a single input image (Fig. 2.3(a)), there are some eorts expanding it to work on multiple images [65] (Fig. 2.3(b)) and video [129]. While providing realistic details, SfS methods are weak in deforming the global face shape. Therefore, a good initial 3D face model is required. Also, SfS methods are sensitive to background clutter, occlusions, blurriness, and noise; hence they are not applicable for low-quality input. 2.2.3 Deformation-based Approaches Many proposed approaches for 3D face reconstruction use a generic model as prior knowl- edge about human face's structure. This model is then deformed to best t the facial 17 (a) From a single image [64] (b) From multiple images image [65] Figure 2.3: Shape-from-Shading 3D Face Modeling methods. image. Diverse sources of information have been used to decide the tting criteria, in- cluding silhouettes [109], facial landmarks [9], image descriptors or image intensity. Early work used proportions between facial landmarks as a means to reconstruct faces (Fig. 2.4(b)). In [134], Tang and Huang located 32 feature points in several frames and related them to the vertices in a generic 3D model. The deformed 3D model was reasonable but too sparse to accurately capture the facial shape. Instead of purely depending on landmark points, later work tried to exploit all pixel intensities. One common approach was to use 2D ow to relate imagery data and generic reference. For instance, Hassner [53] presented a single view face reconstruction method on near-frontal facial images, as illustrated in Fig. 2.4. First ,it estimated 3D head pose, then rendered both appearance (RGB) and depth channels using a generic 3D face model. Next, the system matched pixels in the rendered view of the generic model to pixels in the input image and used these correspondences to assign the query face with depth values of its own. This process is done in an iterative manner, using a coordinate descent process to jointly optimize the similarity of the generic appearance and depth to the appearance 18 (a) Landmark-based (b) Flow-based [53] Figure 2.4: Deformation-based 3D Face Modeling approaches. and estimated depth of the query. It shows visually good 3D modeling results, especially in up-front. However, face depth is poorly estimated by inferring from generic model. 2.2.4 Template-based Approaches While other works have loose constraints on the output shape, template-based methods enforce it to be a human face. 3D Morphable Model (3DMM) [22] is a densely sampled 19 Figure 2.5: 3D Morphable Models tting [22, 109]. statistical model learned from laser scans that included shape and texture. In the original paper by Blanz and Vetter (FIg. 2.5), the model is tted to a single image using a non-linear optimization technique along with a shading-based distance measure. This approach was then rened to exploit also edge information in [109]. 3DMM became publicly available with the release of the Basel Face Model by Paysan et al. [102] and is often considered state-of-the-art. In [9], Amberg et al. propose to t a morphable model to two images in a stereo-vision set-up. Silhouettes and landmark information are included in the optimization process. Similarly, Le et. al. [76] used a linear morphable model on two stereo images to accurately reconstruct the 3D face shape. Comparing to other approaches, template-based approaches have many advantages. The output shape is constrained to be a human face, guaranteeing it work on noisy input. Also, the 3D mesh is well-structured; hence we can easily re-use it for other applications such as 3D face matching, blending, and animation. 20 Inspired by previous work, in Chapter 5 we propose an extensive version of 3D Mor- phable Model tting algorithm, which can work on various in-the-wild input. Then, in Chapter 6, we leverage this method by the use of a deep neural network, resulting a novel and state-of-the-art technique in 3D face modeling from images in the wild. 21 Part I Face Recognition 22 Chapter 3 2D Face Recognition aided by 3D Face Augmentation 1 In this chapter, we investigate the potential of improving a state-of-the-art 2D facial matching engine by associating it with reference 3D face shapes. First, section 3.1 is an introduction about face synthesizing, the technique we use to integrate 3D face shape into 2D face recognition system. Then, in section 3.2, we present our 2D facial matching engine and how face synthesizing can help in both training and testing. Finally, the proposed methods are validated by comprehensive experiments (section 3.3) and concluded (section 3.4). We show that even with the use of generic 3D models, it tremendously lifts system performance and shows the state-of-the-art performance on in-the-wild datasets. 3.1 Data Augmentation by Synthesizing Faces Modern face recognition systems are built on Convolutional Neural Networks (CNNs), a special deep learning technique designed on imagery data (see [77, 71]). Its power is based 1 This chapter is my joint work with Iacopo Masi, Prof. Tal Hassner, Jatuporn Toy Leksut, and Prof. G erard Medioni. It was published as "Do we really need to collect millions of faces for eective face recognition?." in European Conf. Comput. Vision, pp. 579-596, Springer International Publishing, 2016. 23 on a complex network design and the underlying ability to learn from massive training sets. Getting more training data is one eective way to empower these systems. In this section, we detail our approach to augmenting a generic face dataset. We use the CASIA WebFace collection [156], enriching it with substantially more per-subject appearance variations, yet without changing subject labels or losing meaningful informa- tion. Specically, we propose to generate (synthesize) new face images, by introducing the following face specic appearance variations: 1. Pose: Simulating face image appearances across unseen 3D viewpoints. 2. Shape: Producing facial appearances using dierent 3D generic face shapes. 3. Expression: Specically, simulating closed mouth expressions. (1) can be considered an extension of frontalization techniques [54] to multiple views. Conceptually, they rendered new views to reduce variability for better alignment whereas we do this to increase variability to better capture intra-subject appearance variations. To avoid over-tting to a specic face shape, in (2) we in ate the training set using dierent 3D generic face shapes. Ideally, using accurate per-subject 3D models is a better option and we will present our plan to replace (2) by it in the next chapters. 3.1.1 Pose Variations In order to generate unseen viewpoints given a face image I, we use a technique similar to the frontalization proposed by [54]. We begin by applying the facial landmark detector from [11]. Given these detected landmarks we estimate the six degrees of freedom pose for the face in I using correspondences between the detected landmarks p i 2 R 2 and 24 Figure 3.1: Adding pose variations by synthesizing novel viewpoints. Left: Original image, detected landmarks, and 3D pose estimation. Right: rendered novel views. points P i : = S(i)2 R 3 , labeled on a 3D generic face model S. Here, i indexes specic facial landmarks in I and the 3D shape S. As mentioned earlier, we use CASIA faces for augmentation. These faces are roughly centered in their images, and so detecting face bounding boxes was unnecessary. Instead, we used a xed bounding box determined once beforehand. Given the corresponding landmarks p i $ P i we use PnP [52] to estimate extrinsic camera parameters, assuming the principal point is in the image center and then rening the focal length by minimizing landmark re-projection errors. This process gives us a perspective camera model mapping the generic 3D shape S on the image such as p i M P i where M = K [R t] is the camera matrix. Given the estimated pose M, we decompose it to obtain a rotation matrix R 2 R 33 containing rotation angles for the 3D head shape with respect to the image. We then create new rotation matrices R 0 2R 33 for unseen (novel) viewpoints by sampling dierent yaw angles. In particular, since CASIA images are biased towards frontal faces, given an image I we render it at the xed yaw values =f0 40 75. Rendering itself is derived from [54], including soft-symmetry. Fig. 3.1 shows viewpoint (pose) synthesis results for a training subject in CASIA, illustrating the 3D pose estimation process. 25 Note that in practice, faces are rendered with a uniform black background not shown here (original background from the image was not preserved in rendering). 3.1.2 3D Shape Variations Conceptually, to truthfully capture the appearance of a subject's face under dierent viewpoints, its true 3D form must be used for the rendering. Therefore, many works at- tempted to estimate this 3D shape from the image directly prior to frontalization [131]. Because these reconstruction process are not stable yet, particularly for challenging, un- constrained images, Hassner et al. [54] instead used a single generic 3D face to frontalize all face images. We adapt Hassner's idea here, with some extensions, before proposing a more stable 3D shape reconstruction method and the planned integration framework in the next chapters. Rather than using a single generic 3D shape, we extend the procedure described in Sec. 3.1.1 to multiple generic 3D faces. In particular we add the set of generic 3D shapes S =fS j g 10 j=1 . We then simply repeat the pose synthesis procedure with these ten shapes rather than using only a single 3D shape. We used 3D generic shapes from the publicly available Basel 3D face set [102]. It includes 10 high quality 3D face scans captured from dierent people with dierent face shapes. The subjects vary in gender, age, and weight. The models are further well aligned to each other, hence requiring 3D landmarks to be selected only once, on one of these 3D faces, and then directly transferring them to the other nine models. Fig. 3.2 shows the ten generic models used here, along with images rendered to near prole view using each of these shapes. Visually, subjects in these images remain identiable, despite the dierent 26 Figure 3.2: Top: The ten generic 3D face shapes used for rendering. Bottom: Faces rendered with the generic appearing right above them. Dierent shapes induce subtle appearance variations to avoid overtting the classier to a specic shape. underlying 3D shape, meeting the augmentation requirement of not changing subject labels. Yet each image is slightly but noticeably dierent from the rest, introducing appearance variations to this subject's image set. 3.1.3 Expression Variations 1 In addition to pose and shape, we also synthesize expression variations, specically reduc- ing deformations around the mouth. Given a face image I and its 2D detected landmarks p i , and following pose estimation (Sec. 3.1.1) we estimate facial expression by tting a 3D expression Blendshape, similarly to [80]. This is a linear combination of 3D generic face models with various basis expressions, including mouth-closed, mouth-opened and 1 This part was done by Jatuporn Toy Leksut. 27 Figure 3.3: Expression synthesis examples. Top: Example face images from the CASIA WebFace dataset. Bottom: Synthesized images with closed mouths. smile. Following alignment of the 3D face model and the 2D face image in both pose and expression, we perform image-based texture mapping to register the face texture onto the model. This is useful to quickly assign texture to our face model given that only one image is available. To synthesize expression, we manipulate the 3D textured face model to exhibit new expressions and render it back to the original image. This technique al- lows us to render a normalized expression where other image details, including hair and background, remain unchanged. In our experiments we do this to produce images with closed mouths. Some example synthesis results are provided in Fig. 3.3. Though slight artifacts are sometimes introduced by this process (some can be seen in Fig. 3.3) these typically do not alter the general facial appearances and are less pronounced than the noise often present in unconstrained images. 3.2 Face Recognition Pipeline Data augmentation techniques are not restricted to training and are often also applied at test time. Our augmentations provide opportunities to modify the matching process by 28 using dierent augmented versions of the input image. We next describe our recognition pipeline including these and other novel aspects. 3.2.1 CNN Training with 3D Augmented Data Augmented training data: Our pipeline employs a single CNN trained on both real and augmented data generated as described in Sec. 3.1. Specically, training data is produced from the original CASIA WebFace images. It consists of the following types of images: (i) original CASIA images following alignment by a simple, in-plane similarity transform to two coordinate systems: roughly frontal facing faces (face yaw estimates in [30:: 30) are aligned using nine landmarks on an ideal frontal template, while prole images (all other yaw angles) are aligned using the visible eye and the tip of the nose. (ii) Each image in CASIA is rendered from three novel views in yaw anglesf0 40 75, as described in Sec. 3.1.1. (iii) Synthesized views are produced by randomly selecting a 3D generic face model fromS as the underlying face shape (see Sec. 3.1.2), thereby adding shape variations. (iv) Finally, a mouth neutralized version of each image is also added to the training (Sec. 3.1.3). This process raises the total number of images available for training from 494,414 in the original CASIA WebFace set to a total of 2,472,070 images in our complete (pose+shape+expression) augmented dataset. Note that this process leaves the number of CASIA WebFace subjects unchanged, in ating only the number of images per subject (Fig. 3.4(b)). CNN Fine-tuning: We use the very deep VGGNet [123] CNN with 19 layers, trained on the large scale image recognition benchmark (ILSVRC) [110]. We ne tune this network using our augmented data. To this end, we keep all layersfW k ; b k g 19 k=1 of VGGNet 29 Dataset #ID #Img #Img=#ID Google [118] 8M 200M 25 Facebook [131] 4,030 4.4M 1K VGG Face [100] 2,622 2.6M 1K MegaFace [66] 690,572 1.02M 1.5 CASIA [156] 10,575 494,414 46 Aug. pose+shape 10,575 1,977,656 187 Aug. pose+shape+expr 10,575 2,472,070 234 (a) Face set statistics 10 0 10 1 10 2 10 3 10 4 10 5 0 2000 4000 6000 Subjects (log scale) Images CASIA WebFace Pose with Shapes Pose, Shapes, Expression (b) Images for subjects Figure 3.4: (a) Comparison of our augmented dataset with other face datasets along with the average number of images per subject. (b) Our improvement by augmentation (Aug.) in the distribution of per-subject image numbers in order to avoid the long-tail eect of the CASIA set [156] (also shown in the last two rows of (a)). except for the last linear layer (FC8) which we train from scratch. This layer produces a mapping from the embedded feature x2R D (FC7) to the subject labels N = 10; 575 of the augmented dataset. It computes y = W 19 x + b 19 , where y2 R N is the linear response of FC8. Fine-tuning is performed by minimizing the soft-max loss: L(fW k ; b k g) = X t ln e y l P N g=1 e yg (3.1) where l is the ground-truth index over N subjects and t indexes all training images. Eq. (6.1) is optimized using Stochastic Gradient Descent (SGC) with standard L2 norm over the learned weights. When performing back-propagation, we learn FC8 faster since it is trained from scratch while other network weights are updated with a learning rate an order of magnitude lower than FC8. 30 Specically, we initialize FC8 with parameters drawn from a Gaussian distribution with zero mean and standard deviation 0.01. Bias is initialized with zero. The overall learning rate for the entire CNN is set to 0:001, except FC8 which uses learning rate of 10. We decrease learning rate by an order of magnitude once validation accuracy for the ne tuned network saturates. Meanwhile, biases are learned twice as fast as the other weights. For all the other parameter settings we use the same values as originally described in [71]. 3.2.2 Face Recognition with Synthesized Faces General matching process: After training the CNN, we use the embedded feature vector x =f(I;fW k ; b k g) from each image I as a face representation. Given two input images I p and I q , their similarity, s(x p ; x q ) is simply the normalized cross correlation (NCC) of their feature vectors. The values(x p ; x q ) is the recognition score at the image level. In some cases a subject is represented by multiple images (e.g., a template, as in the Janus benchmark [68]). This plurality of images can be exploited to improve recognition at test time. In such cases, image sets are dened byP =fx 1 ;:::; x P g andQ =fx 1 ;:::; x Q g and a similarity score is dened between them: s(P;Q). Specically, we compute the pair-wise image level similarity scores, s(x p ; x q ), for all x p 2P and x q 2Q, and pool these scores using a SoftMax operator, s (P;Q) (Eq.(3.2), below). Though the use of SoftMax here is inspired by the SoftMax loss often used by CNNs, its aim is to get a robust score regression instead of a distribution over the subjects. SoftMax for set fusion can be seen as a weighted average in which the weight 31 depends on the score when performing recognition. It is interesting to note here that the SoftMax hyper-parameter controls the trade-o between averaging the scores or taking the maximum (or minimum). That is: s (;) = 8 > > > > > > > > > < > > > > > > > > > : max() if !1 avg() if = 0 min() if !1 and s (P;Q) = P p2P;q2Q s(x p ; x q )e s(xp;xq ) P p2P;q2Q e s(xp;xq ) : (3.2) Pair-wise scores are pooled using Eq.(3.2) and we nally average the SoftMax responses over multiple values of = [0:::20] to get the nal similarity score: s(P;Q) = 1 21 20 X =0 s (P;Q): (3.3) The use of positive values for is motivated by the fact what we are using a score for recognition, so the higher the value, the better. In our experiments we found that the SoftMax operator reaches a remarkable trade-o between averaging the scores and taking the maximum. The improvement given by the proposed SoftMax fusion is shown in Tab. 3.1: we can see how the proposed method largely outperforms standard fusion techniques on IJB-A, in which subjects are described by templates. Exploiting pose augmentation at test time: The Achilles heel of many face recog- nition systems is cross pose face matching; particularly when one of the two images is viewed at an extreme, near prole angle [155, 82, 43]. Directly matching two images viewed from extremely dierent viewpoints often leads to poor accuracy as the dierence 32 Fusion# IJB-A Ver. (TAR) IJB-A Id. (Rec. Rate) Metrics! FAR0.01 FAR0.001 Rank-1 Rank-5 Rank-10 Min 26.3 11.2 33.1 56.1 66.8 Max 77.6 46.4 84.8 93.3 95.6 Mean 79.9 53.0 84.6 94.7 96.6 SoftMax 86.6 63.6 87.2 94.9 96.9 Table 3.1: SoftMax template fusion for score pooling vs. other standard fusion techniques on the IJB-A benchmark for verication (ROC) and identication (CMC) resp. in viewpoints aects the similarity more than subject identities. To mitigate this prob- lem, we suggest rendering both images from the same view: one that is close enough to the viewpoint of both images. To this end, we leverage our pose synthesis method of Sec. 3.1.1 to produce images in poses better suited for recognition and matching. Cross pose rendering can, however, come at a price: Synthesizing novel views for faces runs the risk of producing meaningless images whenever facial landmarks are not accurately localized and the pose estimate is wrong. Even if pose was correctly estimated, warping images across poses involves interpolating intensities, which leads to smoothing artifacts and information loss. Though this may aect training, it is far more serious at test time where we have few images to compare and ruining one or both can directly aect recognition accuracy. Rather than commit to pose synthesis or its standard alternative, simple yet robust in-plane alignment, we propose to use both: We found that pose synthesis and in-plane alignment are complimentary and by combining the two alignment techniques recognition 33 performance improves. For an image pair (I p ; I q ) we compute two similarity scores. One score is produced using in-plane aligned images and the other using images rendered to a mutually convenient view. This view is determined as follows: If the two images are near frontal then we render them to frontal view (essentially frontalizing them [54]), if they are both near prole we render to 75 , otherwise we render both to 40 . When matching templates (image sets), (P;Q), scores computed for in-plane aligned image pairs and pose synthesized pairs are pooled separately using Eq. (3.2). This is equivalent to comparing the two setsP andQ twice, once for each alignment method. These two similarities are then averaged for the nal template level score. 3.3 Experiments We tested our approach extensively on the recently released IARPA Janus benchmarks [68] and LFW [58]. We perform a minimum of database specic training, using the training images prescribed by each benchmark protocol. Specically, we perform Principal Com- ponent Analysis (PCA) on the training images of the target dataset with the features x extracted from the CNN trained on augmented data. This did not include dimensionality reduction; we did not cut any component after PCA projection. Following this, we apply root normalization to the new projected feature, i.e., x! x c , as previously proposed for the Fisher Vector encoding in [113]. We found that a value of c = 0:65 provides a good baseline across all the experiments. For each dataset we report the contribution of each augmentation technique compared with state-of-the-art methods which use millions of images to train their deep models. 34 10 −4 10 −2 10 0 0 0.2 0.4 0.6 0.8 1 False Acceptance Rate True Acceptance Rate ROC No Augmentation Pose Pose, Shape Pose, Shape, Expr. 0 5 10 15 75 80 85 90 95 100 Rank Recognition Rate CMC No Augmentation Pose Pose, Shape Pose, Shape, Expr. (a) Contribution of augmentation 10 −4 10 −2 10 0 0 0.2 0.4 0.6 0.8 1 False Acceptance Rate True Acceptance Rate ROC In−plane aligned Rendered Combined 0 5 10 15 85 90 95 100 Rank Recognition Rate CMC In−plane aligned Rendered Combined (b) Matching methods Figure 3.5: Ablation study of our data synthesis and test time matching methods on IJB-A. 3.3.1 Results on the IJB-A Benchmarks IJB-A is a new publicly available benchmark released by NIST 2 to raise the challenges of unconstrained face identication and verication methods. Both IJB-A and the Janus CS2 benchmark share the same subject identities, represented by images viewed in ex- treme conditions, including pose, expression and illumination variations, with IJB-A splits generally considered more dicult than those in CS2. The IJB-A benchmarks consist of face verication (1:1) and face identication (1:N) tests. Contrary to LFW, Janus subjects are described using templates containing mixtures of still-images and video frames. It is important to note that the Janus set has some overlap with the images in the CASIA WebFace collection. In order to provide fair comparisons, our CNNs were ne tuned on CASIA subjects that are not included in Janus (Sec. 3.2.1). Face detections: Our pipeline uses the facial landmark detector of [11] for head pose estimation and alignment. Although we found this detector quite robust, it failed to detect landmarks on some of the more challenging Janus faces. Whenever the detector 2 IJB-A data and splits are available under request at http://www.nist.gov/itl/iad/ig/ facechallenges.cfm 35 Without Video pooling With Video pooling Augmentation# IJB-A Ver. (TAR) IJB-A Id. (Rec. Rate) IJB-A Ver. (TAR) IJB-A Id. (Rec. Rate) Metrics! FAR0.01 FAR0.001 Rank-1 Rank-5 Rank-10 FAR0.01 FAR0.001 Rank-1 Rank-5 Rank-10 No Augmentation 74.5 54.3 77.1 89.0 92.3 75.0 55.0 77.8 89.5 92.6 Pose 84.9 62.3 86.3 94.5 96.5 86.3 67.9 88.0 94.7 96.6 Pose, Shapes 86.3 62.0 87.0 94.8 96.9 87.8 69.2 88.9 95.6 97.1 Pose, Shapes, Expr. 86.6 63.6 87.2 94.9 96.9 88.1 71.0 89.1 95.4 97.2 Table 3.2: Eect of each augmentation on IJB-A performance on verication (ROC) and iden- tication (CMC), resp. Only in-plane aligned images used in these tests. failed on all the images in the same template, we use the images cropped to their facial bounding boxes as provided in the Janus data. Video pooling: We note that whenever face templates include multiple frames from a single video, we pool together CNN features extracted from the same video: this, by simple element wise average over all the features extracted from that video's frames. We emphasize that features are not pooled across videos but only within each video. Similar pooling techniques were very recently demonstrated to provide substantial performance enhancements (e.g., [125]) but, to our knowledge, never for faces or in the manner sug- gested here. We refer to this technique as video pooling and report its in uence on our system, and, whenever possible, for our baselines. In all our IJB-A and Janus CS2 results this method provided noticeable performance boosts: we compare video pooling to pair-wise single image comparisons (referred as without video pooling in our results). 36 Without Video pooling With Video pooling Image type# IJB-A Ver. (TAR) IJB-A Id. (Rec. Rate) IJB-A Ver. (TAR) IJB-A Id. (Rec. Rate) Metrics! FAR0.01 FAR0.001 Rank-1 Rank-5 Rank-10 FAR0.01 FAR0.001 Rank-1 Rank-5 Rank-10 In-plane aligned 86.6 63.6 87.2 94.9 96.9 88.1 71.0 89.1 95.4 97.2 Rendered 84.7 64.6 87.3 95.0 96.8 84.8 66.4 88.0 95.5 96.9 Combined 87.8 67.4 89.5 95.8 97.4 88.6 72.5 90.6 96.2 97.7 Table 3.3: Eect of in-plane alignment and pose synthesis at test-time (matching) on IJB-A dataset respectively for verication (ROC) and identication (CMC). Ablation Study: We provide a detailed analysis of each augmentation technique on the challenging IJB-A dataset. Clearly, the biggest contribution is given by pose augmentation (red curve) over the baseline (blue curve) in Fig. 3.5(a). The improvement is especially noticeable in the rank-1 recognition rate for the identication protocol. The eect of video pooling along with each data augmentation method is provided in Tab. 5.7. We next evaluate the eect of pose synthesis at test time combined with the standard in-plane alignment (Sec. 3.2.2), in Tab 5.9 and in Fig. 3.5(b). Evidently, these methods combined contribute to achieving state-of-the-art performance on the IJB-A benchmark. We conjecture that this is mainly due to three contributions: domain-specic augmenta- tion when training the CNN, combination of SoftMax operator, video pooling and nally pose synthesis at test time. Comparison with the state-of-the-art: Our proposed method achieves state of the art results in the IJB-A benchmark and Janus CS2 dataset. In particular, it largely improves over the o the shelf commercial systems COTS and GOTS [68] and Fisher 37 Vector encoding using frontalization [32]. This gap can be explained by the use of deep learning alone. Even compared with deep learning based methods, however, our approach achieves superior performance and with very wide margins. This is true even comparing our results to [139], who use seven networks and fuse their output with the COTS system. Moreover, our method improves in IJB-A verication over [139] in 15% TAR at FAR=0.01 and20% TAR at FAR=0.001, also showing a better rank-1 recognition rate. It is interesting to compare our results to those reported by [31] and [116]. Both ne tuned their deep networks on the ten training splits of each benchmark, at substantial computational costs. Some idea of the impact this ne tuning can have on performance is available by considering the huge performance gap between results reported before and after ne tuning in [31] 3 . Our own results, obtained by training our CNN once on augmented data, far outperform those of [116] also largely outperforming those reported by [31]. We conjecture that by training the CNN with augmented data we avoid further specializing all the parameters of the network on the target dataset. Tuning deep models on in-domain data is computationally expensive and thus, avoiding overtting the network at training time is preferable. 3.3.2 Results on Labeled Faces in the Wild For many years LFW [58] was the standard benchmark for unconstrained face verication. Recent methods dominating LFW scores use millions of images collected and labeled by hand in order to obtain their remarkable performances. To test our approach, we 3 The results reported in [31] with ne tuning on the training sets include system components not evaluated without ne tuning. 38 Methods# JANUS CS2 Ver. (TAR) JANUS CS2 Id. (Rec. Rate) IJB-A Ver. (TAR) IJB-A Id. (Rec. Rate) Metrics! FAR0.01 FAR0.001 Rank-1 Rank-5 Rank-10 FAR0.01 FAR0.001 Rank-1 Rank-5 Rank-10 COTS [68] 58.1 37 55.1 69.4 74.1 { { { { { GOTS [68] 46.7 25 41.3 57.1 62.4 40.6 19.8 44.3 59.5 { OpenBR [70] { { { { { 23.6 10.4 24.6 37.5 { Fisher Vector [32] 41.1 25 38.1 55.9 63.7 { { { { { Wang et al. [139] { { { { { 73.2 51.4 82.0 92.9 { Chen et al. [31] 64.9 45 69.4 80.9 85.0 57.3 { 72.6 84.0 88.4 Deep Multi-Pose [6] 89.7 { 86.5 93.4 94.9 78.7 { 84.6 92.7 94.7 Chen et al. [31] (f.t.) 92.1 78 89.1 95.7 97.2 83.8 { 90.3 96.5 97.7 Swami S. et al. [116] (f.t.) { { { { { 79 59 88 95 { Ours 92.6 82.4 89.8 95.6 96.9 88.6 72.5 90.6 96.2 97.7 Table 3.4: Comparative performance analysis on JANUS CS2 and IJB-A respectively for veri- cation (ROC) and identication (CMC). f.t. denotes ne tuning a deep network multiple times for each training split. A network trained once with our augmented data achieves mostly superior results, without this eort. follow the standard protocol for unrestricted, labeled outside data and report the mean classication accuracy as well as the 100% - EER (Equal Error Rate). We prefer to use 100% - EER in general because it is not dependent on the selected classication threshold but we still report verication accuracy to be comparable with the other methods. Improvement for each augmentation: Fig. 3.6(a) provides ROC curves for each augmentation technique used in our approach. The green curve represents our baseline, that is the CNN trained on in-plane aligned images with respect to a frontal template. The ROC improves by a good margin when we inject unseen rendered images across poses into each subject. Indeed the 100% - EER improves by +1.67%. Moreover, by adding both shapes and expressions, performance improves even more, reaching 100% - EER 39 rate of 98.00% (red curve). See Tab. 3.6(b) for a comparison with methods trained on millions of downloaded images. 3.3.3 Result Summary It is not easy to compare our results with those reported by others using millions of training images: Their system designs and implementation details are dierent from our own and it is dicult to assess how dierent system components contribute to their overall performance. In particular, the reluctance of commercial groups to release their code and data makes it hard to say exactly how much performance our augmentation buys in comparison to their harvesting and curating millions of face images. Nevertheless, the results throughout this section clearly show that synthesizing train- ing images using domain tools and knowledge leads to dramatic increase in recognition accuracy. This may be attributed to the potential of domain specic augmentation to infuse training data with important intra-subject appearance variations; the very varia- tions that seem hardest to obtain by simply downloading more images. As a bonus, it is a more accessible means of increasing training set sizes than downloading and labeling millions of additional faces. Finally, a comparison of our results on LFW to those reported by methods trained on millions of images ([131, 132, 100] and [118]), shows that with the initial set of less than 500K publicly available images from [156], our method surpasses those of [131] and [100] (without their metric learning, which was not applied here), falling only slightly behind the rest. 40 0 0.05 0.1 0 0.2 0.4 0.6 0.8 1 False Acceptance Rate True Acceptance Rate ROC curves for LFW dataset No Augmentation Pose Pose, Shapes and Expr. (a) Ablation Study Method Real Synth Net Acc. (%) 100% - EER Fisher Vector Faces [99] { { { 93.0 93.1 DeepFace [131] 4M { 3 97.35 { Fusion [132] 500M { 5 98.37 { FaceNet [118] 200M { 1 98.87 { FaceNet + Alignment [118] 200M { 1 99.63 { VGG Face [100] 2.6M { 1 { 97.27 VGG Face (triplet loss) [100] 2.6M { 1 98.95 99.13 Us, no aug. 495K { 1 95.31 95.26 Us, aug. pose 495K 2M 1 97.01 96.93 Us, aug. pose, shape, expr. 495K 2.4M 1 98.06 98.00 (b) Results for methods trained on millions of images Figure 3.6: LFW verication results. (a) Break-down of the in uence of dierent training data augmentation methods. (b) Performance comparison with state of the art methods, showing the numbers of real (original) and synthesized training images, number of CNNs used by each system, accuracy and 100%-EER. 3.4 Conclusions In this chapter, we investigate how 3D face shapes can be benecial to the modern face recognition system. First, 3D-based data augmentation can be used to generate (synthesize) valuable additional data to train eective face recognition systems, as an alternative to expensive data collection and labeling. Second, we describe a novel face recognition pipeline with several novel details. In particular, its use of face synthesizing, based on reference 3D face shapes, for matching across poses. This 3D face augmentation, in both training and testing, greatly improves face recognition on challenging in-the-wild data. 41 Chapter 4 Enhance 2D Face Recognition System Despite having a great power on disentangling 2D face recognition task, the proposed system in Chapter 3 still has some limitations. Current 3D face augmentation is computa- tionally expensive, which reduces the scalability of the system. The eciency of two-step matching process (view-based matching + score pooling) is arguable. Landmark-reliance is also a weak point; it traumatizes both running speed and system robustness. In this Chapter, we discuss how to handle these issues in order to enhance system performance and robustness (section 4.1 and section 4.3). Also, we verify the scalability of our method on a novel large-scale training dataset, which further breaks all in-the-wild recognition records (section 4.2). 42 4.1 Accelerate Augmentation and Matching 1 In Chapter 3, we did not consider the computational cost of rendering: At training, rendering millions of face images can be prohibitive; at test time, rendering can quickly become a bottleneck, particularly when multiple images represent a subject. Also, the matching sequence: view-based matching + score pooling is arguably ineective. In this section, we accelerate the 3D-based face synthesis method, which allows rendering new 3D views of faces at a computational cost which is equivalent to simple 2D image warping. We also discuss an alternative matching sequence, which is simpler but also more eective. The proposed modications are tested on the challenging IJB-A and Janus CS2 benchmarks, showing that they are not only fast, but also improves recognition accuracy. 4.1.1 Face Rendering Existing methods used to synthesize new views of faces from single images involve two standard and well-understood steps, known in computer graphics as texture mapping and ray casting / rasterization [54, 131]. We next provide a cursory overview of these steps. More details are available in any standard computer graphics textbook [60, 146]. Overview of face rendering: Texture mapping of an image I (i.e., the input image of a face viewed in unconstrained settings) onto a 3D surfaceF R 3 is the process of assigning every 3D surface position P = (X;Y;Z)2F with a location q = (u;v) in the image (where image coordinates are often normalized to the range of u;v2 [0; 1]). The 1 This section is my joint work with Iacopo Masi, Prof. Tal Hassner, and Prof. G erard Medioni. It was published as "Rapid synthesis of massive face sets for improved face recognition." in Automatic Face & Gesture Recognition, pp. 604-611. IEEE, 2017. 43 shapeF is assumed to be a 3D face shape, which may have been estimated from the input image of the face, I, or is a predetermined, generic face. Following texture mapping,F is projected to the desired view, J. To this end, the output view's camera matrix, M J = K J R J t J , is manually specied in order to set the desired output viewpoint (e.g., frontal view for face frontalization [54]). This includes setting the intrinsic camera parameters in K J , and the 3D rotation and translation in R J t J , respectively, both in the coordinate frame of the 3D shape. The matrix M J is then used to intersect the rays emanating from J's center of projection, passing through each of its pixels, p i = (x i ;y i )2 J, and the surface ofF. Each such intersection is a 3D point P i = (X i ;Y i ;Z i )2F. Following texture mapping, these 3D points are linked to locations q i = (u i ;v i ) in the input face image, I. Thus, the pixel p i in the output image is assigned intensity values by sampling I at its determined q i . Texture mapping by face pose estimation: Texture mapping of the input face image I to the face shapeF is performed automatically. To this end, a facial landmark detection method is used to locate k landmarks in the face image I. For each of the detected landmarks q i 2R 2 we assume corresponding landmarks, P i 2R 3 specied once on the 3D surfaceF (indexi indicates the same facial landmark in I and the 3D shapeF.) Given the corresponding landmarks q i $ P i we use PnP [52] to estimate extrinsic camera parameters for the input image I. Unlike in Chapter 3, we assume a xed intrinsic camera matrix K I , estimating only the rotation and translation matrices R I t I , in the 3D model's coordinate frame. We thus obtain a perspective camera model mapping the 3D face shapeF to the input image so that q i M I P i where M I = K I R I t I is the 44 estimated camera matrix for the input view. Hence, matrix M I can be used to map any point on the 3D surface onto the input image, thereby providing the desired texture map. 4.1.2 Rapid Rendering with Generic Faces In light of the importance of rendering as a key step in many computer graphics applica- tions, tremendous eorts were dedicated to expediting this process, speeding up state of the art rendering engines to fractions of a second, even for complex 3D scenes. Special- ized computer hardware { namely graphical processing units (GPU) { was also developed for this purpose. Our work is tangential to these eorts. Specically, we show that by assuming a generic 3D shape and xed desired output poses, a great deal of the eort required to render new facial views can be performed at preprocessing, and by so doing, substantially reduce the eort required to produce new views and the complexity of the system required for rendering. Precomputing output projections: One of the most time consuming steps in the process described in Sec. 4.1.1 is ray casting: computing the locations of intersections between the rays passing through each output pixel and the surface of the face. Due to potential self occlusions, this process may further involve methods such as Z-buering or binary space partitioning in order to determine visibility of the 3D shape at each output pixel [60]. As previously mentioned, these steps can be expedited using specialized hardware, optimized code and various approximation methods. We note that when a xed generic face shape and output view are used, these steps only need to be performed once, at preprocessing. Subsequent face synthesis using the same shape and pose, can skip this 45 step, and require the same computational eort as standard image warping in 2D for texture mapping (Sec. 4.1.1). Specically, during preprocessing we use a standard rendering engine [146] to perform ray casting of a generic face shapeF onto a desired output view J. Doing so, we store for each output 2D pixel location p i 2 J the 3D coordinatesP i 2F projected onto that pixel (i.e., the 3D location of the surface point P i visible at p i ). This information is stored in a lookup table U, simply dene as: U(p i ) = P i (4.1) In practice, U is stored as anNM 3 matrix, whereN andM are the dimensions of the output view and the last dimension indexes the X;Y andZ coordinates of the 3D point projected onto each pixel. Rendering with precomputed projections: Given an input image I containing a face in unconstrained settings, we render it to a desired new view using U as follows (see also Fig. 4.1 for a Python code example). We rst estimate the 3D pose of the face, as described in Sec. 4.1.1. This provides a camera matrix M I associating 3D points on the surface ofF with pixels in I. Let q = M I U; (4.2) where U is matrix U reshaped to a 4 (NM) matrix where the columns are the 3D points stored in U, in homogeneous notation. Matrix q is then a 3 (NM) matrix with 46 columns representing the 2D projections of these 3D points onto I, also in homogeneous coordinates. import numpy as np import cv2 U_bar = np.vstack((U, \ np.ones((1, threedee.shape[1])))) q_bar = MI * U_hat q = np.divide(q_bar[0:2, :], \ np.tile(q_bar[2, :], (2,1))) #idx are index of q inside of the image q=q[idx] synth = warpImg(I, N, M, q, idx) def warpImg(I, N, M, q, idx): J = np.zeros((N*M, 3)) #fast warping pixels = cv2.remap(I, \ np.asarray( q[0,:] ).astype( float32 ),\ np.asarray( q[1,:] ).astype( float32 ),\ cv2.INTER_CUBIC) #copy the interpolated pixel back ( Eq. (1) ) J[idx,:] = pixels J = J.reshape(( N, M, 3), order= F ) J = J.astype( uint8 ) return J Figure 4.1: Python code snippet for 3D rendering. 47 An output view J can then be produced simply by sampling image I using q (following conversion to Euclidean coordinates). Sampled intensities are mapped back to the output view J by using the correspondence between columns in q, columns in U, and (x;y) pixel locations in U. As for run-time, this process includes precisely the same steps as standard inverse warping [130]. Compared with, for example, the 2D warping regularly performed in real time on even cellphone devices, the only dierence is in applying a 3 4 camera matrix transformation to homogeneous 3D coordinates, rather than a 3 3 projective transformation. Preparing generic 3D heads and backgrounds: In Chapter 3, ten generic 3D shapes S =fF 1 ;:::;F 10 g, obtained from the publicly available Basel Face Set [102], were used to model variations of head shape. These face shapes are conveniently aligned with each other and so 3D landmarks required for pose estimation (Sec. 4.1.1) are only selected once; the selected landmarks can then be automatically transferred to all other models. Beyond simplifying the process of selecting facial landmarks for pose estimation, this also allows segmenting faces from their background and selecting eye regions to avoid cross-eye results [54]. These models, however, only represent the front, face part of the head. This is pre- sumably why we rendered only partial head views in Chapter 3, without backgrounds and why [131] and [54] only used tight bounding boxes around the center of the face. Fortunately, rendering full heads and backgrounds can naturally be included in the process described in the previous sections. This is performed by stitching the ten 3D 48 Figure 4.2: Preparing a generic 3D model: Head added to a generic 3D face (top) along with two planes representing the background. models to an additional, generic 3D structure containing head, ears and neck and adding a plane representing a at background. The process of combining a 3D face model and a 3D head is described in Fig. 4.2. We use the generic 3D head model from [158]. We removed its facial region in order to exchange it with the 3D faces from [102]. To allow blending with dierent 3D faces, varying in sizes and shapes, we maintain an overlap belt with radiusr = 2cm (Fig. 4.2(b)). Given an input 3D face (Fig. 4.2(a)), we merge it onto the head model using soft boundary 49 blending: We rst detect points on the overlap region of the face model. Each point X is then assigned a soft blending weight w: w = 1 2 1 2 cos d r ; (4.3) where d is the distance to the boundary. Next, X is adjusted to the new 3D position X 0 by: X 0 =wX + (1w)P X ; (4.4) where P X is the closest 3D point from the head. The result is a complete 3D face model (Fig. 4.2(c)) . To additionally preserve the background, we simply add two planes to the 3D model: one positioned just behind the head and another, perpendicular plane, on its right. This second plane is used to represent the background when the input face is rendered to a prole view, in which case the rst plane is mapped to a line. Fig. 4.2(d) shows the models we produced using one of the 3D face shapes from [102]. Fig. 4.2(e) shows the rendered view of this generic face from the prole pose used by our system. 4.1.3 Matching by Pooling Synthesized Faces Going beyond the proposed system in Chapter 3, we test also pooling of features obtained from image I and its rendered views. Specically, instead of separately matching the input image, its in-plane aligned view and its rendered views, we extract CNN features for all these views and then pool them together using element-wise average. This process is visualized in Fig. 4.3. The idea here follows similar techniques suggested by, e.g., [125]. 50 Figure 4.3: Face representation: given an input image (left), it is rendered to novel viewpoints as well as in-plane aligned. These are all encoded by our CNN. The CNN features are then pooled by element wise average, obtaining the nal representation. Unlike them, we use the pooled features directly, rather than train subsequent networks to process them. Although the idea of pooling multiple CNN responses over transformation is not new [123] and is extensively done in deep learning framework such as Ca e [62] by sampling random patches and averaging their features, we do it by exploiting the face domain and explicitly pool over synthetically generated out-of-plane transformations. The rationale for this approach is that the CNN is trained to produce features which should be invariant to confounding factors (e.g., pose, illumination, expression, etc.) instead emphasizing subject identity. In practice, these confounding factors aect the representation produced by the CNN. These eects can be considered noise, which can sometimes overpower the subject's identity (the signal) in the deep feature representa- tion. By pooling together features obtained by multiple, synthetically applied identity preserving transformations, we aim to suppress this noise and amplify the signal. As we later show, this process proved benecial especially at very low false alarm rates (FAR). Recognition with face templates: In case each subject is described by a template, i.e. a set of images from dierent sources or media les, we rst do video pooling as described 51 in Chapter 3. However, we go beyond that by combining the features from all images and all pooled video, resulting in a single, feature representation per template. Following this, we evaluate the similarity between two templates as we did when comparing two images, by simply taking the correlation of two features. In comparison to Chapter 3, this process cuts most of the matching burden, leading to a much faster matching speed. 4.1.4 Experimental Results Comparing rendering methods: We compare our rendering approach to a number of recent methods proposed for rendering new views of face images and used for face recognition [54, 85, 86], as well as the proposed method in Chapter 3. All run-times are reported on an Intel Core i7-4820K CPU @ 3.70GHz (4 cores), 32GB RAM and nVidia Titan X GPU. We report run-times as well as various other relevant and practical aspects of these rendering systems. These include: Do these systems use multiple pose rendering (rather than single, typically frontal pose); Do they use multiple 3D shapes for rendering; What programming language was used to implement them; Do they depend on OpenGL; and nally, are they publicly available. Our report is summarized in Tab. 4.1. In order to have a fair comparison, we measure the average time (Avg. Rend. Time) required by each method for rendering alone; time spent on pose estimation (and landmark detection) was excluded. Speeds were measured averaging over 2; 000 rendered views. Our method, despite its implementation in high level, un-optimized Python code, can generate a rendered view in 14.27ms with a very low standard deviation. The previous 52 Method Avg. Rend. Time (ms) Multi-Pose Multi-Shape Implement. Avoid OpenGL Pub. Aval. CVPRW13 [85] { Yes No Blender No No CVPR15 [54] 66.339.51 No No MATLAB Yes Yes CVPR16 [86]* 46.1218.45 Yes No Optimized C++ No No Old renderer 46.1218.45 Yes Yes Optimized C++ No No new renderer 14.271.98 Yes Yes Python Yes Yes Table 4.1: Overview of render times and various properties of recent face rendering methods. renderer, which uses optimized and compiled OpenGL code, requires more time (46.12ms) due to the repeated ray casting it performs and which we instead perform only once at preprocessing. Overall our method has a speed-up of3:7, despite its simpler, high level implementation. To appreciate the signicance of this, synthesizing novel views using multiple 3D shapes required our method less than four hours on a cluster with a total of 96 cores for all the faces in the CASIA WebFace collection [156] (about 500,000 facial images). This includes the entire run time, including landmark detection and face rendering. Analysis of eects on face recognition: In order to see the eects of our rendering we evaluated the face recognition pipeline. Tab. 4.2 reports recognition performances with dierent experimental settings, showing the impact each has on performances. In particular we tested (1) the eect of rendering the background and entire head with the models prepared in Sec. 4.1.2 vs. using only the face and a blank background; (2) the eect of pooling across synthesized images, as we proposed in Sec. 4.1.3; and (3) incremental CNN training both with and without the background. 53 IJB-A Metrics! Ver. (TAR@FAR) Ide. (Rec.Rate) ConvNet: VGG19# 1% 0.1% 0.01% R1 R5 R10 Background and head vs. empty background Us, (no context) 85.2 68.0 44.2 90.5 96.0 97.5 Us, (context) 86.1 71.1 47.0 91.5 96.5 97.7 Pooling across Synth. Us, (real+our rend.) 86.1 71.1 47.0 91.5 96.5 97.7 Us, (pool.synth.) 86.0 70.9 49.0 91.4 96.4 97.5 Impact of incremental training Us, incremental 88.8 75.0 56.4 92.5 96.6 97.4 Table 4.2: Comparison of various pipeline components: the impact of face context on recognition, pooling across synthesized images and incremental training. 0.001 0.01 1 0.6 0.7 0.8 0.9 1 False Acceptance Rate True Acceptance Rate Us, (no context) Us, (context) 0.001 0.01 1 0.6 0.7 0.8 0.9 1 False Acceptance Rate True Acceptance Rate Us,(no incremental training) Us,(incremental training) Figure 4.4: ROC curves on the IJB-A verication for (left) the use of context; (right) incremental training. See also Tab. 4.2. The importance of rendering the background: To our knowledge, all previous face recognition methods which synthesize new facial views, did not render the background [6, 86], or else did not use it in their tests [54, 131]. We test the eect background and the 54 head around the face both have on performance, by rendering our faces with and without the background plane and the head regions prepared in Sec. 4.1.2. To this end, our entire pipeline (training and testing) was executed with and without the background and head. Evident from Tab. 4.2 and Fig. 4.4 (left) is that the background improves performance: This improves verication rates at the low FAR rates by 3%. Impact of pooling rendered views: We test the impact of the pooling step proposed in Sec. 4.1.3. By doing so, we avoid having to match the rendered and real images separately, and so simplify the matching process. But does it improve accuracy? Our results in Tab. 4.2 show that despite the simpler matching scheme, accuracy actually improves. This is particularly evident in the low FAR of 0.01% where pooling adds 2% to the accuracy. Impact of incremental training: Although rendering faces with their context per- forms better than without, we next test if combining both can achieve even better results. Specically, we begin with a network trained on real images (in-plane aligned) and ren- dered views without background. Once training saturated, we resumed training using rendered views for the same images, but this time, produced with backgrounds. Match- ing used faces rendered along with their context, and pooled across views. Surprisingly, the results in Tab. 4.2 show this network to perform better than all other combinations. In particular, it improves at the very low FAR=0.01% by 7% and 4% at FAR=0.1%. Recognition rates also improve across all ranks. This may be attributed to gradual adaptation of the network to faces: The networks trained with rendered views without backgrounds, were initialized with weights obtained by training on a dierent domain (ImageNet). By training the network incrementally, the nal network 55 was initialized using images from the target problem domain, albeit missing their context. Further study is required to better understand the reason for this eect. CS2 IJB-A Methods Ver. (TAR@FAR) Ide. (Rec.Rate) Ver. (TAR@FAR) Ide. (Rec.Rate) Metrics 1% 0.1% R1 R5 R10 1% 0.1% R1 R5 R10 COTS [68] 58.1 37 55.1 69.4 74.1 { { { { { GOTS [68] 46.7 25 41.3 57.1 62.4 40.6 19.8 44.3 59.5 { OpenBR [70] { { { { { 23.6 10.4 24.6 37.5 { Fisher Vector [32] 41.1 25 38.1 55.9 63.7 { { { { { Wang et al. [139] { { { { { 73.2 51.4 82.0 92.9 { Chen et al. [31] 64.9 45 69.4 80.9 85.0 57.3 { 72.6 84.0 88.4 Deep Multi-Pose [6] 89.7 { 86.5 93.4 94.9 78.7 { 84.6 92.7 94.7 Pooling Faces [55] 87.8 74.5 82.6 91.8 94.0 81.9 63.1 84.6 93.3 95.1 PAMs [86] 89.5 78.0 86.2 93.1 94.9 82.6 65.2 84.0 92.5 94.6 VGG Face [40] { { { { { 80.5 60.4 91.3 { 98.1 NAN [153] { { { { { 89.7 78.5 { { { Swami S. et al. [115] { { { { { 87.1 76.6 92.5 { 97.8 Swami S. et al. [116] { { { { { 79 59 88 95 { Chen et al. [31] 92.1 78 89.1 95.7 97.2 83.8 { 90.3 96.5 97.7 VGG Face + SVM [40] { { { { { 93.9 82 92.8 { 98.6 Swami S. et al. + TPE [115] { { { { { 90.0 81.3 93.2 { 97.7 Our best result { { { { { { { { { { Us, Chapter 3 92.6 82.4 89.8 95.6 96.9 88.6 72.5 90.6 96.2 97.7 Us, (FaceW+real+rend) 94.1 85.5 92.4 96.6 97.5 88.3 75.0 92.8 96.9 97.8 Us, (FaceW+pool.synth.) 93.9 86.1 92.3 96.4 97.3 88.8 75.0 92.5 96.6 97.4 Table 4.3: Performance analysis on JANUS CS2 and IJB-A respectively for verication (ROC) and identication (CMC). The methods of [116, 31] and [115] all perform supervised training on all of the test benchmark's training splits. 56 Comparison with state-of-the-art: We compare our complete pipeline, using of ren- dered views with background and heads, pooling across rendered views and incremental training, with published state of the art results. Tab. 4.3 reports these performances on the JANUS CS2 and IJB-A benchmarks. Methods in our report are separated accord- ing to the type of training they applied to the target benchmark: our own method used computationally cheap, unsupervised PCA whereas others used supervised techniques, repeatedly trained for each of the ten training splits. Compared to methods which, like us, did not use the training splits for supervised training and domain adaptation, our method achieves state of the art results. It moreover falls only slightly behind methods which explicitly adapt to the test domain by performing supervised training on each training split of the test benchmarks. This implies that by careful rendering, we eectively introduce many of the appearance variations encountered in the target domain. 4.2 Scaling up the training set: the COW dataset 3D Face Augmentation is proved to be an eective method to enrich small training dataset for CNN training (Chapter 3). Does our system reach the saturation point, or it can be further improved when more training data available? In this section, we will answer this question. 57 4.2.1 COW dataset Despite having half million images, CASIA-WebFace is still considered small compared to the training datasets in the famous commercial systems such as Facebook (4.4M) [131] and Google (200M) [118]. Recently, some large facial image datasets have been published to push further face recognition research, including Oxford [100] and MS-Celeb-1M [51]. In order to acquire a very-large training data, we combine CASIA-WebFace and these two sets into a single one named MSCeleb-Oxford-WebFace, or COW, dataset. a. Components COW is a combination of 3 datasets: • CASIA-WebFace dataset [156] has 0.5M images of 10k subjects. The images are already ltered, cropped, and resized. • Oxford (VGG Face) dataset [100] has 2.2M Internet images from 2.6k subjects. The authors provide the download link for each image and the corresponding face detection bounding box. The images are varying in size, image quality, and the number of faces. About 5% of the images are noise with incorrect labels. • MS-Celeb-1M dataset [51] has 10M Internet images for 100K celebrities. Each subject has 100 images retrieved from Bing search by his name without ltering. Therefore 25-30% of this dataset are noise. Again, the authors provide download links and face detection bounding boxes, similar to Oxford dataset. We acquired Oxford and MS-Celeb-1M images using the provided links. A small amount of these links were broken, and we skipped these missing images. 58 b. Combining datasets First, we detected and combined the overlapping subjects between 3 datasets. This step is essential; otherwise, a subject may have multiple labels causing confusion in CNN training. For each dataset pair, we generated an overlapping candidate list based on similarity in subject names. This list was then manually ltered by human. Finally, for each overlapping subject pair, we combine the 2 image sets into one with a single subject ID. Second, we remove the overlapping subjects to the testing dataset (IJB-A), in order to have fair experimental results. c. Filtering noisy images The remaining dataset contains a lot of non-faces images due to various reasons. First, some download links were broken/outdated, but they still returned blank images. Second, the meta-data face bounding boxes were not perfect; we sometimes got a non-face region. To lter these noisy images, we ran CLNF landmark detector [12] on all images and collected landmark condences. The images with condence less than 0.4 are junky, thus being removed. Although not being perfect, this process greatly removed most the non-face images. The remaining dataset has 6.8M images from 69k subjects. d. Filtering noisy labels By purely using Internet searching, both Oxford and MS-Celeb-1M have a large amount of mislabeled images, i.e. the images of one person but assigned to another. These images 59 Dataset No. of imgs. (M) No. of subs. (k) Noise level CASIA 0.5 10 - Oxford 2.2 2.6 :: MS-Celeb-1M 10 100 ::: NoisyCOW 6.9 69 :: CleanCOW 6.8 63 : Table 4.4: Summary of COW dataset versions and their components. may confuse CNN training, so need to be removed. Filtering these images, however, is not an easy task; it requires an internal face recognition test. To solve this problem, we propose two-step CNN training process. First, we use the dataset remaining in section c., called NoisyCOW dataset, to train a face recognition CNN. We then run directly this CNN on noisy COW and extract the probability each image belongs to its assigned subject (using the last CNN layer). If this probability is smaller than threshold = 0:0005, the corresponding image will be removed (Fig. 4.5.a). When a subject has less than 5 images, we will remove that subject as well. We propose this method based on the fact that CNN has a great generalization ability: by learning on a very large data, it is robust to a portion noise in training labels and can recognize them as well [147]. Filtering result on a sample subject in COW is showed in Fig. 4.5.b. Our algorithm can detect and remove some obvious label noise, thus making the set cleaner. Some of the remaining images may not belong to this subject as well, but it is even hard for human to rmly claim them as noise. The nal dataset is called CleanCOW, which has 6.8M images from 63k subjects. The COW dataset and its components are summarized in Table 4.4. 60 (a) Label ltering process (b) Sample ltering result on COW Figure 4.5: Filtering mislabeled data on COW. 61 Training data Ver. (TAR@FAR) Ide. (Rec.Rate) 1% 0.1% 0.01% R1 R5 R10 CASIA 88.8 75.0 56.4 92.5 96.6 97.4 NoisyCOW 94.6 87.0 73.4 95.7 97.8 98.3 CleanCOW 94.7 87.8 76.5 95.7 97.8 98.3 Table 4.5: Face recognition performance on IJB-A for verication (ROC) and identication (CMC) when using dierent training data. 4.2.2 Experimental Results To evaluate the stability of our system, we repeat the training process described in section 4.1 on the NoisyCOW and CleanCOW datasets. The trained CNNs are then evaluated on IJB-A dataset. Table 4.5 and Fig. 4.6 show comparison between the CNNs trained on CASIA- WebFace and COW datasets. By increasing amount of training data, the new CNNs lead the previous one by a huge margin, both in ROC and CMC. It proves the scalability of our system, as well as conrms the power of large data: the more training data we use, the more powerful system we get. The CNN trained on CleanCOW makes a new record on IJB-A, with 94.7% TAR@1%FAR and 95.7% Recognition Rate Rank-1. Interestingly, the CNN trained on CleanCOW only beats the one trained on Noisy- COW on ROC at very low FAR. It again arms that CNN is robust to label noise, as discovered by [147], and the mislabel ltering process is not critical. 62 Figure 4.6: Face recognition performance on IJB-A for verication (ROC) and identication (CMC) when using dierent training data. 4.3 Landmark-Free Pipeline 1 In order to do in-plane alignment and 3D face synthesis, we need to estimate 3D head pose from the input image. So far, this task is solved using landmark detection and PnP algorithm. This process is slow and sensitive to image conditions, such as occlusions or shadows. Instead, in this section, we show how a simple convolutional neural network (CNN) can be trained to accurately and robustly regress 6 degrees of freedom (6DoF) 3D head pose, directly from image intensities. We further explain how this FacePoseNet (FPN) can be used to align faces in 2D and 3D as an alternative to explicit facial landmark detection for these tasks. We claim that in many cases the standard means of measuring landmark detector accuracy can be misleading when comparing dierent face alignments. Instead, 1 This section is my joint work with Fengju Chang, Iacopo Masi, Prof. Tal Hassner, Prof. Ram Nevatia, and Prof. G erard Medioni. It was published as "FacePoseNet: Making a Case for Landmark-Free Face Alignment." arXiv preprint arXiv:1708.07517 (2017). 63 we compare our FPN with existing methods by evaluating how they aect face recognition accuracy on the IJB-A and IJB-B benchmarks: using the same recognition pipeline, but varying the face alignment method. Our results show that (a) better landmark detection accuracy measured on the 300W benchmark does not necessarily imply better face recognition accuracy. (b) Our FPN provides superior 2D and 3D face alignment on both benchmarks. Finally, (c), FPN aligns faces at a small fraction of the computational cost of comparably accurate landmark detectors. For many purposes, FPN is thus a far faster and far more accurate face alignment method than using facial landmark detectors. 4.3.1 Do We Need Accurate Landmark Detection for Face Alignment? Facial landmark detection is rarely, if ever, an application in its own right. Instead, it is typically a means to an end: It is one component out of many in pipelines designed for other face understanding and processing tasks, particularly face alignment. Most facial landmark detectors, however, are developed without measuring their impact on these applications but rather using standard facial landmark detection benchmarks such as the popular AFW [159], LFPW [19], HELEN [75], and IBUG [112]. These benchmarks contain face images with manually labeled ground truth landmarks. Better detection accuracy on these benchmarks equals better prediction of these manual positions. This raises an important question: Does better approximation of such human labeled landmarks imply better face alignment and consequently better face understanding? Why would higher accuracy on landmark detection benchmarks not imply better alignment? The many landmark detection benchmarks used by the community to measure detection accuracy typically oer 5, 49 or 68 landmarks painstakingly labeled on hundreds 64 or thousands of unconstrained face images, re ecting wide viewpoint, resolution and noise variations. On low resolution images, however, even expert human operators can nd it hard to accurately pinpoint landmark positions. More importantly, many landmark locations are not well dened even in high resolution (e.g., points along the jawline or behind occlusions). Thus, improved landmark detection accuracy may actually re ect better estimation of uncertain human labels rather than better face alignment (Fig. 4.7). Figure 4.7: The problem with manually labeled ground truth facial landmarks. Images and annotations from the AFW [159] (left two columns) and iBug [112] benchmarks. One of each pair shows manually labeled ground truth landmarks; the other, high-error predictions of our FPN. Which is which? 1 Clearly, detection accuracy, as measured by standard benchmarks, does not necessarily re ect the quality of the landmark detection. 1 Images one, three, and ve are ground truth. 65 An additional concern relates to how landmarks are used for face alignment. To our knowledge, the eects landmark detection noise, changing expressions or face shapes have on it were never fully explored. Responding to these concerns, we oer several contributions. (1) We propose com- paring landmark detection methods by evaluating bottom line face recognition accu- racy on faces aligned with these methods. (2) As an alternative to existing facial land- mark detectors, we further present a robust and accurate, landmark-free method for face alignment: our deep FacePoseNet (FPN). We show it to excel at global, 3D face alignment even under the most challenging viewing conditions. Finally, (3), we test our FPN extensively and report that better landmark detection accuracy on the widely used 300W benchmark [111] does not imply better alignment and recognition on the highly challenging IJB-A [69] and IJB-B benchmarks [140]. In particular, recognition results on images aligned with our FPN surpass those on images aligned with state-of- the-art detectors. To support our claims, we released our trained FPN and code at http://github.com/fengju514/Face-Pose-Net. 4.3.2 A Critique of Facial Landmark Detection Before using an existing state-of-the-art facial landmark detector in a face processing system, the following points should be considered. Landmark detection accuracy measures: Facial landmark detection accuracy is typically measured by considering the distances between estimated landmarks and ground 66 truth (reference) landmarks, normalized by the reference inter-ocular distance of the face [41]: e(L; ^ L) = 1 mk^ p l ^ p r k 2 m X i=1 kp i ^ p i k 2 ; (4.5) Here, L =fp i g is the set of m 2D facial landmark coordinates, ^ L =f ^ p i g their ground truth locations, and ^ p l ; ^ p r the reference left and right eye outer corner positions. These errors are then translated to a number of standard quantities, including the mean error rate (MER), the percentage of landmarks detected under certain error thresholds (e.g., below 5% or 10% error rates) or the area under the accumulative error curve (AUC). There are two key problems with this method of evaluating landmark errors. First, the ground truth compared against is manually specied, often by mechanical turk work- ers. These manual annotations can be noisy, they are ill-dened when images are low resolution, the landmarks are occluded (in case of large out-of-plane head rotations, facial hair and other obstructions), or located in featureless facial regions (e.g., along the jaw- line). Accurate facial landmark detection, as measured on these benchmarks, thus implies better matching human labels but not necessarily better detection. These problems are demonstrated in Fig. 4.7. A second potential problem lies in the error measure itself: Normalizing detection errors by inter-ocular distances biases against images of faces appearing at non-frontal views. When faces are near prole, perspective projection of the 3D face onto the im- age plane shrinks the distances between the eyes thereby naturally in ating the errors computed for such images. 67 Landmark detection speed: Some facial landmark detection methods emphasize impressive speeds [67, 106]. Measured on standard landmark detection benchmarks, however, these methods do not necessarily claim state-of-the-art accuracy, falling behind more sophisticated, yet far slower detectors [157]. Moreover, aside from [158], no existing landmark detector is designed to take advantage of GPU hardware, a standard feature in commodity computer systems and most, including [158], apply iterative optimizations which may be hard to convert to parallel processing. Eects of facial expression and shape on alignment: It was shown that 3D alignment and warping of faces to frontal viewpoints (i.e. frontalization) is eective regardless of the precise 3D face shape used for this purpose [54]. Facial expressions and 3D shapes in particular, appear to have little impact on the warped result as evident by the improved face recognition accuracy reported by that method. Moreover, we have demonstrated that by using such a generic 3D face shape, rendering faces from new viewpoints can be accelerated to the same speed as simple 2D image warping. 4.3.3 Deep, Direct Head Pose Regression 1 Rather than align faces using landmark detection, we refer to alignment as a global, 6DoF 3D face pose, and propose to infer it directly from image intensities, using a simple deep network architecture. We next describe the network and the novel method used to train it. 1 This part was done by Fengju Chang. 68 Head pose representation: We dene face alignment as the 3D head pose h, ex- pressed using 6DoF: three for rotations, r = (r x ;r y ;r z ) T , and three for translations, t = (t x ;t y ;t z ) T : h = (r x ;r y ;r z ;t x ;t y ;t z ) T (4.6) where (r x ;r y ;r z ) are represented as Euler angles (pitch, yaw, and roll). Givenm 2D facial landmark coordinates on an input image, p 2m , and their corresponding, reference 3D coordinates, P 3m { selected on a xed, generic 3D face model { we can obtain a 3D to 2D projection of the 3D landmarks onto the 2D image by solving the following equation for the standard pinhole model: [p; 1] T = A[R; t][P; 1] T ; (4.7) where A and R are the camera matrix and rotation matrix respectively and 1 is a constant vector of 1. We then extract a rotation vector r = (r x ;r y ;r z ) T from R using the Rodrigues rotation formula. Obtaining enough training examples: Although the network architecture we use to predict head poses is not very deep compared to deep networks used today for other tasks, training it still requires large quantities of training data. We found the numbers of facial landmark annotated faces in standard data sets to be too small for this purpose. A key problem is therefore obtaining a large enough training set. We produce our training set by generating 6D, ground truth pose labels by running OpenFace [14],the upgraded version of CLNF landmark detector, on a large image set: the 2.6 million images in the VGG face dataset [101]. A potential danger in using an 69 Figure 4.8: Augmenting appearances of images from the VGG face dataset [101]. After detect- ing the face bounding box and landmarks we augment its appearance by applying a number of simple planar transformations, including translation, scaling, rotation, and ipping. The same transformations are applied to the landmarks, thereby producing example landmarks for images which may be too challenging for existing landmark detectors to process. existing method to produce our training labels, is that our CNN will not improve beyond the accuracy of its training labels. As we show in Sec. 6.3, this is not necessarily the case. To further improve the robustness of our CNN, we apply a number of face augmenta- tion techniques to the images in the VGG face set, substantially enriching the appearance variations it provides. Fig. 4.8 illustrates this augmentation process. Specically, follow- ing face detection [154] and landmark detection [14], we transform detected bounding boxes and their detected facial landmarks using a number of simple in-plane transfor- mations. The parameters for these transformations are selected randomly from xed distributions (Table. 4.6). The transformed faces are then used for training, along with their horizontally mirrored versions, to provide yaw rotation invariance. Ground truth labels are, of course, computed using the transformed landmarks. Some example augmented faces are provided in Fig. 4.9. Note that augmented images would often be too challenging for existing landmark detectors, due to extreme rotations or scaling. This, of course, does not aect the accuracy of the ground truth labels which 70 Table 4.6: Summary of augmentation transformation parameters used to train our FPN. Where U(a;b) samples from a uniform distribution ranging from a to b andN (; 2 ) samples from a normal distribution with mean and variance 2 . width and height are the face detection bounding box dimensions. Transformation Range Horizontal translation U(0:1; 0:1)width Vertical translation U(0:1; 0:1)height Scaling U(0:75; 1:25) Rotation (degrees) 30N (0; 1) were obtained from the original images. It does, however, force our CNN to learn to estimate poses even on such challenging images. FPN training: For our FPN we use an AlexNet architecture [72] with its initialized weights provided by [87]. The only dierence is that here the output regresses 6D oating point values rather than predicts one-hot encoded, multi class labels. Note that during training each dimension of the head pose labels is normalized by the corresponding mean and standard deviation of the training set for compensating the large value dierences among dimensions. The same normalization parameters are used at test time. 2D and 3D face alignment with FPN: Given a test image, it is processed by applying the same face detector [154], cropping the face and scaling it to the dimension of the network's input layer. The 6D network output is then converted to a projection matrix. Specically, the projection matrix is produced by the camera matrix A, rotation 71 Figure 4.9: Example augmented training images. Example images from the VGG face data set [101] following data augmentation. Each triplet shows the original detected bounding box (left) and its augmented versions (mirrored across the vertical axis). Both ipped versions were used for training FPN. Note that in some cases, detecting landmarks would be highly challenging on the augmented face, due to severe rotations and scalings not normally handled by existing methods. Our FPN is trained with the original landmark positions, transformed to the augmented image coordinate frame. matrix R, and the translation vector t in Eq. (4.7). With this projection matrix we can render new views of the face, aligning it across 3D views as previously described. Note that some recent papers also regressed 3D head pose with deep neural net- works, though their methods are designed to estimate 2D landmarks along with 3D face shapes [63, 73, 158]. Unlike our proposed pose estimation, they regress poses by using iterative methods which involve computationally costly face rendering. We regress 6DoF directly from image intensities without such rendering steps. 4.3.4 Results We provide comparisons of our FPN with the following widely used, state-of-the-art, facial landmark detection methods: Dlib [67], CLNF [13], OpenFace [14], DCLM [157], 72 RCPR [26], and 3DDFA [158] evaluating them for their eects on face recognition vs. their landmark detection accuracy. 4.3.5 Eect of Alignment on Recognition Sec. 4.3.2 discusses the various potential problems of comparing face alignment methods by measuring their landmark detection accuracy. As an alternative, we propose comparing methods for face alignment and landmark detection by evaluating their eect on the bottom line accuracy of a face processing pipeline. Since face recognition is arguably one of the most popular applications for face alignment, we use recognition accuracy as a performance measure. To our knowledge, this is the rst time alignment methods are compared based on their eect on recognition accuracy. Specically, we use IJB-A [69] and its expanded version: IJB-B [140]. These bench- marks were designed with heightened challenges re ected by, among other factors, an unprecedented amount of extreme out of plane rotated faces including many appearing in near-prole views. As a consequence, these two benchmarks not only push the limits of face recognition systems, but also the alignment methods used by these systems, possibly more so than the faces in standard facial landmark detection benchmarks. Networks: We repeat the same process as described in section 4.1 to train and to test the face recognition, using dierent face alignment methods. Though, we ne tune the ResFace101 CNN using L2-constrained Softmax Loss [104] instead of the original softmax. Bounding box detection.: We emphasize that an identical pipeline was used with the dierent alignment methods; dierent results vary only in the method used to estimate 73 facial pose. The only other dierence between recognition pipelines was in the facial bounding box detector. Facial landmark detectors are sensitive to the face detector they are used with. We therefore report results obtained when running landmark detectors with the best bound- ing boxes we were able to determine. Specically, FPN was applied to the bounding boxes returned by the detector of Yang and Nevatia [154], following expansion of its dimensions by 25%. Most detectors performed best when applied using the same face detector, without the 25% increase. Finally, 3DDFA [158] was tested with the same face detector followed by the face box expansion code provided by its authors. Face verication and identication results.: Face verication and identication results on both IJB-A and IJB-B are provided in Table 4.7. The overall performances in terms of ROC and CMC curves are shown in Fig. 4.10. The table also provides, as reference, our baseline from section 4.1, two other state-of-the-art IJB-A results [40, 105], and baseline results from [140] for IJB-B (to our knowledge, we are the rst to report verication and identication accuracies on IJB-B). 74 Table 4.7: Verication and identication results on IJB-A and IJB-B, comparing landmark de- tection based face alignment methods. Three baseline IJB-A results are also provided as reference at the top of the table. Our landmark-based method which uses meta data seed landmarks and face bounding boxes; all others did not. Numbers estimated from the ROC and CMC in [140]. Method# TAR@FAR Identication Rate (%) Eval.! .01% 0.1% 1.0% Rank-1 Rank-5 Rank-10 Rank-20 IJB-A [69] Crosswhite et al. [40] { { 93.9 92.8 { 98.6 { Ranjan et al. [105] { 82.3 92.2 94.7 { 98.8 { Our (CLNF w/ metadata) 56.4 75.0 88.8 92.5 96.6 97.4 98.0 RCPR [26] 64.9 75.4 83.5 86.6 90.9 92.2 93.7 Dlib [67] 70.5 80.4 86.8 89.2 91.9 93.0 94.2 CLNF [13] 68.9 75.1 82.9 86.3 90.5 91.9 93.3 OpenFace [14] 58.7 68.9 80.6 84.3 89.8 91.4 93.2 DCLM [157] 64.5 73.8 83.7 86.3 90.7 92.2 93.7 3DDFA [158] 74.8 82.8 89.0 90.3 92.8 93.5 94.4 Our FPN 77.5 85.2 90.1 91.4 93.0 93.8 94.8 IJB-B [140] GOTs [140] 16.0 33.0 60.0 42.0 57.0 62.0 68.0 VGG face [140] 55.0 72.0 86.0 78.0 86.0 89.0 92.0 RCPR [26] 71.2 83.8 93.3 83.6 90.9 93.2 95.0 Dlib [67] 78.1 88.2 94.8 88.0 93.2 94.9 96.3 CLNF [13] 74.1 85.2 93.4 84.5 90.9 93.0 94.8 OpenFace [14] 54.8 71.6 87.0 74.3 84.1 87.8 90.9 DCLM [157] 67.6 81.0 92.0 81.8 89.7 92.0 94.1 3DDFA [158] 78.5 89.1 95.6 89.0 94.1 95.5 96.9 Our FPN 83.2 91.6 96.5 91.1 95.3 96.5 97.5 75 (a) ROC IJB-A (b) CMC IJB-A (c) ROC IJB-B (d) CMC IJB-B Figure 4.10: Verication and identication results on IJB-A and IJB-B. ROC and CMC curves accompanying the results reported in Table 4.7. Faces aligned with our FPN oer higher recognition rates, even compared to the most recent, state-of-the-art facial landmark detection method of [157]. In fact, our verication scores on IJB-A outperform our landmark-based result (section 4.1). The landmark-based identication numbers are better, but importantly, it used ground truth annotations to initialize landmark detection search. This allowed to correctly align faces in images where face landmark detectors would normally fail, explaining its higher recognition results. These annotations were not used by any of the other methods compared. 76 Figure 4.11: Qualitative landmark detection examples. Landmarks detected in 300W [111] images by projecting an unmodied 3D face shape, pose aligned using our FPN (red) vs. ground truth (green). The images marked by the red-margin are those which had large FPN errors (> 10% inter-ocular distance). These appear perceptually reasonable, despite these errors. The mistakes in the red-framed example on the third row was clearly a result of our FPN not representing expressions. 4.3.6 Landmark Detection Accuracy From 6DoF pose to facial landmarks.: Given a 6DoF head pose estimate, facial landmarks can then be estimated and compared with existing landmark detection meth- ods for their accuracy on standard benchmarks. To obtain landmark predictions, 3D reference coordinates of facial landmarks are selected o line once on the same generic, 3D face model used in section 4.3.3. Given a pose estimate, we convert it to a projection matrix and project these 3D landmarks down to the input image. 77 Recently, a similar process was proposed for accurate landmark detection across large poses [158]. In their work, an iterative method was used to simultaneously estimate a 3D face shape, including facial expression, and project its landmarks down to the input image. Unlike them, our tests use a single generic 3D face model, unmodied. By not iterating over the face shape, our method is simpler and faster, but of course, our predicted landmarks will not re ect dierent 3D shapes and facial expressions. We next evaluate the eect this has on landmark detection accuracy. Detection accuracy on the 300W benchmark.: We evaluate performance on the 300W data set [111], the most challenging benchmark of its kind, using 68 landmarks. We note that we did not use the standard training sets used with the 300W benchmark (e.g., the HELEN [75] and LFPW [19] training sets with their manual annotations). Instead we trained FPN with the estimated landmarks, as explained in Sec. 4.3.3. As a test set, we used the standard union consisting of the LFPW test set (224 images), the HELEN test set (330), AFW [159] (337), and IBUG [112] (135). These 1026 images, collectively, form the 300W test set. Note that unlike others, we did not use AFW to train our method, allowing us to use it for testing. Fig. 4.12 (a) reports ve measures of accuracy for the various methods tested: The percent of images with 68 landmark detection errors lower than 5%, 10%, and 20% inter- ocular distances, and the mean error rate (MER), averaging Eq. (4.5) over the images tested. Fig. 4.12 (b) additionally provides accumulative error curves for these methods. Not surprisingly, without accounting for face shapes and expressions, our predicted landmarks are not as accurate as those predicted by methods which are in uenced by these factors. Some qualitative detection examples are provided in Fig. 4.11 including a 78 few errors larger than 10%. These show that mistakes can often be attributed to FPN not modeling facial expressions and shape. One way to improve this would be to use a 3D face shape estimation method, such as the method will be described in Chapter 6. Detection runtime.: In one tested measure FPN far outperforms its alternatives: The last column of Fig. 4.12 (a) reports the mean, per-image runtime for landmark detection. Our FPN is an order of magnitude faster than any other face alignment method. This is true even compared to the GPU runtimes reported for 3DDFA in their paper [158]. All methods were tested using an NVIDIA, GeForce GTX TITAN X, 12GB RAM, and an Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz, 132GB RAM. The only exception was 3DDFA [158], which required a Windows system and was tested using an Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz (8 CPUs), 16GB RAM, running 8 Pro 64-bit. Discussion: Landmarks predicted using FPN in Sec. 4.3.6 were less accurate than those estimated by other methods. How does that agree with the better face recognition results obtained with images aligned using FPN? As we mentioned in Sec. 4.3.2 better accuracy on a face landmark detection benchmark re ects many things which are not necessarily important when aligning faces for recognition. These include, in particular face shapes and expressions, the latter can actually cause misalignments when computing face pose and warping the face accordingly. FPN, on the other hand, ignores these factors, instead providing a 6DoF pose estimates at breakneck speeds, directly from image intensities. An important observation is that despite being trained with labels generated by Open- Face [14], recognition results on faces aligned with FPN are better than those aligned with OpenFace. This can be explained in a number of ways: First, FPN was trained on ap- pearance variations introduced by augmentation, which OpenFace was not necessarily 79 Method 5% 10% 20% 40% MER Sec./im. RCPR [26] 44.44 % 66.96 % 77.39 % 9.55 % 0.1386 0.05 Dlib [67] 60.03 % 82.65 % 90.94 % 2.83 % 0.0795 2.26 CLNF [13] 20.86 % 65.11 % 87.62 % 2.63 % 0.1106 0.64 OpenFace [14] 54.39 % 86.74 % 95.42 % 1.27 % 0.0702 0.64 DCLM [157] 64.91 % 91.91 % 96.00 % 1.17 % 0.0611 16.2 3DDFA [158] N/A N/A N/A N/A N/A 0.6 Our FPN 1.75 % 65.40 % 93.86 % 0.97 % 0.1043 0.003 (a) Quantitative results (b) Acumulative error curves Figure 4.12: 68 point detection accuracies on 300W. (a) The percent of images with 68 landmark detection errors lower than 5%, 10%, and 20% inter-ocular distances, or greater than 40%, mean error rates (MER) and runtimes. Our FPN was tested using a GPU. On the CPU, FPN runtime was 0.07 seconds. 3DDFA used the AFW collection for training. Code provided for 3DDFA [158] did not allow testing on the GPU; in their paper, they claim GPU runtime to be 0.076 seconds. As AFW was included in our 300W test set, landmark detection accuracy results for 3DDFA were excluded from this table. (b) Accumulative error curves. designed to handle. Second, poses estimated by FPN were less corrupted by expressions and facial shapes, making the warped images better aligned. Third, CNNs are remarkably adapt at training with label noise such as any errors in the poses predicted by OpenFace for the ground truth labels. Finally, CNNs are highly capable of domain shifts to new data, such as the extremely challenging views of the faces in IJB-A and IJB-B. 4.4 Conclusions In this chapter, we investigate the additional techniques to improve the proposed face recognition system. First, we simplify and accelerate 3D face augmentation and matching 80 sequence, which not only fastens the system but also improve recognition performance. Second, we conrm the scalability of our system by using a very-large-scale training data (COW). This substantially lifts our system to surpass state-of-the-art records by a wide margin. Finally, we argue that accurate landmark detection is not essential for a face recognition system. Instead, we can get similar, if not better, matching engine by removing landmark detection, and regressing directly 3D head pose from the input image. 81 Part II 3D Face Modeling 82 Chapter 5 3D Face Modeling with An Analysis-by-Synthesis Approach In this chapter, we discuss how to reconstruct a 3D face from imagery input, using analysis-by-synthesis methods. First, in section 5.1, we review 3D Morphable Model, the core component of template-based 3D face modeling methods. Second, in section 5.2, we design a stable 3D face modeling method from single facial images by automating and enhancing the state-of-the-art algorithm. Third, we expand the system for better inferring individual 3D face shape when multiple inputs are available (section 5.3). Fourth, we discuss how to do 3D face modeling in case of video input. In order to evaluate the quality of the reconstructed 3D models, we next propose the use of 3D-based face recognition test (section 5.5). Finally, we run comprehensive experiments on various datasets (MICC, MultiPIE, CASIA, IJB-A) (section 5.6). Our results prove that we can recover facial geometry and subject identity with 3D face modeling, particularly when rich input data is available. 83 5.1 3D Morphable Models 3D Morphable Model (3DMM) is a popular way to represent prior knowledge on human face space. In this model, a basis, including mean and M principal components, for face shapes and textures was extracted using Principal Component Analysis (PCA) on a set of laser scans. The basis provides M s shape components, M t texture components, and M x expression ones. Then, any 3D face can be projected into this basis to get a shape vector 2 R Ms , a texture vector 2 R Mt , and an expression vector 2 R Mx . Its 3D vertex coordinates S2 R 3N and colors T 2 R 3N can be recovered fromf;;g using the equations: S(;) =P s +P x + S; T () =P t + T; (5.1) where P s 2R 3NMs , P t 2R 3NMt , P x 2R 3NMx are the trained principal components and S; T are the average shape and texture. Note that in the vectors S and S, the 3D vertex coordinates are stacked in the form [x 1 y 1 z 1 x 2 y 2 z 2 :::x N y N z N ] T . Similarly, in T and T , the corresponding vertex colors are stacked ([r 1 g 1 b 1 r 2 g 2 b 2 ::: r N g N b N ] T ). With the release of Basel Face Model (BFM) [102] and Face Warehouse database [29], 3DMMs became publicly available. We combined two databases to extract a large number of principal components for shape, texture (M s = M t = 199), and expression (M x = 29). Fig 5.1 illustrates the mean face and the composed ones by changing the main principal components on either shape, texture, or expression. Comparing to other 3D face knowledge presentation, 3DMM has many advantages. By using linear formulation, it is a simple, yet strong way to compose and alter the 3D output, which is guaranteed to be a human face. Also, the 3D mesh is well-aligned 84 Figure 5.1: The 3D Morphable Model used in our system. Left: Average face shape and texture. Middle: generated 3D faces by changing shape or texture components. Right: generated 3D faces by changing expression components. and well-structured; hence we can easily re-use it for other applications such as 3D face matching, blending, and animation. 5.2 3D Morphable Model Fitting on A Single Image We follow the idea of 3D Morphable Model tting proposed by [23], which is considered as the state-of-the-art 3D face modeling work on a single image, with some enhancements. The 3D shape and texture are extracted from the input image given color pixels and facial landmark points. The problem is formulated as an optimization problem, which is solved by stochastic Newton method (SNO). Unknowns: Similar to [23], we do not focus on one but all features involving in facial image formation. More specically, we develop a single framework to estimate everything at once: 3D face shape, texture, expression, 3D pose, illumination, and color model. Fig. 5.2(a) demonstrates the unknowns to be estimated. Taking advantages of 3D Morphable Model, 3D face shape and texture are presented by the parameter vectors (,, 85 (a) Unknown parameters to be estimated (b) Goal of tting process Figure 5.2: 3DMM tting on a single input image. Target of the tting process is to minimize error between the rendered image and the input one. ). A 3D pose is represented by 6 parameters, including rotationsfr i g and translations ft i g. As for illumination, we assume Lambertian surface and Phong shading model. Hence, 8 parameters are used, including ambient intensity, diuse intensity, and light direction. Finally, to handle dierent color modes (grayscale/color) and color tones, we use 7 color model parameters. Except for 3DMM parameters, the others are grouped into a rendering vector. [23] claimed that 99 principal components for shape and texture are enough, while [158] proposed the use of 29 expression components, causing 248 unknowns in total. Cost function: The goal of the tting process is to minimize the gap between the rendered image and the real one (Fig. 5.2(b)). Our cost function consists of 3 error measures: 86 • Intensity error (E I ) The average pixel intensity dierence between 2 images. This is the most important and the most obvious cost item. • Feature error (E F ) Here we compare projected 2D facial landmark points to the "estimated" ones provided from a modern facial landmark detector such as CLNF [11]. This cost plays an important role in the initialization step, as well as regulating the optimization process. • Edge-based error (E E ) Proposed by Romdhani and Vetter [109], this cost item computes the distance from each rendered edge to the closest detected one from the input image. We use 2 types of edges: facial texture edges (e.g. nose base, eye contour, etc.) and boundary edges, those are view-independent and view-dependent respectively. This cost alleviates the noise coming from landmark detection, while further exploiting rich image information. These error terms are combined, alongside a regularization term, to have the cost function: E = I E I + F E F + E E E + X i 2 i 2 S;i + X i 2 i 2 T;i + X i 2 i 2 X;i + X i ( i i ) 2 2 R;i ; (5.2) where I , F , and E are the weights for intensity, feature, and edge-based error (statis- tically computed), S;i , T;i , and X;i are standard deviations of the shape, texture, and expression principal components, i is the initial value for each rendering parameter, and R;i is the standard deviation of each rendering parameter. 87 (a) (b) (c) Figure 5.3: Landmark detection with CLNF. (a) 68 landmark points mark-up. (b) Good de- tection result on a frontal image (c = 0.96). (c) Bad detection result on a pitched face (c = 0.52) Automatic initialization: The original work [23] is a semi-automatic process, which requires manual annotation on the anchor points. Instead, we use CLNF [11], a state of the art landmark detection algorithm, to build a fully automatic pipeline. The algorithm detects landmark points following the Multi-PIE [50] 68 points mark-up (Fig. 5.3(a)). Moreover, it provides a condence measurec2 [0; 1], which is important in the integration of multiple observations (see section 5.3). Fig. 5.3(b) and 5.3(c) demonstrates a good and a bad detection result on in-the-wild images. We can dierentiate these cases easily using the condence values. The detected landmarks are used to initialize the camera pose (r,t) and 3D face shape. Then, in the optimization process, we use them to compute the feature error (E F ). We also scale the weight of this error term ( F ) according to the landmark condence value. Optimization process: Since the cost function is non-linear, Newton method is applied to solve this optimization problem. Unknowns (,,,) are grouped again into a single 88 vector x. From the initial value, x is gradually optimized based on the gradient vector (rE) and the Hessian matrix (H): x (n+1) =x (n) H 1 rE; (5.3) where is the learning speed. To avoid being trapped at a local minimum, and to speed up the computation, a stochastic version of Newton method is used: In each iteration, E I is computed on only 40 visible vertices randomly selected. Also, assuming a stable Hessian matrix H, we re-compute it only after each 200-iteration. Line search strategy: In the original work [22], a xed learning rate is used to update the vector of unknowns (x). This strategy, however, is not optimal. Instead, we apply line-search strategy to nd a reasonable learning rate in each iteration: is initialized with a high value, then gradually decreased until it satises the condition: E(x (n+1) ())<E(x (n) ): (5.4) This strategy guarantees that in each step, we actually lower the cost. Hence, the optimization process is more stable. Fig. 5.4 demonstrates the eect of line-search on a 3D reconstruction result. 89 (a) (b) (c) Figure 5.4: Eect of line-search optimization strategy. (a) Input image. (b) Without line-search, the estimated 3D model is poor due to unstable optimization process. (c) With line-search, we get a much better 3D face shape. 5.3 Model Fusion from Multiple Input Images Despite the reasonable reconstruction framework on a single image, it is not guaranteed to recover the accurate 3D face shape. Indeed, each single image just presents limited information: a frontal image poorly indicates the face depth, while a prole image barely gives any hint about the up-front face shape. Therefore, if multiple input images are available, it is preferable to combine them to get a unied 3D model. To solve this problem, our rst attempt was to build a single optimization process con- sidering all input images. Unfortunately, this global optimization approach did not work. The cost fraction coming from the bad images (due to bad landmarks, bad pose estima- tion, or awkward pose) often dominated the fraction coming from the good ones. Hence, the optimization process struggled in solving the bad inputs and wasted the information coming from the good ones. Instead, inspired by paper [103], we found that combining separate 3D ts is a better idea. Although each tted result on a single image is noisy, it is distributed around the 90 accurate one. Based on the law of large numbers, by combining these estimations, we can get a more stable and accurate result. Moreover, we can improve this integration process by taking advantage of the land- mark condence measure (c). The 3D output (, ) is now a weighted average of the separate tsf( i , i )g: = P i c i i P i c i ; = P i c i i P i c i : (5.5) 5.4 3D Face Reconstruction from Video 1 Besides image, video is also an important media source, which is largely used in common life. Videos of human faces can be found in video surveillance, television, social media, and social networks. Surprisingly, there are only a few studies on 3D face modeling on these available resources. Also, only recently, some facial recognition benchmarks [145, 68] have started supporting video data. Compared to images, a video has many advantages as the 3D modeling input. Firstly, its video frames have similar settings such as illumination, image quality, color modes, etc. The subject's appearance in a video is also consistent since it was captured in a specic period. Hence, it is easier to associate video frames, rather than images from dierent sources. Secondly, in term of quantity, it is far much richer than spare images. We can drop a large number of frames if they are not reliable. Certainly, this advantage 1 This section is my joint work with Jongmoo Choi, Yann Dumortier, and Prof. G erard Medioni. The part about 3DFT was published as "Real-time 3-d face tracking and modeling framework for mid-res cam." in Winter Conf. on App. of Comput. Vision, pp. 660-667, IEEE, 2014. 91 comes at a price: we need to process a massive amount of data, which is computationally heavy; hence a good frame selection is required. Lastly and most importantly, video data is temporal coherent; it means we can track the subject though frames, predict the future scene or infer the missing ones. Due to the trackable property of video, we deeply exploit it in section 5.4.1 in order to get accurate camera poses for most of the frames. Then, in section 5.4.2, we discuss how to extend 3D face modeling algorithm on image(s) to video, using this tracking information. 5.4.1 3D Face Tracking In this section, we introduce 3DFT, an extension of 3-D Face Tracker [35], which is presented in our paper [38] with some recent improvements. Given a video sequence, the desired system needs to track three rotation angles and three translations of a face (6 degrees of freedom), with respect to the camera coordinate frame, in the presence of facial expression changes, wide pose variations, and partial occlusions. As illustrated in Fig. 5.5, our 3DFT system is composed of three main modules: initialization with 3-D face modeling, 3-D head pose estimation and re-acquisition. a. Initialization with 3-D Face Modeling A key component of this tracker is the use of a 3D reference face reconstructed in the initialization step. To do so, we need a good frame for 3D modeling. Unlike [35], our system is fully-automatic with the ability to evaluate frames and picking a suitable one to model. 92 Figure 5.5: Overview of the proposed framework. In this section, we describe main modules: ini- tialization with the 3-D face modeling, the 3-D pose estimation & validation and the re-acquisition A good initial frame must include the user's frontal face with small rotation in yaw, pitch, and roll. To verify that condition, we detect the landmark points, evaluate the frontal condition and picks the frame if satised. After nding a frontal view, we employ the process of 3D face modeling from a single image, as described in Section 5.2 algorithm, to get the initial 3D face shape. For tracking purpose, we do not use the estimated texture but the mapped one. This method is more robust and accurate than the simple landmark-based 3D modeling proposed in the original paper [35]. 93 b. 3-D Pose Estimation Robust tracking using 2-D tracking points: In the tracking step, the system extracts a collection of 2-D keypoints from the input image and infers their 3-D coordinates from the 3-D model. These points are tracked over frames by Lucas-Kanade (LKT) feature tracker [84]. For each new frame, given the tracked 2D points and their reference 3D coordinates, 3D head pose is recovered with RANSAC-PnP method [48, 78]. To cope with keypoints lost by occlusion, the program extracts the new keypoints each frame and adds them into the tracking list. Also, it automatically drops some random old tracking points to make room for the new-coming ones. By this, the tracker becomes more robust, stable, and capable to work with large rotation angles. In [35], the authors use Speeded Up Robust Features (SURF) keypoints [18] because they are informative. However, this approach has many weaknesses. First, the number of SURF keypoints is small, which reduces the tracking performance. Second, they are not equally distributed on the whole face, but located in some small areas. Thus, the estimation can easily be biased. Finally, SURF is a computationally expensive method. The program has to integrate with a CUDA library [3, 2] and an external graphic hardware for real-time running. Instead, we add new points from two sets: an equal grid and a number of uniformly distributed random points (12x12 and 150 points in implementation). It is a simple way to manage the number and distribution of tracked points. Though each random point is weak for tracking, a large number of random points, coupled with a robust estimator, provide reliable results. 94 (a) SURF points strategy (b) Dynamic random points strategy Figure 5.6: Comparison between SURF and dynamic random points strategy Fig. 5.6 shows a comparison of SURF points and random points strategy. On one hand, SURF points (Fig. 5.6.a) are sparsely distributed on the face. Thus, tracking may be biased. On the other hand, random points strategy (Fig. 5.6.b) can ll a large amount of points on the whole face at every frame. Thus, the result is stable and much more accurate. Advanced RANSAC-PnP: As described before, when adding new tracking points, we rst select a number of 2-D points and then use the estimated pose to calculate their corresponding 3-D coordinates. 95 Hypothesis Mass Belief Plausibility ? 0 0 0 (T ) (:T ) 1 1 1 (T;:T ) 0 1 1 Table 5.1: Belief table in a single evidence case These correspondences are used as the input of RANSAC-PnP procedure in the next frame. If the estimated pose contains error, these correspondences corrupt the result of RANSAC-PnP and introduce some drift. To reduce this eect, in RANSAC-PnP loops, instead of using a pure tracking tech- nique, we also combine it with facial landmarks detection. We pick the estimation which has not only the most agreements from correspondences but also a small error according to the detection result. We use Dempster-Shafer theory [42, 120, 121, 119] as the framework of our fusion problem. It is a mathematical theory to combine available \evidences" and to compute the joint degree of belief for a given hypothesis. The theory introduces 3 terms: Mass (degree of belief), Belief (amount of belief that supports the hypothesis at least in part), and Plausibility (amount of belief that does not contradict the hypothesis). For example, with only an evidenceE for an eventT , we have the belief table as in Table 5.1, where =P (TjE). Note that the hypothesis (T;:T ) is \indeterminate", meaning that T could either appear or not. 96 When more evidence is introduced, Dempster's rule of combination can be applied. Given 2 evidencesE 1 ,E 2 , the corresponding mass functionsm 1 ,m 2 , and a hypothesis A in the hypothesis space , the combination mass is: m 12 (A) = P B ;C ;B\C=A m 1 (B)m 2 (C) 1 P B ;C ;B\C=? m 1 (B)m 2 (C) : (5.6) In our system, we use the ratio of inliers (I), the detection error (e l , e r and e n - see equation (5.7)) as evidences, and the expected event T corresponding to the mean absolute rotation error being smaller than 5°. We also run a statistical test on the ICT dataset (see section d.) to estimate the probability model E (x) =P (TjE =x) for each evidence E. When evaluating each estimation, the system computes the probability E of each separate evidence E and converts it to the mass value. Then, Dempster's rule (see equation 5.6) is applied to formulate the combination mass. Finally, the estimation with the highest mass is picked up as the output of the RANSAC-PnP procedure. This algorithm has both advantages of the tracking and the detection techniques. On one hand, by using tracking information, it can follow face in critical cases such as prole views or partial occlusions. It also reduces the eects of false detection results, particularly when the rotation is large. On the other hand, by combining with facial landmarks detection, it can prevent error accumulation and we can maintain tracking. We discuss the eect of this modication in the validation test (section 5.4.d.). 97 (a) recovery mechanism (b) auto-correct mechanism Figure 5.7: Re-acquisition techniques c. Re-acquisition Recovery mechanism: After tracking failure, the system needs to re-match the current frame with the initial one to recover the track. A simple way is to estimate 3D pose coming from detected landmarks. However, landmarks are known to be not reliable at prole views. Hence, we recover the track, using landmark-based pose estimation, only at near a frontal view, when landmark condence is reasonable. The eect of this mechanism is illustrated in Fig. 5.7.a. 98 Auto-correct mechanism: Although the tracking algorithm is robust, it cannot deal with all errors. After a long time, the accumulated error becomes signicant, and the pose needs to be corrected. A simple way would consist in processing the recovery technique for near frontal faces. However, modifying all frontal views leads to an unstable result and also reduces the performance. To get a smooth output, we should modify the estimation only when the pose error is large. To estimate the error in real-time, we need a fast validation. A simple way is to compare with the result of facial landmarks detection. Although such a detector is weak when working with prole view or occluded faces, it provides quite accurate results in near frontal cases, particularly for eyes and nose-top detection. Thus, we compute the dierences between its output and our estimate as approximations for the error: e l =jj^ p left eye _ p left eye jj e r =jj^ p right eye _ p right eye jj e n =jj^ p nose _ p nose jj (5.7) with e l ;e r ;e n : errors for the left eye, right eye and the nose-top position respectively, ^ p : the computed position from RANSAC-PnP, _ p : the computed position from facial landmarks detector. When any error is larger than a threshold (5 for the eyes and 8 for the nose-top), a correction procedure with the same algorithm as in recovery step is run. The eect of this mechanism is illustrated in Fig. 5.7.b. 99 0 5 10 15 20 1. Basic functions 2. (1) + Recovery 3. (2) + Auto−correct 4. All Average rotation error( o ) 0 10 20 Loss rate (%) Average rotation error Loss rate Figure 5.8: Evaluation for proposed techniques in the validation test with ICT-3DHP dataset d. Validation Experiments We evaluate the proposed system on the ICT 3-D Headpose [124]. ICT 3-D Headpose consists of 10 videos (800-1600 frames per video) for 10 dierent subjects. The ground truth data of translations and rotations are provided by a magnetic sensor named \Flock of Birds" [4]. This dataset is chosen since it contains videos with a wide range of motions, distracting background, and facial expression changes. First, we re-implemented [36] with auto-initialization. Since SURF points strategy fails to work on videos with small faces, we apply directly the dynamic random point strategy. Then, each advanced feature (sections b. and c.) is sequentially added. Because the rotation is more important and dicult to estimate, we skip the translation error measure. 100 Pitch Yaw Roll Average CLM [39] 5:90 4:63 5:53 5:35 GAVAM [92] 7:69 5:65 3:59 5:64 GAVAM+CLM [15] 6:30 4:94 3:58 4:94 3DFT 6.23 4.46 3.52 4.74 Table 5.2: Experiment results on ICT-3DHP dataset. Fig. 5.8 shows the overall testing results. Each proposed technique has a positive eect on performance. First, the recovery mechanism reduces much the failure rate from 19% to4%. By dealing with many challenging frames at the end of each video, the system, however, provides a higher average error. Then, the auto-correct mechanism shows a large improvement while reducing the lost rate to 0% and the average error to 7:5°. Finally, the Advanced RANSAC-PnP technique continues lowering the error to 4:7°. Note that there is no delay in tracking. The 0% lost rate is true for this dataset only since the ICT-3DHP dataset contains head motion within a limited range. The lost frames from the basic modules (left-most result in Fig. 5.8) are due to drift of the tracker. The auto-correction reduces drift dramatically. Of course, we may observe lost frames and tracking failure in other datasets as shown in section f. To have a comprehensive view, we compare our result with the output of some state of the art techniques. As can be seen in Table 5.2, 3DFT provides the smallest errors, 101 Pitch Yaw Roll Average CLM [39] 3:33 4:32 2:49 3:38 GAVAM [92] 3:98 4:58 2:17 3:58 GAVAM+CLM [15] 2:61 3:66 1:94 2:74 3DFT 3.51 3.74 2.24 3.16 Table 5.3: Experiment results on BU dataset about 4:7°on average. Certainly, since the ICT 3-D Headpose is our training dataset, this comparison is only a reference. e. Experiments with Ground-truth Datasets After optimization, we need to test on ground truth datasets and compare to the state of the art techniques such as GAVAM, CLM, and GAVAM+CLM [92, 39, 15]. In this section, we conduct experiments on BU and Biwi dataset. BU dataset: BU head tracking dataset [1] is an "easy" ground truth dataset. It consists of 45 videos on 5 subjects. Each video has 200 frames with dierent head motions. The comparison result is shown in Table 5.3. Since the videos are short and simple, all programs present very good results. 3DFT and GAVAM+CLM are the best methods with only about 3°error for each rotation. 3DFT is slightly worse due to the quality of the rough 3-D model, but the dierence is not statistically signicant. Biwi dataset: Biwi head pose dataset [46] was introduced by Fanelliet et al. It is a set of 24 videos including depth information from 20 subjects. It is an extremely dicult dataset because, in each video, the user rotates the face around prole views (75 for 102 Pitch Yaw Roll Average CLM [39] 20:94 8:73 14:31 14:65 GAVAM [92] 19:00 9:74 13:45 14:06 GAVAM+CLM [15] 17:52 7:81 12:33 12:55 3DFT 13.86 8.53 8.49 10.30 Regression forests [46] (depth images) 9:2 8:5 8:0 8:6 Table 5.4: Experiment results on Biwi dataset yaw and60 for pitch) for a long time without coming back to frontal views. It is a challenge for both detection and tracking techniques. Since our purpose is to track the face with an RGB camera only, we removed the depth information and run the test on each program. To get a comprehensive view, we also include the original work of Fanelliet et al. (using depth information), named Regression forests, for comparison. The proposed method surpassed all other methods on the RGB videos only as shown in Table 5.4. Without depth data, its error still approaches the error of Regression forests, which is one of the state of the art techniques for depth images, with only1:7 bigger on average. f. Experiments with Recorded Videos In this part, we conducted tests to quantify the range of tracking coverage. We also evaluate the system in critical cases such as facial expressions, partial occlusions, and 103 (a) roll in [-180°, 180°] (b) yaw in [-90°, 90°] (c) pitch in [-60°, 90°] Figure 5.9: Experiments on recorded videos by GAVAM+CLM (top) and 3-D Face Tracker (bottom) (1). complete occlusions. Thus, some recorded videos in those conditions are used. We also repeat the tests with GAVAM+CLM to get a comparative result. Range of rotation coverage: The range of rotation coverage is one of the main ad- vantages of our system. While most of other works limit their estimation around frontal views, our program can track the face up to prole views. To verify it, we conducted the experiments of a full rotation in each direction (Fig. 5.9). As can be observed, GAVAM+CLM lose estimation when the rotation is large. On the contrary, 3DFT sup- ports full 360° roll (a), [-90°, 90°] yaw (b) and [-60°, 90°] pitch (c). Our system can track almost all visible views, except highly negative pitched ones. Experiments on facial expression changes and occlusions: Facial expression changes and partial occlusions are also challenges for the head pose estimation problem. To evaluate our system in these cases, we created some test videos and present typical results in Fig. 5.10. In video (a), the user shows dierent expressions as smile or surprise along with head motions. Overall, both GAVAM+CLM and 3DFT work well. However, while 104 (a) facial expression changes (c) complete occlusions and recovery (b) partial occlusions (d) other person Figure 5.10: Experiments on recorded videos by GAVAM+CLM (top) and 3-D Face Tracker (bottom) (2). GAVAM+CLM (the top gure) sometimes shows a small drift due to the detection error, 3DFT (the bottom gure) show more persistent results. The video (b) is more challenging since the mouth or one eye is covered (the coverage is<50% of the face). Since many landmarks are mislocated, face detection methods fail. Hence, GAVAM+CLM showed a very poor result with wrong estimations for both head position and head direction. Conversely, 3DFT interprets well the head pose when the face is partially occluded. In video (c), we tested for the complete occlusion case. GAVAM+CLM is misled and comes up with meaningless results. On the contrary, 3DFT promptly stops tracking when the face is fully covered. It also rapidly recovers when the occlusion is removed. 105 Mode Speed (fps) Initialization 20:76 1:85 Tracking 14.27 1.21 Recovery 20:79 8:07 Table 5.5: Speed of 3-D Face Tracker in real-time tracking mode We also repeated the combined tests on dierent people. The last gure illustrates the result from a video of a man with glasses. Although the generated 3-D model was not perfect, 3DFT outperforms the baseline methods. g. Experiments with Real-time Tracking One more strength of our tracker is that it can run real-time. We run the tests on a laptop (Levono Y580 with Intel Core i7 @2.30GHz, 6GB Memory RAM). The video resolution is 640 480. Table 5.5 shows the performance of the program on each step. As can be seen, our tracker is real-time with the tracking speed at >14 fps. 5.4.2 Frame-based 3D Face Modeling from Videos After developing a novel 3D face tracker, we take advantage of its pose estimation for 3D face reconstruction. Since video provides a massive input data, we need to lter them rst, then dene the key-frames, and nally associate them in modeling. 106 a. Bad Tracked Frame Removal Despite having a robust tracker, we cannot guarantee that 3D poses in all frames are well estimated. The errors can come from fast motion, low-quality input... Hence, to avoid putting noise into modeling process, we need to detect the bad frames and remove them rst. (a) A good frame (p = 13:3%) (b) A bad frame (p = 60:0%) (c) A good frame with illumination change (p = 8:0%) (d) A bad frame with occlusion (p = 69:0%) Figure 5.11: HOG-based pose quality measurement We employ here a robust condence measurement. Given the estimated pose, either from landmarks or from tracking, our program rst renders an image at the same pose. Then, the rendered image is compared to the input image to measure the pose quality. To deal with illumination issue, instead of using color intensity, we use HOG as a robust representation. When the HOG value of a pixel in the rendered image is dierent from the input, we consider that pixel as noise. The percentage of noise region over an estimated 107 face region (p) is used to estimate pose quality. When p is greater than a threshold (e.g. 30%), the frame is considered as bad and hence removed. Fig. 5.11 illustrates sample results from our method. In each sub-gure, we show 3 columns of rendered images, input images, and error maps respectively. Evidently, when the estimated pose is good, (e.g. Fig. 5.11(a) and 5.11(c)), despite of expression and illumination change, this error is still small. In contrast, it is high when the estimated pose is bad (Fig. 5.11(b)) or when the face is occluded (Fig. 5.11(d)). b. Key-frames based Modeling After removing the bad data, we select frames with 5 gap in yaw angles, called key- frames, for 3D modeling purpose. The 3D reconstruction process now is similar as with multiple images (see section 5.3): each frame is tted separately, then they are combined to have a nal result. Note that the estimated 3D pose from the tracker is used as the initial value when tting each frame. In Fig. 5.12, we show experimental results from 2 videos: one from a subject with a known face, and one from IJB-A in-the-wild dataset. As can be seen, in both cases, the nal 3D models are smoother and more accurate. In the rst video, the nose prole is greatly improved. In the second video, we get better quality in both 3D shape and texture. 108 (a) (b) (c) Figure 5.12: 3D Reconstruction results on video input. (a) Sample key-frames. (b) Initial 3D face model used in 3D face tracking. (c) Estimated 3D face model by aggregating the 3D ts on key-frames. 109 5.5 Parameter based 3D-3D Matching The 3D shapes of faces are well known to be discriminative. Yet despite this, they are rarely used for face recognition and always under controlled viewing conditions. We claim that this is a symptom of a serious but often overlooked problem with existing methods for single view 3D face reconstruction: when applied \in the wild", their 3D estimates are either unstable and change for dierent photos of the same subject or they are over- regularized and generic. Therefore, we believe face recognition is an important test to measure the quality of the reconstructed 3D face models, particularly when ground-truth data is not available. In this section, we describe how to do this task specically with 3DMM-based models. After recovering the shape and texture parameters, we can combine them into a single identity vectorv = (;) T . To compare 2 models, we can simply use the cosine distance between their identity vectors. However, as previously proposed in Chapter 3, it is better to do signed square-root normalization on the identity vectors before matching. Also, when training images are available, we can do PCA training and adaptation. In short, given any reconstructed 3D model with identity vector v and the change of basis matrix P , we apply a transformation: v 0 =sign(Pv) p jPvj: (5.8) Then, when comparing 2 models v 1 and v 2 , we compute the cosine distance of the pro- cessed vectors: d = v 0 1 v 0 2 kv 0 1 kkv 0 2 k : (5.9) 110 Table 5.6: Error measure on the reconstructed 3D face models from cooperative videos in MICC dataset [10]. Model Generic Initial 3D Combined 3D (uniform) Combined 3D (weighted) RMSE (mm) 1.89 0.51 1.75 0.42 1.62 0.44 1.60 0.46 Median (mm) 1.75 1.74 1.54 1.46 5.6 Experiments In this section, we validate the proposed 3D modeling algorithm in order to answer the questions: (1) Can we recover 3D shape accurately, particularly when multiple inputs are available? (2) Can we recover the identity with the reconstructed 3D model from still images? To solve the rst question, we conduct an experiment with ground-truth dataset 5.6.1. For the second question, we perform comprehensive 3D face modeling and matching experiments on both controlled (Multi-PIE) and in-the-wild (CASIA, IJB-A) face recognition datasets. 5.6.1 Experiment with Ground-truth Data We conducted test using MICC cooperative data [10]. It has videos of 53 subjects with the corresponding 3D face scans. 111 For each video, we can build the 3D model from the rst frame (Single 3D model), or using tacking-and-modeling techniques proposed in section 5.4 (Combined 3D model) with either uniform or condence-based weights. In the evaluation, we rst crop the face region around its nose tip with radius 95mm, then align it to the ground-truth with Iterative-Closest-Point (ICP) algorithm [34]. Next, we project both the reconstructed 3D model X and the ground truth one on a frontal view. Finally, the depth errors are recorded, and the statistics over 53 videos are shown in Table 5.6. As can be seen, by combining multiple frames instead of using one in 3D modeling, the average error drops from1.75mm to1.60mm. Also, when using condence-based weights instead of the uniform ones, we get slightly better 3D face models, especially when considering the median error. Fig. 5.13 illustrates some qualitative results, including the ground-truth 3Ds (5.13(a)), the single frame reconstructed 3D face models (5.13(b)), and the combined ones with condence-based weights (5.13(c)). For each reconstructed 3D, we show both its model and the corresponding depth error heat-map. In both cases, we get a lower error heat map after model fusion, which conrms the model enhancement gained from multiple inputs. 5.6.2 Experiment with 3D Face Modeling and Matching Multi-PIE dataset: In [23], the authors presented 3D modeling and matching experi- ments on CMU-PIE dataset [122], which had photos of 68 subjects captured in dierent settings (e.g., pose and illumination). This dataset was then extended to Multi-PIE dataset[50] with a total of 337 subjects involving from one to four dierent recording 112 (a) (b) (c) Figure 5.13: 3D Face Modeling results on MICC dataset. (a) Ground-truth 3D face model. (b) Reconstructed 3D face model from the rst frame (frontal). (c) Reconstructed 3D face model from multiple frames with condence-based weight. sessions. It is well-controlled with good image qualities and limited variants of the pose and illumination. To evaluate the matching performance when using reconstructed 3D models from multiple inputs, we dene a template-based matching protocol on Multi-PIE dataset. A template is a set of images of the same person. Instead of comparing image to image, we now compare template to template. In this experiment, we use Multi-PIE session 4, which has the most number of subjects attending (239 subjects). For each subject, all images captured under the light number 10 are put into a gallery template, while all images of each other light are put into a probe template. In total, we have 239 gallery templates and 4541 probe ones, causing1 million comparisons. We then run our 3D modeling and matching framework on this protocol without and with the proposed 3D model fusion. In the rst case, for each template-template comparison, we get a single score by combining all image-image matching scores with soft-max fusion [21]. In the 113 Figure 5.14: 3D-3D matching performance on Multi-PIE dataset [50]. second case, we fuse the 3D models in each template into a single one, and the matching result on the pooled models is reported. The experimental results are summarized in Fig. 5.14 and Table. 5.7. Without model fusion, we get 95.6% for TAR@FAR0.01 and 89.0% for Recognition Rate Rank-1 (RR1). These numbers are statistical meaningful considering the large number of subjects. When using uniform fusion, we get a signicant boost in RR1 (from 89.0% to 93.4%), but a drop in ROC. We can explain it by the fact that in Multi-PIE dataset, there are many prole images with poorly detected landmarks resulting poor 3D face reconstruction. Hence, they put a lot of noise in uniform fusion. Condence-based fusion is a better solution since it considers the quality of each model input. Consequently, we get much better results: the left part of ROC is lifted up (TAR@FAR0.001 increases from 66.3% to 86.8%) and RR1 rises to 94.1%. It indicates that the reconstructed 3D models can capture well the subject identities, and we can achieve distinctiveness faster using condence-based model fusion. 114 Table 5.7: 3D-3D matching performance on Multi-PIE dataset. Model fusion# Verication (TAR) Identication (Rec. Rate) Metrics! FAR0.1 FAR0.01 FAR0.001 Rank-1 Rank-5 Rank-10 No fusion 98.6 95.6 66.3 89.0 97.9 98.5 Uniform fusion 95.6 87.0 71.7 93.4 97.1 97.8 Weighted fusion 98.4 95.4 86.8 94.1 97.5 98.0 CASIA WebFace dataset: While Multi-PIE images were captured in studio settings, we also want to evaluate the proposed system with in-the-wild data. CASIA WebFace [156] is a popular dataset, with 0.5 million web images collected from 10,000 subjects. The images were captured in various conditions, including pose, expression, image quality, and illumination variations. However, in contrast to Multi-PIE dataset, most of CASIA- WebFace images are frontal or near frontal. Again, we dene a template-template matching protocol on this dataset. Since the dataset is huge, we randomly select 500 subjects for training and 500 subjects for testing. In training set, we run 3D face modeling on all images, then extract PCA basis on the reconstructed 3Ds. In testing set, we divide the images of each subject into 2 equal sets: one as a gallery template and one as a probe template. Then we run 3D face modeling and matching on these galleries and probes following the algorithm described. The matching results are presented in Table 5.8 and Fig. 5.15. This time, we get notable boosts in both ROC and CMC when changing from no fusion to uniform fusion, and from uniform fusion to weighted fusion. Despite a huge number of gallery subjects, 115 our framework shows an impressive performance with 99.2% for TAR@FAR0.01 and 98.4% for Recognition Rate Rank-1. To have a comprehensive understanding, we analyze the eect of the number of mod- eling images on matching performance. For each template size threshold N, we rerun the process but use only maximum N images per template, either randomly or selec- tively based on landmark condence. Then, we record the relationship between N and matching performance in 2 important measures (TAR@0.01FAR and Recognition Rate Rank-1 (RR1)), which is illustrated in Fig. 5.16. Note that in the plot, we use log-scale in x-axis since the change whenN is small is more notable. As can be seen, with a single image per template, we get very poor results (7.6% and 33.80% RR1 with random or condence-based image selection respectively). When the number of images per template increases, the system performance goes up quickly and gets saturated at N = 25. It implies subject identity can be well-reconstructed with 3D face modeling when enough input data is available. Table 5.8: 3D-3D matching performance on CASIA-WebFace dataset [156]. Model fusion# Verication (TAR) Identication (Rec. Rate) Metrics! FAR0.1 FAR0.01 FAR0.001 Rank-1 Rank-5 Rank-10 No fusion 97.0 88.6 72.2 92.0 98.2 99.4 Uniform fusion 97.6 91.8 76.6 95.6 98.4 98.8 Weighted fusion 99.2 95.6 84.6 98.4 99.4 99.8 116 Figure 5.15: 3D-3D matching performance on CASIA-WebFace dataset. Figure 5.16: 3D-3D matching performance on CASIA-WebFace dataset w.r.t. the maximum number of modeling images per template. IJB-A dataset: Finally, we run an experiment on IJB-A [68], one of the top challenging face recognition benchmarks. This dataset supports template-based matching protocol where each template, the matching entity, is a mixture of still images and extracted video- frames. However, a large number of probe/verication templates have only one or two images, hindering our system eciency. 117 (a) (b) (c) (d) Figure 5.17: 3D Morphable Model tting on multiple images. (a) A sample input image. (b) Some sample tting results on dierent single input. (c) The combined 3D face model (shape+texture). (d) The combined 3D face model (shape only) We apply our 3D modeling and matching framework on this dataset, either using the separate 3D models or the fused ones. Since IJB-A has mixed data from images and videos, we apply the pooling method described in section 4.1.3. Our proposed fusion method provides better 3D face shapes as can be seen in Fig. 5.17. We selected here several subjects with more than 10 dierent images, did 3DMM tting separately, then combined the 3D models by subject. Despite being noisy, each in- dividual 3D tting can capture some distinctive features of the subject face (Fig. 5.17(b)). After combining the 3D models (Fig. 5.17(c) and 5.17(d)), we get much more accurate 118 Figure 5.18: 3D-3D matching performance on IJB-A dataset. Figure 5.19: 3D-3D matching performance on IJB-A dataset w.r.t the minimum number of images per template. results: the distinctive features are preserved while the noises are canceled out. The 3D improvement is transferred as a gain in matching performance, which is showed clearly in Fig. 5.18 and Table. 5.9. Since this dataset is highly challenging, it is not surprising that we get a huge drop in matching performance comparing to in Multi-PIE or CASIA WebFace dataset. The TAR@FAR0.01 is now 65.3%, and Rank-1 number is 54.3%, given condence-based model fusion. However, our system already surpasses the baseline performance coming from 119 Table 5.9: 3D-3D matching performance on IJB-A dataset in comparison to some popular 2D systems. System# IJB-A Ver. (TAR) IJB-A Id. (Rec. Rate) Metrics! FAR0.1 FAR0.01 FAR0.001 Rank-1 Rank-5 Rank-10 Ours (no fusion) 79.8 49.3 24.6 56.5 77.0 83.5 Ours (uniform fusion) 78.5 50.3 25.2 61.1 77.9 83.6 Ours (weighted fusion) 78.9 51.3 28.2 63.6 79.1 84.5 GOTS [68] 62.7 40.6 19.8 44.3 59.5 { OpenBR [70] 43.3 23.6 10.4 24.6 37.5 { Our 2D method (deep learn. - section 4.1.4) { 88.8 75.0 92.5 96.6 97.4 some dedicated 2D face recognition systems, including a commercial software (GOTS) [68] and a common open source one (OpenBR) [68] (Table. 5.9). Considering 3D modeling and matching on in-the-wild data is extremely dicult, this result is very impressive. Again, we evaluated the system performance w.r.t the number of images per template. Since the templates in IJB-A are normally small, we do not apply the method used on CASIA WebFace dataset. Instead, for each template size threshold N, we remove all the probe/verication templates those have less than N images. The relationship between N and system performance is plotted in Fig. 5.19. One more time, both TAR@0.01FAR and RR1 rmly escalate when N increases. At N = 22, TAR@0.01FAR already passes 80%, while RR1 stays above 90%. When N > 22, the number of remaining matching template pairs is small (< 300) causing instability in TAR@0.01FAR measurement, hence we stop recording system performance. 120 Note that the state of the art 2D matching performance on IJB-A comes from deep- learning based techniques such as [139]. However, these systems require an expensive training process on millions of in-the-wild images. Also, these systems do a blind match- ing; we cannot interpret which parts are similar and which parts are dierent between 2 faces. In contrast, when doing 3D-3D matching, we can localize the dierences (e.g. eye size or mouth shape). 5.7 Conclusions In this chapter, we investigated how to rebuild 3D face shape from imagery data, from single/multiple images to video. First, we proved that the state-of-the-art 3D face mod- eling algorithm could be enhanced with the use of recent advanced techniques. Second, we dened an elegant way to join multiple inputs in order to get a much more stable and accurate 3D output. Third, we propose a state-of-the-art 3D face tracking method, as well or a frame-based 3D modeling method on video. Finally, we conducted experiments to verify the proposed method in both 3D reconstruction accuracy and identity recovery. The tests conrm the ability to build an accurate and personalized 3D face model given enough imagery inputs. 121 Chapter 6 Deep 3D Face Modeling and Beyond 1 In Chapter 5, we discussed how to estimate 3D face model from imagery input with an analysis-by-synthesis method. One key discovery is that we can get much better 3D reconstruction quality, given enough input images. Can we close the gap and recover such 3D model quality, given a single image input? Machine learning, especially deep learning, is the potential solution. In this chapter, we try to answer this question by dening a deep-learning method of regressing discriminative 3D morphable face models (3DMM). We use a convolutional neural network (CNN) to regress 3DMM shape and texture parameters directly from an input photo. The 3D estimates produced by our CNN, even on single image input, surprisingly surpass the multi-input reconstructed model in Chapter 5. It provides the state of the art accuracy on the MICC data set. Coupled with a 3D-3D face matching pipeline, we show the rst competitive face recognition results on the LFW, YTF and 1 This chapter is my joint work with Iacopo Masi, Prof. Tal Hassner, and Prof. G erard Medioni. The part of deep 3D Face Modeling was published as "Regressing robust and discriminative 3D morphable models with a very deep neural network." in Proc. Conf. Comput. Vision Pattern Recognition, 2017. 122 IJB-A benchmarks using 3D face shapes as representations, rather than the opaque deep feature vectors used by other modern systems. 6.1 Why Deep Learning? The analysis-by-synthesis 3D modeling from single image input is known to be unstable in unconstrained viewing conditions. It can be seen in Fig. 6.1 which presents 3D shapes estimated from three unconstrained photos by our 3DMM tting (Chapter 5) and two other methods (Fig. 6.1 (b-d)). Clearly, though the same subject appears in all photos, shapes produced by the same method are either very dierent (b,c) or highly regularized and generic (d). It is therefore unsurprising that these shapes are poor representations for recognition. It also explains why we in Chapter 3 and 4, as well as some recent methods [53, 54, 85, 131], proposed using coarse, simple 3D shape approximations only as proxies when rendering faces to new views rather than as face representations. Contrary to previous work, we show that robust and discriminative 3D face shapes can, in fact, be estimated from single, unconstrained images (Fig. 6.1 (e)). We propose estimating 3D facial shapes using a very deep convolutional neural network (CNN) to regress 3DMM shape and texture parameters directly from single face photos. We iden- tify shortage of labeled training data as an obstacle to using data-hungry CNNs for this purpose. We address this problem with a novel means for generating a huge labeled train- ing set of unconstrained faces and their 3DMM representations. Coupled with additional technical novelties, we obtain a method which is fast, robust and accurate. The accuracy of our estimated shapes is veried on the MICC data set [10] and quan- titatively shown to surpass the accuracy of other 3D reconstruction methods. We further 123 Figure 6.1: Unconstrained, single view, 3D face shape reconstruction. (a) Input images of the same subject with disruptive poses and occlusions. (b-e) 3D reconstructions using (b) single-view 3DMM (Chapter 5), (c) ow based method [53] (d) 3DDFA [158], (e) Our deep-learning approach. (b-c) Present dierent 3D shapes for the same subject and (d) appears generic, whereas our method (e) is robust, producing similar discriminative 3D shapes for dierent views. show that our estimated shapes are robust and discriminative by presenting face recog- nition results on the Labeled Faces in the Wild (LFW) [58], YouTube Faces (YTF) [142] and IJB-A [69] benchmarks. To our knowledge, this is the rst time single image 3D face shapes are successfully used to represent faces from modern, unconstrained face recogni- tion benchmarks. Finally, to promote reproduction of our results, we publicly release our code and models. 2 . 2 Please see www.openu.ac.il/home/hassner/projects/CNN3DMM for updates. 124 6.2 Regressing 3DMM Parameters with A CNN We propose to regress 3DMM face shape parameters directly from an input photo using a very deep CNN. Ostensibly, CNNs are ideal for this task: After all, they are being successfully applied to many related computer vision tasks. But despite their success, apart from [107], we are unaware of published reports of using CNNs for 3DMM parameter regression. We believe CNNs were not used here because this is a regression problem where both the input photo and the output 3DMM shape parameters are high dimensional. Solving such problems requires deep networks and these need massive amounts of training data. Unfortunately, existing unconstrained face sets with ground truth 3D shapes are far too small for this purpose and obtaining large quantities of 3D face scans is labor intensive and impractical. We therefore instead leverage three key observations. 1. As discussed in Chapter 5, accurate 3D estimates can be obtained by using multiple images of the same face. 2. Unlike the limited availability of ground truth 3D face shapes, there is certainly no shortage of challenging face sets containing multiple photos per subject. 3. Highly eective deep networks are available for the related task of extracting robust and discriminative face representations for face recognition (Chapter 4). From (1), we have a reasonable way of producing 3D face shape estimates for training, as surrogates for ground truth shapes: by using a robust method for multi-view 3DMM 125 Figure 6.2: Overview of our process (a) Large quantities of unconstrained photos are used to t a single 3DMM for each subject. (b) This is done by rst tting single image 3DMM shape and texture parameters to each image separately. Then, all 3DMM estimates for the same subject are pooled together for a single estimate per subject. (c) These pooled estimates are used in place of expensive ground truth face scans to train a very deep CNN to regress 3DMM parameters directly. estimation. Getting multiple photos for enough subjects is very easy (2). This abundance of examples further allows balancing any reconstruction errors with potentially limitless subjects to train on. Finally, (3), a state of the art CNN for face recognition may be ne- tuned to this problem. It should already be tuned for unconstrained facial appearance variations and trained to produce similar, discriminative outputs for dierent images of the same face. 6.2.1 Generating Training Data To generate training data, we use the multi image 3DMM estimation method, proposed in section 5.3. We run it on the unconstrained faces in the CASIA WebFace dataset [156]. These multi image 3DMM estimates are then used as ground truth 3D face shapes when training our CNN 3DMM regressor. Multi image 3DMM reconstruction is performed by rst estimating 3DMM parameters from the 500k single images in CASIA. 3DMM estimates for images of the same subject 126 are then aggregated into a single 3DMM per subject (10k subjects). This process is described in (see also, Fig. 6.2). We then get for each subject a pooled identity vector ^ = [^ ; ^ ], where ^ and ^ are the pooled shape and pooled texture parameter vectors respectively. 6.2.2 Learning to Regress Pooled 3DMM Following the process described in Sec. 6.2.1, each subject in our data set is associated with a number of images and a single, pooled 3DMM. We now use this data to learn a function which, ideally, regresses the same pooled 3DMM feature vector for dierent photos of the same subject. To this end, we reuse the CNN for face recognition. It uses the very deep ResNet architecture [56] with 101 layers, trained for face recognition in Section 4.1. We modify its last fully-connected layer to output the 198D 3DMM feature vector . The network is then ne-tuned on CASIA images using the pooled 3DMM estimates as target values; dierent images of the same subject presented to the CNN using the same target 3DMM shape. We note that we also tried using the VGG-Face CNN of [100] with 16 layers. Its results were similar to those obtained by the ResNet architecture, though somewhat lower. The Asymmetric Euclidean Loss: Training our network requires some care when dening its loss function. 3DMM vectors, by construction, belong to a multivariate Gaussian distribution with its mean on the origin, representing the mean face (Sec. 6.2.1). Consequently, during training, using the standard Euclidean loss to minimize distances between estimated and target 3DMM vectors will favor estimates closer to the origin: 127 these will have a higher probability of being closer to their target values than those further away. In practice, we found that a network trained with the Euclidean loss tends to output less detailed faces (Fig. 6.3). To counter this bias towards a mean face shape, we introduce an asymmetric Euclidean loss. It is designed to encourage the network to favor estimates further away from the origin by decoupling under-estimation errors (errors on the side of the 3DMM target closer to the origin) from over-estimation errors (where the estimate is further out from the origin than the target). It is dened by: L( p ; ) = 1 jj + max jj 2 2 | {z } over-estimate + 2 jj + p max jj 2 2 | {z } under-estimate ; (6.1) using the element-wise operators: + : = abs( ) : = sign( ) ; + p : = sign( ) p ; (6.2) max : = max( + ; + p ): (6.3) Here, is the target pooled 3DMM value, p is the output, regressed 3DMM and 1;2 control the trade-o between the over and under estimation errors. When both equal 1, this reduces to the traditional Euclidean loss. In practice, we set 1 = 1, 2 = 3, thus changing the behavior of the training process, allowing it to escape under-tting faster and encouraging the network to produce more detailed, realistic 3D face models (Fig. 6.3). Network hyperparameters: Eq. (6.1) is solved using Stochastic Gradient Descent (SGD) with a mini-batch of size 144, momentum set to 0.9 and with regularization over the 128 Figure 6.3: Eect of our loss function: (left) Input image, (a) generic model, (b) regressed shape and texture with a regular ` 2 loss and (c) our proposed asymmetric ` 2 loss. weights provided by` 2 with a weight decay of 0.0005. When performing back-propagation, we learn the inner product layer (fc) after pool5 faster, setting the learning rate to 0:01, since it is trained from scratch for the regression problem. Other network weights are updated with a learning rate an order of magnitude lower. When the validation loss saturates, we decrease learning rates by an order of magnitude, until the validation loss stops decreasing. Discussion: Render-free 3DMM estimator: It is important to note that by choos- ing to use a CNN to regress 3DMM parameters, we obtain a function that is render-free. That is, 3DMM parameters are regressed directly from the input image, without an op- timization process which renders the face and compares it to the photo, as do existing methods for 3DMM estimation (including our method for generating training data in Sec. 6.2.1). By using a CNN, we therefore hope to gain not only improved accuracy, but also much faster 3DMM estimation speeds. Face alignment-free: Facial landmark detection and face alignment are known to improve recognition accuracy (e.g., [144, 54]). In fact, the recent, related work of [57] manually assigned landmarks before using their 3DMM tting method for recognition on 129 controlled images. We, however, did not align faces beyond using the bounding boxes provided in their data sets. We found our method robust to misalignments and so spared the runtime this required. 6.3 Experimental Results We test our proposed method, comparing the accuracy of its estimated 3D shapes, its speed and its ability to represent faces for recognition with the proposed methods in Chapter 5, as well as other existing methods. Importantly, we are unaware of any previous work on single view 3D face shape estimation which reported as many quantitative tests as we do, in terms of the number of benchmarks used, the number of baseline methods compared with and the level of diculty of the photos used in these tests. Specically, we again evaluate the accuracy of our estimated 3D shapes using videos and their corresponding scanned, ground truth 3D shapes from the MICC Florence Faces dataset [10] (Sec. 6.3.1). To test how discriminative and robust our shapes are when estimated from unconstrained images, we perform single image and multi image face recognition using the LFW [58], YTF [142] and the new IARPA JANUS Benchmark- A (IJB-A) [69] (Sec. 6.3.3). Finally we also provide qualitative results in Sec. 6.3.4. Compared to Chapter 5, we skip the experiment on MultiPIE since its recognition rate is saturated, and skip the experiment on CASIA-WebFace since it is our training dataset. As baseline 3D reconstruction methods we used the analysis-by-synthesis 3DMM t- ting (Chapter 5), the ow-based method of [53], the edge based method of [17], the multi resolution, multi-view approach of [59] and 3DDFA of [158], were all tested with their authors' implementations. 130 Figure 6.4: Qualitative comparison of surface errors, visualized as heat maps with real world mm errors on MICC face videos and their ground truth 3D shapes. Left to right, top to bottom: frame from input; 3D ground-truth; generic face; estimates for ow-based method [53], Huber et al. [59], 3DDFA [158], Bas et al. [17], 3DMM +pool (Chapter 5), us +pool. 6.3.1 3D Shape Reconstruction Accuracy Again, we run experiments on MICC dataset [10] , which contains challenging face videos of 53 subjects and the corresponding ground-truth 3D face models. These videos were used for single image and multi frame 3D reconstructions, comparing our method to ex- isting alternatives. In these tests, estimated shape parameters were converted to 3D using Eq. (5.1), cropped at a radius of 95mm around the tip of the nose and globally aligned to the ground truth using the standard, rigid iterative closest point (ICP) method [20, 34], obtaining X;X R 3 , respectively. They were additionally projected to a frontal view, obtaining depth maps D Q and D Q . Estimation accuracy was then computed with stan- dard error measures [53, 117]: • 3D Root Mean Square Error (3DRMSE): pP i (XX ) 2 =N v 131 Method 3DRMSE RMSE log 10 10 4 Rel10 4 Sec. Generic 1.88.52 3.48.76 287 6516 { 3DMM (Chapter 5) 1.75.42 3.64.94 298 6818 120 Flow-based [53] 1.83.39 3.29.70 276 6214 13.3 Us 1.57.33 3.18.77 266 5914 .088 Generic+pool 1.88.52 3.48.76 287 6516 { 3DMM (Chapter 5)+pool 1.60.46 3.31.98 279 6220 120 3DDFA [158]+pool 1.83.58 3.45.85 287 6517 .146 [59] 1.84.32 3.73.62 305 6811 .372 [17]+pool 1.84.58 3.45.85 286 6513 52.3 Us +pool 1.53.29 3.14.70 256 5813 .088 Table 6.1: 3D estimation accuracy and per-image speed on the MICC dataset. Top are single view methods, bottom are multi frame. See text for details on measures. 3DRMSE in real-world mm. Denotes the method used to produce the training data in Sec. 6.2.1. Lower values are better. • Root Mean Square Error (RMSE): q P i (D Q i D Q i ) 2 =N p • log 10 :j log 10 (D Q ) log 10 (D Q )j • Relative error (Rel):jD Q D Q j=jD Q j Here, N v is the number of 3D vertices and N p the number of pixels in these representa- tions. 132 Single view estimation was performed on the initial frontal frame. Multi frame re- constructions were done on the frames selected by the proposed method in section 5.4. For all 3DMM tting baselines, we found that estimating shape, texture and expression parameters but using only shape and texture for comparisons, gave the best results. This approach was therefore used in all our tests. Results are reported in Tab. 6.1. Error rates are averaged across all videos and provided standard deviation. Our method is clearly the most accurate. Remarkably, both its single view and multiple frame versions outperform the method used to produce the training set target 3DMM labels (3DMM+pool). This may be due to our use of such a large dataset to train the CNN and their known robustness to training label errors and noise [147]. Our estimates are more accurate than the very recent state-of-the-art. This includes 3DDFA [158] which ts 3DMM parameters by using a CNN to deal with large pose variations as well as [59] and [17]. To better appreciate these numbers, note that our improvement over standard 3DMM tting is comparable to their improvement over using a unmodied, generic Basel face shape [102]. Fig. 6.4 provides a qualitative comparison of the surface errors in mm for dierent methods for a subject in the MICC dataset. Our method produces visually smaller errors compared to ground-truth. The areas around the nose and mouth in particular have very low errors, while other methods are more sensitive in these regions (e.g 3DDFA [158]). 133 6.3.2 3DMM Regression Speed Tab. 6.1 (rightmost column) also reports the average, per image runtime in seconds, required by the various methods to predict 3D face shapes. We compared our approach with iterative methods such as classic 3DMM implementations [17, 59, ?], the ow-based method of [53] and also with a recent CNN based method [158]. As mentioned earlier, our method is render-free, without optimization loops which render the estimated parameters and compare them to the input photo. Unsurprisingly, at 0.088s (11Hz), our CNN is several orders of magnitude faster predicting 3DMM parameters than most of the methods we tested. The second fastest method, by a wide gap, is the 3DDFA of [158], requiring 0.146s (7Hz) for prediction. Runtime was measured on two dierent systems. All our baselines required MS- Windows to run and were tested on an Intel Core i7-4820K CPU @ 3.7GHz with 16GB RAM and a NVIDIA GeForce GTX 770. Our method requires Linux and so was tested on an Intel Xeon CPU @ 3.60GHz, with 12 GB of RAM and GeForce GTX 590. Importantly, the system used to measure our runtime is the slower of the two. Our runtimes may therefore be exaggerated. 6.3.3 Face Recognition In The Wild We next consider the robustness of our 3DMM estimates and how discriminative they are. We aim to see if our 3DMM estimates for dierent unconstrained photos of the same person are more similar to each other than to those of other subjects. An eective way of doing this is by testing our 3DMM estimates on face recognition benchmarks. We emphasize that our goal is not to set new face recognition records. Doing so would 134 require competing with state of the art systems designed exclusively for that problem. We provide performances of relevant (though not necessarily state of the art) recognition systems only as a reference. Nevertheless, our results below are the highest we know of that were obtained with meaningful features (here, shape and texture parameters) rather than opaque representations. Our tests use the pipeline described in Sec. 3.2.2 and report multiple recognition metrics for verication (in LFW and YTF) and identication metrics (in IJB-A). These metrics are verication accuracy, 100%-EER (Equal Error Rate), Area Under the Curve (AUC), and recall (True Acceptance Rate) at two cut-o points of the False Alarm Rate (TAR-f10%,1%g). For identication we report the recognition rates at various ranks from the CMC (Cumulative Matching Characteristic). For each tested method we also indicate its use of estimated 3D shape and/or texture. Finally, bold values indicate best scoring 3D reconstruction methods. Labeled Faces in the Wild (LFW): [58] results are provided in Tab. 6.2 (top) and Fig. 6.5 (left). Evidently, the shapes estimated by 3DDFA [158] are only slightly more robust and discriminative than the classical eigenfaces [138]. Fitting 3DMMs (Chapter 5) does better, but falls behind the Hybrid method of [144], one of the rst results on LFW, now nearly a decade old. Both results suggest that the shapes estimated by these methods are unstable in unconstrained settings and/or are too generic. By comparison, recognition performances with our estimated 3DMM parameters is not far behind those recently reported by Facebook, using their multi-CNN approach trained on four million images [131]. 135 Method 3D Texture Accuracy 100%-EER AUC TAR-10% TAR-1% Labeled Faces in the Wild EigenFaces [138] { 60.020.79 { { 25 6.2 Hybrid Descriptor [144] { 78.470.51 { { 66.60 42.4 DeepFace-ensemble [131] { 97.350.25 { { 99.6 93.7 ResFace101 (section 4.1) { 98.060.60 98.000.73 { 99.5 94.2 3DMM (Chapter 5) 3 5 66.132.79 65.702.81 72.242.75 35.903.74 12.374.81 5 3 74.931.14 74.501.21 82.941.14 60.403.15 28.737.17 3 3 75.252.12 74.732.56 83.211.93 59.44.64 29.674.73 3DDFA [158] 3 5 66.982.56 67.131.90 73.302.49 36.766.27 10.003.22 Us 3 5 90.531.34 90.631.61 96.60.79 91.132.62 58.2012.14 5 3 90.61.07 90.701.17 96.750.59 91.232.42 52.608.14 3 3 92.351.29 92.331.33 97.710.64 94.22.00 65.576.93 YouTube Faces MBGS LBP [142] { 76.41.8 74.7 82.6 60.5 35.8 DeepFace-ensemble [131] { 91.41.1 91.4 96.3 92 54 3DMM (Chapter 5)+pool 3 5 73.262.51 73.082.65 80.412.60 51.365.11 24.044.56 5 3 77.342.54 76.962.64 85.322.63 63.165.07 31.365.21 3 3 79.562.08 79.202.07 87.351.92 69.085.00 34.566.89 3DDFA [158]+pool 3 5 68.102.93 67.963.12 74.953.04 40.523.65 12.22.67 Us +pool 3 5 88.281.84 88.322.16 95.951.38 86.603.95 51.128.86 5 3 87.562.56 87.682.25 94.441.38 84.804.89 40.928.26 3 3 88.802.21 88.842.40 95.371.43 87.924.18 46.566.20 Table 6.2: LFW and YTF face verication. Comparing our 3DMM regression with others, including baseline face recognition methods. Denotes the same method used to produce 3DMM target values for our CNN training (Sec. 6.2.1). 136 Method 3D Text. TAR-10% TAR-1% Rank-1 Rank-5 Rank-10 ResFace101 { { 88.61.6 90.61.2 96.20.6 97.70.4 3DMM +p. 3 5 66.01.7 36.63.4 44.41.8 63.92.2 72.51.6 5 3 73.61.7 42.25.2 54.92.5 73.21.9 80.21.4 3 3 78.51.1 49.75.9 63.61.8 79.11.4 84.51.3 3DDFA+p. 3 5 43.32.5 12.51.9 16.71.9 38.32.7 51.33.0 Us +pool 3 5 86.01.7 55.95.5 72.31.4 88.01.4 91.81.1 5 3 83.52.2 50.35.8 70.91.5 87.31.1 91.51.0 3 3 87.01.560.05.676.21.889.71.092.91.0 Table 6.3: IJB-A face verication and recognition. Comparing our 3DMM regression with others, including baseline face recognition methods. Denotes the same method used to produce 3DMM target values for our CNN training. YouTube Faces (YTF): [142] Accuracy on YTF videos is reported in Tab. 6.2 (bottom) and Fig. 6.5 (mid-left). Though video frames in this set are often low in quality and resolution, our method performs well. It is outperformed by the Facebook CNN ensemble system [131], explicitly designed for face recognition, by an AUC gap of only1%. The 3DMM shapes and textures estimated by other methods perform far worse; the analysis- by-synthesis 3DMM tting does only slightly better than the MBGS face recognition system [142], which is the oldest result on that benchmark and [158] falling far behind. IARPA Janus Benchmark A. (IJB-A): [69] We evaluated both the face verication (1:1) and recognition (1:N) protocols and report results in Tab. 6.3 and Fig. 6.5 (mid- right, right). Here too, performances adopt the same pattern as in the previous two benchmarks, with 3D shapes estimated by 3DDFA [158] performing far worse than other 137 0 0.5 1 0 0.2 0.4 0.6 0.8 1 False Acceptance Rate True Acceptance Rate LFW EigenFaces Hybrid Descriptor DeepFace−ensemble ResFace101 3DMM (Shape) 3DMM (Texture) 3DMM (Shape+Texture) 3DDFA (Shape) Us (Shape) Us (Texture) Us (Shape+Texture) 0 0.5 1 0 0.2 0.4 0.6 0.8 1 False Acceptance Rate True Acceptance Rate YTF MBGS LBP DeepFace−ensemble 3DMM (Shape) 3DMM (Texture) 3DMM (Shape+Texture) 3DDFA Us (Shape) Us (Texture) Us (Shape+Texture) Figure 6.5: Face verication and recognition results. From left to right: Verication ROC curves for LFW, YTF, and IJB-A, and the recognition CMC for IJB-A. methods. Our CNN-based method performs quite well, though it is outperformed by a wide margin by our dedicated 2D face recognition, which was designed for this set. 6.3.4 Qualitative Results We provide qualitative 3D reconstructions of faces in the wild in Fig. 6.6, showing both rendered 3D shapes and (when available) also its estimated texture. These results show that our method generates more visually plausible 3D and texture estimates compared with those produced by other methods. Fig. 6.6 also shows a few failure cases, here due to facial hair which was missing from the original 3DMM representation and extreme out- of-plane rotation which produced a thin, unrealistic 3D shape. For more results, please see our project webpage. 3 3 www.openu.ac.il/home/hassner/projects/CNN3DMM. 138 Figure 6.6: Qualitative results on 3DMM-CNN. Top: Reconstructions for pairs of photos of the same subjects, demonstrating the discriminative and robustness of our method. Middle: Results obtained by 3DMM (Chapter 5), 3DDFA [158] and our method on still-images from LFW and single frames from YTF. Bottom: Two failure examples. 139 6.4 2D-3D Fusion for Face Recognition In the previous section, we presented face recognition results using reconstructed 3D face models on dierent datasets. These results are promising but still far from the 2-D numbers. Can we use 3D face recognition as a supplementary for 2D face matching? This question will be answered in this section. To do so, we collected 3D and 2D matching scores on IJB-A. We used both 2D face recognition CNNs that were trained on CASIA WebFace (section 4.1) and Clean-COW dataset (section 4.2). We call them 2D-CNN1 and 2D-CNN2 respectively. Note that our 3DMM-CNN was trained on CASIA only, thus the fusion between its results and the ones from 2D-CNN1 is fairer. However, we will show that 3D face recognition still helps to improve the matching results from 2D-CNN2, which was trained on a much larger training data. Assumed that we have an "oracle" selector that knows which cases the 3D matching works better than the 2D one. In Identication tests, it searches for the probes in which 3D matching provides better recognition ranks for the ground-truth galleries. In Veri- cation tests, for each False Acceptance Rate (FAR), it searches for the genuine pairs that are accepted by the 3D matcher but the 2D one. By applying this "oracle" selector , we can eectively use 3D matching results to improve the 2D ones on IJB-A, as can be seen in Fig. 6.7 and Table 6.4. Specically, with 2D-CNN1, the 3D matcher can lift Rank-1 from 92.5% to 94% and lift TAR@1%FAR from 88.8% to 89.9%. Even with the powerful 2D face recognition CNN trained on CleanCOW (2D-CNN2), the oracle fusion still helps it gain 0.6% in Rank-1 (from 95.7% to 96.3%). 140 Method TAR-10% TAR-1% TAR-0.1% Rank-1 Rank-5 Rank-10 3D-CNN 87.0 60.0 31.3 76.2 89.7 92.9 2D-CNN trained on CASIA 2D only 96.9 88.8 75.0 92.5 96.6 97.4 2D + 3D 97.3 89.9 76.4 94.0 97.7 98.5 2D-CNN trained on COW 2D only 98.1 94.7 87.8 95.7 97.8 98.3 2D + 3D 98.4 94.9 88.0 96.3 98.6 99.1 Table 6.4: 2D-3D face recognition fusion on IJB-A. Figure 6.7: 2D-3D face recognition fusion on IJB-A. This result proves that 3D face matching, by exploiting a dierent face representation, can supplement the 2D methods. Designing a mostly oracle fusion can be an interesting topic for future research. 6.5 3D Model Renement With CNN-based 3DMM regression, we have a great power to reconstruct the identity 3D face model, including shape and texture. These components are image-independent; 141 they depend only on the subject in the image. There are some other 3D factors, such as expression and ne-details, which are also important but highly image-dependent. In this part, we will discuss of how to recover these components, given the 3D identity shape as the foundation. 6.5.1 Deep Expression Estimation Expression is one of the important components building facial appearance. Expression, however, is not xed; it varies from image to image. As mentioned in section 5.1, we can represent 3D expression with a 3D morphable model (3DMM), similar to 3D shape and texture. Therefore, we can follow the same process in section 6.2 to regress 3D expression component from a single input image. Though, we have to use image-specic 3DMM expression parameters for ground-truth labels instead of the pooled ones. We can brie y describe the process as below: • Run analysis-by-synthesis 3DMM tting on CASIA-WebFace images. For each image, extract a 29D expression parameter vector as "ground-truth". • Fine-tune a ResNet-101 network, with the initial weights trained for face recogni- tion (see section 4.1), using CASIA images and the corresponding "ground-truth" training labels. Modify the last fully-connected layer to regress 29D oating-point vector for 3DMM expression parameters. Also, use an asymmetric Euclidean loss for an eective training. 142 We now have two independent networks to regress the identity 3D model (called I- D3DMM) and the 3D expression component (called E-D3DMM). For each input image, we can run these CNNs parallelly to get an identity vector = ; and an expression vector . Then, we can recover the 3D face model using equation 5.1. Fig. 6.8 presents some qualitative results on LFW dataset. We use here facial images from mostly neutral to highly expressive. Overall, E-D3DMM can reproduce well facial expressions from each input image. Particularly, in the last two columns of row 2 and row 3, E-D3DMM can capture dierent expression levels when the input images come from the same subjects. The last row, however, shows two failure cases. The rst image was taken at a near prole view, which is challenging to the traditional analysis-by-synthesis method. The common landmark detectors often fail on this type of image, thereby misleading the 3DMM tting process that generated our training labels. This noisy training data can explain why E-D3DMM does not do well on near prole images. Note that we do not have this issue with I-D3DMM since it uses the high-quality training labels generated from multiple images. The image-specic training labels also lead to a less stable CNN training process, which is the cause of failure in the second image when the facial expression is over-estimated. 6.5.2 Fine-detail Estimation with Shape-from-Shading Despite reconstructing well subject-specic 3D models, I-D3DMM does not provide much information about ne-details, such as wrinkles, moles, or micro skin structure. It can be explained by two reasons: First, it is limited by the representation power of 3D Morphable 143 Figure 6.8: Qualitative results on Deep 3D Expression Estimation on LFW. We put side-side- side here each input image and the corresponding 3D model reconstructed by our network. The last row shows two failure examples. Models (3DMMs), which often omit those details. Second, in order to gain robust and image-invariant 3D modeling, we had to sacrice these image-dependent components. These details, however, can be recovered by Shape-from-Shading (SfS) techniques [16, 64, 65, 129, 81]. Note that SfS is only good in rening local details; it cannot correct the global shape. Therefore, we still need I-D3DMM and E-D3DMM to provide the 3D shape as a basis. We re-implemented SfS method proposed by [98, 108] and included it in the nal step to rene the reconstructed 3D model. To evaluate this approach, we run this process over 144 Figure 6.9: Qualitative results on Fine-detail Estimation with Shape-from-Shading. a set of images in CleanCOW dataset (section 4.2), as well as some images from Internet, which have noticeable details such as wrinkles, beard, or freckle. Some qualitative results are illustrated in Fig. 6.9. As can be seen, by adding details, all the 3D models look much more realistic. They both improve age and expression perception. However, shape-from-shading also recover unwanted components, such as hair, glasses, or any kind of occlusions, as shown in the last row. 145 6.5.3 Occlusion Handling As discussed above, Shape-from-Shading is sensitive to occlusions, which is not desirable. We enhance this technique by proposing here an automatic mechanism for occlusion handling. Fig. 6.10 illustrates our pipeline. First, we generated the initial 3D model I (by I-3DMM and E-3DMM) and the rened dense mesh D (by SfS). Second, we extracted the occlusion parts using face segmentation. Third, we converted D to a Basel Face Model (BFM) structured mesh S, with occluded vertices removed. Forth, we lled up the occluded vertices in S with Poisson blending, using symmetry and the initial 3D model I as a reference. Fifth, we apply symmetry-based inference to recover more dense regions. Finally, we zippered the dense meshes and the sparse one into a complete 3D model. a. Pixel-Grid Mesh vs. BFM Structured Mesh SfS estimates the depth at every facial pixel in the input image, which produces a very dense 3D face mesh. This mesh, however, is unstructured; there is no connection between the left and the right part, as well as no connection between the visible and the occluded regions. In contrast, the 3D face model regressed from I-D3DMM and E-D3DMM is sparse but well-structured by Basel Face Model (BFM). We can easily dene vertex pairs by symmetry. This model is also complete with the inferred 3D coordinates of the occluded vertices (Fig. 6.11). Due to these properties, we mainly use BFM-structured mesh in our process. However, we also want to bring the rich content of the dense pixel-grid mesh into the nal output. 146 Figure 6.10: Occlusion handling process in 3D face renement: (1) Estimate an initial 3D model with I-D3DMM and E-D3DMM, (2) Rene the initial 3D model with Shape-from-Shading, (3) Detect obstacles with face segmentation, (4) Dene occluded vertices, (5) Convert the dense mesh to a sparse mesh (BFM structured) with occluded vertices removed, (6) Complete the sparse mesh, (7) Symmetry-based inference on the dense mesh, (8) Zipper the meshes to get the nal one, which is clean, complete, and dense. b. Dense and Sparse Mesh Conversion In one of the early steps, we need to convert the pixel-grid mesh from SfS to a BFM- structured one, called S. This process is simple: we project each visible vertex of the initial 3D shape (from I-D3DMM and E-D3DMM) onto the image and update its coordinates based on the depth inferred by SfS. As for the self-occluded vertices, we will update them later along with the ones occluded by obstacles. To get back the dense mesh from the sparse one, we register each facial pixel to the corresponding triangle in the BFM-structured mesh. Specically, for each point in D 147 Figure 6.11: Pixel-grid mesh vs. BFM-structured mesh. with 2D-3D correspondences (p D ;P D ), we nd the corresponding triangle on the sparse mesh with 2D-3D point correspondencesf(p S i ;P S i )ji = 1::3g. Then, we compute the 2D alignment parameters u;v2 [0; 1] such that: p D =up S 1 +vp S 2 + (1uv)p S 3 (6.4) and the residual in 3D: P =P D (uP S 1 +vP S 2 + (1uv)P S 3 ) (6.5) The parameter set (u;v;P ) is stored to recover P D fromfP S i g: P D = (uP S 1 +vP S 2 + (1uv)P S 3 ) +P: (6.6) 148 c. Occluded Vertices Detection We need an input face mask that segments the visible facial area from the background and obstacles. This task can be done by any state-of-the-art face segmentation network, such as [94]. We then mark the vertices of the sparse 3D model that have 2D projection falls outside these regions as occluded. d. Occluded Vertices Inference After detecting the occluded vertices from either self-occlusion or obstacles, we will recover their 3D coordinates using Poisson blending. For each occluded point P i on S, we call the corresponding point by re ection over the axis of symmetry as P f i and the corresponding point in I as P I i . We form a linear equation to recover this occluded point: r S P i = 8 > > > > < > > > > : r S P f i if P f i is visible r I P I i if P (f) i is occluded (6.7) wherer S andr I are the discrete Laplace operators on S and I respectively. e. Dense Mesh Symmetry-based Inference We now have a complete sparse mesh S. This mesh can be combined with the dense mesh D to get a complete and dense nal output. Before doing this, however, we can further improve D with the supporting information from S'. One possible improvement is symmetry-based inference. 149 Symmetry-based inference is the task of lling up some missing regions based on symmetry. Symmetry-based inference on D without any extra tool is challenging since D is unstructured. However, this task is trivial with the help of the complete mesh S, which is structured and well-aligned to D. For each pointP D on D, we already registered it to a trianglefP S i ji = 1::3g on S with an alignment parameter set (u;v;P ). We denoted the opposite triangle on S, through the axis of symmetry, asfP Sf i ji = 1::3g. If any point infP Sf i g is newly recovered from Poisson blending, we can dene the re ection ofP D over the symmetry axis, calledP Df , as below: P Df = (uP Sf 1 +vP Sf 2 + (1uv)P Sf 3 ) +P f (6.8) where P f is the re ection of P over the Oy axis. f. Mesh Zippering Symmetry-based inference cannot ll all missing regions; sometimes occlusions appear on both sides. Therefore, the dense and sparse mesh are combined into the nal mesh F, which is both complete and dense. We rst remove the overlapping regions from S, then zipper the meshes using a well-known technique [137]. g. Experimental Results In this part, we evaluate the proposed method by running it on some 2D facial datasets with obstacles. To exclude the noise from face segmentation, we use here the COFW dataset [27, 151], a facial dataset with ground-truth face masks. We also use some 150 challenging images downloaded from the Internet with the corresponding face masks manually annotated. The qualitative results are presented in Fig. 6.12. For each input image, we show the corresponding input face mask and the estimated 3D model from our algorithm. The obstacles can be a pair of glasses, hair, ngers, and more. As can be seen, we no longer see any trait of these obstacles in the nal outputs. While removing occlusions, our algorithm still provides clean and complete 3D models with realistic details such as beard (the rst 3 images), wrinkles, and skin structure (in most of the images). We also include here two failure examples. In the rst image, there is a mismatch in quality between dierent reconstructed regions. The visible area is very rough due to the complex skin structure, while the inferred occluded region is too smooth. We can solve this problem, in future work, by learning skin structure from visible parts and synthesizing it on the occluded area. In the second image, the estimated 3D model has artifacts coming from shadows. To solve this problem, we can employ a shadow detector [114], then treat the detected shadows in the same way as obstacles. 6.5.4 Deep Fine-Details Regression In the previous sections, we implemented Shape-from-Shading in C++ with a traditional analysis-by-synthesis approach. From what we have learned in this chapter, a deep- learning based method may be a better solution, both in quality and speed. Therefore, we design here a potential approach to regress facial details with a deep neural network. 151 Figure 6.12: 3D face modeling and renement with occlusion handling. 152 a. Representation Unlike shape, texture, and expression, we do not have a PCA model for facial ne- details. Instead, inspired by [28], we propose to regress the depth displacement, between the rened model and the initial one, on the entire image. This information is encoded in an image called "bump-map". Specically, a bump-map is a gray-scale image with the same size as the input one, in which the intensity at each pixel p is dened as: I(p) = 8 > > > > < > > > > : f(0) if p is not on the face f(z D (p)z I (p)) if p is on the face (6.9) where z D (p) is the depth of p on the rened mesh, z I (p) is the depth of p on the initial mesh, and f() is a linear encoding function. We can see this denition in Fig. 6.13. Given a bump-map and the initial face depth, we can easily compute the rened depth at any facial point p by this equation: z D (p) =z I (p) +f 1 (I(p)) (6.10) and then recover the dense and detailed 3D mesh D. Therefore, we can convert the SfS problem into bump map regression, as can be seen in Fig. 6.14. b. Fine-Details Regression Training Training data: Similar to I-D3DMM and E-D3DMM, we ran the traditional SfS code on a 2D facial dataset, encoded the outputs into bump-maps and used them as ground-truth 153 Figure 6.13: A bump mapB() of an image is a gray-scale image encoding the depth displacement between the rened 3D model and the initial one. Figure 6.14: Overview of our deep ne-detail regression system. labels for training. To have clean and high-quality labels, only high-resolution images from CleanCOW dataset that have no obstructed occlusion were used. We manually selected 4300 images that qualify the criteria above. We then cropped and scaled these images to a xed resolution 500x500. Finally, we estimated their bump-maps with SfS. 154 Block Extra input Structure Output size block1 conv3-3, padding = 7 512x512x16 [ conv3-16 ] x 2 block2 conv3-16, stride = 2 256x256x32 [ conv3-32 ] x 2 block3 conv3-32, stride = 2 128x128x64 [ conv3-64 ] x 2 block4 conv3-64, stride = 2 64x64x128 [ conv3-128 ] x 2 block5 conv3-128, stride = 2 32x32x256 [ conv3-256 ] x 2 block6 conv3-256, stride = 2 16x16x512 [ conv3-512 ] x 2 block7 conv3-512, stride = 2 8x8x1024 [ conv3-1024 ] x 2 upblock6 upsample+conv3-512 16x16x512 block6 [ conv3-512 ] x 2 upblock5 upsample+conv3-256 32x32x256 block5 [ conv3-256 ] x 2 upblock4 upsample+conv3-128 64x64x128 block4 [ conv3-128 ] x 2 upblock3 upsample+conv3-64 128x128x64 block3 [ conv3-64 ] x 2 upblock2 upsample+conv3-32 256x256x32 block2 [ conv3-32 ] x 2 upblock1 upsample+conv3-16 512x512x16 block1 [ conv3-16 ] x 2 lastconv conv3-1 512x512x1 Table 6.5: Network structure for bump map regression. 155 Network structure: We followed the U-net structure with skip connections proposed in a recent paper [61]. To avoid checkerboard artifacts, we replaced the transpose convo- lution layers by resize+convolution ones [96]. The detail network structure is shown in Table 6.5. Training process: We trained the network, called SfS-CNN, using the selected images and the corresponding "ground-truth" bump-maps. The training batch size was 4. We started with the learning rate lr = 10 4 and dropped it 10 times after every 50 epochs. The network converged after 150 epochs. c. Experimental Results Running time: We rst ran a speed test on a Linux machine with Intel Xeon CPU @ 3.60GHz, with 12 GB of RAM and GeForce GTX 590. While the traditional SfS method took about 15s to process each single input image, SfS-CNN only requires 1.3s. It means SfS-CNN is 11 times faster than the classical method. Qualitative results: Fig. 6.15 illustrates some qualitative results of ne-details regres- sion from in-the-wild images. We use here some images from CleanCOW dataset without obstacles that are outside of the training data. We also include the estimated 3Ds using the classical SfS method for comparison. As can be seen, the network can capture very well the facial details such as wrinkles, beard, hair, and skin texture. In some cases, it provides clearer details compared to SfS, e.g. in the rst two images. In general, the re- gressed models look a bit noisier than the ones estimated by SfS. Nonetheless, this result is still very competitive, especially when we include running time in consideration. 156 Figure 6.15: Qualitative results of our ne-detail regression method (SfS-CNN), in comparison to the analysis-by-synthesis SfS approach. 6.6 Conclusions We show that existing methods for estimating 3D face shapes may either be sensitive to changing viewing conditions, particularly in unconstrained settings, or too generic. Their estimated shapes thus do not capture identity very well, despite the fact that true 3D face shapes are known to be highly discriminative. We propose instead to use a very deep CNN architecture to regress 3DMM parameters directly from input images. We provide a solution to the problem of obtaining sucient labeled data to train this network. We show our regressed 3D shapes to be more accurate than those of alternative methods. We further run extensive face recognition tests show- ing these shapes to be robust to unconstrained viewing conditions and discriminative. 157 Our results are furthermore the highest recognition results we know of, obtained with interpretable representations rather than opaque features. Finally, we discuss extra work on 3D face modeling to recover image-dependent com- ponents, such as expression and ne-details. While these tasks do not necessarily improve subject distinctiveness, they greatly make the 3D output realistic, therefore approaching 3D face-scan quality. 158 Chapter 7 Conclusions and Future Work In this study, we learned how to solve face recognition and face modeling in-the-wild eectively, given imagery inputs. As for face recognition, we proved that 3D face shape was important, and by including it in a state-of-the-art face recognition system, we could gain much more power to handle in-the-wild input. Specically, 3D reference face shapes allow us to enrich the training data, therefore strengthening the face classier. It also helps us to close the gap between images coming from dierent views by face appearance synthesis. With a lightweight 3D face augmentation and matching process, the system is scalable to work on a very- large-scale training data and to provide the best performance on state-of-the-art face recognition benchmarks. Furthermore, we proved that face alignment part can be done eectively with the use of CNN regression, rather than using an expensive landmark detection process. As for 3D face modeling, we rst leveraged the traditional analysis-by-synthesis tech- niques to recover 3D face shape and texture in 3D Morphable Models (3DMM) space. 159 The input data can be a single image, multiple images, or video. We learned that 3D ge- ometry and identity can be reconstructed precisely given enough multiple images. From that observation, we proposed a method to generate a large amount of "ground-truth" 3DMM labels on the CASIA-WebFace dataset. A deep neural network was then trained to regress directly 3D face shape and texture from any single input. This network is supe- rior to other 3D modeling methods, including the one used to generate training data, and shows the state of the art 3D face modeling results for both speed, accuracy, and distinc- tiveness. Coupled with a 3D-3D face matching pipeline, we showed the rst competitive face recognition results on the LFW, YTF and IJB-A benchmarks using 3D face shapes as representations, rather than the opaque deep feature vectors used by other modern systems. Finally, we presented additional work on expression regression and ne-detail recovery, which can bring the reconstructed 3D faces close to face scan quality. 7.1 Contributions In the face recognition study, we provided four key contributions in this dissertation, compared to state-of-the-art: • 3D Face Augmentation Aid: We proposed the use of 3D reference face shapes to synthesize new facial images in various congurations (pose, shape, expression). This augmentation helps to enrich training data, which signicantly boost recog- nition power of the trained face recognizer. The synthesized facial images are also useful in matching since it brings the faces to the same condition for comparison. 160 • Rapid face synthesis and matching: We designed a novel 3D face synthesis, which can do 3D face augmentation with speed and computational of 2D image warping. We also proposed the use of pooled synthesized faces for more eective face matching. These enhancements are important; they help our system to handle large training data and to work eectively on large scale face recognition system. • A very large scale training data: We introduced COW dataset, a combination of the state-of-the-art facial image sets. It is a largest clean public dataset so far. Our system trained on COW deeply breaks the top matching records on a state-of- the-art in-the-wild face recognition benchmark. • Landmark-free face alignment: We proposed to regress directly 3D head pose from input images, rather than using an expensive landmark detector. We proved that accurate landmarks are not essential in face recognition, and our landmark-free pipeline provides similar, if not better, matching performance than any landmark- based method. In 3D face modeling study, we also make ve contributions in terms of novelty below: • Eective 3D modeling approach on multiple-images: We proved that a simple method of separate-modeling and pooling can greatly improve 3D face re- construction when multiple images of the same subject are available. A series of experiments on in-the-wild datasets indicated an ability to rebuild accurate 3D face shape and identity, given enough input data. 161 • A novel 3D face tracking on video: We developed a sophisticated 3D face tracking technique, which greatly exploits temporal coherence in order to precisely estimate 3D head pose in each video frame. • Deep 3D Face Modeling method: We designed a deep-learning based approach to regress directly 3D face shape and texture, in 3D Morphable Model space, from any single facial image input. Our CNN is both robust and discriminative, which provides the state-of-the-art 3D face modeling in speed, accuracy, and identity recovery. It rst time shows competitive face recognition in the wild using recon- structed 3D models, rather than opaque deep features. • 3D face modeling evaluation by face recognition test: We proposed the use of face recognition as one of the eective methods to evaluation 3D face modeling results. A good reconstructed 3D face needs to be invariant and recognizable. While the traditional face modeling methods fail this test by either providing too generic or too unstable 3Ds, our deep 3D face modeler rst time shows reliable results on in-the-wild data. • Image-dependent 3D component recovery: Besides invariant 3D components, including 3D identity shape and texture, we also discussed additional techniques to recover other image-dependent 3D components such as expression and ne-details. These techniques polish the nal 3D models and bring them towards 3D depth scan quality. Besides theoretical contributions, we also provide practical contributions by releasing our implementations: 162 • ResFace-101: CNN model of our face recognition described in section 4.1 1 . • Rapid face synthesis code: Our Python face specic augmentation code 1 . • FacePoseNet: Our CNN model and demo code of FacePoseNet for landmark-free face alignment described in section 4.3 2 . • 3DMM CNN: Our code and CNN model of the deep 3D face modeling method described in Chapter 6 3 . • COW dataset: List of images and corresponding subject IDs in our CleanCOW dataset (coming soon). 7.2 Future Work Despite having great results on both 2D face recognition and 3D face modeling, there is still room for future research on these topics. In particular, we are interested in exploring the problems below: • Occlusion-robust 3D ne-details regression We discussed how to use a deep neural network to regress 3D facial details in section 6.5.4. In case of occlusion from obstacles, we also discussed how to handle it by an analytical method in section 6.5.3. Instead, we can develop an occlusion-robust SfS-CNN that can detect occlusions and infer directly the missing facial details. The implementation of this network is practical with the use of occlusion augmentation in training. 1 http://www.openu.ac.il/home/hassner/projects/augmented_faces 2 https://github.com/fengju514/Face-Pose-Net 3 http://www.openu.ac.il/home/hassner/projects/CNN3DMM 163 • 2D-3D integrated framework In 2D face modeling work, we still use 3D generic faces for augmentation. Is it better to use subject-specic 3D model estimated from our 3D face modeling work? This is an interesting question we want to address in future work. 164 Reference List [1] ftp://csr.bu.edu/headtracking/. [2] http://www.d2.mpi-inf.mpg.de/surf. [3] http://www.nvidia.com. [4] http://www.vrealities.com/ ockofbirds.html. [5] 3dMD LLC. http://www.3dmd.com/. [6] W. AbdAlmageed, Y. Wu, S. Rawls, S. Harel, T. Hassner, I. Masi, J. Choi, J. Lek- sut, J. Kim, P. Natarajan, R. Nevatia, and G. Medioni. Face recognition using deep multi-pose representations. In Winter Conf. on App. of Comput. Vision, 2016. [7] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):2037{2041, Dec 2006. [8] O. Alexander, M. Rogers, W. Lambeth, J.-Y. Chiang, W.-C. Ma, C.-C. Wang, and P. Debevec. The digital emily project: Achieving a photorealistic digital actor. Computer Graphics and Applications, IEEE, 30(4):20{31, 2010. [9] B. Amberg, A. Blake, A. Fitzgibbon, S. Romdhani, and T. Vetter. Reconstructing high quality face-surfaces using model based stereo. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1{8. IEEE, 2007. [10] A. Bagdanov, A. D. Bimbo, and I. Masi. The orence 2d/3d hybrid face dataset. In Int. Conf. Multimedia, 2011. [11] T. Baltrusaitis, P. Robinson, and L.-P. Morency. Constrained local neural elds for robust facial landmark detection in the wild. In Proc. Int. Conf. Comput. Vision Workshops, 2013. [12] T. Baltrusaitis, P. Robinson, and L. P. Morency. Constrained local neural elds for robust facial landmark detection in the wild. In Computer Vision Workshops (ICCVW), 2013 IEEE International Conference on, pages 354{361, Dec 2013. [13] T. Baltrusaitis, P. Robinson, and L.-P. Morency. Constrained local neural elds for robust facial landmark detection in the wild. In Proc. Conf. Comput. Vision Pattern Recognition Workshops, pages 354{361. IEEE, 2013. [14] T. Baltru saitis, P. Robinson, and L.-P. Morency. Openface: an open source facial behavior analysis toolkit. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pages 1{10, 2016. 165 [15] T. Baltruaitis, P. Robinson, and L. P. Morency. 3d constrained local model for rigid and non-rigid facial tracking. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2610{2617, June 2012. [16] J. T. Barron. Shape, albedo, and illumination from a single image of an unknown object. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR '12, pages 334{341, Washington, DC, USA, 2012. IEEE Computer Society. [17] A. Bas, W. A. P. Smith, T. Bolkart, and S. Wuhrer. Fitting a 3D morphable model to edges: A comparison between hard and soft correspondences. arxiv preprint, abs/1602.01125, 2016. [18] H. Bay, A. Ess, T. Tuytelaars, and L. J. V. Gool. Speeded-up robust features (surf). CVIU, 110(3):346{359, 2008. [19] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts of faces using a consensus of exemplars. Trans. Pattern Anal. Mach. Intell., 35(12):2930{2940, 2013. [20] P. J. Besl and N. McKay. A method for registration of 3-D shapes. Trans. Pattern Anal. Mach. Intell., 14(2):239{256, 1992. [21] C. M. Bishop. Pattern recognition and machine learning. springer, 2006. [22] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. Proc. ACM SIGGRAPH Conf. Comput. Graphics, 1999. [23] V. Blanz and T. Vetter. Face recognition based on tting a 3d morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):1063{1074, Sept 2003. [24] D. Bradley, W. Heidrich, T. Popa, and A. Sheer. High resolution passive fa- cial performance capture. In ACM Transactions on Graphics (TOG), volume 29, page 41. ACM, 2010. [25] R. Brunelli and T. Poggio. Face recognition: features versus templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(10):1042{1052, Oct 1993. [26] X. P. Burgos-Artizzu, P. Perona, and P. Doll ar. Robust face landmark estimation under occlusion. In Proc. Int. Conf. Comput. Vision, pages 1513{1520. IEEE, 2013. [27] X. P. Burgos-Artizzu, P. Perona, and P. Doll ar. Robust face landmark estimation under occlusion. In Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV '13, pages 1513{1520, Washington, DC, USA, 2013. IEEE Computer Society. [28] C. Cao, D. Bradley, K. Zhou, and T. Beeler. Real-time high-delity facial perfor- mance capture. ACM Transactions on Graphics (TOG), 34(4):46, 2015. [29] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20(3):413{425, 2014. 166 [30] K. Chateld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In Proc. British Mach. Vision Conf., 2014. [31] J.-C. Chen, V. M. Patel, and R. Chellappa. Unconstrained face verication using deep cnn features. In Winter Conf. on App. of Comput. Vision, 2016. [32] J.-C. Chen, S. Sankaranarayanan, V. M. Patel, and R. Chellappa. Unconstrained face verication using sher vectors computed from frontalized faces. In Int. Conf. on Biometrics: Theory, Applications and Systems, 2015. [33] Q. Chen and G. Medioni. Building 3-d human face models from two photographs. Journal of VLSI signal processing systems for signal, image and video technology, 27(1-2):127{140, 2001. [34] Y. Chen and G. Medioni. Object modelling by registration of multiple range images. Image and vision computing, 10(3):145{155, 1992. [35] J. Choi, Y. Dumortier, S.-I. Choi, M. B. Ahmad, and G. Medioni. Real-time 3-d face tracking and modeling from awebcam. In Applications of Computer Vision (WACV), 2012 IEEE Workshop on, pages 33{40. IEEE, 2012. [36] J. Choi, Y. Dumortier, S.-I. Choi, M. B. Ahmad, and G. Medioni. Real-time 3-d face tracking and modeling from awebcam. In Applications of Computer Vision (WACV), 2012 IEEE Workshop on, pages 33{40. IEEE, 2012. [37] J. Choi, G. Medioni, Y. Lin, L. Silva, O. Regina, M. Pamplona, and T. C. Faltemier. 3d face reconstruction using a single or multiple views. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 3959{3962. IEEE, 2010. [38] J. Choi, A. Tran, Y. Dumortier, and G. Medioni. Real-time 3-d face tracking and modeling framework for mid-res cam. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pages 660{667. IEEE, 2014. [39] D. Cristinacce and T. Cootes. Feature detection and tracking with constrained local models. pages 929{938, 2006. [40] N. Crosswhite, J. Byrne, O. M. Parkhi, C. Stauer, Q. Cao, and A. Zisserman. Tem- plate adaptation for face verication and identication. volume abs/1603.03958, 2016. [41] M. Dantone, J. Gall, G. Fanelli, and L. Van Gool. Real-time facial feature detec- tion using conditional regression forests. In Proc. Conf. Comput. Vision Pattern Recognition, pages 2578{2585. IEEE, 2012. [42] A. Dempster. Upper and lower probabilities induced by a multivalued mapping. In R. Yager and L. Liu, editors, Classic Works of the Dempster-Shafer Theory of Belief Functions, volume 219 of Studies in Fuzziness and Soft Computing, pages 57{72. Springer Berlin Heidelberg, 2008. [43] C. Ding, C. Xu, and D. Tao. Multi-task pose-invariant face recognition. Trans. Image Processing, 24(3):980{993, 2015. [44] R. Dovgard and R. Basri. Computer Vision - ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11-14, 2004. Proceedings, Part 167 II, chapter Statistical Symmetric Shape from Shading for 3D Structure Recovery of Faces, pages 99{113. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. [45] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. Int. Conf. Comput. Vision, pages 2650{2658, 2015. [46] G. Fanelli, T. Weise, J. Gall, and L. Van Gool. Real time head pose estimation from consumer depth cameras. In 33rd Annual Symposium of the German Association for Pattern Recognition, September 2011. [47] A. Ferencz, E. G. Learned-Miller, and J. Malik. Building a classication cascade for visual identication from one example. In Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, volume 1, pages 286{293 Vol. 1, Oct 2005. [48] M. Fischler and R. Bolles. Random sample consensus: A paradigm for model tting with applications to image analysis and automated cartography. Comm. of the ACM, 24:381{395, 1981. [49] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3{42, 2006. [50] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. In Automatic Face Gesture Recognition, 2008. FG '08. 8th IEEE International Conference on, pages 1{8, Sept 2008. [51] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and bench- mark for large-scale face recognition. In European Conference on Computer Vision, pages 87{102. Springer, 2016. [52] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cam- bridge university press, 2003. [53] T. Hassner. Viewing real-world faces in 3d. In Proc. Int. Conf. Comput. Vision, pages 3607{3614, 2013. [54] T. Hassner, S. Harel, E. Paz, and R. Enbar. Eective face frontalization in uncon- strained images. In Proc. Conf. Comput. Vision Pattern Recognition, 2015. [55] T. Hassner, I. Masi, J. Kim, J. Choi, S. Harel, P. Natarajan, and G. Medioni. Pooling faces: Template based face recognition with pooled face images. In Proc. Conf. Comput. Vision Pattern Recognition Workshops, June 2016. [56] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. Conf. Comput. Vision Pattern Recognition, June 2016. [57] G. Hu, F. Yan, C.-H. Chan, W. Deng, W. Christmas, J. Kittler, and N. M. Robert- son. Face recognition using a unied 3D morphable model. In European Conf. Comput. Vision, pages 73{89. Springer, 2016. [58] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, UMass, Amherst, October 2007. [59] P. Huber, G. Hu, R. Tena, P. Mortazavian, W. Koppen, W. Christmas, M. Rtsch, and J. Kittler. A multiresolution 3D morphable face model and tting framework. 2016. 168 [60] J. F. Hughes, A. Van Dam, J. D. Foley, and S. K. Feiner. Computer graphics: principles and practice. Pearson Education, 2014. [61] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016. [62] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Cae: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [63] A. Jourabloo and X. Liu. Large-pose face alignment via cnn-based dense 3D model tting. In Proc. Conf. Comput. Vision Pattern Recognition, 2016. [64] I. Kemelmacher-Shlizerman and R. Basri. 3d face reconstruction from a single image using a single reference face shape. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(2):394{405, 2011. [65] I. Kemelmacher-Shlizerman and S. M. Seitz. Face reconstruction in the wild. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1746{ 1753. IEEE, 2011. [66] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard. The MegaFace benchmark: 1 million faces for recognition at scale. In Proc. Conf. Comput. Vision Pattern Recognition, 2016. [67] D. E. King. Dlib-ml: A machine learning toolkit. J. Mach. Learning Research, 10:1755{1758, 2009. [68] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P. Grother, A. Mah, M. Burge, and A. K. Jain. Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus Benchmark A. In Proc. Conf. Comput. Vision Pattern Recognition, pages 1931{1939, 2015. [69] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P. Grother, A. Mah, M. Burge, and A. K. Jain. Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus Benchmark-A. In Proc. Conf. Comput. Vision Pattern Recognition, 2015. [70] J. Klontz, B. Klare, S. Klum, E. Taborsky, M. Burge, and A. K. Jain. Open source biometric recognition. In Int. Conf. on Biometrics: Theory, Applications and Systems, 2013. [71] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classication with deep convolutional neural networks. In Neural Inform. Process. Syst., pages 1097{1105, 2012. [72] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classication with deep convolutional neural networks. In Neural Inform. Process. Syst., 2012. [73] A. Kumar, A. Alavi, and R. Chellappa. KEPLER: keypoint and pose estimation of unconstrained faces by learning ecient H-CNN regressors. In Automatic Face and Gesture Recognition, pages 258{265, May 2017. [74] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back. Face recognition: A convo- lutional neural-network approach. Trans. Neural Networks, 8(1):98{113, 1997. 169 [75] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. Huang. Interactive facial feature localization. European Conf. Comput. Vision, pages 679{692, 2012. [76] V. Le, H. Tang, L. Cao, and T. S. Huang. Accurate and ecient reconstruction of 3d faces from stereo images. In Image Processing (ICIP), 2010 17th IEEE Inter- national Conference on, pages 4265{4268. IEEE, 2010. [77] Y. Lecun, L. Bottou, Y. Bengio, and P. Haner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278{2324, Nov 1998. [78] V. Lepetit, F. Moreno-Noguer, and P. Fua. Epnp: An accurate o(n) solution to the pnp problem. IJCV, 81(2):155{166, 2009. [79] G. Levi and T. Hassner. Age and gender classication using convolutional neural networks. In Proc. Conf. Comput. Vision Pattern Recognition Workshops, June 2015. [80] J. P. Lewis, K. Anjyo, T. Rhee, M. Zhang, F. Pighin, and Z. Deng. Practice and Theory of Blendshape Facial Models. In Eurographics 2014, 2014. [81] C. Li, K. Zhou, and S. Lin. Computer Vision { ECCV 2014: 13th European Con- ference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, chap- ter Intrinsic Face Image Decomposition with Human Face Priors, pages 218{233. Springer International Publishing, Cham, 2014. [82] H. Li, G. Hua, Z. Lin, J. Brandt, and J. Yang. Probabilistic elastic matching for pose variant face verication. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3499{3506, 2013. [83] Y. Lin, G. Medioni, and J. Choi. Accurate 3d face reconstruction from weakly calibrated wide baseline images with prole contours. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1490{1497. IEEE, 2010. [84] B. Lucas and T. Kanade. An iterative image registration technique with an appli- cation to stereo vision. In Proceedings of the 1981 DARPA Image Understanding Workshop, pages 121{130, 1981. [85] I. Masi, G. Lisanti, A. Bagdanov, P. Pala, and A. Del Bimbo. Using 3D models to recognize 2D faces in the wild. In Proc. Conf. Comput. Vision Pattern Recognition Workshops, 2013. [86] I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-Aware Face Recognition in the Wild. In Proc. Conf. Comput. Vision Pattern Recognition, 2016. [87] I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-aware face recognition in the wild. In Proc. Conf. Comput. Vision Pattern Recognition, pages 4838{4846, 2016. [88] N. McLaughlin, J. Martinez Del Rincon, and P. Miller. Data-augmentation for reducing dataset bias in person re-identication. In Int. Conf. Advanced Video and Signal Based Surveillance. IEEE, 2015. [89] G. Medioni, J. Choi, C.-H. Kuo, and D. Fidaleo. Identifying noncooperative sub- jects at a distance using face images and inferred three-dimensional face models. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 39(1):12{24, 2009. 170 [90] G. Medioni and B. Pesenti. Generation of a 3-d face model from one camera. In Pat- tern Recognition, 2002. Proceedings. 16th International Conference on, volume 3, pages 667{671. IEEE, 2002. [91] E. Meyers and L. Wolf. Using biologically inspired features for face processing. International Journal of Computer Vision, 76(1):93{104, 2007. [92] L. P. Morency, J. Whitehill, and J. Movellan. Generalized adaptive view-based appearance model: Integrated framework for monocular head pose estimation. In Automatic Face Gesture Recognition, 2008. FG '08. 8th IEEE International Con- ference on, pages 1{8, Sept 2008. [93] S. K. Nayar and Y. Nakagawa. Shape from focus. Pattern analysis and machine intelligence, IEEE Transactions on, 16(8):824{831, 1994. [94] Y. Nirkin, I. Masi, A. Tran, T. Hassner, and G. Medioni. On face segmentation, face swapping, and face perception. volume abs/1704.06729, 2017. [95] E. Nowak and F. Jurie. Learning visual similarity measures for comparing never seen objects. In 2007 IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 1{8, June 2007. [96] A. Odena, V. Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts. Distill, 2016. [97] T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution gray-scale and rotation invariant texture classication with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):971{987, Jul 2002. [98] R. Or-El, G. Rosman, A. Wetzler, R. Kimmel, and A. M. Bruckstein. Rgbd-fusion: Real-time high precision depth recovery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5407{5416, 2015. [99] O. M. Parkhi, K. Simonyan, A. Vedaldi, and A. Zisserman. A compact and discrim- inative face track descriptor. In Proc. Conf. Comput. Vision Pattern Recognition, 2014. [100] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In Proc. British Mach. Vision Conf., 2015. [101] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In Proc. British Mach. Vision Conf., 2015. [102] P. Paysan, R. Knothe, B. Amberg, S. Romhani, and T. Vetter. A 3d face model for pose and illumination invariant face recognition. In Int. Conf. on Advanced Video and Signal based Surveillance, 2009. [103] M. Piotraschke and V. Blanz. Automated 3D face reconstruction from multiple images using quality measures. In Proc. Conf. Comput. Vision Pattern Recognition, June 2016. [104] R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constrained softmax loss for discriminative face verication. arXiv preprint arXiv:1703.09507, 2017. [105] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa. An all-in-one convolutional neural network for face analysis. In Int. Conf. on Automatic Face and Gesture Recognition, pages 17{24. IEEE, 2017. 171 [106] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 fps via regressing local binary features. In Proc. Conf. Comput. Vision Pattern Recognition. IEEE, 2014. [107] E. Richardson, M. Sela, and R. Kimmel. 3d face reconstruction by learning from synthetic data. In Int. Conf. on 3D Vision, 2016. [108] E. Richardson, M. Sela, R. Or-El, and R. Kimmel. Learning detailed face recon- struction from a single image. arXiv preprint arXiv:1611.05053, 2016. [109] S. Romdhani and T. Vetter. Estimating 3d shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In Proc. Conf. Comput. Vision Pattern Recognition, volume 2, pages 986{993 vol. 2, 2005. [110] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa- thy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition chal- lenge. Int. J. Comput. Vision, pages 1{42, 2014. [111] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: Database and results. Image and Vision Computing, 2015. [112] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: The rst facial landmark localization challenge. In Proc. Conf. Comput. Vision Pattern Recognition Workshops, 2013. [113] J. S anchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classication with the sher vector: Theory and practice. Int. J. Comput. Vision, 105(3):222{245, 2013. [114] A. Sanin, C. Sanderson, and B. C. Lovell. Shadow detection: A survey and com- parative evaluation of recent methods. Pattern recognition, 45(4):1684{1695, 2012. [115] S. Sankaranarayanan, A. Alavi, C. Castillo, and R. Chellappa. Triplet probabilistic embedding for face verication and clustering. In Int. Conf. on Biometrics: Theory, Applications and Systems, 2016. [116] S. Sankaranarayanan, A. Alavi, and R. Chellappa. Triplet similarity embedding for face verication. arxiv preprint, arXiv:1602.03418, 2016. [117] A. Saxena, M. Sun, and A. Ng. Make3d: Learning 3d scene structure from a single still image. In Trans. Pattern Anal. Mach. Intell., volume 31, page 824840, 2008. [118] F. Schro, D. Kalenichenko, and J. Philbin. Facenet: A unied embedding for face recognition and clustering. In Proc. Conf. Comput. Vision Pattern Recognition, pages 815{823, 2015. [119] K. Sentz and S. Ferson. Combination of evidence in dempster-shafer theory. Tech- nical report, 2002. [120] G. Shafer. A Mathematical Theory of Evidence. Princeton University Press, Prince- ton, 1976. [121] G. Shafer. Probability judgement in articial intelligence. In Proceedings of the First Conference Annual Conference on Uncertainty in Articial Intelligence (UAI- 85), pages 91{98, Corvallis, Oregon, 1985. AUAI Press. [122] T. Sim, S. Baker, and M. Bsat. The cmu pose, illumination, and expression (pie) database. In Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on, pages 46{51. IEEE, 2002. 172 [123] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Int. Conf. on Learning Representations, 2015. [124] G. Stratou, A. Ghosh, P. Debevec, and L.-P. Morency. Exploring the eect of illu- mination on automatic expression recognition using the ict-3drfe database. Image and Vision Computing, 30(10):728{737, 2012. [125] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proc. Int. Conf. Comput. Vision, pages 945{953, 2015. [126] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identication-verication. In Neural Inform. Process. Syst., pages 1988{1996, 2014. [127] Y. Sun, D. Liang, X. Wang, and X. Tang. Deepid3: Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873, 2015. [128] Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10,000 classes. In Proc. Conf. Comput. Vision Pattern Recognition. IEEE, 2014. [129] S. Suwajanakorn, I. Kemelmacher-Shlizerman, and S. M. Seitz. Computer Vision { ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV, chapter Total Moving Face Reconstruction, pages 796{ 812. Springer International Publishing, Cham, 2014. [130] R. Szeliski. Computer vision: algorithms and applications. Springer Science & Business Media, 2010. [131] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verication. In Proc. Conf. Comput. Vision Pat- tern Recognition, pages 1701{1708. IEEE, 2014. [132] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-scale training for face iden- tication. In Proc. Conf. Comput. Vision Pattern Recognition, 2015. [133] X. Tan and B. Triggs. Analysis and Modeling of Faces and Gestures: Third Interna- tional Workshop, AMFG 2007 Rio de Janeiro, Brazil, October 20, 2007 Proceedings, chapter Fusing Gabor and LBP Feature Sets for Kernel-Based Face Recognition, pages 235{249. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. [134] L. Tang and T. Huang. Automatic construction of 3D human face models based on 2D images. ICIP, pages 467{470, 1996. [135] L.-a. Tang and T. S. Huang. Automatic construction of 3d human face models based on 2d images. In Image Processing, 1996. Proceedings., International Conference on, volume 3, pages 467{470. IEEE, 1996. [136] J. R. Tena, F. De la Torre, and I. Matthews. Interactive region-based linear 3d face models. In ACM Transactions on Graphics (TOG), volume 30, page 76. ACM, 2011. [137] G. Turk and M. Levoy. Zippered polygon meshes from range images. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pages 311{318. ACM, 1994. 173 [138] M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. In Proc. Conf. Comput. Vision Pattern Recognition, 1991. [139] D. Wang, C. Otto, and A. K. Jain. Face search at scale: 80 million gallery. arXiv preprint, arXiv:1507.07242, 2015. [140] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. Adams, T. Miller, N. Kalka, A. K. Jain, J. A. Duncan, K. Allen, et al. Iarpa janus benchmark-b face dataset. In Proc. Conf. Comput. Vision Pattern Recognition Workshops, 2017. [141] L. Wiskott, J. M. Fellous, N. Kruger, and C. von der Malsburg. Face recogni- tion by elastic bunch graph matching. In Image Processing, 1997. Proceedings., International Conference on, volume 1, pages 129{132 vol.1, Oct 1997. [142] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In Proc. Conf. Comput. Vision Pattern Recogni- tion, 2011. [143] L. Wolf, T. Hassner, and Y. Taigman. Descriptor based methods in the wild. In Real-Life Images workshop at the European Conference on Computer Vision (ECCV), October 2008. [144] L. Wolf, T. Hassner, and Y. Taigman. Descriptor based methods in the wild. In Real-Life Images workshop at the European Conference on Computer Vision (ECCV), October 2008. [145] L. Wolf, T. Hassner, and Y. Taigman. Eective unconstrained face recognition by combining multiple descriptors and learned background statistics. Trans. Pattern Anal. Mach. Intell., 33(10):1978{1990, 2011. [146] R. S. Wright Jr, N. Haemel, G. M. Sellers, and B. Lipchak. OpenGL SuperBible: comprehensive tutorial and reference. Pearson Education, 2010. [147] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian. DisturbLabel: Regularizing CNN on the loss layer. In Proc. Conf. Comput. Vision Pattern Recognition, 2016. [148] S. Xie and Z. Tu. Holistically-nested edge detection. In Proc. Int. Conf. Comput. Vision, 2015. [149] S. Xie, T. Yang, X. Wang, and Y. Lin. Hyper-class augmented and regularized deep learning for ne-grained image classication. In Proc. Conf. Comput. Vision Pattern Recognition, pages 2645{2654, 2015. [150] Z. Xu, S. Huang, Y. Zhang, and D. Tao. Augmenting strong supervision using web data for ne-grained categorization. In Proc. Int. Conf. Comput. Vision, pages 2524{2532, 2015. [151] H. Yang, X. He, X. Jia, and I. Patras. Robust face alignment under occlusion via regional predictive power estimation. IEEE Transactions on Image Processing, 24(8):2393{2403, 2015. [152] H. Yang and I. Patras. Mirror, mirror on the wall, tell me, is the error small? In Proc. Conf. Comput. Vision Pattern Recognition, 2015. [153] J. Yang, P. Ren, D. Chen, F. Wen, H. Li, and G. Hua. Neural aggregation network for video face recognition. arxiv preprint, abs/1603.05474, 2016. 174 [154] Z. Yang and R. Nevatia. A multi-scale cascade fully convolutional network face detector. In ICPR, pages 633{638, 2016. [155] D. Yi, Z. Lei, and S. Li. Towards pose robust face recognition. In Proc. Conf. Comput. Vision Pattern Recognition, pages 3539{3545, 2013. [156] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014. Available:http://www.cbsr.ia.ac.cn/english/ CASIA-WebFace-Database.html. [157] A. Zadeh, T. Baltru saitis, and L.-P. Morency. Deep constrained local models for facial landmark detection. arXiv preprint arXiv:1611.08657, 2016. [158] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Li. Face alignment across large poses: A 3d solution. In Proc. IEEE Computer Vision and Pattern Recognition, Las Vegas, NV, June 2016. [159] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark local- ization in the wild. In Proc. Conf. Comput. Vision Pattern Recognition, pages 2879{2886. IEEE, 2012. [160] M. Zollh ofer, M. Martinek, G. Greiner, M. Stamminger, and J. S umuth. Auto- matic reconstruction of personalized avatars from 3d face scans. Computer Anima- tion and Virtual Worlds, 22(2-3):195{202, 2011. 175
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Efficient template representation for face recognition: image sampling from face collections
PDF
3D deep learning for perception and modeling
PDF
Object detection and recognition from 3D point clouds
PDF
From active to interactive 3D object recognition
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Accurate 3D model acquisition from imagery data
PDF
Facial gesture analysis in an interactive environment
PDF
Scalable dynamic digital humans
PDF
Model based view-invariant human action recognition and segmentation
PDF
Line segment matching and its applications in 3D urban modeling
PDF
Data-driven 3D hair digitization
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
The importance of not being mean: DFM -- a norm-referenced data model for face pattern recognition
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
Point-based representations for 3D perception and reconstruction
PDF
Digitizing human performance with robust range image registration
Asset Metadata
Creator
Tran, Anh Tuan
(author)
Core Title
Face recognition and 3D face modeling from images in the wild
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/30/2019
Defense Date
09/29/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D face modeling,3DMM,3DMM-CNN,COW dataset,deep learning,face recognition,face synthesis,faces in-the-wild,OAI-PMH Harvest,SfS-CNN
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gerard Guy (
committee chair
), Gupta, Sandeep (
committee member
), Nevatia, Ram (
committee member
)
Creator Email
anhttran@usc.edu,anstar1111@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-459708
Unique identifier
UC11267997
Identifier
etd-TranAnhTua-5940.pdf (filename),usctheses-c40-459708 (legacy record id)
Legacy Identifier
etd-TranAnhTua-5940.pdf
Dmrecord
459708
Document Type
Dissertation
Rights
Tran, Anh Tuan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D face modeling
3DMM
3DMM-CNN
COW dataset
deep learning
face recognition
face synthesis
faces in-the-wild
SfS-CNN