Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
3D inference and registration with application to retinal and facial image analysis
(USC Thesis Other)
3D inference and registration with application to retinal and facial image analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
3D Inference and Registration with Application to Retinal and Facial Image Analysis by Matthias Hernandez A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2017 Copyright 2017 Matthias Hernandez Acknowledgements Even after joining the Ph.D. program, I doubted for a long time that I would ever be able to complete it. It is only thanks to the continuous support of the people mentioned here that I was able to achieve it. Firstly, I would like to thank my advisor Prof. Gerard Medioni. I am ever so grateful for his guidance throughout the writing of this thesis. He has provided me, and all of my colleagues, with continuous motivation and support. Secondly, I thank all the people that I collaborated with in my work. In particular, I am extremely grateful to Dr. Jongmoo Choi for his constant cheerfulness and ability to make time for endless discussions with me. Also, I would thank Prof. Tal Hassner, Dr. Zhihong Hu and Dr. SriniVas Sadda whom all dedicated time and energy to sharing their invaluable knowledge with me. Thirdly, I am thankful for all the people I have met and friends I have made along the way. All my fellow lab mates were perfect companions to share brilliant ideas, sleepless nights and warm coffees. Last but not the least, I would like to thank my parents who provided me with all the oppor- tunities I could ever dream of. ii Table of Contents Acknowledgements ii List of Tables vi List of Figures vii Abstract x 1 Introduction 1 1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Related Work 7 2.1 Invariant Feature Extraction and Matching . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Area-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Feature-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 2D Registration and mosaicking . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 3D Registration and Structure from motion . . . . . . . . . . . . . . . . . . . . 12 3 Multimodal Registration of retinal images 14 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.1 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.2 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.3.1 Multimodal Registration . . . . . . . . . . . . . . . . . . . . . 17 3.1.3.2 Vessel Segmentation . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.4 Overview of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Line Structure Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Review of the 2-D Tensor V oting framework . . . . . . . . . . . . . . . 23 3.2.1.1 Tensor Representation . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1.2 V oting process . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.2 Tensor-voting-based line segmentation . . . . . . . . . . . . . . . . . . . 25 3.2.2.1 Ball vote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 iii 3.2.2.2 Connected component analysis . . . . . . . . . . . . . . . . . 27 3.2.2.3 Stick vote . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Multimodal Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.1 Chamfer distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.2 Pairwise registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.3 Chained registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.4 Non-linear refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.1 Segmentation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.2 Registration results on Multimodal data . . . . . . . . . . . . . . . . . . 36 3.5.3 Registration results on Longitudinal data . . . . . . . . . . . . . . . . . 37 3.6 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4 3D Face Reconstruction 41 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.1 Initializing per-frame poses . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3.2 Initializing the shape prior . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.3 Prior constrained SfM optimization . . . . . . . . . . . . . . . . . . . . 50 4.4 GPU implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5.1 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5.2 Runtime comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5.3 Analyzing the video quality over reconstruction . . . . . . . . . . . . . . 63 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5 Deep 3D Face Recognition 67 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.2 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.2.1 Expression Generation . . . . . . . . . . . . . . . . . . . . . . 75 5.3.2.2 Pose Variations . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.2.3 Random Patches . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.3 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3.4 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4.1 3D Face Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4.2 Analysis of Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . 80 iv 5.4.3 Performances on the 3D Databases . . . . . . . . . . . . . . . . . . . . . 81 5.4.4 Time Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6 Analysis of 3D Face Reconstruction Quality for 3D Recognition 87 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2 Recognition rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2.1 Dataset and experimental setup . . . . . . . . . . . . . . . . . . . . . . 88 6.2.2 3D Recognition on faces reconstructed from in-the-wild videos . . . . . 88 6.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3.1 Required accuracy level . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.3.2 Representational Power of 3DMM . . . . . . . . . . . . . . . . . . . . . 92 6.3.3 Reconstructed 3D faces . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7 Conclusion and Future Directions 95 7.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Reference List 98 v List of Tables 3.1 Comparison of precision/recall rates for our segmentation approach and other rule-based approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Accuracy of our registration results per modality: average and standard deviations (in parenthesis) of Root Mean Square Error (inm). Lower values are better. . . 37 4.1 Quantitative results on the MICC dataset [9]. Four mean distance measures (and standard deviation). Our PCSfM is evaluated, optimizing for shape alone (Us, 3D) and the all 3DMM parameters (Us, 3D+pose+exp.). Lower values are better, bold is the best score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Average runtimes on MICC videos. Single-view methods were applied to all frames separately. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.1 Comparison of rank-1 accuracy (%) on public 3D face databases . . . . . . . . . 82 5.2 Comparison of computation time (s) of feature extraction, matching per a probe for identification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.1 Recognition rates on MICC using the proposed deep matcher (in %). . . . . . . . 89 6.2 Recognition rates on MICC using a cosine similarity (in %). . . . . . . . . . . . 89 vi List of Figures 3.1 Example of different 2D modalities for one eye. From left to right: Color Fundus (CF), Auto-Fluorescence (AF), Infra-Red (IR), Red-Free (RF). . . . . . . . . . . 14 3.2 Example of OCT volume composed of a set of scans. . . . . . . . . . . . . . . . 15 3.3 Example of OCT projection around the RPE layer . . . . . . . . . . . . . . . . . 17 3.4 Overview of our multimodal retinal image registration approach . . . . . . . . . 21 3.5 Graphical representation of a symmetric 2-D tensor. . . . . . . . . . . . . . . . . 23 3.6 Geometric representation of the saliency decay function. . . . . . . . . . . . . . 25 3.7 Effect of the line segmentation steps on one eye. Top line: AF image. Bottom line: OCT projection. From left to right: original image, result of ball vote, result of connected component pruning, result of stick vote. . . . . . . . . . . . . . . . 28 3.8 Example of the evolution of the graph for chained registration on one eye. Left: complete graph. Middle: minimal spanning tree. Right: final graph. . . . . . . . 33 3.9 Overlap of the registered OCT projection image to Color Fundus, Infra-Red, Red- Free and AutoFluorescence images for several eyes. . . . . . . . . . . . . . . . . 39 3.10 Registration results on longitudinal data. From left to right: 1st visit, after 6 months, 12 months and 18 months. . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 Overview of our approach. (a) For an input video, (b) we track facial landmarks and use them to estimate initial camera matrices and 3DMM parameters. (c) Our iterative optimization refines the shape parameters, along with per frame camera matrices and expressions, seeking ones which maximize photometric consistency between frames. Our output (d). . . . . . . . . . . . . . . . . . . . . . . . . . . 43 vii 4.2 Face reconstruction from an outdoor video in the MICC dataset [9]. Top row: Three examples of the frames used for reconstruction. Low resolution faces have an average of only 30 pixel inter-ocular distances. Bottom row: (a) Close up view of the same subject (not used for reconstruction). 3D estimates by (b) Structure from motion (SfM) [69]; (c) Single-view 3D Morphable Models fitting (3DMM) [5]; (d) our proposed PCSfM. SfM estimates are very poor in low reso- lution. 3DMM cannot guarantee correct face shapes, evident by comparing with (a). PCSfM combines both to produce an accurate 3D shape. . . . . . . . . . . . 44 4.3 SfM alone is not enough. reconstruction results on two controlled videos from the MICC dataset [9]. (a) Example input frames; (b) Ground truth 3D shape; (c) SfM 3D reconstruction of [69] showing sever errors due to the ambiguity of feature matching in these scenes; (d) Our result. . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Non-zero structure of the Jacobian matrix forE photo with four frames.S i ,E j;i ,P j;i are respectively the Jacobian for the shape, thej-th frame expression and pose, estimated from the pair of framesi;i + 1. . . . . . . . . . . . . . . . . . . . . . 54 4.5 Qualitative results on web videos (left column) and old movies (right column). Left to right: example frame, our PCSfM 3D estimate overlaid on the frame and rendered separately. Note error in chin on the bottom row due to the beard not being modeled by the 3DMM parameters. . . . . . . . . . . . . . . . . . . . . . 58 4.6 Qualitative results for the evolution of the estimated parameters. Top row: Initial set of shape/pose/expression. Bottom row: Final set of parameters. . . . . . . . . 59 4.7 Sample frames from three MICC videos [9]. From left to right: indoor cooper- ative and non-cooperative subjects, outdoor non-cooperative subject along with zoomed-in view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.8 Distances to ground truth per vertex on MICC indoor videos. Averaged over all videos for 3DMM [112], flow-based [55], CNN-based fitting [157], regression method [116], multi-view landmark-based fitting [64] and our PCSfM showing far fewer errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.9 Example failure due to landmark tracking errors. left to right: frame with land- marks overlaid, estimated 3D, ground truth. . . . . . . . . . . . . . . . . . . . . 62 4.10 PCSfM accuracy vs. motion on MICC. Mean (SD) RMSE for videos with vary- ing head yaw angle ranges. Single view 3DMM [112] provided as baseline. Evi- dently, the more pose variations, the smaller our errors. . . . . . . . . . . . . . . 65 4.11 PCSfM accuracy vs. image resolution on MICC. Mean (SD) RMSE for videos with different resolutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 viii 4.12 Qualitative 3D estimation results from the MICC dataset[9]. From left to right: example input frame frame and ground truth 3D, 3D estimation results for 3DMM [112], Flow-based method [55], CNN-based fitting [157], shape regression [116], multi- view landmark-based fitting [64], our PCSfM. For each case, the first row shows heat maps visualizing the Euclidean distances between estimated and ground truth 3D shapes. The second row shows the 3D shape. Rows 1-2: coopera- tive indoor videos. Rows 3-4: non-cooperative indoor videos. Rows 5: outdoor non-cooperative videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.1 An overview of the proposed face identification system. In the training phase, we do pre-processing, augment 3D faces, and convert them to 2D depth maps. Then, we train a DCNN using augmented 2D depth maps. In the testing phase, a face representation is extracted from the fine-tuned CNN. After normalization of fea- tures and Principal Component Analysis transform, one’s identity is determined by the matching step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 An overview of person specific expression generation method and its examples. (a) Input: a facial 3D point cloud, Output: a deformed 3D point cloud (b) The first column represents neutral scans of input faces. Rest columns represent generated individual specific expressions from an input. For visualization, we show a 3D point cloud as a depth map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Evaluations of the augmentation methods on the Bosphorus dataset. FRGC is used as a training set. The dataset was augmented using each augmentation method: expression, pose, patch, and all combined. . . . . . . . . . . . . . . . . 85 5.4 Evaluation results on the three databases. (a) and (b) Performances were evalu- ated on three cases by varying expression intensities in a probe set. (c) Perfor- mances were evaluated on the four standard experiment cases. . . . . . . . . . . 86 6.1 Example probe data generated applying Laplacian filter on MICC laser scans for an average error of 0mm, 1mm and 2mm. . . . . . . . . . . . . . . . . . . . . . 91 6.2 Recognition rate as a function of average Euclidean distance to ground truth. . . . 91 6.3 Example limitations of 3DMM for reconstructing MICC data. . . . . . . . . . . 92 6.4 Histogram of distance to ground truth for 3DMM fitting on laser scans of the MICC data. One example fitted 3DMM/laser scan pair is shown for the extreme cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.5 Scatter plot for the recognition rank as a function of the distance to ground truth for 3D faces reconstructed with 3DMM [112], 3DDFA [158] and PCSfM [Chap- ter 4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 ix Abstract Image registration is a fundamental topic in image analysis, with applications in tracking, biometrics, medical imaging or 3D reconstruction. It consists in aligning 2 or multiple images of the same scene that are taken in different conditions, such as from different viewpoints, from different sensors or at different times. Similarly, 2D/3D registration aims at aligning captured 2D images with a 3D model. In this document, we study registration problems in challenging cases in which traditional methods do not provide satisfactory results. We show that even weak prior knowledge on the 3D structure provides reliable information that can be used for accurate registration. Specifically, we focus on two specific cases: 2D/3D multimodal retinal imaging and 3D face reconstruction from low-resolution videos. For retinal image registration, we propose an integrated framework for registering an arbitrary number of images of different modalities, including a 3D volume. We propose a generic method to extract salient line structures in many image modality, based on dense tensor voting, and a robust registration framework for multiple images. Our approach can handle large variations across modalities and is evaluated on real-world retinal images with 5 modalities per eye. For 3D face modeling, we propose to constrain traditional Structure from Motion (SfM) with a face shape prior to guide the correspondence finding process. We initialize a 3D face model x on coarse facial landmarks. We perform 3D reconstruction by maximizing photometric consis- tency across the video over 3D shape, camera poses and facial expressions. We compare our method to several state-of-the-art methods and show that our method can generate more accurate reconstructions. To assess the discriminability of the reconstructed models, we develop an end-to-end 3D-3D facial recognition algorithm. We leverage existing deep learning networks trained on 2D images and fine tune-them on images generated by orthogonal projection of 3D data. We show that while having low amounts of 3D data, our method provides excellent recognition results while being significantly more scalable than state-of-the-art methods. Finally, while excellent recognition results can be achieved with laser-scan 3D data, we have observed that reconstructed facial 3D models cannot be relied on for recognition purposes. We analyze which level of accuracy is required for enabling reliable 3D face recognition, and which factors impact recognition from reconstructed data. xi Chapter 1 Introduction 1.1 Problem statement For a specific scene, images can be captured with different sensors, at different times, and from different viewpoints. To analyze the scene, one needs to combine or compare the information carried by each image. It is therefore critical that the images share the same coordinate system. The task of aligning two or multiple images of the same scene is called image registration. Image registration is critical for a large set of applications that require image processing. It can be used for scene change detection, image mosaicking, or biometrics. It also has applica- tions in medical imaging, where physicians often need to correlate the information provided by multiple sensors to formulate reliable diagnoses and monitor the evolution of diseases in time. In computer vision, accurate image registration is also one major step for classical 3D reconstruction algorithms based on structure-from motion. 1 1.2 Challenges Classical image registration approaches rely on extracting and matching invariant point features through local descriptors. Point correspondences can be used to compute the epipolar geome- try and relative camera poses between images. 2D alignment can then be performed through a transformation model, such as affine, perspective or quadratic. Although these approaches work reasonably well in general, they need several correct cor- respondences to run. As a result, they tend to fail when not many reliable correspondences can be established. This can occur in several occasions. For instance, when the scene is mostly tex- tureless, the features are very sparse. This happens when trying to register several low-resolution images of faces. Other failure cases occur when the appearance of the images is very different. This happens for multimodal images or images taken with a large baseline. Reliable correspon- dences are also hard to establish when there are repeated patterns, which occurs often in retinal images. Another issue arises when the scene has an arbitrary 3D structure: applying global transfor- mation models results in noticeable discrepancies due to the parallax effect. As a result, both the 3D structure and camera matrices must be estimated to achieve accurate registration. These observations highlight the three main inter-dependent challenges that occur in classical 2D image registration: Extracting reliable correspondences: The features must be consistent for all images and discriminative enough to provide reliable matches. This is difficult when the appearance changes due to modality, viewpoint or lighting conditions. This is also difficult when the object of interest is mostly textureless, like a face. 2 Camera pose estimation: Getting accurate pose estimation requires reliable point corre- spondences and knowledge of the 3D structure. Also, the algorithm should be robust to wrong correspondences. 3D Structure estimation: Given correspondences and camera poses, sparse 3D structure can be estimated by minimizing the reprojection error through classical bundle adjustment techniques. Getting accurate 3D locations requires the camera baseline to be large enough to reduce the uncertainty. However, large baselines make the correspondence problem harder as the image appearance varies a lot between viewpoints. When sparse 3D and camera matrices are known, dense correspondences are needed at an image-level to generate a complete 3D model. This means that the registration must be accurate at a pixel level. 1.3 Contributions In this dissertation, we study the registration problem in challenging cases where only few or no reliable point matches can be extracted from the images. We show that leveraging prior knowl- edge on the 3D structure enables improved registration results. To support this claim, we study two different applications. We first propose a framework to perform robust 2D/3D multimodal registration, and apply it to retinal images. The registration is used as a preprocessing step for medical applications, such as disease segmentation [59, 60]. Second, we study the case of 3D face modeling from low-resolution images, in which we propose a 2D/3D registration scheme. To assess the quality of the reconstruction results, we propose a new 3D-3D face recognition framework and analyze which factors impact expected 3 face reconstruction accuracy. We then analyze why reconstructed 3D models cannot be relied on for recognition. Multimodal registration of retinal images. In retinal image registration, the imagery is multimodal. Because of the repeated patterns and the large difference in appearance intrinsic to the difference in modalities, point correspondence cannot be reliably extracted through classical point descriptors, like SIFT [90]. In this study, we propose an integrated global framework to register an arbitrary number of modalities, involving 2D images and a 3D volume. We also present a generic framework for the extraction of salient lines based on tensor voting that is consistent across modalities. Finally, we propose a robust framework for 2D/3D registration based on line structures. The registration problem is first simplified by assuming a near-flat structure for the retina, which enables us to use a global transformation model. This assumption is later lifted to account for the shallow but existing 3D structure. Our method is evaluated on real clinical data and is deployed in the industry. Face modeling in the wild In 3D face reconstruction, we are given low-resolution videos of a person’s face in a natural scenario. We aim at reconstruction accurate face shapes to perform recognition in 3D. Because of the low-resolution and the textureless nature of faces, getting accurate corre- spondences across the video frames is very challenging. As a result, getting accurate pose and structure with a classical structure-from-motion approach is not practical. We show that prior knowledge about the space of a 3D faces enables us to accurately register the video frames and reconstruct the facial geometry. We propose a novel structure-from-motion 4 method constrained with a shape prior, which allows us to establish global correspondences across all low-resolution frames in a video. Since correspondences cannot be properly established at the image level due to the textureless nature of a face, we propose to warp the model in 3D and check for photometric consistency across the frames. 3D shape, per frame pose and expression are then jointly estimated in a global optimization framework. We show that this 2D/3D registration scheme provides state-of-the-art 3D face reconstruction results on low-resolution face videos. Face Recognition For the case of 3D faces, we develop an end-to-end 3D recognition framework based on deep convolutional neural network. The performance of 2D face recognition algorithms has significantly increased by leveraging the representational power of deep convolutional neural networks (CNN) and the use of large- scale labeled training data. For 3D data however, little data is available. We show that transfer learning from a CNN trained on 2D face images can effectively work for 3D face recognition by fine-tuning the CNN with a small number of 3D facial scans. Our proposed method shows state-of-the-art recognition results on several 3D face datasets, while scaling well for large databases. Finally, while we are able to obtain high recognition results on laser scans, we acknowledge that recognition rates on reconstructed models are underwhelming. We analyze which level of accuracy is needed for enabling 3D face recognition from reconstructed data. 5 1.4 Outline This dissertation is organized as follows: We start with a review of related work in Chapter 2. Chapter 3 describes our 2D/3D multimodal registration framework applied to retinal images. In Chapter 4, we introduce a 2D/3D registration scheme and apply it to 3D face modeling for low- resolution videos. Chapter 5 presents a deep 3D-3D matching method, which is applied to the reconstructed models in chapter 6 to analyze the correlation between reconstruction quality and recognition rates. Finally, Chapter 7 concludes this work and proposes possible future research directions. 6 Chapter 2 Related Work Image registration and 3D reconstruction is a research topic that has been studied for decades. The literature is large and several literature surveys have been published over the years [20, 159]. In this section, we report important work that is relevant to the general image registration process. Studies that are specific to retinal image processing, face reconstruction and 3D face recognition will be reported in the corresponding chapters respectively. As previously stated, image registration has a wide range of applications, ranging from med- ical imaging to computer vision. Due to the diversity of images to be registered, it seems im- possible to design a universal method that can be applied to every possible application. Still, the majority of the proposed approaches broadly follow the same steps, each with its own chal- lenges [159]. First, invariant features are extracted and matched. Then, a transform model is estimated. Optionally, the 3D structure is extracted. Finally, the images are transformed into the same coordinate system. 7 2.1 Invariant Feature Extraction and Matching Image registration approaches are often categorized in two classes: area-based methods and feature-based methods. 2.1.1 Area-based methods Area-based methods, sometimes referred to as template matching, try to optimize a distance measure over the image intensities. Several distances have been used over the years. Notable successes include normalized cross-correlation [51, 16] and mutual information [142, 137, 84]. Coming from information theory, mutual information was first applied to images in [142] as a measure of statistical dependency between two images. It is powerful as it does not make any assumption on the image content. As a result, it shows excellent results with different modalities and is widely used for medical images. One common drawback of these approaches is their computational complexity. Also, due to their iterative nature, they are not guaranteed to converge to the desired solution if the initial position is poor. 2.1.2 Feature-based methods Point features Approaches based on point features are the most widely used in practice. Image matching through interest points can be traced back to the early 80s, with the work of Moravec on corner detection [98]. Harris further improved the repeatability of the corner extraction [52] and proved it to be very effective for short-range motion tracking [53]. 8 To perform matching in large motion scenarios, it was later proposed to extract information in the neighborhood around the interest points. Zhang et al. [156] showed that using a square patch around each corner helped select likely matches. Schmid and Mohr [121] proposed to extract a rotationally invariant descriptor of the local image region, and showed that matching could be applied for image recognition in a large database. In his influential work, Lowe [90] proposed the SIFT (Scale-Invariant Feature Transform) feature, that is designed to be invariant to both pose and scale variations. Interest points are no longer corners, which were examining the images as a single scale. Instead, the interest points are extracted as extrema of the difference of Gaussian images of different scales. The rotation invariance is achieved by extracting the principal orientation based on local image gradients. Finally, a descriptor is extracted at the detected scale and orientation. SIFT descriptors have been shown to be stable, discriminative and robust. They have been widely used in vision tasks that include image recognition and registration. The ideas behind the SIFT local descriptors have inspired many studies that try to improve robustness, compactness or effectiveness. Notably, Ke and Sukthankar proposed to apply Prin- ciple Component Analysis to the local gradient patch in order to make the descriptor more com- pact PCA-SIFT [70]. GLOH [96] (Gradient Location and Orientation Histogram) introduces a variation of the SIFT descriptor that considers more spatial regions to increase the descrip- tor’s distinctiveness. RIFT [76] (Rotation-Invarient Feature Transform) is a variation of SIFT in which the descriptors are extracted from concentric circular patches to enforce rotation invari- ance. In. [14], Bay et al. introduced the SURF (Speeded-Up Robust Features) descriptor, which base the descriptor on Haar wavelet response and can be computed much faster than the original 9 SIFT. Later, the DAISY descriptor [138] was introduced for efficient dense matching and pro- vides good results for wide baseline. More recently, advancements in deep learning has led to novel types of convolutional-based descriptors [145]. All these descriptors work well for point matching from multiple views. However, the gradi- ents and appearance are not consistent for multimodal images. To address that issue, the Partial Intensity-Invariant Feature Descriptor (PIIFD) was introduced [27]. After observing that features that the gradient orientations may be reversed across modalities, the authors proposed to sum opposite gradient directions when computing the SIFT descriptors. Potential matches are extracted by finding nearest neighbors in the space of feature descrip- tors. To increase the robustness of the matching process, non-distinctive matches are usually pruned if the distance ratio between the first and the second nearest neighbor is over a threshold. The outliers are then removed by checking epipolar constraints. It is worth noting that these methods are not scalable: the extraction is often computation- ally intensive and a large amount of memory is required. As a result, binary descriptors were introduced as a way to make the descriptors compact and fast to match. BRIEF (Binary Robust Independent Elementary Features) [21] compares intensities of random pixel pairs around the key point. The resulting binary strings can be matches quickly using an XOR operation. ORB (Ori- ented fast and Rotated BRIEF) [114] adds rotation invariance by estimating a patch orientation. Other examples include FREAK (Fast REtinA Keypoints) [2], BRISK (Binary Robust Invariant Scalable KEypoints) [78] or LATCH (Learned Arrangements of Three Patch Codes) [79]. Al- though all these methods are designed to address scalability issues, they show inferior matching performance as compared to histogram based descriptors. 10 High level features Other than primitive point features, higher level features have been used. Line features rep- resent the underlying structure of the objects and have been used as robust features in several cases where point-feature approaches fail. For instance, in retinal image registration, the topol- ogy of the vascular structure has been used for registration [130, 30]. Choe [30] proposes to base multimodal registration on vessel junctions. Edges have also been used in the 3D face modeling scenario [112]. In this document, we show that line features can be used as a robust alternative when point descriptors fail to match. 2.2 2D Registration and mosaicking Point correspondences between images provide with the underlying epipolar geometry. If we assume a planar scene, the transformation model provided by the epipolar geometry is a homog- raphy, which can be computed robustly with a RANSAC-based method [141, 54]. A problem arises when more than 2 images of the same scene are being registered. This process of aligning a sequence of multiple images is called image mosaicking. If sequential pair- wise registration is performed, small errors between pairs of images accumulate into large visible discrepancies. To address this issue, global alignment approaches need to be used. The original idea was to register each image to the mosaic instead of its closest neighbor [118]. However, this still suffers from accumulating error and increasing computational cost as more images are added. A global system was proposed by Davis [34], where the registration was performed be- tween every pair of image independently of their order. Marzotto [91] proposed the idea of a 11 graph representation in which nodes are the different images and the edges represent the spatial distance of the image pair. The minimum spanning tree of the graph enables to extract the best set of homographies for the frames. Note that these ideas only apply to offline registration problems as they are not scalable easily. Online approaches need to compensate for accumulated error through other methods. Usually, the confidence of the alignment is assessed and a recovering scheme is started if the algorithm detects that it has diverged [152]. One major issue with global 2D registration is the assumption that the scene is flat. When the scene has an arbitrary 3D structure, a parallax effect will be largely visible. To address this issue, local elastic alignment has been proposed for panoramas [127] or medical images [84]. These lo- cal refinements can be sufficient for near-planar surfaces, like for retinal images. However, when the 3D structure is large, the parallax cannot be efficiently compensated without reconstructing the underlying 3D structure. 2.3 3D Registration and Structure from motion Given a set of 2D correspondences, the problem of simultaneously estimating the camera pose and the 3D locations is called Structure-from-Motion (SfM). Most SfM methods follow the same framework. First, the relative pose between camera pairs is estimated through epipolar geometry con- straints. In the presence of unknown 3D structure, the epipolar geometry relates the point cor- respondences of 2 images through the fundamental matrixF [110]. Various methods have been 12 proposed to extract this pose information, using different constraints and number of correspon- dences [100, 131]. Second, the cameras are transformed into the same coordinate system. Finally, the 3D infor- mation is extracted with bundle adjustment [54]. Given a set of 3D points and their 2D projec- tions, bundle adjustment aims at jointly refining the position of the 3D points and camera poses. The process is a non linear minimization of the total reprojection error over the 3D location and the camera parameters. Since points are usually not visible in many of the images, the system has a sparse structure and Sparse Bundle Adjustment versions have been proposed to speed up the optimization process [89]. Third, dense 3D structure is extracted based on the extracted cameras poses. For two views, images can be rectified so that the correspondence search is limited to corresponding scan lines. Dense matches enable to compute a depth map in which the depth is known at every pixel. How- ever, finding dense matches is not easy due to homogeneous and occluded areas [120]. As a result, more global approaches have been developed, with dynamic programming for example [72]. To avoid having to estimate dense correspondences at the image level, space carving methods have been proposed [74]. Starting from a large initial volume, voxels are gradually removed based on photometric consistency. Similarly, Kang and Medioni [69] use a plane sweep approach in which each point is moved in 3D to select the depth that maximizes photometric consistency. All these approaches work well provided that they can compute accurate camera matrices. As a result, SfM methods cannot work in cases where point matches are not reliable. 13 Chapter 3 Multimodal Registration of retinal images 3.1 Introduction 3.1.1 Significance In medical image applications, every modality carries some distinct information and using only one image might not be enough for reliable diagnosis or for analyzing the evolution of a disease in time. Retinal imaging is a particular example in which physicians typically work with multiple sensors, each providing its own information. In the past, many 2-D modalities have been used by ophtalmologists: notably color-fundus photographs (CF), infra-red (IR), red-free (RF) and auto-fluorescence (AF) images. Figure. 3.1 shows an example of each of these modalities. Figure 3.1: Example of different 2D modalities for one eye. From left to right: Color Fundus (CF), Auto-Fluorescence (AF), Infra-Red (IR), Red-Free (RF). 14 Figure 3.2: Example of OCT volume composed of a set of scans. Each modality has led to a set of discoveries about retinal diseases. For instance, analysis of color photographs from large epidemiological studies has identified a number of ophthalmic risk factors for development of advanced Age-Related Macular Degeneration (AMD) and visual loss, including presence of large drusen, larger drusen area, pigmentary changes or geographic atrophy [7]. Certain patterns of FAF, particularly diffuse areas, have been shown to correlate with a higher rate of enlargement of atrophic lesions [122]. Infra-Red images enabled to classify different types of choroidal neovascularization (CNV) [135] Recently, optical coherence tomography (OCT) has gained popularity as it provides 3D infor- mation on the deep layers of cells within the retina. More specifically, it represents a 3-D volume composed of several slices, called A-scans (Fig. 3.2). The access to deep 3D understanding of the retinal structure helped doctors understand some effects underlying AMD. Rosenfeld [154] and colleagues identified a zone of reduced reflectivity from volume OCT scans of eyes with GA. This zone of reduced reflectivity appeared to foreshadow the zone of enlargement of atrophy over the ensuing months. 15 While modalities were first considered in isolation, it has been proved that combining the information from multiple modalities enhances robustness and accuracy of disease segmenta- tion [59]. As a result, the information must be put in a single coordinate system and accurate registration is a necessary step. In this study, we focus on registering multiple images from different modalities including CF, IR, AF, RF images and OCT volumes. For each modality, there may be more than image. 3.1.2 Issues The multimodal registration problem is challenging. Indeed, the background texture, the noise pattern, the illumination, the vessels and the disease appear very different across modalities. As a result, intensity-based similariy measures cannot be used. Besides, usual point descriptors like SIFT or SURF are not consistent across modalities and produce almost no correct matches. Secondly, finding how the OCT 3D volume correlates with the other 2D modalities is not trivial. Inspired by what physicians do, we use a 2-D projection of the OCT volume to simplify the registration problem. The 11 layers of the retina are segmented from the OCT 3-D volume with the graph-based search described in [61] and the layers around the Retinal Pigment Epithelium (RPE) are projected on a plane. These layers are specifically picked because they offer the best contrast for the large blood vessels. Figure. 3.3 shows an example of the OCT projection around the RPE layer. Thirdly, our dataset is composed of macular scans that are centered around the fovea. This adds another challenge, as the region covered by the OCT scans can be very small, textureless and contain few visual biological structures. As a result, not many structural features that can be used as a basis for registration. 16 Figure 3.3: Example of OCT projection around the RPE layer Finally, as can be seen on Figure. 3.3, the OCT projection is very noisy and motion artifacts can be mistaken for meaningful structures. As a result, the registration approach must be robust enough to handle large amounts of noise. 3.1.3 Related Work 3.1.3.1 Multimodal Registration Similarly to the global registration problem, existing approaches for multimodal retinal image registration can be classified into two broad categories: intensity-based approaches and feature- based approaches. Intensity-based methods process the intensity values directly and rely on an appropriate similarity measure that needs to be optimized, such as mutual information [142, 109, 84, 95]. Feature-based approaches either rely on: point-based and vasculature-based methods. Point- based methods extract and match feature points and descriptors, such as Y-features [30, 130] or SIFT features [85, 150]. Vasculature-based methods are specific to retinal image processing and rely on the extraction of the blood vessels. 17 Among intensity-based methods, mutual information is the most common measure that is used for multimodal registration. [109] presents a survey of mutual-information based registra- tion in medical imaging. One issue is that increasing the number of degrees of freedom for the transformation severely impacts the computation time. Even though some approaches have been proposed for speeding up the process in [84, 95], it remains cumbersome. Besides, the regis- tration process can converge towards a local minimum if the initial transformation estimate is poor. For the point-based approaches, several feature descriptors have been proposed. Choe and Cohen extract the vessel bifurcations, or Y-features, in [30]. These features are stable across modalities but can be scarce for some image modalities or in pathology-affected images. SIFT features have been used for unimodal registration and their reliance on the gradient information does not make them suitable for multimodal registration. Still, some indirect approaches have been proposed with these features. In [85], the SIFT features are extracted from the image edge maps and matched with an Iterative Closest Point algorithm. The approach provides accurate results but the edge maps are not consistent for every modality. [27] and [13] present feature descriptors that were inspired by SIFT but tweaked for multimodal registration. In [27], Chen et al. introduce the Partial-Intensity Invariant Feature Descriptors (PIIFD). The descriptors are extracted on the Harris corners and are essentially SIFT descriptors in which the opposite gradi- ent contributions are summed up to account for a possible change in gradient orientation across modalities. Ghassabi et al. use these descriptors on UR-SIFT features to improve the results in [46]. In [13], Bathina et al. use a Hessian filter to extract the curvature, extract the features from the junctions of the curvature map and their new descriptors are based on the Radon trans- form. In [104], the Gixel Array Descriptor (GAD) is proposed to handle multimodal images by 18 extracting a descriptor solely based on edges. A problem that all these methods have in common is that the retinal images have a lot of repeated local patterns, which tends to disrupt the feature matching step. Also, the background noise and the disease appearance are not consistent across modalities and can impact the feature descriptors. The vasculature information is consistent across modalities and has been widely used for registration. The most notable approach is the dual-bootstrap iterative closest point, presented in [130]. The registration starts from Y-junction matching and initializes a low-order transforma- tion estimate. From one correspondence, the registration is performed in a local patch that slowly expands with higher-order transformation estimates. The approach was later modified in [150] to be generalized to all types of images and relied on SIFT features instead. The state-of-the-art commercial software i2k Align Retina [39] uses a version of this algorithm. A graph matching method was introduced in [36] where the vessel map is converted to a graph by extracting junc- tions and extremities. These approaches are the most widely used and the most robust overall. Their main drawback is that the approaches can be quite sensitive to segmentation noise and can converge to local minima. To our knowledge, only [82, 101] have worked on registering OCT projection images with other modalities. In [82], a similarity measure based on vessel overlap is introduced and the reg- istration is performed via brute force in the space of translations and Iterative Closest Point [28]. In [101], a similar similarity function is used but is patch-based instead of global. Note that both algorithms performed the registration of OCT projection images with color fundus images only and used a quadratic model. 19 3.1.3.2 Vessel Segmentation As previously stated, blood vessels provide with a robust feature for registration. However, the segmentation of retinal vessels in photographs of the retina is itself a particularly challenging problem that has received high interest from the machine-vision community [99]. Vessel segmentation algorithms can be divided into two groups: rule-based methods and supervised methods. In the first group, we highlight methods using vessel tracking, mathematical morphology, matched filtering, model-based locally adaptive thresholding or deformable models. On the other hand, supervised methods are those based on pixel classification. Supervised methods are based on learning a pixel classifier to decide whether each pixel is a vessel or not. Neural networks have been used for that purpose [45, 92] with detection rates over 90%. With the recent popularity of deep learning, approaches relying on convoluted neural networks [87] have been proposed and constitute the state-of-the-art. Although supervised learning methods are the best for the task of vessel segmentation, they are quite unpractical in the multimodal case. Indeed, they require large amounts of labeled train- ing data for each modality. Also, if new sensors are developed in the future, the networks would need to be trained again. Rule-based methods extract vessels based on prior knowledge about the linear structure of the vessels. Early approaches relied on local curvature and morphological operators [155]. Frangi [44] proposed a ”vesselness” measure based on the analysis of eigenvalues of the Hes- sian matrix. A similar idea is to look at the image response of filters at multiple scales and orientations. Some examples include Gaussian filters or Gabor wavelets [29]. Vessel tracking 20 approaches start from a sparse set of reliable points and propagate along the vessels to perform the segmentation [113]. These approaches provide reliable segmentation but usually lack ways to discard noise. As a result, they tend to have a high recall and average precision. 3.1.4 Overview of Our Approach In our approach, we register 5 different modalities in an integrated framework. We also propose one of the first studies working on multimodal registration involving OCT volumes. Classical point-based approaches do not provide enough matches to perform registration. As a result, we decide to base our approach on higher level features, and specifically on salient linear structures. For the retinal images, these features include the blood vessels but may also include other types of structure, such as disease boundaries. Figure 3.4 presents an overview of our approach. Figure 3.4: Overview of our multimodal retinal image registration approach For the segmentation, we propose to extract the salient lines with a tensor-voting based ap- proach [88, 66], which is accurate and helps discard noise. A connected component analysis is also used to prune the small patches of remaining noise. 21 For the registration, these line features are compared with a Chamfer distance and the pairwise rigid transformations are estimated by matching the line junctions and extremities. An Iterative Closest Point approach is used to refine the rigid transformations. A chained-registration frame- work is used to recover in case of wrong pairwise alignment, as proposed by [91]. Finally, an elastic registration based on thin-plate splines is used to account for the small but existing 3D structure. Our contributions include: Multimodal registration of more than 2 modalities, involving challenging OCT projection maps of macular scans. A tensor-voting based approach for robust line structure segmentation. A robust framework for registration based on line structures. In the following sections, we present our approach. First, we introduce the tensor-voting framework and show how it can be applied for line segmentation. Then, we present our regis- tration framework, with pairwise registration, chained registration and elastic refinement. Our experimental data and our results are presented. Finally, we discuss the limitations of our ap- proach and draw conclusions. 3.2 Line Structure Segmentation Line structures are robust and mostly consistent across modalities. Therefore, they are very desir- able features for multimodal registration. However, their extraction is a challenging issue to ad- dress as the curvilinear structures have different appearance across modalities, may be ill-defined, 22 Figure 3.5: Graphical representation of a symmetric 2-D tensor. noisy or incomplete. To address this problem, most state-of-the-art approaches rely on machine learning techniques but these are unpractical in the multimodal case. Indeed, training would need to be performed every time a new modality is introduced. The tensor-voting framework [50, 66] was applied to medical images in [88] and proved its ability to extract poorly-defined line struc- tures in noisy images. In 2-Dimensions, tensor voting is a method that can extract smooth salient linear structures from a point cloud. It is a quite generic approach and can be applied to every modality with little tuning. 3.2.1 Review of the 2-D Tensor Voting framework 3.2.1.1 Tensor Representation Let us assume we have an input of sparse 2-D points. To apply tensor voting, we represent every point as a symmetrical tensor, which in 2-D is a symmetric 2 2 matrix. The eigen decomposition of this matrix gives useful information (See Fig. 3.5). The tensor voting framework makes extensive use of the tensor representation. An interesting property of tensors is that adding two tensors gives another tensor. If we sum up tensors with the same orientatione 1 , the resulting tensor has a larger valuel 1 l 2 in thee 1 direction. If we add two tensors with different orientations, the new orientation is the average of 23 the two. However, the resulting ballness increases, which means that the resulting orientation is less probable. The tensor voting framework uses this property to infer the underlying structure in a point cloud. 3.2.1.2 Voting process The voting process consists of making every tensor add up votes from its neighbors. As a result, tensors that belong to the same structure cast votes that intensifies the stickness and certainty of the orientation. Meanwhile, tensors from distinct structures weaken the resulting orientation probability. In the end, a tensor with a large sticknessl 1 l 2 has neighbors with the same orientation and belongs to a line. A tensor with a large ballnessl 1 and small stickness has unoriented neighbors and belongs to an area. A tensor withl 1 = l 2 = 0 does not have voting neighbors and is noise. This desirable property makes the approach appropriate for line extraction since the line structures can be extracted from high stickness values while the areas and noise can be discarded based on the ballness values. The result of summing up all the tensor votes is another tensor. We can use an eigen decom- position to get the resulting saliency and orientation. The votes are defined under the assumption that points which belong to a same structure are connected by a smooth curve. Namely, each point votes on its neighbors with a tensor that encodes the length of the smooth path and the curvature. An oriented tensor at point P with saliencyS P and the notations of Fig. 3.6a votes on a pointQ with: V P!Q =S P e s 2 +c 2 2 NN T (3.1) 24 withs = l sin , = 2sin l ,c a constant parameter andN = sin 2 cos 2 T . The resulting voting field is shown in Fig 3.6c. An unoriented tensor votes with R V P!Q and its voting field is shown in Fig 3.6b. Classical tensor voting consists of two steps: the ball vote and the stick vote. In the first step, the tensors’ orientations are unknown are every tensor votes all around itself with the field in Fig 3.6b. In the second step, the tensors vote with the field in Fig 3.6c. Because of this limited voting field, tensors that do not belong to a same structure no longer communicate and do not disturb each other’s resulting stickness. (a) Notations (b) Ball V oting field (c) Stick V oting field Figure 3.6: Geometric representation of the saliency decay function. 3.2.2 Tensor-voting-based line segmentation 3.2.2.1 Ball vote To apply tensor voting on our image, we represent every pixel as a 2-D tensor, i.e. a symmetric 2 2 matrix whose eigen decomposition gives useful information (See Fig. 3.5). The eigen vector e 1 represent the orientation of the tensor and the eigenvalues l 1 > l 2 give information on its ballness and stickness. The ballnessl 1 quantifies the uncertainty of the tensor orientation while the sticknessl 1 l 2 quantifies its saliency. 25 The voting process consists of making every tensor add up votes from its neighbors. As a re- sult, tensors that belong to the same structure cast votes that intensifies the stickness and certainty of each other’s orientation. Meanwhile, tensors from distinct structures weaken each other’s ori- entation probability. In the end, a tensor with a large stickness valuel 1 l 2 has neighbors with close orientations and belongs to a line. This desirable property makes the approach appropriate for line extraction since the line structures can be extracted from high stickness values while the large areas and noisy parts can be discarded based on high ballness values. For the first step, called ball vote, every point P with an intensity I P votes on each of its neighboring pointsQ with the following tensor: vote(P!Q) = (I P I Q ) exp( kPQk 2 2 2 )NN T (3.2) where N is a vector normal to (PQ). For the Color Fundus image, we use the green channel intensity value as it has been reported to show the best contrast. Also, a mask is applied to remove the pixels that are not part of the retina. Each pointQ sums up votes from all of its neighbors and the resulting tensor contains infor- mation on the surrounding structure. Indeed, the eigenvaluesjl 1 j >jl 2 j are signed based on the intensity distribution. We havel 1 > 0 for a dark pixel with bright neighbors andl 1 < 0 for a bright pixel surrounded by dark pixels. In the modalities of our dataset, the salient vessels are darker than the surrounding background so we discard all the points with negativel 1 . For other modalities, such as Fluorescin Angiograms, there could be bright vessels on a dark background and we would discard the cases with positivel 1 instead. 26 The resulting saliency values depend on the intensity distribution of the pixels’ neighborhood and are not normalized. Therefore, we apply a local normalization [115] step to account for pos- sible non-uniform contrast in the image. As a result, the locally salient structures are enhanced. 3.2.2.2 Connected component analysis At this point, we have extracted the major lines, including the vessels. However, there may be some gaps in the lines and some noise from the background. After noticing that the noise is mostly composed of small patches, we decided to use a simple pruning scheme based on connected components analysis. Namely, we arbitrary keep the 10% largest connected components and discard the rest. This step might result in discarding some small vessels or salient lines but our further registration is more robust to false negatives than to false positives. Since the OCT projection images usually contain a lot of noise, we additionally remove all the connected components that are not connected to the border of the image. This heuristic comes from the fact that the vessels should all be connected to the side of the frame in fovea-centered images. 3.2.2.3 Stick vote Finally, a second voting step, called stick vote, is used to close the gaps and smooth out the lines. In this step, the voting field is limited based on the tensors’ orientation (see Fig. 3.6c) so that tensors that do not belong to a same structure no longer communicate and do not disturb each other’s resulting stickness. 27 Figure 3.7: Effect of the line segmentation steps on one eye. Top line: AF image. Bottom line: OCT projection. From left to right: original image, result of ball vote, result of connected component pruning, result of stick vote. With the notations of Fig. 3.6a, the vote cast from P to Q is: vote(P!Q) =S P e s 2 +c 2 2 NN T (3.3) withs = l sin , = 2sin l ,c a constant parameter andN = sin 2 cos 2 T . Our segmentation method has the advantage of requiring minimal tuning when changing modalities, does not rely on any type of training and is robust to noise. Also, this approach is highly parallelizable and can be implemented on GPU for improved speed. 3.3 Multimodal Registration After having segmented the line maps, our goal is to register all the modalities to the OCT pro- jection map. We start with a pairwise registration in which a similarity transformation is inferred from one or two correspondences. Since the OCT projection images contain few features, this 28 pairwise registration may fail for some modalities. In order to recover from large errors, we reg- ister all the 2-D modalities together and use a chained registration framework. Finally, the rigid constraint is relaxed to account for non-planarity of the retinal structure and we use an elastic refinement method based on thin-plate splines to get the final alignment. 3.3.1 Chamfer distance Our line maps are mostly consistent across modalities. Still, some lines might be present in some modalities and absent in the others. Therefore, to assess the quality of a transformation, we need a similarity measure that is robust to background clutter or presence/absence of edges. The Chamfer distance [12, 126] has been reported to have this property and it is appropriate for our case.Also, computing the Chamfer distance is very fast since the distance transforms can be precomputed. First introduced in [12], the Chamfer distance evaluates the asymmetric distance between two edge maps. For a templateT and a binary edge mapE, let the distance transform for every image pointx2T : DT E (x) = min xe2E kxx e k 2 (3.4) In practice, we truncate this value to a parameter in order to reduce the negative impact or missing edges inE. The normalized Chamfer distance is then defined as: d cham (T;E) = 1 jTj X xt2T min(DT E (x t );) (3.5) 29 The distance was further refined in [126], which incorporated edge orientation to the distance. If (x) is the orientation at pointx, then we note: ADT E (x) = arg min xe2E kxx e k 2 (3.6) d ori (T;E) = 2 jTj X xt2T j(x t ) (ADT E (x t ))j (3.7) The resulting oriented Chamfer distance that we will use to measure the closeness of two line maps is then: OCD ; (T;E) = (1)d cham (T;E) +d ori (T;E) (3.8) 3.3.2 Pairwise registration To perform the pairwise registration step, we first restrict the possible transformationsT to sim- ilarities with 4 parameters. In homogeneous coordinate, we look for a transformationT of the form: T = 2 6 6 6 6 6 6 4 s cos s sin t x s sin s cos t y 0 0 1 3 7 7 7 7 7 7 5 (3.9) For each image, we skeletonize the line map obtained from the previous section so that we can extract the line junctions and extremities. These are the pixels that have a connectivity exactly equal to 1 or greater than 3. 30 Most previous approaches extracted feature descriptors to speed up the matching. However, the gain in speed comes at the cost of discarding good matches if the descriptors are not con- sistent. Since there are few feature points, a brute force matching on GPU is robust and not time-consuming. For the AF, IR, OCT and RF modalities, we approximately know the physical scale. Also, because of the acquisition process, the in-plane rotation is always small for retinal images. There- fore, there are only 2 degrees of freedom (t x ;t y ) and we can use only one correspondence to infer the transformation. For the CF modality, the scale is inconsistent in our dataset. Therefore, we keep all 4 degrees of freedom and match pairs of feature points to find the (t x ;t y ;s;) parameters of the transforma- tion. To speed up the process, we discard angles that are larger than 20 degrees and scales that are smaller than 1. Indeed, CF images typically have a larger resolution than the other modalities. The quality of every possible transformation is assessed by computing the oriented Chamfer distance [126]. We keep the transformation yielding the smallest distance as an initial guess. To refine the transformation, we use the point-to-plane Iterative Closest Point (ICP) algo- rithm [28]. At first, we fix the scale parameter s and find the optimal (t x ;t y ;) parameters iteratively: 1. For every point in the source image, find the closest point in the target image. Discard the correspondences that are farther than 3 times the median distance. 2. Get the transformation parameters for these correspondences. 3. Come back to 1 until convergence. 31 After the above algorithm converges, we relax the constraint of the fixed scale and use [38] to compute the scale parameter in step 2. After convergence, we have a similarity transformation and the pairwise registration is complete. Note that this algorithm does not guarantee convergence towards the global minimum in case of poor initialization. This leads to the pairwise registration failing in some cases where we lack matching features in two specific modalities. We want to use the information in the other modalities as a way to recover in case of failure and use a chained registration for that purpose. 3.3.3 Chained registration In order to increase the robustness of the overall approach, we perform a chained registration that enables us to recover in case of registration failure for one or several pairs. We build a complete weighted graph on which every node is a modality image. For every edge that connects two modalities, the weight is set to the Chamfer distance that corresponds to the transformation between the two images. In order to discard the poor pair registrations, we want to compute the closest modality for every image in terms of Chamfer distance. This is equivalent to finding the minimum spanning tree of the graph, which can be done with Kruskal’s algorithm [73]. Finally, for every image modality, we find the chained transformation to the reference image. In order to do so, we find for every image the path that goes to the reference image in the graph. Note that there is only one path since we have a minimal spanning tree. Fig. 3.8 shows an example for the evolution of this graph. Starting with an identity transformation, we multiply by the pairwise transformation between the source image and the target image every time we follow an edge. For instance, in the particular 32 Figure 3.8: Example of the evolution of the graph for chained registration on one eye. Left: complete graph. Middle: minimal spanning tree. Right: final graph. case of Fig. 3.8, we haveT chained 41 =T 45 T 52 T 21 whereT ij is the pairwise transformation between modalityi and modalityj. Note that the chained registration step enables us to have a system that is more robust the more images are used. However, we cannot recover if all the pairs involving one modality fail at the same time. 3.3.4 Non-linear refinement After the chained registration, we have robustly registered all the modalities to the reference frame. However, we had previously restricted the transformation to similarities. Since the retinas are not completely planar objects, there are some non-linearities that should be accounted for. Most state-of-the-art papers in retinal image processing use quadratic transformations but there is no tangible explanation for it. Therefore, we decide to use an elastic refinement based on thin-plate splines (TPS) [33, 151]. This algorithm tries to find the mapping functionf between point sets (p a ) and (q a ) of size K that minimizes: 33 E TPS (f) = K X a=1 kq a f(p a )k 2 + Z Z [( @ 2 f @x 2 ) 2 + 2( @ 2 f @x@y ) 2 + ( @ 2 f @y 2 ) 2 ]dxdy (3.10) The first term is the data term that is for the point matching. The second term is a smoothness term that avoids unrealistic transforms and makes the system somewhat robust to bad correspon- dences. There exists a unique minimizerf composed of an affine partd and a non-affine coefficient w such that: f(p a ) =p a d + (p a )w (3.11) where (p a ) is a 1K vector such that thebth line is b (p a ) =kp b p a k 2 logkp b p a k The form of d and w can be derived to get the final transformation [33]. The result is an elastic transformation that accounts for the non-planarity of the retinas. 3.4 Data Our multimodal dataset was composed of 26 eyes for 20 different patients. For each eye, 5 different modality images were acquired: Color Fundus, AutoFluorescence, InfraRed, Red-Free and Spectral-Domain Optical Coherence Tomography. We also have a longitudinal dataset of 2 eyes for 81 patients in which Fundus AutoFluores- cence images were gathered after 6, 12 and 18 months to monitor the growth of the disease Each subject underwent volume OCT imaging using a Zeiss Cirrus (Carl Zeiss Meditec, Inc., Dublin, USA) SD-OCT in accordance with the existing standardized acquisition protocol at our 34 institution. All scans consisted of 512 (A-scans) 128 (B-scans) 496 voxels and the physical dimensions are 6 6 2mm 3 . If both eyes were present for a subject, one eye was randomly chosen for subsequent analysis. The Fundus Auto-Fluorescence, Infra-Red and Red-Free images were obtained from a Hei- delberg SPECTRALIS confocal scanning laser ophthalmoscopy (cSLO) with a field of view of 30 o 30 o . The image resolution is 768 768 pixels and the physical dimensions are 8:85 8:85mm 2 . The color fundus images were obtained from a Nidek 3Dx system. The field of view and physical dimensions are not consistent throughout the dataset. 3.5 Results 3.5.1 Segmentation results Most reported approaches focus on segmentation of color fundus approaches. We wish to assess their performance on the other modalities in our dataset, namely AF, IR, RF and OCT. To evaluate our segmentation algorithm, we manually draw the vessels on the 26 patients and compute precision/recall values for different types of segmentation. We consider a pixel to be a true positive if it is less than 10 pixels away from a true vessel. A pixel is a false positive if there is no true vessel 10 pixels around it. Similarly, a pixel is a false negative if no vessel is detected 10 pixels around true vessel pixels. 35 Precision Recall Modality CF AF IR RF OCT CF AF IR RF OCT Vesselness+Pruning [82] 0:49 0:51 0:30 0:39 0:71 0:20 0:81 0:32 0:43 0:61 COSFIRE [8] 0:66 0:31 0:43 0:39 0:34 0:45 0:77 0:67 0:46 0:86 Ball vote 0:61 0:85 0:68 0:83 0:33 0:95 0:98 0:98 0:98 1:0 Stick vote 0:62 0:85 0:70 0:83 0:42 0:79 0:94 0:94 0:95 1:0 Stick vote+Pruning 0.78 0.89 0.77 0.90 0.76 0:70 0:78 0:74 0:74 0:77 Table 3.1: Comparison of precision/recall rates for our segmentation approach and other rule- based approaches. We compare our approach with other rule-based approaches that do not use learning, namely [82, 8]. Li et al. [82] propose an approach based on a vesselness measure and non-maximal suppres- sion. Azzopardi et al. [8] propose a multi-scale, multi-orientation Gaussian filter called COS- FIRE. Table 3.1 shows that our approach performs the best across modalities. The several iterations of the tensor voting help improve precision while maintaining a high recall. 3.5.2 Registration results on Multimodal data Quantifying the accuracy of non-rigid registration approaches is not easy. Indeed, we cannot compute a ground truth transformation that would have non-linearities. A method that is often used for unimodal registration is to draw the vessels and quantify the overlap. However, the vessels are not completely consistent across modalities and we cannot measure what the maximal overlap would be. Therefore, we use a set of manually defined control points, as in [82]. Control points can be vessel junctions and crossover, vessel points of high curvature or salient disease boundaries. After having labeled the control points in the OCT projection reference image, we find the corresponding control points in each registered image and measure the Root Mean-Square 36 Modality AF IR CF RF Linear 108 (75) 100 (85) 96 (89) 99 (79) Quadratic 71 (52) 75 (62) 78 (68) 76 (57) Elastic 61 (45) 56 (43) 52 (41) 42 (37) Table 3.2: Accuracy of our registration results per modality: average and standard deviations (in parenthesis) of Root Mean Square Error (inm). Lower values are better. Error. The manual selection makes sure that the control points are real correspondences but are quite loosely picked, which can affect the resulting accuracy. Table. 3.2 shows the average error with a similarity transformation, a quadratic transformation and a non-elastic approach for each modality. We see that non-rigid transformations improve the overall accuracy. Our algorithm is visually accurate and there are no large failure cases. Fig. 3.9 shows some examples for our registration results. The registration works well even when there are very few features in the OCT projection image, as can be seen in the last case of Fig. 3.9. Even though our approach was designed for multimodal registration, it is also applicable to longitudinal data, as can be seen in Fig. 3.10. We implemented most of the algorithm on GPU and in can run in less than 10 seconds for one set of 5 images on an NVIDIA GeForce GTX 580. 3.5.3 Registration results on Longitudinal data Longitudinal data is data that is gathered at several point in time. In our dataset, some AF images were captured after 6, 12 and 18 months to monitor the growth of the disease. During this period of time, the disease evolves but the vessel pattern remains marginally changed. Thanks to the Chamfer distance as a similarity measure, our approach is robust to these small differences in the 37 line patterns. Therefore, even though our approach was designed for multimodal registration, it is also applicable to longitudinal data. 3.6 Summary and discussion We have presented a general framework for multimodal registration of multiple images and ap- plied it to the case of retinal registration. The algorithm uses a tensor voting framework to extract line features. These lines serve as a tool to measure the quality of the registration thanks to the robust Oriented Chamfer Distance. The junctions and extremities of the lines are used as robust cross-modality point features and used for a first estimate of the registration. The transformation is then refined with an ICP and a scaled ICP methods. Recovery from pairwise registration failure is performed via a chained registration. Finally, we use thin-plate splines to account for the non-planarity of the retinas. This algorithm performs well with OCT projection images, even when they show a limited amount of features. Also, it is generic enough to be able to register multiple modality images while most algorithms focus on 2. The module is currently deployed at the UCLA Doheny Eye Institute in which large amounts of retinal images are being processed. A current limitation of our algorithm is that some macular OCT projection images do not have enough vessels or other line features. Also, some do not have any junctions or extremities. For these cases, using point features to estimate an initial transformation would be good. The experiments in this chapter show that salient lines are robust structural features that can be used whenever feature points are not reliable enough to produce correct registration. We focused on a specific near-flat structure in medical imaging. In the next chapters, we will deal 38 Figure 3.9: Overlap of the registered OCT projection image to Color Fundus, Infra-Red, Red-Free and AutoFluorescence images for several eyes. 39 Figure 3.10: Registration results on longitudinal data. From left to right: 1st visit, after 6 months, 12 months and 18 months. with structures that exhibit a strong 3D structure and suffer from the same issue of poor point correspondences. 40 Chapter 4 3D Face Reconstruction 4.1 Introduction 4.1.1 Problem statement The high demand for effective biometrics systems has lead to the emergence of face recognition in the wild as an important research topic. Although technologies involved in this problem improved greatly over the past few years, existing 2D-based techniques still struggle with low-resolution images and in unconstrained environments. These challenges are often due to appearance vari- ations caused by different poses, illuminations and expressions (PIE). In the past, some have suggested that 3D sensing technologies could provide a means of mitigating pose and illumina- tion problems, by using 3D models in the face recognition process [26]. Although range sensors and laser scanners have proven very effective in reconstructing 3D faces [57], these sensors only work at close range, indoors or in special set-ups [3]. Regular cameras, however, are easily accessible and widely used in surveillance systems. As a result, developing 3D reconstruction methods from 2D video streams is a very appealing prospect. 41 4.1.2 Challenges Inferring an accurate 3D face model from a low-resolution video 1 is a very challenging problem. Despite of the success of SfM (Structure from Motion) for many practical applications (e.g., 3D modeling from a large photo collection), it is difficult to apply SfM directly to model 3D faces from videos. First of all, estimating the camera motion from a moving face is difficult due to the non-rigidity of a face (caused by facial expressions), unstable camera estimation from small head motion, and limited corresponding points (e.g., texture-less facial areas). Even with known cameras, state of the art multi-view stereo techniques (e.g., plane sweep [69]) do not provide faithful reconstructions whenever the videos are low resolution due to the absence of accurate dense correspondences. Consequently, SfM-based approaches for 3D face modeling typically use high-resolutions images [94, 86]. Many successful approaches for 3D face reconstruction use face models (a generic [32, 55] or statistical representations [18]) as prior knowledge about the structure of a face. A single generic model can be deformed to best fit the facial image or the coefficients of statistical models can be found by an optimization process. In general, the use of face prior allows the reduction of the computational complexity or the inference of uncertain face shape, but limits the accuracy of reconstructed 3D shapes. Furthermore, applying a prior-based modeling approach (e.g, 3D Mor- phable models [18]) to videos is not trivial because each camera pose is estimated independently and consistency of the texture information across frames is not fully guaranteed. 1 In this work, we assume that the video resolution, though low, still allows for face detection by conventional methods [143], yet only few feature descriptors [90] can be matched between consecutive frames. 42 Figure 4.1: Overview of our approach. (a) For an input video, (b) we track facial landmarks and use them to estimate initial camera matrices and 3DMM parameters. (c) Our iterative optimiza- tion refines the shape parameters, along with per frame camera matrices and expressions, seeking ones which maximize photometric consistency between frames. Our output (d). 4.1.3 Contributions To gain the best of both worlds, we combine these two approaches in our proposed prior con- strained structure from motion (PCSfM). We use a statistical shape representation, 3DMM aug- mented with blendshapes, as a shape prior to disambiguate photometric matches between frames. This is done by searching around the face prior for a shape which provides photometric consis- tency across the input video frames. We show how this can be cast as an optimization problem over the face shape, per-frame camera matrices and facial expression coefficients of the 3DMM representation, and solved using a stochastic Gauss-Newton optimization. By using the GPU, this optimization is shown to run extremely fast. We test the accuracy of our approach on challenging videos from the MICC collection [9] and show that our estimated 3D face shapes are much better approximation to the ground truth 3D shapes than those estimated by state of the art methods. An overview of our approach is presented in Fig. 4.1. We compare our method with state of the art SfM and 3DMM on MICC database containing ground truth 3D shapes, and show that it provides better results with reliable identity information (Fig. 4.2 and 4.12). 43 Figure 4.2: Face reconstruction from an outdoor video in the MICC dataset [9]. Top row: Three examples of the frames used for reconstruction. Low resolution faces have an average of only 30 pixel inter-ocular distances. Bottom row: (a) Close up view of the same subject (not used for reconstruction). 3D estimates by (b) Structure from motion (SfM) [69]; (c) Single-view 3D Morphable Models fitting (3DMM) [5]; (d) our proposed PCSfM. SfM estimates are very poor in low resolution. 3DMM cannot guarantee correct face shapes, evident by comparing with (a). PCSfM combines both to produce an accurate 3D shape. Our contributions are as follow: We propose a novel joint structure-from-motion method with a shape prior, which allows us to establish global correspondences across all low-resolution frames in a video. We show state-of-the-art 3D face reconstruction results on low-resolution face videos. We leverage an existing 3D face tracking method to provide state-of-the-art tracking results on low-resolution face videos. The remainder of the chapter is organized as follows. Section 4.2 summarizes the related work. Our approach is presented in Section 4.3, and Section 4.5 shows the experimental results. Section 4.6 concludes the paper. 44 4.2 Related Work Structure from Motion. Methods designed for SfM can be used for face shape reconstruction and indeed applied to this problem before (e.g., [48, 86, 43]). As we show in Fig. 4.2 (b), Fig. 4.3 and quantitatively in Sec. 4.5, these methods are prone to errors when applied to the low resolution videos typical of unconstrained settings. Our work bears relation to classical rank constraint SfM methods (e.g., [19, 111, 125] and many others). Like us, they assume that shapes can be modeled using low dimensional subspaces. Contrary to us, their subspaces are not used as priors. Instead, they compute correspondences between frames and then fit them with shapes and non rigid motion parameters under various subspace constraints. We turn this process around and use a direct method by iteratively refining a shape prior to guide a search for correspondences between frames. Hence, unlike them, we do not require accurate correspondence estimation, which is hard to ensure in our low resolution videos. Other multiple image methods. Some recently proposed reconstructing faces appearing in mul- tiple images from unconstrained, heterogeneous sources (e.g., the Internet) [83]. Although these methods produce beautiful surface reconstructions, our settings assume a single input video se- quence where all frames were captured under the same (possibly very challenging) viewing con- ditions. Other methods track facial landmarks throughout a video and fit them with face shapes. Three such recent methods are [64, 116, 23]. Finally, the multi-view stereo approach of [5] is related to our own. They also use 3DMM as a prior for 3D face shape estimation. Their method was 45 designed for tightly controlled viewing conditions with high resolution images taken instanta- neously, without motion, illumination or expression changes. Single view. Monocular 3D reconstruction is an ill posed problem. A popular approach for solv- ing this problem uses statistical shape representations such as the popular 3DMM [18, 107, 139, 149]. 3DMMs capture prior knowledge of face shapes and sometimes also texture, expressions and more. More on these representations in Sec. 4.3.2. Others instead make strong assumptions on the scene and image being reconstructed. These methods include shape from shading methods such as [71, 133], or methods which rely on facial symmetry [37]. Although these methods were shown to produce highly detailed surface recon- structions, in order to do so they make assumptions on the scene lighting, the texture of the face and more, and so it is hard to apply them in the unconstrained settings considered here. Example based methods such as the work of [56, 55] modify the 3D surface of a generic face shape in order to fit it to the facial features of a face appearing in the image. Though this approach is extremely robust, these methods were primarily used for new view synthesis of faces and are not designed to provide detailed face reconstructions. Finally, deep learning was also applied to this problem in [158]. Though this method esti- mates a 3D surface to match the appearance of the face, it primarily focuses on estimating the 2D locations of a set of facial landmarks, and makes no claims on the accuracy of the output shape. These methods were mostly designed for plausible face reconstruction. Few were therefore used to accurately approximate the ground truth 3D facial shapes or applied to unconstrained footage. 46 Figure 4.3: SfM alone is not enough. reconstruction results on two controlled videos from the MICC dataset [9]. (a) Example input frames; (b) Ground truth 3D shape; (c) SfM 3D reconstruc- tion of [69] showing sever errors due to the ambiguity of feature matching in these scenes; (d) Our result. 4.3 Our approach We assume that the input is a video showing a face changing pose over time due to multiple out- of-plane rotations of the head and/or camera motion. Reconstruction is performed using an SfM based approach in an effort to produce 3D shapes which are true to the real face shapes. SfM methods require sufficiently high resolution images of the face, a condition which often does not hold in real world videos. When faces are viewed in low resolution, matching pixels across frames is error prone and so is the output shape (see, e.g., Fig. 4.2 (b)). In fact, even in controlled settings, homogeneous face regions cause ambiguous matches which severely degrade the quality of the reconstructed 3D shapes (Fig. 4.3). To mitigate these problems, we propose to leverage knowledge on the space of faces by using a statistical shape prior, here a 3DMM face representation. This prior serves several important purposes: (1) It provides a convenient initial estimate to begin searching for the true shape using 47 SfM. (2) It provides a strong global constraint on the estimated 3D shape, ensuring that it is indeed a face shape. Finally, (3), when the video does not contain sufficient 3D head motion to perform SfM, our method can naturally reduce to single view 3DMM fitting. Our approach is illustrated in Fig. 4.1. Given an input video (a), we begin by applying an off the shelf facial landmark tracker [10] (b). Unlike others (e.g., [64, 116, 23]) we only use the tracked 2D facial landmarks for initializing per-frame camera matrices (described in Sec. 4.3.1). We next fit a 3DMM prior to the tracked facial landmarks (Sec. 4.3.2). Our optimization, (c), then modifies both the surface of the prior, its expressions and the camera matrices to improve photometric consistency across the entire video (Sec. 4.3.3). These steps are described below. 4.3.1 Initializing per-frame poses The facial landmark tracker [10] providesn 2D landmark locations, u j;i 2R 2 ;j2 1::n, in each frame F i . For each detected landmark u j;i , we assume a corresponding landmark, X j 2 R 3 , specified once on the 3D surface of a generic 3D face shapeSR 3 . Here,j indexes the same facial landmark in F i andS. Given the correspondences u j;i $ X j we use PnP [54] to estimate the camera parameters for frame F i . In our experiments,S is the mean shape of the 3DMM representation. Let K be the intrinsic camera matrix, fixed for all frames, and R 0 i and t 0 i the rotation and translation matrices in the generic 3D model’s coordinate frame (superscript 0 indicates iteration number). We thus obtain a perspective camera model for frame F i , such that for any point X t 2S, ~ g t;i R 0 i t 0 i ~ X t ; ~ u t;i K ~ g t;i ; (4.1) 48 where g t;i 2R 3 is the 3D point, X t , after camera rotation and translation, u t;i 2R 2 its projection onto F i and ~ X t ; ~ g t;i ; ~ u t;i their homogeneous representations. An initial estimate for the camera matrix of framei is therefore: C 0 i = K R 0 i t 0 i . Based on these initial camera poses, we sample the frames to ensure that consecutive frames contain motion; that their estimated poses are different. This is performed in order to avoid processing frames which do not introduce new 3D information. We use an empirically determined threshold to enforce yaw angle differences of at least 3 degrees. As a result, around 30-50 frames are typically selected from our videos. We note that these initial estimates can be, and often are, noisy. The source of noise is typi- cally either landmark localization mistakes or the use of a single generic modelS, which may not be suitable for estimating the pose of the face in the image. Our optimization (Sec. 4.3.3) is de- signed to reduce these errors by maximizing photometric consistency across frames by modifying the 3D shape and per frame expressions and extrinsic camera parameters estimates. 4.3.2 Initializing the shape prior Face representation. We represent faces using the 3DMM extension proposed in [22]. It captures knowledge of the space of face shapes with a multi-linear model which contains parameters for facial shapes and expressions. Given this knowledge, a single 3D face is represented by: X = X + P s s + P e e =( s ; e ); (4.2) where the vectors X 2 R 3T , X 2 R 3T contain a stack of T vertex coordinates in the form [x 1 ;y 1 ;z 1 ;x 2 ;y 2 ;z 2 ;:::;x T ;y T ;z T ] T . s 2 R Ss and e 2 R Se are two vectors representing, 49 respectively, the shape and the expression of the face, withS s andS e the number of parameters used to represent each one. X2R 3T represents the average face model. P s 2R 3TSs contains the principal components learned from aligned 3D face scans of different people with neutral faces. Similarly, P e 2 R 3TSe contains the principal components learned from 3D face scans with different expressions. We use the Basel Face model [107] for the shape knowledge P s and the FaceWarehouse [22] for the expression knowledge P e . The mean face X is taken to be the average of the two mean faces from these two sources. Shape initialization. The prior itself is estimated by fitting the 3DMM representation to the landmarks detected in Sec. 4.3.1 across all the video frames. We use the closed form solution proposed by [40] for this purpose. The result is a parameterization for an initial face shape estimate, 0 s for the face shape and 0 e;i , the per-frame expression coefficients. Of course, other heuristics could be used to initialize the model, such as using standard 3DMM fitting [112] on a frontal view, but the one used here was sufficient for our purposes. 4.3.3 Prior constrained SfM optimization We use the 3DMM face prior estimated in Sec. 4.3.2 to disambiguate the correspondences be- tween pixels in the video frames. Our method begins searching for the final 3D shape from the estimated prior and then moves its vertices in 3D, searching for positions which will maximize photometric consistency. The prior acts as a global constraint on the estimated shape. In partic- ular, our optimization seeks a single shape vector s for the entire video and expression vectors e;i and camera matrices C i for each of the video frames. 50 Energy formulation. We represent the modeling problem as the minimization of the following energy function: E =w photo E photo +w reg E reg : (4.3) The objective function maximizes photometric consistency inE photo while being constrained by E reg to remain in the space of plausible faces. The weightsw photo andw reg are used to relatively scale the two objectives and are empirically set tow photo = 1;w reg = 0:1 in our experiments. Photometric consistency metric. To compute photometric consistency for a 3D vertex X t , we project it, after accounting for per-frame expression and getting X t;i , onto every frame and com- pute local image agreement around these projections across different frames. Let t;i represent local information on thei-th image around the projection u t;i of X t;i . Note that t;i can be any local representation, including image intensity, intensity patch or a feature descriptor. In our tests we simply used a bilinear interpolation of the 3 3 pixel intensity patches around u t;i . We propose representing the problem of modeling the agreement of these projections between different video frames by the following energy function: E photo = X t X i v i t v i+1 t j t;i (u t;i ) t;i+1 (u t;i+1 )j 2 ; (4.4) wherev t i states whether the t-th vertex is visible in imagei and equals 1 if the vertex is visible, 0 otherwise. Vertex visibility is computed using standard Z-buffering [25, 65]. In our implementation, we precompute the visibility based on the initial prior model as this process can be computationally expensive, but visibility does not change much during optimiza- tion. Moreover, we compare patches from consecutive frames only. This is done so that the scene 51 in the two frames being compared does not change much in terms of camera pose or illumination conditions. Regularization Constraints. We add a classical regularization term to the energy function to prevent degeneration of the geometry. If we assume a Gaussian distribution for the parameters s , e;i , the interval [3; 3] should contain 99% of the variation in human faces. Therefore, we use the following regularization term on both the pose and expressions: E reg = Ss X q=1 ( q s;q ) 2 + N X i=1 ( Se X q=1 e;i;q e;q ) 2 : (4.5) Discussion: The difference between our approach and standard 3DMM fitting. At this point, it is worth noting a key difference between this formulation and the one used by standard 3DMM fitting methods. Previous methods use a learned texture representation to estimate shape, texture and illumination parameters which produce a rendered model similar to the input face image. This approach can fail whenever the input texture is different from those used for training (e.g., in cases where the input face is wearing heavy make-up, glasses, etc., in previously unseen illumination conditions and more ). Our approach does not attempt to match a rendered, textured 3DMM to the input faces. Rather, it uses the 3DMM to evaluate correspondences, relying purely on photometric consistency between frames in the same video. Optimization process. The energy functionE photo is a non-linear least-square objective that can be minimized using iterative methods such as the Gauss-Newton optimization. Let us denote by r the vector of residualsr t;i : r t;i =v i t v i+1 t ( t;i (u t;i ) t;i+1 :(u t;i+1 )): (4.6) 52 Computing the projections u t;i involves obtaining the expression compensated 3D point cloud, X i , from the global shape s and per frame expression parameters e;i using Eq. (4.2) and then projecting each of its points onto the frames using Eq. (4.1) and the per frame camera matrices C i . Let i = [a i b i c i tx i ty i tz i ], where [a i b i c i ] represent the camera angular velocity and [tx i ty i tz i ] its translation, obtained as an incremental transformation relatively to the current matrices R k i and t k i of frame F i (Sec. 4.3.1). Our goal can now be defined as estimating the vector: Q = [ s ; e;1 ; 1 ; e;2 ; 2 ;:::; e;N ; N ] T ; (4.7) whereN is the number of frames sampled from the video. Initial estimates for these values, Q 0 , were computed as described in Sec. 4.3.1 and 4.3.2. We now iteratively refine our estimate in order to minimize Eq. 4.4. Specifically, each successive iterationk updates the parameter vector by Q k+1 = Q k +Q with J T JQ =J T r: (4.8) At each iteration, we compute the Jacobian matrix J and residual vector r with the new parameter estimates. We solve equation 4.8 with a Preconditioned Conjugate Gradient (PCG) algorithm [11] using a Jacobi preconditioner. To reduce computation time, we use a stochastic version of the Gauss-Newton solver in which we use a set of 40 vertices sampled from the model at each step. This enables us to accelerate computation while escaping local minima. 53 Figure 4.4: Non-zero structure of the Jacobian matrix forE photo with four frames. S i ,E j;i ,P j;i are respectively the Jacobian for the shape, thej-th frame expression and pose, estimated from the pair of framesi;i + 1. Derivation of the Jacobian. We compute the Jacobian matrix analytically. The derivative ofr t;i relative to a given parameter is given by the chain rule as follows (dropping subscripts for u t;i and g t;i of Eq. (4.1) to simplify notation): rr t;i ( s )j Q=Q k = (4.9) r t;i (u)J u (g i )J g i ()J ( s )j Q=Q k r t;i+1 (u)J u (g i+1 )J g i+1 ()J ( s )j Q=Q k: If t;i (u) refers to the intensity around the pixel u,r t;i is the gradient of frame i, obtained by applying a Scharr kernel on the grayscale image. J u (g i ) is the Jacobian of u, which can be derived from Eq. (4.1) and J g i () and J ( s ) from Eq. (4.1) and (4.2). The Jacobian for the other parameters can also be similarly derived, giving us: rr t;i ( e;t )j Q=Q k =r t;i (u)J u (g i )J g i ()J ( e )j Q=Q k rr t;i ( t )j Q=Q k =r t;i (u)J u (g i )J g i ( t )j Q=Q k: (4.10) 54 If we noteg i = [g x g y g z ] T , the Jacobian of the projection onto framei is: J u (g i ) = 2 6 6 4 f gz 0 fgx g 2 z 0 f g i fgy g 2 z 3 7 7 5 (4.11) The other intermediate matrices can be computed as follows: J g i () =R i (4.12) J ( s ) =P s (4.13) J ( e ) =P e (4.14) To simplify the derivation of J g i ( i ), we locally linearize the transformation updates by as- suming small displacements: M i = 2 6 6 6 6 6 6 4 1 c i b i tx i c i 1 a i ty i b i a i 1 tz i 3 7 7 7 7 7 7 5 M k i ; (4.15) 55 where M i = [R i t i ] is the camera extrinsic matrix (Sec. 4.3.1). J g i ( i ) can be computed from Eq. (4.15) and (4.1) as: J g i ( i ) = 2 6 6 6 6 6 6 4 0 z y 1 0 0 z 0 x 0 1 0 y x 0 0 0 1 3 7 7 7 7 7 7 5 (4.16) Finally, the Jacobian entries corresponding to the regularization term can be computed from Eq. 4.5 directly. 4.4 GPU implementation We implement an efficient version of our solver on the graphical processing unit (GPU) using CUDA 8.0. Doing so, we leverage the observation that many of the computations performed by our method are independent from each other, making it well suited for GPU processing. Given a set of parametersQ, all theTN expression-corrected 3D vertices and their 2D projections can be computed using different GPU threads. Similarly, most of the Jacobian entriesrr t;i can be independently computed on separate threads. Gauss-Newton Solver. The Jacobian is large: there are (Ss + (Se + 6)N) parameters and (N1)T residuals; but it is sparse (see Fig.4.4) since only consecutive frames impact the energy function. To take advantage of the sparsity, we implement our own GPU version of sparse matrix multiplicationJh andJ T h. These two products are enough to run the PCG algorithm efficiently on GPU, as described in [144]. As in [136], we avoid computing the large productJ T J and split the computation into two successive products to save computation time. 56 4.5 Results We applied our method on challenging unconstrained videos downloaded from the web and old Hollywood movies. These qualitative results are provided in Fig. 4.5 along with our final pose estimates, illustrated by overlaying our estimated shape and pose over an example input frame. Fig. 4.6 shows an example of evolution from the initial state to the final output. We next test our method in order to quantitatively evaluate this accuracy. 57 Figure 4.5: Qualitative results on web videos (left column) and old movies (right column). Left to right: example frame, our PCSfM 3D estimate overlaid on the frame and rendered separately. Note error in chin on the bottom row due to the beard not being modeled by the 3DMM parame- ters. 4.5.1 Quantitative results The MICC data set. We test the accuracy of our proposed method on videos from the University of Florence, MICC data set [9]. It contains videos of cooperative and uncooperative subjects 58 Figure 4.6: Qualitative results for the evolution of the estimated parameters. Top row: Initial set of shape/pose/expression. Bottom row: Final set of parameters. filmed indoors, as well as uncooperative subjects filmed in unconstrained outdoor settings. Some example frames from these videos are provided in Fig. 4.7. Figure 4.7: Sample frames from three MICC videos [9]. From left to right: indoor cooperative and non-cooperative subjects, outdoor non-cooperative subject along with zoomed-in view. Importantly, accurate structured light 3D scans for all 53 subjects in this collection are also available. To our knowledge, it is therefore the largest collection of face videos and ground truth 3D face shapes. We use this collection to measure the accuracy of 3D face shape estimations, comparing our own method and previous work. Reconstructing faces from the videos in this set is very challenging. Face resolution, defined by the distance between the two eyes (inter ocular distance), is often as low as 50 pixels for the cooperative indoor videos, and 30 pixels for videos taken outdoors. Although such low resolution conditions are successfully handled by face recognition methods (e.g., on the YouTube faces set [146]), few 3D face estimation methods were ever applied to such challenging videos. 59 Evaluation criteria. We follow the work of [55, 119] and others, by evaluating 3D shape sim- ilarity using multiple distance measures. In practice, we measure the distance between the 3D face shapes estimated using our method and its baselines to ground truth 3D shapes, as follows. First, all estimated 3D shapes are globally aligned with a generic face shape using the standard iterative closest point (ICP) method [28, 17] to cancel out any differences in global pose. We consider only the facial region of the face and not the entire head. The same facial region is extracted from all shapes by using only 3D vertices in a 3D sphere of radius = 80mm centered on the tip of the nose. We project the 3D models X and the ground truth models X onto a frontal view, then compare the depth valuesD Q with the ground truth depth valuesD Q . For each method, we then compute the following distance measures: Relative error (REL):jD Q D Q j=jD Q j Root Mean Square Error (RMSE): q P i (D Q i D Q i ) 2 =N log 10 error:j log 10 (D Q ) log 10 (D Q )j 3D Root Mean Square Error (3DRMSE): pP i (XX ) 2 =N Baseline methods. We compare the accuracy of the following methods, selected for being, for the most part, very recent and considered state of the art. (1) Using the mean Basel face shape [107], unmodified, (2) classical 3DMM fitting [112], (3) the flow-based method of [55], (4) the recent CNN-based fitting of [158], (5) shape regression [116] (extension of [23]), (6) multi-view fitting 60 Figure 4.8: Distances to ground truth per vertex on MICC indoor videos. Averaged over all videos for 3DMM [112], flow-based [55], CNN-based fitting [157], regression method [116], multi-view landmark-based fitting [64] and our PCSfM showing far fewer errors. Approach REL10 4 RMSE log 10 10 4 3DRMSE Mean face [107] 109 (30) 5.68 (1.47) 47 (13) 3.23 (1.03) Classical 3DMM fitting [112] (2005) 68 (17) 3.75 (0.84) 31 (7) 2.06 (0.56) Flow-based method [55] (2013) 77 (21) 4.18 (1.09) 34 (9) 2.16 (0.59) CNN-based 3DMM fitting [158] (2016) 68 (15) 3.71 (0.83) 30 (7) 2.02 (0.53) Shape Regression [116] (2016) 76 (18) 4.83 (0.99) 33 (8) 2.25 (0.36) Multi-view fitting on landmarks[64] (2016) 68 (23) 3.91 (0.79) 30 (5) 1.99 (0.41) Structure from Motion [123] (2016) 917 (1012) 27.3 (22.70) 836 (1057) 11.30 (8.74) Us, 3D 72 (21) 3.80 (0.99) 31 (9) 2.09 (0.56) Us, 3D+pose+expr. 62 (18) 3.66 (0.91) 30 (5) 1.92 (0.39) Table 4.1: Quantitative results on the MICC dataset [9]. Four mean distance measures (and standard deviation). Our PCSfM is evaluated, optimizing for shape alone (Us, 3D) and the all 3DMM parameters (Us, 3D+pose+exp.). Lower values are better, bold is the best score. on landmarks [64], (7) the recent structure-from-motion method of [69], applied to the regions in and around tracked facial landmarks (not the entire frames), and (8) our PCSfM. The Basel face is provided in order to show the consequence on not attempting in any way to fit the 3D shape to the video. We used our own implementation of (2). It is a simplified version of the original 3DMM [112] method wherein we do not use the part-based optimization. Part-based optimization was excluded due to the extensive run time it required. For all other methods, we used the code originally developed by the authors. 61 Figure 4.9: Example failure due to landmark tracking errors. left to right: frame with landmarks overlaid, estimated 3D, ground truth. Summary of 3D accuracy tests. Some qualitative examples of these results are available in Fig. 4.12. Table 4.1 additionally provides the mean distances (and standard deviations) between the estimated 3D shapes and their ground truths. MICC 3D data is provided in real world coor- dinates, and so all distances are in millimeters. Note that smooth areas tend to drive the scores down even when large errors exist. Fig. 4.8 provides the distances per 3D vertex from estimated shapes to ground truths, averaged over the entire MICC data set for all our baseline methods. Clearly, our proposed method outperforms other baselines. Note that without updating pose and expression, the proposed method wrongly estimates the 3D shape because the error in poses is large. Refining poses and expressions enables us to correctly update the 3D shape. Unsurpris- ingly, SfM alone produces very poor results, evident also in Fig. 4.2 and Fig. 4.3. Besides this method, all others improve over simply taking the conservative, fixed generic shape as the 3D shape estimate. Fig. 4.12 shows the rendered 3D surfaces obtained by the various available methods, as well as example input frames and the ground truth 3D shape for the subjects in these videos. The figure additionally illustrates 3D estimation errors using heat maps. Here, again, our method clearly makes substantially smaller errors. A failure due to large landmark tracking error is further provided in Fig. 4.9. 62 4.5.2 Runtime comparison We compared the runtime for all the methods we tested on the MICC data set. Runtime was mea- sured on an Intel Core i7 at 2.60GHz with an NVIDIA GeForce GTX 960M GPU. Performance is averaged over all the cooperative MICC videos, using a constant 50 frames per video. Table 4.2 reports the measured runtime. We ran the full optimization for our PCSfM result (optimizing for 3D, pose and expression). Apparently, our CPU version is only slightly slower than some of the fastest (yet less accurate) baselines. On the GPU, PCSfM is fastest by a wide margin. 4.5.3 Analyzing the video quality over reconstruction Motion By taking an SfM based approach we assume that we have sufficient multiple views of the head to provide 3D information. That is, the more out of plane rotations of the head (or the more rotations around the head by the camera) the more accurate we expect our results to be. We evaluate this by comparing the 3D reconstruction accuracy with the amount of motion in the video. We quantify motion as the maximum yaw angle difference between any two pairs of frames in the video (the angle between the most extreme yaw angles). Fig. 4.10 provides these results, quantizing the videos into four groups according to their motion. Evidently, the larger the variation in scene viewpoints the more accurate our estimation becomes. The figure also provides accuracy for single image 3DMM fitting [112], which, of course, is unaffected by pose variations. In the videos with the smallest viewpoint variations, our errors approach those of single view 3DMM. 63 Approach Runtime 3DMM [112] 67 min 11s Flow based [55] 53min 15s CNN-based fitting [158] 31.9s Fitting to landmarks [64] 18.6 s Structure-from-Motion [123] 52.6s Our PCSfM (CPU) 1min 13s Our PCSfM (GPU) 4.6s Table 4.2: Average runtimes on MICC videos. Single-view methods were applied to all frames separately. Resolution Similarly, we analyze the effect of the image resolution on the reconstruction accu- racy. Fig. 4.11 shows that the quality of the expected reconstruction improves as the resolution of the image increases. 4.6 Conclusion We propose a method which constrains SfM estimation of face shapes with a statistical prior. This allows us to obtain accurate 3D shape estimates even in low resolution and quality videos. Our method changes the 3D face shape in space, searching for one which maximizes photometric consistency. This search is performed by optimizing over the 3D shape, as well as its per-frame pose and expression. Our results demonstrate that this method provides improved accuracy com- pared to existing methods and, when implemented on the GPU, far faster processing times than recent relevant baseline methods. 64 Figure 4.10: PCSfM accuracy vs. motion on MICC. Mean (SD) RMSE for videos with varying head yaw angle ranges. Single view 3DMM [112] provided as baseline. Evidently, the more pose variations, the smaller our errors. Figure 4.11: PCSfM accuracy vs. image resolution on MICC. Mean (SD) RMSE for videos with different resolutions. 65 Figure 4.12: Qualitative 3D estimation results from the MICC dataset[9]. From left to right: example input frame frame and ground truth 3D, 3D estimation results for 3DMM [112], Flow- based method [55], CNN-based fitting [157], shape regression [116], multi-view landmark-based fitting [64], our PCSfM. For each case, the first row shows heat maps visualizing the Euclidean distances between estimated and ground truth 3D shapes. The second row shows the 3D shape. Rows 1-2: cooperative indoor videos. Rows 3-4: non-cooperative indoor videos. Rows 5: out- door non-cooperative videos. 66 Chapter 5 Deep 3D Face Recognition 5.1 Introduction Face recognition has been an active research topic for many years. It is a challenging problem because the facial appearance and surface of a person can be vary greatly due to changes in pose, illumination, make-up, expression or hard occlusions. Recently, the performance of 2D face recognition systems [105, 124] was boosted signifi- cantly with the popularization of deep convolutional neural networks (CNN). It turns out that recent methods using CNN feature extractors trained on a massive dataset outperform conven- tional methods using hand-crafted feature extractors, such as Local Binary Pattern [1] or Fisher vectors [128]. Deep learning approaches require a large dataset to learn a face representation which is invariant to different factors, such as expressions or poses. Large scale datasets of 2D face images can be easily obtained from the web. FaceNet [124] uses about 200M face images of 8M independent people as training data. VGG Face [105] assembled a massive training dataset containing 2.6M face images over 2.7K identities. 67 With 3D modalities, recent research [77, 81, 129] has focused on finding robust feature points and descriptors based on geometric information of a 3D face in a hand-crafted manner. Those methods achieve good recognition performances but involve relatively complex algorithmic op- erations to detect key feature points and descriptors as compared to end-to-end deep learning models. While some of these methods can do verification in real-time, they often do not scale well for identification tasks where a probe scan needs to be matched with a large-scale gallery set. Compared to publicly available 2D face databases, 3D scans are hard to acquire, and the number of scans and subjects in public 3D face databases is limited. According to the survey in [106], the biggest 3D dataset is ND 2006 [41] which contains 13,450 scans over 888 individuals. It is small compared to publicly available labeled 2D faces, and may not be sufficient to train a deep convolutional neural network from scratch. We propose to leverage existing networks, trained for 2D face recognition, and fine-tune them with a small amount of 3D scans in order to perform 3D to 3D face matching. Another challenge intrinsic to the recognition tasks comes from the need to minimize intra- class variances (e.g., differences in the same individual under expression variations) while max- imizing inter-class variances (differences between persons). For faces, variations in expression impact the 3D structure and can degrade the performance of recognition systems [106]. To ad- dress this issue, we propose to augment our 3D face database with synthesized 3D face data considering facial expressions. To augment training data, we use multi-linear 3D morphable models in which the shape comes from the Basel Face Model [107], and the expression comes from FaceWarehouse [22]. To pass our 3D data to the 2D-trained CNN, we project the point clouds onto a 2D image plane with an orthographic projection. To make our system robust to small alignment error, 68 each 3D shape is augmented by rigid transformations: 3D rotations and translations before the projection. Some random patches are also added to the 3D data to simulate random occlusions (e.g., facial hair, covering by hands or artifacts). We fine-tune a deep CNN trained for 2D face recognition, VGG-Face [105], with the augmented data. We report performances on standard public 3D databases: Bosphorus [117], BU-3DFE [153], and 3D-TEC [140]. Our contributions are as follows: 1. To our knowledge, this work is the first to use a deep convolutional neural network for 3D face recognition. We frontalize a 3D scan, generate a 2.5D depth map, extract deep features to represent the 3D surface, and match the feature vector to perform 3D face recognition. 2. We propose a 3D face expression augmentation method that generates a number of person- specific 3D shapes with expression changes from a single raw 3D scan, which allows us to enlarge a limited 3D dataset and improve the performance of 3D face recognition in the presence of expression variations. 3. We have validated our approach on 3 standard datasets. Our method shows comparable results to the state of the art algorithms while enabling efficient 3D matching for large- scale galleries. An overview of our framework is presented in Figure 5.1. The rest of the chapter is organized as follows: Section 5.2 reviews the related work. Section 5.3 describes our proposed method. Our augmentation and performances on the public 3D databases are evaluated in Section 5.4. Section 5.5 concludes the paper. 69 Input 3D point cloud Training phase Testing phase Matching Input Pose variation Random patches Generated expression Resized input Input depth maps 3x224xx224 Augmentation & preprocessing Finetuning VGG Face Input 3D point cloud Preprocessing & resized input Input depth map 3x224xx224 CNN Feature extraction Feature normalization PCA transform Gallery features fc7 K fc7 Figure 5.1: An overview of the proposed face identification system. In the training phase, we do pre-processing, augment 3D faces, and convert them to 2D depth maps. Then, we train a DCNN using augmented 2D depth maps. In the testing phase, a face representation is extracted from the fine-tuned CNN. After normalization of features and Principal Component Analysis transform, one’s identity is determined by the matching step. 5.2 Related work We review prior works on 3D face recognition, 2D face recognition using deep convolutional neural networks (DCNN), and the use of CNNs for 3D object recognition. 3D face recognition An overview of 3D face recognition systems is presented in [106]. Chal- lenges in 3D face recognition come from variations in expression, occlusions, and missing parts. Most recent methods [77, 80, 81] extract hand-crafted features from a 3D face and match the fea- tures with other features using a metric for verification/identification tasks. Existing methods can be largely classified into two categories: holistic approaches and local-region based approaches. In holistic approaches, the whole 3D face is used for recognition. Morphable model-based approaches were introduced in [102]. After fitting a morphable model into a probe scan, the 70 fitted 3D face is passed to the recognition module. As each probe scan need to be fitted, these approaches usually are time-consuming. In local-region based approaches, features are extracted from several local sub-regions and local information. Li et al. [81] detect 3D key points where local curvatures are high. They propose three key point descriptors using the local shape information of detected key points. A multi-task sparse representation based on Sparse Representation based Classifier (SRC) is pre- sented in [147]. Lei et al. [81] propose a robust local facial descriptor. After detecting key points based on Hotelling transform, they set the descriptor by calculating four types of geometric fea- tures in a key point area. Then, they use a two-phase classification framework with extracted local descriptors for recognition. Due to the complexity of these algorithms, they suffer from slow feature extraction and/or matching processes, which limits their scalability as compared to end-to-end learning systems. 2D face recognition Most of the recent work on 2D face recognition [105, 124] relies on deep learning approaches using massive datasets. In these approaches, deep neural networks learn a robust 2D face representation directly from 2D images of faces. FaceNet [124] uses a large dataset of 200M faces over 8M identities. Their deep CNN archi- tecture is based on inception modules [134] and use triplet loss which minimizes the distances between an anchor image and another image from identities but maximizes the distance between the anchor image and the other image from different identities within a set of three images. VGG Face [105] proposes a procedure to assemble a large dataset with a small cost and trained VGG- 16 nets on the dataset of 2.6M images. Masi et al. [93] augment face images using 3D rendered images on poses, shapes, and closed mouth expressions using a 3D generic face. A key limitation of these approaches is that training of DCNNs requires a massive and carefully-designed dataset. 71 Convolutional neural nets for 3D objects 3D object recognition using deep convolutional neural networks is not well developed. First, 3D objection recognition still suffers from a lack of large annotated 3D database. For example, the most commonly used 3D dataset, ModelNet [148], contains only 150K shapes. To put this in perspective, ImageNet dataset [35], which is a very large 2D image database, contains tens of millions of annotated data. Second, effective represen- tations to pass 3D objects to a CNN are yet to be determined. Representations for 3D objects can be classified into two categories: model-based methods and view-based methods. In model-based methods, a whole 3D model is used for feature extraction. Wu et al. [148] use a volumetric representation for 3D shapes. Each voxel in a grid contains a binary value depending on a mesh surface and the size of the grid is 30 30 30. A 30 30 30 resolution may work for object classification but may not be fine enough to represent facial shapes. Using a high enough voxel resolution to capture fine facial structure variations would require massive amounts of memory. In view-based methods, a set of views from a 3D shape is used for recognition. The advantage of this method is that it can leverage existing 2D image datasets, such as ImageNet. Su et al. [132] create multiple 2D image renderings of a 3D object with different camera positions and angles for training and testing data. They use a CNN pre-trained on ImageNet. A first CNN is used for extracting features from each image and combine the features with element-wise maximum operation. A second CNN is used to compact the shape descriptor. Since there is no prior research on 3D face recognition using a deep convolutional neural net, there is no known representation for 3D facial scans. Indeed, a 303030 volumetric grid [148] may be too coarse to represent a 3D face and multiple views of a face may not be needed. 72 5.3 Method Figure 5.1 shows the proposed face identification process. We represent a 3D facial point cloud with an orthogonally projected 2D depth map. We use 2D depth maps to fine-tune VGG Face, which is pre-trained for the task of 2D face recognition. In the training phase, we augment our 3D data to enlarge the size of our dataset and make the CNN robust. A 3D point cloud of a facial scan is augmented with expression and pose variations. After converting a 3D point cloud into a 2D depth map, random patches are removed from the depth maps and used as augmentation to simulate hard occlusions. In the testing phase, the fine-tuned CNN is used as a feature extractor. We take extracted features from the CNN as a 3D face representation. Then, we determine one’s identity by matching its representation with a gallery set. 5.3.1 Preprocessing 3D scans can be provided in different conditions that can impact the appearance of the rendered images, notably with different poses or with large 3D noise coming from the sensor. We wish to minimize these factors when converting the 3D models to 2D maps. First, we align all the facial 3D models together using a classical rigid-ICP [24] between each 3D scan and a reference facial model. To initialize the algorithm, we find the nose tip in the 3D point cloud, and crop the point cloud within an empirically set radius (100mm). This process keeps only the facial region and enables better convergence of the ICP algorithm. This process is similar to [80]. The aligned 3D scan is then projected orthographically onto a 2D image to generate a depth map. 3D Points are scaled by 200=r to create a 200 200 size depth 2D map, where r is a 73 constant radius value used for face cropping. For a given 3D point [x;y;z] T , the coordinates (x;y) in a 2D depth map may not be integers. We calculate each pixel values using a mean with bilinear interpolation, as described in [58]. A 3D point cloud can contain spikes due to sensor noise. Therefore, we apply a classical median filtering to the depth images to generate the final results. 5.3.2 Augmentation We leverage existing CNNs used for 2D face recognition and fine-tune them for 3D face recog- nition. With a small number of 3D scans, we do not have a lot of variability in training data (e.g. FRGC), which might yield overfitting. To address this issue, we propose a individual-specific expression generation method which allows us to get a wide variety of expressions with a given dataset. Input 3DMM 3D shape fitting Generating expressions of 3DMM Fitted 3DMM Transferring expression Expression 3DMM Deformed input (a) An overview of expression generation method Deformed Inputs (Randomly generated expressions) Subject 1 Subject 2 Input (b) Examples of randomly generated expressions. Figure 5.2: An overview of person specific expression generation method and its examples. (a) Input: a facial 3D point cloud, Output: a deformed 3D point cloud (b) The first column represents neutral scans of input faces. Rest columns represent generated individual specific expressions from an input. For visualization, we show a 3D point cloud as a depth map. 74 5.3.2.1 Expression Generation First, we augment our 3D scans with expression variations. To be specific, we modify the ex- pression of every 3D facial scan in the training dataset, and add the resulting point cloud to the training dataset. Figure 5.2a shows an overview of the expression augmentation method. Adding expressions to a facial point scan consists of three steps: (1) 3D shape fitting a 3D morphable model (3DMM) to an input scan (3D point cloud), (2) adding expressions to the 3DMM, and (3) transferring expressions of the 3DMM to the input. Figure 5.2b shows examples of randomly generated expressions for two subjects. 3DMM We use a multi-linear 3D morphable model (3DMM), which contains variations in both shape and expression. The shape information comes from the Basel Face Model (BFM) [107], while the expression comes from FaceWarehouse [22]. The 3DMM represents a face by two vec- tors and, which represent the shape and the expressions, respectively. A 3D point cloud can then be computed with a linear equation: X = X + P s + P e ; (5.1) where X is the average facial point cloud, P s is provided by the BFM, and P e by FaceWare- house. 3D shape fitting To fit a 3DMM to an input scan, we iteratively update shape, pose, and expression in an iterative-closest point manner. The fitting process is similar to [6]. It uses only a 3D point cloud for fitting. In our experiment, we fit the 3DMM to neutral expression scans. 75 Generating expressions We can add random expressions to the fitted 3DMM by randomly varying values of the expression parameters in the 3DMM. To be specific, our 3DMM has 29 expression parameters. In order to generate various expressions, we generate a set of random vectors, which uses a random number of expression parameters and assign random values each time. We limit a parameter value i within a range (0:05 < i < 0:05) to generate natural looking expressions. Transferring expressions The results of 3DMM cannot be used directly as training data. Indeed, 3DMMs are smooth by design and smooth out a lot of local information that may be im- portant for recognition. As a result, we propose to only use the 3D morphable model to compute the deformation field between the original point cloud and an expression-augmented point cloud. We first define a target expression by a randomly generated 3D model using the 3DMM. We then compute a displacement vector field that maps from the input raw 3D scan to the target augmented 3D data. A displacement vector i from a 3D point in the fitted 3DMM ( i ) to the corresponding point of the deformed 3DMM ( i ) is computed as i = i i (5.2) wheref i g N i=1 represents a set of displacement vectors from the fitted to the deformed 3DMM andN is the number of points in a 3DMM. To apply the displacement vectors to the input, we take for every input point (X i ) a displacement vector based on the nearest point in the fitted 3DMM: j =argmin j jjX i j jj 2 2 (5.3) 76 X 0 i =X i + j (5.4) where j is the displacement vector corresponding to the X i .fX 0 j g M j=1 represents the de- formed input where M is the number of point in the input. 5.3.2.2 Pose Variations Secondly, we augment our data with variations on 3D pose. Since the rigid-ICP registration does not guarantee an optimal convergence, all 3D faces may not be accurately registered to the reference face and have different poses. Furthermore, a CNN is not invariant to pose trans- formations [75]. Therefore, this augmentation aims at making the CNN invariant to minor pose variations. To do so, we simply apply randomly generated rigid transformation matri- ces (M = [R t]) to an input 3D point cloud. A rotation matrix (R2 R 33 ) is generated by multiplying yaw, pitch and row rotations (R = R z ( 3 )R y ( 2 )R x ( 1 )) with random angle de- grees (10 < 1 ; 2 ; 3 < 10 ). A translation vector is also generated with random values (t = [x;y;z] T where 10<x;y;z< 10). 5.3.2.3 Random Patches Finally, we put eight 18 18 size patches on each 2D depth map at random positions. These random patches are used to prevent overfitting to specific regions of the face. As a result, a patch- augmented CNN learns every region of a face. In 2D face images, we can easily find this kind of training data, such as when the subject is wearing sunglasses or is occluded by other objects. With a 3D face, we simulate occluded data by hiding random patches in the depth map. Note that other types of patches, with different size and shape, are also possible. 77 5.3.3 Fine-tuning To build our 3D face recognition system, we start from VGG Face [105] that is pre-trained on 2D face images, and fine-tune the network with our augmented 2D depth maps. In order to fit the input size of VGG Face, a depth map is resized to 224 224 and 3 channels. We transfer all the weights from VGG Face but replace the last fully connected layer (FC8) with a new last fully connected layer and a softmax layer. The new last layer has the size of the number of subjects in training data and weights are randomly initialized from a Gaussian distribution with zero mean and standard deviation 0:01. We use SGD using mini-batches of 32 samples and set a learning rate 0:001 for pre-trained layers but 0:01 for the last layer. 5.3.4 Identification The fine-tuned CNN is used for extracting features. We take a 4096-dimensional vector from the FC7 layer as a face representation. A feature vector is normalized by taking the square root of each element of the feature vector. After acquiring features of gallery and probe set, we perform Principal Component Analysis on features from a probe set with features from a gallery set. In a matching step, we calculate a cosine distance between features from a probe and a galley. One’s identity is determined by the closest gallery feature. 78 5.4 Experiments 5.4.1 3D Face Databases We use the augmented FRGC [108] (FRGC v1 and FRGC v2) and CASIA 3D [103] as our training data. Each gallery set of the databases except 3D-TEC [140] is augmented and used for training data. Performances are evaluated on the Bosphorus [117], BU-3DFE [153] and 3D-TEC. FRGC v1, v2 This database which is a subset of ND 2006 [41] consists of 4,950 3D facial scans of 577 subjects. Although the number of identities is relatively large compared with other databases, it contains limited expression variations. We use this database as our training data. CASIA 3D The CASIA 3D database contains 4,624 scans of 123 subjects. Each subject was captured with different expressions and poses. We only use scans with expressions, which amount to 3,937 scans over 123 subjects. This database is used as a training set. Bosphorus The Bosphorus database contains 4,666 3D facial scans over 105 subjects with rich expression variations, poses, occlusions. The 2,902 scans contain expression variations from 105 subjects. In the experiment, 105 first neutral scans from each identity are used as a gallery set and 2797 non-neutral scans are used as a probe set. BU-3DFE The BU-3DFE database contains 2500 3D facial expression models of 100 sub- jects. Each subject made 6 expressions (e.g. happiness, disgust, fear and so on) with four levels of intensity from low to high and a neutral expression. The resolution of this database is low compared to other databases. 3D-TEC The 3D-TEC database contains 107 pairs (total 214 subjects) of twins and expres- sions of neutral and smiling scan are captured for each subject. Since twins are visually similar to each other, it is harder to identify a probe where the corresponding twin of the probe is in a gallery 79 set. Identifying twins visually are sometimes hard even for humans. The standard protocol for 3D-TEC has four scenarios (Case I,II,III, and IV) described in [140]. 5.4.2 Analysis of Augmentation As described in section 5.3.2, we augment 3D face on expressions, pose variations, and random patches. We evaluate each augmentation method separately. As we only use the FRGC dataset as a training set, the amount of data is so limited that a DCNN may not be trained well from scratch. Therefore, we add shallow convolution layers for analysis of performance. In total, we use three different CNNs to evaluate augmentation methods in detail: (a) shallow net (2 convolution layers and 2 fully connected layer) with random initial weights, (b) VGG-16 with random initial weights, (c) VGG Face which is pre-trained from 2D face images. Expression generation When augmenting, we selected the first scan of each identity (577) in the FRGC database and generated 25 expressions with the selected scan. A total of 14,425 (577*25) expression scans were generated. Pose variations We apply pose augmentations to every scan in the FRGC dataset. We applied 10 random rigid-body transformations to each scan. The number of augmented data is 49,500 (4950*10) and the total number of training data is 54,450. Random patches We augmented a 3D face in the FRGC database by putting random patches on the corresponding 2D depth map. We generated 10 images per a scan. The total number of training data is 54,450 which is the same as in the pose variation experiment. Figure 5.3 shows the ROC and CMC curves from the experiments. Each augmentation method improves performances on the three CNNs. Performances are largely increased in (a) 80 and (b). Although there are small improvements in (c), it is more important because the per- formance without augmentation is already very high (rank-1 of 97.0%). We also find that all CNNs gave the best rank-1 accuracy when a CNN is trained on the data using all of the three augmentation methods. In (a) and (b), pose augmentation shows the biggest improvements in the CMC curves. In (c), expression augmentation achieves the highest increase compared with other augmentation methods in the CMC curve. 5.4.3 Performances on the 3D Databases We evaluate the proposed method on the Bosphorus [117], BU-3DFE [153] and 3D-TEC [140]. Each gallery set of the databases except 3D-TEC is augmented and used for training data. We follow evaluation protocols described in [80]. Figure 5.4 shows ROC and CMC curves on the three databases and Table 5.1 compares rank-1 accuracy with the state-of-the-art methods which have same protocols. Performances on Bosphorus We set three experimental scenarios for detail analysis: (1) Neutral (105 galleries) vs Neutral (194 probes), (2) Neutral vs Non-neutral (2603 probes), and (3) Neutral vs All (2797 probes). The gallery sets are the same in the experiments. Our method achieves 100%, 99.2%, and 99.24% rank-1 accuracies respectively. The rank-1 of 99.24% in the experiment (3) is the highest accuracy compared with the state-of-the-art methods as shown in the Table 5.1. With the training set from only augmented FRGC dataset, we obtain rank-1 of 98.1% as shown in Figure 5.3(c). Performances on BU-3DFE We set three experimental scenarios depending on the intensity of expressions: (1) Neutral (100 gallery) vs Low-intensity (1200 probes), (2) Neutral vs High- intensity (1200 probes), and (3) Neutral vs All (2400 probes). The gallery sets are the same in 81 Approaches Training data Bosphorus BU-3DFE 3D-TEC Case I Case II Case III Case IV Lei et al. [77] (2016) Gallery a 98.9 93.2 - - - - Ocegueda et al. [102] (2011) FRGC v2 98.6 99.3 - - - - Li et al. [80] (2014) BU-3DFE, Gallery a 95.4 - 93.9 96.3 90.7 91.6 Bosphorus, Gallery a - 92.2 94.4 96.7 90.7 92.5 Li et al. [81] (2015) Gallery a 98.8 - - - - - Berretti et al. [15] (2013) BU-3DFE b 95.7 87.5 - - - - Huang et al. [62] (2011) None - - 91.1 93.5 77.1 78.5 Huang et al. [63] (2011) FRGC v1 - - 91.6 93.9 68.7 71.0 Faltemier et al. [42] (2008) FRGC v1 - - 94.4 93.5 72.4 72.9 Ours FRGC c , Casia 3D c 99.2 95.0 94.8 94.8 81.3 79.9 Gallery a,d a Corresponding gallery set of the testing set respectively. b A subset of BU-3DFE. c Augmented entire data on expressions, poses, and random patches. d Augmented gallery sets of Bosphorus and BU-3DFE. Not reported Table 5.1: Comparison of rank-1 accuracy (%) on public 3D face databases the experiments. We obtain 97%, 95%, and 93% rank-1 accuracies at each scenario. We have the second highest rank-1 accuracy in the Table 5.1. Performances on 3D-TEC We evaluate our method on 3D-TEC in the 4 cases described in [140]. The rank-1 performances are quite low in Case III,IV compared with Case I, II. This is because the expression of a probe is different from one’s face in the gallery but the same as the twin’s expression in the gallery. Other methods except [80] show the same tendency as shown in Table 5.1. However, the rank-2 performances were largely increased by over 15% in Case III, IV and the rank-2 recognition rate is similar on all four cases. From the results, our method achieves comparable performances to state-of-the-art methods on the three databases and can handle rich variations of expression. However, it is still hard to identify twins with different expressions. 82 Approaches Processing Matching Total Spreeuwers [129] 2.5 0.04 2.54 Lei et al. [77] 6.08 2.41 8.49 Li et al. [80] 3.05 0.5 3.55 Alyuz et al. [4] 36 0.02 36.02 Kakadiaris et al. [68] 15 0.5 15.5 Li et al. [81] a 69.5 5.5 75 Ours 3.16 0.09 3.25 a Computation times when the gallery size is 105. Table 5.2: Comparison of computation time (s) of feature extraction, matching per a probe for identification. 5.4.4 Time Complexity Analysis We evaluate our method on a PC with 2.6 GHz dual-processors and a NVIDIA K40 GPU for training and testing by using Caffe implementation [67]. In a 3D face recognition system, face identification is usually slow as a probe face needs to be matched with the whole gallery set. We measured the computation time for pre-processing and matching for identification per probe where a gallery size is 466. In this experiment, pre-processing includes time for processing a raw 3D data and extracting features. Table 5.2 shows computation time of our method and other methods. The time for our method takes 3:25 seconds to identify a probe. Our time-consuming part is registration of a probe to a reference. It takes around 3 seconds to register a probe using the rigid-ICP. The rest of the process (feature extraction and matching) can be done in less than a second. In a matching step, our method only needs to measure a cosine distance of N-dimensional feature vectors whereN is the size of a gallery set. Spreeuwers [129] shows the computationally best method but the method is not carefully evaluated. This paper uses only one database for evaluation. Liet al. [80] introduced a method that takes 3:55s to identify a probe when the size of the gallery is 466. Since it involves an optimization (l 0 minimization) where computation times 83 depends on the size of the gallery, it is hard to estimate its time complexity when the gallery size increases. Ocegueda et al. [102] did not report their time complexity. However, the pre- processing in [102] takes at least 15s as it uses a deformable model fitting reported in [68]. 5.5 Conclusions In this paper, we propose the first 3D face recognition model with a deep convolutional network (DCNN). Even with a limited dataset, we show a prominent possibility of using a DCNN by leveraging fine-tuning and 3D augmentation methods. Our method only requires standard pre- processing methods, including a nose tip detection and ICP, and does not involve complex feature extraction and matching, so that it is time efficient. Our method is evaluated on the public 3D databases [117, 153, 153] shows comparable performances to the state-of-the-art results in terms of accuracy while being more scalable. 84 (a) Shallow net (c) VGG Face (b) VGG-16 Figure 5.3: Evaluations of the augmentation methods on the Bosphorus dataset. FRGC is used as a training set. The dataset was augmented using each augmentation method: expression, pose, patch, and all combined. 85 (a) CMC and ROC curves on Bosphorus (c) CMC and ROC curves on 3D-TEC (b) CMC and ROC curves on BU-3DFE Figure 5.4: Evaluation results on the three databases. (a) and (b) Performances were evaluated on three cases by varying expression intensities in a probe set. (c) Performances were evaluated on the four standard experiment cases. 86 Chapter 6 Analysis of 3D Face Reconstruction Quality for 3D Recognition 6.1 Introduction 3D face recognition has been studied extensively as a way to alleviate known problems with 2D face recognition: notably the PIE (Pose-Illumination-Expression) [49] problems. As shown in Chapter 5 and much of the literature, laser-scan quality 3D data is very discrim- inative and excellent recognition rates can be expected. Consequently, when the accuracy of the reconstructed models is very high, such as in studio setups [31] or with active sensors [97], high recognition rates have been achieved. We show that recognition accuracy drops significantly when the quality of the reconstruction gets reduced. Notably, we demonstrate that the 3D models reconstructed from in-the-wild data cannot be used for the purpose of recognition. We evaluate recognition rates on state-of-the-art re- construction methods, including the one presented in Chapter 4 and try to analyze the correlation between recognition rate and reconstruction accuracy. The rest of the chapter is organized as follows. Section 6.2 presents the recognition results from reconstructed 3D data. Section 6.3 studies the effect of reconstruction quality on recognition 87 results and which factors impact reconstruction results. Finally, Section 6.4 summarizes our findings. 6.2 Recognition rates 6.2.1 Dataset and experimental setup In this chapter, all experiments are performed on the MICC dataset [9] as it provides with high quality 3D model and videos for which we can reconstruct 3D faces. The dataset is composed of 53 different people, for which a laser scanned 3D model and three videos are provided. As in Chapter 4, we compare the reconstructed models from our proposed approach and those from traditional 3DMM fitting [112] and state-of-the-art deep learning-based 3DMM fit- ting [158]. Note that all three methods output a low-dimensional vector representing a 3DMM. 6.2.2 3D Recognition on faces reconstructed from in-the-wild videos For each person in the MICC dataset, we reconstruct the 3D face using two different videos: cooperative and non-cooperative. The gallery data is set to be the reconstruction from the coop- erative videos and the probe data is reconstructed from non-cooperative videos. We run experiments using two recognition systems. First, we use the recognition module presented in chapter 5. Second, we use a naive cosine similarity metric over the 3DMM vector representation. Results for recognition rates are summarized in Table. 6.1 and Table. 6.2 respec- tively. Several observations can make made based upon the reported numbers. 88 3DMM [112] 3DDFA [158] PCSfM [Chapter 4] Rank 1 3.8 1.9 13.2 Rank 5 13.2 9.4 24.5 Rank 10 28.3 18.9 32.1 Table 6.1: Recognition rates on MICC using the proposed deep matcher (in %). 3DMM [112] 3DDFA [158] PCSfM [Chapter 4] Rank 1 13.2 60.4 26.4 Rank 5 32.1 67.9 43.4 Rank 10 41.5 75.5 45.3 Table 6.2: Recognition rates on MICC using a cosine similarity (in %). Firstly, we can note that the CNN-based method [158] provides acceptable recognition rates when using a cosine similarity distance. This suggests that deep networks can provide consistent and discriminative parameter vectors. However, these do not translate to good 3D-3D recognition, as seen in Table. 6.1. This phenomenon can be explained by the labels in method [158] not being generated from real-world 3D data. Secondly, the approach presented in Chapter 4 provides the best results in terms of 3D-3D recognition and performs better than traditional 3DMM. Thirdly, and most importantly, the results are a lot lower that those presented in chapter 5, using laser-scan quality 3D data. These results are not sufficiently high for 3D-3D recognition to be reliable. 6.3 Analysis We acknowledge that the recognition rates presented in Table 6.1 and 6.2 exhibit low numbers which suggest these reconstructed models cannot be used for recognition. However, as suggested by the literature, high quality 3D face models provide excellent recognition results [31, 97]. 89 As a result, we wish to study 3D recognition in the gray area in between high-resolution 3D models and models reconstructed from unconstrained data. In this section, we hypothesize which factors could be responsible for the low recognition rates and formulate which fundamental limitations need to be addressed. We evaluate which model quality is required to get high recognition rates, and analyze the weaknesses of the recon- struction systems. 6.3.1 Required accuracy level In this part, we aim at assessing the correlation between reconstruction quality and recognition rates. To do so, we artificially reduce the quality of laser scan models, and compute recognition rates on these reduced quality models. For the gallery data, we use the MICC laser scan 3D model. To generate the probe data, we iteratively apply Laplacian smoothing to the laser scans to remove high-frequency details and incidentally increase Euclidean distance to ground truth. Example results obtained after smoothing are shown in Fig. 6.1. The deep 3D matcher presented in Chapter. 5 is used to asses the recognition rank for each of the 53 faces. The recognition rate as a function of distance to ground truth is shown in Fig. 6.2. Fig. 6.2 exhibits that the performance drop is very steep as the model quality drops below a certain threshold. In this experimental setting, we see that 0.4mm reconstruction accuracy maintains enough detail and structure to allow excellent recognition rates. The performance suddenly drops beyond .5mm error and the recognition accuracy at a 0.7mm error is already below 50%. 90 Figure 6.1: Example probe data generated applying Laplacian filter on MICC laser scans for an average error of 0mm, 1mm and 2mm. Figure 6.2: Recognition rate as a function of average Euclidean distance to ground truth. 91 Figure 6.3: Example limitations of 3DMM for reconstructing MICC data. Note that for this dataset and an unconstrained video setting, the best current reconstruction methods have an accuracy of 1.9mm. As a result, the accuracy is too low to expect reliable recognition rates. 6.3.2 Representational Power of 3DMM To increase robustness of the reconstruction frameworks, most of the literature has leveraged 3D Morphable Models (3DMM) as prior knowledge on the space of 3D faces [158, 116, 64, 112]. As morphable models are generated through Principal Component Analysis (PCA) over laser scans, they are smoother than laser scan 3D models and cannot accurately represent local details. Besides, their representation is biased by the variability of the training data. We wish to assess how much this lack of representational power impacts recognition rates. To assess the highest accuracy reachable with a given 3DMM representation, we perform non- rigid 3DMM fitting on the 3D scans using a non-rigid ICP approach [6]. Since ICP is sensitive to local minima, for each scan, we start from 1; 000 random initial positions and keep the best final model. As expected, local details and high curvature areas cannot be well captured (see Fig. 6.3). Besides, if the face is very different from the average face, the results are uncanny or even unre- alistic. 92 Figure 6.4: Histogram of distance to ground truth for 3DMM fitting on laser scans of the MICC data. One example fitted 3DMM/laser scan pair is shown for the extreme cases. Figure 6.5: Scatter plot for the recognition rank as a function of the distance to ground truth for 3D faces reconstructed with 3DMM [112], 3DDFA [158] and PCSfM [Chapter 4] The average distance to ground truth is summarized in Fig. 6.4. Crossing the results from Fig. 6.4 and Fig. 6.2, we can conclude that the maximum expected accuracy of the 3DMM model is too low to guarantee good recognition rates over 30% of the time. 6.3.3 Reconstructed 3D faces If we compare the results from Table 4.1 and Fig. 6.2, it is obvious that the quality of the recon- structed 3D models is not sufficient for expecting good recognition rates. To validate this assumption, we analyze the correlation between the recognition rate and the reconstruction accuracy on the models reconstructed from the MICC videos. 93 We run 3D reconstruction on the cooperative videos of the MICC dataset with the traditional 3DMM fitting [112], CNN-3DMM fitting [158] and our SfM Method [Chapter 4]. We use the laser scans as gallery data and the reconstructed 3D models as probe faces. For each face, we run the face matcher from Chapter. 5 and compute the recognition rank. Fig. 6.5 shows that the recognition rank as a function of the reconstruction accuracy. We can note that the better the accuracy, the more the points are concentrated to lower ranks, which suggests recognition rank and model accuracy are correlated. However, the distribution of the results is too random to guarantee robustness of a recognition system. 6.4 Conclusion In this section, we analyze the level of accuracy required by 3D reconstruction to be applied for recognition. We find that beyond 0.4mm accuracy, the recognition rate decreases significantly. We also show that the representational power of 3DMM is not sufficient to expect high recogni- tion rates. As a result, proposing an alternative representation to 3DMM is necessary if the goal is achieving high recognition rates. 94 Chapter 7 Conclusion and Future Directions In this work, we focused on 2D-3D registration problems in challenging cases for which no or few reliable point features are available. We investigated two applications: multimodal medical imaging and face reconstruction from video. We proposed novel approaches to address these problems, and achieved superior results than other existing methods. Also, we proposed a novel 3D-3D face recognition system that outperforms state-of-the- art techniques in terms of speed and accuracy. We then analyzed which parameters impact the accuracy of the expected 3D reconstruction results. 7.1 Summary of contributions The contributions of this work are summarized below: 1. Multimodal 2D-3D registration of medical images (Chapter. 3): We have presented a general framework for multimodal registration of multiple images and applied it to the case of retinal registration. We propose a generic framework to extract line structures from the 95 2D and 3D data, based on tensor voting. The lines are then used to perform rigid, then elastic alignment. The proposed approach is the first to focus on registering OCT data to other image modali- ties, and the first to register more than two modalities in a global framework. The software is currently deployed at Doheny Eye Institute UCLA. 2. 3D Face Reconstruction (Chapter. 4): We propose a method which constrains Structure- from-Motion estimation of face shapes with a statistical prior. We show that leveraging prior knowledge on the space of faces enables to guide the correspondence finding process when point correspondences are hard to establish directly. This allows us to obtain accurate 3D shape estimates even in low resolution and low quality videos. We propose to maximize photometric consistency by jointly optimizing 3D face shape, per-frame pose and expres- sion improves reconstruction results over existing approaches. Our results demonstrate that this method provides improved accuracy compared to existing methods and, when imple- mented on the GPU, faster processing times than recent relevant baseline methods. 3. 3D Face Recognition (Chapter. 5): We propose the first end-to-end 3D face recognition system with a deep convolutional network. On public 3D face databases, our method shows comparable performances to the state-of-the-art results in terms of accuracy while being faster and more scalable. Notably, we show that even with a limited dataset, it is possible to transfer learning from a network trained on 2D images to a network catering to 3D data. Our method does not rely on any hand-crafted feature extraction, which enables it to be time efficient. 96 4. 3D Face Recognition from Reconstructed Data (Chapter. 6): We show that the mini- mum accuracy necessary for 3D-3D recognition begin accuracy is 0.4mm. For that reason, we show that the representational power of the 3D Morphable Models is too limited for performing 3D-3D recognition and that current 3D reconstruction methods are not suffi- cient. 7.2 Future Directions 1. 3D Face Reconstruction: Most of the recent 3D face reconstruction literature focuses on tuning the energy function of the 3DMM fitting. This involves adding terms related to edges, facial landmarks, image intensities, specular lighting, shadows, and diverse regular- ization methods. Recent advancements in deep learning, notably on Generative Adversarial Networks [47], have shown that neural networks can be used to define an efficient and dis- criminative loss function. We believe that using a deep neural network as a loss function could enhance the fitting accuracy of 3DMM on images. 2. 3DMM Representational Power: Another issue explaining the low recognition results stems from the lack of representational power of the current 3DMMs (see Chapter. 6). An issue is that 3DMM are extracted through Principal Component Analysis, which is a simple and purely linear function. Using neural networks would allow for non-linearities which could enhance the representational power of 3DMM and maintain high-frequencies in the recovered 3D models. 97 Reference List [1] Timo Ahonen, Abdenour Hadid, and Matti Pietik¨ ainen. Face recognition with local binary patterns. In European Conf. Comput. Vision, pages 469–481. Springer, 2004. [2] A. Alahi, R. Ortiz, and P. Vandergheynst. Freak: Fast retina keypoint. Proc. Conf. Comput. Vision Pattern Recognition, 2012. [3] Oleg Alexander, Mike Rogers, William Lambeth, Jen-Yuan Chiang, Wan-Chun Ma, Chuan-Chang Wang, and Paul Debevec. The digital emily project: Achieving a photo- realistic digital actor. Computer Graphics and Applications, IEEE, 30(4):20–31, 2010. [4] Nes ¸e Alyuz, Berk Gokberk, and Lale Akarun. Regional registration for expression re- sistant 3-d face recognition. IEEE Transactions on Information Forensics and Security, 5(3):425–440, 2010. [5] Brian Amberg, Andrew Blake, Andrew Fitzgibbon, Sami Romdhani, and Thomas Vetter. Reconstructing high quality face-surfaces using model based stereo. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007. [6] Brian Amberg, Reinhard Knothe, and Thomas Vetter. Expression invariant 3d face recog- nition with a morphable model. In Automatic Face & Gesture Recognition, 2008. FG’08. 8th IEEE International Conference on, pages 1–6. IEEE, 2008. [7] J. Armstrong, R.E. Gangnon, L.Y . Lee, R. Klein, B.E. Klein, R.C. Milton, and F.L. Ferris. Illustration of the amd severity scale from the agerelated eye diseases study. ARVO, 2004. [8] G. Azzopardi, N. Strisciuglio, M. Vento, and N. Petkov. Trainable cosfire filters for vessel delineation with application to retinal images. Medical Image Analysis, 19, 2015. [9] Andrew Bagdanov, Alberto Del Bimbo, and Iacopo Masi. The florence 2d/3d hybrid face dataset. In Int. Conf. Multimedia, 2011. [10] T. Baltruaitis, P. Robinson, and L. P. Morency. 3d constrained local model for rigid and non-rigid facial tracking. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2610–2617, June 2012. [11] Richard Barrett, Michael W Berry, Tony F Chan, James Demmel, June Donato, Jack Don- garra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der V orst. Templates for the solution of linear systems: building blocks for iterative methods, volume 43. Siam, 1994. 98 [12] H Barrow, J Tenenbaum, and H Wolf. Parametric correspondence and chamfer matching: Two new techniques for image matching. Proc. 5th Int. Joint Conf. Artificial Intelligence, pages 659–663, 1977. [13] Yogesh Babu Bathina, M. V . Kartheek Medathati, and Jayanthi Sivaswamy. Robust match- ing of multi-modal retinal images using radon transform based local descriptor. Proceed- ings of the 1st ACM International Health Informatics Symposium, 2010. [14] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust fea- tures (surf). Comput. Vision Image Understanding, 110(3):346 – 359, 2008. [15] Stefano Berretti, Naoufel Werghi, Alberto Del Bimbo, and Pietro Pala. Matching 3d face scans using interest points and local histogram descriptors. Computers & Graphics, 37(5):509–525, 2013. [16] R. Berthilsson. Affine correlation. Int. Conf. Pattern Recognition, 1998. [17] Paul J. Besl and N.D. McKay. A method for registration of 3-D shapes. Trans. Pattern Anal. Mach. Intell., 14(2):239–256, 1992. [18] V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. Proc. ACM SIGGRAPH Conf. Comput. Graphics, 1999. [19] W Brand. Morphable 3D models from video. In Proc. Conf. Comput. Vision Pattern Recognition, 2001. [20] Lisa Gottesfeld Brown. A survey of image registration techniques. ACM Comput. Surv., 24(4):325–376, December 1992. [21] M. Calonder, V . Lepetit, C. Strecha, and P. Fua. Brief: Binary robust independent elemen- tary features. European Conf. Comput. Vision, 2010. [22] C. Cao, Y . Weng, S. Zhou, Y . Tong, and K. Zhou. Facewarehouse: a 3D facial expression database for visual computing. Trans. on Visualization and Comput. Graphics, 2014. [23] Chen Cao, Qiming Hou, and Kun Zhou. Displaced dynamic expression regression for real-time facial tracking and animation. ACM Trans. on Graphics, 33(4):43, 2014. [24] Umberto Castellani and Adrien Bartoli. 3d shape registration. In 3D Imaging, Analysis and Applications, pages 221–264. Springer, 2012. [25] Edwin Catmull. A subdivision algorithm for computer display of curved surfaces. Tech- nical report, DTIC Document, 1974. [26] Kyong Chang, Kevin Bowyer, and Patrick Flynn. Face recognition using 2d and 3d facial data. In Workshop on Multimodal User Authentication, pages 25–32, 2003. [27] Jian Chen, Jie Tian, Noah Lee, Jian Zheng, R. Theodore Smith, and Andrew F. Laine. A partial intensity invariant feature descriptor for multimodal retinal image registration. Trans. Biomedical Engineering, 2010. 99 [28] Yang Chen and G´ erard Medioni. Object modelling by registration of multiple range im- ages. Image and vision computing, 10(3):145–155, 1992. [29] Zikuan Chen and Sabee Molloi. Multiresolution vessel tracking in angiographic images using valley courses. Optical Engineering, 42(6):1673–1682, 2003. [30] Tae Eun Choe and Isaac Cohen. Registration of multimodal fluorescein images sequence of the retina. In In: Proceedings of IEEE International Conference on Computer Vision (ICCV) 2005, pages 106–113, 2005. [31] Jongmoo Choi, Gerard Medioni, Yuping Lin, Luciano Silva, Olga Regina Pereira Bellon, Mauricio Pamplona Segundo, and Timothy Faltemier. 3d face reconstruction using a single or multiple views. ICPR, 2012. [32] Jongmoo Choi, Gerard Medioni, Yuping Lin, Luciano Silva, Olga Regina, Mauricio Pam- plona, and Timothy C Faltemier. 3d face reconstruction using a single or multiple views. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 3959–3962. IEEE, 2010. [33] Haili Chui and Anand Rangarajan. A new point matching algorithm for non-rigid regis- tration. Computer Vision and Image Understanding, 89(23):114 – 141, 2003. [34] J. Davis. Mosaics of scenes with moving objects. In Proc. Conf. Comput. Vision Pattern Recognition, pages 354–360, 1998. [35] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. [36] K Deng, J Tie, J Zheng, X Zhang, X Dai, and M Xu. Retinal fundus image registration via vasculature structure graph matching. International Journal of Biomedical Imaging, 2010, 2010. [37] R. Dovgard and R. Basri. Statistical symmetric shape from shading for 3D structure re- covery of faces. In European Conf. Comput. Vision, 2004. [38] Shaoyi Du, Nanning Zheng, Lei Xiong, Shihui Ying, and Jianru Xue. Scaling iterative closest point algorithm for registration of md point sets. Journal of Visual Communication and Image Representation, 21(56):442 – 452, 2010. [39] DualAlign i2k Retina. http://www.dualalign.com/retinal/. [40] N. Faggian, A. Paplinski, and J. Sherrah. 3D morphable model fitting from multiple views. In Int. Conf. on Automatic Face and Gesture Recognition, 2008. [41] Timothy C Faltemier, Kevin W Bowyer, and Patrick J Flynn. Using a multi-instance enroll- ment representation to improve 3d face recognition. In Biometrics: Theory, Applications, and Systems, 2007. BTAS 2007. First IEEE International Conference on, pages 1–6. IEEE, 2007. 100 [42] Timothy C Faltemier, Kevin W Bowyer, and Patrick J Flynn. A region ensemble for 3-d face recognition. IEEE Transactions on Information Forensics and Security, 3(1):62–73, 2008. [43] Douglas Fidaleo and G´ erard Medioni. Model-assisted 3d face reconstruction from video. Int. Workshop on Analysis and Modeling of Faces and Gestures, 2007. [44] A Frangi, W Niessen, K Vincken, and M Viergever. Multiscale vessel enhancement filter- ing. MICCAI, 1998. [45] GG Gardner, D Keating, TH Williamnson, and AT Elliott. Automatic detection of diabetic retinopathy using an artificial neural network: a screening tool. Br J Ophtalmology, 1996. [46] Zeinab Ghassabi, Jamshid Shanbehzadeh, Amin Sedaghat, and Emad Fatemizadeh. An efficient approach for robust multimodal retinal image registration based on ur-sift features and piifd descriptors. EURASIP Journal on Image and Video Processing, April 2013. [47] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Neural Inform. Process. Syst., 2014. [48] Paulo FU Gotardo, Tomas Simon, Yaser Sheikh, and Iain Matthews. Photogeometric scene flow for high-detail dynamic 3D reconstruction. In Proc. Int. Conf. Comput. Vision, 2015. [49] R. Gross, I. Matthews, J. F. Cohn, T. Kanade, and S Baker. Multi-pie. Image and Vision Computing, 2009. [50] G Guy and G Medioni. Inferring global perceptual contours from local features. Interna- tional Journal of Computer Vision, 20(1-2):113–133, Oct. 1996. [51] N. Hanaizumi and S. Fujimur. An automated method for registration of satellite remote sensing images. International Geoscience and Remote Sensing Symposium, 1993. [52] C. Harris and M. Stephens. A combined corner and edge detection. The Fourth Alvey Vision Conference, pages 147–151, 1988. [53] Chris Harris. Geometry from visual motion. In Andrew Blake and Alan Yuille, editors, Active Vision, pages 263–284. MIT Press, Cambridge, MA, USA, 1993. [54] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. [55] Tal Hassner. Viewing real-world faces in 3d. In Proceedings of the IEEE International Conference on Computer Vision, pages 3607–3614, 2013. [56] Tal Hassner and Ronen Basri. Example based 3D reconstruction from single 2D images. In Proc. Conf. Comput. Vision Pattern Recognition Workshops, 2006. [57] Matthias Hernandez, Jongmoo Choi, and G´ erard Medioni. Near laser-scan quality 3-d face reconstruction from a low-quality depth stream. Image and Vision Computing, 36:61–69, 2015. 101 [58] Matthias Hernandez, Jongmoo Choi, and G´ erard Medioni. Near laser-scan quality 3-d face reconstruction from a low-quality depth stream. Image and Vision Computing, 36:61–69, 2015. [59] Zhihong Hu, Gerard G. Medioni, Matthias Hernandez, Amirhossein Hariri, Xiaodong Wu, , and SriniVas R. Sadda. Segmentation of the geographic atrophy in spectral-domain op- tical coherence tomography and fundus autofluorescence images. Investigative Ophthal- mology and Visual Science, 2013. [60] Zhihong Hu, Gerard G. Medioni, Matthias Hernandez, and SriniVas R. Sadda. Automated segmentation of geographic atrophy in fundus autofluorescence images using supervised pixel classification. Journal of Medical Imaging, 2015. [61] Zhihong Hu, Xiaodong Wu, Amirhossein Hariri, and SriniVas R. Sadda. Automated mul- tilayer segmentation and characterization in 3d spectral-domain optical coherence tomog- raphy images. Proc. SPIE, 8567, 2013. [62] Di Huang, Mohsen Ardabilian, Yunhong Wang, and Liming Chen. A novel geometric fa- cial representation based on multi-scale extended local binary patterns. In Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on, pages 1–7. IEEE, 2011. [63] Di Huang, Wael Ben Soltana, Mohsen Ardabilian, Yunhong Wang, and Liming Chen. Tex- tured 3d face recognition using biological vision-based facial representation and optimized weighted sum fusion. In CVPR 2011 WORKSHOPS, pages 1–8. IEEE, 2011. [64] Patrik Huber, Guosheng Hu, Rafael Tena, Pouria Mortazavian, Willem P Koppen, William Christmas, Matthias R¨ atsch, and Josef Kittler. A multiresolution 3D morphable face model and fitting framework. In Proc. Int. Conf. on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2016. [65] John F Hughes, Andries Van Dam, James D Foley, and Steven K Feiner. Computer graph- ics: principles and practice. Pearson Education, 2014. [66] Jiaya Jia, Sai-Kit Yeung, Tai-Pang Wu, Chi-Keung Tang, and G. Medioni. A closed-form solution to tensor voting: Theory and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(8):1482–1495, 2012. [67] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Gir- shick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [68] Ioannis A Kakadiaris, Georgios Passalis, George Toderici, Mohammed N Murtuza, Yun- liang Lu, Nikos Karampatziakis, and Theoharis Theoharis. Three-dimensional face recog- nition in the presence of facial expressions: An annotated deformable model approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(4):640–649, 2007. [69] Zhuoliang Kang and G´ erard Medioni. Progressive 3D model acquisition with a commodity hand-held camera. In Winter Conf. on App. of Comput. Vision, 2015. 102 [70] Yan Ke and Rahul Sukthankar. Pca-sift: A more distinctive representation for local image descriptors. Proc. Conf. Comput. Vision Pattern Recognition, pages 506–513, 2004. [71] Ira Kemelmacher-Shlizerman and Ronen Basri. 3d face reconstruction from a single image using a single reference face shape. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(2):394–405, 2011. [72] Jae Chul Kim, Kyoung Mu Lee, Byoung Tae Choi, and Sang Uk Lee. A dense stereo matching using two-pass dynamic programming with generalized ground control points. Proc. Conf. Comput. Vision Pattern Recognition, 2005. [73] J B Kruskal. On the shortest spanning subtree of a graph and the travelling salesman problem. Proceedings of the American Mathematical Society, 7(1):48–50, 1956. [74] Kiriakos N. Kutulakos and Steven M. Seitz. A theory of shape by space carving. Interna- tional Journal of Computer Vision, 2000. [75] Dmitry Laptev, Nikolay Savinov, Joachim M Buhmann, and Marc Pollefeys. Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks. arXiv preprint arXiv:1604.06318, 2016. [76] S. Lazebnik, C. Schmid, and J. Ponce. Semi-local affine parts for object recognition. Proc. British Mach. Vision Conf., 2004. [77] Yinjie Lei, Yulan Guo, Munawar Hayat, Mohammed Bennamoun, and Xinzhi Zhou. A two-phase weighted collaborative representation for 3d partial face recognition with single sample. Pattern Recognition, 52:218–237, 2016. [78] S. Leutenegger, M. Chli, and R. Y . Siegwart. Brisk: Binary robust invariant scalable keypoints. Proc. Int. Conf. Comput. Vision, 2011. [79] Gil Levi and Tal Hassner. Latch: Learned arrangements of three patch codes. Winter Conf. on App. of Comput. Vision, 2016. [80] Huibin Li, Di Huang, Jean-Marie Morvan, Liming Chen, and Yunhong Wang. Expression- robust 3d face recognition via weighted sparse representation of multi-scale and multi- component local normal patterns. Neurocomputing, 133:179–193, 2014. [81] Huibin Li, Di Huang, Jean-Marie Morvan, Yunhong Wang, and Liming Chen. Towards 3d face recognition in the real: a registration-free approach using fine-grained matching of 3d keypoint descriptors. International Journal of Computer Vision, 113(2):128–142, 2015. [82] Y Li, G Gregori, R W Knighton, B J Lujan, and P J Rosenfeld. Registration of oct fun- dus images with color fundus photographs based on blood vessel ridges. Optics Express, 19(1):7–16, January 2010. [83] Shu Liang, Linda G Shapiro, and Ira Kemelmacher-Shlizerman. Head reconstruction from internet photos. In European Conf. Comput. Vision, 2016. [84] Yuping Lin and G. Medioni. Mutual information computation and maximization using gpu. Computer Vision and Pattern Recognition Workshops, pages 1–6, June 2008. 103 [85] Yuping Lin and G. Medioni. Retinal image registration from 2d to 3d. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, June 2008. [86] Yuping Lin, G´ erard Medioni, and Jongmoo Choi. Accurate 3d face reconstruction from weakly calibrated wide baseline images with profile contours. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1490–1497. IEEE, 2010. [87] P. Liskowski and K. Krawiec. Segmenting retinal blood vessels with deep neural networks. IEEE Transactions on Medical Imaging, PP(99):1–1, 2016. [88] L Loss, G Bebis, and B Parvin. Iterative tensor voting for perceptual grouping of ill- defined curvilinear structures: Application to adherent junctions. IEEE Trans Med Imag- ing, 30(8):1503–1513, Mar 2011. [89] Manolis IA Lourakis and Antonis A Argyros. Sba: A software package for generic sparse bundle adjustment. ACM Transactions on Mathematical Software (TOMS), 2009. [90] David G Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999. [91] R. Marzotto, A. Fusiello, and V . Murino. High resolution video mosaicing with global alignment. In Proc. Conf. Comput. Vision Pattern Recognition, pages I–692–I–698 V ol.1, 2004. [92] Diego Marn, Arturo Aquino, Manuel Emilio Gegndez-Arias, and Jos Manuel Bravo. A new supervised method for blood vessel segmentation in retinal images by using gray- level and moment invariants-based features. Trans. Medical Imaging, 30(1), 2011. [93] Iacopo Masi, Anh Tuan Tran, Jatuporn Toy Leksut, Tal Hassner, and Gerard Medioni. Do we really need to collect millions of faces for effective face recognition? European Conf. Comput. Vision, 2016. [94] G´ erard Medioni, Jongmoo Choi, Cheng-Hao Kuo, and Douglas Fidaleo. Identifying non- cooperative subjects at a distance using face images and inferred three-dimensional face models. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 39(1):12–24, 2009. [95] Xu Meihe, Rajagopalan Srinivasan, and Wieslaw L. Nowinski. A fast mutual information method for multi-modal registration. Proceedings of the 16th International Conference on Information Processing in Medical Imaging, pages 466–471, 1999. [96] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. Trans. Pattern Anal. Mach. Intell., 27(10):1615–1630, Oct. 2005. [97] Rui Min, Jongmoo Choi, Gerard Medioni, and Jean-Luc Dugelay. Real-time 3d face iden- tification from a depth camera. Int. Conf. Pattern Recognition, 2012. [98] H. Moravec. Rover visual obstacle avoidance. International Joint Conference on Artificial Intelligence, pages 785–790, 1981. 104 [99] M. Niemeijer, J. Staal, B. van Ginneken, M. Loog, and M.D. Abramoff. Comparative study of retinal vessel segmentation methods on a new publicly available database. Journal of Medical Imaging, 2004. [100] D. Nister. An efficient solution to the five-point relative pose problem. Trans. Pattern Anal. Mach. Intell., 26(6), 2004. [101] S Niu, Q Chen, H Shen, L de Sisternes, and D L Rubin. Registration of sd-oct en-face im- ages with color fundus photographs based on local patch matching. Ophthalmic Medical Image Analysis First International Workshop, 2014. [102] Omar Ocegueda, Georgios Passalis, Theoharis Theoharis, Shishir K Shah, and Ioannis A Kakadiaris. Ur3d-c: Linear dimensionality reduction for efficient 3d face recognition. In Biometrics (IJCB), 2011 International Joint Conference on, pages 1–6. IEEE, 2011. [103] Chinese Academy of Sciences Institute of Automation (CASIA). Casia-3d facev1, 3d face database. [104] Guan Pang and Ulrich Neumann. The gixel array descriptor (gad) for multimodal image matching. In Applications of Computer Vision (WACV), 2013 IEEE Workshop on, pages 497–504, 2013. [105] Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. British Machine Vision Conference, 1(3):6, 2015. [106] Hemprasad Patil, Ashwin Kothari, and Kishor Bhurchandi. 3-d face recognition: features, databases, algorithms and challenges. Artificial Intelligence Review, 44(3):393–441, 2015. [107] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d face model for pose and illumination invariant face recognition. In Advanced video and signal based surveillance, 2009. AVSS’09. Sixth IEEE International Conference on, pages 296–301. IEEE, 2009. [108] P Jonathon Phillips, Patrick J Flynn, Todd Scruggs, Kevin W Bowyer, Jin Chang, Kevin Hoffman, Joe Marques, Jaesik Min, and William Worek. Overview of the face recognition grand challenge. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 947–954. IEEE, 2005. [109] Josien P. W. Pluim, J. B. Antoine Maintz, and Max A. Viergever. Mutual-information- based registration of medical images: a survey. IEEE Transcations on Medical Imaging, pages 986–1004, 2003. [110] Q.T. Luong qnd Olivier D. Faugeras. The fundamental matrix: Theory, algorithms, and stability analysis. Int. J. Comput. Vision, 1(17):43–75, 1996. [111] Vincent Rabaud and Serge Belongie. Re-thinking non-rigid structure from motion. In Proc. Conf. Comput. Vision Pattern Recognition, 2008. [112] S. Romdhani and T. Vetter. Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In Proc. Conf. Comput. Vision Pattern Recognition, 2005. 105 [113] Y . Rouchdy and L. D. Cohen. Retinal blood vessel segmentation using geodesic voting methods. In 2012 9th IEEE International Symposium on Biomedical Imaging (ISBI), pages 744–747, May 2012. [114] E. Rublee, V . Rabaud, K. Konolige, and G. Bradski. Orb: an efficient alternative to sift or surf. Proc. Int. Conf. Comput. Vision, 2011. [115] Daniel Sage. Local normalization. http://bigwww.epfl.ch/sage/soft/localnormalization/, 2011. [116] Shunsuke Saito, Tianye Li, and Hao Li. Real-time facial segmentation and performance capture from RGB input. In European Conf. Comput. Vision, 2016. [117] Arman Savran, Nes ¸e Aly¨ uz, Hamdi Dibeklio˘ glu, Oya C ¸ eliktutan, Berk G¨ okberk, B¨ ulent Sankur, and Lale Akarun. Bosphorus database for 3d face analysis. In European Workshop on Biometrics and Identity Management, pages 47–56. Springer, 2008. [118] Harpreet S. Sawhney and Rakesh Kumar. True multi-image alignment and its application to mosaicing and lens distortion correction. Trans. Pattern Anal. Mach. Intell., 21(3):235– 243, 1999. [119] Ashutosh Saxena, Min Sun, and Andrew Ng. Make3d: Learning 3d scene structure from a single still image. In Trans. Pattern Anal. Mach. Intell., volume 31, page 824840, 2008. [120] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two frame stereo correspondence algorithms. Int. J. Comput. Vision, 47, 2002. [121] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. Trans. Pattern Anal. Mach. Intell., 19(5):530–535, 1997. [122] S. Schmitz-Valckenberg, M. Fleckenstein, Scholl HP., and Holz FG. Fundus autofluo- rescence and progression of age-related macular degeneration. Surv. Ophtalmology, 54, 2009. [123] Johannes L. Sch¨ onberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proc. Conf. Comput. Vision Pattern Recognition, 2016. [124] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proc. Conf. Comput. Vision Pattern Recognition, pages 815–823, 2015. [125] Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. Automatic acquisition of high- fidelity facial performances using monocular videos. ACM Trans. on Graphics, 33, 2014. [126] Jamie Shotton, Andrew Blake, and Roberto Cipolla. Multiscale categorical object recog- nition using contour fragments. IEEE Transactions on Pattern Analysis and Machine In- telligence, 30(7):1270–1281, 2008. [127] Heung-Yeung Shum and Richard Szeliski. Systems and experiment paper: Construction of panoramic image mosaics with global and local alignment. Int. J. Comput. Vision, 36(2):101–130, 2000. 106 [128] Karen Simonyan, Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Fisher vector faces in the wild. Proc. British Mach. Vision Conf., 2(3):4, 2013. [129] Luuk Spreeuwers. Fast and accurate 3d face recognition. International Journal of Com- puter Vision, 93(3):389–414, 2011. [130] Charles V . Stewart, Chia-Ling Tsai, and B. Roysam. The dual-bootstrap iterative closest point algorithm with application to retinal image registration. Trans. Medical Imaging, 22(11):1379–1394, Nov. 2003. [131] Henrik Stewenius, Christopher Engels, and David Nister. Recent developments on direct relative orientation. ISPRS Journal of Photogrammetry and Remote Sensing, 60(4):284,294, 2006. [132] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE Inter- national Conference on Computer Vision, pages 945–953, 2015. [133] Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M Seitz. Total moving face reconstruction. In European Conf. Comput. Vision, 2014. [134] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 1–9, 2015. [135] Thomas Theelen, Tos T. J. M. Berendschot, Carel B. Hoyng, Camiel J. F. Boon, and B. Jeroen Klevering. Near-infrared reflectance imaging of neovascular age-related macular degeneration. Graefes Arch Clin Exp Ophthalmol., 2009. [136] J. Thies, M. Zollh¨ ofer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt. Real- time expression transfer for facial reenactment. ACM Trans. on Graphics, 34(6), 2015. [137] P. Thvenaz and M. Unser. An efficient mutual information optimizer for multiresolution image registration. Int. Conf. Image Processing, pages 833–837, 1998. [138] Engin Tola, Vincent Lepetit, and Pascal Fua. A fast local descriptor for dense matching. Proc. Conf. Comput. Vision Pattern Recognition, 2008. [139] Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gerard Medioni. Regressing robust and discriminative 3D morphable models with a very deep neural network. In Proc. Conf. Comput. Vision Pattern Recognition, 2017. [140] Vipin Vijayan, Kevin W Bowyer, Patrick J Flynn, Di Huang, Liming Chen, Mark Hansen, Omar Ocegueda, Shishir K Shah, and Ioannis A Kakadiaris. Twins 3d face recognition challenge. In Biometrics (IJCB), 2011 International Joint Conference on, pages 1–7. IEEE, 2011. [141] Etienne Vincent and Robert Laganiere. Detecting planar homographies. Image and Signal Processing and Analysis, 2001. 107 [142] P. Viola and W.M. Wells. Alignment by maximization of mutual information. Int. J. Comput. Vision, pages 137–154, 1997. [143] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–511. IEEE, 2001. [144] Daniel Weber, Jan Bender, Markus Schnoes, Andr Stork, and Dieter Fellner. Computer graphics forum efficient gpu data structures and methods to solve sparse linear systems in dynamics applications, 2012. [145] Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and Cordelia Schmid. Deepflow: Large displacement optical flow with deep matching. Proc. Int. Conf. Comput. Vision, 2013. [146] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In Proc. Conf. Comput. Vision Pattern Recognition, 2011. [147] John Wright, Allen Y Yang, Arvind Ganesh, S Shankar Sastry, and Yi Ma. Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence, 31(2):210–227, 2009. [148] Zhirong Wu, Shuran Song, Aditya Khosla, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets for 2.5 d object recognition and next-best-view prediction. ArXiv e-prints, 2, 2014. [149] Fei Yang, Jue Wang, Eli Shechtman, Lubomir Bourdev, and Dimitri Metaxas. Expression flow for 3D-aware face component transfer. ACM Trans. on Graphics, 30(4):60, 2011. [150] Gehua Yang, Charles V . Stewart, Michal Sofka, and Chia-Ling Tsai. Registration of chal- lenging image pairs: Initialization, estimation, and decision. IEEE Trans. Pattern Anal. Mach. Intell., 29(11):1973–1989, November 2007. [151] Jinzhong Yang. The thin plate spline robust point matching (tps-rpm) algorithm: A revisit. Pattern Recognition Letters, 32(7):910 – 918, 2011. [152] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A survey. ACM Comput. Surv., 38(4), December 2006. [153] Lijun Yin, Xiaozhou Wei, Yi Sun, Jun Wang, and Matthew J Rosato. A 3d facial expres- sion database for facial behavior research. In 7th international conference on automatic face and gesture recognition (FGR06), pages 211–216. IEEE, 2006. [154] Yehoshua Z, Rosenfeld PJ, Gregori G, Feuer WJ, Falco M, Lujan BJ, and Puliafito C. Pro- gression of geographic atrophy in age-related macular degeneration imaged with spectral domain optical coherence tomography. Ophthalmology, 2011. [155] F. Zana and J. C. Klein. Segmentation of vessel-like patterns using mathematical morphol- ogy and curvature evaluation. IEEE Transactions on Image Processing, 10(7):1010–1019, Jul 2001. 108 [156] Zhengyou Zhang, Rachid Deriche, Olivier Faugeras, and Quang-Tuan Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence, 78(12):87 – 119, 1995. [157] Xiangxin Zhu and Deva Ramanan. Face detection, pose estimation, and landmark local- ization in the wild. In Proc. Conf. Comput. Vision Pattern Recognition, pages 2879–2886. IEEE, 2012. [158] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z. Li. Face alignment across large poses: A 3D solution. In Proc. Conf. Comput. Vision Pattern Recognition, 2016. [159] Barbara Zitov and Jan Flusser. Image registration methods: a survey. Image and Vision Computing, 21(11):977 – 1000, 2003. 109
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Registration and 3-D reconstruction of a retinal fundus from a fluorescein images sequence
PDF
3D deep learning for perception and modeling
PDF
3D object detection in industrial site point clouds
PDF
Digitizing human performance with robust range image registration
PDF
Data-driven 3D hair digitization
PDF
Object detection and recognition from 3D point clouds
PDF
Accurate image registration through 3D reconstruction
PDF
Efficient template representation for face recognition: image sampling from face collections
PDF
Fast iterative image reconstruction for 3D PET and its extension to time-of-flight PET
PDF
Image registration with applications to multimodal small animal imaging
PDF
Green learning for 3D point cloud data processing
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
Accurate 3D model acquisition from imagery data
PDF
Complete human digitization for sparse inputs
PDF
Facial gesture analysis in an interactive environment
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Reconstructing 3D reconstruction: a graphical taxonomy of current techniques
Asset Metadata
Creator
Hernandez, Matthias Thomas
(author)
Core Title
3D inference and registration with application to retinal and facial image analysis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/14/2017
Defense Date
09/29/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D reconstruction,face modeling,face recognition,image registration,OAI-PMH Harvest,retinal imaging
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gerard (
committee chair
), Nakano, Aiichiro (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
matthias.hernandez@gmail.com,mthernan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-454197
Unique identifier
UC11264217
Identifier
etd-HernandezM-5906.pdf (filename),usctheses-c40-454197 (legacy record id)
Legacy Identifier
etd-HernandezM-5906.pdf
Dmrecord
454197
Document Type
Dissertation
Rights
Hernandez, Matthias Thomas
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D reconstruction
face modeling
face recognition
image registration
retinal imaging