Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient template representation for face recognition: image sampling from face collections
(USC Thesis Other)
Efficient template representation for face recognition: image sampling from face collections
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Efficient template representation for face recognition: image sampling from face collections by Jungyeon Kim A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) December 2019 Copyright 2020 Jungyeon Kim Acknowledgement First of all, I would like to express my sincere gratitude to my advisor Prof. Gérard Medioni for his continuous support, his boundless patience and necessary advice for this Ph.D. study. His guidance has helped to study and write this dissertation. I could not have imagined having a better mentor for my Ph.D. study. Besides my advisor, I would like to thank the rest of my thesis committee: Prof. Panayi- otis Georgiou, Prof. Aiichiro Nakano, Prof. Antonio Ortega, and Prof. Alexander Sawchuk, for their encouragement. My sincere thanks also goes to Prof. Ram Nevatia, Dr. Osonde A. Osoba, Prof. Yura Lee, Miss Claudia Clabo, Pastor Sungho Shin, Mrs. Nakgum Back, Dr. Lucy Kim, Dr. Hyeonju Kim, Dr. Kanggeon Kim, Dr. Jiun Son, who counseled my hard time during my Ph.D. study. My sincere thanks also goes to Dr.Anh T Tran, Prof. Tal Hassner, Dr. Iacopo Masi, and Dr.Jongmoo Choi, who widened my research area. I thank SVP Junseo Lim, former SVP Ilsung Bae, and Master VP Heonsuk Ryu, who support this Ph.D. program. Last but not least, I would like to thank my loving family: my parents, my brother, and my sister for everything. ii Table of Contents Acknowledgement ii List of Tables vii List of Figures viii Abstract x 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.1 Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2 Selecting representative facial images . . . . . . . . . . . . . . . . . . . 8 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Related Work 14 2.1 Image set based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 frame selection approaches for face recognition . . . . . . . . . . . . . . . . . . 16 iii I Image sampling for face recognition on deep learning approaches 20 3 Template representation by sampling a few representative images 21 3.1 Motivation and overall approach . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Representative face image sets . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 Robust feature vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.3 Overview of our system . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Pre-processing: Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 Landmark detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.2 Face normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.3 Face Image Analysis: Pose and Quality Estimation . . . . . . . . . . . . 44 3.2.4 In-plane alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Image sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.1 Quantization of face image space . . . . . . . . . . . . . . . . . . . . . 54 3.3.2 Filtering: Outlier Removal . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.3 Image selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.4 To sum up for the Image Sampling Block . . . . . . . . . . . . . . . . . 67 3.4 Training CNN: Fine-tuning of deep learning models . . . . . . . . . . . . . . . . 68 3.5 Matching Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.5.1 Deep feature representation of an image . . . . . . . . . . . . . . . . . . 72 3.5.2 The similarity of two images . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5.3 The similarity of two templates . . . . . . . . . . . . . . . . . . . . . . 72 3.5.4 Fusion of pooled faces and random faces . . . . . . . . . . . . . . . . . 74 3.5.5 Overview of matching two templates . . . . . . . . . . . . . . . . . . . 75 3.6 Robust Feature Extraction: Bin equalization . . . . . . . . . . . . . . . . . . . . 76 iv 3.6.1 Bias cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.6.2 Variance cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4 Experimental results 83 4.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 System component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3.1 Quantization of face image space . . . . . . . . . . . . . . . . . . . . . 86 4.3.2 Face Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.3 Face representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3.4 Bin equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3.5 Component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.3.6 With fewer images, higher recognition accuracy . . . . . . . . . . . . . . 101 4.4 Comparison with the state-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . 103 II Application to videos 106 5 Video-based face recognition pipeline 107 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3 Face collections from a video . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.3.1 Face tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3.2 Face detector and associator . . . . . . . . . . . . . . . . . . . . . . . . 111 5.4 CNN training and feature extraction . . . . . . . . . . . . . . . . . . . . . . . . 111 6 Experimental results on video based face recognition 115 6.1 Database and protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 v 6.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.3 Recognition accuracy on two different face collections . . . . . . . . . . . . . . 118 6.4 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.5 Comparison with the state-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . 122 7 Conclusions 125 Reference List 126 vi List of Tables 3.1 Bin indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1 Selection performance of a random face in quantized bins . . . . . . . . . . . . . 92 4.2 Selection performance of a pooled face in quantized bins . . . . . . . . . . . . . 92 4.3 Recognition results comparing several template representation on Janus CS2 . . . 96 4.4 Bin equalization performance of our representative face images with pair-wise matching method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5 Bin equalization performance of our representative face images using pooled fea- ture matching method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.6 Recognition performance for each component on CS2 and IJB-A Datasets with pair-wise matching method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.7 Recognition performance for each component on CS2 and IJB-A Datasets with pooled feature matching method. . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.8 Comparative performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.1 Face recognition accuracy according to the sampling method on Janus CS3 database 119 6.2 Face recognition accuracy according to the sampling method on IJB-B database . 120 6.3 Comparison of computational complexity . . . . . . . . . . . . . . . . . . . . . 122 6.4 Comparison of face recognition accuracy according to the sampling method . . . 122 6.5 Recognition comparison with state-of-the-arts . . . . . . . . . . . . . . . . . . . 123 vii List of Figures 1.1 IJB-A benchmarks: template to template matching . . . . . . . . . . . . . . . . 2 1.2 Our proposed solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Challenges for face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Comparison of the statistical distribution of pose yaw and image quality in the CASIA-WebFace and IJB-A . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 An example of our framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Our proposed solution for a reduced template representation . . . . . . . . . . . 26 3.4 Our proposed solution for robust feature vectors . . . . . . . . . . . . . . . . . . 28 3.5 Overview of our system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6 Overview of the Alignment Block . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.7 Examples of face images with detected landmarks and landmark confidence level 38 3.8 Bins of face image space depending on the pose and quality . . . . . . . . . . . . 55 3.9 In IJB-A [79], examples of outlier face images . . . . . . . . . . . . . . . . . . . 59 3.10 Template representation: representative face images for a template . . . . . . . . 63 3.11 CNN training with our template structures . . . . . . . . . . . . . . . . . . . . . 71 3.12 Score fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.13 overview of matching pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.14 Bin equalization: the concept of bias cancellation . . . . . . . . . . . . . . . . . 78 3.15 Bin equalization: bias cancellation . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.16 Bin equalization: variance cancellation . . . . . . . . . . . . . . . . . . . . . . . 81 viii 4.1 The importance of sample diversity on pose . . . . . . . . . . . . . . . . . . . . 88 4.2 The importance of sample diversity on quality . . . . . . . . . . . . . . . . . . 89 4.3 CS2, IJBA, CMC and ROC curve using pair-wise matching method . . . . . . . 102 4.4 CS2, IJBA, CMC and ROC curve using the pooled feature matching method . . . 103 5.1 System block diagram at test stages . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2 An example of a video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3 System diagram for feature extraction in our experiment . . . . . . . . . . . . . 113 ix Abstract Template-based face recognition methods recognize a person’s identity (1:N face iden- tification) or validate a person’s identity(1:1 verification) by comparing probe templates against gallery templates. Each template contains many real-world face images of a person with vary- ing poses, qualities and other unconstrained settings, captured from multiple devices. Recent template-based face recognition systems have more focused on developing advanced deep mod- els to achieve higher recognition accuracy, less focused on reducing computational complexity. Here, this dissertation aims to reduce computational complexity as well as to increase recogni- tion accuracy by introducing an efficient template representation by sampling a few face images which represent a template on deep learning approaches. To do that, we compressed variances of intra-personal appearances with a few samples. We sparsely quantized face image space depending on pose and quality in a disjointed way and assigned the bin to face samples referring to estimated pose yaw and image quality. Then, we averaged the face images that had similar appearances in each quantized bin to obtain the central tendency. In addition, to compensate for the possible loss of facial details induced by the average, we randomly selected a face image in each bin. Further, from the encoded deep features of our template representation, we proposed bin equalization methods to remove the pose and quality specific portions that are not related to a person’s identity. Finally, we showed that our method x achieved higher recognition accuracy with fewer images than the original templates on mixed image protocols of the Janus CS2 and IJB-B benchmarks with extensive experiments. Then, we applied our method to video-based face recognition on the Janus CS3 and IJB-B benchmarks in which gallery templates contained real-world still face images and probe templates contained a video file with an initial face bounding box of a target subject. These video protocols made the recognition problem more challenging because the probe templates contained a myriad number of face images. Some of the face images might be from different subjects or non-face objects depending on the performance of face collectors. Finally, we showed that our method is superior to the state-of-the-art on video-based face recognition in terms of both computational complexity and recognition accuracy. Also, we showed that our method achieved consistently high recognition accuracy even with low-performance face collectors. This dissertation makes two contributions to efficient template-based face recognition. First, we show that computational complexity can be significantly improved without losing recog- nition accuracy, by intelligently sampling a few face images which represent many face images in a template. Second, we propose robust feature vectors that are insensitive to variations in pose and quality, which boost recognition accuracy. The integration of the two contributions re- sults in a robust face recognition system which is insensitive to the variations in the quality of face collections. Finally, the proposed method achieved a new state-of-the-art performance in template-based video face recognition efficiently. xi Chapter 1 Introduction 1.1 Background Face recognition is a computer vision application in which a computer solves a classification problem to determine a positive pair between probe sets and gallery sets. The face recognition problem includes identification and verification: Face identification is a task of determining a person’s identity by comparing a probe set against all gallery sets (priors). The requirement is that corresponding pairs show high similarity scores among their learned facial features. Face verification is a task of verifying whether given face images are of the same person or not. The requirement is that corresponding pairs show a higher similarity score than a given operating threshold when we compare one probe set with one gallery set. Face recognition has become more and more important in the digital imaging era be- cause of its increasing potential applications [65], such as surveillance for security and photo tagging for entertainment. However, in unconstrained real world environments, face appearances 1 of a person can have significant differences depending on head pose, image quality, and facial ex- pression, which means intra-personal facial appearances can be greater than inter-personal facial appearances. For this reason, solving the face recognition problem with real world face pho- tos still remains a challenging problem. In particular, as shown in Fig. 1.1, recently published face recognition databases such as the IJB-A benchmarks (IARPA Janus Benchmark-A) [79] have introduced a "template" and suggested "template to template matching", making the face recognition problem more challenging, because a template contains many unconstrained face image(s)/video frames of a person from various capturing devices. Figure 1.1: IJB-A benchmarks: template to template matching 1.2 Problem statement On the surface, using multiple face images rather than one leads to higher recognition accu- racy [153] because we can use more abundant facial information. However, this is not always the case. When templates are populating multiple images varying in poses, quality and other com- pounding factors, matching across the templates requires considering all those characteristics to 2 avoid skewing matching scores. Also, when templates contain many similar face photos which only provide overlapping/redundant facial information (e.g. consecutive video frames when a subject is not moving), they might not improve recognition accuracy, but will increase compu- tational cost. To address these issues, we focus on reducing the computational complexity even increasing recognition accuracy by suggesting an effective template representation. Previous works on effective image set representation were mainly achieved in a set based setting, often with the YTF benchmark [141]. Various set representations from linear subspaces (e.g., [52, 66]) to non-linear manifolds [32, 91] were suggested, together with their own set-to-set similarity measures. However, more recent approaches with templates have tended to use all face images with more specialized set representations, using deep learning models [3, 29, 96, 117]. Then, they compute the set similarity between two templates by aggregating the similarities of all image pairs across the templates into a single score. Very recently, approaches have adopted feature pooling [89, 40, 122, 93] to exploit the invariant properties of feature vectors, and thus, accomplished higher recognition accuracy with fewer matching comparisons. As one way of reducing the computational complexity, we selected a maximally con- stant number of representative facial images for a subject from the original template. Specifically, when we havep probe templates andg gallery templates, defined respectively asfP 1 ;P 2 ;:::;P p g2 P andfG 1 ;G 2 ;:::;G g g2G, where each probe and gallery template is composed of a series of images or video frames defined asfI 1p ;I 2p ;:::;I mp g2P p andfI 1g ;I 2g ;:::;I ng g2G g , we can reduce the computational complexity of matching comparisons for each template from O.M p N g / toO.1/. In addition, when we define the computational complexity of extract- ing a feature vector asF , then we can reduce the computational complexity fromO.M p / and 3 O.N g / toO.1/. Thus, limiting the number of images in a template to a constant number not only saves storage capacities for images, but also reduces the computational time needed for extracting expensive deep features. Traditional studies have automatically selected representatives, so-called "exemplars", from the raw gallery video frames using learned statistical models, or the centers of probabilistic mixture distributions [82, 15]. The representatives were used to construct probabilistic appear- ance manifolds such as pose [85] or intra-personal and extra-personal subspaces [44]. Or, they were exploited as reference images to remove low-quality images such as not-human faces, not- well cropped faces, non-frontal faces, noisy faces or blurry faces [50, 15, 123, 47, 16, 101, 97, 142, 7, 17]. However, these methods only worked well when the gallery databases were highly correlated to probe databases in terms of the statistical distribution. So, the methods were not applicable for recent template-based databases which show independent statistical distributions between the probe and gallery databases. More recent deep learning-based approaches have focused on improving recognition accuracy while ignoring computational complexity. They have attempted to develop more ad- vanced deep models by collecting training images and using more advanced deep networks [130, 105, 127, 125, 29, 117, 93, 59, 120], not caring how many feature vectors they need to extract during the test stages. For example, Masi et al. [93] exploited a template that was four times bigger than the original template during the test stages to achieve higher recognition accuracy, expending more storage space, computing power and computational time. Even though some 4 studies have reduced the computational complexity by pruning CNN weights [148, 129] or opti- mizing deep feature extraction hardware [124], there is still a lack of research on ways to reduce the template size considering the number of images on deep learning approaches. Accordingly, in this dissertation, we propose an effective template representation by intelligently sampling representative facial images in a template, as shown in Fig 1.2.(a). To cap- ture more essential facial features with fewer images, we sparsely and disjointly quantized face image space considering pose and quality. To compress the overlapping facial information and to remove noisy nuisance factors, we averaged face images in a pixel-wise manner in each quantized bin. In addition, because the average itself can cause the loss of details in facial appearances, we took a randomly selected face in each bin to provide subtle facial appearances. Finally, by taking both average pooled faces and randomly selected faces (PRF) as our representative face images, we were able to achieve higher recognition accuracy with maximally a constant number of face images. In addition, we propose identity preserving feature vectors that are insensitive to varia- tions in pose and quality, introducing bin equalization methods, as shown in Fig 1.2.(b). Because pose and quality are not actually related to a person’s identity, we eliminated the pose and quality specific portions from our encoded deep feature vectors and unified the feature spaces, taking advantage of our quantized pose and quality bin. Then, we pooled our encoded feature vectors within the unified feature space. Because the characteristics of pooled faces and random faces are different from each other, all the processes are separately performed, and then, we took average fusion of their similarity scores during matching comparisons. By doing so, we were able to further boost our recognition accuracy. 5 We carried out extensive experiments on the Janus CS2 and IJB-A benchmarks [79] on the mixed image protocols. In addition, we validated our pooling and random face representation (PRF) on video protocols of the Janus CS3 and IJB-B benchmarks [140], accomplishing state- of-the-art recognition accuracy requiring much less computational complexity. Moreover, our methods showed consistent high recognition accuracy even when the original template exhibited significantly lower recognition accuracy due to noisy image sets. Figure 1.2: Our proposed solution 1.3 Challenges 1.3.1 Face recognition Differentiating the positive matching pairs (genuine) and negative matching pairs (impostors) among face images is a challenging task because the facial appearances of a person can exhibit 6 large variations in pose, quality, illumination, expression and occlusion, and as a result, vary more than the facial appearances of different people. Figure 1.3: Challenges for face recognition: (a) head pose, (b) image quality, (c) expression, (d) illumination, (e) many unconstrained images for a person, our interest is in (a), (b) and (e). Head pose: While human faces have common structures with two eyes, a nose and a mouth, human facial structures captured on a 2D screen are completely different from each other depending on the head poses or camera angles (e.g. frontal face vs. profile face) as shown in Fig 1.3.(a). In other words, differences in head poses can result in intra-personal differences in facial appearances that are greater than inter-personal differences. Those facial differences are often compensated with alignment techniques [94, 55]. Image quality: The face images of a person vary extremely depending on resolution, blurriness, noise levels, compression effect, and other quality factors, as shown in Fig 1.3.(b). In addition, real-world face databases contain many low quality images which do not carry accurate facial information. For example, blurry images that lack textures de- liver featureless information. Images with extreme sharpness or compression artifacts such 7 as JPEG often include undesirable digital patterns, introducing inaccurate facial features. Consequently, face recognition systems need to be designed so that they are robust to noisy features from low quality images. Facial expression: Human faces are highly deformable depending on facial expressions such as anger and happiness. The variations in facial skin textures and facial features near the mouth, eyes, and jaws of a person presenting different expressions are larger than those among people with the same expression, as shown in Fig 1.3.(c). Illumination: Depending on color temperatures, light intensities, and incident direction of illumination, variations in the facial skin colors and textures of one person are larger than those between different people, as shown in Fig 1.3.(d). In addition, extremely low or high intensity light can add undesirable patterns such as dot noises and flares to the facial images, contaminating facial information. Many unconstrained images: The use of many real-world face photos with various poses, quality, expressions, illumination, and other compounding real-world factors make the face recognition problem more challenging. In addition, the use of multiple images increases computational complexity. 1.3.2 Selecting representative facial images Because face image space is influenced by many real-world circumstances in a complicated way, clarifying the face image space and reflecting all the facial characteristics with a few samples are impractical. However, when we consider practical applications, as much as possible, the goal is 8 to keep important sample variances most, and filter out noisy images, without losing recognition accuracy. Head pose: Because head pose is expressed as a 3D continuous number with pitch, yaw, and roll, it is difficult to decide which precision of head pose we need to consider. Also, it difficult to decide which angles of yaw and pitch deliver important facial features more. In addition, when the statistical distribution of head poses is skewed within face databases, then matching scores can also be skewed. For example, when a face database contains mostly near-frontal faces in both gallery and probe, selecting near-frontal faces as repre- sentative facial images in the database can increase the recognition accuracy, predicting genuine pairs well. However, what if the probe database only includes profile faces unlike the representatives (frontal faces)? Image quality: Bluriness, sharpness, noise, compression artifacts, and other quality re- lated factors distort the information in encoded facial feature vectors. However, when the databases contain many low quality images with all those factors together, it is not easy to determine which images are of low quality and should be eliminated in the databases, due to their useless information. In addition, in circumstances where we only can use the low quality images in a template, we need to extract usable information from the given low quality images as much as possible. Also, because similarity scores are pair matching issues, similarity scores among low quality images can cause false-positive pairs, skewing the recognition results. Expression: Human faces are not a rigid object and there are no standard expressive faces among people. Even when we consider a person’s smiling face, the mouth shape can be 9 deformed in many different ways. Often, methods to neutralize a facial expression are used involving 3D face reconstruction. 1.4 Contributions In this dissertation, we intelligently represent a template with a few samples and propose ro- bust feature vectors that are insensitive to pose and quality in deep learning based approaches. Due to compactly preserved essential facial appearances, our sample methods achieved higher recognition accuracy than state-of-the-art video-based face recognition systems. Also, due to the reduced template size, our sample methods needed much less storage capacity and computational time during extracting feature vectors. Contributions in terms of novelty are like below.: Reduced template representations in image domain by combining several image sam- pling methods: We extracted essential facial information in a fixed volume from the orig- inal template. To keep appearance diversities, we quantized face image space considering pose and quality. To obtain the central tendency among similar poses and qualities, we adopted pixel-wise average pooling of facial images in each bin. To reduce the loss of us- able information such as details of facial textures due to the average, we applied a random sampling method to the facial images in each bin. In addition, we deleted less-face-like im- ages in each bin, depending on landmark confidence level. By combining all these methods, we were able to reflect the facial characteristics of real-world face collections with their subsets, and preserve diversities and local similarities, without losing recognition accuracy. 10 Training deep models with our image sampling structure: Because the face databases what we used for training sets did not support template structures, we generated several different size templates by sampling face images from the original training images for each subject. The generated templates were used to fine-tune the conventional deep networks with our template representation. By doing so, we made our deep models robust even to pooled faces, as well as original faces. Identity preserving deep feature vectors insensitive to pose and quality: Because the pose and quality information is irrelevant to a person’s identity, we have attempted to elim- inate the pose and quality dedicated portions from our original feature vectors, introducing bin equalization methods. In the process, we assumed that a mean feature vector of many subjects in a bin captures human facial appearances with specific pose and quality unrelated to the individual facial features, while a mean feature vector of a person in a bin captures a person’s common facial appearances with specific pose and quality. By exploiting the mean feature vectors of multiple subjects in each bin, we were able to remove the pose and quality specific portions and flattened the quantized face image spaces. Further, in the flat- tened and unified face image space, we were able to provide robust pooled features which preserve invariant properties of feature vectors except for pose and quality. Application to video protocols: We applied our sampling structure to video protocols on template settings. Our representative facial images showed higher recognition accuracy than state-of-the-art as well as the original face collections (our baseline system). In partic- ular, our sample method showed consistently high recognition accuracy even when the face 11 collections were noisy due to the low performance of face collectors, unlike the original templates. Recognition accuracy: With the combination of our image sampling method in the im- age domain, we improved recognition accuracy with much less computational complexity. Especially, on video-based face recognition applications, we achieved higher performance than state-of-the-art. Computational complexity: By comparing facial characteristic parameters and averag- ing images, we reduced template size and in image domain, and computational complexity needed for feature extraction. Thus, we experimentally showed that our method is practi- cally usable in inexpensive systems without expensive, high performance GPUs and mass storage devices. 1.5 Dissertation outline The rest of this dissertation is organized as follows.: In Chapter 2, we review the related work on image set representations and image sampling approaches for face recognition. Then, in the first part of this dissertation, we explain our image sampling method on deep learning approaches in Chapter 3, and then, evaluate recognition accuracy and template size of our representation on mixed-image protocols of the Janus CS2 and IJB-A in Chapter 4. In the second part of this dissertation, we apply our sampling method to video-based face recognition. In Chapter 5, we explain our video-based face recognition systems. In Chapter 6, we evaluate recognition 12 accuracy, image storage capacities and feature extraction time of our representation on the video protocols of the Janus CS3 and IJB-B. Finally, in Chapter 7, we conclude this dissertation. 13 Chapter 2 Related Work In this chapter, we review the image set based approaches which provide various set representa- tions and set-to-set similarity comparisons to deal with multiple images and/or video frames of a subject. Then, we review the various frame selection approaches used to sample more informative face images while discarding less-informative face images based on their own criteria. 2.1 Image set based approaches Image set based settings contain two main tasks: one is to represent an image set which is typ- ically composed of multiple images of a person from a single video. The other is to develop a set-to-set distance or similarity metric. Possibly, the simplest approach is to store all the images or feature vectors of each set, and then, to measure the distance between two sets by aggregating the distances among all cross-set image pairs (e.g., min-dist [141]). More elaborate methods can be categorized into parametric and non-parametric approaches. 14 Parametric approaches represent an image set as parametric distribution functions, such as single Gaussian [119] or Gaussian mixture models [8], and then, compute a similarity score of two image sets with Kullback-Leibler (KL) divergence. However, these methods require accurate parameter estimation of the distribution functions of the image sets and strong correlation between the test sets and the training sets [134, 78]. Non-parametric approaches require large training samples with small sample varia- tions to approximate the real-world face photos to targeted distributions such as a linear subspace and a manifold. They are categorized into subspace methods, manifold method, affine and convex hull methods, and covariance matrix-based methods.: Subspace methods represent image sets as linear subspaces [12, 13, 52, 66, 78], assum- ing that all the elements of a set lie close to a linear subspace. The methods propose a computa- tionally efficient set to set distances/similarities by measuring mapping distances from subspaces to points [13] or angles between two subspaces (canonical angles, principal angles or canonical correlations) [52]. Real-world photos of faces, however, rarely lie on linear subspaces due to their complex variations. Using such subspaces representation may lose substantial amounts of information, and thus, can cause degradation of recognition capabilities. Manifold-to-manifold methods [32, 67, 91, 134, 57, 58, 50] represent real-world images as non-linear manifolds, when image sets cannot be well approximated to a linear subspace due to large sample variations (e.g. different viewpoints). The methods calculate the manifold-to- manifold distance (MMD), typically involving manifold learning techniques which express non- linear high dimensionality subspace (a manifold) to a collection of local linear models. Still, 15 the performance of the methods may deteriorate much when the image sets do not have enough samples, or sample variations are too big. Convex hull methods [23] and Affine hull methods [62] express each image as a point in a linear or affine feature space and each image set as the affine or convex hull spanned by its feature points, proposing geometrically closest distances between affine or convex models as set-to-set dissimilarity. When many images without large sample variations are available in each set, these hulls can be well defined, making the system effective. In covariance matrix based methods [133], an image set is represented by a covariance matrix, proposing a Riemannian kernel function which maps a covariance matrix from the Rie- mannian manifold to a Euclidean space to measure the similarity between two image sets. While the method has advantages, in that the kernel function can be applied to any discriminant learning method for image set classification(e.g. [69, 22]), the disadvantages are computational costs for matrix decomposition and the curse of dimensionality given few training images. Besides these, various set representations such as the bag of features [83, 33, 90], Fisher vectors [108] and Vector of Locally Aggregated Descriptors (VLAD) [70] have been proposed. 2.2 frame selection approaches for face recognition Using many images rather than one increased recognition accuracy [97]. Using myriad video frames and their pairwise comparisons could be the simplest way for face recognition systems. However, using them all surely increases the computational complexities for feature extraction 16 and feature comparisons. In addition, when the image sets include non-face images and non- informative face images, their spurious matches with noisy features increases the possibility of degrading the recognition accuracy. Frame selection approaches [11], what we will introduce in this chapter, have been studied mainly on video-based face recognition including quality oriented selection approaches, weighted selection approaches, diversity oriented selection approaches and combination approaches [11]. Their goal is to make the recognition system efficient by discarding outliers and non-informative frames. Quality based selection algorithms removed low quality face images completely, defin- ing them with own criteria. Berrani et al. [16] eliminated outliers with robust PCA (RobPCA) [68] and discarded not-well-cropped, not-well-aligned, and non-frontal face images with their own noisy image classification methods. Best-Rowden et al. [17] got rid of low confident face images considering faceness (confidence level that the object in an image is a human face) and frontalness (how much frontal the head pose is). Park et al. [104] eliminated images with large motion blur, estimating the Discrete Cosine Transform (DCT) coefficients. Anantharajah et al. [7] presented quality score normalization and fusion considering a variety of cues (face symmetry, sharpness, contrast, closeness of mouth, brightness and openness of the eye) to select the highest quality im- ages. They assumed that low quality faces did not convey proper facial features (e.g. texture-less face images) or did not exist in gallery or training databases. Weighted selection approaches assumed that even a bad frame might hold usable in- formation and could contribute to increasing recognition accuracy. Thus, instead of discarding the "bad" frames, they inferred a contribution level of each image to the recognition accuracy and progressively combined the individual classification result of all frames to one score referring to 17 the level. Zhang et al. [153] assigned quality scores with the weighted probabilistic appearance- based algorithms. They assigned higher weights for more similar images to training images in terms of pose and expression. As for occlusions, they divided an image into local areas, and then, combined all the recognition results from each of the local sections of all images. Localization errors were estimated with the multivariate Gaussian distributions from the subset generated un- der all possible localization errors of each of the training images. Stallkamp et al. [121] proposed three measures (i.e. distance-to-model (DTM), distance-to-second closest (DT2ND), and their combinations) on a local appearance-based classification. They obtained the confidence levels to reduce the influence of a test sample which was dissimilar to the model, or which possibly matched multiple identities on recognition accuracy. Still, these weighted selection methods are similar to quality based methods in the sense that they regard gallery/train images as their standard of high-quality images. Diversity oriented selection approaches extracted representative face samples (exem- plars) from a video sequence or an image set. The exemplars [82] preserved significant intra-class variations by clustering or partitioning a set of images into several groups, where each group con- tained images with a particular mode of variation such as pose and expression. Hadid et al. [51] compressed high dimensional non-linear face manifolds into low-dimensional face space using the Locally Linear Embedding (LLE) [114] algorithm considering similar appearances, and then, defined their face models as the cluster centers with K-means clustering. Fan et al. [43] ap- proximated geodesic distances between face appearance manifold using local neighborhood in- formation, and then, applied a hierarchical agglomerative clustering algorithm to group similar 18 images together. However, these methods often failed when the statistical distribution between train images and test images are not well correlated. Lastly, combinations of the previous selection approaches were studied empirically. Thomas et al. [131] proposed N-frame representation considering quality and diversity, measur- ing the faceness (how similar the object is to human faces) and the projected distances on PCA spaces. They showed that using high quality and diverse images improved recognition accuracy more than using only high quality images or using only diverse images. Xiong et al. [144] incre- mentally constructed databases by adding high quality and previously unseen images considering the contribution of each image to the recognition accuracy and the data redundancy. High qual- ity scores were assigned when the face images were high resolution, when the head poses were close to near frontal, or when the images had not appeared in the databases previously (e.g. im- ages including previously unseen subjects). However, these approaches regarded the near frontal faces and high quality faces as their reference images, which poorly reflected the real-world face images. To sum up, previous frame selection approaches have discarded low quality images. Or, they have assigned low weights to low quality images to decrease their influences to recognition accuracy. Also, they selected various samples considering a variety of poses and expressions, using them as representative images. However, all these approaches had inherent limitations because they assumed that real-world images could be well summarized by the training sets, and that the databases had correlated statistical distribution, even though the training sets were mostly composed of near frontal, neutral expressive and high quality images. 19 Part I Image sampling for face recognition on deep learning approaches 20 Chapter 3 Template representation by sampling a few representative images We divide this chapter into six parts. In Sec. 3.1, we introduce our motivation and our overall approach. In Sec. 3.2, we explain how we prepare the face images before sampling. In Sec. 3.3, we explain how to sample representative face image sets within a fixed data volume, when given many face images of a person. In Sec. 3.4, we show how to train the conventional deep neural networks using the sampled face image sets. In Sec. 3.5, we describe how to compare two tem- plates using the deep features introducing simple matching methods. In Sec. 3.6, we explain how we make our deep features more robust to variations in pose and quality. 3.1 Motivation and overall approach 3.1.1 Representative face image sets Using multiple face images of a person to recognize the person’s identity is not efficient in terms of computational complexity even though it may increase the recognition accuracy [97]. Because 21 of the reason, we propose to use only a few representative face images of a person by sampling them from a set of face images of a person considering pose and quality. Figure 3.1: Comparison of the statistical distribution of pose yaw and image quality in the CASIA-WebFace [150] and IJB-A [79]: (a) pose yaw, (b) image quality, The distributions shows that the newly published IJB-A contains much more various face images in head poses and image quality than the previous CASIA-WebFace. Quantization of face image spaces Recent template-based Janus CS2 and IJB-A benchmarks [79] provided a much wider variety of facial appearances of a person in pose yaw and image quality than the previous CASIA- WebFace [150], as shown in Fig. 3.1. Regarding a wide distribution of head pose yaw in the new database [79], National Institute of Standards and Technology (NIST) has reported that the existing face recognition systems are influenced by the difference in pose yaw between gallery sets and probe sets [102]. In addition, previous studies [18, 45, 39] on image quality have re- ported that low quality images can also contribute to recognition accuracy depending on how the recognition systems are organized, referring to the fact that a similarity score is a pair matching 22 issue. Inspired by these factors, we propose to quantize the face image spaces in terms of pose yaw and image quality. By doing so, while previous works in Sec. 2 [68, 17] focused on exploit- ing near frontal face images or high quality face images, the proposed method was able to use diverse facial appearances among the bins, as shown in Fig. 3.2.(b) and (c). Figure 3.2: An example of our framework: IJB-A template [194], (a) An average face: An average pooled face image of all face images in a template,(b) Pooled faces: Average pooled face images of all face images in each pose and quality bin of a template. (c) Randomly selected faces: Randomly selected face images in each pose and quality bin of a template. We used in- plane aligned real images to make these images. The quality of (a) is much lower than that of (b) and (c). In (b) and (c), we partitioned a face image space depending on head pose yaw and image quality. In each bin, we could only consider similar face images. Pooled and randomly selected face images Once face images are assigned to the pre-assigned bin depending on their pose and quality, we first considered a pixel-wise average pooled face image as our representative face sample for a template, as provided in Equation.( 3.1) [56]. However, as shown in Fig. 3.2.(a), averaging all the images in a template makes much of original facial information useless. Also, none of the 23 previous studies have ventured so far as to propose using the 1st order statistics alone, because higher order statistics [92] and/or metric learning schemes [139, 49, 92, 66, 67, 91] can much better capture the variations in facial appearances. To successfully replace those expensive meth- ods, we propose to average only the locally similar face images within each pose and quality bin, and separately preserved common facial features for diverse pose and quality, as shown in Fig. 3.2.(b). A pooled face: Let a (gallery or probe) face template be represented by the set of its member images as: FDfI 1 ;:::; I N g, where the face images are in RGB image spaces, as I i 2 R nm3 , and each member image is aligned for translation and scale by being cropped with the bounding box centered on the face and re-scaled to the same size. Then, the 1st order statistics of this set is simply defined as: F : Davg.F/D 1 N N X iD1 I i (3.1) Yet, the pooled faces can still lose facial details, as shown in the bottom-right image of Fig. 3.2.(b). Inherently, averaging is to represent only central tendencies and to average out subtle facial appearances. In addition, the averaged images may not preserve even common facial appearances well when the images are not well aligned with each other (e.g. low performance alignment techniques) [56]. In such cases, a randomly selected face image of Equation.( 3.2) can better capture facial features, as shown in the bottom-right image of Fig. 3.2.(c). For these reasons, we took the combination of pooled faces and randomly selected faces as our representative face images. In each pose and quality bin, the pooled face obtained high 24 quality common facial features such as eyes, nose and mouth areas. A randomly selected face image compensated the possible loss of facial details of the pooled face image. By doing so, we presented the two images in each bin as our representative facial image sets, complementing the weaknesses of the pooled faces and randomly selected faces. A randomly selected face: LetK be a random variable with the discrete uniform dis- tribution over the setf1; 2;:::; Ng, given a face template ofFDfI 1 ;:::; I N g, where I i 2 R nm3 as given the Equation.( 3.1). The randomly selected face image of this set is simply defined as: F r : D I k where KDf1; 2;:::; Ng where Prob.KD k/D1=N (3.2) Overview of representative facial image sets We devised pooled and random faces (PRF) as the representative facial images for a subject, as shown in Fig. 3.3. In advance, we partitioned face image space depending on pose and quality, sparsely and individually, like (pose_i, quality_j ). Then, given face images in a template, we assigned the face images to the corresponding pose and quality bins depending on their estimated pose yaw and image quality. In each bin, we discarded some images which were less similar to human faces(less-human-face-like images), referring to their landmark confidence levels (CFD), which will be explained in Sec. 3.2.1. Then, we took a pooled face image and a randomly selected 25 face image as the representative face images for each bin. By doing so, we were able to compactly express a template within a fixed data volume, regardless of the number of subject images given in a template. Figure 3.3: Our proposed solution for a reduced template representation: A series of pro- cesses for the proposed method are to quantize face image spaces depending on pose and quality, to filter out images which are less similar to human faces, and to organize a reduced template by averaging face images and by selecting a face image randomly within each bin, in order. 3.1.2 Robust feature vectors Deep learning approaches Face recognition process involves encoding each face image to its feature vector which is a set of numerical values. In recent few years, researchers have shown that deep feature vectors are highly discriminative, compared to the previous handcraft feature vectors, such as local binary 26 patterns (LBP) [5] and scale-invariant feature transform (SIFT) [89]. They have achieved a sig- nificant improvement on recognition accuracy by designing advanced deep Convolutional Neural Networks (CNNs) [130, 105, 118, 127, 125, 126, 128] and by training the deep models on a large volume of face images [105, 130, 93, 37, 151]. On the success of the deep CNN approaches, we fine-tuned one of conventional deep CNN models, applying our sampling structures on training image sets, as explained in Sec. 3.4. Pose and quality free space: Bin equalization After obtaining deep feature vectors, we calculate the similarity scores of two templates to es- timate how much similar the two templates are. A fair comparison of two templates containing complex real-world face images should consider all different poses, expressions, quality, and other compounding factors, requiring learning all those facial characteristics and their complex relationships (e.g. metric learning approaches [92, 67, 154, 91, 61]). Instead, we focus on evalu- ating the efficiency of our representative facial image sets, in terms of computational complexity and recognition accuracy, by using two simple matching methods, "All vs. All" and "Matching pooled features". All vs. All : We aggregate all similarity scores of all feature vector pairs across the templates into a single similarity score, by applying the softmax function, which represents a categorical probability distribution over all possible similarity scores [96, 56]. The details are explained in Sec. 3.5.3. 27 Matching pooled features: We use the pooled feature vector to compute the similarity of two templates. Pooling itself reduces the number of feature vectors in each template. In addition, pooling is already proven to be an effective way of preserving invariant informa- tion among multiple feature vectors (e.g. pose [122] and video ID [93]), achieving higher recognition accuracy. Figure 3.4: Our proposed solution for robust feature vectors: (Upper-Left): Conceptual de- scription. Our baseline of the matching method "All vs. All" calculates the similarity score of two feature vectors, regardless of which pose and quality bins they belong to. (Upper-Right): Con- ceptual description. Because all feature vectors are mapped into a pose-and-quality free space, comparing two feature vectors is fair enough, regardless of how much different pose and quality properties they have originally. (Lower-Left): We assume that the deep feature vectors contain the specific portions produced by pose and quality properties. (Lower-Right): By eliminating the specific portions from all feature vectors, we can generate insensitive feature vectors to the pose and quality variations. (Overall): By realizing this concept, we map all feature vectors into a single unified pose-and-quality free space. 28 However, because both "All vs. All" and "Matching pooled features" are not carefully designed to deal with all different facial information, as mentioned above, applying them to our sampling structures does not seem to be fair. The "All vs. All" method compares image pairs with visibly different pose and quality in our sampling structures. In the "Matching pooled features" method, pooling itself adds up all feature vectors assigned in different pose and quality bins. Both methods ignore our bin information and do not exploit already well categorized pose and quality information, requiring an effective way to use our sampling structures during comparing two templates. On the conditions, we suggest unifying the quantized face image spaces into a single face image space which is free to the variations in pose and quality properties, as shown in the up- per right image of Fig. 3.4. Inspired by the fact that the head pose and image quality are common properties of all human face images, not unique properties of a person’s appearance, we assume that it is possible to extract the specific portions related to the pose and quality information, as shown in the lower left image of Fig. 3.4. To realize the idea, we introduce two bin equalization methods, using "Bias" in each bin and "Variances" among bins, as explained in Sec. 3.6. With the methods, we successfully eliminate the disparities of pose and quality among all bins, as the concept shown in Fig. 3.4, increasing the recognition accuracy on the baseline matching methods above. As results, we generate the robust feature vectors which are insensitive to the variations in pose and quality. 3.1.3 Overview of our system In this part, we roughly introduce the overall system pipeline as shown in Fig. 3.5. 29 Overview When given face images in a template, the face images are aligned and analyzed in the Alignment Block. Then, the representative face image sets of a template are sampled in the Image Sampling Block. In the Robust Features Block, the sampled images are encoded into deep feature vectors using the fine-tuned conventional convolutional neural network (CNN). The feature vectors are then mapped into a pose and quality free space, making them more robust to the variations in pose and quality. Then, in the Matcher Block, we measure similarity of two templates using the deep features, which belong to pose and quality free space. Brief introduction of each block is explained below, though all the details will be explained in the later sections. Figure 3.5: Overview of our system. 30 Alignment Block In the "Alignment Block" of Fig. 3.5, When face images and their face boundary boxes of a per- son are given, the Landmark Detection Block 1 [76] provides facial landmarks on each 2D face image with their average of confidence levels, predicting how similar the detected face image is to human face. The Face Normalization Block 2 not only compensates the in-plane rotation (roll) of the face image, but also normalizes the face size depending on roughly estimated 3D head pose with an affine to orthographic approximation [46]. The Pose and Quality Estimation block 3 provides both pose and quality information of the face images. More accurate 3D head poses are estimated by the projective-n-point (PnP) method [46]. Also, objective image quality is inferred by the spatial and spectral entropy-based quality (SSEQ) index [87, 86]. Depending on the esti- mated head pose yaw, the normalized face images in a template are grouped into three distinctive views: the frontal view, the half-profile view, and the profile view. The pose variations within each group are compensated in the In-plane Alignment Block 4 . The details will be provided in Sec. 3.2. As results, the Alignment Block provides three types of pre-processed face images to the following Image Sampling Block, together with several facial information parameters (2D- landmarks, landmark confidence level, pose, quality). Image sampling The Image Sampling Block are designed to provide twenty different types of face appearances depending on pose yaw and image quality. The pre-processed images are assigned to predefined disjoint pose and quality bins in the Quantization Block 5 . As results, we keep visibly different 31 facial appearances between the bins, as well as visibly similar facial appearances within each bin. Then, in the Filtering Block 6 , less similar images to human faces are discarded, referring to the landmark confidence levels (CFD). In the Selection Block 7 , we are able to compress similar facial information by pooling facial images within each bin. We are able to compensate for the loss of facial details because of the pooling method by keeping one pre-processed face image in each bin. As results, we complete our representative facial image sets within a fixed data volume. All of the details are explained in Sec. 3.3. Robust features In the "Robust Features" Block of Fig. 3.5, the representative face images are encoded into deep feature vectors insensitive to the variations in pose and quality. In the Feature Extraction block 8 , we encode the images into deep feature vectors. The deep convolutional neural networks (CNN) are fine-tuned on the modified training sets exploiting our sampling structures. In the Bin Equal- ization Block 9 , the feature vectors which belong to different pose and quality face image space are mapped into a single unified face image space which is free to the variations in pose and quality. The details are described in Sec. 3.6. Matching In the "Matcher Block", we calculate the similarity scores of two templates with "All vs. All" and "Matching pooled features" methods, which introduced in Sec. 3.1.2. In the process, to increase 32 the recognition accuracy, we apply power normalization and PCA (principal component analysis) adaptation. In addition, we separately calculate the similarity scores of the pooled face images and the randomly selected face images, and then, averagely fuse the two scores, to preserve the characteristics of two different types of representative images. More details will be explained in Sec. 3.5. 33 3.2 Pre-processing: Alignment Alignment is one of the most important parts of face recognition systems in increasing the recog- nition accuracy [64, 63, 55]. As one of the methods, researchers have long exploited high- performance facial landmark detectors [9, 21, 112], expecting that accurately locating fiducial facial features may increase the recognition accuracy. The detected landmarks are used as cor- respondence points among face images to map non-canonical face images to a canonical form, which makes it easy to compare one face image to another [9, 21, 112]. The landmarks are not only used to normalize the face image sizes, but also used to normalize a variety of head poses [130, 155], decoupling the facial variances which are not related to a person’s identity. In the Alignment Block of Fig. 3.5, we also used a landmark detector and its detected facial landmarks to align face images. As shown in Fig. 3.6, instead of using computationally expensive and very accurate 3d alignment techniques [55, 56, 24], we exploited a series of 2d warping methods as pre-processing. Although the methods are computationally inexpensive and less accurate, applying them twice over two stages allowed us to get the sufficiently clear pooled face images, as shown in Fig. 3.2-(b). In addition, in face recognition, facial characteristics have been used in many different ways to increase recognition accuracy or to reduce computational complexity, as introduced in Sec. 2. While aligning face images in the Alignment Block, we also measured facial image characteristics such as landmark confidence level, head poses, and image quality. In the Image Sampling Block of Fig. 3.5, landmark confidence levels were used to remove face images that are less similar to human faces, and head pose yaw and image quality were used as indicators to classify face images. 34 The details of the Alignment Block are shown in Fig. 3.6. The main steps are com- posed of "Landmark Detection => Face Normalization => Pose and Quality Estimation => In- plane Alignment" in order. The outputs are "an aligned face image, facial landmarks, landmark confidence level, head pose yaw, and image quality score". The detailed explanations are given in the following sections. Figure 3.6: Overview of the Alignment Block: (a) four main steps in the Alignment Block, (b) the detailed functional blocks in the Alignment Block. When given a face image, the Alignment Block outputs an aligned face image and its analyzed characteristics. 35 3.2.1 Landmark detection Landmark detection (LD) Well-aligned face images have used to increase the recognition accuracy [63, 64, 105, 118], as mentioned in the beginning of this section. To align face images better, more accurate meth- ods of estimating facial landmarks [9, 112, 21, 73] has developed. In this section, we took one of the state-of-the-art facial landmark detectors, the Holistically Constrained Local Model (HCLM) [75], which was developed on Constrained Local Neural Fields (CLNF) [10]. The CLNF [10] is one of the state-of-the-art landmark detection methods, categorized as a local approach. The method exploited a local neural field (LNF) patch expert to a Constrained Local Model (CLM), to learn non-linear and spatial local response maps between input pixels and the probability of a landmark being aligned. In addition, the CLNF optimizes the CLM by using the Non-uniform Regularized Landmark Mean-Shift (NU-RLMS) technique, in light of the observation that certain response maps are noisier than others. By doing so, the CNLF achieves high performance in localizing facial features, especially in poor lighting conditions and in the wild. The HCLM [75] has further improved the accuracy of CLNF by using the sparse- holistic landmarks generated by the Holistic Convolutional Neural Network (CNN) as the initial anchor points for the dense CLNF landmarks. By doing so, the HCLM [75] has achieved high accuracy of landmarks position estimation even when given the very large head pose variations, including profile poses. Because the recent real-world IJB-A [79] face databases showed large head pose distribution, we applied the HCLM, which is robust to the head pose variations, to the 36 real-world 2D face images cropped by their face bounding boxes, and were able to obtain more accurate n facial points,x i;iD1;:::;n 2R 2 , for each face image. Landmark confidence level (CFD) The HCLM landmark detector [75] provides not only the locations of landmarks, but also the landmark confidence level (CFD). The CFD estimates how reliable the locations of the detected landmarks are with respect to the locations of the actual landmarks on average, suggesting how similar a given face image is to a human face. The CFD was estimated as follows: p.xjX s ;I//p.x/ n Y iD1 p.x i jX s ;I/: (3.3) The likelihoods of the landmark positions were conditioned on a set of sparse land- marksX s Dfx s j s2 Sg, wherejSj n, and an imageI, as in the Eq. ( 3.3 ). A Gaussian prior distributionp.x/ over a set of landmarks x was imposed on the non-rigid shape parameters of a 3D point distribution model (PDM) with an orthographic camera projection. The probabilis- tic patch expertp.x i jX s ;I/ produced its response map, describing probabilistic predictions of the landmark alignment. To optimize the likelihoods in Eq. ( 3.3 ), the HCLM used the Non- Uniform Regularized Landmark Mean-Shift (NU-RLMS) which iteratively computes the patch responses and updates the point distribution model (PDM) parameters. The landmark confidence level (CFD) was then calculated by taking a sum of log probabilities over all of the model likeli- hoods from each response map and normalizing them. 37 In Fig. 3.7, the detected landmarks are shown in each given face image. In particular, examples of face images with a high confidence level (CFD > 0.7) are displayed in the upper row (a), and examples of face images with a low confidence level (CFD < 0.4) are displayed in the bottom row (b). These example images visually show that the face images with high landmark confidence level have more reliable landmarks with more accurate positions than the face images with low landmark confidence level. In the following sub-section, the locations of the detected landmarks will be used to align face images and to estimate head poses. In addition, the landmark confidence levels (CFD) which represent the reliability of a landmark being aligned will be used to filter out less useful face images in the Image Sampling Block of Sec. 3.3. Figure 3.7: Examples of face images with detected landmarks and landmark confidence level: (a) with high landmark confidence level (CFD > 0.7), (b) with low landmark confidence level (CFD < 0.4). 38 3.2.2 Face normalization Once we get detected facial landmarks on a given face image, we normalize the face image in the Face Normalization Block, which is composed of "Roll Compensation => Affine Pose Estimation => Face Size Normalization" in order, as shown in Fig. 3.6. Each sub-block was made to achieve system stability. The Roll Compensation Block was used to compensate the head pose roll which does not change the actual facial appearances on the image plane. The Affine Head Pose Estimation Block and the Face normalization Block were used to weakly align face images in advance of the Pose Estimation Block and the In-plane Alignment Block in Fig. 3.5. Doing this was to prevent the PnP’s pose estimation error from diverging, and thus, to reduce the failure ratio of the alignment, throughout the whole Alignment Block. The details will be explained in the followings. Roll Compensation The purpose of the Roll Compensation Block is to avoid the Gimbal Lock phenomenon [138], which provide unintuitive head pose values that cause system errors. The phenomenon happens when one degree of the freedom from three-dimensional rotation axes was lost under some spe- cific conditions, but never happens when we fix one rotation axis. In addition, because the roll of the head poses does not affect the actual facial appearances, we compensated the in-plane roll angle as zero degree. Given the detected 2D landmarks including the landmarks that should not be visible because of the self-occlusion, we generated the lines connecting each pair of left-right symmetric 39 landmarks around two eyes and the mouth on the face image. Then, we calculated the roll angle by averaging the angles between each extracted line and the horizontal line of the image plane. Finally, we compensated the estimated roll angle by setting the lines between left-right symmetric landmarks parallel to the horizontal image plane, rotating the face image to the negative roll angle. Affine Pose Estimation [46] The affine pose estimation algorithm has its advantages, in that it does not require accurate intrin- sic camera parameters, compared to the perspective-n-Point (PnP) pose estimation method. Thus, it was suitable to apply the affine model to the real-world face images, which were gathered on multiple unknown capturing systems. The model assumed that the object was flat, considering the case that the depth of the object was much shorter than the distance to the camera, enough to be ignored. The assumption caused the method to estimate the head poses stably, even the near profile head poses. Even though the affine method provided larger pose estimation errors than the PnP method overall, the high success ratio of estimating head poses was a great benefit, in the sense that the face recognition system was able to use more face images, which may contain usable/useful facial information. In the implementation, the landmarks on the face outline were not used because of large detection errors [10]. Then, the 3D landmarksP i;iD18;:::;68 2 R 3 , which corresponded to the 2D landmarks on 2D face image p i;iD18;:::;68 2 R 2 , were annotated on a generic 3D face model [42]. An affine matrix A and 2d translation t were estimated by minimizing the L2 distance between the detected 2D landmarks and the projected 3D landmarks, as shown in Eq.( 3.4). 40 The affine matrixA was further factorized to the scaling matrix s and the rotation matrix R, using the QR factorization method, as shown in Eq.( 3.5). Then, the rotation matrix provided 3D head poses [53], pitch, yaw and roll angle, by calculating Euler Angles [36] on the condition that the 3D human faces were located in front of the camera. A; tD argmin A;t k p i;iD18;:::;68 A P i;iD18;:::;68 tk 2 (3.4) ADsRD 2 6 6 4 s 1 s 2 s 3 s 4 3 7 7 5 R (3.5) Face Size Normalization The real-world face images were captured in a variety of different environment using many dif- ferent camera systems. Also, large variations in head poses provided visibly different distances between fiducial facial features. Since these factors were not related to a person’s identity, we considered reducing the influences of those factors on the normalized face images. To do that, the main tasks of the Face Size Normalization Block were designed to standardize the scaling and skew parameters of the intrinsic camera parameters and to apply a different normalization refer- ence on the face image depending on its view. By doing so, we were able to obtain the consistent face image sizes that were less related to the camera parameters. 41 Affine-to-Orthographic approximation: We first approximated the affine camera model to the orthographic camera model to reduce the effect of different capturing environments and devices on the normalized face images. To do that, what we cared was the scaling matrix, containing camera skew parameters and camera scaling parameters. The camera skew parameters, which are the off-diagonal elements of the scaling matrix s in the Eq.( 3.5), include the shear distortions of a camera, and even include movement of the subject. The camera scaling parameters, which are the diagonal elements of the scaling matrix s in the Eq.( 3.5), include the influences of the camera field of view. Therefore, because the skew parameters and the scaling factors were not purely induced by the 3d shape of the human faces, we approximated those parameters like the followings. The elements of the scaling matrix were such that we set the camera skew parameters to zero, so thats 2 0 ands 3 0, and we replaced the vertical and horizontal scaling parameters to their averaged value, so thats 1 s 4 sD0:5.s 1 Cs 4 / in the Eq.( 3.5). Finally, we completed the affine-to-orthographic approximation as shown in the Eq.( 3.6), reducing the influences of environmental factors, which are not related to the human face shape, on the normalized face images. sD 2 6 6 4 s 1 s 2 s 3 s 4 3 7 7 5 D 2 6 6 4 s 0 0 s 3 7 7 5 ; wheresD0:5.s 1 Cs 4 / (3.6) Normalizing face image size depending on the views: We then normalized face image size depending on the views. Using only one normalization distance of two eyes was good enough 42 when the databases were mostly populated with near frontal face images. However, the large variations in 3D head poses in the recent real-world IJB-A [79] face database cause the distance of the two eyes on the image plane vary a lot. In particular, The larger the head pose yaw, the shorter the distance between the two eyes in the face image. Thus, the varying reference eye-to- eye distance on the image plane depending on the views generates the inconsistent normalized face image size. Therefore, we determined to use more than two normalization distances to generate a consistent face image size for all views. In the implementation of the Face Size Normalization Block, we only considered yaw angle variations from three Euler Angles. The roll angles were already compensated in the Roll Compensation Block. The distribution of pitch angles in the IJB-A database [79] showed a much smaller variation than that of yaw angles. Thus, except for the yaw angle, other Euler Angles were enough to be ignored while normalizing face images. We then empirically categorized the head pose yaw into two groups, a near frontal view and a near profile view, by setting the dividing reference yaw angle to 30°, depending on whether both eyes were visible or not. For the near frontal view, we assigned two visible eyes as the reference landmarks. For the near profile views, we assigned the center of two eyes and the nose base as the reference landmarks, assuming that the distance between them represents the nose length. Then, we matched the distance between the two reference landmarks with a pre-determined face normalization distance for each view, keeping the size of the face image consistent as if the distance between the human face and the camera were always constant. Depending on the yaw angle from the rotation matrix in the Eq.( 3.5), a near frontal face image withjyawj30° was 43 normalized by setting the eye to eye distance to 68 pixels, and a near profile face image with jyawj>30° was normalized by setting the distance between the center of the two eyes and the nose base to 46 pixels. The center of the two eyes of the normalized face image was then located at (320, 400) on the image plane, assuming the principal axis of the camera was in the center of an image with size 500 x 500, placing the normalized face at a pre-determined location on the image plane. 3.2.3 Face Image Analysis: Pose and Quality Estimation After normalizing the given face images, we estimated the head poses and image quality of the face images in the Pose and Quality Estimation Block, as described in this section. Because those facial image characteristics were varied widely in the recent face databases, as shown in the Fig. 3.1, we used them as the representative facial image characteristics to influence the facial appearances in the following Image Sampling Block. Pose Estimation We estimated the head poses using the Perspective-n-Points (PnP) pose estimation method [53] which solves the distance minimization problem among the correspondence pairs.p i ;P i /, when given the 2D landmarks p i;iD18;:::;68 2 R 2 on the 2D face image and the corresponding 3D landmarks P i;iD18;:::;68 2 R 3 on a generic 3D model in a world reference frame. However, using the PnP camera modelMDK Rt, whereK is the intrinsic camera matrix, andR is the 44 rotation matrix, and t is the camera translation matrix, required the calibrated intrinsic camera parameters, although the real world face images were gathered from multiple capturing systems with unknown camera parameters. Therefore, the direct linear transformation (DLT) method [53], the traditional PnP method, frequently re-projected the corresponding 3D facial landmarks to the outside of the image plane, not stably estimating the extrinsic camera parameters including the rotation matrix R2 R 33 , because of the unknown and possibly various intrinsic camera parameters. Hassner et al. [55] stably performed the PnP method [53] on real-world face databases by using face images rendered with arbitrary constant intrinsic camera parameters. They used the corresponding pairs between the detected 2D landmarksp i;iD1;:::;68 2 R 2 on input 2D image and the detected 2D landmarksO p i;iD1;:::;68 2 R 2 on the rendered view with the code [54], and obtained the 3D coordinates O P i;iD1;:::;68 2R 3 for these points on the generic face model. By doing so, they were able to gain the system stability so that the re-projected face image was located within the image plane, not providing the infinite pose estimation error. However, the method used expensive rendering engine. Instead of using an rendered face images, we used the normalized face images and an approximated affine-to-orthographic intrinsic camera model, as described in Sec. 3.2.2. Assum- ing all subjects are at the same distance from the camera was able to compensated for the lack of unknown scaling parameters of the approximated affine-to-orthographic intrinsic camera param- eters. Finally, we completed the PnP head pose estimation stably without using the same intrinsic camera parameters for all scenes unlike the [55], and better reflected a variety of capturing envi- ronments to the estimated intrinsic camera parameters. 45 Image Quality Estimation Face quality score have been studied for more than a decade [60, 4, 109, 19, 2, 41, 74, 132] for face recognition systems, and the examples were already introduced in the Sec. 2. They assigned higher scores to face images with more reliable facial features for face recognition systems, and used the estimated scores to increase the recognition accuracy and/or to reduce the computational complexity. However, they assumed the gallery databases, which were much more populated with frontal face images, had the same statistical distribution as the probe databases. For example, the face quality assessment methods evaluated profile face images as low quality face images even though the profile face images contained useful facial information indicating a person’s identity. Therefore, we considered using an objective image quality assessment method regardless of the statistical correlation between databases. Objective image quality scores have been developed to quantify the degree of image distortion on any image including face images. Objective image quality assessment metrics (IQA) were classified into three groups depending on the availability of reference images: full-reference (FR), reduced-reference (RR) and no-reference (NR). We only considered the no-reference image quality assessment (NR-IQA) because there were no distortion-free ground-truth images for the real-world face databases. Next, no-reference image quality assessment (NR-IQA) was divided into two groups according to how many image distortion types are considered [86]: distortion- specific quality assessment and general-purpose quality assessment. Distortion-specific quality assessment methods only quantified a specific distortion, such like JPEG distortion [137] and blurriness [35, 31]. 46 Image distortion directly degrades the quality of facial image features so that the meth- ods to reduce the negative effects of image distortion on the performance of face recognition systems [6, 103, 38, 72] have been studied. For example, [48, 6, 37] proposed robust features that were less affected by blur. Also, [45, 37] proposed adaptive pair comparisons depending on blur. By doing so, they improved recognition accuracy on face databases including many blurry face images. However, the previous studies have not considered multiple image distortions at once, not reflecting real-world scenarios well. Therefore, we required a general-purpose image quality metric to predict the quality score that takes into account various distortions simultane- ously. Examples of such general-purpose NR–IQA methods are DIIVINE [99], BLIINDS- II [116], BRISQUE [98], SSEQ [86], and more. DIIVINE [99] involves distortion identification and distortion-specific quality assessment in order, using natural scene statistics (NSS) features in the wavelet domain to predict image quality scores. BLIINDS-II [116] uses a natural scene statistics (NSS) features in the discrete cosine transform (DCT) domain and a simple Bayesian inference model, to predict image quality scores. BRISQUE [98] was based on a NSS model in the spatial domain, exploiting locally normalized luminance coefficients. SSEQ [86] utilized local spatial and spectral entropy features on distorted images. All of them were developed on natural scene statistics (NSS), considering digital image distortions, and automatically evaluated the quality of images in agreement with human subjective scores. Ultimately, we chose one of the general-purpose no-reference image quality assessment (NR-IQA) methods, the Spatial- Spectral Entropy based Quality (SSEQ) [86], because its time complexity was relatively low and its prediction accuracy was least dependent on the databases among the NR-IQA above. 47 SSEQ [86] utilized local spatial and spectral entropy features on five kinds of image distortions: blur, noise, JP2K compression, JPEG compression, and fast-fading. The two-stage framework of distortion classification and quality assessment was trained using a support vector machine (SVM), achieving competitive quality prediction performance across multiple distortion categories. Therefore, we obtained the general-purpose no reference image quality scores, SSEQ, by applying its released code [87] to the real-world face images. 3.2.4 In-plane alignment The objective of the In-plane Alignment Block was to provide better-aligned face images for the face recognition system. Because the previous Face Normalization Block in Sec. 3.2.2 roughly aligned the face images to estimate more accurate 3D head poses, the In-plane Alignment Block properly rearranged the roughly aligned face images according to three yaw-head pose views (frontal face, half profile face, profile face) to provide more discriminative facial features within the image screen. The details are described below. We first made all face images face the left side of the image screen to consider only one-directional face images. To do that, we flipped the face images given the head poseyaw < 0°. Doing so was to increase the number of facial images with similar head postures and to reduce multiple pose views that were irrelevant to recognizing a person’s identity. We then aligned the face images into three groups by applying 2D similarity transformation on each face image, similar to the alignment method in [96]. 48 Next, we divided the head pose yaw into three groups, taking into account changes in face shape: a frontal view group with.0 jyawj < 30 /, a half-profile view group with .30 jyawj < 60 /, and a full-profile view group with.60 jyawj/. Then, we prepared corresponding three target head pose views that best represent the facial appearances of each input view group as follows. For each of three target pose views (a frontal face.0 , a half-profile face as .40 , a full-profile face as.75 ), the target 2D landmarksp t i;iD1;:::;7 2R 2 on 2D image plane were produced by projecting annotated 3D landmarksP i;iD1;:::;7 2 R 3 on a generic 3D face model [42] using an arbitrary constant camera model. The locations of the target 2D landmarks were further modified to fully fill the image screen with the rendered face image of a 3D generic model, which was intended to completely fill the image screen with the valid facial features of the aligned face image. Then, we obtained 2D similarity transformation by solving a linear system of equa- tions and completed to align the real-world face images like the followings. Using seven pairs of corresponding landmarks (around eyes, a nose, and a mouth) between the 2D landmarks of one-directional normalized face image p n ; view i;iD1;:::;7 2 R 2 and the target 2D landmarks p t ; view i;iD1;:::;7 2R 2 . The face images of a frontal view group of.0 jyawj<30 / were mapped to the target frontal view ofjyawjD0 . The face images of a half-profile view group of.30 jyawj < 60 / were mapped to the target half-profile view ofjyawjD 40 . The face images of a full-profile view group of.60 jyawj/ were mapped to the target full-profile view ofjyawjD75 . Thus, we produced the in-plane aligned face images by mapping the one- directional normalized face images onto the image plane in consideration of the corresponding view group. 49 In previous studies [56, 24, 96], the rendering engine was used to accurately align face images to achieve high face recognition accuracy, but Masiet al. [93] conducted an experiment using weakly aligned face images as comparison pairs and achieved state-of-the-art recognition accuracy. This means that we may achieve high recognition accuracy without using expensive 3d rendering engine. Therefore, although the in-plane aligned face images visually showed slight differences in head poses each other, we used the weakly aligned face images as the input images for the following Image Sampling Block. 3.3 Image sampling The ultimate goal of the Image Sampling Block is to extract, summarize and present essential information that represents an individual’s identity. The main task is to sample representative facial images in a way that removes duplicate or irrelevant information about a person’s identity, when given a set of real-world face images of a person. To do that, we organized the Image Sampling Block in order as shown in Fig. 3.3: the Quantization Block, the Outlier Rejection Block (the Filtering Block), and the Face Selection Block. In template-based face recognition [79], a template may contain multiple various face images and video frames, which are captured from multiple imaging devices, in uncontrolled conditions. Thus, the face images in a template vary in head poses, facial expressions, image quality, scales, and occlusions. In the case that the video frames are consecutively captured, the template may include multiple similar facial appearances. In addition, the template may highly populate with low quality images because video frames are highly influenced by motion blurs, 50 lens defocusing, and image compression. Therefore, using all those face images in a template may degrade face recognition performance in terms of accuracy and computational complexity. One solution to avoiding face recognition degradation is to use a set of face images which can represent a template. However, previous studies on subset selection, which have been mainly performed on video-based face recognition as introduced in Sec. 2, had been difficult to apply to recent template-based face recognition because of the following three problems: At first, face images with high quality facial features can be predicted to be of poor quality and thus removed from the template. Because previous studies [82, 11] assessed the quality score based on the statistical distribution of the training/or gallery databases, face images which do not appear frequently in the databases received low ratings. Therefore, profile face images, that did not actually appear in the databases often, contained essential information about a person’s identity, but were considered low-quality face images. Second, researchers had difficulty finding good definitions of face image quality for face recognition which was sufficient for robust subset selection [11, 15, 142]. In the quality- based frame selection methods [142, 11], researchers assumed that using high quality face im- ages would lead to high recognition accuracy. However, because face recognition performance is influenced by multiple compounding factors (e.g. sharpness, contrast, compression artifacts, geometry, illumination, pose, alignment, occlusion, facial expression) of real-world face images, no one has suggested robust face image quality criteria for real-world face images that accurately reflect facial recognition accuracy predictions, as far as we know. Lastly, reducing the size of a face image set was not easy for unconstrained real-world face images. In particular, the face image sets containing consecutive video frames were highly 51 crowded with low quality blurry face images with similarly duplicate facial appearances. How- ever, because each low quality face images and similar face images contributed to a recognition accuracy to some extent, the weighted frame selection method [153] eventually used all the face images in the image set in order not to reduce recognition accuracy. The previous method showed the above-mentioned shortcomings, but we successfully selected a few face images needed to represent the template using the following solutions: First of all, we used the bin of the face image subspace instead of the center of the statistic cluster as an indicator for representative face image sampling. To do that, we divided the face image space in relation to the factors that visually changed the facial image appearances of a person, such as head pose yaw and image quality. We then sample face images depending on the face image subspaces taken into account differences in the facial image appearances of a person, regardless of the statistical distribution of training or gallery databases. Doing so made us be able to capture a variety of visually different facial features regardless of how often they appear in the database. Next, we used low quality face images as well as high quality face images, unlike the previous frame selection methods [142, 28] which attempted to select only high quality face images as much as possible, because some of visually low quality face images were observed to still hold information about a person’s identity. However, we distinguished the low quality face images depending on the cause and kept only low quality face images with identity information. To be specific, an image-specific quality score (SSEQ [86]) was used as one of the criteria for quantizing the face image space to classify similar face images in terms of image distortion, allowing us to maintain low quality face images, whereas a face-specific quality score (CFD [75], landmark confidence level) was used as a criterion to filter out most of the unreliable face images 52 from the quantized face image subspace containing similar facial appearances, allowing us to maintain only high quality face images. By doing so, we appropriately used two different quality scores, namely, an independent quality score on the identity and a quality score that well reflects the identity. In addition, the template size was effectively reduced by using two representative sam- ple types, an average face (a pooled face) and a random face, together with the quantized face image space which replaced the statistical cluster center of face image characteristics. Pooling was particularly useful for compressing similar facial appearances, while random sampling was useful for maintaining detailed facial appearances. In addition, each quantized face image space concerning pose and quality was effective in preserving distinctive facial appearances. Thus, even in continuous video frames containing similar face images, it was possible to provide a compact subset of face images that well summarized the template, which was difficult in the previous studies [153, 121] in terms of maintaining recognition accuracy. Finally, in the implementation, we divided face image space into twenty mutually ex- clusive bins with four head pose yaw bins and five quality bins. In each template, all face images were assigned to the bins depending on their head pose yaw and image quality, and then, the face images with the most unreliable facial landmarks were eliminated one by one until each bin had only a few face samples. We then selected two representative facial images from the remained face samples in each bin: a pooled face image and a random face image. Details will be described in the following sections. 53 3.3.1 Quantization of face image space Real-world face image space is complex and non-linear, which is influenced by pose, quality, illumination, expression, blur, JPEG compression, noise, and more in a complicated way. The previous clustering methods for face recognition [82, 43, 11] well summarized the complex real- world scenarios while drastically reducing the number of face images. They approximated the high dimensional real-world face image space to low dimensional subspaces by learning the class centroids in the training/gallery databases. However, when face images with unseen class cen- troids during training were given to the probe database, the previous methods lost useful facial information due to biased class centroids. In practice, the training/gallery databases and the probe databases were statistically uncorrelated each other. For example, CASIA-WebFace [150] which often used as a training database showed a statistically different distribution from the IJB-A [79] face database in terms of facial image characteristics, as shown in Fig. 3.1 of Sec. 3.1. In addition, in the template based face recognition databases such as the IJB-A [79], all templates were statistically uncorrelated each other in most cases, meaning that comparison template pairs contained face images with statistically different facial image characteristics each other in most cases. Therefore, we focused on maintaining facial appearances about a person’s identity in- formation regardless of the biased statistical cluster center. We first chose head pose yaw and image quality that have a statistical distribution large enough to affect recognition accuracy as shown in Fig. 3.1 and that are visually noticeable in intra-personal face images as shown in Fig. 3.8.We then experimentally divided the face image space into four head pose groups and five 54 Figure 3.8: Bins of face image space depending on the pose and quality: (a) Depending on head pose |yaw|, from left to right, near frontal to near profile in order. (b) Depending on image quality (SSEQ), from left to right, low quality to high quality in order. quality groups to obtain the reference facial appearances for image sampling, depending on the variations in facial appearance on the image plane, regardless of how often the facial appearances of each group appear in the database. While dividing the face image space, we considered the similarity of facial appearances of a person within each bin to get clear pooled face images and considered the dissimilarity of facial appearances of a person between bins to preserve the dis- tinctive facial features. Because the facial appearance is mutually affected by head pose yaw and image quality, we eventually used mutual subspaces altogether. Fig. 3.8 and Table. 3.1 summa- rizes the total 20 bin indices what we used. 55 Table 3.1: Bin indices. The partitioning criteria of pose yaw and image quality while we sample representative face images from template representations. Using Four pose bins and five quality bins together, A total of 20 bins were used, but in practice they were mostly empty. Top: pose |yaw| °, bottom: image quality, normalized SSEQ [86] jyawj 0 ; 20 /;20 ; 40 /;40 ; 60 /; 60 ;1/ Quality .1;0:45/;0:45;0:55/;0:55;0:65/;0:65;0:75/;0:75;1/ Basic concepts and notations When we letC be the face image space, we quantize the face image spaceC to a certain number of face image subspaces depending on head pose yaw and image quality. We can define c i as the face image subspace subject to thei th head pose yaw group,c j as the face image subspace subject to thej th quality group, and c i;j as the face image subspace subject to the mutually exclusive group of the i th head pose group and the j th image quality group. Then, the face image space can be expressed asCD[c i for all differenti, where;D\c i for any different i, CD[ c j for all differentj , where;D\ c j for any differentj , andCD[ c i;j for all differenti,j , where;D\c i;j for any differenti,j . After quantizing the face image space, we assign the face images of a person to a specific face image subspace depending on their facial characteristics. If we letX k be represented by a set of face images of a subject k, then the face image sets of a subject k in the mutually exclusive face image subspacec i;j of thei th head pose yaw group and thej th quality group can be represented asX k;i;j Dfc i;j .X k /Wc i;j 2Cg, where the face images of a person inc i;j are expressed asX k D[X k;i;j for all differenti,j , where;D\X k;i;j for any differenti,j . 56 Binning by head pose [56] In this section, we classified face images depending on head pose yaw, which was estimated in Sec. 3.2.3. As described in Sec. 3.2.2, the roll angles were compensated to be ignored. The pitch angles did not show a statistically significant distribution in the recent face databases, such as CS2 [79], IJB-A [79], and CASIA-WebFace [150], compared to the yaw angles, showing few samples with pitch angles greater thanj15 j. Therefore, it is assumed that the variations in head pose yaw of 3D head poses mainly changed the facial appearances in the image plane and had a much greater impact on the recognition accuracy than the variations in pitch and roll. Finally, taking into account the similarity of facial appearances, we empirically partitioned the head pose yaw into four groups as in Eq.( 3.7 ) and well assigned distinctive face images according to the head pose yaw into each bin: P.I/D 8 < : 0; if0 jYawj.I/<20 1; if20 jYawj.I/<40 2; if40 jYawj.I/<60 3; if60 jYawj.I/<1 (3.7) 57 Binning by image quality [56] We also classified face images according to image quality score which was measured by Spatial- Spectral Entropy based Quality Assessment (SSEQ) [86] described in Sec. 3.2.3. Because SSEQ was one of the general-purpose no-reference objective image quality assessment methods which considered five different types of image distortions (blur, noise, JP2K compression, JPEG com- pression, and fast-fading) at once, the method did not actually predict the quality of facial features in relation to identity information. Therefore, we assumed that low quality face images degraded by image distortions still contained distorted facial features, but useful/usable facial information that can identify a person’s identity. Also, we assumed that face images with similar image qual- ity scores contained similar levels of distorted facial features in relation to a person’s identity. Finally, considering the similarity of facial appearances within bins and dissimilarity of facial appearances between bins, we experimentally divided the image quality into five groups as in Eq. ( 3.8 ) and well assigned distinctive face images according to the image quality into each bin: Q.I/D 8 < : 0; if1<SSEQ.I/<0:45 1; if0:45SSEQ.I/<0:55 2; if0:55SSEQ.I/<0:65 3; if0:65SSEQ.I/<0:75 4; if0:75SSEQ.I/<1; (3.8) 58 3.3.2 Filtering: Outlier Removal Once we assigned face images into the mutually exclusive predetermined pose and quality bins as described in the previous section, each bin should contain only similar facial appearances in terms of pose and quality. However, the unconstrained real-world face images provided by the Alignment Block of Sec. 3.2 already contained all kinds of unreliable face images, such as inaccurately cropped face images and non-human face images. Therefore, to only maintain face images that represent the characteristics of each bin as much as possible, we had to get rid of unreliable face images assigned to each bin. Figure 3.9: In IJB-A [79], examples of outlier face images: examples of unreliable face images with landmark confidence levels with less than 0.4. To eliminate the unreliable face images, such as outlier face images shown in Fig. 3.9, we designed the Filtering Block, also called the Outlier Removal Block. We first defined "face- ness" [16, 17, 147] as an indicator of how reliable the face image is, in other words how similar a given face is to a human face. We then assumed that faceness can be estimated with the landmark 59 confidence level [75] which predicts how reliable the locations of the detected landmarks on the given face image are. Then, we defined unreliable face images, which we call outliers in this dissertation, as images with much less reliable facial information about a person’s identity. After observing face images in relation to landmark confidence levels following the definitions above, we empirically determined the outliers as face images with a CFD value of less than 0.4, as shown in Eq. 3.9. In our observation, most of face images in the CFD range were inaccurately aligned face images or non-human face images, and thus we eliminated the outliers withO.I/ D 0 in Eq. 3.9, as similar to the quality-based frame selection methods introduced in Sec. 2. O.I/D 8 < : 0; CFD.I/0:4 1; CFD.I/>0:4 (3.9) Candidate face images for template representation: in general Therefore, when we have reliable face images more than one, we define candidate face images to represent a template as followings. Let a (gallery or probe) face template be represented by the set of its member images asFDfI 1 ;:::; I N g, where the face images are in RGB image spaces, as I n 2R nm3 , and each member image is aligned each other. Let a disjointly quantized bin of a face template depending on pose and quality be represented by a set of its member images asF pq DfI pq_1 ;:::; I pq_M g, 60 where face image space is quantized into P subspaces depending on poseP and Q subspaces depending on quality Q, and thus altogether P Q subspaces, where P D f1;:::; Pg, Q D f1;:::; Qg, and whereM N . Then, given a bin whose member face image belong toO or O depending on face image reliability, whereO is face image subspace containing reliable face images andO unreliable face images, a set of candidate face images for template representation in each bin is defined as: F pq _ candidates Df I pq_k 2F o jk2f1;2;:::;L pq g; L pq M pq g whereN.F o /D PQ X pqD1 N.F pq /D PQ X pqD1 M pq >0 (3.10) Template representation: only with outliers We were not able to eliminate all outliers always because some templates in CS2 and IJB-A [79] only contained unreliable face images. Eliminating all outliers from all templates in the databases generated empty templates containing no information at all, impairing recognition accuracy. Thus, when a template does not have any available face images except for outliers, we had to keep at least one outlier in each template, while expecting some usable facial information in the given outliers. While observing outliers as shown in Fig. 3.9, we determined to keep only one outlier with the highest landmark confidence level across all the pose and quality bins for the template. By doing so, we were able to use the sample with the most useful information from 61 the given image samples with poor quality faceness. Also, we still were able to get rid of most outliers, which were likely to impair recognition accuracy. Finally, we selected a candidate face image that represents the template as follows. Let a (gallery or probe) face template be represented by the set of its member images asFDfI 1 ;:::; I N g, where the face images are in RGB image spaces, as I n 2 R nm3 . Let a landmark confidence level of each face image becfd.I n /. Then, given a template in which all member face images belong toO containing only unreliable face images, a candidate face image for template representation in the template is defined as: F : D I k wherecfd.I k /Dmax.cfd.I n // for allnDf1;2;:::;Ng; where I k 2F N o ; N.F o /D0 (3.11) 3.3.3 Image selection Once we discarded outliers from each pose and quality bin, we had only similar face images within each bin. Thus, the main task of the Image Selection Block was to compactly represent several similar face images and to maintain essential facial information at the same time. To do that, we used two representative statistical sample types, a mean and a random sample, for the 62 representative face image types in template representation. By doing so, we completed the effec- tive template representation with a few representative face images within a fixed data volume, as shown in Fig. 3.10. Figure 3.10: Template representation: representative face images for a template: When given multiple reliable face images in each bin, we used two representative face image types: a pooled face and a random face. A pooled face was sampled by averaging all given face images in each bin. A random face was sampled by randomly selecting a face image from all given face images in each bin. Using the two representative face images in each bin, we well summarized the template within a fixed data volume. Template representation As described in Eq.( 3.10), let a disjointly quantized pose p and quality q bin c pq of a face template for a subjectsj be represented by a set of its member images, asFDfI _1 ;:::; I _K g, where the face images are in RGB image spaces, as I _k 2 R nm3 , where I _k is the selected in-plane aligned images for the bin, where landmark confidence level of each face image is 63 cfd.I _k / > 0:4. Then, F can be represented differently by a set of its member images, as F s DfI s_1 ;:::; I s_K g, wherecfd.I s_1 / cfd.I s_2 / ::: cfd.I s_K /, where the number of members are allowed up to N face images for a pooled face and a random face. A set of face images to be used in each bin for template representation is defined as: F s_allowed DfI s_1 ;:::; I s_K 0g , where 8 < : K 0 DN; where KN K 0 DK; where K<N (3.12) Finally, the proposed effective template representation of a posep and qualityq binc pq consists of a pooled face image F p and a random face image F r as below, and can be represented by a set of its member images asF T _pq DfF p ; F r g using the face images given in Eq.( 3.12). F T _pq D 8 < : F p : Davg.F s_allowed /D1=K 0 P K 0 iD1 I s_i F r : D I s_k where KDf1; 2;:::; K 0 g where Prob.KD k/D1=K 0 (3.13) 64 A pooled face Previous studies using rich and redundant information in multiple frames has been mainly studied on video based face recognition systems, as introduced in Sec. 2. In particular, clustering methods or examplar-based methods [68, 51] drastically reduced the number of face images by selecting distinctively representative frames, but did not consider the use of redundant similar face images. Weighted frame selection methods [153, 121] used all face images, including similar faces. They applied different weights for each sample, based on the expected contribution on recognition ac- curacy, such as distance to the appearance model, and thus the methods increased the recognition accuracy, but did not succinctly represent the given multiple frames. Recent study on template representation [56] reduced the template size and kept the recognition accuracy, introducing a pooled face image. The pooled face images, which averaged multiple face images in pixel-wise manner as in Eq.( 3.1), well represented the central tendencies of the multiple face images in terms of a person’s identity, by using well standardized face images in terms of visually noticeable facial appearances caused by different 3D head poses with the use of decent 3d alignment technique. However, without canceling out the 3D head poses, averag- ing real-world 2D face images with extremely different facial appearances all together did not preserve both distinctive facial appearances and facial details, as previously shown in Fig. 3.2. Therefore, to get decent pooled face images using weakly 2D aligned face, we at first required to minimize the extreme differences in facial appearance between the given face images. To do that, as described in Sec. 3.3.1, we quantized face image space depending on pose and qual- ity to keep visually similar face images together in each quantized face image subspace (bin). In 65 addition, we filtered out unreliable face images such as non-human face images or not-accurately cropped face images as described in Sec. 3.3.2. Finally, we were able to consider much more similar face images while pooling face images. Nevertheless, when given too many similar face images in each bin, we could not get high quality pooled face images as in the previous work [56], because of the slight difference in head poses between the given face images. In the case, we empirically used up to N face images with the highest landmark confidence level while pooling face images. During the process, we assumed that the landmark confidence level indicated alignment reliability, which represented how well a given face image was aligned to a reference face image with a perfect landmark confidence level, because we used several fiducial landmarks to align the face images as described in Sec. 3.2.2 and Sec. 3.2.4. A randomly selected face In each quantized pose and quality bin, the similar face images were compressed into a pooled face image. By doing so, we failed to preserve sufficient facial details because of the nature of the average itself which only keep the central tendencies of given samples, as shown in Fig. 3.2.(b) of Sec. 3.1.1. Thus, to compensate for the loss of facial details of a pooled face image, we considered a randomly selected face image for each bin elements of template representation. First of all, face images which randomly selected with the probability of uniform dis- tribution well remained facial details intact because they were nothing but some samples of the original samples, as shown in Fig. 3.2.(c). Also, the randomly selected face images from each 66 pose and quality bin well reflected the facial appearances of each bin because the randomly se- lected sample theoretically represented the statistics of the population well in terms of that the randomly selected observations do not differ in any significant way from observations not sam- pled, although the sample size should be enough to ensure the significance. For these reasons, we used one additional randomly selected face image, together with a pooled face image, to rep- resent the face images in each bin of a template, assuming that the combination of two samples was enough to represent the bin as proven in Sec. 4.3.3. In addition, we considered random sampling because it has other advantages. Random sampling has been used in many aspects in computer vision areas [136, 135, 146, 152, 143, 93]. For example, it has been used to provide the general system performance for the complex systems that are difficult to optimize [146, 135, 136]. It has also been used to compensate for lack of sample information by learning general information from other classes [93, 96, 152]. These kinds of advantages were expected to facilitate template representation because the real-world face image space is complex and the facial information is not uniform enough in the face databases. 3.3.4 To sum up for the Image Sampling Block Through the Image Sampling Blocks, we well summarized the given facial information of a template in terms of a person’s identity by sampling statistically representative a few face images. By quantizing face image space depending on pose and quality, we not only preserved the diverse facial appearances, but also grouped similar facial appearances. We only considered reliable face images by filtering out non-human face images. By using only a few well aligned face images 67 in each bin, we were able to get better quality pooled face images, which represent the central tendencies of the given face images. By adding a randomly selected sample to each bin, we were able to compensated for the facial details of the pooled face images. By doing all these, we well compressed a person’s identity within a fixed data volume in image domain and we can summarize the template representation with Eq.( 3.13) and Eq.( 3.11). 3.4 Training CNN: Fine-tuning of deep learning models On the success of recent deep CNN approaches to face recognition [127, 130, 118, 93, 24, 94], we use the VGG-19 CNN of [120] to encode our efficient template representation which consists of a few pooled and random face images described in Sec. 3.3. This 19 layer network was originally trained on the large scale image recognition benchmark (ILSVRC)) [115]. We fine tune the weights of this network twice with augmented and modified large database, as follows. CNN Fine Tuning The VGG-19 CNN [120] ends with the fully connected layers fc7 and fc8. The last layer fc8 outputsN -way SoftMax function by mapping from the embedded feature fc7 to the subject labels N of training database. When the ith output of the network on a given imageI is denoted byy i .I/, the output of the SoftMax function provides the probability assigned to the ith class as below: p i .I/D e y i .I/ P N gD1 e y g .I/ (3.14) 68 Then, the network fine-tuning is performed by minimizing the soft-max lossL as below, wherei is the ground-truth index,t indexes all training images: LD X t lnp i .I/ (3.15) We train fc8 from scratch: we initialize fc8 with the parameters drawn from a Gaussian distri- bution with zero mean and standard deviation 0.01, while using the pre-trained weights on the ImageNet data set [115]. The networks are then ne-tune on the training database (augmented and modified CASIA-WebFace databases for our experiments), minimizing the soft-max loss L through stochastic gradient descent (SGD) with standardL2 norm over the learned weights, together with standard back-propagation [84]. Once we set an initial learning rate of, the learning rate is applied to the entire net- work, and ten times larger learning rate is applied to fc8. Also, biases is learned twice faster than other weights. When validation accuracy saturates, we reduce the learning rate by an order of magnitude. For all other parameters, we use the same values as in [115]. We fine tune this network using our augmented data at first, to get pose-robust deep features. Then, we again fine tune this network using modified database to our template repre- sentation, to adjust the model especially to our pooled face images. We randomly use 80% of the images for training and the remaining 20% for validation. 69 The first fine-tuning with augmented CASIA-WebFace The first fine tuning is performed on augmented CASIA-WebFace [150] images. From the origi- nal database, we at first align images within plane, as described in Sec. 3.2, to be called in-plane aligned face images. Then, we use a 3D generic face model and synthesize three different canon- ical views (frontal 0°, half-profile 40°, full-profile 75°), to be called out-of-plane aligned face images, as in [96]. By doing so, we enrich different facial appearances of the original training database in terms of 3D head poses [93]. We then fine tune the VGG-19 CNN network on these augmented face images, together with the pre-trained model parameters in [115], learning to recognize 10,575 subject labels. During the training, we use an initial learning rate ofD0:001. Once validation set error saturates, is decreased by an order of magnitude once [93]. The second fine-tuning with modified CASIA-WebFace The second fine tuning is performed again using CASIA-WebFace [150] images, but this time with some modification to learn our template representation as shown in Fig. 3.11. Since CASIA- WebFace only provides images by subjects, we randomly select subsets of face images for each subject, considering each subset as a template defined in [79]. Once we have templates, we apply our template representation to the templates as described in Sec. 3.3. To take advantage of pose- augmented face database, we apply template representation to each canonical view, respectively. Finally, the fine tuning is performed on our representative face image sets containing pooled face images and randomly selected face images, as shown in Fig. 3.11. The training starts with the 70 Figure 3.11: CNN training with our template structures: We randomly select subsets of face images of a person and consider each subset as a template. Face images in each template are represented by a few pooled face images and randomly selected face images, by being aligned and sampled as in Sec. 3.3. After we gather representative face images for all templates for all subjects, we fine-tune the VGG-19 CNN network with pre-trained initial weights, learning to recognize each subject label. Not shown in this figure, but While we gather representative face images, we actually apply our image sampling process to each out-of-plane aligned face images. first fine-tuned model parameters and initial learning rate ofD 0:0005, learning to recognize 10,575 subject labels. During the training, we use an Once validation set error saturates, is decreased by an order of magnitude. 3.5 Matching Templates We introduce two general simple template matching methods that calculate the similarity of two templates [93, 56, 94]: "All vs. All" and "Matching pooled features". 71 3.5.1 Deep feature representation of an image Once we represent the template with a few representative face images as in Sec. 3.3 and get the fine-tuned CNN model as in Sec. 3.4, we encode each representative face imageI into deep feature vectors x. We take the response of the fully connected layer fc7 following non-linear ReLu activation with the weightsW fc7 and the biasesb fc7 , and thus the feature representation x of each imageI is as: x fc7 Df.IIfW fc7 ;b fc7 g/ (3.16) 3.5.2 The similarity of two images Then, when given a gallery imageI g in a gallery templateG and a proble imageI p in a probe templateP, the similarity of the two images is calculated by the normalized cross correlation (NCC) of their feature vectors as: s.x fc7 p ; x fc7 g /D .x fc7 p x fc7 p /.x fc7 g x fc7 g / T jjx fc7 p x fc7 p jjjjx fc7 g x fc7 g jj (3.17) 3.5.3 The similarity of two templates Given two templates that possibly contain multiple face images, we extract feature vectors from the images using Eq.( 3.16). Then, we calculate the similarity of two templates to see how similar they are in terms of a person’s identity. While matching two templates, we use two different approaches [93, 56, 94]: "All vs. All" and "Matching Pooled Features" 72 All vs. All: Pair-wise similarity comparison As the easiest way to compare two templates, we compare each image in one template with all other images in another template using Eq.( 3.17), and aggregate all scores of all image pairs across the two templates, to be called "pair-wise similarity comparison" or "All vs. All". We define the similarity of two templates as s.P;G/, where P is a probe template, G is gallery template. When we calculate the similarity score of a probe imageI g 2G and a gallery image I p 2P using Eq.( 3.17), we aggregate the pair-wise similarity scores of all image pairs by using SoftMaxs .P;G/ defined in Eq.( 3.18). We used SoftMax which is one of representative score normalization methods to alleviate the influence of the uneven score distribution. Finally, we get the similarity score by averaging the SoftMax response over multiple values ofD0:::20, as in [56, 93, 94]. s .P;G/D P p2P;g2G w pg s.x p ;x g / P p2P;g2G w pg ; w pg : De s.x p ;x g / (3.18) Matching pooled features Another simple way to compare two templates is pooling features which was recently considered in [93, 145] to boost recognition accuracy. The method average(pool) deep features of face images in a template by the element, when given a template. Because the pooled feature vector represent a template, the similarity of two templates is simply the similarity of two pooled feature vectors. Given a templateT , we encode a template imageI t into its deep feature vectorx t , and then a pooled feature vector is defined as: 73 Pool.T/ D 1 jTj X t2T x t (3.19) Then, Given a probe template P and its deep feature vectorsx p , and also given a gallery template G and its deep feature vectors x g , the similarity of two templates on pooled feature vectors is defined as: s.P;G/D 1 jPj X p2P x p ; 1 jGj X g2G x g (3.20) 3.5.4 Fusion of pooled faces and random faces Using our template representation described in Sec. 3.3, we provide a few representative face image types. one is pooled(averaged) face images that represent the central tendencies of sample face images, and the other is randomly selected face images that contain details. Because their statistical characteristics are inherently different, we compare templates twice for each image type (pooled to pooled, random to random). Finally, we average the similarity scores given by only matching pooled images and by only matching randomly selected images as in Fig. 3.12. Given a probe templateP and a gallery templateG, which are represented by our tem- plate representation described in Sec. 3.3, the pooled face images in a probe template is denoted asP pooled , the randomly selected face images in a probe template asP random , the pooled face images in a gallery template,G pooled , the randomly selected face images in a gallery template as 74 G random . Then, the template similarity score for our template representations.P;G/ is defined as: s.P;G/D 1 2 s.P pooled ; G pooled /C 1 2 s.P random ; G random / (3.21) Figure 3.12: Score fusion: Once we have two different types of representative face images for a template, we separately compute the similarity of two templates, separately for each image type. We then combine the two similarity scores. 3.5.5 Overview of matching two templates Fig. 3.13 shows the details of matching pipeline in the case of pair-wise similarity comparison case of Eq. 3.18. At test time, given two templates, we sample pooled and random face images as described in Sec. 3.3. And then, the sampled face images are encoded into deep feature vectors as in Eq.( 3.16) using the fine-tuned deep CNN model as in Sec. 3.4. Then, after the feature vectors are transformed using principal components analysis (PCA) learned on the training images of the 75 target database, the process compares the similarity of two templates as described in Sec. 3.5.3, but separately for pooled face images and randomly selected face images, and finally complete comparing two templates by fusing the two similarity scores. Figure 3.13: overview of matching pipeline: template matching is carried out twice, separately for pooled face images and randomly selected face images, and then the similarity scores are fused to provide final similarity of two templates. This case shows "All vs. All", pair-wise comparison case. 3.6 Robust Feature Extraction: Bin equalization To ideally compare two templates, we should take into account the differences in face images, which are affected by complex real-world conditions, such as poses, image quality, expressions, and other factors. However, the simple matching methods introduced in Sec. 3.5.3 do not con- sider the differences in facial appearances between templates. In particular, when applying the matching methods to our template representation which consists of representative face images 76 categorized by pose and quality, the "All vs. All" method compares all pairs of face images with each other, regardless of pose and quality. Also, the "Matching Pooled features" method com- pares a pair of pooled feature vectors that combine all feature vectors in a template, regardless of pose and quality. Therefore, a better approach is needed to properly compare two templates expressed by our template representation. To fairly evaluate our template representation using the simple matching methods in Sec. 3.5.3, we consider that the variations in pose and quality produce the common variations of human facial appearances, unrelated to a person’s identity. So, we propose to remove the pose and quality bin differences and incorporate the quantized face image subspaces, as shown in the upper image of Fig. 3.4. To do that, we assume that face image space can be approximated by a linear combination of mutually exclusive pose and quality face image subspaces, as in linear subspace methods [12, 13, 52, 66, 78]. Also, because all deep feature vectors in each bin encode the facial appearances related to pose and quality of the bin, pooling features that has already proven effective in preserving invariant properties among feature vectors [89, 40, 122, 93] is assumed to be able to extract the specific portions related to pose and quality information when given sufficient samples in each quantized bin across the subjects. Finally, we introduce "Bias cancellation" and "Variance cancellation" methods to suc- cessfully remove the differences in poses and quality of all bins. With the methods, we obtain better identity preserving deep feature vectors that are insensitive to the variations in pose and quality, and thus increase recognition accuracy as shown in Sec. 4. The details are described in the next sections. 77 Figure 3.14: Bin equalization: the concept of bias cancellation The face image of each pose and quality bin is already biased by the facial characteristics of the bin. Therefore, a fair feature normalization for all bins should eliminate the bias. 3.6.1 Bias cancellation Comparing feature vectors involves measuring the difference between the observed feature vec- tor and some other feature vectors (i.e. the mean of the other feature vectors). The simple matching methods in Sec. 3.5 normalize each probe feature vector by subtracting the mean of all training feature vectors which show unevenly distributed facial characteristics in face image space as shown in Fig. 3.1. For example, because the training face database such as CASIA- WebFace [150] contains mostly frontal face images, the normalized feature vector of each face image is generated as shown in the left image of Fig. 3.14, conceptually. However, our sampling structure evenly preserves face images with different facial appearances in quantized bins de- pending on pose and quality. Therefore, we propose to use the local mean feature vector of each quantized bin rather than the global mean of all feature vectors of all bins, to normalize a feature 78 vector. In other words, because the face image of each pose and quality bin is already biased by the facial characteristics that the bin represents, fair feature normalization should eliminate the biases, as shown in the right image of Fig. 3.14. To realize it, we at first assume that we can extract the portions related to common facial appearances of pose and quality from each feature vector. Thus, when we let the original feature vectorf of a subjectn in a quantized posei and qualityj bin be.f n / i;j and let the common feature vector representing the pose and quality be f i;j . Then, identity preserving feature vectors from which the specific portions related to pose and quality has been removed is represented by.f n / i;j 0 D.f n / i;j f i;j . In more detail, it is assumed that the specific portions can be obtained by the mean feature vector of face images of all subjects of each bin, using a training database, as shown in Fig. 3.15. Let face image feature space be represented by mutually exclusive face image feature subspacesX i;j depending on posei and image qualityj . Also, let.X k / i;j be represented by the set of its member facial feature vectors of a subjectk as: .X k / i;j Df.f k / i;j ;:::; .f k / P;Q g. Then, the specific portions related to the pose and quality is defined by Eq.( 3.22). Therefore, the feature vector robust to the variations in pose and quality of subjectn in a probe database is defined as Eq.( 3.23) as below: f i;j D lim N!1 1 N N X tDk .f k / i;j (3.22) .f n / i;j 0 D.f n / i;j f i;j (3.23) 79 Figure 3.15: Bin equalization: bias cancellation All feature vectors are assigned to their cor- responding pose and quality bins, which means that all feature vectors contain specific portions that reflect the facial features represented by the assigned bins. It is assumed that the specific portions can be extracted by averaging the feature vectors of each bin across all subjects in a training database with sufficient subjects. 3.6.2 Variance cancellation The bias cancellation method in Sec. 3.6.1 assumes that each quantized face image space is linearly mapped onto its feature space and the mapping is onto. Thus, the feature mappingg maps face images onto their feature vectors as gW .F t / i;j 7!.X t / i;j . In addition, the method assumes that the quantized face image space is linearly independent, which means that each mean (bias) feature vectorf i;j is one of the basis vector of the face image space. Therefore, the bias cancellation method independently performs in each bin. However, when we assume that each quantized face image is linearly dependent each other, the bias cancellation method even removes the portions not related to pose and quality and 80 brings each feature vector into a non-integrated face feature vector space. In this case, we pro- pose to use all bias vectors together, not independently using each bias vector, when extracting the specific portions related to pose and quality, where bias vector isf i;j , the mean feature vector of each quantized bin across all subjects. Because all bias vectors represent different facial char- acteristics of poses and qualities as shown in the center image of Fig. 3.16, one way is to use the principal component analysis to find the principal component that represents the largest variance between the bias vectors. Figure 3.16: Bin equalization: variance cancellation When each face image space is linearly dependent each other, we extract the specific portions related to pose and quality using bias vectors, which are the mean feature vectors of each quantized bin across all subjects in a training database. Using all twenty bias vectors, we experimentally choose nine principal components which explain nearly 95% variances in the information between twenty bias vectors. Principal component analysis (PCA) is a well-known technique used in many ways in many areas. In the field of computer vision, it is mainly used to reduce the dimensionality of multivariate data while maintaining as much relevant information as possible by using key 81 components that maximize the input data variance. we use this technique to extract the principal component among bias vectorsf i;j to extract the dominant portions related to pose and quality, all across the bins. We experimentally chose first nine principal components, which describe almost 95% of variances between bias vectors as shown in the rightest image of Fig. 3.16. Then, we assume that only the nine principal components explains the specific portions related to pose and quality, and thus make the first nine principal components of the principal component matrix zero vector. Therefore, feature vector in unified face feature vector spacef 0 of subjectn in pose i and qualityj bin is defined as: .f n / i;j 0 D.f n / i;j f i;j P m (3.24) where, P m is principal component matrix with first nine eigen vectors are replaced with zero vectors. 82 Chapter 4 Experimental results We have conducted experiments to show the effectiveness of our representative face image sam- ples with pooled face images and random face images within a fixed data volume in terms of template size and recognition accuracy on the JANUS CS2 and benchmark and IARPA Janus Benchmark-A (IJB-A). Each experiment justify why we choose the methods (e.g. face image space quantization, reliable face image selection, face representation with pooled and random faces, and bin equalization). Then, we compare our performance with the state-of-the-art meth- ods. 4.1 Database We use IJB-A and Janus CS2 benchmark for our experiments. The IARPA Janus Benchmark A (IJB-A) [79] was published on 2015, introducing a new concept of "template" and "template to template matching". The database was made to enhance the development of face recognition algorithms by providing more challenging real-world unconstrained face database. The database 83 provides multiple templates of five hundred subjects. Each template contains various face images captured by multiple camera systems. In particular, the database provide full pose variations and extremely poor image quality. Also, some templates contain more than one hundred face images. The IJB-A evaluation protocols consists of face verification and face identification. Janus CS2 also provides template structures which use the same subjects and same images with IJB-A. However, Janus CS2 provides different evaluation protocols from IJB-A. 4.2 Performance Metrics Face verification: We report multiple true acceptance rate (TAR) which is sampled at false acceptance rate (FAR) of 0.1, 0.01, and 0.001 of the Receiver Operating Characteristic (ROC). The metric compares each template against each other (one to one comparison), and then the genuin scores from positive pairs and the imposter scores from negative pairs are generated to compute true acceptance rate (TAR) and false rejection rate (FRR). The ROC cure plots the TAR and FAR. Face identification: We report rank-1, rank-5, and rank-10 of the Cumulative Matching Characteristic (CMC) curve. The metric compares each probe template against all pre- defined gallery templates (one to many comparisons), and then the compared scores are sorted and ranked to determine the rank where a true match occurs. The CMC curve plots true positive identification rate, which is the probability of observing the true identity within top K ranks, against ranks. 84 Template size: This work aims at reducing the number of images when representing a template. Therefore, we report the number of representative face images what we sam- pled for template representation. We provide the average standard deviation for probe and gallery templates while evaluating the recognition performance. This performance is especially important when video frames are given, as shown later in Sec. 6, because the fewer images in each template results in the fewer feature extractions, the fewer feature comparisons, the faster matching, as well as the less storage space. 4.3 System component analysis In this section, we study each step of face image sampling with a specially designed comparative experimental setup: face image space quantization, reliable face image selection (filtering), tem- plate representation (image types), and face image space integration (bin equalization). Then, we show our template representation is superior to other template representation in terms of recog- nition accuracy and template size. In particular, our template representation keeps recognition accuracy with far fewer face images in a template compared to the others. In more detail, we show why keeping a variety of facial appearances is better than keeping similar facial appearances. We also show why it is better to maintain reliable face im- ages similar to a human face image. Then, we show our template representation is better than others in terms of trade-off between recognition accuracy and template size. Next, we show that recognition accuracy is further improved after the feature representation of our representative face images is transformed to fit the integrated face feature space. Finally, we report that our sampling 85 method is superior to other template representation with extensive experiments on IJB-A and Janus CS2. 4.3.1 Quantization of face image space To reduce the number of images in an image set, traditional methods retained as many high- quality face images or near frontal images as possible to avoid compromising recognition ac- curacy, as introduced in Sec. 2. The new experimental database (such as IJB-A) that introduce template concepts, however, has been observed to contain a lot of extreme poses and extremely low quality images as shown in Fig. 3.1. So, it is necessary to evaluate whether the existing approaches really work for the new challenging database. In this chapter, through a series of specially designed experiments, we show that main- taining face images with various facial appearances is more beneficial than keeping only high- quality face images or keeping only frontal face images when removing face images that are not required for recognition performance. The results imply that it is more advantageous to maintain face images with various facial appearances than to keep frequently appearing face images or similar face images when considering sample size. To get these conclusions, we designed the following experiments to examine which samples are more valuable to keep in the template for recognition accuracy. First of all, one effective way to reduce the number of images in a template is to follow the diversity selection methods as introduced in Sec. 2 when allowing loss of face recognition. Its drawback is to sample face images similar to frequently appearing face images in training 86 databases. To avoid this drawback and not to be affected by the training database, we quantize face image space to sample diverse facial appearances from actually possible all facial appear- ances. Based on the assumption that pose and quality are the most influential factors on the variations in facial appearances, we quantize the face image space depending on pose and quality, respectively. To be specific, when given a template with multiple face images, we quantized face image space into five groups depending on possible head poses (|Yaw| explained in Sec. 3.3.1) as:f[0°, 20°), [20°, 40°), [40°, 60°), [60°, 75°), [75°,1)g. Also, we divide the face image space into five groups depending on image quality (SSEQ explained in Sec. 3.3.1) as: f(1,0.45), [0.45,0.55), [0.55,0.65), [0.65,0.75), [0.75,1)g. We then sample face images in each quantized space using different sampling methods as: "Random selection (reference)", "REMOVE_ALL", "KEEPONE". Random selection (reference): We randomly selectn % of the original samples. We use this method as our baseline because random samples has been used to provide the general system performance for the complex systems that are difficult to optimize [146, 135, 136]. REMOVE_ALL: We remove all images in each group in order. For pose groups, we first remove all images in the largest pose group [75°,1). In the next experiment, we further remove the second largest head pose group with [60°, 75°). Thus, we repeat the experiment until we only have the frontal head pose group of [0°, 20°). Similarly, image quality groups are removed one by one from the lowest quality group until only the highest quality group remains. 87 KEEPONE: When removing images in "REMOVE_ALL" method, we leave one image for each group. Therefore, this sampling method keeps more various facial appearances than "REMOVE_ALL" at the same template size. We then assess recognition accuracy in terms of the template size on Janus CS2, split1, as shown in Fig. 4.1 and in Fig. 4.2. In these experiments, we use the AlexNet [81] trained on in-plane-aligned face images of CASIA-Web Face [150] while extracting feature vectors of the sampled face images as in Sec. 3.5.1. Then, we use "All vs. All" matching method in Sec. 3.5.3. Figure 4.1: The importance of sample diversity on pose Both Fig. 4.1 and Fig. 4.2 show experimental results on the importance of sample diver- sity related to quantization of face image space. They plot the recognition performance according to template size for three sampling methods. 88 In particular, Fig. 4.1 shows that keeping face images with various poses (KEEP- ONE, blue line) is more advantages than keeping face images with only near frontal pose (RE- MOVE_ALL NEAR PROFILE, red line), in terms of the trade-off between face recognition and template size, because "KEEPONE" shows higher recognition accuracy than other two methods at the same template size (Avg.# of images in a template). Figure 4.2: The importance of sample diversity on quality In particular, Fig. 4.2 shows that keeping face images with various image quality (KEEP- ONE, blue line) is more advantages than keeping face images with only high quality face images (REMOVE_ALL LOW QUALITY , red line), in terms of the trade-off between face recognition and template size, because "KEEPONE" shows higher recognition accuracy than other two meth- ods at the same template size (Avg.# of images in a template). 89 Finally, we conclude that keeping a few various face images in terms of head pose and image quality is more beneficial than keeping only near frontal or high quality face images. In addition, by quantizing face image space, we still catch face diversity while reducing the template size dramatically. 4.3.2 Face Selection When given a template, our template representation selects face images in each template using three components: (1) face image quantization, (2) filtering out non-human face images, and (3) selecting the best aligned face images. Then, the selected face images are used to represent representative face images (face representation): (1’) a random face, and (2’) a pooled face. In this section, we evaluate the performance of each component of face image selection methods (1)(2)(3), using each component of our face representations (1’)(2’). Each of the face selection methods and their face representations are described in Sec. 3.3. In addition, their feature representation and template matching method "All vs. All" are explained in Sec. 3.5. All experimental items are as below: All Images (Baseline): All template images are directly used. Its feature representation uses the VGG-19 CNN that was fine-tuned on the augmented CASIA-WebFace dataset in Sec. 3.4, with original template structure. Random_ per bin : After we quantize face image space depending on pose and quality as described in Sec. 3.3.1, we randomly select one of the images in each quantized bin. Thus, 90 the average template size throughout the database reduces dramatically. This test item represents the general system performance of the face image quantization. Pooling_ all : After we quantize face image space depending on pose and quality as de- scribed in Sec. 3.3.1, we average all images in each bin in a pixel-wise manner. Thus, the average template size throughout the database reduces dramatically. This face repre- sentation shows the initial performance of our smart sampling method, which keeps face diversity using quantization and compresses similar face images within each quantized face image subspace. +wo non-face : Before we select a face image randomly or average face images pixel- wisely in each quantized bin, we filter out non-human face images with landmark confi- dence level < 0.4, as described in Sec. 3.3.2. Thus, the candidates for sampling a random face and a pooled face are given as Eq. 3.10 and Eq. 3.11. +best-aligned : Before we select a face image randomly or average face images pixel- wisely in each quantized bin, we filter out most of the non-human face images, and then we allow retaining up to five samples with the highest landmark confidence level in each bin. All the details are followed by Sec. 3.3. Thus, the candidates for sampling a random face and a pooled face are given as Eq. 3.12 and Eq. 3.11. Chosen : We use all face images allowed in "+best-aligned", given as Eq. 3.12 and Eq. 3.11. Its feature representation uses the same one with "All Images" (only once fine-tuned VGG- 19 CNN). 91 Table 4.1: Selection performance of a random face in quantized bins: (1) Random selection per bin, (2) Random selection per bin from all images except for non-human face images, (3) (2) + using up to five best aligned face images with the highest landmark confidence level in each quantized bin, (4) All images used in (3). All images and (4) chosen are reference sets. [tested on CS2, split1] Metric All Images (1) Random per bin (2) +wo non-face (3)+ best-aligned (4) Chosen TAR@FAR=0.01 90.1 88.1 88.3 88.6 89.8 RANK@10 93.7 93.2 92.9 93.1 93.6 Temp.Size@G 23.4 19.9 7.9 3.8 7.7 3.7 7.7 3.7 16.8 11.5 Temp.Size@P 6.9 13.5 2.9 3.3 2.8 3.2 2.8 3.2 5.3 8.4 Table 4.2: Selection performance of a pooled face in quantized bins: (1) A pooled face using all images per bin, (2) A pooled face using all images except for non-human face images, (3) (2) + using up to five best aligned face images with the highest landmark confidence level in each quantized bin, (4) All images used in (3). All images and (4) chosen are reference sets.[tested on CS2, split1] Metric All Images (1) Pooling all (2) +wo non-face (3)+ best-aligned (4) Chosen TAR@FAR=0.01 90.1 86.8 88.6 89.2 89.8 RANK@10 93.7 92.7 93.6 93.7 93.6 Temp.Size@G 23.4 19.9 7.9 3.8 7.7 3.7 7.7 3.7 16.8 11.5 Temp.Size@P 6.9 13.5 2.9 3.3 2.8 3.2 2.8 3.2 5.3 8.4 Table. 4.1 shows the face selection performance with random faces, and Table. 4.2 with pooled faces. In both experiments, we select a set of face images from all images in the template, based on several criteria, and observe the resulting change in face recognition performance and template size. Thus, "All Images" is the baseline for all other experiments using all images in a template. "Chosen" shows the overall performance of all three selection blocks (Quantization, Filtering, Selection of the N best aligned images) using face images selected directly from all images in each template. "Chosen" uses only 72% of the original template size in gallery datasets and 77% in probe datasets, but nearly maintains recognition performance. (cf. Both "All Images" and "Chosen" use all images in a template so that they are common experimental items on two 92 tables, regardless of face representation.). In addition, "Random per bin" and "Pooling all" use only 34% of the original gallery template size and 42% of the original probe template size, which means that the selection of various samples through quantization of face image space played an important role in reducing the template size. We then select reliable face images in two steps. At first step, we filter out non-human face images by comparing landmark confidence level as explained in Sec. 3.3.2, and the ex- perimental results using random face representation and pooling face representation are shown in "(2)+wo non-face" of Table. 4.1 and Table. 4.2, respectively. Next, we only use the N best aligned face images by comparing landmark confidence level as explained in Eq. 3.12 of Sec. 3.3.3, to get better quality pooled face images and to get more reliable matching pairs, and the experimental results are shown in "(3)+best-aligned" of both tables. In Table. 4.1, random face representation is not significantly affected by the reliability of face images in the population. To be specific, the face recognition performance of "Random per bin", "+wo non-face", and "+best-aligned" is not significantly different. "Random per bin" is sampled from all images in each quantized bin. Then, "+wo non-face" is sampled from face images with landmark confidence level larger than 0.4. And then, "+best-alinged" is sampled from up to five samples with the highest landmark confidence level in the bin. In other words, after removing unreliable face images from the bin in turn, we randomly sample a face image in each bin. However, recognition accuracy of all three items is similar. Moreover, although their template size is much smaller than "All Images" and "Select", their recognition accuracy is slightly smaller. 93 On the other hand, Table. 4.2 shows that pooled face images in quantized bins are im- mediately influenced by the reliability of face images in each bin. The recognition performance increases directly when we use more reliable face images, showing the significant difference, as the experiment show the recognition accuracy of "Pooling_all", "+wo non-face" and "+best- aligned" increases in turn. "pooling face _all" initially has lower recognition accuracy than "ran- dom face per bin" with the same template size. However, as soon as the non-human face images are removed from the population, recognition performance improves. Again, when only more reliable images are given in the population, the performance improves further. In the end, the recognition accuracy is close to "All Images" and "Chosen", but the template size reduced a lot. To sum up, we conclude that: quantization of face image space greatly reduces the template size, nearly keeping original recognition performance, together with random face rep- resentation and pooling face representation. Using reliable face images and better-aligned face images immediately improve the recognition accuracy of "pooling face" representation. Carefully designed face selection components well preserve the original template performance. 4.3.3 Face representation In this section, we compare several template representation considering the trade-off between recognition accuracy and template size. We use "Chosen" face images in the previous section as the population of the random samples and pooled samples. Also, we define "+best-aligned" with random samples as "Random face representation" and "+best-aligned" with pooled samples as "Pooling face representation". Then, in Table. 4.1 and Table. 4.2, both representation shows lower recognition accuracy than the original template. 94 Paying attention to the fact that pooled face images inherently lose facial details as shown in Fig. 3.2, random face images are chosen as another sample type to complement the facial details. Therefore, we perform an experiment that compares the face image representation of a template taking into account the sample size, and the compared template representations are as follows: All Images (reference): All template images are directly used. Its feature representation uses the VGG-19 CNN that was fine-tuned on the augmented CASIA-WebFace dataset as described in Sec. 3.4, with original template structure. Random (Size=1): Random selection per bin with sample size one as in Sec. 3.3.3. In each quantized pose and quality bin, the allowed face images for random selection are as in Eq.( 3.11) and Eq.( 3.12). One of our face representation type to preserve facial details. Random (Size=2): Random selection per bin with sample size two. Except for the sample size, this item is same with a randomly selected face. This experimental item is a compar- ative baseline of our template representation in terms of sample size. Pooling : A pooled face image per bin as in Sec. 3.3.3. In each quantized pose and quality bin, the allowed face images for random selection are as in Eq.( 3.11) and Eq.( 3.12). One of our face representation type to represent common facial appearances in each quantized bin of a subject. Pooling & Random: Combination of a pooled face and a random face in each quantized pose and quality bin. Our proposed template representation in image space. 95 Table 4.3: Recognition results comparing several template representation on Janus CS2: "All Images" provides baseline recognition performance using all images in a template, (1) one random face per bin, (2) two random faces per bin, (3) a pooled face, (4) (a pooled face & a random face) per bin. Our template representation (4) shows higher recognition accuracy than the other representations, using a much smaller number of images [tested on CS2, average(split1 to 10)]. Metric All Images (1) Random(size=1) (2) Random(size=2) (3) Pooling (4) Pooling & Random TAR@FAR=0.01 90.6 0.7 90.0 0.8 90.5 0.7 90.6 0.7 90.9 0.8 TAR@FAR=0.001 74.8 2.2 75.2 1.6 77.3 1.9 77.4 1.6 78.4 1.3 TAR@FAR=0.0001 42.7 11.2 54.7 4.4 55.4 4.7 55.2 3.0 57.1 4.6 RANK@1 84.9 1.1 84.1 1.4 84.9 1.0 85.6 0.9 86.3 1.0 RANK@5 92.9 0.8 92.9 0.7 93.0 0.6 93.5 0.6 93.6 0.6 RANK@10 94.7 0.7 94.8 0.5 94.7 0.5 95.2 0.6 95.2 0.5 Temp.Size@G 24.3 20.7 8.2 3.9 12.1 6.6 8.2 3.9 12.1 6.6 Temp.Size@P 7.3 13.3 3.0 3.3 4.1 5.3 3.0 3.3 4.1 5.3 Tab. 4.3 shows the experimental results comparing several template representation on Janus CS2 in terms of recognition accuracy and template size. All four template representations show higher recognition accuracy with fewer template size except for the measure "TAR@FAR=0.01" than "All Images". In more detail, "Random(size=2)" shows higher recognition accuracy than "Random(size=1)" and similar or less recognition accuracy than "Pooling". "Pooling" alone shows similar or higher recognition accuracy than "All Images". However, by increasing the sample size with random face images, we get higher recognition accuracy, especially verification accuracy. In that, there are some templates that originally contain only one sample, it’s not sur- prising that adding random samples to a pooled facial expression actually increases one-to-one matching accuracy. Also, although "Random(size=2)" and "Pooling & Random" have the same template size, our pooled and random face representation is superior to "Random(size=2)". Thus, a series of analyzes shows that our template representation is superior to other template repre- sentation in terms of the trade-offs between recognition accuracy and template size. Finally, we 96 conclude that our template representation is much smaller in template size than "All Images", but with higher recognition accuracy, and superior to other template representation. 4.3.4 Bin equalization Our sampling method evenly samples the face image taking into account the face shape, so the comparison image pairs are noticeably different in most cases when matching two templates. To alleviate this phenomenon caused by quantized face image space, we propose to integrate the quantized subspace into one space. Thus, one solution is to eliminate the bias of each quan- tized facial feature subspace. The other is to remove the variance between the quantized facial feature subspaces. In this section, we compare the recognition performance of the proposed bin equalization methods, and the comparison experimental items are like below: Without Bin.eq. : Once we represent a template with representative face images consist- ing of pooled face images and randomly selected face images, we map the face images to deep feature space as in Sec. 3.5.1. We use the feature representation for template compar- isons. (1) Bias : We modify the original feature vector as explained in Sec. 3.6.1 by removing the specific portions related to pose and quality of the bin. The specific portions are extracted in each bin across all subjects of the training database, based on the assumption that pose and quality information between twenty bins are independent each other. 97 (2) Var : We modify the original feature vector as explained in Sec. 3.6.2. Except for the bin dependency and extraction methods, all others are same with (1)Bias. (1) + (2) : Face image space is complex and it is difficult to explain perfectly. Such that, we apply (1) and (2) at the same time. Using both methods actually improves the recognition accuracy a little bit. Table 4.4: Bin equalization performance of our representative face images using the pair-wise matching method [tested on CS2, average(split1 to 10)] Metric Without Bin.eq. (1) Bias (2) Var (1)+(2) TAR@FAR=0.01 91.0 90.8 91.0 91.1 TAR@FAR=0.001 78.0 78.3 78.5 78.7 RANK@1 86.2 86.5 86.7 86.7 RANK@5 93.7 93.9 93.9 94.0 RANK@10 95.3 95.4 95.6 95.6 Temp.Size@G 12.1 6.6 12.1 6.6 12.1 6.6 12.1 6.6 Temp.Size@P 4.1 5.3 4.1 5.3 4.1 5.3 4.1 5.3 Table 4.5: Bin equalization performance of our representative face images using pooled feature matching method [tested on CS2, average(split1 to 10)] Metric Without Bin.eq. (1) Bias (2) Var (1)+(2) TAR@FAR=0.01 89.7 90.2 90.6 90.6 TAR@FAR=0.001 78.6 79.9 80.3 80.5 RANK@1 87.4 87.9 88.1 88.3 RANK@5 94.0 94.2 94.3 94.4 RANK@10 95.5 95.4 95.6 95.6 Temp.Size@G 2.0 0.0 2.0 0.0 2.0 0.0 2.0 0.0 Temp.Size@P 2.0 0.0 2.0 0.0 2.0 0.0 2.0 0.0 98 Table. 4.4 and Table. 4.5 represent the recognition performance using pair-wise match- ing method and pooled feature matching method, explained in Sec. 3.5.3, respectively. In pair- wise matching method, three bin equalization methods show slightly higher recognition accuracy than the baseline, except for "TAR@FAR=0.01, (1) Bias". The best performance is seen in the combination method. In the pooled feature matching method, all equalization methods improve recognition accuracy and the best performance happens at the combination method. In addition, feature pooling method shows higher recognition accuracy than pair-wise matching method ex- cept for "TAR@FAR=0.01", which especially performs better in identification, and even reduces the pair comparisons. Therefore, from the experiments, we conclude that unifying quantized face image space of our sampling structure actually improves recognition accuracy. 4.3.5 Component analysis Table. 4.6 and Table. 4.7 report the gains improved by each component of our sampling structure using pair-wise matching method and pooled feature matching method, respectively. It shows the average performance on JANUS CS2 and IJB-A reporting the TAR at FAR=0.01(TAR1) and FAR=0.001(TAR01) for the verification protocol and the recognition rate at Rank-1 for the iden- tification protocol. In addition, their ROC and CMC curves are shown in Fig. 4.3 and Fig. 4.4, respectively. In Table. 4.6, "Random Face wo.bin" was created to see the general system performance when given the same template size with our sampling structure. To do that, we generated new templates by randomly sampling (12.1/24.3) % of the images from the original gallery templates 99 and (4.1/7.3) % of the images from the probe templates. So, we consider this experimental item as one of the baseline, representing the recognition accuracy obtained purely through the components what we used in the structure (quantization, ourlier removal, selection of the N best aligned images, face representation, bin equalization). Table 4.6: Recognition performance for each component on CS2 and IJB-A Datasets with pair- wise matching method. (TAR1 is reported at FAR=0.01 and TAR01 is reported at FAR=0.001 for verification. Recognition rate at Rank-1 is reported for identification.), Random Face wo.bin was tested on Janus CS2, split1. Datasets JANUS CS2 IJB-A Avg.Temp.Size Metrics TAR1 TAR01 Rank1 TAR1 TAR01 Rank1 Gal Prob All Images 90.6 74.8 84.9 86.1 67.6 86.2 24.3 20.7 7.3 13.3 Random Face wo.bin 83.6 65.5 88.1 12.3 10.9 3.9 6.8 Random Face w.bin 90.0 75.2 84.1 85.2 68.0 85.3 8.2 3.9 3.0 3.3 Pooling Face 90.6 77.4 85.6 85.7 67.1 86.1 8.2 3.9 3.0 3.3 Pooling&Random 91.0 78.0 86.2 86.2 68.6 86.5 12.1 6.6 4.1 5.3 Pooling&Random + Bin.eq. 91.1 78.7 86.7 86.3 68.7 87.1 12.1 6.6 4.1 5.3 Table 4.7: Recognition performance for each component on CS2 and IJB-A Datasets with pooled feature matching method. (TAR1 is reported at FAR=0.01 and TAR01 is reported at FAR=0.001 for verification. Recognition rate at Rank-1 is reported for identification.) Datasets JANUS CS2 IJB-A Avg.Temp.Size Metrics TAR1 TAR01 Rank1 TAR1 TAR01 Rank1 Gal Prob All Images 88.1 75.4 85.6 82.8 66.9 86.8 1.00.0 0.0 0.1 Random Face 88.7 76.6 85.7 83.5 69.2 86.7 1.00.0 0.0 0.1 Pooling Face 89.5 77.9 87.0 83.9 69.8 87.3 1.00.0 0.00.1 Pooling&Random 89.7 78.6 87.4 84.7 70.6 87.6 2.00.0 0.0 0.2 Pooling&Random + Bin.eq. 90.6 80.5 88.3 85.5 72.2 88.0 2.00.0 0.0 0.2 100 In Table. 4.6 and Table. 4.7, "Random Face (w.bin)" is considered general system per- formance representing quantization and is one of our system face representation. "Pooling Face" is considered to show the combined performance of all components used in the process of re- ducing template size smartly, except for the types of template representation. Finally, "Pool- ing&Random" shows the fundamental performance of our representative face images in image domain. Further, "Pooling&Random+Bin.eq." shows the performance of exploiting the bin cat- egory information during matching. All of these experimental items show that our sampling structure has been improved in stages. Lastly, our template representation provides a much smaller template size than the orig- inal template, but improves recognition accuracy than before. Therefore, the biggest advantage of our template representation is that fewer face images require less processing time and less storage space, which means less burden on test phase. 4.3.6 With fewer images, higher recognition accuracy In Fig. 4.3, we show the improvement by our representative facial image structure, and its encoded deep features, fine tuned with our template representation. Here, we do not add the random sampling without face image space, but the quantization of head pose and image quality improve the recognition by random sampling a lot. Also, step by step, we more and more improve with face selection and pooling, face representation to quantization, we improve the recognition results higher than our reference "All Images". 101 We focus on the construction of representative facial image structure for a template, recent researches more like to feature pooling to increase recognition accuracy such as video pooling and media pooling [93] and we are further improving recognition accuracy with feature pooling as shown in Figure ??. We get higher identification and verification accuracy than all images with naive pixel wise average pooling and we only show the possibility that we can improve the results using our well categorized image structures (Here, we showed the possibility by removing less-identity related pose and image quality portions from the feature vectors.) Figure 4.3: CS2, IJBA, CMC and ROC curve using pair-wise matching method 102 Figure 4.4: CS2, IJBA, CMC and ROC curve using the pooled feature matching method. 4.4 Comparison with the state-of-the-art Table. 4.8 compares our performance to a number of existing methods on the Janus benchmarks. Our result largely outperforms the results evaluated by the Commercial and Government Off- The-Shelf systems (COTS, GOTS) [79], Open Source Biometric tool (OpenBR, V0.5) [80], and by Fisher Vector encoding using frontalization [27]. The difference comes from the use of deep learning. 103 Table 4.8: Comparative performance analysis on JANUS CS2 and IJB-A respectively for veri- fication (ROC) and identification (CMC). TAR0.01 means TAR@FAR=0.01, TAR0.001 means TAR@FAR=0.001. ft denotes fine-tuning a deep network multiple times for each training split. pw denotes naive all sample pair-wise comparison matching. vp denotes video pooling. fp means a single feature pooling. Data-PSE denotes the augmented data types including pose, shapes, and expression while training a deep network. st denotes synthesizing face images at test phase. Datasets JANUS CS2 IJB-A Metrics TAR0.01 TAR0.001 Rank-1 Rank-5 Rank-10 TAR0.01 TAR0.001 Rank-1 Rank-5 Rank-10 COTS [79] 58.1 37 55.1 69.4 74.1 – – – – – GOTS [79] 46.7 25 41.3 57.1 62.4 40.6 19.8 44.3 59.5 – OpenBR [80] – – – – 23.6 10.4 24.6 37.5 37.5 – Fisher Vector [27] 41.1 25 38.1 55.9 63.7 – – – – – Chen et al. [26] 64.9 45 69.4 80.9 85 57.3 – 72.6 84 88.4 Chen et al. [26], ft 92.1 78 89.1 95.7 97.2 83.8 – 90.3 96.5 97.7 Swamiet al. [117], ft – – – – – 79.0 59 88 95 – Pooling [56], ft, st, pw 87.8 74.5 82.6 91.8 94.0 81.9 63.1 84.6 93.3 95.1 Deep Multi-Pose [3], st, pw 89.7 – 86.5 93.4 94.9 78.7 – 84.6 92.7 94.7 Pose Aware [94], st, pw – – – – – 82.6 65.2 84.0 92.5 94.6 Pose Aware [94], st, vp – – – – – 84.7 71.1 86.2 94.3 95.3 Masi et al. [93], Data-PSE, st, vp 92.6 82.4 89.8 95.6 96.9 88.6 72.5 90.6 96.2 97.7 Masi et al. [93], pw – – – – – 84.9 62.3 86.3 94.5 96.5 Masi et al. [93], vp – – – – – 86.3 67.9 88 94.7 96.6 OURS, pw 91.1 78.7 86.7 94.0 95.6 86.3 68.7 87.1 94.1 95.6 OURS, fp 90.6 80.5 88.3 – – 85.5 72.2 88.0 – – Even considering other deep learning approaches, recognition accuracy is comparable. However, a closer look at the other approaches shows that our approach is much more advan- tageous in terms of computational complexity at test stage. First of all, all methods except for [56] use the original template to extract deep features. The methods marked st synthesizes face images besides during testing. Thus, except for [56], they produce more deep feature vectors than the original template. In particular, Masi et al. [93] produce nearly three times more face images (frontal, half-profile and profile faces) and their feature vectors than the original template at test stage to show the state-of-the-art recognition accuracy, even though they use well trained model on larger databases(Data-PSE). 104 Considering computational complexity, Swami et al. [117] used a single feature pool- ing, and Masi et al. [93] use a reduced number of feature vectors, and Hassner et al. [56] use a reduced number of face images. However, without using the 3D rendering engine, no one uses a smaller number of images than the original template in the testing phase. In addition, Chen et al. [26] and Swami et al. [117] fine tune their deep model on target database, and do metric learning. In particular, Chen et al. [26] shows two largely different results depending on whether they fine tune the model on the target database, which means that the results are largely affected by the characteristics of training database. Finally, when we compare our results with the experimental results of similar condi- tions in Masi et al. [93] in the last third and fourth lines, our template representation has similar or higher recognition accuracy, but a much smaller template size. 105 Part II Application to videos 106 Chapter 5 Video-based face recognition pipeline 5.1 Motivation An automatic System for unconstrained video-based face recognition suffers from computational complexity because of a large number of usable face images. In addition, the face images vary in large: facial appearances vary in head pose containing view point change depending on camera movement and subject movement, and image quality of face images varies greatly due to the different quality of capturing systems and a variety of real-world capturing environments (e.g. bluriness depending on image size, subject movement, digital image distortion, etc.). Moreover, continuous frames provide many similar images with little displacement. Therefore, video-based face recognition systems handle an enormous amount of various and similar face images. A large amount of usable face images directly requires storage spaces and compu- tational time in practical applications. Using all available face images and extracting all their 107 feature vectors is one strategy to maximally use the deep features advanced discriminative ca- pabilities. However, rather than increasing recognition accuracy at the cost of computational complexity, we proposed reduced template representation, without losing recognition accuracy. We expected that its benefits would shine in video-based facial recognition. Therefore, to show its benefits over the baseline systems clearly, we implemented complete pipeline for video-based face recognition system adding a face collector to our sampling system pipeline. 5.2 System overview Figure 5.1: System block diagram at test stages: When a protocol and images/video frames or video files are given, this is our sampling system block diagram. Video-based protocol requires to use face collector block (red-line box), but other cases does not. From the baseline system, only two modules (boxes filled with sky blue) are added. The computational costs of the additional blocks are really cheap, but in reality, the two components extremely reduce the computational complexity of the following blocks. We first perform face collections for a template from whole image files and videos ac- cording to the protocol. Especially, for a video, we collect face images from a given bounding 108 box in a frame with two different methods as described in Sec. 5.3.: A face tracker [113] provides face areas in video frames from the given face bounding box until the tracking fails. The com- bination of a face detector [149] and an associator detects multiple face areas in each frame at first, and then, provides the subsets of detected faces which are likely to be from the same identity across the video frames. Once face images for a subject are collected, following processes are same with 3.3. The face images are aligned with the detected landmarks, and then, the aligned face images are filtered with our sampling method. Next, we encode the representative face im- ages within a fixed volume with our deep model. Then, bin equalization technique provides pose and quality insensitive feature vectors. Finally, we compute the matching score and perform face recognition in CS3/IJB-B video protocol. The block diagram of our face recognition system is shown in Figure. 5.1. In this chapter, we briefly introduce how we implement the system including face col- lection, image sampling, feature extraction, and matching. Especially, to estimate the compu- tational complexity experimentally, we introduce how we organize the feature extraction block including image storage system. 5.3 Face collections from a video To collect faces from a video given a face bounding box, we tried two methods: Alien tracker and YOLO face detection combined with face associator. In both cases, we regularly sample the video frames to reduce the burden of processing massive video frames and remove small size faces with lower length/width than 20 pixels assuming that they provide really low confident facial features. 109 Figure 5.2: An example of a video: A few frames from a video file in IJB-B databases [140]: (154.mp4). The video file is composed of multi-shot frames containing often multiple subjects. Given the annotated red bounding box of a target subject at the initial stage, tracklet fragments of the target subject are displayed along the white arrow line. The dashed line means the failure of the tracking. The yellow boxes show all the face areas detected in each frame, regardless of which subject the face tracker tracks. 5.3.1 Face tracker Based on the given face bounding box in the initial frame of a video, we use the ALIEN (Ap- pearance Learning In Evidential Nuisance) tracker [106] because it is not only implemented es- pecially for face tracking applications adopting face re-detection to distinguish face identities, but also performs long-term tracking even under severe visibility artifacts and occlusions. This non- parametric online tracker exploits weakly aligned multiple- instance local features to get pieces of evidence of the tracking from the nuisance factors induced by physical shape observations, determining the most useful image features which discriminate between the object and the rest of the imaged scene. 110 5.3.2 Face detector and associator Tracking often failed due to a large variation of appearance induced by extreme environment changes such as illumination or abrupt changed of camera motion even with a single shot video. Recent video-based face recognition databases (i.e. given by video protocol) of Janus CS3 and IARPA Janus Benchmark-B Face Dataset [140] provide more serious real-world scenario videos which are composed of disconnected scenes, often containing multiple people as shown in Fig- ure. 5.2. Thus, recent researchers have proposed to use face detection based face tracking, com- bining high performance face detector and face association using overlapped face areas and strong facial feature vectors between consecutive frames. Their strategy was "bigger the overlapped ar- eas and higher the similarity scores of feature vectors, linking the fragmented tracklets more" assuming that the tracklets came from the target subject [30] In [149, 77], a high performance YOLO (You Only Look Once) cascade [110] face detector trained on a cascade of shallow fully convolutional neural networks (FCN) was used. Given multiple detected face areas in each frame of a video, face association was performed inspired by [20, 30]. 5.4 CNN training and feature extraction We encoded our pooling and random faces to deep features with fine-tuned ResFace [93] model on MS-Celeb-1M [88] and CASIA-WebFace [150] face database. 111 Fine-tunning of ResFace Similarly in Sec. 3.4, we trained deep convolutional neural network(DCNN), but at this time, we used deeper network and larger training databases to get higher discriminative power. We fine- tuned the publicly available ResFace [93] with the ResNet-101 architecture [59]. As one way to get the advanced deep models, we augmented the training face databases by merging three large databases from CASIA-WebFace [150] and MS-Celeb-1M [88], generating their rendered faces like [93]. Then, from the augmented training data, we randomly generate the several templates from a subject separately for each rendered views and real images, giving the same label. By applying our sampling method (pooling and random faces) to each template, we could obtain various pooled faces. Finally, with the augmented data, we fine-tuned the ResFace and got the advanced deep model which is additionally robust to pooled faces. Feature extraction pipeline With the fine-tuned weights and biases and the ResNet architecture with 101 layers, we extract the deep features of our pooling and random faces, taking the response of pool5 layer, using the pipeline shown in Figure. 5.3 assuming we have limited computational resources. As for image storage systems, we adopted LMDB (lightning memory-mapped database) [34] which is one of the in-memory key value storage systems. By storing a single large file rather than a large number of small files in the form of LMDB [14, 111], we could reduce the burdens of read and write operations, find multiple image files included in one template with a single key, and allow the multiple applications to access the file simultaneously for multiple processing. 112 To do that, we employed Caffe framework [71] to use the deep model and python feature extraction code [1]. Also, we utilized a GPU to extract deep features from the LMDB file because GPU computing resources allowed to achieve higher speedups than CPU only versions. Also, because high graphics memory allows us to use big batch size and shorten the extraction time, we set batch size 30 in our experiment. If we do not use them, the execution time is almost 4 times slower. In the later part of our experimental section, we will show how much we reduce the image storage spaces and the feature extraction time with our sampling representation, compared to the original face collections using this baseline system as well as the recognition accuracy. Figure 5.3: System diagram for feature extraction in our experiment: We implemented feature extraction block as shown in this figure. We use LMDB (Lightning Memory-Mapped Database) as our image storage systems for representative facial images, and then, a single GPU is used to extract deep features of the LMDB file. To get the deeply learned model for the proposed representative sampled face images, fine-tuning was performed by minimizing the cross-entropy lossL using the SoftMax function and a one-hot vectorc encoding of the ground-truth classi asb c as P i log.p i .I/;c i /, given the labelc, over the entire training set of images. We use an initial learning rate ofD0:001. This rate is applied 113 to the entire network, except for the new fc8, which has a learning rate an order of magnitude greater than. Lightning Memory-Mapped Database (LMDB) is a software library that provides a high-performance embedded transactional database in the form of a key-value store. LMDB is written in C with API bindings for several programming languages. LMDB stores arbi- trary key/data pairs as byte arrays, has a range-based search capability, supports multiple data items for a single key and has a special mode for appending records at the end of the database (MDB_APPEND) which gives a dramatic write performance increase over other similar stores.[1] LMDB is not a relational database, it is strictly a key-value store like Berkeley DB and dbm. LMDB may also be used concurrently in a multi-threaded or multi-processing envi- ronment, with read performance scaling linearly by design. LMDB databases may have only one writer at a time, however unlike many similar key-value databases, write transactions do not block readers, nor do readers block writers. LMDB is also unusual in that multiple applications on the same system may simultaneously open and use the same LMDB store, as a means to scale up performance. Also, LMDB does not require a transaction log (thereby increasing write per- formance by not needing to write data twice) because it maintains data integrity inherently by design. 114 Chapter 6 Experimental results on video based face recognition In this section, on video-based face recognition protocol given by Janus CS3 and IJB-B bench- mark, we conduct experiments to show the effectiveness of our template representation compared to the original template. We experimentally evaluate computational complexity in terms of im- age storage space and feature extraction time, and recognition accuracy. Because many practical applications allow to use limited computational resources, our image sampling system reduces the burden of the system a lot, yet improving the recognition accuracy. Finally, we compare our recognition accuracy with the current state-of-the-art methods and show that reduced computa- tional complexity, yet improved recognition accuracy. Here, we use the computer system with 2.4GHz 16 cores of CPU, 125GB RAM, NVIDIA Tesla K40 with 12 GB graphics memory, and 2880 CUDA cores. In addition, CUDA v7.5, cuDNN v5.1m and open BLAS library was used on Linux CentOS 7.3.1611 xfs file system. 115 6.1 Database and protocol JANUS CS3 & IJB-B The JANUS CS3 database is the extended version of IJB-A [79] and it contains 11,876 images, 55,372 video frames, and 7,245 video clips of 1,871 subjects. In this dissertation, we are in- terested in protocol 6 which provides video-based face recognition (video-to-image) scenarios containing 7,195 probe templates, 940 gallery 1 templates and 930 gallery 2 templates. Each probe template is marked by a human-annotated face bounding box notifying the target in a frame in the video. These probe templates are evaluated with respect to two galleries which are disjoint in terms of subject, respectively. Each gallery has templates which are composed of several images with human-annotated bounding box providing the subject in the image. Protocol 6 is an open-set identification problem to search the mated template in the gallery for a given probe template, but some of the probe templates will not have a mated template in the gallery. Thus, we estimate the ranking accuracy only with the probe templates which have a mated template in the gallery, demonstrating the performance of closed-set search. In addition, we consider the IARPA Janus Benchmark-B (NIST IJB-B) [140] dataset, also a superset of IJB-A. IJB-B shares CS3 databases so that it also have a video-based face recognition protocol, but provides just different combinations of templates. It consists of 1,845 subjects with human-labeled ground truth labels. 21,798 still images and 55,026 frames from 7,011 videos. 116 Protocol 6 denes an open-set video-to-image identication problem. It provides two gallery sets and one probe set. Each gallery set contains only still images and the probe set consists of only videos. Open-set means that it is not guaranteed that a probe subject is contained in the gallery. 6.2 Performance Metrics The advantage of our method is to reduce the computational complexities by sampling faces im- ages from the face collections and such gain can be maximized when video-based face recogni- tion. Thus, we report the image storage spaces and computational time to extract the deep features on our baseline system explained in Sec. 5, focusing on computational complexities induced by deep learning method which disturbs the recent practical applications due to their high computing costs. Besides them, we report the image storage space and the face recognition accuracy. Standard Janus verification metrics: Following the evaluation metrics suggested in IJB-A [79], identification performance is estimated at Rank-1, Rank-5, and Rank-10. The average verification performance is estimated with true alarm rate (TAR) at false alarm rate (FAR) at 1e-2, 1e-3 and 1e-4. Image storage spaces: We estimated the image storage spaces on our implemented stor- age system shown in Figure. 5.3. To reduce the burden of accessibility from other ap- plications, we used LMDB file system before extracting deep features and we report the comparison of the file size between original template and our template representation. We will not report the trade-off the storage spaces and the computational time in this thesis, 117 but by using LDBM, we were actually able to efficiently use the computer resources (A lot of small size files enormously slow down the accessibility to the storage system and drag down the whole system performance as explained in [14, 111]). Feature extraction time: Even we use the LMDB which reduces the burden of I/O, still extracting deep features requires to use high computing costs. We actually estimated the computational time between the original template and our method, showing our method can reduce the computational time extremely. 6.3 Recognition accuracy on two different face collections In video-based face recognition, input face database is collected by face detector and/or face tracker, and thus the recognition performance is directly influenced by the performance of the face collector. In this section, we evaluate our sampling method depending on the collected face database in terms of recognition accuracy. To do that, we collect face images on Janus CS3 and IJB-B [140] video-based face database using two different face collectors: Alien tracker (AT) [107] and face associator (FA) [25, 77] with YOLO detector [110, 149]. AT can track long term video frames using traditional SIFT features, but it does not work well on multi-shot video frames due to its weak tracking perfor- mance. On the other hand, FA uses a strong face detector developed on deep learning approaches to track a face comparing the similarity and the overlapping area between adjacent face bounding boxes, and thus successfully track multi-shot frames. Because the Janus CS3 and IJB-B do not provide continuous shooting containing a single subject, but rather provide discrete multi shots 118 containing multiple subjects, these two face collectors provide different quality face database for face recognition. Once we obtain two face database from a video face database using two face collectors, we compare the recognition performance of baseline face recognition system with that of the face recognition system with our sampler, using the two databases. Table. 6.1 and Table. 6.2 show that our sampling system performs consistently higher recognition accuracy than using the original face collection on all Janus CS3 and IJB-B. In addi- tion, We compare the difference in recognition accuracy with the two face collections to see how insensitive the proposed method is to the quality of the collections, compared to the baseline. To do that, we calculate the difference between the recognition accuracy of AT and that of FA, and then average the result of "gallery 1" and "gallery 2", separately for each sampling method. The experimental results are shown in the two rightmost columns of the tables. The baseline shows much larger difference than ours, which means that Our sampling method works relatively well even with a low performance face collector and is less affected by the quality of the input database. Table 6.1: Face recognition accuracy according to the sampling method on Janus CS3 database.: Our sampling method improved the recognition a lot on different face collections. Janus CS3 Gallery 1 (G1) Gallery2 (G2) Avg(G1,G2) Face Collector (FC)! Alien Tr. Face Ass. Alien Tr. Face Ass. FC. Diff. Sampling method! All Ours All Ours All Ours All Ours All Ours TAR @FAR=1e-2 72.3 80.5 78.7 83.5 62.0 68.4 69.7 72.4 7.05 3.5 TAR @FAR=1e-3 52.0 65.0 60.4 68.5 37.8 50.0 47.2 53.9 8.9 3.7 TAR @FAR=1e-4 29.9 46.7 39.4 48.3 15.7 31.8 19.6 33.9 6.7 1.85 Rank=1 58.5 65.9 67.3 68.5 48.5 53.5 56.4 57.4 8.35 3.25 Rank=5 71.6 77.4 79.0 81.0 60.0 65.8 67.9 68.2 7.65 3 Rank=10 76.2 81.9 82.8 85.3 64.9 70.3 72.7 73.4 7.2 3.25 119 Table 6.2: Face recognition accuracy according to the sampling method on IJB-B database.: Our sampling method improved the recognition a lot on different face collections. IJB-B database Gallery 1 (G1) Gallery2 (G2) Avg(G1,G2) Face Collector (FC)! Alien Tr. Face Ass. Alien Tr. Face Ass. FC. Diff. Sampling method! All Ours All Ours All Ours All Ours All Ours TAR @FAR=1e-2 72.8 80.5 79.5 83.7 60.4 68.6 69.2 72.2 7.75 3.4 TAR @FAR=1e-3 52.2 66.0 61.6 69.5 35.7 50.6 46.3 54.3 10 3.6 TAR @FAR=1e-4 30.4 47.0 40.9 49.9 14.6 32.3 17.8 35.2 6.85 2.9 Rank=1 58.4 66.3 67.5 69.4 48.5 54.0 56.6 58.0 8.6 3.55 Rank=5 71.2 77.8 79.6 81.7 59.4 66.2 67.8 68.9 8.4 3.3 Rank=10 76.2 82.5 83.7 85.8 64.5 70.9 73.1 74.5 8.05 3.45 6.4 Performance analysis The biggest advantage of our sampling method is that the method keeps only a certain amount of face images even if we collect a large number of face images from video frames. Thus, our sam- pling method requires less storage space and less feature extraction time. In addition, our sam- pling method does not compromise recognition accuracy at the cost of computational complexity. In this section, we clearly show that our sampling method reduces computational complexity, but improves recognition accuracy by comparing our system with a naive sampler. To clarify these benefits, we first compare the reference system with the system using our sampling method in terms of storage space and computational time required during the deep feature extraction. To do that, we use Alien tracker to obtain the input face collection from video frames. Then, once images are sampled using our sampler, images are all converted to LMDB(lightning-mapped database manager) to extract feature vectors in a short time efficiently at the cost of storage space. 120 Then, we compare the storage space and the feature extraction time between "(1) All" and "(3) ours", as shown in Table. 6.5. Also, we compare the recognition accuracy in Table. 6.4. Table. 6.5 shows that our sampling method reduces the required storage space size three times smaller than the baseline system and reduces the feature extraction time three to four times shorter than the baseline system. But, Table. 6.4 shows that our system even increases recognition accu- racy over the baseline by a large difference. However, the input database may simply contain many non-human face images, or may contain images that can be easily removed in other ways. So, we set up another experiment with a naive sampler using landmark confidence level to check the superiority of sampling performance of our system. Considering the storage space what we used with our sampling structure, we find a landmark confidence level making the baseline system get similar number of face images to our system, and the level is >0.71, as shown in Tab. 6.5. Then, we compare recognition accuracy between "(2) cfd" and "(3) ours", as shown in Table. 6.4. Even though Table. 6.5 shows that our sampling method still uses the smaller storage space and shorter feature extraction time, the system using a naive sampler shows much less recognition accuracy than our system. In addition, when we observe the Table. 6.5, our sampling method does not reduce the gallery samples, but reduces a lot of probe images, unlike a naive sampler. This may suggest that smartly selecting diverse images rather than more reliable human face images actually helps to increase recognition accuracy. To sum up, the experimental results show that our sampling method effectively rep- resents a template, reducing template size and even increasing recognition accuracy. If a larger 121 video database is provided, the system gains even more in terms of computational complexity because the sampling method uses a constant data volume for each template. Table 6.3: Comparison of computational complexity: tested on CS3, protocol6, with feature pooling method storage space [GByte] feature extraction time [s] gal1 gal2 probe total batch=1 batch=30 (1) All 0.5 0.5 19.0 20.0 11670 2649 (2) cfd >0.71 0.3 0.3 6.1 6.7 3913 923 (3) ours 0.5 0.5 5.0 6.0 3544 827 Table 6.4: Comparison of face recognition accuracy: tested on CS3, protocol6, with feature pooling method (1) All (2) cfd >0.71 (3) ours TAR @FAR=1e-2 67.2 69.6 74.4 TAR @FAR=1e-3 44.9 48.3 57.5 TAR @FAR=1e-4 22.8 26.2 39.2 Rank1 53.5 56.7 59.7 Rank5 65.8 68.5 71.6 Rank10 70.53 73.3 76.1 6.5 Comparison with the state-of-the-art Finally, we compare recognition performance with state-of-the-art methods, as shown in Ta- ble. 6.5. While state-of-the-art methods only focused on collecting face images from a video and using naive interval sampling to reduce the computational time, we focused on smartly sam- pling from the given face collections. 122 Table 6.5: Recognition comparison with state-of-the-arts: [tested on CS3, protocol6, Aver- age of Gallery1 and Gallery2], * and ** used the exactly same face collector, but different face recognition pipeline. ** used other recognition system, reported on [77] Alien tracker Face associator MD *Ours * [107] ** [107] *Ours *[25] ** [25] [25] ** [100] TAR @FAR=1e-2 74.4 73.3 76.9 78.6 78.3 79.8 - 75.4 TAR @FAR=1e-3 57.5 52.4 57.4 61.2 58.6 60.8 49.8 60.5 TAR @FAR=1e-4 39.2 27.7 34.1 41.1 33.4 37 34.1 33.7 Rank=1 59.7 56.3 59.9 63.0 61.4 63.2 61.0 61.3 Rank=5 71.6 67.7 71.5 74.6 73.2 74.8 73.4 71.8 Rank=10 76.1 72.1 76.2 79.4 77.4 78.7 77.9 73.5 In our experimental test, we use two state-of-the-art face collecting method (Alien Tracker, face associator), and thus two face collections. We compare our sampling method with one of the state-of-the-art recognition method [93] which use more images than the original tem- plate by adding several rendered views(denoted by *). Moreover, [77] reported that the new state-of-the-art results using [95], which requires much more computational complexity by syn- thesizing background as well as rendered view at test phase(denoted by **). Even though they spend much more computational complexity, the recognition performance of both systems is similar to or less than ours. In addition, we compare the recognition results with originally given by [25] in fourth column of Face accociator column in the table. Also, we compare the results with Multi-Domain Network (MDnet tracker [100]) which learns the shared representation of targets from multi- ple annotated video sequences for tracking, where each video is regarded as a separate domain, together with [95]. The baseline recognition accuracy can vary depending on the face recognition method and the face collector naive sampling intervals. However, the experimental results marked by "*" 123 used exactly the same face collections for a template as a system input. Our sampling method itself shows a much higher performance than using all images when we do not allow synthesizing more images at test phase, as shown in Table. 6.1 and Table. 6.2. In conclusion, our method con- sistently comparable to or much better than the state-of-the-art in terms of recognition accuracy, but the computational complexity is much reduced. 124 Chapter 7 Conclusions In the first part of this dissertation, we show how to smartly sample face images from a given face collection. In the second part of this dissertation, we apply our sampling method directly to video-based face recognition which suffers from a huge number of face images. In two parts, our sampling method significantly reduces the number of images in a template within a fixed data volume, saving storage space and computation time, but improving recognition accuracy. In the first part, we propose our template representation with a limited number of im- ages without degrading recognition accuracy by using hybrid frame selection approach. We smartly sample a variety of facial appearances evenly by quantizing face image space into a few subspaces regarding the actual variances in facial appearances. We also smartly sample redun- dant and similar facial appearances using two statistically representative sample types, average sample and random sample. We further use the facial characteristics (bin information) properly to improve the recognition accuracy. Our template representation has been tested on the most 125 demanding benchmarks, IJB-A and Janus CS2, using deep learning approach. The experimental results show the effectiveness in terms of template size and recognition accuracy. In the second part, we apply our effective template representation to the video-based face recognition databases which suffer from many similar and various images on Janus CS3 and IJB-C, video protocol. Given several face collections from different face collectors and face databases, our sampling method consistently shows higher recognition accuracy than using all possible images, saving storage spaces and feature extraction time. Moreover, because of ex- cellent face sampling performance, our template representation shows more stable recognition accuracy, which is less affected by the performance of the face collector. To conclude, carefully sampled representative face images actually increase the recog- nition accuracy and reduce the computational complexity. Also, using facial characteristics smartly increases the recognition accuracy. 126 Reference List [1] Caffe feature extraction code on python. Available: https://gist.github.com/ marekrei/7adc87d2c4fde941cea6, 2015. [2] Ayman Abaza, Mary Ann Harrison, Thirimachos Bourlai, and Arun Ross. Design and evaluation of photometric image quality measures for effective face recognition. IET Bio- metrics, 3(4):314–324, 2014. [3] Wael AbdAlmageed, Yue Wu, Stephen Rawls, Shai Harel, Tal Hassner, Iacopo Masi, Jong- moo Choi, Jatuporn Lekust, Jungyeon Kim, Prem Natarajan, et al. Face recognition using deep multi-pose representations. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9. IEEE, 2016. [4] Gaurav Aggarwal, Soma Biswas, Patrick J Flynn, and Kevin W Bowyer. Predicting per- formance of face recognition systems: An image characterization approach. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Con- ference on, pages 52–59. IEEE, 2011. [5] Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. Face description with local bi- nary patterns: Application to face recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(12):2037–2041, 2006. [6] Timo Ahonen, Esa Rahtu, Ville Ojansivu, and Janne Heikkila. Recognition of blurred faces using local phase quantization. In Pattern Recognition, 2008. ICPR 2008. 19th In- ternational Conference on, pages 1–4. IEEE, 2008. [7] Kaneswaran Anantharajah, Simon Denman, Sridha Sridharan, Clinton Fookes, and Dian Tjondronegoro. Quality based frame selection for video face recognition. In Signal Pro- cessing and Communication Systems (ICSPCS), 2012 6th International Conference on, pages 1–5. IEEE, 2012. [8] Ognjen Arandjelovic, Gregory Shakhnarovich, John Fisher, Roberto Cipolla, and Trevor Darrell. Face recognition with image sets using manifold density divergence. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 581–588. IEEE, 2005. [9] Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja Pantic. Incremental face alignment in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1859–1866, 2014. 127 [10] Tadas Baltrusaitis, Peter Robinson, and Louis-Philippe Morency. Constrained local neu- ral fields for robust facial landmark detection in the wild. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 354–361, 2013. [11] Jeremiah R Barr, Kevin W Bowyer, Patrick J Flynn, and Soma Biswas. Face recogni- tion from video: A review. International Journal of Pattern Recognition and Artificial Intelligence, 26(05):1266002, 2012. [12] Ronen Basri, Tal Hassner, and Lihi Zelnik-Manor. Approximate nearest subspace search with applications to pattern recognition. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007. [13] Ronen Basri, Tal Hassner, and Lihi Zelnik-Manor. Approximate nearest subspace search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2):266–278, 2011. [14] Doug Beaver, Sanjeev Kumar, Harry C Li, Jason Sobel, Peter Vajgel, et al. Finding a needle in haystack: Facebook’s photo storage. In OSDI, volume 10, pages 1–8, 2010. [15] S-A Berrani and Christophe Garcia. Enhancing face recognition from video sequences using robust statistics. In Advanced Video and Signal Based Surveillance, 2005. AVSS 2005. IEEE Conference on, pages 324–329. IEEE, 2005. [16] Sid-Ahmed Berrani and Christophe Garcia. Robust detection of outliers for projection- based face recognition methods. Multimedia Tools and Applications, 38(2):271–291, 2008. [17] Lacey Best-Rowden, Brendan Klare, Joshua Klontz, and Anil K Jain. Video-to-video face matching: Establishing a baseline for unconstrained face recognition. In Biometrics: Theory, Applications and Systems (BTAS), 2013 IEEE Sixth International Conference on, pages 1–8. IEEE, 2013. [18] J Ross Beveridge, P Jonathon Phillips, Geof H Givens, Bruce A Draper, Mohammad Nay- eem Teli, and David S Bolme. When high-quality face images match poorly. In Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Confer- ence on, pages 572–578. IEEE, 2011. [19] Samarth Bharadwaj, Mayank Vatsa, and Richa Singh. Can holistic representations be used for face biometric quality assessment? In Image Processing (ICIP), 2013 20th IEEE International Conference on, pages 2792–2796. IEEE, 2013. [20] Michael D Breitenstein, Fabian Reichlin, Bastian Leibe, Esther Koller-Meier, and Luc Van Gool. Online multiperson tracking-by-detection from a single, uncalibrated camera. IEEE transactions on pattern analysis and machine intelligence, 33(9):1820–1833, 2011. [21] Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape regression. International Journal of Computer Vision, 107(2):177–190, 2014. [22] Rui Caseiro, Pedro Martins, João F Henriques, Fatima Silva Leite, and Jorge Batista. Rolling riemannian manifolds to solve the multi-class classification problem. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 41–48, 2013. 128 [23] Hakan Cevikalp and Bill Triggs. Face recognition based on image sets. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2567–2573. IEEE, 2010. [24] Feng-Ju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, and Gerard Medioni. Faceposenet: Making a case for landmark-free face alignment. In Proceedings of the IEEE International Conference on Computer Vision, pages 1599–1608, 2017. [25] Ching-Hui Chen, Jun-Cheng Chen, Carlos D Castillo, and Rama Chellappa. Video-based face association and identification. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on, pages 149–156. IEEE, 2017. [26] J. C. Chen, V . M. Patel, and R. Chellappa. Unconstrained face verification using deep cnn features. In Winter Conf. on App. of Comput. Vision, 2016. [27] J.-C. Chen, S. Sankaranarayanan, V . M. Patel, and R. Chellappa. Unconstrained face veri- fication using fisher vectors computed from frontalized faces. In Int. Conf. on Biometrics: Theory, Applications and Systems, 2015. [28] Jiansheng Chen, Yu Deng, Gaocheng Bai, and Guangda Su. Face image quality assessment based on learning to rank. IEEE Signal Processing Letters, 22(1):90–94, 2015. [29] Jun-Cheng Chen, Vishal M Patel, and Rama Chellappa. Unconstrained face verification using deep cnn features. In Winter Conf. on App. of Comput. Vision, 2016. [30] Jun-Cheng Chen, Rajeev Ranjan, Amit Kumar, Ching-Hui Chen, Vishal M Patel, and Rama Chellappa. An end-to-end system for unconstrained face verification with deep convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 118–126, 2015. [31] Ming-Jun Chen and Alan C Bovik. No-reference image blur assessment using multiscale gradient. EURASIP Journal on image and video processing, 2011(1):3, 2011. [32] Shaokang Chen, Conrad Sanderson, Mehrtash T Harandi, and Brian C Lovell. Improved image set classification via joint sparse approximated nearest subspaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 452–459, 2013. [33] Yi-Chen Chen, Vishal M Patel, P Jonathon Phillips, and Rama Chellappa. Dictionary- based face recognition from video. In European conference on computer vision, pages 766–779. Springer, 2012. [34] Howard Chu. Mdb: A memory-mapped database and backend for openldap. In Proceed- ings of the 3rd International Conference on LDAP , Heidelberg, Germany, page 35, 2011. [35] Frederique Crete, Thierry Dolmiere, Patricia Ladret, and Marina Nicolas. The blur ef- fect: perception and estimation with a new no-reference perceptual blur metric. In Elec- tronic Imaging 2007, pages 64920I–64920I. International Society for Optics and Photon- ics, 2007. 129 [36] James Diebel. Representing attitude: Euler angles, unit quaternions, and rotation vectors. Matrix, 58(15-16):1–35, 2006. [37] Changxing Ding and Dacheng Tao. Trunk-branch ensemble convolutional neural networks for video-based face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. [38] Samuel Dodge and Lina Karam. IQCNN understanding how image quality affects deep neural networks. Available:http://arxiv.org/pdf/1604.04004.pdf, 2016. [39] Samuel Dodge and Lina Karam. Understanding how image quality affects deep neural networks. arXiv preprint arXiv:1604.04004, 2016. [40] Jingming Dong and Stefano Soatto. Domain-size pooling in local descriptors: Dsp-sift. In Proc. Conf. Comput. Vision Pattern Recognition, pages 5097–5106, 2015. [41] Abhishek Dutta, Raymond Veldhuis, and Luuk Spreeuwers. A bayesian model for predict- ing face recognition performance using image quality. In Biometrics (IJCB), 2014 IEEE International Joint Conference on, pages 1–8. IEEE, 2014. [42] FaceGen. Facegen. Available:http://facegen.com/. [43] Wei Fan and Dit-Yan Yeung. Face recognition with image sets using hierarchically ex- tracted exemplars from appearance manifolds. In 7th International Conference on Auto- matic Face and Gesture Recognition (FGR06), pages 177–182. IEEE, 2006. [44] Wei Fan and Dit-Yan Yeung. Locally linear models on face appearance manifolds with application to dual-subspace based classification. In 2006 IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1384– 1390. IEEE, 2006. [45] Cécile Fiche, Patricia Ladret, and Ngoc-Son Vu. Blurred face recognition algorithm guided by a no-reference blur metric. In IS&T/SPIE Electronic Imaging, pages 75380U– 75380U. International Society for Optics and Photonics, 2010. [46] David A Forsyth and Jean Ponce. Computer vision: a modern approach. Prentice Hall Professional Technical Reference, 2002. [47] Xiufeng Gao, Stan Z Li, Rong Liu, and Peiren Zhang. Standardization of face image sam- ple quality. In International Conference on Biometrics, pages 242–251. Springer, 2007. [48] Raghuraman Gopalan, Sima Taheri, Pavan Turaga, and Rama Chellappa. A blur-robust de- scriptor with applications to face recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(6):1220–1226, 2012. [49] Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. Is that you? metric learning approaches for face identification. In ICCV 2009-International Conference on Computer Vision, pages 498–505. IEEE, 2009. 130 [50] Abdenour Hadid and M Pietikainen. From still image to video-based face recognition: an experimental analysis. In Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE International Conference on, pages 813–818. IEEE, 2004. [51] Abdenour Hadid and Matti Pietikainen. Selecting models from videos for appearance- based face recognition. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 1, pages 304–308. IEEE, 2004. [52] Jihun Hamm and Daniel D Lee. Grassmann discriminant analysis: a unifying view on subspace-based learning. In Proceedings of the 25th international conference on Machine learning, pages 376–383. ACM, 2008. [53] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. [54] Tal Hassner. Viewing real-world faces in 3D. In Proc. Int. Conf. Comput. Vision, pages 3607–3614, 2013. [55] Tal Hassner, Shai Harel, Eran Paz, and Roee Enbar. Effective face frontalization in uncon- strained images. In Proc. Conf. Comput. Vision Pattern Recognition, 2015. [56] Tal Hassner, Iacopo Masi, Jungyeon Kim, Jongmoo Choi, Shai Harel, Prem Natarajan, and Gérard Medioni. Pooling faces: Template based face recognition with pooled face images. 2016. [57] Munawar Hayat, Mohammed Bennamoun, and Senjian An. Learning non-linear recon- struction models for image set classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1907–1914, 2014. [58] Munawar Hayat, Mohammed Bennamoun, and Senjian An. Deep reconstruction models for image set classification. IEEE transactions on pattern analysis and machine intelli- gence, 37(4):713–727, 2015. [59] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. [60] Rein-Lien Vincent Hsu, Jidnya Shah, and Brian Martin. Quality assessment of facial im- ages. In Biometric Consortium Conference, 2006 Biometrics Symposium: Special Session on Research at the, pages 1–6. IEEE, 2006. [61] Junlin Hu, Jiwen Lu, and Yap-Peng Tan. Discriminative deep metric learning for face verification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1875–1882, 2014. [62] Yiqun Hu, Ajmal S Mian, and Robyn Owens. Sparse approximated nearest points for image set classification. In Proc. Conf. Comput. Vision Pattern Recognition, pages 121– 128. IEEE, 2011. [63] Gary Huang, Marwan Mattar, Honglak Lee, and Erik G Learned-Miller. Learning to align from scratch. In Advances in Neural Information Processing Systems, pages 764–772, 2012. 131 [64] Gary B Huang, Vidit Jain, and Erik Learned-Miller. Unsupervised joint alignment of complex images. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007. [65] Thomas Huang, Ziyou Xiong, and Zhenqiu Zhang. Face recognition applications. In Handbook of Face Recognition, pages 617–638. Springer, 2011. [66] Zhiwu Huang, Ruiping Wang, Shiguang Shan, and Xilin Chen. Projection metric learning on grassmann manifold with application to video based face recognition. In Proc. Conf. Comput. Vision Pattern Recognition, pages 140–149, 2015. [67] Zhiwu Huang, Ruiping Wang, Shiguang Shan, Xianqiu Li, and Xilin Chen. Log-euclidean metric learning on symmetric positive definite manifold with application to image set clas- sification. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, pages 720–729, 2015. [68] Mia Hubert, Peter J Rousseeuw, and Karlien Vanden Branden. Robpca: a new approach to robust principal component analysis. Technometrics, 47(1):64–79, 2005. [69] Sadeep Jayasumana, Richard Hartley, Mathieu Salzmann, Hongdong Li, and Mehrtash Harandi. Kernel methods on the riemannian manifold of symmetric positive definite matri- ces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 73–80, 2013. [70] Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recog- nition (CVPR), 2010 IEEE Conference on, pages 3304–3311. IEEE, 2010. [71] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Gir- shick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multi- media, pages 675–678. ACM, 2014. [72] Samil Karahan, Merve Kilinc Yildirum, Kadir Kirtac, Ferhat Sukru Rende, Gultekin Bu- tun, and Hazim Kemal Ekenel. How image degradations affect deep cnn-based face recognition? In 2016 International Conference of the Biometrics Special Interest Group (BIOSIG), pages 1–5. IEEE, 2016. [73] Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1867–1874, 2014. [74] Hyung-Il Kim, Seung Ho Lee, and Yong Man Ro. Face image assessment learned with objective and relative face image qualities for improved face recognition. In Image Pro- cessing (ICIP), 2015 IEEE International Conference on, pages 4027–4031. IEEE, 2015. [75] KangGeon Kim, Tadas Baltrusaitis, AmirAli B Zadeh, Louis-Philippe Morency, and Gérard G Medioni. Holistically constrained local model: Going beyond frontal poses for facial landmark detection. In BMVC, pages 1–12, 2016. 132 [76] KangGeon Kim, Feng-Ju Chang, Jongmoo Choi, Louis-Philippe Morency, Ramakant Nevatia, and Gérard Medioni. Local-global landmark confidences for face recognition. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Con- ference on, pages 666–672. IEEE, 2017. [77] KangGeon Kim, Zhenheng Yang, Iacopo Masi, Ramakant Nevatia, and Gerard Medioni. Face and body association for video-based face recognition. In 2018 IEEE Winter Confer- ence on Applications of Computer Vision (WACV), pages 39–48. IEEE, 2018. [78] Tae-Kyun Kim, Josef Kittler, and Roberto Cipolla. Discriminative learning and recogni- tion of image set classes using canonical correlations. Trans. Pattern Anal. Mach. Intell., 29(6):1005–1018, 2007. [79] Brendan F Klare, Ben Klein, Emma Taborsky, Austin Blanton, Jordan Cheney, Kristen Allen, Patrick Grother, Alan Mah, Mark Burge, and Anil K Jain. Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1931–1939. IEEE, 2015. [80] J. Klontz, B. Klare, S. Klum, E. Taborsky, M. Burge, and A. K. Jain. Open source biomet- ric recognition. In Int. Conf. on Biometrics: Theory, Applications and Systems, 2013. [81] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Neural Inform. Process. Syst., pages 1097–1105, 2012. [82] V olker Krueger and Shaohua Zhou. Exemplar-based face recognition from video. In European Conference on Computer Vision, pages 732–746. Springer, 2002. [83] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. Conf. Comput. Vision Pattern Recognition, volume 2, pages 2169–2178. IEEE, 2006. [84] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998. [85] Kuang-Chih Lee, Jeffrey Ho, Ming-Hsuan Yang, and David Kriegman. Video-based face recognition using probabilistic appearance manifolds. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, pages I–313. IEEE, 2003. [86] Lixiong Liu, Bao Liu, Hua Huang, and Alan Conrad Bovik. No-reference image quality assessment based on spatial and spectral entropies. Signal Processing: Image Communi- cation, 29(8):856–863, 2014. [87] Lixiong Liu, Bao Liu, Hua Huang, and Alan Conrad Bovik. SSEQ software release. Available: http://live.ece.utexas.edu/research/quality/ SSEQ_release.zip, 2014. 133 [88] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015. [89] D.G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, 2004. [90] Jiwen Lu, Gang Wang, Weihong Deng, and Pierre Moulin. Simultaneous feature and dic- tionary learning for image set based face recognition. In European Conference on Com- puter Vision, pages 265–280. Springer, 2014. [91] Jiwen Lu, Gang Wang, Weihong Deng, Pierre Moulin, and Jie Zhou. Multi-manifold deep metric learning for image set classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1137–1145, 2015. [92] Jiwen Lu, Gang Wang, and Pierre Moulin. Image set classification using holistic multiple order statistics features and localized multi-kernel metric learning. In Proc. Int. Conf. Comput. Vision, pages 329–336, 2013. [93] Iacopo Masi, Anh Tu an Trãn, Tal Hassner, Jatuporn Toy Leksut, and Gérard Medioni. Do we really need to collect millions of faces for effective face recognition? 2016. [94] Iacopo Masi, Feng-Ju Chang, Jongmoo Choi, Shai Harel, Jungyeon Kim, KangGeon Kim, Jatuporn Leksut, Stephen Rawls, Yue Wu, Tal Hassner, et al. Learning pose-aware models for pose-invariant face recognition in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. [95] Iacopo Masi, Tal Hassner, Anh Tuân Tran, and Gérard Medioni. Rapid synthesis of mas- sive face sets for improved face recognition. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 604–611. IEEE, 2017. [96] Iacopo Masi, Stephen Rawls, Gérard Medioni, and Prem Natarajan. Pose-Aware Face Recognition in the Wild. In CVPR, 2016. [97] Sandra Mau, Shaokang Chen, Conrad Sanderson, and Brian C Lovell. Video face match- ing using subset selection and clustering of probabilistic multi-region histograms. In Im- age and Vision Computing New Zealand (IVCNZ), 2010 25th International Conference of, pages 1–8. IEEE, 2010. [98] Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing, 21(12):4695–4708, 2012. [99] Anush Krishna Moorthy and Alan Conrad Bovik. Blind image quality assessment: From natural scene statistics to perceptual quality. IEEE transactions on Image Processing, 20(12):3350–3364, 2011. [100] Hyeonseob Nam and Bohyung Han. Learning multi-domain convolutional neural net- works for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4293–4302, 2016. 134 [101] Kamal Nasrollahi and Thomas B Moeslund. Face quality assessment system in video sequences. In European Workshop on Biometrics and Identity Management, pages 10–18. Springer, 2008. [102] NIST. NIST results from the ijb-a recognition challenges, 2015-05-17. Available:http: //biometrics.nist.gov/cs_links/face/face_challenges/IJBA_ reports.zip, 2015. [103] Ville Ojansivu and Janne Heikkilä. Blur insensitive texture classification using local phase quantization. In International conference on image and signal processing, pages 236–243. Springer, 2008. [104] Unsang Park and Anil K Jain. 3d model-based face recognition in video. In International Conference on Biometrics, pages 1085–1094. Springer, 2007. [105] Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In British Machine Vision Conference, volume 1, page 6, 2015. [106] Federico Pernici. Facehugger: The alien tracker applied to faces. In Computer Vision– ECCV 2012. Workshops and Demonstrations, pages 597–601. Springer, 2012. [107] Federico Pernici and Alberto Del Bimbo. Object tracking by oversampling local features. IEEE transactions on pattern analysis and machine intelligence, 36(12):2538–2551, 2014. [108] Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabularies for image categorization. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007. [109] P Jonathon Phillips, J Ross Beveridge, David S Bolme, Bruce A Draper, Geof H Givens, Yui Man Lui, Su Cheng, Mohammad Nayeem Teli, and Hao Zhang. On the existence of face quality measures. In Biometrics: Theory, Applications and Systems (BTAS), 2013 IEEE Sixth International Conference on, pages 1–8. IEEE, 2013. [110] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016. [111] Kai Ren and Garth A Gibson. Tablefs: Enhancing metadata efficiency in the local file system. In USENIX Annual Technical Conference, pages 145–156, 2013. [112] Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. Face alignment at 3000 fps via regressing local binary features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1685–1692, 2014. [113] Markus Roth, Martin Bäuml, Ram Nevatia, and Rainer Stiefelhagen. Robust multi-pose face tracking by multi-stage tracklet association. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 1012–1016. IEEE, 2012. [114] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000. 135 [115] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi- heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211– 252, 2015. [116] Michele A Saad, Alan C Bovik, and Christophe Charrier. Blind image quality assess- ment: A natural scene statistics approach in the dct domain. IEEE transactions on Image Processing, 21(8):3339–3352, 2012. [117] Swami Sankaranarayanan, Azadeh Alavi, and Rama Chellappa. Triplet similarity embed- ding for face verification. arXiv preprint arXiv:1602.03418, 2016. [118] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015. [119] Gregory Shakhnarovich, John W Fisher, and Trevor Darrell. Face recognition from long- term observations. In European Conference on Computer Vision, pages 851–865. Springer, 2002. [120] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large- scale image recognition. CoRR, abs/1409.1556, 2014. [121] Johannes Stallkamp, Hazim K Ekenel, and Rainer Stiefelhagen. Video-based face recogni- tion on real-world data. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007. [122] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proc. Int. Conf. Comput. Vi- sion, pages 945–953, 2015. [123] M Subasic, S Loncaric, T Petkovic, H Bogunovic, and V Krivec. Face image validation system. In Image and Signal Processing and Analysis, 2005. ISPA 2005. Proceedings of the 4th International Symposium on, pages 30–33. IEEE, 2005. [124] Amr Suleiman, Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Towards closing the energy gap between hog and cnn features for embedded vision. arXiv preprint arXiv:1703.05853, 2017. [125] Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning face represen- tation by joint identification-verification. In Advances in neural information processing systems, pages 1988–1996, 2014. [126] Yi Sun, Ding Liang, Xiaogang Wang, and Xiaoou Tang. Deepid3: Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873, 2015. [127] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation from pre- dicting 10,000 classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1891–1898, 2014. 136 [128] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2892–2900, 2015. [129] Vivienne Sze, Tien-Ju Yang, and Yu-Hsin Chen. Designing energy-efficient convolutional neural networks using energy-aware pruning. 2017. [130] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 1701–1708, 2014. [131] Deborah Thomas, Kevin W Bowyer, and Patrick J Flynn. Multi-frame approaches to im- prove face recognition. In Motion and Video Computing, 2007. WMVC’07. IEEE Work- shop on, pages 19–19. IEEE, 2007. [132] S Vignesh, KVSNL Manasa Priya, and Sumohana S Channappayya. Face image quality assessment for face selection in surveillance video using convolutional neural networks. In Signal and Information Processing (GlobalSIP), 2015 IEEE Global Conference on, pages 577–581. IEEE, 2015. [133] Ruiping Wang, Huimin Guo, Larry S Davis, and Qionghai Dai. Covariance discriminative learning: A natural and efficient approach to image set classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2496–2503. IEEE, 2012. [134] Ruiping Wang, Shiguang Shan, Xilin Chen, and Wen Gao. Manifold-manifold distance with application to face recognition based on image set. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008. [135] Xiaogang Wang and Xiaoou Tang. Random sampling lda for face recognition. In Com- puter Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–259. IEEE, 2004. [136] Xiaogang Wang and Xiaoou Tang. Random sampling for subspace face recognition. In- ternational Journal of Computer Vision, 70(1):91–104, 2006. [137] Zhou Wang, Alan C Bovik, and Brian L Evan. Blind measurement of blocking artifacts in images. In Proceedings 2000 International Conference on Image Processing (Cat. No. 00CH37101), volume 3, pages 981–984. Ieee, 2000. [138] A Watt and Mark Watt. Advanced animation and rendering techniques theory and practice, 1994. [139] Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems, pages 1473–1480, 2006. [140] Cameron Whitelam, Emma Taborsky, Austin Blanton, Brianna Maze, Jocelyn Adams, Tim Miller, Nathan Kalka, Anil K Jain, James A Duncan, Kristen Allen, et al. Iarpa janus benchmark-b face dataset. In CVPR Workshop on Biometrics, 2017. 137 [141] Lior Wolf, Tal Hassner, and Itay Maoz. Face recognition in unconstrained videos with matched background similarity. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 529–534. IEEE, 2011. [142] Yongkang Wong, Shaokang Chen, Sandra Mau, Conrad Sanderson, and Brian C Lovell. Patch-based probabilistic image quality assessment for face selection and improved video- based face recognition. In CVPR 2011 WORKSHOPS, pages 74–81. IEEE, 2011. [143] John Wright, Allen Y Yang, Arvind Ganesh, Shankar S Sastry, and Yi Ma. Robust face recognition via sparse representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2):210–227, 2009. [144] Quanren Xiong and Christopher Jaynes. Mugshot database acquisition in video surveil- lance networks using incremental auto-clustering quality measures. In Advanced Video and Signal Based Surveillance, 2003. Proceedings. IEEE Conference on, pages 191–198. IEEE, 2003. [145] Zhongwen Xu, Yi Yang, and Alex G Hauptmann. A discriminative cnn video representa- tion for event detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1798–1807, 2015. [146] Allen Y Yang, John Wright, Yi Ma, and S Shankar Sastry. Feature selection in face recog- nition: A sparse representation perspective. submitted to IEEE Transactions Pattern Anal- ysis and Machine Intelligence, 2007. [147] Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Faceness-net: Face detection through deep facial part responses. IEEE transactions on pattern analysis and machine intelligence, 40(8):1845–1859, 2018. [148] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. arXiv preprint arXiv:1611.05128, 2016. [149] Zhenheng Yang and Ramakant Nevatia. A multi-scale cascade fully convolutional network face detector. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pages 633–638. IEEE, 2016. [150] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014. Available: http://www.cbsr.ia. ac.cn/english/CASIA-WebFace-Database.html. [151] Erfan Zangeneh, Mohammad Rahmati, and Yalda Mohsenzadeh. Low resolution face recognition using a two-branch deep convolutional neural network architecture. arXiv preprint arXiv:1706.06247, 2017. [152] Yizhe Zhang, Ming Shao, Edward K Wong, and Yun Fu. Random faces guided sparse many-to-one encoder for pose-invariant face recognition. In Proceedings of the IEEE In- ternational Conference on Computer Vision, pages 2416–2423, 2013. 138 [153] Yongbin Zhang and Aleix M Martínez. A weighted probabilistic approach to face recogni- tion from multiple images and video sequences. Image and Vision Computing, 24(6):626– 638, 2006. [154] Pengfei Zhu, Lei Zhang, Wangmeng Zuo, and David Zhang. From point to set: Extend the learning of distance metrics. In Proc. Int. Conf. Comput. Vision, pages 2664–2671, 2013. [155] Xiangyu Zhu, Zhen Lei, Junjie Yan, Dong Yi, and Stan Z Li. High-fidelity pose and expression normalization for face recognition in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 787–796, 2015. 139
Abstract (if available)
Abstract
Template-based face recognition methods recognize a person's identity (1:N face identification) or validate a person's identity(1:1 verification) by comparing probe templates against gallery templates. Each template contains many real-world face images of a person with varying poses, qualities and other unconstrained settings, captured from multiple devices. Recent template-based face recognition systems have more focused on developing advanced deep models to achieve higher recognition accuracy, less focused on reducing computational complexity. Here, this dissertation aims to reduce computational complexity as well as to increase recognition accuracy by introducing an efficient template representation by sampling a few face images which represent a template on deep learning approaches. ❧ To do that, we compressed variances of intra-personal appearances with a few samples. We sparsely quantized face image space depending on pose and quality in a disjointed way and assigned the bin to face samples referring to estimated pose yaw and image quality. Then, we averaged the face images that had similar appearances in each quantized bin to obtain the central tendency. In addition, to compensate for the possible loss of facial details induced by the average, we randomly selected a face image in each bin. Further, from the encoded deep features of our template representation, we proposed bin equalization methods to remove the pose and quality specific portions that are not related to a person's identity. Finally, we showed that our method achieved higher recognition accuracy with fewer images than the original templates on mixed image protocols of the Janus CS2 and IJB-B benchmarks with extensive experiments. ❧ Then, we applied our method to video-based face recognition on the Janus CS3 and IJB-B benchmarks in which gallery templates contained real-world still face images and probe templates contained a video file with an initial face bounding box of a target subject. These video protocols made the recognition problem more challenging because the probe templates contained a myriad number of face images. Some of the face images might be from different subjects or non-face objects depending on the performance of face collectors. Finally, we showed that our method is superior to the state-of-the-art on video-based face recognition in terms of both computational complexity and recognition accuracy. Also, we showed that our method achieved consistently high recognition accuracy even with low-performance face collectors. ❧ This dissertation makes two contributions to efficient template-based face recognition. First, we show that computational complexity can be significantly improved without losing recognition accuracy, by intelligently sampling a few face images which represent many face images in a template. Second, we propose robust feature vectors that are insensitive to variations in pose and quality, which boost recognition accuracy. The integration of the two contributions results in a robust face recognition system which is insensitive to the variations in the quality of face collections. Finally, the proposed method achieved a new state-of-the-art performance in template-based video face recognition efficiently.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Face recognition and 3D face modeling from images in the wild
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
A learning‐based approach to image quality assessment
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
A data-driven approach to compressed video quality assessment using just noticeable difference
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Multi-robot strategies for adaptive sampling with autonomous underwater vehicles
PDF
Machine-learning approaches for modeling of complex materials and media
Asset Metadata
Creator
Kim, Jungyeon
(author)
Core Title
Efficient template representation for face recognition: image sampling from face collections
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
12/18/2019
Defense Date
01/19/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
face recognition,image sampling,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gerard (
committee chair
), Georgiou, Panayiotis (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
kukinine@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-257591
Unique identifier
UC11674726
Identifier
etd-KimJungyeo-8085.pdf (filename),usctheses-c89-257591 (legacy record id)
Legacy Identifier
etd-KimJungyeo-8085.pdf
Dmrecord
257591
Document Type
Dissertation
Rights
Kim, Jungyeon
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
face recognition
image sampling