Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Data-efficient image and vision-and-language synthesis and classification
(USC Thesis Other)
Data-efficient image and vision-and-language synthesis and classification
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Data-Efficient Image and Vision-and-Language Synthesis and Classification by Mozhdeh Rouhsedaghat A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2022 Copyright 2022 Mozhdeh Rouhsedaghat Table of Contents List of Tables v List of Figures vii Abstract xi Chapter 1: Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Image Representation Learning . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Face Gender Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.3 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.4 V&L Classification in the Medical Domain . . . . . . . . . . . . . . . . . 7 1.3.5 One-shot Mask-guided Image Synthesis . . . . . . . . . . . . . . . . . . . 7 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2: PixelHop++: An Enhanced Successive Subspace Learning Model 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Background Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Proposed PixelHop++ Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Channel-wise (c/w) Saab Transform . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Tree-decomposed Feature Representation . . . . . . . . . . . . . . . . . . 14 2.3.3 Cross-Entropy-Guided Feature Selection . . . . . . . . . . . . . . . . . . 15 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Effects of Hyper-Parameters in PixelHop++ . . . . . . . . . . . . . . . . . 17 2.4.3 Performance Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 3: FaceHop: A Light-Weight Low-Resolution Gender Classification Method 20 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Face Attributes Classification . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.2 Successive Subspace Learning (SSL) . . . . . . . . . . . . . . . . . . . . 23 ii 3.3 Proposed FaceHop Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.2 PixelHop++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.4 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter 4: Low-Resolution Face Recognition In Resource-Constrained Environments 37 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Proposed Face Recognition Method . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 PixelHop++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.2 Pairwise Feature Generation . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Integration with Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5.1 Face Verification on LFW . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5.2 Face Identification on CMU Multi-PIE . . . . . . . . . . . . . . . . . . . 53 4.5.3 Model Size Computation and Time Complexity . . . . . . . . . . . . . . . 54 4.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 5: BERTHop: An Effective Vision-and-Language Model for Chest X-ray Dis- ease Diagnosis 56 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3.1 Visual Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3.2 In-Domain Text Pre-Training . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4.3 Visual Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Chapter 6: MAGIC: Mask-Guided Image Synthesis by Inverting a Quasi-Robust Clas- sifier 72 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.3.1 Quasi-Robust Model as a Strong Prior for Synthesis . . . . . . . . . . . . 79 6.3.2 Shape Preservation and Manipulation Control . . . . . . . . . . . . . . . . 82 6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.4.1 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.4.2 Comparison with the state-of-the-art . . . . . . . . . . . . . . . . . . . . . 86 iii 6.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 7: Conclusion and Future Work 91 7.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Bibliography 94 iv List of Tables 2.1 Averaged correlations of filtered AC outputs from the first to the third Pixelhop units with respect to the MNIST, Fashion MNIST and CIFAR-10 datasets. . . . . . 13 2.2 Comparison of the original and the modified LeNet-5 architectures on three bench- mark dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Comparison of test accuracy (%) of LeNet-5 and PixelHop++ for MNIST, Fashion MNIST and CIFAR-10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Comparison of the model size (in terms of the total parameter numbers) of LeNet-5 and PixelHop++ for the MNIST, the Fashion MNIST and the CIFAR-10 datasets. . 19 3.1 Configurations of PixelHop++ for LFW and CMU Multi-PIE.. . . . . . . . . . . . 33 3.2 Feature vector dimensions for LFW and CMU Multi-PIE. . . . . . . . . . . . . . . 33 3.3 Performance comparison of each individual hop/region classifier for LFW. . . . . . 34 3.4 Performance comparison of LeNet-5, FaceHop I and FaceHop II in accuracy rates and model sizes for LFW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Performance comparison of each individual hop/region classifier for CMU Multi- PIE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6 Performance comparison of LeNet-5, FaceHop I and FaceHop II in accuracy rates and model sizes for CMU Multi-PIE. . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1 Comparison of test accuracy of C Y and C C rCb and hyper-parameter settings of M Y and M C rCb, where K 1 , K 2 , and K 3 are numbers of intermediate and leaf nodes at level-1, level-2, and level-3, P is the number of vectors at level-3 and N = 7+ 4K 1 + 2K 2 + P is the feature dimension. . . . . . . . . . . . . . . . . . . . . . 48 4.2 Face verification results on LFW for 16×16 images. . . . . . . . . . . . . . . . . . 51 4.3 Face verification results on LFW for 32×32 images. . . . . . . . . . . . . . . . . . 51 4.4 Rank-1 identification rate (%) for frontal and slightly non-frontal face images (± 15 ◦ ) in Setting-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 v 4.5 The number of parameters of each component in LRFRHop . . . . . . . . . . . . . 54 5.1 The AUC thoracic diseases diagnosis comparison of our model with other three methods on OpenI. BERTHop significantly outperforms models trained with a sim- ilar amount of data (e.g. VB w/ BUTD). *TieNet is trained on a much larger dataset than BERTHop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Comparison betwee different visual encoders (BUTD, ChexNet, and PixelHop++) under the same transformer backbone of BlueBERT. PixelHop++ outperforms BUTD and even ChexNet, which is pre-trained on a large in-domain disease diagnosis dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.1 Quantitative comparison. DEEPSIM vs MAGIC for object and scene images. (a) FID score; (b) Average preference by the users drawn from the user survey. . . . . 89 vi List of Figures 1.1 Sample inputs and outputs for an image classification and an image generation model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 The block diagram of the PixelHop++ method that contains three PixelHop++ Units in cascade. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Comparison of the traditional Saab transform and the proposed c/w Saab transform. 13 2.3 Illustration of the tree-decomposed feature representation. . . . . . . . . . . . . . 15 2.4 The relation between the test accuracy (%) and energy threshold T in PixelHop++ for MNIST, Fashion MNIST and CIFAR-10, where the number of model parame- ters in Module 1 is shown at each operational point. . . . . . . . . . . . . . . . . . 16 2.5 The relation between test accuracy (%) and selected number N S of cross-entropy- guided features in PixelHop++ for MNIST, Fashion MNIST and CIFAR-10, where the number of model parameters in Module 2 is shown at each operational point. . 16 3.1 An overview of the proposed FaceHop method. . . . . . . . . . . . . . . . . . . . 24 3.2 Illustration of the proposed 3-hop FaceHop system as a tree-decomposed repre- sentation with its depth equal to three, where each depth layer corresponds to one hop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Collection of regional responses in hop-1 and hop-2 response maps as features in the FaceHop system: (a) four regions in hop-1 and (b) three regions in hop-2. . . . 30 4.1 The block diagram of the proposed face recognition model. . . . . . . . . . . . . . 40 4.2 Illustration of data flow in the three-level c/w Saab transform in PixelHop++, which provides a sequence of successive subspace approximations (SSAs) to the input image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Illustration of selected spatial regions of interest (ROIs) with respect to the input image for frequency channels at (a) level-1 and (b) level-2. . . . . . . . . . . . . . 45 4.4 The relation between the test accuracy (%) and the energy threshold in M Y , where the number of model parameters in M Y is shown at each operational point. . . . . . 49 vii 4.5 The relation between the test accuracy (%) and the energy threshold in M CrCb , where the number of model parameters in M CrCb is shown at each operational point. 49 4.6 Quality of the obtained 32×32 (b) and 16×16 (c) low-resolution face images com- pared with the original high-resolution face image (a). . . . . . . . . . . . . . . . . 51 4.7 Comparison of classification accuracy of three active learning methods as a func- tion of the number of training pairs. . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1 An overview of BERTHop. BERTHop takes X-ray image and clinical report as input. It first encodes the image and text and extracts potential features from both modalities. Then a transformer-based model learns the associations between these two modalities. By applying appropriate vision and text extractor, the model is capable to identify the abnormality and associate it with the text labels. . . . . . . . 57 5.2 The proposed BERTHop framework for CXR disease diagnosis. A PixelHop++ model followed by a “PCA and concatenation” block is used to generate Q feature vectors. These features along with language embedding are fed to the transformer that is initialized with BlueBERT. . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 Data flow in a 3-level PixelHop++ model. A node represents a channel. . . . . . . 62 5.4 A sample image-text pair in the OpenI dataset. The text report from a radiologist is important for disease diagnosis but has a significantly different style compared to general-domain text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.5 OpenI label statistics: (A) Percentage of normal and abnormal cases (B) Percent- age of different diseases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.6 ROC curve of BERTHop for all 14 thoracic diseases. . . . . . . . . . . . . . . . . 69 6.1 Multiple image manipulation tasks with a single method. MAGIC allows a di- verse set of image synthesis tasks following the semantic of objects and scenes requiring only a single image, its segmentation mask, and a guide mask. In each pair, the left image is the input, and the right one is the manipulated image, guided by the mask shown on top. a) position control and copy/move manipulation by editing the guide mask; b) non-rigid shape control on scenes. c) non-rigid shape control on objects such as animals. Note that the guide mask is not required to segment the object perfectly with fine details; on the contrary, it can be loose, requiring less supervision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 viii 6.2 a) SINGAN [135] and IMAGINE [159] fail to capture the arrangement of parts of ob- jects. Supervision with primitives may lead to better performance—DEEPSIM and our MAGIC). b) Even when IMAGINE uses supervision—right column—the syn- thesis is limited or requires the clip-art to match the image colors. c) Our MAGIC can handle a spectrum of deformations from mild to even intense, whereas DEEP- SIM fails to generate unseen parts or to interpolate empty regions; d) on the con- trary, DEEPSIM preserves the contour of objects better though it “curves” straight lines and shows artifacts when the mask provides no direct supervision. Some figures are taken from [159]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 The binary mask y ′ is used as a guide; x ′ is inverted from x latent code z, constrained with y ′ . . . . 78 6.4 a) x ′ receives structured gradients fromθ to preserve the semantics of z; it receives gradients from a discriminator to match x’s patch distribution. An AE is pre-trained to map x to y, we then introduce gradients from AE to guide x ′ shape/location con- strained with y ′ . b) Gradients from ResNet-50 [59]—also used in [159]—exhibit a sparse structure with activations around the borders; using a quasi-robust model with smallℓ 2 yields gradients with structures that appear on silent features (eyes, nose, etc.) Zoom on gradients for better comparison. . . . . . . . . . . . . . . . . 79 6.5 Shape is better preserved with ours (right) compared to [159] (left). . . . . . . . . . . . . . . 82 6.6 Visualization of the gradient of the loss with respect to the input for ResNet-50 [59]. Input gradients seem noisy for the non-robust model used in IMAGINE but for the ℓ 2 quasi-robust models, they start to be aligned with edges as soon as ε slightly departs from zero. For larger ε, e.g., ε = 5.0, the model becomes more robust yet gradients are more aligned with course edges. The same holds forℓ ∞ - quasi-robust models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.7 Synthesized images by IMAGINE using models with different amount of adversar- ial robustness. a) Using a non-robust classification model for model inversion, IMAGINE synthesizes fragmented objects in the output. b) By changing the non- robust model in IMAGINE with a quasi-robust model, synthesized images look less fragmented. c) By increasing the robustness a bit more, the generated objects be- come non-fragmented and unbroken. d-e) Using strongly-robust models makes generated objects blurry and some of the object details disappear. . . . . . . . . . . 85 6.8 Synergy between the quasi-robust classifier and our discriminator. . . . . . . . . . . . . . . . 86 6.9 Qualitative comparison. DEEPSIM and MAGIC use the same guide mask y ′ . a) IMAGINE fails to perform position control and generates fragmented results. b) & d) DEEPSIM cannot synthesize realistic objects when y ′ is extremely different from y whereas MAGIC succeeds. c) IMAGINE generates a good result yet requires more supervision. e) IMAGINE generates samples similar to the input with no supervi- sion, while MAGIC enforces large variation using the guide mask. f) For shape control on complex scenes, MAGIC generates high-fidelity results while DEEPSIM synthesizes blurry and ‘curved’ images. Some figures are taken from [159]. . . . . 87 ix 6.10 For each input, we fix the mask and optimize starting from different random noise. While observing the boundaries specified by the guide mask y ′ and generating realistic images, MAGIC keeps specificity and generates diverse results. . . . . . . . 88 6.11 Ghost effect. The original feet of the dog are still visible in the synthesized image. . . . . . . . . 89 x Abstract Image classification and image synthesis are two fundamental yet challenging tasks in computer vision and pattern recognition and have drawn significant research attention over the last several decades. Image classification models learn to predict the probability of an image belonging to different classes, i.e., they learn the conditional probability distribution p(y|x) where x is the input image and y is a class label. On the other hand, image synthesis models learn the probability distribution of data conditioned on some specific input. With the emergence of Deep Learning (DL) techniques and availability of large annotated datasets and computational power, classification and generation models could achieve great success, however, in domains in which a large amount of annotated data is not available, such models perform poorly, and having data-efficient models remains a challenge requiring further attention. In this dissertation, we focus on learning-based data-efficient image and vision-and-language classification and image synthesis tasks. The Successive Subspace Learning (SSL) principle was developed to design an interpretable image classification model, known as the PixelHop. We propose an improved PixelHop method and call it PixelHop++. First, we decouple the joint spatial-spectral input tensor to multiple spatial tensors under the spatial-spectral separability assumption and perform the Saab transform in a channel-wise manner. Second, by performing this operation successively, we construct a channel- decomposed feature tree whose leaf nodes contain features of dimension one. Third, a subset of discriminant features is selected based on their cross-entropy values for image classification. PixelHop++ offers a flexible tradeoff between the model size and the classification performance. For low-resolution face gender classification, we propose a lightweight method, called Face- Hop which offers an interpretable machine learning solution. It has desired characteristics such as xi small model size, small training data, and low training complexity. FaceHop is also developed with the SSL principle and built upon the foundation of PixelHop++. According to our experiments, FaceHop outperforms LeNet-5 in accuracy while LeNet-5 has a 4.5x larger model size. We propose a high-performance data-efficient low-resolution face recognition model called LRFRHop for resource-constrained environments using the SSL technology. SSL offers an ex- plainable non-parametric feature extraction submodel that flexibly trades the model size for the verification performance. Its training complexity is significantly lower than DNN-based models since it is trained in a one-pass feedforward manner without backpropagation. Furthermore, ac- tive learning can be conveniently incorporated to reduce the labeling cost. We demonstrate the effectiveness of LRFRHop by conducting experiments on the two well-known datasets. Vision-and-Language (V&L) models take image and text as input and learn to capture the asso- ciations between them. Prior studies show that pre-trained V&L models can significantly improve the model performance for downstream tasks, however, they are less effective when applied in the medical domain due to the domain gap. We investigate the challenges of applying pre-trained V&L models in medical applications and propose BERTHop, a transformer-based model based on Pixel- Hop++ and BlueBERT, to overcome the limitations and better capture the associations between the two modalities. Experiments on OpenI, a commonly used thoracic disease diagnosis benchmark, show that BERTHop outperforms state-of-the-art while it is trained on a 9x smaller dataset. One-shot image synthesis focuses on tackling different image synthesis tasks using only a sin- gle training image. Existing models in this category, either can not generate realistic results or can not handle all types of images including repetitive and non-repetitive ones. We illustrate the lim- itations of existing models and propose MAGIC, a mask-guided one-shot image synthesis model based on quasi-robust model inversion which can achieve high-quality results for the shape and location control tasks on all types of inputs. By conducting extensive experiments, we show that MAGIC outperforms state-of-the-art and synthesizes high-quality results for both repetitive and non-repetitive images. Furthermore, we demonstrate the benefit of quasi-robust model inversion compared with non-robust and strongly robust model inversion for image synthesis. xii Chapter 1 Introduction 1.1 Significance of the Research Image classification models learn to predict the probability of an image belonging to different classes, i.e., they learn the conditional probability distribution p(y|x) where x is the input image and y is a class label. Inherently, they tend to learn the boundary between classes in a dataset. On the other hand, image generation and synthesis models learn the probability distribution of data. In figure 1.1, sample inputs and outputs for a classification model a generation model are shown. Image classification is a fundamental task in computer vision and it studied and explored for decades. It has a wide range of applications and is used in many fields including medicine, se- curity, and robotics. For example, in the medical domain, reliable classification models can help accurately diagnose disease from X-ray or MRI images and reduce the human error. Vision and Language (V&L) classification is a similar task to image classification with the difference that in this task the input to the classification model are image-text pairs and the model may use the information from both image and text to predict the class label. Previously, the classification task could be divided into main two steps: 1) feature extraction and 2) classification. In the first step, some hand crafted representation of the image were extracted to form feature vectors, and in the second step these feature were fed into some classification al- gorithm, e.g., Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbors (KNN). Nowadays, Deep Neural Networks (DNNs) are considered the state-of-the-art approach 1 Classification Model 0.01 0.91 0.03 0.01 Cat Car Dog Ship … … Probability Label Generation Model Figure 1.1: Sample inputs and outputs for an image classification and an image generation model. for tackling the classification task and are trained end-to-end. DNNs with more than one hundred layers and tens of millions of parameters outperform human on large-scale classification datasets. For example, ResNet-152 [59] has achieved an error rate of 3.6% for classification on the chal- lenging ImageNet [34] benchmark while the error rate of human is reported to be 5.1% on this dataset. Despite the success of DNNs, they generally have several limitations. First, they are data- hungry and require a large amount of training data. For example, ResNet-152 is trained on more than 1.2 million training images of ImageNet. But in some cases we may not have a large training set available. For addressing this issue, first, DNN models are pre-trained on a source dataset and then, fine-tuned for a few epochs on the target dataset but if there is a domain gap between the dataset the model is pre-trained on and the target dataset such pre-train and fine-tune paradigm may not be helpful. Second, DNNs require lots of time and computational power to be trained meaning they are not power-efficient. For example, training ResNet-152 on the ImageNet dataset takes using an NVIDIA P100 GPU card would take roughly more than 3 weeks [30]. Finally, they have tens of millions of parameters so they require a large memory during training and inference. For example, ResNet-152 has about 60 million parameters. 2 Image generation is another important and widely-studied task in computer vision which has numerous applications including image editing, data augmentation, and image super resolution. Image synthesis is an image generation task which is conditioned on an input data, e.g., an input image. Traditional image generation models could synthesize high-quality texture images but were not able to generate realistic images of complicated scenes. Deep generative models could tackle this problem and generate realistic images by using massive training data which requires a high amount of computational power. For example, StyleGAN [75], a successful model in generating real-looking face images, is trained on a dataset containing more than 70,000 high-quality images and its training process takes about 41 days on a single Tesla V100 GPU. In this dissertation, first, we focus on developing a new classification method that is small in model size, data and power efficient during training and inference, mathematically transparent, and able to achieve competitive performance compared with state-of-the-art. Then, using some machine learning and deep learning techniques, we further extend our method to different image and V&L classification tasks. Finally, we propose a one-shot image synthesis model, i.e., the model only requires a single training image for synthesizing new variations of the image. 1.2 Background Image classification has a long history of development in the computer vision field. Before the popularity of Convolutional Neural Networks (CNNs), hand crafted feature were extracted from images to form feature vectors. For example, image features could be obtained by SIFT [102], GIST [37], HOG [32] and then fed to a classifier like SVM for an image classification task. After the success of LeNet-5 [90] for handwriting character recognition, CNNs became the dominant method for image representation learning in image classification tasks. LeNet-5 is a shallow and rather small CNN model consisting of three convolutional layers, two pooling layers, and two fully connected layers. With the growth of computational power, in a few years, several breakthrough 3 CNN-based models were proposed for image classification including AlexNet [85], VGG [141], and ResNet [59]. These models improved the performance significantly but also intensified data, memory, and power consumption making them improper for resource-constrained environments. Successive Subspace Learning (SSL) is a technique recently introduced for automatic image representation learning. The key idea of SSL is computing the weights of a multi-level model by a closed form expression without using any back-propagation. Inspired by the architecture of CNNs, Kuo et al. [89] proposed the SAAB transform, a multi-level SSL-based model for image represen- tation learning in which the weights of the kernels are computed by Principal Component Analysis (PCA). SAAB transform extracts effective features from images in an unsupervised manner. Then, the features extracted from an image can be fed into a classifier for class label prediction. SAAB is mathematically transparent and also considered data, memory and energy efficient compared with CNN models. Following SAAB, Chen et al. [23] proposed the PixelHop model as an illustrative example of SSL for image classification. PixelHop uses SAAB for feature extraction from images and also proposes a new module for the supervised dimension reduction of the output of SAAB called Label-Assisted Regression (LAG). V&L classification is a relatively new research area compared with image classification. In a generic V&L classification model, initially image features and text embedding are extracted. Then, an attention mechanism is employed to specify the importance of each image feature and text embedding for predicting the final class label. Up-Down [5] and VisualBERT [95] are two well-know V&L classification models. Up-Down uses Faster R-CNN [120] for feature extraction from the image of an image-text pair and Gated Recurrent Unit (GRU) [28] for encoding the text. Then, given the output of the GRU, a normalized attention weight for each of the image features is generated and used to compute a weighted sum of image features. Finally, the attended image feature and the text embedding are used to compute a joint representation of the image and the text which is going to be used for predicting the class label. VisualBERT uses the BERT model [35] as the attention mechanism for capturing the alignment between image feature extracted by Faster R-CNN and text embedding. 4 One-shot image generation and synthesis is a relatively new research area in computer vision which is introduced in [138] for tackling the image re-targeting task. Recently, IMAGINE [159] is proposed for one-shot image synthesis but it has a very loose control over synthesized images and can not produce satisfactory results for position control and shape control, i.e., modifying the location or shape of the synthesized objects and scenes. DeepSIM [155] proposed to use a mask guidance for high-quality position and shape control and is the state-of-the-art one-shot method for this task. It uses thin-plate-spline (TPS) as the data-augmentation method for generating a large amount of training data out of the signle provided image to be used for training a large Pix2PixHD model [160]. 1.3 Contributions of the Research 1.3.1 Image Representation Learning In this work, we propose PixelHop++, which is an enhanced version of the PixelHop model. Pix- elHop++ is currently considered the state-of-the-art SSL-based method for representation learning from images and video frames. We show that while being a smaller model, PixelHop++ outper- forms PixelHop on various datasets. Main contributions of this work are summarized below: 1. First, we point out the weak correlation of different spectral components of the Saab trans- form, which is used in PixelHop for dimension reduction. Then, we exploit this property to design a channel-wise (c/w) Saab transform, which can reduce the filter size as well as the memory requirement for filter computation in PixelHop++. 2. Second, we propose a novel tree-decomposed feature representation method whose leaf node provides a scalar (or 1D) feature. By concatenating leaf node’s features, we obtain a feature vector of higher dimension for PixelHop++. 5 3. Third, we compute the cross-entropy value of each feature and order them from the lowest to highest. The feature of lower cross-entropy has higher discriminant power. As a result, we can find a proper subset of features that are suitable for the classification task. 1.3.2 Face Gender Classification In this work, we propose FaceHop, an SSL-based method for face gender classification which is built upon the foundation of the PixelHop++. FaceHop has quite a few desired characteristics including a small model size, a small training data amount, low training complexity, and low- resolution input images. Main contributions of this work are summarized below: 1. First, it offers a practical solution to the challenging face biometrics problem in a resource- constrained environment. 2. Second, it is the first effort that applies SSL to face gender classification and demonstrates its superior performance. 3. Third, FaceHop is fully interpretable, non-parametric, and non-DL-based. It offers a brand new path for research and development in biometrics. 1.3.3 Face Recognition In this work, we propose LRFRHop an SSL-based model for face recognition which leverages PixelHop++ for feature learning from face images. LRFRHop is a high-performance data-efficient low-resolution face recognition model for resource-constrained environments. The main contribution of our work lies in the assembly of two effective tools, PixelHop++ and active learning, to address the challenge of face recognition in resource-constrained environments. Both PixelHop++ and active learning are existing tools. Yet, to the best of our knowledge, this is the first time that they are jointly applied to a face biometric problem. We will demonstrate the power of the integrated solution in the context of face recognition with extensive experiments. As 6 the second contribution, we propose a pairwise feature generation module to extract effective joint features from the PixelHop++ output channels for each pair of face images. 1.3.4 V&L Classification in the Medical Domain In this work, we propose BERTHop, a transformer-based V&L model designed for medical appli- cations. We showed that BERTHop outperforms state-of-the-art for Chest X-Ray (CXR) disease diagnosis while it is trained on a much smaller training set. Main contributions of this work are mentioned below: 1. We propose BERTHop, a novel data-efficient V&L model for CXR disease diagnosis sur- passing existing approaches. 2. Our proposed model incorporates PixelHop++ into a transformer-based model. To the best of our knowledge, this is the first study which integrates PixelHop++ and Deep Neural Network (DNN) models. 3. We conduct extensive experiments to demonstrate the effectiveness of each submodel we used in BERTHop. 4. We study how transformer initialization with a model, pre-trained on in-domain data (even on a single modality) is highly beneficial in the medical domain. 1.3.5 One-shot Mask-guided Image Synthesis In this work, we propose MAGIC, a model for one-shot image synthesis. We demonstrate the strength of MAGIC for the location control and the shape control tasks and show that it outperforms state-of-the-art by conducting qualitative and quantitative experiments. Main contributions of this work are as follows: 1. We propose MAGIC, a one-shot mask-guided image synthesis model based on quasi-robust model inversion. 7 2. We demonstrate the benefit of using quasi-robust model inversion compared with non-robust model inversion and strongly-robust model inversion for image synthesis. 3. We conduct a detailed ablation study to demonstrate the effect of proposed components of MAGIC in improving its results. 1.4 Organization of the Dissertation The rest of the dissertation is organized as follows. In Chapter 2, we propose an SSL model which is built upon SAAB offering a flexible tradeoff between the model size and the classification per- formance. In Chapter 3, we propose an explainable, efficient and lightweight SSL-based method for low-resolution face gender classification. In Chapter 4, we propose a novel SSL-based low- resolution face recognition model, which can achieve competitive performance compared with state-of-the-art while being highly data, power, and memory efficient. In Chapter 5, we propose a V&L classification model for chest X-ray disease diagnosis which is data-efficient and achieves superior performance compared with state-of-the-art while being trained on a small training set. In Chapter 6, we propose a one-shot image synthesis model and demonstrate that it outperforms state-of-the-art through conducting extensive qualitative and quantitative experiments. Finally, concluding remarks and future research directions are given in Chapter 7. 8 Chapter 2 PixelHop++: An Enhanced Successive Subspace Learning Model 2.1 Introduction The design of small machine learning models is a hot research topic in recent years since small models are essential to mobile and edge computing applications. There has been a lot of re- search dedicated to neural network model compression and acceleration, e.g. [26, 40, 58, 65, 119]. Techniques such as parameter pruning, quantization, binarization and sharing, low-rank factoriza- tion, transferred filters, knowledge distillation, etc. have been applied to larger network models to achieve this goal. Another path is to design small network models from the scratch. Examples include SqueezeNet [67], SquishedNets [134] and SqueezeNet-DSC [128]. Being similar to the situation with large neural-network-based learning models, the underlying mechanism of small learning models remains to be a mystery. Being inspired by study on deep learning networks, the successive subspace learning (SSL) principle was recently proposed. It has been used to design two interpretable machine learning models – PixelHop for image classification [23] and PointHop for point cloud classification [176]. 9 Since no backpropagation is needed in SSL-based model training, the training can be done effi- ciently. In this work 1 , we focus on the model size (in terms of model parameters) of PixelHop and propose several ideas to reduce its model size, resulting in a new method called PixelHop++. This work has several contributions. First, we point out the weak correlation of different spec- tral components of the Saab transform, which is used in PixelHop for dimension reduction. Then, we exploit this property to design a channel-wise (c/w) Saab transform, which can reduce the filter size as well as the memory requirement for filter computation in PixelHop++. Second, we propose a novel tree-decomposed feature representation method whose leaf node provides a scalar (or 1D) feature. By concatenating leaf node’s features, we obtain a feature vector of higher dimension for PixelHop++. Third, we compute the cross-entropy value of each feature and order them from the lowest to highest. The feature of lower cross-entropy has higher discriminant power. As a result, we can find a proper subset of features that are suitable for the classification task. In PixelHop++, one can control the learning model size of fine-granularity, offering a flexible tradeoff between the model size and the classification performance. We demonstrate the flexibility of PixelHop++ on MNIST, Fashion MNIST, and CIFAR-10 three datasets. 2.2 Background Review Subspace learning is one of the fundamental problems in signal/image processing and computer vision [153, 13, 82, 103, 53, 157]. The subspace method learns a projection from an input data space into a lower-dimensional subspace which serves as an approximation to the input space. Be- ing inspired by the deep learning (DL) framework and built upon the foundation in [87, 88, 86, 89], Kuo et al. proposed the successive subspace learning (SSL) principle to design interpretable ma- chine learning models. Concrete examples include PixelHop [23] and PointHop [176], which are 1 This research was supported by the U.S. Army Research Laboratory’s External Collaboration Initiative (ECI) of the Director’s Research Initiative (DRIA) program. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Army Research Laboratory or the U.S. Government. The U.S. Governments are authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon. 10 LAG Unit Feature Selection 32 × 32 × 0 (14 × 14 , 1 , 2 ) Max-pooling 14 × 14 × 1 , 1 ( 5 × 5 , 2 , 2 ) Classifier Predicted Object Class Module 1 Module 2 Module 3 PixelHop++ Unit Neighborhood Construction C/W Saab Transform − (28 × 28 , 1 ) LAG Unit Feature Selection Max-pooling 5 × 5 × 2 , 1 PixelHop++ Unit Neighborhood Construction C/W Saab Transform − (10 × 10 , 2 ) LAG Unit Feature Selection PixelHop++ Unit Neighborhood Construction C/W Saab Transform − 1 × 1 × 3 Figure 2.1: The block diagram of the PixelHop++ method that contains three PixelHop++ Units in cascade. designed for image and point cloud classification problems, respectively. Their model parameters are determined stage-by-stage in a feedforward manner without any backpropagation (BP). The PixelHop method contains three main modules: 1) successive near-to-far neighborhood expansion and unsupervised dimension reduction; 2) supervised dimension reduction via designed label-assisted regression (LAG); 3) feature concatenation and decision making. PixelHop extracts features successively from a pixel and its near-, mid- and far-range neighborhoods in multiple stages, where each stage corresponds to one PixelHop unit. To control the rapid growth of the out- put dimension of a PixelHop unit, the Saab (subspace approximation with adjusted bias) transform [89] is adopted for unsupervised dimension reduction. 2.3 Proposed PixelHop++ Method The block diagram of the PixelHop++ method for tiny images (size 32× 32) is shown in Fig. 2.1. It has three PixelHop++ units in cascade. In all of them, the Saab transform kernels are of the same spatial dimension (5× 5). Also, We apply the max pooling operation to filtered outputs in the first and the second PixelHop units. As compared with PixelHop, PixelHop++ has the following three modifications. 1. We replace the traditional Saab transform with the channel-wise (c/w) Saab transform. 11 2. A novel tree-decomposed feature representation is constructed. 3. We order leaf node’s features based on their cross-entropy values and use them to select a feature subset. They will be detailed below. 2.3.1 Channel-wise (c/w) Saab Transform Module 1 in Fig. 2.1 is an unsupervised feature learning module. By arguing that the Saab trans- form can be approximated by the tensor product of two separable transforms - one along the spatial domain and the other along the spectral domain, we can reduce the size of the Saab transform. The Saab transform is a variant of the PCA (Principal Component Analysis) transform. Since PCA can decorrelate the covariance matrix into a diagonal matrix, all channel components are decoupled. Although the Saab transform is not identical with the PCA transform, we expect Saab coefficients to be weakly correlated in the spectral domain. To validate the spatial-spectral separability assumption, we show the average correlations of Saab coefficients at the outputs of the first, the second and the third PixelHop++ units in Table 2.1. The first two rows in the table indicate the averaged spatial correlation of a window of size 5× 5 at the output of the first (Spatial 1) and the second PixelHop++ units (Spatial 2) for a fixed spectral component. The last three rows indicate the averaged spectral correlation at the output of the first (Spectral 1), the second (Spectral 2) and the third (Spectral 3) Pixelhop++ units at the center pixel location. Only outputs of AC filters are used in the computation. We see that spectral correlations are weaker than spatial correlations. This is especially obvious for the CIFAR-10 dataset. Furthermore, these correlations are weaker as we go into deeper PixelHop++ units. The weak spectral correlation of Saab coefficients allows us to approximately decompose the joint spatial-spectral input tensor of dimension 5× 5× K i , i= 1,2, to the (i+ 1)th PixelHop++ Unit into K i spatial tensors of size 5× 5 (i.e., one for each spectral component), respectively. Then, 12 Table 2.1: Averaged correlations of filtered AC outputs from the first to the third Pixelhop units with respect to the MNIST, Fashion MNIST and CIFAR-10 datasets. Dataset MNIST Fashion MNIST CIFAR-10 Spatial 1 0.48± 0.22 0.51± 0.17 0.53± 0.17 Spatial 2 0.22± 0.17 0.29± 0.19 0.28± 0.25 Spectral 1 0.002± 0.002 0.0003± 0.0002 0.0006± 0.001 Spectral 2 0.003± 0.002 0.0011± 0.0009 0.001± 0.001 Spectral 3 0.009± 0.007 0.008± 0.006 0.008± 0.006 instead of performing the traditional Saab transform of high dimension in PixelHop, we apply K i channel-wise (c/w) Saab transforms to each of the spatial tensors in PixelHop++. The traditional Saab transform and the c/w Saab transform are compared in Fig. 2.2 from the whole image viewpoint. The traditional Saab transform takes an input image of dimension S i × S i × K ′ i and generates an output image of dimension S i+1 × S i+1 × K ′ i+1 after the max-pooling operation that pools from a grid of size S i × S i to a grid of size S i+1 × S i+1 . The c/w Saab transform takes K i channel-images of dimension S i × S i as the input and generates K i+1 output images of dimension S i+1 × S i+1 after max-pooling. Channel-wise Saab Transform Saab Transform × × ′ + 1 × + 1 × + 1 ′ ( + 1 × + 1 , + 1 ) + 1 ( × , ) ⋯ ⋯ ⋯ Figure 2.2: Comparison of the traditional Saab transform and the proposed c/w Saab transform. 13 We adopt the same kernel and bias design principle of the Saab transform in the design of the c/w Saab transform. Their main difference lies in the input tensor size. The dimension of input tensors for PixelHop++ is the spatial neighborhood size (5× 5= 25 in the current example) while that for PixelHop is the multiplication of the spectral dimension and the spatial dimension (25× K ′ ). Separable transforms, in general, facilitates computation and allows a smaller set of kernel parameters. 2.3.2 Tree-decomposed Feature Representation In this subsection, we propose a new feature representation method, called the tree-decomposed feature representation as illustrated in Fig. 2.3. The root node of the tree is the input image of dimension 32× 32× K 0 , where K 0 = 1 and 3 for the gray-scale and color images, respectively. We normalize the total energy of the root node to unity. The first PixelHop++ unit yields the first level child nodes. This unit applies the c/w Saab transform of 5× 5 transform kernels to input images with stride equal to one. Its output contains multiple response maps of dimension 28× 28, where the boundary effect is taken into account. We apply the standard(2× 2)-to-(1× 1) max-pooling to reduce the spatial redundancy of the response maps. The final output of first PixelHop++ unit is a set of K 1 response maps of dimension 14× 14. Each response map is a child node of the root node. The energy of a child node is the multiplication of its parent node energy and its normalized energy with respect to its parent node. If the energy of a child node is smaller than the pre-set threshold T , we treat it as a leaf node and the total energy of the response map is used as the feature of the node. If the energy of a child node is larger than threshold T , its response map will be used as the input to the next stage for further processing. It is called an intermediate node. For example, there are K i nodes at the output of the ith PixelHop++ unit, i= 1,2. We use K i,1 and K i,2 to denote the number of leaf and intermediate nodes at the ith level in Fig. 2.1. Clearly, K i,1 + K i,2 = K i . We conduct the same operation in each PixelHop++ unit successively until the last stage is met. Then, we obtain a tree-decomposed feature representation as shown in Fig. 2.3, whose leaf node has an associated feature that corresponds to the energy of the response map of one spectral 14 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯c ⋯ ⋯ ⋯ ⋯ Level-0 Root Node Level-1 1 Nodes Level-2 2 Nodes Level-3 3 Nodes 32 × 32 × 0 14 × 14 5 × 5 1 × 1 ⋯ ⋯c ⋯c Leaf Node Intermediate Node Figure 2.3: Illustration of the tree-decomposed feature representation. component. The spectral components at different tree levels have different receptive field sizes. The associated features are useful for the image classification task. 2.3.3 Cross-Entropy-Guided Feature Selection The tree-decomposed feature representation process is unsupervised. It provides task-independent features. Next, we need to find a link from features to desired labels. In this subsection, we will develop new Module 2 in Fig. 2.1 based on this representation. First, we compute the cross-entropy value for each feature at the leaf node via L= J ∑ j=1 L j , L j =− M ∑ c=1 y j,c log(p j,c ), (2.1) where M is the class number, y j,c is binary indicator to show whether sample j is correctly clas- sified, and p j,c is the probability that sample j belongs to class c. The lower the cross-entropy, the higher the discriminant power. We order features from the smallest to the largest cross-entropy 15 scores and select the top N S features. This new cross-entropy-guided feature selection process can reduce the model size of the label-assisted regression (LAG) unit. Finally, we concatenate M features from each PixelHop++ unit to form a feature vector and fed it into a classifier in Module 3, where a simple linear least-squared regressor is adopted in our experiment. There are two hyperparameters, T and N S , which can be used to control the model size flexibly. We will study their impacts on classification accuracy in Sec. 2.4.2. 2.4 Experiments 10 − 6 10 − 5 10 − 4 10 − 3 10 − 2 Energy Thre hold (log cale) 97.2 97.4 97.6 97.8 98.0 98.2 98.4 Te t Accuracy (%) 134254 123954 107954 69054 53805 27038 21247 15453 14606 14142 MNIST 10 − 6 10 − 5 10 − 4 10 − 3 10 − 2 Energy Thresh ld (l g scale) 88.25 88.50 88.75 89.00 89.25 89.50 89.75 90.00 90.25 Test Accuracy (%) 140647 132047 114622 84697 63272 46504 30199 25273 20401 17119 15743 15054 14516 Fashion MNIST 10 − 4 10 − 3 10 − 2 Energy Thre hold (log cale) 62.5 63.0 63.5 64.0 64.5 65.0 65.5 66.0 66.5 Te t Accuracy (%) 73983 63217 56820 43091 36919 31012 26002 22170 20791 20050 19486 CIFAR 10 Figure 2.4: The relation between the test accuracy (%) and energy threshold T in PixelHop++ for MNIST, Fashion MNIST and CIFAR-10, where the number of model parameters in Module 1 is shown at each operational point. 500 1000 1500 2000 2500 3000 3500 4000 Num ber of Selected Features 96.75 97.00 97.25 97.50 97.75 98.00 98.25 98.50 Test Accuracy (%) 5500 11000 16500 22000 27500 33000 38500 44000 47333 MNIST 4000 4500 5000 5500 6000 6500 7000 7500 8000 Num ber of Selected Features 88.25 88.50 88.75 89.00 89.25 89.50 89.75 90.00 90.25 Test Accuracy (%) 41250 46750 52250 57750 63250 68750 74250 79750 86009 Fashion MNIST 2750 3000 3250 3500 3750 4000 4250 4500 4750 Num ber of Selected Features 63 64 65 66 67 Test Accuracy (%) 30800 33550 36300 39050 41800 44550 47300 50050 52206 CIFAR 10 Figure 2.5: The relation between test accuracy (%) and selected number N S of cross-entropy- guided features in PixelHop++ for MNIST, Fashion MNIST and CIFAR-10, where the number of model parameters in Module 2 is shown at each operational point. 16 2.4.1 Experimental Setup We test the PixelHop++ method on three popular datasets: MNIST [90], Fashion MNIST [170] and CIFAR-10 [84]. MNIST and Fashion MNIST contain gray-scale images of size 28× 28 and zero-padding is used to enlarge the image size to 32× 32. CIFAR-10 has 10 object classes of color images of size 32× 32. We conduct performance benchmarking of PixelHop++ and LeNet-5 [90] in terms of classification accuracy and model complexity. We adopt the original LeNet-5 archi- tecture for MNIST. To handle more complicated images in Fashion MNIST and CIFAR-10, we increase the filter numbers of the convolutional layers and fully-connected layers as shown in Ta- ble 2.2. By applying PixelHop++ to MNIST, we adopt the tree-decomposed feature representation and use features of leaf nodes from the third level only since the discriminant power of features in the first two levels is very weak. For Fashion MNIST and CIFAR-10, we use concatenated features from leaf nodes at all three levels. We set the output dimension, M, of the LAG unit to M= 10. Table 2.2: Comparison of the original and the modified LeNet-5 architectures on three benchmark dataset. Dataset MNIST Fashion MNIST CIFAR-10 1st Conv. Kernel Size 5× 5× 1 5× 5× 1 5× 5× 3 1st Conv. Kernel No. 6 16 32 2nd Conv. Kernel Size 5× 5× 6 5× 5× 16 5× 5× 32 2nd Conv. Kernel No. 16 32 64 1st FC. Filter No. 120 200 200 2nd FC. Filter No. 84 100 100 Output Node No. 10 10 10 2.4.2 Effects of Hyper-Parameters in PixelHop++ We study the effect of two hyperparameters, T and N S , on classification accuracy of PixelHop++. We can control the model size in Module 1 by energy threshold T . A larger threshold demands a larger model size. As shown in Fig. 2.4, the test accuracy decreases slightly as T decreases while the model size is reduced significantly for all three datasets. For example, by moving T from 0.00001 and 0.0005 for MNIST, the test accuracy is decreased by 0.51% while the model 17 parameter number becomes 4x fewer. We see clear advantages of the tree-decomposed feature representation in finding good tradeoff. By varying the N S values in Module 2, we plot the test accuracy as a function of N S in Fig. 2.5. The test accuracy decreases marginally as we decrease N S to yield smaller models for all three datasets. Take MNIST as an example, the test accuracy of keeping 500 features is only 0.57% lower than keeping all features (3x more) in Module 2. This is because a smaller feature subset with more discriminant features can be selected through the cross-entropy-guided feature selection process. Overall, by adjusting T and N s , we can control the model size in fine-granularity. 2.4.3 Performance Benchmarking We compare classification accuracy and model complexity of LeNet-5 and PixelHop++ against all three datasets in Table 2.3 and Table 2.4, respectively. Here, we report the performance of two model settings of PixelHop++, i.e. a larger model and a smaller model. In terms of classification accuracy, LeNet-5 performs the best on MNIST and CIFAR-10, yet the large PixelHop++ model outperforms LeNet-5 on Fashion MNIST with fewer parameters. On the other hand, by slightly scarifying the accuracy, PixelHop++ with the small model demands about 2x, 6x and 6x fewer parameters than LeNet-5 for MNIST, Fashion MNIST and CIFAR-10, respectively. With these performance numbers, we can claim that PixelHop++ is more effective than LeNet-5. Table 2.3: Comparison of test accuracy (%) of LeNet-5 and PixelHop++ for MNIST, Fashion MNIST and CIFAR-10. Method MNIST Fashion MNIST CIFAR-10 LeNet-5 99.04 89.74 68.72 PixelHop++ (Large) 98.49 90.17 66.81 PixelHop++ (Small) 97.98 88.84 64.75 18 Table 2.4: Comparison of the model size (in terms of the total parameter numbers) of LeNet-5 and PixelHop++ for the MNIST, the Fashion MNIST and the CIFAR-10 datasets. Method MNIST Fashion MNIST CIFAR-10 LeNet-5 61,706 194,558 395,006 PixelHop++ (Large) 111,981 127,186 115,623 PixelHop++ (Small) 29,514 33,017 62,150 2.5 Conclusion An image classification method with a design of interpretable and small learning models was proposed in this Chapter. Extensive experiments were conducted on three benchmark datasets (MNIST, Fashion MNIST, and CIFAR-10) to demonstrate that the model size of PixelHop++ can be flexibly controlled, and PixelHop++ maintain the classification accuracy with fewer parameters comparing with LeNet-5. 19 Chapter 3 FaceHop: A Light-Weight Low-Resolution Gender Classification Method 3.1 Introduction Face attributes classification is an important topic in biometrics. The ancillary information of faces such as gender, age, and ethnicity is referred to as soft biometrics in forensics [71, 121, 47]. The face gender classification problem has been extensively studied for more than two decades. Before the resurgence of deep neural networks (DNNs) around 7-8 years ago, the problem was treated using the standard pattern recognition paradigm. It consists of two cascaded modules: 1) unsupervised feature extraction and 2) supervised classification via common machine learning tools such as support vector machine (SVM) and random forest (RF) classifiers. We have seen fast progress on this topic due to the application of deep learning (DL) technol- ogy in recent years. Generally speaking, cloud-based face verification, recognition, and attributes classification technologies have become mature, and they have been used in many real world bio- metric systems. Convolution neural networks (CNNs) offer high-performance accuracy. Yet, they rely on large learning models consisting of several hundreds of thousands or even millions of model parameters. The superior performance is contributed by factors such as higher input image resolutions, more and more training images, and abundant computational/memory resources. 20 Edge/mobile computing in a resource-constrained environment cannot meet the above-mentioned conditions. The technology of our interest finds applications in rescue missions and/or field opera- tional settings in remote locations. The accompanying face inference tasks are expected to execute inside a poor computing and communication infrastructure. It is essential to have a smaller learning model size, lower training and inference complexity, and lower input image resolution. The last requirement arises from the need to image individuals at farther standoff distances, which results in faces with fewer pixels. In this work, we propose a new interpretable non-parametric machine learning solution called the FaceHop method. FaceHop has quite a few desired characteristics, including a small model size, a small training data amount, low training complexity, and low-resolution input images. FaceHop follows the traditional pattern recognition paradigm that decouples the feature extrac- tion module from the decision module. However, FaceHop automatically extracts statistical fea- tures instead of handcrafted features. It is developed with the successive subspace learning (SSL) principle [86, 87, 89] and built upon the foundation of the PixelHop++ system [24]. The effec- tiveness of the FaceHop method is demonstrated by experiments on two benchmarking datasets. For gray-scale face images of resolution 32× 32 obtained from the LFW and the CMU Multi-PIE datasets, FaceHop achieves gender classification accuracy of 94.63% and 95.12% with model sizes of 16.9K and 17.6K parameters, respectively. FaceHop outperforms LeNet-5 while the LeNet-5 model is significantly larger and contains 75.8K parameters. There are three main contributions of this work. First, it offers a practical solution to the challenging face biometrics problem in a resource-constrained environment. Second, it is the first effort that applies SSL to face gender classification and demonstrates its superior performance. Third, FaceHop is fully interpretable, non-parametric, and non-DL-based. It offers a brand new path for research and development in biometrics. The rest of this Chapter is organized as follows. Related work is reviewed in Sec. 3.2. The FaceHop method is presented in Sec. 3.3. Experimental set-up and results are detailed in Sec. 3.4. Finally, concluding remarks and future extensions are given in Sec. 3.5. 21 3.2 Related Work 3.2.1 Face Attributes Classification We can classify face attributes classification research into two categories: non-DL-based and DL- based. DL-based solutions construct an end-to-end parametric model (i.e. a network), define a cost function, and train the network to minimize the cost function with labeled face gender images. The contribution typically arises from a novel network design. Non-DL-based solutions follow the pattern recognition paradigm and their contributions lie in using different classifiers or extracting new features for better performance. Non-DL-based Solutions. Researchers have studied different classifiers for gender classifi- cation. Gutta et al. [56] proposed a face-based gender and ethnic classification method using the ensemble of Radial Basis Functions (RBF) and Decision Trees (DT). SVM [110] and AdaBoost [9] have been studied for face gender classification. Different feature extraction techniques were ex- perimented to improve classification accuracy. A Gabor-kernel partial-least squares discrimination (GKPLSD) method for more effective feature extraction was proposed by ˇ Struc et al. [144]. Other handcrafted features were developed for face gender classification based on the local directional patterns (LDP) [70] and shape from shading [167]. Cao et al. [16] combined Multi-order Local Binary Patterns (MOLBP) with Localized Multi-Boost Learning (LMBL) for gender classification. Recent research has focused more on large-scale face image datasets. Li et al. [97] proposed a novel binary code learning method for large-scale face image retrieval and facial attribute pre- diction. Jia et al. [73] collected a large dataset of 4 million weakly labeled face in the wild (4MWLFW). They trained the C-Pegasos classifier with Multiscale Local Binary Pattern (LBP) features using the 4MWLFW dataset and achieved the highest test accuracy on the LFW dataset for Non-DL-based methods up to now. Fusion of different feature descriptors and region of inter- ests (ROI) were examined by Castrill´ on-Santana et al. [18]. The mentioned methods either have a weak performance as a result of failing to extract strong features from face images or have a large model size. 22 DL-based Solutions. With the rapid advancement of the DL technology, DL-based methods become increasingly popular and achieve unprecedented accuracy in face biometrics [91]. Levi et al. [92] proposed a model to estimate age and gender using a small training data. Duanet al. [38] introduced a hybrid CNN-ELM structure for age and gender classification which uses CNN for feature extraction from face images and ELM for classifying the features. Taherkhani et al. [147] proposed a deep framework which predicts facial attributes and leveraged it as a soft modality to improve face identification performance. Han et al. [57]investigated the heterogeneous face at- tribute estimation problem with a deep multi-task learning approach. Ranjan et al. [118] proposed a multi-task learning framework for joint face detection, landmark localization, pose estimation, and gender recognition. Antipov et al. [6] investigated the relative importance of various regions of human faces for gender and age classification by blurring different parts of the faces and ob- serving the loss in performance. ResNet50 [6], AlexNet [25], and VGG16 [91] were applied to gender classification of the LFW dataset, and decent performance was observed. However, these models have very large model sizes. Considerable amounts of computation and storage resources are required to implement these solutions. Light-Weight CNNs. Light-weight networks are significantly smaller in size than regular networks while achieving comparable performance. They find applications in mobile/edge com- puting. One recent development is the SqueezeNet [67] which achieves comparable accuracy with the AlexNet [85] but uses 50x fewer parameters. It contains 4.8M model parameters. In the area of face recognition, Wu et al. [168] proposed a light CNN architecture that learns a compact em- bedding on a large-scale face dataset with massive noisy labels. Although the mentioned models are relatively small, they still require a large amount of training data. 3.2.2 Successive Subspace Learning (SSL) Representation learning plays an important role in many representation learning methods are built upon DL, which is a supervised approach. It is also possible to use an unsupervised approach for representation learning automatically (i.e. not handcrafted). For example, there exist correlations 23 Preprocessing PixelHop++ Feature extraction Classifier Figure 3.1: An overview of the proposed FaceHop method. between image pixels and their correlations can be removed using the principal component analysis (PCA). The application of PCA to face images was introduced by Turk and Pentland [154]. The method is called the “Eigenface”. One main advantage of converting face images from the spatial domain to the spectral domain is that, when face images are well aligned, the dimension of input face images can be reduced significantly and automatically. Since we attempt to find a powerful subspace for face image representation, it is a subspace learning method. Chan et al. [19] proposed a PCANet that applies the PCA to input images in two stages. Chen et al. [23] proposed a PixelHop system that applies cascaded Saab transforms [89] to input images in three stages, where the Saab transform is a variant of the PCA that adds a positive bias term to avoid the sign confusion problem [86]. The main difference between Eigenface, PCANet and PixelHop is to conduct the PCA transform in one, two, or multiple stages. If we apply one-stage PCA, the face is a pure spatial- and spectral-domain representations before and after the transform, respectively. Since the spatial representation is local, it cannot offer the global contour and shape information easily. On the contrary, the spectral representation is global, it fails to differentiate local variations. It is desired to get multiple hybrid spatial/spectral representations. This can be achieved by multi-stage transforms. Kuo et al. developed two multi-stage transforms, called the Saak transform [88] and the Saab transform [89], respectively. Recently, the channel-wise (c/w) Saab transform was proposed in [24] to enhance the efficiency of the Saab transform. Inspired by the function of convolutional layers of CNNs [89], the PixelHop system [23] and the PixelHop++ system [24] were developed to serve the same function but derived based on a 24 completely different principle. The weights of convolutional filters in CNNs are obtained by end- to-end optimization through backpropagation. In contrast, the convolutional kernels used in Pix- elHop and PixelHop++ are the Saab filters. They are derived by exploiting statistical correlations of neighboring pixels. As a result, both PixelHop and PixelHop++ are fully unsupervised. Neither label nor backpropagation is needed in filter weights computation. The PixelHop++ system [24] is an enhanced version of the PixelHop system [23]. The main difference between PixelHop and PixelHop++ is that the former uses the Saab transform while the latter adopts the c/w Saab transform. The c/w Saab transform requires fewer model parameters than the Saab transform since channels are decoupled in the c/w Saab transform. 3.3 Proposed FaceHop Method An overview of the proposed FaceHop system is shown in Fig. 3.1. It consists of four modules: 1) Preprocessing, 2) PixelHop++, 3) Feature extraction, and 4) Classification. Since PixelHop++ is the most unique module in our proposed solution for face gender classification, it is called the FaceHop system. The functionality of each module will be explained below in detail. 3.3.1 Preprocessing Face images have to be well aligned in the preprocessing module to facilitate their processing in the following pipeline. In this work, we first use the dlib [80] tool for facial landmarks localization. Based on detected landmarks, we apply a proper 2D rotation to each face image to reduce the effect of pose variation. Then, all face images are centered and cropped to remove the background. After- wards, we apply histogram equalization to each image to reduce the effect of different illumination conditions. Finally, all images are resized to a low resolution one of 32× 32 pixels. 25 3.3.2 PixelHop++ Both PixelHop and PixelHop++ are used to describe local neighborhoods of a pixel efficiently and successively. The size of a neighborhood is characterized by the hop number. One-hop neigh- borhood is the neighborhood of the smallest size. Its actual size depends on the filter size. For example, if we use a convolutional filter of size 5 × 5, then the hop-1 neighborhood is of size 5× 5. The Saab filter weights are obtained by performing dimension reduction on the neighborhood of a target pixel using PCA. The Saab filters in PixelHop and PixelHop++ serve as an equivalent role of convolutional filters in CNNs. For example, a neighborhood of size 5 × 5 has a dimension of 25 in the spatial domain. We can use the Saab transform to reduce its original dimension to a signifi- cantly lower one. We should mention that the neighborhood concept is analogous to the receptive field of a certain layer of CNNs. As we go to deeper layers, the receptive field becomes larger in CNNs. In the SSL context, we say that the neighborhood size becomes larger as the hop number increases. The proposed 3-hop PixelHop++ system is shown in Fig. 3.2, which is a slight modification of [24] so as to tailor to our problem. The input is a gray-scale face image of size 32× 32. Each hop consists of a PixelHop++ unit followed by a (2× 2)-to-(1× 1) max-pooling operation. A PixelHop++ system has three ingredients: 1) successive neighborhood construction, 2) channel- wise Saab transform, and 3) tree-decomposed feature representation. They are elaborated below. 1) Successive neighborhood construction. We need to specify two parameters to build the neighborhood of the center pixel at each hop. There are the window-size and the stride. We use a window size of 5× 5 and stride of 1 in all three hops in Fig. 3.2. The neighborhood size grows bigger as the hop number becomes larger due to the max-pooling operation. The first, second, and third hops characterize the information of the short-, mid-, and long-range neighborhoods of the center pixel. Apparently, each neighborhood has a degree of freedom of 25 in the spatial domain. By collecting these neighborhood samples from different spatial locations, we can study their statistical correlations via a covariance matrix of dimension 25× 25. Then, we conduct the eigenvector/eigenvalue analysis to the covariance matrix to find a more economical representation. 26 32 × 32 28 × 28 ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• 14 × 14 14 × 14 10 × 10 5 × 5 1 × 1 Intermediate Node Leaf Node Discarded Node PixelHop++ Unit Max Pooling layer First hop Second hop Third hop Figure 3.2: Illustration of the proposed 3-hop FaceHop system as a tree-decomposed representation with its depth equal to three, where each depth layer corresponds to one hop. That is, we can convert pixel values from the spatial domain to the spectral domain, which leads to the PCA transform, for dimension reduction. 2) Channel-wise (c/w) Saab transform. The PCA transform has both positive and negative responses. We encounter a sign-confusion problem [86] when a convolutional operation in the (i+ 1)th stage has the sum of two terms: 1) a positive response in the ith stage multiplied by a positive outgoing link and 2) a negative response in the ith stage multiplied by a negative outgoing link. Both terms contribute positive values to the output while their input patterns are out of phase. Similarly, there will be another sign-confusion when the convolutional operation in the (i+ 1)th stage has the sum of a positive response multiplied by a negative filter weight as well as a negative response multiplied by a positive filter weight. They both contribute to negative values. To resolve such confusion cases, a constant bias term is added to make all responses positive. This is called the Saab (subspace approximation via adjusted bias) transform [89]. Typically, the input of the next pixelhop unit is a 3D tensor of dimension N x × N y × k, where N x = N y = 5 are 27 spatial dimensions of a filter and k is the number of kept spectral components. The Saab transform is used in PixelHop. Since channel responses can be decorrelated by the eigen analysis, we are able to treat each channel individually. This results in channel-wise (c/w) Saab transform [24]. The main difference between the standard Saab and the c/w Saab transforms is that one 3D tensor of dimension N x × N y × k can be decomposed into k 2D tensors of dimension N x × N y in the latter. Furthermore, responses in higher frequency channels are spatially uncorrelated so that they do not have to go to the next hop. The c/w Saab transform is used in PixelHop++. It can reduce the model size significantly as compared with the Saab transform while preserving the same performance. 3) Tree-decomposed representation. Without loss of generality, we use the first hop to explain the c/w Saab transform design. The neighborhood of a center pixel contains 25 pixels. In the spectral domain, we first decompose it into the direct sum of two orthogonal subspaces - the DC (direct current) subspace and the AC (alternating current) subspace. Then, we apply the PCA to the AC subspace to derive Saab filters. After the first-stage Saab transform, we obtain one DC coefficient and 24 AC coefficients in a grid of size 28 × 28. We classify AC coefficients into three groups based on their associated eigenvalues: low-, mid-, and high-frequency AC coefficients. When the eigenvalues are extremely small, we can discard responses in these channels without affecting the quality of the input face image. This is similar to the eigenface approach in spirit. For mid-frequency AC coefficients, the spatial correlation of their responses is too weak to offer a significant response in hop-2. Thus, we can terminate its further transform. For low-frequency AC coefficients, the spatial correlation of their responses is strong enough to offer a significant response in hop-2. Then, we conduct max-pooling and construct the hop-2 neighborhood of these frequency channels in a grid of size 14× 14. It is easy to show these hop-by-hop operations using a tree. Then, each channel corresponds to a node. We use the green, yellow and pink colors to denote low-, mid- and high-frequency AC channels in Fig. 3.2. where the DC channel is also colored in green. They are called the intermediate, leaf, and discard nodes in a hierarchical tree of depth equal to three. 28 To determine which node belongs to which group, we use the energy of each node as the criterion. The energy of the root node is normalized to one. The energy of each node in the tree can be computed and normalized against the energy value of the root node. Then, we can choose two thresholds (in terms of energy percentages) at each hop to partition nodes into three types. These energy thresholds are hyperparameters of the PixelHop++ model. 3.3.3 Feature Extraction Responses at each of the three hops of the FaceHop system have different characteristics. As shown in Fig. 3.2, Hop-1 has a response map of size 28× 28, Hop-2 has a response map of size 10× 10 and Hop-3 has a response map of size 1× 1. Hop-1 responses give a spatially detailed representation of the input. Yet, it is difficult for them to offer regional and full views of the entire face unless the dimension of hop-1 responses becomes extremely large. This is expensive and unnecessary. Hop-2 responses give a coarser view of the entire face so that a small set of them can cover a larger spatial region. Yet, they do not have face details as given by Hop-1 responses. Finally, Hop-3 responses lose all spatial details but provide a single value at each frequency channel that covers the full face. The eigenface approach can only capture responses of the full face and cannot obtain the information offered by hop-1 and hop-2 responses in the FaceHop system. We will extract features based on responses in all three hops. We group pixel responses in hop-1 and hop-2 to form region responses as shown in Fig. 3.3. • Hop-1. We collect pixel responses in hop-1 to form four regions as shown in Fig. 3.3 (a). They cover the left eye, the right eye, the nose, and the mouth regions. Their spatial dimen- sions (height versus width) are 10× 12, 10× 12, 12× 10 and 8× 18, respectively. There are spatial correlations for responses of the same channel. Thus, we can apply another PCA to responses of the same hop/region for dimension reduction. Usually, we can reduce the di- mension to the range between 15 and 20. Afterwards, we concatenate the reduced dimension vector of each region across all hop-1 channels (including both leaf and intermediate nodes) to create a hop/region feature vector and feed it to a classifier. There are four hop-1 regions, 29 (a) (b) Figure 3.3: Collection of regional responses in hop-1 and hop-2 response maps as features in the FaceHop system: (a) four regions in hop-1 and (b) three regions in hop-2. and we have four feature vectors that contain both spatial and spectral information of a face image. The dimension of hop-1 feature vectors in four regions will be given in Table 3.2. • Hop-2. We collect pixel responses in hop-2 to form three regions as shown in Fig. 3.3 (b). They are: one horizontal stripe of dimension 3× 10 covering two eyes, another horizontal stripe of dimension 4× 10 covering the mouth, and one vertical stripe of dimension 10× 4 covering the nose as well as the central 40% region. Similarly, we can perform dimension reduction via PCA and concatenate the spatially reduced dimension of each region across all hop-2 channels to train three classifiers. The dimension of hop-2 feature vectors in the three regions will be summarized in Table 3.2. • Hop-3. We use all responses of hop-3 as one feature vector to train a classifier. It is worthwhile to point out that, although some information of intermediate nodes will be for- warded to the next hop, different hops capture different information contents due to varying spatial resolutions. For this reason, we include responses in both intermediate and leaf nodes at hop-1 and hop-2 as features. 30 3.3.4 Classifiers As described in Sec. 3.3.3, we train four classifiers in hop 1, another three classifiers in hop 2, and one classifier in hop 3. Each classifier takes a long feature vector as the input and makes a soft decision, which is the probability for the face to be a male or a female. Since the two probabilities add to unity, we only need to record one of them. Then, at the next stage, we feed these eight probabilities into a meta classifier for the final decision. The choice of classifiers can be the Random Forest (RF), the Support Vector Machine (SVM), and the Logistic Regression (LR). Although the SVM and the RF classifiers often give higher accuracy, they have a larger number of model parameters. Since our interest lies in a smaller model size, we adopt the LR classifier in our experiments only. 3.4 Experiments In this section, we evaluate the proposed FaceHop gender classification method. We compare the FaceHop solution with a variant of LeNet-5 in model sizes and verification performance. The rea- son for choosing the LeNet-5 for performance benchmarking is that it is a small model which is demonstrated to have a relatively high classification accuracy on gray-scale 32 × 32 images. The neuron numbers of the modified LeNet-5 model are changed to 16 (1st Conv), 40 (2nd Conv), 140 (1st FC), 60 (2nd FC) and 2 (output). The modification is needed since human faces are more com- plicated than handwritten digits in the MNIST dataset. For fair comparison, we train both models on the same training data which is achieved by applying the preprocessing and data augmentation to the original face images. We use only logistic regression (LR) classifiers in FaceHop due to its small model size. Datasets. We adopt the following two face image datasets in our experiments. • LFW dataset [62] The LFW dataset consists of 13,233 face images of 5,749 individuals, which were collected 31 from the web. There are 1,680 individuals who have two or more images. A 3D aligned version of LFW [39] is used in our experiments. • CMU Multi-PIE dataset [52] The CMU Multi-PIE face dataset contains more than 750,000 images of 337 subjects recorded in four sessions. We select a subset of the 01 session that contains frontal and slightly non- frontal face images (camera views 05 0, 05 1, and 14 0) with all the available expressions and illumination conditions in our experiments. Data Augmentation. Since both datasets have significantly fewer female images, we use two techniques to increase the number of female faces. • Flipping the face images horizontally. • Averaging a female face image with its nearest neighbor in the reduced dimension space to generate a new female face image. To find the nearest neighbor, we project all female images to a reduced dimension space, which is obtained by applying PCA and keeping the highest energy components with 90% of the total energy. Dimension reduction is conducted to eliminate noise and high-frequency components. The quality of augmented female images is checked to ensure that they are visually pleasant. After augmentation, the number of male images is still slightly more than the number of female images. Configuration of PixelHop++. The configurations of the PixelHop++ module for LFW and CMU Multi-PIE datasets are shown in Table 3.1. We list the numbers of intermediate nodes, leaf nodes and discarded nodes at each hop (see Fig. 3.2) in the experiments. In our design, we partition channels into two groups (instead of three) only at each hop. That is, they are either discarded or all forwarded to the next hop. As a result, there are no leaf nodes at hop-1 and hop-2. Feature Vector Dimensions of Varying Hop/Region Combinations. The dimensions of fea- ture vectors of varying hop/region combinations are summarized in Table 3.2. As discussed earlier, hop-1 has 4 spatial regions, hop-2 has three spatial regions and all nodes of hop-3 form one feature 32 LFW CMU Multi-PIE Hop Index Interm. Leaf Discarded Interm. Leaf Discarded Node No. Node No. Node No. Node No. Node No. Node No. Hop-1 18 0 7 18 0 7 Hop-2 122 0 328 117 0 333 Hop-3 0 233 2,817 0 186 2,739 Table 3.1: Configurations of PixelHop++ for LFW and CMU Multi-PIE.. vector. Thus, there are eight hop/region combinations in total. Since there are spatial correlations in regions given in Fig. 3.3, we apply PCA to regional responses collected from all channels and keep leading components for dimension reduction. We keep 15 components for the LFW dataset and 20 components for the CMU Multi-PIE datasets, respectively. Then, the dimension of each feature vector at hop-1 and hop-2 is the product of 15 (or 20) and the sum of intermediate and leaf nodes at the associated hop for the LFW (or CMU Multi-PIE) dataset. Hop/Region LFW MPIE Hop/Region LFW MPIE Hop-1 (left eye) 270 360 Hop-2 (upper stripe) 1,830 2,340 Hop-1 (right eye) 270 360 Hop-2 (lower stripe) 1,830 2,340 Hop-1 (nose) 270 360 Hop-2 (vertical strip) 1,830 2,340 Hop-1 (mouth) 270 360 Hop-3 233 186 Table 3.2: Feature vector dimensions for LFW and CMU Multi-PIE. Performance and Model Size Comparison for LFW. We randomly partition male and origi- nal plus augmented female images in the LFW dataset into 80% (for training) and 20% (for testing) two sets individually. Then, they are mixed again to form the desired training and testing datasets. This is done to ensure the same gender percentages in training and testing. We train eight indi- vidual hop/region LR classifiers and one meta LR classifier for ensembles. Then, we apply them to the test data to find out their performance. We repeat the same process four times to get the mean testing accuracy and the standard deviation value, and report the testing performance of each individual hop/region in Table 3.3. The mean testing accuracy ranges from 82.90% (hop-1/nose) to 92.42% (hop-2/vertical stripe). The standard deviation is relatively small. Furthermore, we see that hop-2 and hop-3 classifiers 33 Classifier Accuracy(%) Classifier Accuracy(%) Hop-1 (left eye) 86.70± 0.65 Hop-2 (upper stripe) 92.25± 0.22 Hop-1 (right eye) 86.14± 0.66 Hop-2 (lower stripe) 89.70± 0.73 Hop-1 (nose) 82.90± 0.61 Hop-2 (vertical strip) 92.42± 0.56 Hop-1 (mouth) 83.42± 0.74 Hop-3 91.22± 0.46 Table 3.3: Performance comparison of each individual hop/region classifier for LFW. perform better than hop-1 classifiers. Based on this observation, we consider two ensemble meth- ods. In the first scheme, called FaceHop I, we fuse soft decisions of all eight hop/region classifiers with a meta classifier. In the second scheme, called FaceHop II, we only fuse soft decisions of four hop/region classifiers from hop-2 and hop-3 only. The testing accuracy and the model sizes of LeNet-5, FaceHop I and FaceHop II are compared in Table 3.4. FaceHop I and FaceHop II outperform LeNet-5 in terms of classification accuracy by 0.70% and 0.86%, respectively, where their model sizes are only about 33.7% and 22.2% of LeNet-5. Clearly, FaceHop II is the favored choice among the three for its highest testing accuracy and smallest model size. Method Accuracy(%) Model Size LeNet-5 93.77± 0.43 75,846 FaceHop I (all three hops) 94.47± 0.54 25,543 FaceHop II (hop-2 & hop-3 only) 94.63± 0.47 16,895 Table 3.4: Performance comparison of LeNet-5, FaceHop I and FaceHop II in accuracy rates and model sizes for LFW. Performance and Model Size Comparison for CMU Multi-PIE. Next, we show the classi- fication accuracy of each individual hop/region classifier for the CMU Multi-PIE dataset in Table 3.5. Their accuracy values range from 63.02% (hop-1/mouth) to 91.95% (hop-2/upper stripe). It appears that CMU Multi-PIE is more challenging than LFW if we focus on the performance of each individual classifier by comparing Tables 3.3 and 3.5. We consider two ensemble schemes as done before. FaceHop I uses all eight soft decisions while FaceHop II takes only four soft decisions from hop-2 and hop-3. The mean accuracy per- formance of LeNet-5, FaceHop I and FaceHop II are compared in Table 3.6. It is interesting to see that FaceHop I and II have slightly better ensemble results of CMU Multi-PIE than of LFW, 34 Classifier Accuracy(%) Classifier Accuracy(%) Hop-1 (left eye) 79.33± 0.33 Hop-2 (upper stripe) 91.95± 0.18 Hop-1 (right eye) 78.64± 0.25 Hop-2 (lower stripe) 87.00± 0.15 Hop-1 (nose) 65.19± 0.36 Hop-2 (vertical strip) 91.34± 0.22 Hop-1 (mouth) 63.02± 0.41 Hop-3 84.55± 0.77 Table 3.5: Performance comparison of each individual hop/region classifier for CMU Multi-PIE. respectively. The performance of LeNet-5 also increases from 93.77% (LFW) to 95.08% (CMU Multi-PIE). As far as the model size is concerned, the model sizes of FaceHop I and FaceHop II are about 38.4% and 23.2% of LeNet-5, respectively. Again, FaceHop II is the most favored solution among the three for its highest testing accuracy and smallest model size. Method Accuracy(%) Model Size LeNet-5 95.08 75,846 FaceHop I (all three hops) 95.09± 0.24 29,156 FaceHop II (hop-2 and hop-3 only) 95.12± 0.26 17,628 Table 3.6: Performance comparison of LeNet-5, FaceHop I and FaceHop II in accuracy rates and model sizes for CMU Multi-PIE. 3.5 Conclusion and Future Work A light-weight low-resolution face gender classification method, called FaceHop, was proposed. This solution finds applications in resource-constrained environments with limited networking and computing. FaceHop has several desired characteristics, including a small model size, a small training data amount, low training complexity, and low-resolution input images. The effective- ness of the FaceHop method for gender classification was demonstrated by experiments on two benchmarking datasets. In this Chapter, we demonstrated the potential of the SSL principle for effective feature ex- traction from face images. As to future work, we would like to test more datasets for gender classification and also extend the SSL principle for identifying heterogeneous and correlated face 35 attributes such as gender, age, and race. It is particularly interesting to develop a multi-task learn- ing approach. Furthermore, it will be desired to work on high-resolution face images and see whether we can get significant performance improvement using the SSL principle in classification accuracy, computational complexity, and memory usage. 36 Chapter 4 Low-Resolution Face Recognition In Resource-Constrained Environments 4.1 Introduction Deep learning based face recognition has reached maturity in recent years. DNN models consist- ing of millions of model parameters have been developed and made significant progress. How- ever, their promising performance mainly relies on several factors: higher input image resolu- tions, using an extremely large number of training images, and abundant computational/memory resources. For example, DeepFace [148] was trained on a collection of photos from Facebook that contains 4.4M images. FaceNet [130] was trained on the Google dataset that contains 500M images. SphereFace [99] was trained on the CASIA-WebFace dataset [173] that contains 0.49M images. Along this direction, many activities are centered on face image collection and the setup of the required computing and communication environment. We may face an opposite situation in some real-world applications, i.e., edge or mobile comput- ing in resource-constrained environments with poor computing and communication infrastructure, as is often the case in the field and in operational settings. Such environments demand a smaller model size, fewer labeled images for training, lower training and inference complexity, and lower input image resolution, partly due to the need to identify individuals at farther standoff distances. Due to these stringent requirements, DNNs may not be suitable. The goal of this work is to address 37 these challenges by developing a transparent non-parametric model that allows graceful perfor- mance tradeoff between resources and performance and is capable of being easily integrated with active learning to minimize the training sample number while achieving relatively high accuracy. We adopt an emerging machine learning system called PixelHop++ [24] to achieve these objec- tives. PixelHop++ is designed based on the SSL principle and has several unique characteristics that fit our objectives well. 1. PixelHop++ is a lightweight non-parametric model whose size can be flexibly adjusted for graceful performance tradeoff. It is trained in a feedforward one-pass manner and the train- ing complexity is significantly lower than DNNs. 2. SSL adopts a statistics-centric principle. It is a mathematically transparent approach which exploits pixel-to-pixel correlations for dimension reduction to derive image features. It also analyzes statistics between features and labels to identify discriminant features. 3. We will incorporate active learning in SSL to select the most “informative” samples of a dataset for labeling and reduce the labeling cost. PixelHop++ is a lightweight model, and it can easily be integrated with active learning. The main contribution of our work lies in the assembly of two effective tools to address the chal- lenge of face recognition in resource-constrained environments. Both PixelHop++ and active learn- ing are existing tools. Yet, to the best of our knowledge, this is the first time that they are jointly applied to a face biometric problem. We will demonstrate the power of the integrated solution in the context of face recognition with extensive experiments. As the second contribution, we pro- pose a pairwise feature generation module to extract effective joint features from the PixelHop++ output channels for each pair of face images. The rest of the Chapter is organized as follows. Related prior work is reviewed in Sec. 4.2. The proposed face recognition method and its integration with active learning are presented in Sec. 4.3 and Sec. 4.4, respectively. Experimental results are provided in Sec. 4.5. Finally, concluding remarks and future work are given in Sec. 4.6. 38 4.2 Related Work Face Recognition. Face recognition has made significant progress in recent years. Most success- ful face recognition models use deep learning technique [148, 130, 99, 64] and offer high accuracy on the benchmarking datasets. Although these models are powerful for high-resolution face recog- nition, they usually contain tens of millions of model parameters and require a large amount of training data and computation resources. Recently several lightweight CNN models are proposed [169, 21, 96] which are significantly smaller than regular CNN networks, but they still have a very large number of model parameters. Low-Resolution Face Recognition. In comparison with high-resolution face recognition, less attention has been drawn to low-resolution face recognition. Generally, there are two different settings for this problem: high-resolution to low-resolution, and low-resolution to low-resolution. In the first setting, the low-resolution probe images are compared against high-resolution gallery images [12, 112, 106] while in the second setting both probe and gallery images are low-resolution face images [27, 44, 43, 45]. It may also be possible for a proposed model to be evaluated under both settings [77]. We evaluate the effectiveness of LRFRHop under the second setting. Successive Subspace Learning (SSL). The main technique in subspace learning is to project a high-dimension input space to a low-dimensional output subspace, which serves as an approxima- tion to the original one. When these dimension reduction operations are performed sequentially, it leads to successive subspace learning (SSL). PixelHop++ [24] is the latest SSL-based model proposed for unsupervised representation learning on images by applying the channel-wise (c/w) Saab transform to them. The Saab transform [89] is a multi-stage variant of Principal Component Analysis (PCA) conducted on images. In each stage, it applies PCA to pixel blocks and also uses a bias term to avoid the sign-confusion problem [86]. The performance of the Saab transform can be further enhanced by removing the spatial correlation between Saab transform outputs in the cur- rent stage so that the Saab transform can be applied to each output channel separately in the next stage. PixelHop++ with the c/w Saab transform is demonstrated to offer an effective multi-stage representation. For example, FaceHop [124] is a recently proposed model for gender classification 39 Img1 Img2 Figure 4.1: The block diagram of the proposed face recognition model. which leverages PixelHop++ for feature learning on gray-scale face images. In this Chapter, in addition to incorporating chrominance channels of face images, we add a new module for effective pair-wise feature generation to tackle the face recognition problem. Active Learning. Active learning is used to select the most informative unlabeled data for labeling so that almost the same accuracy can be reached with the smallest amount of labeled data. An active learning method begins with a small amount of labeled data to train a machine learning model. The model queries the labels of some unlabeled data based on a query strategy, and the model is retrained by all labeled data. This process is repeated until we reach the labeled sample budget. Common query strategies include the entropy method [136], the query by committee method [133] and the core-set method [131]. Active learning was incorporated in DNNs by Wang et al. [158]. An active annotation and learning framework is introduced in [172] for the face recognition task. 4.3 Proposed Face Recognition Method The block diagram of LRFRHop is depicted in Fig. 4.1. As shown in the figure, the proposed model is an ensemble of two submodels, each of which consists of three components: PixelHop++, pairwise feature generation, and a classifier. The Y channels of each face pair are fed into one 40 32 × 32 28 × 28 ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• 14 × 14 10 × 10 5 × 5 1 × 1 Intermediate Node Leaf Node Discarded Node PixelHop++ Unit Max Pooling layer First hop Second hop Third hop ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• Figure 4.2: Illustration of data flow in the three-level c/w Saab transform in PixelHop++, which provides a sequence of successive subspace approximations (SSAs) to the input image. designated submodel while the CrCb channels are fed into another one. Each submodel generates a probability score and finally a meta classifier ensembles their predictions. We will elaborate on each component of the block diagram below. 4.3.1 PixelHop++ In each submodel, we use a three-level PixelHop++ system, similar to the system shown in Fig. 4.2. The input to each of the PixelHop++ systems is a whole face image of size 32×32×K 0 , where K 0 = 1 and 2 for the Y channel and CrCb channels, respectively. Each level of a PixelHop++ system has one “PixelHop++ unit” followed by (2×2)-to-(1×1) max-pooling layer. In LRFRHop, the PixelHop++ unit of each level operates on blocks of 5×5 pixels with a stride of one. In the training phase of each PixelHop++ unit, we collect sample blocks from each input channel to derive Saab kernels for that channel separately. Then, we project each block on the derived kernels and generate a set of responses for the central pixel in the block. Since the responses can be positive 41 or negative, we add a constant bias to all responses to ensure that they are all non-negative, which explains the name of the “successive approximation with adjusted bias (Saab) transform” [89]. The first Saab kernel is the unit-length constant-element vector that computes the local mean of each block which is called the DC component. After removing the DC component, we apply PCA on residuals. Since each block has 25 dimensions, we get one DC component plus 24 AC components, whose kernels are eigenvectors of the covariance matrix of the collected blocks, for each channel.In each level, the components generated by each kernel for the blocks of each channel form an output channel (shown by a node in the tree in Fig. 4.2), e.g. the DC component of blocks extracted from an input channel form one output channel/node. We divide these nodes into three groups: • Intermediate nodes: The DC and several leading low-frequency AC channels will be for- warded to the next level for further energy compaction. • Leaf nodes: The nodes which are kept at the current level. • Discarded nodes: AC components with very small eigenvalues are discarded. As we mentioned, in each PixelHop++ unit we apply channel-wise (c/w) transform to pixel blocks; in other words, for pixel blocks extracted from each individual input channel, an individual Saab transform is applied. The reason we can process channels individually is that all AC channels are orthogonal to the DC channel and all AC responses are uncorrelated due to PCA. Note that the first PixelHop++ unit of M CrCb is an exception (Cr and Cb channels are not uncorrelated), so we apply Saab transform on blocks of 5×5×2 pixels at this level and obtain 1 DC component and 49 AC components for each pixel block. We should emphasize that channel separability is powerful in reducing our model size and computational complexity. Unlike DNNs, PixelHop++ does not transform one large 3D (i.e., 2D spatial plus 1D spectral) tensor but multiple 2D spatial tensors. Each level of PixelHop++ provides an approximation to the input with different spatial-frequency tradeoffs. The input is a pure spatial representation. The output of level-3 is a pure frequency rep- resentation, and the outputs of level-1 and level-2 are hybrid spatial/frequency representations. A square in Fig. 4.2 indicates a channel which is the union of all corresponding spatial locations. 42 Before the max-pooling layer, the dimensions of the output channels of level-1 and level-2 are 28×28 and 10×10, respectively. Clearly, level-1 has more spatial detail than level-2. The spatial dimension of level-3 is 1×1. Each intermediate/leaf node at a level indicates a frequency channel at the corresponding level. For channels at each level, we need two hyper-parameters to partition them into the mentioned three groups. We use the energy level of a channel as the criterion. If its energy is less than a cutoff energy denoted by E C , the channel is discarded. If its energy is higher than the forwarded energy threshold denoted by E F , the channel is forwarded to the next level. We normalize the energy of the root node (i.e., the input image) to 100%. The energy level of each intermediate/leaf node is computed as follows. • Step 1: Initial DC and AC energy computation for each PixelHop++ unit. Each eigenvalue of the covariance matrix indicates the energy of its corresponding node. We define the initial energy ( E init ) of each output node as the ratio of its corresponding eigenvalue to the sum of all other eigenvalues related to that PixelHop++ unit. At this step, the sum of the DC and total AC energy values for each PixelHop++ unit is 100%. • Step 2: Normalized energy at each node. Based on the first step, we have the energy of an intermediate/leaf node against its siblings. By traversing the tree from the root node to a leaf node, the path includes intermediate nodes. The normalized energy of a leaf node against the root node is the product of E init values of all visited nodes (including itself). Note that by lowering E C and E F , we can obtain a better approximation at the cost of higher model complexity. 4.3.2 Pairwise Feature Generation To compare whether two face images are similar or not, we can examine their similarities at differ- ent spatial regions, channels, and levels. This is feasible because of the rich representations offered 43 by PixelHop++. Note that although the content of an intermediate node will be mainly forwarded to the next level, it does have different spatial/spectral representations at two adjacent levels. Thus, for feature extraction, we do not differentiate intermediate/leaf nodes at each level. There are K 1 , K 2 and K 3 nodes at level-1, level-2 and level-3, respectively, as shown in Fig. 4.2. We can process them individually as detailed below. • Level-1. It has the highest spatial resolution, i.e., 28×28. We can use it to zoom in on salient regions of the face such as eyes, nose, and mouth at each channel (see Fig. 4.3(a)) and flatten them to form vectors. Accordingly, we extract 4 feature vectors from each node at this level (4×K 1 feature vectors). • Level-2. It has a lower spatial resolution (10×10). We can still zoom in on unions of salient regions such as one horizontal stripe covering two eyes and one vertical stripe covering the nose and the mouth at each channel (see Fig. 4.3(b)) and form 2 vectors per node accordingly (2×K 2 feature vectors). • Level-3. It has no spatial resolution. Each leaf node offers a scalar description of the whole face. In our implementation, we concatenate all K 3 nodes into a long sequence and then group every 10 nodes as one vector. Consequently, we obtain P feature vectors, where P is the floor of K 3 divided by 10. To compare the similarity between two face images, we collect corresponding vector pairs from the same spatial region of the same node (including intermediate and leaf nodes) at each level and compute two similarity measures for them - the cosine similarity (C k ) and the length ratio (R k ) for the k-th vector pair. We define the length ratio of two vectors as the ratio of the vector with smaller L2 norm to the vector with larger L2 norm. If two images are similar, their cosine similarity and length ratio should be close to unity. Otherwise, they should be farther away from (and less than) one. We observe experimentally that an individual R k value is not as discriminant as C k . Instead, the average length ratio for each spatial region in level-1 and -2 as shown in Fig. 4.3 and the average length ratio for P pairs in level-3 is more robust and discriminant. 44 Figure 4.3: Illustration of selected spatial regions of interest (ROIs) with respect to the input image for frequency channels at (a) level-1 and (b) level-2. To summarize, the ultimate feature vector to be fed to the binary classifier is the concatenation of: 1) 7 average length ratio values (4 extracted from level-1, 2 from level-2, and 1 from level-3), 2) 4×K 1 cosine similarity values from four spatial regions and K 1 channels, 3) 2×K 2 cosine similarity values from two spatial regions and K 2 channels, and 4) P cosine similarity values from groups of channels in level-3. The value of the feature dimension (N) for each submodel is given in Table 4.1. 4.3.3 Classifiers As described in Sec. 4.3.2, we extract the feature vector from the Y channel and the CrCb channels of each pair separately. For each pairwise feature, we train a classifier which makes a soft decision, i.e., the probability for the pair to be match or mismatch. Then, we feed these two probabilities into a meta classifier for the final decision. We use the Logistic Regression (LR) classifier in our experiments to achieve a smaller model size although using larger binary classifiers we may achieve higher accuracy. 45 4.4 Integration with Active Learning The feature generation process in LRFRHop is unsupervised, and labels are only needed for clas- sifier training. As a motivation for active learning, a scenario of interest is training a model on a mobile agent when the model has access to unlabeled training samples locally but has to fetch the label of a limited number of samples from the server through unreliable low-bandwidth channels. To overcome the communication constraint, active learning can be used to retrieve labels of most informative samples. We consider three active learning methods as explained below. 1. Entropy method: In each iteration, the entropy of each sample in the pool of unlabeled data D u is computed and the samples with higher entropy are picked and added to the labeled training data. Higher entropy means the model is more uncertain about those samples. The entropy for sample X can be computed using entropy(x)=− J ∑ j=1 p(y i | x)log p(y i | x), (4.1) where p(y i | x) is the probability that sample x belongs to the class label y i . 2. Query By Committee (QBC): Instead of training one model, this method trains several models called a committee. In each iteration, the disagreement between the committee members for each sample in D u is measured and the samples with the largest disagreement values are picked. One of the common disagreement measures is Vote Entropy which can be computed using V E(x)=− J ∑ j=1 V(y i ) C log V(y i ) C , (4.2) where C is the number of models, V(y i ) is the number of votes which label y i receives from committee members. 3. Core-set method: The core-set selection problem is choosing b sample points from the pool of unlabeled data that minimize the maximum distance between each data point remaining 46 in the pool and its nearest data point in the selected subset. This problem is NP-Hard. A greedy approach for core-set selection is the k-Center-Greedy algorithm. In the i th iteration, it selects the samples from D u with the maximum distance from their closest sample in the labeled training data D l i− 1 . 4.5 Experiments We evaluate the performance of the proposed method by conducting experiments on two well- known datasets: Labeled Faces in the Wild (LFW) [62] and CMU Multi-PIE [52]. In all experi- ments, we use low-resolution face images of size 32×32 unless otherwise specified. The LFW dataset is a widely used dataset for face verification. It consists of 13,233 face images of 5,749 individuals. For performance benchmarking of different face verification models, it provides 6,000 face pairs in 10 splits. We follow the “Image-Restricted, Label-Free Outside Data” protocol. We choose this protocol as we use a tool for facial landmark localization in the preprocessing step. But for training the model, we only use face images in the LFW training set. A 3D aligned version of LFW [39] is used in experiments. For data augmentation, we add the pair of horizontally filliped images of each training pair to the training data. The CMU Multi-PIE dataset contains more than 750,000 images of 337 people recorded in four sessions. For each identity, images are captured under 15 viewpoints and 19 illumination conditions with a few different facial expressions. In experiments, we select a subset of 01 session which contains frontal and slightly non-frontal face images (camera views 05 0, 05 1, and 14 0) with a neutral expression and under all existing illumination conditions. Preprocessing. Several commonly used face processing techniques are adopted in the prepro- cessing step such as: 1. Applying a 2D face alignment algorithm to input face images to reduce the effect of pose variation; 2. Cropping face images properly to eliminate background; 47 Table 4.1: Comparison of test accuracy of C Y and C C rCb and hyper-parameter settings of M Y and M C rCb, where K 1 , K 2 , and K 3 are numbers of intermediate and leaf nodes at level-1, level-2, and level-3, P is the number of vectors at level-3 and N= 7+ 4K 1 + 2K 2 + P is the feature dimension. Input ch. E C K 1 K 2 K 3 P N Acc.(%) Y 0.0005 18 119 233 23 340 83.47 CrCb 0.0004 19 73 124 12 241 75.89 3. Using histogram equalization (HE) to reduce the effect of different illumination conditions on the Y channel. We use the dlib [80] toolkit for face detection and landmark localization. Face images are aligned/normalized so that the line connecting the eye centers is horizontal and all faces are cen- tered and resized into a constant size of 32×32 pixels. We convert the color space to YCrCb and, then feed the Y channel (the luminance component) to one designated submodel and the Cr and Cb channels (chrominance components) to another submodel. 4.5.1 Face Verification on LFW Experiment #1. We use the first 90% of the LFW training pairs for training the model and the rest of the pairs for testing. We set E C = E F so that the number of leaf nodes in the first and second hop is zero, and study the effect of changing this hyper-parameter on the test accuracy of each submodel. As shown in Figs. 4.4 and 4.5 by increasing the energy threshold, the accuracy initially increases and then decreases gracefully while the model size decreases substantially, demonstrating that this architecture has a trade-off between accuracy and model size. For M Y we select E C = E F = 0.0004 which gives the best accuracy while having a reasonable model size. For M CrCb , although E C = E F = 0.0001 gives the best accuracy, we select E C = E F = 0.0005 because, compared with the operating point of the highest accuracy, it has a considerably smaller model size (3× smaller) and gives only slightly lower accuracy. 48 10 4 10 3 10 2 Energy Threshold (log scale) 78 79 80 81 82 83 Test Accuracy (%) 19204 8766 5963 2975 1179 839 472 Figure 4.4: The relation between the test accuracy (%) and the energy threshold in M Y , where the number of model parameters in M Y is shown at each operational point. 10 4 10 3 Energy Threshold (log scale) 71 72 73 74 75 76 77 Test Accuracy (%) 18856 10376 6015 3193 1489 737 426 373 Figure 4.5: The relation between the test accuracy (%) and the energy threshold in M CrCb , where the number of model parameters in M CrCb is shown at each operational point. Using these hyper-parameters, the parameters of each PixelHop++ block in LRFRHop and the accuracy of the classifiers trained on their output features are given in Table 4.1. The test accuracy 49 of the meta classifier ( C M ) under this setting is 85.33%. We use the determined hyper-parameters in all experiments. Experiment #2. To compare LRFRHop with the state-of-the-art models, we compute the 10- fold cross-validation accuracy for input resolutions of 32×32 and 16×16 pixels. To obtain the 16×16 face images, we down-sample images to 16×16 and then resize them back to 32×32. The quality of the obtained face images is compared in Fig. 4.6. To the best of our knowledge, there is no low-resolution to low-resolution face recognition model which has reported accuracy under the “Image-Restricted, Label-Free Outside Data” protocol on LFW. To this end, we compare our model with SKD (Selective Knowledge Distillation) [44] and BD (Bridge Distillation) [45] which are two state-of-the-art low-resolution face recognition models using extensive training data (under the “Unrestricted With Labeled Outside Data” protocol) and are evaluated for input resolutions of 32×32 and 16×16 pixels. We compare LRFRHop with SKD without distillation, SKD, BD without distillation, and BD in terms of accuracy, the number of parameters, and the number of training images for the input resolution of 16×16 and 32×32 in Tables 4.2 and 4.3, respectively. According to Table 4.2, for the case of 16×16 resolution, LRFRHop achieves an accuracy of 82.16% which is only 3.71%, and 3.72% lower than SKD and BD, respectively, while its model size is about 79× smaller than SKD and 21× smaller than BD and uses only 5400 pair of images as the training set. On the other hand, a pre-trained model on the VGGFace dataset [113] with 2.6 M images is used as the teacher model of SKD and its student model is trained and fine-tuned on the UMDFaces dataset [10] with 367,888 images. BD uses an ensemble of two models as the teacher model. One of them is trained on the CASIA-WebFace dataset [173] with 0.49 M images and the other is trained on the VGGFace2 dataset [17] with 3.31 M images. The student model of BD is trained on UMDFaces. Also, note that LRFRHop outperforms SKD without distillation and BD without distillation. Ac- cording to Table 4.3, for the case of 32×32 resolution, LRFRHop achieves an accuracy of 83.53% which is only 6.19%, and 5.69% lower than SKD and BD, respectively, while its model size is considerably smaller and uses a much smaller training set as mentioned before. Based on these 50 (a) (b) (c) Figure 4.6: Quality of the obtained 32×32 (b) and 16×16 (c) low-resolution face images compared with the original high-resolution face image (a). Table 4.2: Face verification results on LFW for 16×16 images. Model #Param. Training Set (#Img.) Acc.(%) SKD without distillation 0.79 M UMDFaces (367,888) 62.82 SKD 0.79 M VGGFace (2.6 M), UMDFaces (367,888) 85.87 BD without distillation 0.21 M UMDFaces (367,888) 81.26 BD 0.21 M VGGFace2 (3.31 M), CASIA-WebFace (0.49M), UMDFaces (367,888) 85.88 LRFRHop 0.01 M LFW (5400 pairs) 82.16 results, LRFRHop does have a competitive performance for deployment in resource-constrained environments. FSLR (Fewer-Shots and Lower-Resolutions) [43] is another state-of-the-art low-resolution face recognition model which is designed and optimized exclusively for 16×16 images. It has achieved state-of-the-art verification accuracy on 16×16 LFW images but its model size is 11× larger than LRFRHop. Regarding the training set, a pre-trained teacher model on a massive dataset containing 3.8 M face images is used to train the student of FSLR on UMDFaces. Table 4.3: Face verification results on LFW for 32×32 images. Model #Param. Training Set (#Img.) Acc.(%) SKD without distillation 0.79 M UMDFaces (367,888) 70.23 SKD 0.79 M VGGFace (2.6 M), UMDFaces (367,888) 89.72 BD without distillation 0.21 M UMDFaces (367,888) 85.8 BD 0.21 M VGGFace2 (3.31 M), CASIA-WebFace (0.49M), UMDFaces (367,888) 89.22 LRFRHop 0.01 M LFW (5400 pairs) 83.53 51 1000 2000 3000 4000 5000 Number of Training Pairs 0.76 0.78 0.80 0.82 0.84 0.86 Classification Accuracy entropy core-set QBC Figure 4.7: Comparison of classification accuracy of three active learning methods as a function of the number of training pairs. Experiment #3. In this experiment, we use the three active learning methods which are in- troduced in Sec. 4.4 to obtain the minimum required number of labeled training pairs without significant loss in recognition accuracy. We use the first 90% of the LFW training pairs as the initial pool of unlabeled data (D u ) and the rest as the test data, then we randomly select 5% of the pool as D 0 . The data augmentation step is removed. For the QBC algorithm, we use two different classifiers (LR and SVM) as the committee of classifiers. We apply each active learning algorithm to samples and compute the test accuracy versus the number of labeled training pairs used for training the model. The result is shown in Fig. 4.7. By comparing the performance of different algorithms, we see that QBC is the most effective one for LRFRHop in the current experiment setting as it has the fastest convergence in test accuracy and also the most stable performance. For example, with this algorithm, LRFRHop can achieve an accuracy of above 84% using only 1465 training pairs and a stable accuracy of about 86% with only 3000 pairs. 52 Table 4.4: Rank-1 identification rate (%) for frontal and slightly non-frontal face images ( ± 15 ◦ ) in Setting-1. Method Resolution Acc.(%) CPF [174] 60×60 89.45 HPN [36] 256×220 84.23 c-CNN Forest [171] - 96.97 Light CNN-29 [169] 128×128 99.78 TP-GAN [63] 128×128 99.78 LRFRHop 32×32 89.48 4.5.2 Face Identification on CMU Multi-PIE We evaluate the effectiveness of LRFRHop for the face identification task by conducting experi- ment on the Multi-PIE dataset. To the best of our knowledge, no low-resolution face recognition model has reported results on this dataset. Thus, we compare our model with high-resolution face recognition models that have used this dataset for performance benchmarking. We use a setting mentioned in [174] as Setting-1. According to this setting, the first 150 identities are used for training and the rest of the identities (151-250) from session 01 with neutral expression are used for testing so that there is no overlap between training and test identities. As the gallery image, one frontal face image with frontal illumination is used and the rest of the images in the test set are selected as probes. For model training, we generate 17,888 training pairs by pairing probe images with gallery images and 21,000 training pairs by randomly pairing images with each other. Overall, 38,888 training pairs are used to train LRFRHop. Since our model is not proposed for handling extreme pose variation, we report results only for frontal and slightly non-frontal images (± 15 ◦ ) in the dataset. The rank-1 face identification accu- racy is shown in Table 4.4. LRFRHop has a competitive and comparable performance although it uses low-resolution face images, a small training set, and a small model size. For example, Light CNN-29 has 12.637M parameters and uses 128×128 images while our model size is 1,149× smaller and trained on 32×32 images. It is worthwhile to comment that TP-GAN in Table 4.4 also uses Light CNN-29 for feature extraction from images so it is larger than Light CNN-29. 53 Table 4.5: The number of parameters of each component in LRFRHop Subsystem Num. of Param. M Y 451 + 2,543 + 2,969 Pairwise Feat. Gen. - Y 740 LR Classifier - C Y 341 M CrCb 476 + 1,369 + 1,348 Pairwise Feat. Gen. - CrCb 432 LR Classifier - C CrCb 242 Meta Classifier 3 Total 10,914 4.5.3 Model Size Computation and Time Complexity We compute the size of LRFRHop based on the information provided in Table 4.1. Each Pixel- Hop++ system in LRFRHop has three levels and the c/w Saab transform is applied in each level. The first level of the M Y system has 18 intermediate nodes and 7 discarded nodes (25-18). Thus, it has 25×18 variables for storing PCA components and one bias parameter (451 parameters in total). The second level has 119 intermediate nodes and 25×18-119 discarded nodes. For each of the 18 output channels of level one, the Saab transform is applied separately. Initially, 25×18 nodes are generated, and 119 of them are then selected as intermediate nodes based on the energy threshold. For each Saab transform, the DC kernel is known and constant so we have 18 repetitive kernels in the second level. As a result, we have 25×(119-18) parameters for storing PCA components and 18 bias parameters which is in total 2,543 parameters. By the same token, there are 25×(233- 119)+119 parameters in the third level of M Y . In the Pairwise Feature Generator - Y unit, the mean and the standard deviation of each PixelHop++ output node are stored. Hence, the size of the unit is 2×(18+119+233) equal to 740. The LR classifier’s size equals its input feature size plus one which is 341 for C Y . Likewise, the size of all subsystems is computed and summarized in Table 4.5. 54 Training LRFRHop for experiment #1 takes about 20 minutes and its inference time is 103 milliseconds per image pair. The hardware platform for training and inference time measurement was Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz. 4.6 Conclusion and Future Work A lightweight data-efficient low-resolution face recognition model for resource-constrained envi- ronments, called LRFRHop, was proposed in this Chapter. We plan to develop a similar method- ology for face ethnicity, age, and gender recognition in resource-constrained environments. Fur- thermore, occluded face recognition [42, 93] and extreme pose variation in resource-constrained environments are two challenging problems, and the generalization of LRFRHop to these chal- lenges is a valuable future research topic. 55 Chapter 5 BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis 5.1 Introduction Computer-Aided Diagnosis (CADx) [46] systems could provide valuable benefits for disease di- agnosis including but not limited to improving the quality and consistency of the predictions and reducing medical mistakes as they are not subject to human error. Although most existing studies focus on diagnosis based on medical images such as chest X-ray (CXR) images [8, 4, 2], the radi- ology reports often contain substantial information (e.g. patient history and previous studies) that are difficult to be detected from the image alone. Besides, diagnosis from both image and text is more closely aligned with disease diagnosis by human experts. Therefore, V&L models that take both images and text as input can be potentially more accurate for CADx and several attempts have been made in this direction [164, 178, 98]. However, the shortage of annotated data in the medical domain makes utilizing V&L models challenging. Annotating medical data is an expensive process as it requires human experts. Al- though a couple of recent large-scale auto-labeled datasets have been provided for some medical tasks, e.g., chest X-ray [163, 15, 74], they are often noisy (low-quality) and degrade the perfor- mance of models. Besides, such datasets are not available for most medical tasks. Therefore, training V&L models with limited annotated data remains a key challenge. 56 . . . . . . . . . The cardiac and mediastinal contours are within normal limits. The lungs are well-inflated and clear. There is an 8mm nodule in the left lower lobe, XXXX calcified granuloma … Chest X-ray image Text report Atelectasis Effusion Pneumonia Nodule Emphysema Fibrosis PT . . . Figure 5.1: An overview of BERTHop. BERTHop takes X-ray image and clinical report as input. It first encodes the image and text and extracts potential features from both modalities. Then a transformer-based model learns the associations between these two modalities. By applying appropriate vision and text extractor, the model is capable to identify the abnormality and associate it with the text labels. Recently, pre-trained V&L models have been proposed for reducing the amount of labeled data required to train an accurate downstream model [95, 149, 145, 104] in the general domain. These models are first trained on large-scale image caption data with self-supervision signals (e.g., using masked language model loss) to learn the association between objects and text tokens. Then, the pre-trained V&L models are used to initialize the downstream models and fine-tuned on the target tasks. In most V&L tasks, it has been reported that V&L pre-training is a major source of performance improvement. However, we identify a key problem in applying common pre-trained V&L models for the medical domain: the large domain gap between the medical (target) and the general domain (source) makes such pre-train and fine-tune paradigm considerably less effective in the medical domain. Therefore, domain-specific designs have to be applied. Notably, V&L models mainly leverage object-centric feature extraction methods such as Faster R-CNN [120] which is pre-trained on general domain to detect everyday objects, e.g., cats, and dogs. However, the abnormalities in the X-ray images do not resemble everyday objects and will likely be ignored by a general-domain object detector. 57 To overcome this challenge, we propose BERTHop, a transformer-based V&L model designed for medical applications. BERTHop resolves the domain gap issue by leveraging pre-training language encoder, BlueBERT [115], a BERT [35] variant that has been trained on biomedical and clinical datasets. Furthermore, in BERTHop, the visual encoder of the V&L architecture is redesigned leveraging PixelHop++ [24] and is fully unsupervised which significantly reduces the need for labeled data [123]. PixelHop++ can extract image representations at different frequency levels that is beneficial for abnormality detection. We evaluate BERTHop by conducting extensive experiments and analysis for CADx in chest disease diagnosis on the OpenI dataset [33]. The OpenI dataset contains thoracic diseases, includ- ing 14 common chest diseases. Compared to SOTA (TieNET [164]), BERTHop outperforms in 11 out of 14 thoracic diseases diagnoses and achieves an average AUC of 98.23% that is 1.73% higher, using significantly less training data (TieNet is trained on the ChestX-ray14 [163] dataset that is 9 times larger than OpenI). Compared to the similar transformer-based V&L model pre- trained on general domain and fine-tuned on OpenI [95, 98], BERTHop requires no expensive V&L pre-training yet outperforms it by 14.37%. We summarize our contributions as follows: (1) We propose BERTHop, a novel data-efficient V&L model for CXR disease diagnosis surpassing existing approaches. (2) Our proposed model incorporates PixelHop++ into a transformer-based model. To the best of our knowledge, this is the first study which integrates PixelHop++ and Deep Neural Network (DNN) models. (3) We conduct extensive experiments to demonstrate the effectiveness of each submodel we used in BERTHop. 5.2 Related Work Transformer-based V&L models Inspired by the success of BERT for NLP tasks, various transformer-based V&L models have been proposed [95, 22, 149]. They generally use an ob- ject detector pre-trained on Visual Genome [83] to extract visual features from an input image and then use a transformer to model the visual features and input sentence. They are pre-trained on 58 a massive amount of paired image-text data with a mask-and-predict objective similar to BERT. During pre-training, part of the input is masked and the objective is to predict the masked words or image regions based on the remaining contexts. Such models have been applied to many V&L applications [179, 105, 29] including the medical domain [98]. However, the performance of these models is not satisfactory due to the domain shift between the general domain and medical domain. V&L models in the medical domain Various CNN-RNN-based V&L models have been pro- posed for disease diagnosis on CXR. Zhang et al. [178] proposed TNNT (Text-guided Neural Network Training) which helps a CNN model get guidance from text report embedding for a more efficient training process on V&L data and evaluated the model on four V&L datasets including the OpenI dataset. They showed that the text report has important information that can improve the diagnosis compared with prior vision-only models, e.g., ResNet. TieNet is a CNN-RNN-based model for V&L embedding integrating multi-level attention lay- ers into an end-to-end CNN-RNN framework for disease diagnosis and radiology report generation tasks. TieNet uses a ResNet-50 pre-trained for general-domain visual feature extraction and an RNN for V&L fusion. As a result, it requires a large amount of in-domain training data (ChestX- ray14) for adapting to the medical domain, limiting its practical usage. In contrast, our method achieves higher performance with very limited in-domain data. Recently, Li et al. [98] evaluated the transferability of well-known pre-trained V&L models by fine-tuning them on MIMIC-CXR [74] and OpenI. However, the pre-trained models are designed and pre-trained for general-domain, and directly fine-tuning it with limited in-domain data leads to suboptimal performance. We refer to this method as VB w/ BUTD (section 5.4.2). PixelHop++ for visual feature learning PixelHop++ is originally proposed as an alternative to deep convolutional neural networks for feature extraction from images and video frames in resource-constrained environments. It is a multi-level model which generates output channels representing an image at different frequencies. 59 . . . The cardiac and mediastinal contours are within normal limits. The lungs are well-inflated and clear. There is an 8mm nodule in the left lower lobe, XXXX calcified granuloma … Chest X-Ray Image Text Report Atelectasis Effusion Pneumonia Nodule Emphysema Fibrosis PT . . . Transformer E 0 . . . E 1 E 2 E 3 . . . PCA and Concatenation . . . . . . . . . V 1 V 2 V Q PixelHop++ Output Channels Figure 5.2: The proposed BERTHop framework for CXR disease diagnosis. A PixelHop++ model followed by a “PCA and concatenation” block is used to generate Q feature vectors. These features along with language embedding are fed to the transformer that is initialized with BlueBERT. PixelHop++ is used in various applications and shown to be highly effective on small datasets. These applications include face gender classification [124], face recognition [125], and deep fake detection [20]. It has also been recently applied to a medical task. V oxelHop [100] leveraged this model on 3D Magnetic resonance imaging (MRI) imaging data and could achieve superior results for Amyotrophic Lateral Sclerosis (ALS) disease classification task. To the best of our knowledge, this is the first study which integrates PixelHop++ and DNN models. Our proposed model takes advantage of the attention mechanism to integrate visual fea- tures extracted from PixelHop++ and the language embedding. 5.3 Approach Inspired by the architecture of VisualBERT, our framework uses a single transformer to integrates visual features and language embeddings. The overall framework of our proposed approach is shown in Figure 5.2. We first utilize PixelHop++ to extract visual features from the X-ray image; then the text (a radiology report) is encoded into subword embeddings; a joint transformer is applied on top to model the relationship between two modalities and capture implicit alignments. There are two main differences between BERTHop and previous approaches: 60 • Visual feature encoder Considering the lack of data in the medical domain, instead of us- ing an object detector pre-trained on a general-domain dataset, we leverage PixelHop++, an unsupervised data-efficient method, to extract visual features. As the size of the PixelHop++ output channels is relatively large to be directly fed into the transformer, we apply Princi- ple Component Analysis (PCA) to the output channels for dimension reduction. PCA is an orthogonal linear transformation that maps the data to a new coordinate system of lower di- mension so that the variation of data is better preserved. By applying PCA to the PixelHop++ output channels, we capture the most prominent features and prevent over-fitting. Then, we concatenate the results to generate the final visual feature vectors. (Section 5.3.1) • In-domain text pre-training Instead of resorting to computation-extensive V&L pre-training on a general domain image-text dataset, we find in-domain text-only pre-training consider- ably more beneficial in our application. Thus, we use BlueBERT as the backbone for our model, a transformer pre-trained on biomedical and clinical datasets. (Section 5.3.2) 5.3.1 Visual Encoder We argue that extracting visual features from a general-domain object detector, i.e. the BUTD [5] approach that is dominant in most V&L tasks, is not suitable for the medical domain. BUTD 1 takes an image and employs a ResNet-based Faster-RCNN [120] for object detection and feature extraction from each object. The detector is pre-trained on Visual Genome [83] to detect objects in everyday scenes. Such an approach fails to detect medical abnormalities when applied to X- ray images. The reason is that the abnormalities in the image, which are of high importance for facilitating diagnosis, usually do not resemble the normal notion of an “object” and will likely be ignored by a general-domain object detector. Further, there exists no large-scale annotated dataset for disease abnormality detection from which to train a reliable detector [137]. 1 In the following, we use the term “BUTD” to refer to extracting visual features from a pre-trained object detector rather than the full model from [5]. 61 206 × 206 204×204 ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• 102×102 100×100 50×50 48 × 48 Intermediate node Discarded node PixelHop++ unit Max-pooling layer 1 st Level 2 nd Level 3 rd Level ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• ••• Figure 5.3: Data flow in a 3-level PixelHop++ model. A node represents a channel. We propose to adopt PixelHop++ [24] for unsupervised visual feature learning in the medical domain, which has been shown to be highly effective when trained on small-scale datasets. The key idea of PixelHop++ is computing the parameters of its model by a closed-form expression without using back-propagation [123]. As PixelHop++ leverages PCA for computing parameters, the model is able to extract image representations at various frequencies in an unsupervised man- ner. Inspired by the architecture of DNN models, PixelHop++ is a multi-level model in which each level consists of one or several PixelHop++ units followed by a max-pooling layer. An illustration of data flow in a 3-level PixelHop++ model is shown in Figure 5.3. When training a PixelHop++ model, parameters of PixelHop++ units (kernels and biases) are computed, and during the infer- ence, they are used for feature extraction from pixel blocks. Training phase of PixelHop++ Suppose that we have N training images of size s 1 × s 2 × d, where d is 1 for gray-scale and 3 for color images. They are all fed into a single PixelHop++ unit in the first level of the model. The goal of training a PixelHop++ unit is to compute linearly independent projection vectors (kernels) which can extract strong features from its input data. There are one or more PixelHop++ units in each level of a PixelHop++ model. 62 In the first step of processing data in a PixelHop++ unit, using a sliding window of size w× w× d and a stride of s, patches from each training image are extracted and flattened, i.e., x i1 ,x i2 ,...,x iM where x i j is the jth flattened patch for image i, and M is the number of extracted patches per image. In the second step, the set of all patches extracted from training images are used to compute the kernels of the PixelHop++ unit. Kernels are computed as follows: • The first kernel, called DC kernel, is the mean filter, i.e., 1 √ n × (1,1,...,1) where n is the size of the input vector, and extracts the mean of each input vector. • After computing the mean (DC component) of each vector, PCA kernels of the residuals are computed and stored as AC kernels. First, k PCA kernels are the top k orthogonal projection vectors which can capture the variation of residuals best. Each image patch is projected on computed kernels and a scalar bias is added to the pro- jection result to avoid the sign-confusion problem [89]. This transformation on the input vector (x 0 ,x 1 ,...,x D− 1 ) T can be shown as follows: y k = D− 1 ∑ d=0 a kd × x d + b k (5.1) where a kd represents kernel parameters associated with the kth kernel of a PixelHop++ unit and b k is the kernel’s corresponding bias term. By transforming x i1 ,x i2 ,...,x iM by a kernel in a PixelHop++ unit, one output channel is gener- ated. For example, in the first level of the model, the PixelHop++ unit generates 1 DC channel and w× w× d− 1 AC channels. Each channel is shown by a node in Figure 5.3. In the last step, model pruning is executed to remove the channels which include deficient data. The ratio of the variance explained by each kernel to the variance of training data is called the “energy ratio” of the kernel or its corresponding channel and is used as a criterion for pruning the model. An energy ratio threshold value, E, is selected and model pruning is performed using the following rule: 63 • If the energy ratio of a channel is less than E, it will be discarded (discarded nodes/channels in Figure 5.3) as the variation of data along the corresponding kernel is very small. • If the energy ratio of a channel is more than E, it is forwarded to the next level for further energy compaction (intermediate nodes/channels in Figure 5.3). Each output intermediate channel generated by a PixelHop++ unit will be fed into one separate PixelHop++ unit in the next level. So, except for the first level of the model, other levels contain more than one PixelHop++ unit. Inference phase of PixelHop++ Data flow is similar to the training phase but all parameters including kernel weights and biases are computed during the training phase. Therefore, according to Equation 5.1, feature extraction from test images is conducted in each PixelHop++ unit using the computed kernels and biases. 5.3.2 In-Domain Text Pre-Training In BERTHop, the text report plays an important role in guiding the transformer to pay more at- tention to the right visual features in the attention mechanism. As shown in an example in Figure 5.4, the report is written by an expert radiologist, who lists the normal and abnormal observations in the “finding” section and other important patient information including patient history, body parts, and previous studies in the “impression” section of the report. The text style of the report is drastically different from that of the pretraining corpora of BERT (Wikipedia and BookCorpus) or V&L models (MSCOCO and Conceptual Captions). However, previous methods [98] do not take such a significant domain gap into consideration. Rather, they initialize the transformer with a model trained on general-domain image-text corpora, as in most V&L tasks. Meanwhile, pre-training with text-only corpora has been reported to how only marginal or no benefit [149]. In the medical domain, however, we find that using a transformer pre-trained on in-domain text corpora as our initialized backbone serves as a simpler yet stronger approach. 64 X-ray image Text report Findings: The lungs are clear. There is no pleural effusion or pneumothorax. The heart is not significantly enlarged. The mediastinum is normal. Arthritic changes of the skeletal structures are noted Impression: No acute pulmonary disease. No gross evidence for rib fracture Figure 5.4: A sample image-text pair in the OpenI dataset. The text report from a radiologist is important for disease diagnosis but has a significantly different style compared to general-domain text. Peng et al. [115] proposed a Biomedical Language Understanding Evaluation (BLUE) bench- mark which evaluated the performance of BERT and Elmo [116] on 5 common biomedical text- mining tasks with ten corpora and showed the superiority of BERT when is pre-trained on biomed- ical and clinical datasets (BlueBERT). They made the models and datasets with various versions publicly available. 2 . Recently, BlueBERT has been widely used in the bioNLP community for various NLP tasks [54, 41, 156] and a few V&L tasks, e.g, data labeling [72]. Thus, we leverage this pre-trained version of BERT as the backbone in BERTHop to better capture the text report information. A typical transformer-based V&L model accepts input as an image-text pair. After embedding the text and image, a single-stream transformer [95] or a two-stream transformer [149] is used to fuse the two modalities. BERTHop has a single-stream design. 5.4 Experiments In this section, we evaluate BERTHop on the OpenI dataset and compare it with other existing models. To understand the effectiveness of the model designs, we also conduct detailed studies to verify the value of the visual encoder. 2 https://github.com/ncbi-nlp/bluebert 65 5.4.1 Experiment Setup Dataset For CADx in CXR disease diagnosis, commonly used datasets include ChestX-ray14, MIMIC-CXR, and OpenI. In this Chapter, we focus on the OpenI dataset for which professional annotators labeled the data. OpenI comprises 3,996 reports and 8,121 associated images from 3,996 unique patients collected by Indiana University from multiple institutes. Its labels include 14 commonly occurring thoracic chest diseases, i.e., Atelectasis, Cardiomegaly, Effusion, Infiltration, Mass, Nodule, Pneumonia, Pneumothorax, Consolidation, Edema, Emphysema, Fibrosis, Pleural Thickening (PT), and Hernia. OpenI is a reliable choice for both training and evaluating V&L models as it is annotated by experts (labels are not learned from text reports or images). The disadvantage of using OpenI for training is that it contains a small amount of training data which is a challenge for DNN models. We apply the same pre-processing as TieNet and obtain 3,684 image-text pairs. We do not consider ChestX-ray14 and MIMIC-CXR for benchmarking because their labels are generated automatically from the images and/or associated reports. Specifically, ChestX-ray14 labels are mined using text process technique from the radiology reports, and MIMIC-CXR labels are generated using ChexPert[68] and NegBio[114] auto labelers. As their labels are machine- generated, evaluating the V&L model on these datasets is not reliable. Therefore, we considered evaluation on OpenI to accurately compare the performance of BERTHop with human expert per- formance. Model and training parameters We first resize all images of OpenI to 206 × 206 and apply the unsupervised feature learner, PixelHop++. We use a three-level PixelHop++ with the following hyper-parameters: w= 3, d = 1, s= 1, and E = 0.00005. Then, we apply PCA to its output channels and concatenate the generated vectors to form a set of Q visual features of dimension D, i.e., V =[v 1 ,v 2 ,...,v Q ],v i ∈R D . In BERTHop, D is set to be 2048. In our experiments setup, Q is equal to 15 but may vary depending on the size of the output channels of the PixelHop++ model and also the number of PCA components. 66 Atelectasis 24% Cardiomegaly 26% Effusion 12% Infiltration 5% Mass 1% Nodule 8% Pneumonia 3% Pneumothorax 2% Consolidation 2% Edema 3% Emphysema 8% Fibrosis 2% PT 4% Hernia 0% Diseases 29% Normal 71% (A) (B) Figure 5.5: OpenI label statistics: (A) Percentage of normal and abnormal cases (B) Percentage of different diseases. As for the transformer backbone, we use BlueBERT-Base (Uncased, PubMed+MIMIC-III) from Huggingface [166], a transformer library. BlueBERT-Base is pre-trained on PubMed ab- stracts with more than 4 billion words in the biomedical domain and MIMIC-III with more than 500 million words in the clinical domain. Having the visual features from the visual encoder and text embedding, we train the transformer on the training set of OpenI with 2,912 image-text pairs. We use batch size = 18, learning rate = 1e− 5, max-seq-length = 128, and Stochastic Gradient Descent (SGD) as the optimizer with momentum = 0.9 and train it for 240 epochs. Evaluation metric All mentioned datasets are highly imbalanced and mostly contain normal cases. Figure 5.5 shows the percentages of different diseases compared with normal cases in OpenI. Therefore, evaluating models using metrics such as accuracy does not reflect model performance. Instead, we follow prior studies to evaluate models based on Receiver Operating Characteristic (ROC) and Area Under the ROC Curve (AUC) score. 5.4.2 Main Results We train BERTHop on the OpenI training dataset containing 2,912 image-text pairs and evaluate it on the corresponding test set comprising 772 image-text pairs. The ROC curve for each disease is plotted in Figure 5.6. 67 TNNT [178] TieNet ∗ [164] VB w/ BUTD [98] BERTHop Atelectasis - 0.976 0.9247 0.9838 Cardiomegaly - 0.962 0.9665 0.9896 Effusion - 0.977 0.9049 0.9432 Infiltration - 0.984 0.8867 0.9926 Mass - 0.903 0.6428 0.9900 Nodule - 0.960 0.8480 0.9810 Pneumonia - 0.994 0.8537 0.9967 Pneumothorax - 0.960 0.8931 1.0000 Consolidation - 0.989 0.7870 0.9671 Edema - 0.995 0.9500 0.9987 Emphysema - 0.868 0.8565 0.9971 Fibrosis - 0.960 0.6274 0.9966 PT - 0.953 0.7612 0.9330 Hernia - - - - A VG 0.854 0.965 0.8386 0.9823 Table 5.1: The AUC thoracic diseases diagnosis comparison of our model with other three methods on OpenI. BERTHop significantly outperforms models trained with a similar amount of data (e.g. VB w/ BUTD). *TieNet is trained on a much larger dataset than BERTHop. We compare BERTHop with the following approaches: • TNNT [178]: a Text-giuded Nueral Network Training method. See the details in Section 5.2. • TieNET [164]: a CNN-RNN-based model. See the details in Section 5.2. • VB w/ BUTD [95, 98]: Fine-tuning the original VisualBERT. we evaluate all the models using the same AUC implementation in scikit-learn [14]. Table 5.1 summarizes the performance of BERTHop compared with existing methods. The results demonstrate that BERTHop outperforms SOTA (TieNet) in 11 out of 14 thoracic disease diagnoses and achieves an average AUC of 98.23% which is 14.37%, 12.83%, and 1.73% higher than VB w/ BUTD, TNNT, and TieNet, respectively. Note that TieNet has been trained on a much larger annotated dataset, i.e., the ChestX-ray14 dataset containing 108,948 training data while BERTHop is trained on only 2,912 case examples. 68 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate Atelectasis 0.9838 Cardiomegaly 0.9896 Effusion 0.9432 Infiltration 0.9926 Mass 0.9900 Nodule 0.9810 Pneumonia 0.9967 Pneumothorax 1.0000 Consolidation 0.9671 Edema 0.9987 Emphysema 0.9971 Fibrosis 0.9966 Pleural_Thickening 0.9330 Hernia 0.0000 Figure 5.6: ROC curve of BERTHop for all 14 thoracic diseases. Regarding the VB w/ BUTD results, we re-evaluate the results based on the released code 3 from the original authors. However, we cannot reproduce the results reported in the Chapter even after contacting the authors. 5.4.3 Visual Encoder To better understanding what visual encoder is suitable for medical applications, we compare three visual feature extraction methods (BUTD, ChexNet [117], and PixelHop++). In particular, we re- place the visual encoder of BERTHop with different visual encoders and report their performance. BUTD extracts visual features from a Faster R-CNN pre-trained on Visual Genome, which is pre- vailing in recent V&L models. ChexNet is a CNN-based method that is proposed for pneumonia disease detection. It is a 121-layer DenseNet [61] trained on the ChestX-ray14 dataset for pneu- monia detection having all pneumonia cases labeled as positive examples and all other cases as negative examples. By modifying the loss function, it is also trained to classify all 14 thoracic dis- eases and achieved state-of-the-art among existing vision-only models, e.g., [163]. To augment the data, it extracts 10 crops from the image (4 corners and one center and horizontally flipped version of them) and feeds it into the network to generate a feature vector of dimension 1024 for each of 3 https://github.com/YIKUAN8/Transformers-VQA/blob/master/openI_VQA.ipynb 69 BUTD ChexNet PixelHop++ Atelectasis 0.8866 0.9787 0.9838 Cardiomegaly 0.8875 0.9797 0.9896 Effusion 0.9120 0.8894 0.9432 Mass 0.7373 0.7529 0.9900 Consolidation 0.8906 0.9000 0.9671 Emphysema 0.8261 0.9067 0.9971 A VG 0.8564 0.8798 0.9823 Table 5.2: Comparison betwee different visual encoders (BUTD, ChexNet, and PixelHop++) under the same transformer backbone of BlueBERT. PixelHop++ outperforms BUTD and even ChexNet, which is pre-trained on a large in-domain disease diagnosis dataset. them. In order to make it compatible with our transformer framework, we apply a linear trans- formation that maps feature vectors of size 1024, generated by ChexNet, to 2048. We fine-tune ChexNet and train the parameters of the linear transformation on the OpenI dataset. The results in Table 5.2 show that the visual encoder of BERTHop, PixelHop++, can extract richer features from the CXR images as it uses a data-efficient method capable of extracting image representations at different frequencies. Then, the transformer can highlight the most informative features from image-text data in an attention mechanism to make the final decision. 5.5 Conclusion and Future Work In this Chapter, we proposed a high-performance data-efficient V&L model, BERTHop, for CXR disease diagnosis. We showed that BERTHop outperforms state-of-the-art while it is trained on a much smaller training set. Our studies verify the effectiveness of the visual feature extractor PixelHop++ and the transformer backbone initialization BlueBERT. For future research direction, we plan to study how anomaly detection techniques can be in- corporated to further improve the performance of the model. As no large-scale annotated CXR dataset for anomaly detection is available, we may use weekly supervised techniques or knowl- edge transfer from similar tasks. We are also interested in how our proposed BERTHop model can 70 help other biomedical tasks, e.g., COVID-19 disease diagnosis and radiology report generation. Another future research direction is exploring the effectiveness of our visual encoder technique for other biomedical tasks, e.g., bladder cancer diagnosis. 71 Chapter 6 MAGIC: Mask-Guided Image Synthesis by Inverting a Quasi-Robust Classifier 6.1 Introduction “A picture is worth a thousand words”: a famous English language adage that is even more rel- evant nowadays, where the influence of multimedia data is making an impact in our daily lives through social media, web pages, and TV shows. The image synthesis, manipulation, and editing landscape is even more critical today for companies working with digital artists or using image manipulation programs. Companies, such as Synthesia [1], are using multimedia manipulation technology for entertainment and consumer purposes, especially with the increasingly improving capabilities within the field. With the advent of image digitization, digital artists could manipulate media with content- aware rescaling [7] or automatic image retargeting [132]. In the last decade, deep generative mod- eling brought image synthesis to the next level. Generative modeling made extraordinary leaps in terms of data fidelity: they unleashed the possibility of generating data that is indistinguish- able from natural images [75] and opening new fields such as “neural rendering” [151]. A game changer was an implicit density model that learns the data density with no explicit likelihood by an adversarial minimax game between a generator, and a discriminator—Generative Adversarial Networks (GANs) [48, 75]. 72 a) b) c) Position Control Copy/Move Remove Add Scene Shape Control Scene Shape Control Object Shape Control Object Shape Control Figure 6.1: Multiple image manipulation tasks with a single method. MAGIC allows a diverse set of image synthesis tasks following the semantic of objects and scenes requiring only a single image, its segmentation mask, and a guide mask. In each pair, the left image is the input, and the right one is the manipulated image, guided by the mask shown on top. a) position control and copy/move manipulation by editing the guide mask; b) non-rigid shape control on scenes. c) non- rigid shape control on objects such as animals. Note that the guide mask is not required to segment the object perfectly with fine details; on the contrary, it can be loose, requiring less supervision. While powerful supervised models learning a mapping from one domain to the other have been introduced [161], training such models concerning all possible manipulations on a large dataset in an inductive fashion is an unfeasible and ill-posed problem. Thus, lazy learning became of inter- est: image-specific GANs [138] can be trained on a single image to produce images with similar “DNA” or with repetitive patterns [135]. While PatchGANs learn patch level data distribution, they are insufficient to model complex relations between objects and their parts. To overcome these lim- itations, DEEPSIM [155] learns a mapping from image primitives (edges, semantic segmentation 73 maps) to color intensities using a pix2pixHD [161] architecture. It employs a novel data augmenta- tion method based on non-linear image warping using thin plate spline (TPS) to cope with the lack of training data. Instead, IMAGINE [159] performs image synthesis by inverting a classifier with regularization in the feature space inspired by [175]. Our work is also based on model inversion as [175] yet performs position and shape control of objects with higher fidelity than IMAGINE and requires less supervision (i.e., fewer primitives) than DEEPSIM. We entitle our method as MAGIC following “mask-guided image synthesis by inverting a quasi-robust classifier”. Leveraging on the limits of the prior art, we make the following contributions: ⋄ MAGIC accomplishes semantic image editing in “one fell swoop”, thereby performing po- sition control, shape control, and copy/move manipulation, as shown in fig. 6.1. We propose a one-shot method that jointly inverts a quasi-robust classifier and a patch-based autoencoder (AE) with superior results than IMAGINE [159]. Despite their affinities, MAGIC reaches a higher quality in position control and shape deformation—fig. 6.2 b) bottom vs. fig. 6.2 c) bottom—which is something that PatchGANs [138, 135] too cannot achieve—fig. 6.2 a). ⋄ Like DEEPSIM [155], we perform lazy-learning, yet we can handle stronger deformations exploiting the prior of inverting a quasi-robust classifier trained on ImageNet [126] and also a PatchGAN with a specific inductive bias described in section 6.3.2; additionally, this latter char- acteristic protects our generation from interpolating “empty” regions with unrealistic hallucinated details, affording to be looser on the supervision—fig. 6.2 c). This looseness in supervision trans- lates into less effort from an artist’s perspective. ⋄ DEEPSIM produces more smoothed images yet is less robust to intense shape deformations and has a bias to bending surfaces, leading to unrealistic scene synthesis—fig. 6.2 d). MAGIC overcomes these limitations and offers an orthogonal contribution. ⋄ We offer a diverse set of excellent visual performances over various image manipulation tasks. We highlight the novel interplay between the inversion of a quasi-robust classifier and a 74 a) c) b) d) SINGAN IMAGINE DEEPSIM MAGIC Input IMAGINE w/o supervision IMAGINE w/ supervision DEEPSIM MAGIC Input DEEPSIM MAGIC Figure 6.2: a) SINGAN [135] and IMAGINE [159] fail to capture the arrangement of parts of objects. Supervision with primitives may lead to better performance—DEEPSIM and our MAGIC). b) Even when IMAGINE uses supervision—right column—the synthesis is limited or requires the clip-art to match the image colors. c) Our MAGIC can handle a spectrum of deformations from mild to even intense, whereas DEEPSIM fails to generate unseen parts or to interpolate empty regions; d) on the contrary, DEEPSIM preserves the contour of objects better though it “curves” straight lines and shows artifacts when the mask provides no direct supervision. Some figures are taken from [159]. patch-based AE for binary mask segmentation as a means for manipulation control. Our qualita- tive results are also supported with quantitative comparisons including a user study following the practice of [159, 155]. 6.2 Prior Work Our work touches on multiple aspects of image synthesis: i) classifier inversion; ii) image synthesis with a “robust” classifier, optimized with adversarial training (AT) or variants thereof; iii) the usage of a GAN to prune the space of possible inversions. The GAN considered in our work is related to PatchGANs—imposing a constraint on patch statistics [180]—rather than “holistic” GANs. We now discuss the three aspects mentioned above. Image synthesis by model inversion. Neural network (NN) inversion dates back to 1986 [165] and the nineties [79]. In [79], Kindermann and Linden used inversion as a tool for understanding 75 which arbitrary input signal can match the output code. Inversion is usually done using back- propagation of errors keeping the NN weights fixed and optimizing the input. This process is also the basic recipe for adversarial attacks [146, 11], explainable AI using activation maximiza- tion [140] or inspecting NN representation [108, 109]. Inversion implies optimizing a pre-image subject to regularizations to resemble a natural image: this process enables producing mesmerizing pictures with Google’s “DeepDream” [111]. Despite recent progress, generating high-fidelity nat- ural images by classifier inversion while controlling attributes such as the position of the objects and their shape remains a challenge. The main limitation is that NNs do not provide any explicit mechanism to control these attributes. Recent methods working towards the aforementioned objec- tive are “Dream to Distill” [175] and “IMAGINE” [159]. The work in [175] takes inspiration from “DeepDream” [111] and uses image synthesis as a data generation process for a teacher-student framework. Yin et al. imposes additional regularization on the pre-image: they also impose con- straints between the statistics of the feature maps of the pre-image and those internally stored in the batch normalization (BN) statistics. On the other hand, IMAGINE [159] produces variations of a guide image yet changes the feature map constraint of [175] to take into account specificity . IMAG- INE overcomes the limit of Single Image GAN—SINGAN [135]—therefore being able to generate images containing objects instead of self-similar repetitive patterns. Synthesis with a robust classifier. Santurkar et al. [129] are the first to use a robust classifier for synthesis. Robust indicates a classifier optimized with AT to be resilient to a threat model. The threat is described by bounding the magnitude of the perturbation with aℓ p norm [49, 107]. Ro- bust models [129] are shown to retain input gradients more aligned with human perception [3, 76] and better capture the global shape of objects [177]. The reason why it is so is not yet crystal clear: Terzi et al. convey that AT makes the classifier invertible [150] learning more high-level features; on the contrary, [78] conjectures that AT restricts gradients closer to the image mani- fold. Regardless, the invertibility property is undoubtedly valuable for synthesis. Interestingly, other researchers empirically show that this property emerges even when a non-robust classifier is 76 trained [76] with randomized smoothing [31], trained with AT but with low magnitude perturba- tions [3], or even when the class-conditional logit gradients are regularized to have low norm [143]. The invertibility property of robust models has been recently employed by [122] for solving inverse problems such as image denoising, example-based style transfer, or anomaly detection. Contrast- ingly, we use a “quasi-robust” model: i.e., a low max-perturbation bound quasi-robust model which retains a high classification accuracy on natural images, enabling simultaneous classifica- tion and synthesis. Another characteristic trait is that we focus on location and shape control which are applications that [122] does not cover. While [122] share similar aspects with our work, [155] does not employ a robust model yet attains some similar objectives and is the state-of-the-art solu- tion. Though [155] performs shape manipulation with a single guide sample, it uses a multi-class segmentation mask requiring much more tedious per-pixel supervision compared to binary mask, which is the only supervision our method exploits. Constraining patch-level statistics with GANs. The first to apply GAN at the patch level is [94] with the term “neural patch”, followed by [139] referring to as “local adversarial loss”. The us- age of GAN to constrain patch statistics has been widely used in pix2pix [69] under the name of Markovian discriminator. The work par excellence exploiting GAN at the patch level is SIN- GAN [135] employing a multi-scale hierarchy of fully convolutional GANs. Though SINGAN and INGAN [138] can generate realistic scenes with repetitive structures, [50] showed that a hierarchi- cal patch-based Nearest-Neighbor could replace the GAN when dealing with repetitive patterns with a remarkable speedup in the generation. 6.3 Method Preliminaries and objective. We are given an image x∈Z H× W× 3 | [0,255] along with an aligned source binary mask y∈Z H× W× 1 | [0,1] ; where this latter supervises the pixels of the object or scene that we seek to manipulate and takes values∈[0,1]. 77 x y z x ′ y ′ segment encode invert edit Figure 6.3: The binary mask y ′ is used as a guide; x ′ is inverted from x latent code z, constrained with y ′ . Referring to the diagram in fig. 6.3, we aim at synthesizing x ′ by cheaply editing y into a guide mask y ′ , that functions as a prior for a vari- ety of tasks such as position control, non-rigid shape control, scene mod- ification, and copy/move. Manipulating y ′ from y is practical and can be done with off-the-shelf background removal tools incorporated into ma- jor image manipulation programs. This process requires only segment- ing the global part we wish to manipulate compared to [155] that requires fine-grained primitives such as pixel-wise class-conditional segmentation masks. Furthermore, the process can be further made semi-automatic by pre-estimating y with saliency detection [162], though it is not the central point of our work. For instance, for each given pair in fig. 6.1, y and y ′ are shown in the upper left part of the input and synthesized images, respectively. In the following sections, we explain how we implement the mapping x→ z→ x ′ contingent to the constraint y ′ → x ′ , while aligning the patch distributions of x—x ′ . Overview. We propose inverting two main models to achieve image synthesis, preserving the semantics of objects and scenes while achieving shape control. The first inversion implements x→ z→ x ′ by getting gradients from a quasi-robust classifier θ. This part ensures that the re- construction contains gradients with information coming from class-conditional data distribution to preserve class semantics. We also invert a patch-based autoencoder (AE)θ AE for manipulation control. Offline, we pre-train θ with a variant of adversarial training (AT) that perturbs the data with a very small ε-ball around the training point underℓ 2 norm, which is different than what is usually done in robust machine learning, where ε is set to be high to make the model resilient to attacks. Then, we also optimize θ AE to encode the mapping from x to y, so that θ AE learns the distribution of the part we seek to manipulate. Conversely, at synthesis time, we fix both θ and θ AE to get gradients from them: in particular, with θ AE , we replace y with y ′ so to force the foreground object to be deformed guided by the mask y ′ . Following [159] we require the patch dis- tribution of x ′ to be aligned with the patch data density of x with a discriminatorθ d , though, in our 78 Input [159] ℓ 2 , ε=0.01 ℓ 2 , ε=0.05 x x x ′ y z y ′ θ θ AE θ d a) b) Figure 6.4: a) x ′ receives structured gradients from θ to preserve the semantics of z; it receives gradients from a discriminator to match x’s patch distribution. An AE is pre-trained to map x to y, we then introduce gradients from AE to guide x ′ shape/location constrained with y ′ . b) Gradients from ResNet-50 [59]—also used in [159]—exhibit a sparse structure with activations around the borders; using a quasi-robust model with smallℓ 2 yields gradients with structures that appear on silent features (eyes, nose, etc.) Zoom on gradients for better comparison. case, patches are sampled to be much larger than those in [159]. This better preserves object/scene shapes. fig. 6.4 a) shows the process with gradients flowing on x ′ . 6.3.1 Quasi-Robust Model as a Strong Prior for Synthesis Model Inversion. The mapping x→ z→ x ′ defined in fig. 6.3 is formalized as inverting the latent embedding z of a classifier θ parametrized by the weights of a ConvNet, trained on ImageNet as previously done in [159]. A classifier θ :Z H× W× 3 | [0,255] 7→R C maps high-dimensional data x to an embedding z where C is the number of classes—for ImageNet is C = 1,000. In particular, this process can be seen as sampling from a generative model conditioned on x; first, x→ z encodes x into a latent space with the forward pass, and z→ x ′ “draw a sample x ′ ” from z with a backward pass. Inverting a classifier implies solving: x ′ = argmin x ′ L x ′ ,x;θ where L x ′ ,x;θ =ℓ θ(x ′ ),z +ρ(x ′ ), (6.1) where z . = θ(x) is the latent code given the source image x. The optimization is an ill-posed problem since the learned functionθ is non-injective per the requirement of building invariance in the input space with respect to the same class. Hence, given a latent code z, multiple pre-images could be generated from this code. This issue motivates the need for strong regularization ρ on 79 the optimized pre-image x ′ . The lossℓ(·,·) in eq. (6.1) can be Kullback–Leibler (KL) divergence if the two logit terms are transformed into probabilities using a softmax operator, while fixing z as the reference distribution over classes. Alternatively, we can also follow a greedy approach that assigns c= argmax c θ c (x) as the most likely class given x. In this case, we can solve: x ′ = argmin x ′ ℓ θ(x ′ ),c +ρ(x ′ )+ρ θ (x ′ ,x), (6.2) where KL divergence ℓ transforms to the cross-entropy loss and c selects the index of the most likely class, following θ’s prediction. Note that for eq. (6.2) to work, the classifier has to retain a good accuracy on natural images, that is, c needs to be correct most of the times otherwise eq. (6.2) may optimized x ′ for another class. The same rationale follows for a loss that performs “soft assignment” such as KL divergence, so switching to eq. (6.1) will not solve the issue. Importantly, we highlight that the good prediction needed by eq. (6.2) is not a property of a robust classifier , given that exhibits low accuracy on natural images [152], thereby we cannot naively replaceθ with a robust model for structured gradients [76, 3]. Basic Regularization. Following prior work [108], we used a basic regularization in the image space by bounding its squared Euclidean norm and imposing a total variance (TV) loss thus pe- nalizing the sum of the norm of the pre-image gradient ρ(x ′ )=αρ TV (x ′ )+β||x ′ || 2 where α and β are tunable hyperparameters. We also ask x ′ to match the first and second-order statistics of the feature maps of the source image as suggested in [159] to enforce a mild semantic consistency with the source image x: ρ θ (x ′ ,x)=∑ j∈θ ||µ j (x ′ )− µ j (x)|| 2 +∑ j∈θ ||σ j (x ′ )− σ j (x)|| 2 , where µ and σ are the average and the standard deviation of the feature maps across the spatial dimension and j indicates the layer at which the map is taken inθ. Note that this formulation per se is not enough to guarantee robust synthesis and does not fully take into account the semantic of the objects as shown in fig. 6.2 b) and in fig. 6.9, second row. It is thus essential to introduce a better, stronger prior that can induce structured gradients when solving eq. (6.2) for x ′ . Quasi-Robust Model for Synthesis. In order to synthesize a new image, we initialize the pre- image with normal random noise, i.e., x ′ t=0 ∼ N (0,1). We then proceed iteratively updating the 80 pre-image following the direction provided by the gradient of the loss in eq. (6.1) with respect to the pre-image as x ′ t = x ′ t− 1 − λ∇ x ′L(x,x ′ ;θ) where t indicates the iteration of gradient descent and λ is the learning rate of the synthesis. The more structured is∇ x ′L(x,x ′ ;θ), the better and faster will be the optimization for image synthesis. For instance, it is important that the model θ has some degree of “semantic” understanding of the property of a given class. As mentioned in section 6.2, a way for getting more structured gradients relays on using AT, i.e., solving a minimax game in eq. (6.3): θ ⋆ = argmin θ ℓ θ(x+δ ⋆ ),y where δ ⋆ = arg max ||δ|| p <ε ℓ θ x+δ ,y (6.3) that alternates between finding an additive perturbation δ with boundedℓ p norm using Projected Gradient Descent (PGD) [107] and updating the weights θ to lower the cost on the perturbed points. Though we could simply use eq. (6.3) with a large perturbation ball around the data point ε, we instead propose using a very small ε value so that we can retain the same accuracy of a standard model—need by eq. (6.2)—while getting the benefit of structured gradients of a robust model. Thereby, we replace θ with a quasi-robust model trained on ImageNet with eq. (6.3) and aℓ 2 perturbation ball centered on the input with a very small ε = 0.05. We refer to this model as “quasi-robust” since it is a good trade-off between clean accuracy and structured gradients, pointing out that the model is robust within our small ε yet is not robust from an adversarial ma- chine learning perspective. In section 6.4, we offer ablation experiments that show how switching to this model dramatically improves the semantics of the manipulations and study its effect un- der diverse ℓ p norm on the final results. Similarly to what observed in [51], we conjecture that under this discriminative model trained as in eq. (6.3) secretively hides a generative model of the class-conditional data distribution with a Langevin dynamics sampling process reminiscent of score matching [66, 142]. We leave the verification of this conjecture to future work. Quasi-robust model gradients are visualized in fig. 6.4 b) compared to those of [159] that exhibits activations not in salient parts of the objects. We found that synthesis with a model trained with eq. (6.3) with ℓ 2 works better thanℓ ∞ and give an explanation in section 6.4. 81 6.3.2 Shape Preservation and Manipulation Control Larger receptive field in the discriminator better preserves shape. The architecture θ d is de- rived from [159] with a series of 2D convolution followed by Batch Normalization and LeakyReLu and shown in fig. 6.4 a); θ d is a particular PatchGAN—patch-based discriminator—where the gen- erator is the pre-image itself, and the discriminator plays an adversarial game to classify patches of x and x ′ with a Wasserstein loss with gradient penalty [55]. Unlike [159], we change the inductive bias of the discriminator to adapt it to shape control. With minor modifications, we achieved a significant impact on the synthesis. These modifications include increasing the number of param- eters from 447,489 to 545,217 by raising the kernel size from 3× 3 to 4× 4 in the first three layers and keeping the number of filters as 128 in all layers except the first, in which we have 64 filters. We also changed the number of layers from six to five and modified the stride of the second and third layers to two. With these modifications, we improve what the last classification units ‘see’ in the pre-image by increasing their receptive field from 9 × 9 to 21× 21, for a 224× 224 pre-image. The enhancements can be appreciated in fig. 6.5. Note that in this ablation we did not use the quasi-robust model yet, and we employ the same location control mechanism as in [159] based on attention maps for a fair comparison. In fig. 6.5, in some cases, ours incorrectly hallucinates two hummingbirds. In the next section, we explain how to remove those artifacts and present our final contribution in shape control. IMAGINE [159] [159] + our θ d Figure 6.5: Shape is better preserved with ours (right) compared to [159] (left). Manipulation Control via Mask-Guided Autoen- coder Inversion. Unlike DEEPSIM [155] that maps primitives to images, we work in the reverse direc- tion by learning a mapping from the image to the ob- ject or scene of interest. Our method’s last building block consists of obtaining gradients from a patch- based AE trained offline for binary pixel-wise seg- mentation supervised by y. By doing so, we create a bottleneck throughθ AE that incorporates spatial knowledge of the object’s position along with its 82 shape. Unlike [155], ourθ AE computes the expectation of the loss with respect to a set of patches by means of fully convolutional layers [101], thereby regularizing the training. Doing so, we avoid complex data augmentation procedures such as using non-linear deformations of the input to find new samples. In fact, DEEPSIM treats a sample as the image itself, and it is required to apply strong deformations employing TPS that heavily bias the model towards producing “curved” objects and scenes. At synthesis time, we can invert θ AE obtaining gradients on x ′ by replacing y with a new guide mask y ′ specified as input to the algorithm. These new gradients will guide x ′ to deform its shape according to y ′ . Final formulation. Our final ‘ MAGIC’ formulation preserves object and scenes semantics using gradients from a quasi-robust model, aligns patch distributions without fragmenting objects, and fi- nally achieves manipulation control as described above. The full inversion with all the regularizers is provided below: x ′ = argmin x ′ ℓ θ(x ′ ),c | {z } semantics via quasi robust inversion +ηρ θ d (x ′ ,x) | {z } align large patch distr. +γρ θ AE (x ′ ,x,y ′ ) | {z } manipulation control +κ ρ(x ′ ) |{z} classic image reg. [108] +νρ θ (x ′ ,x) | {z } feat. map. distr. [159] (6.4) whereℓ(·,·) indicates quasi-robust model inversion,ρ θ d (x ′ ,x) is implemented as described in sec- tion 6.3.2 and ρ θ AE (x ′ ,x,y ′ ) inverts the binary cross-entropy averaged across all the pixels of the mask y ′ . The last two regularizers are classic regularization ρ(x ′ ) in the image space [108] and ρ θ (x ′ ,x) matches feature maps distributions between the two images, following [159]. We give technical details on how we implemented this inversion and explain the hyperparameters h=[η,γ,κ,ν] in section 6.4. We can see that at each step of gradient descent we update x ′ as a linear combination of the gradients coming from eq. (6.4) with coefficients given by h and scaled by the learning rate. The first three regularizers in eq. (6.4) represent our contributions. 83 6.4 Experimental Evaluation In this section, we investigate MAGIC’s generative capabilities and the effect of the proposed com- ponents on synthesized images. We offer an ablation study illustrating the effect of the contribu- tions to the baseline model IMAGINE and analyze the improvements in section 6.4.1. We further compare MAGIC with the state-of-the-art in section 6.4.2 by performing both qualitative and quan- titative evaluations. We show multiple generated samples to test the repeatability of our synthesis. Implementation and hyper-parameters. In our experiments the image size is H=W=224, fol- lowing [159]. To obtain y ′ for an image, we either manipulate its corresponding y or manually draw a binary mask from scratch. We use an ℓ 2 -quasi-robust ResNet-50 with ε = 0.05 as the classification model. The discriminator θ d is trained using the Wasserstein loss similar to what described in [159] yet by increasing the number of iterations; θ d weights are the only parameters optimized along with x ′ , the rest of networks are held fixed and we simply get gradients from them. For the quasi-robust model, we used the implementation publicly available provided by [127]. For optimizing x ′ , initially the hyper-parameters h in eq. (6.4) are set as follows: η = 0.0, γ = 30.0, κ = 1.0, ν = 5.0 while params in ρ(x ′ ) are α = 1e− 4 and β = 1e− 5. After 5,000 iterations, we start training θ d with η = 0.05. This technique improves the alignment of the generated image with y ′ and makes the training process more stable. MAGIC has a 200% smaller coefficient for the Wasserstein loss compared with IMAGINE. We use the Adam optimizer [81] with learning rate λ of 5e− 4. For other unmentioned parameters, we employ the values from IMAGINE [159]. 6.4.1 Ablation study The impact of the quasi-robust model. To give insights of the effect of the quasi-robust model in eq. (6.4), we first visualize in fig. 6.6 the input gradients for several images from ImageNet. In particular, we study the influence of ℓ 2 andℓ ∞ norm in eq. (6.3) with different ε values on the input gradients that we sample from θ. For visualizing the gradients, we follow [152] by first clipping the gradient intensity to stay within± 3 standard deviation with respect to their mean and 84 Monkey Lion Dog Input Non-robust ℓ 2 ,ε=0.01 ℓ 2 ,ε=0.05 ℓ 2 ,ε=1.0 ℓ 2 ,ε=5.0 ℓ∞ ,ε= 0.5 255 ℓ∞ ,ε= 1.0 255 Figure 6.6: Visualization of the gradient of the loss with respect to the input for ResNet-50 [59]. Input gradients seem noisy for the non-robust model used in IMAGINE but for theℓ 2 quasi-robust models, they start to be aligned with edges as soon as ε slightly departs from zero. For larger ε, e.g., ε = 5.0, the model becomes more robust yet gradients are more aligned with course edges. The same holds forℓ ∞ -quasi-robust models. a) Baseline IMAGINE b)ℓ 2 -robust,ε=0.01 c)ℓ 2 -robust,ε=0.05 d)ℓ 2 -robust,ε=1 e)ℓ 2 -robust,ε=5 Figure 6.7: Synthesized images by IMAGINE using models with different amount of adversarial robustness. a) Using a non-robust classification model for model inversion, IMAGINE synthesizes fragmented objects in the output. b) By changing the non-robust model in IMAGINE with a quasi- robust model, synthesized images look less fragmented. c) By increasing the robustness a bit more, the generated objects become non-fragmented and unbroken. d-e) Using strongly-robust models makes generated objects blurry and some of the object details disappear. then rescaling it to lie∈[0,1] for each example. As illustrated in fig. 6.6, the quasi-robust models trained with the ℓ 2 norm start to pay more attention to edges in the input image, as soon as ε slightly increases from zero, which makes the gradients more aligned with human perception [129] and thus more suitable to be used for synthesis. Note that this result was known for robust models but less known for quasi-robust models and not applied yet for synthesis. However this yields a trade-off: if the model is trained with stronger attacks, e.g., ε = 5.0, equivalent to increasing the ℓ 2 -ball around the data point, then it learns to rely mostly on coarse edges as compared to fine edges. We suppose that image synthesis using strongly robust models is prone to neglect fine edges and details of the object. Our ablation in fig. 6.7 confirms this hypothesis. According to fig. 6.6, input gradients of the model trained with the ℓ ∞ norm, are more aligned with coarse 85 edges which has the same disadvantage mentioned before. fig. 6.7 offers the impact of the size of theℓ 2 -ball around the training points in eq. (6.3) when we synthesize images, thereby using the gradient from θ. According to this evidence, we always use a quasi-robust model with ε=0.05 with ℓ 2 without optimizing ε further. Besides, per the requirement of eq. (6.2), keeping a high classification accuracy is mandatory, supporting this choice even further. Figure 6.8: Synergy between the quasi-robust clas- sifier and our discriminator. The interplay of the quasi-robust model with our discriminator. As explained in section 6.3.2, MAGIC uses a PatchGAN with a receptive field of 21× 21 while IMAGINE uses one with 9× 9. Though having a smaller receptive field leads IMAGINE to generate images with more variations, it is also more prone to produce artifacts and non-realistic outputs. fig. 6.8 shows our final results after incorpo- rating the quasi-robust model along with our discriminator θ d . We can appreciate how artifacts still clearly visible in fig. 6.5 are removed when these two contributions are employed together. The effect of manipulation control. By using θ d , MAGIC tends to generate images similar to the input image, yet the contribution of inverting the mask-guided AE is key in controlling the manipulation: we offer qualitative results all along the Chapter in fig. 6.1, fig. 6.2, fig. 6.9, and fig. 6.10. These are evidence of how the method enforces object and scene deformations albeit preserving realism. 6.4.2 Comparison with the state-of-the-art We evaluate MAGIC by conducting extensive experiments on images either randomly selected from the ImageNet validation set or collected from the web, or the same images that previous methods used. We compare the results against DEEPSIM [155] which, to the best of our knowledge, is the state-of-the-art for one-shot image synthesis guided by primitives. For a fair comparison, we re-trained every pair shown in this Chapter with the publicly available code of DEEPSIM feeding the guide mask used in ours. We also perform a qualitative comparison against IMAGINE [159]. 86 Input IMAGINE [159] DEEPSIM [155] MAGIC (Ours) a) b) c) d) e) f) Figure 6.9: Qualitative comparison. DEEPSIM and MAGIC use the same guide mask y ′ . a) IMAGINE fails to perform position control and generates fragmented results. b) & d) DEEPSIM cannot synthesize realistic objects when y ′ is extremely different from y whereas MAGIC suc- ceeds. c) IMAGINE generates a good result yet requires more supervision. e) IMAGINE generates samples similar to the input with no supervision, while MAGIC enforces large variation using the guide mask. f) For shape control on complex scenes, MAGIC generates high-fidelity results while DEEPSIM synthesizes blurry and ‘curved’ images. Some figures are taken from [159]. Note that IMAGINE requires a detailed and color segmentation map for shape control and does not work with binary mask. We have demonstrated the strengths of MAGIC compared with IMAGINE in section 6.4.1. Qualitative evaluation. In fig. 6.1, we present MAGIC’s results for position control, shape control, and copy/move manipulation on scenes and objects. In fig. 6.2, we compare the results of MAGIC with DEEPSIM and IMAGINE. Our experiments show that compared with DEEPSIM, MAGIC can better handle scene deformation (scene shape control) and also extreme object deformations. Re- sults provided in fig. 6.9 further supports this statement. Though MAGIC is guided by y ′ , even if we fix y ′ , MAGIC can sample different variations and synthesize diverse images, which is highly beneficial, especially for scenes. This capability is illustrated in fig. 6.10: we repeat the synthesis 87 Input Sample 1 Sample 2 Sample 3 Figure 6.10: For each input, we fix the mask and optimize starting from different random noise. While observing the boundaries specified by the guide mask y ′ and generating realistic images, MAGIC keeps specificity and generates diverse results. from three different starting points x ′ t=0 ∼ N (0,1) subject to the same y ′ . Interestingly, diversity is usually injected in the background for objects, whereas scenes vary more. Quantitative evaluation. Following prior work [155, 135], we use machine perception as a proxy for measuring the quality by employing Frechet Inception Distance (FID) [60] and comparing 25 generated images by MAGIC against 25 synthesized ones by DEEPSIM. As shown in table 6.1a, MAGIC significantly outperformed DEEPSIM on both object and scene synthesis. To further evaluate our method, we used human perception by conducting subjective evaluation of the image quality for images synthesized by MAGIC compared to DEEPSIM. For subjective eval- uation, we prepared a survey containing 20 questions, each of which offers a pair of synthesized images, one by DEEPSIM and the other by MAGIC, along with the corresponding input image. The survey asks to select the image with higher quality. In every question, each synthesized image was randomly placed in the lower left or lower right of the input image to prevent bias. Severe failure cases of DEEPSIM, e.g., fig. 6.9 b), d), and f), including three object and two scene images, were not 88 Methods Objects Scenes DEEPSIM [155] 78.90 128.12 MAGIC (Ours) 35.12 40.91 (a) Methods Objects Scenes DEEPSIM [155] 44.58% 13.19% MAGIC (Ours) 55.42% 86.81% (b) Table 6.1: Quantitative comparison. DEEPSIM vs MAGIC for object and scene images. (a) FID score; (b) Average preference by the users drawn from the user survey. included in the survey to further avoid biasing the evaluation. The survey was taken by 120 sub- jects not involved with the project. According to the survey results shown in table 6.1b, although we removed severe failure cases of DEEPSIM, MAGIC was generally preferred more compared to DEEPSIM on objects, whereas on scenes was preferred with a very high margin. x x ′ y y ′ Input MAGIC Figure 6.11: Ghost effect. The original feet of the dog are still visible in the syn- thesized image. Limitations and Failure Cases. The main limitations and failure cases of MAGIC are object removal and ghost effects. The presence of the regularizerρ θ (x ′ ,x) that compares statis- tics of optimized image and guide image does not enable ob- ject removal or extreme scale changes. Also, sporadically we could see excessive pale details of the original object. A fail- ure example is shown in fig. 6.11 where hallucination of the dog’s original feet in the synthesized image is apparent. Though the prediction θ AE (x ′ ) matches exactly the guide mask y ′ , shown on top of the generated image, the patch-based AE classifies incorrectly the dog’s legs as background. Thus, we argue that using a robust patch-based AE for manipulation control could resolve the issue. 6.5 Conclusions and Future Work We proposed MAGIC, an effective method for one-shot mask-guided images synthesis that can find ample applications in advanced image manipulation programs. MAGIC can perform a diverse set of image synthesis tasks, including shape and location control and intense non-rigid shape 89 deformation using a single training image, its binary segmentation mask, and a guide mask. MAGIC synthesis capabilities have been judged as competing or superior to the state-of-the-art by a pool of more than one hundred surveyees. To the best of our knowledge, MAGIC is the first work that shows the advantage of a quasi-robust model inversion for image synthesis. As future work, we plan to investigate theoretically the relationships between a quasi-robust model and sampling from a score matching generative model [66]. Furthermore, we would like to extend MAGIC by resolving its limitations, including object removal. 90 Chapter 7 Conclusion and Future Work 7.1 Summary of the Research In this dissertation, we focused on data-efficient image and vision-and-language classification and image synthesis. The successive subspace learning (SSL) principle was developed and used to design an inter- pretable learning model, known as the PixelHop method, for image classification. In Chapter 2, based on the SSL principle, we proposed an improved PixelHop method and call it PixelHop++. In PixelHop++, one can control the learning model size of fine-granularity, offering a flexible tradeoff between the model size and the classification performance. Experimental results demonstrate the flexibility of PixelHop++ on several datasets. In Chapter 3, we proposed a data-efficient face gender classification model called FaceHop which is also developed with the successive subspace learning (SSL) principle and built upon the foundation of PixelHop++. This solution finds applications in resource-constrained environments with limited networking and computing. FaceHop has several desired characteristics, including a small model size, a small training data amount, low training complexity, and low-resolution input images. The effectiveness of the FaceHop method for gender classification was demonstrated by experiments on two benchmarking datasets. In Chapter 4, we proposed a high-performance data-efficient low-resolution face recognition model called LRFRHop for resource-constrained environments using the SSL technology. We 91 show that active learning can be conveniently incorporated to reduce the labeling cost even further. We demonstrate the effectiveness of LRFRHop by conducting experiments on the two well-known datasets. In Chapter 5, we proposed a high-performance data-efficient V&L model, BERTHop, for CXR disease diagnosis. We showed that BERTHop outperforms state-of-the-art while it is trained on a much smaller training set. Our studies verify the effectiveness of the visual feature extractor PixelHop++ and the transformer backbone initialization BlueBERT. In Chapter 6, we proposed a one-shot mask-guided image synthesis model, MAGIC, and showed that it outperforms state-of-the-art by conducting quantitative and qualitative experiments including a subjective evaluation. Our studies and experiments demonstrate the effect of each sub- model on the synthesized image. We also illustrated the benefit of quasi-robust model inversion compared with non-robust and strongly-robust model inversion for image synthesis. 7.2 Future Work In the future, we would like to extend our proposed one-shot image synthesis model, MAGIC. In particular, we are interested in improving the results of MAGIC and resolving its limitations and failure cases including the ghost effect. Furthermore, we would like to incorporate color masks instead of binary masks to be able to edit different segments of an image simultaneously. Another potential future research direction is exploring and extending our model to be able to also handle more tasks including image inpainting and paint-to-image. By noticing the failure cases of MAGIC, we realized that for images in which there is a signifi- cant distribution difference between the segment of interest and the background (e.g., a significant difference in color), AE does not get trained very well and overfits to the training patches. This phenomena causes artifacts we call the ghost effect in synthesize images. To resolve this issue, we can use data augmentation methods, e.g. TPS, to train a more reliable AE. 92 Binary segmentation mask is used in MAGIC as a loose supervision for image editing but it does not allow multi-region editing within the image. The current MAGIC architecture with some modifications in the patch-AE submodel should be capable of using color masks instead of binary masks and enable multi-region image editing. The resulting model can also be used for paint- to-image by using the training and target painting instead of the color masks for training the AE submodel and synthesizing images. 93 Bibliography [1] Synthesia - AI Driven Video Generation. https://www.synthesia.io. Accessed: 2020- 04-03. [2] Rahib H Abiyev and Mohammad Khaleel Sallam Ma’aitah. Deep convolutional neural networks for chest diseases detection. Journal of healthcare engineering, 2018, 2018. [3] Gunjan Aggarwal, Abhishek Sinha, Nupur Kumari, and Mayank Singh. On the benefits of models with perceptually-aligned gradients. In ICLR, 2020. [4] Imane Allaouzi and Mohamed Ben Ahmed. A novel approach for multi-label chest x-ray classification of common thorax diseases. IEEE Access, 7:64279–64288, 2019. [5] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and vi- sual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018. [6] Grigory Antipov, Moez Baccouche, Sid-Ahmed Berrani, and Jean-Luc Dugelay. Effective training of convolutional neural networks for face-based gender and age prediction. Pattern Recognition, 72:15–26, 2017. [7] Shai Avidan and Ariel Shamir. Seam carving for content-aware image resizing. In ACM SIGGRAPH 2007 papers. 2007. [8] Enes Ayan and Halil Murat ¨ Unver. Diagnosis of pneumonia from chest x-ray images using deep learning. In 2019 Scientific Meeting on Electrical-Electronics & Biomedical Engineer- ing and Computer Science (EBBT), pages 1–5. Ieee, 2019. [9] Shumeet Baluja and Henry A Rowley. Boosting sex identification performance. Interna- tional Journal of computer vision, 71(1):111–119, 2007. [10] Ankan Bansal, Anirudh Nanduri, Carlos D Castillo, Rajeev Ranjan, and Rama Chellappa. Umdfaces: An annotated face dataset for training deep networks. In 2017 IEEE Interna- tional Joint Conference on Biometrics (IJCB), pages 464–473. IEEE, 2017. [11] Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 84:317–331, 2018. 94 [12] Soma Biswas, Kevin W Bowyer, and Patrick J Flynn. Multidimensional scaling for match- ing low-resolution face images. IEEE transactions on pattern analysis and machine intelli- gence, 34(10):2019–2030, 2011. [13] Thierry Bouwmans. Subspace learning for background modeling: A survey. Recent Patents on Computer Science, 2(3):223–234, 2009. [14] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Lay- ton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Ga¨ el Varoquaux. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108–122, 2013. [15] Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria de la Iglesia-Vay´ a. Padch- est: A large chest x-ray image dataset with multi-label annotated reports. Medical image analysis, 66:101797, 2020. [16] Dong Cao, Ran He, Man Zhang, Zhenan Sun, and Tieniu Tan. Real-world gender recogni- tion using multi-order lbp and localized multi-boost learning. In IEEE International Con- ference on Identity, Security and Behavior Analysis, pages 1–6. IEEE, 2015. [17] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international confer- ence on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018. [18] M Castrill´ on-Santana, Javier Lorenzo-Navarro, and Enrique Ram´ on-Balmaseda. Descrip- tors and regions of interest fusion for in-and cross-database gender classification in the wild. Image and Vision Computing, 57:15–24, 2017. [19] Tsung-Han Chan, Kui Jia, Shenghua Gao, Jiwen Lu, Zinan Zeng, and Yi Ma. Pcanet: A simple deep learning baseline for image classification? IEEE transactions on image processing, 24(12):5017–5032, 2015. [20] Hong-Shuo Chen, Mozhdeh Rouhsedaghat, Hamza Ghani, Shuowen Hu, Suya You, and C. C. Jay Kuo. Defakehop: A light-weight high-performance deepfake detector, 2021. [21] Sheng Chen, Yang Liu, Xiang Gao, and Zhen Han. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In Chinese Conference on Biometric Recognition, pages 428–438. Springer, 2018. [22] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In Euro- pean Conference on Computer Vision, pages 104–120. Springer, 2020. [23] Yueru Chen and C-C Jay Kuo. Pixelhop: A successive subspace learning (ssl) method for object recognition. Journal of Visual Communication and Image Representation, 70:102749, 2020. 95 [24] Yueru Chen, Mozhdeh Rouhsedaghat, Suya You, Raghuveer Rao, and C-C Jay Kuo. Pix- elhop++: A small successive-subspace-learning-based (ssl-based) model for image classifi- cation. In 2020 IEEE International Conference on Image Processing (ICIP), pages 3294– 3298. IEEE, 2020. [25] Jingchun Cheng, Yali Li, Jilong Wang, Le Yu, and Shengjin Wang. Exploiting effective facial patches for robust gender recognition. Tsinghua Science and Technology, 24(3):333– 345, 2019. [26] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35(1):126–136, 2018. [27] Zhiyi Cheng, Xiatian Zhu, and Shaogang Gong. Low-resolution face recognition. In Asian Conference on Computer Vision, pages 605–621. Springer, 2018. [28] Kyunghyun Cho, Bart Van Merri¨ enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [29] Shih-Han Chou, Wei-Lun Chao, Wei-Sheng Lai, Min Sun, and Ming-Hsuan Yang. Visual question answering on 360deg images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1607–1616, 2020. [30] Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. Scale out for large minibatch sgd: Residual network training on imagenet-1k with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291, 2017. [31] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via ran- domized smoothing. In ICML, pages 1310–1320. PMLR, 2019. [32] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1, pages 886–893. Ieee, 2005. [33] Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, Laritza Ro- driguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2):304–310, 2016. [34] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. [35] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 96 [36] Changxing Ding and Dacheng Tao. Pose-invariant face recognition with homography-based normalization. Pattern Recognition, 66:144–152, 2017. [37] Matthijs Douze, Herv´ e J´ egou, Harsimrat Sandhawalia, Laurent Amsaleg, and Cordelia Schmid. Evaluation of gist descriptors for web-scale image search. In Proceedings of the ACM International Conference on Image and Video Retrieval, pages 1–8, 2009. [38] Mingxing Duan, Kenli Li, Canqun Yang, and Keqin Li. A hybrid deep learning cnn–elm for age and gender classification. Neurocomputing, 275:448–461, 2018. [39] Claudio Ferrari, Giuseppe Lisanti, Stefano Berretti, and Alberto Del Bimbo. Effective 3d based frontalization for unconstrained face recognition. In 2016 23rd International Confer- ence on Pattern Recognition (ICPR), pages 1047–1052. IEEE, 2016. [40] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, train- able neural networks. arXiv preprint arXiv:1803.03635, 2018. [41] Kathleen C Fraser, Isar Nejadgholi, Berry De Bruijn, Muqun Li, Astha LaPlante, and Khal- doun Zine El Abidine. Extracting umls concepts from medical text using general and domain-specific deep learning models. arXiv preprint arXiv:1910.01274, 2019. [42] Shiming Ge, Chenyu Li, Shengwei Zhao, and Dan Zeng. Occluded face recognition in the wild by identity-diversity inpainting. IEEE Transactions on Circuits and Systems for Video Technology, 30(10):3387–3397, 2020. [43] Shiming Ge, Shengwei Zhao, Xindi Gao, and Jia Li. Fewer-shots and lower-resolutions: Towards ultrafast face recognition in the wild. In Proceedings of the 27th ACM International Conference on Multimedia, pages 229–237, 2019. [44] Shiming Ge, Shengwei Zhao, Chenyu Li, and Jia Li. Low-resolution face recognition in the wild via selective knowledge distillation. IEEE Transactions on Image Processing, 28(4):2051–2062, 2018. [45] Shiming Ge, Shengwei Zhao, Chenyu Li, Yu Zhang, and Jia Li. Efficient low-resolution face recognition via bridge distillation. IEEE Transactions on Image Processing, 29:6898–6908, 2020. [46] Maryellen L Giger and Kenji Suzuki. Computer-aided diagnosis. In Biomedical information technology, pages 359–XXII. Elsevier, 2008. [47] Ester Gonzalez-Sosa, Julian Fierrez, Ruben Vera-Rodriguez, and Fernando Alonso- Fernandez. Facial soft biometrics for recognition in the wild: Recent works, annotation, and cots evaluation. IEEE Transactions on Information Forensics and Security, 13(8):2001– 2014, 2018. [48] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. 97 [49] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver- sarial examples. In ICLR, 2015. [50] Niv Granot, Ben Feinstein, Assaf Shocher, Shai Bagon, and Michal Irani. Drop the gan: In defense of patches nearest neighbors as single image generative models. arXiv preprint arXiv:2103.15545, 2021. [51] Will Grathwohl, Kuan-Chieh Wang, J¨ orn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. In ICLR, 2020. [52] Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image and Vision Computing, 28(5):807–813, 2010. [53] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning. In Twenty-Second International Joint Conference on Artificial Intelligence , 2011. [54] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. arXiv preprint arXiv:2007.15779, 2020. [55] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In NeurIPS, 2017. [56] Srinivas Gutta, Harry Wechsler, and P Jonathon Phillips. Gender and ethnic classification of face images. In Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pages 194–199. IEEE, 1998. [57] Hu Han, Anil K Jain, Fang Wang, Shiguang Shan, and Xilin Chen. Heterogeneous face attribute estimation: A deep multi-task learning approach. IEEE transactions on pattern analysis and machine intelligence, 40(11):2597–2609, 2017. [58] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015. [59] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [60] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochre- iter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, volume 30, 2017. [61] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. 98 [62] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007. [63] Rui Huang, Shu Zhang, Tianyu Li, and Ran He. Beyond face rotation: Global and local per- ception gan for photorealistic and identity preserving frontal view synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pages 2439–2448, 2017. [64] Yuge Huang, Pengcheng Shen, Ying Tai, Shaoxin Li, Xiaoming Liu, Jilin Li, Feiyue Huang, and Rongrong Ji. Improving face recognition from hard samples via distribution distillation loss. In European Conference on Computer Vision, pages 138–154. Springer, 2020. [65] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016. [66] Aapo Hyv¨ arinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005. [67] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016. [68] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 33, pages 590–597, 2019. [69] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, pages 1125–1134, 2017. [70] Taskeed Jabid, Md Hasanul Kabir, and Oksam Chae. Gender classification using local directional pattern (ldp). In 2010 20th International Conference on Pattern Recognition, pages 2162–2165. IEEE, 2010. [71] Anil K. Jain, Sarat C. Dass, and Karthik Nandakumar. Soft biometric traits for personal recognition systems. In David Zhang and Anil K. Jain, editors, Biometric Authentication, pages 731–738, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg. [72] Saahil Jain, Akshay Smit, Steven QH Truong, Chanh DT Nguyen, Minh-Thanh Huynh, Mudit Jain, Victoria A Young, Andrew Y Ng, Matthew P Lungren, and Pranav Rajpurkar. Visualchexbert: Addressing the discrepancy between radiology report labels and image la- bels. arXiv preprint arXiv:2102.11467, 2021. [73] Sen Jia and Nello Cristianini. Learning to classify gender from four million images. Pattern recognition letters, 58:35–41, 2015. 99 [74] Alistair EW Johnson, Tom J Pollard, Nathaniel R Greenbaum, Matthew P Lungren, Chih- ying Deng, Yifan Peng, Zhiyong Lu, Roger G Mark, Seth J Berkowitz, and Steven Horng. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019. [75] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for gener- ative adversarial networks. In CVPR, pages 4401–4410, 2019. [76] Simran Kaur, Jeremy Cohen, and Zachary C Lipton. Are perceptually-aligned gradients a general property of robust classifiers? arXiv preprint arXiv:1910.08640, 2019. [77] Syed Safwan Khalid, Muhammad Awais, Zhen-Hua Feng, Chi-Ho Chan, Ammarah Farooq, Ali Akbari, and Josef Kittler. Resolution invariant face recognition using a distillation ap- proach. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2(4):410–420, 2020. [78] Beomsu Kim, Junghoon Seo, and Taegyun Jeon. Bridging adversarial robustness and gra- dient interpretability. In ICLR Workshops, 2019. [79] Joerg Kindermann and Alexander Linden. Inversion of neural networks by gradient descent. Parallel computing, 14(3):277–286, 1990. [80] Davis E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Re- search, 10:1755–1758, 2009. [81] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014. [82] Hans-Peter Kriegel, Peer Kr¨ oger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1):1, 2009. [83] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Con- necting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017. [84] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny im- ages. Technical report, Citeseer, 2009. [85] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012. [86] C.-C. Jay Kuo. Understanding convolutional neural networks with a mathematical model. Journal of Visual Communication and Image Representation, 41:406–413, 2016. [87] C.-C. Jay Kuo. The CNN as a guided multilayer RECOS transform [lecture notes]. IEEE Signal Processing Magazine, 34(3):81–89, 2017. 100 [88] C.-C. Jay Kuo and Yueru Chen. On data-driven Saak transform. Journal of Visual Commu- nication and Image Representation, 50:237–246, 2018. [89] C-C Jay Kuo, Min Zhang, Siyang Li, Jiali Duan, and Yueru Chen. Interpretable convolu- tional neural networks via feedforward design. Journal of Visual Communication and Image Representation, 60:346–359, 2019. [90] Yann LeCun, L´ eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [91] Brian Lee, Syed Zulqarnain Gilani, Ghulam Mubashar Hassan, and Ajmal Mian. Facial gender classification—analysis using convolutional neural networks. In 2019 Digital Image Computing: Techniques and Applications (DICTA), pages 1–8. IEEE, 2019. [92] Gil Levi and Tal Hassner. Age and gender classification using convolutional neural net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 34–42, 2015. [93] Chenyu Li, Shiming Ge, Daichi Zhang, and Jia Li. Look through masks: Towards masked face recognition with de-occlusion distillation. In Proceedings of the 28th ACM Interna- tional Conference on Multimedia, pages 3016–3024, 2020. [94] Chuan Li and Michael Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV, pages 702–716. Springer, 2016. [95] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019. [96] Xianyang Li, Feng Wang, Qinghao Hu, and Cong Leng. Airface: lightweight and efficient model for face recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019. [97] Yan Li, Ruiping Wang, Haomiao Liu, Huajie Jiang, Shiguang Shan, and Xilin Chen. Two birds, one stone: Jointly learning binary code for large-scale face image retrieval and at- tributes prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 3819–3827, 2015. [98] Yikuan Li, Hanyin Wang, and Yuan Luo. A comparison of pre-trained vision-and-language models for multimodal representation learning across medical images and reports. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1999– 2004. IEEE, 2020. [99] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017. [100] Xiaofeng Liu, Fangxu Xing, Chao Yang, C-C Jay Kuo, Suma Babu, Georges El Fakhri, Thomas Jenkins, and Jonghye Woo. V oxelhop: Successive subspace learning for als disease classification using structural mri. arXiv preprint arXiv:2101.05131, 2021. 101 [101] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015. [102] David G Lowe. Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pages 1150–1157. Ieee, 1999. [103] Haiping Lu, Konstantinos N Plataniotis, and Anastasios N Venetsanopoulos. A survey of multilinear subspace learning for tensor data. Pattern Recognition, 44(7):1540–1551, 2011. [104] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visi- olinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019. [105] Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 12-in-1: Multi-task vision and language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10437–10446, 2020. [106] Ze Lu, Xudong Jiang, and Alex Kot. Deep coupled resnet for low-resolution face recogni- tion. IEEE Signal Processing Letters, 25(4):526–530, 2018. [107] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018. [108] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In CVPR, pages 5188–5196, 2015. [109] Aravindh Mahendran and Andrea Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. IJCV, 120(3):233–255, 2016. [110] Baback Moghaddam and Ming-Hsuan Yang. Learning gender with support faces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):707–711, 2002. [111] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going deeper into neural networks, 2015. [112] Sivaram Prasad Mudunuri and Soma Biswas. Low resolution face recognition across varia- tions in pose and illumination. IEEE transactions on pattern analysis and machine intelli- gence, 38(5):1034–1040, 2015. [113] Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. 2015. [114] Yifan Peng, Xiaosong Wang, Le Lu, Mohammadhadi Bagheri, Ronald Summers, and Zhiy- ong Lu. Negbio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits on Translational Science Proceedings, 2018:188, 2018. [115] Yifan Peng, Shankai Yan, and Zhiyong Lu. Transfer learning in biomedical natural language processing: an evaluation of bert and elmo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474, 2019. 102 [116] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018. [117] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017. [118] Rajeev Ranjan, Vishal M Patel, and Rama Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recogni- tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1):121–135, 2017. [119] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Ima- genet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016. [120] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2016. [121] Daniel Riccio, Genny Tortora, Maria De Marsico, and Harry Wechsler. Ega—ethnicity, gender and age, a pre-annotated face database. In 2012 IEEE Workshop on Biometric Mea- surements and Systems for Security and Medical Applications (BIOMS) Proceedings, pages 1–8. IEEE, 2012. [122] Renan A Rojas-Gomez, Raymond A Yeh, Minh N Do, and Anh Nguyen. Inverting adver- sarially robust networks for image synthesis. arXiv preprint arXiv:2106.06927, 2021. [123] Mozhdeh Rouhsedaghat, Masoud Monajatipoor, Zohreh Azizi, and C-C Jay Kuo. Succes- sive subspace learning: An overview. arXiv preprint arXiv:2103.00121, 2021. [124] Mozhdeh Rouhsedaghat, Yifan Wang, Xiou Ge, Shuowen Hu, S. You, and C.-C. Jay Kuo. Facehop: A light-weight low-resolution face gender classification method. In ICPR Work- shops, 2020. [125] Mozhdeh Rouhsedaghat, Yifan Wang, Shuowen Hu, Suya You, and C-C Jay Kuo. Low-resolution face recognition in resource-constrained environments. arXiv preprint arXiv:2011.11674, 2020. [126] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi- heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, pages 1–42, 2014. [127] Hadi Salman, Andrew Ilyas, Logan Engstrom, Ashish Kapoor, and Aleksander Madry. Do adversarially robust imagenet models transfer better? In NeurIPS, 2020. 103 [128] Aline Gondim Santos, Camila Oliveira de Souza, Cleber Zanchettin, David Macedo, Adri- ano LI Oliveira, and Teresa Ludermir. Reducing squeezenet storage size with depthwise sep- arable convolutions. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–6. IEEE, 2018. [129] Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Image synthesis with a single (robust) classifier. In NeurIPS, 2019. [130] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. [131] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017. [132] Vidya Setlur, Saeko Takagi, Ramesh Raskar, Michael Gleicher, and Bruce Gooch. Auto- matic image retargeting. In Proceedings of the 4th international conference on Mobile and ubiquitous multimedia, pages 59–68, 2005. [133] H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory , pages 287– 294, 1992. [134] Mohammad Javad Shafiee, Francis Li, Brendan Chwyl, and Alexander Wong. Squishednets: Squishing squeezenet further for edge device scenarios via deep evolutionary synthesis. arXiv preprint arXiv:1711.07459, 2017. [135] Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. Singan: Learning a generative model from a single natural image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4570–4580, 2019. [136] Claude E Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948. [137] Hoo-Chang Shin, Holger R Roth, Mingchen Gao, Le Lu, Ziyue Xu, Isabella Nogues, Jian- hua Yao, Daniel Mollura, and Ronald M Summers. Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging, 35(5):1285–1298, 2016. [138] Assaf Shocher, Shai Bagon, Phillip Isola, and Michal Irani. Ingan: Capturing and retargeting the ”dna” of a natural image. In ICCV, 2019. [139] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, pages 2107–2116, 2017. [140] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional net- works: Visualising image classification models and saliency maps. In ICLR Workshops, 2014. 104 [141] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [142] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019. [143] Suraj Srinivas and Franc ¸ois Fleuret. Rethinking the role of gradient-based attribution meth- ods for model interpretability. In ICLR, 2021. [144] Vitomir ˇ Struc and Nikola Paveˇ si´ c. Gabor-based kernel partial-least-squares discrimination features for face recognition. Informatica, 20(1):115–138, 2009. [145] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019. [146] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014. [147] Fariborz Taherkhani, Nasser M. Nasrabadi, and Jeremy Dawson. A deep face identification network enhanced by facial attributes prediction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018. [148] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014. [149] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019. [150] Matteo Terzi, Alessandro Achille, Marco Maggipinto, and Gian Antonio Susto. Adversarial training reduces information and improves transferability. arXiv preprint arXiv:2007.11259, 2020. [151] Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias Nießner, et al. State of the art on neural rendering. In Computer Graphics Forum, volume 39, pages 701– 727, 2020. [152] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In ICLR, 2019. [153] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal of cognitive neuro- science, 3(1):71–86, 1991. [154] Matthew Turk and Alex Pentland. Face recognition using eigenfaces. In Proceedings. 1991 IEEE computer society conference on computer vision and pattern recognition, pages 586– 587, 1991. 105 [155] Yael Vinker, Eliahu Horwitz, Nir Zabari, and Yedid Hoshen. Image shape manipulation from a single augmented training sample. In ICCV, pages 13769–13778, 2021. [156] Shoya Wada, Toshihiro Takeda, Shiro Manabe, Shozo Konishi, Jun Kamohara, and Yasushi Matsumura. A pre-training technique to localize medical bert and enhance biobert. arXiv preprint arXiv:2005.07202, 2020. [157] Kaiye Wang, Ran He, Liang Wang, Wei Wang, and Tieniu Tan. Joint feature selection and subspace learning for cross-modal retrieval. IEEE transactions on pattern analysis and machine intelligence, 38(10):2010–2023, 2015. [158] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2016. [159] Pei Wang, Yijun Li, Krishna Kumar Singh, Jingwan Lu, and Nuno Vasconcelos. IMAGINE: Image synthesis by image-guided model inversion. In CVPR, pages 3681–3690, 2021. [160] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catan- zaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. [161] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catan- zaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, pages 8798–8807, 2018. [162] Wenguan Wang, Qiuxia Lai, Huazhu Fu, Jianbing Shen, Haibin Ling, and Ruigang Yang. Salient object detection in the deep learning era: An in-depth survey. TPAMI, 2021. [163] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017. [164] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and Ronald M Summers. Tienet: Text- image embedding network for common thorax disease classification and reporting in chest x-rays. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9049–9058, 2018. [165] Ronald J Williams. Inverting a connectionist network mapping by backpropagation of error. In 8th Annual Conf. Cognitive Sci. Soc., 1986. [166] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´ emi Louf, Morgan Funtowicz, et al. Huggingface’s trans- formers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. [167] Jing Wu, William AP Smith, and Edwin R Hancock. Facial gender classification using shape-from-shading. Image and Vision Computing, 28(6):1039–1048, 2010. 106 [168] Xiang Wu, Ran He, Zhenan Sun, and Tieniu Tan. A light CNN for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884– 2896, 2018. [169] Xiang Wu, Ran He, Zhenan Sun, and Tieniu Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884– 2896, 2018. [170] Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017. [171] Chao Xiong, Xiaowei Zhao, Danhang Tang, Karlekar Jayashree, Shuicheng Yan, and Tae- Kyun Kim. Conditional convolutional neural network for modality-aware face recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 3667– 3675, 2015. [172] Hao Ye, Weiyuan Shao, Hong Wang, Jianqi Ma, Li Wang, Yingbin Zheng, and Xiangyang Xue. Face recognition via active annotation and learning. In Proceedings of the 24th ACM international conference on Multimedia, pages 1058–1062, 2016. [173] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014. [174] Junho Yim, Heechul Jung, ByungIn Yoo, Changkyu Choi, Dusik Park, and Junmo Kim. Rotating your face using multi-task deep neural network. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 676–684, 2015. [175] Hongxu Yin, Pavlo Molchanov, Jose M Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K Jha, and Jan Kautz. Dreaming to distill: Data-free knowledge transfer via DeepIn- version. In CVPR, pages 8715–8724, 2020. [176] Min Zhang, Haoxuan You, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Pointhop: An explainable machine learning method for point cloud classification. arXiv preprint arXiv:1907.12766, 2019. [177] Tianyuan Zhang and Zhanxing Zhu. Interpreting adversarially trained convolutional neural networks. In ICML, pages 7502–7511. PMLR, 2019. [178] Zizhao Zhang, Pingjun Chen, Xiaoshuang Shi, and Lin Yang. Text-guided neural network training for image recognition in natural scenes and medicine. IEEE transactions on pattern analysis and machine intelligence, 2019. [179] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 34, pages 13041–13049, 2020. [180] Maria Zontak and Michal Irani. Internal statistics of a single natural image. In CVPR, pages 977–984, 2011. 107
Abstract (if available)
Abstract
Image classification and image synthesis are two fundamental yet challenging tasks in computer vision and pattern recognition and have drawn significant research attention over the last several decades. Image classification models learn to predict the probability of an image belonging to different classes, i.e., they learn the conditional probability distribution p(y|x) where x is the input image and y is a class label. On the other hand, image synthesis models learn the probability distribution of data conditioned on some specific input. With the emergence of Deep Learning (DL) techniques and the availability of large annotated datasets and computational power, classification and generation models could achieve great success, however, in domains in which a large amount of annotated data is not available, such models perform poorly, and having data-efficient models remains a challenge requiring further attention. In this dissertation, we focus on learning-based data-efficient image and vision-and-language classification and image synthesis tasks.
The Successive Subspace Learning (SSL) principle was developed to design an interpretable image classification model, known as the PixelHop. We propose an improved PixelHop method and call it PixelHop++. First, we decouple the joint spatial-spectral input tensor to multiple spatial tensors under the spatial-spectral separability assumption and perform the Saab transform in a channel-wise manner. Second, by performing this operation successively, we construct a channel decomposed feature tree whose leaf nodes contain features of dimension one. Third, a subset of discriminant features is selected based on their cross-entropy values for image classification. PixelHop++ offers a flexible tradeoff between the model size and the classification performance.
For low-resolution face gender classification, we propose a lightweight method, called FaceHop which offers an interpretable machine learning solution. It has desired characteristics such as small model size, small training data, and low training complexity. FaceHop is also developed with the SSL principle and built upon the foundation of PixelHop++. According to our experiments, FaceHop outperforms LeNet-5 in accuracy while LeNet-5 has a 4.5x larger model size.
We propose a high-performance data-efficient low-resolution face recognition model called LRFRHop for resource-constrained environments using the SSL technology. SSL offers an explainable non-parametric feature extraction submodel that flexibly trades the model size for the verification performance. Its training complexity is significantly lower than DNN-based models since it is trained in a one-pass feedforward manner without backpropagation. Furthermore, active learning can be conveniently incorporated to reduce the labeling cost. We demonstrate the effectiveness of LRFRHop by conducting experiments on the two well-known datasets.
Vision-and-Language (V&L) models take image and text as input and learn to capture the associations between them. Prior studies show that pre-trained V&L models can significantly improve the model performance for downstream tasks, however, they are less effective when applied in the medical domain due to the domain gap. We investigate the challenges of applying pre-trained V&L models in medical applications and propose BERTHop, a transformer-based model based on PixelHop++ and BlueBERT, to overcome the limitations and better capture the associations between the two modalities. Experiments on OpenI, a commonly used thoracic disease diagnosis benchmark, show that BERTHop outperforms state-of-the-art while it is trained on a 9x smaller dataset.
One-shot image synthesis focuses on tackling different image synthesis tasks using only a single training image. Existing models in this category, either can not generate realistic results or can not handle all types of images including repetitive and non-repetitive ones. We illustrate the limitations of existing models and propose MAGIC, a mask-guided one-shot image synthesis model based on quasi-robust model inversion which can achieve high-quality results for the shape and location control tasks on all types of inputs. By conducting extensive experiments, we show that MAGIC outperforms state-of-the-art and synthesizes high-quality results for both repetitive and non-repetitive images. Furthermore, we demonstrate the benefit of quasi-robust model inversion compared with non-robust and strongly robust model inversion for image synthesis.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Efficient graph learning: theory and performance evaluation
PDF
Object classification based on neural-network-inspired image transforms
PDF
Green image generation and label transfer techniques
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Green learning for 3D point cloud data processing
PDF
Advanced features and feature selection methods for vibration and audio signal classification
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Deep generative models for image translation
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Advanced knowledge graph embedding techniques: theory and applications
PDF
Advanced techniques for human action classification and text localization
Asset Metadata
Creator
Rouhsedaghat, Mozhdeh
(author)
Core Title
Data-efficient image and vision-and-language synthesis and classification
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-08
Publication Date
08/03/2022
Defense Date
05/12/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data-efficient,image classification,image synthesis,OAI-PMH Harvest,SSL,vision-and-language classification
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Jenkins, Keith (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
rouhseda@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111376003
Unique identifier
UC111376003
Legacy Identifier
etd-Rouhsedagh-11084
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Rouhsedaghat, Mozhdeh
Type
texts
Source
20220803-usctheses-batch-968
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
data-efficient
image classification
image synthesis
SSL
vision-and-language classification