Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Theory and applications of adversarial and structured knowledge learning
(USC Thesis Other)
Theory and applications of adversarial and structured knowledge learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
THEORY AND APPLICATIONS OF ADVERSARIAL AND STRUCTURED KNOWLEDGE LEARNING by Jiali Duan A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2021 Copyright 2021 Jiali Duan To my parents, my mentor and my beloved family members. ii Table of Contents Dedication ii List of Tables vi List of Figures viii Abstract xiv 1: Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.1 Adversarial Knowledge Learning from Human and Data . . . . . . . 6 1.3.2 Structured Knowledge Representation via Metric Learning . . . . . . 7 1.3.3 Practical and Technical Contributions . . . . . . . . . . . . . . . . . 8 1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2: Robot Learning via Human Adversarial Games 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Learning Robust Grasps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.1 Grasping Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.2 Adversarial Disturbance . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.4 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.5 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 From Theory to Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7.2 Multiple objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3: Curriculum Reinforcement Learning Guided by Human 27 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Interactive Curriculum Guided by Human . . . . . . . . . . . . . . . . . . . 29 3.2.1 Task Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.2 Platform Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.3 A Simple Interactive Curriculum Framework . . . . . . . . . . . . . 32 iii 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 Effect of Interactive Curriculum . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Generalization Ability . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4: PortraitGAN for Flexible Portrait Manipulation 38 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.1 Base Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.2 Multi-level Adversarial Supervision . . . . . . . . . . . . . . . . . . 44 4.3.3 Texture consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.4 Bidirectional Portrait Manipulation . . . . . . . . . . . . . . . . . . 46 4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4.1 Portrait Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4.2 Perceptual Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5: Fashion Compatibility Recommendation via Unsupervised Latent Attribute Dis- covery 60 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 Attribute-Aware Explainable Graph Networks . . . . . . . . . . . . . . . . . 63 5.3.1 Unsupervised Latent Attribute Space Learning . . . . . . . . . . . . 65 5.3.2 Graph Filtering Network . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.3 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.4 Attribute Preference Inference . . . . . . . . . . . . . . . . . . . . . 70 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.5.1 Recommendation Performance . . . . . . . . . . . . . . . . . . . . . 73 5.5.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5.3 Variants of LAEN . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.5.4 Variants of PPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.5.5 Visualization of Our Model . . . . . . . . . . . . . . . . . . . . . . 77 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6: SLADE: A Self-Training Framework for Distance Metric Learning 81 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.3.1 Self-Supervised Pre-Training and Fine-Tuning for Teacher Network . 86 6.3.2 Pseudo Label Generation . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.3 Optimization of Student Network and Basis Vectors . . . . . . . . . 87 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 iv 6.4.2 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.4.5 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7: Compositional Learning for Weakly Attribute-based Metric Learning 99 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.2 Compositional Learning . . . . . . . . . . . . . . . . . . . . . . . . 104 7.3.3 Attribute-set Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.3.4 Training & Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.4.2 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.4.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.4.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8: Conclusion 116 Bibliography 118 v List of Tables Table 2.1: Likert items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Table 2.2: Grasping success rate before (left column) and after (right column) application of random disturbance. Different users interacted with different objects (between-subjects design). . . . . . . . . . . . . . . 20 Table 4.1: Quantitative evaluation for generated image. Our model is slightly slower than StarGAN but achieves the best MSE and SSIM. . . . . . 51 Table 4.2: Subjective ranking for different models based on perceptual qual- ity. Our model is close to CycleGAN but is much better than StarGAN. 53 Table 5.1: Comparisons on FITB/Compatibility task over Resampled Mary- land Polyvore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Table 5.2: Comparisons on FITB/Compatibility task over Polyvore Outfits. . . . 73 Table 5.3: Ablation study of our model on Polyvore Outfits [180]. . . . . . . . . 74 Table 5.4: Variants ofT for Graph Filtering Network on Polyvore Outfits. . . . 75 Table 5.5: Variants of Latent Attribute Extraction Network (LAEN) on Polyvore Outfits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Table 5.6: Necessity of Pairwise Preference Attention: Results on the Polyvore- Outfit test set obtained by variants of Preference Selection. . . . . . . 77 Table 6.1: MAP@R, RP, P@1 (%) on the CUB-200-2011 and Cars-196 datasets. Pre-trained Image-Net model is denoted as ImageNet and the fine- tuned SwA V model on our data is denoted as SwA V . The teacher networks (T 1 ,T 2 ,T 3 andT 4 ) are trained with the different losses, which are then used to train the student networks (S 1 ,S 2 ,S 3 and S 4 ) (e.g., the teacherT 1 is used to train the studentS 1 ). Note that the results may not be directly comparable as some methods (e.g., [6, 83, 174]) report the results based on their own frameworks with different settings, e.g., embedding dimensions, batch sizes, data augmentation, optimizer etc. More detailed explanations are in Section 7.4.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Table 6.2: Recall@K (%) on the In-shop dataset. . . . . . . . . . . . . . . . . 94 vi Table 6.3: Comparison of different weight initialization schemes of the teacher network, where the teacher is trained with a contrastive loss. The results are the final performance of our framework. . . . . . . . . . . 96 Table 6.4: Ablation study of different components in our framework on CUB- 200 and Cars-196. The teacher network is trained with a con- trastive loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Table 6.5: Accuracy of our model in MAP@R, RP and P@1 versus different loss designs on CUB-200. . . . . . . . . . . . . . . . . . . . . . . . 97 Table 6.6: Influence of using different numbers of clusters (k) on NABirds, which is used as the unlabeled data for CUB-200. . . . . . . . . . . 98 Table 7.1: Results of Compositional Zero-Shot Learning on MIT-States and UT-Zappos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Table 7.2: Object-attribute recognition results on two benchmarks. . . . . . . . 113 Table 7.3: Retrieval performance on Fashion200k. . . . . . . . . . . . . . . . . 113 Table 7.4: Results of ablation studies. . . . . . . . . . . . . . . . . . . . . . . 115 vii List of Figures Figure 1.1: Examples showing the benefit for learning rich adversarial or struc- tured knowledge from human and data. Left: a robot communi- cates its goal by generating trajectories that are intuitive to infer for human [39]; Middle: a robust reinforcement learning sys- tem that explores adversarial knowledge when training an RL agent [143]; Right: structured representation learning with triplet loss [160]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Figure 1.2: A graphical illustration of generative adversarial network, where the generative model is pitted against an adversary. We generalize this framework for our studies of adversarial knowledge learning in this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Figure 2.1: An overview of our framework for a robot learning robust grasps by interacting with a human adversary. . . . . . . . . . . . . . . . . 11 Figure 2.2: Selected grasp predictions before (top row) and after (bottom row) training with the human adversary. The red bars show the open gripper position and orientation, while the yellow circles show the grasping points when the gripper has closed. . . . . . . . . . . . 12 Figure 2.3: ConvNet architecture for grasping. . . . . . . . . . . . . . . . . . . 17 Figure 2.4: Participants interacted with a simulated Baxter robot in the cus- tomized Mujoco simulation environment. . . . . . . . . . . . . . . 18 Figure 2.5: Success rates from Table 2.2 for all five participants and subjec- tive metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Figure 2.6: Success rates from Table 2.2 for each object with (y-axis) and without (x-axis) random disturbances for all five participants. . . . . 21 Figure 2.7: Actions applied by selected human adversaries over time. We plot in green adversarial actions that the robot succeeds in resisting, and in red actions that result in the human ‘snatching’ the object. . . 22 Figure 2.8: The user started assisting the robot in the later part of the interac- tion, instead of acting as an adversary. . . . . . . . . . . . . . . . . 23 viii Figure 2.9: Difference between training with user and simulated adversary for the stick object. The simulated adversary explores by applying forces in directions that fail to snatch the object. . . . . . . . . . . . 24 Figure 2.10: A force in almost any direction would make the grasp (a) fail, while only a force parallel to the axis of the stick would remove grasp (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Figure 3.1: Given specific scenarios during curriculum training, humans can adaptively decide whether to be “friendly” or “adversarial” by observing the progress the agent is able to make. In cases where performance degrades, a user may flexibly adjust the strategy as opposed to an automatic assistive agent. . . . . . . . . . . . . . . . 28 Figure 3.2: Our interactive platform for curriculum reinforcement learning, in which the user is allowed to manipulate the task difficulty via a unified interface (slider and buttons). All three tasks receive only sparse rewards. The manipulable variable for the three environ- ments are respectively the number of red obstacles (GridWorld, Top row), the height of wall (Wall-Jumper, Middle row) and the radius of the target (SparseCrawler, Bottom row). The task diffi- culty gradually increases from left to right. . . . . . . . . . . . . . 29 Figure 3.3: General design of our interactive platform and associations between environment container with RL trainer as well as interactive-interface. 31 Figure 3.4: To reduce the amount of human effort, we integrate human-interactive signal into RL parallelization via our interface. . . . . . . . . . . . 32 Figure 3.5: “Inertial” problem of auto-curriculum which gradually grows the difficulty at fixed interval. The performance of auto-curriculum (orange curve) drops significantly when navigation requires jump- ing over the box first but the learning inertial prevents it from adapting to the new task. Note that testing curve is evaluated on the ultimate task unless otherwise stated. . . . . . . . . . . . . . . . 33 Figure 3.6: A change of skill is required when the height of wall changes over a certain threshold . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Figure 3.7: Effect of interactive curriculum evaluated on the ultimate task. . . . 35 Figure 3.8: Generalization ability of interactive curriculum evaluated on a set of tasks. The average performance over these tasks is plotted for different time steps. . . . . . . . . . . . . . . . . . . . . . . . . . . 35 ix Figure 4.1: Overview of training pipeline: In the forward cycle, original image IA is first translated to c IB given target emotionLB and modal- ityC and then mapped back to c IA given condition pair (LA,C 0 ) encoding the original image. The backward cycle follows similar manner starting fromIB but with opposite condition encodings using the same generatorG. Identity preservation and modality constraints are explicitly modeled in our loss design. . . . . . . . . 42 Figure 4.2: Multi-level adversarial supervision . . . . . . . . . . . . . . . . . . 44 Figure 4.3: Interactive manipulation without constraints. Column 1st-2nd: Original image and auto-detected facial landmarks; 3rd: gener- ated image from 1st-2nd; 4th: manipulated target landmark; 5th: inverse modality generation from 3rd-4th; 6th: photo to style gen- eration with landmarks of 5th. . . . . . . . . . . . . . . . . . . . . 48 Figure 4.4: Given leftmost and rightmost face, we first interpolate the middle one (e.g., the 4th one), then we can interpolate 2nd (with 1st and 4th) and 5th (with 4th and 7th). Lastly, we interpolate 3rd (with 2nd and 4th) and 6th (with 5th and 7th). . . . . . . . . . . . . . . . 49 Figure 4.5: Failure cases: The reason could be that facial landmarks don’t capture well enough details of micro-emotions. . . . . . . . . . . . 50 Figure 4.6: Comparison with StarGAN and CycleGAN. Images generated by our model exhibit closer texture proximity to groundtruth, due to adoption of texture consistency loss. . . . . . . . . . . . . . . . . . 52 Figure 4.7: Effect of multi-level adversarial supervision. Left/Right: wo/w multi-level adversarial supervision. Please also refer to the sup- plementary material for the high-resolution (512x512) version. . . . 53 Figure 4.8: More results for continuous shape edits and simultaneous shape and modality manipulation results by PortraitGAN. . . . . . . . . . 54 Figure 4.9: Left: original image; Right: generated image. . . . . . . . . . . . . 56 Figure 4.10: Left: original image; Right: generated image. . . . . . . . . . . . . 56 Figure 4.11: Left: original image; Right: generated image. . . . . . . . . . . . . 57 Figure 4.12: Left: original image; Right: generated image. . . . . . . . . . . . . 57 Figure 4.13: Left: original image; Right: generated image. . . . . . . . . . . . . 58 Figure 4.14: Left: original image; Right: generated image. . . . . . . . . . . . . 58 Figure 4.15: Left: original image; Right: generated image. . . . . . . . . . . . . 59 x Figure 4.16: Left: original image; Right: generated image. . . . . . . . . . . . . 59 Figure 5.1: An illustration of interactions between a product’s attributes and its context: the blue and pink ellipsoids indicate two latent-attribute metric space in which items are compared (e.g., color, shape, style, etc.). The sunglasses in the middle is shared by two dif- ferent outfits for different reasons. . . . . . . . . . . . . . . . . . . 61 Figure 5.2: Difference between the conventional (a) Global/Single Feature Space and our (b) Metric-Aware Latent Attribute Space. Each latent space represents a latent attribute space that is learned auto- matically from data. To make comparisons between items, dif- ferent weights are assigned to these metric space to account for different preferences over corresponding attributes. . . . . . . . . . 64 Figure 5.3: The architecture of Attribute-Aware Explainable Graph Network (AAEG) for Fashion Compatibility Recommendation. LAEN is trained unsupervised to extract metric-aware latent attribute rep- resentations, where latent masks areK learnable embedding func- tions, transforming global features into corresponding latent attribute space. PPA models the user’s preference by computing attention over latent attributes. AAEG accounts for interactions between user’s preference and contexts using graph filtering, trained in a multi-task setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 5.4: Distribution of attention weight for item preference in the outfit. . . 77 Figure 5.5: An example of t-SNE visualization of our Latent Attribute Space on Polyvore Outfits. . . . . . . . . . . . . . . . . . . . . . . . . . 78 Figure 5.6: Grad-CAM visualization on decision making of attribute prefer- ence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Figure 5.7: Qualitative results of our model for FITB prediction on Maryland Polyvore and Polyvore Outfits. Green box indicates the groundtruth and scores highlighted with red color are our predictions using Eqn 5.16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Figure 6.1: A self-training framework for retrieval. In the training phase, we train the teacher and student networks using both labeled and unlabeled data. In the testing phase, we use the learned student network to extract embeddings of query images for retrieval. . . . . 82 xi Figure 6.2: An overview of our self-training framework. Given labeled and unlabeled data, our framework has three main steps. (1) We first initialize the teacher network using self-supervised learning, and fine-tune it by a ranking loss on labeled data; (2) We use the learned teacher network to extract features, cluster and generate pseudo labels on unlabeled data; (3) We optimize the student net- work and basis vectors on labeled and unlabeled data. The pur- pose of feature basis learning is to select high-confidence samples (e.g., positive and negative pairs) for the ranking loss, so the stu- dent network can learn better and reduce over-fitting to noisy samples. 85 Figure 6.3: Retrieval results on CUB-200 and Cars-196. We show some chal- lenging cases where our self-training method improves Proxy- Anchor [83]. Our results are generated based on the student model S 2 in Table 6.1. The red bounding boxes are incorrect predictions. . 95 Figure 7.1: A compositional learning framework for retrieval. We learn a compositional module for attribute manipulation, consisting of a decoupling network and a coupling network, which are responsi- ble for removing and adding attributes respectively. Our model is weakly supervised as it can generalize to attribute configurations unseen during training. . . . . . . . . . . . . . . . . . . . . . . . . 100 Figure 7.2: An overview of the proposed WAML framework, consisting of two main components. The compositional learning component takes an image and two attributesa i ,a j sampled in the attribute set and outputs two multi-label features f o X , f o Y (with attribute i and j respectively) for the second component. The attribute- set synthesis module synthesizes feature embeddings represent- ing similar content but with different attribute configurations through attribute set operations. The end-result is a feature embedding space capable of representing attribute manipulations. . . . . . . . . 103 Figure 7.3: Our composition module consists of two invertible functions— the coupling networkT + and the decoupling networkT , aimed for attribute-object factorization in the embedding space. To achieve this goal, we define a set of regularization objectives based on metric learning to enhance the discriminativeness of the model. In the case when the attribute is not present (Case 1), the result- ing manipulation throughT + should lead to an embedding farther compared to the one operated withT . On the opposite hand, if the attribute is present, the coupling operation should lead to an embedding closer compared to the decoupling function (Case 2). . 104 xii Figure 7.4: The coupling and decoupling network share the same network structure. They take a specific attribute embedding (e.g.,a i ora j ) to manipulate, resulting inf o X andf o Y .f o is the image embedding extracted from ResNet-18. . . . . . . . . . . . . . . . . . . . . . . 105 Figure 7.5: Image retrieval on MIT-States and UT-Zappos after conducting attribute manipulations. The red boxes are incorrect predictions. . . 114 xiii Abstract Deep learning has brought impressive improvements in many fields, including computer vision, robotics, reinforcement learning and recommendation systems etc, thanks to end-to-end data- driven optimization. However, people have little control over the system during training and limited understanding about the structure of knowledge being learned. In this thesis, we study theory and applications of adversarial and structured knowledge learning: 1) learning adversarial knowledge with human interaction or by incorporating human-in-the-loop; 2) learning structured knowledge by modelling contexts and users’ preferences via distance metric learning. As more and more deep learning applications migrate from research labs to the real world, it becomes increasingly important to understand the influence human could exert on these systems and be able to teach the learning agents the desired behavior. We achieve this through adver- sarial learning by incorporating human. In the first research topic, we teach a robotic arm to learn robust manipulation grasps that can withstand perturbations, through end-to-end optimiza- tion with a human adversary. Specifically, we formulate the problem as a two-player game with incomplete information, played by a human and a robot, where the human’s goal is to minimize the reward the robot can get. In the second research topic, we present a portable, interactive and parallel platform for human-agent curriculum learning experience. Users are allowed to observe, interact and customize the learning environment. Result shows reinforcement learning leverag- ing human adversarial examples for curriculum design outperforms existing state-of-the-art RL method. In the third research topic, we propose a novel framework that supports continuous edits and multi-modality portrait manipulation using adversarial learning. To ensure photo-realistic synthesis, we adopt texture-loss that enforces texture consistency and multi-level adversarial supervision that facilitates gradient flow. The three applications are unified by extracting adver- sarial knowledge from data and human. xiv At the same time, deep learning systems need to reason and represent the structure in data in order to better understand the user’s intentions. Therefore, we propose to study structured knowledge from representation via metric learning. In the forth research topic, we focus on the need for compatible fashion recommendation, which complements a partial outfit based on the user’s favor of existing items. We propose an unsupervised approach to obtain representation of items in a metric-aware latent attribute space and develop a graph filtering network to model the contextual relationships among items. In the fifth research topic, we present the first self- training framework to improve distance metric learning. The challenge is the noise in pseudo labels, which prevents exploiting additional unlabeled data. We design a new feature basis learn- ing component for the teacher-student network, which better measures pairwise similarity and selects high confidence pairs. The proposed model significantly improves state-of-the-arts. In the sixth research topic, we address image-attribute query, which allows a user to customize image-to-image retrieval by designating desired attributes in target images. We achieve this by adopting a composition module, which enforces object-attribute factorization and an attribute-set synthesis module to deal with sample insufficiency. xv Chapter 1 Introduction 1.1 Significance of the Research The promise of deep learning is to be able to discover and leverage rich, structured knowledge from data or human. We argue that it’s important to explore those knowledge during training. For example, the robot from Figure 1.1(a) is equipped with a human-intent model and is cleaning up a dining table together with a human collaborator. As it is reaching for one of the two remaining objects, the human can infer its goals and reach for the other, because of the intent-expressive trajectory generated by the robot. (a) Generating legible motion (b) Robust adversarial RL (c) Triplet loss Figure 1.1: Examples showing the benefit for learning rich adversarial or structured knowledge from human and data. Left: a robot communicates its goal by generating trajectories that are intuitive to infer for human [39]; Middle: a robust reinforcement learning system that explores adversarial knowledge when training an RL agent [143]; Right: structured representation learn- ing with triplet loss [160]. In the second example as shown in Figure 1.1(b), adversarial knowledge is explored when training an RL agent, to bridge the gap between simulation and real-world where policy learning approaches fail to transfer. The key insight behind the success of such a system is to model the differences during training and test scenarios via extra forces/disturbances. In the third example 1 as shown in Figure 1.1(c), the class becomes more discriminative after a structured representation is obtained. The triplet loss pulls embedding from the anchor and positive close while pushing embedding between anchor and negative apart. So far, the most striking successes in deep learning have primarily been based on black- box optimization, over large amount of data. Humans on the other hand, have had less of an impact during system training, and limited understanding about the knowledge or structure being learned. The difficulty stems from the challenge in incorporating human into deep learning opti- mization and other technical challenges such as implementation, interpretation, parallelization or computation. In this thesis, we overcome these challenges and present a study on theory and applications of adversarial and structured knowledge learning. Specifically, we group six research topics into two categories. Figure 1.2: A graphical illustration of generative adversarial network, where the generative model is pitted against an adversary. We generalize this framework for our studies of adver- sarial knowledge learning in this thesis. In the first category, three research topics are unified under the umbrella of generative adver- sarial learning, by extracting adversarial knowledge from data and human. Generative adversar- ial learning [51] is becoming an increasingly significant field in deep learning. It’s based on a game theoretic scenario in which the generator network must compete against an adversary as shown in Figure 1.2. In our first research topic, we incorporate human as an incarnation of the discriminator network. Concretely, we formulate the learning procedure as a two-player game 2 with incomplete information, played by a human and a robot, where the human’s goal is to min- imize the reward the robot can get, by perturbing grasps that are unstable. In fact, the adversar- ial knowledge obtained through human adversary enables the robot to outperform counterparts trained self-supervised or with a simulated adversarial agent. Beyond that, the interactive robotic training platform that we released serves as one of the earliest work for human-robot adversarial learning. In the second research topic, we leverage human adversarial examples in curriculum design to improve sample efficiency for deep reinforcement learning. The core idea behind curriculum reinforcement learning is to focus on examples of gradual increasing difficulty that are neither too hard nor too easy [11]. However, questions such as “what metric to use for quantifying the task difficulty” or “how should curriculum be designed” are hard to estimate quantitatively. In this work, we demonstrate the effectiveness of adversarial knowledge learned from human- in-the-loop. A portable, parallelizable and interactive platform benchmarking SOTA RL with our algorithm is released, making it possible for human to train large-scale RL applications that require millions of samples, on a personal computer. In the third research topic, we apply generative adversarial network for flexible portrait manipulation. Specifically, we developed a software that supports continuous facial edits and multi-modality transfer using a single network. With user customized facial landmark as prior, the model is capable of generating 512x512 resolution images. In the second category, we are motivated by the significance of learning structured knowl- edge from data. Structured knowledge are often more interpretable, easier to understand and more generalizable to unseen scenarios. The following works are thus unified under the theme of metric learning. In our forth research topic, we study an interesting real-world problem in industry, where a recommender system is tasked with predicting a compatibility score for a given outfit or complementing a partial outfit in the most appealing manner. To achieve this, we model the user’s preferences over different attributes by learning a latent metric space unsuper- vised. Then we leverage a graph filtering network to account for the interactions among items when making the compatibility decision. 3 In the fifth research topic, we address the scalability of metric learning by pioneering a research into the self-training teacher-student network in leveraging additional unlabeled data. Our solution is to a learn a set of basis vectors as “references” to represent unseen classes during training. This proves to be beneficial for reducing the noise in pseudo labels via distribution separation and high-confidence sampling. We demonstrate the effectiveness of our framework by showing significant improvement over existing state-of-the-arts. In the sixth research topic, we study a way to empower the image retrieval system by sup- porting attribute designations during searches. For example, a user may have access to a red dress, instead she may be interested in adding dotted patterns or replacing the red color. A typical approach is to learn a joint visual-textual embedding space using samples with differ- ent attribute configurations. However, the number of possible configurations is exponential to the size of attribute set, making it impossible to collect or annotate in practice. Therefore, we propose a weakly-supervised compositional learning framework which learns object-attribute factorization for better generalization. The two categories are also related in the sense that structured knowledge often help lay a solid foundation, on which adversarial knowledge can be learned more successfully. In the case of robot learning via human adversarial games, human prior over robustness of grasps can be seen as a source of structured knowledge, whereas simulated adversary are often noisier at the beginning of training. On the contrary, we do observe failure cases where users do not behave consistently, which ended up misleading the learning agents. 1.2 Related Work Generative adversarial learning [51, 52] and structured knowledge learning [49, 160, 201] are gaining more and more attention in both the academia and industry. In this section, we focus on papers that are most related to our research, highlighting progress and limits of existing work. 4 Learning-based robotic framework There has been a recent paradigm shift in robotics to data-driven learning for planning and control. However, labelling data for training learning- based robotic grasping model is expensive. Therefore, [102, 144, 145] adopt self-supervised approaches to generate pseudo-groundtruth for supervision. On the other hand, [142] exploits adversarial learning framework to reject weak notions of success, where an antagonist robot tries to snatch the object away. Given the success of learning adversarial knowledge, it’s interesting to explore what effect would be if an adversarial agent with domain knowledge is applied. Curriculum reinforcement learning A curriculum is designed to ameliorate the learning curve, so as to expedite the speed of convergence for reinforcement learning. Representative work include teacher-student curriculum [124], where a student is an RL agent working on actual tasks while a teacher agent is a policy for selecting tasks; curriculum through compet- itive self-play [8, 169], where curriculum evolves as opponent becomes stronger; automatic goal generation [62] is a promising approach that leverages a generator to approximate the distribu- tion of goal in the “right” distance. However, it’s unclear if it will generalize to tasks whose goals are discrete or tasks that require skill transfer between curriculum. Portrait Manipulation with GANs CycleGAN [219] is a seminal work for between-domain translation. However, each domain transfer requires training a separate generator/discriminator. StarGAN [35] approaches this problem by handling multiple domains with a single network, by tasking the discriminator with a domain classification objective. However, the classification objective requires that the domain be discrete. We’re interested in solving demands where this assumption does not hold. Deep Metric Learning Metric learning is a crucial field for the success of deep learning, which aims to learn structured representation by discovering the optimal distance metric. It lays the foundation for product search [134], image retrieval [128, 166, 200], face verification [69], person re-identification [207] and image-text cross-modal retrieval [192]. Beyond significance in itself, there has been a growing impact in combining metric learning with self-supervised learning [30, 60], 3D representation learning [135] and robotics [137] etc,. In this thesis, we 5 focus on extending metric learning in handling complex relationships, exploiting unlabeled data, and composing diverse queries by learning object-attribute factorization. 1.3 Contributions of the Research 1.3.1 Adversarial Knowledge Learning from Human and Data Generative adversarial learning is getting broad attention since [51], which simultaneously trains a generative model G that captures the data distribution, and a discriminative model D that estimates the probability of a sample coming fromG. We allow direct human involvement in the following three research topics and explore adversarial learning for human-rotot interaction (HRI), by incorporating human as incarnation of the discriminative model. • Much work in robotics has focused on “human-in-the-loop” learning techniques that improve the efficiency of the learning process. However, these algorithms have made the strong assumption of a cooperating human supervisor that assists the robot. In real- ity, human observers tend to also act in an adversarial manner towards deployed robotic systems. We explore this observation to propose a general framework for human-robot adversarial learning. In a manipulation task that involves a user study of 25 subjects (not experts in robotics), we show that grasping success improves significantly when the robot trains with a human adversary as compared to training in a self-supervised manner in Chapter 2. • Adversarial knowledge can also be exploited from human for improving the sample effi- ciency of curriculum reinforcement learning. In this case, human is not “strictly” adver- sarial, but challenge RL agent in a transitional manner. We first identify the “inertial” problem in automatic curriculum and then propose a simple interactive curriculum frame- work that works in environments that require millions of interactions in Chapter 3. • Previous methods have dealt with discrete manipulation of facial attributes such as smile, sad, angry, surprise etc, out of canonical expressions and they are not flexible, operating 6 in single modality. We propose a novel framework that supports continuous edits and multi-modality portrait manipulation using adversarial learning in Chapter 4. 1.3.2 Structured Knowledge Representation via Metric Learning Learning structured knowledge is crucial for the success of deep learning. Representation induced from metric learning has this nice structure, where embedding from the same class are close while embedding from different classes are apart [129]. In the following three research top- ics, we learn structured knowledge representation via metric learning to handle contextual infor- mation, exploit unlabeled data, and empower the image retrieval system by supporting attribute designations during searches. • Recommending compatible items is a novel application in industry. We introduce an unsu- pervised approach to learn a latent metric space in which items are compared. To model the interactions between different fashion items, we leverage a graph filtering network and design a Pairwise Preference Attention (PPA) module to automatically match the user’s preference for each attribute given contextual information in Chapter 5. • Most existing distance metric learning approaches use fully labeled data to learn the sam- ple similarities in an embedding space. We present a self-training framework to improve retrieval performance by leveraging additional unlabeled data. To better deal with noisy pseudo labels generated by the teacher network, we design a new feature basis learning component, which is used to measure the pairwise similarity and to select high-confident samples for training the student network in Chapter 6. • Image-attribute query offers a powerful way to express intended searches, by enabling adding or removing attributes present in a query image. In the real world, it’s often hard to collect visually similar images with different attribute configurations as positive samples, leading to worse generalizations. We address this issue by proposing a weakly- supervised compositional learning framework, where we synthesize positive samples with 7 novel attributes on the fly without human labor. We show in the zero-shot retrieval setting that our model can generalize to unseen attributes during training in Chapter 7. 1.3.3 Practical and Technical Contributions • We released an interactive robotic platform that supports human-robot adversarial train- ing. Anonymized log files of the human adversarial actions are also publicly available to facilitate future research in Chapter 2. • A portable, parallel and interactive curriculum reinforcement learning system, that sup- ports: 1) Real-time online interaction with flexibility; 2) Parallelizable for human-in-the- loop training; 3) Seamless control between reinforcement learning and human-guided cur- riculum in Chapter 3. • A web application that supports continuous facial edits and multi-modality transfer using a single network. With user customized facial landmark as prior, the model is capable of generating 512x512 resolution images in Chapter 4. • We built a static website showcasing hundreds of qualitative results for fashion compati- bility recommendation in Chapter 5. • Our self-training framework addresses an industrial need of scalability by exploiting unla- beled data. We demonstrated superior results in Chapter 6. 1.4 Organization of the Thesis In Chapter 2, we propose a general framework for robot learning via human adversarial games. In Chapter 3, we introduce our benchmark platform for interactive curriculum reinforcement learning and demonstrate the effectiveness of our algorithm for tasks that involve sparse rewards. In Chapter 4, we show an adversarial learning application that supports flexible portrait manip- ulation, including continuous edits and modality transfer. In Chapter 5, we propose a novel framework for learning structured representation that combines unsupervised metric learning 8 with graph convolution. In Chapter 6, we demonstrate a self-training which scales existing met- ric learning framework by exploiting unlabeled data. In Chapter 7, a novel compositional learn- ing model is presented to empower image-attribute queries, learned weakly-supervised. Finally, concluding remarks and future research directions are given in Chapter 8. 9 Chapter 2 Robot Learning via Human Adversarial Games 2.1 Introduction We focus on the problem of end-to-end learning for planning and control in robotics. For instance, we want a robotic arm to learn robust manipulation grasps that can withstand per- turbations using input images from an on-board camera. Learning such models is challenging, due to the large amount of samples required. For instance, in previous work [145], a robotic arm collected more than 50K examples to learn a grasping model in a self-supervised manner. Researchers at Google [105] developed an arm farm and collected hundreds of thousands of examples for grasping. This shows the power of parallelizing exploration, while it requires a large amount of resources and the system is unable to distinguish between stable and unstable grasps. To improve sample efficiency, Pinto et al. [142] showed that robust grasps can be learned using a robotic adversary: a second arm that applies disturbances to the first arm. By training jointly both the first arm and the adversary, they show that this can lead to robust grasping solutions. This configuration, however, typically requires two robotic arms placed in close proximity to each other. What if there is one robotic arm “in the wild” interacting with the environment, as well as with humans? One approach could be to have the human act as a teammate, and assist the robot in complet- ing the task. An increasing amount of work [88, 90, 111, 121, 151, 198] has shown the benefits of human feedback in the robot learning process. 10 Figure 2.1: An overview of our framework for a robot learning robust grasps by interacting with a human adversary. At the same time, we should not always expect the human to act as a collaborator. In fact, previous studies in human-robot interaction [9, 24, 133] have shown that people, especially children, have acted in an adversarial and even abusive manner when interacting with robots. This work explores the degree to which a robotic arm could exploit such human adversarial behaviors in its learning process. Specifically, we address the following research question: How can we leverage human adversarial actions to improve robustness of the learned policies? While there has been a rich amount of human-in-the-loop learning, to the best of our knowl- edge this is the first effort of robot learning with adversarial human users. Our key insight is: By using their domain knowledge in applying perturbations, human adversaries can contribute to the efficiency and robustness of robot learning. We propose an “human-adversarial” framework where a robotic arm collects data for a manipulation task, such as grasping (See Fig. 2.1). Instead of using humans in a collaborative manner, we propose to use them as adversaries. Specifically, we have the robot learner, and the human attempting to make the robot learner fail on its task. For instance, if the learner attempts 11 Figure 2.2: Selected grasp predictions before (top row) and after (bottom row) training with the human adversary. The red bars show the open gripper position and orientation, while the yellow circles show the grasping points when the gripper has closed. to grasp an object, the human can apply forces or torques to remove it from the robot. Contrary to a robot adversary in previous work [142], the human already has domain knowledge about the best way to attempt the grasp, by observing the grasp orientation and their prior knowledge of the object’s geometry and physics. Additionally, here the robot can only observe one output, the outcome of the human action, rather than a distribution of adversarial actions. We implement the framework in a virtual environment, where we allow the human to apply simulated forces and torques on an object grasped by a robotic arm. In a user study we show that, compared to the robot learning in a self-supervised manner, the human user can provide super- vision that rejects unstable robot grasps, leading to significantly more robust grasping solutions (See Fig. 2.2). While there are certain limitations on the human adversarial inputs because of the interface, this is an exciting first step towards leveraging human adversarial actions in robot learning. 2.2 Related Work Self-supervised Deep Learning in Manipulation. In robotic manipulation, deep learning has been combined with self-supervision techniques to achieve end-to-end training [102, 104, 105], for instance with curriculum learning [146]. Other approaches include learning dynamics models through interaction with objects [2]. Most relevant to ours is the work by Pinto et al., where 12 a “protagonist” robot learns grasping solutions by interacting with a robotic adversary. In this work, we follow a human-in-the-loop approach, where we have a robotic arm learn robust grasps by interacting with a human adversary. Reinforcement Learning with Human Feedback. Previous work [4, 54, 87, 89, 90, 116, 121] has also focused on using human feedback to augment the learning of autonomous agents. Specifically, rather than optimizing a reward function, learning agents respond to positive and negative feedback signals provided by a human supervisor. These works have explored different ways to incorporate feedback into the learning process, either as part of the reward function of the agent, such as in the TAMER framework [87], or directly in the advantage function of the algorithm, as suggested by the COACH algorithm [121]. This allows the human to train the agent towards specific behaviors, without detailed knowledge of the agent’s decision making mechanism. Our work is related in that the human affects the agent’s reward function. However, the human does not do this explicitly, but indirectly through its own actions. More importantly, the human acts in an adversarial manner, rather than as a collaborator or a supervisor. Adversarial Methods. Generative adversarial methods [44, 51] have been used to train two models, a generative model that captures the data distribution, and a discriminative model that estimates the probability that a sample came from the training data. Researchers have also ana- lyzed a network to generate adversarial examples, with the goal of increasing the robustness of classifiers [52]. In our case, we let a human agent generate the adversarial examples that enable adaptation of a discriminative model. Grasping. We focus on generating robust grasps, that can withstand disturbances. There is a large amount of previous work on grasping [16, 20], that range from physics-based model- ing [12, 13, 122] to data-driven techniques [105, 145]. The latter have focused on large-scale data collection. Pinto et al. [142] have shown that perturbing grasps by shaking or snatching by a robot adversary can facilitate learning. We are interested in whether this can hold when the adversary is a human user, applying forces at the grasped object. 13 2.3 Problem Statement We formulate the problem as a two-player game with incomplete information [100], played by a human (H) and a robot (R). We define s2 S to be the state of the world. A robot and a human are taking turns in actions. A robot action results in a stochastic transition to new state s + 2 S + , based on some unknown transition functionT : SA R ! (S + ). The human then acts based on a stochastic policy, also unknown to the robot, so that H : (s + ;a H ). After the human and the robot’s actions, the robot observes the final states ++ and receives a reward signalr : (s;a R ;s + ;a H ;s ++ )7!r. In an adversarial setting, the robot attempts to maximizer, while the human wishes to min- imize it. Specifically, we formulater as a linear combination of two terms: the reward that the robot would receive in the absence of an adversary, and the penalty induced by the human action: r =R R (s;a R ;s + )R H (s + ;a H ;s ++ ) (2.1) The goal of the system is to develop a policy R :s7!a R t that maximizes this reward. R = argmax R E r(s;a R ;a H )j H (2.2) Through this maximization, the robot implicitly attempts to minimize the reward of the human adversary. In Eq. (2.1), controls the proportion of learning from the human’s adversar- ial actions. 2.4 Approach Algorithm. We assume that the robot’s policy R is parameterized by a set of parametersW , represented by a convolutional neural network. The robot uses its sensors to receive a state representations, and samples an actiona R . It then observes a new states + , and waits for the human adversary to act. Finally, it observes the final states ++ , and computes the rewardr based 14 on Eq. (2.1). A new world state is then sampled randomly, as the robot attempts to grasp a potentially different object (Algorithm 1). Initialization. We initialize the parametersW by optimizing only forR R (s;a R ;s+), that is for the reward in the absence of the adversary. This allows the robot to choose actions that have a high probability of grasp success, which in turn enables the human to act in response. After training in a self-supervised manner, the network can be refined through interactions with the human. Algorithm 1 Learning with a Human Adversary Initialize parametersW of robot’s policy R for batch = 1;B do for episode = 1;M do observes sample actiona R R (s) execute actiona R and observes + ifs + is not terminal then observe human actiona H and states ++ end if observer given by Eq. (2.1) records;a R ;r end for updateW based on recorded sequence end for returnW =0 2.5 Learning Robust Grasps We instantiate the problem in a grasping framework. The robot attempts to grasp an object. The human observes the robot’s grasp. If the grasp is successful, the human can apply a force as a disturbance in the robot’s hand, in six different directions. In this work, we use a simulation environment to simulate the grasps and interactions with the human. We use this environment as a testbed for testing different grasping strategies. 15 2.5.1 Grasping Prediction Following previous work [145], we formulate grasping prediction as a classification problem. Given a 2D input image I, taken by a camera with a top-down view, we sample N g image patches. We then discretize the space of grasp angles toN a different angles. We use the patches as input to a convolutional neural network, which predicts the probability of success for every grasping angle with the grasp location being the center of the patch. The output of the ConvNet is aN a -dimensional vector giving the likelihood of each angle. This results in aN g N a grasp probability matrix. The policy then chooses the best patch and angle to execute the grasp. The robot’s policy thus uses as input the imageI, and as output the grasp location (x g ;y g ), which is the center of the sampled patch, and the grasping angle g : R :I7! (x g ;y g ; g ). 2.5.2 Adversarial Disturbance After the robot grasps an object successfully, the human can attempt to pull the object away from the robot’s end-effector, by applying a force of fixed magnitude. The action space is discrete with 6 different actions, one for each direction: up/down, left/right, inwards/outwards. As a result of the applied force, the object either remains on the robot’s hand, or it is dropped to the ground. 2.5.3 Network Architecture We use the same ConvNet architecture with previous work [145], modeled on AlexNet [93] and shown in Fig. 2.3. The output of the network is scaled to (0; 1) using a sigmoidal response function. 2.5.4 Network Training We initialized the network with a pretrained model initialized by Pinto et al. [145]. The model was pre-trained with completely different objects and patches. To train the model, we treat the reward r that the robot receives as a training target for the network. Specifically, we set R R (s;a R ;s + ) = 1 if the robot succeeds and 0 if the robot fails. Similarly,R H (s + ;a H ;s ++ ) = 1 16 96 11 11 conv1 256 5 5 conv2 384 5 5 conv3 384 5 5 conv4 256 5 5 conv5 1 4096 fc6 1 1094 fc7 1 N A fc8 Figure 2.3: ConvNet architecture for grasping. if the human succeeds, and 0 if the human fails. Therefore, based on Eq. (2.1), the signal received by the robot is: r = 8 > > > > < > > > > : 0 if robot fails to grasp 1 if robot succeeds and human fails 1 if human succeeds (2.3) We note that the training target is different than that of previous work [142]. There, the robot has access to the adversary’s predictions. Here, however, the robot can only see only one output, rather than a distribution of possible actions. We then define as loss function for the ConvNet, the binary cross entropy loss between the network’s prediction and the reward received. We train the network using RMSProp [176]. 2.5.5 Simulation Environment For the training, we used the Mujoco [177] simulation environment. We customized the envi- ronment to allow a human user interacting with the physics engine. 1 1 The code is publicly available at: https://github.com/davidsonic/Interactive-mujoco py 17 Figure 2.4: Participants interacted with a simulated Baxter robot in the customized Mujoco simulation environment. 2.6 From Theory to Users We conducted a user study, with participants interacting with the robot in the virtual environment. The purpose of our study is to test whether the robustness of the robot’s grasps can improve when interacting with a human adversary. We are also interested to explore how the object geometry affects the adversarial strategies of the users, as well as how users perceive robot’s performance. Study Protocol. Participants interacted with a simulated Baxter robot in the customized Mujoco simulation environment (See Fig. 2.4). The experimenter told participants that the goal of the study is to maximize robot’s failure in grasping the object. They did not tell participants that the robot was learning from their actions. Participants applied forces to the object using the keyboard. All participants first did a short training phase by attempting to snatch an object from the robot’s grasp 10 times, in order to get accustomed to the interface. The robot did not learn during that phase. Then, participants interacted with the robot executing Algorithm 1. 18 Table 2.1: Likert items. 1. The robot learned throughout the study. 2. The performance of the robot improved throughout the study. In order to keep the interactions with users short, we simplified the task, so that each user trained with the robot on one object only, presented to the robot at the same orientation. We fixed the magnitude of the forces applied to each object, so that the adversary would succeed if the grasp was unstable but fail to snatch the object otherwise. We selected a batch sizeB = 5 and a number of episodes per batchM = 9. The interaction with the robot lasted on average 10 minutes 2 . Manipulated variables. We manipulated (1) the robot’s learning framework and (2) the object that users interacted with. We had three conditions for the first independent variable: the robot interacting with a human adversary, the robot interacting with a simulated adversary [142], and the robot learning in a self-supervised manner, without an adversary. We had five different objects (See Fig. 2.2). We selected objects of varying grasping difficulty and geometry to explore the different strategies employed by the human adversary. Dependent measures. For testing we executed the learned policy on the object for 50 episodes, applying a random disturbance after each grasp and recording the success or failure of the grasp before and after the random disturbance was applied. To avoid overfitting, we selected for testing the earliest learned model that met a selection criterion (early-stop) [27]. The testing was done using a script after the conduction of the study, without the participants being present. We additionally asked participants to report their agreement on a seven-point Likert scale to two statements regarding the robot’s learning process (Table 2.1) and justify their answer. Hypotheses H1. We hypothesize that the robot trained with the human adversary will perform better than the robot trained in a self-supervised manner. We base this hypothesis on previous work [142] that 2 The anonymized log files of the human adversarial actions are publicly available at https://github.com/davidsonic/human adversarial grasping data 19 Table 2.2: Grasping success rate before (left column) and after (right column) application of random disturbance. Different users interacted with different objects (between-subjects design). User # Bottle T-shape Half-nut Round-nut Stick 1 64 40 56 42 40 36 58 40 90 62 2 64 40 52 28 40 36 82 48 94 64 3 66 40 56 42 40 36 82 54 92 64 4 74 40 78 60 40 36 52 40 90 62 5 68 40 78 62 40 36 84 48 100 84 Simulated-adversary 60 38 76 54 42 38 54 50 64 54 Self-trained 14 4 52 34 40 36 80 40 50 18 has shown that training with a simulated adversary improved robot’s performance, compared to training in a self-supervised manner. H2. We hypothesize that the robot trained with the human adversary will perform better than the robot trained with a simulated adversary. A human adversary has domain knowledge: they observe the object geometry and have intuition about the physics properties. Therefore, we expect the human to act as a model-based learning agent and use their model to do targeted adversarial actions. On the other hand, the simulated adversary has no such knowledge and they need to learn the outcome of different actions through interaction. Subject allocation. We recruited 25 users, 21 Male and 4 female participants. We followed a between-subjects design, where we had 5 users per object, in order to avoid confounding effects of humans learning to apply perturbations, getting tired or bored by the study. 2.7 Results 2.7.1 Analysis Objective metrics. Table 2.2 shows the success rates for different objects. Different users inter- acted with each object; for instance User 1 for Bottle is a different participant than User 1 for T-shape. We have two dependent variables, the success rate of robot grasping an object in the testing phase in the absence of any perturbations, and the success rate with random perturbations 20 Figure 2.5: Success rates from Table 2.2 for all five participants and subjective metrics. 0 100 Without disturbances 0 100 With disturbances Bottle Human-adv Sim-adv Self-trained 0 100 Without disturbances 0 100 With disturbances T-shape 0 100 Without disturbances 0 100 With disturbances Half-nut 0 100 Without disturbances 0 100 With disturbances Round-nut 0 100 Without disturbances 0 100 With disturbances Stick Figure 2.6: Success rates from Table 2.2 for each object with (y-axis) and without (x-axis) random disturbances for all five participants. being applied. A two-way multivariate ANOV A [95] with object and framework as indepen- dent variables showed a statistically significant interaction effect for both dependent measures: (F (16; 38) = 3:07;p = 0:002; Wilks’ = 0:19). In line with H1, a Post-hoc Tukey tests with Bonferroni correction showed that success rates were significantly larger for the human adversary condition than the self trained condition, both with (p < 0:001) and without random disturbances (p = 0:001). We note that the post-hoc analysis should be viewed with caution, because of the significant interaction effect. To interpret these results, we plot the mean success rates for all conditions (See Fig. 2.5). For clarity, we also contrast both success rates for each object separately in Fig. 2.6. Indeed, we see the the success rate averaged over all human adversaries was higher for three out of five objects. The difference was largest for the bottle and the stick. The reason is that it was easy for the self-trained policy to pick up these objects without a robust grasp, which resulted in 21 0 10 20 # Interactions Left Right Up Down In Out Actions Bottle (a) 0 5 10 # Interactions Left Right Up Down In Out Actions T-shape (b) 0 10 20 # Interactions Left Right Up Down In Out Actions Half-nut (c) 0 20 # Interactions Left Right Up Down In Out Actions Round-nut (d) 0 10 20 # Interactions Left Right Up Down In Out Actions Stick (e) Figure 2.7: Actions applied by selected human adversaries over time. We plot in green adver- sarial actions that the robot succeeds in resisting, and in red actions that result in the human ‘snatching’ the object. slow learning. On the other hand, the network trained with the human adversary rejected these unstable grasps, and learned quickly robust grasps for these objects. In contrast, round nut and half-nut objects could be grasped robustly at the curved areas of the object. The self-trained network thus got “lucky” finding these grasps, and the difference was negligible. In summary, these results lead to the following insight: Training with a human adversary is particularly beneficial for objects that have few robust grasp candidates that the network needs to search for. There were no significant differences between the rates in the human adversary and simulated adversary condition. Indeed, we see that the mean success rates were quite close for the two conditions. We expected the human adversary to perform better, since we hypothesized that the human adversary has a model of the environment, which the simulated adversary does not have. Therefore, we expected the human adversarial actions to be more targeted. To explain this result, which does not support H2, we look at human behaviors below. Behaviors. Fig. 2.7 shows the disturbances applied over time for different users. Observing the participants behaviors, we see that some participants used their model of the environment to apply disturbances effectively. Specifically, the user in Fig. 2.7(b) applied a force outwards in the T-shape, succeeding in ‘snatching’ the object even at the first try, which is indicated by the red dots. Gradually, the robot learned a more robust grasping policy, which resulted in the user failing to snatch the object (green dots). Similarly, the user in Fig. 2.7(a) and Fig. 2.7(c) used targeted perturbations which resulted in failed grasps from the very start of the task. 22 0 5 10 15 20 # Interactions Left Right Up Down In Out Actions Figure 2.8: The user started assisting the robot in the later part of the interaction, instead of acting as an adversary. In some cases, such as in Fig. 2.8, the user adapted their strategy as well: when the robot learned to withstand an adversarial action outwards, the user acted by applying a force to the right, until the robot learned that as well. Fig. 2.9 compares the user of Fig. 2.7(e) with the simulated adversary for the same object (stick). We observe that the simulated adversary explores different perturbations that are unsuc- cessful in snatching the object. This translates to worse performance for that object in the testing phase. However, not all grasps required an informed adversary for the grasp to fail. For instance, for the grasped bottle in Fig. 2.10(a), there were many different directions where an applied force could succeed in removing the object. Therefore, having a model of the environment did not offer a significant benefit, since almost any disturbance would succeed in dropping the object. On the contrary, only targeted disturbances in the direction parallel to the axis were effective in removing the stick (See Fig. 2.10(b)), which explains the difference in performance with the simulated adversary. Subjective metrics. We conclude our analysis with reporting the users’ subjective responses (See Fig. 2.5). A Cronbach’s = 0:86 showed good internal consistency [17]. Participants generally agreed that the robot learned throughout the study, and that its performance improved. 23 0 10 20 # Interactions Left Right Up Down In Out Actions Human 0 20 # Interactions Left Right Up Down In Out Actions Sim-adv Figure 2.9: Difference between training with user and simulated adversary for the stick object. The simulated adversary explores by applying forces in directions that fail to snatch the object. (a) (b) Figure 2.10: A force in almost any direction would make the grasp (a) fail, while only a force parallel to the axis of the stick would remove grasp (b). In their open-ended responses, participants stated that “The robot learned the grasping technique to win over me by learning from the forces that I provided and became more robust,” and that “The robot took almost 8 to 10 runs before it would start responding well. By the end of my experiment, it would grasp almost all the time.” At the same time, one participant stated that 24 the “rate of improvement seemed pretty slow,” and another that it “kept making mistakes even towards the end.” 2.7.2 Multiple objects We wish to test whether our framework can leverage human adversarial actions to learn grasping multiple objects at the same training session. Therefore, we modified the experiment setup, so that in each episode one of the five objects appeared randomly. To increase task difficulty, we additionally randomized the object’s position and orientation in every episode. The robot then trained with one of the authors of the paper for 200 episodes. We then tested the trained model for another 200 episodes with randomly selected objects of random positions and orientations, as well as randomly applied disturbances. The trained model achieved a 52% grasping success rate without disturbances, and 34% success rate with disturbances. The rates were higher than those of a simulated adversary trained in the same environment for the same number of episodes, which had 28% grasping success rate without disturbances and 22% with disturbances. We find this result promising, since it indicates that targeted perturbations from a human expert can improve the efficiency and robustness of robot grasping. 2.8 Conclusion Limitations. Our work is limited in many ways. Our experiment was conducted in a virtual environment, and the users’ adversarial actions were constrained by the interface. Our environ- ment provides a testbed for different human-robot interaction algorithms in manipulation tasks, but we are also interested in exploring what types of adversarial actions users apply in real-world settings. We also focused on interactions with only one human adversary; a robot “in the wild” is likely to interact with multiple users. Previous work [142] has shown that training a model with different robotic adversaries further improves performance, and it is worth exploring whether the same holds for human adversaries. 25 Implications. Humans are not always going to act cooperatively with their robotic counterparts. This work shows that from a learning perspective, this is not necessarily a bad thing. We believe that we have only scratched the surface of the potential applications of learning via adversarial human games: Humans can understand stability and robustness better than learned adversaries, and we are excited to explore human-in-the-loop adversarial learning in other tasks as well, such as obstacle avoidance for manipulators and mobile robots. 26 Chapter 3 Curriculum Reinforcement Learning Guided by Human 3.1 Introduction A curriculum organizes examples in a more meaningful order which illustrates gradually more concepts, and more complex ones so that humans and animals can learn better [11]. When com- bined with reinforcement learning, it’s been shown that a curriculum can improve convergence or performance compared to learning from the target task from scratch [46, 53, 173]. For example, [46] learns to reach goals with sparse rewards by following reverse curriculum, generated by start states that grow increasingly far from the goal. [7, 8] explores adversarial self-play to automatically evolve curriculum in playing sumo, football etc. [124, 147] formalizes teacher-student transfer learning framework, where a student agent works on actual tasks while a teacher network is tasked with selecting tasks. The most related work to ours is [61], which shows empirically how a rich environment can help to promote the learning of complex behavior without explicit reward guidance. The key to reap the advantage of a curriculum strategy, according to previous works [11, 62], is that the most beneficial examples are those that are neither “too hard” nor “too easy”. One question that naturally arises then is how to design a metric to quantify how hard a task is so that we can sort tasks accordingly? A line of work deals with this problem by proposing curriculum automatically through another RL agent, such as teacher-student framework [124, 147], self- play [7, 8, 169] or goal-gan [62]. One way of interpreting these approaches is that curriculum evolves through the adversarial nature between the two agents, similar to GAN [40, 51]. 27 Figure 3.1: Given specific scenarios during curriculum training, humans can adaptively decide whether to be “friendly” or “adversarial” by observing the progress the agent is able to make. In cases where performance degrades, a user may flexibly adjust the strategy as opposed to an automatic assistive agent. However, it’s not always possible to formulate a curriculum through self-play (e.g., NPC in Wall-Jumper task) or require that the parameter space be continuous for adversarial learning (e.g, GridWorld task). In these cases, automatic curriculum most commonly used transitions from easy to hard. Secondly, it’s possible that in certain cases the primary agent can learn better when another agent present is “friendly” rather than “adversarial”, as in [154]. Compared to an automatic agent, human has an innate ability to improvise and adapt when confronted with different scenarios, which we utilize to provide expalinability and flexibility for curriculum rein- forcement learning. In Figure 3.1, a user is able to intuitively understand the learning progress and dynamically manipulate the task difficulty by changing the height of the wall. As learning agents move from research labs to the real world, it becomes increasingly impor- tant for human users especially those without programming skills, to teach agents desired behav- ior. A large amount of work focuses on imitation learning [65, 145, 153, 159], where demonstra- tions from the expert act as direct supervision. Humans can also interactively shape training with only positive or negative reward signals [88] or combine manual feedback with rewards from MDP [1, 89]. A recent work formulates human-robot interaction as an adversarial game [43] and shows improvement of grasping sucess and robustness when the robot trains with a human adversary. 28 In this paper, we aim to close the loop between these two fields, by studying the effect of interactive curriculum on reinforcement learning. To achieve this, we have designed three challenging environments that are nontrivial to solve even for state-of-the-art RL method [161], which we describe in the next Section. The rest of the paper is organized as follows. In Section 3.2, Our interactive curriculum platform is introduced, with which we identify the “inertial” problem in an “easy-to-hard” auto- matic curriculum. In Section 3.3, We show results of user studies on our environments that require millions of interactions. Finally, we conclude and discuss future work in Section 7.5. Figure 3.2: Our interactive platform for curriculum reinforcement learning, in which the user is allowed to manipulate the task difficulty via a unified interface (slider and buttons). All three tasks receive only sparse rewards. The manipulable variable for the three environments are respectively the number of red obstacles (GridWorld, Top row), the height of wall (Wall-Jumper, Middle row) and the radius of the target (SparseCrawler, Bottom row). The task difficulty grad- ually increases from left to right. 3.2 Interactive Curriculum Guided by Human 3.2.1 Task Repository Figure 3.2 shows our released environments for curriculum reinforcement learning 1 , where the task difficulty can be manipulated by users. As shown in the figure, at certain training phase, the 1 Demos and interactive executable have been made publicly available at: https://github.com/davidsonic/interactive-curriculum-reinforcement-learning 29 user formulated curriculum in a way that was neither too hard nor too easy for the agent, so as to maximize the efficiency and quality trade-off. During interaction, the user can pause, play or save the current configuration. The locations of the objects in the arena are customizable with cursor and the height of the wall is tunable for difficulty transitions. Our interactive interface are same for the rest environments listed below. Grid-World The agent (represented as a blue square) is tasked to reach the goal position (green plus), by navigating through obstacles (red cross, maximally 5). All objects are randomly spawn on a 2D plane. A positive reward 1 for reaching the goal, negative 1 for cross and -0.01 for each step. Movements are in cardinal directions. Wall-Jumper The goal is to navigate a wall (maximum height is 8), by jumping or (possibly) leveraging a block (white box). Positive reward of 1 for successful landing on the goal location (green mat) or negative 1 for falling outside or reaching maximum allowed time. A penalty of -0.0005 for each step taken. The observation space is 74 dimensional, corresponding to 14 ray casts each detecting 4 possible objects, plus the global position of the agent and whether or not the agent is grounded. Allowed actions include translation, rotation and jumping. Sparse-Crawler A crawler is an agent with 4 arms and 4 forearms. The aim is to reach a randomly located target on the ground (maximum radius of 40). The state is a vector of 117 variables corresponding to position, rotation, velocity, and angular velocities of each limb plus the acceleration and angular acceleration of the body. Actions space is of size 20, corresponding to target rotations for joints. Only sparse reward is provided, when the target is reached. 3.2.2 Platform Design Although a multitude of reinforcement learning environments already exist [22, 78, 172], none of them allow human-in-the-loop training, partly due to two challenges. First, environments such as Gym [22] and DM Control Suite [172] are inflexible for neither procedural content generation nor human-interaction, due to the need to hand-craft configuration files before rendering; Sec- ond, it’s not clear how training process can persist during interaction or how long it should wait before resuming. We resolve the above two challenges by building upon the popular Unity game 30 engine [78] and propose an interactive curriculum training framework for general reinforcement learning. Compared to existing RL environments [22, 78, 172], Our interactive platform is built with three goals in mind: 1) Real-time online interaction with flexibility; 2) Parallelizable for human-in-the-loop training; 3) Seamless control between reinforcement learning and human- guided curriculum. To achieve the first goal, an event-driven environment container is run separated from the training process, allowing user to send a control signal (e.g., UI control, scene layout, task difficulty) to the environment at any time during training via interactive interface. At the same time, RL trainer is able to update the network parameters for controlling different groups of agents through the communicator, accessible via a Python API (see Figure 3.3). Figure 3.3: General design of our interactive platform and associations between environment container with RL trainer as well as interactive-interface. 31 To achieve similar efficiency as automatic training, we integrate human-interactive signal into RL parallelization. On the one hand, centeralized SGD update with decentralized experi- ence collection is performed as agents of the same kind share the same network policy [127]. On the other hand, we enable controlling environment parameters in different instantiations simul- taneously via a unified interactive interface, which makes it possible to solve tasks that require millions of interactions (See Figure 3.4). To our best knowledge, this is the first endeavor that makes it possible for users to interact with RL agents in a complex task setting. Figure 3.4: To reduce the amount of human effort, we integrate human-interactive signal into RL parallelization via our interface. For the third goal, we display real-time instructions and allow users to inspect learning progress when designing curriculum. In Figure 3.4, the users will be prompted when it’s time to interact or evaluate. Rewards for the agent during last and current evaluation are shown for users’ reference. 3.2.3 A Simple Interactive Curriculum Framework Curriculum reinforcement learning is an adaptation strategy to improve RL training by ordering a set of related tasks to be learned [11]. The most natural ordering is to gradually increase the task difficulty with an automatic curriculum. However as shown in Figure 3.5(a), the auto-curriculum quickly mastered skills when walls are low but failed to adapt when a dramatic change of skill is 32 (a) Training curve (b) Testing curve Figure 3.5: “Inertial” problem of auto-curriculum which gradually grows the difficulty at fixed interval. The performance of auto-curriculum (orange curve) drops significantly when naviga- tion requires jumping over the box first but the learning inertial prevents it from adapting to the new task. Note that testing curve is evaluated on the ultimate task unless otherwise stated. (a) Low Wall (b) High Wall Figure 3.6: A change of skill is required when the height of wall changes over a certain threshold required (See Figure 3.6(b)), leading to a degradation of performance on the ultimate task (See Figure 3.5(b)). The reason is that the agent must use a box to navigate a high wall in contrast to low-wall scenarios, where additional steps to locate the box will be penalized. Our results testify what [11] observed in their curriculum for supervised classification task, that curriculum should be designed to focus on “interesting” examples. In our case, curriculum that resided at an easy level for the first 3M steps “overfitted” to the previous skill and prevented it from adapting. Although a comprehensive IF-ELSE rule is possible, in real-world where situ- ations could be arbitrarily complex, adaptable behavior out of guidance from human is desired. 33 Following this spirit, we test the ability of human interactive curriculum using a simple frame- work (Algo 2), where human (functionH) provides feedback by adjusting the task difficulty at fixed interval in the training loop (i.e., after evaluating the agent’s learning progress on current difficulty, user can choose to tune the task easier/harder or leave it unchanged). We show in the next Section that with this simple interactive curriculum, tasks that are originally unsolvable can be guided towards success by human, with an additional property of better generalization. Algorithm 2 Human-Guided Interactive Curriculum Initialize difficulty while step total step do R new = Train( R old , difficulty) if step % interval ==0 then difficulty=H ( R new , difficulty) end if R old = R new end while return R =0 3.3 Experiments We train the agents for three competitive tasks using the training method described previously. Our aim is to show that human-in-the-loop interactive curriculum are capable of leveraging human prior during adaptation which allows agents to build on past experiences. For all our experiments, we fix the interaction interval (e.g, 0, 0.1, 0.2,...,0.9 of the total steps) and allow users to inspect learning progress twice before adjusting the curriculum. The user can either choose to make it easier, harder or unchanged. Our baseline is PPO with the optimized param- eters as in [78]. We train GridWorld, Wall-Jumper and SparseCrawler for 50K, 5M and 10M steps respectively. 3.3.1 Effect of Interactive Curriculum In Section 3.2.1, we introduced three challenging tasks due to sparsity of rewards. For example in Figure 3.7(a), we observed that agents which learn from scratch (green and red curves) had 34 (a) GridWorld (obstacles of 5) (b) Wall-Jumper (height of 8) (c) SparseCrawler (radius of 40) Figure 3.7: Effect of interactive curriculum evaluated on the ultimate task. (a) GridWorld (obstacles from 1 to 5) . (b) Wall-Jumper (heights from 0 to 8) (c) SparseCrawler (radius from 5 to 40) Figure 3.8: Generalization ability of interactive curriculum evaluated on a set of tasks. The average performance over these tasks is plotted for different time steps. little chance of success with obstacles scattered around the grid, thus failing to reinforce any desired behavior. On the other hand, users were able to gradually load or remove obstacles by inspecting the learning progress. Eventually, the models trained with our framework are able to solve GridWorld with 5 obstacles present. Inspired by this, we further tested our framework on SparseCrawler task (See Figure 3.7(c)), which requires 10M steps of training. Thanks to our parallel design (Section 3.2.1), we were able to reduce the training time from 10 to 3 hours during which users would interact 10 times. When trained with dynamically moved targets of increasing radius, we found that crawlers gradually learned to align themselves toward the right direction. In the Wall-Jumper task (See Figure 3.7(b)), we noticed a variance of performance given different users. One run (blue curve) outperformed learning from scratch with an obvious mar- gin while another run (orange curves) performed less well but still converged as learning from scratch. Nevertheless, both of the two trials are much better than an auto-curriculum that suffers from over-fitting as described in Section 3.2.3. 35 3.3.2 Generalization Ability Over-fitting to a particular dataset is a common problem in supervised learning. Similar prob- lems can occur in reinforcement learning when there’s no or little variation in the environment. To deal with this problem, we had considered: 1) randomness in terms of how grid is generated; layout of blocks and jumpers; locations of crawlers and targets. 2) entropy regularization in our PPO implementation, making a strong baseline. We compare models trained with our framework with ones trained from scratch in three envi- ronments with a set of tasks. For example, in GridWorld the agents were tested with the number of obstacles increasing from 1 to 5. In Wall-Jumper, the heights of the wall rise from 0 to 8 discretely during testing and in SparseCrawler, the radius of the moving target transitions from 5 to 40 with a span of 5 (See Figure 3.8). One common observation is that our model consis- tently outperforms learning from scratch. Secondly, there’s a large gap between the curves from curriculum model and learning from scratch (See Figure 3.8(a)), indicating that they “warm-up” more quickly with easy tasks than directly jumping into the difficult task. This is analogous to how human learns by building on past experiences. Interestingly, the curves eventually congre- gate in Wall-Jumper (See Figure 3.8(b)), for both curriculum model and scratch model. Finally, we observed that the performance of our model in SparseCrawler (See Figure 3.8(c)) continually arose and reached the target with 1 to 2 more success, as opposed to Wall-Jumper environment. The reason is because we would reset the environment in SparseCrawler only when it reaches the maximum time steps in a single round. When performing qualitative tests, our model solves the GridWorld with varying obstacles whereas scratch model fails when the number of obstacles exceeds 3. For Wall-Jumper, our model is able to reach the goal with minimum steps while the scratch model would inevitably use the block, necessary only for heights over 6.5. In the SparseCrawler environment, our model has a faster moving speed and a greater numbers of success whereas the scratch model could only reach proximal targets. 36 3.4 Conclusion In this paper, we released three new environments that are challenging to solve (sparse reward, transfer between skills and large amount of training up to 10M steps), with varying curricu- lum space (discrete/continuous). With this environment, we identified a phenomenon of over- fitting in auto-curriculum that leads to deteriorating performance during skill transfer. Then, We proposed a simple interactive curriculum framework facilitated by our unified user interface. Experiment shows the promise of a more explainable and generalizable curriculum transition by involving human-in-the-loop, on tasks that are otherwise nontrivial to solve. For future work, we would like to explore the effect of human function on the final performance and to provide more source as reference for users’ decision-making. 37 Chapter 4 PortraitGAN for Flexible Portrait Manipulation 4.1 Introduction Our digital age has witnessed a soaring demand for flexible, high-quality portrait manipulation, not only from smart-phone apps but also from photography industry, e-commerce and movie pro- duction etc. Portrait manipulation is a widely studied topic [5, 34, 47, 91, 170, 175] in computer vision and computer graphics. From another perspective, many computer vision problems can be seen as translating image from one domain (modality) to another, such as colorization [103], style transfer [48, 71, 77, 117],image inpainting [15] and visual attribute transfer [108] etc,. This cross-modality image-to-image translation has received significant attention [74, 219] in the community. In this paper, we define different styles as modalities and try to address multi- modality transfer using a single model. In terms of practical concern, transfer between each pair of modalities as opposed to SPADE [139] or GANPaint [10] whose manipulation domain is fixed. Recently, generative adversarial networks have demonstrated compelling effects in synthesis and image translation [29, 74, 81, 178, 205, 219], among which [208, 219] proposed cycle- consistency for unpaired image translation. In this paper, we extend this idea into a conditional setting by leveraging additional facial landmarks information, which is capable of capturing intri- cate expression changes. Benefits that arise with this simple yet effective modifications include: First, cycle mapping can effectively prevent many-to-one mapping [219, 220] also known as mode-collapse. In the context of face/pose manipulation, cycle-consistency also induces identity preserving and bidirectional manipulation, whereas previous method [5] assumes neutral face to begin with or is unidirectional [119, 148], manipulating in the same domain. Second, face 38 images of different textures or styles are considered different modalities and current landmark detector will not work on those stylized images. With our design, we can pair samples from multiple domains and translate between each pair of them, thus enabling landmark extraction indirectly on stylized portraits. Our framework can also be extended to makeups/de-makeups, aging manipulation etc. once corresponding data is collected. In this work, we leverage [77] to generate pseudo-targets i.e., stylized faces to learn simultaneous expression and modality manip- ulations, but it can be replaced with any desired target domains. However, there remain two main challenges to achieve high-quality portrait manipulation. We propose to learn a single generatorG as in [35]. But StarGAN [35] deals with discrete manip- ulation and fails on high-resolution images with irremovable artifacts. To synthesize images of photo-realistic quality (512x512), we propose multi-level adversarial supervision inspired by [193, 215] where synthesized images at different resolution are propagated and combined before being fed into multi-level discriminators. Second, to avoid texture inconsistency and arti- facts during translation between different domains, we integrate Gram matrix [48] as a measure of texture distance into our model as it is differentiable and can be trained end-to-end using back propagation. Fig. 4.8 shows the result of our model. Extensive evaluations have shown both quantitatively and qualitatively that our method is comparable or superior to state-of-the-art generative models in performing high-quality portrait manipulation. Our model is bidirectional, which circumvents the need to start from a neutral face or a fixed domain. This feature also ensures stable training, identity preservation and is easily scalable to other desired domain manipulations. In the following section, we review related works to ours and point out the differences. Details of PortraitGAN are elaborated in Section 7.3. We evaluate our approach in Section 4.4 and conclude the paper in Section 4.5. 39 4.2 Related Work Face editing Face editing or manipulation is a widely studied area in the field of computer vision and graphics, including face morphing [19], expression edits [99, 168], age progres- sion [82], facial reenactment [5, 18, 175]. However, these models are designed for a particular task and rely heavily on domain knowledge and certain assumptions. For example, [5] assumes neutral and frontal faces to begin with while [175] employs 3D model and assumes the availabil- ity of target videos with variation in both poses and expressions. Our model differs from them as it is a data-driven approach that does not require domain knowledge, designed to handle general face manipulations. Image translation Our work can be categorized into image translation with generative adver- sarial networks [29, 67, 74, 112, 193, 208], whose goal is to learn a mapping G :X ! b Y that induces an indistinguishable distribution to target domainY, through adversarial training. For example, Isola et al. [74] takes image as a condition for general image-to-image transla- tion trained on paired samples. Later, Zhu et.al [219] builds upon [74] by introducing cycle- consistency loss to obviate the need of matched training pairs. In addition, it alleviates many- to-one mapping during training generative adversarial networks also known as mode collapse. Inspired by this, we integrate this loss into our model for identity preservation between different domains. Another seminal work that inspired our design is StarGAN [35], where target facial attributes are encoded into a one-hot vector. In StarGAN, each attribute is treated as a different domain and an auxiliary classifier used to distinguish these attributes is essential for supervising the training process. Different from StarGAN, our goal is to perform continuous edits in the pixel space that cannot be enumerated with discrete labels. This implicitly implies a smooth and continuous latent space where each point in this space encodes meaningful axis of variation in the data. We treat different style modalities as domains in this paper and use two words interchangeably. In this sense, applications like beautification/de-beautification, aging/younger, with beard/without beard can also be included into our general framework. We compare our 40 approach against CycleGAN [219] and StarGAN [35] during experiments and illustrate in more details about our design in the next section. Landmark Guided Generation In [150], an offline interpolation process is adopted for gen- erating face boundary map, to be used for GMM clustering and as conditional prior. There are two key differences: 1) the number of new expressions depends on clustering, possibly not con- tinuous; 2) boundary heat map is estimated offline. In [194], facial landmarks are represented as V AE encoding for GAN. In contrast, the major goal of our framework is to support online, flexible, even interactive in user experience, which is why we process and leverage landmarks in a different way, as a channel map. There are also works that use pose landmarks as condition for person image generation [98, 148, 163, 189]. For example [119] concatenates one-hot pose feature maps in a channel-wise fashion to control pose generation. Different from our approach, each landmark constitutes one channel. In [152], keypoints and segmentation mask of birds are used to manipulate locations and poses of birds. To synthesize more plausible human poses, Siarohin et.al [163] develop deformable skip connections and compute a set of affine transformations to approximate joint deformations. These works share some similarity with ours as both facial landmark and human skeleton can be seen as a form of pose representation. However, the above works deal with manipulation in the original domain and does not preserve identity. Style transfer Exemplar guided neural style transfer was first proposed by Gatys et al. [48]. The idea is to preserve content from the original image and mimic “style” from a reference image. We adopt Gram matrix in our model to enforce pattern consistency. We apply a fast neural style transfer algorithm [77] to generate pseudo targets for multi-modality manipulations. Another branch of work [36, 139] try to model style distribution in another domain which is in favor of one-to-many mapping in the target domain, or collection style transfer [72]. 41 Figure 4.1: Overview of training pipeline: In the forward cycle, original imageIA is first trans- lated to c IB given target emotionLB and modalityC and then mapped back to c IA given con- dition pair (LA,C 0 ) encoding the original image. The backward cycle follows similar manner starting fromIB but with opposite condition encodings using the same generatorG. Identity preservation and modality constraints are explicitly modeled in our loss design. 4.3 Proposed Method Problem formulation Given domainsX 1 ;X 2 ;X 3 ;:::X n of different modalities, our goal is to learn a single general mapping function G :X i !X j ;8i;j2f1; 2; 3;:::ng (4.1) that transformsI A from domain A toI B from domain B in a continuous manner. Eqn 4.1 implicitly implies thatG is bidirectional given desired conditions. We use facial landmarkL j 2 R 1HW to denote facial expression in domainj. Facial expressions are represented as a vector of 2D keypoints with N = 68, where each point u i = (x i ;y i ) is the ith pixel location in L j . We use attribute vectorc = [c 1 ;c 2 ;c 3 ;:::c n ] to represent the target domain. Formally, our input/output are tuples of the form (I A ;L B ;c B )=(I B ;L A ;c A )2R (3+1+n)HW . Model architecture The overall pipeline of our approach is straightforward, shown in Fig- ure 4.1 consisting of three main components: (1) A generator G(I;L;c), which renders an 42 input face in domainc 1 to the same person in another domainc 2 given conditional facial land- marks. G is bidirectional and reused in both forward as well as backward cycle. First map- pingI A ! c I B ! c I A and then mapping backI B ! c I A ! c I B given conditional pair (L B ;c B )=(L A ;c A ). (2) A set of discriminatorsD i at different levels of resolution that distin- guish generated samples from real ones. Instead of mappingI to a single scalar which signifies “real” or “fake” , we adopt PatchGAN [71] which uses a fully convnet that outputs a matrix where each element M i;j represents the probability of overlapping patch ij to be real. If we trace back to the original image, each output has a 70 70 receptive field. (3) Our loss func- tion takes into account identity preservation and texture consistency between different domains. In the following sections, we elaborate on each module individually and then combine them together to construct PortraitGAN. 4.3.1 Base Model To begin with, we consider manipulation of emotions in the same domain, i.e.I A andI B are of same texture and style, but with different face shapes denoted by facial landmarksL A andL B . Under this scenario, it’s sufficient to incorporate only forward cycle and conditional modality vector is not needed. The adversarial loss conditioned on facial landmarks follows Eqn 4.2. L GAN (G;D) =E I B [log(D(I B )] +E (I A ;L B ) [log(1D(G(I A ;L B )))] (4.2) A face verification loss is desired to preserve identity betweenI B and c I B =G(I A ;L B ). How- ever in our experiments, we find` 1 loss to be enough and it’s better than` 2 loss as it alleviates blurry output and acts as an additional regularization [74]. L id (G) =E (I A ;L B ;I B ) jjI B G(I A ;L B )jj 1 (4.3) The overall loss is a combination of adversarial loss and` 1 loss, weighted by. 43 L base =L GAN (G;D) +L id (G) (4.4) 4.3.2 Multi-level Adversarial Supervision Manipulation at a landmark level requires high-resolution synthesis, which is challenging [50], because it’s harder to optimize. Here we use two strategies for improving generation quality and training stability. First our conditional facial landmark acts as an additional constraint for generation. Second, we adopt a multi-level feature matching loss [48, 155] to explicitly requireG to match statistics of real data thatD finds most discriminative at feature level as follows. L FM (G;D k ) =E (I A ;I B ) T X i=1 1 N i kD i k (I B )D i k (G(I A ;L B ))k 1 (4.5) we denote the ith-layer feature extractor of discriminator D k as D i k , where T is the total number of layers andN i denotes the number of elements in each layer. Figure 4.2: Multi-level adversarial supervision 44 Third, we provide fine-grained guidance by propagating multi-level features for adversarial supervision (Figure 4.2). Cascaded upsampling layers inG are connected with auxiliary con- volutional branches to provide images at different scales ( d I B1 ; d I B2 ; d I B3 ::: d I Bm ), wherem is the number of upsampling blocks. These images are fed into discriminators at different scalesD k . Applying it to Eqn 4.4 we get, L multi = X k [L GAN (G;D k ) +L FM (G;D k )] + L id (G) (4.6) Compared to [215], our proposed discriminators responsible for different levels are optimized as a whole rather than individually for each level. The increased discriminative ability fromD k in turn provides further guidance when trainingG (Eqn 4.6). 4.3.3 Texture consistency When translating between different modalities in high-resolution, texture differences become easy to observe. Inspired by [48], we let k I;L be the vectorized kth extracted feature map of imageI from neural network at layerL.G I;L 2R is defined as, G I;L (k;l) =< k I;L ; l I;L >= X i k I;L (i) l I;L (i) (4.7) where is the number of feature maps at layerL and k I;L (i) isith element in the feature vector. Eqn 4.7 also known as Gram matrix can be seen as a measure of the correlation between feature mapsk andl, which only depends on the number of feature maps, not the size ofI. For image I A andI B , the texture loss at layerL is, L L texture ( c I B ;I B ) =jjG c I B ;L G I B ;L jj 2 (4.8) 45 where c I B =G(I A ;L B ). We obtain obvious improvement in quality of texture in cross-modality generation and we use pretrained VGG19 for texture feature extraction in our experiments with its parameters frozen during optimization. 4.3.4 Bidirectional Portrait Manipulation To transfer to a target domainX , an additional one-hot encoding vectorc2R n is conditioned as input. Specifically, each element is first replicated spatially into sizeHW and then concate- nated with image and landmark along the channel axis. The only change to previous equations is that instead of taking (I A ;L B ) as input, the generator G now takes (I A ;L B ;c), where c indicates the domain whereI B belongs to. To encourage bijection between mappings in different modality manifold and to prevent mode collapse, we adopt cycle-consistency structure similar to [219], which consists of a forward and a backward cycle, for both generating directions. L cyc (G) =E (I A ;L B ;c;c 0 ) [jjG(G(I A ;L B ;c);L A ;c 0 )I A jj] 1 +E (I B ;L A ;c;c 0 ) [jjG(G(I B ;L A ;c 0 );L B ;c)I B jj] 1 (4.9) wherec andc 0 encodes modality forI B andI A respectively. Note that only one set of gener- ator/discriminator is used for bidirectional manipulation. Our final optimization objective for PortraitGAN is as follows, L PortraitGAN =L multi A!B +L multi B!A +L cyc +L texture (4.10) where, controls the weight for cycle-consistency loss and texture loss respectively. 46 4.4 Experimental Evaluation Our goal in this section is to test our model’s capability in 1) continuous shape editing; 2) simul- taneous modality transfer. We also created testbed for comparing our model against two closely related SOTA methods [35, 219], though they do not support either continuous shape editing and multi-modality transfer directly. The aim is to provide quantitative and qualitative analysis in terms of perceptual quality. Additionally, we also conducted ablation studies for our compo- nents. Implementation Details Each training step takes as input a tuple of four images (I A ,I B ,L A , L B ) randomly chosen from possible modalities of the same identity. Attribute conditional vector, represented as a one-hot vector, is replicated spatially before channel-wise concatenation with corresponding image and facial landmarks. Our generator uses 4 stride-2 convolution layers, followed by 9 residual blocks and 4 stride-2 transpose convolutions while auxiliary branch uses one-channel convolution for fusion of channels. We use two 3-layer PatchGAN [71] discrimina- tors for multi-level adversarial supervision and Least Square loss [123] for stable training. Layer conv1 1-conv5 1 of VGG19 [164] are used for computing texture loss. We set,, , as 2, 10, 5, 10 to ensure that loss components are at the same scale. There are four styles used in our experiment, for training a unified deep model for shape and modality manipulation. The training time for PortraitGAN takes around 50 hours on a single Nvidia 1080 GPU. Dataset We collected and combined the following three emotion dataset for experiments and performed a 7/3 split based on identity for training and testing. 1) The Radboud Faces Database [97] contains 4,824 images with 67 participants, each performing 8 canonical emo- tional expressions: anger, disgust, fear, happiness, sadness, surprise, contempt, and neutral. 2) iCV Multi-Emotion Facial Expression Dataset [118] is designed for micro-emotion recognition, which includes 31,250 facial expressions from 125 subjects performing 50 different emotions. 3) We also collected 20 videos of high-resolution from Youtube (abbreviated as HRY Dataset) containing 10 people giving speech or talk. For the above dataset, we use dlib [84] for facial 47 Figure 4.3: Interactive manipulation without constraints. Column 1st-2nd: Original image and auto-detected facial landmarks; 3rd: generated image from 1st-2nd; 4th: manipulated target landmark; 5th: inverse modality generation from 3rd-4th; 6th: photo to style generation with landmarks of 5th. landmark extraction and [77] for generating portraits of multiple styles. Extracted landmarks and stylized images correspond to groundtruthL B andI B respectively for Equation 4.5. Comparison Protocol CycleGAN [219] is considered state-of-the-art in image translation and is closely related to our work in terms of consistency-loss design. StarGAN [35] is also related because it supports multiple attribute transfer using a single generator. However, direct compar- ison is not possible since none of the two approaches support continuous shape edits. Therefore, to compare with CycleGAN, we use the following pipeline: Given image pairfI A ,I B g, which are from domain A and B, CycleGAN translatesI A to c I B , which has content fromI A and modality fromI B . This can be achieved with our approach with landmarkL A unchanged. To compare with StarGAN, we train StarGAN on discrete canonical expressions and compare it with our approach which is conditioned on facial landmarks. 4.4.1 Portrait Manipulation Our model is sensitive for edits in eyebrows, eyes, and mouth but less so for nose. The reason is because there’s little change in nose shape in our collected database. Nevertheless, our model is able to handle continuous edits because of abundant variations of expressions in data. For 48 example, in Figure 4.3 of the paper, the 1st-row achieves face-slimming as a result of pulling landmarks for left (right) cheeks inward, even though there’s no slim-face groundtruth as training data. Similarly, the 2nd-row of Figure 4.3 shows the mouth fully-closed by merging landmarks for upper and down lips. These two results were obtained with a web tool we developed for interactive portrait manipulation 1 , where users can manipulate facial landmarks manually and evaluate the model directly. Figure 4.4: Given leftmost and rightmost face, we first interpolate the middle one (e.g., the 4th one), then we can interpolate 2nd (with 1st and 4th) and 5th (with 4th and 7th). Lastly, we interpolate 3rd (with 2nd and 4th) and 6th (with 5th and 7th). Another example for continuous edits is face interpolation. Our model is capable of gener- ating new facial expressions unseen for a certain person. For example, given two two canoni- cal expressions (e.g., surprise and smile), we can interpolate 2 a neutral expression in between through interpolate their facial landmarks. The granularity of face edits depends on the gap between two facial landmarks. Here we show a more challenging case, where we interpolate five intermediate transitions given only two real faces. In this case, the quality of the 3rd face is dependent on previous generations (i.e., after the 2nd and 4th fake faces are generated). In Figure 4.4, our model can gradually transition a surprise emotion to a smile emotion, beyond canonical emotions. 1 The tool will be released at: https://github.com/davidsonic/Flexible-Portrait-Manipulation 2 Please refer to supplementary material for more details 49 Compared to discrete conditional labels, facial landmark gives full freedom for continuous shape editing. As can be seen, our model integrates two functions into a single model: shape edits (when modality is fixed) and style transfer (when landmark is fixed). Not only that, our model supports bidirectional transfer using a single generator, i.e., from natural domain to stylis- tic domain (1st column to 3rd column or from 5th to 6th) or from stylistic domain to natural domain (3rd column to 5th column). The user can manipulate in any domain and can generate edited shapes in another domain immediately. For example, the 1st row successfully performed simultaneous face-slimming and stylistic transfer. Figure 4.5: Failure cases: The reason could be that facial landmarks don’t capture well enough details of micro-emotions. However, there does exists some failure cases, which generally happen in iCV dataset. In Figure 4.5, we tried to manipulate landmark in order to change the original expression (1st column) into groundtruth (4th column) but failed. The closest generated result we can get is shown in the 3rd column. As can be seen, the generated picture fails to mimic the intricate expression displayed in groundtruth. Given that iCV is a micro-emotion dataset, our guess is that 68 landmark is not sufficient for capturing subtle expressions. An overview of manipulation results are shown in Figure 4.8. Some interesting generations were observed. For example, our model seems to be capable of learning some common knowl- edge, i.e., teeth is hallucinated when mouth is open (1st row, 4th-6th column), after we manip- ulate the landmarks along the edge of mouth. It’s also surprising that our model can preserve obscure details such as earrings (5th row, 4th-6th column). We also notice some artifacts during 50 translation (3rd-4th row, 8th column), The reason is due to the challenge in handling emotion changes and multi-modal transfer with a single model. Having said that, our framework shows promising results in trying to address both simultaneously. For high-resolution (512x512) syn- thesis, please refer to Figure 4.9,4.10,4.11,4.12,4.13,4.14,4.15,4.16 As can be seen, our model is able to manipulate expression and style based on landmark prior of the target emotion with photo-realistic effect. We refer readers to the supplementary material for more qualitative results. We will also release a website showcasing more results in original resolution (512*512) on Github 3 . Ablation study Each component is crucial for the proper performance of the system, which we demonstrate through qualitative figures and quantitative numbers in Table 4.1. First multi- level adversarial loss is essential for high-resolution generation. As can be seen in Figure 4.7, face generated with this design exhibits more fine-grained details and thus more realistic. In Table 4.1, SSIM drops 1.6% without this loss. Second, texture loss is crucial for pattern sim- ilarity during modality transformation. As shown in Figure 4.6, PortraitGAN generates more consistent textures compared to StarGAN and CycleGAN. In Table 4.1, SSIM drops 3.6% if without. Last but not least,L cyc andL id help preserve identity. Method MSE# SSIM" inference time(s)# CycleGAN 0.028 0.473 0.365 StarGAN 0.029 0.483 0.263 Ours wo/L multi +L texture 0.028 0.472 0.271 Ours wo/L multi 0.011 0.639 0.277 Ours wo/L texture 0.013 0.619 0.285 Ours 0.011 0.655 0.290 Table 4.1: Quantitative evaluation for generated image. Our model is slightly slower than Star- GAN but achieves the best MSE and SSIM. 3 https://github.com/davidsonic/Flexible-Portrait-Manipulation.git 51 Figure 4.6: Comparison with StarGAN and CycleGAN. Images generated by our model exhibit closer texture proximity to groundtruth, due to adoption of texture consistency loss. 4.4.2 Perceptual Quality Quantitative Analysis We incorporated 1000 images (500 stylized and 500 natural) to con- duct quantitative analysis. For generative adversarial network, two widely used metric for image quality is MSE and SSIM, between the generated image and groundtruth. For MSE, the lower means more fidelity to groundtruth, and for SSIM the higher the better. Table 4.1 shows quan- titatively results between CycleGAN, StarGAN and our approach. As can be seen, our method achieves the best MSE and SSIM score while maintaining relatively fast speed. Subjective user study As pointed out in [74], traditional metrics should be taken with care when evaluating GAN, therefore we adopt the same evaluation protocol as in [35, 74, 193, 219] 52 Figure 4.7: Effect of multi-level adversarial supervision. Left/Right: wo/w multi-level adversar- ial supervision. Please also refer to the supplementary material for the high-resolution (512x512) version. for human subjective study generated images. We collect responses from 10 users (5 experts, 5 non-experts) based on their preferences about images displayed at each group in terms of perceptual realism and identity preservation. Each group consists of one photo input and three randomly shuffled manipulated images generated by CycleGAN [219], StarGAN [35] and our approach. We conducted two rounds of user study where the 1st round has a time limit of 5 seconds while 2nd round is unlimited. There are in total 100 images and each user is asked to rank three methods on each image twice. Our model gets the best score among three methods as shown in Table 4.2. Method (%) 1st round 2nd round Average StarGAN 31.2 32.3 31.75 CycleGAN 33.0 33.5 33.25 Ours 35.8 34.2 35.0 Table 4.2: Subjective ranking for different models based on perceptual quality. Our model is close to CycleGAN but is much better than StarGAN. 4.5 Conclusion We present a flexible portrait manipulation framework that integrates continuous shape edits and modality transfer into a single adversarial framework. To overcome the technical challenges, we 53 Figure 4.8: More results for continuous shape edits and simultaneous shape and modality manip- ulation results by PortraitGAN. proposed to condition on facial landmark as input and designed a multi-level adversarial super- vision structure for high-resolution synthesis. Beyond photo quality, our loss function also takes 54 into account identity and texture into consideration, verified by our ablation studies. Experi- mental results show the promise of our framework in generating photo-realistic and supporting flexible manipulations. For future work, we would like to improve on the stability of training. 55 Figure 4.9: Left: original image; Right: generated image. Figure 4.10: Left: original image; Right: generated image. 56 Figure 4.11: Left: original image; Right: generated image. Figure 4.12: Left: original image; Right: generated image. 57 Figure 4.13: Left: original image; Right: generated image. Figure 4.14: Left: original image; Right: generated image. 58 Figure 4.15: Left: original image; Right: generated image. Figure 4.16: Left: original image; Right: generated image. 59 Chapter 5 Fashion Compatibility Recommendation via Unsupervised Latent Attribute Discovery 5.1 Introduction The prevalence of online shopping has led to increasing demands for exploiting the user’s prefer- ences. Fashion compatibility prediction, being one form among them, requires to recommend the most “suitable” item to complete a partial outfit. For instance, an outfit with two pairs of shoes are clearly not suitable. Moreover, clothes usually consists of different attributes (e.g., color, shape, style, etc.) while the significance of each attribute usually changes as outfit changes. For example in Figure 5.1, the color of the sunglasses plays a crucial role when fitting it with the men’s wear on the left while less so when it’s paired with the women’s wear on the right. The blue and pink ellipsoid indicates two attribute metric space in which items are compared. In previous works, the performance of fashion recommendation has been enhanced via var- ious metric learning [171, 182, 183]. Veit et al. [183] adopted Siamese-network for comparing a pair of heterogeneous co-occurring instances and later proposed to leverage the triplet net- work [160] to account for different similarity conditions [182]. Vasileva et al. [180] attempted to get a better comparison between a pair of items by enumerating all possible combinations of category subspace. Although these models are able to perform multi-faceted pairwise compar- isons, they fail to dive deep into exploring the interactions between user’s preference and their contexts. In general, there are two challenges involved in current fashion compatibility recommenda- tion. First, when choosing the most compatible item, people usually care both the role of each 60 Figure 5.1: An illustration of interactions between a product’s attributes and its context: the blue and pink ellipsoids indicate two latent-attribute metric space in which items are compared (e.g., color, shape, style, etc.). The sunglasses in the middle is shared by two different outfits for different reasons. item in the entire outfit but also its factored attributes. For instance in Figure 5.1 we may ask ourselves questions such as “Does the sunglasses play a more important role than the bottom in outfit compatibility?” and “Given the black men’s wear, is ‘color’ of the sunglasses more informative than its ‘shape’?” Second, clothing attributes are not static and usually evolve given the user’s aesthetic tastes or change of social occasions (i.e., context). Moreover, it is difficult to collect fine-grained attribute annotations manually. These two aspects make it hard to generate explainable compatibility recommendations with current models. To address the above challenges, we propose a novel Attribute-Aware Explainable Graph Network (AAEG) for fashion compatibility recommendation. In AAEG, we introduce a factor- ized Latent Attribute Space learned unsupervised, where each latent space represents a metric- aware attribute space in which items are compared. We then project items in this factorized latent attribute space to capture the user’s fine-grained preferences and generate explainable compat- ibility recommendations. Specifically, we first develop a Latent Attribute Extraction Network (LAEN), which is used to extract latent metric-aware attribute representations. Then to model 61 the interactions between latent clothing attributes and their contexts, we leverage a graph filter- ing network and design a Pairwise Preference Attention (PPA) module to automatically match the user’s preference for each attribute given contextual information, and aggregate all attributes with different weights. Finally, we optimize AAEG by predicting item linkage and overall out- fit quality in a multi-task learning setting. Extensive experiments on two large-scale real world datasets reveal that MAEG not only outperforms existing state-of-the-art, but also provides inter- pretable insights by highlighting the role of latent attributes and contextual relationships among items. 5.2 Related Work Generally, the related work can be grouped into the following categories: metric-aware recom- mendation, explainable recommendation and graph convolutional networks. Metric-aware Recommendation. Deep metric learning aims to learn useful representations by distance comparisons [66, 185]. In [80, 183], Siamese network was leveraged to learn the notion of similarity between a pair of items. This framework allowed CNN to train in two parallel branches with the same ‘copy’ and joined by a contrastive loss function. To further explore the information contained in the clothing image, Veit et al. [182] proposed to learn different notions of similarity and developed a CSN network that encodes similarity conditions in different subspace. In [180], different subspace were constructed for each pair of clothing categories to perform metric learning and most recently Tan et al. [171] proposed to learn these similarity condition masks in a weak-supervised manner. Unfortunately, their representation ability is often limited by pairwise relationship and they fail to consider contextual information inherent in an outfit. Explainable Recommendation Interpretability is one of the most important aspects of deep learning [42, 94, 216] and it has also become a very important research direction in fashion 62 recommendation systems [125, 217]. McAuley et al. [125] made an interpretation about collab- orative filtering by leveraging explicit factor based matrix factorization. Singh et al. [165] devel- oped an attribute ranking module that utilizes a spatial transformer network to discover the most informative image region for the attribute. In [3, 68], a weakly supervised fashion recommenda- tion model was built by localizing attribute regions in an image through Class Activation Maps (CAM) [218]. In this paper, we provide interpretable insights by adopting Grad-CAM [162] to highlight the most informative latent attribute being attended to during feature aggregation. Graph Convolutional Network Recently, a powerful structural learning paradigm, which takes the form of graph convolution, has shown great promise in handling complex relational reasoning [86, 106, 158]. Veliˇ ckovi´ c et al. [184] presented a graph attention model that is able to discriminate neighboring nodes by computing attention weights between each pair of graph nodes. In contrast, we perform an outfit-level attention mechanism to explain the contribution of each item in the set withO(N) complexity. Most recently, graph convolution have also been applied to recommendation systems [14, 209]. In [37], an outfit was represented as an undi- rected graph to consider context information and fashion compatibility prediction was cast as a link prediction problem. However, these work comprehended the clothing image as a global content representation and were unable to model the interactions between the user’s preferences and contexts. In this paper, we take a further step to explain the user’s preferences given varying contexts. 5.3 Attribute-Aware Explainable Graph Networks In this section, we introduce our proposed AAEG for addressing the fashion compatibility rec- ommendation. Most previous work represent clothing items in a global/single feature space as shown in Figure 5.2 (a), but comparisons can be ambiguous when context is present. For example, the gold dress and earring are close in terms of color whereas the two dresses are close in terms of category in Figure 5.2 (b). As a result, the incompatible gold earring and black dress are also 63 Figure 5.2: Difference between the conventional (a) Global/Single Feature Space and our (b) Metric-Aware Latent Attribute Space. Each latent space represents a latent attribute space that is learned automatically from data. To make comparisons between items, different weights are assigned to these metric space to account for different preferences over corresponding attributes. close to each other. To solve this problem, AAEG utilizes a factorized Metric-Aware Latent Attribute Space, where each latent space represents a latent attribute space in which items are being compared. As demonstrated in Figure 5.2 (b), items are represented jointly by several latent attribute space. An overview of AAEG architecture is shown in Figure 5.3, which consists of two main com- ponents, i.e., Latent Attribute Extraction Network (LAEN) and Pairwise Preference Attention (PPA). With LAEN, we first obtain fashion item projections in the latent attribute feature space. Next we construct a graph representation for the outfit and perform message passing to consider contextual information. Then, we jointly train the item representation via latent attribute embed- ding and graph filtering network in a multi-task setting. Finally, with permutation-invariant outfit embedding and attribute preference inference, we can generate explainable recommendations by computing item contributions and highlighting attribute attention respectively. 64 Figure 5.3: The architecture of Attribute-Aware Explainable Graph Network (AAEG) for Fash- ion Compatibility Recommendation. LAEN is trained unsupervised to extract metric-aware latent attribute representations, where latent masks areK learnable embedding functions, trans- forming global features into corresponding latent attribute space. PPA models the user’s pref- erence by computing attention over latent attributes. AAEG accounts for interactions between user’s preference and contexts using graph filtering, trained in a multi-task setting. 5.3.1 Unsupervised Latent Attribute Space Learning In this subsection, we describe how to project items into latent attribute space to obtain metric- aware latent attributes in different subspace. To get item embedding with respect toK different attributes, a standard supervised-learning framework requires annotating each item with these K attributes. Whereas, most real-world E-commerce datasets lack attribute annotations to learn attribute representation directly. To this end, we resort to a unsupervised approach by learning from pseudo-labels similar to [101]. How- ever, instead of minimizing conditional entropy of unlabeled class probabilities, we minimize the margin-ranking loss and recover internal metric in the latent space in a data-driven manner. 65 Specifically, we first construct triplets by mining the Polyvore Outfits [180]. Each triplet is in the form (a;p;n), wherea andp are two compatible clothing items in outfitI + a , andn is a mined negative fromInI + a but belonging to the same category asp. Then we use a CNN to extract global feature representations for each triplet, yielding (x g a ;x g p ;x g n ). Next we assign a pseudo-label l k (k = 1:::K) to each triplet, by projecting its element to the corresponding attribute space: x p k = Proj(x g ;E k ); k = 1; 2;:::;K (5.1) whereE k is a learned parameter for thek-th attribute space. Here, eachx p k is a latent attribute representation ofx under spacek. Consequently, we get the latent attribute feature representation extracted for itemi by LSEN: ^ x = [x p 1 i ;:::;x p K i ;x g i ], which will then go through two branches, PPA (described below) and our Graph Filtering Network 5.3.2. Actually, eachx p k is a vector of dimensionD residing in its corresponding latent attribute space and captures certain attribute of the item. However, as discussed in introduction, a clothing item can have a variety of different attributes and each attribute may be given different priori- ties when paired with different objects. Therefore inspired by [181], we propose an attribute- level pairwise attention module (namely Pairwise Preference Attention (PPA)), which outputs a preference-aware feature representation for a pair of nodes by assigning adaptive weights to features in different latent attribute space. Concretely, as illustrated in left part of Figure 5.3, given a pair of inputs (^ x i ; ^ x j ), the attention weight for each latent attribute space is computed as: ijj = softmax(ReLU(W T [x g i jjx g j ])) (5.2) where matrixW2R 2DK is a learned parameter to transfer the concatenated feature to dimen- sion designated by the latent attribute space. The weighted preference embedding for item i dependent onj can thus calculated as: x wp ijj = K X k=1 ijj;k x p k i (5.3) 66 Finally, to learn a useful metric for the latent attribute space, we adopt the Margin Ranking Loss [160], which has been widely used in metric learning: L rank = max(0;(jjx wp ajn x wp nja jj 2 jjx wp ajp x wp pja jj 2 ) +) (5.4) where is the maximum margin and the weighted preference embedding for the anchor, positive and negative element in the triplet are computed pairwise respectively. 5.3.2 Graph Filtering Network In this subsection, we investigate how to take into account contextual information inherent in the outfit and how to model the interactions between the user’s preference over latent attributes and context. To this end, we aim to output a context-aware representationx ctx i;T for each objecti condi- tioned on metric-aware latent attribute features associated with the object. This is obtained with iterative message passing overT iterations with our graph filtering network as shown in bottom part of Figure 5.3. We use a fully-connected graph over the outfit, where each node represents a clothing item in the outfit and there’s a directed edgei!j between every pair of itemi andj. Each nodei is represented by a metric-aware attribute featurex ctx i;t that is updated during each iterationt. Latent Attribute-Conditioned Message Passing To consider contextual information in the factorized Latent Attribute Space at iterationt + 1, our graph filtering takes three steps: Step 1. We update the initial features for all nodes with the updated Latent MasksE k;t at timet (k = 1;:::;K) using Eqn 5.1. The resulting latent attribute representation for nodei is: x ctx i;t = [x p 1 i;t ;:::;x p K i;t ;x g i;t ];i = 1;:::;N (5.5) 67 Step 2. For node i, we compute message vector m t j;i from its neighbor j, by learning a functionf R parameterized by R as: m t j;i =f R (x ctx i;t ;x ctx j;t ; R ) (5.6) R = 8 < : 0 i =j 1 i6=j; 1iN; j2N i (5.7) where 0 and 1 are learned parameters. The idea is to differentiate between the effects from self-loop connections and neighboring connections. Step 3. For node i, we gather messages propagated from its neighbors and update node representation by learning a functionf O parameterized by O . x ctx i;t+1 =f O (x ctx i;t ; X j2N i m t ji ; O ) (5.8) Note thatf R andf O are shared among all edges and all nodes, therefore generalizable to unseen data. Finally, we get final representation for each nodex ctx i;T ;i = 1; 2;:::;N, which can be used as input to subsequent Edge Prediction module 7.3.4. Permutation-Invariant Outfit Embedding In deciding the quality of an outfit, a user usually places different emphasis on different items in the outfit. In fact, this provides an intuitive way to interpret the behavior of graph convolution. In order to capture this behavior, we propose to build an outfit embedding that is order-agnostic. This can be achieved without extra burden due to our contextualized graph representation. Specifically, we first initialize the outfit-embedding as o raw = P N i=1 x i , then we perform dot-product attention to compute the affinity between it with each itemi in the set: i = softmax(<o raw ;x g i;T >= p D); i = 1;:::;N (5.9) 68 whereD is the dimension ofx g . The updated outfit-embedding can thus be computed as: o = N X i=1 i x g i;T (5.10) The computation complexity of this step isO(N), where N is the number of objects in the outfit. 5.3.3 Multi-Task Learning We collect context-aware node embedding with attribute information in different latent attribute spacex ctx i (i = 1;:::;N) and permutation-invariant outfit embeddingo i (i = 1;:::;M) afterT graph iterations for training. N is the number of nodes andM is the number of outfits and we ignore subscriptT for brevity. For edge prediction, given a pair of node representations (x ctx i ;x ctx j ), we apply Pairwise Preference Attention (PPA) introduced in 5.3.1, followed by a prediction function: ^ e ij =P e (PPA(x ctx i ;x ctx j )) (5.11) To trainP e , we randomly sample equal number of positive and negative edges based on adjacency information and perform random edge-dropout with probability 0:15 at each iteration for robustness. Similarly, we learn an outfit grading functionP o with permutation-invariant outfit embedding as input: ^ s i =P o (o i ) (5.12) To trainP o , we sample negatives by randomly replacing the groundtruth item at the blank posi- tion with one of the negative candidates from the same categories. The final loss is a weighted sum of two binary cross-entropy over edge and outfit: L =L e + 1 L o + 2 jjjj 2 (5.13) where 1 controls the weight between the two losses and 2 is a regularization hyper-parameter. includes all model parameters. 69 5.3.4 Attribute Preference Inference We project clothing items into a new latent attribute space. In this space, the user’s preferences for different latent attributes can be calculated, making it possible to generate explainable rec- ommendations. Specifically, when we calculate the preference over latent attributes (Eqn 5.2), we can further identify which region of the image is being attended for this decision and where this particular latent attribute gets activated. Therefore, we can use the attention weight calculated in Eqn 5.2 as classification score for the K latent attribute space and compute the gradient of the score for class c with respect to the last convolutional layer, wherec is determined by the class that maximizes the classification confidence, similar to [162]: c t = 1 Z X m X n y c @F t mn (5.14) where c t is the neuron importance weight of class c for channel t. F t mn indicates the (m;n) spatial location of the t-th channel of feature map F . Z is the normalization term. Then we perform a weighted combination of forward activation maps, followed by ReLU: Mask c = ReLU( X t c t F t ) (5.15) where Mask c indicates latent attribute classc’s contribution to the activation. Accordingly, the explanation consists of three parts: (1) AAEG can highlight which part of the clothing image is being attended by the network during preference selection of latent attributes; (2) AAEG provides the relative importance of each item in the outfit when making outfit compatibility prediction and compatibility score between each pair of items; (3) AAEG provides t-SNE visualization of the learned embedding. We will demonstrate these recommen- dation results in the experiment section. 70 5.4 Experiments Datasets. Maryland Polyvore [59] is a real-world dataset created based on users’ preferences of outfit configurations on an online website named polyvore.com: items within the outfits that receive high-ratings are considered compatible and vice versa. There are in total 164,379 items constituting 217,899 different outfits, where each outfit consists of 8 items at maximum and 6.5 items in average. In the original provided test set, negative candidates are sampled randomly without taking category into consideration (e.g., the “shoe” may have already appeared in the given outfit, which makes exclusion of current candidate easier). We also notice that not all items are softlines (e.g., there are occasionally “lamps” or “wardrobes” in the dataset that are hardlines). As such, we evaluate our model on resampled version of Maryland Polyvore as [37, 171]. Polyvore Outfits [180] is much larger than Maryland Polyvore with a total of 365,054 items and 68,306 outfits. Items that don’t belong to clothing are discarded. Most importantly, items in the candidate list are hard-mined to belong from the same category, giving rise to more chal- lenges. The maximum number of items per outfit is 19. We use the split provided by the authors and finally have 53,306 outfits for training, 10,000 for testing, and 5,000 for validation. Evaluation Protocol. During inference, we use the edge prediction result for Fill In the Blank (FITB) and Outfit Compatibility Prediction (Compat AUC) as in [37]. The goal of FITB is to complete a given partial outfitS =fs i g(i = 1;:::;N) with the best itemj from a candidate set C. We choose the itemj that satisfies: j = argmax j ( N X i=1 e ij ); j2C (5.16) wheree ij is the edge prediction score between nodei and nodej. The evaluation metric for this task is the prediction accuracy. 71 For evaluating outfit compatibility, a score close to 1 represents a compatible outfit while 0 an incompatible outfit. For an outfitk, we take the sum of two terms as the final prediction: 2 N(N 1) X i;j (i6=j) e ij ; i;j = 1; 2;:::;N (5.17) The evaluation metric for this task is the Area Under the Roc Curve (AUC) [21] to measure how much our model is capable of distinguishing between compatible and incompatible outfit. Implementation Details. For fair comparisons, we follow the configurations specified in [59, 171, 180] and use ResNet-18 as global feature extractor for all approaches, which yields a 64-dimensional feature. For [37], we use the default setting 1 from the author’s released code (k=0 for wo/ ctx and k=1 for w/ ctx case) and retrain it with 3 hidden layers of size 64 using ResNet-18 feature. For our AAEG framework, we use k=0 for wo/ ctx and k=1 for w/ ctx case, K = 4, = 0:3 andD = 64 for our LAEN module andT = 3 for our graph filtering network, with hidden size of 64 each. We use dropout ratio of 0:5 and batch normalization between each graph layer for regularization. We use Adam [85] with a learning rate of 0:001 with 10; 000 iterations for optimization and 1 = 0:5, 2 = 1e 5. We perform model-parallelism on two RTX 2080 for training. 5.5 Results In this section, we perform comprehensive analysis of our proposed AAEG. First, we evaluate our model on two large-scale real world datasets against state-of-the-art methods from the latest papers in Section 5.5.1. Then, we conduct extensive ablation study to investigate how each component of our framework affects the overall performance in Section 5.5.2. To do so, we dive deep into details of our two core modules LAEN and PPA, to compare them with their 1 k=1 performs on par with k=15 in [37] and is the default setting in the released repo. Following the paradigm in benchmark [59], we use Resnet18 for our method and all other baselines. 72 variants in Section 5.5.3 and Section 5.5.4 respectively. Finally, we interprete our model and show qualitative results in Section 5.5.5. Method FITB ACC Compat. AUC Siamese Net [183] 54.4% 0.85 Bi-LSTM [59] 64.9% 0.94 TA-CSN [180] 65.0% 0.93 SCE-Net [171] 60.8% 0.90 CA-GCN (wo /ctx) [37] 41.7% 0.71 CA-GCN (w /ctx) [37] 83.1% 0.99 Ours (wo/ ctx) 62.1% 0.93 Ours (w/ ctx) 87.3% 0.99 Ours + Outfit(w/ ctx) 89.3% 0.99 Table 5.1: Comparisons on FITB/Compatibility task over Resampled Maryland Polyvore. Method FITB ACC Compat. AUC Siamese Net [183] 52.9% 0.81 TA-CSN [180] 55.3% 0.86 SCE-Net [171] 61.6% 0.91 CA-GCN (wo/ ctx) [37] 43.3% 0.75 CA-GCN (w/ ctx) [37] 82.4% 0.99 Ours (wo/ ctx) 63.1% 0.93 Ours (w/ ctx) 86.7% 0.99 Ours+Outfit (w/ ctx) 88.0% 0.99 Table 5.2: Comparisons on FITB/Compatibility task over Polyvore Outfits. 5.5.1 Recommendation Performance We present the recommendation performance for Fill In The Blank (FITB) and Outfit Compat- ibility Prediction (Compat AUC) on two datasets in Table 5.1 and Table 5.2 respectively. As can be seen, our model obtains consistent improvement in both tasks across two datasets. In particular, our model outperforms the state-of-the-art graph-convolution based approach CA- GCN [37] by over 5% under with-context case in FITB task, demonstrating that it can better differentiate candidate items by utilizing our proposed latent attribute space. In fact, the signif- icance of our latent attribute representation is more obvious under the without-context case, in 73 which our model achieves a gain of over 19% against [37]. In addition, when compared with strongly-supervised metric-learning approaches such as [180, 183], which require category label information during training, our model learns metric aware representation with pseudo-label but achieves comparable performance when no context is used and much higher performance (over 24% on both datasets) when context is used. To the best of our knowledge, we made the first attempt to demonstrate the effectiveness of leveraging metric-aware latent embedding in graph neural networks. 5.5.2 Ablation Study In this subsection, we perform ablation study on our proposed AAEG. As shown in Table 5.3, we report the FITB ACC and Compat AUC on Polyvore Outfits [180] for different combinations of our components. Model Version LAEN PPA Graph Filtering Outfit Embedding ACC" AUC" 1 X 37.7 0.69 2 X X 38.8 0.70 3 X X 57.0 0.84 4 X X X 57.7 0.87 5 X X X 86.7 0.99 6 X X X X 88.0 0.99 Table 5.3: Ablation study of our model on Polyvore Outfits [180]. For model #1, we use graph filtering network, but with 4 random subspace (without LAEN). In this case, there is no guarantee that the four subspace can provide useful metric measure for feature comparison. The performance is only around 37:7% for FITB and 0:69 for Compat AUC. The situation is the same for model #2, with outfit-embedding as supervision. In model #3 and #4, LAEN is used to learn useful metric-aware latent attribute space while average- weighting mechanism across latent subspace is in place of PPA. In this case, compared with model #2, the performance of model #4 increases from 38:8% to 57:7% for FITB and from 0:70 74 to 0:87 for Compat AUC. Therefore, we conclude that LAEN serves as a crucial component for providing useful metric representation for downstream components such as graph filtering network or PPA. To demonstrate the effectiveness of PPA component, we compare model #3 with #5, where the differences lie at how we perform feature selection during triplet-training and graph aggregation. As can be seen, the performance gap due to PPA module is around 29% for FITB and 0:15 for Compat AUC. One possible explanation is that adaptive feature selection provides our framework with more modelling capacity whereas average aggregation of feature from different latent space may impose difficulty for graph filtering network in learning useful node representations. Finally, model #6 shows extra 1:3% improvement for FITB by leveraging outfit-embedding component. It also helps interpret the model’s emphasis on different items when making compatibility predictions. With Variants of Graph Filtering Network (wo/w ctx) Method Compat. AUC FITB ACC 1 hidden layer 0.89/0.84 59.6/57.7 2 hidden layer 0.90/0.94 61.2/72.6 3 hidden layer 0.93/0.99 63.1/88.0 , Table 5.4: Variants ofT for Graph Filtering Network on Polyvore Outfits. We further study the influence of propagation stepT on the recommendation performance. All hidden layers have size of 64 since the feature dimension from ResNet-18 extractor is 64. From Table 5.4, we observe consistent improvement of performance by increasing hidden layer number from 1 to 3. We notice that when there’s only 1 hidden layer, the modeling capacity seems limited when context is involved. Instead, two or more are better options as they increase the modelling capacity. 5.5.3 Variants of LAEN In this subsection, we discuss the necessity of using Latent Attribute Embedding extracted by LAEN. We construct 6 variants based on Siamese-Net [183] and variants of LAEN to verify the effectiveness of incorporating the metric-aware latent attribute space. As shown in Table 5.5, four 75 With Variants of LAEN (wo/w ctx) Method Compat. AUC FITB ACC Random 4-Subspace 0.52/0.70 30.4/38.8 Siamese-Net [183] 0.80/0.93 52.0/69.9 LAEN w/ (1 subspace) 0.85/0.95 54.3/72.4 LAEN w/ (3 subspace) 0.92/0.99 62.4/87.0 LAEN w/ staticM k 0.92/0.99 61.8/77.0 LAEN (Ours) 0.93/0.99 63.1/88.0 , Table 5.5: Variants of Latent Attribute Extraction Network (LAEN) on Polyvore Outfits. random subspace without metric learning can only achieve 38:8% FITB and 0:70 Compat AUC. In contrast, a pre-trained Simaeset-Net [183] embedding bring the number to 69:9% and 0:93 for FITB and Compat AUC respectively. We also trained with variants of LAEN, by experimenting with different numbers of latent attribute space or by fixing latent embedding layer (staticE k ) in LAEN. The result shows that 4 subspace yields slightly better performance than using 3 subspace (88:0% vs 87:0% in FITB) whereas much better than using 1 subspace alone (88:0% vs 72:4% in FITB Acc and 0:99 vs 0:95 in Compat AUC). We further increase the performance under w/ ctx case by keeping latent embedding layerE k updated during training. 5.5.4 Variants of PPA In this subsection, we investigate two alternatives for performing feature preference selection. In Table 5.6, we compare PPA with average-weighting mechanism and direct concatenation. The former indiscriminatively blends features from different subspace with feature dimension unchanged while the latter also yields a single feature but with longer dimension. Interest- ingly, they give almost the same performance under without-context situation, but training with average-weighting mechanism cannot further increase the performance when context is involved during inference. Therefore, we argue that it is important to perform careful feature selection during graph aggregation step, which would then influence the graph update step. The evaluation result shows the effectiveness of using PPA, by performing adaptive feature selection based on contextual information. 76 With Variants of Preference Selection (wo/w ctx) Method Compat. AUC FITB ACC Average Weight 0.87/0.87 58.0/57.7 Concat 0.87/0.91 58.7/61.2 PPA (Ours) 0.93/0.99 63.1/88.0 , Table 5.6: Necessity of Pairwise Preference Attention: Results on the Polyvore-Outfit test set obtained by variants of Preference Selection. 5.5.5 Visualization of Our Model In addition to quantitative evaluations, we are able to provide some visual interpretations of our framework, thanks to our LAEN, PPA and permutation-invariant outfit embedding modules. We argue that a user’s preference manifests through favor over certain items in an outfit (e.g, top over auxiliary), and towards certain attributes for an item (e.g, color over shape, style over color etc.). And more importantly, these “favors” tend to change given different contexts. For the first argument, we visualized a typical user’s preference for each individual item in an outfit. As shown in Figure 5.4, the outfit consists of top, bottom and auxiliaries. We display the attention weight assigned to each item by our model when evaluated against permutation- invariant outfit embedding using Eqn 5.9. Item that are more conspicuous seem to get more credits for the final “contributions”, which to some extent coincides with human intuition. Figure 5.4: Distribution of attention weight for item preference in the outfit. 77 Second, we provide an intuitive way to visualize our proposed latent attribute space learned unsupervised. In contrast to [180], where t-SNE [120] was used to visualize the metric-space learned for a single category using supervised labels, we provide a t-SNE plot for one of our learned latent attribute space on Polyvore Outfits [180], which involves items from different categories without supervision (See Figure 5.5). Ideally, clothing items that are compatible in color, style, shape, etc will be embed close to each other. As shown in Figure 5.5, we can observe gradual transition in terms of color across categories, which demonstrates that even with “pseudo-labels” during training, our approach is able to learn visually similar attributes explicitly defined in the dataset. Figure 5.5: An example of t-SNE visualization of our Latent Attribute Space on Polyvore Outfits. Third as a post-processing step, we can use Grad-CAM [162] to understand how our network makes its preference decision over latent attributes. The idea is to use the gradient of the max- imum preference score with respect to the global average pooling layer in LAEN to visualize which parts of the clothing image are most important for the preference selection. Visualization examples from four representative categories are presented in Figure 5.6. 78 Figure 5.6: Grad-CAM visualization on decision making of attribute preference. Finally, a user’s favor over items or attributes tend to change given different contexts, which is captured by the interactions between users’ preferences and contexts during graph filtering. As shown in Figure 5.7, candidate items are from different categories in Maryland Polyvore whereas from the same category but with different attributes (e.g, style, shape etc.) in Polyvore Outfits. Our model not only predicts groundtruth as the best candidate, but also provides distinctive score for items that are visually similar as opposed to those seemingly incompatible. 5.6 Conclusion In this paper, we proposed AAEG to capture how users’ preferences evolve given changes of contexts and provide interpretability for fashion compatibility recommendation tasks. We first developed a Latent Attribute Extraction Network (LAEN) to project items under different latent attribute space that’s learned automatically from data. Then we introduced Pairwise Prefer- ence Attention (PPA) and Graph Filtering Network to comprehend the interactions between user’s preference and context. Finally, we provide interpretable visualizations of our frame- work. Experimental results on real-world datasets clearly demonstrated the effectiveness and explanatory power of AAEG. 79 Figure 5.7: Qualitative results of our model for FITB prediction on Maryland Polyvore and Polyvore Outfits. Green box indicates the groundtruth and scores highlighted with red color are our predictions using Eqn 5.16. 80 Chapter 6 SLADE: A Self-Training Framework for Distance Metric Learning 6.1 Introduction Existing distance metric learning methods mainly learn sample similarities and image embed- dings using labeled data [23, 83, 134, 195], which often require a large amount of data to perform well. A recent study [129] shows that most methods perform similarly when hyper-parameters are properly tuned despite employing various forms of losses. The performance gains likely come from the choice of network architecture. In this work, we explore another direction that uses unlabeled data to improve retrieval performance. Recent methods in self-supervised learning [26, 30, 60] and self-training [31, 203] have shown promising results using unlabeled data. Self-supervised learning leverages unlabeled data to learn general features in a task-agnostic manner. These features can be transferred to down- stream tasks by fine-tuning. Recent models show that the features produced by self-supervised learning achieve comparable performance to those produced by supervised learning for down- stream tasks such as detection or classification [26]. Self-training methods [31, 203] improve the performance of fully-supervised approaches by utilizing a teacher/student paradigm. How- ever, existing methods for self-supervised learning or self-training mainly focus on classification but not retrieval. We present a SeLf-trAining framework for Distance mEtric learning (SLADE) by leveraging unlabeled data. Figure 6.1 illustrates our method. We first train a teacher model on the labeled 81 Figure 6.1: A self-training framework for retrieval. In the training phase, we train the teacher and student networks using both labeled and unlabeled data. In the testing phase, we use the learned student network to extract embeddings of query images for retrieval. dataset and use it to generate pseudo labels for the unlabeled data. We then train a student model on both labels and pseudo labels to generate a final feature embedding. We utilize self-supervised representation learning to initialize the teacher network. Most deep metric learning approaches use models pre-trained on ImageNet ([83], [195], etc). Their extracted representations might over-fit to the pre-training objective such as classification and 82 not generalize well to different downstream tasks including distance metric learning. In contrast, self-supervised representation learning [26, 30, 31, 60] learns task-neutral features and is closer to distance metric learning. For these reasons, we initialize our models using self-supervised learning. Our experimental results (Table 6.3) provide an empirical justification for this choice. Once the teacher model is pre-trained and fine-tuned, we use it to generate pseudo labels for unlabeled data. Ideally, we would directly use these pseudo labels to generate positive and negative pairs and train the student network. However in practice, these pseudo labels are noisy, which affects the performance of the student model (cf. Table 6.4). Moreover, due to their dif- ferent sources, it is likely that the labeled and unlabeled data include different sets of categories (see Section 7.4.1 for details about labeled and unlabeled datasets). The features extracted from the embedding layer may not adequately represent samples from those unseen classes. To tackle these issues, we propose an additional representation layer after the embedding layer. This new layer is only used for unlabeled data and aims at learning basis functions for the feature rep- resentation of unlabeled data. The learning objective is contrastive, i.e. images from the same class are mapped close while images from different classes are mapped farther apart. We use the learned basis vectors to compute the feature representation of each image and measure pairwise similarity for unlabeled data. This enables us to select high-confident samples for training the student network. Once the student network is trained, we use it to extract embeddings of query images for retrieval. We evaluate our model on several standard retrieval benchmarks: CUB-200, Cars-196 and In-shop. As shown in the experimental section, our approach outperforms several state-of-the- art methods on CUB-200 and Cars-196, and is competitive on In-shop. We also provide various ablation studies in the experimental section. The main technical contributions of our work are: • A self-training framework for distance metric learning, which utilizes unlabeled data to improve retrieval performance. • A feature basis learning approach for the student network, which better deals with noisy pseudo labels generated by the teacher network on unlabeled data. 83 6.2 Related Work Distance metric learning is an active research area with numerous publications. Here we review those that are relevant to our work. While a common objective is to push similar samples closer to each other and different samples away from each other, approaches differ on their losses and sample mining methods. One can train a model using cross entropy loss [213], hinge loss [134], triplet loss [200], proxy-NCA loss [83, 128, 174], etc. [128] used the proxy-NCA loss to minimize the distance between a sample and their assigned anchor(s). These set of anchors were learnable. [83] further improved the proxy-based loss by combining it with a pair-based loss. We also use a set of learnable vectors but do not optimize directly on their distances to samples. Rather, we use them as a basis (anchor set) to represent output features operated on by a distribution loss. Our intuition is that while individual pseudo labels can be noisy, representing features using these anchors makes them more robust to noise [33], [211]. As mentioned previously, we use self-supervised training to seed our teacher model before fine-tuning it. There has been a significant progress recently in self-supervised learning of gen- eral visual representation [26, 30, 31, 55]. [30] learns representation by minimizing the similarity between two transformed versions of the same input. Transformations include various data aug- mentation operations such as crop, distort or blur. [31] narrowed the gap between self-supervised and supervised learning further by using larger models, a deeper projection head and a weight stabilization mechanism. In [26], a clustering step was applied on the output presentation. Their algorithm maximizes the consistency of cluster assignments between different transformations of the same input image. Most of these works aimed at learning a generic visual representation that can later be used in various downstream tasks. Self-training involves knowledge distillation from larger, more complex models or from ensembles of models (teachers) to less powerful, smaller students (e.g., [64, 204, 212]). Their end purpose is often reducing model size. Recently, [203] and [221] used iterative self-training to improve classification accuracy of both teacher and student models. At a high level, our 84 Figure 6.2: An overview of our self-training framework. Given labeled and unlabeled data, our framework has three main steps. (1) We first initialize the teacher network using self-supervised learning, and fine-tune it by a ranking loss on labeled data; (2) We use the learned teacher net- work to extract features, cluster and generate pseudo labels on unlabeled data; (3) We optimize the student network and basis vectors on labeled and unlabeled data. The purpose of feature basis learning is to select high-confidence samples (e.g., positive and negative pairs) for the ranking loss, so the student network can learn better and reduce over-fitting to noisy samples. self-training is similar to these approaches, but it is designed for distance metric learning and semi-supervised learning settings. Unlabeled data has been used to improve performance in various computer vision tasks such as classification and semantic segmentation (e.g., [136]). They have been also used in self- supervised and unsupervised representation learning such as in [203] or [214]. But there is still a performance gap compared to the fully supervised setting. For distance metric learning, most algorithms that are competitive on popular benchmarks (CUB-200, Cars-196 and In-shop) used fully labeled data ([83], [174], [195], etc). Here, we additionally use external unlabeled data to push performance on these datasets further. 6.3 Method Figure 7.2 illustrates the system overview of our self-training framework. Our framework has three main components. First, we use self-supervised learning to initialize the teacher network, and then fine-tune it on labeled data. We use the pre-trained SwA V model [26] and fine-tune it on our data to initialize the teacher network. After pre-training, we fine-tune the teacher network 85 with a ranking loss (e.g., contrastive loss) on labeled data. The details of self-supervised pre- training and fine-tuning of the teacher network are presented in section 6.3.1. Second, we use the fine-tuned teacher network to extract features and cluster the unlabeled data using k-means clustering. We use the cluster ids as pseudo labels. In practice, these pseudo labels are noisy. Directly optimizing the student network with these pseudo labels does not improve the performance of the teacher network. Therefore, we introduce a feature basis learning approach to select high-confidence samples for training the student network. The details of pseudo label generation are presented in Section 6.3.2. Third, we optimize the student network and basis vectors using labeled and unlabeled data. The basis vectors are defined as a set of weights that map the feature embeddingf of each image into a feature representationr. We train the basis vectors such that images from the same class are mapped close and images from different classes are mapped farther apart. The basis vectors are used to select high-confidence samples for the ranking loss. The student network and basis vectors are optimized in an end-to-end manner. The details of student network optimization and feature basis learning are in Section 6.3.3. 6.3.1 Self-Supervised Pre-Training and Fine-Tuning for Teacher Network Existing deep metric learning methods often use the ImageNet pre-trained model [38] for initial- ization, which may over-fit to the pre-training objective, and not generalize well to downstream tasks. Instead, we use self-supervised learning to initialize the teacher model. We use the pre- trained SwA V model [26] and fine-tune it on our data. As shown in the experimental section, this choice leads to improvement in retrieval performance as compared to the pre-trained Ima- geNet models (see Table 6.3). We conjecture that this might be because deep metric learning and self-supervised learning are related, they both learn embeddings that preserve distances between similar and dissimilar data. Specifically, we are given a set of labeled images: D l =f(x 1 ;y 1 ); (x 2 ;y 2 );:::; (x n ;y n )g and unlabeled images:D u =f^ x 1 ; ^ x 2 ;:::; ^ x m g. We denote the parameters of the teacher network as t and the parameters of the student network as s . In the pre-training stage, we fine tune 86 the SwA V model on the union of the labeled and unlabeled images without using the label information to initialize the teacher model. Once the teacher network is pre-trained, we fine-tune the teacher network using a ranking loss (for example, a constrastive loss [57]) on the labeled data: L rank = X (x i ;y i )2P max(d(x i ;y i )m pos ; 0) + X (x i ;y i )2N max(m neg d(x i ;y i ); 0) (6.1) whereP is the positive pairs,N is the negative pairs, andm pos andm neg are the margins. 6.3.2 Pseudo Label Generation We use the teacher model to extract features, and cluster the unlabeled images using k-means. The unlabeled images are assigned to the nearest cluster centers. The assigned cluster ids are then used as pseudo labels. One can train a student network with a ranking loss that uses the positive and negative pair samples sampled from the pseudo labels. However, in practice, the pseudo labels are noisy, and unlabeled data may have unseen categories. The features extracted from the teacher model may not work well on those unseen categories (the pseudo labels are incorrectly estimated). This motivates us to design a new feature basis learning approach that better models the pairwise similarity on unlabeled data and can be used to select high-confidence sample pairs for training the student network. 6.3.3 Optimization of Student Network and Basis Vectors We first explain the ideas of feature basis learning, and the use of basis vectors for sample mining and describe the training for student network and basis vectors. Feature Basis Learning Basis vectors are a set of learnable weights that map a feature embedding f of an image to a feature representationr. We denote a set of basis vectors asfa 1 ;a 2 ;:::;a k g, where each basis 87 vectora i is ad 1 vector. For simplicity, we represent the basis vectors as akd matrixW a . Given an image I, we use the student network to obtain the feature embedding f of the input image. The feature representationr is computed byr =W a f, wherer i =a i T f. We train the basis vectors to optimize the feature representation such that images from the same class are mapped close while images from different classes are mapped farther apart. Specifically, we train the basis vectors using two losses, a cross-entropy loss and a similarity distribution loss. The loss function for feature basis learning is defined as: L Basis =L CE (D l ) +L SD (D u ); (6.2) where the first term is the cross entropy loss on the labeled data and the second term is the similarity distribution loss on the unlabeled data. The cross-entropy loss is applied on labeled data. The ground truth class labels can be used as a strong supervision signal to regularize the basis vectors to separate different classes. The cross entropy loss on labeled data is: L CE = n X i=1 y i log((W a f(x i ; s ))); (6.3) where is the softmax function,W a is the matrix for basis vectors, and s is the parameters for the student network. For unlabeled data, one can also train a cross-entropy loss on the pseudo labels similar to labeled data. However, we found that this leads to poor performance since the model tends to over-fit to the noisy pseudo labels. Instead, we optimize the pairwise similarity on unlabeled data. Pairwise similarity is a simpler (relaxed) form compared to multi-class entropy loss, where the objective is to make the samples from the same class close while distancing the samples with different classes. The motivation is that even though there exists noises in the pseudo labels, a certain portion of pairs are correctly estimated with the pseudo labels, which are good enough to train a class-agnostic pairwise similarity model. 88 We use the pseudo labels to sample a set of pseudo positive pairs and pseudo negative pairs, where the pseudo positive pairs are sampled from the same pseudo class and the pseudo negative pairs are sampled from different pseudo classes. We compute the similarity of each image pair by using the cosine similarity of two normalized feature representation: s(^ x i ; ^ x j ) = cos(r i ;r j ) = cos(W a f i ;W a f j ) (6.4) We model the similarity distributions as two Gaussian distributionsG + andG . The idea is to separate the two Gaussian distributions by maximizing the difference between their means and penalizing the variance of each distribution. The similarity distribution loss is defined as: L SD (G + jjG ) =max( + +m; 0) +( + + ); (6.5) where + ( ) and + ( ) are the mean and variance of the Gaussian distributions respectively, andm is the margin. We update the parameters in a batch-wise manner: + = (1) + b + + + = (1) + b + + ; (6.6) where + b and + b are the mean and variance in a batch. is the updating rate. The parameters ofG are updated in a similar way. Sample Mining We use the basis vectors to select high-confidence sample pairs from the unlabeled images for training the student network. Given a set of samples in a batch, we compute the pair-wise similarity for all samples using equation 6.4 and select positive and negative pairs by: P =f(^ x i ; ^ x j )js(^ x i ; ^ x j )T 1 g N =f(^ x i ; ^ x j )js(^ x i ; ^ x j )T 2 g (6.7) 89 We set the confidence thresholds ofT 1 andT 2 usingu + andu . The positive and negative pairs will be used in the ranking function (see Equation 6.9). Joint Training We train the student network and basis vectors by minimizing a functionL: min s ;Wa L( s ;W a ) (6.8) L =L rank (D l ; s ) + 1 L rank (D u ; s )+ 2 L Basis (D l ;D u ; s ;W a ); (6.9) whereL rank is a ranking loss (see Equation 6.1). We train the ranking loss on both labeled and unlabeled images. Note that for unlabeled data, we use the sample mining to obtain the positive and negative pairs. Our framework is generic and applicable to different pair-based ranking losses. We report the results of different losses, e.g., contrastive loss and multi-similarity loss in Table 6.1. We first train the basis vectors for a few iterations to get a good initialization, then train the student network and basis vectors end-to-end. After training the student network, we use the student as a new teacher and go back to the pseudo label generation step. We iterate this a few times. During testing, we discard the teacher model and only use the student model to extract the embedding of a query image for retrieval. 6.4 Experiments We first introduce the experimental setup including datasets, evaluation criteria and implemen- tation details. Then we report the results of our method on three common retrieval benchmarks 90 CUB-200, Cars-196 and In-shop ([92, 113, 188]) 1 . Finally, we conduct ablation studies to analyze different design choices of the components in our framework. 6.4.1 Datasets CUB-200/NABirds: We use CUB-200-2011 [188] as the labeled data and NABirds [179] as the unlabeled data. CUB-200-2011 contains 200 fine-grained bird species, with a total number of 11,788 images. NABirds is the largest publicly available bird dataset with 400 species and 743 categories of North America’s birds. It has 48,000 images with approximately 100 images for each species. We measure the overlaps between CUB-200 and NABirds, where there are 655/743 unseen classes in NABirds compared to CUB, showing the challenges of handling the out-of-domain images for unlabeled data. Cars-196/CompCars: We use Cars-196 [92] as the labeled data, which contains 16,185 images of 196 classes of cars. Classes are annotated at the level of make, model, and year (e.g., 2012 Tesla Model S). We use CompCars [206] as the unlabeled data. It is collected at model level, so we filter out unbalanced categories to avoid being biased towards minority classes, resulting in 16,537 images categorized into 145 classes. In-shop/Fashion200k: In-shop Clothes Retrieval Benchmark [113] includes 7,982 clothing items with 52,712 images. Different from CUB-200 and Cars196, In-shop is an instance-level retrieval task. Each article is considered as an individual category (each article has multiple views, such as front, back and side views), resulting in an average of 6.6 images per class. We use Fashion-200k [58] as the unlabeled data, since it has similar data organization (e.g., catalog images) as the In-shop dataset. 6.4.2 Evaluation Criteria For CUB-200 and Cars-196, we follow the settings in [83, 128, 129] that use half of the classes for training and the other half for testing, and use the evaluation protocol in [129] to fairly 1 We did not carry out experiments on the SOP dataset [134], since we could not find an unlabeled dataset that is publicly available and is similar to it in content. 91 compare different algorithms. They evaluate the retrieval performance by using MAP@R, RP and P@1. For each query, P@1 (also known as Recall@1 in previous metric learning papers) reflects whether the i-th retrieval result is correct. However, P@1 is not stable. For example, if only the first of all retrieval results is correct, P@1 is still 100%. RP measures the percentage of retrieval results that belong to the same class as the query. However, it does not take into account ranking of correct retrievals. MAP @R = 1 R P R i=1 P @i combines the idea of mean average precision with RP and is a more accurate measure. For In-shop experiment, we use the Recall@K as the evaluation metric [113]. We compare our model with full-supervised baselines that are trained with 100% of labeled data and are fine-tuned end-to-end. Our settings are different from the ones in self-supervised learning frameworks [26, 30], where they evaluate the models in a label fraction setting (e.g., using 1% or 10% labeled data from the same dataset for supervised fine-tuning or linear eval- uation). Evaluating our model in such setting is important, since in practice, we often use all available labels to fine-tune the entire model to obtain the best performance. These settings also pose several challenges. First, our fully supervised models are stronger as they are trained with 100% of labels. Second, since our labeled and unlabeled data come from different image distri- butions, the model trained on labeled data may not work well on unlabeled data - there will be noisy pseudo labels we need to deal with. 6.4.3 Implementation Details We implement our model using the framework [129]. We use 4-fold cross validation and a batch size of 32 for both labeled and unlabeled data. ResNet-50 is used as our backbone network. We use the pre-trained SwA V [26] model and fine-tune it on our data. In each fold, a student model is trained and outputs an 128-dim embedding, which will be concatenated into a 512-dim embedding for evaluation. We set the updating rate to 0.99, and 1 and 2 in Equation 6.9 to 1 and 0.25 respectively to make the magnitude of each loss in a similar scale. For iterative training, we train a teacher and a student model in each fold, then use the trained student model as the new teacher model for the next fold. This produces a better teacher model 92 more quickly compared to getting a new teacher model after the student network finishes all training folds. Methods Frwk Init Arc / Dim CUB-200-2011 Cars-196 MAP@R RP P@1 MAP@R RP P@1 Contrastive [57] [129] ImageNet BN / 512 26.53 37.24 68.13 24.89 35.11 81.78 Triplet [200] [129] ImageNet BN / 512 23.69 34.55 64.24 23.02 33.71 79.13 ProxyNCA [128] [129] ImageNet BN / 512 24.21 35.14 65.69 25.38 35.62 83.56 N. Softmax [213] [129] ImageNet BN / 512 25.25 35.99 65.65 26.00 36.20 83.16 CosFace [190, 191] [129] ImageNet BN / 512 26.70 37.49 67.32 27.57 37.32 85.52 FastAP [25] [129] ImageNet BN / 512 23.53 34.20 63.17 23.14 33.61 78.45 MS+Miner [195] [129] ImageNet BN / 512 26.52 37.37 67.73 27.01 37.08 83.67 Proxy-Anchor 1 [83] [83] ImageNet R50 / 512 - - 69.9 - - 87.7 Proxy-Anchor 2 [83] [129] ImageNet R50 / 512 25.56 36.38 66.04 30.70 40.52 86.84 ProxyNCA++ [174] [174] ImageNet R50 / 2048 - - 72.2 - - 90.1 Mutual-Info [6] [6] ImageNet R50 / 2048 - - 69.2 - - 89.3 Contrastive [57] (T 1 ) [129] ImageNet R50 / 512 25.02 35.83 65.28 25.97 36.40 81.22 Contrastive [57] (T 2 ) [129] SwA V R50 / 512 29.29 39.81 71.15 31.73 41.15 88.07 SLADE (Ours) (S 1 ) [129] ImageNet R50 / 512 29.38 40.16 68.92 31.38 40.96 85.8 SLADE (Ours) (S 2 ) [129] SwA V R50 / 512 33.59 44.01 73.19 36.24 44.82 91.06 MS [195] (T 3 ) [129] ImageNet R50 / 512 26.38 37.51 66.31 28.33 38.29 85.16 MS [195] (T 4 ) [129] SwA V R50 / 512 29.22 40.15 70.81 33.42 42.66 89.33 SLADE (Ours) (S 3 ) [129] ImageNet R50 / 512 30.90 41.85 69.58 32.05 41.50 87.38 SLADE (Ours) (S 4 ) [129] SwA V R50 / 512 33.90 44.36 74.09 37.98 46.92 91.53 Table 6.1: MAP@R, RP, P@1 (%) on the CUB-200-2011 and Cars-196 datasets. Pre-trained Image-Net model is denoted as ImageNet and the fine-tuned SwA V model on our data is denoted as SwA V . The teacher networks (T 1 ,T 2 ,T 3 andT 4 ) are trained with the different losses, which are then used to train the student networks (S 1 ,S 2 ,S 3 andS 4 ) (e.g., the teacherT 1 is used to train the student S 1 ). Note that the results may not be directly comparable as some methods (e.g., [6, 83, 174]) report the results based on their own frameworks with different settings, e.g., embedding dimensions, batch sizes, data augmentation, optimizer etc. More detailed explana- tions are in Section 7.4.5. 6.4.4 Results The retrieval results for CUB-200 and Cars-196 are summarized in Table 6.1. We compare our method with state-of-the-art methods reported in [129] and some recent methods [6, 83, 174]. Note that numbers may not be directly comparable, as some methods use their own settings. For example, ProxyAnchor [83] uses a larger batch size of 120 for CUB-200 and Cars-196. It also 93 Recall@K 1 10 20 40 N. Softmax [213] 88.6 97.5 98.4 - MS [195] 89.7 97.9 98.5 99.1 ProxyNCA++ [174] 90.4 98.1 98.8 99.2 Cont. w/M [197] 91.3 97.8 98.4 99.0 Proxy-Anchor [83] 91.5 98.1 98.8 99.1 SLADE (Ours) (S 4 ) 91.3 98.6 99.0 99.4 Table 6.2: Recall@K (%) on the In-shop dataset. uses the combination of global average pooling and global max pooling. Mutual-Info [6] uses a batch size of 128 and a larger embedding size of 2048. ProxyNCA++ [174] uses a different global-pooling, layer normalization and data sampling scheme. We evaluate the retrieval performance using original images for CUB-200 and Cars-196 rather than cropped images such as in [79] (CGD). For ProxyAnchor, in addition to reporting results using their own framework [83] (denoted as ProxyAnchor 1 ), we also report the results using the framework [129] (denoted as ProxyAnchor 2 ). We use the evaluation protocol of [129] for fair comparison. We use ResNet50 instead of BN-Inception as our backbone network because it is commonly used in self-supervised frame- works. We experiment with different ranking losses in our framework, e.g., contrastive loss [57] and multi-similarity loss [195]. The teacher networks (T 1 ,T 2 ,T 3 andT 4 ) are used to train the corresponding student networks (S 1 , S 2 , S 3 and S 4 ) (e.g., teacher T 1 is used to train student S 1 ). We use the same loss for both teacher and student networks. For example, T 1 (teacher) andS 1 (student) are trained with the contrastive loss. We also report the results of our method using different pre-trained models: pre-trained Image-Net model (denoted as ImageNet) and the fine-tuned SwA V model (denoted as SwA V). We compare our method with the supervised baselines that are trained with 100% labeled data. Even in such setting, we still observe a significant improvement using our method com- pared to state-of-the-art approaches that use ResNet50 or BN-Inception. We boost the final performance to P@1 = 74:09 and P@1 = 91:53 for CUB-200 and Cars-196 respectively. The results validate the effectiveness of self-supervised pre-training for retrieval as well as the feature basis learning to improve the sample quality on unlabeled data. 94 Figure 6.3: Retrieval results on CUB-200 and Cars-196. We show some challenging cases where our self-training method improves Proxy-Anchor [83]. Our results are generated based on the student modelS 2 in Table 6.1. The red bounding boxes are incorrect predictions. We also show that our method generalizes to different losses (e.g., contrastive loss [57] and multi-similarity (MS) loss [195]). Both losses lead to improvements using our method. The performance (P@1) of MS loss is improved from 66.31 to 74.09 on CUB-200, and 85.16 to 91.53 on Cars-196 respectively. We also report the performance of our method using the pre-trained ImageNet model (S 3 ), which achieves on-par performance with state-of-the-art approaches (e.g., Proxy-Anchor) even when we use a lower baseline model as the teacher (e.g., MS loss). Table 6.2 summarizes the results for In-shop. Different from CUB-200 and Cars-196, In- shop is an instance-level retrieval task, where each individual article is considered as a category. Fashion200k is used as the unlabeled data. We train the teacher and student models using the multi-similarity loss [195] similar to the settings asT 4 andS 4 in Table 6.1. We report the results using Recall@K [113]. We achieve competitive results against several state-of-the-art methods. We note that the images in In-shop dataset are un-cropped while the images in Fashion200k dataset are cropped. So there exist notable distribution differences between these two datasets. We use the un-cropped version of In-shop to fairly compare with the baseline methods. 95 6.4.5 Ablation study Initialization of Teacher Network We first investigate the impact of using different pre-trained weights to initialize the teacher network (see Table 6.3). The results in the table are the final performance of our framework using different pre-trained weights. The teacher network is trained with a contrastive loss. We compare three different pre-trained weights: (1) a model trained on ImageNet with supervised learning; (2) a model trained on ImageNet with SwA V [26]; and (3) a fine-tuned SwA V model on our data without using label information. From the results, we can see that the weights from self-supervised learning significantly improve the pre-trained ImageNet model, 4:21% and 4:86% improvements on CUB-200 and Cars-196 respectively. This validates the effectiveness of self-supervised pre-training for retrieval. We also find that fine-tuned SwA V model further improves the performance ( 1% improvements) of pre-trained SwA V model. Pre-trained weight MAP@R CUB-200 Cars-196 ImageNet [38] 29.38 31.38 Pre-trained SwA V [26] 32.79 35.54 Fine-tuned SwA V 33.59 36.24 Table 6.3: Comparison of different weight initialization schemes of the teacher network, where the teacher is trained with a contrastive loss. The results are the final performance of our frame- work. Components in Student Network Table 6.4 analyzes the importance of each component in our self-training framework. The results are based on the teacher network trained with a contrastive loss. Training the student with the positive and negative pairs sampled from pseudo labels only improves the teacher slightly, 1:52% and 0:26% on CUB-200 and Cars-196 respectively. The improvements are not very significant because the pseudo labels are noisy on unlabeled data. The performance is further improved by using the feature basis learning and sample mining, which supports the proposed method for 96 better regularizing the embedding space with the feature basis learning and selecting the high- confidence sample pairs. We boost the final performance to 33.59 and 36.24 for CUB-200 and Cars-196 respectively. Components MAP@R CUB-200 Cars-196 Teacher (contrastive) 29.29 31.73 Student (pseudo label) 30.81 31.99 + Basis 32.45 35.78 + Basis + Mining 33.59 36.24 Table 6.4: Ablation study of different components in our framework on CUB-200 and Cars-196. The teacher network is trained with a contrastive loss. Pairwise Similarity Loss We also investigate different design choices of the loss functions (Equation 6.5) used for feature basis learning. One alternative is to assign a binary label to each constructed pair according to the pseudo labels (e.g., 1 for pseudo-positive pair and 0 for pseudo-negative pair) and calculate a batch-wise cross entropy loss between pairwise similarities against these binary labels. We denote this option as local-CE. Another alternative is to first update the global Gaussian means using Equation 6.6, and then assign a binary label to the global means, followed by a cross entropy loss. We denote this option as global-CE. Similarity distribution loss performs better than cross-entropy, either locally or globally. The reason could be that basis vectors need not generalize to pseudo labels as they could be noisy. Another reason is that SD optimizes both the means and variances to reduce distribution overlap. Regularization CUB-200 MAP@R RP P@1 Local-CE 32.69 43.20 72.64 Global-CE 32.23 42.68 72.45 SD (Ours) 33.59 44.01 73.19 Table 6.5: Accuracy of our model in MAP@R, RP and P@1 versus different loss designs on CUB-200. 97 Number of Clusters Tables 6.6 analyzes the influence of different numbers of clusters on unlabeled data. The results are based on the teacher network trained with a contrastive loss. The best performance is obtained withk = 400, which is not surprising, as our model (student network) is trained on the NABirds dataset, which has 400 species. We can also see that our method is not sensitive to the number of clusters. k NABirds MAP@R RP P@1 100 31.83 42.25 72.19 200 32.61 43.02 72.75 300 32.81 43.18 72.21 400 33.59 44.01 73.19 500 33.26 43.69 73.26 Table 6.6: Influence of using different numbers of clusters (k) on NABirds, which is used as the unlabeled data for CUB-200. 6.5 Conclusion We presented a self-training framework for deep metric learning that improves the retrieval per- formance by using unlabeled data. Self-supervised learning is used to initialize the teacher model. To deal with noisy pseudo labels, we introduced a new feature basis learning approach that learns basis functions to better model pairwise similarity. The learned basis vectors are used to select high-confidence sample pairs, which reduces the noise introduced by the teacher net- work and allows the student network to learn more effectively. Our results on standard retrieval benchmarks demonstrate our method outperforms several state-of-the art methods, and signifi- cantly boosts the performance of fully-supervised approaches. 98 Chapter 7 Compositional Learning for Weakly Attribute-based Metric Learning 7.1 Introduction Deep metric learning[66, 129, 134] is a widely researched area, laying the foundation for image retrieval [128, 166, 200], face verification [69], person re-identification [207], image-text cross- modal retrieval [192], compatibility retrieval [41, 59] etc,. Its goal is to learn a regularized embedding space that maps semantic similar feature representations close while embedding dis- similar features apart. Usually, learning these representations entail a large quantity of labeled data tailored to the task, such as attribute-based metric learning [56, 187], which we tackle in this paper. Image-attribute query offers an intuitive and powerful way to express intended searches, which are often visually similar in content except certain attributes. For example, a user may have access to a red dress, instead she may be interested in buying a similar dress but with dotted patterns. See Figure 7.1. In this scenario, the function of being able to manipulate retrieval results based on attributes is desired. In the real world, we identify two challenges for this task. First, it’s annotation expensive to collect visually similar images with different attribute configurations as positive samples, leading to worse generalizations. For example in Figure 7.1, the training set may contain a bunch of attributes (e.g., black, brown, fluffy, wet) describing the dog species, but during test retrieval on the cat species is required. In fact, the number of attribute configurations grows exponentially withjAj, the size of the attribute set. Second, the correspondence between a visual image with 99 Figure 7.1: A compositional learning framework for retrieval. We learn a compositional module for attribute manipulation, consisting of a decoupling network and a coupling network, which are responsible for removing and adding attributes respectively. Our model is weakly supervised as it can generalize to attribute configurations unseen during training. attributes is often ambiguous. Though it may be possible to identify round-neck vs short-neck based on “attention”, attributes such as “brown” and “fluffy” both correspond to the whole image, and thus removing one of them is likely to degrade the retrieval quality of another. We address the first challenge by adopting three set operations (e.g., intersection, union and subtraction) to synthesize positive samples with the desired attribute configurations on the fly without human labor. For example, given a subset of “brown fluffy cats” and a subset of “small fluffy cats”, we can obtain a super-category “fluffy cats” through intersection. With the union operation, we get “brown small fluffy cats”. Similarly, we can also get “brown small cats” when subtraction is applied. Note that these operations won’t be applied over attributes from the same category (e.g., both are color attributes). Thus by learning a mapping from the feature space to the semantic attribute space, our model is augmented to generalize to novel attribute configurations, learned weakly-supervised. We show in the zero-shot retrieval setting that our model can generalize to unseen attributes during training. 100 For the second challenge, we leverage a composition module to detach the compounding effects from different attributes. Previously, compositional learning has shown promising results for zero-shot classification [28, 126, 130]. In this paper, we extend the idea to multi-label metric learning and re-design the composition module to tailor to the image retrieval task. The key insight here is that any complex concept can be decomposed into objects and attributes. There- fore, we propose to learn two invertible functions (e.g., coupling T + and decoupling T ) to model the compositional nature of an image-attribute query. As shown in Figure 7.1, we lever- age these two functions to add or remove the attribute embedding in the feature space during retrieval. To enforce attribute-object factorization, we leverage a set of regularization objectives inspired by the algebraic theory such as relativity, commutativity, invertibility to account for the constraints during the decomposition. Because of the nice properties induced by the theory, the compositionality and transferability are naturally satisfied. We optimize the above objectives in a unified model in an end-to-end manner. To summarize, our main contributions are three fold. • We propose a novel weakly-supervised synthesis method to address the sample insuffi- ciency in attribute-based metric learning, when multi-label attributes are present. • We leverage and re-design a composition module to detach the compounding effects from different attributes, accompanied with a set of regularization objectives tailored to metric learning. • We show that our model can improve existing baselines under the zero-shot retrieval set- ting for large-scale multi-label dataset. 7.2 Related Work Visual attributes and compositional learning: Attribute learning shifts the goal of visual recognition from naming to describing, by inferring richer mid-level representations. It has 101 since been applied to recognition of object [45, 138], human action [167], face [115], pedes- trian [110], zero-shot learning [75, 96] etc,. Therefore, attribute learning serves as an important component for visual understanding. The typical approach for attribute learning is very sim- ilar to that of object classification through learning a discriminative multi-label classification model [114, 188, 202]. Later, some works start to exploit attribute-attribute [70, 76], attribute- object correlations, also known as compositional learning [130, 149]. In [130], attributes are modeled as operators independent of objects to generalize across unseen compositions of object- attribute pairs. In [199] adversarial learning is employed to model the discrepancy and correla- tions among attributes and objects. Purush et al. [149] propose to learn a set of modular networks controlled by gating functions to model the compatibility between the input image and the con- cept under consideration. Although impressive results were observed in the aforementioned works, their test-bed is single-attribute classification task [73, 210]. In contrast, we leverage composition module to account for the compounding effects of multi-attributes for the retrieval task [58, 187]. Image retrieval and metric learning: Metric learning losses operate on the relationships between samples in a batch to learn a regularized embedding space, mapping similar data close together and dissimilar data far apart. Because of the nice properties of the learned embed- dings, they are preferred when the task is some variant of information retrieval, such as verifica- tion [160], person re-identification [63] and retrieval [92, 134, 188]. There are numerous image retrieval applications, including product search [114], image-text cross modal retrieval [192], sketch to image retrieval [156] or cross-view image retrieval [109]. Recently, there’s a grow- ing trend of learning image search based on textual feedbacks [56, 187]. For example, Nam et al. [187] learns both a image model and a language model and benchmark different multi-modal fusion schemes. Chen et al,. [32] proposes to learn a self-attention and a joint attention trans- former to model region-to-region and global correlations respectively for image-text matching. These works focus on multi-modality learning for modelling image-text correspondences while we focus on learning an attribute-object composition model and benchmark image retrieval under the zero-shot setting. 102 Figure 7.2: An overview of the proposed WAML framework, consisting of two main compo- nents. The compositional learning component takes an image and two attributesa i ,a j sampled in the attribute set and outputs two multi-label featuresf o X , f o Y (with attributei andj respec- tively) for the second component. The attribute-set synthesis module synthesizes feature embed- dings representing similar content but with different attribute configurations through attribute set operations. The end-result is a feature embedding space capable of representing attribute manipulations. 7.3 Method 7.3.1 Framework Overview As explained in the introduction, our goal is to learn an embedding space for image-attribute query where images with similar attributes are close and a composition model which reflects attribute manipulation from the label space in the feature space, as shown in Figure 7.1. Our approach is schematically illustrated in Figure 7.2. We propose a Weakly-supervised Attribute-based Metric Learning framework (WAML), consisting of two components. Objects and attributes can compose diverse compositions. Therefore in Section 7.3.2, we introduce the compositional learning module whose goal is to achieve object-attribute factorization and to provide input to the second component, which are visually similar pairs but with different attributes. To achieve this, we leverage two invertible functions (Section 7.3.2) responsible for adding attributes and removing attributes respectively and design a set of constraints based on the metric learning objectives (Section 7.3.2). As mentioned in the introduction, the challenge for image-attribute metric learning involves the lack of positive samples with diverse attribute configurations. Thus in Section 7.3.3, we 103 Figure 7.3: Our composition module consists of two invertible functions—the coupling network T + and the decoupling network T , aimed for attribute-object factorization in the embedding space. To achieve this goal, we define a set of regularization objectives based on metric learning to enhance the discriminativeness of the model. In the case when the attribute is not present (Case 1), the resulting manipulation throughT + should lead to an embedding farther compared to the one operated with T . On the opposite hand, if the attribute is present, the coupling operation should lead to an embedding closer compared to the decoupling function (Case 2). present the Attribute-set Synthesis module which generates samples with new attribute config- urations on the fly during training. It serves to regularize the embedding space by enforcing a consistency between the feature space and the attribute label space. We introduce the training and inference using our framework in Section 7.3.4. 7.3.2 Compositional Learning Object-attribute Factorization Image-attribute query is powerful in expressing the intended searches, yet learning a joint image- attribute embedding space is a daunting challenge as the number of attribute configurations is exponential to the size of attribute set. Instead, we adopt an image-attribute factorization paradigm similar to compositional zero-shot learning [107, 130] for our goal. As shown in Fig- ure 7.3, by learning a composition module, the attribute embedding can be detached from the image embedding. This way, the number of image-attribute configurations is linear in the size of attribute space. However different from the compositional zero-shot setting which evaluates on a single attribute, we account for the multi-label scenario, where the compounding effects of attributes are present. We discuss in more detail in Section 7.3.2. 104 Figure 7.4: The coupling and decoupling network share the same network structure. They take a specific attribute embedding (e.g.,a i ora j ) to manipulate, resulting inf o X andf o Y . f o is the image embedding extracted from ResNet-18. In Figure 7.4, we incarnate the object-attribute factorization scheme into two networks, the coupling network T + which “adds” an attribute, and the decoupling network T which “removes” an attribute. The two networks share the same network architecture, but with their own parameters. Take the coupling network as an example, it takes an image embeddingf o and an attribute embeddinga i as input and outputsf o X , which represents the embedding augmented with thei-th attribute. f o X =W 4 RELU(W 3 (f o +f o gated )) (7.1) whereW 3 andW 4 are implemented as fully-connected layers. The gated image featuref o gated is obtained by learning a dot-product attention as follows, f o gated =(W 2 RELU(W 1 [f o ;a i ]))f o (7.2) where means element-wise dot product and [;] is the concatenation operation. W 1 andW 2 are learned to concert with the function to perform the gating operation. It acts an attention mask on the image embeddingf o to couple or decouple a particular attribute. 105 Regularization Objectives To ensure a meaningful image-attribute factorization, we define a set of optimization objectives to enforce constraints inspired by the algebraic principles. Because of the inherent nice prop- erties, the compositionality and transferability of the composition scheme can be modeled. We study the validity of the incurred embedding space in Section 7.4. The goal of metric learning is to map embeddings of similar semantics close. Denote the input image embedding asf o , the coupling/decoupling operations asT + andT respectively. Define “” as function composition andd as distance kernel (e.g.,l 2 distance). Metric Learning Objective. For embeddings representing different object categories, we enforce the distance to be larger than a marginm, irrespective of the attributes being applied to the two object, L metric =max(0;md(f o T + (a i );f o 0 T + (a j ))); 8o6=o 0 ;8i;j2A (7.3) Relativity. As shown in Figure 7.3 right, the objective should take into account the compatibility of the modified attribute and existing attributes. For example, compounding “cute cat” with “small” should lead to an embedding farther than removing “small” from it since the attribute “small” is not present in the original embedding, See Case 1. On the contrary, compounding “cute cat” with an existing attribute “cute” should not change the embedding, therefore should be closer compared to removing “cute” from it, See Case 2. Based on the above discussion, we enforce a marginm on the relative distanced before and after the transformation, L rel = max(0;d(f o ;f o T (a j ))d(f o ;f o T + (a j )) +m) + max(0;d(f o ;f o T + (a i ))d(f o ;f o T (a i )) +m) (7.4) 106 wherea i is an attribute inA present inf o whilea j is not present. Commutativity. The order of attribute manipulations should not matter for the composition. For example, the “brown fluffy cat” should be the same as “fluffy brown cat”. Therefore,f o T + (a i )T (a j ) should be the same asf o T (a j )T + (a i );8i6=j: L com =jjf o T + (a i )T (a j )f o T (a i )T + (a j )jj 2 (7.5) Invertibility. To enforce that the decoupling operation achieves the inverse result of coupling operation, we require that aT operation should undo the effect of aT + operation. For example, a “small cat” after adding and removing attribute “brown” should remain the same as “small cat”: L inv =jjf o T (a i )T + (a i )f o jj 2 +jjf o T + (a i )T (a i )f o jj 2 (7.6) Classification of objects and attributes. Beside the above regularization objectives, we add a classification loss for the object category and attributes as a strong training supervision signal. L cls = BCE(f o X ;L(X)) + BCE(f o Y ;L(Y )) (7.7) The final composition loss is a combination of the above losses, weighted to ensure loss magnitudes of a similar scale. L comp = 1 L metric + 2 L rel + 3 L com + 4 L inv + 5 L cls (7.8) 107 7.3.3 Attribute-set Synthesis Typical metric learning approaches [57, 200] require a large amount of annotated pairs with different relationships. In the case of image-attribute query, it means collecting images with dif- ferent attribute configurations exhaustively, which is prohibitive. We instead resort to a weakly- supervised approach to synthesize novel samples online by leveraging samples from existing attributes. As shown in Figure 7.2 right, our attribute-set synthesis module takes as input a pair of featuresf o X andf 0 Y and perform a three kinds of set operations as follows, f o int =M int ([f o X ;f 0 Y ]) (7.9) f o uni =M uni ([f o X ;f 0 Y ]) (7.10) f o sub =M sub ([f o X ;f 0 Y ]) (7.11) where [;] is the concatenation operation, andM int ,M uni ,M sub are implemented as three same fully-connected networks but with different parameters. We train the three set operation net- works with the set loss defined as follows, L int = BCE(f o int ;L(X)\L(Y )) (7.12) L uni = BCE(f o uni ;L(X)[L(Y )) (7.13) L sub = BCE(f o sub ;L(X)nL(Y )) (7.14) whereL is the label-extraction operation. Finally, we define the set objective to optimize for this module, L set = 1 L int + 2 L uni + 3 L sub (7.15) where coefficients are chosen to ensureL int ,L uni ,L sub are at similar scales without further finetuning. With the binary cross entropy, we enforce that the synthesized feature embeddings 108 f o int ,f o uni ,f o sub are semantically consistent with their attribute labels through set operations. The gradients of this module will be backpropagated all the way back to regularize the target— embedding space. 7.3.4 Training & Inference To ease the training difficulty and to prevent explosion or vanishing of gradients, we take a two step procedure for training our framework. First, we train the compositional learning module usingL comp to initialize the embedding space for a few epochs. Second, we chain the composi- tional learning component with the attribute-based synthesis and optimize the whole framework in an end-to-end manner as follows, L = 1 L comp + 2 L set (7.16) During inference, we only need the composition module (e.g., the coupling network and the decoupling network) to perform attribute manipulations for inference. 7.4 Experiments 7.4.1 Datasets We evaluate our method on three public benchmark: -MIT-States [73]: The dataset contains 63,440 images covering 245 objects and 115 attributes. Each image is composed of a single object-attribute pair and there are 1, 262 pairs in total. We follow the setting in [126, 130] for comparisons. There are 1,262 pairs/34,562 images used for training and 700 pairs/19,191 images for test. On average, each object is modified by one of the 9 attributes it affords. -UT-Zappos50k [210]: It’s a fine-grained dataset with 50,025 images of shoes labeled with different shoe type-material pairs. We follow the setting in [130], using 83 object-attribute pairs/24,898 images for training and 33 pairs/4,228 images for test. 109 The training and testing pairs of both datasets are non-overlapping, i.e., the test set contains unseen object-attribute pairs consisting of seen attributes and objects. The two datasets are complementary. The MIT-States involves diverse everyday objects and attributes and most object annotations are sparse (some classes have as few as 4 images). The UT-Zappos focuses on fine- grained domain of shoes and each object has at least 200 images annotated. -Fashion200k [58]: In contrast to the above two datasets, Fashion200k is a multi-label dataset consisting of 200k images of fashion products. Each image is annotated with attribute descriptions of variable length (e.g., prom and formal dresses, long sleeved tops). We use the setting in [187] to use 172k images for training and randomly sample 10 validation sets/31,670 test queries for test. 7.4.2 Evaluation Criteria We evaluate our approach under the zero-shot retrieval setting, where we report Top-1, 2, 3, accuracies on the unseen test set as evaluation metric. In the case of MIT-States, we also report the classification results. 7.4.3 Baselines We compare WAML with baselines following [107, 187]. We use ResNet-18 as backbone feature extractor for all experiments unless otherwise stated. Visual Product trains two classifiers for attributes and objects predictions independently. The probability of a pair is calculated as P (a;o) = P (a)P (o). The classifier can be either linear SVM [126] or single layer softmax regression model [130]. LabelEmbed (LE) follows [130] except it uses the GloVe word embeddings [140] for object and attributes representations, later transformed into a transformation matrix by 3 FC lay- ers. The classification score is the product of the transformation matrix and the visual feature T (e a ;e b ) T (I). It has the following variants. 1. LabelEmbed Only Regression (LEOR) [126] minimizes the Euclidean distance between T (e a ;e b ) and the weight of pair SVM classifierw ab . 110 2. LabelEmbed With Regression (LE+R) [126] combines the loss of LE and LEOR together. 3. LabelEmbed+ [130] embeds both the object pairs and images in the same embedding space and allows optimization of the input representations. AnalogousAttr [28] trains a linear SVM classifier for each seen pair and utilizes tensor comple- tion to generalize to unseen pairs. RedWine [126] differs with LabelEmbed in that it uses the SVM weights in stead of word-vector embeddings. AttrOperator [130] treats attributes as operators independent of objects and trains a linear trans- formation matrixM for attribute-object compositions. We use the retrieval results reported in [107] for comparison. TAFE-Net [196] uses the word vectors for objects and attributes and generates a binary classifier for each existing composition. GenModel [131] learns a shared latent space for images and semantic language embeddings pairs. The prediction is made by comparing the distance between the image features against candiadte pair embeddings. TMN [149] proposes a task-driven modular architecture configured via a gating function to represent the compatibility between the images and the concept under consideration. TIRG [187] studies the task of image retrieval, where the input query is specified in the form of an image and some text that describes the desired modifications to the image. It benchmarks several composition functions for image-text fusion. 7.4.4 Implementation Details For all experiments, we use a ImageNet pretrained ResNet-18 as backbone feature extractor and don’t finetune it on our dataset following [131]. We use the 300-dim GloVe [140] word embed- ding for attribute and object embeddings. During training, we sample negative pairs consisting of the same object but different attributes to compute the losses 7.3.2. Our model is trained with a single NVIDIA GPU for both datasets. We use a batch size of 512 and a learning rate of 1e4 111 using ADAM. We determine weighting hyper-parameters to ensure that losses are at a similar scale without further tuning. 7.4.5 Results Compositional Zero-Shot Learning Table 7.1 shows the performance of our model (WAML) against existing baselines on MIT-States and UT-Zappos. The five rows are baselines from [126, 131] (the scores with are reproduced by [131]). WAML outperforms all existing baselines on MIT-States and UT-Zappos. Note that we improve over 5% compared to TIRG, which learns a language model for the attribute but doesn’t model object-attribute factorization during embedding learning. This shows the promise of compositional learning, which helps generalize across objects and attributes. Method MIT-States UT-Zappos Top-1 Top-2 Top-3 Top-1 Top-2 Top-3 Visual Product [126] 9.8/13.9 16.1 20.6 49.9 / / LabelEmbed (LE) [126] 11.2/13.4 17.6 22.4 25.8 / / - LEOR [126] 4.5 6.2 11.8 / / / - LE + R [126] 9.3 16.3 20.8 / / / - LabelEmbed+ [131] 14.8* / / 37.4* / / AnalogousAttr [28] 1.4 / / 18.3 / / Red Wine [126] 13.1 21.2 27.6 40.3 / / AttOperator [131] 14.2 19.6 25.1 46.2 56.6 69.2 TAFE-Net [196] 16.4 26.4 33.0 33.2 / / GenModel [131] 17.8 / / 48.3 / / TIRG [187] 12.2 / / / / / WAML (Ours) 18.0 27.2 33.6 50.6 67.8 76.6 Table 7.1: Results of Compositional Zero-Shot Learning on MIT-States and UT-Zappos. Another interesting observation is that most previous approaches do not surpass Visual Prod- uct on UT-Zappos, but our approach achieves 0.7% improvement. One possibility is that UT- Zappos is a fine-grained dataset while MIT-States is much sparser for object annotations. 112 Object Attribute Recognition In Table 7.2, we show the accuracy of object and category prediction on the two benchmarks. The results for AttrOperator [131] are reproduced with its released code. Our classification branch (Section 7.3.2) is implemented as a simple 2-FC classifier with ResNet-18 features. WAML outperforms AttrOperator by 4% on MIT-States and 5.9% on UT-Zappos. MIT-States UT-Zappos Method Attribute Object Attribute Object AttrOperator [131] 14.6 20.5 29.7 67.5 GenModel [131] 15.1 27.7 18.4 68.1 WAML 18.6 28.0 37.6 66.2 Table 7.2: Object-attribute recognition results on two benchmarks. Multi-Label Retrieval on Fashion200k To validate the effectiveness of our model for multi-label retrieval, we follow the setting in [187] to report recall at rankk (R@K), computed as the percentage of test queries where (at least 1) correctly labeled image is among the top K retrieval results. We observe significant margins when applying our approach to the multi-label scenario. Method R@1 R@10 R@50 Image only [187] 3.5 22.7 43.7 Text only [187] 1.0 12.3 21.8 Concatenation [187] 11.9 1:0 39.7 1:0 62.6 0:7 TIRG [187] 14.1 0:6 42.5 0:7 63.8 0:8 Han et al. [58] 6.3 19.9 38.3 Show and Tell [186] 12.3 1:1 40.2 1:7 61.8 0:9 Param Hashing [132] 12.2 1:1 40.0 1:1 61.7 0:8 Relationship [157] 13.0 0:6 40.5 0:7 62.4 0:6 FiLM [141] 12.9 0:7 39.5 2:1 61.9 1:9 WAML 22.8 0:7 50.3 0:8 71.6 0:7 Table 7.3: Retrieval performance on Fashion200k. In Table 7.3, first 4 rows show the contribution of visual and textual modality and different fusion schemes in TIRG. Methods in the second genre are reproduced by [187] using different 113 multi-modality fusion schemes for retrieval on Fashion200k. In contrast, we leverage a compo- sition model (Section 7.3.2) to detach attributes from objects. We observe over 8% improvement over the closest competitor, which shows the effectiveness of our framework in dealing with the compounding effects from different attributes. Image Retrieval using Image-Attribute Query To qualitatively evaluate WAML, we display the image retrieval results after attribute manipu- lations. We use the trained coupling and decoupling network to add or remove attributes (See Figure 7.1). Given an image-attribute queryp = (fag;o), we remove the existing attributesfag withT and add the desired attributes withT + . We report top-5 KNN results in Figure 7.5 for MIT-States and UT-Zappos dataset. Note that this is more challenging than the normal attribute- object retrieval [131, 149] because of the “attribute swapping” functionality. The 1st and 2nd column denote the given image-attribute pair, the 3rd column indicates the desired attribute manipulations. As can be seen, WAML is able to retrieve certain number of correct images within top-5 retrieved results. This suggests that WAML is able to infer and compose attributes semantically. Figure 7.5: Image retrieval on MIT-States and UT-Zappos after conducting attribute manipula- tions. The red boxes are incorrect predictions. Ablation Study We conduct ablation studies for different components of our framework and report results in Table 7.4. We verify the effectiveness of the proposed regularization objectives by removing them. The performance drops from 18.0% to 12.1% for Top-1 on MIT-States and drops from 50.6% to 41.2% when constraints involving relativity, commutativity, invertibility and classifi- cation objectives are removed. 114 Method MIT-States UT-Zappos Top-1 Top-2 Top-3 Top-1 Top-2 Top-3 WAML 18.0 27.2 33.6 50.6 67.8 76.6 WAML w/oL comp 9.8 17.3 23.0 20.2 40.0 51.0 WAML w/oL set 16.9 24.5 30.9 47.6 64.4 73.0 WAMLL metric only 12.1 22.7 29.0 41.2 55.0 64.8 WAMLCos dist. 17.5 26.1 31.7 48.6 66.9 76.2 Table 7.4: Results of ablation studies. We also confirm the necessity of attribute-set synthesis module which accounts for 1:1% and 3:0% increase in Top-1 for MIT-State and UT-Zappos respectively. We find the influence of distance measured to be negligible when switched froml 2 to cosine distance for the objectives in Section 7.3.2. 7.5 Conclusion We presented a compositional learning framework for attribute-based metric learning. Compared to previous work, our model deals with the multi-label case, in a weakly-supervised manner. This moves the field a step further to the real-world scenario. To achieve attribute manipulation during inference, we redesigned the composition zero-shot learning module to detach the compounding effects from different attributes. For future work, we will study the interactions between different attributes and their effects on the retrieval results. 115 Chapter 8 Conclusion In this thesis, we focused on theory and applications of adversarial and structured knowledge learning. We showed that generative adversarial learning is a powerful methodology that can be lever- aged for human-robot interaction, human-in-the-loop curriculum reinforcement learning and flexible portrait manipulation. For the first two research topics, we generalized generative adver- sarial network by incorporating human as an incarnation of the discriminator network. With prior/domain knowledge, human users can judge the robustness of grasps, identify the best adversarial perturbations, or design a set of curriculum that match the agent’s current level for guiding the procedure of reinforcement learning. We discover that human-robot adversar- ial games improve the robustness of robot models as compared to self-supervised learning, and that human-guided curriculum reinforcement learning facilitates skill transfer through delicate difficulty selection. These findings may seem intuitive, but technical challenges required to make them work is a massive undertaking. For human-robot adversarial learning, we relaxed the strong assumption of a cooperating supervisor common in the field of HRI and developed an interactive robot platform that supports end-to-end adversarial learning with high-dimensional visual input. For human-in- the-loop curriculum reinforcement learning, we released a powerful interactive RL ecosystem that consists of: 1) a massively parallellized RL-training executable that directly runs on a MAC computer without any dependencies such as tensorflow or pytorch; 2) three highly configurable environments of various levels of difficulty that support online interaction during training. 116 In the third research topic, we leveraged human prior on facial landmark as input for gen- erative adversarial network and demonstrated high-resolution, photo-realistic and continuous manipulation results in terms of facial expressions and modality transfer. Another major theme of this thesis involves structured knowledge representation, via dis- tance metric learning. In the forth research topic, we learned a structured latent representation space to model a user’s favor for different attributes, without fine-grained item annotations. Then we leveraged a graph representation to account for contextual information when making a com- patibility recommendation. In the fifth research topic, we presented a self-training framework to scale deep metric learn- ing by exploiting additional unlabeled data. We introduced a new feature basis learning approach that learns basis functions to better model pairwise similarity. The learned basis vectors are used to select high-confidence sample pairs, which reduces the noise introduced by the teacher net- work and allows the student network to learn more effectively. Results on standard retrieval benchmarks demonstrate our method outperforms several state-of-the art methods, and signifi- cantly boosts the performance of fully-supervised approaches. In the sixth research topic, a flexible compositional learning framework was proposed to allow for attribute manipulation of an image retrieval system. A user can customize searches through “adding” or “removing” certain attributes in the query to facilitate various search needs. As the attribute configuration space is huge, learning a joint attribute representation is anno- tation expensive. Instead, we leveraged a structured representation to transfer attributes across categories and instances. This helps reduce the exponential complexity to linear size in a weakly- supervised manner. 117 Bibliography [1] D. Abel, J. Salvatier, A. Stuhlm¨ uller, and O. Evans. Agent-agnostic human-in-the-loop reinforcement learning. arXiv preprint arXiv:1701.04079, 2017. [2] P. Agrawal, A. V . Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, pages 5074–5082, 2016. [3] K. E. Ak, A. A. Kassim, J. Hwee Lim, and J. Yew Tham. Learning attribute representa- tions with localization for flexible fashion search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7708–7717, 2018. [4] D. Arumugam, J. K. Lee, S. Saskin, and M. L. Littman. Deep reinforcement learning from policy-dependent human feedback. arXiv preprint arXiv:1902.04257, 2019. [5] H. Averbuch-Elor, D. Cohen-Or, J. Kopf, and M. F. Cohen. Bringing portraits to life. ACM Transactions on Graphics (TOG), 36(6):196, 2017. [6] I. B. Ayed. A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses. In ECCV, 2020. [7] B. Baker, I. Kanitscheider, T. Markov, Y . Wu, G. Powell, B. McGrew, and I. Mordatch. Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528, 2019. [8] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch. Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748, 2017. [9] C. Bartneck and J. Hu. Exploring the abuse of robots. Interaction Studies, 9(3):415–433, 2008. [10] D. Bau, J.-Y . Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman, and A. Torralba. Gan dissection: Visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597, 2018. [11] Y . Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceed- ings of the 26th annual international conference on machine learning, pages 41–48, 2009. [12] D. Berenson, R. Diankov, K. Nishiwaki, S. Kagami, and J. Kuffner. Grasp planning in complex scenes. In 2007 7th IEEE-RAS International Conference on Humanoid Robots, pages 42–48. IEEE, 2007. [13] D. Berenson and S. S. Srinivasa. Grasp synthesis in cluttered environments for dexter- ous hands. In Humanoids 2008-8th IEEE-RAS International Conference on Humanoid Robots, pages 189–196. IEEE, 2008. 118 [14] R. v. d. Berg, T. N. Kipf, and M. Welling. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263, 2017. [15] M. Bertalmio, G. Sapiro, V . Caselles, and C. Ballester. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424, 2000. [16] A. Bicchi and V . Kumar. Robotic grasping and contact: A review. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automa- tion. Symposia Proceedings (Cat. No. 00CH37065), volume 1, pages 348–353. IEEE, 2000. [17] J. M. Bland and D. G. Altman. Statistics notes: Cronbach’s alpha. Bmj, 314(7080):572, 1997. [18] V . Blanz, C. Basso, T. Poggio, and T. Vetter. Reanimating faces in images and video. In Computer graphics forum, volume 22, pages 641–650. Wiley Online Library, 2003. [19] V . Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194. ACM Press/Addison-Wesley Publishing Co., 1999. [20] J. Bohg, A. Morales, T. Asfour, and D. Kragic. Data-driven grasp synthesis—a survey. IEEE Transactions on Robotics, 30(2):289–309, 2014. [21] A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997. [22] G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016. [23] A. Brown, W. Xie, V . Kalogeiton, and A. Zisserman. Smooth-ap: Smoothing the path towards large-scale image retrieval. In ECCV, 2020. [24] D. Brsci´ c, H. Kidokoro, Y . Suehiro, and T. Kanda. Escaping from children’s abuse of social robots. In Proceedings of the tenth annual acm/ieee international conference on human-robot interaction, pages 59–66. ACM, 2015. [25] F. Cakir, K. He, X. Xia, B. Kulis, and S. Sclaroff. Deep metric learning to rank. In CVPR, 2019. [26] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020. [27] R. Caruana, S. Lawrence, and C. L. Giles. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Advances in neural information processing systems, pages 402–408, 2001. [28] C.-Y . Chen and K. Grauman. Inferring analogous attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 200–207, 2014. 119 [29] Q. Chen and V . Koltun. Photographic image synthesis with cascaded refinement networks. In The IEEE International Conference on Computer Vision (ICCV), volume 1, 2017. [30] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020. [31] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton. Big self-supervised models are strong semi-supervised learners. In NeurIPS, 2020. [32] Y . Chen, S. Gong, and L. Bazzani. Image search with text feedback by visiolinguistic attention learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3001–3011, 2020. [33] Y .-C. Chen, C.-T. Chou, and Y .-C. F. Wang. Learning to learn in a semi-supervised fash- ion. ECCV, 2020. [34] Y .-C. Chen, H. Lin, M. Shu, R. Li, X. Tao, X. Shen, Y . Ye, and J. Jia. Facelet-bank for fast portrait manipulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3541–3549, 2018. [35] Y . Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation. arXiv preprint arXiv:1711.09020, 2017. [36] Y . Choi, Y . Uh, J. Yoo, and J.-W. Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8188–8197, 2020. [37] G. Cucurull, P. Taslakian, and D. Vazquez. Context-aware visual compatibility prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12617–12626, 2019. [38] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. [39] A. Dragan and S. Srinivasa. Generating legible motion. 2013. [40] J. Duan, X. Guo, Y . Song, C. Yang, and C.-C. J. Kuo. Portraitgan for flexible portrait manipulation. arXiv preprint arXiv:1807.01826, 2018. [41] J. Duan, X. Guo, S. Tran, and C.-C. J. Kuo. Fashion compatibility recommendation via unsupervised metric graph learning. [42] J. Duan, J. Wan, S. Zhou, X. Guo, and S. Z. Li. A unified framework for multi-modal isolated gesture recognition. ACM Transactions on Multimedia Computing, Communica- tions, and Applications (TOMM), 14(1s):21, 2018. [43] J. Duan, Q. Wang, L. Pinto, C.-C. J. Kuo, and S. Nikolaidis. Robot learning via human adversarial games. CoRR, 2019. 120 [44] V . Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016. [45] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1778–1785. IEEE, 2009. [46] C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel. Reverse curriculum gen- eration for reinforcement learning. arXiv preprint arXiv:1707.05300, 2017. [47] O. Fried, E. Shechtman, D. B. Goldman, and A. Finkelstein. Perspective-aware manipu- lation of portrait photos. ACM Transactions on Graphics (TOG), 35(4):128, 2016. [48] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2414–2423. IEEE, 2016. [49] J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov. Neighbourhood components analysis. In Advances in neural information processing systems, pages 513– 520, 2005. [50] I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016. [51] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. In Advances in neural informa- tion processing systems, pages 2672–2680, 2014. [52] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial exam- ples. arXiv preprint arXiv:1412.6572, 2014. [53] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated cur- riculum learning for neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1311–1320. JMLR. org, 2017. [54] S. Griffith, K. Subramanian, J. Scholz, C. L. Isbell, and A. L. Thomaz. Policy shaping: Integrating human feedback with reinforcement learning. In Advances in neural informa- tion processing systems, pages 2625–2633, 2013. [55] J.-B. e. a. Grill. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020. [56] X. Guo, H. Wu, Y . Gao, S. Rennie, and R. Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. arXiv preprint arXiv:1905.12794, 2019. [57] R. Hadsell, S. Chopra, and Y . LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006. [58] X. Han, Z. Wu, P. X. Huang, X. Zhang, M. Zhu, Y . Li, Y . Zhao, and L. S. Davis. Automatic spatially-aware fashion concept discovery. In ICCV, 2017. 121 [59] X. Han, Z. Wu, Y .-G. Jiang, and L. S. Davis. Learning fashion compatibility with bidi- rectional lstms. In Proceedings of the 25th ACM international conference on Multimedia, pages 1078–1086. ACM, 2017. [60] K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020. [61] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y . Tassa, T. Erez, Z. Wang, S. Eslami, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017. [62] D. Held, X. Geng, C. Florensa, and P. Abbeel. Automatic goal generation for reinforce- ment learning agents. 2018. [63] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re- identification. arXiv preprint arXiv:1703.07737, 2017. [64] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. [65] J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in neural information processing systems, pages 4565–4573, 2016. [66] E. Hoffer and N. Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015. [67] J. Hoffman, E. Tzeng, T. Park, J.-Y . Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017. [68] M. Hou, L. Wu, E. Chen, Z. Li, V . W. Zheng, and Q. Liu. Explainable fashion recommen- dation: A semantic attribute region guided approach. arXiv preprint arXiv:1905.12862, 2019. [69] J. Hu, J. Lu, and Y .-P. Tan. Discriminative deep metric learning for face verification in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1875–1882, 2014. [70] S. Huang, M. Elhoseiny, A. Elgammal, and D. Yang. Learning hypergraph-regularized attribute predictors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 409–417, 2015. [71] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. CoRR, abs/1703.06868, 2017. [72] X. Huang, M.-Y . Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to- image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 172–189, 2018. 122 [73] P. Isola, J. J. Lim, and E. H. Adelson. Discovering states and transformations in image collections. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1383–1391, 2015. [74] P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint, 2017. [75] D. Jayaraman and K. Grauman. Zero shot recognition with unreliable attributes. arXiv preprint arXiv:1409.4327, 2014. [76] D. Jayaraman, F. Sha, and K. Grauman. Decorrelating semantic visual attributes by resist- ing the urge to share. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1629–1636, 2014. [77] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016. [78] A. Juliani, V .-P. Berges, E. Vckay, Y . Gao, H. Henry, M. Mattar, and D. Lange. Unity: A general platform for intelligent agents. arXiv preprint arXiv:1809.02627, 2018. [79] H. Jun, B. Ko, Y . Kim, I. Kim, and J. Kim. Combination of multiple global descriptors for image retrieval. arXiv preprint arXiv:1903.10663, 2020. [80] W.-C. Kang, C. Fang, Z. Wang, and J. McAuley. Visually-aware fashion recommendation and design with generative image models. In 2017 IEEE International Conference on Data Mining (ICDM), pages 207–216. IEEE, 2017. [81] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. [82] I. Kemelmacher-Shlizerman, S. Suwajanakorn, and S. M. Seitz. Illumination-aware age progression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3334–3341, 2014. [83] S. Kim, D. Kim, M. Cho, and S. Kwak. Proxy anchor loss for deep metric learning. In CVPR, 2020. [84] D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009. [85] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [86] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional net- works. arXiv preprint arXiv:1609.02907, 2016. [87] W. B. Knox and P. Stone. Tamer: Training an agent manually via evaluative reinforce- ment. In 2008 7th IEEE International Conference on Development and Learning, pages 292–297. IEEE, 2008. 123 [88] W. B. Knox and P. Stone. Interactively shaping agents via human reinforcement: The tamer framework. In Proceedings of the fifth international conference on Knowledge capture, pages 9–16. ACM, 2009. [89] W. B. Knox and P. Stone. Combining manual feedback with subsequent mdp reward sig- nals for reinforcement learning. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pages 5–12. Interna- tional Foundation for Autonomous Agents and Multiagent Systems, 2010. [90] W. B. Knox and P. Stone. Reinforcement learning from simultaneous human and mdp reward. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 475–482. International Foundation for Autonomous Agents and Multiagent Systems, 2012. [91] I. Korshunova, W. Shi, J. Dambre, and L. Theis. Fast face-swap using convolutional neural networks. In The IEEE International Conference on Computer Vision, 2017. [92] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In ICCV workshop, 2013. [93] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convo- lutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [94] C.-C. J. Kuo, M. Zhang, S. Li, J. Duan, and Y . Chen. Interpretable convolutional neural networks via feedforward design. Journal of Visual Communication and Image Represen- tation, 60:346–359, 2019. [95] M. H. Kutner, C. J. Nachtsheim, J. Neter, W. Li, et al. Applied linear statistical models, volume 103. McGraw-Hill Irwin Boston, 2005. [96] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951–958. IEEE, 2009. [97] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and A. Van Knippen- berg. Presentation and validation of the radboud faces database. Cognition and emotion, 24(8):1377–1388, 2010. [98] C. Lassner, G. Pons-Moll, and P. V . Gehler. A generative model of people in clothing. arXiv preprint arXiv:1705.04098, 2017. [99] M. Lau, J. Chai, Y .-Q. Xu, and H.-Y . Shum. Face poser: Interactive modeling of 3d facial expressions using facial priors. ACM Transactions on Graphics (TOG), 29(1):3, 2009. [100] R. Lavi. Algorithmic game theory. Computationally-efficient approximate mechanisms, pages 301–330, 2007. 124 [101] D.-H. Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013. [102] I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. The Interna- tional Journal of Robotics Research, 34(4-5):705–724, 2015. [103] A. Levin, D. Lischinski, and Y . Weiss. Colorization using optimization. In ACM SIG- GRAPH 2004 Papers, pages 689–694. 2004. [104] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016. [105] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coor- dination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37(4-5):421–436, 2018. [106] Y . Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015. [107] Y .-L. Li, Y . Xu, X. Mao, and C. Lu. Symmetry and group in attribute-object compo- sitions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11316–11325, 2020. [108] J. Liao, Y . Yao, L. Yuan, G. Hua, and S. B. Kang. Visual attribute transfer through deep image analogy. arXiv preprint arXiv:1705.01088, 2017. [109] T.-Y . Lin, Y . Cui, S. Belongie, and J. Hays. Learning deep representations for ground- to-aerial geolocalization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5007–5015, 2015. [110] Y . Lin, L. Zheng, Z. Zheng, Y . Wu, Z. Hu, C. Yan, and Y . Yang. Improving person re- identification by attribute and identity learning. Pattern Recognition, 95:151–161, 2019. [111] Z. Lin, B. Harrison, A. Keech, and M. O. Riedl. Explore, exploit or listen: Combining human feedback and policy model to speed up deep reinforcement learning in 3d worlds. arXiv preprint arXiv:1709.03969, 2017. [112] M.-Y . Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pages 700–708, 2017. [113] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, 2016. [114] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1096–1104, 2016. [115] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015. 125 [116] R. Loftin, B. Peng, J. MacGlashan, M. L. Littman, M. E. Taylor, J. Huang, and D. L. Roberts. Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning. Autonomous agents and multi-agent systems, 30(1):30–59, 2016. [117] F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep photo style transfer. CoRR, abs/1703.07511, 2017. [118] I. L¨ usi, J. C. J. Junior, J. Gorbova, X. Bar´ o, S. Escalera, H. Demirel, J. Allik, C. Ozcinar, and G. Anbarjafari. Joint challenge on dominant and complementary emotion recognition using micro emotion features and head-pose estimation: Databases. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on, pages 809–813. IEEE, 2017. [119] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. Pose guided person image generation. In Advances in Neural Information Processing Systems, pages 405– 415, 2017. [120] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008. [121] J. MacGlashan, M. K. Ho, R. Loftin, B. Peng, D. Roberts, M. E. Taylor, and M. L. Littman. Interactive learning from policy-dependent human feedback. arXiv preprint arXiv:1701.06049, 2017. [122] J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kr¨ oger, J. Kuffner, and K. Goldberg. Dex-net 1.0: A cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1957– 1964. IEEE, 2016. [123] X. Mao, Q. Li, H. Xie, R. Y . Lau, Z. Wang, and S. P. Smolley. Least squares genera- tive adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2813–2821. IEEE, 2017. [124] T. Matiisen, A. Oliver, T. Cohen, and J. Schulman. Teacher-student curriculum learning. IEEE transactions on neural networks and learning systems, 2019. [125] J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pages 165–172. ACM, 2013. [126] I. Misra, A. Gupta, and M. Hebert. From red wine to red tomato: Composition with con- text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1792–1801, 2017. [127] V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Interna- tional conference on machine learning, pages 1928–1937, 2016. 126 [128] Y . Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh. No fuss distance metric learning using proxies. In ICCV, 2017. [129] K. Musgrave, S. Belongie, and S.-N. Lim. A metric learning reality check. ECCV, 2020. [130] T. Nagarajan and K. Grauman. Attributes as operators: factorizing unseen attribute-object compositions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 169–185, 2018. [131] Z. Nan, Y . Liu, N. Zheng, and S.-C. Zhu. Recognizing unseen attribute-object pair with generative model. In Proceedings of the AAAI Conference on Artificial Intelligence, vol- ume 33, pages 8811–8818, 2019. [132] H. Noh, P. H. Seo, and B. Han. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 30–38, 2016. [133] T. Nomura, T. Kanda, H. Kidokoro, Y . Suehiro, and S. Yamada. Why do children abuse robots? Interaction Studies, 17(3):347–369, 2016. [134] H. Oh Song, Y . Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted struc- tured feature embedding. In CVPR, 2016. [135] R. Ohbuchi and T. Furuya. Distance metric learning and feature combination for shape- based 3d model retrieval. In Proceedings of the ACM workshop on 3D object retrieval, pages 63–68, 2010. [136] Y . Ouali, C. Hudelot, and M. Tami. Semi-supervised semantic segmentation with cross- consistency training. In CVPR, 2020. [137] L. Palmieri and K. O. Arras. Distance metric learning for rrt-based motion planning with constant-time inference. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 637–643. IEEE, 2015. [138] D. Parikh and K. Grauman. Relative attributes. In 2011 International Conference on Computer Vision, pages 503–510. IEEE, 2011. [139] T. Park, M.-Y . Liu, T.-C. Wang, and J.-Y . Zhu. Semantic image synthesis with spatially- adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2337–2346, 2019. [140] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word represen- tation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. [141] E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. 127 [142] L. Pinto, J. Davidson, and A. Gupta. Supervision via competition: Robot adversaries for learning tasks. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 1601–1608. IEEE, 2017. [143] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta. Robust adversarial reinforcement learning. arXiv preprint arXiv:1703.02702, 2017. [144] L. Pinto, D. Gandhi, Y . Han, Y .-L. Park, and A. Gupta. The curious robot: Learning visual representations via physical interactions. In European Conference on Computer Vision, pages 3–18. Springer, 2016. [145] L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In Robotics and Automation (ICRA), 2016 IEEE International Confer- ence on, pages 3406–3413. IEEE, 2016. [146] L. Pinto and A. Gupta. Learning to push by grasping: Using multiple tasks for effective learning. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 2161–2168. IEEE, 2017. [147] R. Portelas, C. Colas, K. Hofmann, and P.-Y . Oudeyer. Teacher algorithms for cur- riculum learning of deep rl in continuously parameterized environments. arXiv preprint arXiv:1910.07224, 2019. [148] A. Pumarola, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer. Unsupervised person image synthesis in arbitrary poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8620–8628, 2018. [149] S. Purushwalkam, M. Nickel, A. Gupta, and M. Ranzato. Task-driven modular networks for zero-shot compositional learning. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 3593–3602, 2019. [150] S. Qian, K.-Y . Lin, W. Wu, Y . Liu, Q. Wang, F. Shen, C. Qian, and R. He. Make a face: Towards arbitrary high fidelity face manipulation. In Proceedings of the IEEE Interna- tional Conference on Computer Vision, pages 10033–10042, 2019. [151] S. Reddy, S. Levine, and A. Dragan. Shared autonomy via deep reinforcement learning. arXiv preprint arXiv:1802.01744, 2018. [152] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In Advances in Neural Information Processing Systems, pages 217–225, 2016. [153] S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011. [154] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016. 128 [155] T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016. [156] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016. [157] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. arXiv preprint arXiv:1706.01427, 2017. [158] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008. [159] S. Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sci- ences, 3(6):233–242, 1999. [160] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recog- nition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. [161] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy opti- mization algorithms. arXiv preprint arXiv:1707.06347, 2017. [162] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 618–626, 2017. [163] A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe. Deformable gans for pose-based human image generation. arXiv preprint arXiv:1801.00055, 2017. [164] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [165] K. K. Singh and Y . J. Lee. End-to-end localization and ranking for relative attributes. In European Conference on Computer Vision, pages 753–769. Springer, 2016. [166] K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NeurIPS, 2016. [167] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. [168] T. Sucontphunt, Z. Mo, U. Neumann, and Z. Deng. Interactive 3d facial expression posing through 2d portrait manipulation. In Proceedings of graphics interface 2008, pages 177– 184. Canadian Information Processing Society, 2008. [169] S. Sukhbaatar, Z. Lin, I. Kostrikov, G. Synnaeve, A. Szlam, and R. Fergus. Intrin- sic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407, 2017. 129 [170] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4):95, 2017. [171] R. Tan, M. I. Vasileva, K. Saenko, and B. A. Plummer. Learning similarity conditions without explicit supervision. arXiv preprint arXiv:1908.08589, 2019. [172] Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. de Las Casas, D. Budden, A. Abdol- maleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller. DeepMind control suite. Technical report, DeepMind, Jan. 2018. [173] M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009. [174] E. W. Teh, T. DeVries, and G. W. Taylor. Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis. In ECCV, 2020. [175] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Real- time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2387–2395, 2016. [176] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012. [177] E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. [178] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017. [179] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona, and S. Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In CVPR, 2015. [180] M. I. Vasileva, B. A. Plummer, K. Dusad, S. Rajpal, R. Kumar, and D. Forsyth. Learn- ing type-aware embeddings for fashion compatibility. In Proceedings of the European Conference on Computer Vision (ECCV), pages 390–405, 2018. [181] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. [182] A. Veit, S. Belongie, and T. Karaletsos. Conditional similarity networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 830–838, 2017. 130 [183] A. Veit, B. Kovacs, S. Bell, J. McAuley, K. Bala, and S. Belongie. Learning visual clothing style with heterogeneous dyadic co-occurrences. In Proceedings of the IEEE International Conference on Computer Vision, pages 4642–4650, 2015. [184] P. Veliˇ ckovi´ c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y . Bengio. Graph atten- tion networks. arXiv preprint arXiv:1710.10903, 2017. [185] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016. [186] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image cap- tion generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015. [187] N. V o, L. Jiang, C. Sun, K. Murphy, L.-J. Li, L. Fei-Fei, and J. Hays. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6439–6448, 2019. [188] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200- 2011 dataset. 2011. [189] J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows: Video forecasting by generating pose futures. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 3352–3361. IEEE, 2017. [190] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 2018. [191] H. Wang, Y . Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. In CVPR, 2018. [192] L. Wang, Y . Li, and S. Lazebnik. Learning deep structure-preserving image-text embed- dings. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5005–5013, 2016. [193] T.-C. Wang, M.-Y . Liu, J.-Y . Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017. [194] W. Wang, X. Alameda-Pineda, D. Xu, P. Fua, E. Ricci, and N. Sebe. Every smile is unique: Landmark-guided diverse smile generation. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pages 7083–7092, 2018. [195] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott. Multi-similarity loss with general pair weighting for deep metric learning. In CVPR, 2019. [196] X. Wang, F. Yu, R. Wang, T. Darrell, and J. E. Gonzalez. Tafe-net: Task-aware fea- ture embeddings for low shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1831–1840, 2019. 131 [197] X. Wang, H. Zhang, W. Huang, and M. R. Scott. Cross-batch memory for embedding learning. In CVPR, 2020. [198] G. Warnell, N. Waytowich, V . Lawhern, and P. Stone. Deep tamer: Interactive agent shaping in high-dimensional state spaces. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [199] K. Wei, M. Yang, H. Wang, C. Deng, and X. Liu. Adversarial fine-grained composi- tion learning for unseen attribute-object recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3741–3749, 2019. [200] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. JMLR, 2009. [201] C.-Y . Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2840–2848, 2017. [202] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010. [203] Q. Xie, M.-T. Luong, E. Hovy, and Q. V . Le. Self-training with noisy student improves imagenet classification. In CVPR, 2020. [204] G. Xu, Z. Liu, X. Li, and C. C. Loy. Knowledge distillation meets self-supervision. In ECCV, 2020. [205] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image inpainting using multi-scale neural patch synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017. [206] L. Yang, P. Luo, C. Change Loy, and X. Tang. A large-scale car dataset for fine-grained categorization and verification. In CVPR, 2015. [207] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Deep metric learning for person re-identification. In 2014 22nd International Conference on Pattern Recognition, pages 34–39. IEEE, 2014. [208] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image- to-image translation. arXiv preprint, 2017. [209] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 974–983. ACM, 2018. [210] A. Yu and K. Grauman. Semantic jitter: Dense supervision for visual comparisons via synthetic images. In Proceedings of the IEEE International Conference on Computer Vision, pages 5570–5579, 2017. 132 [211] H.-X. Yu, W.-S. Zheng, A. Wu, X. Guo, S. Gong, and J.-H. Lai. Unsupervised person re-identification by soft multilabel learning. In CVPR, 2019. [212] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng. Revisiting knowledge distillation via label smoothing regularization. In CVPR, 2020. [213] A. Zhai and H.-Y . Wu. Classification is a strong baseline for deep metric learning. In BMVC, 2019. [214] X. Zhan, J. Xie, Z. Liu, Y .-S. Ong, and C. C. Loy. Online deep clustering for unsupervised representation learning. In CVPR, 2020. [215] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stack- gan++: Realistic image synthesis with stacked generative adversarial networks. arXiv: 1710.10916, 2017. [216] Q. Zhang, Y . Nian Wu, and S.-C. Zhu. Interpretable convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8827–8836, 2018. [217] Y . Zhang, G. Lai, M. Zhang, Y . Zhang, Y . Liu, and S. Ma. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 83–92. ACM, 2014. [218] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016. [219] J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017. [220] J.-Y . Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Pro- cessing Systems, pages 465–476, 2017. [221] B. Zoph, G. Ghiasi, T.-Y . Lin, Y . Cui, H. Liu, E. D. Cubuk, and Q. V . Le. Rethinking pre-training and self-training. arXiv preprint arXiv:2006.06882, 2020. 133
Abstract (if available)
Abstract
Deep learning has brought impressive improvements in many fields, including computer vision, robotics, reinforcement learning and recommendation systems etc, thanks to end-to-end data-driven optimization. However, people have little control over the system during training and limited understanding about the structure of knowledge being learned. In this thesis, we study theory and applications of adversarial and structured knowledge learning: 1) learning adversarial knowledge with human interaction or by incorporating human-in-the-loop
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advanced knowledge graph embedding techniques: theory and applications
PDF
Visual knowledge transfer with deep learning techniques
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Algorithms and systems for continual robot learning
PDF
Advances in understanding and leveraging structured data for knowledge-intensive tasks
PDF
Efficient graph learning: theory and performance evaluation
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Leveraging structure for learning robot control and reactive planning
PDF
Complete human digitization for sparse inputs
PDF
Scaling robot learning with skills
PDF
Planning and learning for long-horizon collaborative manipulation tasks
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Learning distributed representations of cells in tables
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Exploring complexity reduction in deep learning
PDF
Facial age grouping and estimation via ensemble learning
PDF
Towards socially assistive robot support methods for physical activity behavior change
PDF
Closing the reality gap via simulation-based inference and control
Asset Metadata
Creator
Duan, Jiali
(author)
Core Title
Theory and applications of adversarial and structured knowledge learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
03/15/2021
Defense Date
03/02/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adversarial learning,human-robot adversarial games,OAI-PMH Harvest,representation learning,structured knowledge learning
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, Chung-Chieh Jay (
committee chair
), Chugg, Keith Michael (
committee member
), Nakano, Aiichiro (
committee member
), Nikolaidis, Stefanos (
committee member
)
Creator Email
jialidua@usc.edu,jli.duan@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-426007
Unique identifier
UC11666638
Identifier
etd-DuanJiali-9316.pdf (filename),usctheses-c89-426007 (legacy record id)
Legacy Identifier
etd-DuanJiali-9316.pdf
Dmrecord
426007
Document Type
Dissertation
Rights
Duan, Jiali
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
adversarial learning
human-robot adversarial games
representation learning
structured knowledge learning