Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Structured visual understanding and generation with deep generative models
(USC Thesis Other)
Structured visual understanding and generation with deep generative models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
STRUCTURED VISUAL UNDERSTANDING AND GENERATION WITH DEEP GENERATIVE MODELS by Yuhang Song A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2020 Copyright 2020 Yuhang Song Acknowledgments I would like to express my sincere gratitude to my advisor Professor C.-C. Jay Kuo for the continuous support of my Ph.D. study during these past five years. He has given me the freedom to pursue various projects without objection. He has also provided insight- ful discussions about the research. In addition, his immense knowledge and research experience has greatly broadened my horizon, and his enthusiasm and persistence to research has encouraged me to conquer challenges during my Ph.D. study. Besides my advisor, I would like to thank the rest of my committe members: Pro- fessor Alexander Sawchuk, Professor Ulrich Neumann, Professor Panayiotis Georgiou, and Professor Antonio Ortega, for their insightful comments and encouragement. My sincere thanks also go to Professor Hao Li, Dr. Zhe Lin, Dr. Pengchuan Zhang, Dr. Lei Zhang, and Dr. Jianfeng Gao, who provided me opportunities to have collabo- rations or take internships, as well as to give access to the research resources. I thank my fellow labmates in MCL, particularly, Qin Huang, Haiqiang Wang, Ye Wang, Yeji Shen, Harry Yang, Siyang Li, Yueru Chen, Heming Zhang, Junting Zhang, and many others. Also I thank my friends from outside of the lab and institution, such as Dr. Wenbo Li, Dr. Jianwei Yang, Dr. Jiahui Yu, and Dr. Yang Gao. Last but not the least, I would like to thank my family for their supporting me spir- itually thoughout my Ph.D. study and my life. Their love and encourangement will enlighten my life forever. ii Contents Acknowledgments ii List of Tables vi List of Figures vii Abstract xi 1 Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Contextual Based Image Inpainting . . . . . . . . . . . . . . . 4 1.2.2 Segmentation Guided Image Inpainting . . . . . . . . . . . . . 5 1.2.3 Novel Human-Object Interaction Detection . . . . . . . . . . . 6 1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Research Background 8 2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 8 2.2 Image Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . 13 2.4 Image Inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Visual relationship detection . . . . . . . . . . . . . . . . . . . . . . . 23 3 Contextual Based Image Inpainting: Infer, Match and Translate 28 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.4 Multi-scale Inference . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 iii 3.3.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.4 More Experiment Results . . . . . . . . . . . . . . . . . . . . 46 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 SPG-Net: Segmentation Prediction and Guidance Network for Image Inpaint- ing 54 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.1 Segmentation Prediction Network (SP-Net) . . . . . . . . . . . 58 4.2.2 Segmentation Guidance Network (SG-Net) . . . . . . . . . . . 60 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5 Novel Human-Object Interaction Detection via Adversarial Domain Gener- alization 68 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 70 5.2.2 Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.2 Adversarial Domain Generalization (ADG) . . . . . . . . . . . 74 5.3.3 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4.1 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.3 More Experiment Results . . . . . . . . . . . . . . . . . . . . 91 5.4.4 Proofs and Additional Derivations . . . . . . . . . . . . . . . . 92 5.4.5 Network Architectures . . . . . . . . . . . . . . . . . . . . . . 98 5.4.6 More Visualiation Results . . . . . . . . . . . . . . . . . . . . 101 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6 Conclusion and Future Work 104 6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.2.1 Scene Graph Generation . . . . . . . . . . . . . . . . . . . . . 106 iv 6.2.2 Downstream Tasks . . . . . . . . . . . . . . . . . . . . . . . . 107 Bibliography 109 v List of Tables 3.1 Numerical comparison on 200 test images of ImageNet. . . . . . . . . . 40 4.1 Numerical comparison on 200 test images of Cityscapes. . . . . . . . . 64 5.1 Statistics of the new splits . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Performance on the original HICO-DET dataset . . . . . . . . . . . . . 85 5.3 Performance on the new split of HICO-DET dataset. For [52], ADG- KLD, CADG-KLD, and CADG-JSD, we measure the relative ratio with baseline to compute the gain and loss on PredCls R@1 . . . . . . . . . 86 5.4 Performance of the evaluation metric PredCls on the UnRel dataset. For [52], ADG-KLD, CADG-KLD, and CADG-JSD, we measure the rela- tive ratio with baseline to compute the gain and loss on R@1 . . . . . . 87 5.5 Per-class evaluations of PredCls R@1 on HICO-DET test set. The sec- ond column indicates the number of instances of each predicate in the training set. We show the baseline and our proposed models for each action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.6 Ablation study on HICO-DET dataset. For each model, we show its inference results using the full model and HSp branches. The relative gain of the union-box branch of each model is also calculated . . . . . . 89 5.7 Performance on the new split of HICO-DET dataset. Comparing with Table 5.3, we add the last column of the metric on val which shows our model selection criterion. . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.8 Performance of the early insertion of the adversarial branch on HICO- DET dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.9 The basic blocks for architecture design. (“-” connects two consecutive layers.) . . . 99 5.10 The structures for the baseline model and our proposed methods ADG-KLD, CADG- KLD, and CADG-JSD.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 vi List of Figures 2.1 The framework of FCN [59], which combines high-level, low-resolution layers with low-level, high-resolution layers. The figure is from [59]) . . 12 2.2 Atrous convolution. The figure is from [57]. . . . . . . . . . . . . . . . 12 2.3 GANs framework. Two players are playing against each other. The goal of the discriminator is to output the probability that the input is real, while the goal of the generator is to generate an image which fools the discriminator. The figure is from [26]. . . . . . . . . . . . . . . . . . . 15 2.4 DCGAN generator. A 100 dimensional uniform distributaionz is pro- jected to a spatial convolutional representation with many feature maps. Four following fractionally-strided convolutions then convert this repre- sentation into a 64 64 output image. The figure is from [70]. . . . . . 16 2.5 Examples of translating an input image into a corresponding output image in a different domain. The figure is from [38]. . . . . . . . . . . 17 2.6 An example of style transfer. The figure is searched from Google Images. 19 2.7 The RCNN framework. The figure is from [24]. . . . . . . . . . . . . . 21 2.8 The Fast-RCNN framework. The figure is from [23]. . . . . . . . . . . 22 2.9 The Region Proposal Network (RPN). The figure is from [71]. . . . . . 22 2.10 The examples of visual relationship detection results. The figure is from [60]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.11 For images that contain the same objects, the relationship could be dif- ferent and determine the interpretation of the image. The figure is from [60]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.12 Long tail distribution of relationship in the training dataset. The figure is from slides of [60]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.13 Visual phrase architecture. The figure is from [75]. . . . . . . . . . . . 25 2.14 An overview of the visual relationship detection pipeline in [60]. The figure is from [60]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.15 DR-Net framework. The figure is from [10]. . . . . . . . . . . . . . . . 26 2.16 VTranse framework. The figure is from [97]. . . . . . . . . . . . . . . 27 2.17 Graph R-CNN framework. The figure is from [93]. . . . . . . . . . . . 27 vii 3.1 Our result comparing with GL inpainting [35]. (a) & (d) The input image with missing hole. (b) & (d) Inpainting result given by GL inpaint- ing [35]. (c) & (f) Final inpainting result using our approach. The size of images are 512x512. . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Overview of our network architecture. We use Image2Feature network as coarse inferrence and use VGG network to extract a feature map. Then patch-swap matches neural patches from boundary to the hole. Finally the Feature2Image network translates to a complete, high-resolution image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Illustration of patch-swap operation. Each neural patch in the hole r searches for the most similar neural patch on the boundary r, and then swaps with that patch. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Multi-scale inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Left: using deconvolution (a) vs resize-convolution (b). Middle: using ` 2 reconstruction loss (c) vs using perceptual loss (d). Right: Train- ing Feature2Image network using different input data. (e) Result when trained with the Image2Feature prediction. (f) Result when trained with ground truth. (g) Result when fine-tuned with ground truth and predic- tion mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6 Arbitrary shape inpainting of real-world photography. (a), (d): Input. (b), (e): Inpainting mask. (c), (f): Output. . . . . . . . . . . . . . . . . 43 3.7 Arbitrary style transfer. (a), (d): Content. (b), (e): Style. (c), (f): Result. 44 3.8 Failure cases. (a), (c) and (e): Input. (b), (d) and (f): Output. . . . . . . 44 3.9 Visual comparisons of ImageNet results with random hole. Each exam- ple from top to bottom: input image, GLI [35], our result. All images have size 256 256. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.10 Visual comparisons of ImageNet and COCO results. Each example from left to right: input image, CAF [2], CE [66],NPS [91], GLI [35], our result w/o Feature2Image, our final result. All images have size 512 512. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.11 Plot of (a) reconstruction loss and (b) discriminator loss at scale 256. . 49 3.12 Iterative inference comparing with single inference. (a) and (c): results of single inference. (b) and (d): results after five iterations of texture refinement using Feature2Image network. . . . . . . . . . . . . . . . . 49 3.13 Effect of different architectures for Feature2Image network. (a) Input. (b) Output using decoder structure. (c) Output using encoder-decoder structure with skip connection. . . . . . . . . . . . . . . . . . . . . . . 49 3.14 Effect of training Feature2Image with different inputs. (a) Patch-swap on raw image. (b) Patch-swap on relu2 1’s response. (c) Patch-swap on relu4 1’s response. (d) Patch-swap on relu3 1’s response. . . . . . . . . 50 viii 3.15 Additional COCO Results. Images are scaled to 512x512 with fixed holes. (a) Input. (b) CE [66]. (c) NPS [91]. (d) GLI [35]. (e) Our final result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.16 Additional ImageNet Results. Images are scaled to 512x512 with fixed holes. (a) Input. (b) CE [66]. (c) NPS [91]. (d) GLI [35]. (e) Our final result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.17 Additional COCO Results. Images are scaled to 256x256 with fixed holes. (a) Input. (b) GLI [35]. (e) Our final result. . . . . . . . . . . . . 52 3.18 Additional COCO Results. Images are scaled to 512x512 with fixed holes. (a) Input. (b) GLI [35]. (e) Our final result. . . . . . . . . . . . . 53 4.1 Comparison of our intermediate and final result with GL inpainting [35]. (a) Input image with missing hole. (b) Deeplabv3+ [7] output. (c) SP- Net result. (d) SG-Net result. (e) Inpainting result given by GL inpaint- ing [35]. The size of images are 256x256. . . . . . . . . . . . . . . . . 54 4.2 Visual comparisons of Cityscapes results with random hole. Each exam- ple from left to right: input image, PatchMatch [2], GL [35], SP-Net out- put, and SG-Net output (our final result). All images have size 256256. Zoom in for better visual quality. . . . . . . . . . . . . . . . . . . . . . 64 4.3 Visual comparisons of Helen Face Dataset results with random hole. Each example from left to right: input image, GFC [51], and our result. All images have size 256 256. . . . . . . . . . . . . . . . . . . . . . 65 4.4 Ablation study. (a) Input image with missing hole. (b) Deeplabv3+ [7] output. (c) SP-Net result. (d) SG-Net result. (e) Baseline result. The size of images are 256x256. . . . . . . . . . . . . . . . . . . . . . . . . 66 4.5 Interactive Editing. (a) Input image with missing hole. (b) Ground truth. (c) First label map. (d) Inpainting result based on (c). (e) Second label map. (f) Inpainting result based on (e). The size of images are 256x256. 66 5.1 Novel relationship detection. Green box: subject. Red box: object. First two images are fro training set while the last image contains an unseen triplet from test set. . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Number of instances in the HICO-DET dataset for each (a) HOI cate- gory (b) predicate category of ”horse”. . . . . . . . . . . . . . . . . . . 70 5.3 Architectures. (a) Baseline architecture which consists of (i) a union- box branch,(ii) a human branch, and (iii) a spatial branch. (b) the pro- posed ADG framework for domain generalization. (c) ADG-KLD. (d) CADG-KLD. (e) CADG-JSD. . . . . . . . . . . . . . . . . . . . . . . 75 ix 5.4 Qualitative results on test images. Green box: human. Red box: object. Blue box: union box of object and human with a margin. Green text indicates correct predictions and red text implies the wrong ones. Images of first two columns are from the HICO-DET dataset, and images of the last column are from the UnRel dataset. . . . . . . . . . . . . . . 88 5.5 Grad-CAM visualization. First row: visualization of ride from the union-box features. Second row: visualization ofhold from the back- bone features before the ROI Align module (we only keep the visualiza- tion inside the union box). Green box: human. Red box: object. Blue box: union box of object and human with a margin. (a) input images. (b) baseline. (c) ADG-KLD. (d) CADG-KLD. (e) CADG-JSD. . . . . 90 5.6 Grad-CAM visualization of the predicates from the union-box features. Green box: human. Red box: object. Blue box: union box of object and human with a margin. (a) input images. (b) baseline. (c) ADG- KLD. (d) CADG-KLD. (e) CADG-JSD. Zoom in for better view. . . . . 102 5.7 Grad-CAM visualization of the predicates from the backbone features before the ROI Align module (we only keep the visualization inside the union box for simplification). Green box: human. Red box: object. Blue box: union box of object and human with a margin. (a) input images. (b) baseline. (c) ADG-KLD. (d) CADG-KLD. (e) CADG-JSD. Zoom in for better view. . . . . . . . . . . . . . . . . . . . . . . . . . 103 x Abstract In recent years, deep learning has made a lot of impacts and achievements to the com- puter vision community. Nowadays, deep learning model can recognize thousands of image categories, with various architectures, deeper and deeper. In complex scene, deep neural models can localize objects and detect a number of object categories and per- form instance segmentation afterward. At most recently, a number of scene graph gen- eration and visual relationship detection methods are developed for high-level image understanding, in order to extract more fine-grained and structural representation from images. As a dual problem of visual understanding, visual generation also attracts lots of attention during these few years in the light of deep learning techniques. Deep gen- erative models can generate realistic images with high resolution and high quality, and also be further applied to make image translation across different domains and envi- ronments. The world around us is highly structured and images are highly structured. Images can not only contain multiple foreground object categories but also contain var- ious background either in natural scenes or artificial scenarios. In this thesis, we mainly leverage structure information for visual generation and understanding in these tasks: 1) leveraging the semantic structure to generate realistic images; 2) making use of the seg- mentation maps for high quality image generation; 3) applying domain generalization approach to extract the human-object interaction features for structured representation of images. xi Image inpainting is the task to reconstruct the missing region in an image with plau- sible contents based on its surrounding context, which is a common topic of low-level computer vision. In order to overcome the difficulty to directly learn the distribution of high-dimensional image data, we divide the task into inference and translation as two separate steps and model each step with a deep neural network. We also use simple heuristics to guide the propagation of local textures from the boundary to the hole. We show that, by using such techniques, inpainting reduces to the problem of learning two image-feature translation functions in much smaller space and hence easier to train. We evaluate our method on several public datasets and show that we generate results of better visual quality than previous state-of-the-art methods. To go deeper in the image inpainting task, The second research idea is motivated by the fact that existing methods based on generative models don’t exploit the segmentation information to constrain the object shapes, which usually lead to blurry results on the boundary. To tackle this problem, we propose to introduce the semantic segmentation information, which disentangles the inter-class difference and intra-class variation for image inpainting. This leads to much clearer recovered boundary between semantically different regions and better texture within semantically consistent segments. Our model factorizes the image inpainting process into segmentation prediction (SP-Net) and seg- mentation guidance (SG-Net) as two steps, which predict the segmentation labels in the missing area first, and then generate segmentation guided inpainting results. Experi- ments on multiple public datasets show that our approach outperforms existing methods in optimizing the image inpainting quality, and the interactive segmentation guidance provides possibilities for multi-modal predictions of image inpainting. On the visual understanding side, we study the problem of human-object interaction (HOI) detection, which is to recognize the relationship between humans and objects in images. We observe that the ground truth label distribution of the relationship triplets xii are usually extremely imbalanced, which leads to an ineffective feature representation and model learning. Therefore, we focus on the novel human-object interaction (HOI) detection, aiming at improving the generalization ability of the model to unseen sce- narios. Most existing HOI methods heavily rely on object priors and can hardly gen- eralize to unseen combinations. To tackle this problem, we propose a unified frame- work of domain generalization to learn object-invariant features for predicate prediction and instantiate three methods based on this framework. To measure the performance improvement, we create a new split of the HICO-DET dataset, where the HOIs in the test set are all unseen triplet categories in the training set. Our experiments show that the proposed framework significantly increases the performance on both the new split of HICO-DET dataset and the UnRel dataset for auxiliary evaluation in detecting novel HOIs. xiii Chapter 1 Introduction 1.1 Significance of the Research In this thesis, we mainly try to leverage structure information for visual generation and understanding tasks. On the visual generation side, we mainly work on the image inpainting task which is to fill in the missing part of an image with visually plausi- ble contents which could hallucinate human’s perceptions. On the visual understanding side, we work on the human-object interaction (HOI) detection to recognize the rela- tionship between humans and objects in images. Image inpainting is one of the most common tasks of low-level computer vision and image processing, which aims at creating semantically meaningful contents with realistic textures. Typically, the inpainting algorithm is given an input image with a missing hole with arbitrary shapes and size, and it should output a complete image in which the hole is filled and the filling content should be consistent with the given context of the original image. Image inpainting serves as a crucial task which has multiple applications in the real world. In previous years when electronic devices are not prevalent all over the world, people mostly use photos which are very easy to be damaged or corrupted. However, there are many valuable photos which people want to recover and restore. Therefore, a strong demand for high-quality image inpainting method is increasing over the years to deal with this situation. Given a photo with physical smudges or broken missing parts, the inpainting technique could help reconstruct a clean figure removing all the artifacts. 1 Another application of image inpainting is object removal, which is to remove the unwanted objects or object parts in the image. When people take pictures, it is com- mon that the image has some unwanted objects, such as passengers in the background, passing cars on the road, private objects or identities, etc. In this case, image inpainting is used to remove these kinds of objects to make the picture more focused on the key components. Although the topic of image inpainting has been developed for many years, it is a challenging problem and there is still a long way to go. While high-quality image inpainting requires generating extremely realistic textures which could halluci- nate human eyes, the inpainting part should be coherent and consistent with the sur- rounding backgrounds, otherwise, it could be very easy to find the differences and inhar- mony in the image. Therefore, the inpainting algorithm should complete the semantic layouts and object parts as well as generating similar textures to comply with the style of the given context. Traditionally, image processing based methods are used to design the inpainting algorithms. After the rapid development of Generative Adversarial Networks (GANs) and related models in the field of image generation, generative model based approaches enable the high-quality image generation and inpainting which outperforms the tradi- tional methods. However, there are still challenges coming with these approaches. On one hand, the generative models could generate meaningful contents based on the given context, but the textures filling are still not realistic and consistent, which leads to obvi- ous inharmony in the output images. On the other hand, when it comes to the interaction of multiple objects in the image, the inpainting textures cannot provide a sharp and clear boundary between different objects, which lead to a blurred inpainting result. In this thesis, we first focus on improving the quality of the inpainting textures to be more realistic based on generative models. Then we try to produce sharp boundaries 2 between objects in the inpainting result. Specifically, we explore to incorporate the semantic segmentation information in the inpainting process, which could represent the clear location for different objects. For visual understanding, we focus on the human-object interaction (HOI) detection problem to recognize the relationship between humans and objects in images. Over the past few years, there has been rapid progress in visual recognition tasks, including object detection, segmentation, and action recognition. However, understanding a scene requires not only detecting individual object instances but also recognizing the visual relationship between object pairs. One particularly important class of visual relationship detection is detecting and recognizing how each person interacts with the surrounding objects. This task, known as human-object interactions (HOI) detection, aims to localize a person, an object, as well as identify the interaction between the person and the object. Detecting and recognizing HOI is an essential step towards a deeper understanding of the scene. Instead of What is where? (i.e., localizing object instances in an image), the goal of HOI detection is to answer the question What is happening?. Studying the HOI detection problem also provides important cues for other related high-level vision tasks, such as pose estimation, image captioning, and image retrieval. A long-standing problem in both HOI detection and visual relationship detection more broadly is the long-tail problem, where specific predicates dominate the triplet instances for most of the object categories. For example, if we collect all images with instances containing “horse” in the HICO-DET dataset, the most frequent interaction would be “ride”, which happens much more than other possible relations like “feed” or “pull”. In this case, a small number of categories dominate the training instances, allowing a learned model to rely on a frequency prior rather than learning the relation- ship feature itself. 3 Collecting a balanced dataset is a simple approach to tackle this problem. How- ever, if we haveN predicates andM objects, the possible combination of the triplets is MN. It is difficult to collect all those possible combinations due to the infrequency of relationships, which brings the combinatorial prediction problem. For the relationship detection task, the long-tail problem, combinatorial problem and frequency prior barrier are closely related in the sense that the distribution of the triplet categories are extremely imbalanced in the large compositional space of triplet categories. In this thesis, we focus on the novel HOI detection problem, where the predicate- object combinations in test set are never seen in the train set. Specifically, the training and test set share the same predicate categories, but the combinations of predicate and object categories in the test set are unseen, which we formulate as a domain general- ization problem. This task is challenging because the model is required to learn object- invariant predicate features and generalize to unseen interactions, which are able to be further applied to downstream tasks. 1.2 Contributions of the Research 1.2.1 Contextual Based Image Inpainting Generally, a realistic image is composed of two parts, i.e. meaningful contents and vivid textures. While generative models are quite helpful for content generation, we propose to enhance the texture generation based on the given surrounding context. Therefore, we divide the whole inpainting process into two steps, in which the first step generates the low-resolution but semantically plausible contents and the second step focus on the texture refinement. • We design a learning-based inpainting system that is able to synthesize missing parts in a high-resolution image with high-quality contents and textures. 4 • In order to overcome the difficulty to directly learn the distribution of high dimen- sional image data, we divide the task into inference and translation as two separate steps and model each step with a deep neural network. • We show that, by using such techniques, inpainting reduces to the problem of learning two image-feature translation functions in much smaller space and hence easier to train. We evaluate our method on several public datasets and show that we generate results of better visual quality than previous state-of-the-art methods. 1.2.2 Segmentation Guided Image Inpainting Based on our observations from the application of current generative models, direct image generation always lead to some blurred output images which seem to be quite unrealistic, especially along the boundaries between different objects, but image trans- lation from segmentation maps to RGB images perform quite well in generating the texture details. Inspired by this observation, we propose to introduce the segmentation information into the inpainting process and to guide the image generation based on the segmentation cues. • We propose to introduce the semantic segmentation information which disentan- gles the inter-class difference and intra-class variation for image inpainting. • We factorize the image inpainting process into segmentation prediction (SP-Net) and segmentation guidance (SG-Net) as two steps, which predict the segmentation labels in the missing area first, and then generate segmentation guided inpainting results. 5 • Experiments on multiple public datasets show that our approach outperforms existing methods in optimizing the image inpainting quality, and the interac- tive segmentation guidance provides possibilities for multi-modal predictions of image inpainting. 1.2.3 Novel Human-Object Interaction Detection Based on our observations, the ground truth label distribution of the relationship triplets are usually extremely imbalanced, which leads to an ineffective feature representation and model learning. Therefore, we focus on the novel human-object interaction (HOI) detection, aiming at improving the generalization ability of the model to unseen scenar- ios. Most existing HOI methods heavily rely on object priors and can hardly general- ize to unseen combinations. To tackle this problem, we propose a unified framework of domain generalization to learn object-invariant features for predicate prediction and instantiate three methods based on this framework. • We create a new benchmark dataset for the novel HOI detection task, based on the images and annotations from the HOI detection datasets, where the new bench- mark dataset avoids the overlapping of the triplet categories in the training set, validation set and test set. • We propose a unified domain generalization framework and instantiate both con- ditional and unconditional approaches to improve the generalization ability of models. • Experiments on multiple datasets show that our proposed methods can get uni- formly significant improvement on all metrics. 6 • With this framework, we show promising results of adversarial domain general- ization in conquering the combinatorial prediction problem in real-world applica- tions. 1.3 Organization of the Thesis The rest of the thesis is organized as follows. In Chapter 2, we review the research back- ground of structured visual generation and understanding. In Chapter 3, we propose a contextual based image inpainting algorithm which we separate the image inpainting process into two steps and finetune the textures in the second step based on the fea- ture maps of the surrounding context. In Chapter 4, we propose a segmentation based image inpainting method which we introduce the segmentation information as the guid- ance for the image inpainting process. In Chapter 5, we formulate the human-object interaction detection task as a domain generalization problem and propose an adver- sarial domain generalization framework to learn the object-invariant features. Finally, concluding remarks and future research directions are given in Chapter 6. 7 Chapter 2 Research Background 2.1 Convolutional Neural Networks As defined in [27], feedforward neural networks are the essential deep learning models whose goal is to approximate some functionf. A feedforward network defines a map- ping function y = f(x;) and learns the value of parameters that result in the best function approximation. These models are called feedforward because the inputx flows to the model which defines the function f and finally get the output y. There are no cyclic connections in the network so it’s very straightforward to set up the training or to make inferences. Convolutional neural networks are a specialized kind of neural network for process- ing data that has a known grid-like topology [27]. Therefore, they are commonly used in computer vision tasks, where image data could be thought of as a 2-D grid of pix- els. Typically, a CNN usually contains convolutional layers, activation layers, pooling layers, and fully connected layers. • Convolutional layers. Convolution is a mathematical operation which is typically a kind of linear operation used on 2-D data. A convolution layer is a feature map which represents the weights and bias for computation. It takes a local area of the data as input and outputs the inner product of the weights and the input data. 8 Generally, convolutional layers could be taken as a substitute operation of the general matrix multiplication. A typical convolution function could be defined as y = X i w i x i +b: (2.1) • Activation layers. The activation layers are non-linear functions which are typ- ically applied to the output of a convolution layer. The goal of activation lay- ers is to introduce non-linear characteristics into the network to enable the high- dimensional representation by designing deeper networks. There are many dif- ferent kinds of activation layers, such as sigmoid, ReLU, LReLU, etc. The most commonly used one is ReLU (Rectified Linear Unit), which could be formulated as f(x) = max(0;x): (2.2) • Pooling layers. Pooling layers are used to downsample the output of previous layers to better extract the feature representations from the data. The most com- mon pooling layers are max pooling layers and average pooling layers which take the maximum or average value of a local patch. The advantage of pooling layers is that it could reduce the number of parameters and restrain overfitting during training, which improves the model efficiency. • Fully connected layers. Fully connected layers, also known as FC layers, are layers that connect to the whole input data without only focusing on a local area. This leads to a representation of the global input data rather than that of local patches, which could be used for the downstream tasks while losing the spatial information in the original input. 9 Applications in computer vision. A typical architecture in a computer vision task which requires feature representation extraction is a feedforward linear connection of different kinds of layers: convpoolreluconvpoolrelu:::convpoolrelu Among all the tasks in computer vision, image classification is the most fundamental one which has profound influnces on other computer vision tasks. For image classification, there are usually a few FC layers following the first severalconvpoolrelu blocks to get the global image feature representation. To make the final prediction, a softmax operation is used to transform the output of FC layers into a probability distribution: p(y =jjx) = exp(f(x T w j )) P i exp(f(x T w i )) ; (2.3) wheref(x T w j ) represents the output of the last FC layer corresponding to categoryc, andp(y =jjx) represent the predicted probability for categoryc. Some most famous image classification models are AlexNet [79], VGG [73], and ResNet [34]. While AlexNet includes five convolutional modules followed by three FC layers, VGG designs a deeper network by stacking the convolutional layers in the module, where all convolutional layers in the same module have the same kernel size. To solve the vanishing gradient problem in deep neural networks, ResNet proposes to add skip connections to make the backpropagation more efficient with direct connections between different layers. 10 2.2 Image Semantic Segmentation Fully Convolutional Neural Networks. Different from the task of image classification, image semantic segmentation aims at predicting the class labels on an image in the pixel- wise level. Therefore, FC layers have to be replaced with convolutional layers due to the property that FC layers cannot preserve the geometric information [59]. Another advantage of this replacement is that the fully convolutional neural networks could deal with input images with a random size. In this way, image semantic segmentation could be considered as a pixel-wise classification problem, and the replaced convolutional layers could be viewed as a local classifier which is applied in a local region. As there are pooling layers in the convolutional neural networks, which indicates that the output size is much smaller than the input image size. However, when we make a pixel-wise prediction, it is important to get an output image which has the same size as the input. This leads to the requirement of upsampling layers, and multiple approaches are proposed to solve this problem. The most influential work is FCN [59], where the authors propose to upsample the feature maps by using the deconvolution operation in multiple steps. The framework is shown in Fig. 2.1. In this figure, the main branch (the first branch) provides an output image with 1 32 size of the original input image (FCN-32s), which loses detailed geometric information and cannot make an accurate pixel-wise prediction. The second branch upsamples the output feature map fromconv 7 layer by 2 times and combines it with the feature map from the output ofpool 4 . This produces an output of 1 16 of the original input size (FCN-16s), which has finer resolution comparing with the first branch. Similarly, the third branch produces an output image with 1 8 of the original input size (FCN-8s) by combining the pool 3 output, upsampled pool 4 output and the 4 upsampledpool 5 output. Atrous convolution. As FCN uses five convolutional layers and pooling layers to extract the features, it usually loses detailed information for the output prediction. To 11 Figure 2.1: The framework of FCN [59], which combines high-level, low-resolution layers with low-level, high-resolution layers. The figure is from [59]) tackle this problem, DeepLab [57] removes thepool 4 andpool 5 layers in the architecture. To keep the receptive field the same as previous FCN method, DeepLab [57] proposes atrous convolution layer (shown in Fig. 2.2), which could keep the receptive field while reducing the number of pooling layers. To be more specific, the atrous convolution takes an input stride of 2 and produces an output stride of 1, where 3 3 convolutional filter could have a receptive field of 5 5. Figure 2.2: Atrous convolution. The figure is from [57]. Atrous convolution improves the performance of FCN-8s segmentation by remov- ing the pooling layers and deconvolutional layers for upsampling. However, the output images could still lose detailed spatial information. To further improve the segmentation performance, the latest work of applying atrous convolution to semantic segmentation 12 is DeepLabv3+ [7], which proposes to combine the advantages of both spatial pyramid pooling modules and encoder-decoder structure. Specifically, the DeepLabv3+ extends the previous DeepLab segmentation framework by adding a simple yet effective decoder module to refind the segmentation results, especially along the object boundaries. On the other hand, a fully connected CRF [45] could be applied to finetune the segmenta- tion output. The CRF (conditional random field) module is built on the whole image with defining both unary potential function and pairwise potential function. The unary potential function is defined on the prediction confidence for each pixel using previous methods, and the pairwise potential function measures the similarity of a pair of pixels based on their positions and colors. Both methods could refine the segmentation output and improve the final prediction performance. 2.3 Generative Adversarial Networks Machine learning models could be generally divided into two categories, discriminative models and generative models. The most common algorithms to solve image classi- fication and image semantic segmentation problems are discriminative models, which take an image as input and make the global classification or pixel-wise classification predictions conditioned on the input image. However, another kind of model is gener- ative models, which learn the joint probability distribution of the input and the output. For example, given a hidden representation, they could predict the associated features or learn to imitate the distribution of data. Generative models have multiple advan- tages other than merely generating more images as pointed on in [26]. First, training and sampling from generative models is a test of the ability to represent and manip- ulate high-dimensional probability distributions. Second, generative models could be applied to reinforcement learning algorithms, as model-based reinforcement learning 13 algorithms usually contain a generative model for simulating and predicting possible futures. What’s more, generative models could make predictions on inputs with missing data. Finally, generative models could enable multi-modal outputs rather than predict- ing only one output, which usually occurs in the scenario where one input image could correspond to multiple possible outputs. Generative Adversarial Networks. There are many different kinds of generative mod- els, such as fully visible belief networks [16], Boltzmann machines [15], variational autoencoder [72], etc. Among these methods, Generative Adversarial Networks (GANs) attract most attention in recent years as GANs avoid many disadvantages of other meth- ods and could generate better results. The basic idea of GANs is to set up a game between two players, the generator and the discriminator. The generator tries to create samples from the distribution of the training data, while discriminator judges whether the input samples are real or fake, i.e. whether the samples are from real data or from the output of the generator. The discriminator is a binary classifier which is trained in a supervised manner, and the goal of the generator is to fool the discriminator so as to learn the latent distribution of the training data. Formally, GANs are a structured proba- bilistic model containing latent variablesz and observed variablesx [28]. The generator is a differentiable functionG, which takesz as input, which is sampled from some prior distribution, and outputs a sample ofx drawn fromp model . The goal of the discrimina- tor is to measure the distance between two distributions, p model and p data . If an ideal discriminator cannot distinguish these two distributions, this indicates that the generator could generate samples under the same distribution as that of the training data. Different from previous CNN models, GANs consist of two networks, i.e. the gen- erator and the discriminator. Therefore, a special training process is proposed to suc- cessfully train both networks. On each step of the training process, two minibatches are sampled: a minibatch ofx values from the dataset and a minibatch ofz values drawn 14 Figure 2.3: GANs framework. Two players are playing against each other. The goal of the discriminator is to output the probability that the input is real, while the goal of the generator is to generate an image which fools the discriminator. The figure is from [26]. from the model’s prior in latent space [26]. Then two gradient steps need to be made for both the generator and the discriminator. As we could formulate GANs as a game between two players, the Nash equilibrium for a GAN game represents the ending state of the training. However, a problem comes up during the process of training, where the generator and the discriminator could possibly learn their tasks at different speeds which would result in unsatisfying training result. With a generator could perfectly generate samples, it would be difficult for the discriminator to make the real or fake examination for the input samples. On the other hand, when a discriminator could perfectly examine the input samples to be real or fake, the generator would be confused that along which direction it should optimize its parameters. Therefore, multiple training strategies have been proposed to stabilize the training steps. Some authors recommend running more steps on one player than the other, while some argue that the best strategy in practice is to train the GANs with one step for each network. Deep Convolutional Generative Adversarial Networks. Although GANs provide an attractive alternative to maximum likelihood techniques, they have been known to be 15 unstable to train and often result in generators that produce meaningless outputs. To stabilize the GANs training, DCGAN [70] propose a class of CNN called deep convolu- tional generative adversarial networks, which have certain architectural constraints and make the training stable in most settings. A typical generator architecture is illustrated in Fig. 2.4, which takes the uniformly sampled latent vector z and produce an output image with size 64x64. The main guideline for stable DCGAN is as follows [70]: • Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator). • Use batchnorm in both the generator and the discriminator. • Remove full connected hidden layers for deeper architectures. • Use ReLU activation in the generator for all layers except for the output, which uses Tanh. • Use LReLU activation in the discriminator for all layers. Figure 2.4: DCGAN generator. A 100 dimensional uniform distributaion z is pro- jected to a spatial convolutional representation with many feature maps. Four following fractionally-strided convolutions then convert this representation into a 64 64 output image. The figure is from [70]. 16 Following DCGAN, many new approaches have been proposed to stabilize the training process, including energy-based GAN [102], Wasserstein GAN (WGAN) [76, 1], WGAN-GP [30], BEGAN [4], LSGAN [61] and the more recent Progressive GANs [40]. Some of them propose a new loss function or a new measurement for eval- uating the distances between two distributions, while others propose some new archi- tectures to make the training steps more reliable. Image translation. Apart from direct image generation by using GANs, a typical appli- cation of conditional GANs is the image-to-image translation, whose goal is to translate an input image into a corresponding output image in a different domain (as illustrated in Fig. 2.5). Here ”domain” means a different expression or representation of the image, such as an RGB image, a semantic map, etc. In this scenario, many problems in image processing, graphics, and vision could be formulated as an image translation problem, where the setting is always to map pixels to pixels. Figure 2.5: Examples of translating an input image into a corresponding output image in a different domain. The figure is from [38]. To explore GANs in this conditional setting, pix2pix [38] propose to use a ”U-Net”- based architecture for the generator and use a convolutional PatchGAN classifier for the discriminator which only penalizes structure at the scale of image patches. However, this approach has a few drawbacks. First, it requires paired data in both domains in the training set, which is usually hard to acquire. To solve this problem, CycleGAN [105] 17 proposes a cycle-consistent structure which could learn image translation in the absence of paired training data. Second, when it comes to the high-resolution image transla- tion, it usually gets very blurred results. A later work Pix2PixHD [86] proposes a new multi-scale generator and discriminator structure to enable the image translation with a resolution of 2048 1024. 2.4 Image Inpainting Non-neural image inpainting. Traditionally, people use non-neural methods to deal with the problem of image inpainting, which is usually on the image space. A typical method is PatchMatch [2], which finds approximate nearest neighbor matches between image patches in a fast manner. A following work [3] generalizes PatchMatch by search- ing across scales and rotations and matching via arbitrary descriptors and distances. While they find the best matching patch in the image level and make texture propagation directly, they only exploit the low-level signal of the known contexts to hallucinate miss- ing regions and fall short of understanding and predicting high-level semantics. Further- more, these methods are often inadequate to capture the global structure of images by simply extending texture from surrounding regions. Therefore, they are usually agnostic to high-level semantic and structural information. With the development of deep gen- erative models, neural-based image inpainting methods have shown great potential to outperform these traditional methods. Neural style transfer. Style transfer is defined as transferring the style from one image onto another while preserving the contents [19], as shown in Fig. 2.6. Gatys et al. [19] first formulates this problem as an optimization problem that combines texture synthe- sis with content reconstruction. As an alternative, [14, 17, 83] use neural-patch based similarity matching between the content and style images for style transfer. Li and 18 Wand [49] optimize the output image such that each of the neural patch matches with a similar neural patch in the style image. This enables arbitrary style transfer at the cost of expensive computation. [8] proposes a fast approximation to [49] where it con- structs the feature map directly and uses an inverse network to synthesize the image in a feed-forward manner. Figure 2.6: An example of style transfer. The figure is searched from Google Images. Image inpainting based on generative models. More recently, deep neural networks have shown excellent performance in various image completion tasks, such as texture synthesis and image completion. For inpainting, adversarial training becomes the de facto strategy to generate sharp details and natural looking results [66, 94, 51, 92, 35]. Pathak et al. [66] first proposes to train an encoder-decoder model for inpainting using both reconstruction loss and adversarial loss. In [94], Yeh et al. uses a pre-trained model to find the most similar encoding of a corrupted image and uses the found encoding to synthesize what is missing. In [92], Yang et al. proposes a multi-scale approach and optimizes the hole contents such that neural feature extracted from a pre-trained CNN matches with features of the surrounding context. The optimization scheme improves the inpainting quality and resolution at the cost of computational efficiency. In [35], Iizuka et al. proposes a deep generative model trained with global and local adversarial losses, and can achieve good inpainting performance for mid-size images and holes. However, it requires extensive training (two months as described in the paper), and the 19 results often contain excessive artifacts and noise patterns. Another limitation of [92] and [35] is that they are unable to handle perceptual discontinuity, making it necessary to resort to post-processing (e.g. Poisson blending). 2.5 Object Detection Convolutional neural networks could demonstrate the strong ability to image classifica- tion problem, where the network takes fixed size images as input and make predictions on the probability distribution for the object categories. However, it cannot be directly used in the object detection task, which has random size images as input and requires not only the prediction of categories on the images but also the localization of the objects shown on the images. To make use of the experience in image classification, the object detection could be regarded as an image classification task which is based on a region in an image, rather than the whole image. In this way, we could find a method to bridge the gap between object detection and image classification. One of the most important things we need to care about is how we find the potential regions in an image which would con- tain an object with large probability. After this step of region proposal, the object detec- tion problem is then transformed to a classification problem, which is to judge whether the proposed region contains a specific kind of object or only background. RCNN [24] is one of the earliest detection framework based on convolutional neural networks. To transform the object detection task to image classification task, a straight- forward way is to apply CNNs to every possible region on the image, which is exhaus- tive and time-consuming. To solve this problem, RCNN proposes to use object propos- als first, which could eliminate most of the meaningless regions and get the remaining several thousand possible RoI (regions of interest) for the next step of region classifi- cation. Fig. 2.7 illustrates the framework of RCNN approach, which applies Selective 20 Search [82] first to extract around 2k object proposals and then wraps each object pro- posal to a fixed dimension feature vector which represents the region features. Given this feature, it is now a classification problem which could use a linear SVM to determine which object category this region should belong to. Figure 2.7: The RCNN framework. The figure is from [24]. Although RCNN [24] only proposes a few thousand object regions for the classifi- cation, the main drawback of RCNN is that the training and inference are quite slow, as the network has to compute the features for every region which takes a long time. We could observe that there are many regions with a lot of overlapping, which causes much redundancy in the computation. To tackle this problem, Fast-RCNN [23] proposes a RoI pooling layer which extracts the region features directly from the previous layer of the feature maps (as shown in Fig. 2.8). Then the output features of the RoI pool- ing would have a fixed dimension of feature vector which is then passed to classifiers and regressors for probability prediction and bounding box localization. In this way, the computation of feature maps on different regions could share the same backbone network, which largely reduces the computation redundancy. Fast-RCNN [23] has a large improvement on the computation speed, but the main bottleneck is the non-CNN object proposal method. The commonly used Selective Search [82] still takes a long time comparing with CNN inference due to the high efficiency of GPU implementation on CNNs and cannot be integrated into the whole 21 Figure 2.8: The Fast-RCNN framework. The figure is from [23]. framework together for the object detection framework. In this case, Faster-RCNN [71] designs a CNN-based proposal network named Region Proposal Network (RPN), which could be integrated into Fast-RCNN. RPN is a neural network which could be applied to image feature map in a sliding window manner and produces binary class scores for k anchor boxes (as illustrated in Fig. 2.9). RPN is much faster than traditional algorithms and then replace them as an important part of the object detection framework. Figure 2.9: The Region Proposal Network (RPN). The figure is from [71]. 22 2.6 Visual relationship detection Faster-RCNN [71] and later works of object detection have shown strong performance in image understanding, which help a lot in downstream computer vision tasks. However, while objects are the building blocks of an image, the relationships between objects could help understand and interpret the images in a more semantic manner. Therefore, to further understand the images better, we not only need to know the individual objects but also hope to understand the relationship between different objects. The objects and the relationship between them could construct a meaningful semantic structure for image understanding, which is an inevitable way to the high-level image understanding. Figure 2.10: The examples of visual relationship detection results. The figure is from [60]. Fig. 2.10 shows some examples of the visual relationship detection (VRD). As we see, different from object detection which only provides bounding boxes and class categories as output, the VRD output contains object pairs which have a meaning- ful relationship and also makes the prediction on the relationship. For every pair of objects, it is commonly defined as a relationship triplet, which has a format of < subject;predicate;object >. This task is not a straightforward result from object detection outputs. As we could observe from Fig. 2.11, an image with a person and a bicycle could have different kinds of relations which could lead to totally different high-level understanding and interpretation of the images. Furthermore, it is also not 23 easy to predict the relation only based on the object detection result. For example, in Fig. 2.11, when a person is standing near a bicycle, it could either be the action ”pushing” or the preposition ”next to”. When we only know the bounding boxes and categories of these two objects, it’s hard to predict which relationship it belongs to. Therefore, visual relationship detection is a non-trivial task which is very important for better image understanding and many downstream tasks such as image captioning and visual question answering. Figure 2.11: For images that contain the same objects, the relationship could be different and determine the interpretation of the image. The figure is from [60]. There are mainly two difficulties for visual relationship detection comparing with object detection. First, for the same pair of objects, there are multiple kinds of potential relationships, which leads to a quadratic explosion of the relation triplets. For exam- ple, if there are N objects and K relationships, there would be N 2 K different kinds of triplets, which makes the detection significantly harder than object detection. Sec- ond, in object detection, it’s easy to get balanced training data, i.e., for different object categories, people could get comparable amount of images which are used for model training. It is quite important as unbalanced data would lead to certain tendencies in model inference and lead to insufficient training on categories with fewer data. How- ever, in the relationship detection problem, there is naturally a long tail distribution of the relationship triplet, as shown in Fig. 2.12. The long tail distribution refers to the situation where some common relation triplets appear much more often than other rare 24 relation triplets. For example, in Fig. 2.12, the relation of ”car on street” is a very com- mon scene in our life, but we rarely see the relation of ”elephant drink milk”. However, the relation ”drink” could be very common in triplets of ”person drink water”, so the main challenge is how to learn a certain relationship from other relation instances. This is also similar to the setting of zero-shot learning or few-shot learning, which are also emerging topics in computer vision. Figure 2.12: Long tail distribution of relationship in the training dataset. The figure is from slides of [60]. Figure 2.13: Visual phrase architecture. The figure is from [75]. A very early work of visual relationship detection is [75], which is based on tradi- tional manually designed visual features (see Fig. 2.13). They show that the relationship detection could help object detection and vice versa. When deep learning methods are applied to computer vision tasks, deep features show their advantages over traditional features and [60] show that deep visual features perform much better in not only the 25 object detection task but also the visual relationship detection task. Fig. 2.14 shows that visual relationship detection is often related to both visual features and language features, and taking advantage of language prior would significantly improve the rela- tionship detection performance. Figure 2.14: An overview of the visual relationship detection pipeline in [60]. The figure is from [60]. Following [60], [10] proposes an integrated framework to tackle the problem of visual relationship detection, as illustrated in Fig. 2.15. The authors propose to separate the whole pipeline into three steps: object detection, pair filtering, and joint recognition. Pair filtering is a simple yet effective network that is similar to the target of RPN, which wants to propose some potential meaningful relation pairs. In the joint recognition step, inspired by the setting of CRF, [10] proposes a multi-layer inference structure which could update the predicted probability distribution in every step. Figure 2.15: DR-Net framework. The figure is from [10]. 26 To exploit the visual translation embedding in visual relation detection, [97] pro- poses a new embedding network which is a purely visual model without language pri- ors. The intuition is that [97] places objects in a low-dimensional relation space and deal with the relation detection in the relation space, i.e. to model the relationship as a simple vector translation, where the sum of the subject vector and predicate vector equals the object vector. In this way, the proposed structure in Fig. 2.16 could be trained and backpropagated to the object detection module in an end-to-end manner. Figure 2.16: VTranse framework. The figure is from [97]. One of the latest works of visual relationship detection is [93], which takes the whole structure of objects and relations as a graph that consists of nodes and edges. The model divides the whole process into two steps. First, it designs a relation proposal network (RePN) which is used to filter out meaningless object pairs and propose potential object pairs for future classification. Second, it proposes an attentional graph convolution net- work (GCN) to predict and update the triplet prediction. Figure 2.17: Graph R-CNN framework. The figure is from [93]. 27 Chapter 3 Contextual Based Image Inpainting: Infer, Match and Translate 3.1 Introduction The problem of generating photo-realistic images from sampled noise or conditioning on other inputs such as images, texts or labels has been heavily investigated. In spite of recent progress of deep generative models such as PixelCNN [83], V AE [43] and GANs [28], generating high-resolution images remains a difficult task. This is mainly because modeling the distribution of pixels is difficult and the trained models easily introduce blurry components and artifacts when the dimensionality becomes high. Sev- eral approaches have been proposed to alleviate the problem, usually by leveraging multi-scale training [98, 11] or incorporating prior information [63]. In addition to the general image synthesis problem, the task of image inpainting can be described as: given an incomplete image as input, how do we fill in the missing parts (a) (b) (c) (d) (e) (f) Figure 3.1: Our result comparing with GL inpainting [35]. (a) & (d) The input image with missing hole. (b) & (d) Inpainting result given by GL inpainting [35]. (c) & (f) Final inpainting result using our approach. The size of images are 512x512. 28 with semantically and visually plausible contents. We are interested in this problem for several reasons. First, it is a well-motivated task for a common scenario where we may want to remove unwanted objects from pictures or restore damaged photographs. Sec- ond, while purely unsupervised learning may be challenging for large inputs, we show in this work that the problem becomes more constrained and tractable when we train in a multi-stage self-supervised manner and leverage the high-frequency information in the known region. Context-encoder [66] is one of the first works that apply deep neural networks for image inpainting. It trains a deep generative model that maps an incomplete image to a complete image using reconstruction loss and adversarial loss. While adversarial loss significantly improves the inpainting quality, the results are still quite blurry and contain notable artifacts. In addition, we found it fails to produce reasonable results for larger inputs like 512x512 images, showing it is unable generalize to high-resolution inpainting task. More recently, [35] improved the result by using dilated convolution and an additional local discriminator. However it is still limited to relatively small images and holes due to the spatial support of the model. Yang et al. [91] proposes to use style transfer for image inpainting. More specifi- cally, it initializes the hole with the output of context-encoder, and then improves the texture by using style transfer techniques [49] to propagate the high-frequency textures from the boundary to the hole. It shows that matching the neural features not only transfers artistic styles, but can also synthesize real-world images. The approach is optimization-based and applicable to images of arbitrary sizes. However, the computa- tion is costly and it takes long time to inpaint a large image. Our approach overcomes the limitation of the aforementioned methods. Being simi- lar to [91], we decouple the inpainting process into two stages: inference and translation. In the inference stage, we train an Image2Feature network that initializes the hole with 29 coarse prediction and extract its features. The prediction is blurry but contains high-level structure information in the hole. In the translation stage, we train a Feature2Image net- work that transforms the feature back into a complete image. It refines the contents in the hole and outputs a complete image with sharp and realistic texture. Its main differ- ence with [91] is that, instead of relying on optimization, we model texture refinement as a learning problem. Both networks can be trained end-to-end and, with the trained models, the inference can be done in a single forward pass, which is much faster than iterative optimizations. To ease the difficulty of training the Feature2Image network, we design a “patch- swap” layer that propagates the high-frequency texture details from the boundary to the hole. The patch-swap layer takes the feature map as input, and replaces each neural patch inside the hole with the most similar patch on the boundary. We then use the new feature map as the input to the Feature2Image network. By re-using the neural patches on the boundary, the feature map contains sufficient details, making the high-resolution image reconstruction feasible. We note that by dividing the training into two stages of Image2Feature and Fea- ture2Image greatly reduces the dimensionality of possible mappings between input and output. Injecting prior knowledge with patch-swap further guides the training process such that it is easier to find the optimal transformation. When being compared with the GL inpainting [35], we generate sharper and better inpainting results at size 256x256. Our approach also scales to higher resolution (i.e. 512x512), which GL inpainting fails to handle. As compared with neural inpainting [91], our results have comparable or bet- ter visual quality in most examples. In particular, our synthesized contents blends with the boundary more seamlessly. Our approach is also much faster. 30 Patch-Swap Skip Connections Image2Feature Network Feature2Image Network Real/Fake Pair Real/Fake Pair ●●● VGG Layers Figure 3.2: Overview of our network architecture. We use Image2Feature network as coarse inferrence and use VGG network to extract a feature map. Then patch-swap matches neural patches from boundary to the hole. Finally the Feature2Image network translates to a complete, high-resolution image. The main contributions of this work are: (1) We design a learning-based inpainting system that is able to synthesize missing parts in a high-resolution image with high- quality contents and textures. (2) We propose a novel and robust training scheme that addresses the issue of feature manipulation and avoids under-fitting. (3) We show that our trained model can achieve performance comparable with state-of-the-art and gener- alize to other tasks like style transfer. 3.2 Methodology 3.2.1 Problem Description We formalize the task of image inpainting as follows: suppose we are given an incom- plete input imageI 0 , withR and R representing the missing region (the hole) and the known region (the boundary) respectively. We would like to fill in R with plausible contentsI R and combine it withI 0 as a new, complete imageI. Evaluating the quality of inpainting is mostly subject to human perception but ideally,I R should meet the fol- lowing criteria: 1. It has sharp and realistic-looking textures; 2. It contains meaningful 31 content and is coherent with I R and 3. It looks like what appears in the ground truth imageI gt (if available). In our context,R can be either a single hole or multiple holes. It may also come with arbitrary shape, placed on a random location of the image. 3.2.2 System Overview Our system divides the image inpainting tasks into three steps: Inference: We use an Image2Feature network to fill an incomplete image with coarse contents as inference and extract a feature map from the inpainted image. Matching: We use patch-swap on the feature map to match the neural patches from the high-resolution boundary to the hole with coarse inference. Translation: We use a Feature2Image network to translate the feature map to a com- plete image. The entire pipeline is illustrated in Fig. 3.2. 3.2.3 Training We introduce separate steps of training the Image2Feature and Feature2Image network. For illustration purpose, we assume the size ofI 0 is 256x256x3 and the holeR has size 128x128. Inference: Training Image2Feature Network The goal of the Image2Feature network is to fill in the hole with coarse prediction. During training, the input to the Image2Feature translation network is the 256x256x3 incomplete imageI 0 and the output is a feature mapF 1 of size 64x64x256. The network consists of an FCN-based module G 1 , which consists of a down-sampling front end, multiple intermediate residual blocks and an up-sampling back end. G 1 is followed by 32 the initial layers of the 19-layer VGG network [79]. Here we use the filter pyramid of the VGG network as a higher-level representation of images similar to [19]. At first, I 0 is given as input toG 1 which produces a coarse predictionI R 1 of size 128x128. I R 1 is then embedded intoR forming a complete imageI 1 , which again passes through the VGG19 network to get the activation of relu3 1 asF 1 . F 1 has size 64x64x256. We also use an additional PatchGAN discriminator D 1 to facilitate adversarial training, which takes a pair of images as input and outputs a vector of true/fake probabilities. ForG 1 , the down-sampling front-end consists of three convolutional layers, and each layer has stride 2. The intermediate part has 9 residual blocks stacked together. The up-sampling back-end is the reverse of the front-end and consists of three transposed convolution with stride 2. Every convolutional layer is followed by batch normaliza- tion [37] and ReLu activation, except for the last layer which outputs the image. We also use dilated convolution in all residual blocks. Similar architecture has been used in [86] for image synthesis and [35] for inpainting. Different from [86], we use dilated layer to increase the size of receptive field. Comparing with [35], our receptive field is also larger given we have more down-sampling blocks and more dilated layers in residual blocks. During training, the overall loss function is defined as: L G 1 = 1 L perceptual + 2 L adv : (3.1) The first term is the perceptual loss, which is shown to correspond better with human perception of similarity [101] and has been widely used in many tasks [20, 39, 13, 8]: L perceptual (F;I g t) = kM F (F 1 vgg(I gt ))k 1 : (3.2) 33 HereM F are the weighted masks yielding the loss to be computed only on the hole of the feature map. We also assign higher weight to the overlapping pixels between the hole and the boundary to ensure the composite is coherent. The weights of VGG19 network are loaded from the ImageNet pre-trained model and are fixed during training. The adversarial loss is based on Generative Adversarial Networks (GANs) and is defined as: L adv = max D 1 E[log(D 1 (I 0 ;I gt )) + log(1D 1 (I 0 ;I 1 ))]: (3.3) We use a pair of images as input to the discriminator. Under the setting of adversarial training, the real pair is the incomplete imageI 0 and the original imageI gt , while the fake pair isI 0 and the predictionI 1 . To align the absolute value of each loss, we set the weight 1 = 10 and 2 = 1 respectively. We use Adam optimizer for training. The learning rate is set aslr G = 2e3 andlr D = 2e4 and the momentum is set to 0.5. Match: Patch-swap Operation Patch-swap is an operation which transformsF 1 into a new feature mapF 0 1 . The idea is that the predictionI R 1 is blurry, lacking many of the high-frequency details. Intuitively, we would like to propagate the textures from I R 1 onto I R 1 but still preserves the high- level information ofI R 1 . Instead of operating onI 1 directly, we useF 1 as a surrogate for texture propagation. Similarly, we user and r to denote the region onF 1 corresponding toR and R onI 1 . For each 3x3 neural patchp i (i = 1; 2;:::;N) ofF 1 overlapping withr, 34 ̅ ̅ (a) search (b) swap Figure 3.3: Illustration of patch-swap operation. Each neural patch in the holer searches for the most similar neural patch on the boundary r, and then swaps with that patch. we find the closest-matching neural patch in r based on the following cross-correlation metric: d(p;p 0 ) = <p;p 0 > kpkkp 0 k (3.4) Suppose the closest-matching patch ofp i isq i , we then replacep i withq i . After each patch inr is swapped with its most similar patch in r, overlapping patches are averaged and the output is a new feature mapF 0 1 . We illustrate the process in Fig. 3.3. Measuring the cross-correlations for all the neural patch pairs between the hole and boundary is computationally expensive. To address this issue, we follow similar imple- mentation in [8] and speed up the computation using paralleled convolution. We sum- marize the algorithm as following steps. First, we normalize and stack the neural patches on r and view the stacked vector as a convolution filter. Next, we apply the convolution filter onr. The result is that at each location ofr we get a vector of values which is the cross-correlation between the neural patch centered at that location and all patches in r. Finally, we replace the patch inr with the patch in r of maximum cross-correlation. Since the whole process can be parallelized, the amount of time is significantly reduced. In practice, it only takes about 0.1 seconds to process a 64x64x256 feature map. 35 Translate: Training Feature2Image Translation Network The goal of the Feature2Image network is to learn a mapping from the swapped feature map to a complete and sharp image. It has a U-Net style generatorG 2 which is similar toG 1 , except the number of hidden layers are different. The input toG 2 is a feature map of size 64x64x256. The generator has seven convolution blocks and eight deconvolu- tion blocks, and the first six deconvolutional layers are connected with the convolutional layers using skip connection. The output is a complete 256x256x3 image. It also con- sists of a Patch-GAN based discriminatorD 2 for adversarial training. However different from the Image2Feature network which takes a pair of images as input, the input toD 2 is a pair of image and feature map. A straightforward training paradigm is to use the output of the Image2Feature net- workF 1 as input to the patch-swap layer, and then use the swapped featureF 0 1 to train the Feature2Image model. In this way, the feature map is derived from the coarse prediction I 1 and the whole system can be trained end-to-end. However, in practice, we found that this leads to poor-quality reconstruction I with notable noise and artifacts (Sec. 3.3). We further observed that using the ground truth as training input gives rise to results of significantly improved visual quality. That is, we use the feature mapF gt = vgg(I gt ) as input to the patch-swap layer, and then use the swapped featureF 0 gt = patch swap(F gt ) to train the Feature2Image model. SinceI gt is not accessible at test time, we still use F 0 1 = patch swap(F 1 ) as input for inference. Note that now the Feature2Image model trains and tests with different types of input, which is not a usual practice to train a machine learning model. Here we provide some intuition for this phenomenon. Essentially by training the Feature2Image network, we are learning a mapping from the feature space to the image space. Since F 1 is the output of the Image2Feature network, it inherently contains a significant amount of noise and ambiguity. Therefore the feature space made up ofF 0 1 36 has much higher dimensionality than the feature space made up ofF 0 gt . The outcome is that the model easily under-fitsF 0 1 , making it difficult to learn a good mapping. Alter- natively, by using F 0 gt , we are selecting a clean, compact subset of features such that the space of mapping is much smaller, making it easier to learn. Our experiment also shows that the model trained with ground truth generalizes well to noisy inputF 0 1 at test time. Similar to [103], we can further improve the robustness by sampling from both the ground truth and Image2Feature prediction. The overall loss for the Feature2Image translation network is defined as: L G 2 = 1 L perceptual + 2 L adv : (3.5) The reconstruction loss is defined on the entire image between the final outputI and the ground truthI gt : L perceptual (I;I gt ) =kvgg(I)vgg(I gt )k 2 : (3.6) The adversarial loss is given by the discriminatorD 2 and is defined as: L adv = max D 2 E[log(D 2 (F 0 gt ;I gt )) + log(1D 2 (F 0 gt ;I))]: (3.7) The real and fake pair for adversarial training are (F 0 gt ;I gt ) and (F 0 gt ;I). When training the Feature2Image network we set 1 = 10 and 2 = 1. For the learning rate, we setlr G = 2e4 andlr D = 2e4. Same as the Image2Feature network, the momentum is set to 0.5. 37 " G $ $ $ " $ " " ′ $ " $ $ VGG G ' " swap upsample $ $ VGG ′ $ $ $ G ' $ $ ' $ ' ′ $ ' G ' ' swap VGG upsample swap Figure 3.4: Multi-scale inference. 3.2.4 Multi-scale Inference Given the trained models, inference is straight-forward and can be done in a single for- ward pass. The inputI 0 successively passes through the Image2Feature network to get I 1 andF 1 = vgg(I 1 ), then the patch-swap layer (F 0 1 ), and then finally the Feature2Image network (I). We then use the center ofI and blend withI 0 as the output. Our framework can be easily adapted to multi-scale. The key is that we directly upsample the output of the lower scale as the input to the Feature2Image network of the next scale (after using VGG network to extract features and apply patch-swap). In this way, we will only need the Image2Feature network at the smallest scales 0 to get I 0 1 and F 0 1 . At higher scales s i (i > 0) we simply set I s i 1 = upsample(I s i1 ) and let F s i 1 = vgg(I s i 1 ) (Fig. 3.4). Training Image2Feature network can be challenging at high resolution. However by using the multi-scale approach we are able to initialize from lower scales instead, allowing us to handle large inputs effectively. We use multi-scale inference on all our experiments. 38 3.3 Experiments 3.3.1 Experiment Setup We separately train and test on two public datasets: COCO [58] and ImageNet CLS- LOC [74]. The number of training images in each dataset are: 118,287 for COCO and 1,281,167 for ImageNet CLS-LOC. We compare with content aware fill (CAF) [2], con- text encoder (CE) [66], neural patch synthesis (NPS) [91] and global local inpainting (GLI) [35]. For CE, NPS, and GLI, we used the public available trained model. CE and NPS are trained to handle fixed holes, while GLI and CAF can handle arbitrary holes. To fairly evaluate, we experimented on both settings of fixed hole and random hole. For fixed hole, we compare with CAF, CE, NPS, and GLI on image size 512x512 from Ima- geNet test set. The hole is set to be 224x224 located at the image center. For random hole, we compare with CAF and GLI, using COCO test images resized to 256x256. In the case of random hole, the hole size ranges from 32 to 128 and is placed anywhere on the image. We observed that for small holes on 256x256 images, using patch-swap and Feature2Image network to refine is optional as our Image2Feature network already generates satisfying results most of the time. While for 512x512 images, it is neces- sary to apply multi-scale inpainting, starting from size 256x256. To address both sizes and to apply multi-scale, we train the Image2Feature network at 256x256 and train the Feature2Image network at both 256x256 and 512x512. During training, we use early stopping, meaning we terminate the training when the loss on the held-out validation set converges. On our NVIDIA GeForce GTX 1080Ti GPU, training typically takes one day to finish for each model, and test time is around 400ms for a 512x512 image. 39 Table 3.1: Numerical comparison on 200 test images of ImageNet. Method Mean` 1 Error SSIM Inception Score CE [66] 15.46% 0.45 9.80 NPS [91] 15.13% 0.52 10.85 GLI [35] 15.81% 0.55 11.18 our approach 15.61% 0.56 11.36 3.3.2 Results Quantitative Comparison Table 3.1 shows numerical comparison result between our approach, CE [66], GLI [35] and NPS [91]. We adopt three quality measurements: mean ` 1 error, SSIM, and inception score [76]. Since context encoder only inpaints 128x128 images and we failed to train the model for larger inputs, we directly use the 128x128 results and bi-linearly upsample them to 512x512. Here we also compute the SSIM over the hole area only. We see that although our mean` 1 error is higher, we achieve the best SSIM and inception score among all the methods, showing our results are closer to the ground truth by human perception. Besides, mean` 1 error is not an optimal measure for inpainting, as it favors averaged colors and blurry results and does not directly account for the end goal of perceptual quality. Visual Result Fig. 3.9 shows our comparison with GLI [1] in random hole cases. We can see that our method could handle multiple situations better, such as object removal, object completion and texture generation, while GLIs results are noisier and less coher- ent. From Fig. 3.10, we could also find that our results are better than GLI most of the time for large holes. This shows that directly training a network for large hole inpaint- ing is difficult, and it is where our “patch-swap” can be most helpful. In addition, our results have significantly fewer artifacts than GLI. Comparing with CAF, we can better 40 predict the global structure and fill in contents more coherent with the surrounding con- text. Comparing with CE, we can handle much larger images and the synthesized con- tents are much sharper. Comparing with NPS whose results mostly depend on CE, we have similar or better quality most of the time, and our algorithm also runs much faster. Meanwhile, our final results improve over the intermediate output of Image2Feature. This demonstrates that using patch-swap and Feature2Image transformation is benefi- cial and necessary. User Study To better evaluate and compare with other methods, we randomly select 400 images from the COCO test set and randomly distribute these images to 20 users. Each user is given 20 images with holes together with the inpainting results of NPS, GLI, and ours. Each of them is asked to rank the results in non-increasing order (meaning they can say two results have similar quality). We collected 399 valid votes in total found our results are ranked best most of the time: in 75.9% of the rankings our result receives highest score. In particular, our results are overwhelmingly better than GLI, receiving higher score 91.2% of the time. This is largely because GLI does not handle large holes well. Our results are also comparable with NPS, ranking higher or the same 86.2% of the time. 3.3.3 Analysis Comparison Comparing with [91], not only our approach is much faster but also has several advantages. First, the Feature2Image network synthesizes the entire image while [91] only optimizes the hole part. By aligning the color of the boundary between the output and the input, we can slightly adjust the tone to make the hole blend with the boundary more seamlessly and naturally (Fig. 3.10). Second, our model is trained to directly model the statistics of real-world images and works well on all resolutions, while [91] is unable to produce sharp results when the image is small. Comparing with 41 (a) (b) (c) (d) (e) (f) (g) Figure 3.5: Left: using deconvolution (a) vs resize-convolution (b). Middle: using` 2 reconstruction loss (c) vs using perceptual loss (d). Right: Training Feature2Image network using different input data. (e) Result when trained with the Image2Feature prediction. (f) Result when trained with ground truth. (g) Result when fine-tuned with ground truth and prediction mixtures. other learning-based inpainting methods, our approach is more general as we can han- dle larger inputs like 512x512. In contrast, [66] can only inpaint 128x128 images while [35] is limited to 256x256 images and the holes are limited to be smaller than 128x128. Ablation Study For the Feature2Image network, we observed that replacing the decon- volutional layers in the decoder part with resize-convolution layers resolves the checker- board patterns as described in [65] (Fig. 3.5 left). We also tried only using` 2 loss instead of perceptual loss, which gives blurrier inpainting (Fig. 3.5 middle). Additionally, we experimented different activation layers of VGG19 to extract features and found that relu3 1 works better than relu2 1 and relu4 1. We may also use iterative inference by running Feature2Image network multiple times. At each iteration, the final output is used as input to VGG and patch-swap, and then again given to Feature2Image network for inference. We found iteratively applying Feature2Image improves the sharpness of the texture but sometimes aggregates the artifacts near the boundary. For the Image2Feature network, an alternative is to use vanilla context encoder [66] to generate I 0 0 as the initial inference. However, we found our model produces better results as it is much deeper, and leverages the fully convolutional network and dilated layer. 42 (a) (b) (c) (d) (e) (f) Figure 3.6: Arbitrary shape inpainting of real-world photography. (a), (d): Input. (b), (e): Inpainting mask. (c), (f): Output. As discussed in Sec. 3.2.3, an important practice to guarantee successful training of the Feature2Image network is to use ground truth image as input rather than using the output of the Image2Feature network. Fig. 3.5 also shows that training with the prediction from the Image2Feature network gives very noisy results, while the models trained with ground truth or further fine-tuned with ground-truth and prediction mixtures can produce satisfying inpainting. Model Generalization Our framework can be easily applied to real-world tasks. Fig. 3.6 shows examples of using our approach to remove unwanted objects in photog- raphy. Given our network is fully convolutional, it is straight-straightforward to apply it to photos of arbitrary sizes. It is also able to fill in holes of arbitrary shapes, and can handle much larger holes than [36]. The Feature2Image network essentially learns a universal function to reconstruct an image from a swapped feature map, therefore can also be applied to other tasks. For example, by first constructing a swapped feature map from a content and a style image, we can use the network to reconstruct a new image for style transfer. Fig. 3.7 shows examples of using our Feature2Image network trained on COCO towards arbi- trary style transfer. Although the network is agnostic to the styles being transferred, it is still capable of generating satisfying results and runs in real-time. This shows the strong generalization ability of our learned model, as it’s only trained on a single COCO dataset, unlike other style transfer methods. 43 (a) (b) (c) (d) (e) (f) Figure 3.7: Arbitrary style transfer. (a), (d): Content. (b), (e): Style. (c), (f): Result. (a) (b) (c) (d) (e) (f) Figure 3.8: Failure cases. (a), (c) and (e): Input. (b), (d) and (f): Output. Failure Cases Our approach is very good at recovering a partially missing object like a plane or a bird (Fig. 3.10). However, it can fail if the image has overly complicated structures and patterns, or a major part of an object is missing such that Image2Feature network is unable to provide a good inference (Fig. 3.8). Training Losses We visualize the training losses of the Feature2Image network at scale 256 in Fig. 3.11. The plot shows that the reconstruction loss goes down and converges fast. On the other hand, the high discriminator loss at late stage of training indicates that we effectively trained the generator, and it is challenging for the discriminator to differentiate the reconstructed image from the ground truth. Iterative Inference Another perspective to look at patch-swap and Feature2Image net- work is to view it as a texture refinement module. That is, given an arbitrary image, we can use VGG to extract the feature and then use patch-swap and Feature2Image to pre- dict a new image. Presumably the new image will have sharper texture inside the hole area comparing with the input. This inspires us to use iterative inference by applying patch-swap and Feature2Image network multiple times on the inpainting output. More 44 Figure 3.9: Visual comparisons of ImageNet results with random hole. Each example from top to bottom: input image, GLI [35], our result. All images have size 256 256. specifically, after acquiringI iter i , we again apply VGG to getF 1;iter i+1 = vgg(I iter i ) and F 0 1;iter i+1 = patch swap(F 1;iter i+1 ), and then apply Feature2Image network to getI iter i+1 . As one iteration of inference is fast, it can be repeated multiple times with little over- head. However, we note in our experiment that although iterative inference increases the sharpness of the texture, it sometimes aggregates artifacts, especially in regions near the boundary. Some examples are shown in Fig. 3.12. Architecture of Feature2Image Network Our network design is inspired by both image translation methods [38, 105] and deconvolution networks [48, 12, 42]. We experimented and compared with different architectures for Feature2Image network, and found that the encoder-decoder structure with skip connection performs much bet- ter than the simple decoder architecture, as shown in Fig. 3.13. The intuition for this performance is that the patchswap operation would introduce noise for the feature map. If we directly learn a mapping from the feature map to the output, the noise would 45 accumulate and lead to artifacts for the result. Alternatively, when we use encoder net- work to first extract the hidden code of the “feature map” from the patchswap output, we could preserve the important information while reducing the noise using the hidden code, which could be further decoded to the final result. Choice of Activation Layer We experimented using different activation layer’s response as inputs to patch-swap and the Feature2Image network. Fig. 3.14 shows the inpainting results when training the Feature2Image network with the raw image, relu2 1’s response, relu4 1’s response and relu3 1’s response (used in this work) respec- tively. We can see that using the raw image or relu2 1 as input to patch-swap and Feae- ture2Image training leads to blurry results, due to the large input size relative to the patch. Meanwhile, using later layers like relu4 1 gives grainy outputs whose structures and textures are inconsistent with the boundary. Empirically training with the feature response at relu3 1 balances well between the level of texture details and noise, gener- ating results of best visual quality. 3.3.4 More Experiment Results Here we show more experiment results randomly selected from test sets. Firstly we demonstrate the results from size 512x512 with fixed holes. While comparing with CE [66],NPS [91], and GLI [35], our performance performs uniformly better than other methods. Then we show the results with random holes with size 256x256 and size 512x512 and compare with GLI [35]. We can see that our results are more coherent and have better quality. Additional COCO Results (images scaled to 512x512 with fixed holes) Visual com- parisons of COCO results. Each example from left to right: input image, CE [66], NPS [91], GLI [35], our final result. All images have size 512 512 with fixed hole in the center (hole size 224 224). 46 Additional ImageNet Results (images scaled to 512x512 with fixed holes) Visual comparisons of ImageNet results. Each example from left to right: input image, CE [66],NPS [91], GLI [35], our final result. All images have size 512 512 with fixed hole in the center (hole size 224 224). Additional COCO Results (images scaled to 256x256 with random holes) Visual comparisons of COCO results. Each example from left to right: input image, GLI [35], our final result. All images have size 256 256 with random holes. Additional COCO Results (images scaled to 512x512 with random holes) Visual comparisons of COCO results. Each example from left to right: input image, GLI [35], our final result. All images have size 512 512 with random holes. 3.4 Conclusion We propose a learning-based approach to synthesize missing contents in a high- resolution image. Our model is able to inpaint an image with realistic and sharp contents in a feed-forward manner. We show that we can simplify training by breaking down the task into multiple stages, where the mapping function in each stage has smaller dimen- sionality. It is worth noting that our approach is a meta-algorithm and naturally we could explore a variety of network architectures and training techniques to improve the infer- ence and the final result. We also expect that similar idea of multi-stage, multi-scale training could be used to directly synthesize high-resolution images from sampling. 47 Figure 3.10: Visual comparisons of ImageNet and COCO results. Each example from left to right: input image, CAF [2], CE [66],NPS [91], GLI [35], our result w/o Fea- ture2Image, our final result. All images have size 512 512. 48 (a) (b) Figure 3.11: Plot of (a) reconstruction loss and (b) discriminator loss at scale 256. (a) (b) (c) (d) Figure 3.12: Iterative inference comparing with single inference. (a) and (c): results of single inference. (b) and (d): results after five iterations of texture refinement using Feature2Image network. (a) (b) (c) Figure 3.13: Effect of different architectures for Feature2Image network. (a) Input. (b) Output using decoder structure. (c) Output using encoder-decoder structure with skip connection. 49 (a) (b) (c) (d) Figure 3.14: Effect of training Feature2Image with different inputs. (a) Patch-swap on raw image. (b) Patch-swap on relu2 1’s response. (c) Patch-swap on relu4 1’s response. (d) Patch-swap on relu3 1’s response. (a) (b) (c) (d) (e) Figure 3.15: Additional COCO Results. Images are scaled to 512x512 with fixed holes. (a) Input. (b) CE [66]. (c) NPS [91]. (d) GLI [35]. (e) Our final result. 50 (a) (b) (c) (d) (e) Figure 3.16: Additional ImageNet Results. Images are scaled to 512x512 with fixed holes. (a) Input. (b) CE [66]. (c) NPS [91]. (d) GLI [35]. (e) Our final result. 51 (a) (b) (c) Figure 3.17: Additional COCO Results. Images are scaled to 256x256 with fixed holes. (a) Input. (b) GLI [35]. (e) Our final result. 52 (a) (b) (c) Figure 3.18: Additional COCO Results. Images are scaled to 512x512 with fixed holes. (a) Input. (b) GLI [35]. (e) Our final result. 53 Chapter 4 SPG-Net: Segmentation Prediction and Guidance Network for Image Inpainting 4.1 Introduction (a) (b) (c) (d) (e) Figure 4.1: Comparison of our intermediate and final result with GL inpainting [35]. (a) Input image with missing hole. (b) Deeplabv3+ [7] output. (c) SP-Net result. (d) SG-Net result. (e) Inpainting result given by GL inpainting [35]. The size of images are 256x256. Image inpainting is the task to reconstruct the missing region in an image with plau- sible contents based on its surrounding context, which is a common topic of low-level computer vision [33, 44]. Making use of this technique, people could restore damaged images or remove unwanted objects from images or videos. In this task, our goal is to not only fill in the plausible contexts with realistic details but also make the inpainted area coherent with the contexts as well as the boundaries. 54 Traditional image inpainting methods mostly use image-level features to address the problem of filling in the hole. A typical method is Patch-Match [2], in which Barnes et al. which proposes to search for the best matching patches to reconstruct the miss- ing area. Another example is [88], which further optimizes the search areas and find the most fitting patches. These methods could provide realistic texture by its nature, however, they only make use of the low-level features of the given context and lack the ability to predict the high-level features in the missing hole. On the other hand, instead of capturing the global structure of the images, they propagate the texture from outside into the hole. This often leads to semantically inconsistent inpainting results which are unwanted by humans. Recent developments of deep generative models have enabled the generation of realistic images either from noise vectors or conditioned on some prior knowledge, such as images, labels, or word embeddings. In this way, we could regard the image inpainting task as an image generation task conditioned on the given context of images [66, 92, 36, 51, 95]. One of the earliest works that apply the deep genera- tive model to image inpainting task is Context-encoder [66], where Pathak et al. trains an encoder-decoder architecture to predict the complete image directly from the input image with a hole. Adding the adversarial loss has enabled large improvement on the image inpainting quality, but the results still lack high-frequency details and contain notable artifacts. To handle higher resolution inpainting problems, Iizuka et al. [36] proposes to add dilation convolution layers to increase the receptive field and use a joint global and local discriminator to improve the consistency of the image completion result. However, their results often contain noise patterns and artifacts which need to be reduced by a post- processing step (e.g. Poisson image editing [67]). Meanwhile, the training of their method is very time-consuming, which takes around 2 months in total. Another line 55 of work in high-resolution inpainting is trying to apply style transfer methods to refine the inpainting texture. More specifically, Yang et al. [92] proposes to optimize the inpainting result by finding the best matching neural patches between the inpainting area and the given context, and then a multi-scale structure is applied to refine the texture in an iterative way to achieve the high-resolution performance. It could predict photo- realistic results but the inference takes much more time than other methods. Another limitation of many recent approaches is that they usually predict the com- plete images directly and don’t exploit the segmentation information from the images. We find that this limitation usually leads to blurry boundaries between different objects in the inpainting area. To address this problem, we propose to use segmentation mask as an intermediate bridge for the incomplete image and the complete image prediction. We decouple the inpainting process into two steps: segmentation prediction (SP-Net) and segmentation guidance (SG-Net). We first use a state-of-the-art image segmentation method [7] to generate the segmentation labels for the input image. Then we predict the segmentation label in the missing area directly, which gives us a prior knowledge of predicted object localization and shape details in the hole. Finally we combine this complete segmentation mask with the input image together and pass them into the seg- mentation guidance network to make the complete prediction. This formulates the seg- mentation guided semantic segmentation process (see Fig. 4.1), and the whole system is able to combine the strength of deep generative models as well as the segmenta- tion information, which can guide the architecture to make a more realistic prediction, especially for boundaries between different objects. On the other hand, as compared with other methods which could only make a single prediction given the input image, our method provides the possibility of interactive and multi-modal predictions. More specifically, users could edit the segmentation mask in the missing hole interactively, 56 and the predictions could be different according to assigned segmentation labels in the hole. We evaluate the performance of the proposed framework on a variety of datasets based on both qualitative and quantitative evaluations. We also provide a thorough anal- ysis and ablation study about different steps in our architecture. The experimental results demonstrate that the segmentation map offers useful information in generating texture details, which leads to better image inpainting quality. The rest of this chapter is organized as follows. The full architecture and the approach are proposed in Section 4.2. The experimental results and evaluations are presented in Section 4.3. Finally, the conclusion is given in Section 4.4. 4.2 Approach Our proposed model uses segmentation labels as additional information to perform the image inpainting. Suppose we are given an incomplete input image I 0 , our goal is to predict the complete image I, which is composed of two parts, I 0 and I R , where I R is the reconstructed area of the missing hole. Here we also model the segmentation label mapS as the latent variable, which is similarly composed ofS 0 andS R , whereR represents the missing hole. Our whole framework contains three steps as depicted in Fig. 3.2. First, we estimate S 0 from I 0 using the state-of-the-art algorithm. Then the Segmentation Prediction Network (SP-Net) is used to predictS R fromI 0 andS 0 . Lastly S R is passed to the Segmentation Guidance Network (SG-Net) as the input to predict the final resultI. 57 4.2.1 Segmentation Prediction Network (SP-Net) Network architecture The goal of SP-Net is to predict the segmentation label map in the missing hole. The input to SP-Net is the 256x256xC incomplete label map S 0 as well as the 256x256x3 incomplete imageI 0 , where C is the number of label categories, and the output is the prediction of segmentation label mapS of size 256x256xC. Exist- ing works have proposed different architectures of the generator, such as the encoder- decoder structure [66] and FCN structure [36]. Similar to [36], the generator of SP-Net is based on FCN but replaces the dilation convolution layer with residual blocks, which could provide better learning capacity. Progressive dilated factors are applied to increase the receptive field and provide a wider view of input to capture the global structure of the image. To be more specific, our generator consists of four down-sampling convolution layers, nine residual blocks, and four up-sampling convolution layers. The kernel sizes are 7 in the first layer and last layer, and are 3 in other layers. The dilation factors of 9 residual blocks are 2 for the first three blocks, then 4 for another three blocks, and 8 for the last ones. The output channel for down-sampling layers and up-sampling layers are respectively 64, 128, 256, 512 and 512, 256, 128, 64, while they’re all 512 for residual blocks. ReLU and Batch normalization layer is used between each convolution layer except the last layer which produces the final result. The last layer uses a softmax func- tion to produce a probability map, which predicts the probability of the segmentation label for each pixel. Loss Functions Adversarial losses are given by discriminator networks to judge whether an image is real or fake and have been widely used since the emerging of GANs [28]. However, a single GAN discriminator design is not good enough to produce a clear and realistic result as it needs to take both global view and local view into consideration. To address this problem, we use the multi-scale discriminators similar to [86] which have same network structure but operate at three different scales of image resolutions. 58 Each discriminator is a fully convolutional PatchGAN [38] with 4 down-sampling layers followed by a sigmoid function to produce a vector of reak/fake predictions, where each value corresponds to a local patch in the original image. By the multi-scale application, the discriminators, i.e.fD 1 ;D 2 ;D 3 g, take corresponding inputs that are down-sampled from the original image by a factor of 1, 2, 4 respectively, and are able to classify the global and local patches at different scales, which enable the generator to capture both global structure and local texture. More formally, the adversarial loss is defined as: min G max D 1 ;D 2 ;D 3 X k=1;2;3 L GAN (G;D k ) = X k=1;2;3 E[log(D k ((S 0 ) k ; (S gt ) k ) + log(1D k ((S 0 ) k ; (G(S 0 ) k )]: (4.1) Here (S 0 ) k and (S gt ) k refer to the k th image scale of the input label map and ground truth respectively. Instead of the common reconstruction loss for image inpainting, we improve the loss by defining a perceptual loss, which is introduced by Gatys et al. [19], and then widely used in many tasks aiming to improve the perceptual evaluation performance [39, 13]. As the input image is a label map with C channels, we cannot apply the perceptual loss function on a pre-trained model, which usually takes an image as input. Therefore, a more reasonable way is to extract feature maps from multiple layers of both the gener- ator and the discriminator to match the intermediate representations. Specifically, the perceptual loss is written as: L perceptual (G) = n X l=0 1 H l W l X h;w jjM l (D k (S 0 ;S gt ) l hw D k (S 0 ;G(S 0 )) l hw )jj 1 : (4.2) Herel refers to the feature layers, and refers to the pixelwise multiplication.M l is the mask of the missing hole at layerl. Using the feature matching loss has been proposed 59 in the image translation task [86]. Here we extend the design to incorporate the mask weight, which helps to emphasize more on the generation in the missing area. Another benefit comes froml starting from 0 where the layer 0 is the input of the discriminator, which contains a reconstruction loss function in nature. Our full objective is then defined to combine both adversarial loss and perceptual loss: min G ( adv ( max D 1 ;D 2 ;D 3 X k=1;2;3 L GAN (G;D k )) + perceptual X k=1;2;3 L perceptual (G)); (4.3) where adv and perceptual control the rate of two terms. In our experiment, we set adv = 1 and perceptual = 10 as used in [66, 86]. 4.2.2 Segmentation Guidance Network (SG-Net) Network architecture The goal of SG-Net is to predict the image inpainting resultI of size 256x256x3 in the missing hole. It takes a 256x256x3 incomplete imageI 0 jointly with the segmentation label map S predicted by SP-Net as input. The SG-Net shares a similar architecture with SP-Net, with four down-sampling convolution layers, nine residual blocks and four up-sampling layers. Different from SP-Net, the last convolution layer uses a tanh function to produce an image with pixel value ranged at [1; 1], which is then rescaled to the normal image value. Loss Functions Besides the loss functions in SP-Net, SG-Net introduces an additional perceptual loss to stabilize the training process. Traditional perceptual losses typically use VGG-Net and compute the` 2 distance on different feature layers. Recently [101] proposes to train a perceptual network based on AlexNet to measure the perceptual differences between two image patches and shows that AlexNet performs better to reflect human perceptual judgements. Here we extend the loss function by considering the 60 local hole patch. The perceptual network computes the activations of the hole patches and sums up the` 2 distances across all feature layers, each scaled by a learned weight, which finally provides a perceptual real/fake prediction. Formally, the new perceptual loss based on AlexNet is defined as: L Alex (G) = X l 1 H l W l X h;w jjw l ( (G(I 0 ) p ) l hw ((I gt ) p ) l hw )jj 2 2 : (4.4) Here p refers to the local hole patch, andI 0 ,I gt are the incompulete image and ground truth respectively. is the AlexNet andl is the feature layer.w l is the layer-wise learnt weight. With the benifit of this extra perceptual loss, the full loss function of SG-Net is defined as: min G ( adv max D 1 ;D 2 ;D 3 X k=1;2;3 L GAN (G;D k )+ perceptual X k=1;2;3 L perceptual (G) + Alex L Alex (G)); (4.5) where we set adv = 1, perceptual = 10 and Alex = 10 in our experiment. 4.3 Experiments 4.3.1 Experiment Setup We conduct extensive comparisons on two public datasets: Cityscapes dataset [9] and Helen Face dataset [47, 80]. Cityscapes dataset has 2,975 street view images for train- ing and we use the validation set for testing, which consists of 500 images. Helen Face dataset has 2,000 face images for training and 100 images for testing. The fine annotations of the segmentation labels for both datasets are also provided for training. Cityscapes and Helen Face dataset are annotated with 35 and 11 categories respectively. 61 To better capture the global structure of the street view, we map the 35 categories to 8 categories, which are road, building, sign, vegetation, sky, person, vehicle, and unla- beled otherwise. To make fair comparisons with existing methods, we train images of size 256x256. For each image, we apply a mask with a single hole at random locations. The sizes of holes are between 1/8 and 1/2 of the image’s size. To train the whole networks, we firstly use the state-of-the-art semantic segmentation method Deeplabv3+ [7] and fix its model parameters. Then we train the SP-Net and SG-Net separately for 200 epochs with linear learning rate decay in the last 100 epochs. Finally, we train the whole architecture in an end-to-end manner for additional 100 epochs. We train all our models on an NVIDIA Titan X GPU. The total training time for the two steps is around 2 days, and the inference is real-time. 4.3.2 Comparisons For Cityscapes, We compare our method with 2 methods: PatchMatch [2] and Globally- Locally consistent inpainting (GL) [35]. PatchMatch is the state-of-the-art non-learning based approach, and GL is the recent work proposed by Iizuka et al. . We make the comparison in a random hole setting, and only GL applies Poisson Image Editing as post-processing as stated in [35]. For Helen Face dataset, we compare our method with Generative Face Completion (GFC) [51]. Qualitative Comparisons Fig. 4.2 shows the comparisons on Cityscapes dataset where images are randomly drawn from the test set. Cityscapes is a dataset of traffic view images with highly complicated global structures. We can see that PatchMatch could generate realistic details by patch propagation, but they often fail to capture the global structure or synthesize new contents in most scenarios. While GL could provide plausi- ble textures which is coherent with the surrounding area, it could not handle the object 62 shapes well and often predict unreasonable structures. As compared with GL, our SP- Net could focus on the task of shape prediction, and then pass the high-level semantic information as guidance for the generation step of SG-Net. This factorizes the inpaint- ing task in a reasonable way, which enables the completion of different object shapes. For example, the masks in the second and third rows of Fig. 4.2 contain an interaction of multiple object boundaries, such as the car, building, and tree. While GL only prop- agates the texture from the neighborhood of the holes and gives very blurry results, our SP-Net makes a more reasonable layout prediction and SG-Net recovers boundaries of the car and building very clearly. Furthermore, our method could also infer a part of a car even from a very small segmentation shape in the input and complete the wheel of the car in the first example of Fig. 4.2. Fig. 4.3 shows the comparisons on Helen Face dataset. Since [51]’s model deals with images of 128x128, we directly up-sample the results to 256x256 for comparison. For our results, the prediction of SP-Net is also shown at lower left corners. It can be seen that our method could also generate more realistic face inpainting results than GFC [51] which is specifically designed and trained for face completion, and this indicates the strong generalization ability of our segmen- tation based inpainting framework. Quantitative Comparisons We make the quantitative comparisons between Patch- Match, GL, and our method. Here we report four image quality assessment metrics: ` 1 , ` 2 , SSIM [87], and PSNR following works of [66, 92]. Table 4.1 shows the com- parison results. It can be seen that our method outperforms the other methods on three out of the four metrics. For` 2 , GL has slightly smaller errors than ours, but` 2 error is less capable to assess the perceptual quality than SSIM and PSNR, as it tends to average pixel values and award blurry outputs. User Study To better evaluate our methods from the perceptual view of people, we con- duct a user study on the Cityscapes dataset to make comparisons. We ask 30 users for 63 Figure 4.2: Visual comparisons of Cityscapes results with random hole. Each example from left to right: input image, PatchMatch [2], GL [35], SP-Net output, and SG-Net output (our final result). All images have size 256 256. Zoom in for better visual quality. Method ` 1 Error ` 2 Error SSIM PSNR PatchMatch [2] 641.3 169.3 0.9419 30.34 GL [35] 598.0 94.78 0.9576 33.57 Ours 392.4 98.95 0.9591 34.26 Table 4.1: Numerical comparison on 200 test images of Cityscapes. perceptual evaluation, each with 20 subjective tests. At every test, users are shown the input incomplete image and are asked to compare the results of PatchMatch, GL, and ours. Among 600 total comparisons, the user study shows that our results receive the 64 (a) (b) (c) (d) (e) (f) Figure 4.3: Visual comparisons of Helen Face Dataset results with random hole. Each example from left to right: input image, GFC [51], and our result. All images have size 256 256. highest score 70.8% of the time. As compared with PatchMatch, our results are over- whelmingly better 96.2% of the time. Comparing with GL, our results are perceptually better 71.3% of the time, and are ranked the same 16.3% of the time. 4.3.3 Analysis Ablation Study Our main motivation is to introduce segmentation label map as interme- diate guidance to provide high-quality inpainting results. To justify this framework, we show the intermediate results of our method at each step and compare our result to the baseline result. Here the baseline result refers to the single SG-Net, which only takes the incomplete image as input and doesn’t have any other conditions. We train both methods in the same setting with 200 epochs, and show the comparison in Fig. 4.4. We can see that Deeplabv3+ provides an accurate segmentation label map, and SP-Net could make a reasonable prediction. SG-Net then inpaints the missing area based on the output of SP-Net, which generates sharp and realistic details. Comparing with our method, the baseline result is very blurred, especially along the boundaries of the vegetation and the car. 65 (a) (b) (c) (d) (e) Figure 4.4: Ablation study. (a) Input image with missing hole. (b) Deeplabv3+ [7] output. (c) SP-Net result. (d) SG-Net result. (e) Baseline result. The size of images are 256x256. Interactive Editing Our segmentation based framework allows us to perform interac- tive editing on the inpainting task to give multi-modal predictions for each single input image. Specifically, when we are given an incomplete image as input, we don’t know the ground truth layout in the missing hole. However, we could make interactive editing on the segmentation map in the mask, such as following the ground truth label to guide the inpainting (Fig. 4.5c), or add more components to the hole, e.g. a car (Fig. 4.5e). While both label maps are reasonable, SG-Net could provide multi-modal outputs based on different conditions (Fig. 4.5df). (a) (b) (c) (d) (e) (f) Figure 4.5: Interactive Editing. (a) Input image with missing hole. (b) Ground truth. (c) First label map. (d) Inpainting result based on (c). (e) Second label map. (f) Inpainting result based on (e). The size of images are 256x256. 4.4 Conclusion In this work, we propose a novel end-to-end learning framework for image inpainting. It is composed of two distinct networks to provide segmentation information and generate 66 realistic and sharp details. We have observed that segmentation label maps could be predicted directly from the incomplete input image, and provide important guidance for the texture generation in the missing hole. Our method also allows an interactive edit- ing to manipulate the segmentation maps and predict multi-modal outputs. We expect that these contributions broaden the possibilities for image inpainting task and could be applied to more image editing and manipulation applications. 67 Chapter 5 Novel Human-Object Interaction Detection via Adversarial Domain Generalization 5.1 Introduction Over the past few years, rapid progress has been made in visual recognition tasks, but image understanding also calls for visual relationship detection, i.e., detection of hsubject;predicate;objecti triplets in the image. While some efforts have been made to detect the general relationships between different objects [60, 56, 10, 89, 97, 55, 96, 93], one particularly important class of visual relationship detection requiring further study is the Human-Object Interaction (HOI) detection, where only relations with human sub- jects are of interest [6, 31, 5, 18, 54, 69, 25]. A long-standing problem in both HOI detection and visual relationship detection is the long-tail problem, where specific predicates dominate the triplet instances for most of the object categories. Fig. 5.2 shows the distribution of both the triplet categories and the object categories given the predicate “horse” in the HICO-DET dataset. In both cases, a small number of categories dominate the training instances, allowing a learned model to rely on a frequency prior rather than learning the relationship feature itself. The same conclusion is made in Visual Genome dataset [46] by Zellers et al. [96], where they point out that the frequency prior is a main barrier for visual relationship detection. 68 Annotation: ride (motorcycle) Annotation: watch (tv) Prediction: ride (dog) Figure 5.1: Novel relationship detection. Green box: subject. Red box: object. First two images are fro training set while the last image contains an unseen triplet from test set. Collecting a balanced dataset is a simple approach to tackle this problem. How- ever, if we haveN predicates andM objects, the possible combination of the triplets is MN. It is difficult to collect all those possible combinations due to the infrequency of relationships. For the relationship detection task, the long-tail problem, combinatorial problem and frequency prior barrier are closely related in the sense that the distribution of the triplet categories are extremely imbalanced in the large compositional space of triplet categories. Motivated by these observations, we focus on the novel HOI detection problem [78], where the predicate-object combinations in test set are never seen in the train set. As shown in Fig. 5.1, the training and test set share the same predicate categories, but the combinations of predicate and object categories in the test set are unseen. This task is challenging because the model is required to learn object-invariant predicate features and generalize to unseen interactions, which are able to be further applied to downstream tasks. Our first contribution is to create a new benchmark dataset for the novel HOI detec- tion task, based on the images and annotations from the HICO-DET dataset [5] and the UnRel dataset [68]. The new benchmark dataset avoids the overlapping of the triplet categories in the training set, validation set and test set. This new benchmark contains 69 (a) (b) Figure 5.2: Number of instances in the HICO-DET dataset for each (a) HOI category (b) predicate category of ”horse”. an additional evaluation set from UnRel dataset [68], highlighting its instances with unusual scenes. Our second contribution is to propose a unified adversarial domain generalization framework, which can serve as a plug-in module for existing models to improve their generalization ability. We instantiate both conditional and unconditional methods within the framework and build its relationship with previous methods. Experiments on HICO- DET and Unrel dataset show that our proposed adversarial training can get uniformly significant improvement on all metrics. Our work shows promising results of adversarial domain generalization in conquering the combinatorial prediction problem in real-world applications. 5.2 Problem Statement 5.2.1 Problem Formulation Suppose the training set and test set are represented byD train =f(I i ; (b S ) i ; (b O ) i ; S i ;O i ;P i )g andD test =f(I j ; (b S ) j ; (b O ) j ;S j ;O j ;P j )g, whereb S andb O are the bound- ing boxes of subjects and objects, and I, S, O, P denote the images, subject labels, 70 object labels and predicate labels. The novel HOI detection task is defined by the con- straint that there are no overlapping combinations ofP andO. The goal of HOI detection is to learn a functionF :I!fb S ;b O ;O;Pg. This contains two steps, object detection and predicate detection, making the prob- lem more complex and difficult to analyze. In this paper, we focus on predicate prediction problem to learn the object-invariant features, where we formulate it as: F :fI;b S ;b O g!P . 5.2.2 Dataset Creation The most commonly used datasets for HOI detection are V-COCO [31] and HICO- DET [5]. As V-COCO is relatively small, it is insufficient for evaluation of novel HOI detection. Therefore, we primarily use the HICO-DET dataset for our experiments and evaluation, with 600 HOI categories and over 150K annotated instances of human-object pairs. We extract 117 predicates and 80 object categories from the 600 HOI categories to evaluate the HOI detection performance of the model on unseen ¡human, predicate, object¿ triplets. However, the original split of the HICO-DET was not designed to verify the effec- tiveness of the proposed models when being transferred to unseen predicate-object pairs. With the original train/test split, [78] use part of predicate-object combinations in the train set and the other part of predicate-object combinations in the test set to set up the novel HOI detection task. However, this approach discards most of the data in the orig- inal test split and results in a very small novel-HOI test set and thus large fluctuation of evaluation metrics. Moreover, it is not clear how the validation set is set up and how the hyperparameters is tuned in [78]. Therefore, we create a new split of the HICO-DET based on its images and annotations, based on the principle that none of the ¡human, predicate, object¿ triplet categories in the test set should exist in the training set. We 71 HICO-DET UnRel train trainval testval test total split 1 split 2 split 3 #images 31873 4357 5528 5421 47179 196 248 494 #instances 106043 15149 10408 10391 141991 323 396 718 Table 5.1: Statistics of the new splits collect the triplet instances in the whole dataset and then divide them into 90% training and 10% test sets without overlapping triplet categories. 1 We further divide the training set into 7/9, 1/9 and 1/9 for training and trainval and testval splits, where the training and trainval splits share the same triplet categories and the testval split has no over- lapping triplet categories with training and trainval. The combined trainval and testval splits are used as the validation set for hyperparameter tuning. The statistics of this new split are listed in Table 5.1. In addition to the HICO-DET dataset, we use the UnRel dataset [68] for auxiliary validation for models trained on HICO-DET dataset. The UnRel dataset was originally used for image retrieval and was designed for the detection of rare relations. It contains 1,071 images with annotations and 76 triplet categories. We choose the instances with the same predicate categories as those in the HICO-DET to validate the generalization ability of the model. Based on the type of the chosen predicate and object types, we cre- ated three different splits in the UnRel dataset to verify the generalization performance in different levels: Split 1:D 1 unrel =f(I k 1 ; (b S ) k 1 ; (b O ) k 1 ;S k 1 ;O k 1 ;P k 1 )g,S k 1 =human,fO k 1 gD train , fP k 1 gD train . Split 2:D 2 unrel =f(I k 2 ; (b S ) k 2 ; (b O ) k 2 ;S k 2 ;O k 2 ;P k 2 )g,S k 2 =human,fP k 2 gD train . Split 3:D 3 unrel =f(I k 3 ; (b S ) k 3 ; (b O ) k 3 ;S k 3 ;O k 3 ;P k 3 )g,fP k 3 gD train . The statistics of the splits are shown in Table 5.1. 1 We also ensure the triplet instances in the training set and the test set are from different image sets. 72 5.2.3 Evaluation Metrics In this paper, we focus on the predicate detection and use ground-truth object boxes and labels for evaluation. Due to the ambiguity and incompleteness of HOI annotations (e.g., “ride” vs “striddle”), we propose to use recall metrics (which are widely used in visual relationship detection and scene graph generation [60, 97, 55, 96, 99, 81]) as follows: (1) Predicate classification (PredCls): For each human-predict-object triplet in the test set, predict the predicate class given the ground-truth bounding boxes and object label. (2) Predicate detection (PredDet): For each image in the test set, detect all human- predict-object triplets given the ground-truth bounding boxes and their associated labels. We can replace the ground-truth object bounding boxes and labels with detected bounding boxes and labels (from a pretrained or jointly trained object detector), and the PredDet metric becomes the standard SgDet metric in scene graph genera- tion [60, 97, 55, 96, 99, 81]. However, the SgDet metric is very sensitive to the object detection performance. Therefore, our metrics use ground-truth object boxes and labels to exclude the factor of object detection performance. 5.3 Method In this section, we start with an overview of our baseline in Section 5.3.1. Then we present our adversarial domain generalization framework in Section 5.3.2, instantiate three approaches in Section 5.3.2 and 5.3.2, and discuss the relation between our frame- work and previous DeepC [52] in Section 5.3.2. Finally, we introduce the implementa- tion in Section 5.3.3. 73 5.3.1 Overview Following the conventions in [18, 25, 54], our baseline model has three branches to extract different types of visual features for predicate prediction, as shown in Fig. 5.3(a). We use the union-box branch rather than the object branch to extract visual features as the union box contains more visual information to capture the interaction between human and object. Our proposed domain generalization approaches are only applied to the union-box branch, while keeping the other two branches unchanged for fair compar- ison. We denote the human box and union box asb h andb u . From the three branches, we predict three probabilities of the predicate category, denoted ass h , s u ands sp . As the same human-object pair could have multiple predicates as the ground truth labels, we take the predicate prediction as a multi-label classification problem based on a binary sigmoid classifier, and minimizes the cross entropy losses on three branches for each category, denoted asL H ,L sp , andL U . Then the total loss function is defined as: L baseline =L H +L sp +L U (5.1) In the inference time, we rank the scores of each triplet based on the formula score triplet = (s h +s u )s sp . 5.3.2 Adversarial Domain Generalization (ADG) To learn a visually grounded relationship feature that can generalize to novel ¡predicate, object¿ pairs, the feature should be as object-invariant as possible. In this paper, we focus on learning a visually grounded relationship feature from the union box branch, i.e.,f u in Figure 5.3 (b), because the spatial branch feature and the human branch feature are both expected to be object-invariant by design. 74 Conv Features s " (i)union-boxbranch Binary SpatialMask $ (ii)humanbranch %& (iii) spatialbranch $ " Conv Features ) *+ − *+ ConditionalEmbedding " 80 80 " 80 80 Pred.Emb. " Obj.Emb. True/False True/False Pred.Emb. " (a) (b) (c) (d) (e) Figure 5.3: Architectures. (a) Baseline architecture which consists of (i) a union-box branch,(ii) a human branch, and (iii) a spatial branch. (b) the proposed ADG framework for domain generalization. (c) ADG-KLD. (d) CADG-KLD. (e) CADG-JSD. We view this object-invariant feature learning as a domain generalization (DG) prob- lem, where each object category is viewed as a separate domain. We aim at learning domain(object)-invariant features for predicting class (predicate category). For a given predicate, say, ”ride”, we only have ”ride-horse” and ”ride-bicycle” in the training data, i.e., training data is only collected from domains ”horse” and ”bicycle”. However, we have unseen pair ride-dog in the test data, i.e., we want our model to generalize well to new domain ”dog”. This DG problem is extremely challenging, due to three reasons. First, our feature extractor f u = F (x) is a deep neural network, while nearly all previous DG meth- ods [41, 21, 62, 90, 22] are only tested on linear feature extractor. Very recent work [52] showed promising results on domain generalization on deep features. Second, the num- ber of domains is large, i.e., 80 object categories in our case. Third, there are huge variations in predicate class distribution across domains. For example, the predicate class distributions of domain ”horse” (see Figure 5.2(b)) and domain ”cup” are largely different. [52] is only tested on problems with at most 5 domains and does not work 75 with large variation in class distribution across domains, as mentioned in their paper and shown in our results. In the rest of this section, we propose a general framework for adversarial domain generalization (ADG). We introduce a DG regularizationL DG into the training, i.e., L total =L H +L sp +L U +L DG ; (5.2) which effectively inject an inductive bias in the training process to learning domain(object)-invariant features. L DG involves divergence between high-dimensional distributions, so we introduce discriminators to estimate it and perform alternative adver- sarial training to minimize the total loss function (5.2); see Figure 5.3. We denoteM the number of domains (i.e., object categories) andK the number of classes (i.e., predicate categories). In our task,M = 80 andK = 117. Unconditional adversarial domain generalization (ADG-KLD) A first attempt is to enforce the invariance of extracted feature distributions across domains, i.e., P (f u jobj i ) = P (f u jobj j ) for any two different object categories obj i andobj j . For example, it is expected that the union-box feature distributions of domain ”horse” and ”elephant” are similar, because they share similar interactions with human. This shared feature space is expected to be more independent from the object categories and more applicable to an unseen domain ”donkey”. Enforcing the mutual similarity between two domains is equivalent to enforce the similarity to the pooled feature distribution for each domain: P (f u jobj i ) =P (f u ) 8i2 [M]; (5.3) 76 whereP (f u ) is the pooled feature distribution P (f u ) = X i2[M] i P (f u jobj i ); X i2[M] i = 1 (5.4) and i is the relative importance of each domain. In this paper, we choose i =N i =N, i.e., the fraction of data in domaini. We use adversarial training to enforce this distribution match. Specifically, we intro- duce a discriminatorD : f u ! [0; 1] M that tries to classify the domain (object) based on the union-box feature, while the feature extractorF is trying to confuse the discrim- inator. Formally, they play a mini-max game as follows: min F max D X i2[M] i E f2P (fujobj i ) [logD i (f)]: (5.5) In practice, we optimize this with stochastic gradient descent (SGD). For each samplex (from domainobj(x)), we updateD andF with the minimax loss: min F max D logD obj(x) (F (x)); (5.6) wherew i =N i =N is used to reduce (5.5) to (5.6). Assume infinite capacity of the discriminatorD, the maximum of (5.5) is a weighted summation of KL divergence between distributions: L DG =KLD := X i2[M] i KL(P (f u jobj i )jjP (f u )): (5.7) We provide the proof in Appendix. Therefore, the adversarial training in (5.6) is indeed adding a DG regularization (5.7) into the training objective (5.2), effectively enforcing invariance of feature distributions (5.3). 77 Conditional adversarial domain generalization However, due to class distribution mismatch across different domains, (5.3) may not be the invariance we want to achieve in object-invariant predicate detection. For example, the predicate class distributions of domain ”horse” and domain ”cup” are completely different, and thus it is not reasonable to enforce predicate feature distribution match between these two domains. By considering the class distribution mismatch, we can instead enforce that the con- ditional feature distribution is the same over different domains (objects): P (f u jobj i ;pred k ) =P (f u jpred k ) 8i2 [M];k2 [K]; (5.8) where the pooled conditional distribution P (f u jpred k ) = X i2[M] (k) i P (f u jobj i ;pred k ) (5.9) and (k) i ( P i2[M] (k) i = 1) is the relative importance of different conditional distribu- tions. In this paper, we choose (k) i = N (k) i N (k) whereN (k) i is the number of training samples in domain (object category)i and with (predicate) classk andN (k) = P i2[M] N (k) i is the number of samples with (predicate) classk in the full training dataset. One can also choose other weights, like (k) i 1=M in [52], and methods presented below can be still applied with corresponding re-weighting of training samples. However, as we show in Table 5.3, (k) i 1=M in [52] gets very marginal improvement while our weights achieves significant improvement. We provide more discussions on this in Appendix. In the following, we present two methods to achieve (5.8), by specifying two differ- ent kinds of divergence as the DG regularizationL DG in (5.2). 78 KL Divergence (CADG-KLD) In the first method, similar to Section 5.3.2, we enforce (5.8) with the following conditional KL divergence: L DG =CKLD := X k2[K] (k) X i2[M] (k) i KL(P (f u jobj i ;pred k )jjP (f u jpred k )); (5.10) where (k) i are the weights defined in (5.9) and (k) are weights that balance different classes. To achieve this, we introduce discriminators conditioned onpred k , written as D(f;pred k )2 [0; 1] M , which tries to classify the sample’s domain (object category). The feature extractorF tries to confuse the discriminator: min F max D X k2[K] (k) X i2[M] (k) i E f2P (fujobj i ;pred k ) [logD i (f;pred k )]: (5.11) We show that in the appendix, assuming infinite capacity ofD, the maximum of (5.11) (up to a constant) is indeedCKLD defined in (5.10). We propose to use (k) =N (k) =N and (k) i =N (k) i =N (k) . In practice, we optimize (5.11) with SGD, in which for each sam- plex (with labelpred(x) and domainobj(x)), we updateD andF with the following regularization: min F max D logD obj(x) (F (x);pred(x)): (5.12) Jensen-Shannon Divergence (CADG-JSD) In the second method, we enforce (5.8) with the following conditional JSD: L DG =CJSD := X k2[K] (k) X i2[M] (k) i JSD(P (f u jobj i ;pred k )jjP (f u jpred k )); (5.13) where (k) and (k) i are weights that users specified to balance different terms. We introduce discriminators conditioned on pred k , written as D(f;obj i ;pred k ) 2 [0; 1]. The objective of the discriminator is to distinguish where the features is from the domain 79 specific distributionP (f u jobj i ;pred k ) or from the pooled distributionP (f u jpred k ). The feature extractor is trying to confuse the discriminator: min F max D X k2[K] (k) X i2[M] (k) i E f2P (fujobj i ;pred k ) [logD(f;obj i ;pred k )] +E f2P (fujpred k ) [log(1D(f;obj i ;pred k ))] : (5.14) We show that in the appendix, assuming infinite capacity ofD, the maximum of (5.14) (up to a constant) is indeedCJSD defined in (5.13). We propose to use (k) =N (k) =N and (k) i = N (k) i =N (k) , as in CADG-KLD. In practice, we optimize this with SGD, in which for each samplex (with labelpred(x) and domainobj(x)), we update theD and F with the following regularization: min F max D logD(F (x);obj(x);pred(x))+ X i2[M] N pred(x) i N pred(x) log(1D(F (x);obj i ;pred(x))): (5.15) A general recipe for ADG Finally, we summarize a general recipe consisting of 3 steps for AGD training. First, one chooses the invariance to be enforced, such as the unconditional feature distribution matching (5.3) or the conditional matching (5.8). Second, one chooses the distribu- tion divergence (or distance) to be used as the DG regularization, such as KL diver- gence (5.7)(5.10) and JD divergence (5.13). With many GAN variants, one can freely pick many other divergences/distances, see, e.g., [1, 50, 64, 100]. Third, one utilizes various GAN formulations, converts the DG regularization into an adversarial problem, and then performs the adversarial training with SGD, like in (5.6), (5.12) or (5.15). 80 Relation between the proposed framework and the DeepC and CIDDG [52] The loss function of DeepC is a special case of our general framework: when (k) = (k) i = 1 in (5.11), our CADG-KLD reduces to DeepC. We propose to use (k) i = N (k) i =N (k) , which gives significantly better results. Thank to our framework, the meaning of these parameters and our choice are very intuitive: domains with larger sample sizes should contribute more to the pooled distribution (5.9) and should have larger weights in the distribution matching regularization (5.11). CIDDG reweights conditional distributions P (f u jobj i ;pred k ) with 1= (k) i to compute their class prior- normalized minimax value. This cannot be applied in our case, because many (k) i ’s are 0 since many predicate-object pairs have never been seen in training set. After all, the large number of domains and huge variation across domains (i.e., (k) i = 0 for many (i;k)’s) make the off-the-shelf DeepC have poor performance in the novel HOI detec- tion task. Our results in Table 5.3 and 5.4 prove the advantage of our methods over DeepC. 5.3.3 Architectures Fig. 5.3 shows the network structure for ADG-KLD, CADG-KLD, and CADG-JSD. As Fig. 5.3(b) shows, the adversarial branch (blue path) of ADG-KLD and the union- box branch (red path) are trained iteratively in an adversarial manner. CADG-KLD is a conditional version of ADG-KLD, where the adversarial branch takes the predicate embedding as an additional input (Fig. 5.3(d)). The goal of the adversarial branch in CADG-JSD is to distinguish whether the input feature is from the object-specific distri- bution or the pooled distribution, given the predicate. Therefore, as shown in Fig. 5.3(e), it takes the feature, the object embedding and predicate embedding as inputs and pre- dicts a binary output. 81 5.4 Experiments Implementation details For feature extraction backbone, we adopt ResNet-50 [34] and follow the setting of [18, 54]. There are three branches in our baseline model, and our domain generalization framework is only applied on the union-box branch. It takes around 60 hours on 4 NVIDIA P100 GPU for training the model, and we we apply the linear scaling rule according to [29]. Creating the new split of the HICO-DET dataset We first combine the original train- ing and test sets of the HICO-DET dataset to merge all the annotations together. As the original images in the HICO-DET are annotated at the HOI level, we merge the same pair of annotated boxes with IoU 0.7. We then follow two main principles to split the full dataset: the training set and test set should not have overlapping ¡human, predi- cate, object¿ triplet categories; the training set and test set should not have overlapping images. For each specific predicate, we split the objects related to this predicate into 90% and 10% for training set and test set. After we get the training and test sets, we further create the validation sets from the training set. We first divide the training set into 8/9 and 1/9 for “train+trainval” and “test- val” (for novel HOI validation), following the same procedure of creating the training and test sets. Then we split i.i.d. the “train+trainval” set into 7/8 and 1/8 for the “train” and “trainval” sets, where the “trainval” set is for in-domain HOI validation (HOIs seen in the train set). We do the model selection based on the “trainval”+“testval” set, which gives us a good trade-off between in-domain and out-of-domain generalization. Evaluation metrics The most commonly used evaluation metric in HOI detection is the role mean average precision (role mAP) [31], which takes the image as input and predicts each instance in the format of ¡human box, object box, HOI class¿. If the correct HOI class is predicted, and both the human and the object bounding boxes meet the condition of IoUs 0.5 w.r.t the ground truth annotations, the instance is marked 82 as a true positive. However, we don’t use this metric for two reasons. First, this metric evaluates two steps, object detection and predicate detection, but we want to focus on the predicate prediction problem. Second, the HOI class is defined for the combination of predicate and object labels, which cannot be applied to novel HOI detection. We would like to disentangle the HOI class into predicate class and object class in order to generalize the metrics to unseen triplet categories. In HICO-DET dataset, the only evaluation metric is the role mAP, but there are also other evaluation metrics that are commonly used in visual relationship detection and scene graph generation. Scene graph generation mostly uses predicate classifica- tion, scene graph classification, and scene graph generation as its metrics, and visual relationship detection uses predicate detection, phrase detection, and relation detec- tion. Because of the incomplete annotations in most of the datasets on the relationships between different objects, we only use recall instead of precision for evaluation, such as R@20, R@50, and R@100. There is a fundamental difference between these two tasks: graph constraints. In scene graph generation, each edge of object pairs could only have one relation prediction, while in visual relationship detection, there is no graph constraints, which means that each edge could have multiple relation predictions. In our setting of HOI detection, there could be multiple predicate labels between the same human-object pair, so it is more suitable to use evaluation metrics in the visual relation- ship detection setting. Other differences are listed as follows: In scene graph generation, the metrics are defined as: Predicate Classification: Given images, ground truth boxes, and object labels, we need to predict the predicate labels. The model should be able to judge whether two objects have relationships or not and make the predictions with graph constraints. 83 Scene Graph Classification: Given images and ground truth boxes, we need to predict the object labels and the predicate labels. Usually the models need to predict the object models and the predicate labels simultaneously. Scene Graph Generation: Given only the images, we need to predict the ground truth boxes, the object labels, and the predicate labels with the requirement of graph con- straints. In visual relationship detection, the metrics are defined as: Predicate Detection: Given images, ground truth boxes, object labels, and object pairs that have relationships in the annotations, we need to predict the predicate labels without the graph constraints. Phrase Detection: Given only the images, we need to predict the union boxes of two interacting objects and their predicate labels. This is designed in early papers and not used much recently. Relation Detection: Given only the images, we need to predict the ground truth boxes, the object labels, and the predicate labels without the graph constraints. As we stated in Section 5.2.1, the problem of relationship detection usually contains two steps: object detection and predicate detection. In this paper, we mainly focus on predicate prediction problem to learn the object-invariant features and thus we provide the ground truth boxes as input to the model and attend to the generalization ability in the novel HOI detection problem. Under this circumstances where we are given the ground truth boxes, as our task is more related to visual relationship detection which has no graph constraints, we use predicate detection (PredDet) as one of our metrics. In visual relationship detection, people usually use R@50 and R@100 for evaluation. But different from the Visual Genome dataset or some other datasets, there are on average 3.1 relationships and rela- tively few object instances in the HICO-DET dataset. Therefore, we useR@5,R@10 for 84 evaluation. We also introduce another metric which is inspired by the classification task and the image retrieval task. For each given human-object pair, we predict its predicate label based on the ranking result of all predicates. This is defined as instance predicate detection (PredCls), and we useR@1 andR@5 for evaluation. For the UnRel dataset, in most cases each image only has one relationship pair, where using PredDet doesn’t make much sense comparing with PredCls. Therefore, we only use PredCls for evalation on UnRel dataset which also usesR@1 andR@5. 5.4.1 Evaluations Baseline To achieve a fair comparison, we first show that our strong baseline performs better than other approaches with similar architectures [5, 25, 18, 69] on the original split of HICO-DET dataset in Table 5.2. Then we use this baseline model in the new split for comparison with our proposed framework. In our experiments, we focus on examining whether our proposed adversarial training as a plug-in module can robustly improve the baseline. It can be easily applied to other recent HOI detection methods [53, 32, 84, 85, 104]. Default Known Object Method Full Rare Non Rare Full Rare Non Rare HO-RCNN [5] 7.81 5.37 8.54 10.41 8.94 10.85 InteractNet [25] 9.94 7.16 10.77 - - - iCAN [18] 14.84 10.45 16.15 16.26 11.33 17.73 GPNN [69] 13.11 9.34 14.23 – – – Baseline 15.27 11.82 16.31 16.97 13.94 17.87 Table 5.2: Performance on the original HICO-DET dataset HICO-DET Dataset We conducted extensive experiments to hyper-tune the baseline method on the new split in order to achieve its best performance, as shown in Table 5.3. We observe that the baseline models perform much worse in test set than in trainval set, which is expected and indicates that the baseline model has very limited generalization ability. Another simple baseline model is the frequency model, which gets the statistics 85 trainval testval test PredCls PredDet PredCls PredDet PredCls PredDet Method R@1 R@5 R@5 R@10 R@1 R@5 R@5 R@10 R@1 R@5 R@5 R@10 Frequency 43.04 95.25 54.23 69.58 0.00 1.93 0.25 2.49 0.00 0.15 0.12 2.43 Baseline 41.87 91.09 51.17 66.92 32.03 75.95 52.96 68.88 32.18 76.73 51.92 67.45 DeepC [52] 40.58 (-3.1%) 90.41 50.21 65.85 32.82 (+2.5%) 76.92 53.89 69.54 32.80 (+1.9%) 77.62 52.33 67.95 ADG-KLD 40.08 (-4.3%) 89.85 49.72 65.94 46.78 (+46.1%) 80.27 61.05 74.48 48.68 (+51.3%) 81.61 60.98 74.34 CADG-KLD 39.92 (-4.7%) 88.09 49.07 64.66 40.33 (+25.9%) 75.85 55.04 69.43 41.66 (+29.5%) 76.89 54.88 68.96 CADG-JSD 40.15 (-4.1%) 88.48 50.12 65.38 42.29 (+32.0%) 76.60 56.10 69.68 43.47 (+35.1%) 77.55 56.08 69.26 Table 5.3: Performance on the new split of HICO-DET dataset. For [52], ADG-KLD, CADG-KLD, and CADG-JSD, we measure the relative ratio with baseline to compute the gain and loss on PredCls R@1 of the ¡human, predicate, object¿ triplets and predicts the predicate class only based on the frequency. From Table 5.3 we observe that the frequency model can only make random predictions on the test set as expected, because the triplets in the test set do not exist in the training set. Moreover, in the trainval set, the performance of the frequency model is even a little higher with our baseline models, which indicates that the baseline models are mostly learning from the frequency bias in the training set. This is also observed from Visual Genome dataset [46] by Zellers et al. [96]. Besides, DeepC [52] proposes a deep domain generalization approach which is only applied on some toy datasets. We extend this method on the new split as another baseline method. We compare our proposed adversarial domain generalization framework with the baseline models. Our proposed methods all decrease by around 4% on PredCls R@1 of the trainval set, which is reasonable as the proposed model disentangles object and predicate representations. On the other hand, while DeepC [52] does not show much improvement over the baseline, our models gain around 26%~51% on testval and test set. As discussed in Section 5.3.2, DeepC is a special case of CADG-KLD with uniform weights across domains, while we propose to use natural weights. From Table 5.3 and 5.4, this change improve CADG-KLD’s performance significantly. UnRel Dataset To further investigate the generalization ability of the proposed models on novel relation triplets, we evaluate the metrics on UnRel dataset directly using the 86 models trained on HICO-DET which are the same as the models in Table 5.3. The eval- uation results are shown in Table 5.4. While we observe minor performance gain on DeepC [52], all our proposed models show significant improvements uniformly on the metrics, and ADG-KLD and CADG-JSD both have even better performance, with an increase of over 75% and 125% comparing with the baseline model. Note that UnRel dataset has triplets with unseen object classes and even non-human subjects, which indi- cates that our proposed models generalize better to unseen triplet categories than base- line. split 1 split 2 split 3 Method R@1 R@5 R@1 R@5 R@1 R@5 Frequency 0.00 0.00 – – – – Baseline 13.31 63.47 17.68 66.41 14.76 61.00 DeepC [52] 16.72 65.94 21.97 67.93 14.90 (+1%) 63.51 ADG-KLD 39.01 74.30 41.41 76.52 33.15 (+125%) 69.22 CADG-KLD 19.20 69.35 26.01 71.21 20.19 (+37%) 60.86 CADG-JSD 27.24 73.99 34.09 75.51 26.04 (+75%) 64.07 Table 5.4: Performance of the evaluation metric PredCls on the UnRel dataset. For [52], ADG-KLD, CADG-KLD, and CADG-JSD, we measure the relative ratio with baseline to compute the gain and loss on R@1 Qualitative Results We show our human-object interaction detection results in Fig- ure 5.4, where each subplot illustrates one ¡human, predicate, object¿ triplet. We choose the predicate with the top score of each instance for visualization and comparison. Images of the first three columns are from the HICO-DET dataset. We note that our models perform uniformly better than the baseline, and detect the predicates when fac- ing unseen triplets in the test images. The last column of Figure 5.4 shows rare scenes with unseen triplets in the images, and there is even one instance that takes a cat as the subject. This implies that our models learn better features of predicates themselves which is invariant to objects so as to get strong generalization ability. 87 Baseline: straddle ADG-KLD: carry CADG-KLD: hold CADG-JSD: carry Baseline: watch ADG-KLD: no interaction CADG-KLD: feed CADG-JSD: feed Baseline: jump ADG-KLD: ride CADG-KLD: ride CADG-JSD: ride Baseline: wear ADG-KLD: ride CADG-KLD: ride CADG-JSD: ride Baseline: cut ADG-KLD: hold CADG-KLD: cut CADG-JSD: cut Baseline: fly ADG-KLD: carry CADG-KLD: hold CADG-JSD: carry Figure 5.4: Qualitative results on test images. Green box: human. Red box: object. Blue box: union box of object and human with a margin. Green text indicates correct predictions and red text implies the wrong ones. Images of first two columns are from the HICO-DET dataset, and images of the last column are from the UnRel dataset. 5.4.2 Analysis Per-class evaluations We evaluate the PredCls R@1 for each class in Table 5.5. Among all the predicates, we choose some representative verbs including both high-frequency and low-frequency ones, where the frequency reflects the number of positive instances in training. For each predicate, we make the statistics of the number of instances in the test set and evaluate the R@1 metric. We observe from the table that our proposed models improve significantly over the baseline model for high-frequency predicates, and get comparable results for low-frequency predicates. We also evaluate the mean R@1 which is an average of the per-class R@1, where we still get an improvement of around 10%~26% over the baseline. 88 # instances Baseline ADG-KLD CADG-KLD CADG-JSD hold 14956 43.50 61.64 70.19 72.92 ride 13967 27.32 88.98 66.84 70.94 sit on 11051 22.08 32.93 16.73 19.10 carry 4526 1.82 12.87 2.48 3.96 watch 1553 7.89 11.84 14.47 15.79 walk 673 27.27 1.30 22.08 35.06 feed 555 6.15 1.54 6.15 7.69 cut 367 5.71 8.57 11.43 8.57 push, exit, etc. ¡250 0.00 0.00 0.00 0.00 overall – 32.18 48.68 (+51.3%) 41.66 (+29.5%) 43.47 (+35.1%) mean – 5.96 6.53 (+9.6%) 7.52 (+26.2%) 6.70 (+12.4%) Table 5.5: Per-class evaluations of PredCls R@1 on HICO-DET test set. The second column indicates the number of instances of each predicate in the training set. We show the baseline and our proposed models for each action trainval testval test PredCls PredDet PredCls PredDet PredCls PredDet Method R@1 R@5 R@1 R@5 R@1 R@5 Baseline(HSp) 37.25 48.05 28.31 51.04 28.35 50.68 Baseline(full) 41.87 51.17 32.03 52.96 32.18 (+13.5%) 51.92 ADG-KLD(HSp) 41.34 51.09 43.74 59.68 44.94 59.34 ADG-KLD(full) 40.08 49.72 46.78 61.05 48.68 (+8.3%) 60.98 CADG-KLD(HSp) 36.74 47.55 25.57 49.06 26.33 48.24 CADG-KLD(full) 39.92 49.07 40.33 55.04 41.66 (+58.2%) 54.88 CADG-JSD(HSp) 37.65 48.74 26.13 49.93 22.25 49.30 CADG-JSD(full) 40.15 50.12 42.29 56.10 43.47 (+95.4%) 56.08 Table 5.6: Ablation study on HICO-DET dataset. For each model, we show its inference results using the full model and HSp branches. The relative gain of the union-box branch of each model is also calculated Ablation study In Table 5.6, we evaluate the contributions of the object branch in our full model to the results. HSp inference represents the inference using only the human- box branch (H) and spatial branch (Sp), while the full model makes use of all three branches for prediction. Comparing with the baseline model which gains 13.5% in test set, ADG-KLD shows that this unconditional domain generalization method cannot get very good features on the union-box branch, and the adversarial training process indirectly optimizes the human-box branch and the spatial branch. On the other hand, the union-box branch of CADG-KLD and CADG-JSD shows significant improvement 89 on the model, with around 60%~100% in the test set. This indicates that the union-box branch of these conditional domain generalization models learn much better features than the unconditional one, which is also the reason that we go beyond the unconditional model even if it shows overall great performance. (a) (b) (c) (d) (e) Figure 5.5: Grad-CAM visualization. First row: visualization ofride from the union- box features. Second row: visualization ofhold from the backbone features before the ROI Align module (we only keep the visualization inside the union box). Green box: human. Red box: object. Blue box: union box of object and human with a margin. (a) input images. (b) baseline. (c) ADG-KLD. (d) CADG-KLD. (e) CADG-JSD. Grad-CAM visualization Fig. 5.5 shows the visualization of intermediate features extracted from the input image, following Grad-CAM [77], which takes a weighted average of the feature map w.r.t the gradient of the ground truth one-hot vector. The first row visualizes the feature map from the union-box branch and the second row visualizes the features before the ROI Align module. We note that the baseline features attends to the wrong region, while CADG-KLD and CADG-JSD pay more attention to the region that contains possible interaction between human and the object. The unconditional ADG-KLD gives a reverse saliency map comparing with CADG-KLD and CADG-JSD, which implies that ADG-KLD cannot learn good features on the union-box branch. This is the same as what we observe from the ablation study. On the other hand, as the second row shows, while the baseline feature focuses more on the eyes, all our models attend 90 more to the interaction region between the hand and the folk, which demonstrates the effectiveness of the proposed domain generalization approaches. To sum up, we show that our proposed ADG framework can get uniformly signif- icant improvement over the baselines. Further analysis shows that, while the uncon- ditional ADG indirectly optimize the human and spatial branch, the conditional ADG methods can improve the union-box features directly. 5.4.3 More Experiment Results Model selection We select the models and make hyper-parameter tuning on the valida- tion set. The experiment results are listed in Table 5.7. While DeepC drops 1.2% on the validation set, our proposed methods could get improvements of around 10%, and achieve more significant performance on the test set, as illustrated in the main paper. trainval testval test val PredCls PredDet PredCls PredDet PredCls PredDet PredCls Method R@1 R@5 R@5 R@10 R@1 R@5 R@5 R@10 R@1 R@5 R@5 R@10 R@1 Frequency 43.04 95.25 54.23 69.58 0.00 1.93 0.25 2.49 0.00 0.15 0.12 2.43 25.51 Baseline 41.87 91.09 51.17 66.92 32.03 75.95 52.96 68.88 32.18 76.73 51.92 67.45 37.86 DeepC [52] 40.58 (-3.1%) 90.41 50.21 65.85 32.82 (+2.5%) 76.92 53.89 69.54 32.80 (+1.9%) 77.62 52.33 67.95 37.42 (-1.2%) ADG-KLD 40.08 (-4.3%) 89.85 49.72 65.94 46.78 (+46.1%) 80.27 61.05 74.48 48.68 (+51.3%) 81.61 60.98 74.34 42.81 (+13.1%) CADG-KLD 39.92 (-4.7%) 88.09 49.07 64.66 40.33 (+25.9%) 75.85 55.04 69.43 41.66 (+29.5%) 76.89 54.88 68.96 40.09 (+6.9%) CADG-JSD 40.15 (-4.1%) 88.48 50.12 65.38 42.29 (+32.0%) 76.60 56.10 69.68 43.47 (+35.1%) 77.55 56.08 69.26 41.03 (+8.4%) Table 5.7: Performance on the new split of HICO-DET dataset. Comparing with Table 5.3, we add the last column of the metric on val which shows our model selection criterion. Experiments on different insertion positions of the adversarial branch In the main paper, we introduce our proposed framework where the adversarial branch is inserted before the last classification layer of the union-box branch. On the other hand, there may be multiple positions to add the adversarial branch, and we show another possible position for adversarial branch. By this design, we add the adversarial branch to the position which is directly before two FC layers and one classification layer. Meanwhile, the network structure of the adversarial branch is symmetric with the main branch, i.e. 91 including two FC layers and one classification layer. From the experiment results shown in Table 5.8, we note that our proposed models ADG-KLD and CADG-JSD only get comparable performance as DeepC [52]. Although CADG-KLD has better performance on the test set, it drops 11.0% on the trainval set, which is a big sacrifice on the com- mon triplets and doesn’t get a good tradeoff between common triplets and novel triplets. Therefore, it shows that the proposed domain generalization models can get better train- ing when inserting the adversarial branch before the last classification layer. trainval testval test val PredCls PredDet PredCls PredDet PredCls PredDet PredCls Method R@1 R@5 R@5 R@10 R@1 R@5 R@5 R@10 R@1 R@5 R@5 R@10 R@1 Frequency 43.04 95.25 54.23 69.58 0.00 1.93 0.25 2.49 0.00 0.15 0.12 2.43 25.51 Baseline 41.87 91.09 51.17 66.92 32.03 75.95 52.96 68.88 32.18 76.73 51.92 67.45 37.86 DeepC [52] 40.58 (-3.1%) 90.41 50.21 65.85 32.82 (+2.5%) 76.92 53.89 69.54 32.80 (+1.9%) 77.62 52.33 67.95 37.42 (-1.2%) ADG-KLD 40.39 (-3.5%) 89.43 50.02 65.81 32.18 (+0.5%) 75.47 52.42 68.51 32.60 (+1.3%) 76.14 51.87 66.86 37.05 (-2.1%) CADG-KLD 37.25 (-11.0%) 86.44 46.53 62.52 37.62 (+17.5%) 77.49 55.04 71.89 37.93 (+17.9%) 78.73 54.49 70.34 37.40 (-1.2%) CADG-JSD 41.34 (-1.1%) 90.05 50.37 66.10 32.57 (+1.7%) 76.73 53.65 69.31 33.45 (+3.9%) 77.62 53.43 68.29 37.77 (-0.2%) Table 5.8: Performance of the early insertion of the adversarial branch on HICO-DET dataset. 5.4.4 Proofs and Additional Derivations In this section, we provide proofs for our algorithms, showing that they are indeed mini- mizing the divergences we claimed in Section 5.3.2 and 5.3.2.We also provide additional derivations on how we simplify the populational loss to the corresponding minibatch loss used in Stochastic Gradient Descent (SGD). Theorem 1. Define the pooled feature distributionP (f u ) as P (f u ) = X i2[M] i P (f u jobj i ); X i2[M] i = 1 (5.16) and i is the relative importance of each domain. We introduce a discriminator D : f u ! [0; 1] M that tries to classify the domain (object category) based on the union-box 92 feature. Mathematically, the discriminator is trying to maximize the likelihood of the ground-truth domain: max D X i2[M] i E f2P (fujobj i ) [logD i (f)]: (5.17) Assume infinite capacity of the discriminatorD, the maximum of (5.17) is a weighted summation of KL divergence between distributions (up to a constant): KLD := X i2[M] i KL(P (f u jobj i )jjP (f u )): (5.18) Proof. LetD be an optimal discriminator, i.e., D = arg max D X i2[M] i E f2P (fujobj i ) [logD i (f)] = arg max D X i2[M] i Z f P (fjobj i ) logD i (f): Since D is a multi-class classifier, the above optimization has an implicit constraint P i2[M] D i (f) = 1 for any f. Thanks to the infinite capacity of the discriminator D, we can maximize the value function pointwisely (for eachf), and obtain a close form solution: D i (f) = i P (fjobj i )= X i2[M] i P (fjobj i ) = i P (fjobj i )=P (f): (5.19) 93 The second equation is by definition (5.16). The first equation can be obtained by the method of Lagrange multipliers, i.e., D = arg max D X i2[M] i Z f P (fjobj i ) logD i (f) +( X i2[M] D i (f) = 1): Setting the derivative of the above equation w.r.t. D i (f) to zero, we obtain D i (f) = i P (fjobj i )=. Setting the derivative of to zero, we obtain = P i2[M] i P (fjobj i ). Therefore, we prove the optimal solutionD in (5.19). Plugging (5.19) into (5.17), we have the maximum of (5.17) as X i2[M] i E f2P (fujobj i ) [log i P (fjobj i ) P (f) ] = X i2[M] i log i + i KL(P (f u jobj i )jjP (f)) =KLD + X i2[M] i log i : Therefore, we proved that the maximum of (5.17) is the KLD plus a constant. From population loss to minibatch loss The empirical loss of (5.17) is max D X i2[M] i N i N i X j=1 logD i (F (x jji )); (5.20) wherex jji denotes thej’th sample in domaini. In practice, we optimize this with SGD. For each samplex (with domainobj(x)), we trainD with the following minibatch loss: max D N i N i logD obj(x) (F (x)) = logD obj(x) (F (x)); (5.21) 94 where i = N i =N (the recommended weight) is used in the last equality. Other kinds of weighting can be applied, too. Finally, the feature extractor is updated by min F logD obj(x) (F (x)); as we depicted in (5.6) in Section 5.3.2. Theorem 2. Define the class-specific pooled feature distributionP (f u jpred k ) as P (f u jpred k ) = X i2[M] (k) i P (f u jobj i ;pred k ) (5.22) and (k) i is the relative importance of each domain. We introduce discriminators spe- cific to each class, i.e.,D(f;pred k )! [0; 1] M that tries to classify the domain (object category) based on the union-box feature and the predicate class. Mathematically, the discriminator is trying to maximize the likelihood of the ground-truth domain: max D X k2[K] (k) X i2[M] (k) i E f2P (fujobj i ;pred k ) [logD i (f;pred k )]: (5.23) Assume infinite capacity of the discriminatorD, the maximum of (5.23) is a weighted summation of KL divergence between distributions (up to a constant): CKLD := X k2[K] (k) X i2[M] (k) i KL(P (f u jobj i ;pred k )jjP (f u jpred k )): (5.24) The proof is nearly the same with the proof of Theorem (1), so we omit the proof here. 95 From population loss to minibatch loss The empirical loss of (5.23) is: max D X k2[K] X i2[M] (k) (k) i N (k) i N (k) i X j=1 logD i (f(x jji;k );pred k ); (5.25) wherex jji;k denotes thej’th sample in domaini with labelk. In this paper, we propose to use (k) =N (k) =N and (k) i =N (k) i =N (k) , so we have 1 N max D X k2[K] X i2[M] N (k) i X j=1 logD i (f(x jji;k );pred k ): (5.26) In practice, we optimize this with SGD, in which for each samplex (with labelpred(x) and domainobj(x)), we add the following regularization: max D logD obj(x) (f(x);pred(x)); as we depicted in (5.12) in Section 5.3.2. Theorem 3. Define the class-specific pooled feature distributionP (f u jpred k ) as P (f u jpred k ) = X i2[M] (k) i P (f u jobj i ;pred k ) (5.27) and (k) i is the relative importance of each domain. We introduce discriminators spe- cific to each class, i.e.,D(f;pred k )! [0; 1] M that tries to tell whether a featuref is 96 from the domain-conditioned distribution P (f u jobj i ;pred k ) or from the pooled distri- butionP (f u jpred k ). Mathematically, the discriminator (with binary output) is trying to maximize the following objective: max D X k2[K] (k) X i2[M] (k) i E f2P (fujobj i ;pred k ) [logD(f;obj i ;pred k )] +E f2P (fujpred k ) [log(1D(f;obj i ;pred k ))] : (5.28) Assume infinite capacity of the discriminatorD, the maximum of (5.28) is a weighted summation of Jensen-Shannon divergence (JSD) between distributions (up to a con- stant): CJSD := X k2[K] (k) X i2[M] (k) i JSD(P (f u jobj i ;pred k )jjP (f u jpred k )): (5.29) Proof. From the standard GAN proof, see, e.g., [28], assuming infinite capacity of the discriminatorD, the maximum of (5.28) is X k2[K] (k) X i2[M] (k) i log(4)+ 2JSD(P (f ubox jobj i ;pred k )jjP (f ubox jpred k )) = 2CJSDlog(4) X k2[K] (k) : 97 From population loss to minibatch loss The empirical loss of (5.28) is: max D X k2[K] X i2[M] (k) (k) i N (k) i N (k) i X j=1 logD(f(x jji;k );obj i ;pred k ) + (k) (k) i N (k) N (k) X j=1 log(1D(f(x jjk );obj i ;pred k )) ; (5.30) wherex jji;k denotes thej’th sample in domaini with labelk andx jjk denotes thej’th sample with label k. In this paper, we propose to use (k) = N (k) =N and (k) i = N (k) i =N (k) , so we have 1 N max D X k2[K] X i2[M] N (k) i X j=1 logD(f(x jji;k );obj i ;pred k ) + N (k) i N (k) N (k) X j=1 log(1D(f(x jjk );obj i ;pred k )) : (5.31) In practice, we optimize this with SGD, in which for each samplex (with labelpred(x) and domainobj(x)), we add the following regularization: max D logD(f(x);obj(x);pred(x))+ X i2[O] N pred(x) i N pred(x) log(1D(f(x);obj i ;pred(x))); as we depicted in (5.15) in Section 5.3.2. 5.4.5 Network Architectures We present the network architectures for the baseline model and our proposed meth- ods in Table 5.10, where the models are built with basic blocks defined in Table 5.9. As our models take the ResNet-50 architecture as our backbone module, we only list 98 the network structures after the backbone module. Although our models have three branches, we only apply the proposed methods on the union-box branch, so we only list the detailed structures of the union-box branch for each model. As shown in Table 5.10, the baseline model contains the blocks of ROI Align,P 0 ,P 1 ,P 2 , andP 3 . The adversarial branch of ADG-KLD takesfc4 as input and predicts the object categories. The condi- tional adversarial branch of CADG-KLD takes bothfc4 and the predicate embedding emb1 as inputs and predicts the object categories. The conditional adversarial branch of CADG-JSD takesfc4, the predicate embeddingemb1, and the object embeddingemb2 as inputs, and provides a binary predication. Name Operations / Layers Conv 11, stride=1 Convolution 11 - ReLU, stride=1. Linear Fully connected layer. FC Fully connected layer - ReLU. Multiply Multiplication of two tensors (with broadcasting). Mean Take the average of the tensor along the channel dimension. Avg Pool Take the average of the tensor along the spatial dimensions. Concat Concatenate input tensors along the channel dimension. ResBlock Standard ResNet blocks. ROI Align Pooling feature maps for ROI. Embedding Word embedding of the given words. F backbone Features extracted from the backbone module with height N h and widthN w . Table 5.9: The basic blocks for architecture design. (“-” connects two consecutive layers.) Implementation details For the input images in both training and inference, we first resize them to set the larger size at 640. During training, we apply gradient clip to 1 when doing back-propagation and use early stopping to train the model with 4 NVIDIA P100 GPU, where we adjust the learning schedule according to the linear scaling rule as in [29]. Because we use the sigmoid loss function for each predicate category and the ratio of positive and negative samples are imbalanced, we set the ratio of positive and negative samples to 1:6. We’ve made hyper-parameter search on this ratio and find that 99 Stage Name Input Tensors Output Tensors ROI Align ROI Align N h N w 1024(F backbone ) 14141024 P 0 ResBlock 14141024 771024 ResBlock 771024 771024 ResBlock 771024 771024 Avg Pool 771024 1024(fc1) P 1 FC 1024(fc1) 512(fc2) Conv 1 1, stride=1 N h N w 1024(F backbone ) N h N w 512(conv1) Conv 1 1, stride=1 N h N w 1024(F backbone ) N h N w 512(conv2) Multiply 512(fc2);N h N w 512(conv1) , N h N w 512 Mean N h N w 512 , N h N w Multiply N h N w ;N h N w 512(conv2) , N h N w 512 Conv 1 1, stride=1 N h N w 512 N h N w 1024 Avg Pool N h N w 1024 1024(fc3) Concat 1024(fc1);1024(fc3) 2048 P 2 FC 2048 1024 FC 1024 1024(fc4) P 3 Linear 1024 117 Emb Embedding 1 (predicate label) 50(emb1) Embedding 1 (object label) 50(emb2) ADGKLD Linear 1024(fc4) 80 CADGKLD Concat 1024(fc4);50(emb1) 1074 Linear 1074 80 CADGJSD Concat 1024(fc4);50(emb1);50(emb2) 1124 Linear 1124 1 Table 5.10: The structures for the baseline model and our proposed methods ADG-KLD, CADG-KLD, and CADG-JSD.. the performance doesn’t change a lot for different ratios, but it performs much better than the model without setting this ratio. After we make extensive experiments on the hyper-parameter search, we employ a SGD optimizer with learning rate 0.001, momentum 0.9 and weight decay 0.0005. For the proposed domain generalization framework, we train the mainstream branch and the adversarial branch each for once alternatively. Baseline We set the learning rate decay at every 100,000 iterations withgamma = 0:96. 100 ADG-KLD We set the learning rate decay at the 2,000,000-th iteration and 2,800,000-th iteration withgamma = 0:1 and set DG = 1. CADG-KLD We set the learning rate decay at the 2,000,000-th iteration and 2,800,000- th iteration withgamma = 0:1 and set DG = 100. CADG-JSD We set the learning rate decay at every 100,000 iterations withgamma = 0:96 and set DG = 100. The learning rate for the adversarial branch is 0.01. 5.4.6 More Visualiation Results We demonstrate more Grad-CAM visualization of intermediate features in Fig. 5.6 and Fig. 5.7. Fig. 5.6 visualizes the feature maps from the union-box branch, where the baseline attends to the wrong region and our proposed CADG-KLD and CADG-JSD focus on the region related to the human-object interaction. ADG-KLD also cannot give correct saliency maps on the union-box features, which is the same as we point out in the main paper. Fig. 5.7 shows the feature maps from the backbone features, where our proposed methods show more meaningful saliency maps than the baseline. Furthermore, CADG-KLD and CADG-JSD can get even better attention on the possible interactions than ADG-KLD. Therefore, we can safely make the same conclusion as in the main paper that our proposed methods could learn semantically rich features of predicates with strong generalization ability, and the conditional domain generalization approaches can get better saliency on the union-box features. 5.5 Conclusion In this paper, we have focused on the problem of novel human-object interaction detec- tion where the triplet combinations in the test set are unseen during training. To evaluate the performance in this setting, we created a new split based on the HICO-DET dataset 101 (a) (b) (c) (d) (e) Figure 5.6: Grad-CAM visualization of the predicates from the union-box features. Green box: human. Red box: object. Blue box: union box of object and human with a margin. (a) input images. (b) baseline. (c) ADG-KLD. (d) CADG-KLD. (e) CADG-JSD. Zoom in for better view. and another evaluation set from the UnRel dataset. We proposed a unified adversarial domain generalization framework to tackle this problem. Experiments showed that our framework achieved significant improvement over the baseline models, by up to 50% on the new split of HICO-DET test set and up to 125% on the UnRel dataset. Our work shows that adversarial domain generalization is a promising way to overcome the combinatorial prediction problem in real-world applications. 102 (a) (b) (c) (d) (e) Figure 5.7: Grad-CAM visualization of the predicates from the backbone features before the ROI Align module (we only keep the visualization inside the union box for simplification). Green box: human. Red box: object. Blue box: union box of object and human with a margin. (a) input images. (b) baseline. (c) ADG-KLD. (d) CADG- KLD. (e) CADG-JSD. Zoom in for better view. 103 Chapter 6 Conclusion and Future Work 6.1 Summary of the Research In this thesis, we focused on leveraging strutured information for visual generation and understanding. On the visual generation side, we introduced the topic of image inpaint- ing and proposed two methods to solve this problem from different point of views: the contextual-based approach and the segmentation-guided approach. On the visual under- standing side, we applied domain generalization techniques to extract human-object interaction representations for high-level image understanding. To generate the image inpainting results with high-resolution textures, we pro- posed a learning-based approach to generate visually coherent completion given a high- resolution image with missing components. In order to overcome the difficulty to directly learn the distribution of high-dimensional image data, we divided the task into inference and translation as two separate steps and model each step with a deep neu- ral network. We also used simple heuristics to guide the propagation of local textures from the boundary to the hole. We’ve shown that, by using such techniques, inpaint- ing reduces to the problem of learning two image-feature translation functions in much smaller space and is hence easier to train. We evaluated our method on several public datasets and show that we generate results of better visual quality than previous state- of-the-art methods. 104 While deep generative models enable an efficient end-to-end framework for image inpainting tasks, existing methods based on generative models don’t exploit the segmen- tation information to constrain the object shapes, which usually lead to blurry results on the boundary. To tackle this problem, we proposed to introduce the semantic segmenta- tion information, which disentangles the inter-class difference and intra-class variation for image inpainting. This leads to a much clearer recovered boundary between seman- tically different regions and better texture within semantically consistent segments. Our model factorizes the image inpainting process into segmentation prediction (SP-Net) and segmentation guidance (SG-Net) as two steps, which predict the segmentation labels in the missing area first, and then generate segmentation guided inpainting results. Exper- iments on multiple public datasets have shown that our approach outperforms existing methods in optimizing the image inpainting quality, especially along the boundaries between different objects, and the interactive segmentation guidance provides possibili- ties for multi-modal predictions of image inpainting. For visual understanding, we focused on the human-object interaction (HOI) detec- tion problem to recognize the relationship between humans and objects in images. To tackle the long-tail problem in data distribution of HOI detection datasets, we formu- lated we formulated the novel HOI detection task as a domain generalization problem. We first created a new benchmark dataset for the novel HOI detection task, based on the images and annotations from the HOI detection datasets, where the new benchmark dataset avoids the overlapping of the triplet categories in the training set, validation set and test set. Then we proposed a unified domain generalization framework and instanti- ate both conditional and unconditional approaches to improve the generalization ability of models. Experiments on multiple datasets have shown that our proposed methods can get uniformly significant improvement on all metrics. 105 6.2 Future Research In the future work, we are interested in addressing the scene graph generation problem, which is a general case of human-object interaction, where relationship triplets between objects are also detected in images. Another direction is to apply the structured informa- tion extracted from images to downstream tasks, such as video-based action recognition, text-based image retrieval, and image captioning. 6.2.1 Scene Graph Generation Scene graph generation aims at generating a scene graph from a given image, which is a more general case than human-object interaction. Scene graph is defined as a graph representing the structure of an image, where nodes represent the objects and edges represent the relationship. The relationship include semantic relationship, possessive relationship, and spatial relationship. Possessive and spatial relationship are relatively easy to predict, while semantic relationship is the most difficult. Another point that is differeint from human-object interaction is the graph constraint. As it is required to extract a scene graph from the image, each edge only corresponds to one relationship, even though there can be multiple correct relations on the edge. Previously, we worked on human-object interaction, which mainly contains semantic relationships between humans and objects. We plan to focus on scene graph generation by utilizing human-object interaction detection experiences. Besides generalizing our model to a larger number of relationships between objects, we are interested in solving the confusion problems in scene graph generation. This mainly includes two types, entity instance confusion and proximal relatinship ambiguity. Entity Instance confusion refers to the case where the subject or object is related to one of many instances of the same class but the model fails to distinguish between the target instance and the others. 106 Proximal relationship ambiguity refers to the case when the image contains multiple subject-object pairs interacting in the same work and the model fails to identify the correct pairing. From our preliminary experiments, different types of contrastive losses can help increase the performance and corresponding sampling methods can help in relationship proposals during the training process. Furthermore, we want to re-formulate the scene graph generation task, where the prediction on each edges containing all three kinds of relationships, i.e. semantic rela- tionship, possessive relationship, and spatial relationship. Existing works don’t distin- guish these relationships explicitly, and there are no good metrics on how to evaluate the generation outputs. For the ground truth graph, there could also be different structures and annotations for the same image, and even humans cannot tell which one is better. As a result, there is not an efficient way to detect which model is better, and in current situations even the most state-of-the-art models cannot beat the frequency prior predic- tion. Therefore, a cleaner and more standard way to formulate this problem is of great important. We will formulate the scene graph generation as the prediction for all three kinds of relationships, and compare the generation outputs by its kind of relationship. As we can expect, spatial relationship and possessive relationship can have higher perfor- mance, and then we focus more on the semantic relationship based on our experiences in human-object interaction, especially to apply the adversarial domain generalization approahces to solve the long-tail distribution problem. 6.2.2 Downstream Tasks The reason that we care about the visual relationship detection and scene graph genera- tion tasks is that we want to have better image understanding ability and extract better image representations for downstream tasks, such as video-based action recognition, text-based image retrieval, and image captioning. 107 Video-based action recognition is already researched for a long time, which usually extract features from video frames directly to make predictions without caring much about the detailed image structures. With experiences on image-level visual relation- ship detection, we can extract structured representation from image frames which could be better used for video-based prediction. For text-based image retrieval, matching of sentences and images is a challenging problem. Based on the experience from scene graph generation, scene graphs can be taken as an intermediate information for better matching. We can first translate the sentences into scene graphs, and then extract the scene graphs from images to compare the scene graphs directly. This could avoid the multi-modal matching problem to improve the matching capability. Similarly, image captioning requires deep understanding of the relationships in the images. If we can extract good scene graphs from images, translation of scene graph to images can be much more straightforward. These downstream tasks are all meaningful and straightfor- ward applications of scene graph generations and visual relationship detections. 108 Bibliography [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial net- works. In International Conference on Machine Learning, pages 214–223, 2017. [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph., 28(3):24–1, 2009. [3] C. Barnes, E. Shechtman, D. B. Goldman, and A. Finkelstein. The generalized patchmatch correspondence algorithm. In European Conference on Computer Vision, pages 29–43. Springer, 2010. [4] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017. [5] Y .-W. Chao, Y . Liu, X. Liu, H. Zeng, and J. Deng. Learning to detect human- object interactions. In 2018 IEEE Winter Conference on Applications of Com- puter Vision (WACV), pages 381–389. IEEE, 2018. [6] Y .-W. Chao, Z. Wang, Y . He, J. Wang, and J. Deng. Hico: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision, pages 1017–1025, 2015. [7] L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder- decoder with atrous separable convolution for semantic image segmentation. arXiv:1802.02611, 2018. [8] T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337, 2016. [9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 109 [10] B. Dai, Y . Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 3076–3086, 2017. [11] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015. [12] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision, pages 184–199. Springer, 2014. [13] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pages 658–666, 2016. [14] M. Elad and P. Milanfar. Style transfer via texture synthesis. IEEE Transactions on Image Processing, 26(5):2338–2351, 2017. [15] S. E. Fahlman, G. E. Hinton, and T. J. Sejnowski. Massively parallel architectures for al: Netl, thistle, and boltzmann machines. In National Conference on Artificial Intelligence, AAAI, 1983. [16] B. J. Frey, G. E. Hinton, and P. Dayan. Does the wake-sleep algorithm produce good density estimators? In Advances in neural information processing systems, pages 661–667, 1996. [17] O. Frigo, N. Sabater, J. Delon, and P. Hellier. Split and match: Example-based adaptive patch sampling for unsupervised style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 553–561, 2016. [18] C. Gao, Y . Zou, and J.-B. Huang. ican: Instance-centric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437, 2018. [19] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015. [20] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2414–2423. IEEE, 2016. [21] M. Ghifary, D. Balduzzi, W. B. Kleijn, and M. Zhang. Scatter component anal- ysis: A unified framework for domain adaptation and domain generalization. IEEE transactions on pattern analysis and machine intelligence, 39(7):1414– 1430, 2016. 110 [22] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi. Domain generaliza- tion for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision, pages 2551–2559, 2015. [23] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015. [24] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014. [25] G. Gkioxari, R. Girshick, P. Doll´ ar, and K. He. Detecting and recognizing human- object interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8359–8367, 2018. [26] I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016. [27] I. Goodfellow, Y . Bengio, A. Courville, and Y . Bengio. Deep learning, volume 1. MIT press Cambridge, 2016. [28] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. [29] P. Goyal, P. Doll´ ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tul- loch, Y . Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. [30] I. Gulrajani, F. Ahmed, M. Arjovsky, V . Dumoulin, and A. Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017. [31] S. Gupta and J. Malik. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015. [32] T. Gupta, A. Schwing, and D. Hoiem. No-frills human-object interaction detec- tion: Factorization, layout encodings, and training techniques. In Proceedings of the IEEE International Conference on Computer Vision, pages 9677–9685, 2019. [33] J. Hays and A. A. Efros. Scene completion using millions of photographs. In ACM Transactions on Graphics (TOG), volume 26, page 4. ACM, 2007. [34] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 111 [35] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and Locally Consistent Image Completion. ACM Transactions on Graphics (Proc. of SIGGRAPH 2017), 36(4):107:1–107:14, 2017. [36] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4):107, 2017. [37] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [38] P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016. [39] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694– 711. Springer, 2016. [40] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. [41] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba. Undoing the damage of dataset bias. In European Conference on Computer Vision, pages 158–171. Springer, 2012. [42] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1646–1654, 2016. [43] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [44] N. Komodakis. Image completion using global optimization. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 1, pages 442–452. IEEE, 2006. [45] P. Kr¨ ahenb¨ uhl and V . Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pages 109–117, 2011. [46] R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalan- tidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017. 112 [47] V . Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang. Interactive facial fea- ture localization. In European Conference on Computer Vision, pages 679–692. Springer, 2012. [48] C. Ledig, L. Theis, F. Husz´ ar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016. [49] C. Li and M. Wand. Combining markov random fields and convolutional neu- ral networks for image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2479–2486, 2016. [50] C.-L. Li, W.-C. Chang, Y . Cheng, Y . Yang, and B. P´ oczos. Mmd gan: Towards deeper understanding of moment matching network. arXiv preprint arXiv:1705.08584, 2017. [51] Y . Li, S. Liu, J. Yang, and M.-H. Yang. Generative face completion. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 6, 2017. [52] Y . Li, X. Tian, M. Gong, Y . Liu, T. Liu, K. Zhang, and D. Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 624–639, 2018. [53] Y .-L. Li, S. Zhou, X. Huang, L. Xu, Z. Ma, H.-S. Fang, Y . Wang, and C. Lu. Transferable interactiveness knowledge for human-object interaction detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pages 3585–3594, 2019. [54] Y .-L. Li, S. Zhou, X. Huang, L. Xu, Z. Ma, H.-S. Fang, Y .-F. Wang, and C. Lu. Transferable interactiveness prior for human-object interaction detection. arXiv preprint arXiv:1811.08264, 2018. [55] K. Liang, Y . Guo, H. Chang, and X. Chen. Visual relationship detection with deep structural ranking. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [56] X. Liang, L. Lee, and E. P. Xing. Deep variation-structured reinforcement learn- ing for visual relationship and attribute detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 848–857, 2017. [57] C. Liang-Chieh, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference on Learning Representations, 2015. 113 [58] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. [59] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015. [60] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In European Conference on Computer Vision, pages 852– 869. Springer, 2016. [61] X. Mao, Q. Li, H. Xie, R. Y . Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. arXiv preprint ArXiv:1611.04076, 2016. [62] K. Muandet, D. Balduzzi, and B. Sch¨ olkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10–18, 2013. [63] A. Nguyen, J. Yosinski, Y . Bengio, A. Dosovitskiy, and J. Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. arXiv preprint arXiv:1612.00005, 2016. [64] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural sam- plers using variational divergence minimization. In Advances in Neural Informa- tion Processing Systems, pages 271–279, 2016. [65] A. Odena, V . Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts. Distill, 2016. [66] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016. [67] P. P´ erez, M. Gangnet, and A. Blake. Poisson image editing. ACM Transactions on graphics (TOG), 22(3):313–318, 2003. [68] J. Peyre, J. Sivic, I. Laptev, and C. Schmid. Weakly-supervised learning of visual relations. In Proceedings of the IEEE International Conference on Computer Vision, pages 5179–5188, 2017. [69] S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu. Learning human-object interac- tions by graph parsing neural networks. In Proceedings of the European Confer- ence on Computer Vision (ECCV), pages 401–417, 2018. 114 [70] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learn- ing with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. [71] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information pro- cessing systems, pages 91–99, 2015. [72] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropaga- tion and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014. [73] A. Romero, N. Ballas, S. Kahou, A. Chassang, C. Gatta, and Y . Bengio. Ima- genet classification with deep convolutional neural networks. In International Conference on Learning Representations, 2015. [74] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recogni- tion challenge. International Journal of Computer Vision, 115(3):211–252, 2015. [75] M. A. Sadeghi and A. Farhadi. Recognition using visual phrases. In CVPR 2011, pages 1745–1752. IEEE, 2011. [76] T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Pro- cessing Systems, pages 2234–2242, 2016. [77] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localiza- tion. In Proceedings of the IEEE International Conference on Computer Vision, pages 618–626, 2017. [78] L. Shen, S. Yeung, J. Hoffman, G. Mori, and L. Fei-Fei. Scaling human-object interaction recognition through zero-shot learning. In 2018 IEEE Winter Con- ference on Applications of Computer Vision (WACV), pages 1568–1576. IEEE, 2018. [79] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [80] B. M. Smith, L. Zhang, J. Brandt, Z. Lin, and J. Yang. Exemplar-based face pars- ing. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3484–3491. IEEE, 2013. [81] K. Tang, Y . Niu, J. Huang, J. Shi, and H. Zhang. Unbiased scene graph generation from biased training. arXiv preprint arXiv:2002.11949, 2020. 115 [82] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selec- tive search for object recognition. International journal of computer vision, 104(2):154–171, 2013. [83] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016. [84] B. Wan, D. Zhou, Y . Liu, R. Li, and X. He. Pose-aware multi-level feature net- work for human object interaction detection. In Proceedings of the IEEE Inter- national Conference on Computer Vision, pages 9469–9478, 2019. [85] T. Wang, R. M. Anwer, M. H. Khan, F. S. Khan, Y . Pang, L. Shao, and J. Laak- sonen. Deep contextual attention for human-object interaction detection. In Pro- ceedings of the IEEE International Conference on Computer Vision, pages 5694– 5702, 2019. [86] T.-C. Wang, M.-Y . Liu, J.-Y . Zhu, A. Tao, J. Kautz, and B. Catanzaro. High- resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017. [87] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assess- ment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. [88] M. Wilczkowiak, G. J. Brostow, B. Tordoff, and R. Cipolla. Hole filling through photomontage. In BMVC 2005-Proceedings of the British Machine Vision Con- ference 2005, 2005. [89] D. Xu, Y . Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5419, 2017. [90] Z. Xu, W. Li, L. Niu, and D. Xu. Exploiting low-rank structure from latent domains for domain generalization. In European Conference on Computer Vision, pages 628–643. Springer, 2014. [91] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image inpainting using multi-scale neural patch synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [92] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image inpainting using multi-scale neural patch synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017. 116 [93] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph r-cnn for scene graph gener- ation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 670–685, 2018. [94] R. Yeh, C. Chen, T. Y . Lim, M. Hasegawa-Johnson, and M. N. Do. Seman- tic image inpainting with perceptual and contextual losses. arXiv preprint arXiv:1607.07539, 2016. [95] R. A. Yeh, C. Chen, T. Y . Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5485– 5493, 2017. [96] R. Zellers, M. Yatskar, S. Thomson, and Y . Choi. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5831–5840, 2018. [97] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5532–5540, 2017. [98] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stack- gan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016. [99] J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro. Graphical con- trastive losses for scene graph generation. arXiv preprint arXiv:1903.02728, 2019. [100] P. Zhang, Q. Liu, D. Zhou, T. Xu, and X. He. On the discrimination- generalization tradeoff in gans. CoRR, abs/1711.02771, 2017. [101] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. arXiv preprint arXiv:1801.03924, 2018. [102] J. Zhao, M. Mathieu, and Y . LeCun. Energy-based generative adversarial net- work. arXiv preprint arXiv:1609.03126, 2016. [103] S. Zheng, Y . Song, T. Leung, and I. Goodfellow. Improving the robustness of deep neural networks via stability training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4480–4488, 2016. [104] P. Zhou and M. Chi. Relation parsing neural network for human-object interaction detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 843–851, 2019. 117 [105] J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017. 118
Abstract (if available)
Abstract
In recent years, deep learning has made a lot of impacts and achievements to the computer vision community. Nowadays, deep learning model can recognize thousands of image categories, with various architectures, deeper and deeper. In complex scene, deep neural models can localize objects and detect a number of object categories and perform instance segmentation afterward. At most recently, a number of scene graph generation and visual relationship detection methods are developed for high-level image understanding, in order to extract more fine-grained and structural representation from images. As a dual problem of visual understanding, visual generation also attracts lots of attention during these few years in the light of deep learning techniques. Deep generative models can generate realistic images with high resolution and high quality, and also be further applied to make image translation across different domains and environments. The world around us is highly structured and images are highly structured. Images can not only contain multiple foreground object categories but also contain various background either in natural scenes or artificial scenarios. In this thesis, we mainly leverage structure information for visual generation and understanding in these tasks: 1) leveraging the semantic structure to generate realistic images
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Visual knowledge transfer with deep learning techniques
PDF
3D deep learning for perception and modeling
PDF
Local-aware deep learning: methodology and applications
PDF
A notation for rapid specification of information visualization
PDF
Deep generative models for time series counterfactual inference
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
A green learning approach to image forensics: methodology, applications, and performance evaluation
PDF
Enabling spatial-visual search for geospatial image databases
PDF
A data-driven approach to image splicing localization
PDF
Experimental design and evaluation methodology for human-centric visual quality assessment
PDF
Theory and applications of adversarial and structured knowledge learning
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Deep generative models for image translation
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Temporal perception and reasoning in videos
PDF
Leveraging training information for efficient and robust deep learning
PDF
Depth inference and visual saliency detection from 2D images
Asset Metadata
Creator
Song, Yuhang
(author)
Core Title
Structured visual understanding and generation with deep generative models
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/19/2020
Defense Date
04/14/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adversarial training,image inpainting,OAI-PMH Harvest,visual relationship detection
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Neumann, Ulrich (
committee member
), Sawchuk, Alexander (
committee member
)
Creator Email
songyuhang1234@gmail.com,yuhangso@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-333787
Unique identifier
UC11663563
Identifier
etd-SongYuhang-8698.pdf (filename),usctheses-c89-333787 (legacy record id)
Legacy Identifier
etd-SongYuhang-8698.pdf
Dmrecord
333787
Document Type
Dissertation
Rights
Song, Yuhang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
adversarial training
image inpainting
visual relationship detection