Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Visual knowledge transfer with deep learning techniques
(USC Thesis Other)
Visual knowledge transfer with deep learning techniques
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
VISUAL KNOWLEDGE TRANSFER WITH DEEP LEARNING TECHNIQUES by Junting Zhang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2020 Copyright 2020 Junting Zhang Acknowledgments How time flies! My Ph.D. journey has finally come to its very end. When I look back to the road I have traveled, it’s full of colorful memories with mixed feelings. I had those sleepless nights desperately searching for a solution, debugging and rewriting paper over and over again, but I also recall those exciting moments when I completed successful experiments and got my paper finally accepted. I do not know how my achievements would extend the boundary of human knowledge, but I do know that I have become a better person with independent thinking and perseverance. The memorable experiences during my Ph.D. study will be priceless assets to carry in the rest of my life. I would like to thank my advisor Prof. C.-C. Jay Kuo. He opened the door of the fascinating topic of computer vision for me in 2014, and later encouraged me to apply for the Ph.D. program and guided me through the dark tunnel before dawn. He is a role model to me in many aspects, such as the enthusiasm for research, and is an extremely organized and dependable person. I would like to thank my committee members for my qualifying exam and defense: Prof. Keith Jenkins, Prof. Ulrich Neumann, Prof. Alexander Sawchuk, and Prof. Panayiotis Georgiou, who provided many valuable feedbacks and suggestions for my research. I would like to thank my fellow labmates Yuewei Na and Chen Liang, and colleagues I met during my summer internships Dr. Jie Zhang, Dr. Dawei Li, Dr. Shalini Ghosh, ii Dr. Serafettin Tasci, and Dr. Xiaolong Wang from Samsung Research America, and Fan Zhang, George Sung and Dr. Chuo-ling Chang from Google. They all played an irreplaceable role in our research collaborations. I would like to thank all of my friends, especially Dan Luo, Jingjing Qu, Siqi Chen, Dr. Heming Zhang, and Dr. Siyang Li. They are like anchors for me to hold onto, in both the high and low tide. We had many meaningful conversations and fun times in the past few years. Finally, I would like to thank my family for their everlasting love and support. I got the courage to set sail on a long voyage because I know they have built a warm haven, waiting for me to come home. iii Contents Acknowledgments ii List of Tables vii List of Figures ix Abstract xii 1 Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 11 2 Knowledge Transfer from Old to New Tasks via Incremental Learning 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 DMC for Image Classification . . . . . . . . . . . . . . . . . . 19 2.3.2 DMC for Object Detection . . . . . . . . . . . . . . . . . . . . 21 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 Evaluation Protocols . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Incremental Learning of Image Classifiers . . . . . . . . . . . . 24 2.4.3 Incremental Learning of Object Detectors . . . . . . . . . . . . 30 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 Knowledge Transfer across Visual Domains via Domain Adaptation 35 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Proposed Domain Adaptation Network . . . . . . . . . . . . . . . . . . 38 3.3.1 Fully Convolutional Tri-branch Network Architecture . . . . . 38 3.3.2 Encoding Explicit Spatial Information . . . . . . . . . . . . . . 39 3.3.3 Assigning Pseudo Labels to Target Images . . . . . . . . . . . 40 iv 3.3.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.5 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4 Knowledge Transfer across Applications — from Segmentation to Detection 47 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1 Text Attention Map (TAM) Generation . . . . . . . . . . . . . 51 4.3.2 Text-attentional Multibox Text Detection . . . . . . . . . . . . 51 4.3.3 Overall Network Architecture. . . . . . . . . . . . . . . . . . . 53 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.1 Datasets and Implementation Details . . . . . . . . . . . . . . . 55 4.4.2 Experimental Results and Discussion . . . . . . . . . . . . . . 56 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Knowledge Transfer in Spatiotemporal Space — Multi-object Video Object Segmentation with Region-based Reasoning 60 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.2 Pixel-to-node Projection: Differentiable Feature Aggregation . . 67 5.3.3 Spatiotemporal Graph Reasoning . . . . . . . . . . . . . . . . 69 5.3.4 Query Node Prediction . . . . . . . . . . . . . . . . . . . . . . 71 5.3.5 Node-to-pixel Reprojection and Segmentation Mask Prediction 72 5.3.6 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.7 Inference Procedure . . . . . . . . . . . . . . . . . . . . . . . 74 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4.1 Datasets and Evaluation Metrics . . . . . . . . . . . . . . . . . 75 5.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 75 5.4.3 Experimental Results and Analyses . . . . . . . . . . . . . . . 76 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6 Conclusion and Future work 85 6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2.1 Incremental Learning of Object Detectors . . . . . . . . . . . . 86 6.2.2 Domain Adaptation via Unsupervised Disentangled Represen- tation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 88 v A Appendix for Chapter 2 90 A.1 Detailed Experimental Results of DMC for Object Detection . . . . . . 90 A.2 Effect of the Amount of Auxiliary Data for Object Detection . . . . . . 91 A.3 Implementation and Training Details . . . . . . . . . . . . . . . . . . . 91 A.4 Preliminary Experiments of Adding Exemplars . . . . . . . . . . . . . 93 A.5 Preliminary Experiments of Consolidating Models with Common Classes 95 Bibliography 97 vi List of Tables 2.1 Average incremental accuracies on CIFAR-100 wheng = 20 and vary- ing distance metrics used inL dd . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Accuracies on CUB-200 when incrementally learning with g = 100 classes per group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3 VOC 2007 test per-class average precision (%) when incrementally learn- ing 10 + 10 classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 VOC 2007 test per-class average precision (%) when incrementally learn- ing 19 + 1 classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5 VOC 2007 test mAP (%) when using different network architectures for the old and new model, respectively. Classes 1-19 are the old classes, and class 20 (tvmonitor) is the new one. . . . . . . . . . . . . . . . . . 33 3.1 Adaptation from GTA to Cityscapes. All numbers are measured in %. The last three rows show our results before adaptation, after one and two rounds of curriculum learning using the proposed FCTN, respectively. . 43 4.1 Evaluations on COCO-Text-Legible validation set (in %) . . . . . . . . 57 4.2 Evaluations on COCO-Text-Full validation set (in %) . . . . . . . . . . 58 5.1 Quantitative results on the DA VIS 2017 validation set. . . . . . . . . . 77 5.2 Effect of incorporating low level LABXY feature for the region clustering. 78 5.3 Per-sequence mIoU comparison for the model with and without GNN. . 84 A.1 Varying the amount of auxiliary data in the consolidation stage. VOC 2007 test mAP (%) are shown, where classes 1-10 are the old classes, and classes 11-20 are the new ones. . . . . . . . . . . . . . . . . . . . 91 A.2 ^ used in when incrementally learningg classes at a time on iCIFAR- 100 benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.3 Average incremental accuracies when adding the exemplars of old classes. iCaRL [88] with the same memory budget is compared. Results of incremental learning withg = 5; 10; 20; 50 classes at a time on iCIFAR- 100 benchmark are reported. . . . . . . . . . . . . . . . . . . . . . . . 94 A.4 Consolidation of two models with 10 common classes (class 46-55). . . 95 vii A.5 VOC 2007 test per-class average precision (%) when incrementally learn- ing 19 + 1 classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 viii List of Figures 2.1 Overview of the proposed incremental learning algorithm. Given a model pre-trained on existing classes and labeled data of new classes, we first train a new model for recognizing instances of new classes; we then combine the old model and the new model using the novel deep model consolidation (DMC) module, which leverages external unlabeled aux- iliary data. The final model suffers less from forgetting the old classes and achieves high recognition accuracy for the new classes. . . . . . . 13 2.2 Incremental learning with group ofg = 5; 10; 20; 50 classes at a time on iCIFAR-100 benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Performance variation on the first task when trained incrementally over 20 tasks (g = 5) on iCIFAR-100. . . . . . . . . . . . . . . . . . . . . . 26 2.4 Confusion matrices of methods on iCIFAR-100 when incrementally learn- ing 10 classes in a group. The entries transformed bylog(1 +x) for bet- ter visibility. Fig. 2.4(b), 2.4(c) and 2.4(d) are from [88]. (Best viewed in color.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Varying the datasets of auxiliary data used in the consolidation stage on iCIFAR-100 benchmark. Note that using MNIST leads to failure (2% acc.) so we omit the plots. . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Average incremental accuracy on iCIFAR-100 withg = 10; 20; 50 classes per group for different the amount of auxiliary data used in the consol- idation stage. Dashed horizontal lines represent the performance of the previous state-of-the-art, i.e., , LwF. . . . . . . . . . . . . . . . . . . . 29 3.1 An overview of the proposed fully convolutional tri-branch network (FCTN). It has one shared base network denoted byF followed by three branches of the same architecture denoted byF 1 ,F 2 andF t . Branches F 1 andF 2 assign pseudo labels to images in the unlabeled target domain, while branchF t is trained with supervision from images in the pseudo- labeled target domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 ix 3.2 Illustration of pseudo labels used in the 2-round curriculum learning in the GTA-to-Cityscapes DA experiments. The first row shows the input images. The second row shows the ground truth segmentation masks. The third and fourth row shows the pseudo labels used in the first and second round of curriculum learning, respectively. Note in the visual- ization of pseudo labels, white pixels indicate the unlabeled pixels. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Domain adaptation results from the CityscapesVal set. The third col- umn shows segmentation results using the model trained solely by the GTA dataset, and the fourth column shows the segmentation results after two rounds of the FCTN training (best viewed in color). . . . . . . . . . 43 4.1 Illustration of the text attention gating process with the TAM. . . . . . . 52 4.2 The overall architecture. The detailed explanation can be found in Sec. 4.3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Qualitative results on the COCO-Text validation set. The left and right columns show the detections without and with the text attention gating mechanism, respectively. The middle column shows the visualized text attention maps. (Best viewed in color.) . . . . . . . . . . . . . . . . . . 59 5.1 Overview of the proposed system. . . . . . . . . . . . . . . . . . . . . 65 5.2 Building a graph based on spatial-temporal 3-D postions of the nodes. Consider a nodeV i , as long as the 3D euclidean distance to the other nodeV j is within the predefined radius, we add an edge between them. The added edge can be a spatial edge (shown in magenta) or a temporal edge (shown in cyan). . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 GNN architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4 Qualitative results on the india sequence. . . . . . . . . . . . . . . . . 78 5.5 Qualitative results on the gold-fish sequence. . . . . . . . . . . . . . . . 79 5.6 Qualitative results on the breakdance sequence. . . . . . . . . . . . . . 80 5.7 Qualitative results on the judo sequence. . . . . . . . . . . . . . . . . . 81 5.8 Visualization of the extracted regions (superpixels) and the correspond- ing segmentation results. In each example, the first row shows the results by using deep features only, the second row shows the results by using deep features and LabXY features. Note these results are from the model without the GNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.1 Digit classification datasets MNIST [58] and SVHN [78]. . . . . . . . . 89 x A.1 Confusion matrices of exemplar-based methods on iCIFAR-100 when incrementally learning 10 classes in a group. The element in the i-th row andj-th column indicates the percentage of samples with ground- truth label i that are classified into class j. Fig. A.1(b) is from [88]. (Best viewed in color.) . . . . . . . . . . . . . . . . . . . . . . . . . . 94 xi Abstract The classical machine learning paradigm rarely exploits the dependencies and relations among different tasks and domains. In the deep learning era, manually creating a labeled dataset for each task becomes prohibitively expensive. In this dissertation, we aim to develop effective techniques to retain, accumulate, and transfer knowledge gained from past learning experiences to solve new problems in new scenarios. Specifically, we con- sider four different types of knowledge transfer scenarios in computer vision applica- tions: 1) incremental learning — we transfer knowledge from old task(s) to new task as training data become available gradually over time; 2) domain adaptation — we obtain the knowledge from labeled training data in one domain, and then transfer and apply it in another domain; 3) knowledge transfer across applications to improve robustness of the target application; 4) knowledge transfer in spatiotemporal domain — we perform pixel-wise tracking for multiple objects in a video sequence given the annotations for the first frame. Existing incremental learning (IL) approaches tend to produce a model that is biased towards either the old classes or new classes, unless with the help of exemplars of the old data. To address this issue, we propose a class-incremental learning paradigm called Deep Model Consolidation (DMC), which works well even when the original training data is not available. The idea is to train a model on the new data, and then combine the two individual models trained on data of two distinct set of classes (old classes and xii new classes) via a novel dual distillation training objective. The two existing models are consolidated by exploiting publicly available unlabeled auxiliary data. This overcomes the potential difficulties due to the unavailability of original training data. Compared to the state-of-the-art techniques, DMC demonstrates significantly better performance in CIFAR-100 image classification and PASCAL VOC 2007 object detection benchmarks in the IL setting. In the second work, a domain adaptation method for urban scene segmentation is proposed. We develop a fully convolutional tri-branch network, where two branches assign pseudo labels to images in the unlabeled target domain while the third branch is trained with supervision based on images in the pseudo-labeled target domain. The re-labeling and re-training processes alternate. With this design, the tri-branch network learns target-specific discriminative representations progressively and, as a result, the cross-domain capability of the segmenter improves. We evaluate the proposed net- work on large-scale domain adaptation experiments using both synthetic (GTA) and real (Cityscapes) images. It is shown that our solution achieves state-of-the-art performance and it outperforms previous methods by a significant margin. Scene text detection is a critical prerequisite for many fascinating applications, and we choose it as an example application to explore the possibility of transferring knowl- edge across applications. Existing methods detect texts either using the local informa- tion only or casting it as a semantic segmentation problem. They tend to produce a large number of false alarms or cannot separate individual words accurately. In this work, we present an elegant segmentation-aided text detection solution that predicts the word- level bounding boxes using an end-to-end trainable deep convolutional neural network. It exploits the holistic view of a segmentation network in generating the text attention map (TAM); it then uses the TAM to refine the convolutional features for the MultiBox xiii detector through a multiplicative gating process. We conduct experiments on the large- scale and challenging COCO-Text dataset and demonstrate that the proposed method outperforms state-of-the-art methods significantly. We also study the knowledge transfer in the spatiotemporal domain for video under- standing. Semi-supervised video object segmentation is a pixel-wise tracking problem to propagate the mask annotations from the first frame throughout the full video. We propose to aggregate pixel features into region features via soft superpixel clustering, and then build a spatiotemporal graph for the regions extracted from adjacent frames. A graph neural network is designed to reason in the 3D space and refine the features associated with each node via message passing. The segmentation masks are estimated by predicting the node labels for the query frame and re-project the node labels back to the pixel spaces. The proposed system is end-to-end trainable and involves only one forward pass of the network at test time. Our method achieves comparable accuracy as the state-of-the-art on the DA VIS 2017 benchmark with much fewer computations and less memory consumption. xiv Chapter 1 Introduction 1.1 Significance of the Research In recent years, we have witnessed tremendous success in training deep neural networks to learn a surprisingly accurate mapping from input signals to outputs, whether they are images, languages, genetic sequences, etc. from large amounts of labeled data. One of the fatal characteristics of the current dominant learning paradigm is that it learns in isolation [17]: given a carefully constructed training dataset, it runs a machine learning algorithm on the dataset to produce a model that is then used in its specific intended application. It has no intention to exploit the dependencies and relations among differ- ent tasks and domains, nor the effective techniques to retain, accumulate, and transfer knowledge gained from past learning experiences to solve new problems in the new scenarios. The learning environments are typically static and strictly constrained. For super- vised learning, labeling of training data is often done manually, which is prohibitively expensive in terms of labor resource and time, especially when the required label is fine-grained or it requires knowledge from a domain expert. Considering the real world is too complex with infinite possible tasks, it is almost impossible to label a sufficient number of examples for every possible task or application. Furthermore, the world also changes constantly, and the appearance of instances or the label of the same instance may vary from time to time, the labeling thus needs to be done continually, which is a daunting task for humans. 1 On the contrary, humans learn in a different way, where the transfer of knowledge plays a critical role in human learning. We accumulate and maintain the knowledge learned from previous tasks and use it seamlessly in learning new tasks and solving new problems. Whenever we encounter a new situation or problem, we are good at dis- covering the relationship between it and our past experience, finding some aspects that we have seen in the past in some other contexts, adapting our past reusable knowledge to deal with the new situation, and just learning the newly encountered aspects. True intelligence means the capability that goes far beyond simply memorize and repeating materials we have learned, and we can digest old knowledge and experiences and then apply them to a new concept, and to use both the new and old knowledge to solve a problem that we may have never encountered before. In order to build the ultimate arti- ficial intelligence (AI) that learns like humans, the idea of knowledge transfer allows us to deal with the complex and ever-changing world. In this dissertation, we consider three different types of knowledge transfer scenar- ios that could facilitate the learning process of machines: 1) incremental learning — we transfer knowledge from old task(s) to new task as training data become available gradually over time; 2) domain adaptation — we aim to conduct the same task in dif- ferent visual domains, where the labeled training data in the source domain is much easier to acquire; 3) knowledge transfer across applications to improve robustness of the target application. Specifically, we consider these scenarios in computer vision applica- tions but note that many ideas in this dissertation may generalize to a broader range of machine learning applications. Incremental learning. Incremental learning (IL) aims to learn from a sequence of different tasks, retain the knowledge learned so far, and use the knowledge to help future task learning. The different tasks here may refer to different object categories the sys- tem needs to recognize. An ideal intelligent visual object recognition system should 2 incrementally learn about new classes when training data for them becomes available. Take the vision-based self-driving cars as an illustrating example, where the road envi- ronment is highly dynamic and complicated. For the perception system to detect and recognize all kinds of objects on the road to understand the environment and issue valid driving instructions to ensure driving safety and efficiency, it is unrealistic to train a sys- tem based on a static set of labeled training data. It is very desirable if the system can perform IL to adapt to the customized local environment, and identify unseen objects and learn to recognize them in the process. The main challenge of IL in the deep learning era is the notorious catastrophic for- getting [75, 34] effect — an abrupt degradation of performance on the original set of classes when the training objective is adapted to a newly added set of classes. The major focus has been on incrementally learning each new task in the same deep neural network without causing the neural network to forget the models learned for past tasks. Domain adaptation. Domain Adaptation (DA) is a particular case of transfer learn- ing (TL) that leverages labeled data in one or more related source domains, to learn a model for unseen or unlabeled data in a target domain [22]. It is generally assumed that the task is the same, i.e. class labels are shared between domains. The key distinction between DA and standard supervised learning problem is that the source domains are assumed to be related to the target domain, but not identical. In other words, there is a domain gap between the distributions of labeled training and test sets. We refer to such distribution differences as domain shifts in visual applications, which is the main focus of this dissertation. It is mainly caused by changes in lighting conditions, noise level, camera shooting configurations, background, which results in variation in the visual appearance of objects, such as pose and scale. The domain shift is a common challenge in real-life applications. 3 As a special area of DA, learning from simulations has drawn more and more atten- tion. Modern computer graphics techniques enable the simulator to generate infinite high-quality realistic frames with accurate pixel-wise labels for free. If we can train a model using synthetic images and perform adaptation to transfer the knowledge to the real world, it would be highly economical. Take the vision-based self-driving cars as an example again, with the success of convolutional neural networks (CNNs), numer- ous successful fully-supervised semantic segmentation solutions have been proposed in recent years [72, 14]. To achieve satisfactory performance, these methods demand a sufficiently large dataset with pixel-level labels for training. However, creating such large datasets is prohibitively expensive as it requires human annotators to trace seg- ment boundaries accurately. For example, the annotation and quality control takes more than 1.5 hours on a single image of the Cityscapes dataset [21], which is the most pop- ular traffic scene semantic segmentation dataset. Furthermore, it is difficult to collect traffic scene images with sufficient variation in lighting conditions, weather, city, and driving routes. Learning from simulations, in this case, is very attractive, as we already have great urban scene simulation engines developed in the game industry and many readily available simulation datasets [92, 94]. In this case, semantic feature spaces between source and target domain are the same, but the marginal probability distri- butions between simulation and reality are different, i.e. objects in the simulation and the target domain may look different, although this difference diminishes as simulations get more realistic. Lack of effective training supervisory signals in the target domain is a key issue in unsupervised DA. The major challenges of DA for classification originate from shifts in lighting, appearance, and pose of the object; more challenges emerge specifically for semantic segmentation DA, such as variation in scene layout, object scale, and class distribution in the individual images. 4 Text detection in natural images. In this dissertation, we use scene text detection as an example application to explore how the knowledge transfer across applications may help improve the model’s robustness. The ability to read texts in a natural scene is a highly desirable capability in many interesting and practical applications, such as assistance to visually impaired people, environment understanding and automatic navi- gation for self-driving cars and smart robots, and visual translation, etc. Thus, research on scene text detection and recognition has drawn increasing attention in the computer vision community. Scene text detection is a crucial prerequisite for numerous subse- quent visual understanding and recognition tasks. The goal of text detection is to gen- erate word-level bounding boxes from natural images. The sentence-level or letter-level bounding boxes are not the desired output. The challenges of text detection originate from variabilities in languages, fonts, scales, and layout, as well as complex backgrounds and perspective distortion. Unlike generic objects with a clear boundary and rigid shape, text is a special type of struc- tured pattern with human-readable meaning, and it is easily confused with other types of structured pattern in the world, such as patterns of brick wall and leaves, or some arti- ficial artistic patterns. Solving text detection problems solely from an object detection perspective, which aims to predict word-level bounding boxes location directly, usually cannot produce satisfactory results without the use of heavy post-processing techniques for the removal of false alarms. Alternatively, formulating the text detection problem as a semantic segmentation problem and predicting the probability of each pixel belonging to a bounding box of a text block is another option. This type of approach is robust with respect to complex backgrounds because of the adoption of a more global view. Nevertheless, they often perform so poorly in separating individual words that they fail 5 to produce the desired output. It is highly desirable to exploit the knowledge trans- fer across applications and combine the strengths of the detection-based approach and segmentation-based approach to build a robust and efficient text detection system. Multi-object Video Object Segmentation. Human vision system does not perceive the world through single images, instead, we see a stream of continuous frames all the time. Despite remarkable progress in recent years, video object segmentation (VOS) remains a challenging problem, especially when there are multiple target objects in the video. Most existing approaches still exhibit too severe limitations in terms of quality and efficiency for practical applications, e.g. for real-time processing, or large-scale video post-production and editing in the visual effects industry. The semi-supervised VOS requires mask annotation in the first frame only and aims to automatically infer masks for the rest of the video frames. It is fundamentally a label propagation problem in the spatiotemporal domain. Many methods have been proposed to segment one object at one time effectively, but few methods can segment multiple objects systematically in a single pass. Despite the generic challenges in video analysis, such as the deformation in the object appearance over time, fast motion, and occlusion. The additional challenges for efficient multi-object video segmentation include: 1) the number of objects is variable so that we can not construct a convolutional neural network (CNN) with an output layer with a predetermined number of output neurons; 2) instances of the same category are not easily separable in the embedding space; 3) target objects at test time may be unseen during the training phase. Making dense predictions in the video is inherently a costly task. It is critical to devise an algorithm that is computation and memory-efficient so that the model can be trained and deployed even in a resource-limited environment. 6 1.2 Contributions of the Research In the first work, we study incremental learning with deep neural networks (DNNs), which usually suffer from catastrophic forgetting on the old tasks when learning the new task. This is due to the loss of access to the training data for old tasks. To overcome this issue, we propose a novel paradigm for class-incremental learning called deep model consolidation (DMC), which first trains an individual new model for the new classes using labeled data, and then combines the new model with the old model using unlabeled auxiliary data with novel carefully designed training objective function. The external unlabeled data can be obtained at negligible cost. This paradigm offers an illuminating perspective for incremental learning, which overcomes old- data-craving by finding some cheap substitute that does not need to be stored. We propose a new training objective function to combine two deep models into one single compact model, which eliminates the intrinsic bias towards either the old classes or the new classes in the final model. Note that the two models can have different architectures, and they can be trained on data of a distinct set of classes. Furthermore, our method can be generalized to many other use cases: it can be directly applied to combine any two arbitrary pre-trained models that can be downloaded from the Internet for easy deployment (i.e., only one model needs to be deployed instead of two), without access to the original training data for either of the two models. We not only study the image classification task but also develop an approach to extend the proposed paradigm to the object detection task to further demonstrate the effectiveness and broad applicabilities of the proposed paradigm. Notably, our method can be used to incrementally train modern one-stage object detectors [68, 70, 89], to which the existing methods are not applicable. one-stage object 7 detectors are nearly as accurate as two-stage detectors [32, 90, 67] but run much faster than the later ones. We conduct extensive experiments and demonstrate the substantial performance improvement of our method over existing approaches on large-scale image classi- fication and object detection benchmarks in the incremental learning setting. In the second work, we study the unsupervised domain adaptation (DA) for semantic segmentation, which is a practical and challenging task for many real-life applications. The labeled source domain consists of the synthetic images generated by the simulator while the unlabeled target domain consists of real images captured by the dash cam- era. Unlike many other DA approaches in the literature, we do not aim to align the intermediate features of two domains, which implicitly assume the existence of a sin- gle good mapping from the domain-invariant feature space to the correct segmentation mask. We extend the tri-training idea [132, 97] in semi-supervised learning and DA for classification tasks to DA for the semantic segmentation task. We propose a tri-branch fully convolutional network architecture for segmenta- tion DA, where asymmetric tri-training [132, 97] can be applied. Two labeler net- works are designed to generate pseudo segmentation ground-truth for unlabeled target samples and the third network branch learns from these pseudo-labeled tar- get samples. We introduce an alternating re-labeling and re-training mechanism to improve the DA performance in a curriculum learning fashion. A voting-based criterion is designed to select the reliable training samples in each labeling-training cycle. Urban traffic scene is a highly structured scene with a common spatial layout. For example, the sky is not likely to appear at the bottom and the road is not likely to 8 appear at the top. Thus we encode the spatial coordinate maps and consider them as additional network input so that we incorporate the spatial layout prior to the training. The superior performance of the proposed method is demonstrated by substantial improvement over our baseline and previous methods when evaluating the pro- posed method using large-scale synthesized-to-real urban scene datasets. In the third work, we study text detection in natural images. Existing methods detect texts either using the local information only or casting it as a semantic segmentation problem. They tend to produce a large number of false alarms or cannot separate indi- vidual words accurately. We start from the idea of transferring the knowledge across applications and combine the strengths of both approaches to build a robust and effi- cient text detector. We integrate the segmenter and the detector in a holistic way. We propose a novel segmentation-aided text detection solution, where the segmentation and detection networks are integrated into one single network. As a result, it combines a seg- menter’s robustness and a detector’s deftness. We propose a deep neural network that has two modules, segmentation module, and detection module, where two modules share the same feature extractor. The segmentation module aims to produce a “Text Attention Map” (TAM), which is a dense heat map indicating the probability of text’s presence at each pixel. Upon obtaining the TAM, we refine the feature maps for the detection module through a multiplicative gating process. We design the training objective function such that we learn the TAM genera- tion and MultiBox text detection jointly. By injecting the semantic localization 9 information into the input convolutional features of detectors, the false positives can be effectively suppressed. Thus, the quality of the finally detected word-level bounding boxes is significantly improved. Unlike other approaches integrating segmentation and detection network [86], we do not process each cropped text block image patch individually. It is more efficient in both computation and mem- ory. We demonstrate the effectiveness of the proposed method by showing the state- of-the-art performance on the COCO-Text dataset [110], which is today’s largest and most challenging benchmark dataset. In the fourth work, we study the semi-supervised video object segmentation, which is a fundamental problem for video understanding and editing. Most of the existing methods focus on single object segmentation and have to process each object indepen- dently when multiple target objects exist. We propose a method that systematically han- dles multiple objects with favorable efficiency in terms of computations and memory consumption. Our key contributions in this work include: We propose a novel deep neural network that is end-to-end trainable and can seg- ment multiple objects in a single network forward pass. We reduce the computation burden in the prediction by aggregating the pixel fea- tures into soft superpixel regions to reduce the number of primitives for process- ing. We model the high-level dependencies among the regions extracted from consec- utive frames by a spatiotemporal graph and perform region-level reasoning which effectively exploits both spatial and temporal correlation for the object motion in a video sequence. 10 We perform the region-level reasoning by introducing a graph neural network (GNN) to update the node features to facilitate node matching and label prediction for the query frame. This is the first work that introduces GNNs to the area of video object segmentation. We validate the proposed method by showing competitive segmentation accuracy on DA VIS 2017 benchmark [85] with significantly fewer computations and less memory consumption than the existing methods. 1.3 Organization of the Dissertation The rest of this dissertation is organized as follows. In Chapter 2, we present the deep model consolidation (DMC) framework designed for incremental learning. In Chapter 3, we elaborate the proposed tri-branch fully convolutional network (TFCN) to tackle the domain adaptation for semantic segmentation problem. We then describe the novel segmentation-aided text detection framework with great details in Chapter 4. In Chap- ter 5, we study the knowledge transfer in the spatiotemporal domain by presenting a novel algorithm for multi-object video object segmentation with region-based reason- ing. Finally, Chapter 6 gives the concluding remarks of the dissertation, and points out the future directions. 11 Chapter 2 Knowledge Transfer from Old to New Tasks via Incremental Learning 2.1 Introduction Humans can accumulate and maintain the knowledge learned from previous tasks and use it seamlessly in learning new tasks and solving new problems — learning new con- cepts over time is a core characteristic of human learning. Therefore, it is desirable to have a computer vision system that can learn incrementally about new classes when training data for them becomes available, as this is a necessary step towards the ultimate goal of building real intelligent machines that learn like humans. Despite the recent success of deep learning in computer vision for a broad range of tasks [56, 42, 90, 68, 72, 14], classical training paradigm of deep models is ill-equipped for incremental learning (IL). Most deep neural networks can only be trained in batch mode in which the complete dataset is given and all classes are known prior to training. However, the real world is dynamic and new categories of interest can emerge over time. Re-training a model from scratch whenever a new class is encountered is prohibitively expensive due to the huge data storage requirements and computational cost. Directly fine-tuning the existing model on only the data of new classes using stochastic gradient descent (SGD) optimization is not a better approach either, as this might lead to the notorious catastrophic forgetting [75, 34] effect, which refers to the severe performance degradation on old tasks. 12 cat cat cup train car dog person Input Output Incremental Learning via DMC Labeled images of new classes Model pretrainedon old classes DMC with Double Distillation Loss Unlabeled Auxiliary Data cat cup train dog car person Final model cup train Training a new model on new classes Figure 2.1: Overview of the proposed incremental learning algorithm. Given a model pre-trained on existing classes and labeled data of new classes, we first train a new model for recognizing instances of new classes; we then combine the old model and the new model using the novel deep model consolidation (DMC) module, which leverages external unlabeled auxiliary data. The final model suffers less from forgetting the old classes and achieves high recognition accuracy for the new classes. We consider a realistic, albeit strict and challenging, setting of class-incremental learning, where the system must satisfy the following constraints: 1) the original training data for old classes are no longer accessible when learning new classes — this could be due to a variety of reasons, e.g. , legacy data may be unrecorded, proprietary, too large to store, or simply too difficult to use in training the model for a new task; 2) the system at any time should provide a competitive multi-class classifier for the classes observed so far; 3) the model size should remain approximately the same after learning new classes. Several attempts have been made to enable IL for DNNs, but none of them satisfies all of these constraints. Although some recent works [88, 48, 10, 36] that rely on the storage of partial old data have made impressive progress, this direction is arguably not 13 memory efficient and it may violate some practical constraints such as copyright or pri- vacy issues. The performance of the existing methods that do not store any past data [64, 101, 51, 50, 99] is yet unsatisfactory. Some of these methods rely on underdevel- oped incremental generative models [50, 99], while others [64, 101, 54] fine-tune the old model on the new data, with certain regularization techniques to prevent forgetting. We argue that the ineffectiveness of these regularization-based methods is mainly due to the asymmetric information between old classes and new classes in the fine-tuning step. New classes have the explicit and strong supervisory signal from the available labeled data, whereas the information for old classes is implicitly given in the process of knowledge distillation [39] using new data. Image samples from new data usually under-represent or even severely deviate from the true distribution of the old data, which further aggravates the information asymmetry. Nevertheless, if we over-regularize the model, the model would fail to adapt to the new task, which is referred to as intransi- gence [12] in the IL. As a result, these methods have an intrinsic bias towards either the old classes or the new classes in the final model, and it is extremely difficult to find a sweet spot considering that we don’t have a validation dataset for the old classes. To eliminate such intrinsic bias caused by the information asymmetry or over- regularization in the training, we propose a dual distillation training objective func- tion, such that a student model can learn from two teacher models simultaneously. To overcome the difficulty introduced by loss of access to legacy data, we present a novel method that leverages publicly available unlabeled auxiliary data, where the abundant transferable representations are mined to facilitate IL. As depicted in Fig. 2.1, we pro- pose a novel paradigm for class-incremental learning called deep model consolidation (DMC), which first trains an individual new model for the new classes using labeled data, and then combines the new model with the old model using unlabeled auxiliary data via dual distillation training. 14 Crucially, we do not assume the auxiliary data shares the class labels or generative distribution of the target data. Usage of such unlabeled data incurs no additional dataset construction and maintenance cost since it can be crawled from the web effortlessly when needed and discarded once the IL of new classes is complete. Furthermore, note that the symmetric role of the two teacher models in DMC has a valuable extra benefit in the generalization of our method; it can be directly applied to combine any two arbitrary pre-trained models that can be downloaded from the Internet for easy deployment (i.e., only one model needs to be deployed instead of two), without access to the original training data for either of the two models. To summarize, our main contributions include: A novel paradigm for incremental learning which exploits external unlabeled data, which can be obtained at negligible cost. This is an illuminating perspective for IL, which overcomes old-data-craving by finding some cheap substitute that does not need to be stored. A new training objective function to combine two deep models into one single compact model to promote symmetric knowledge transfer. The two models can have different architectures, and they can be trained on data of a distinct set of classes. An approach to extend the proposed paradigm to incrementally train modern one- stage object detectors, to which the existing methods are not applicable. Extensive experiments that demonstrate the substantial performance improvement of our method over existing approaches on large-scale image classification and object detection benchmarks in the IL setting. The rest of the chapter is organized as follows. We briefly review related work in Section 2.2. The details of the proposed approach, including training and inference 15 procedures, are presented in Section 2.3. Section 2.4 illustrates experimental results and analyses on several benchmark datasets. Last, we conclude the chapter and highlight possible directions for future work in Section 2.5. 2.2 Related Work In the late 80s, McCloskey et al. [75] first identified the catastrophic forgetting effect in the early connectionist models, where the memory about the old data is overwrit- ten when retraining a neural network with new data. Recently, researchers have been actively developing methods to alleviate this effect. Regularization methods. Regularization methods enforce additional constraints on the weight update, so that the new concepts are learned while retaining the prior memo- ries. Goodfellow et al. empirically studied multi-layer perceptrons (MLPs) with dif- ferent activation functions and found that dropout [104] could reduce forgetting some- times. Li and Hoiem [64] proposed the Learning without Forgetting (LwF) method to fine-tune the network with the images of new classes with knowledge distillation [39] loss to encourage the output probabilities of old classes for each image to be close to the original network outputs. However, this strategy requires the distribution of new class images to be similar to the old classes, which does not always hold in practice. Instead, our method assigns two teacher models to one student network to guarantee the sym- metric information flow from old classes and new classes into the final model. EWC [54] was originally proposed for continual learning, which is a topically related prob- lem. The idea of EWC is to constrain the weight parameters that are important to the old tasks to stay close to their old values while looking for a solution to a new task in the neighborhood of the old one. Some prior work [51, 101, 12] attempted to adopt the idea of EWC for the class-incremental learning task but only got the mediocre performance. 16 Rehearsal methods. In rehearsal methods, past information is periodically replayed to the model, to strengthen connections for memories it has already learned. This is usually done by interleaving data from earlier sessions with the current session data [93]. However, the storage of past data is not resource-efficient and may violate some practical constraints such as copyright or privacy issues. In iCaRL [88] and its variants [48, 10], although they allocate fixed-size memory to store the exemplars, the number of exemplars for each class reduce drastically as the number of learned classes grows, and the performance deteriorates quickly. Pseudorehearsal methods. These methods attempt to alleviate the memory cost of rehearsal methods by using generative models to generate pseudo patterns [93] that are combined with the current samples. However, this requires training a generative model in the class-incremental fashion, which is an even more challenging task, especially for the high dimensional distribution of real images. Existing end-to-end trainable pseudo- rehearsal methods do not produce competitive results [99] unless supported by real exemplars [36]. Alternatively, FearNet [50] generates samples in a lower-dimensional feature space instead of image space, but this also prevents task-specific discriminative features to be learned as it relies on a fixed feature embedding. Incremental learning of object detectors. Shmelkov et al. [101] adapted LwF for the object detection task. However, their framework can only be applied to object detectors in which proposals are computed externally, e.g. , Fast R-CNN [32]. In our experiments for the object detection task, we show that our method is applicable to more efficient modern single-shot object detection architectures, e.g. , RetinaNet [68]. Exploiting external data. In computer vision, the idea of employing external data to improve the performance of a target task has been explored in many contexts. Domain adaptation or inductive transfer learning [22] aims to transfer and reuse knowledge in labeled out-of-domain instances. Semi-supervised learning [135, 11] attempts to exploit 17 the usefulness of unlabeled in-domain instances. Our work shares a similar spirit with self-taught learning [87], where we use unlabeled auxiliary data but do not require the auxiliary data to have the same class labels or generative distribution as the target data. Such unlabeled data is significantly easier to obtain compared to typical semi-supervised or transfer learning settings. 2.3 Method Let’s first formally define the class-incremental learning setting. Given a labeled data stream of sample sets X 1 ;X 2 ; , where X y =fx y 1 ;x y yn g denotes the samples of classy2 N + , we learn one class or group of classes at a time. During each learning session, we only have training dataD new =fX s+1 ;:::;X t g of newly available classes s + 1; ;t, while the training data of the previously learned classesfX 1 ;:::;X s g are no longer accessible. However, we have the model obtained in the previous session, which is ans-class classifierf old (x; old ). Our goal is to train at-class classifierf(x; ) without catastrophic forgetting on old classes or significant underperformance on the new classes. We assume that all models are implemented as DNNs where x and denote the input and the parameters of the network, respectively. We perform IL in two steps: the first step is to train a (ts)-class classifier using training dataD new , which we refer to as the new modelf new (x; new ); the second step is to consolidate the old model and the new model. The new class learning step is a regular supervised learning problem and it can be solved by standard back-propagation. The model consolidation step is the major con- tribution of our work, where we propose a method called Deep Model Consolidation (DMC) for image classification which we further extend to another classical computer vision task, object detection. 18 2.3.1 DMC for Image Classification We start by training a new CNN modelf new on new classes using the available training data D new with standard softmax cross-entropy loss. Once the new model is trained, we have two CNN models specialized in classifying either the old classes or the new classes. After that, the goal of the consolidation is to have a single compact model that can perform the tasks of both the old model and the new model simultaneously. Ideally, we have the following objective: f(x; )[j] = 8 > > < > > : f old (x; old )[j]; 1js f new (x; new )[j]; s<jt ;8x2I (2.1) where j denotes the index of the classification score associated with j-th class andI denotes the joint distribution from which samples of class 1; ;t are drawn. We want the output of the consolidated model to approximate the combination of the network outputs of the old model and the new model. To achieve this, the network response of the old model and the new model is employed as supervisory signals in joint training of the consolidated model. Knowledge distillation (KD) [39] is a popular technique to transfer knowledge from one network to another. Originally, KD was proposed to transfer knowledge from a cum- bersome network to a light-weight network performing the same task, and no novel class was introduced. We generalize the basic idea in KD and propose a double distillation loss to enable class-incremental learning. Here, we define the logits as the inputs to the final softmax layer. We run a feed-forward pass off old andf new for each training image, and collect the logits of the two models ^ y old = [^ y 1 ; ; ^ y s ] and ^ y new = [^ y s+1 ; ; ^ y t ], respectively, where the superscript is the class label associated with the neuron in the 19 model. Then we minimize the difference between the logits produced by the consoli- dated model and the combination of logits generated by the two existing specialist mod- els, according to some distance metric. We choose L 2 loss [4] as the distance metric because it demonstrates stable and good performance, seex 2.4.2 for discussions. Due to the absence of the legacy data, we cannot consolidate the two models using the old data. Thus some auxiliary data has to be used. If we assume that natural images lie on an ideal low-dimensional manifold, we can approximate the distribution of our target data via sampling from readily available unlabeled data from a similar domain. Note that the auxiliary data do not have to be stored persistently; they can be crawled and fed in mini-batches on-the-fly in this stage, and discarded thereafter. Specifically, the training objective for consolidation is min 1 jUj X x i 2U L dd (y i ; y i ); (2.2) whereU denotes the unlabeled auxiliary training data, and the double distillation loss L dd is defined as: L dd (y; y) = 1 t t X j=1 y j y j 2 ; (2.3) in whichy j is the logit produced by the consolidated model for thej-th class, and y j = 8 > > < > > : ^ y j 1 s P s k=1 ^ y k ; 1js ^ y j 1 ts P t k=s+1 ^ y k ; s<jt (2.4) where ^ y is the concatenation of ^ y old and ^ y new . The regression target y is the concatenation of normalized logits of the two specialist models. We normalize ^ y by subtracting its mean over the class dimension (Eq. 2.4). This serves as a step of bias calibration for the two sets of classes. It unifies the scale of 20 logits produced by the two models but retains the relative magnitude among the classes so that the symmetric information flow can be enforced. Notably, to avoid the intrinsic bias toward either old or new classes, should not be initialized from old or new ; we should also avoid the usage of training data for the new classesD new in the model consolidation stage. 2.3.2 DMC for Object Detection We extend the IL approach given in Section 2.3.1 for modern one-stage object detectors, which are nearly as accurate as two-stage detectors but run much faster than the later ones. A single-stage object detector divides the input image into a fixed-resolution 2D grid (the resolution of the grid can be multi-level), where higher resolution means that the area corresponding to the image region (i.e., receptive field) of each cell in the grid is smaller. There are a set of bounding-box templates with fixed sizes and aspect ratios, called anchor boxes, which are associated with each spatial cell in the grid. Anchor boxes serve as references for the subsequent prediction. The class label and the bound- ing box location offset relative to the anchor boxes are predicted by the classification subnet and bounding boxes regression subnet, respectively, which are shared across all the feature pyramid levels [67]. In order to apply DMC to incrementally train an object detector, we have to con- solidate the classification subnet and bounding boxes regression subnet simultaneously. Similar to the image classification task, we instantiate a new detector whenever we have training dataD new for new object classes. After the new detector is properly trained, we then use the outputs of the two specialist models to supervise the training of the final model. Anchor boxes selection. In one-stage object detectors, a huge number of anchor boxes have to be used to achieve decent performance. For example, in RetinaNet [68],100k 21 anchor boxes are used for an image of resolution 800 600. Selecting a smaller number of anchor boxes speeds up forward-backward pass in training significantly. The naive approach of randomly sampling some anchor boxes doesn’t consider the fact that the ratio of positive anchor boxes and negative ones is highly imbalanced, and negative boxes that correspond to background carry little information for knowledge distillation. To efficiently and effectively distill the knowledge of the two teacher detectors in the DMC stage, we propose a novel anchor boxes selection method to selectively enforce the constraint for a small set of anchor boxes. For each image sampled from the auxiliary data, we first rank the anchor boxes by the objectness scores. The objectness score (os) for an anchor box is defined as: os, maxfp 1 ; ;p s ;p s+1 ; ;p t g; (2.5) where p 1 ; ;p s are classification probabilities produced by the old-class model, and p s+1 ; ;p t are from the new-class model. Intuitively, a high objectness score for a box implies a higher probability of containing a foreground object. The predicted classifi- cation probabilities of the old classes are produced by the old model and new classes by the new model. We use the subset of anchor boxes that have the highest objectness scores and ignore the others. DMC for classification subnet. Similar to the image classification case in Sec. 2.3.1, for each selected anchor box, we calculate the double distillation loss between the logits produced by the classification subnet of the consolidated modely and the normalized logits generated by the two existing specialist models y. The loss term of DMC for the classification subnetL cls (y; y) is identical to Eq. 2.3. DMC for bounding box regression subnet. The output of the bounding box regression subnet is a tuple of spatial offsetst = (t x ;t y ;t h ;t w ), which specifies a scale-invariant 22 translation and log-space height/width shift relative to an anchor box. For each anchor box selected, we need to set its regression target properly. If the class that has the highest predicted class probability is one of the old classes, we choose the old model’s output as the regression target, otherwise, the new model’s output is chosen. In this way, we encourage the predicted bounding box of the consolidated model to be closer to the predicted bounding box of the most probable object category. SmoothL 1 loss [32] is used to measure the closeness of the parameterized bounding box locations. The loss term of DMC for the bounding box regression subnet is as follows: L loc (t; ^ t) = X k=x;y;h;w smooth L 1 (t k ^ t k ); (2.6) in which ^ t = 8 > > < > > : ^ t old ; max 1js ^ y j > max s<jt ^ y j ^ t new ; otherwise ; (2.7) Overall training objective. The overall DMC objective function for the object detec- tion is defined as min 1 jUj X x i 2U L cls (y i ; y i ) +L loc (t i ; ^ t i ) (2.8) where is a hyper-parameter to balance the two loss terms. 2.4 Experiments 2.4.1 Evaluation Protocols There are two evaluation protocols for incremental learning. In one setting, the network has different classification layers (multiple heads) for each task, where each head can differentiate the classes learned only in this task; it relies on an oracle to decide on the 23 task at test time, which would result in a misleading high test accuracy [12, 71]. In this work, we adopt a practical yet challenging setting, namely “single-head” evaluation, where the output space consists of all thet classes learned so far, and the model has to learn to resolve the confusion among the classes from different tasks when task identities are not available at test time. 2.4.2 Incremental Learning of Image Classifiers Experimental Setup We evaluate our method on iCIFAR-100 benchmark as done in iCaRL [88], which uses CIFAR-100 [55] data and learn all 100 classes in groups ofg = 5; 10; 20 or 50 classes at a time. The evaluation metric is the standard top-1 multi-class classification accuracy on the test set. For each experiment, we run this benchmark 5 times with different class orderings and then report the averages and standard deviations of the results. We use ImageNet3232 dataset [19] as the source of auxiliary data in the model consolidation stage. The images are down-sampled versions of images from ImageNet ILSVRC [24, 95] training set. We exclude the images that belong to the CIFAR-100 classes, which results in 1,082,340 images. Following iCaRL [88], we use a 32-layer ResNet [37] for all experiments and the model weights are randomly initialized. Experimental Results and Discussions We compare our method against the state-of-the-art exemplar-free incremental learn- ing methods EWC++ [12, 54], LwF [64], SI [123], MAS [3], RWalk [12] and some baselines with g = 5; 10; 20; 50. Finetuning denotes the case where we directly fine- tune the model trained on the old classes with the labeled images of new classes, with- out any special treatment for catastrophic forgetting. Fixed Representation denotes the 24 approach where we freeze the network weights except for the classification layer (the last fully connected layer) after the first group of classes has been learned, and we freeze the classification weight vector after the corresponding classes have been learned, and only fine-tune the classification weight vectors of new classes using the new data. This approach usually underfits for the new classes due to the limited degree of freedom and incompatible feature representations of the frozen base network. Oracle denotes the upper bound results via joint training with all the training data of the classes learned so far. 20 40 60 80 100 Number of classes 0 10 20 30 40 50 60 70 80 90 100 Accuracy in % Incrementally learning 5 classes at a time 20 40 60 80 100 Number of classes 0 10 20 30 40 50 60 70 80 90 100 Accuracy in % Incrementally learning 10 classes at a time 20 40 60 80 100 Number of classes 0 10 20 30 40 50 60 70 80 90 100 Accuracy in % Incrementally learning 20 classes at a time 50 60 70 80 90 100 Number of classes 0 10 20 30 40 50 60 70 80 90 100 Accuracy in % Incrementally learning 50 classes at a time Fixed Rep. Finetuning EWC MAS SI Rwalk LwF Oracle DMC (Ours) Figure 2.2: Incremental learning with group of g = 5; 10; 20; 50 classes at a time on iCIFAR-100 benchmark. 25 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Accuracy on the 1st Task in % Task EWC ++ MAS SI RWalk LwF DMC (Ours) Oracle Figure 2.3: Performance variation on the first task when trained incrementally over 20 tasks (g = 5) on iCIFAR-100. (a) DMC (b) LwF (c) Finetuning (d) Fixed Repr. Figure 2.4: Confusion matrices of methods on iCIFAR-100 when incrementally learning 10 classes in a group. The entries transformed bylog(1 +x) for better visibility. Fig. 2.4(b), 2.4(c) and 2.4(d) are from [88]. (Best viewed in color.) The results are shown in Fig. 2.2. Our method outperforms all the methods by a significant margin across all the settings consistently. We used the official code 1 for [12] to get the results for EWC++ [12, 54], SI [123], MAS [3] and RWalk [12]. We found they are highly sensitive to the hyperparameter that controls the strength of regularization due to the asymmetric information between old classes and new classes, so we tune the hyperparameter using a held-out validation set for each setting separately and report the 1 https://github.com/facebookresearch/agem 26 best result for each case. The results of LwF [64] are from iCaRL [88] and they are the second-best in all the settings. It can be also observed that DMC demonstrates a stable performance across different g, in contrast to other regularization-based methods, where the disadvantages of inherent asymmetric information flow reveal more, as we incrementally learn more sessions. They struggle in finding a good trade-off between forgetting and intransigence. Fig. 2.3 illustrates how the accuracy of the first group of classes changes as we learn more and more classes over time. While the previous methods [3, 12, 54, 64, 123] all suffer from catastrophic forgetting on the first task, DMC shows considerably more gentle slope of the forgetting curve. Though the standard deviations seems high, which is due to the random class ordering in each run, the relative standard deviations (RSD) are at a reasonable scale for all methods. We visualize the confusion matrices of some of the methods in Fig. 2.4. Finetuning forgets old classes and makes predictions based only on the last learned group. Fixed Representation is strongly inclined to predict the classes learned in the first group, on which its feature representation is optimized. The previous best performing method LwF does a better job but still has many more non-zero entries on the recently learned classes, which shows strong evidence of information asymmetric between old classes and new classes. On the contrary, the proposed DMC shows a more homogeneous confusion matrix pattern and thus has visibly less intrinsic bias towards or against the classes that it encounters early or late during learning. Impact of the distribution of auxiliary data. Fig. 2.5 shows our empirical study on the impact of the distribution of the auxiliary data by using images from datasets of hand- written digits (MNIST [59]), house number digits (SVHN [78]), texture (DTD [20]), and scenes (Places365 [130]) as the sources of the auxiliary data. Intuitively, the more 27 diversified and more similar to the target data the auxiliary data is, the better perfor- mance we can achieve. Experiments show that usage of overparticular datasets like MNIST and SVHN fails to produce competitive results, but using generic and easily accessible datasets like DTD and Places365 can already outperform the previous state- of-the-art methods. In the applied scenario, one may use the prior knowledge about the target data to obtain the desired auxiliary data from a related domain to boost the performance. 20 40 60 80 100 Number of classes 10 20 30 40 50 60 70 80 90 Accuracy in % Incrementally learning 10 classes at a time DMC-ImageNet DMC-Places365 LwF DMC-DTD SI DMC-SVHN 20 40 60 80 100 Number of classes 10 20 30 40 50 60 70 80 90 Accuracy in % Incrementally learning 20 classes at a time DMC-ImageNet DMC-Places365 LwF DMC-DTD Rwalk DMC-SVHN Figure 2.5: Varying the datasets of auxiliary data used in the consolidation stage on iCIFAR-100 benchmark. Note that using MNIST leads to failure (2% acc.) so we omit the plots. Choices of loss function. We compare some common distance metrics used in knowl- edge distillation in Table 2.1 . We observe DMC is generally not sensitive to the loss function chosen, while L 2 loss and KD loss [39] with T = 2 performs slightly better than others. As stated in [39], both formulations should be equivalent in the limit of a high temperatureT , so we useL 2 loss throughout our experiments for its simplicity and stability over various training schedules. Effect of the amount of auxiliary data. Fig. 2.6 illustrates the effect of the amount of auxiliary data used in the consolidation stage. We randomly subsampled 2 k 10 3 images for k = 0; ; 9 from ImageNet3232 [19]. We report the average of the 28 Table 2.1: Average incremental accuracies on CIFAR-100 when g = 20 and varying distance metrics used inL dd . KD (T = 1) KD (T = 2) L 1 L 2 46:95 2:01 58:01 1:17 57:86 1:16 58:06 1:15 classification accuracies over all steps of the IL (as in [10], the accuracy of the first group is not considered in this average). Overall, our method is robust against the reduction of auxiliary data to a large extent. We can outperform the previous state-of-the-art by just using 8,000, 16,000 and 32,000 unlabeled images (< 3% of full auxiliary data) for g = 10; 20; 50, respectively. Note that it also takes less training time for the consolidated model to converge when we use less auxiliary data. 10 3 10 4 10 5 10 6 Amount of auxiliary data (in log scale) 10 20 30 40 50 60 Ave. incremental acc. in % g=10 g=20 g=50 Figure 2.6: Average incremental accuracy on iCIFAR-100 withg = 10; 20; 50 classes per group for different the amount of auxiliary data used in the consolidation stage. Dashed horizontal lines represent the performance of the previous state-of-the-art, i.e., , LwF. Experiments with larger images. We additionally evaluate our method on CUB-200 [114] dataset in IL setting with g = 100. The network architecture (VGG-16 [103]) and data preprocessing are identical with REWC [71]. We use BirdSnap [6] as the auxiliary data source where we excluded the CUB categories. As shown in Table 2.2, DMC outperforms the previous state-of-the-art [71] by a considerable margin. This demonstrates that DMC generalizes well to various image resolutions and domains. 29 Table 2.2: Accuracies on CUB-200 when incrementally learning withg = 100 classes per group. Methods Old Classes New Classes Average Accuracy EWC [54] 42.3 48.6 45.3 REWC [71] 53.3 45.2 48.4 Ours 54:70 57:56 55:89 Table 2.3: VOC 2007 test per-class average precision (%) when incrementally learning 10 + 10 classes. Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP Class 1-10 76.8 78.1 74.3 58.9 58.7 68.6 84.5 81.1 52.3 61.4 - - - - - - - - - - - Class 11-20 - - - - - - - - - - 66.3 71.5 75.2 67.7 76.4 38.6 66.6 66.6 71.1 74.5 - Oracle 77.8 85.0 82.9 62.1 64.4 74.7 86.9 87.0 56.0 76.5 71.2 79.2 79.1 76.2 83.8 53.9 73.2 67.4 77.7 78.7 74.7 Adapted Shmelkov et al. [101] 67.1 64.1 45.7 40.9 52.2 66.5 83.4 75.3 46.4 59.4 64.1 74.8 77.1 67.1 63.3 32.7 61.3 56.8 73.7 67.3 62.0 DMC- exclusive aux. data 68.6 71.2 73.1 48.1 56.0 64.4 81.9 77.8 49.4 67.8 61.5 67.7 67.5 52.2 74.0 37.8 63.0 55.5 65.3 72.4 63.8 Inference twice 76.9 77.7 74.4 58.5 58.7 67.8 84.9 77.8 52.0 65.0 67.3 69.5 70.4 61.2 76.4 39.2 63.2 62.1 72.9 74.6 67.5 DMC 73.9 81.7 72.7 54.6 59.2 73.7 85.2 83.3 52.9 68.1 62.6 75.0 69.0 63.4 80.3 42.4 60.3 61.5 72.6 74.5 68.3 2.4.3 Incremental Learning of Object Detectors Experimental Setup Following [101], we evaluate DMC for incremental object detection on PASCAL VOC 2007 [28] in the IL setting: there are 20 object categories in the dataset, and we incre- mentally learn 10 + 10 classes and 19 + 1 classes. The evaluation metric is the standard mean average precision (mAP) on the test set. We use training images from Microsoft COCO [69] dataset as the source of auxiliary data for the model consolidation stage. Out of 80 object categories in the COCO dataset, we use the 98,495 images that contain objects from the 60 non-PASCAL categories. Table 2.4: VOC 2007 test per-class average precision (%) when incrementally learning 19 + 1 classes. Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP Class 1-19 70.6 79.4 76.6 55.6 61.7 78.3 85.2 80.3 50.6 76.1 62.8 78.0 78.0 74.9 77.4 44.3 69.1 70.5 75.6 - - Class 20 - - - - - - - - - - - - - - - - - - - 68.9 - Oracle 77.8 85.0 82.9 62.1 64.4 74.7 86.9 87.0 56.0 76.5 71.2 79.2 79.1 76.2 83.8 53.9 73.2 67.4 77.7 78.7 74.7 Adapted Shmelkov et al. [101] 61.9 78.5 62.5 39.2 60.9 53.2 79.3 84.5 52.3 52.6 62.8 71.5 51.8 61.5 76.8 43.8 43.8 69.7 52.9 44.6 60.2 DMC- exclusive aux. data 65.3 65.8 73.2 43.8 57.1 73.3 83.1 79.3 45.4 74.3 55.1 82.0 68.7 62.6 74.9 42.3 65.2 67.5 67.8 64.0 65.5 Inference twice 70.6 79.1 76.6 52.8 61.5 77.6 85.1 80.3 50.6 76.0 62.7 78.0 76.5 74.7 77.0 43.7 69.1 70.3 70.0 69.5 70.1 DMC 75.4 77.4 76.4 52.6 65.5 76.7 85.9 80.5 51.2 76.1 63.1 83.3 74.6 73.7 80.1 44.6 67.5 68.1 74.4 69.0 70.8 30 We perform all experiments using RetinaNet [68], but the proposed method is appli- cable to other one-stage detectors [57, 70, 89, 125] with minor modifications. In the 10 + 10 experiment, we use ResNet-50 [37] as the backbone network for both 10-class models and the final consolidated 20-class model. In 19+1 experiment, we use ResNet- 50 as the backbone network for the 19-class model as well as the final consolidated 20- class model, and ResNet-34 for the 1-class new model. In all experiments, the backbone networks were pre-trained on ImageNet dataset [37]. Experimental Results and Discussions We compare our method with a baseline method and with the state-of-the-art IL method for object detection by Shmelkov et al. [101]. In the baseline method, denoted by Infer- ence twice, we directly run inference for each test image using two specialist models separately and then aggregate the predictions by taking the class that has the highest classification probability among all classes, and use the bounding box prediction of the associated model. The method proposed by Shmelkov et al. [101] is compatible only with object detectors that use pre-computed class-agnostic object proposals (e.g. , Fast R-CNN [32]), so we adapt their method for RetinaNet by using our novel anchor boxes selection scheme to determine where to apply the distillation, denoted by Adapted Shmelkov et al. [101]. Learning 10 + 10 classes. The results are given in Table 2.3. Compared to Inference twice, our method is more time- and space-efficient since Inference twice scales badly with respect to the number of IL sessions, as we need to store all the individual models and run inference using each one at test time. The accuracy gain of our method over the Inference twice method might seem surprising, but we believe this can be attributed to the better representations that were inductively learned with the help of the unlabeled auxiliary data, which is exploited also by many semi-supervised learning algorithms. 31 Compared to Adapted Shmelkov et al. [101], our method exhibits remarkable perfor- mance improvement in detecting all classes. Learning 19 + 1 classes. The results are given in Table 2.4. We observe an mAP pat- tern similar to the 10 + 10 experiment. Adapted Shmelkov et al. suffers from degraded accuracy in old classes. Moreover, it cannot achieve good AP on the “tvmonitor” class. Heavily regularized on 19 old classes, the model may have difficulty learning a single new class with insufficient training data. Our DMC achieves state-of-the-art mAP of all the classes learned, with only half of the model complexity and inference time of Infer- ence twice. We also performed the addition of one class experiment with each of the VOC categories being the new class. The behavior for each class is very similar to the “tvmonitor” case described above. The mAP varies from 64.88% (for new class “aero- plane”) to 71.47% (for new class “person”) with mean 68.47% and standard deviation of 1.75%. Detailed results are in the Appendix A.1. Impact of the distribution of auxiliary data. The auxiliary data selection strategy that was described in Sec. 2.4.3 would potentially include images that contain objects from target categories. To see the effect of data distribution, we also experimented with a more strict data in which we exclude all the MS COCO images that contain any object instance of 20 PASCAL categories, denoted by DMC- exclusive aux. data in Table 2.3 and 2.4. This setting can be considered as the lower bound of our method regarding the distribution of auxiliary data. We see that even in such a strict setting, our method outperforms the previous state-of-the-art [101]. This study also implies that our method can benefit from auxiliary data from a similar domain. Consolidating models with different base networks. As mentioned in Sec. 2.4.3, originally we used different base network architectures for the two specialist models in 19+1 classes experiment. As shown in Table 2.5, we also compare the case when using ResNet-50 backbone for both the 19-class model and the 1-class model. We observed 32 that ResNet-50 backbone does not work as well as ResNet-34 backbone, which could result from overfitting of the deeper model to the training data of the new class and thus it fails to produce meaningful distillation targets in the model consolidation stage. However, since our method is architecture-independent, it offers the flexibility to use any network architecture that fits best with the current training data. Table 2.5: VOC 2007 test mAP (%) when using different network architectures for the old and new model, respectively. Classes 1-19 are the old classes, and class 20 (tvmonitor) is the new one. Model Old Classes New Class All Classes Class 1-19 (ResNet-50) 70.8 - - Class 20 (ResNet-34) - 68.9 - Consolidated 70.9 69 70.8 Class 20 (ResNet-50) - 59.0 - Consolidated 70.2 57.9 69.9 2.5 Conclusion In this chapter, we present a novel class-incremental learning paradigm called DMC. With the help of a novel double distillation training objective, DMC does not require storage of any legacy data; it exploits readily available unlabeled auxiliary data to con- solidate two independently trained models instead. DMC outperforms existing non- exemplar-based methods for incremental learning on large-scale image classification and object detection benchmarks by a significant margin. DMC is independent of net- work architectures and thus it is applicable in many tasks. Future directions worth exploring include: theoretically characterize how the “simi- larity” between the unlabeled auxiliary data and target data affects the IL performance; 2) continue the study on using of exemplars of old data with DMC (preliminary results in Appendix A.4), in terms of exemplar selection scheme and rehearsal strategies; 3) generalize DMC to consolidate multiple models at one time; 4) extend DMC to other 33 applications where consolidation of deep models is beneficial, e.g. taking ensemble of models trained with the same or partially overlapped sets of classes (preliminary results in Appendix A.5). Acknowledgments. This work was started during my internship at Samsung Research America and later continued at USC. We also acknowledge the support of NVIDIA Corporation with the donation of a Titan X Pascal GPU. 34 Chapter 3 Knowledge Transfer across Visual Domains via Domain Adaptation 3.1 Introduction Semantic segmentation for urban scenes is an important yet challenging task for a vari- ety of vision-based applications, including autonomous driving cars, smart surveillance systems, etc. With the success of convolutional neural networks (CNNs), numerous suc- cessful fully-supervised semantic segmentation solutions have been proposed in recent years [72, 14]. To achieve satisfactory performance, these methods demand a suffi- ciently large dataset with pixel-level labels for training. However, creating such large datasets is prohibitively expensive as it requires human annotators to accurately trace segment boundaries. Furthermore, it is difficult to collect traffic scene images with suf- ficient variations in terms of lighting conditions, weather, city, and driving routes. To overcome the above-mentioned limitations, one can utilize the modern urban scene simulators to automatically generate a large number of synthetic images with pixel-level labels. However, this introduces another problem, i.e. distributions mismatch between the source domain (synthesized data) and the target domain (real data). Even if we synthesize images with the state-of-the-art simulators [92, 94], there still exists visible appearance discrepancy between these two domains. The testing performance in the target domain using the network trained solely by the source domain images is severely degraded. The domain adaptation (DA) technique is developed to bridge this 35 gap. It is a special example of transfer learning that leverages labeled data in the source domain to learn a robust classifier for unlabeled data in the target domain. DA methods for object classification have several challenges such as shifts in lighting and variations in the object’s appearance and pose. There are even more challenges in DA methods for semantic segmentation because of variations in the scene layout, object scales, and class distributions in images. Many successful domain-alignment-based methods work for DA-based classification but not for DA-based segmentation. Since it is not clear what comprises data instances in a deep segmenter [126], DA-based segmentation is still far from its maturity. In this work, we propose a novel fully convolutional tri-branch network (FCTN) to solve the DA-based segmentation problem. In the FCTN, two labeling branches are used to generate pseudo segmentation ground-truth for unlabeled target samples while the third branch learns from these pseudo-labeled target samples. An alternating re-labeling and re-training mechanism is designed to improve the DA performance in a curriculum learning fashion. We evaluate the proposed method using large-scale synthesized-to- real urban scene datasets and demonstrate substantial improvement over the baseline network and other benchmarking methods. 3.2 Related Work The current literature on visual domain adaptation mainly focuses on image classifi- cation [23]. Being inspired by shallow DA methods, one common intuition of deep DA methods is that adaptation can be achieved by matching the distribution of features in different domains. Most deep DA methods follow a siamese architecture with two streams, representing the source and target models. They aim to obtain domain-invariant 36 F 1 F t Concatenate Coordinate Maps Source Image TargetImage Source Label Target Pseudo Label F: Shared base network F 1 , F 2 : Labeling branches F t : Target-specific branch F 2 F Voting Decision of Pseudo Label : Forwarding of labeled source samples : Forwarding of pseudo-labeled target samples : Source domain supervision : Target domain supervision Figure 3.1: An overview of the proposed fully convolutional tri-branch network (FCTN). It has one shared base network denoted by F followed by three branches of the same architecture denoted by F 1 , F 2 and F t . Branches F 1 and F 2 assign pseudo labels to images in the unlabeled target domain, while branchF t is trained with super- vision from images in the pseudo-labeled target domain. features by minimizing the divergence of features in the two domains and a classifica- tion loss [73, 105, 108, 109], where the classification loss is evaluated in the source domain with labeled data only. However, these methods assume the existence of a uni- versal classifier that can perform well on samples drawn from whichever domain. This assumption tends to fail since the class correspondence constraint is rarely imposed in the domain alignment process. Without such an assumption, feature distribution match- ing may not lead to classification improvement in the target domain. The ATDA method proposed in [97] avoids this assumption by employing the asymmetric tri-training. It can assign pseudo labels to unlabeled target samples progressively and learn from them using a curriculum learning paradigm. This paradigm has been proven effective in the weakly-supervised learning tasks [61] as well. 37 Previous work on segmentation-based DA is much less. Hoffman et. al [40] con- sider each spatial unit in an activation map of a fully convolutional network (FCN) as an instance, and extend the idea in [108] to achieve two objectives: 1) minimizing the global domain distance between two domains using a fully convolutional adversar- ial training and 2) enhancing category-wise adaptation capability via multiple instance learning. The adversarial training aims to align intermediate features from two domains. It implies the existence of a single good mapping from the domain-invariant feature space to the correct segmentation mask. To avoid this condition, Zhang et. al [126] proposed to predict the class distribution over the entire image and some representative superpixels in the target domain first. Then, they use the predicted distribution to regu- larize network training. In this work, we avoid the single good mapping assumption and rely on the remarkable success of the ATDA method [97]. In particular, we develop a curriculum-style method that improves the cross-domain generalization ability for better performance in DA-based segmentation. 3.3 Proposed Domain Adaptation Network The proposed fully convolutional tri-branch network (FCTN) model for cross-domain semantic segmentation is detailed in this section. The labeled source domain training set is denoted byS =f(x s i ;y s i )g ns i=1 while the unlabeled target domain training set is denoted byT =fx t i g nt i=1 , wherex is an image,y is the ground truth segmentation mask andn s andn t are the sizes of training sets of two domains, respectively. 3.3.1 Fully Convolutional Tri-branch Network Architecture An overview of the proposed FCTN architecture is illustrated in Fig. 3.1. It is a fully convolutional network that consists of a shared base network (F ) followed by three 38 branch networks (F 1 , F 2 and F t ). Branches F 1 and F 2 are labeling branches. They accept deep features extracted by the shared base net, F , as the input and predict the semantic label of each pixel in the input image. Although the architectures of the three branches are the same, their roles and functions are not identical. F 1 andF 2 generate pseudo labels for the target images based on prediction. F 1 and F 2 learn from both labeled source images and pseudo-labeled target images. In contrast, F t is a target- specific branch that learns from pseudo-labeled target images only. We use the DeepLab-LargeFOV (also known as the DeepLab v1) [13] as the refer- ence model due to its simplicity and superior performance in the semantic segmentation task. The DeepLab-LargeFOV is a re-purposed VGG-16 [103] network with dilated convolutional kernels. The shared base network F contains 13 convolutional layers while the three branches are formed by three convolutional layers that are converted from fully connected layers in the original VGG-16 network. Although the DeepLab- LargeFOV is adopted here, any effective FCN-based semantic segmentation framework can be used in the proposed FCTN architecture as well. 3.3.2 Encoding Explicit Spatial Information Being inspired by PFN [65], we attach the pixel coordinates as the additional feature map to the last layer of F . The intuition is that the urban traffic scene images have a structured layout and certain classes usually appear in a similar location in images. However, CNN is translation-invariant by nature. That is, it makes predictions based on patch-based features regardless of the patch location in the original image. Assume that the last layer in F has a feature map of size HWD, where H, W , and D are the height, width, and depth of the feature map, respectively. We generate two spatial coordinate maps X and Y of size H W , where values of X(p x ;p y ) and Y (p x ;p y ) are set to be p x =W and p y =H for pixel p at location (p x ;p y ), respectively. 39 We concatenate spatial coordinate mapsX andY to the original feature maps along the depth dimension. Thus, the output feature maps are of dimensionHW (D + 2). By incorporating the spatial coordinate maps, the FCTN can learn more location-aware representations. 3.3.3 Assigning Pseudo Labels to Target Images Being inspired by the ATDA method [97], we generate pseudo labels by feeding images in the target domain training set to the FCTN and collect predictions from both labeling branches. For each input image, we assign the pseudo-label to a pixel if the following two conditions are satisfied: 1) the classifiers associated with labeling branches, F 1 andF 2 , agree in their predicted labels on this pixel; 2) the higher confidence score of these two predictions exceeds a certain threshold. In practice, the confidence threshold is set very high (say, 0.95 in our implementation) because the use of many inaccurate pseudo labels tends to mislead the subsequent network training. In this way, high-quality pseudo labels for target images are used to guide the network to learn target-specific discriminative features. The pseudo-labeled target domain training set is denoted by T l =f(x t i ; ^ y t i )g nt i=1 , where ^ y is the partially pseudo-labeled segmentation mask. Some sample pseudo-labeled segmentation masks are shown in Fig. 3.2. In the subsequent training, the not-yet-labeled pixels are simply ignored in the loss computation. 3.3.4 Loss Function Weight-Contrained Loss. As suggested in the standard tri-training algorithm [132], the three classifiers inF 1 ,F 2 andF t must be diverse. Otherwise, the training degenerates to self-training. In our case, one crucial requirement to obtain high-quality pseudo-labels from two labeling branchesF 1 andF 2 is that they should have different views on one sample and make decisions on their own. 40 Figure 3.2: Illustration of pseudo labels used in the 2-round curriculum learning in the GTA-to-Cityscapes DA experiments. The first row shows the input images. The second row shows the ground truth segmentation masks. The third and fourth row shows the pseudo labels used in the first and second round of curriculum learning, respectively. Note in the visualization of pseudo labels, white pixels indicate the unlabeled pixels. Best viewed in color. Unlike the case in the co-training algorithm [7], where one can explicitly partition features into different sufficient and redundant views, it is not clear how to partition deep features effectively in our case. Here, we enforce divergence of the weights of the convolutional layers of two labeling branches by minimizing their cosine similarity. Then, we have the following filter weight-constrained loss term: L w = ~ w 1 ~ w 2 k~ w 1 kk~ w 2 k (3.1) where ~ w 1 and ~ w 2 are obtained by the flattening and concatenating the weights of convo- lutional filters in convolutional layers ofF 1 andF 2 , respectively. 41 Weighted Pixel-wise Cross-entropy Loss. In the curriculum learning stage, we take a minibatch of samples with one half fromS and the other half fromT l at each step. We calculate the segmentation losses separately for each half of the samples. For the source domain images samples, we use the vanilla pixel-wise softmax cross-entropy loss, denoted byL S , as the segmentation loss function. Furthermore, as mentioned in Sec. 3.3.3, we assign pseudo labels to target domain pixels based on predictions of two labeling branches. This mechanism tends to assign pseudo labels to the prevalent and easy-to-predict classes, such as the road, building, etc., especially in the early stage (this can be seen in Fig. 3.2). Thus, the pseudo labels can be highly imbalanced in classes. If we treat all classes equally, the gradients from challenging and relatively rare classes will be insignificant and the training will be biased toward prevalent classes. To remedy this, we use a weighted cross-entropy loss for target domain samples, denoted by L T l . We calculate weights using the median frequency balancing scheme [26], where the weight assigned to classc in the loss function becomes c = median freq freq(c) ; (3.2) wherefreq(c) is the number of pixels of classc divided by the total number of pixels in the source domain images whenever c is present and median freq is the median of these frequenciesffreq(c)g C c=1 , and where C is the total number of classes. This scheme works well under the assumption that the global class distributions of the source domain and the target domain are similar. Total Loss Function. There are two stages in our training procedure. We first pre-train the entire network using minibatches fromS so as to minimize the following objective function: L =L w +L S (3.3) 42 Once the curriculum learning starts, the overall objective function becomes L =L w +L S +L T l (3.4) whereL S is evaluated onS and averaged over predictions ofF 1 andF 2 branches,L T l is evaluated onT l and averaged over predictions of all three top branches, and and are hyper-parameters determined by the validation split. Model per-class IoU mIoU road sidewlk bldg. wall fence pole t. light t. sign veg. terr. sky person rider car truck bus train mbike bike No Adapt 31.9 18.9 47.7 7.4 3.1 16.0 10.4 1.0 76.5 13.0 58.9 36.0 1.0 67.1 9.5 3.7 0.0 0.0 0.0 21.1 FCN [40] 70.4 32.4 62.1 14.9 5.4 10.9 14.2 2.7 79.2 21.3 64.6 44.1 4.2 70.4 8.0 7.3 0.0 3.5 0.0 27.1 No Adapt 18.1 6.8 64.1 7.3 8.7 21.0 14.9 16.8 45.9 2.4 64.4 41.6 17.5 55.3 8.4 5.0 6.9 4.3 13.8 22.3 CDA [126] 26.4 22.0 74.7 6.0 11.9 8.4 16.3 11.1 75.7 13.3 66.5 38.0 9.3 55.2 18.8 18.9 0.0 16.8 14.6 27.8 No Adapt 59.7 24.8 66.8 12.8 7.9 11.9 14.2 4.2 78.7 22.3 65.2 44.1 2.0 67.8 9.6 2.4 0.6 2.2 0.0 26.2 Round 1 66.9 25.6 74.7 17.5 10.3 17.1 18.4 8.0 79.7 34.8 59.7 46.7 0.0 77.1 10.0 1.8 0.0 0.0 0.0 28.9 Round 2 72.2 28.4 74.9 18.3 10.8 24.0 25.3 17.9 80.1 36.7 61.1 44.7 0.0 74.5 8.9 1.5 0.0 0.0 0.0 30.5 Table 3.1: Adaptation from GTA to Cityscapes. All numbers are measured in %. The last three rows show our results before adaptation, after one and two rounds of curricu- lum learning using the proposed FCTN, respectively. Input Image Ground Truth No Adapt Ours Figure 3.3: Domain adaptation results from the CityscapesVal set. The third column shows segmentation results using the model trained solely by the GTA dataset, and the fourth column shows the segmentation results after two rounds of the FCTN training (best viewed in color). 43 3.3.5 Training Procedure The training process is illustrated in Algorithm 1. We first pre-train the entire FCTN on the labeled source domain training setS for iters iterations, optimizing the loss function in Eq. (3.3). We then use the pre-trained model to generate the initial pseudo labels for the target domain training setT , using the method described in Sec. 3.3.3. We re-train the network usingS andT l for several steps. At each step, we take a minibatch of samples with half fromS and a half fromT l , optimizing the terms in Eq. (3.4) jointly. We repeat the re-labeling ofT and the re-training of the network for several rounds until the model converges. Algorithm 1 Training procedure for our fully convolutional tri-branch network (FCTN). Input: labeled source domain training setS = f(x s i ;y s i )g ns i=1 and unlabeled target domain training setT =fx t i g nt i=1 Pretraining onS : fori = 1 toiters do trainF;F 1 ;F 2 ;F t with minibatches fromS end for Curriculum Learning withS andT : fori = 1 torounds do T l LABELING(F;F 1 ;F 2 ;T ) . See Sec. 3.3.3 fork = 1 tosteps do trainF;F 1 ;F 2 with samples fromS trainF;F 1 ;F 2 ;F t with samples fromT l end for end for returnF;F t 44 3.4 Experiments We validate the proposed method by experimenting with the adaptation from the recently built synthetic urban scene dataset GTA [92] to the commonly used urban scene semantic segmentation dataset Cityscapes [21]. Cityscapes [21] is a large-scale urban scene semantic segmentation dataset. It pro- vides over 5,000 finely labeled images (train/validation/test: 2,993/503/1,531), which are labeled with per-pixel category labels. They are with high resolution of 10242048. There are 34 distinct semantic classes in the dataset, but only 19 classes are considered in the official evaluation protocol. GTA [92] contains 24,966 high-resolution labeled frames extracted from realistic open-world computer games, Grand Theft Auto V (GTA5). All the frames are vehicle- egocentric and the class labels are fully compatible with Cityscapes. We implemented our method using Tensorflow[1] and trained our model using a single NVIDIA TITAN X GPU. We initialized the weights of the shared base net F using the weights of the VGG-16 model pre-trained on ImageNet. The hyper-parameter settings were = 10 3 ; = 100. We used a constant learning rate of 10 5 in the training. We trained the model for 70k, 13k and 20k iterations in the pre-training and two rounds of curriculum learning, respectively. We use synthetic data as source labeled training data and Cityscapestrain as an unlabeled target domain, while evaluating our adaptation algorithm on Cityscapesval using the predictions from the target-specific branchF t . Following the Cityscapes offi- cial evaluation protocol, we evaluate our segmentation domain adaptation results using the per-class intersection over union (IoU) and mean IoU over the 19 classes. The detailed results are listed in Table. 3.1 and some qualitative results are shown in Fig. 3.3. We achieve state-of-the-art domain adaptation performance. Our two rounds of curriculum learning boost the mean IoU over our non-adapted baseline by 2.7% and 45 4.3%, respectively. Especially, the IoU improvement for the small objects (e.g. pole, traffic light, traffic sign, etc.) are significant (over 10%). 3.5 Conclusion A systematic way to address the unsupervised semantic segmentation domain adaptation problem for urban scene images was presented in this work. The FCTN architecture was proposed to generate high-quality pseudo labels for the unlabeled target domain images and learn from pseudo labels in a curriculum learning fashion. It was demonstrated by the DA experiments from the large-scale synthetic dataset to the real image dataset that our method outperforms previous benchmarking methods by a significant margin. There are several possible future directions worth exploring. First, it is interesting to develop a better weight constraint for the two labeling branches so that even better pseudo labels can be generated. Second, we may impose the class distribution constraint on each individual image [126] to alleviate the confusion between some visually similar classes, e.g. road and sidewalk, vegetation and terrain, etc. Third, we can extend the proposed method to other tasks, e.g. instance-aware semantic segmentation. 46 Chapter 4 Knowledge Transfer across Applications — from Segmentation to Detection 4.1 Introduction The ability to read texts in a natural scene is a highly desirable capability in many inter- esting and practical applications, such as assistance to visually impaired people, environ- ment understanding and automatic navigation for smart robots, and visual translation, etc. Thus, research on scene text detection and recognition has drawn increasing atten- tion in the computer vision community. Scene text detection is a crucial prerequisite for numerous subsequent recognition tasks. Generally speaking, text detection is a chal- lenging task, which should provide word-level bounding boxes as the desired output. Its challenges originate from variabilities in fonts, scales, and layout, as well as complex backgrounds and perspective distortion. Conventionally, hand-crafted features were adopted in a text detection system [79, 27]. Recently, deep-learning-based methods [44, 128, 106] are dominating in this field. They have achieved remarkable performance on several well-known benchmark datasets [98, 49]. They can be divided into two categories. The first category follows the path of generic object detection and predicts word-level bounding boxes directly. 47 Examples include [107, 66]. However, they cannot produce satisfactory results with- out the use of heavy post-processing techniques for the removal of false alarms. This is because predictions are made based on local regions with limited access to the con- textual cues. The second category [129, 38, 121] formulates the text detection problem as a semantic segmentation problem. These methods predict the probability of each pixel belonging to a bounding box of a text block using the fully convolutional networks (FCNs). They are robust to complex backgrounds because of the adoption of a more global view. Nevertheless, they often perform poorly in separating individual words. In this work, we propose a novel segmentation-aided text detection solution, where the segmentation and detection networks are integrated into one single network. As a result, it combines a segmenter’s robustness and a detector’s deftness. The segmentation module aims to produce a “Text Attention Map” (TAM), which is a dense heat map indicating the probability of text’s presence at each pixel. Upon obtaining the TAM, we refine the feature maps for the detection module through a multiplicative gating process. The training process is to learn the TAM generation and MultiBox text detection jointly. By injecting the semantic localization information into the input convolutional features of detectors, the false positives can be effectively suppressed. Thus, the accuracy of the finally detected word-level bounding boxes are significantly improved. The proposed method achieves state-of-the-art performance on the COCO-Text dataset [110], which is today’s largest and most challenging benchmark dataset. The overall system is end- to-end trainable and highly efficient. 48 4.2 Related Work Traditionally, scene text detection heavily relied on hand-crafted features. For example, the Maximally Stable Extremal Regions (MSER) [79] and the Stroke Width Transform (SWT) [27] were used to extract text-specific low-level features. With the great success of convolutional neural networks (CNNs) in many computer vision tasks, CNNs have become popular in text detection. Recent text detection meth- ods [46, 107, 66] have been inspired by the generic object detection frameworks, e.g. [33, 90, 89, 70]. However, due to a large variation in text fonts, scales, and aspect ratios, etc, a direct application of general object detection algorithms without domain-specific modification fails to produce satisfactory results. Jaderberg et al. [46] followed the R-CNN [33] framework yet employed the combination of two means in word proposal generation. Based on the Faster R-CNN[90] framework, CPTN [107] adopted a ver- tical anchor mechanism to generate text-component (rather than word-level) proposals and constructed a joint CNN-RNN model to predict the horizontal span of text lines. TextBoxes [66] finetuned the SSD framework [70] and re-designed the aspect ratios and the tiling scheme of pre-defined default boxes. On one hand, as compared with traditional methods, these methods achieved better performance on relatively simple benchmark datasets [98, 49]. On the other hand, the detection results of these methods are made based on local regions in the image without enough contextual cues. When the environment becomes more challenging, they are not robust against text-like patterns such as fences, windows, leaves, etc. Post-processing is usually needed to remove false positives. Text detection can be cast as a semantic segmentation problem. Recent advances in semantic segmentation offer new tools for text detection. Fully convolutional networks (FCNs) [72, 14] preserve the spatial information in the feature map and enable robust pixel-wise prediction for input images of arbitrary size. Both Zhang et al. [129] and 49 CCTN [38] adopted an FCN to find coarse text block regions and cascaded another FCN to refine segmentation results for each text block. Yao et al. [121] trained an FCN to predict the binary mask for the text line block and individual characters simultaneously. Since FCNs have a large receptive field and incorporate more context information, they are robust against complex backgrounds and produce very few false positives. At the same time, it is difficult to detect fine-scale individual text lines or words due to the use of a large receptive field. Extra character-level bounding box annotation can be used to mitigate this shortcoming [129, 121]. However, tremendous post-processing efforts are still required to split text lines, group characters, and partition words for satisfactory performance. It is apparent that the detection network and the semantic segmentation network are complementary to each other in terms of recall and precision scores. It is natural to combine them into one single system. Qin et al. [86] adopted a straightforward cascade approach. That is, text block regions produced by an FCN are cropped out of the image and resized to a square shape of a fixed size. Then, a YOLO-like [89] detection network is applied to each cropped patch to detect word-level bounding boxes. In this work, we take an unprecedented perspective by integrating the segmenter and the detector in a holistic way. Our method does not process each cropped text block image patch individually. It is more efficient in both computation and memory. Thus, it offers a more favorable solution in the resource-limited applications such as intelligent robots. 4.3 Methodology In the proposed segmentation-aided text detection system, Deeplab V2 [14] and SSD [70] are integrated seamlessly into one CNN in the form of two modules: text attention 50 map generator and text-attentional MultiBox detector. The details of the two modules and the overall architecture are explained in this section. 4.3.1 Text Attention Map (TAM) Generation A text attention map (TAM) is a two-dimensional heat map with values ranging in [0; 1], which represent the probability of the existence of text. We adopt the recently developed segmentation network DeepLab V2 [14] to generate the TAM. On the top of several fully convolutional layers, we introduce the Atrous Spatial Pyramid Pooling (ASPP) layer as in DeepLab V2 [14]. The ASPP layer produces multi-scale representations by combining feature responses from parallel atrous convolution layers with different sampling rates. The softmax function is used to generate the TAM. This process is illustrated in the upper part of Fig. 4.1. The ground-truth attention mask is converted from the bounding boxes annotations. We labeled class “1” for the pixels inside the text bounding boxes and class “0” else- where. The loss function L seg is the mean of the cross-entropy terms for each spatial position in the predicted mask. 4.3.2 Text-attentional Multibox Text Detection This module aims to predict the class label and location of word-level bounding boxes. The architecture of this module is illustrated in the lower part of Fig. 4.1. Text Attention Gating. Inspired by the gating mechanism of long short-term mem- ory (LSTM), we design the text attention gating layer. The generated TAM in Sec. 4.3.1 now serves as a soft attention gating signal map, where the value at (i;j) describes how much attention the detector should pay to the location (i;j) in the input feature map. Zero value indicates no attention necessary while one means full attention needs to be 51 Figure 4.1: Illustration of the text attention gating process with the TAM. drawn. Given the input convolutional feature blob of dimensionW f H f C f , the atten- tion gating layer first resizes the TAM toW f H f 1 using bilinear interpolation. Then for each channel in the input convolutional feature blob, we perform an element-wise multiplication operation between the feature map and the resized TAM. The output is a refined attention-aware feature map, which is the key component for the false positive suppression in the subsequent text detection. MultiBox Text Detector. Given the refined attention-aware feature maps, we follow SSD [70] to attach several convolutional layers to predict class confidence scores and spatial offsets for some predefined default bounding boxes. At each feature map cell location (i;j), which associates with a default boxb 0 = (x 0 ;y 0 ;w 0 ;h 0 ), we apply two 3 3 convolutional filters to either predict a confidence scoresc for categories (text or 52 non-text) of the associated box, or regress the spatial offsetst = (x; y; w; h) relative to the associated box. The predicted box isb = (x;y;w;h), where x =x 0 + x;y =y 0 + y;w =w 0 exp(w);h =h 0 exp(h): Training Objective. We define and associate a set of default boxes with various scales and aspect ratios to each feature map cell, for several feature maps extracted from base CNN at different depths. In the training stage, the ground truth boxes are matched to the default boxes according to the Jaccard overlap, following the matching strategy described in [70]. The training loss functionL det of the detector is a weighted sum of the average classification loss (L cls ) and localization loss (L loc ) for each matched default box: L det = 1 N (L cls +L loc ); (4.1) whereN is the number of matched default boxes, and is empirically set to 1. Specif- ically,L cls is the softmax loss over 2-class confidence scoresc, andL loc is the smooth L1 loss [32] between the predicted box and the corresponding ground truth box loca- tion parameters. Note that we have to insert a Stop Gradient layer before the attention gating operation to isolate the two modules in the backward pass, preventing gradients back-propagation from the detector to the TAM generator. 4.3.3 Overall Network Architecture. The overall end-to-end trainable architecture of the segmentation-aided text detection network is illustrated in Fig. 4.2. Inspired by DSSD [30], we use the fully convolutional part of ResNet-101 [37] as the base network, removing the last global average pooling layer and the 1000-way fully connected layer. In order to increase the resolution of the feature map, we change the 53 Figure 4.2: The overall architecture. The detailed explanation can be found in Sec. 4.3.3. effective output stride of the conv5 block to 16 pixels instead of 32 pixels, which is ben- eficial for the dense prediction tasks like segmentation and detection [14]. We append four more residual blocks (namely, conv6 to conv9) after the conv5 block so that we can detect text at various scales. The input feature map of the TAM generation module is pulled from the conv5 block. We employ the attention gating for the feature maps from conv3, and conv5 to conv9, respectively, and perform MultiBox text detection based on the refined feature maps. The overall loss function is a weighted sum of the segmentation loss (L seg ) and the detection loss (L det ): L =L seg +L det ; (4.2) where is set to 10 by cross-validation. At the very end of the network, the non-maximum suppression (NMS) technique is adopted to aggregate and post-process the predictions from different layers to get final detection results. 54 4.4 Experiments To verify the effectiveness of the proposed segmentation-aided text detector and com- pare it with existing methods, we conducted experiments on the most challenging bench- mark to date, i.e. COCO-Text [110]. 4.4.1 Datasets and Implementation Details The COCO-Text is the largest publicly available dataset for text detection and recogni- tion in natural images. The images are harvested from the Microsoft COCO dataset [69]. The full set of COCO-Text v1.4 annotated a total of 63,686 images with 145,859 text instances (training: 43,686/118,309, validation: 10,000/27,550, test: 10,000/no public annotations). The annotated text regions are classified into two categories according to legibility, which indicates whether a text can be read. Illegible texts are usually too blurry or far away from the viewpoint. In the smart robot applications, detecting legible texts are more of interest. Therefore, we selected a subset of the images that have at least one legible text instance and filtered the ground truth annotations to keep legible text instances only. The resulting training set has 14,324 images, and the validation set has 3,346 images. We will refer to this subset as “COCO-Text-Legible” while the original dataset is “COCO-Text-Full” in the rest of the chapter. We implemented the proposed segmentation-aided text detector in Tensorflow [1] version 1.0. We reused some code from a re-implementation of the original VGG-16- based SSD in Tensorflow 1 . We substituted the VGG base network with the ResNet-101 model 2 . Both the VGG-16 and the ResNet-101 model were pre-trained on the ImageNet 1 https://github.com/balancap/SSD-Tensorflow 2 https://github.com/tensorflow/models/tree/master/slim 55 dataset [25]. We also conducted experiments with the default boxes that have text- specific aspect ratios and vertically denser tiling strategy proposed in TextBoxes [66]. The training process has two stages. In the first stage, we trained the text attention map generation module for 50k iterations, using Adam [52] optimizer with a fixed learn- ing rate 10 5 . The second stage initializes the network from the trained network in the first stage. We added the text-attentional MultiBox detectors and randomly initialized their associated filter weights. We applied the polynomial decay with power 0.5 to the learning rate, and it decays from 10 5 to 10 7 in 100 epochs. We trained the overall network end-to-end for 270k iterations. In the training stage, data augmentation and hard negative example mining tech- niques as in [70] were used. At inference time, we resized the input images to 513513. The NMS threshold was 0.45, and the top 30 detections were kept. The total inference time is 0:3s per image on average. The experiments were conducted using a workstation with the following configura- tions: a single NVIDIA TitanX GPU, 3.5 GHz 6-Core CPU, 32GB RAM, and Linux 64-bit OS. 4.4.2 Experimental Results and Discussion Quantitative Results on COCO-Text-Legible Dataset. The performance of the pro- posed method on the COCO-Text-Legible dataset is shown in Table 4.1. By just using ResNet-101 in place of VGG-16 as the base network, there is already 4.17% improve- ment in the F-Score. We denote this network as ResNet-SSD as our baseline. By adding the text attention gating module, the recall of the proposed approach is signif- icantly improved (13.57% gain) over the baseline. Overall, the proposed text detector achieved 4.16% F-score gain over the baseline. 56 In ResNet-TextBoxes, default boxes with more text-specific aspect ratios and vertically denser tiling scheme as in TextBoxes [66] are used. With more default boxes per cell in the feature maps, both the recall and false detection rate increase, so the overall F-score is slightly higher thanResNet-SSD. Adding the proposed text attention gating mechanism boosts the performance by a remarkable margin, 3.67% in the recall, 8.09% in the precision, and 5.98% in the F-score. Quantitative Results on COCO-Text-Full Dataset. The performance of the pro- posed method on the COCO-Text-Full dataset is depicted in Table 4.2. Comparing with our baseline ResNet-SSD, which only gets 27.17% in the F-Score, adding the text attention gating mechanism makes it slightly better than the state-of-the-art method [121]. The F-score ofResNet-TextBoxes is just comparable with the state-of-the- art method [121], while with the help of the proposed text attention gating mechanism, our model substantially outperforms it. Specifically, the recall is improved by 14.7%, and the overall F-score gain is 4.22%. This confirms the effectiveness of the proposed method. We also list the baseline methods from [110], but they are not directly compa- rable since they were used in the data annotation pipeline when the dataset was built. Table 4.1: Evaluations on COCO-Text-Legible validation set (in %) Models Recall Precision F-Score VGG-SSD 30.38 42.01 35.26 ResNet-SSD 34.42 46.14 39.43 ResNet-SSD + Proposed 47.99 39.93 43.59 ResNet-TextBoxes [66] 41.53 38.83 40.14 ResNet-TextBoxes [66] + Proposed 45.20 46.92 46.12 Qualitative Results. Some sample detection results of the proposed method are shown in Fig. 4.3. TAMs clearly highlight the text regions and thus yield higher detec- tion accuracy with reduced false positives. 57 Table 4.2: Evaluations on COCO-Text-Full validation set (in %) Models Recall Precision F-Score Yao et al. [121] 23.1 43.23 33.31 ResNet-SSD 35.4 31.03 27.17 ResNet-SSD + Proposed 40.7 28.59 33.57 ResNet-TextBoxes [66] 35.9 30.89 33.22 ResNet-TextBoxes [66] + Proposed 37.8 37.26 37.53 Baselines from [110] A 23.3 83.78 36.48 B 10.7 89.73 19.14 C 4.7 18.56 7.47 4.5 Conclusion We have presented a robust and efficient scene text detection system. Our proposed system combines the strengths of the segmentation network and the detection network in a unified end-to-end trainable network. A text attention map (TAM) is produced to robustly determine the probability of the presence of text in each spatial locations in the images. The input convolutional features of the detection network are re-weighted adaptively according to the response values in the TAM. The MultiBox detectors are applied onto the semantically refined features to predict and localize the bounding boxes for individual words. The experiments on the most challenging benchmark COCO-Text confirm that the proposed segmentation-aided text detector substantially outperforms previous methods. Without any post-processing and repeated computations, our system is more efficient and practical for intelligent robotics than previous methods. Possible future directions worthy of exploring include: 1) combining the current work with a text recognition module to build an end-to-end scene text reading system; 2) further optimizing the proposed method for the embedded systems; 3) extending the proposed idea to other detection problems, such as pedestrian detection. 58 ResNet-SSD ResNet-Textboxes TAM TAM Proposed Proposed Figure 4.3: Qualitative results on the COCO-Text validation set. The left and right columns show the detections without and with the text attention gating mechanism, respectively. The middle column shows the visualized text attention maps. (Best viewed in color.) 59 Chapter 5 Knowledge Transfer in Spatiotemporal Space — Multi-object Video Object Segmentation with Region-based Reasoning 5.1 Introduction Semi-supervised video object segmentation (VOS) is a pixel-wise tracking problem, where the mask annotations of the objects of interest in the first frame are given. It is a fundamental problem in video understanding and a core step for efficient video editing, which attracts emerging attention in the entertainment industry and computer vision research community. Existing literature on this topic focus on single object segmentation [9, 83, 81]. When there are multiple objects to segment, they have to execute the algorithm mul- tiple times to process and segment each object independently, which is lacks efficiency and scalability. In this work, we are interested in efficiently segmenting multiple objects in one pass. The major challenges include: 1) the number of objects is variable so that we can not construct a convolutional neural network (CNN) with an output layer with a predetermined number of output neurons; 2) instances of the same category are not 60 easily separable in the embedding space; 3) target objects at test time may be unseen during the training phase. The multi-object semi-supervised VOS is essentially a knowledge transfer problem in the spatiotemporal space and thus can be formulated as a meta-learning problem [5, 116]. The meta-task is label propagation over time, where we aim to learn propagating the masks from the first frame to the rest of the frames in a video, regardless of the number of objects or semantic category of the target objects. One popular formulation is to model the semi-supervised VOS problem as a pixel- wise retrieval problem [100, 113, 5, 127], where embedding of video frames are extracted and pixel-wise matching between query frame and reference frames are per- formed in the embedding space. The matching-based methods are simple and intuitive, however, dense matching is computationally expensive and incurs a large memory foot- print when computing the pairwise affinities. To reduce the computation complexity for the pixel-wise matching VOS algorithms, we propose to perform region-level matching instead of pixel-level matching by group- ing pixels first. Traditional superpixel segmentation algorithms [2] assign hard member- ship for each pixel, and thus they are not differentiable inherently. We design a system to group pixels in each frame into a set of regions represented as soft superpixels, and then we build a graph connecting the nodes representing these regions within a spatiotempo- ral neighborhood and perform region-level relation reasoning across frames via a graph neural network. The output node features are used to compute affinities between query nodes and reference nodes. The segmentation label of nodes in the query frame can be obtained via weighted voting based on the affinities. We then reuse the soft assignment scores in the pixel grouping step to distribute node labels to pixels to produce the final segmentation mask. 61 Graph-based methods are well exploited in the classical label propagation problem for the semi-supervised learning [131]. Recently, graph neural networks (GNNs) [53] become popular in graph-structured data modeling due to its compatibility with the model deep learning systems and expressiveness in learning complex representation. We hereby adopt GNNs to extract features for the node matching and facilitate the label propagation across frames in the video. The proposed system is fully differentiable and end-to-end trainable. At inference time, it involves only a feed-forward pass to produce segmentation masks for all the objects in the query frame. 5.2 Related Work Semi-supervised video object segmentation. In the past few years, many semi- supervised VOS algorithms proposed to rely on first frame fine-tuning at test time [9, 83, 62, 74]. Although they can be very accurate on the mask prediction, the runtime is impractical for many applications. Developing methods that do not require test time finetuning has been a recent trend, and meta-learning the object-agnostic embedding matching is one popular formulation. For example, PML [100] uses a pixel-wise embed- ding learned with a triplet loss together with the nearest neighbor classifier. Video- Match [41] uses a soft matching layer and searches for thek nearest neighbors in pixels from the first frame for each pixel in the current frame in a learned embedding space. Our method also falls into the category of embedding matching, but it differs from the previous works by extracting embeddings from regions instead of pixels. Please refer to [122] for a more comprehensive survey on this topic. Multi-object video object segmentation. Many VOS algorithms are designed only for single object segmentation. Recent methods that have tested on the multi-object VOS 62 benchmarks [85, 117] process each object independently by running the model multiple times and heuristically merge the predicted masks at the end [41, 81]. FEELVOS [113] managed to share the heavy feature extraction process across objects but still have to run the segmentation head for each object separately. [5, 127] can systematically segment multiple objects in one forward pass. When segmenting the query frame, TVOS [127] performs expensive pairwise matching inference over query pixels and all the pixels from a large number of reference frames; MetaVOS [5] performs matching between query pixels and cluster centroids extracted from the reference frames. Our method effectively performs matching for soft superpixels in the query frame and reference frames, which greatly reduces the complexity of the matching process. Superpixels as object representation. Superpixels are an over-segmentation of an image by grouping pixels based on low-level image properties [91]. It can drasti- cally reduce the number of primitives for later processing steps and thus becomes an established mid-level image representation which is widely used in computer vision algorithms. For VOS, some methods [31, 119, 60] use superpixel as the object represen- tation, where they generate a coarse segmentation via superpixel algorithms, infer about the motion by analyzing the superpixel segmentation changes in consecutive frames, and refine the initial segmentation by optimizing an energy function. However, they do not use deep learning frameworks and thus have inferior performance. As Jampani et al. [47] pointed out, the reasons why superpixels are seldom used in conjunction with deep learning framework are two-fold. First, the standard convolution operation is defined over regular grid lattices and becomes inefficient when operating over irregular superpixel lattices. Second, classical superpixel algorithms are non-differentiable and thus prohibits end-to-end optimization for optimal performance. SSN [47] was recently proposed as a differentiable replacement of the classical SLIC [2] superpixel algorithm to address the second issue. In this work, we not only extend the application of SSN [47] 63 to the VOS task to improve the efficiency of the system, but also propose to use graph convolutional networks to effectively learn over the resulting irregular superpixel lat- tices. Graph neural networks. Graph neural networks (GNNs) are deep-learning-based methods that operate on graph data, which are usually defined in a non-euclidean space. GNNs can capture the dependence of graphs via message passing between the nodes of graphs. GNNs can be seen as a generalization of convolutional neural networks (CNNs). The general idea of each layer of GNNs is to collectively aggregate information from neighbor nodes, transform the aggregated information, and update the node embeddings. In recent years, variants of graph neural networks such as graph convolutional network (GCN) [53], graph attention network (GAT) [111], and GraphSAGE [35] etc. have demonstrated ground-breaking performance on standard graph learning tasks such as graph classification, node classification, and link prediction. GNNs for pixel-wise prediction. Recently, GNNs have been recently adopted to exploit long-range contextual information for pixel-wise prediction tasks[115, 63, 16, 124]. Non-local networks [115] compute the response at a spatial position as a weighted sum of the features at all positions; the cost to compute the affinity matrix grows quadratically as the number of pixels in the image. [63, 16, 124] propose to model the dependencies between regions of the image rather than individual pixels. They aggregate and project features from the original coordinate space to a node space with lower-dimensional intermediate representation, perform message passing via graph convolutions in this space, and then reproject the resulting features back onto the origi- nal coordinate space. They are only designed for and tested on a single image semantic segmentation task. To the best of our knowledge, GNNs have not been applied to the VOS task in the literature. We draw inspiration from these image segmentation works 64 and design a novel GNN-based framework for VOS. We treat soft superpixels as graph nodes and perform graph convolutions in the spatiotemporal space. 5.3 Method 5.3.1 Overview References Query Feature Extraction + Pixel to Node Projection Spatio-temporal Graph Query Node Label Prediction Affinity Computation reference nodes query nodes GNN Update Soft Pixel-to-node Assignment Map Node to pixel Re-projection t t-1 Query Segmentation Soft SuperpixelRegions Figure 5.1: Overview of the proposed system. We present our video object segmentation algorithm by formally defining the semi- supervised video object segmentation problem first. Similar to the set-up in the MetaVOS [5], we formulate the semi-supervised video object segmentation as a pixel- wise few-shot learning problem. The meta-learning objective is to learn model param- eters on a number of tasks (videos), which are sampled from the distribution ofp(T ) of meta-training tasks, such that the learned model can generalize and perform well on an unseen task (test video). In each task, fort-th frameI t in the video, we aim to pre- dict the instance label for all the pixels (query set), given the ground truth labels of the pixels (support set) in the first frameI 0 . To take the temporal appearance variation into 65 consideration, we may consider the pixels with inferred labels in the previous frames fI 1 ; ;I t1 g in the support set as well. During the meta-training phase, we sample a video sequence from the training set and we sampleM consecutive framesI tM+1 ;I tM+2 I t . The support set consists of labeled pixels infI tM+1 ;I t1 g, where the query set consists of pixels in the current frameI t . Suppose the spatial resolution of the input video frames is HW , we first feed the input frames to a convolutional backbone network to obtain the 2D feature maps F 2 R MH 0 W 0 C where C is the dimension of the feature vector, H 0 W 0 is the spatial dimension of feature maps. Suppose there areN objects to be segmented in the video. For theM 1 support frames, we apply one-hot encoding to the segmentation masks to obtain reference label mapsy ref 2 [0; 1] (M1)H 0 W 0 (N+1) , where N + 1 represents N object instances and the background. As Fig. 5.1 illustrated, the full pipeline of predicting segmentation mask for the query frame consists of four modules: Pixel-to-node projection. For each frame, we groupH 0 W 0 “pixels” in the input feature maps into K nodes via iteratively computing a pixel-node association matrix Q. The assignment is soft and the assignment process is designed to encourage pixels from a coherent region to be assigned to the same node. Pixel featuresF m 2R H 0 W 0 C are aggregated within each node and form the node fea- turesZ m 2R KC 0 for the frameI m , wherem =ftM+1; ;tg. Additionally, for the reference frames (m<t), pixel labelsy ref are also aggregated within each node based onQ to obtain reference node labelsY ref 2 [0; 1] (M1)K(N+1) , which is a probability vector representing the likelihood of object instance for the underlying image region associated with the node. Spatiotemporal graph convolution. We construct a graphG = (V;E) by con- necting nodes within the spatiotemporal volume of input frames, and perform 66 graph convolutions onG by propagating the node featuresZ over the edges of the graphG. We stack multiple convolution layers with nonlinear activation func- tions to improve the expressiveness of the model. The output of this module is the transformed node features ^ Z. Query node prediction. For each query node, we compute an affinity score between it and all nodes from the reference frames based on the transformed node features ^ Z, and then we propagate instance labelsY ref from reference nodes to the query nodes via weighted voting based on the affinity scores. The output of this module is the query node labelY q 2 [0; 1] K(N+1) . Node-to-pixel reprojection. In this module, we reuse the pixel-node assign- ment matrix Q to reproject the query node labels Y q back to the pixel space via linear interpolation, so that we can produce segmentation probability map ^ y q 2 [0; 1] H 0 W 0 (N+1) for the query frame. We then subsequently upsample the segmentation probability map to the original image resolutionHW and finally produce the segmentation mask for the query frameI t . We will describe each module in detail in the following subsections. 5.3.2 Pixel-to-node Projection: Differentiable Feature Aggregation In the pixel-to-node projection module, we aim to design a differentiable module that is able to aggregate pixel features into region features to reduce the number of primitives in the subsequent process. Object motion in the consecutive frames usually have strong spatial correlations, so we would like to retain the spatial locality in the node. We adapted the differentiable SLIC [47] to iteratively grouping the pixels into regions in a bottom-up fashion. 67 The original SLIC [2] algorithm performs ak-means clustering on image pixels in a five-dimensional position and color space (XYLab). Given the backbone CNN feature maps F 2 R MH 0 W 0 C , we convert the input RGB frames to the Lab color space and downsample them to match the spatial dimension ofF , and build a 2-dimensional position feature map consisting of the x and y coordinates of the cells on the feature map. We concatenate theF with the scaled color feature and position feature along the channel dimension to form the input to the projection module, denoted by ^ F . Color feature and position features serve as the low-level cues for the clustering process. The detailed steps to compute the pixel-to-node association matrixQ is described as follows: We first evenly divided the input feature maps into a rectangular grid ofK cells. We initialize the superpixel centers S 0 2 R K(C+5) with the average features within the grid cells. We then iteratively compute the pixel to superpixel association and update the superpixel centers. In thei-th iteration, we calculate the pixel to superpixel associationQ2 R H 0 W 0 K by computing the L 2 distance between pixels features and superpixel centers: Q pk =e k ^ FpS i1 k k 2 2 : (5.1) We then update the superpixel centers as the weighted sum of pixel features: S i k = 1 P p Q pk H 0 W 0 X p=1 Q pk ^ F p : (5.2) Following [47], we constrain the distance computations in Eq. 5.1 from each pixel to only 9 surrounding superpixels. This brings down the computations in Eq. 5.1 from H 0 W 0 K toH 0 W 0 9, making it efficient in terms of both computation and memory. This enforces a local spatial constraint on which pixels can be assigned to the same node, which may prevent the spatial information of the resulting nodes from diluting. 68 After v iterations, the final superpixel centersS v consists of aggregated deep fea- tures, color features and position features. We store the projection matrix Q so that we can find the reverse mapping from nodes to pixels later. We define the node fea- turesZ as theC-dimensional deep features inS v , while the 2-dimensional aggregated potion feature inS v are considered as the spatial coordinates of the nodes. The sptial coordinates of the nodes weighted sum of pixel coordinates: For support frames, we additionally compute the node labelsY ref 2 [0; 1] (M1)K(N+1) with Y k ref = 1 P p Q pk H 0 W 0 X p=1 Q pk y p ref : (5.3) Note that our projection module is different from the related work GCU [63] or GloRe [16], which parameterize the projection module with some globally learnable parameters. The intent is to globally consistent semantics for each node, which is defined by the training data. However, this is not suitable for the multi-object VOS tasks, as the test objects may be unseen from the training set, so there is no such thing as a global dictionary of node features that can generalize well. 5.3.3 Spatiotemporal Graph Reasoning Upon obtaining the aggregated region features in each frame independently, we build a graph connecting the regions across frames so that we can perform holistic reasoning in the spatiotemporal space. We build a weighted undirected graphG = (V;E) for the set of consecutive frames, whereV denotes the set of vertices andE denotes the set of edges connecting vertices. V consists ofMK nodes, representingK regions on each of theM consecutive frames. We assign a 3D coordinatep = ( m m;x;y) to node with spatial position (x;y) inm-th frame, wherem = [0; 1; 2; ;M1],x;y2R and 0xW 0 ; 0yH 0 , and m 69 t-1 t t+1 Figure 5.2: Building a graph based on spatial-temporal 3-D postions of the nodes. Con- sider a nodeV i , as long as the 3D euclidean distance to the other nodeV j is within the predefined radius , we add an edge between them. The added edge can be a spatial edge (shown in magenta) or a temporal edge (shown in cyan). is a scale factor to control the proximity scale in the temporal dimension. We compute the Euclidean distance between nodeV i andV j asd ij =kp i p j k 2 . As illustrated in Fig. 5.2, we create edges based on the node position to all other nodes within a given distance radius. The weightw ij for edgee ij 2E is defined by applying a Gaussian kernel on thed ij such that the distant neighbors have a smaller impact on the center node in the message passing process. Formally, we have w ij = 8 > > < > > : exp(d 2 ij =); d ij 0; d ij > : (5.4) Once the graphG is built, we feedG into a graph neural network. The architecture of the GNN is depicted in Fig. 5.3. There are 3 layers of graph convolution layers followed 70 GraphSAGE Conv ReLU GraphSAGE Conv ReLU GraphSAGE Conv ReLU + + Linear 256-d 256-d 768-d concat Figure 5.3: GNN architecture. by ReLU [77] activation function. We use a modified GraphSAGE [35] operator in the graph convolution layer, denoted by GConv(), where the transformation function is defined as ^ X i =W 1 X i +W 2 1 P j2N (i) w ij X j2N (i) w ij X j ; (5.5) where W 1 and W 2 are learnable parameters, X and ^ X are input and output node features, respectively, andN (i) denotes the set of immediate neighbors of nodei. Each layer ofGConv() can be considered as a smoothing or diffusion process over a local neighborhood in the spatiotemporal space, therefore applying a GNN on the spatiotemporal graph can encourage spatially and temporally consistent feature repre- sentation for the regions from the same object instance. We add skip connections to each of GConv-ReLU blocks to improve gradients flow in the back-propagation. We collect and concatenate the intermediate responses from each of the GConv-ReLU blocks and apply a learnable linear transformation to merge the intermediate responses. The final output is then transformed node features ^ Z. 5.3.4 Query Node Prediction After the node features are transformed by the GNN, we expect the node features from the same object to be temporally stable. We compute pairwise affinity scoresA between query nodes and reference nodes such thatA ij = exp( ^ Z T i ^ Z j ), wherei;j are the node index in support set and query set, respectively. 71 The query node label can be thus predicted as a weighted average of reference node labels: Y i q = X j A ij P k A ik Y j ref (5.6) 5.3.5 Node-to-pixel Reprojection and Segmentation Mask Predic- tion The final step to produce segmentation mask for the query frame is to reproject the query node labelsY q back to the pixel space. As suggested in [47, 63, 16], this can be done by reusing the projection matrixQ obtained in Sec. 5.3.2: ^ y p q = 1 P k Q T kp K X k=1 Q T kp Y k q ; (5.7) where ^ y p q 2 [0; 1] N denotes predicted label the query pixel p. We linearly interpolate pixel labels based on their region assignments and does not have any parameters. To obtain the final segmentation mask, we upsample they q to the original image resolution via bilinear interpolation and assign the object label to each pixel by taking a arg max over the predicted probability vector. 5.3.6 Training Procedure We train the proposed model with multiple losses. The main task loss is pixel-wise cross-entropy loss for pixels in the query frame: L seg = X i logP ^ y i q =y i q jx i q : (5.8) In order to encourage the superpixel clustering process respect the semantic bound- ary, we impose a reconstruction loss for the reference frames. Specifically, we reproject 72 the reference node labelsY ref back to the pixel space using the projection matrixQ via Eq. 5.7, and compute the the pixelwise cross-entropy loss between the reprojected pixel label and the groundtruth reference pixel labels: L recon = X i logP ^ y i ref =y i ref jx i ref : (5.9) The intuition here is that the network will fail to recover the pixel labels after the projection-reprojection cycle if purity of the superpixel cluster is low. Following [47], we also use a compactness loss to encourage the superpixels to be spatially compact i.e., to have lower spatial variance inside each superpixel region. Let I xy denote the pixel position feature, and S xy denote the aggregated superpixel position feature, we reproject theS xy back to the pixel space by using a hard assignment matrix derived from Q. Specifically, we determine the cluster membership of each pixel by taking the superpixel with the highest association score; we then assign the superpixel positional feature to all the pixels belonging to that superpixel, i.e., ^ I p xy = S k xy j arg max k Q pk = k. The compactness loss is a L 2 loss between original pixel position feature and the reprojected one: L compact = I xy ^ I xy 2 : (5.10) This loss would regularize the superpixel generation process to improve the chance of having region correspondences across the frames, which can reduce the difficulty in node matching. The total loss to optimize during training is the sum of all the losses defined above: L =L seg +L recon +L compact (5.11) 73 5.3.7 Inference Procedure At inference time, we are given videos with only the first frame annotated. We pro- gressively adding frames we have processed with predicted labels to the support set to facilitate the prediction on the current frame. During the online processing, We retain the node features a sliding window ofM 1 previous frames. When we are processing t-th frame, we run the backbone network, pixel-to-node projection module, and then construct the spatiotemporal graph with frametM + 1; ;t on-the-fly, the GNN- transformed feature ^ Z t for the framet is cached for transductive inference [127]. The transductive inference means when predicting the node labels for the frame t, we go beyond the temporal window of M frames and consider sampling reference frames from all the frames we have processed so far. As [127] suggests, the transductive inference can take accounts of long-term object appearance change over time and boost the segmentation performance. We adopt the sampling strategy from [127] and sample a total of 9 frames from the preceding 40 frames: the 4 consecutive frames before the target frame, and 5 more frames sparsely sampled from the remaining 36 frames. All the nodes with cached node features and inferred node labels from the sampled frames are considered as reference nodes when predicting the node labels for the current frame. The spatial smoothness term in TVOS [127] is also adopted to improve the segmen- tation quality, and affinity matrixA defined in Sec. 5.3.4 is scaled by a spatial term such thatA ij =A ij exp kpos(i)pos(j)k 2 2 s , wherepos(i) is the spatial position of nodei, s is a locality parameter. The node label for the current frame is predicted by Eq. 5.6 with A replacingA. Following TVOS, we use a smaller s for adjacent reference frames and larger s for the distant reference frames to incorporate the motion prior that nodes that are more distant in the temporal dimension have weaker spatial dependencies. Note that our method has a more memory-efficient advantage over TVOS as our system performs transductive inference in the node space rather than pixel space. During 74 the online inference, TVOS has to store the pixel feature history for all the frames in the past, which is incurs aO(HW ) order of memory requirement, but our method brings it down to O(K). Our method thus has greater scalability in processing longer video sequences. 5.4 Experiments In this section, we present the experimental results and analyze the components of the proposed VOS system. 5.4.1 Datasets and Evaluation Metrics We test and validate the proposed method on the challenging multi-object VOS bench- mark DA VIS 2017 [85]. DA VIS 2017 contains 150 video sequences and it involves multiple objects (ranging from 1 to 5, with an average of 2) with drastic deformations, heavy and prolonged occlusions, and very fast motions. Object annotations are available for all frames. Our model is trained on the training set (60 videos) and evaluated on the validation set (30 videos) using the frames at 480P resolution. The evaluation metrics are standard evaluation metrics defined by DA VIS bench- mark [84], including the mean intersection-over-union (mIoU) between the predicted and the ground truth segmentation masks, denoted byJ , the contour accuracyF, and the average ofJ andF, denoted byJ &F. 5.4.2 Implementation Details Our backbone feature extractor network is a modified ResNet-50 [37] and we extract the feature maps with stride 4 by replacing the pooling operation with dilated convolution 75 in the last 3 residual blocks to maintain a high-resolution output. The ResNet-50 is pre- trained on ImageNet [24] and other parameters in the model are randomly initialized. The GNN layers all have a hidden dimension of 256. In the pixel-to-node projection module, we apply 5 soft superpixel clustering iterations. We train our model on 2 NVIDIA TITAN X GPUs with a total batch size of 4 and evaluate using a single GPU. Due to GPU memory constraint, we first train the backbone network without GNN layers by directly predicting the query node labels based on aggregated node featuresZ, optimizing the full loss L in Eq. 5.11 for 200 epochs. We then freeze the weights in the backbone network and add the GNN layers and only optimize the GNN layers for 20 epochs, optimizing the primary task lossL seg only. Following TVOS [127], we apply random flipping and random cropping of size 256 256 on the input images for data augmentation in the training. We add another form of augmentation by randomly pick the number of frames (M is less than or equal to the number of input frames in a batch) to build the graph. We use an SGD solver with an initial learning rate of 0.02 and a cosine annealing scheduler. The system is implemented with PyTorch [82] and PyTorch Geometric [29] libraries. 5.4.3 Experimental Results and Analyses Quantitative results. We summarize the experimental results in Table 5.1. We compare the representative methods that train and report results on DA VIS 2017 [85] from the past 3 years. The “one-shot” column denotes whether the method supports segmentating multiple objects in one forward pass. FEELVOS [113] and TVOS [127] are the leading methods that do not require extra training data. In the original paper, their models are trained with dozens of GPUs on an industrial computation cluster with a large batch size to stabilize training. For a fair comparison, we retrained their models using the official implementation with 2 GPUs on a local desktop, the results are denoted by the mark. 76 Table 5.1: Quantitative results on the DA VIS 2017 validation set. Methods Base Net One-shot J F J &F Modulation [120] VGG 7 52.5 57.1 54.8 VideoMatch [41] ResNet-101 7 56.5 68.2 62.4 RVOS [112] ResNet-101 3 48.0 52.6 50.3 FEELVOS [113] Xception65+ 7 65.9 72.3 69.1 TVOS [127] ResNet-50 3 69.9 74.4 72.3 FEELVOS Xception65+ 7 46.5 50.5 48.5 TVOS ResNet-50 3 63.7 68.9 66.3 MetaVOS [5] ResNet-101 3 63.9 70.7 67.3 Ours - w/o GNN ResNet-50 3 63.1 67.2 65.2 Ours ResNet-50 3 63.5 67.7 65.6 FEELVOS and MetaVOS [5] finetune from the models that has been trained on image segmentation datasets, but TVOS and ours simply use the generic ImageNet pretrained model. Computation complexity analysis. Our method aggregates the pixels features into regions’ features and performs the reasoning and matching on the region level, which greatly reduces the computation required during the inference. In the step of comput- ing the affinity scores between queries and reference, TVOS computes 9H 2 W 2 affinity scores, whereHW is the spatial resolution of the feature map the base network pro- duces, 9 is the number of reference frames sampled in the transductive inference step; our method computes 16 times smaller number of such scores. The size of this affin- ity matrix also sets the peak of memory consumption in the inference process, so our method is memory efficient as well. We may further reduce the number of regions to extract to increase the saving with bearable performance loss. Our proposed method achieves comparable accuracy with the state-of-the-art meth- ods with considerable less computation and memory requirement, which is a practical solution for resource limited environment. 77 Groundtruth Ours TVOS FEELVOS Figure 5.4: Qualitative results on the india sequence. Qualitative results. We show the qualitative results from a few sequences from the DA VIS 2017 validation set in Fig. 5.4 - Fig. 5.7. We see that the proposed method can track the masks of multiple objects robustly in various challenging scenarios. Its performance is on-par with the state-of-the-art method with the same hardware assets. Discussion on the pixel-to-node projection module. The pixel-to-node projec- Table 5.2: Effect of incorporating low level LABXY feature for the region clustering. Features for Clustering J F J &F Deep features 59.7 63.4 61.55 Deep features + LabXY 63.1 67.2 65.2 tion operates on the deep feature maps generated by the backbone network, which is 78 Groundtruth Ours TVOS FEELVOS Figure 5.5: Qualitative results on the gold-fish sequence. essentially a fuzzyk-means clustering process. It is critical to enforce the extracted soft region boundary to respect the semantic contour of the objects so that we can recover the boundary information in the reprojection stage. We found it is difficult to achieve this by using the deep features only in the clustering process due to the coarse resolution of the feature map. Thus we propose to use the low-level color feature and position feature in conjunction with the deep features. Table 5.2 shows that LabXY features considerably improve the final segmentation quality in terms of both region accuracyJ and contour accuracyF. We visualize the hard superpixelfor both cases by marking the change hard superpixel assignment for adjacent pixels in Fig. 5.8, and we can see the superpixels 79 Groundtruth Ours TVOS FEELVOS Figure 5.6: Qualitative results on the breakdance sequence. obtained by both methods smoothly follow the semantic boundary, but the superpix- els obtained using deep features and LabXY are more coherent for the homogeneous regions, which leads to more robust matching in the later stage. Discussion on the GNN module. Though theoretically appealing, the overall improvement of having GNN to perform message passing over spatiotemporal space to refine the region features is marginal. We compare the per-sequence mIOU to further investigate this. The results are shown in Table 5.3. We notice that the GNN module generally helps differentiate the different instances of the same category of objects, such as in the sequence breakdance, the model with GNN is less confused about differenti- ating the foreground dancer and the background audience people; wherein the sequence 80 Groundtruth Ours TVOS FEELVOS Figure 5.7: Qualitative results on the judo sequence. gold-fish, lab-coat and pigs, the foreground objects are of the same semantic category. This is due to the smoothing effect of graph convolution, which encourages consistency of intra-instance feature representation spatially and temporally, and increases inter- instance feature variance accordingly. We attribute limited performance gain by the GNN module to the following two reasons: 1) the DA VIS-2017 contains 60 training videos only, which is considered as a small dataset for the VOS task now [117]. In the stage of training without GNN module, the model has fitted the training data well. We observe an average segmentation loss as low as 2 10 4 at the end of first stage training. When we later freeze the backbone network and plug in the GNN module for training, such training loss can not provide meaningful gradients to optimize the GNN module. The extra parameters the GNN module introduces may increase the risk of model overfitting on the training data as well; 2) the fact that we do not perform end-to-end training of the system due to the computational resource limitation makes the backbone layers and GNN layers not 81 optimally coordinated for the end task. We believe that training the full model end-to- end with more training data may reveal the full potential of the proposed method. 5.5 Conclusion We have presented a semi-supervised video object segmentation system that achieves comparable performance with state-of-the-art methods when trained locally but with considerably fewer computations and less peak memory usage. We reduce the number of units to be processed by aggregating the pixel features into soft superpixel regions, which enables us to apply the graph neural networks to the video object segmentation task for the first time. We perform region-level reasoning in the spatiotemporal graph, which effectively exploits both spatial and temporal correlation for the object motion in a video sequence. Our preliminary results show the potentials of the application of GNNs in the dense prediction task and encourage future research in this direction. The future direction of improvement of this work include: 1) optimizing the region feature extraction module by using some non-iterative but differentiable superpixel algo- rithms, such as [118]; 2) taking the uncertainty of inferred reference masks at the test time into account, and trying modeling it during training to reduce test dime error accu- mulation; 3) investigating the effect of using different variants of GNNs in the video reasoning task and develop a task-specific GNN to further improve the segmentation quality. 82 (a) sample frames from the dogs-jump sequence (b) sample frames from the pigs sequence Figure 5.8: Visualization of the extracted regions (superpixels) and the corresponding segmentation results. In each example, the first row shows the results by using deep fea- tures only, the second row shows the results by using deep features and LabXY features. Note these results are from the model without the GNN. 83 Table 5.3: Per-sequence mIoU comparison for the model with and without GNN. Sequence Name mIoU with GNN mIoU without GNN mIoU Gain bike-packing 60.03 63.30 -3.27 blackswan 93.94 94.27 -0.33 bmx-trees 44.84 46.54 -1.70 breakdance 70.45 64.27 6.18 camel 72.32 71.83 0.49 car-roundabout 80.75 79.89 0.86 car-shadow 62.95 66.08 -3.13 cows 90.78 90.66 0.12 dance-twirl 62.64 62.15 0.49 dog 84.56 83.95 0.61 dogs-jump 69.48 67.57 1.91 drift-chicane 32.13 29.95 2.17 drift-straight 60.41 58.23 2.17 goat 81.47 79.56 1.90 gold-fish 78.34 74.30 4.04 horsejump-high 71.95 77.23 -5.28 india 66.29 66.79 -0.50 judo 76.61 76.37 0.23 kite-surf 41.58 46.62 -5.04 lab-coat 47.61 44.40 3.21 libby 81.52 83.56 -2.04 loading 63.64 66.84 -3.20 mbike-trick 58.95 57.52 1.43 motocross-jump 46.81 47.86 -1.05 paragliding-launch 51.14 51.41 -0.27 parkour 90.05 89.70 0.35 pigs 77.21 75.46 1.76 scooter-black 53.11 56.13 -3.02 shooting 57.99 52.46 5.53 soapbox 53.95 52.25 1.71 84 Chapter 6 Conclusion and Future work 6.1 Summary of the Research In this dissertation, we studied three important aspects of visual knowledge transfer with deep learning techniques: 1) incremental learning (IL) for image classification and object detection; 2) unsupervised domain adaptation (DA) for semantic segmentation; 3) segmentation-aided text detection; 4) semi-supervised multi-object video object seg- mentation. To overcome catastrophic forgetting in the IL, we proposed a novel paradigm called Deep Model Consolidation (DMC). The idea is to train a model on the new data, and then combine the two individual models trained on data of two distinct set of classes (old classes and new classes) via a novel dual distillation training objective. The two existing models are consolidated by exploiting publicly available unlabeled auxiliary data. This overcomes the potential difficulties due to the unavailability of original training data. DMC demonstrated the state-of-the-art performance in image classification and object detection benchmarks in the IL setting. To solve the domain adaptation for urban scene segmentation, we developed a fully convolutional tri-branch network, where two branches assign pseudo labels to images in the unlabeled target domain while the third branch is trained with supervision based on images in the pseudo-labeled target domain. The re-labeling and re-training processes alternate. With this design, the tri-branch network learns target-specific discrimina- tive representations progressively and, as a result, the cross-domain capability of the 85 segmenter improves. We evaluated the proposed network with large-scale experiments using synthetic-to-real datasets and our method outperformed previous methods by a significant margin. To build a robust and efficient text detector, we presented an elegant segmentation- aided text detection solution that predicts the word-level bounding boxes using an end- to-end trainable deep convolutional neural network. It exploits the holistic view of a segmentation network in generating the text attention map (TAM) and uses the TAM to refine the convolutional features for the MultiBox detector through a multiplicative gat- ing process. Verified by experimental results, our method improved detection robustness and efficiency over the prior arts substantially. To efficiently exploit spatiotemporal dependencies in video frames to perform a multi-object video object segmentation task, we proposed an efficient and fully- differentiable deep neural network to perform region-based reasoning. It aggregates the pixel features into soft superpixel region features and models the high-level region interaction by building a spatiotemporal graph, and performing region-level reasoning and feature refinement via a graph neural network. The proposed system is evaluated on a challenging benchmark and we managed to improve the computation and memory efficiency over the prior methods without loss of accuracy. 6.2 Future Directions 6.2.1 Incremental Learning of Object Detectors Object detection is a crucial task when machines want to understand an image beyond what the image contains, as it helps to understand how multiple objects in images inter- act with each other. Being capable of incrementally training an object detector has many interesting and useful applications. Although the method proposed in Chapter 2 shows 86 some preliminary but promising results on the object detection benchmark, we did not consider the unique characteristics and challenges of this task too much. There is still a huge performance gap between the incrementally trained detector and jointly trained oracle detector (more than 6% mAP as shown in Table 2.3). Object detection is inherently a compound task with two subtasks: object recogni- tion and object localization. Object recognition is to classify the object within a partic- ular region of interest on the image. Object localization is a regression task that aims to find the exact locations of existing objects and produce coordinates of the desired bounding box. Due to the efficiency consideration, the modules responsible for these two subtasks are usually decoupled in a deep neural network, and the object localization module is usually designed to be class-agnostic. In the incremental learning setting, we have some images of novel objects with bounding box annotations. It is a classi- cal class-incremental-learning problem for object recognition module; however, it is not an incremental learning problem for object localization module, as the regression task remains the same, but the input distribution may change. In other words, incrementally training an object detector means performing incremental learning of object recognition module and domain adaptation of object localization module jointly. Two important research problems arise from incremental learning of object detectors that we would like to explore: 1. How can we perform domain adaptation without forgetting for the object local- ization module? Domain adaptation without forgetting in the source domain in general has been rarely studied in the literature. Many ideas for overcoming the catastrophic forgetting in incremental learning and online learning can be adopted and further developed. 2. How can we jointly optimize for the two goals: incremental learning of object recognition module and domain adaptation of object localization module? 87 6.2.2 Domain Adaptation via Unsupervised Disentangled Repre- sentation Learning In Chapter 3, we study the unsupervised domain adaptation problem by leveraging tri- training and curriculum learning. One limitation of this approach is that the performance saturates soon when we are not able to mine meaningful pseudo-labeled samples for re- training. We would like to explore another research direction of domain adaptation — disentangled representation learning. The motivation of disentangled representation learning is the growing popularity of the image-to-image (I2I) translation techniques [45, 133, 134] in the domain adaptation community [8, 102, 96, 76, 43]. The general framework for I2I-based domain adaptation has two equivalent directions: 1. (a) Transfer the images from a source domain to the target domain, such that the semantic content is preserved while the style is target-like (b) Using the generated target-like images together with their class label as the training data, train a classifier that performs well in the target domain 2. (a) Transfer images from the target domain to the source domain, such that the semantic content is preserved while the style is source-like (b) Train a classifier on source domain images in the supervised learning fashion and test it on the transferred target domain images. The key challenge of I2I-based domain adaptation is to perform content-preserving translation. Disentangled feature representation is the prerequisite for successful trans- lation with the class association. For example, in the digit classification task, illustrated in Fig. 6.1, it is highly desirable if we could disentangle the content and style. The con- tent factor is digit class, and the style factors may represent the stroke width, foreground 88 (a) samples from MNIST (b) samples from SVHN Figure 6.1: Digit classification datasets MNIST [58] and SVHN [78]. and background color, etc. If we can disentangle the style factor from the content, we may learn the style mapping from one domain to the other domain, while leaving the content unchanged. In the source domain, label information can be exploited to guided the feature dis- entanglement, e.g by training an auxiliary classifier during feature learning [80, 18]. In the target domain, unsupervised disentanglement has to be employed. Very few attempts [15] have been made towards this direction, especially for real-world images with enough complexities. The preliminary idea is to try performing dimension reduc- tion and clustering analysis before representation learning. Cluster membership infor- mation may be noisy yet still useful for representation learning. And a more elegant solution would be to directly incorporate a clustering objective during disentangled rep- resentation learning. 89 Appendix A Appendix for Chapter 2 In this chapter, we provide additional detailed experimental results and analyses of the proposed method in Chapter 2, Deep Model Consolidation (DMC), for class- incremental learning. A.1 Detailed Experimental Results of DMC for Object Detection In the experiments of DMC for incremental learning of object detectors, we incremen- tally learn 19 + 1 classes using RetinaNet [68]. In Chapter 2, we presented the results of adding “tvmonitor” class as the new class. Here, we show the results of addition of one class experiment with each of the VOC categories being the new class in Table A.5, where Old Model denotes the 19-class detector trained on the old 19 classes, New Model denotes the 1-class detector trained on the new class and DMC denotes the final consol- idated model that is capable of detecting all the 20 classes. Per-class average precisions on the entire test set of PASCAL VOC 2007 [28] are reported. 90 A.2 Effect of the Amount of Auxiliary Data for Object Detection We studied the effect of the amount of auxiliary data for DMC for image classification task. To see how the amount of auxiliary data affects the final performance in the incre- mental learning of object detection, we performed additional experiments on PASCAL VOC 2007 with the 10 + 10 classes setting. We randomly sampled 1=2, 1=4 and 1=8 of the full auxiliary data from Microsoft COCO dataset [110] for consolidation. As shown in Table A.1, with just 1=8 of full data, i.e., 12.3k images, DMC can still outperform the state-of-the-art, which demonstrates its robustness and efficiency in the detection task as well. Table A.1: Varying the amount of auxiliary data in the consolidation stage. VOC 2007 test mAP (%) are shown, where classes 1-10 are the old classes, and classes 11-20 are the new ones. Model Old Classes New Classes All Classes All auxiliary data 70.53 66.16 68.35 1=2 of auxiliary data 69.79 66.06 67.93 1=4 of auxiliary data 70.2 64.67 67.44 1=8 of auxiliary data 66.77 62.71 64.74 A.3 Implementation and Training Details We implement DMC with PyTorch [82] library. Training details for the image classification experiments. Following iCaRL [88], we use a 32-layers ResNet [37] for all experiments and the model weights are randomly initialized. When training the individual specialist models, we use SGD optimizer with momentum for 200 epochs. In the consolidation stage, we train the network for 50 epochs. The learning rate schedule for all the experiments is same, i.e., it starts with 91 0.1 and reduced by 0.1 at 7/10 and 9/10 of all epochs. For all experiments, we train the network using mini-batches of size 128 and a weight decay factor of 1 10 4 and momentum of 0.9. We apply the simple data augmentation for training: 4 pixels are padded on each side, and a 32 32 crop is randomly sampled from the padded image or its horizontal flip. Training details for the object detection experiments. We resize each image so that the smaller side has 640 pixels, keeping the aspect ratio unchanged. We train each model for 100 epochs and use Adam [52] optimizer with learning rate 1 10 3 on two NVIDIA Tesla M40 GPUs simultaneously, with batch size of 12. Random horizon- tal flipping is used for data augmentation. Standard non-maximum suppression (NMS) with threshold 0.5 is applied for post-processing at test time to remove the duplicate pre- dictions. For each image, we select 64 anchor boxes for DMC training. Empirically we found selecting more anchor boxes (128, 256 etc.) did not provide further performance gain. The is set to 1.0 for all experiments. Hyperparameters used for the baseline methods. We report results of EWC++ [12], SI [123], MAS [3] and RWalk [12] on iCIFAR-100 benchmark in Chapter 2. Table A.2 summarizes the hyperparamter ^ that controls the strength of regularization used in each experiment, and they are picked based on a held-out validation set. Method g = 5 g = 10 g = 20 g = 50 EWC++ [12] 10 10 1 0.1 SI [123] 0.01 0.05 0.01 0.01 MAS [3] 0.1 0.1 0.001 0.0001 RWalk [12] 5 1 1 0.1 Table A.2: ^ used in when incrementally learning g classes at a time on iCIFAR-100 benchmark. 92 A.4 Preliminary Experiments of Adding Exemplars While DMC is realistic in applied scenarios due to its scalablity and immunity to copy- right and privacy issues, we additionally tested our method in the scenario where we are allowed to store some exemplars from the old data with a fixed budget when learn- ing the new classes. Suppose we are incrementally learning a group of g classes at a time, With the same total memory budget K = 2000 as in iCaRL [88], we fill the exemplar set by randomly samplingb K g c training images from each class when we learn the first group of classes; then every time we learn g more classes with training data D new = [X gi ; ;X g(i+1)1 ] in thei-th incremental learning session, we augment the exemplar set byb K gi c randomly sampled training images of the new classes, and we fine- tune the consolidated model using these exemplars for 15 epochs with a small learning rate of 1 10 3 . After fine-tuning, we reduce the size of the exemplar set by keeping b K g(i+1) c exemplars for each class. We refer to this variant of our method as DMC+. We validate the effectiveness of DMC+ on the iCIFAR-100 benchmark, and Table A.3 summarizes the results as the average of the classification accuracies over all steps of the incremental training (as in [10], the accuracy of the first group is not considered in this average). We can get comparable performance to iCaRL in all settings. Note that we also tried the herding algorithm to select the exemplars as in iCaRL, but we did not observe any notable improvement. The confusion matrices comparison between DMC+ and iCaRL [88] is shown in Fig. A.1, and we find: 1) fine-tuning with exemplars can indeed further reduce the intrinsic bias in the training; 2) our DMC+ is on a par with iCaRL, even though we use naive random sampling rather than the more expensive herding [88] approach to select exemplars. 93 These preliminary results demonstrate that DMC may also hold promise for exemplar-based incremental learning, and we would like to further study the poten- tial improvement of DMC+, e.g. in terms of exemplar selection scheme and rehearsal strategies. Table A.3: Average incremental accuracies when adding the exemplars of old classes. iCaRL [88] with the same memory budget is compared. Results of incremental learning withg = 5; 10; 20; 50 classes at a time on iCIFAR-100 benchmark are reported. g 5 10 20 50 iCaRL 57:8 2:6 60:5 1:6 62:0 1:2 61:8 0:4 DMC+ 56:78 0:86 59:1 1:4 63:2 1:3 63:1 0:54 (a) DMC+ (b) iCaRL Figure A.1: Confusion matrices of exemplar-based methods on iCIFAR-100 when incre- mentally learning 10 classes in a group. The element in thei-th row andj-th column indicates the percentage of samples with ground-truth labeli that are classified into class j. Fig. A.1(b) is from [88]. (Best viewed in color.) 94 A.5 Preliminary Experiments of Consolidating Models with Common Classes The original DMC assumes the two models to be consolidated are trained with distinct sets of classes, but it can be easily extended to the case where we have two models that are trained with partially overlapped set of classes. We first normalize the logits produced by the two models as Eq. 2.4. We then set the double distillation regression target as the follows: for the common classes, we take the mean of normalized logits from the two model; for each of the other classes, we take the normalized logit from the corresponding specialist model that was trained with this class. Below we present a preliminary experiment on CIFAR-100 dataset in this setting, where we have separately trained two 55-class classifiers for Class 1-55 and Class 46- 100, respectively, where 10 classes (Class 46-55) are in common. The results are shown in Table A.4. For the common classes, DMC can be considered as an ensemble learning method, where at least the accuracy of the weaker model is maintained; for learning the rest of classes, it does not exhibit catastrophic forgetting or intransigence. This shows that DMC is promisingly extensible to the special case of incremental learning with partially overlapped categories. Table A.4: Consolidation of two models with 10 common classes (class 46-55). Model Class 1-45 Class 46-55 Class 56-100 Class 1-100 Model 1 73.73 80.5 - - Model 2 - 71.6 66.47 - Consolidated 60.76 71.7 58.09 60.65 95 Table A.5: VOC 2007 test per-class average precision (%) when incrementally learning 19 + 1 classes. Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP Old Model - 78.8 77.4 56.5 60.1 76.4 85.0 80.0 50.0 78.0 69.9 78.3 79.2 74.3 77.3 39.5 66.4 65.7 76.9 74.4 - New Model 15.8 - - - - - - - - - - - - - - - - - - - - DMC 16.3 75.9 75.8 52.9 59.5 74.2 84.2 79.1 49.3 73.0 59.9 70.0 75.4 64.8 79.9 40.2 64.1 58.8 69.9 74.3 64.9 Old Model 69.6 - 76.3 60.1 59.8 76.7 85.4 79.6 54.6 75.9 63.7 78.6 79.5 71.5 77.7 44.9 68.0 57.6 77.3 75.5 - New Model - 70.2 - - - - - - - - - - - - - - - - - - - DMC 75.8 62.4 75.5 59.6 59.0 76.0 85.6 79.5 53.8 77.0 62.3 78.6 77.6 67.5 80.7 43.5 70.6 57.6 77.3 76.3 69.8 Old Model 68.9 78.8 - 55.9 61.4 70.7 79.9 79.8 50.9 73.6 65.0 77.7 79.3 76.0 77.0 43.1 66.2 66.8 77.3 75.5 - New Model - - 35 - - - - - - - - - - - - - - - - - - DMC 69.0 77.9 43.8 54.7 60.1 75.5 84.1 77.5 51.0 71.4 65.4 69.4 69.7 73.5 76.5 40.8 59.9 66.9 77.0 76.2 67.0 Old Model 76.9 78.3 77.1 - 57.9 76.2 85.2 79.8 48.7 76.5 65.6 82.9 76.9 75.1 77.7 40.6 67.7 67.6 76.9 69.5 - New Model - - - 18.6 - - - - - - - - - - - - - - - - - DMC 76.6 77.2 75.8 23.4 58.2 77.2 84.4 80.0 48.7 78.5 63.3 82.8 70.3 76.1 80.7 40.8 66.7 64.9 75.5 68.6 68.5 Old Model 70.5 77.9 77.5 53.5 - 76.1 85.6 78.8 51.0 76.2 62.5 77.2 79.1 73.2 77.6 42.5 68.6 68.1 76.6 74.5 - New Model - - - - 47.7 - - - - - - - - - - - - - - - - DMC 74.7 76.2 76.4 51.0 37.6 76.9 85.4 79.4 53.0 76.7 64.2 77.9 77.9 73.6 80.4 43.0 68.2 68.4 76.5 75.3 69.6 Old Model 70.8 77.8 75.2 57.2 60.0 - 84.7 79.6 48.3 75.3 68.4 78.8 78.6 75.6 77.3 41.8 69.0 68.0 75.0 73.9 - New Model - - - - - 46 - - - - - - - - - - - - - - - DMC 68.7 79.7 73.9 55.6 61.3 53.5 84.9 79.3 49.4 75.8 66.8 78.9 75.6 75.4 80.6 41.6 67.4 66.7 70.0 73.8 68.9 Old Model 77.5 78.8 74.5 58.1 60.3 74.5 - 80.7 49.0 76.0 64.4 77.3 78.7 66.8 77.1 39.0 67.9 67.0 77.1 75.3 - New Model - - - - - - 76.2 - - - - - - - - - - - - - - DMC 70.3 76.3 74.0 51.3 60.2 68.2 77.5 80.0 47.1 76.5 61.0 77.4 77.3 59.5 79.9 41.5 66.5 65.0 77.2 74.8 68.1 Old Model 76.5 79.4 78.1 54.7 60.8 77.2 85.4 - 49.6 74.9 65.1 78.5 78.5 74.3 77.8 44.2 67.3 65.1 76.0 74.5 - New Model - - - - - - - 60.5 - - - - - - - - - - - - - DMC 75.7 81.0 76.6 51.4 61.9 76.7 84.5 69.8 51.5 74.6 63.6 76.9 69.4 74.6 81.2 43.3 67.0 67.1 77.1 74.2 69.9 Old Model 78.7 79.6 76.9 57.3 62.2 77.4 80.0 79.5 - 75.9 66.6 77.6 79.6 76.5 77.3 43.4 67.3 66.7 77.8 69.3 - New Model - - - - - - - - 41.9 - - - - - - - - - - - - DMC 75.8 76.2 76.9 56.7 62.9 76.9 85.4 78.9 38.1 75.1 64.1 78.6 76.0 74.4 80.4 43.0 66.8 62.6 77.8 74.0 70.0 Old Model 70.8 77.8 76.0 58.1 60.7 78.1 85.0 80.1 47.2 - 64.4 77.4 75.3 74.9 80.3 41.7 66.8 64.9 77.1 72.1 - New Model - - - - - - - - - 30.3 - - - - - - - - - - - DMC 69.9 75.6 68.1 56.4 60.7 77.2 85.5 79.4 46.4 37.0 65.2 70.0 68.0 74.6 80.4 41.7 59.6 62.8 76.5 72.9 66.4 Old Model 75.5 80.1 77.1 57.8 61.4 76.6 85.5 80.6 51.1 79.0 - 78.6 80.2 75.4 77.1 44.7 68.4 66.7 77.4 74.6 - New Model - - - - - - - - - - 43.6 - - - - - - - - - - DMC 75.0 80.8 75.3 54.1 62.6 76.8 85.3 80.6 50.2 77.4 53.9 83.8 77.6 73.4 81.0 45.9 66.8 65.7 75.4 74.8 70.8 Old Model 76.6 77.8 77.2 57.4 60.6 76.3 84.8 80.9 49.9 77.4 64.5 - 77.4 69.3 77.5 43.2 73.7 67.4 76.7 74.7 - New Model - - - - - - - - - - - 40.3 - - - - - - - - - DMC 75.2 76.6 74.5 57.1 62.0 74.9 85.4 70.0 51.0 57.8 63.6 52.5 59.2 73.5 79.9 43.1 65.4 66.9 75.1 74.5 66.9 Old Model 77.3 78.2 77.7 59.4 60.5 77.5 85.2 85.9 48.8 76.6 70.7 76.5 - 74.1 77.3 42.1 67.8 68.0 78.7 72.4 - New Model - - - - - - - - - - - - 52.4 - - - - - - - - DMC 77.4 76.2 72.1 54.9 63.0 77.5 84.7 79.1 48.0 73.3 68.0 61.5 40.5 71.9 79.5 40.4 65.9 63.0 77.4 73.0 67.4 Old Model 76.5 77.3 75.7 56.8 60.8 70.5 85.4 79.7 48.6 74.2 62.7 79.3 77.3 - 76.9 43.9 68.4 63.3 77.2 76.0 - New Model - - - - - - - - - - - - - 59.2 - - - - - - - DMC 68.9 74.2 75.8 55.0 60.1 77.1 84.2 86.0 50.7 75.2 61.3 78.8 70.0 68.2 79.6 46.1 68.1 61.4 75.5 76.1 69.6 Old Model 77.1 79.7 76.9 59.1 62.3 77.3 85.7 80.2 52.0 77.6 65.0 78.5 80.4 78.2 - 44.1 67.5 71.9 78.0 74.1 - New Model - - - - - - - - - - - - - - 76.4 - - - - - - DMC 75.8 78.8 75.0 59.7 62.1 77.1 85.5 80.1 51.1 77.0 63.3 77.9 78.1 76.8 78.0 44.7 66.0 69.2 78.1 75.1 71.5 Old Model 75.5 77.1 75.5 58.9 62.1 77.8 85.8 87.7 44.4 76.6 64.7 78.3 78.7 75.5 77.4 - 68.3 67.7 76.6 73.4 - New Model - - - - - - - - - - - - - - - 35.8 - - - - - DMC 75.5 74.3 74.4 56.1 61.6 77.9 86.7 87.2 48.0 77.1 64.6 78.0 77.4 73.7 80.3 31.0 67.3 65.8 75.4 73.8 70.3 Old Model 78.0 77.8 75.3 57.5 61.0 69.6 86.5 79.8 48.5 67.9 62.8 76.8 79.6 74.8 77.5 43.5 - 68.0 76.8 74.8 - New Model - - - - - - - - - - - - - - - - 26 - - - - DMC 76.3 76.9 73.6 54.3 62.0 73.8 86.1 79.7 48.5 67.0 63.9 75.6 76.3 75.2 80.8 44.4 20.2 68.2 76.4 74.4 67.7 Old Model 77.6 78.2 76.3 55.0 59.3 70.7 85.8 80.4 50.5 75.4 67.2 83.5 78.7 69.0 77.6 44.4 67.7 - 70.1 75.1 - New Model - - - - - - - - - - - - - - - - - 33.1 - - - DMC 77.4 82.5 68.5 58.3 61.8 75.2 85.6 78.7 47.1 74.9 63.5 75.6 69.8 73.5 79.4 42.6 65.8 26.1 69.8 74.1 67.5 Old Model 70.4 79.5 77.1 57.6 60.2 73.6 84.6 80.2 51.0 75.5 65.4 78.7 78.0 75.3 77.7 42.8 69.3 63.6 - 73.9 - New Model - - - - - - - - - - - - - - - - - - 34.9 - - DMC 73.9 77.4 76.7 56.0 60.8 62.1 83.7 79.7 49.6 76.3 65.7 79.3 73.8 73.2 80.6 40.3 73.3 65.9 37.7 74.5 68.0 96 Bibliography [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. 45, 55 [2] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. S¨ usstrunk. Slic super- pixels. Technical report, 2010. 61, 63, 68 [3] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages 139–154, 2018. 24, 26, 27, 92 [4] J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014. 20 [5] H. S. Behl, M. Najafi, A. Arnab, and P. H. S. Torr. Meta learning deep visual words for fast video object segmentation. In NeurIPS 2019 Workshop on Machine Learning for Autonomous Driving, 2018. 61, 63, 65, 77 [6] T. Berg, J. Liu, S. Woo Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2011–2018, 2014. 29 [7] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning the- ory, pages 92–100. ACM, 1998. 41 [8] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsuper- vised pixel-level domain adaptation with generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol- ume 1, page 7, 2017. 88 97 [9] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix´ e, D. Cremers, and L. Van Gool. One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 221–230, 2017. 60, 62 [10] F. M. Castro, M. J. Marın-Jim´ enez, N. Guil, C. Schmid, and K. Alahari. End- to-end incremental learning. In The European Conference on Computer Vision (ECCV), September 2018. 13, 17, 29, 93 [11] O. Chapelle, B. Scholkopf, and A. Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009. 17 [12] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. S. Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In The European Conference on Computer Vision (ECCV), September 2018. 14, 16, 24, 26, 27, 92 [13] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Seman- tic image segmentation with deep convolutional nets and fully connected crfs. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. 39 [14] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016. 4, 12, 35, 49, 50, 51, 54 [15] X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Info- gan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016. 89 [16] Y . Chen, M. Rohrbach, Z. Yan, Y . Shuicheng, J. Feng, and Y . Kalantidis. Graph- based global reasoning networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 433–442, 2019. 64, 69, 72 [17] Z. Chen and B. Liu. Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3):1–207, 2018. 1 [18] Y . Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 89 98 [19] P. Chrabaszcz, I. Loshchilov, and F. Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017. 24, 28 [20] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014. 27 [21] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4, 45 [22] G. Csurka. A comprehensive survey on domain adaptation for visual applications. In Domain Adaptation in Computer Vision Applications, pages 1–35. Springer, 2017. 3, 17 [23] G. Csurka, editor. Domain Adaptation in Computer Vision Applications. Advances in Computer Vision and Pattern Recognition. Springer, 2017. 36 [24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large- Scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009. 24, 76 [25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large- scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. 56 [26] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015. 42 [27] B. Epshtein, E. Ofek, and Y . Wexler. Detecting text in natural scenes with stroke width transform. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2963–2970. IEEE, 2010. 47, 49 [28] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html. 30, 90 [29] M. Fey and J. E. Lenssen. Fast graph representation learning with PyTorch Geo- metric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. 76 99 [30] C.-Y . Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017. 53 [31] D. Giordano, F. Murabito, S. Palazzo, and C. Spampinato. Superpixel-based video object segmentation using perceptual organization and location prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4814–4822, 2015. 63 [32] R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440–1448, 2015. 8, 17, 23, 31, 53 [33] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014. 49 [34] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y . Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. Inter- national Conference on Learning Representations (ICLR), 2014. 3, 12 [35] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024–1034, 2017. 64, 71 [36] C. He, R. Wang, S. Shan, and X. Chen. Exemplar-supported generative reproduc- tion for class incremental learning. In 29th British Machine Vision Conference (BMVC 2018), pages 3–6, 2018. 13, 17 [37] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 24, 31, 53, 75, 91 [38] T. He, W. Huang, Y . Qiao, and J. Yao. Accurate text localization in natural image with cascaded convolutional text network. arXiv preprint arXiv:1603.09423, 2016. 48, 50 [39] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2014. 14, 16, 19, 28 [40] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adver- sarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016. 38, 43 [41] Y .-T. Hu, J.-B. Huang, and A. G. Schwing. Videomatch: Matching based video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 54–70, 2018. 62, 63, 77 100 [42] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 3, 2017. 12 [43] S.-W. Huang, C.-T. Lin, S.-P. Chen, Y .-Y . Wu, P.-H. Hsu, and S.-H. Lai. Auggan: Cross domain adaptation with gan-based data augmentation. In The European Conference on Computer Vision (ECCV), September 2018. 88 [44] W. Huang, Y . Qiao, and X. Tang. Robust scene text detection with convolution neural network induced mser trees. In European Conference on Computer Vision, pages 497–511. Springer, 2014. 47 [45] P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017. 88 [46] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1):1–20, 2016. 49 [47] V . Jampani, D. Sun, M.-Y . Liu, M.-H. Yang, and J. Kautz. Superpixel sam- pling networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 352–368, 2018. 63, 67, 68, 72, 73 [48] K. Javed and F. Shafait. Revisiting distillation and incremental classifier learning. Asian Conference on Computer Vision (ACCV), 2018. 13, 17 [49] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. de las Heras. Icdar 2013 robust reading competition. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, pages 1484–1493. IEEE, 2013. 47, 49 [50] R. Kemker and C. Kanan. Fearnet: Brain-inspired model for incremental learn- ing. In International Conference on Learning Representations, 2018. 14, 17 [51] R. Kemker, M. McClure, A. Abitino, T. Hayes, and C. Kanan. Measuring catas- trophic forgetting in neural networks. AAAI Conference on Artificial Intelligence, 2018. 14, 16 [52] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceed- ings of the International Conference on Learning Representations (ICLR), 2015. 56, 92 [53] T. N. Kipf and M. Welling. Semi-supervised classification with graph convo- lutional networks. In International Conference on Learning Representations (ICLR), 2017. 62, 64 101 [54] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catas- trophic forgetting in neural networks. Proceedings of the national academy of sciences, page 201611835, 2017. 14, 16, 24, 26, 27, 30 [55] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. 24 [56] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 12 [57] H. Law and J. Deng. Cornernet: Detecting objects as paired keypoints. In Pro- ceedings of the European Conference on Computer Vision (ECCV), pages 734– 750, 2018. 31 [58] Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. x, 89 [59] Y . LeCun, L. Bottou, Y . Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 27 [60] S.-H. Lee, W.-D. Jang, and C.-S. Kim. Superpixels for image and video process- ing based on proximity-weighted patch matching. Multimedia Tools and Appli- cations, pages 1–29, 2020. 63 [61] S. Li, X. Zhu, Q. Huang, H. Xu, and C.-C. J. Kuo. Multiple instance curricu- lum learning for weakly supervised object detection. In British Machine Vision Conference, 2017. 37 [62] X. Li and C. Change Loy. Video object segmentation with joint re-identification and attention-aware mask propagation. In Proceedings of the European Confer- ence on Computer Vision (ECCV), pages 90–105, 2018. 62 [63] Y . Li and A. Gupta. Beyond grids: Learning graph representations for visual recognition. In Advances in Neural Information Processing Systems, pages 9225– 9235, 2018. 64, 69, 72 [64] Z. Li and D. Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. 14, 16, 24, 27 [65] X. Liang, Y . Wei, X. Shen, J. Yang, L. Lin, and S. Yan. Proposal-free network for instance-level object segmentation. arXiv preprint arXiv:1509.02636, 2015. 39 102 [66] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu. Textboxes: A fast text detector with a single deep neural network. In AAAI, 2017. 48, 49, 56, 57, 58 [67] T.-Y . Lin, P. Doll´ ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. arXiv preprint arXiv:1612.03144, 2016. 8, 21 [68] T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ ar. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018. 7, 12, 17, 21, 31, 90 [69] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014. 30, 55 [70] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision, pages 21–37. Springer, 2016. 7, 31, 49, 50, 52, 53, 56 [71] X. Liu, M. Masana, L. Herranz, J. Van de Weijer, A. M. Lopez, and A. D. Bag- danov. Rotate your networks: Better weight consolidation and less catastrophic forgetting. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 2262–2268. IEEE, 2018. 24, 29, 30 [72] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. 4, 12, 35, 49 [73] M. Long, Y . Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105, 2015. 37 [74] J. Luiten, P. V oigtlaender, and B. Leibe. Premvos: Proposal-generation, refine- ment and merging for video object segmentation. In Asian Conference on Com- puter Vision, pages 565–580. Springer, 2018. 62 [75] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist net- works: The sequential learning problem. In Psychology of learning and motiva- tion, volume 24, pages 109–165. Elsevier, 1989. 3, 12, 16 [76] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 88 [77] V . Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010. 71 103 [78] Y . Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y . Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011. x, 27, 89 [79] L. Neumann and J. Matas. Real-time scene text localization and recognition. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3538–3545. IEEE, 2012. 47, 49 [80] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585, 2016. 89 [81] S. W. Oh, J.-Y . Lee, N. Xu, and S. J. Kim. Video object segmentation using space- time memory networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 9226–9235, 2019. 60, 63 [82] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Des- maison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS- W, 2017. 76, 91 [83] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learn- ing video object segmentation from static images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2663–2672, 2017. 60, 62 [84] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 724–732, 2016. 75 [85] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel´ aez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017. 11, 63, 75, 76 [86] S. Qin and R. Manduchi. Cascaded segmentation-detection networks for word- level text spotting. arXiv preprint arXiv:1704.00834, 2017. 10, 50 [87] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y . Ng. Self-taught learning: trans- fer learning from unlabeled data. In Proceedings of the 24th international con- ference on Machine learning, pages 759–766. ACM, 2007. 18 [88] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. vii, ix, xi, 13, 17, 24, 26, 27, 91, 93, 94 104 [89] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016. 7, 31, 49, 50 [90] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015. 8, 12, 49 [91] X. Ren and J. Malik. Learning a classification model for segmentation. In null, page 10. IEEE, 2003. 63 [92] S. R. Richter, V . Vineet, S. Roth, and V . Koltun. Playing for data: Ground truth from computer games. In European Conference on Computer Vision, pages 102– 118. Springer, 2016. 4, 35, 45 [93] A. Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995. 17 [94] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The syn- thia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3234–3243, 2016. 4, 35 [95] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. 24 [96] P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo. From source to target and back: symmetric bi-directional adaptive gan. CVPR, 2018. 88 [97] K. Saito, Y . Ushiku, and T. Harada. Asymmetric tri-training for unsupervised domain adaptation. ICML, 2017. 8, 37, 38, 40 [98] A. Shahab, F. Shafait, and A. Dengel. Icdar 2011 robust reading competition challenge 2: Reading text in scene images. In Document Analysis and Recogni- tion (ICDAR), 2011 International Conference on, pages 1491–1496. IEEE, 2011. 47, 49 [99] H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017. 14, 17 [100] J. Shin Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. So Kweon. Pixel-level matching for video object segmentation using convolutional neural networks. In 105 Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 2167–2176, 2017. 61, 62 [101] K. Shmelkov, C. Schmid, and K. Alahari. Incremental learning of object detectors without catastrophic forgetting. In Computer Vision (ICCV), 2017 IEEE Interna- tional Conference on. IEEE, 2017. 14, 16, 17, 30, 31, 32 [102] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, volume 2, page 5, 2017. 88 [103] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. 29, 39 [104] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Jour- nal of Machine Learning Research, 15(1):1929–1958, 2014. 16 [105] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adap- tation. In Computer Vision–ECCV 2016 Workshops, pages 443–450. Springer, 2016. 37 [106] S. Tian, Y . Pan, C. Huang, S. Lu, K. Yu, and C. Lim Tan. Text flow: A uni- fied text detection system in natural scene images. In Proceedings of the IEEE International Conference on Computer Vision, pages 4651–4659, 2015. 47 [107] Z. Tian, W. Huang, T. He, P. He, and Y . Qiao. Detecting text in natural image with connectionist text proposal network. In European Conference on Computer Vision, pages 56–72. Springer, 2016. 48, 49 [108] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4068–4076, 2015. 37, 38 [109] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. CVPR, 2017. 37 [110] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016. 10, 48, 55, 57, 58, 91 [111] P. Veliˇ ckovi´ c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y . Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. 64 106 [112] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and X. Giro-i Nieto. Rvos: End-to-end recurrent network for video object segmentation. arXiv preprint arXiv:1903.05612, 2019. 77 [113] P. V oigtlaender, Y . Chai, F. Schroff, H. Adam, B. Leibe, and L.-C. Chen. Feelvos: Fast end-to-end embedding learning for video object segmentation. In CVPR, 2019. 61, 63, 76, 77 [114] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Insti- tute of Technology, 2011. 29 [115] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018. 64 [116] H. Xiao, B. Kang, Y . Liu, M. Zhang, and J. Feng. Online meta adaptation for fast video object segmentation. IEEE transactions on pattern analysis and machine intelligence, 2019. 61 [117] N. Xu, L. Yang, Y . Fan, D. Yue, Y . Liang, J. Yang, and T. Huang. Youtube- vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018. 63, 81 [118] F. Yang, Q. Sun, H. Jin, and Z. Zhou. Superpixel segmentation with fully con- volutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13964–13973, 2020. 82 [119] J. Yang, B. Price, X. Shen, Z. Lin, and J. Yuan. Fast appearance modeling for automatic primary video object segmentation. IEEE Transactions on Image Pro- cessing, 25(2):503–515, 2015. 63 [120] L. Yang, Y . Wang, X. Xiong, J. Yang, and A. K. Katsaggelos. Efficient video object segmentation via network modulation. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 6499–6507, 2018. 77 [121] C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao. Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002, 2016. 48, 50, 57, 58 [122] R. Yao, G. Lin, S. Xia, J. Zhao, and Y . Zhou. Video object segmentation and tracking: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(4):1–47, 2020. 62 107 [123] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intel- ligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR. org, 2017. 24, 26, 27, 92 [124] L. Zhang, X. Li, A. Arnab, K. Yang, Y . Tong, and P. H. Torr. Dual graph con- volutional network for semantic segmentation. arXiv preprint arXiv:1909.06121, 2019. 64 [125] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 4203–4212, 2018. 31 [126] Y . Zhang, P. David, and B. Gong. Curriculum domain adaptation for semantic segmentation of urban scenes. In The IEEE International Conference on Com- puter Vision (ICCV), Oct 2017. 36, 38, 43, 46 [127] Y . Zhang, Z. Wu, H. Peng, and S. Lin. A transductive approach for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6949–6958, 2020. 61, 63, 74, 76, 77 [128] Z. Zhang, W. Shen, C. Yao, and X. Bai. Symmetry-based text line detection in natural scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2558–2567, 2015. 47 [129] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pages 4159–4167, 2016. 48, 49, 50 [130] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. 27 [131] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch¨ olkopf. Learning with local and global consistency. In Advances in neural information processing sys- tems, pages 321–328, 2004. 62 [132] Z.-H. Zhou and M. Li. Tri-training: Exploiting unlabeled data using three clas- sifiers. IEEE Transactions on knowledge and Data Engineering, 17(11):1529– 1541, 2005. 8, 40 [133] J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In The IEEE International Confer- ence on Computer Vision (ICCV), Oct 2017. 88 108 [134] J.-Y . Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shecht- man. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems 30. 2017. 88 [135] X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2(3):4, 2006. 17 109
Abstract (if available)
Abstract
The classical machine learning paradigm rarely exploits the dependencies and relations among different tasks and domains. In the deep learning era, manually creating a labeled dataset for each task becomes prohibitively expensive. In this dissertation, we aim to develop effective techniques to retain, accumulate, and transfer knowledge gained from past learning experiences to solve new problems in new scenarios. Specifically, we consider four different types of knowledge transfer scenarios in computer vision applications: 1) incremental learning—we transfer knowledge from old task(s) to new task as training data become available gradually over time
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Transfer learning for intelligent systems in the wild
PDF
Multimodal reasoning of visual information and natural language
PDF
Object localization with deep learning techniques
PDF
3D deep learning for perception and modeling
PDF
Theory and applications of adversarial and structured knowledge learning
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Deep generative models for image translation
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Towards more occlusion-robust deep visual object tracking
PDF
Modeling, learning, and leveraging similarity
PDF
Structured visual understanding and generation with deep generative models
PDF
Local-aware deep learning: methodology and applications
PDF
Transfer reinforcement learning for autonomous collision avoidance
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Learning to diagnose from electronic health records data
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Leveraging training information for efficient and robust deep learning
PDF
Unsupervised domain adaptation with private data
Asset Metadata
Creator
Zhang, Junting
(author)
Core Title
Visual knowledge transfer with deep learning techniques
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/09/2020
Defense Date
09/02/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,deep learning,domain adaptation,incremental learning,knowledge transfer,OAI-PMH Harvest,scene text detection,video object segmentation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Jenkins, Keith (
committee member
), Neumann, Ulrich (
committee member
)
Creator Email
juntingz@usc.edu,zhangjunting0204@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-393366
Unique identifier
UC11666503
Identifier
etd-ZhangJunti-9089.pdf (filename),usctheses-c89-393366 (legacy record id)
Legacy Identifier
etd-ZhangJunti-9089.pdf
Dmrecord
393366
Document Type
Dissertation
Rights
Zhang, Junting
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer vision
deep learning
domain adaptation
incremental learning
knowledge transfer
scene text detection
video object segmentation