Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Video object segmentation and tracking with deep learning techniques
(USC Thesis Other)
Video object segmentation and tracking with deep learning techniques
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
VIDEO OBJECT SEGMENT ATION AND TRACKING WITH DEEP LEARNING TECHNIQUES by Ye Wang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2020 Copyright 2020 Ye Wang Acknowledgments I am very fortunate to join Media Communications Lab (MCL) led by Professor C.- C. Jay Kuo in Fall 2015, who gave me the chance and inspiration to explore the new direction and guided me through my whole PhD study. His immense knowledge has greatly broadened my vision and his insistence in theoretical understanding in deep learning encourages me to never give up in my bad times. In addition, Professor Kuo provides prompt help and care not only in the academic area but also in daily life, I am indeed lucky to be a member of MCL. I would also sincerely appreciate the scientific advices and insightful instructions from Dr. Jongmoo Choi during and after the PWICE project. Besides my advisor, I would thank the rest of my committee members, Professor Joseph Lim, Professor Shrikanth Narayanan, Professor Antonio Ortega and Professor Alexander Sawchuk for providing many valuable comments and suggestions. I feel very lucky to know many research fellows from MCL, particularly, Qin Huang, Haiqiang Wang, Weihao Gan, Yuhang Song, Yeji Shen, Wenchao Zheng, Siyang Li, Yueru Chen, Kaitai Zhang, and many others. I also get much help from friends and research fellows outside of the lab, such as Guoxian Dai, Xiaosong Zhou, Jiefu Zhai and Ke Zhang. Finally, I really appreciate the unconditional love and care from my family, particu- larly thank my wife, Wei Zhang, for supporting and encouraging me anytime. ii Contents Acknowledgments ii List of Tables vi List of Figures viii Abstract xii 1 Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background of the Research . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Instance segmentation. . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Semi-supervised video object segmentation. . . . . . . . . . . . 4 1.2.3 Unsupervised video object segmentation. . . . . . . . . . . . . 5 1.2.4 Hard example mining. . . . . . . . . . . . . . . . . . . . . . . 6 1.2.5 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.6 Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.7 Generative Adversarial Network . . . . . . . . . . . . . . . . . 9 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 10 1.3.1 Video Object Segmentation . . . . . . . . . . . . . . . . . . . 10 1.3.2 Drone Detection and Tracking . . . . . . . . . . . . . . . . . . 11 1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Design Pseudo Ground Truth with Motion Cues for Unsupervised Video Object Segmentation 13 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Learning to tag the foreground object . . . . . . . . . . . . . . 16 2.2.2 Unsupervised video object segmentation . . . . . . . . . . . . . 19 2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . 22 iii 2.3.3 Comparison with state-of-the-art methods . . . . . . . . . . . . 24 2.3.4 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Unsupervised Video Object Segmentation with Distractor-aware Online Adap- tation 31 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 Generate pseudo ground truth . . . . . . . . . . . . . . . . . . 35 3.2.2 Online hard negative/negative/positive example selection . . . . 37 3.2.3 Distractor-aware online adaptation . . . . . . . . . . . . . . . . 40 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.1 Datasets and evaluation metrics . . . . . . . . . . . . . . . . . 42 3.3.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . 44 3.3.3 Performance comparison with state-of-the-art . . . . . . . . . . 46 3.3.4 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4 Drone Monitoring with Convolutional Neural Networks 50 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Data Collection and Augmentation . . . . . . . . . . . . . . . . . . . . 53 4.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.3 Thermal Data Augmentation . . . . . . . . . . . . . . . . . . . 57 4.3 Drone Monitoring System . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.1 Drone Detection . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.2 Drone Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.3 Integrated Detection and Tracking System . . . . . . . . . . . . 66 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.1 Drone Detection . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.2 Drone Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.3 Fully Integrated System . . . . . . . . . . . . . . . . . . . . . 74 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5 Video Object Tracking and Segmentation with Box Annotation 77 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.1 Track and then Segment . . . . . . . . . . . . . . . . . . . . . 80 5.2.2 Reverse Optimization from Segmentation to Tracking . . . . . . 84 5.2.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 85 5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 87 iv 5.3.2 Evaluations for Video Object Segmentation . . . . . . . . . . . 88 5.3.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6 Conclusion and Future Work 97 6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.1 Video-based panoptic segmentation . . . . . . . . . . . . . . . 99 6.2.2 3D model reconstruction from videos . . . . . . . . . . . . . . 100 Bibliography 102 v List of Tables 2.1 Comparison of the mIoU scores (%) of different unsupervised VOS approaches in the DA VIS 2016 validation dataset. Our method achieves the highest mIoU compared with state-of-the-art methods . . . . . . . . 22 2.2 Comparison of the mIoU scores (%) per video of several methods for the DA VIS 2016 validation dataset. (1) blackswan, (2) bmx-trees, (3) breakdance, (4) camel, (5) car-roundabout, (6) car-shadow, (7) cows, (8) dance-twirl, (9) dog, (10) drift-chicane, (11) drift-straight, (12) goat, (13) horsejump-high, (14) kite-surf, (15) libby, (16) motocross-jump, (17) paragliding-launch, (18) parkour, (19) scooter-black, (20) soapbox 25 2.3 Comparison of mIoU scores (%) of different finetuning times on the first frames of the DA VIS validation set . . . . . . . . . . . . . . . . . . . . 25 2.4 Comparison of the F-score and mIoU scores (%) of different unsuper- vised VOS approaches on the FBMS test dataset. Our method achieves the highest compared with state-of-the-art methods . . . . . . . . . . . 26 2.5 Ablation study of our approach on DA VIS 2016 validation set. TTDA denotes the test time data augmentation and CRF denotes conditional random field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Error analysis for the entire DA VIS 2016 validation set and two difficult videos (bmx-trees and kite-surf) . . . . . . . . . . . . . . . . . . . . . 30 3.1 Comparison of the results of several methods for the DA VIS 2016 vali- dation dataset. The proposed method outperforms state-of-the-art unsu- pervised VOS methods and is even better than some supervised VOS approaches in terms ofJ Mean andF Mean (%) . . . . . . . . . . . . 43 3.2 Comparison of theJ Mean andF Mean scores (%) of different unsu- pervised VOS approaches on the FBMS test dataset. Our method achieves the highest compared with state-of-the-art methods . . . . . . . . . . . 45 3.3 Ablation study of the three modules in distractor-aware online adap- tation: (1) negative example addition (+N), (2) hard negative exam- ple addition (+HN), and (3) fusion of positive mask with motion mask (+MP), assessed on the DA VIS 2016 validation set . . . . . . . . . . . 46 vi 3.4 Comparison of the first frame influence for the DA VIS 2016 validation dataset. We finetune on the first frame and perform inference for the remaining frames without online adaptation. We compare the perfor- mances (%) from the pseudo ground truth generated from Mask R-CNN (PGT M ) and jointly from Mask R-CNN and Deeplabv3+ (PGT MD ), the erosion and dilation masks fromPGT M , and ground truth mask (GT) 46 5.1 Quantitative results of different semi-supervised VOS approaches on the DA VIS 2016 validation set. FT andM denote fine-tuning on the first frame and exploiting the first frame ground truth mask, respectively. Time is measured in seconds per frame. . . . . . . . . . . . . . . . . . 87 5.2 Quantitative results of different semi-supervised VOS approaches on the DA VIS 2017 validation set. FT andM denote fine-tuning on the first frame and exploiting the first frame ground truth mask, respectively. Time is measured in seconds per frame. . . . . . . . . . . . . . . . . . 91 5.3 Ablation studies on DA VIS 2016 validation set. R w=OSVOS and R w=OnAVOS denote replacing the segmentation branch to OSVOS [6] and OnA VOS [100], respectively. ReRO denotes removing the reverse optimization from the segmentation cues to the tracker. . . . . . 91 5.4 Ablation studies of different components in the segmentation branch on DA VIS 2016 validation set. GM denotes global matching with the first frame, andLM represents local matching with the previous frame.PP denotes adding previous predition to the input of the segmentation branch. 92 5.5 Per sequence quantitative results of DA VIS 2017 validation set. . . . . . 96 vii List of Figures 2.1 Overview of tagging the main object. We use instance segmentation algorithm to segment objects in the static image. We then utilize optical flow to select and group the segments to one foreground object. . . . . . 15 2.2 Example results of our method, where the pseudo ground truth of the first frame is in yellow, and the other seven images in green are sample segmentations of the rest of the video clip. Best viewed in color. . . . . 16 2.3 Overview of the proposed method. We trained the appearance model on the DA VIS training set with a pre-trained wider-ResNet from Ima- geNet, COCO and PASCAL VOC. We then finetuned the model on the first frame “pseudo ground truth”, and online adaptation is applied after- wards. The pixels in yellow and green are selected positive and negative examples respectively in the online branch. . . . . . . . . . . . . . . . 17 2.4 Qualitative results on DA VIS validation set: The first column in yel- low is the “pseudo ground truth” of the first frame of each video. The other four columns are the output segmentation masks of our proposed approach. Our algorithm performs well on videos with fast motion (first and forth row), gesture changes (second row), unseen category (third row) and complex background (fifth row). Best viewed in color. . . . . 22 2.5 Qualitative results on FBMS dataset: The first column in yellow is the “pseudo ground truth” of the first frame of each video. The other four columns are the output segmentation masks of our proposed approach. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6 Qualitative results on SegTrack-v2 dataset: The first column in yel- low is the “pseudo ground truth” of the first frame of each video. The other four columns are the output segmentation masks of our proposed approach. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . 28 2.7 Comparison of qualitative results on two sequences, camel and car- roundabout, on DA VIS validation set: The first column in yellow is the “pseudo ground truth” of the first frame of each video. The other two columns are the output segmentation masks of our oneshot and online approaches respectively. Best viewed in color. . . . . . . . . . . . . . . 29 viii 3.1 Example results of ground truth, ARP [47], PDB [89] and the pro- posed method. Distractors in the background lead to incorrect segmen- tations (second and third column) due to the similar features between the foreground and background objects. We exploit distractor-aware online adaptation approach to learn from the hard negatives to avoid mis-classifying background objects as foreground. Best viewed in color with 4 zoom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Overview of the proposed method. Instead of directly applying static image segmentation to video object segmentation, an online adaptation approach is proposed by detecting distractors (negatives and hard nega- tives). Both the appearance and motion cues are utlized to generate pos- itives, negatives, and hard negatives. Besides, first frame pseudo ground truth is utilized to supervise the finetuning process to make accurate inferences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Objectness mask from Mask R-CNN (yellow), Deeplabv3+ (green). Best view in color with 3 zoom. . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Illustration of positive examples (yellow), negative examples (red) and hard negative examples (green). . . . . . . . . . . . . . . . . . . . . . 40 3.5 Visual results of the proposed DOA on DA VIS 2016. The pseudo ground truths (in yellow) are illustrated in the first column, and the other columns are the segmentation results (in red) by DOA. The five sequences include the unseen object (first row), strong occlusions (second row), appear- ance variance (third row), and similar static objects in the background (fourth and fifth row). Best viewed in color with 3 zoom. . . . . . . . 43 3.6 Comparison of qualitative results on the key components of online adap- tation. The first row presents the differences of w/ HN (left) and w/o HN (right), and the second row presents the differences of w/ MP (left) and w/o MP (right). Best viewed in color. . . . . . . . . . . . . . . . . . . 48 4.1 Overview of proposed approach. We integrate the tracking module and detector module to set up an integrated system. The integrated system can monitor the drone during day and night with exploiting our proposed data augmentation techniques. . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Sampled frames from two collected drone datasets. . . . . . . . . . . . 53 4.3 Illustration of the data augmentation idea, where augmented training images can be generated by merging foreground drone images and back- ground images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4 Sampled frames from collected USC thermal drone dataset. . . . . . . . 56 4.5 Comparison of generated thermal drone images of different methods: 3D rendering (first row), Cycle-GAN (second row), proposed method (third row). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 ix 4.6 Illustration of augmented visible and thermal drone models. The left three columns show the augmented visible drone models using different augmentation techniques. The right three columns show the augmented thermal drone models with the first row exploiting 3D rendering tech- nique and the second row utilizing Generative Adversarial Networks. . 58 4.7 Synthesized visible and thermal images by incorporating various illumi- nation conditions, image qualities, and complex backgrounds. . . . . . 59 4.8 The architecture of the proposed thermal drone generator. . . . . . . . . 60 4.9 Qualitative results on USC drone datasets. Our algorithm performs well on small object tracking and long sequence (first and second row), com- plex background (third row), and occlusion (fourth row). The bounding boxes in red are integrated system results and the bounding boxes in green are tracking-only results. . . . . . . . . . . . . . . . . . . . . . . 63 4.10 Failure cases. Our algorithm fails to track the drone with strong motion blur and complex background (top row), and fails to re-identify the drone when it goes out-of-view and back (bottom row). The bound- ing boxes in red are integrated system results and the bounding boxes in green are tracking-only results. . . . . . . . . . . . . . . . . . . . . . . 64 4.11 Comparison of qualitative results of detection, tracking and integrated system on thedrone garden sequence. The detection results are shown in the first row. The corresponding tracking and integrated system results are shown in the second row with tracking bounding boxes in green and integrated system bounding boxes in red respectively. . . . . . . . . . . 64 4.12 Comparison of qualitative results of detection (first row), integrated sys- tem (second row) on the thermal drone sequence. . . . . . . . . . . . . 65 4.13 Comparison of three raw input images (first row) and their correspond- ing residual images (second row). . . . . . . . . . . . . . . . . . . . . 66 4.14 A flow chart of the drone monitoring system. . . . . . . . . . . . . . . 67 4.15 Comparison of the visible drone detection performance on (a) the syn- thetic and (b) the real-world datasets, where the baseline method refers to that using geometric transformations to generate training data only while the All method indicates that exploiting geometric transforma- tions, illumination conditions and image quality simulation for data aug- mentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.16 Comparison of the visible and thermal drone detection performance on (a) the synthetic, (b) the real-world datasets. (c) shows the comparison of different thermal data augmentation techniques, where the baseline method refers to that using geometric transformations, illumination con- ditions and image quality simulation for data augmentation, and image transfer refers to that utilizing the proposed modified Cycle-GAN data augmentation technique. . . . . . . . . . . . . . . . . . . . . . . . . . 72 x 4.17 Comparison of the MDNet tracking performance using the raw and the residual frames as the input. . . . . . . . . . . . . . . . . . . . . . . . 74 4.18 Detection only (Faster RCNN) vs. tracking only (MDNet tracker) vs. our integrated system: The performance increases when we fuse the detection and tracking results. . . . . . . . . . . . . . . . . . . . . . . 75 5.1 Segmentation results of the proposed RevVOS vs. SiamMask [102] on three sequences on DA VIS 2016 and DA VIS 2017. We propose a two- step approach that deeply leverages the optimization between tracking and segmentation, producing better segmentation than the state-of-the- art joint-trained approach SiamMask. . . . . . . . . . . . . . . . . . . . 78 5.2 An overview of the proposed RevVOS. The framework includes a track- ing branch (top), a segmentation branch (bottom), and a Box2Seg pre- diction (left). We propose a two-stage approach, track-then-segment, to generate segmentation masks from bounding box initialization. Then, the reverse optimization from the segmentation cues to the tracker is applied to refine the tracker. . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Benchmark ofJ&F mean and time per frame (in log scale) on DA VIS 2016 dataset. SiamMask [102] and RevVOS in red only use bounding box labels. The methods in blue utilize fine-tuning on the first frame. . 87 5.4 Qualitative results of the proposed method on DA VIS 2016 and DA VIS 2017 validation sets: drift-straight, and cows in the first two rows belong to DA VIS 2016;judo andhorsejump-high in the last two rows are in DA VIS 2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5 More qualitative results of the proposed method on DA VIS 2016 valida- tion set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.6 More qualitative results of the proposed method on DA VIS 2016 valida- tion set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.7 More qualitative results of the proposed method on DA VIS 2017 valida- tion set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 xi Abstract Unsupervised video object segmentation is a crucial application in video analysis with- out knowing any prior information about the objects. It becomes tremendously challeng- ing when multiple objects occur and interact in a given video clip. In this thesis, a novel unsupervised video object segmentation approach via distractor-aware online adaptation (DOA) is proposed. DOA models spatial-temporal consistency in video sequences by capturing background dependencies from adjacent frames. Instance proposals are gener- ated by the instance segmentation network for each frame and then selected by motion information as hard negatives if they exist and positives. To adopt high-quality hard negatives, the block matching algorithm is then applied to preceding frames to track the associated hard negatives. General negatives are also introduced in case that there are no hard negatives in the sequence and experiments demonstrate both kinds of negatives (distractors) are complementary. Finally, we conduct DOA using the positive, nega- tive, and hard negative masks to update the foreground/background segmentation. The proposed approach achieves state-of-the-art results on two benchmark datasets, DA VIS 2016 and FBMS-59 datasets. In addition, this thesis reports a visible and thermal drone monitoring system that integrates deep-learning-based detection and tracking modules. The biggest challenge in adopting deep learning methods for drone detection is the paucity of training drone xii images especially thermal drone images. To address this issue, we develop two data aug- mentation techniques. One is a model-based drone augmentation technique that auto- matically generates visible drone images with a bounding box label on the drone’s loca- tion. The other is exploiting an adversarial data augmentation methodology to create thermal drone images. To track a small flying drone, we utilize the residual informa- tion between consecutive image frames. Finally, we present an integrated detection and tracking system that outperforms the performance of each individual module containing detection or tracking only. The experiments show that, even being trained on synthetic data, the proposed system performs well on real world drone images with complex back- ground. The USC drone detection and tracking dataset with user labeled bounding boxes is available to the public. Finally, this thesis presents a two-stage approach, track and then segment, to perform semi-supervised video object segmentation (VOS) with only bounding box annotations. We present reverse optimization for VOS (RevVOS), which leverages a fully convolu- tional Siamese network to perform the tracking and then segments the objects in the tracker. The segmentation cues are able to reversely optimize the localization of the tracker. The proposed two-branch system is performed online, to produce object seg- mentation masks. We demonstrate significant improvements over the state-of-the-art methods on two video object segmentation datasets: DA VIS 2016 and DA VIS 2017. xiii Chapter 1 Introduction 1.1 Significance of the Research The key problem of this thesis is how we can build an intelligent system to perform video object segmentation and tracking. We discuss the challenges in this field and then propose deep learning algorithms to significantly improve the performances in this thesis. Video object segmentation is a challenging problem because of the complex nature of videos: occlusions, motion blur, deforming shapes and truncations etc. Besides, smaller size of foreground object or the high similarity between the foreground object and background are also hard to segment. Finally, multiple objects interactions also bring challenges to video object segmentation and we need to identify and segment the main object in a video clip. Video object segmentation (VOS) aims to segment foreground objects from complex background scenes in video sequences. There are two main categories in existing VOS methods: semi-supervised and unsupervised. Semi-supervised VOS algorithms [65, 97, 100, 6, 74, 34] require manually annotated object regions in the first frame and then automatically segment the specified object in the remaining frames throughout the video sequence. Unsupervised VOS algorithms [47, 94, 39, 55, 56, 33] segment the most conspicuous and eye-attracting objects without prior knowledge of these objects in the video. 1 Unsupervised video object segmentation has many applications such as object iden- tification and content-adaptive video compression. It is however more challenging since the algorithms should automatically discover foreground regions in the video clip. In order to extract objects that people are interested in, motion information is a key factor in unsupervised video object segmentation. Since manual annotation is expensive, it is desired to develop the more challenging unsupervised VOS solution. This is feasi- ble due to the following observation. Inspired by vision studies [79], moving objects can attract infant and young animals’ attention who can group things properly without knowing what kinds of objects they are. Furthermore, we tend to group moving objects and separate them from background and other static objects. In other words, seman- tic grouping is acquired after motion-based grouping in the VOS task. In addition, it becomes tremendously challenging when multiple objects occur and interact in a given video clip. To tackle this problem, distractor-aware online adaptation is proposed by utilizing positive, negative, and hard negative examples. However, it is possible that the primary object is static in the first several frames or during the middle fractions in the video clip. Therefore, motion cues cannot find the salient regions, and although image-based segmentation algorithms can be applied to find the objectness regions, the algorithms cannot determine what the primary object is. This thesis also presents a two-stage approach, track and then segment, to perform semi- supervised video object segmentation (VOS) with only bounding box annotations. We present reverse optimization for VOS (RevVOS), which leverages a fully convolutional Siamese network to perform the tracking and then segments the objects in the tracker. The segmentation cues are able to reversely optimize the localization of the tracker. The proposed two-branch system is performed online, to produce object segmentation masks. We demonstrate significant improvements over the state-of-the-art methods on two video object segmentation datasets: DA VIS 2016 and DA VIS 2017. 2 In addition to video object segmentation, I will also introduce my efforts to build an integrated drone monitoring system which includes drone detection and tracking in this thesis. Generally speaking, techniques for localizing drones can be categorized into two types: acoustic and optical sensing techniques. The acoustic sensing approach achieves target localization and recognition by using a miniature acoustic array system. The optical sensing approach processes images or videos to estimate the position and identity of a target object. In this thesis, we employ the optical sensing approach by leveraging the recent breakthrough in the computer vision field. In particular, our proposed model integrates the detector module and tracker module to set up a drone monitoring system. The proposed system can monitor drones during both day and night. Due to the lack of the drone data and paucity of thermal drone diversities, we propose model-based augmentation for visible drone data augmentation and design a modified Cycle-GAN-based generation approach for thermal drone data augmentation. Furthermore, a residual tracker module is presented to deal with fast motion and occlusions. 1.2 Background of the Research 1.2.1 Instance segmentation. Many video object segmentation methods [100, 39, 74, 6] are based on semantic segmentation networks [105] for static images. State-of-the-art semantic segmen- tation techniques are dominated by fully convolutional networks[61, 7]. Semantic segmentation segments the same category of objects with one mask while instance segmentation[30] provides a segmentation mask independently for each instance. One key reason that these deep learning based methods for instance segmentation have devel- oped very rapidly is that there are large datasets with instance mask annotations such as 3 COCO [58]. It is difficult to annotate all categories of objects and apply a supervised training. It is more difficult to extend image instance segmentation to video instance segmentation due to the the lack of large-scale manual labeled instance video object seg- mentation datasets. In contrast, we focus on generic object segmentation in the video and we do not care whether the object category is in the training dataset or not. We propose a method to transfer the image instance segmentation to enable finetuning the pretrained fully convolutional network. 1.2.2 Semi-supervised video object segmentation. Given the manual foreground/background annotations for the first frame in a video clip, semi-supervised VOS methods segment the foreground object along the remain- ing frames. Deep learning based methods have achieved excellent performance [100, 11, 40, 112, 107, 110], and static image segmentation [6, 74, 64] is utilized to perform video object segmentation without any temporal information. MaskTrack [74] considers the output of the previous frame as a guidance in the next frame to refine the mask. OSVOS [6] processes each frame independently by finetuning on the first frame, and OSVOS-S [64] further transfers instance-level semantic information learned on ImageNet [17] to produce more accurate results. OnA VOS [100] proposes online finetuning with the predicted frames to further optimize the inference network. To fully exploit the motion cues, MoNet [107] introduces a distance transform layer to separate motion-inconstant objects and refine the segmentation results. However, under the cir- cumstances which the object is occluded or changes the movement abruptly, significant performance deterioration will occur. Our approach aims to tackle this challenge using distractor-aware online adaptation. Fine-tuning-based approaches achieve high performance, but the runtime is long. The other methods [110, 10, 99, 9] avoid fine-tuning, to balance the trade-off between 4 runtime and performance to meet the requirements of real-world applications. OSMN [110] adapts the segmentation model with a modulator to manipulate the intermediate layers of the segmentation network without using fine-tuning. FA VOS [10] produces segmentation for each region of interest after part-based tracking, and then refines the segmentation using a similarity-based scoring function. FEELVOS [99] utilizes seman- tic embedding with global matching with the first frame, and local matching with the previous frame, to generate segmentation masks. 1.2.3 Unsupervised video object segmentation. Unsupervised VOS algorithms [41, 63, 72, 103, 113, 55, 56, 33] attempt to extract the primary object segmentation with no manual annotations. Several unsupervised VOS algorithms [28, 108] cluster the boundary pixels hierachically to generate mid-level video segmentations. ARP [47] utilizes the recurrent primary object to initialize the segmentation and then refines the initial mask by iteratively augmenting with miss- ing parts or reducing them by excluding noisy parts. Recently, deep learning based methods [94, 39, 89, 55, 56] have been proposed to utilize both motion boundaries and saliency maps to identify the primary object. Two-stream FCNs [61], LVO [95] and FSEG [39], are proposed to jointly exploit appearance and motion features. FSEG fur- ther boosts the performance by utilizing weakly annotated videos, while LVO forwards the concatenated features to bidirectional convolutional GRU. MBN [56] combines the background from motion-based bilateral network with instance embeddings to boost the performance. 5 1.2.4 Hard example mining. There are enormous imbalances between the foreground and background regions since more regions can be sampled from backgrounds than those from foreground. In addi- tion, overwhelming easy negative examples from the background regions have less con- tributions to train the detector, and thus hard negative example mining approaches are proposed to tackle this inbalance challenge. Boostrapping is exploited in optimizing Support Vector Machines (SVMs) [23] by several rounds of training SVMs to converge on the working set, and modifying the working set by removing easy examples and adding hard examples. Hard example mining is also used in boosted decision trees [18] by training the model with postive examples and a random set of negative examples. The model is then applied to the rest of negative examples to generate false positive examples for retraining the pre-trained model. Hard negative mining has also been exploited in deep learning models to improve the performance. OHEM [87] trains region-based object detectors using automaticly selected hard examples, and yields significant boosts in detection performance on both PASCAL [20] and MS COCO [58] datasets. Focal loss [57] is designed to down-weight the loss assigned for well-classified examples and focuses on the training on hard exam- ples. Effective bootstrapping of hard examples is also applied in face detection [101], pedestrian detection [2], and tracking [69] etc. Both trackers and static image object detectors are applied to select hard examples by finding the inconsistency between the tracklets and object detections from unlabeled videos [91]. In [42], a trained detector is utilized to find the isolated detection, which is marked as a hard negative, from the preceding and following detections. In the proposed approach, we focus on developing an online hard example selection strategy for video object segmentation. 6 1.2.5 Object Detection Current state-of-the-art CNN object detection approaches include two main streams: two-step and one-step object detection. Two-step object detection approaches are based on R-CNN [27] framework, the first step generates candidate object bounding box and the second step classifies each can- didate bounding box as foreground or background using a convolutional neural net- work. The R-CNN method [27] trains CNNs end-to-end to classify the proposal regions into object categories or background. SPPnet [31] develops spatial pyramid pooling on shared convolutional feature maps for efficient object detection and semantic seg- mentation. Inspired by SPPnet, Fast R-CNN [26] enables shared computation on the entire image and then the detector network evaluates the individual regions which dra- matically improves the speed. Faster R-CNN [82] proposes a Region Proposal Network (RPN) to generate candidate bounding boxes followed by a second-step classifier which is the same as that of Fast R-CNN. The two-step framework consistently achieves top accuracy on the challenging COCO benchmark [58]. One-step object detection approaches predict bounding boxes and confidence scores for multiple categories directly without the proposal generation step in order to reduce the training and testing time. OverFeat [86], a deep multiscale and sliding window method, presents an integrated framework to implement classification, localization and detection simultaneously by utilizing a single shared network. YOLO [81] exploits the whole topmost feature map to predict bounding boxes and class probabilities directly from full images in one evaluation. SSD [60] utilizes default boxes of different aspect ratios and scales on each feature map location. At prediction time, the classfication scores are determined in each default box and the bounding box coordinates are adjusted to match the object shape. To handle objects of various sizes, multiple feature maps with different resolutions are combined to perform better predictions. 7 1.2.6 Object Tracking Object tracking is one of the fundamental problems in computer vision and CNN-based tracking algorithms have developed very fast due to the development of deep learning. The trackers can be divided to three main streams: correlation filter based trackers [24, 16], Siamese network based trackers [92, 5] and detection based trackers [70, 43]. Correlation filter based tracker can detect objects very fast in the frequency domain. Recent techniques incorporate representations from convolutional neural networks with discriminative correlation filters to improve the performance. BACF [24] designs a new correlation filter by learning from negative examples densely extracted from background region. ECO [16] proposes an efficient discriminative correlation filter for visual track- ing by reducing the number of parameters and designing a compact generative model. The Siamese network is composed with two-branch CNNs with tied parameters, and takes the image pairs as input and predict their similarities. Siamese network based trackers learn a matching function offline on image pairs. In the online tracking step, the matching function is exploited to find the most similar object region compared with the object in the first frame. SiamFC [5] trains a fully covolutional Siamese network directly without online adaptation, and it achieves 86 fps with GPU but its tracking accuracy is not state-of-the-art. Then, additional Siamese approaches improve the performance by utilizing a region proposal network [22, 53], a deeper neural network[115, 52], and learning from the distractors in the background [117]. SINT [92] utilizes optical flow to deal with candidate sampling and it achieves higher tracking accuracy but lower speed. CFNet [98] interprets the correlation filter as a differentiable layer in a deep neural network to compute the similarity between the two input patches. The experimental results show comparable tracking accuracy at high framerates. We adopt SiamRPN [53] as the backbone of the tracking branch of the proposed method. 8 Tracking-by-detection approaches train a classifier to distinguish positive image patches with negative image patches. MDNet [70] finetunes a classification network to learn class-agnostic representations appropriate for visual tracking task. It proposes a multi-domain learning framework to separate the domain-independent information from the domain-specific one. Although MDNet demonstrates state-of-the-art tracking accu- racies on two benchmark datasets, the tracking speed is about 1 fps. RT-MDNet [43] utilizes improved RoIAlign [30] technique to improve the tracking speed by extract- ing representations from the feature map instead of the image. This approach achieves similar accuracy with MDNet with real-time tracking speed. 1.2.7 Generative Adversarial Network Paucity of thermal drone training data forms a major bottleneck in training deep neural networks for drone monitoring. To address the problem, we propose an adversarial data generation approach to augment the exsiting thermal drone data. Generative Adversarial Networks (GANs) simultaneously train two models: a gen- erator and a discriminator. The generator tries to generate data from some distributions to maximize the probability of the discriminator making a mistake, while the discrim- inator distinguishes the sample comes from the training data rather than the generator. GANs have showed impressive results in a wide range of tasks, such as generating high-quality images [80], semi-supervised learning [85], image inpainting [111], video prediction and generation [67], and image translation [38]. Current image-to-image translation approaches have drawn more and more attentions due to the development of GANs. Pix2Pix [38] exploits a regression loss to guide the GAN to learn pairwise image-to-image translation. Due to the lackage of the paired data, Cycle-GAN [116] utilizes a combination of adversarial and cycle-consistent losess to deal with unpaired data image-to-image translation. Taigman et al. [90] exploits cycle-consistency in the 9 feature map with the adversarial loss to transfer a sample in one domain to an analog sample in another domain. In the thesis, a new unpaired image-to-image translation algorithm is proposed to augment the thermal drone images. 1.3 Contributions of the Research 1.3.1 Video Object Segmentation Video object segmentation is the task to segment foreground objects from background across all frames in a video clip. The VOS methods can be classified into two cate- gories: semi-supervised and unsupervised VOS methods. Semi-supervised VOS meth- ods require the ground truth segmentation mask in the first frame as the input and, then, segment the annotated object in the remaining frames. Unsupervised VOS methods identify and segment the main object in the video automatically. Main contributions in this thesis are summarized below. First, we introduce a novel unsupervised video object segmentation method by combining instance segmentation and motion information. Second, we transfer a recent semi-supervised network architecture to the unsuper- vised context. Third, a novel hard negative example selection method is proposed by incorporat- ing instance proposals, block matching tracklets and motion saliency masks. Fourth, we propose a distractor-aware approch to perform the online adaptation to generate video object segmentation with better temporal consistency and avoid the motion error propagation. 10 Fifth, we propose a two-stage framework, track and then segment, to perform video object segmentation, given the first frame box annotation, and demonstrate that this two-stage approach reduces the runtime by a large margin. The use of only the bounding box annotation is enough to automatically segment the object in the remaining frames. Finally, the proposed methods achieves the state-of-the-art results on multiple datasets, DA VIS 2016, DA VIS 2017, Segtrackv2 and FBMS-59. 1.3.2 Drone Detection and Tracking A video-based drone monitoring system was proposed in this thesis to detect and track drones during day and night. The system consists of the drone detection module and the drone tracking module. Both of them were designed based on deep learning networks. The contributions of our work are summarized below. To the best of our knowledge, this is the first one to use the deep learning technol- ogy to solve the challenging drone detection and tracking problem. We propose to exploit a large number of synthetic drone images, which are gener- ated by conventional image processing and 3D rendering algorithms, along with a few real 2D and 3D data to train the CNN. We develop an adversarial data augmentation technique, a modified Cycle-GAN- based generation approach, to create more thermal drone images to train the ther- mal drone detector. We propose to utilize the residue information from an image sequence to train and test an CNN-based object tracker. It allows us to track a small flying object in the cluttered environment. 11 We present an integrated drone monitoring system that consists of a drone detector and a generic object tracker. The integrated system outperforms the detection-only and the tracking-only sub-systems. We have validated the proposed system on USC drone dataset. 1.4 Organization of the Thesis The rest of the thesis is organized as follows. In Chapter 2, we design a pseudo ground truth with motion cues to tackle unsupervised video object segmentation. In Chapter 3, we propose a distractor-aware online adaptation algorithm based on hard example mining to perform unsupervised video object segmentation. In Chapter 4, we propose an integrated drone monitoring system to detect and track the drones. In Chapter 5, we present a two-stage approach, track and then segment, to perform semi-supervised video object segmentation (VOS) with only bounding box annotations. Finally, concluding remarks and future research directions are given in Chapter 6. 12 Chapter 2 Design Pseudo Ground Truth with Motion Cues for Unsupervised Video Object Segmentation 2.1 Introduction Video object segmentation (VOS) is the task to segment foreground objects from back- ground across all frames in a video clip. The VOS methods can be classified into two categories: semi-supervised and unsupervised VOS methods. Semi-supervised VOS methods [65, 97, 100, 6, 74] require the ground truth segmentation mask in the first frame as the input and, then, segment the annotated object in the remaining frames. Unsupervised VOS methods [47, 94, 39, 11, 55, 56] identify and segment the main object in the video automatically. Recent image-based semantic and instance segmentation tasks [30, 37, 36] have achieved great success due to the emergence of deep neural networks such as the fully convolutional network (FCN) [61]. The one-shot video object segmentation (OSVOS) method [6] uses large classification datasets in pretraining and applies the foreground/background segmentation information obtained from the first frame to object segmentation in the remaining frames of the video clip. It converts the image-based seg- mentation method to a semi-supervised video-based segmentation method by processing each frame independently without using the temporal information. 13 However, since manual annotation is expensive, it is desired to develop the more challenging unsupervised VOS solution. This is feasible due to the following observa- tion. Inspired by vision studies [79], moving objects can attract infant and young ani- mals’ attention who can group things properly without knowing what kinds of objects they are. Furthermore, we tend to group moving objects and separate them from back- ground and other static objects. In other words, semantic grouping is acquired after motion-based grouping in the VOS task. In this chapter, we propose to tag the main object in a video clip by combining the motion information and the instance segmentation result. We use optical flow to group segmented pixels to a single object as the pseudo ground truth and, then, take it as the first frame mask to perform the OSVOS. The pseudo ground truth is the estimated object mask for the first frame to replace the true ground truth in the semi-supervised VOS methods. The main idea is sketched below. We apply a powerful instance segmentation algorithm, called the Mask R-CNN [30], to the first frame of a video clip as shown in Figure 2.1, where different objects have different labels. Then, we extract optical flow from the first two frames and select and group different instance segmentations to estimate the foreground object. Next, we finetune a pretrained CNN using the estimated foreground object from the first frame as the pseudo ground truth and propagate the foreground/background segmentation to the remaining frames of the video one frame at a time. Finally, we achieve state-of-the-art performance in the benchmark datasets by incorporating online adaptation [100]. Example results are shown in Figure 3.1. Our goal is to segment the primary video object without manual annotations. The proposed method does not use the temporal information of the whole video clip at once but one frame at a time. Errors from each consequent frame do not propagate along time. As a result, the proposed method has higher tolerance against occlusion and fast motion. We evaluate the proposed method extensively on the DA VIS dataset [75], the 14 Figure 2.1: Overview of tagging the main object. We use instance segmentation algo- rithm to segment objects in the static image. We then utilize optical flow to select and group the segments to one foreground object. FBMS dataset [71]. Our method gives state-of-the-art performance in both datasets with the mean intersection-over-union (IoU) of 79.3% on DA VIS, and 77.9% on FBMS. Main contributions in this work are summarized below. First, we introduce a novel unsupervised video object segmentation method by combining instance segmentation and motion information. Second, we transfer a recent semi-supervised network archi- tecture to the unsupervised context. Finally, the proposed method outperforms state-of- the-art unsupervised methods on several benchmark datasets. The rest of this chapter is organized as follows. Related work is reviewed in Sec. ??. Our novel unsupervised video object segmentation method is proposed in Sec. 2.2. Experimental results are shown in Sec. 5.3. Finally, concluding remarks are given in Sec. 5.4. 15 Figure 2.2: Example results of our method, where the pseudo ground truth of the first frame is in yellow, and the other seven images in green are sample segmentations of the rest of the video clip. Best viewed in color. 2.2 Proposed Method Our goal is to segment generic object in the video in an unsupervised approach. In the semi-supervised VOS, the first frame ground truth label is needed. Inspired by the semi- supervised approach, we propose a method to tag the “pseudo ground truth” and then take it as input for the pretrained network, and then output the segmentation masks for the rest of the video. To our best knowledge, this is the first attempt to transfer semi- supervised VOS approach to unsupervised VOS approach by utilizing “pseudo ground truth”. Figure 2.3 shows the overview of the proposed method, which includes three key components, the criterion to tag the primary object, appearance model and online adaptation. 2.2.1 Learning to tag the foreground object Image instance segmentation. We apply an image-based instance segmentation algorithm to the first frame of the given video. Specifically, we choose Mask R-CNN [30] as our instance segmentation frame- work and generate instance masks. We further exploit the error analysis to demonstrate 16 Figure 2.3: Overview of the proposed method. We trained the appearance model on the DA VIS training set with a pre-trained wider-ResNet from ImageNet, COCO and PASCAL VOC. We then finetuned the model on the first frame “pseudo ground truth”, and online adaptation is applied afterwards. The pixels in yellow and green are selected positive and negative examples respectively in the online branch. that better initial instance segmentations improve the performance in a large margin which suggests that our proposed method has the potential to improve further with more advanced instance segmentation methods. Mask R-CNN is a simple yet high performance instance segmentation model. Specifically, Mask R-CNN adds an additional FCN mask branch to the original Faster R-CNN [83] model. The mask branch and the bounding box branch are trained simul- taneously in the training, while the instance masks are generated from the detection results at inference time. The box prediction branch generates bounding boxes based on the proposals followed by non-maximum suppression. The mask branch is then applied to predict segmentation masks from the 100 detection boxes with the highest scores. This step speeds up the inference time and improves accuracy, which is different from the training step with parallel computation. For each region of interests (ROIs), the mask 17 can predict n times where n is the class number in the training set, and the only used k-th mask is from the predicted class by the classification branch. We note that the mask branch generates class-specific instance segmentation masks for the given image whereas VOS focuses on class-agnostic object segmentation. Our experiments show that even though Mask R-CNN can only generate limited-class labels due to the labels of COCO [58] and PASCAL [20], we can still output instance seg- mentation masks with closest class label. Our algorithm needs to further force all the classes to one foreground class, and thus the misclassification has little influence on the performance of VOS. Optical flow thresholding. There are two important cues in video object segmentations: appearance and motion. To use the information from both spatial and temporal domain, we incorporate optical flow with instance segmentation to learn to segment the primary object. Instance segmen- tation can generate precise class-specific segmentation masks, however, the algorithm cannot determine the primary object in the video. While optical flow can separate mov- ing objects from the background, however the optical flow estimation is still far from perfect. Motivated by the moving objects attract people’s attention [73], so we use motion information to select and group the static image instance segmentation propos- als, which takes advantage of the merits of optical flow and instance segmentation. We apply optical flow algorithm Coarse2Fine [59] to extract the optical flow between the first frame and the second frame of a given video clip. To combine with the instance segmentation proposals, we normalize the flow magnitude and then threshold and select the optical flow motivated by the faster motions are more likely to attract attentions. We select instance segmentation proposals which have more than 80% overlap with optical flow mask. We further group the selected proposal masks with different class 18 labels to one foreground class without any class labels. In image-based instance seg- mentation, the same object may be separated into different parts due to the differences of colors, textures and the influence of occlusions. We can efficiently group the dif- ferent parts to one primary object without knowing the categories of the objects. We named this foreground object as “pseudo ground truth” and forward it to the pretrained appearance model. Sample “pseudo ground truths” are shown in Figure 2.1. 2.2.2 Unsupervised video object segmentation Our proposed method is built on one-shot video object segmentation (OSVOS) [6] which finetunes the pretrained appearance model on the first annotated frame. We replace the first annotated frame with our estimated “pseudo ground truth” so that semi-supervised network architecture can be used in our proposed approach. Our goal is to train a Con- vNet to segment a generic object in a video. Network overview. We adopt a more recent ResNet [105] architecture pretrained on ImageNet [17] and MS- COCO [58] to learn powerful features. In more detail, the network uses the model A of the wider ResNet with 38 hidden layers as the backbone. The data in DA VIS training datasets is very scarce and we further pretrain the network using PASCAL [20] by map- ping all the 20-class labels to one foreground label and keep background unchanged. As demonstrated in OnA VOS [100], the two steps of finetuning on DA VIS and PASCAL are complementary. Hence, we finetune the network using DA VIS training datasets and obtain the final pretrained network. The above training steps are all offline training to construct a model to identify foreground object. At inference time, we finetune the net- work on the “pseudo ground truth” of the first frame to tell the network which object to 19 be segmented. However, the first frame does not provide all the information through the whole video, and thus online adaptation is needed during the test time. Online adaptation. The major difficulty for video object segmentation is the appearance may change dra- matically throughout the video. A model learned only from the first frame cannot address the severe appearance changes. Therefore, online adaptation for the model is needed to exploit the information from the rest frames during inference time. We adopt test data augmentation method from Lucid Data Dreaming augmentation [45] and online adaptation approach from OnA VOS [100] to perform our online finetun- ing. We generate augmentation of the first frame using Lucid Data Dreaming approach. As each frame is segmented, foreground pixels with high confidence predictions are taken as further positive training examples, while pixels far away from the last assumed object position are taken as negative examples. Then an additional round of fine-tuning is performed on the newly acquired data. 2.3 Experiments To evaluate the effectiveness of our proposed method, we conduct experiments on three challenging video object segmentation datasets: DA VIS [75], Freiburg-Berkeley Motion Segmentation (FBMS) dataset [71], SegTrack-v2 [54]. We use region similarity, which is defined as the intersection-over-union (IoU) between the estimated segmentation and the ground truth mask, and F-score evaluation protocol proposed in [71] to estimate the accuracy. 20 2.3.1 Datasets We provide a detailed introduction to evaluation benchmarks below. DA VIS. The DA VIS dataset is composed of 50 high-definition video sequences, 30 in the training set and the remaining 20 in the validation set. There are totally 3, 455 densely annotated, pixel-accurate frames. The videos contain challenges such as occlusions, motion blur, and appearance changes. Only the primary moving objects are annotated in the ground truth. FBMS. The Freiburg-Berkeley motion segmentation dataset is composed of 59 video sequences with 720 frames annotated. In contrast to DA VIS, it has multiple moving objects in several videos with instance-level annotations. We do not train on any of these sequences and evaluate using mIoU and F-score respectively. We also convert the instance-level annotations to binary ones by merging all foreground annotations, as in [94]. SegTrack-v2. The SegTrack-v2 dataset contains 14 videos with a total of 1, 066 frames with pixel-level annotations. For videos with multiple objects with individual ground-truth segmenta- tions, each object can be segmented in turn, treating each as a problem of segmenting that object from the background. 21 Figure 2.4: Qualitative results on DA VIS validation set: The first column in yellow is the “pseudo ground truth” of the first frame of each video. The other four columns are the output segmentation masks of our proposed approach. Our algorithm performs well on videos with fast motion (first and forth row), gesture changes (second row), unseen category (third row) and complex background (fifth row). Best viewed in color. Table 2.1: Comparison of the mIoU scores (%) of different unsupervised VOS approaches in the DA VIS 2016 validation dataset. Our method achieves the highest mIoU compared with state-of-the-art methods NLC [21] FST [72] LMP [94] FSEG [39] LVO [95] ARP [47] Ours mIoU 55.1 55.8 70.0 70.7 75.9 76.2 79.3 2.3.2 Implementation details We jointly use optical flow and semantic instance segmentation to group foreground objects that move together into a single object. We use the optical flow from a re- implementation of Coarse2Fine optical flow [59]. We implemented the objectness net- work using Tensorflow [1] library and set wider ResNet [105] with 38 hidden layers as the backbone. The segmentation network is simple without using upsampling, skip con- nections or multi-scale structures. In some convolution layers, increasing the dilation 22 rates and removing the down-sampling operations accordingly are applied to generate score maps at 1/8 resolution. Large field-of-view setting in Deeplabv2 [7] is used to replace the top linear classifier and global pooling layer which exist in the classification network. Besides, the batch normalization layers are freezed during finetuning. We adopted the initial network weights provided by the repository which were pre- trained on the ImageNet and COCO dataset. We further finetune the objectness network based on the augmented PASCAL VOC ground truth from [29] with a total of 12, 051 training images. Note that we force all the foreground objects in a certain image to one single foreground object and keep background the same. For the DA VIS dataset evaluation, we further train the network on DA VIS training set and then apply a one-shot finetuning on the first frame with “pseudo ground truth”. The segmentation network is trained on the first frame image/“pseudo ground truth” pair, by Adam with learning rate 310 6 . We set the number of finetuning n f on the first frame as 100, we found that a relatively small n f can improve the accuracy which is opposite with semi-supervised VOS. For the online part, we used the default parameters in OnA VOS [100] by setting the number of finetuning as 15, finetuning interval as 5 frames, and learning rate as110 5 and adopted the CRF parameters from DeepLab [7]. For completeness, we also conduct experiments on FBMS and SegTrack-v2 datasets, we conduct the same procedures for FBMS as DA VIS. To check the effectiveness of the “pseudo ground truth” we only perform one-shot branch for SegTrack-v2 without online adaption. 23 2.3.3 Comparison with state-of-the-art methods DA VIS. We compare our proposed approach with state-of-the-art unsupervised techniques, NLC [21], LMP [94], FSEG [39], LVO [95], and ARP [47] in Table 2.1. We achieve the best performance for unsupervised video object segmentation: 3.1% higher than the second best ARP. Besides, we achieve mIoU of 71.2% on the DA VIS validation set by extracting the pseudo ground-truth on each frame of a given video. When we break down the per- formance on each DA VIS sequence, we outperform the majority of the videos shown in Table 3.1, and especially for drift-straight, libby and scooter-black, our results are more than 10% higher than the second-best results. Our approach could segment unknown object classes which do not need to be in the PASCAL/COCO vocabulary. The goat in the third row is an unseen category in the training data, the closest semantic category horse is matched instead. Note that our algorithm only needs the foreground mask with- out knowing the specific category, and performs better than state-of-the-art methods. Our method performs even better when the object classes are in the MS COCO, the top two rows show a single instance segmentation with large appearance changes (first row) and viewing angle and gesture changes (second row). The bottom two rows show that our algorithm works well when merging multiple object masks to one single mask with viewing angle changes (forth row) and messy background (fifth row). To verify where the improvements come from, we utilize similar backbone with previous method. We test OSVOS [6] by replacing the first frame annotations with pseudo ground truths. OSVOS uses the VGG architecture, and we set the number of first-frame fine-tuning to 500 without applying boundary snapping. The mIoUs of our approach and the original OSVOS are 72.3% and 75.7%, respectively. Our approach in 24 Table 2.2: Comparison of the mIoU scores (%) per video of several methods for the DA VIS 2016 validation dataset. (1) blackswan, (2) bmx-trees, (3) breakdance, (4) camel, (5) car-roundabout, (6) car-shadow, (7) cows, (8) dance-twirl, (9) dog, (10) drift- chicane, (11) drift-straight, (12) goat, (13) horsejump-high, (14) kite-surf, (15) libby, (16) motocross-jump, (17) paragliding-launch, (18) parkour, (19) scooter-black, (20) soapbox Method No.1 No.2 No.3 No.4 No.5 No.6 No.7 No.8 No.9 No.10 No.11 No.12 No.13 No.14 No.15 No.16 No.17 No.18 No.19 No.20 meanIoU FSEG [39] 81.2 43.3 51.2 83.6 90.7 89.6 86.9 70.4 88.9 59.6 81.1 83.0 65.2 39.2 58.4 77.5 57.1 76.0 68.8 62.4 70.7 ARP [47] 88.1 49.9 76.2 90.3 81.6 73.6 90.8 79.8 71.8 79.7 71.5 77.6 83.8 59.1 65.4 82.3 60.1 82.8 74.6 84.6 76.2 Ours-oneshot 83.3 39.8 50.3 76.1 82.9 91.9 87.5 77.3 90.1 86.1 85.4 85.1 74.1 60.1 75.5 75.5 57.3 89.9 72.7 74.9 75.8 Ours-online 82.0 46.0 60.7 75.5 93.0 94.6 87.6 78.9 89.3 82.9 91.7 85.7 76.8 60.3 76.0 84.7 56.9 89.9 88.1 85.1 79.3 Table 2.3: Comparison of mIoU scores (%) of different finetuning times on the first frames of the DA VIS validation set Finetuning times Semi-supervised oneshot Unsupervised oneshot Unsupervised online 50 80.4 73.1 77.6 100 80.7 75.8 79.3 500 81.4 74.4 77.7 2000 82.1 74.8 77.9 the VGG architecture still outperforms FSEG (70.7%) without online adaptation, CRF, test time data augmentation. We further analyze the finetuning times on the first frames for both semi-supervised and unsupervised approaches in Table 2.3. In the table, the second column shows that the performance improves with the increasing finetuning times for semi-supervised approach in terms of mIoU, which indicates more finetuning times with image/ground truth pairs can predict better results. The right two columns show the different rela- tionships between the performance in mIoU and finetuning times on the first frames for unsupervised approach. They both achieve the highest performance by setting the number of finetuning as 100, which indicates the model learns better with an appropri- ate number of finetuning since the pseudo ground truth is not as accurate as the ground truth. 25 Table 2.4: Comparison of the F-score and mIoU scores (%) of different unsupervised VOS approaches on the FBMS test dataset. Our method achieves the highest compared with state-of-the-art methods NLC [21] FST [72] CVOS [93] MP-Net-V [94] LVO [95] ARP [47] Ours mIoU 44.5 55.5 - - - 59.8 77.9 F-score - 69.2 74.9 77.5 77.8 - 85.1 FBMS. We evaluate the proposed approach on the test set, with 30 sequences in total. The results are shown in Table 3.2. Our method is outperformed in both evaluation metrics, with an F-score of 85.1% which is 7.3% higher than the second-best method LVO [95], and the mIoU of 77.9% which is 18.1% better than ARP [47], which performs the second-best on DA VIS. Figure 5.4 shows qualitative results of our method, our algorithm performs well for most of the sequences. The last row shows the failure case for rabbits04 since there are severe occlusions in this video and the rabbit is also an unseen category in the MS COCO. To recover a better prediction mask, further motion information should be used to address this problem. SegTrack-v2. Our method achieves mIoU of 58.7% on this dataset, which is higher than other methods that do well on DA VIS, CUT [44] (47.8%), FST [72] (54.3%), and LVO [95] (57.3%). Note that we did not apply online adaptation on this dataset which could further improve the performance. Our method performs worse than NLC [21] (67.2%) due to low reso- lution of SegTrack-v2 and the fact that NLC is designed and evaluated on this dataset. We outperform NLC on both FBMS and DA VIS datasets by a large margin. Figure 2.6 shows qualitative results of the proposed method on the SegTrack-v2. All these visual results demonstrate the effectiveness of our approach where the category of the object is not existed in MS COCO [58] or PASCAL VOC 2012 [20]. The accurate category is 26 Figure 2.5: Qualitative results on FBMS dataset: The first column in yellow is the “pseudo ground truth” of the first frame of each video. The other four columns are the output segmentation masks of our proposed approach. Best viewed in color. not needed in our approach, as long as the foreground object is consistent in the whole video. The objectness of the worm sequence in the third row cannot be detected using instance segmentation algorithm, in this case the thresholded flow magnitude is used as the pseudo ground truth mask instead. 2.3.4 Ablation studies Table 3.3 presents our ablation study on DA VIS 2016 validation set on the three major components: online adaptation, CRF [7] and test time data augmentation. The baseline 27 Figure 2.6: Qualitative results on SegTrack-v2 dataset: The first column in yellow is the “pseudo ground truth” of the first frame of each video. The other four columns are the output segmentation masks of our proposed approach. Best viewed in color. ours-oneshot in Table 3.3 is the wider-ResNet trained on the PASCAL VOC 2012 dataset and the DA VIS 2016 training set. Online adaptation provides 1.4% improvement over the baseline in terms of mIoU. Additional CRF post processing brings further 1.1% boost in terms of mIoU. Combining with test time data augmentation (TTDA) gives the best performance of 79.3% in mIoU which is 3.5% higher than the baseline without any post processing. Figure 2.7 shows qualitative comparisons for oneshot and online approaches on the video sequences camel and car-roundabout. Our online approach outperforms our oneshot approach for the sequence car-roundabout in the second row, which is due to the right bottom pixels are considered as negative training examples from the previous frames. The additional round of finetuning is performed on the newly acquired data to remove the false positive masks. The first row shows the failure case for the two approaches, the two branches both wrongly predict the foreground mask when the mov- ing camel is walking across the static camel. This example shows the weakness of the 28 Table 2.5: Ablation study of our approach on DA VIS 2016 validation set. TTDA denotes the test time data augmentation and CRF denotes conditional random field Ours-oneshot +Online +Online +CRF +Online +CRF +TTDA mIoU(%) 75.8 77.2 78.3 79.3 “pseudo ground truth” Ours-oneshot Ours-online Figure 2.7: Comparison of qualitative results on two sequences, camel and car- roundabout, on DA VIS validation set: The first column in yellow is the “pseudo ground truth” of the first frame of each video. The other two columns are the output segmenta- tion masks of our oneshot and online approaches respectively. Best viewed in color. oneshot approaches by propagating throughout the whole video without using motion information. Error analysis. To analyze the effect of the first frame tagging, we apply OSVOS to the entire DA VIS validation set using the pseudo ground truth and the ground truth of the first frame respectively, the mIoUs of the entire dataset and two difficult sequences are shown in Table 2.6. The mIoUs of the entire DA VIS validation set is 5.5% lower when using pseudo ground truth of the first frame. This demonstrates that more accurate mask prediction for the first frame can generate better segmentation masks for the remaining 29 Table 2.6: Error analysis for the entire DA VIS 2016 validation set and two difficult videos (bmx-trees and kite-surf) Sequences Erosion Dilation Pseudo Ground Truth Ground Truth bmx-trees 33.9 42.4 46.0 52.5 kite-surf 51.0 56.4 60.3 66.6 mIoU 68.4 76.1 79.3 84.8 frames of the whole video, which shows the potential performance improvement when using more advanced tagging technique. We also erode and dilate the pseudo ground truth by 5 pixels respectively and use the erosion and dilation masks as the new pseudo ground truths to apply OSVOS approach to the videos. The performances have largely degraded from 3.2% to 10.9% compared with those of the original pseudo ground truth. This demonstrates accurate tagging is the key component for our tagging and segmenting approach. 2.4 Conclusion In this chapter, we present a simple yet intuitive approach for unsupervised video object segmentation. Specifically, instead of manually annotating the first frame like exist- ing semi-supervised methods, we proposed to automatically generate the approximate annotation, pseudo ground truth, by jointly employing instance segmentation and optical flow. Experimental results on the DA VIS, FBMS and SegTrack-v2 demonstrate that our approach enables effective transfer from semi-supervised VOS to unsupervised VOS and improves the mask prediction performance by a large margin. Our error analysis shows that using better instance segmentation has a dramatic performance boost which shows great potential for further improvement. Our approach is able to extend from sin- gle object tracking to multiple arbitrary object tracking based on the category-agnostic ground truths or pseudo ground truths. 30 Chapter 3 Unsupervised Video Object Segmentation with Distractor-aware Online Adaptation 3.1 Introduction Video object segmentation (VOS) aims to segment foreground objects from complex background scenes in video sequences. There are two main categories in existing VOS methods: semi-supervised and unsupervised. Semi-supervised VOS algorithms [65, 97, 100, 6, 74, 34] require manually annotated object regions in the first frame and then automatically segment the specified object in the remaining frames throughout the video sequence. Unsupervised VOS algorithms [47, 94, 39, 55, 56, 33] segment the most conspicuous and eye-attracting objects without prior knowledge of these objects in the video. Both settings have ground truth labels for the training set. However, in the test video clip, semi-supervised VOS has the first frame annotation while unsupervised VOS has no annotations. Because the unsupervised VOS is more challenging and widely applicable, we focus on developing a VOS algorithm in this group. In unsupervised video object segmentation, motion information is a key factor because motion attracts people’s and animals’ attention. [51] initializes the segments from the motion saliency and then propagates it to the remaining frames. However, the 31 Ground truth ARP PDB Ours Figure 3.1: Example results of ground truth, ARP [47], PDB [89] and the proposed method. Distractors in the background lead to incorrect segmentations (second and third column) due to the similar features between the foreground and background objects. We exploit distractor-aware online adaptation approach to learn from the hard negatives to avoid mis-classifying background objects as foreground. Best viewed in color with 4 zoom. initialized regions from motion saliency might not be accurate especially when multi- ple objects are moving in the neighborhood. Recently, deep learning models have also been applied to automatically segment moving objects with motion cues. FSEG [39] trains a dual-branch fully convolutional neural network, which consists of an appear- ance network and a motion network, to jointly learn the object segmentation and optical flow. However, direct fusion of the optical flow and object segmentation cannot find the correspondence between the foreground and the motion patterns successfully. In this chapter, we aim at generating an accurate initialized foreground mask by leveraging optical flow and instance segmentation and the distractor-aware online adap- tation (DOA) is proposed to generate consistent foreground object regions. Specifically, 32 the instance segmentation algorithm is applied to roughly segment the objectiveness masks, and then the foreground object of the first frame is further grouped and selected by optical flow. We call this predicted foreground mask “pseudo-ground truth.” Finally, one-shot VOS [6] is performed to propagate the prediction to the remaining frames of the given video. However, the main disadvantages for one-shot VOS are the unsta- ble boundaries and treating static and moving objects the same. To this end, DOA is proposed to utilize the motion information throughout the whole video to improve the inter-frame consistency of the consecutive masks and avoid motion error propagation by exploiting the fusion of motion and appearance. Inspired by DOA, we erode the prediction mask from the previous frame and mark it as a positive example for the current frame. It is known that consistently paying attention to a previously seen background object, when it meets with the target foreground object, segmentations should be easily distinguishable. The instance proposals that are not covered by motion masks in the first frame are regarded as hard negatives. We apply the block matching algorithm to find corresponding blocks from adjacent frames. If the intersection-over-union (IoU) of the block and instance proposals is larger than a certain threshold, the instance proposals are considered hard negative examples. Thus, hard negative attention-based adaptation is applied to update the network. Sample results are illustrated in Figure 3.1. We evaluate the proposed method on two benchmark datasets, DA VIS 2016 and FBMS-59 datasets. The experimental results demonstrate the state-of-the-art perfor- mances on both the two datasets. The main contributions are summarized as follows: First, we introduce a novel hard negative example selection method by incorpo- rating instance proposals, block matching tracklets and motion saliency masks. 33 Figure 3.2: Overview of the proposed method. Instead of directly applying static image segmentation to video object segmentation, an online adaptation approach is proposed by detecting distractors (negatives and hard negatives). Both the appearance and motion cues are utlized to generate positives, negatives, and hard negatives. Besides, first frame pseudo ground truth is utilized to supervise the finetuning process to make accurate inferences. Second, we propose a distractor-aware approach to perform the online adaptation to generate video object segmentation with better temporal consistency and avoid the motion error propagation. Finally, the proposed method achieves the state-of-the-art results on DA VIS 2016 and FBMS-59 datasets with mean intersection-over-union (IoU) scores of 81.6% and 79.1%. 3.2 Proposed Method An overview of the proposed method is illustrated in Figure 5.2. The proposed method mainly consists three components. 1) The pseudo ground truth generation of the first frame with both the continuous motion and visual appearance cues. 2) A novel online hard negative example selection approach is proposed to acquire high-quality training data. 3) The distractor-aware online adaptation (DOA) approach facilitates unsupervised video object segmentation. 34 3.2.1 Generate pseudo ground truth In VOS problems, an accurate ground truth of the first frame leads to better segmentation results for the rest of the given video. However, the requirement of manually labeling the ground truth limits the flexibility and applicability of the VOS method. Thus, generate a pseudo-ground truth of the first frame automatically is our first target in the proposed method. Using only one type of objectness masks or motion saliency masks cannot gen- erate accurate and reliable initialized masks. Image-based semantic segmentation [8] and instance segmentation [30] techniques have been well developed in recent years. Instead of utilizing semantic segmentation, we apply an instance segmentation algo- rithm, Mask R-CNN [30], to the first frame of the given video without any further fine- tuning. Mask R-CNN outputs class-specific segmentation masks whereas video object segmentation aims at generating binary class-agnostic segmentation masks. The exper- iments demonstrate that the instance segmentation network produces the segmentation mask with the closest class label to the limited class labels of MS COCO. With further mapping all the classes to one foreground class, the influence of mis-classification is alleviated in the inference process of VOS. Although instance segmentation provides accurate object regions, the same object may be separated into different parts due to variations of textures, colors, and the effect of occlusions. In addition, only the image-based segmentation algorithm is incapable of determining the primary object in certain cases. Therefore, to tackle unsupervised video object segmentation, incorporating motion cues is essential. We utilize motion information to select and group instance proposals and then map all the proposals to one foreground mask without knowing the specific category of the object. Specifically, Coarse2Fine [59] is exploited to extract the optical flow between the first frame and second frame of a given video sequence. To avoid the effect of the camera motion, we adopt flow-saliency transformation by utilizing a saliency detection method [94] on the 35 optical flow to segment moving regions from the background instead of thresholding the flow magnitude. Alternative approaches [3, 114, 107] can be applied to detect saliency. Instance proposals, whose overlapping regions with the motion regions exceed a certain threshold, are selected and grouped as one foreground mask called the pseudo-ground truth: P GT = n [ i=1 I i ; (8I i ; if I i T M I i >T) (3.1) whereP GT represents the pseudo-ground truth mask,M is the motion mask,I i denotes theith instance proposal,T is the threshold, andn is the total number of the instance proposals. Semantic segmentation segments the same category of objects with one mask while instance segmentation provides a segmentation mask for each instance individually. The settings of the proposed approach exploit instance segmentation instead of semantic seg- mentation because objects in the same category cannot be distinguished. However, we observed that the pixel-wise labels in the instance segmentation dataset, MS COCO, are very coarse, while the labels of the semantic segmentation datasets, PASCAL and Cityscape [13], are very accurate. Therefore, segmentation networks pretrained on fine- labeled datasets should predict object regions more accurately. To this end, we adopt the semantic segmentation network Deeplabv3+ [8], pretrained on the PASCAL dataset without further finetuning, to generate object masks when the number of each object category is at most one. The number of objects is determined by the instance segmen- tation stage. Moreover, as the resolution of the semantic segmentation outputs is lower than that of the VOS datasets, bilinear interpolation and a dense conditional random field (CRF) are utilized to upsample the mask and refine the boundaries, respectively, 36 Figure 3.3: Objectness mask from Mask R-CNN (yellow), Deeplabv3+ (green). Best view in color with 3 zoom. to generate the objectness mask. Sample objectness masks generated from Mask R- CNN and Deeplabv3+ are presented in Figure 4.11, which demonstrates the advantage of exploiting semantic segmentation for object region prediction. 3.2.2 Online hard negative/negative/positive example selection Although treating video object segmentation as image-based foreground or background segmentation without using temporal information is straightforward, obtaining consis- tent object regions and tackling the truncations, occlusions, and abrupt motion is dif- ficult, which usually leads to inaccurate foreground segmentation cross frames. To address this challenge, we propose a novel online process to automatically select hard negative samples using motion cues along with the acquisition of negative and positive examples which we use in the online adaptation. As hard negatives have similar features with positive examples, the misalignment usually occurs between these two types. The errors are further propagated during the prediction process if the false positives are not effectively suppressed. Therefore, detect- ing hard negative samples is essential to improve the segmentation accuracy. In Section 3.2.1, we fuse the motion mask and the instance proposals to obtain a pseudo-ground truth. Similarly, we initialize the hard negatives from the first frame by selecting the 37 detections that are not covered by the motion mask. From the observation of a sequence where a car is driving across the stop sign, we found that the stop sign which is belonged to the background usually merges into the foreground car even though they are supposed to be classified in different categories. Therefore, only the detections whose confidence score is higher than or equal to 0.8 are kept as positives regardless of whether they are in the same category or not. Then, Mask R-CNN is applied to the remaining frames in the same video so that we have both segmentation masks and detection bounding boxes during this process. For each detection in framet, we perform the block matching algorithm [14] to previousk frames. The search window size become larger when the reference frame is farther away from frame t. The size expanding rate is 20 pixels in each direction and k is selected as 3 empirically. For these k previous frames, if the minimum IoU between the instance proposals and the matching blocks is larger than 0.7, we denote the detections in the current frame as consistent object bounding boxes and the segmentations in the corresponding boxes as consistent object segmentations. Subsequently, if the overlap of the consistent object masks and the motion masks in framet is below 0.2, we denote these consistent object masks as hard negative examples in framet, which is represented as follows: HN S = n [ i=1 I i ; f8I i ; if (I i \ M <T 1 ) and (min k2[1;3] (f(I i bb ) \ P ik )T 2 )g (3.2) where HN S denotes hard negatives, M is the motion mask, I i and I i bb represent the instance segmentation and the corresponding bounding box, respectively, T 1 and T 2 denote two thresholds,f is the block matching function, andP ik is the corresponding bounding box in the(ik) th frame. 38 The consistent detections from adjacent frames indicate that these objects are already seen and treated as hard negatives before the current frame. When the target object moves across the hard negatives, they have a larger probability of being distinguished. The experiments show that significant improvements can be achieved by using these hard negative examples to the network. For videos who have new instances entering the scene in the middle, they might be treated as foreground with high probability since they are not trained as foreground or background before. In this case, we adopt the settings used in [100] where each pixel has a Euclidean distance to the closest foreground pixel of the mask. The pixels with a distance larger than the threshold are denoted as negative examples: N S =fp i : min i E(p i ;Pos)>dg (3.3) where N S represents the negative examples, p i represents the pixels in the frame, d denotes the threshold distance,Pos is the positive mask, andE is the Euclidean distance. The basic idea of gathering positive samples is to select the foreground pixels with high confidence score. Considering the motion between consecutive frames, we erode the foreground mask of framet1 to initialize the positive examples of framet. How- ever, the objects occluded by the moving object at the beginning have a higher probabil- ity of being segmented as foreground when they are not occluded because the occluded objects are not trained as background so that the positive pixels spread to the new objects which results in false positives. To tackle this challenge, we regard the intersection of the motion mask and the eroded foreground mask as a positive example. For those frames where there is no intersection between these two masks, which may output unsatisfac- tory segmentation, we perform the one-shot VOS approach instead of online adaptation 39 Figure 3.4: Illustration of positive examples (yellow), negative examples (red) and hard negative examples (green). so that the errors from motion will not propagate to the remaining frames. Positive examples with motion cues are as follows: P t =M \ g(P t1 ) (3.4) whereP t andP t1 denote the positive masks for framest andt1, respectively, andg represents the erosion function. Figure 3.4 presents positive, negative, and hard negative examples. 3.2.3 Distractor-aware online adaptation In Section 3.2.2, we propose a process for carefully selecting positive, negative, and hard negative examples for each video sequence. For accommodating the variations in the appearance of foreground objects across a given video, we perform online adaptation based on the positive, negative, and hard negative examples and name it as DOA. The 40 model is updated in the current frame and is better adapted to make inferences in the same frame using DOA. In order to increase the discrimination power of distinguishing the objects with simi- lar attribute, higher attention weights are assigned to hard negatives examples compared to other negative examples since they are easier to be confused with positive samples. The DOA approach can suppress the distractors and provide superior performance com- pared with other methods with no adaptation. Moreover, negative examples are also considered in the DOA approach to deal with videos with new instance entering the scene and those where hard negatives do not exist in the frame. Thus, we combine both negative and hard negative examples to finetune the model to suppress the distractor with the loss function defined as: L curr =L hn +(1)L n (3.5) whereL hn andL n represent pixel-wise segmentation losses for hard negatives and neg- atives, respectively, andL curr is the loss for the current frame. Here, is the coefficient to control the contribution of these two losses. When = 0, the loss is equivalent to the loss without hard negatives. As increases, the loss with easy negatives gets discounted, and = 0:8 is set in the experiments when hard negatives exist. Note that the positive region in framet is at most the same size as the erosion mask from framet1, and thus, plays a role as the foreground attention. However, additional training iterations on the current frame cause inaccurate predictions. We consider incor- porating the first frame into the finetuning process because the pseudo-ground truth of the first frame can supply high-quality training data. According to the settings in [100], the first frame is sampled more compared with the current frame, and the weight of the 41 loss for the current frame is reduced to improve the performance. Subsequently, the joint loss function is as follows: L total =L ff +(1)L curr (3.6) whereL ff denotes the loss for the first frame, andL total denotes the total loss. is the balance coefficient which is set to0:95 in the experiments. The distractor-aware approach completely utilizes the spatial and temporal informa- tion from the current frames, the previous several frames, and the first frame to set up long-range consistency over the given video. The finetuning step utilizes high-quality training data to train and thus the inference enables high-quality foreground segmen- tation results. After the distractor-aware online adaptation, a dense CRF is applied to refine the segmentation to obtain the final results. 3.3 Experiments This section is divided into four parts: datasets and evaluation metrics, implementation details, performance comparison with state-of-the-art, and an ablation study. To evaluate the effectiveness of the proposed method, we conduct experiments on two challenging video object segmentation datasets: DA VIS 2016 [75] and FBMS-59 [71]. 3.3.1 Datasets and evaluation metrics The DA VIS 2016 and FBMS-59 datasets are adopted as two evaluation benchmarks and introduced below. 42 Figure 3.5: Visual results of the proposed DOA on DA VIS 2016. The pseudo ground truths (in yellow) are illustrated in the first column, and the other columns are the seg- mentation results (in red) by DOA. The five sequences include the unseen object (first row), strong occlusions (second row), appearance variance (third row), and similar static objects in the background (fourth and fifth row). Best viewed in color with 3 zoom. Table 3.1: Comparison of the results of several methods for the DA VIS 2016 validation dataset. The proposed method outperforms state-of-the-art unsupervised VOS methods and is even better than some supervised VOS approaches in terms ofJ Mean andF Mean (%) Methods Supervised Unsupervised OnA VOS OSVOS MSK LVO LMP FSEG ARP IET MBN Ours J Mean 86.1 79.8 79.7 75.9 70.0 70.7 76.2 78.6 80.4 81.6 F Mean 84.9 80.6 75.4 72.1 65.9 65.3 70.6 76.1 78.5 79.7 DA VIS The DA VIS dataset is composed of 50 high-definition video sequences, 30 in the training set and 20 in the validation set. There are 3,455 densely annotated frames with pixel accuracy. The videos contain challenges such as occlusions, motion blur, and appearance changes. Only the primary moving objects are labeled in the ground truth. 43 FBMS The Freiburg-Berkeley motion segmentation dataset is composed of 59 video sequences with 720 frames annotated. In contrast to the DA VIS dataset, the FBMS- 59 dataset has several videos containing multiple moving objects with instance-level annotations. The performance is evaluated in terms of region similarityJ and the F- score protocol from [71] without considering FBMS-59 dataset in the training of the proposed method. We also convert the instance-level annotations to binary ones by merging all foreground annotations as mentioned in [94]. To evaluate the performance, we adopt two conventional evaluation metrics, region similarityJ and contour similarityF. Region similarityJ . The Jaccard indexJ is defined as the IoU between the ground truth mask and the predicted mask to measure the region-based segmentation similarity. Specifically, given a predicted maskP and a corresponding ground truth maskG,J is defined asJ = P T G P S G . Contour similarityF. The contour similarityF is defined as the F-measure between the contour points of the predicted segmentation and the ground truth as proposed in [66]. Given contour-based precisionP and recallR,F is defined asF = 2PR P+R . 3.3.2 Implementation details Motion saliency segmentation and image semantic/instance segmentation are jointly uti- lized to predict the pseudo-ground truth for the first frame. We employ the Coarse2Fine [59] optical flow algorithm followed by a flow-saliency transformation approach [94] to avoid the effect of camera motion. For instance segmentation, Mask RCNN [?] is adopted without further finetuning to generate instance proposals in general cases. We use the semantic segmentation approach Deeplabv3+ [8] to replace the Mask R-CNN when the number of objects in the category is at most one. Considering the trade-off between inference speed and accuracy, we utilize the Xception [12] model pretrained on 44 Table 3.2: Comparison of theJ Mean andF Mean scores (%) of different unsupervised VOS approaches on the FBMS test dataset. Our method achieves the highest compared with state-of-the-art methods NLC [21] FST [72] CVOS [93] MP-Net-V [94] LVO [95] ARP [47] IET [55] MBN [56] Ours J Mean 44.5 55.5 - - - 59.8 71.9 73.9 79.1 F-score - 69.2 74.9 77.5 77.8 - 82.8 83.2 85.8 MS COCO and PASCAL with the following settings: output stride is 16, eval scale is 1.0, and no left-right flip. We adopt a ResNet [105] with 38 hidden layers as the backbone. All implemen- tation and training are based on Tensorflow [1], and ADAM [46] is utilized for opti- mization. Similar settings with Deeplabv2 [7] are exploited in the proposed network: A large field-of-view replaces the top linear classifier and the global pooling layer, and we use dilations to replace the down-sampling operations in certain layers. The ResNet is trained on MS COCO and then finetuned on the augmented PASCAL VOC ground truth from [29] with a total of 12,051 training images. Note that all 20 object classes in PASCAL map to one foreground mask, and the background is kept unchanged. To evaluate the approach on the DA VIS 2016 dataset, we further train the network on the DA VIS training set and perform two experiments: one-shot finetuning on the first frame with pseudo-ground truth and distractor-aware online adaptation with negatives and hard negatives. For both of them, a dense CRF [7] is applied to refine the seg- mentation afterwards. For completeness, we also conduct experiments on the FBMS-59 dataset. However, we use the PASCAL pretrained model with upsampling instead. 45 Table 3.3: Ablation study of the three modules in distractor-aware online adaptation: (1) negative example addition (+N), (2) hard negative example addition (+HN), and (3) fusion of positive mask with motion mask (+MP), assessed on the DA VIS 2016 validation set +N +HN +MP CRF J Mean - - - - 76.7 X - - - 78.9 +2.2 X X - - 80.1 +1.2 X X X - 80.6 +0.5 X X X X 81.6 +1.0 Table 3.4: Comparison of the first frame influence for the DA VIS 2016 validation dataset. We finetune on the first frame and perform inference for the remaining frames without online adaptation. We compare the performances (%) from the pseudo ground truth generated from Mask R-CNN (PGT M ) and jointly from Mask R-CNN and Deeplabv3+ (PGT MD ), the erosion and dilation masks fromPGT M , and ground truth mask (GT) Erosion Dilation PGT M PGT MD GT First frameJ Mean 67.9 74.9 79.0 81.3 100 The whole val setJ Mean 65.0 73.8 75.8 76.7 80.4 3.3.3 Performance comparison with state-of-the-art DA VIS 2016. The performances on the DA VIS 2016 dataset are summarized in Table 3.1. The proposed DOA approach outperforms state-of-the-art unsupervised VOS tech- niques, e.g., LVO [95], and ARP [47], FSEG [39], LMP [94], and PDB [89]. Specif- ically, the superior gaps to the second best PDB are 4.4% and 5.2% in terms of the J mean and theF mean, respectively. Moreover, the proposed method provides a convincing performance when compared to several recent semi-supervised VOS tech- niques, such as OSVOS [6] and MSK [74]. The qualitative results of the proposed DOA approach are presented in Figure 3.5. The first column is the first frame with pseudo- ground truth annotation and the other columns show the segmentation results of the fol- lowing frames by DOA. The approach yields encouraging results in those challenging 46 sequences. The blackswan in the first row belongs to an unseen object category in the training data. The video in the second row contains strong occlusions while the object in the third row has non-rigid deformations. The bottom two rows demonstrate the cases with multiple distractors and messy background. All these images prove that the pro- posed method produces accurate and robust segmentation masks for various challenging cases by exploiting distractor-aware online adaptation. We also evaluate the influence of the pseudo-ground truth by utilizing the pseudo- ground truth generated from Mask R-CNN (PGT M ) and jointly from Mask R-CNN and Deeplabv3+ (PGT MD ). The first frame is finetuned and inferred for the remaining frames without online adaptation. The ground truth and the erosion and dilation masks of (PGT M ) are also compared in Table 3.4. We observe that the performance for a video clip is highly correlated with the overlap ratio in the first frame of this video. FBMS-59. The proposed method is evaluated on the FBMS-59 test set with 30 sequences in total. The results are presented in Table 3.2. The proposed method outper- forms the second best method in both evaluation metrics, with aJ mean of 79.1% and aF mean of 85.8% which are 5.1% and 8.0% higher, respectively, than the second best PDB [56] and LVO [95]. 3.3.4 Ablation studies We study the four major components of the proposed methods and then summarize the effects of the components, including negative examples, hard negative examples, and the fusion of the motion mask and the positive mask, CRF, in Table 3.3. The baseline without online adaptation is a ResNet trained on the PASCAL dataset and the DA VIS 2016 training set. The negative examples provide 2.2% enhancement over the baseline in terms of theJ mean. The hard negative examples combined with the negative exam- ples further improve the performance by 1.2%, which demonstrates the proposed online 47 Figure 3.6: Comparison of qualitative results on the key components of online adapta- tion. The first row presents the differences of w/ HN (left) and w/o HN (right), and the second row presents the differences of w/ MP (left) and w/o MP (right). Best viewed in color. adaptation approach is effective when dealing with confusing distractors. Subsequently, the fusion of the motion mask and the positive mask brings a further 0.5% boost because the motion information helps select the positive mask to avoid the effects of distractors occluded by the positive mask at the beginning. Finally, additional CRF post-processing is combined to improve the performance by 1.0%. Figure 3.6 shows the qualitative performances with different components. We com- pare the effect of the hard negative examples for the car-roundabout sequence in the first row, and the positive examples influences for the camel sequence in the second row. The segmentation with hard negatives training ignores the arrow directional sign while the training without hard negatives merges the arrow directional sign segmentation into the car, which indicates the importance of the finetuning with hard negatives. To investigate the effectiveness of the fusion of the motion mask and the positive mask eroded from 48 the previous frame (MP), we compare the qualitative results with MP and without MP of the camel sequence where the walking camel is easily distinguished from the static camel with MP. 3.4 Conclusion Based on the insight that video object segmentation becomes tremendously challeng- ing when multiple objects occur and interact in a video clip, a distractor-aware online adaptation for unsupervised video object segmentation is proposed in this paper. The motion between adjacent frames and image segmentation are combined to generate the approximate annotation, the pseudo-ground truth, to replace the ground truth of the first frame. Motion-based hard negative example mining and a block matching algorithm are integrated to produce distractors that are then incorporated in the online adaptation. In addition, motion-based positive examples selection is combined with the hard negatives during online updating. We also conduct an ablation study to demonstrate the effec- tiveness of each component in the proposed approach. Experimental results show that the proposed method achieves state-of-the-art performance on two benchmark datasets and is able to extend from single object segmentation to multiple object segmentation in videos. 49 Chapter 4 Drone Monitoring with Convolutional Neural Networks 4.1 Introduction There is a growing interest in the commercial and recreational use of drones. This in turn imposes a threat to public safety. The Federal Aviation Administration (FAA) and NASA have reported numerous cases of drones disturbing the airline flight operations, leading to near collisions. It is therefore important to develop a robust drone monitoring system that can identify and track illegal drones. Drone monitoring is however a difficult task because of diversified and complex background in the real-world environment and numerous drone types in the market. Generally speaking, techniques for localizing drones can be categorized into two types: acoustic and optical sensing techniques. The acoustic sensing approach achieves target localization and recognition by using a miniature acoustic array system. The optical sensing approach processes images or videos to estimate the position and identity of a target object. In this work, we employ the optical sensing approach by leveraging the recent breakthrough in the computer vision field. The objective of video-based object detection and tracking is to detect and track instances of a target object from image sequences. In earlier days, this task was accom- plished by extracting discriminant features such as the scale-invariant feature transform (SIFT) [62] and the histograms of oriented gradients (HOG) [15]. The SIFT feature 50 Visual Drone Video Detection Tracking Online Integrated System Data Augmentation Thermal Drone Video Offline Training Fast R-CNN MDNet Figure 4.1: Overview of proposed approach. We integrate the tracking module and detector module to set up an integrated system. The integrated system can monitor the drone during day and night with exploiting our proposed data augmentation techniques. vector is attractive since it is invariant to object’s translation, orientation and uniform scaling. Besides, it is not too sensitive to projective distortions and illumination changes since one can transform an image into a large collection of local feature vectors. The HOG feature vector is obtained by computing normalized local histograms of image gradient directions or edge orientations in a dense grid. It provides another powerful feature set for object recognition. In 2012, Krizhevsky et al. [49] demonstrated the power of the convolutional neural network (CNN) in the ImageNet grand challenge, which is a large-scale object clas- sification task, successfully. This work has inspired a lot of follow-up work on the developments and applications of deep learning methods. A CNN consists of multiple convolutional and fully connected layers, where each layer is followed by a non-linear activation function. These networks can be trained end-to-end by back-propagation. There are several variants in CNNs such as the R-CNN [27], SPPNet [31] and Faster- RCNN [82]. Since these networks can generate highly discriminant features, they out- perform traditional object detection techniques by a large margin. The Faster-RCNN includes a Region Proposal Network (RPN) to find object proposals, and it can reach nearly real-time object detection. 51 In particular, our proposed model integrates the detector module and tracker module to set up a drone monitoring system as illustrated in Fig. 4.1. The proposed system can monitor drones during both day and night. Due to the lack of the drone data and paucity of thermal drone diversities, we propose model-based augmentation for visible drone data augmentation and design a modified Cycle-GAN-based generation approach for thermal drone data augmentation. Furthermore, a residual tracker module is presented to deal with fast motion and occlusions. Finally, we demonstrate the effectiveness of the proposed integrated model on USC drone dataset and attain an AUC score of 43.8% on the test set. The contributions of our work are summarized below. To the best of our knowledge, this is the first one to use the deep learning technol- ogy to solve the challenging drone detection and tracking problem. We propose to exploit a large number of synthetic drone images, which are gener- ated by conventional image processing and 3D rendering algorithms, along with a few real 2D and 3D data to train the CNN. We develop an adversarial data augmentation technique, a modified Cycle-GAN- based generation approach, to create more thermal drone images to train the ther- mal drone detector. We propose to utilize the residue information from an image sequence to train and test an CNN-based object tracker. It allows us to track a small flying object in the cluttered environment. We present an integrated drone monitoring system that consists of a drone detector and a generic object tracker. The integrated system outperforms the detection-only and the tracking-only sub-systems. 52 We have validated the proposed system on USC drone dataset. The rest of this chapter is organized as follows. Related work is reviewed in Sec. ??. The collected drone datasets are introduced in Sec. 4.2. The proposed drone detection and tracking system is described in Sec. 4.3. Experimental results are presented in Sec. 4.4. Concluding remarks are given in Sec. 4.5. 4.2 Data Collection and Augmentation (a) Public-Domain Drone Dataset (b) USC Drone Dataset Figure 4.2: Sampled frames from two collected drone datasets. 4.2.1 Data Collection The first step in developing the drone monitoring system is to collect drone flying images and videos for the purpose of training and testing. We collect two drone datasets as shown in Fig. 4.2. They are explained below. Public-Domain drone dataset. It consists of 30 YouTube video sequences captured in an indoor or outdoor envi- ronment with different drone models. Some samples in this dataset are shown in Fig. 4.2a. These video clips have a frame resolution of 1280 x 720 and their 53 duration is about one minute. Some video clips contain more than one drone. Furthermore, some shoots are not continuous. USC drone dataset. It contains 30 visible video clips shot at the USC campus. All of them were shot with a single drone model. Several examples of the same drone in different appearance are shown in Fig. 4.2b. To shoot these video clips, we consider a wide range of background scenes, shooting camera angles, different drone shapes and weather conditions. They are designed to capture drone’s attributes in the real- world such as fast motion, extreme illumination, occlusion, etc. The duration of each video is approximately one minute and the frame resolution is 1920 x 1080. The frame rate is 30 frames per second. USC thermal drone dataset. It contains 10 thermal video clips shot at the USC campus and sample images are demonstrated in Fig. 4.4. All of them were shot with the same drone model as that for USC drone dataset. Each video clip is approximately one minute and the frame resolution is 1920 x 1080. The frame rate is 30 frames per second. We annotate each drone sequence with a tight bounding box around the drone. The ground truth can be used in CNN training. It can also be used to check the CNN perfor- mance when we apply it to the testing data. 4.2.2 Data Augmentation The preparation of a wide variety of training data is one of the main challenges in the CNN-based solution. For the drone monitoring task, the number of static drone images is very limited and the labeling of drone locations is a labor intensive job. The latter also suffers from human errors. All of these factors impose an additional barrier in 54 Figure 4.3: Illustration of the data augmentation idea, where augmented training images can be generated by merging foreground drone images and background images. developing a robust CNN-based drone monitoring system. To address this difficulty, we develop a model-based data augmentation technique that generates training images and annotates the drone location at each frame automatically. The basic idea is to cut foreground drone images and paste them on top of back- ground images as shown in Fig. 4.3. To accommodate the background complexity, we select related classes such as aircrafts and cars in the PASCAL VOC 2012 [19]. As to the diversity of drone models, we collect 2D drone images and 3D drone meshes of many drone models. For the 3D drone meshes, we can render their corresponding images by changing the view-distance, viewing-angle, and lighting conditions of the camera. As a result, we can generate many different drone images flexibly. Our goal is to gen- erate a large number of augmented images to simulate the complexity of background images and foreground drone models in a real world environment. Some examples of the augmented drone images of various appearances are shown in Fig. 4.3. Specific drone augmentation techniques are described below. Geometric transformations We apply geometric transformations such as image translation, rotation and scal- ing. We randomly select the angle of rotation from the range (-30 , 30 ). Further- more, we conduct uniform scaling on the original foreground drone images along 55 Figure 4.4: Sampled frames from collected USC thermal drone dataset. the horizontal and the vertical direction. Finally, we randomly select the drone location in the background images. Illumination variation To simulate drones in the shadows, we generate regular shadow maps by using random lines and irregular shadow maps via Perlin noise [77]. In the extreme lighting environments, we observe that drones tend to be in monochrome (i.e. the gray-scale) so that we change drone images to gray level ones. Image quality This augmentation technique is used to simulate blurred drones caused by cam- era’s motion and out-of-focus. We use some blur filters (e.g. the Gaussian filter, the motion Blur filter) to create the blur effects on foreground drone images. We use the model-based augmentation technique to acquire more training images with the ground-truth labels and show several exemplary synthesized drone images in Fig. 4.7, where augmented drone models are shown in Fig. 4.6. 56 Figure 4.5: Comparison of generated thermal drone images of different methods: 3D rendering (first row), Cycle-GAN (second row), proposed method (third row). 4.2.3 Thermal Data Augmentation In real-life applications, our systems are required to work in both daytime and night- time. To monitor the drones efficiently during the nighttime, we train our CNN-based thermal drone detector using infrared thermal images. It is more difficult to acquire enough training data for training the thermal drone detector. We can therefore apply data augmentation methods as mentioned in the previous section to generate thermal 57 Figure 4.6: Illustration of augmented visible and thermal drone models. The left three columns show the augmented visible drone models using different augmentation tech- niques. The right three columns show the augmented thermal drone models with the first row exploiting 3D rendering technique and the second row utilizing Generative Adversarial Networks. drone images with drone bounding box annotations. As illustrated in Fig. 4.3, we col- lect thermal images as the background from public thermal datasets and the Internet. However, it is difficult to directly apply the visible data augmentation techniques since the thermal drone models are very limited and we cannot collect enough fore- ground thermal drone models with large diversity. This problem can be solved if we can successfully translate a visible drone image to a corresponding thermal drone image, where we face an unsupervised image-to-image translation problem. To address this issue, we provide two approaches for generating thermal foreground drone images. One is specifically targeting at translating thermal drone images using traditional image processing techniques, and the other is the proposed image translation approach using GANs. Traditional image processing techniques From observation of USC thermal drone dataset in Fig. 4.4, thermal drones have nearly uniform gray color in most cases. Therefore, a post-processing section is added to convert visible drones to monochrome drones. 58 Figure 4.7: Synthesized visible and thermal images by incorporating various illumina- tion conditions, image qualities, and complex backgrounds. Modified Cycle-GAN Our goal is to learn mapping functions between two domains X and Y given the unbalanced training samples. In our case, there are enough samples in visible domain X but very few samples in thermal domain Y , and applying the learned mapping function helps to generate a large diversity of samples in domain Y . 59 Figure 4.8: The architecture of the proposed thermal drone generator. Cycle-GAN provides a good baseline for unpaired image-to-image translation problem, however the training images in two domains are heavily imbalanced in our case. As demonstrated in the second row in Fig 4.7, Cycle-GAN cannot increase the diversity of drone foreground images. Their proposed cycle con- sistency loss is, however, necessary to solve our problem. We utilize this cycle consistency loss to constrain the object shape consistency in two domains. The objective also contains the perceptual texture loss for only learning the texture translation between two domains, which helps to address the failures of Cycle- GAN. As illustrated in Fig 4.8, our methods learn two generators G A : X ! Y and G B : Y ! X with corresponding discriminator D A and D B . Image x2 X is translated to domainY asG A (x) and translated back asG B (G A (x)) which should be the reconstruction ofx. Similarly, imagey2 Y is translated to domainX as G B (y) and translated back asG A (G B (y)). The cycle consistency loss is defined as the sum of the two reconstruction loss: 60 L cycle (G A ;G B ;X;Y) =E xPx [kG B (G A (x))xk] +E yPy [kG A (G B (y))yk] (4.1) We extract texture features of images as inputs of discriminators which aims to distinguish the texture styles between images x and translated images G A (y), imagesy and translated imagesG B (x). We exploit the texture features proposed by Gatys et al [25], which is the gram matrixG of network feature map of layer j. The perceptual texture GAN loss is defined as: L tex (G A ;D B ;X;Y) =E yPy [logD B (G( j (y)))] +E xPx [log(1D B (G A (G( j (x)))))] (4.2) The full loss function is: L loss (G A ;D A ;G B ;D B ) =L cycle (G A ;G B ;X;Y) +L tex (G A ;D B ;X;Y)+L tex (G B ;D A ;Y;X) (4.3) where controls the relative importance of the cycle consistency loss and perceptual texture GAN loss. 61 4.3 Drone Monitoring System To achieve the high performance, the system consists of two modules; namely, the drone detection module and the drone tracking module. Both of them are built with the deep learning technology. These two modules complement each other, and they are used jointly to provide the accurate drone locations for a given video input. 4.3.1 Drone Detection The goal of drone detection is to detect and localize the drone in static images. Our approach is built on the Faster-RCNN [82], which is one of the state-of-the-art object detection methods for real-time applications. The Faster-RCNN utilizes the deep con- volutional networks to efficiently classify object proposals. To achieve real time detec- tion, the Faster-RCNN replaces the usage of external object proposals with the Region Proposal Networks (RPNs) that share convolutional feature maps with the detection network. The RPN is constructed on the top of convolutional layers. It consists of two convolutional layers, one encodes conv feature maps for each proposal to a lower- dimensional vector and the other provides the classification scores and regressed bounds. The Faster-RCNN achieves nearly cost-free region proposals and it can be trained end- to-end by back-propagation. We use the Faster-RCNN to build the drone detector by training it with synthetic drone images generated by the proposed data augmentation technique for the daytime case and by the proposed thermal data augmentation method for the nighttime case as described in Sec. 4.2. 4.3.2 Drone Tracking The drone tracker attempts to locate the drone in the next frame based on its location at the current frame. It searches around the neighborhood of the current drone’s position. 62 Figure 4.9: Qualitative results on USC drone datasets. Our algorithm performs well on small object tracking and long sequence (first and second row), complex background (third row), and occlusion (fourth row). The bounding boxes in red are integrated system results and the bounding boxes in green are tracking-only results. This helps track a drone in a certain region instead of the entire frame. To achieve this objective, we use the state-of-the-art object tracker called the Multi-Domain Network (MDNet) [70] as the backbone. Due to the fact that learning a unified representation across different video sequences is challenging and the same object class can be con- sidered not only as a foreground but also background object. The MDNet is able to separate the domain specific information from the domain independent information in network training. The network architecture includes three convolution layers, two general fully con- nected layers and aN-branch fully connected layer, whereN is the number of training sequences. To distinguish the foreground and background object in the tracking proce- dure, each of the last branches includes a binary softmax classifier with cross-entropy 63 Figure 4.10: Failure cases. Our algorithm fails to track the drone with strong motion blur and complex background (top row), and fails to re-identify the drone when it goes out-of-view and back (bottom row). The bounding boxes in red are integrated system results and the bounding boxes in green are tracking-only results. Figure 4.11: Comparison of qualitative results of detection, tracking and integrated sys- tem on the drone garden sequence. The detection results are shown in the first row. The corresponding tracking and integrated system results are shown in the second row with tracking bounding boxes in green and integrated system bounding boxes in red respectively. loss. As compared with other CNN-based trackers, the MDNet has fewer layers, which lowers the complexity of an online testing procedure and has a more precise localization prediction. During online tracking, theN-branch fully connected layer is replaced by a single-branch layer. Besides, the weights in the first five layers are pretrained during the multi-domain learning, and random initialization is exploited to the weights in the new single-branch layer. The weights in the fully connected layer are updated during 64 Figure 4.12: Comparison of qualitative results of detection (first row), integrated system (second row) on the thermal drone sequence. online tracking whereas the weights in the convolutional layers are frozen. Both the gen- eral and domain-specific features are preserved in this strategy and the tracking speed is improved as well. To control the tracking procedure and weights update, long-term and short-term updates are conducted respectively based on length of the consistent positive exam- ples intervals. Besides, hard negative example mining [87] is performed to reduce the positive/negative example ratio and improve the binary classification difficulty to make the network more discriminative. Finally, bounding box regression is exploited to adjust the accurate target location. To improve the tracking performance furthermore, we propose a video pre- processing step. That is, we subtract the current frame from the previous frame and take the absolute values pixelwise to obtain the residual image of the current frame. Note that we do the same for the R,G,B three channels of a color image frame to get a color residual image. Three color image frames and their corresponding color residual images are shown in Fig. 4.13 for comparison. If there is a panning movement of the camera, we need to compensate the global motion of the whole frame before the frame subtraction operation. 65 Figure 4.13: Comparison of three raw input images (first row) and their corresponding residual images (second row). Since there exists strong correlation between two consecutive images, most back- ground of raw images will cancel out and only the fast moving object will remain in residual images. This is especially true when the drone is at a distance from the camera and its size is relatively small. The observed movement can be well approximated by a rigid body motion. We feed the residual sequences to the MDNet for drone tracking after the above pre-processing step. It does help the MDNet to track the drone more accurately. Furthermore, if the tracker loses the drone for a short while, there is still a good probability for the tracker to pick up the drone in a faster rate. This is because the tracker does not get distracted by other static objects that may have their shape and color similar to a drone in residual images. Those objects do not appear in residual images. 4.3.3 Integrated Detection and Tracking System There are limitations in detection-only or tracking-only modules. The detection-only module does not exploit the temporal information, leading to huge computational waste. The tracking-only module attempts to track the drone by leveraging the temporal rela- tionships between video frames without knowing the object information, but it cannot 66 initialize the drone tracker when failed to track for a certain time interval. To build a complete system, we need to integrate these two modules into one. The flow chart of the proposed drone monitoring system is shown in Fig. 5.2. Figure 4.14: A flow chart of the drone monitoring system. Generally speaking, the drone detector has two tasks – finding the drone and initial- izing the tracker. Typically, the drone tracker is used to track the detected drone after the initialization. However, the drone tracker can also play the role of a detector when an object is too far away to be robustly detected as a drone due to its small size. Then, we can use the tracker to track the object before detection based on the residual images as the input. Once the object is near, we can use the drone detector to confirm whether it is a drone or not. An illegal drone can be detected once it is within the field of view and of a reasonable size. The detector will report the drone location to the tracker as the start position. Then, 67 the tracker starts to work. During the tracking process, the detector keeps providing the confidence score of a drone at the tracked location as a reference to the tracker. The final updated location can be acquired by fusing the confidence scores of the tracking and the detection modules as follows. For a candidate bounding box, we can compute the confidence scores of this location via S 0 d = 1=(1+e 1 (S d 1 ) ); (4.4) S 0 t = 1=(1+e 2 (St 2 ) ); (4.5) S 0 = max(S 0 d ;S 0 t ); (4.6) whereS d andS t denote the confidence scores obtained by the detector and the tracker, respectively,S 0 is the confidence score of this candidate location and parameters 1 , 2 , 1 , 2 are used to control the acceptance threshold. We compute the confidence score of a couple of bounding box candidates, whereS 0 i denotes the confidence score andBB i denotes the bounding box position,i2C, where C denotes the set of candidate indices. Then, we select the one with the highest score: i = argmax i2C S 0 i ; (4.7) S f = max i2C S 0 i ; (4.8) BB i = BB argmax i2C S 0 i (4.9) where BB i is the finally selected bounding box and S f is its confidence score. If S f = 0, the system will report a message of rejection. 68 4.4 Experimental Results 4.4.1 Drone Detection We test the visible and thermal drone detector on both the real-world and the synthetic visible or thermal datasets. Each of them contains 1000 images. The images in the real-world dataset are sampled from videos in the USC Drone dataset and the USC Thermal Drone dataset. The images in the synthetic dataset are generated using different foreground and background images with those in the training dataset. The detector can take any size of images as the input. These images are then re-scaled such that their shorter side has 600 pixels [82]. To evaluate the drone detector, we compute the precision-recall curve. Precision is the fraction of the total number of detections that are true positive. Recall is the fraction of the total number of labeled samples in positive class that are true positive. The area under the precision-recall curve (AUC) [35] is also reported. As for the visible detector, the effectiveness of the proposed data augmentation tech- nique is illustrated in Fig. 4.15. In this figure, we compare the performance of the baseline method that uses simple geometric transformations only and that of the method that uses all mentioned data augmented techniques, including geometric transforma- tions, illumination conditions and image quality simulation. Clearly, better detection performance can be achieved by more augmented data. We see around 11% and 16% improvements in the AUC measure on the real-world and the synthetic datasets, respec- tively. The experiments results of thermal drone detector are presented in Fig. 4.16. In both real-world and synthetic thermal datasets, the thermal detector achieves better per- formance than the visible detector by a large margin. We further compare the proposed 69 modified Cycle-GAN data augmentation approach with the traditional data augmen- tation techniques to present the necessity of the proposed approach. The comparison between baseline methods which uses the 3D rendering techniques and image trans- fer methods which exploits the modified Cycle-GAN model for generating foreground images are demonstrated in Fig. 4.16. We can observe there is8% performance gain in the AUC measure on the real-world datasets. 4.4.2 Drone Tracking The MDNet is adopted as the object tracker. We take 3 video sequences from the USC drone dataset as testing ones. They cover several challenges, including scale variation, out-of-view, similar objects in background, and fast motion. Each video sequence has a duration of 30 to 40 seconds with 30 frames per second. Thus, each sequence contains 900 to 1200 frames. Since all video sequences in the USC drone dataset have relatively slow camera motion, we can also evaluate the advantages of feeding residual frames (instead of raw images) to the MDNet. The performance of the tracker is measured with the area-under-the-curve (AUC) measure. We first measure the intersection over union(IoU) for all frames in all video sequences as IoU = Areaof Overlap Areaof Union ; (4.10) where the “Area of Overlap” is the common area covered by the predicted and the ground truth bounding boxes and the “Area of Union” is the union of the predicted and the ground truth bounding boxes. The IoU value is computed at each frame. If it is higher than a threshold, the success rate is set to 1; otherwise, 0. Thus, the success rate value is either 1 or 0 for a given frame. Once we have the success rate values for all frames in all video sequences for a particular threshold, we can divide the total success 70 (a) Synthetic Dataset (b) Real-World Dataset Figure 4.15: Comparison of the visible drone detection performance on (a) the synthetic and (b) the real-world datasets, where the baseline method refers to that using geomet- ric transformations to generate training data only while the All method indicates that exploiting geometric transformations, illumination conditions and image quality simu- lation for data augmentation. rate by the total frame number. Then, we can obtain a success rate curve as a function of the threshold. Finally, we measure the area under the curve (AUC) which gives the desired performance measure. 71 (a) Synthetic Dataset (b) Real-World Dataset (c) Real-World Dataset Figure 4.16: Comparison of the visible and thermal drone detection performance on (a) the synthetic, (b) the real-world datasets. (c) shows the comparison of different thermal data augmentation techniques, where the baseline method refers to that using geometric transformations, illumination conditions and image quality simulation for data augmen- tation, and image transfer refers to that utilizing the proposed modified Cycle-GAN data augmentation technique. 72 We compare the success rate curves of the MDNet using the original images and the residual images in Fig. 4.17. As compared to the raw frames, the AUC value increases by around 10% using the residual frames as the input. It collaborates the intuition that removing background from frames helps the tracker identify the drones more accurately. Although residual frames help improve the performance of the tracker for certain condi- tions, it still fails to give good results in two scenarios: 1) movement with fast changing directions and 2) co-existence of many moving objects near the target drone. To over- come these challenges, we have the drone detector operating in parallel with the drone tracker to get more robust results. Our algorithm performs well for most of the sequences on USC drone datasets as shown in Fig. 4.9. The qualitative results show that our algorithm performs well for tracking drones in distances and a long video (first and second row). The third row shows that both tracking algorithm (in green) and integrated system (in red) works well when meeting with complex background. The fourth row shows our approach performs well with motion blur and occlusions. Especially the integrated system predicts more accurate and tight bounding boxes in red compared with those of the tracking-only algo- rithm in green in the fourth column. We present failure cases on USC drone datasets in Fig. 4.10. The first row shows both tracking algorithm and integrated system fail when dealing with strong motion blur, complex background in the fourth column. The second row shows the failure case when the drone is out-of-view for a long time where the drone is going out of view in the frame 189 and fully comes back in the frame 600. Both the two approaches cannot track the drone when the drone reappears in the video. Long term memory should be applied in this case and optical flow could be exploited to re-identify the drone to initialize the tracker. 73 Figure 4.17: Comparison of the MDNet tracking performance using the raw and the residual frames as the input. 4.4.3 Fully Integrated System The fully integrated system contains both the detection and the tracking modules. We use the USC drone dataset to evaluate the performance of the fully integrated system. The performance comparison (in terms of the AUC measure) of the fully integrated system, the conventional MDNet (the tracker-only module) and the Faster-RCNN (the detector-only module) is shown in Fig. 4.18. Note that the tracker-only module needs human annotation for the first frame to perform the drone tracking, while the integrated system is an autonomous system without relying on the first frame annotation. The fully integrated system outperforms the other benchmarking methods by substantial margins. 74 Figure 4.18: Detection only (Faster RCNN) vs. tracking only (MDNet tracker) vs. our integrated system: The performance increases when we fuse the detection and tracking results. This is because the fully integrated system can use detection as the means to re-initialize its tracking bounding box when it loses the object. We show the comparison of qualitative results of detection, tracking and integrated system in Fig. 4.11. The drone detection-only results in the first row demonstrates that detection-only performs bad when there are strong illumination changes. The drone tracker in green in the second row fails to track the drones back when it loses tracking at the very beginning. The proposed integrated system has higher tolerance against the illuminance and scale changes which are in the red bounding boxes in the second row. It outperforms the detection-only results and tracking-only results since it learns from both 75 of them. Furthermore, we present the comparison of qualitative results of the thermal detection and integrated system in Fig. 4.12. Both the two approaches perform well to monitor the drone at night. 4.5 Conclusion A video-based drone monitoring system was proposed in this work to detect and track drones during day and night. The system consists of the drone detection module and the drone tracking module. Both of them were designed based on deep learning networks. We developed a model-based data augmentation technique for visible drone monitoring to enrich the training data. Besides, we presented an adversarial data augmentation methodology to create more thermal drone images due to the lack of the thermal drone data. We also exploited residue images as the input to the drone tracking module. The fully integrated monitoring system takes advantage of both modules to achieve high performance monitoring. Extensive experiments were conducted to demonstrate the superior performance of the proposed drone monitoring system. 76 Chapter 5 Video Object Tracking and Segmentation with Box Annotation 5.1 Introduction Video object segmentation (VOS) is a challenging problem in computer vision, which has great potential in the video editing, unmanned vehicle navigation, automatic data annotation and video surveillance. With the success of deep networks [50, 88, 32] and the appearance of high-resolution VOS datasets [76, 78, 109], recent breakthroughs in VOS have mainly focused on the semi-supervised setting in which the manually anno- tated segmentation mask of the first frame is provided in the given test video sequence. Then, the objects in the remaining frames of the video clip are automatically segmented. However, pixel-level annotation is very expensive and slow, and therefore, manually annotating the initial mask for the first frame of each new video clip is not realis- tic. There is another VOS setting, dubbed unsupervised VOS, in which the first frame ground truth mask is not provided. Unsupervised algorithms aim to automatically seg- ment the primary objects without prior knowledge of these objects in the video. Motion cues and/or image-based segmentations are utilized to guess the moving object regions. However, it is possible that the primary object is static in the first several frames or dur- ing the middle fractions in the video clip. Therefore, motion cues cannot find the salient regions, and although image-based segmentation algorithms can be applied to find the objectness regions, the algorithms cannot determine what the primary object is. 77 SiamMask Ours SiamMask Ours SiamMask Ours Figure 5.1: Segmentation results of the proposed RevVOS vs. SiamMask [102] on three sequences on DA VIS 2016 and DA VIS 2017. We propose a two-step approach that deeply leverages the optimization between tracking and segmentation, producing better segmentation than the state-of-the-art joint-trained approach SiamMask. To make VOS practical, and to guide the network to segment the primary object, we use very weak supervision (a bounding box) as the input, to perform object segmenta- tions throughout the remaining frames in the video. The object bounding box can be obtained in a much easier way than the pixel-wise segmentation, either through user interaction or from an object detector, which also provides a class label as additional information. Producing pixel-level estimates requires more computational resources than a simple bounding box. The recent fine-tuning-based approaches, OSVOS [6] and OnA VOS [100] achieve high performance, but suffer from high computational complex- ity in terms of time and space, especially when online adaptation is used. Only a few methods [10, 110, 99, 9, 106] achieve two potentially conflicting goals, accuracy and speed, at the same time. Different from the settings above, SiamMask [102] relies solely on bounding box annotation for the first frame. SiamMask extends existing Siamese trackers with an extra branch and loss to jointly train the classification branch, box branch, and mask branch by leveraging the shared representation. The proposed RevVOS method adopts the same 78 setting as SiamMask. We propose a track-then-segment approach, and use reverse opti- mization from the segmentation cues to update the tracker. Recent CNN-based work completely ignores complementary cues or jointly trains a multi-task network using the same representation. The proposed approach can take advantage of reverse optimization to produce stronger results than the previous best method, SiamMask. This paper aims at fast segmenting video objects with the bounding box annotation. The contributions are threefold, as dicussed below: First, we propose a two-stage framework, track and then segment, to perform video object segmentation, given the first frame box annotation, and demonstrate that this two-stage approach reduces the runtime by a large margin. The use of only the bounding box annotation is enough to automatically segment the object in the remaining frames. Second, contrary to the conventional approach of multi-task learning, where two or more similar tasks are jointly learned using a shared representation, we propose a novel paradigm to leverage the segmentation cues, to perform reverse optimiza- tion to localize the objects. Third, we show with the DA VIS 2016 and DA VIS 2017 datasets that this approach outperforms state-of-the-art results, with mean intersection-over-union (IoU) scores of 78.5% and 61.0%, respectively. Figure 5.1 shows the comparison between RevVOS and the state-of-the-art method SiamMask [102], which has the same setting as the proposed approach. The bounding box tracking in SiamMask is not consistent, and the segmentation along the boundary is very coarse. In contrast, RevVOS generates stronger segmentation results for these challenging sequences on the DA VIS 2016 and DA VIS 2017 datasets. 79 5.2 Proposed Method In this section, we introduce the proposed method in detail. For a given test video clip, first the bounding box annotation is provided. To perform video object segmentation for the given video sequence, we propose a different two-branch method, exploiting the SiamRPN [53] and FEELVOS [99] architectures, as illustrated in Figure 5.2, to jointly utilize the box cues and segmentation cues to balance the trade-off between speed and accuracy. 5.2.1 Track and then Segment Tracking with a Siamese Network. Visual object tracking has grown rapidly in recent years, due to the emergence of benchmark datasets [104, 48, 68] and the development of deep learning techniques [50, 88, 32]. Siamese network-based trackers train a two- branch network on pairs of video frames to learn the cross-correlations and output a similarity function. The Siamese network-based trackers are able to balance the trade- off between accuracy and efficiency. SiamFC [5] proposed an end-to-end fully convolu- tional Siamese network to estimate the similarity between two regions of interest. Then, additional Siamese approaches improve the performance by utilizing a region proposal network [22, 53], a deeper neural network[115, 52], and learning from the distractors in the background [117]. We adopt SiamRPN [53] as the backbone of the tracking branch of the proposed method. We adopt a real-time tracking architecture, SiamRPN [53], by utilizing a Siamese network to extract features, and a region proposal network for proposal generation. The fully convolutional Siamese network compares a previous exemplar image in the tem- plate branch [53] with a larger search region in the current frame in the detection branch, to obtain dense response maps. Then, the response maps are forwarded into the region 80 SiamRPN FG/BG Segmentation … … Updated tracker Initial search region … … Dense embedding Segmentation head Matching Tracking branch Segmentation branch Initial search region Frame 1 Frame t-1 Frame t Cropped frame 1 Cropped frame t-1 Cropped frame t Box2Seg Figure 5.2: An overview of the proposed RevVOS. The framework includes a tracking branch (top), a segmentation branch (bottom), and a Box2Seg prediction (left). We pro- pose a two-stage approach, track-then-segment, to generate segmentation masks from bounding box initialization. Then, the reverse optimization from the segmentation cues to the tracker is applied to refine the tracker. proposal network [83], which also consists of two branches, one for foreground classi- fication and one for proposal regression. The pair-wise correlation is computed on the classification branch and the regression branch, and thus, the network outputs bounding boxes in parallel with the classification scores. During the training process, the cross- entropy loss and theL 1 loss are jointly exploited for the classification and regression, respectively. Foreground-Background Segmentation in the Trackers. To make the object segmen- tation consistent in the entire video, we select the first frame as the reference frame. We apply Mask-RCNN [30] pretrained on ImageNet [17] and MS COCO [58] datasets to the first frame of the given video without any further fine-tuning. Mask-RCNN outputs class-specific segmentation masks, whereas video object segmentation aims at generat- ing binary class-agnostic segmentation masks. Mask-RCNN produces the segmentation mask with the closet class label, if the foreground objects do not belong to the 80 cat- egories in MS COCO. Then, all the foreground objects are mapped to one foreground object as the pseudo-ground truth of the first frame, and thus, the misclassification has limited influence on the inference process in VOS. 81 Video object segmentation methods are generally divided into two groups: semi- supervised VOS and unsupervised VOS. Unsupervised VOS methods [55, 56, 96] aim to segment the primary object without the first frame annotation, by jointly exploit- ing the motion cues and image-based segmentations. Semi-supervised VOS approaches [6, 100, 74, 96, 4] need a mask annotation for the first frame, and then segment the object in the remaining frames for the given video sequence. Semi-supervised VOS approaches are further classified into two categories: approaches that use fine-tuning and approaches that do not use fine-tuning on the first frame. Fine-tuning-based approaches achieve high performance, but the runtime is long. The other methods [110, 10, 99, 9] avoid fine-tuning, to balance the trade-off between runtime and performance to meet the requirements of real-world applications. OSMN [110] adapts the segmentation model with a modulator to manipulate the intermediate layers of the segmentation network without using fine-tuning. FA VOS [10] produces segmentation for each region of inter- est after part-based tracking, and then refines the segmentation using a similarity-based scoring function. FEELVOS [99] utilizes semantic embedding with global matching with the first frame, and local matching with the previous frame, to generate segmenta- tion masks. With the pseudo-ground truth, the task changes to a semi-supervised setting. To perform fast video object segmentation, we apply the state-of-the-art VOS architecture FEELVOS [99] to the tracking boxes. FEELVOS first exploits Deeplabv3+ [8] as the backbone to extract features, and then adds a semantic pixel-wise embedding layer to extract the embedding feature vectors. Afterward, global matching with the first frame and local matching with the previous frame are applied to calculate the distance maps. Finally, the backbone features, the predictions from the previous frame, and the global and local matching distance maps are forwarded to a segmentation head, which is then 82 trained with the cross-entropy loss. The goal is to learn an embedding, such that the pixels of the same object have similar embeddings. Unlike FEELVOS, which extracts backbone features and computes the instance embedding distances on the entire frame, we argue that producing binary segmentation in the trackers improves the performance in terms of speed and accuracy. In a smaller bounding box, the primary object becomes more dominant, because most of the con- fusing background is ignored, and the embedding between the current tracker and the initial bounding box in the first frame is more effective. For each pixel, the embedding vector is extracted in the learned embedding space. The pixels belonging to the same object have smaller Euclidean distances than pixels from different objects. The similar- ity between the two instance embedding vectorse i ande j is measured as a function of the Euclidean distance by s(i;j) = 2 1+exp(jje i e j jj 2 2 ) : (5.1) We learn the embedding from only the bounding box, instead of the entire frame, and thus, the global matching with the box in the first frame and the local matching with the tracker in the previous frame are more focused. The pixels in the background sampled for the embedding are closer to the objects, and they are easier to classify as the foreground. This procedure can be considered as using hard pixels for calculating the embedding, because computing the distances for all pairs of pixels is expensive. Although FEELVOS samples, at most, 1024 pixels for each object to save the computa- tion cost for global matching, a smaller bounding box usually contains fewer objects to further reduce the calculation. 83 The track-then-segment approach generates segmentations in the tracker, and thus, it automatically removes the outliers, which generate confusing predictions, out of the tracker without utilizing online adaptation. OnA VOS [100] finetunes the network using an eroded mask from the previous prediction as the positive mask, and exploiting the pixels with a distance larger than a threshold to the positive masks as the negative mask. The inference time is about 13 seconds per frame, where the online adaptation con- tributes nearly 90% of the total runtime. The proposed approach avoids online adap- tation, and uses only one forward pass for each frame. For global matching with the entire first frame, the global distance map is noisy and contains false-positives. How- ever, when matching with the initial bounding box region, the false-positives out of the bounding box are automatically removed, and thus, the encoded embeddings are more representative. 5.2.2 Reverse Optimization from Segmentation to Tracking The intuition behind reverse optimization from segmentation to tracking is twofold. First, SiamMask [102] uses a unifying approach for object tracking and segmentation, by augmenting their loss with a binary segmentation mask in the original fully convo- lutional Siamese tracking network. Although the method’s inference time is very fast, 35 frames per second, the bounding box tracking is not consistent, and the segmentation along the boundary is very coarse, as shown in Figure 5.1. Second, the tracking algo- rithms have trouble when dealing with severe aspect ratio changes and long-term object tracking in which the object is often occluded and appears again. In contrast to the conventional approach of multi-task learning of training the net- work using multiple losses simultaneously using shared representation, we build an inte- grated system to reversely optimize the trackers from the segmentation cues, to claim that the segmentation cues improve the tracking performance. We first perform the 84 track-then-segment approach, and then compare the minimum axis-aligned bounding box that fits the segmentation mask with the tracking box. If the IoU between the two boxes is smaller than the threshold T low , in which the pixels in the bounding box are more likely to be background due to the occlusions, or out-of-view, we enlarge the search region to the entire frame, to use the segmentation branch to perform the predic- tion. If no segmentation mask is predicted, we pause the tracking branch, and perform the video object segmentation for the remaining frames until the object reappears. The tracker is re-initialized based on the minimum axis-aligned bounding box covering the segmentation mask. If the IoU between these two boxes is larger than the threshold T high , we keep the track-then-segment approach for the rest of the frames. Otherwise, we use the minimum axis-aligned bounding box to replace the tracking box to initialize the tracker in the current frame. This approach copes well with the case that the bound- ing box aspect ratio is not able to follow the change in the box shape when the object changes shapes dramatically. We build an integrated system, and show in experiments that the proposed reverse optimization is effective. 5.2.3 Implementation Details For the tracking branch, we adopt SiamRPN, which utilizes modified AlexNet [50] pre- trained from ImageNet [17], with the first three convolutional layers fixed and only the last two convolutional layers updated. SGD [84] is utilized for optimization, and the learning rate is decreased in log space from 10 2 to 10 6 during the total 50 epochs. Similar to SiamFC [5], the exemplar patch is 127 127, and the corresponding search image patch is 255 255. Online tracking is considered a one-shot detection task, and thus, no online adaptation is followed. For the segmentation branch, we first produce the segmentation masks in the ini- tial bounding box using Mask-RCNN [30] pretrained on ImageNet [17] and MS COCO 85 [58]. For the DA VIS 2016 evaluation, we map all the instance masks in the box to one foreground mask; for the DA VIS 2017 evaluation, we keep each instance with different labels. To produce the segmentation masks for the remaining frames in the given video, we apply FEELVOS [99] to the initial bounding box and the trackers. FEELVOS uses Deeplabv3+ [8] with Xception-65 as the backbone, including depth-wise separable con- volutions, batch normalization, atrous spatial pyramid pooling, and a decoder module. An additional embedding layer is added on top of the feature extraction network. During the training procedure, one reference frame and two consecutive frames are randomly selected in each video in a mini-batch. They correspond to the first frame, the previ- ous frame, and the current frame in the inference procedure. During the inference, the cropped image region with the estimated segmentation mask in the initial bounding box is considered the first frame with ground truth. Following [99], we apply global match- ing with the first frame cropped region and local matching with the previous frame, and then forward them with the extracted features and the previous frame prediction to the segmentation head, to produce the segmentation mask. To reinitialize the tracker, we set the IoU threshold for initializing trackers, as described in Section 5.2.2, to T high = 0:8, and enlarge the search region with an IoU lower thanT low = 0:1. We pause the tracking branch, and allow the segmentation branch to make inference to the next frames, if there are no objects in the enlarged search region, until the object reappears in the sequence. 5.3 Experimental Evaluation In this section, the experiments are divided into three parts: evaluation metrics, evalua- tions for video object segmentation and ablation studies. To evaluate the effectiveness 86 OnA VOS [100] OSVOS-S [64] OSVOS [6] MSK [74] FA VOS [10] FEELVOS [99] RGMP [106] PML [9] OSMN [110] SiamMask [102] Ours FT 3 3 3 3 7 7 7 7 7 7 7 M 3 3 3 3 3 3 3 3 3 7 7 J 86.1 85.6 79.8 79.7 82.4 81.1 81.5 75.5 74.0 71.7 78.5 F 84.9 87.5 80.6 75.4 79.5 82.2 82.0 79.3 72.9 67.8 80.2 J&F 85.5 86.5 80.2 77.6 81.0 81.7 81.8 77.4 73.5 69.8 79.4 time 13 4.5 9 10 0.6 0.45 0.14 0.28 0.14 0.03 0.40 Table 5.1: Quantitative results of different semi-supervised VOS approaches on the DA VIS 2016 validation set.FT andM denote fine-tuning on the first frame and exploit- ing the first frame ground truth mask, respectively. Time is measured in seconds per frame. Figure 5.3: Benchmark ofJ&F mean and time per frame (in log scale) on DA VIS 2016 dataset. SiamMask [102] and RevVOS in red only use bounding box labels. The methods in blue utilize fine-tuning on the first frame. of the proposed method, we conduct experiments on two video object segmentation datasets: DA VIS 2016 [75] and DA VIS 2017 [78]. 5.3.1 Evaluation Metrics Region SimilarityJ . The Jaccard indexJ is defined as the IoU between the ground truth mask and the predicted mask, to measure the region-based segmentation similarity. Specifically, given the predicted maskP and the corresponding ground truth maskG, J is defined asJ = P T G P S G . 87 0% 25% 50% 75% 100% Figure 5.4: Qualitative results of the proposed method on DA VIS 2016 and DA VIS 2017 validation sets: drift-straight, andcows in the first two rows belong to DA VIS 2016; judo andhorsejump-high in the last two rows are in DA VIS 2017. Contour SimilarityF. The contour similarityF is defined as the F-measure between the contour points of the predicted segmentation and the ground truth, as proposed in [66]. Given contour-based precisionP and recallR,F is defined asF = 2PR P+R . 5.3.2 Evaluations for Video Object Segmentation The model is trained on the DA VIS 2017 dataset, and then evaluated on the DA VIS 2017 and DA VIS 2016 validation sets. The network is not further trained on YouTube- VOS, which is a large-scale and challenging video object segmentation dataset, for fair comparison with most of state-of-the-art methods. To initialize the bounding box for the first frame, we use the minimum axis-aligned box that fits the ground truth mask. The experimental results are split into two sections: DA VIS 2016 and DA VIS 2017. DA VIS 2016. The DA VIS 2016 dataset is composed of 50 high-definition(HD) video sequences, 30 in the training set and the remaining 20 in the validation set. There are a 88 0% 25% 50% 75% 100% Figure 5.5: More qualitative results of the proposed method on DA VIS 2016 validation set. total of 3, 455 densely annotated, pixel-accurate frames. The videos contain challenges, such as occlusions, motion blur, and appearance changes. Only the primary moving objects are annotated in the ground truth. Table 5.1 shows the comparison of theJ mean,F mean,J&F mean, and runtime with other state-of-the-art semi-supervised VOS methods. These approaches are divided into three major groups. The first group includes the approaches that utilize first frame 89 0% 25% 50% 75% 100% Figure 5.6: More qualitative results of the proposed method on DA VIS 2016 validation set. fine-tuning. These methods achieve better performance in terms ofJ&F, but the run- time is too long, due to some of the following aspects: first frame fine-tuning [6], online adaptation [100], optical flow extraction [59], conditional random field post-processing [7], and test data augmentation [100]. The second group consists of the methods that do not leverage fine-tuning on the first frame, but need first frame segmentation anno- tations. SiamMask [102] and the proposed approach RevVOS are listed in the third 90 OnA VOS [100] OSVOS-S [64] OSVOS [6] FA VOS [10] FEELVOS [99] RGMP [106] OSMN [110] SiamMask [102] Ours FT 3 3 3 7 7 7 7 7 7 M 3 3 3 3 3 3 3 7 7 J 61.6 64.7 56.6 54.6 69.1 64.8 52.5 51.1 61.0 F 69.1 71.3 63.9 61.8 74.0 68.6 57.1 55.0 66.2 J&F 65.4 68.0 60.3 58.2 71.5 66.7 54.8 53.1 63.6 time 26 9 18 1.2 0.51 0.28 0.28 0.06 0.81 Table 5.2: Quantitative results of different semi-supervised VOS approaches on the DA VIS 2017 validation set.FT andM denote fine-tuning on the first frame and exploit- ing the first frame ground truth mask, respectively. Time is measured in seconds per frame. Rw=OSVOS Rw=OnAVOS ReRO Ours J 72.3 77.3 76.7 78.5 Table 5.3: Ablation studies on DA VIS 2016 validation set. R w=OSVOS and Rw=OnAVOS denote replacing the segmentation branch to OSVOS [6] and OnA VOS [100], respectively.ReRO denotes removing the reverse optimization from the segmen- tation cues to the tracker. group, which uses a bounding box as the initialization without fine-tuning on the first frame. The proposed approach outperforms the previous best method in both evaluation metrics, with aJ mean of 78.5% and aF mean of 80.2%, which are 6.8% and 12.4%, respectively, higher than the previous best SiamMask. Figure 5.3 provides a benchmark of the state-of-the-art methods and the proposed approach betweenJ&F mean and runtime. In terms ofJ&F mean, the proposed approach outperforms MSK [74] in the first group, PML [9] and OSMN [110] in the sec- ond group, and SiamMask [102] in the third group, although we use only the bounding box as the initialization. In terms of runtime, the proposed approach is comparable with most of the methods in the second group, and much shorter than those in the first group. The proposed approach is much faster than the methods in the second group, when the first frame annotation time is considered in the total runtime, because bounding boxes are easier and less expensive to annotate, in contrast to pixel-wise segmentations. Qualitative results of the proposed approach for DA VIS 2016 are shown in the first two rows in Figure 5.4. The first column is the first frame with the pseudo-ground truth 91 GM LM PP J&F 3 65.2 3 3 73.9 3 3 3 79.4 Table 5.4: Ablation studies of different components in the segmentation branch on DA VIS 2016 validation set. GM denotes global matching with the first frame, and LM represents local matching with the previous frame. PP denotes adding previous predition to the input of the segmentation branch. annotation generated by applying Mask-RCNN [30] to the bounding box. The first row shows that the proposed method produces accurate segmentation for fast motion and appearance changes. The cows in the second row is an unseen object category in the training dataset. More qualitative results are shown in Figure 5.5 and Figure 5.6. The proposed approach is shown to cope well, and generates accurate segmentation masks for multiple videos with different challenges. DA VIS 2017. The DA VIS 2017 dataset extends the DA VIS 2016 dataset, and contains 90 HD video sequences in total. Multiple (one to five) instances are annotated in the 60 training sequences, and we perform evaluations on the 30 validation sequences. Table 5.2 compares the proposed method with recent semi-supervised approaches on the DA VIS 2017 validation set. With the same setting, the proposed method achieves 63.6%, which outperforms the second-best method SiamMask by 10.5% in terms of the J&F mean. In addition, the proposed method outperforms recent approaches, OSMN [110], FA VOS [10], and OSVOS [6], which use the ground truth segmentation mask as the initialization. Qualitative results of the proposed method are shown in the third and fourth rows in Figure 5.4. The first column is the first frame with the pseudo- ground truth annotation generated by applying Mask-RCNN [30] to each bounding box. Note that one or multiple bounding boxes exist in each frame in DA VIS 2017 datasets. The third row shows that the proposed method produces accurate segmentation for fast 92 0% 25% 50% 75% 100% Figure 5.7: More qualitative results of the proposed method on DA VIS 2017 validation set. motion and appearance changes. Thehorsejumphigh in the fourth row shows excel- lent results when multiple objects interact each other. More results are shown in Fig- ure 5.7 to demonstrate our approach works well with different challenges. The exam- ples include typical video challenges: scale variation, appearance change, etc. The per sequence quantitative results of DA VIS 2017 validation set are also shown in Table 5.5. 93 5.3.3 Ablation Studies Table 5.3 shows the ablation studies in the model on the DA VIS 2016 validation set. We disable or replace parts of the algorithm to see the contribution of each module. The pro- posed method achieves aJ mean of 78.5%. We replace the segmentation branch with OSVOS [6] and OnA VOS [100], using the default settings, as fine-tuning 500 times and 50 times the performance results degrade to 72.3% and 77.3%, respectively. The results for both methods are better than those for SiamMask [102], in terms of theJ mean, which demonstrates the effectiveness of the proposed two-stage approach. The results in the table also show that when the reverse optimization is degraded from the segmen- tation cues to the tracker, the result is degraded by 1.8%, which shows the importance of reverse optimization from segmentation to the bounding box. In addition, Table 5.4 analyzes the effect of each components in the segmentation branch. Jointly utilizing global matching with the first frame, local matching with the previous frame, and previous frame prediction achievesJ&F mean of 79.4%. When the previous frame prediction is disabled from the inputs, the performance is degraded by 5.5%. Additional removal of local matching with the previous frame leads to fur- ther 8.7% performance degration. Only applying global matching with the first frame achieves aJ&F score of 65.2%. This shows that global matching, local matching and previous frame prediction all contribute to the final prediction and they are complemen- tary to make better segmentation masks. 5.4 Conclusion We propose a two-stage method for video object segmentation, track and then segment, without annotating the segments in the first frame. Instead of leveraging multi-task learning simultaneously using a shared representation, we build an integrated system 94 that tracks the box of the object first, and then segments the objects in the tracker. After- ward, the segmentation cues are inversely optimized to update the tracker. Experimental results show that this simple idea achieves state-of-the-art performance on two bench- mark datasets. 95 Sequence J F Sequence J F bike-packing 1 0.442 0.583 kite-surf 1 0.29 0.411 bike-packing 2 0.729 0.772 kite-surf 2 0.267 0.305 blackswan 1 0.925 0.954 kite-surf 3 0.705 0.896 bmx-trees 1 0.29 0.705 lab-coat 1 0 0 bmx-trees 1 0.29 0.705 lab-coat 2 0 0 bmx-trees 2 0.586 0.789 lab-coat 3 0.889 0.834 breakdance 1 0.317 0.392 lab-coat 4 0.71 0.579 camel 1 0.825 0.906 lab-coat 5 0.785 0.724 car-roundabout 1 0.887 0.854 libby 1 0.712 0.883 car-shadow 1 0.954 0.997 loading 1 0.93 0.897 cows 1 0.939 0.971 loading 2 0.289 0.404 dance-twirl 1 0.73 0.739 loading 3 0.818 0.871 dog 1 0.938 0.967 mbike-trick 1 0.618 0.68 dogs-jump 1 0.128 0.234 mbike-trick 2 0.658 0.775 dogs-jump 2 0.545 0.64 motocross-jump 1 0.697 0.695 dogs-jump 3 0.744 0.841 motocross-jump 2 0.791 0.761 drift-chicane 1 0.841 0.933 paragliding-launch 1 0.769 0.816 drift-straight 1 0.916 0.888 paragliding-launch 2 0.587 0.868 goat 1 0.856 0.865 paragliding-launch 3 0 0 gold-fish 1 0.604 0.576 parkour 1 0.909 0.969 gold-fish 2 0.596 0.642 pigs 1 0.836 0.813 gold-fish 3 0.609 0.673 pigs 2 0.568 0.721 gold-fish 4 0.792 0.822 pigs 3 0.908 0.826 gold-fish 5 0.666 0.427 scooter-black 1 0.192 0.423 horsejump-high 1 0.778 0.915 scooter-black 2 0.734 0.699 horsejump-high 2 0.733 0.953 shooting 1 0 0 india 1 0.662 0.63 shooting 2 0.815 0.788 india 2 0.177 0.239 shooting 3 0 0 india 3 0.116 0.178 soapbox 1 0.7 0.734 judo 1 0.797 0.822 soapbox 2 0.67 0.73 judo 2 0.68 0.725 soapbox 3 0.553 0.678 Table 5.5: Per sequence quantitative results of DA VIS 2017 validation set. 96 Chapter 6 Conclusion and Future Work 6.1 Summary of the Research In this thesis, we systematically introduced our work on unsupervised and semi- supervised video object segmentation and drone monitoring system using deep learning techniques. To tackle unsupervised video object segmentation, we presented a simple yet intu- itive approach for unsupervised video object segmentation. Specifically, instead of man- ually annotating the first frame like existing semi-supervised methods, we proposed to automatically generate the approximate annotation, pseudo ground truth, by jointly employing instance segmentation and optical flow. Experimental results on the DA VIS, FBMS and SegTrack-v2 demonstrate that our approach enables effective transfer from semi-supervised VOS to unsupervised VOS and improves the mask prediction perfor- mance by a large margin. Our error analysis shows that using better instance segmenta- tion has a dramatic performance boost which shows great potential for further improve- ment. Our approach is able to extend from single object tracking to multiple arbitrary object tracking based on the category-agnostic ground truths or pseudo ground truths. To deal with multiple objects interaction, a distractor-aware online adaptation for unsupervised video object segmentation was proposed. The motion between adjacent frames and image segmentation are combined to generate the approximate annotation, pseudo ground truth, to replace the ground truth of the first frame. Motion-based hard example mining and block matching algorithm were integrated to produce distractors 97 which were further incorporated into online adaptation. In addition, motion-based pos- itive examples selection was combined with the hard negatives during online updating. Besides, we conducted an ablation study to demonstrate the effectiveness of each com- ponent in our proposed approach. Experimental results show that the proposed method achieves state-of-the-art performance on two benchmark datasets. In addition, a video-based drone monitoring system was proposed in this thesis to detect and track drones during day and night. The system consisted of the drone detec- tion module and the drone tracking module. Both of them were designed based on deep learning networks. We developed a model-based data augmentation technique for vis- ible drone monitoring to enrich the training data. Besides, we presented an adversarial data augmentation methodology to create more thermal drone images due to the lack of the thermal drone data. We also exploited residue images as the input to the drone tracking module. The fully integrated monitoring system takes advantage of both mod- ules to achieve high performance monitoring. Extensive experiments were conducted to demonstrate the superior performance of the proposed drone monitoring system. Finally, we propose a two-stage method for video object segmentation, track and then segment, without annotating the segments in the first frame. Instead of leveraging multi-task learning simultaneously using a shared representation, we build an integrated system that tracks the box of the object first, and then segments the objects in the tracker. Afterward, the segmentation cues are inversely optimized to update the tracker. Exper- imental results show that this simple idea achieves state-of-the-art performance on two benchmark datasets. 98 6.2 Future Research In the future work, we are interested in addressing the following two problems, video- based panoptic segmentation and 3D model reconstruction from videos. 6.2.1 Video-based panoptic segmentation Semantic segmentation is defined simply assigning a class label to each pixel in an image, and different objects in one class are assigned the same label. While instance segmentation is defined assigning an instance label to each pixel in the object bounding box. Instance segmentation provides different labels for different objects even if they are in the same class. The panoptic segmentation is proposed to encompass the instance segmentation and semantic segmentation, so that all the pixels in the images are identi- fied and all the instances are identified as well. This means each pixel of an image has a semantic meaning and also an instance id. The panoptic segmentation task combines both instance segmentation and semantic segmentation and therefore brings new challenges. Different from semantic segmen- tation, it distinguishes each object instance which brings a challenge to fully convo- lution network. The panoptic segmentation also leads to a challenge for region-based approaches that deal with each instance independently which is different from instance segmentation. Resolving the inconsistencies between the instance segmentation and semantic segmentation are the key challenges in panoptic segmentation and is the core step towards real-world applications such as autonomous driving. Previously, we did video object segmentation based on instance segmentation and semantic segmentation. We plan to focus on video-based panoptic segmentation by utlizing image-based panoptic segmentation experiences. We are interested in adding 99 scene understanding to the video object segmentation, and combining the instance seg- mentation and semantic segmentation to predict the video-based panoptic segmentation. 6.2.2 3D model reconstruction from videos In this direction, we plan to focus on synthesize 3D data by deep learning approaches. Specifically, we are going to address the problem of generating the 3D geometry and 2D silhouette of a generic object from multiple images. The 3D geometry could be either 3D point cloud or graph-based model. A point cloud representation may not be as efficient in representing the underlying continuous 3D geometry as compared to a CAD model using geometric primitives or even a simple mesh, but it also has many advantages. A point cloud is a simple, uniform structure that is easier to learn, as it does not have to encode multiple primitives or com- binatorial connectivity patterns. In addition, a point cloud allows simple manipulation when it comes to geometric transformations and deformations, as connectivity does not have to be updated. The graph-based model is also a good representation since it provides the structural information which the point cloud cannot provide. We attempt to solve this ill-posed problem of 3D structure recovery from several projection using certain learned priors. We plan to train a network to predict the silhouette and the depth of the surface given a variable number of images where the silhouette is predicted at a different viewpoint from the input while the depth is predicted at the viewpoint of the input images. The pro- posed method has two benefits. First, using a view-dependent representation improves the network’s generalizability to unseen objects. Second, the network learns about 3D using the proxy tasks of predicting depth and silhouette images, it is not limited by the resolution of the 3D representation. 100 Our final goal is to predict 3D models from a video clip, and utilize the 3D model constraint to update the 2D silhouettes to improve the performance of video object seg- mentation. 101 Bibliography [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghe- mawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. InOSDI, volume 16, pages 265–283, 2016. [2] R. Appel, T. Fuchs, P. Doll´ ar, and P. Perona. Quickly boosting decision trees– pruning underachieving features early. In International conference on machine learning, pages 594–602, 2013. [3] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation methodology for optical flow. International Journal of ComputerVision, 92(1):1–31, 2011. [4] L. Bao, B. Wu, and W. Liu. Cnn in mrf: Video object segmentation via infer- ence in a cnn-based higher-order spatio-temporal mrf. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5977– 5986, 2018. [5] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully- convolutional siamese networks for object tracking. In European conference on computervision, pages 850–865. Springer, 2016. [6] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix´ e, D. Cremers, and L. Van Gool. One-shot video object segmentation. InCVPR2017. IEEE, 2017. [7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXivpreprintarXiv:1606.00915, 2016. [8] L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprintarXiv:1802.02611, 2018. [9] Y . Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE Con- ferenceonComputerVisionandPatternRecognition, pages 1189–1198, 2018. 102 [10] J. Cheng, Y .-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang. Fast and accu- rate online video object segmentation via tracking parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7415– 7424, 2018. [11] J. Cheng, Y .-H. Tsai, S. Wang, and M.-H. Yang. Segflow: Joint learning for video object segmentation and optical flow. In2017IEEEInternationalConferenceon ComputerVision(ICCV), pages 686–695. IEEE, 2017. [12] F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXivpreprint, pages 1610–02357, 2017. [13] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and PatternRecognition(CVPR), 2016. [14] K. Dabov, A. Foi, V . Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image pro- cessing, 16(8):2080–2095, 2007. [15] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. InComputerVisionandPatternRecognition,2005.CVPR2005.IEEEComputer SocietyConferenceon, volume 1, pages 886–893. IEEE, 2005. [16] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco: Efficient convolution operators for tracking. InCVPR, volume 1, page 3, 2017. [17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large- scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.CVPR2009.IEEEConferenceon, pages 248–255. IEEE, 2009. [18] P. Doll´ ar, Z. Tu, P. Perona, and S. Belongie. Integral channel features. 2009. [19] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zis- serman. The pascal visual object classes challenge: A retrospective.International journalofcomputervision, 111(1):98–136, 2015. [20] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010. [21] A. Faktor and M. Irani. Video segmentation by non-local consensus voting. In BMVC, volume 2, page 8, 2014. 103 [22] H. Fan and H. Ling. Siamese cascaded region proposal networks for real-time visual tracking. arXivpreprintarXiv:1812.06148, 2018. [23] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on patternanalysisandmachineintelligence, 32(9):1627–1645, 2010. [24] H. K. Galoogahi, A. Fagg, and S. Lucey. Learning background-aware correlation filters for visual tracking. InICCV, pages 1144–1152, 2017. [25] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. InProceedingsoftheIEEEConferenceonComputerVisionand PatternRecognition, pages 2414–2423, 2016. [26] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computervision, pages 1440–1448, 2015. [27] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. InProceedingsoftheIEEE conferenceoncomputervisionandpatternrecognition, pages 580–587, 2014. [28] M. Grundmann, V . Kwatra, M. Han, and I. Essa. Efficient hierarchical graph- based video segmentation. InComputerVisionandPatternRecognition(CVPR), 2010IEEEConferenceon, pages 2141–2148. IEEE, 2010. [29] B. Hariharan, P. Arbel´ aez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. InEuropeanConferenceonComputerVision, pages 297–312. Springer, 2014. [30] K. He, G. Gkioxari, P. Doll´ ar, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV),2017IEEEInternationalConferenceon, pages 2980–2988. IEEE, 2017. [31] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convo- lutional networks for visual recognition. In European conference on computer vision, pages 346–361. Springer, 2014. [32] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [33] Y .-T. Hu, J.-B. Huang, and A. Schwing. Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. InProc.ECCV, 2018. [34] Y .-T. Hu, J.-B. Huang, and A. G. Schwing. Videomatch: Matching based video object segmentation. arXivpreprintarXiv:1809.01123, 2018. 104 [35] J. Huang and C. X. Ling. Using auc and accuracy in evaluating learning algo- rithms. IEEETransactionsonknowledgeandDataEngineering, 17(3):299–310, 2005. [36] Q. Huang, C. Xia, S. Li, Y . Wang, Y . Song, and C.-C. J. Kuo. Unsupervised clustering guided semantic segmentation. In 2018 IEEE Winter Conference on ApplicationsofComputerVision(WACV), pages 1489–1498. IEEE, 2018. [37] Q. Huang, C. Xia, C. Wu, S. Li, Y . Wang, Y . Song, and C.-C. J. Kuo. Semantic segmentation with reverse attention. arXivpreprintarXiv:1707.06426, 2017. [38] P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXivpreprint, 2017. [39] S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos.arXiv preprintarXiv:1701.05384, 2(3):6, 2017. [40] W.-D. Jang and C.-S. Kim. Online video object segmentation via convolutional trident network. InProceedingsoftheIEEEConferenceonComputerVisionand PatternRecognition, pages 5849–5858, 2017. [41] W.-D. Jang, C. Lee, and C.-S. Kim. Primary object segmentation in videos via alternate convex optimization of foreground and background distributions. In ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition, pages 696–704, 2016. [42] S. Jin, A. RoyChowdhury, H. Jiang, A. Singh, A. Prasad, D. Chakraborty, and E. Learned-Miller. Unsupervised hard example mining from videos for improved object detection. arXivpreprintarXiv:1808.04285, 2018. [43] I. Jung, J. Son, M. Baek, and B. Han. Real-time mdnet. arXiv preprint arXiv:1808.08834, 2018. [44] M. Keuper, B. Andres, and T. Brox. Motion trajectory segmentation via minimum cost multicuts. InComputerVision(ICCV),2015IEEEInternationalConference on, pages 3271–3279. IEEE, 2015. [45] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Lucid data dreaming for object tracking. arXivpreprintarXiv:1703.09554, 2017. [46] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014. 105 [47] Y . J. Koh and C.-S. Kim. Primary object segmentation in videos based on region augmentation and reduction. In Proceedings of the IEEE Conference on Com- puterVisionandPatternRecognition, volume 1, page 6, 2017. [48] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Cehovin Zajc, T. V ojir, G. Bhat, A. Lukezic, A. Eldesokey, et al. The sixth visual object track- ing vot2018 challenge results. In Proceedings of the European Conference on ComputerVision(ECCV), pages 0–0, 2018. [49] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [50] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [51] Y . J. Lee, J. Kim, and K. Grauman. Key-segments for video object segmenta- tion. InComputerVision(ICCV),2011IEEEInternationalConferenceon, pages 1995–2002. IEEE, 2011. [52] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan. Siamrpn++: Evo- lution of siamese visual tracking with very deep networks. arXiv preprint arXiv:1812.11703, 2018. [53] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on ComputerVisionandPatternRecognition, pages 8971–8980, 2018. [54] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video segmentation by tracking many figure-ground segments. In Computer Vision (ICCV), 2013 IEEE InternationalConferenceon, pages 2192–2199. IEEE, 2013. [55] S. Li, B. Seybold, A. V orobyov, A. Fathi, Q. Huang, and C.-C. J. Kuo. Instance embedding transfer to unsupervised video object segmentation. InProceedingsof the IEEE Conference on Computer Vision and Pattern Recognition, pages 6526– 6535, 2018. [56] S. Li, B. Seybold, A. V orobyov, X. Lei, and C.-C. Jay Kuo. Unsupervised video object segmentation with motion-based bilateral networks. InProceedingsofthe EuropeanConferenceonComputerVision(ECCV), pages 207–223, 2018. [57] T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ ar. Focal loss for dense object detection. IEEEtransactionsonpatternanalysisandmachineintelligence, 2018. 106 [58] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conferenceoncomputervision, pages 740–755. Springer, 2014. [59] C. Liu et al. Beyond pixels: exploring new representations and applications for motionanalysis. PhD thesis, Massachusetts Institute of Technology, 2009. [60] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016. [61] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and patternrecognition, pages 3431–3440, 2015. [62] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Interna- tionaljournalofcomputervision, 60(2):91–110, 2004. [63] T. Ma and L. J. Latecki. Maximum weight cliques with mutex constraints for video object segmentation. InComputerVisionandPatternRecognition(CVPR), 2012IEEEConferenceon, pages 670–677. IEEE, 2012. [64] K.-K. Maninis, S. Caelles, Y . Chen, J. Pont-Tuset, L. Leal-Taix´ e, D. Cremers, and L. Van Gool. Video object segmentation without temporal information. arXiv preprintarXiv:1709.06031, 2017. [65] N. M¨ arki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bilateral space video segmentation. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 743–751, 2016. [66] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE transactions on patternanalysisandmachineintelligence, 26(5):530–549, 2004. [67] M. Mathieu, C. Couprie, and Y . LeCun. Deep multi-scale video prediction beyond mean square error. arXivpreprintarXiv:1511.05440, 2015. [68] M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. InProceedings oftheEuropeanConferenceonComputerVision(ECCV), pages 300–317, 2018. [69] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. InProceedingsoftheIEEEConferenceonComputerVisionand PatternRecognition, pages 4293–4302, 2016. 107 [70] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. InProceedingsoftheIEEEConferenceonComputerVisionand PatternRecognition, pages 4293–4302, 2016. [71] P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence, 36(6):1187–1200, 2014. [72] A. Papazoglou and V . Ferrari. Fast object segmentation in unconstrained video. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 1777– 1784. IEEE, 2013. [73] D. Pathak, R. Girshick, P. Doll´ ar, T. Darrell, and B. Hariharan. Learning features by watching objects move. InProc.CVPR, volume 2, 2017. [74] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learn- ing video object segmentation from static images. InComputerVisionandPattern Recognition, 2017. [75] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InComputerVisionandPatternRecognition, 2016. [76] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 724–732, 2016. [77] K. Perlin. An image synthesizer. ACMSiggraphComputerGraphics, 19(3):287– 296, 1985. [78] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel´ aez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprintarXiv:1704.00675, 2017. [79] R. F. Port and T. Van Gelder. Mind as motion: Explorations in the dynamics of cognition. MIT press, 1995. [80] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learn- ing with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. [81] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer visionandpatternrecognition, pages 779–788, 2016. 108 [82] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvancesinneuralinformationpro- cessingsystems, pages 91–99, 2015. [83] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvancesinneuralinformationpro- cessingsystems, pages 91–99, 2015. [84] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematicalstatistics, pages 400–407, 1951. [85] T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Pro- cessingSystems, pages 2234–2242, 2016. [86] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXivpreprintarXiv:1312.6229, 2013. [87] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on ComputerVisionandPatternRecognition, pages 761–769, 2016. [88] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXivpreprintarXiv:1409.1556, 2014. [89] H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam. Pyramid dilated deeper convlstm for video salient object detection. InProceedingsoftheEuropeanCon- ferenceonComputerVision(ECCV), pages 715–731, 2018. [90] Y . Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image genera- tion. arXivpreprintarXiv:1611.02200, 2016. [91] K. Tang, V . Ramanathan, L. Fei-Fei, and D. Koller. Shifting weights: Adapt- ing object detectors from image to video. In Advances in Neural Information ProcessingSystems, pages 638–646, 2012. [92] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance search for tracking. In ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition, pages 1420–1429, 2016. [93] B. Taylor, V . Karasev, and S. Soatto. Causal video object segmentation from persistence of occlusions. In Proceedings of the IEEE Conference on Computer VisionandPatternRecognition, pages 4268–4276, 2015. 109 [94] P. Tokmakov, K. Alahari, and C. Schmid. Learning motion patterns in videos. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 531–539. IEEE, 2017. [95] P. Tokmakov, K. Alahari, and C. Schmid. Learning video object segmentation with visual memory. arXivpreprintarXiv:1704.05737, 2017. [96] P. Tokmakov, C. Schmid, and K. Alahari. Learning to segment moving objects. InternationalJournalofComputerVision, 127(3):282–301, 2019. [97] Y .-H. Tsai, M.-H. Yang, and M. J. Black. Video segmentation via object flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3899–3908, 2016. [98] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr. End-to-end representation learning for correlation filter based tracking. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5000–5008. IEEE, 2017. [99] P. V oigtlaender, Y . Chai, F. Schroff, H. Adam, B. Leibe, and L.-C. Chen. Feelvos: Fast end-to-end embedding learning for video object segmentation. arXivpreprintarXiv:1902.09513, 2019. [100] P. V oigtlaender and B. Leibe. Online adaptation of convolutional neural networks for video object segmentation. arXivpreprintarXiv:1706.09364, 2017. [101] S. Wan, Z. Chen, T. Zhang, B. Zhang, and K.-k. Wong. Bootstrapping face detec- tion with hard negative examples. arXivpreprintarXiv:1608.02236, 2016. [102] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr. Fast online object track- ing and segmentation: A unifying approach. arXiv preprint arXiv:1812.05050, 2018. [103] W. Wang, J. Shen, and F. Porikli. Saliency-aware geodesic video object segmen- tation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3395–3402, 2015. [104] Y . Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 2411–2418, 2013. [105] Z. Wu, C. Shen, and A. v. d. Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. arXivpreprintarXiv:1611.10080, 2016. 110 [106] S. Wug Oh, J.-Y . Lee, K. Sunkavalli, and S. Joo Kim. Fast video object segmen- tation by reference-guided mask propagation. In Proceedings of the IEEE Con- ferenceonComputerVisionandPatternRecognition, pages 7376–7385, 2018. [107] H. Xiao, J. Feng, G. Lin, Y . Liu, and M. Zhang. Monet: Deep motion exploita- tion for video object segmentation. In Proceedings of the IEEE Conference on ComputerVisionandPatternRecognition, pages 1140–1148, 2018. [108] C. Xu, C. Xiong, and J. J. Corso. Streaming hierarchical video segmentation. In EuropeanConferenceonComputerVision, pages 626–639. Springer, 2012. [109] N. Xu, L. Yang, Y . Fan, D. Yue, Y . Liang, J. Yang, and T. Huang. Youtube- vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018. [110] L. Yang, Y . Wang, X. Xiong, J. Yang, and A. K. Katsaggelos. Efficient video object segmentation via network modulation. algorithms, 29:15, 2018. [111] R. Yeh, C. Chen, T. Y . Lim, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with perceptual and contextual losses. arxiv preprint. arXiv preprintarXiv:1607.07539, 2, 2016. [112] J. S. Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. S. Kweon. Pixel-level matching for video object segmentation using convolutional neural networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2186– 2195. IEEE, 2017. [113] D. Zhang, O. Javed, and M. Shah. Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. InComputer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 628– 635. IEEE, 2013. [114] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech. Minimum bar- rier salient object detection at 80 fps. In Proceedings of the IEEE international conferenceoncomputervision, pages 1404–1412, 2015. [115] Z. Zhipeng, P. Houwen, and W. Qiang. Deeper and wider siamese networks for real-time visual tracking. arXivpreprintarXiv:1901.01660, 2019. [116] J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXivpreprint, 2017. [117] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference onComputerVision(ECCV), pages 101–117, 2018. 111
Abstract (if available)
Abstract
Unsupervised video object segmentation is a crucial application in video analysis without knowing any prior information about the objects. It becomes tremendously challenging when multiple objects occur and interact in a given video clip. In this thesis proposal, a novel unsupervised video object segmentation approach via distractor-aware online adaptation (DOA) is proposed. DOA models spatial-temporal consistency in video sequences by capturing background dependencies from adjacent frames. Instance proposals are generated by the instance segmentation network for each frame and then selected by motion information as hard negatives if they exist and positives. To adopt high-quality hard negatives, the block matching algorithm is then applied to preceding frames to track the associated hard negatives. General negatives are also introduced in case that there are no hard negatives in the sequence and experiments demonstrate both kinds of negatives (distractors) are complementary. Finally, we conduct DOA using the positive, negative, and hard negative masks to update the foreground/background segmentation. The proposed approach achieves state-of-the-art results on two benchmark datasets, DAVIS 2016 and FBMS-59 datasets. ❧ In addition, this thesis proposal reports a visible and thermal drone monitoring system that integrates deep-learning-based detection and tracking modules. The biggest challenge in adopting deep learning methods for drone detection is the paucity of training drone images especially thermal drone images. To address this issue, we develop two data augmentation techniques. One is a model-based drone augmentation technique that automatically generates visible drone images with a bounding box label on the drone's location. The other is exploiting an adversarial data augmentation methodology to create thermal drone images. To track a small flying drone, we utilize the residual information between consecutive image frames. Finally, we present an integrated detection and tracking system that outperforms the performance of each individual module containing detection or tracking only. The experiments show that, even being trained on synthetic data, the proposed system performs well on real-world drone images with complex background. The USC drone detection and tracking dataset with user labeled bounding boxes is available to the public. ❧ Finally, this thesis presents a two-stage approach, track and then segment, to perform semi-supervised video object segmentation (VOS) with only bounding box annotations. We present reverse optimization for VOS (RevVOS), which leverages a fully convolutional Siamese network to perform the tracking and then segments the objects in the tracker. The segmentation cues are able to reversely optimize the localization of the tracker. The proposed two-branch system is performed online, to produce object segmentation masks. We demonstrate significant improvements over the state-of-the-art methods on two video object segmentation datasets: DAVIS 2016 and DAVIS 2017.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Object localization with deep learning techniques
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
A deep learning approach to online single and multiple object tracking
PDF
Green unsupervised single object tracking: technologies and performance evaluation
PDF
A data-driven approach to compressed video quality assessment using just noticeable difference
PDF
Local-aware deep learning: methodology and applications
PDF
Towards more occlusion-robust deep visual object tracking
PDF
Machine learning based techniques for biomedical image/video analysis
PDF
A data-driven approach to image splicing localization
PDF
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
3D deep learning for perception and modeling
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Depth inference and visual saliency detection from 2D images
PDF
Advanced visual processing techniques for latent fingerprint detection and video retargeting
PDF
Visual knowledge transfer with deep learning techniques
PDF
Machine learning techniques for outdoor and indoor layout estimation
PDF
Structured visual understanding and generation with deep generative models
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Experimental design and evaluation methodology for human-centric visual quality assessment
Asset Metadata
Creator
Wang, Ye
(author)
Core Title
Video object segmentation and tracking with deep learning techniques
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
06/16/2020
Defense Date
11/12/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,OAI-PMH Harvest,unsupervised and supervised,video object segmentation and tracking
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Lim, Joseph J. (
committee member
), Sawchuk, Alexander A. (
committee member
)
Creator Email
wang316@usc.edu,wangye381@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-320456
Unique identifier
UC11663762
Identifier
etd-WangYe-8598.pdf (filename),usctheses-c89-320456 (legacy record id)
Legacy Identifier
etd-WangYe-8598.pdf
Dmrecord
320456
Document Type
Dissertation
Rights
Wang, Ye
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
deep learning
unsupervised and supervised
video object segmentation and tracking