Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Object localization with deep learning techniques
(USC Thesis Other)
Object localization with deep learning techniques
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
OBJECT LOCALIZATION WITH DEEP LEARNING TECHNIQUES by Siyang Li A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2018 Copyright 2018 Siyang Li Acknowledgments Now I am standing on the end of the Ph.D adventure. At this very moment, I recall the day when I made the decision to pursue a Ph.D degree years ago. It was in the spring of 2014 and I was going to finish my undergraduate study in Hong Kong. To be honest, it was a hard decision - the Ph.D journey seemed extremely long and was full of uncer- tainty. However, when I look back in time, the Ph.D time actually elapsed like a rapid river running into the giant sea. The first day that I arrived at USC appears like yesterday to me. The whole adventure was indeed full of uncertainty. There was disappointment, doubtness and exhaustion. However, more importantly, I also enjoyed the achievement of successful experiments, the appreciation from the research community and the proud that I pushed forward the boundary of human’s technology. If I could travel back in time, I would love to say to younger me, “Life is always uncertain. Just be brave and believe in yourself.” A lot of people have supported me in the past years. I would love to express my gratitude to my Ph.D supervisor, Prof. C.-C. Jay Kuo, first. In the past four years, Prof. Kuo has provided lasting guidance, care and tolerance. Besides his help on my research, his enthusiasm, self-discipline and sense of responsibility would always enlighten my future life. I am honored to have him as my supervisor and establish a life-long friend- ship with him. ii My mother, Banmei Yu, played an important role in my Ph.D years. It was her unconditional love that empowered me whenever I felt upset. I was also largely encour- aged by her braveness, which enabled her to finish her graduate studies in extremely hard years. My mother is the most talented and courageous lady I have ever seen, and she will always be my role model. My boyfriend, Yaguang Li, has supported me throughout the Ph.D journey, not only with love and encouragement, but also practical suggestions. Being also a Ph.D stu- dent at USC and with more research experiences, he could feel exactly what I feel and knows exactly what I need. He provided guidance when I was conducting experiments, comforted me when I failed, and shared the fulfillment when I succeeded. I could not achieve what I have now without him. A lot of labmates helped me in both research and daily life, especially my office- mates, Junting Zhang and Heming Zhang. We had research discussions together, encour- aged each other, and shared a lot of fun time. I wish them all the best on their rest of Ph.D journey. During my Ph.D years, I did research interns at Google AI Perception, for three times. These opportunities were extremely important and beneficial. I would like to thank my hosts and co-authors who helped my research and publications. They are Henry Rowley, Xiangxin Zhu, Bryan Seybold, Alexey V orobyov and Alireza Fathi. It was a great honor to collaborate with them. Finally, I would love to thank my qualifying and defense exam committee members: Prof. Alexander Sawchuk, Prof. Justin Haldar, Prof. Keith Chugg and Prof. Joseph Lim. Their feedbacks were extremely helpful for my research. iii Contents Acknowledgments ii List of Tables vii List of Figures ix Abstract xiv 1 Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Object proposal enhancement . . . . . . . . . . . . . . . . . . 5 1.2.2 Weakly supervised object detection . . . . . . . . . . . . . . . 6 1.2.3 Unsupervised video object segmentation . . . . . . . . . . . . . 7 1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 8 2 Research Background 9 2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 9 2.2 Object Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Video Object Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Box Refinement: Object Proposal Enhancement and Pruning 23 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 Cost Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 Anchor Selection . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.3 Contour Search . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Best Candidate Deviation . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Improvement in Recall-versus-IoU Performance . . . . . . . . 35 iv 3.3.3 Overall Performance Benchmarking . . . . . . . . . . . . . . . 36 3.3.4 Impact of Proposal Budgets . . . . . . . . . . . . . . . . . . . 39 3.3.5 Combination with Detectors . . . . . . . . . . . . . . . . . . . 39 3.3.6 Impact of Original Proposal Quality . . . . . . . . . . . . . . . 42 3.3.7 Impact of Object Sizes . . . . . . . . . . . . . . . . . . . . . . 42 3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Multiple Instance Curriculum Learning for Weakly Supervised Object Detection 44 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Proposed MICL Method . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1 Detector Initialization . . . . . . . . . . . . . . . . . . . . . . 47 4.2.2 Segmentation-based Seed Growing . . . . . . . . . . . . . . . 48 4.2.3 Multiple Instance Curriculum Learning . . . . . . . . . . . . . 53 4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.1 Detector Initialization . . . . . . . . . . . . . . . . . . . . . . 56 4.3.2 Segmentation-based Seed Growing . . . . . . . . . . . . . . . 57 4.3.3 Multiple Instance Curriculum learning . . . . . . . . . . . . . . 58 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.1 Datasets and Evaluation . . . . . . . . . . . . . . . . . . . . . 61 4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.3 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5 Unsupervised Video Object Segmentation 70 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.1 Instance embeddings and objectness . . . . . . . . . . . . . . . 73 5.2.2 Motion-based bilateral networks for background estimation . . 75 5.2.3 Embedding graph cut . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.4 Label propagation . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.1 Datasets and evaluation metrics . . . . . . . . . . . . . . . . . 85 5.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.3 Performance comparison . . . . . . . . . . . . . . . . . . . . . 88 5.3.4 Analysis of module contributions . . . . . . . . . . . . . . . . 91 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6 Conclusions and Future Work 97 6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.1 Weakly Supervised Image/Video Segmentation . . . . . . . . . 100 v 6.2.2 Domain Adaptation for Image/Video Segmentation . . . . . . . 101 Bibliography 103 vi List of Tables 3.1 Comparison of the average recall (AR), 0.5-recall and 0.8-recall of 15 methods with 100, 300 and 1000 proposals. Results improved by our box refinement method have the “BR-” prefix. The best results are high- lighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Comparison of the detection mean average precision (mAP) on RPN [92] and BR-RPN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Comparison of mAP values on the VOC07 test set. Note that the gaps between the previous approaches and our method are particularly large on categories such as “cat”, “dog” and “horse”. The improvements are from the SSG network, which grows the bounding box to cover the full objects from the discriminative parts (i.e., faces). . . . . . . . . . . . . 62 4.2 Comparison of the CorLoc on the VOC07trainval set. . . . . . . . . . 64 4.3 Results on the VOC12 dataset, where the AP and the CorLoc are mea- sured on thetest andtrainval set, respectively. . . . . . . . . . . . . . . 64 4.4 The precision of the maximum saliency point on VOC07trainval set. . . 65 4.5 The precision and recall of segmentation seeds on VOC07trainval set. . 66 4.6 The Corloc of the bounding boxes from the segmenter on VOC07train- val set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.7 Performance comparison of the three Fast R-CNN baselines and the one with the proposed multiple instance curriculum learning paradigm. . . . 67 4.8 Comparison of the CorLoc on the selected training subset versus on the whole set and mAP on the test set achieved by the correspondingly trained detectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1 The results on the val split of DA VIS 2016 dataset [88]. The pro- posed method outperforms other unsupervised methods in terms ofJ /F Mean, and is even better than some semi-supervised methods. For the temporal stability (T ), our method is the second best. . . . . . . . . . . 91 5.2 Comparison of theJ mean and the F-score on the test split of the FBMS-59 dataset [83]. . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3 The runtime of each module of the proposed method. The computation bottleneck is embedding extraction and dense CRF. . . . . . . . . . . . 92 vii 5.4 Performance comparison between results of the motion-based BNN and other dual-branch methods on DA VIS 2016val split (without CRF refine- ment). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5 Performance comparison of different pairwise costs and seed linking schemes. Motion similarity and dense temporal edges help to achieve better performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.6 Building the embedding graph with different sets of consecutive frames for online and offline processing. Under the online scenario, we con- sider a temporal window of length (W + 1) ending at frame t. For offline processing, a window of length (2W + 1) centered att is used. For label propagation, using seeds from the previous, the current and the following frames gives the optimal results. This group of variants is evaluated on DA VIS 2016 val set withJ Mean (without CRF) as the metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 viii List of Figures 1.1 An object detector typically takes in a 2-D image and outputs bounding boxes with scores of object categories. The output results are from the Faster RCNN detector [92]. . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 An object segmenter takes in an image and outputs a mask of the object presenting in the image. . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 An example showing that merely knowing what exists in image may give wrong results for queries. If the location of sofa is unknown, the image on the right can be wrongly provided as the results for the query “redsofa”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 An example showing that object location is important to understand the interaction between objects. Left: a person riding a horse; Right: a person standing by a horse. . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 The diversity of object appearance brings challenges to the object local- ization problem. Top row: cats with intrinsically different appearance. Middle row: cars in different viewing angles. Bottom row: people with occlusion, blur or luminance change. . . . . . . . . . . . . . . . . . . . 4 2.1 An illustration of 2-D convolution. The figure is from [38]. . . . . . . . 10 2.2 The RCNN framework. First, an object proposal algorithm is applied to extract potential regions. Then each region is wrapped to a fixed size and sent to the classification CNN for feature extraction. Finally, a linear class-specific SVM is applied to each region feature. The figure is from [36]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 The Fast RCNN framework. It computes a feature map for the whole image and project object proposals to the feature map. RoI pooling is applied to extract a feature vector for every proposal. Finally a classifier and a bounding box regressor takes in the RoI feature and produce the object class confidence and adjusted box. The figure is from [35]. . . . 16 2.4 The Region Proposal Network (RPN). The figure is from [92]. . . . . . 17 2.5 The architecture of FCN [76], which combines high-level, low-resolution layers with low-level, high-resolution layers (figure taken from [76]) . . 19 2.6 Convolution with “atrous algorithm”. The figure is from [17]. . . . . . 20 ix 3.1 Illustrative examples for the proposed box refinement (BR) method, where the original image, the edge map, the optimal contour (in yel- low) and final refined proposal are shown in the first, second, third and fourth rows, respectively. In the bottom right figure, the box in magenta is the initial proposed box obtained by [92] while the box in blue is the refined box. Edge maps are enhanced and the optimal contour is widened for better visualization (the enhancement for visualization is applied to other figures as well). . . . . . . . . . . . . . . . . . . . . . 24 3.2 The Box Refinement (BR) pipeline. Given an image, we obtain its edge map with the Structured Edge detector [26] and then sticky SLIC super- pixels [2]. Next, a cost map is computed and we construct a directed graph whose edge weight is obtained from the cost map. Keypoints (marked by green dots) are placed based on the configuration of sticky superpixels. Anchors (marked by stars in green) are selected for the tightness constraint. The Cartesian coordinates are established at the centroid of each initial bounding box. We search for the optimal con- tour segment in all four quadrants using the shortest path algorithm, and obtain the closed optimal contour by post-processing. Finally, we refine the proposal based on the optimal contour location. . . . . . . . . . . . 26 3.3 Two searched optimal contours with (left figure) and without (middle figure) penalty. The recall-versus-IoU performance with (in cyan) and without (in blue) penalty on 500 randomly selected images from the test set is shown in the right figure. Experiments are done with 300 proposals from [92]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Illustration of the optimal contour search process, where the initial bound- ing box and the four optimal contour segments are shown in the left figure. The optimal contour segment in quadrant I overlaps with those in quadrants II and IV . These overlapping parts are removed from the final optimal contour shown in the right figure, since they indicate poor localization. Furthermore, we allow the optimal contour segment to go beyond the initial bounding box region such as that in quadrant III. . . . 31 3.5 Top left: A bounding box touching the left border (in magenta) is given. Top right:V left is added toG. The black dots represent the vertices cor- responding to the left border pixels. Other details ofG irrelevant toV left is omitted here. All edges shown in green have a constant weight. Bot- tom left: The contour segments are found by assuming that the object has complete visible contour. This set of contour segments gives a score of 0.379. Bottom middle: Assuming that the object is cropped, the optimal contour segments in quadrants II and III can start with any pixel on the left border, equivalent to starting withV left inG. A score of 0.417 is obtained. Bottom right: The final optimal contour is obtained by choosing the one with higher score. . . . . . . . . . . . . . . . . . . . 32 x 3.6 Comparison of the recall-versus-IoU curves of six proposal methods with and without box refinement (BR), where results of 100, 300 and 1000 proposals are shown in the first, second and third column. Results of six methods are split into two groups for clarity (three in the top row while the other three in the bottom row). Results without and with BR are represented by dashed and solid lines, respectively. . . . . . . . . . 33 3.7 The deviation of the four sides from the ground truth bounding boxes, with 1000 proposals. The original methods are represented by dashed lines and their improved version are in solid lines. . . . . . . . . . . . . 35 3.8 Visual comparison of initial RPN proposals (in magenta) and the improved proposals using the proposed box refinement process (in blue). Ground truth bounding boxes are in green. Irrelevant ground truth boxes are omitted here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.9 Captured and missed ground truth boxes with IoU threshold equal to 0.8 are shown in green and red, respectively. Methods without and with our box refinement methods are in yellow and blue, respectively. . . . . . . 38 3.10 The illustration of contour searching. Edge maps are enhanced and con- tours are widened for better visualization. . . . . . . . . . . . . . . . . 40 3.11 Comparison of the recall-versus-IoU performance for 13 representative methods, where the methods with our proposed box refinement process and MTSE [19] are shown in solid lines and dotted lines, respectively. The remaining seven methods without any improvement are shown in dashed lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.12 Comparison of the average recall (AR), 0.8-recall and average best over- lapping (ABO) as a function of the proposal budget for 13 representa- tive methods, where the methods improved by our box refinement pro- cess are shown in solid lines, the methods improved by MTSE [19] are shown in dotted lines and the methods without any refinement are shown in dashed lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.13 The relationship between the original RPN [92] proposals and corre- sponding refined proposals. . . . . . . . . . . . . . . . . . . . . . . . . 42 3.14 The impact of object size (with 100% representing the largest object) on our box refinement method. . . . . . . . . . . . . . . . . . . . . . . . . 43 xi 4.1 The training diagram of our MICL method. It iterates over re-localization and re-training. During re-localization, saliency maps are generated from the current detector. Segmentation seeds are obtained from the saliency maps, which later grow to segmentation masks. We use the segmentation masks to guide the detector to avoid being trapped in the discriminative parts. A curriculum is designed based on the segmenta- tion masks and the current top scoring detections. With the curriculum, the multiple instance learning process can be organized in an easy-to- hard manner for the detector re-training. . . . . . . . . . . . . . . . . . 45 4.2 The detector with a saliency branch to find the most salient candidate (MSC) for image classification. . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Compute the saliency map in a RoI . . . . . . . . . . . . . . . . . . . . 50 4.4 Aggregate RoI saliency maps to the image saliency maps. . . . . . . . . 51 4.5 Visualized easy examples selected by measuring the consistency between boxes from the segmenter (in red) and the detector (in green). Those examples are used for the next round of detector training. . . . . . . . . 59 4.6 Visualized hard examples selected by measuring the consistency between boxes from the segmenter (in red) and the detector (in green). Those examples are NOT used for the next round of detector training. . . . . . 60 4.7 Qualitative detection results, where the correctly detected ground truth objects are in green boxes and blue boxes represent correspondingly predicted locations. Objects that the model fails to detect are in yellow boxes and false positive detections are in red. . . . . . . . . . . . . . . 63 4.8 Comparison of the CorLoc performance of the Fast R-CNN detector with and without curriculum learning. . . . . . . . . . . . . . . . . . . 68 4.9 Percentages of three error types among the mis-localized objects from the SSG and the MSC. . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 An example of the changing segmentation target (foreground) in videos depending on motion. A car is the foreground in the top video while a car is the background in the bottom video. To address this issue, our method obtains embeddings for object instances and identifies repre- sentative embeddings for foreground/background and then segments the frame based on the representative embeddings. Left: the ground truth. Middle: A visualization of the embeddings projected into RGB space via PCA, along with representative points for the foreground (magenta) and background (blue). Right: the segmentation masks produced by the proposed method. Best viewed in color. . . . . . . . . . . . . . . . . . 71 xii 5.2 An overview of the proposed method that consists of four modules. The instance embeddings and objectness score are extracted in Module 1, highlighted by green. The background is estimated from the inverse objectness map and the optical flow through a bilateral network (BNN) in Module 2, which is enclosed in light blue. Then, an embedding graph that contains sampled pixels (marked by dots) from a set of consecutive frames as vertices is constructed. The unary cost is defined based on the objectness and the estimated background from the BNN. The pairwise cost is from instance embedding and optical flow similarity. All vertices are classified in Module 3 by optimizing the total cost, where magenta and yellow dots denote the foreground and background vertices, respec- tively. Finally, the graph vertex labels are propagated to all remaining pixels based on embedding similarity in Module 4. Best viewed in color. 74 5.3 Illustration of a fast bilateral filtering pipeline with a bilateral space of dimensiond = 2: 1) splatting (left): the available features from input samples (orange squares) are accumulated on the bilateral grid; 2) con- volving (center): the accumulated features on vertices are filtered and propagated to neighbors; and 3) slicing (right): the feature of any sample with a known position in the bilateral space can be obtained by interpo- lation from its surrounding vertices. . . . . . . . . . . . . . . . . . . . 78 5.4 Top: An image (left) and the embedding edge map (right). Bottom: the candidate setC (left) and the seed setS (right). Best viewed in color. . 81 5.5 An illustration of embedding graph construction. We find the local minima in the embedding discrepancy map and select them as graph nodes. Edges connect nodes that are spatial neighbors or from con- secutive frames (see texts for the definition of spatial neighbors). Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6 Given an arbitrary pixel (marked by the diamond), its surrounding nodes (in red circles) are identified from the current frame, the previous frame (omitted here) and the following frame. The label of the node with the shortest embedding distance is assigned to the pixel. Best viewed in color. 84 5.7 Qualitative segmentation results from DA VIS 2016 val split. The first four sequences feature motion blur, occlusion, large object appearance change, and static objects in background, respectively. . . . . . . . . . . 89 5.8 Visual examples from the test split of the FBMS-59 dataset [83]. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.1 A failed example of weakly supervised segmentation. From left to right: the image, the saliency map, and the expanded masks. . . . . . . . . . . 100 6.2 A pair of consecutive frames containing a boat and the optical flow com- puted by [49]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 A synthesized image (left) and a real-world image (right). . . . . . . . . 102 xiii Abstract Object localization is a crucial step for computers to understand an image. An object localizer typically takes in an image and outputs the bounding boxes of objects. Some applications require finer localization - delineating the shape of objects, which is called “object segmentation”. In this dissertation, three object localization related problems have been studied: 1) improving the accuracy of object proposals, 2) reducing the label- ing effort for object detector training, and 3) segmenting the moving objects in videos. Object proposal generation has been an important pre-processing step for object detectors in general and the convolutional neural network (CNN) detectors in particular. However, some object proposal methods suffer from the “localization bias” problem, that the recall drops rapidly as the localization accuracy requirement increases. Since contours offer a powerful cue for accurate localization, we propose a box refinement method by searching for the optimal contour for each initial bounding box that mini- mizes the contour cost. The box is then aligned with the contour. Experiments on the PASCAL VOC2007 test dataset show that our box refinement method can significantly improve the object recall at a high overlapping threshold while maintaining a similar recall at a loose one. Given 1000 proposals, the average recall of multiple existing methods is increased by more than 5% with our box refinement process integrated. The second research problem is motivated by the fact that a convolutional neural network based object detectors usually require a large amount of accurately annotated xiv bounding boxes of objects. On the contrary, the image-level labels are much cheaper to achieve. Thus, we supervise the detectors with image-level labels only. A com- mon drawback of such training setting is that the detector usually outputs bounding box of discriminative object parts (e.g. a box of cat face). To address this challenge, we incorporate object segmentation into the detector training, which guides the model to correctly localize the full objects. We propose the multiple instance curriculum learn- ing (MICL) method, which injects curriculum learning (CL) into the multiple instance learning (MIL) framework. The MICL method starts by automatically picking the easy training examples, where the extent of the segmentation mask agrees with detection bounding boxes. The training set is gradually expanded to include harder examples to train strong detectors that handle complex images. The proposed MICL method with segmentation in the loop outperforms the state-of-the-art weakly supervised object detectors by a substantial margin on the PASCAL VOC datasets. In the third part, we propose a method for unsupervised video object segmentation by transferring the knowledge encapsulated in image-based instance embedding net- works. The instance embedding network produces an embedding vector for each pixel that enables identifying all pixels belonging to the same object. Though trained on static images, the instance embeddings are stable over consecutive video frames. To reduce the false positives from static objects, a motion-based bilateral network is trained to esti- mate the background, which is later integrated with instance embeddings into a graph. We classify graph nodes by defining and minimizing a cost function, and segment the video frames based on the node labels. The proposed method outperforms previous state-of-the-art unsupervised video object segmentation methods on several benchmark datasets. xv Chapter 1 Introduction 1.1 Significance of the Research Objects are defined as “standalone things with a well-defined boundary and center, such as cows, cars, and telephones, as opposed to amorphous background stuff, such as sky, grass, and road” [4]. Object localization aims at finding the locations of existing objects given an image or a video. Typically, an object localizer (also called “detector”) should output a bounding box for each object belonging to categories of interest (which are defined by the dataset), as shown in Fig. 1.1. An even finer object localizer would give the accurate object mask instead of a bounding box, shown in Fig. 1.2. Such localizers are usually called “object segmenters”. In this dissertation, we first focus on bounding- box-level object localization on static images and then study the object segmentation in videos. Figure 1.1: An object detector typically takes in a 2-D image and outputs bounding boxes with scores of object categories. The output results are from the Faster RCNN detector [92]. 1 Segmenter Figure 1.2: An object segmenter takes in an image and outputs a mask of the object presenting in the image. Object localization serves as a crucial task when machines/computers want to under- stand an image beyond what the image contains. There are many cases that knowing merely what is in the image becomes insufficient. For example, in the image search problem with a query input “red sofa” shown in Fig. 1.3, without knowing the location of the sofa, the wrongly queried image on the right would be returned to users, which contains something “red” (wall) and a “sofa”. To realize that the sofa is not red, its location is necessary. The location information of objects also helps machine to understand how multiple objects in images interact with each other. As shown in Fig. 1.4, both images contain a horse and a person. Only with known locations of those two objects can the machine understand the difference between them: in the left image, the person is riding a horse while in the right image, the person is standing beside a horse. Figure 1.3: An example showing that merely knowing what exists in image may give wrong results for queries. If the location of sofa is unknown, the image on the right can be wrongly provided as the results for the query“redsofa”. 2 Figure 1.4: An example showing that object location is important to understand the interaction between objects. Left: a person riding a horse; Right: a person standing by a horse. Though being important, object localization is an extremely challenging problem. Besides the intrinsic object diversity such as different types of cats or people in various clothes, extrinsic diversity such as different viewing angles, lighting conditions, occlu- sions and blurring (including motion blur and out-of-focus blur) make the problem even more challenging, as shown in Fig. 1.5. Recently, convolutional neural network (CNN) based object detectors and seg- menters largely improve the performance of object localization despite the challenges from the diversity of object appearance. However, there are other challenges coming with those detectors. First, many CNN based detectors take in object proposals as the input and then classify each proposal, but the object proposals are not accurate enough. Second, training those detectors requires densely labeled images where human raters need to provide tight bounding boxes for all objects. The time-consuming labeling pro- cess prevents the construction of large-scale training datasets and thus hinds further progress in object detection. The object segmentation problem also faces the challenge of labeling, especially in videos. Building a fully labeled and diverse video segmenta- tion dataset is even more expensive. In this dissertation, we first focus on improving the accuracy of the object proposals. Then we switch our goal to reducing the supervision level in object detection training. 3 Figure 1.5: The diversity of object appearance brings challenges to the object localiza- tion problem. Top row: cats with intrinsically different appearance. Middle row: cars in different viewing angles. Bottom row: people with occlusion, blur or luminance change. Specifically, we explore weakly supervised object detection, where only image-level labels are available for training. Finally, we focus on video object segmentation, which aims at segmenting the moving objects. We leverage the diverse image datasets to learn useful embeddings for pixels and transfer them to videos, so as to reduce the amount of required labels for videos. 4 1.2 Contributions of the Research 1.2.1 Object proposal enhancement Object proposal is a crucial pre-processing procedure for many object detectors. We propose to enhance the object proposal boxes in terms of the location accuracy, so that the output bounding boxes of object detectors are tighter. The proposed enhancement algorithm is based on edge information. Given a bounding box, we search for the optimal contour for this box and propose two properties for optimal contour. 1) Edgeness: along the contour, the edge response should be high. 2) Tightness: since the initial proposal box already provides a rough location of objects, the optimal contour should not deviate too far way. We obtain the edge map and define its inversion as a cost map. By minimizing the cost along the contour, we guarantee its edgeness. We place anchors around the initial bounding boxes and the optimal contour is forced to pass them, so that the tightness requirement is satisfied. We solve the optimization problem by building a graph from the cost map and then searching the shortest path that connects anchors. The optimization process is efficient. Our method can be used as a “plug-in” post-processing module for any object pro- posal approaches that produce bounding boxes. In the experiments, we prove that our method successfully improve the average recall of various proposals, includ- ing RPN [92], Edge Boxes [122], Deep Proposal [34], BING [21] and Object- ness [5]. 5 1.2.2 Weakly supervised object detection Weakly supervised object detection is to train a detector with image-level labels only. A common challenge is that a detector often produces a bounding box of the most dis- criminative object part (e.g. cat face) instead of a tight box of the whole object (the entire cat). To address this challenge, we propose to guide the detector training with expanded object masks from an object segmentation network with curriculum learning and multiple instance learning combined. We take advantage of the saliency maps from a trained image classification net- work to identify the location of discriminative object parts and train a segmenta- tion network to expand masks from those parts to the whole object. We initialize a detector to find the most salient object for the image classification task from the object proposals and then guide the detector with expanded masks. We propose an easy example selection criterion by analyzing the consistency between the results from the detector and the expanded masks. Then we train the detector with a curriculum (from easy examples to hard ones). No additional human supervision is involved in example selection. The training with a curriculum is achieved by adopting the multiple instance learning framework, which alternates between re-localization and re-training. In re-training, harder examples are gradually added in the training set. In re- localization, we do identify the highest-score box for existing object and also update the saliency map based on the trained detector. In order to update the saliency maps, we generalize the saliency computation from classification net- works to detection networks. 6 The superior performance of the proposed method is demonstrated by surpassing the state-of-the-art weakly supervised detectors, especially for detecting animals (e.g. cats, dogs, and horses), which have the face as a very discriminate part. 1.2.3 Unsupervised video object segmentation Video object segmentation (VOS) aims at segmenting moving objects in videos. In this problem, “unsupervised” refers to the scenario where the target object is not given for the query video (i.e., testing videos), not the case where training data is unlabeled. For unsupervised VOS problem, training can utilize as many labeled data as possible. A common challenge for VOS is semantically similar objects in the background (static objects). To solve this problem, we proposed a background estimation network based on motion information to include the static objects and leverage the instance embeddings which encode pixel similarity to provide an accurate mask for moving objects. We proposed a new strategy for adapting instance segmentation models trained on static images to videos. Notably, this strategy performs well on video datasets without requiring any video object segmentation annotations. A trainable bilateral network is trained to estimate background based on motion cues and objectness. By first localizing a small set of pixels in non-object regions from objectness scores, the bilateral network can expand the background from the small set to regions with similar motion patterns. The expansion effectively reduces the false positives from semantically similar but static objects. Efficient multi-frame reasoning is conducted via graph cut of a reduced set of points seeded across objects, where the motion similarity and instance similarity are modeled. 7 We achieve the state-of-the-art results on the DA VIS 2016 and the FBMS-59 datasets with intersection-over-union (IoU) scores of 80.4% and 73.9%, and out- performs previous best results by 1.9% and 2.0%, respectively. 1.3 Organization of the Dissertation The rest of the dissertation is organized as follows. In Chapter 2, we review the research background, including convolutional neural networks, object proposal, detection and segmentation. In Chapter 3, we propose a bounding box enhancement algorithm based on the edge map, where we search the optimal contour in each box and align the box with the optimal contour. In Chapter 4, we address the weakly supervised object detection problem, where no human-labeled bounding boxes are involved in training. In Chapter 5, we try to solve the unsupervised video object segmentation problem with instance embeddings learned from images and a motion-based bilateral network. Finally, con- cluding remarks and future research directions are given in Chapter 6. 8 Chapter 2 Research Background 2.1 Convolutional Neural Networks Convolutional neural networks (CNNs) are a type of feed-forward neural networks com- monly used in vision-related tasks. A CNN usually contains convolutional layers, non- linear activation layers, pooling layers and fully connected layers [38]. Convolutional layer. Convolutional layers contain a group of neurons that are connected to a local region (usually square) of the input. The output of each neuron is the inner product of the neuron’s weights and the input local region, illustrated in Fig. 2.1. If the neuron contains bias terms, the final output is the inner product plus bias. The weights and bias of this neuron group are shared for all regions. Non-linear activation layer. This layer is essentially a non-linear activation func- tion, which is typically applied to the output of a neuron. The most commonly used non-linear activation function is Rectified Linear Unit (ReLU), defined as f(x) = max(0;x): (2.1) Pooling layer. Pooling layers downsample an output matrix or tensor along some spatial dimensions. The most common downsampling computation is taking the maximum of a local region or the average, known as max pooling layer or average 9 Figure 2.1: An illustration of 2-D convolution. The figure is from [38]. pooling layer, respectively. Pooling layers reduce the amount of parameters as well as computation, so as to control overfitting. Fully connected layer. Fully connected (FC) layers are a group of neurons con- nected to the whole input region. The resulting output can be viewed as a vector that describes the whole input. Unlike the convolutional layers which preserve spatial information, the output of a FC layer is a global feature with no spatial information. Applications in image classification. A typical architecture of an image classification network is a linear connection: convpoolconvpool:::convpoolfc:::fc; where the non-linear activation layers that follow every convolutional layer and FC layer are omitted. The neurons of the last FC layer correspond to the image categories, with an 10 output range of (1;1). To convert the output to a probability distribution, represent- ing the probability of the input image belonging to each category, a softmax operation is applied, p(c) = exp(f(c)) P i exp(f(i)) ; (2.2) wheref(c) represents the output from the neuron corresponding to categoryc in the last FC layer. Some well-known image classification CNNs include AlexNet [63], VGG [97] and ResNet [43]. AlexNet is composed of fiveconvpool modules followed by three FC layers. VGG deepens the network by stacking the convolutional layers in theconvpool module, forming a connection ofconvconvconvpool. All convolutional lay- ers in the same module have identical kernel size and neuron numbers. Compared with AlexNet, the kernel size of convolutional layers of VGG is much smaller, in order to control the number of parameters while deepening the network. ResNet is no longer a linear architecture. For an ultra deep neural network, the vanishing gradient prob- lem [45] makes a linear architecture extremely hard to train via backpropagation algo- rithm [38]. As an ultra deep CNN, ResNet [43] contains skip connections which allow the gradient efficiently backpropageted to shallow layers during training. 2.2 Object Proposal Object proposals are a group of regions (represented as bounding boxes or masks) that are likely to be objects. They serve as the input to many region-based detectors to avoid checking every possible location. Object proposal generation methods can be roughly classified into two categories: objectness-based and similarity-based methods. Recently, CNN-involved methods become increasingly popular among the former category. Also, 11 people begin to improve existing methods to obtain higher localization accuracy while maintaining a high recall at a loose localization requirement. Objectness-based Methods. Typically, objectness-based methods start from densely sampled sliding windows. For each window, different cues are extracted to determine how likely the window contains an object. Multiple cues such as saliency and color contrast are considered in [5]. Edge Boxes (EB) [122] generates proposals based on edge maps. BING [21] relies on binarized normed gradients but suffers from severe localization bias. The Canny edge map [14] is later exploited in [119] to improve BING. They move the initial box according to the Canny edge map [14] iteratively and, then, applied MTSE [19] to further increase localization accuracy. A recent trend on objectness-based methods is the involvement of CNN. Deep Box [65] modifies Alexnet [63] and uses it to re-rank proposals from EB. They specif- ically train the network with challenging false proposals generated by EB and manage to reduce the number of proposals required for the same recall level. Deep Proposal (DP) [34] extracts deep features from the convolutional layers of Alexnet pre-trained on ImageNet [22]. They build an inverse cascade of the convolutional layers. Starting from conv 5 , several thousand promising proposals are generated and then refined in lower lay- ers, as lower convolutional layers are of higher resolution. CSSB [81] combines deep features and contour features to build proposals from EB and Selective Search [110]. A deep neural network is trained to produce object proposals in [89]. Faster R-CNN [92] offers an end-to-end CNN that detects objects in a given image. There is a relatively simple region proposal network (RPN) inside Faster R-CNN, which shares convolutional layers with a more sophisticated classifier. In the testing phase, features are extracted from the shared convolutional layers, and the 300 most promising proposals generated by RPN are used as the input to the detection classifier. RPN suc- cessfully reduces the number of proposals from 2000 to 300 while achieves higher mean 12 average precision (mAP). However, most CNN-involved methods suffer from low recall when the localization accuracy requirement is high, i.e., the localization bias problem. This remains an issue even when more accurate localization information from lower convolutional layers is considered as done in [34]. Similarity-based Methods. Similarity-based methods adopt a bottom-up framework. Several seed regions are created and then are expanded by merging their adjacent regions to produce proposals as in [110] or by generating multiple object-background segmenta- tion masks as in [15], [61]. Multiple cues are used to guide merge and segmentation. For example, Selective Search (SS) [110] starts from the FH superpixels [33]. The similarity of each adjacent superpixel pair is measured using color, texture and other local cues, and a greedy merge algorithm is adopted to generate proposals. Alternatively, a random- ized merging strategy is proposed in [78]. Xiao et al. [115] modify SS by introducing a complexity-adaptive similarity measure scheme to guide superpixel merge while [113] uses multiple segmentation branches for merging. Contour Box [77] uses incomplete contours as a non-object cue by tracing the contour from an edge map, and refines the proposal set obtained from SS by removing those non-object proposals. However, Con- tour Box tends to remove some proposals that have a high overlap with ground truth boxes. Object-background masks are computed in [15], [61], [28], [91], [47], [48], [66], [62] to generate proposals with higher computational complexity. MCG [6] generates proposals by grouping segments as well, but they start with the fast normalized cut [94] instead of FH superpixels. Improving Proposals. To address the localization bias of existing methods, especially objectness-based ones [21], [5], [122], MTSE [19] generates the FH superpixels and uses the Superpixel Tightness (ST) metric to indicate the tightness of a bounding box. Boxes containing objects and background show different ST index distributions. The proposals generated by objectness-based methods tend to have a similar distribution 13 with background boxes. For improvement, MTSE focuses on straddling superpixels and expands the initial box to cover them if their in-box part is larger than a threshold. Multiple thresholds are used and several sets of expanded boxes are generated. Finally, randomness is added to rank newly-expanded as well as initial boxes. Due to the added randomness, MTSE is less effective when the proposal budget is small. For example, it is less suitable for improving CNN-involved methods, especially RPN [92]. 2.3 Object Detection The region-based convolutional neural networks. Convolutional neural networks demonstrate its power on image classification problem, where the network takes in fixed- size images and outputs the probability of categories. To take advantage of CNN’s power on image classification, the object detection is treated as a region classification problem. Specifically, an image contains a lot of potential regions (represented by boxes), the detection problem is essentially to classify whether a region is some object or back- ground. RCNN [36]. This is one of the earliest detection system using CNNs. The bottleneck of using CNN to solve the detection problem by classifying regions is the huge number of possible regions, considering the diversity of location, size and aspect ratio of objects. It is not feasible to apply a computationally heavy classifier to all regions. RCNN solves this problem by using object proposals. Object proposal algorithms quickly eliminate a large number of regions and keep only several thousand regions of interest for further classification. The framework is shown in Fig. 2.2. The system takes in a 2-D image and applies Selective Search [110] to extract about 2k object proposals. Each proposal is wrapped to a fixed size so that the trained classification CNN can produce a feature vector to “describe” the region. 14 Then a class-specific linear SVM takes in the feature and determines if this region belongs to some object class. Figure 2.2: The RCNN framework. First, an object proposal algorithm is applied to extract potential regions. Then each region is wrapped to a fixed size and sent to the classification CNN for feature extraction. Finally, a linear class-specific SVM is applied to each region feature. The figure is from [36]. During training, the ground truth object (represented by box) that each proposal overlaps the most is identified. If the overlapping ratio is too small, the proposal is assigned to “background”; otherwise it belongs to the corresponding class of the ground truth object. To improve the localization accuracy, RCNN also introduce a bounding box regression module, which is commonly used in CNN-based detectors later. Fast RCNN [35]. RCNN [36] suffers from extremely slow training and inference, because computing the CNN feature for thousands of regions takes a long time. In fact, because a lot of regions overlap, the feature computation process is redundant to some degree. The idea of computing a feature map for the whole image and then extract region feature is first proposed in SPP-Net [42]. The challenge for extracting region feature and then classifying it is the dilemma of the varying region size of different objects and the requirement of fixed-size input of classifier. SPP-Net [42] solves this challenge by applying spatial pyramid pooling. The corresponding region of the feature map is cropped and divided into bins. The division is hierarchical and different levels of division form a “pyramid”. Max pooling or average pooling is applied in each bin. 15 Because the division size is fixed, the changing-size cropped feature region is converted to a fixed-length vector. Figure 2.3: The Fast RCNN framework. It computes a feature map for the whole image and project object proposals to the feature map. RoI pooling is applied to extract a feature vector for every proposal. Finally a classifier and a bounding box regressor takes in the RoI feature and produce the object class confidence and adjusted box. The figure is from [35]. The framework of Fast RCNN is shown in Fig. 2.3. Fast RCNN simplifies the SPP process by building a pyramid of only one layer, which is named as “RoI pooling”. The region feature is extracted by RoI pooling and sent to the classifier and the regressor. The proposal receives a class probability and a correspondingly adjusted box. Fast RCNN integrates the classifier, regressor and feature extractor so that the whole network is end-to-end trainable given precomputed object proposals. Most importantly, the feature computation process for regions is shared, which dramatically reduces the training and inference time. Faster RCNN [92]. With GPU acceleration, the speed bottleneck for Fast RCNN [35] is object proposal generation. The commonly used object proposal algorithms (at that time) were based on some low-level features, without CNN involved. Faster RCNN designs a proposal generation module called “region proposal net- work (RPN)” and integrates it to Fast RCNN [35]. RPN is a small network applied to the image feature map in a sliding window manner (thus implemented as a set of convo- lutional layers). The architecture is illustrated in Fig. 2.4. The small network produces 16 a binary class scores for k anchor boxes centered at a cell of the feature map, as well as adjusts the anchor boxes to better localize the objects. RPN is much faster than tradi- tional object proposal algorithm implemented on CPU, such as Selective Search [110]. Figure 2.4: The Region Proposal Network (RPN). The figure is from [92]. It is also shown that the object proposals provided by RPN reaches higher recall than Selective Search [110] given smaller object proposal budget. Thus, the classifier no longer needs to look at 2k regions. Instead, 300 proposals are sufficient. By reducing the number of potential objects, the classification process also takes less time and the inference is further accelerated. Since Faster RCNN does not rely on precomputed proposals for specific datasets, it is easily applied to detection datasets in any domain, as long as the labeled bounding boxes are available. Weakly supervised object detection. Weakly supervised object detection refers to training detectors with only image-level labels, with no human-labeled bounding boxes involved. The weakly supervised object detection problem is oftentimes treated as a multiple instance learning (MIL) task [10, 100, 101, 102, 95, 37, 11, 98, 99]. Each image is considered as a bag of instances. An image is labeled as positive for one category if it contains at least one instance from that category, and an image is labeled as negative if no instance from that category appears in it. As a non-convex optimization problem, MIL is sensitive to model initialization and many prior arts focus on good initialization 17 strategies [101, 23]. Recently, people start to exploit the powerful CNN to solve the weakly supervised object detection problem. Oquabetal. [84] convert the Alexnet [63] to a fully convolutional network (FCN) to obtain an object score map, but it gives only a rough location of objects. WSDNN [12] constructs a two-branch detection network with a classification branch and a localization branch in parallel. ContextLoc [54] is built on WSDNN and adds an additional context branch. DA [68] proposes a domain adaptation approach to identify proposals corresponding to objects and uses them as pseudo loca- tion labels for detector training. Size Estimate (SE) [95] trains a separate size estimator with additional object size annotation and the detector is trained by feeding images with a decreasing estimated size. Singh etal. [64] take the advantage of videos and transfer the object appearance to images for object localization. Most of the aforementioned approaches suffer from trapping in discriminative regions as no specific supervision is provided to learn the extent of full objects. 2.4 Image Segmentation Fully Convolutional Neural Networks. Image semantic segmentation can be cast to a pixel-level classification problem. To solve this problem with deep convolutional neural networks, people convert the fully connected layers to convolutional layers (“convolu- tionalize”) [76]. This conversion can be viewed as building a local classifier and apply- ing it to every pixel in a sliding window manner. The weights of the classifier are shared as every layer is convolutional. A crucial challenge to solve the segmentation problem by CNNs is keeping an rea- sonable output resolution (pooling layers reduce the resolution, as mentioned above). Multiple approaches to upsampling the feature map are proposed. In FCN [76], the authors use the deconvolution operation to upsample step by step. The architecture is 18 shown in Fig. 2.5. The single stream (without upsampling, FCN-32s) produces an out- put with a resolution of 1 32 original resolution. Thus, it cannot capture the detailed boundary of objects. In the second row of Fig. 2.5 (FCN-16s), the output fromconv 7 is 2 upsampled and the combined with the feature map frompool 4 to produce an out- put of finer resolution. The structure that produces an even higher resolution output, FCN-8s, is illustrated in the third row of Fig. 2.5. It combines the 4 upsampledconv 7 output, the 2 upsampledpool 4 output and thepool 3 output. The final resolution is 1 8 of the original one. Figure 2.5: The architecture of FCN [76], which combines high-level, low-resolution layers with low-level, high-resolution layers (figure taken from [76]) In another well-known architecture for semantic segmentation, DeepLab [17], the pool 4 andpool 5 layers are removed to avoid loss of resolution. However, simply remov- ing them reduces the size of the receptive field, which harms the performance. To keep the resolution while still being able to grow the receptive field, DeepLab [17] uses con- volution with “atrous algorithm”, illustrated in Fig. 2.6. With input stride = 2 and output stride = 1, the 3 3conv filter sees a region of 5 5 (normally 3 3 without “atrous”). Convolution with “atrous algorithm” makes DeepLab keep the same resolution with FCN-8s without deconvolution layers for upsampling. However, the output would still 19 Figure 2.6: Convolution with “atrous algorithm”. The figure is from [17]. miss some details. To further refine the segmentation output, a fully connected CRF [59] is constructed on the original image. The unary potential is based on the confidence of each pixel from DeepLab, while the pairwise potential comes from the similarity of pixel pairs in the bilateral position and color space. Weakly supervised image semantic segmentation. Similar with the bounding box labels for detector training, the pixel-wise labels for semantic segmentation are also laborious to obtain. Consequently, many researchers focus on weakly supervised image semantic segmentation [8, 72, 58, 85], where only image-level labels or scribbles on objects are available for training. PointSup [8] trains a segmentation network with one point annotation per category (or object) only and uses objectness prior to improve the results. Similarly, ScribbleSup [72] trains the segmentation network with scribbles by alternating between the GraphCut [93] algorithm and the FCN training [76]. SEC [58] generates coarse location cues from image classification networks as “pseudo scrib- bles”. They design a loss that encourage segmentation mask expansion and constrains the training process with smoothness in the mean time. Image Instance Segmentation. The target of image instance segmentation is to obtain accurate masks of different object instances. The output of such segmentation models should be an instance label map. A common solution is detecting objects first to obtain bounding boxes of individual instances and then conducting binary segmentation in each 20 box to obtain the mask of object instances [41]. Recently, approaches based on instance embeddings are provided in [31, 80], where the pixel-wise instance embeddings are trained to encode the probability of two pixel belonging to the same instances. This group of methods does not rely on object detection. The proposed method for video object segmentation is developed based on the instance embeddings in [31]. Relevant details are given in Sec. 5.2.1. 2.5 Video Object Segmentation Unsupervised video object segmentation. Unsupervised video object segmentation discovers the most salient, or primary, objects that move against a video’s background or display different color statistics. One set of methods to solve this task builds hierarchical clusters of pixels that may delineate objects [40]. Another set of methods performs binary segmentation of foreground and background. Early foreground segmentation methods often used Gaussian Mixture Models and Graph Cut [79, 108], but more recent work uses convolutional neural networks (CNN) to identify foreground pixels based on saliency, edges, and/or motion [105, 106, 50]. For example, LMP [105] trains a network which takes optical flow as an input to separate moving and non-moving regions and then combines the results with objectness cues from SharpMask [90] to generate the moving object segmentation. LVO [106] trains a two-stream network, using RGB appearance features and optical flow motion features that feed into a ConvGRU [116] layer to generate the final prediction. FSEG [50] also proposes a two-stream network trained with mined supplemental data. SfMNet [111] uses differentiable rendering to learn object masks and motion models without mask annotations. Despite the risk of focusing on the wrong object, unsupervised methods can be deployed in more places because they do not require user interaction to specify an object to segment. Since we are 21 interested in methods requiring no user interaction, we choose to focus on unsupervised segmentation. Semi-supervised video object segmentation. Semi-supervised video object segmen- tation utilizes human annotations on the first frame of a video (or more) indicating which object the system should track. Importantly, the annotation provides a very good appearance model initialization that unsupervised methods lack. The problem can be formulated as either a binary segmentation task conditioned on the annotated frame or a mask propagation task between frames. Non-CNN methods typically rely on Graph Cut [79, 108], but CNN based methods offer better accuracy [56, 13, 112, 20, 53, 117]. Mask propagation CNNs take in the previous mask prediction and a new frame to pro- pose a segmentation mask in the new frame. VPN [51] trains a bilateral network to prop- agate to new frames. MSK [56] trains a propagation network with synthetic transforma- tions of still images and applies the same technique for online fine-tuning. SegFlow [20] finds that jointly learning moving object masks and optical flow helps to boost the seg- mentation performance. Binary segmentation CNNs typically utilize the first frame for fine-tuning the network to a specific sequence. The exact method for fine-tuning varies: OSVOS [13] simply fine-tunes on the first frame. OnA VOS [112] fine-tunes on the first frame and a subset of predictions from future frames. Fine-tuning can take seconds to minutes, and longer fine-tuning typically results in better segmentation. Avoiding the time cost of fine-tuning is a further inducement to focus on unsupervised methods. 22 Chapter 3 Box Refinement: Object Proposal Enhancement and Pruning 3.1 Introduction In the object detection problem, classifiers are applied to certain regions extracted from the input image. A sliding window scheme used to be the most common region search strategy, where the detector has to handle as many as several million windows. This search strategy prevents the use of complicated detectors due to the computation load concern. To speed up the sliding window scheme, a small set of candidates, called object proposals, is generated with relatively simple classifiers. This problem has received high attention recently since it becomes an important preprocessing step in the application of powerful detectors such as the convolutional neural network (CNN) [36], [35]. It has been shown in [92] that object proposal generation helps achieve better detection performance than the sliding window searching scheme. Multiple object proposal methods have been developed. However, most object-ness- based methods, such as [21], [5], [122], suffer from localization bias. That is, the recall drops rapidly if the intersection over union (IoU) ratio is larger than a certain threshold as discussed in [19]. Some recently developed CNN-involved proposal generation methods also have the similar problem, especially the region proposal network (RPN) in Faster R-CNN [92], although the number of proposals needed for detection is substantially reduced. 23 Figure 3.1: Illustrative examples for the proposed box refinement (BR) method, where the original image, the edge map, the optimal contour (in yellow) and final refined pro- posal are shown in the first, second, third and fourth rows, respectively. In the bottom right figure, the box in magenta is the initial proposed box obtained by [92] while the box in blue is the refined box. Edge maps are enhanced and the optimal contour is widened for better visualization (the enhancement for visualization is applied to other figures as well). Being motivated by the observed localization bias problem, we focus on the refine- ment of proposals obtained by existing methods and call our solution the “Box Refine- ment” (BR) method. The BR method addresses the localization issue via contour align- ment and proposal pruning. Some illustrative examples of the box refinement process are shown in Figure 3.1. We first construct a cost map based on the edge response of the whole image. For a proposed box, we find the optimal contour by searching the shortest path in the cost map, and assign a score to the optimal contour according to its average edge response. The box is replaced with the bounding box of the optimal contour if its 24 score is above a threshold; otherwise, it is removed from the candidate set. The same process is repeated for all proposed boxes. Unlike the idea in [19] where multiple pro- posals are generated from one initial proposed box, we directly process each original box (either keep and refine it or simply discard it). As a result, the number of remaining proposals is always equal to or smaller than that of the original proposal set. Clearly, the BR method can be used as a post-processing step to any objectness-based methods, including CNN-involved methods [92], [34], and non-CNN ones [21], [5], [122]. We evaluate the proposed BR method on the PASCAL VOC2007 dataset [29]. It is shown by experimental results that the BR method can significantly improve existing CNN-involved methods. Given a budget of 300 proposals, we achieve a gain of 8.7% on average recall (AR) over RPN [92] and 3.6% gain on top of Deep Proposal (DP) [34]. We also apply our method to non-CNN objectness-based methods and obtain AR gains of 2.9%, 16.3% and 14.7% on top of Edge Boxes [122], BING [21] and Objectness [5] with 1000 proposals, respectively. 3.2 Methodology Objects have a closed contour, and we try to trace the contour in Cartesian coordinates and then use it to refine proposed bounding boxes. Given an initial proposed box, the contour to be traced should have the following three properties. (a) Tightness: the con- tour should not deviate too much from the original proposed box, because it is generated by existing object proposal methods, which already provides coarse information for the location of objects. (b) Edgeness: the contour should have a reasonable average edge response. (c)Completeness: the contour should be closed. We construct a graph for the image, representing each pixel by a vertex and con- necting adjacent pixels with an edge. Then the contour-tracing problem is converted to 25 path searching in the graph. For tightness, we select a set of fixed pixels, called anchors, which are close to the boundary of the initial bounding box, and we force the path to cover those anchors. As a result, the path would not deviate too much from the initial box. For edgeness, an edge cost is assigned to each pixel, which has a negative corre- lation with its edge response. Then the path with a reasonably high edge response that connects a pair of anchors can be found by minimizing the edge cost. Completeness is achieved by connecting anchors in clockwise order with those low-cost paths. The over- all pipeline is illustrated in Figure 3.2. Given an initial bounding box, we first find the contour segments with the minimum cost to connect pre-selected anchors in clockwise order and then combine the segments by post-processing to obtain the optimal closed contour. Finally, box is refined accordingly. I nput i mage Edge map Cost map Cons t r uct ed gr aph Keypoi nt s The i ni t i al pr oposal + Anchor s ( t i ght ness const r i ant ) Cont our s egment s The f i nal cont our and r ef i ned box Keypoi nt s ( zoomed i n) Figure 3.2: The Box Refinement (BR) pipeline. Given an image, we obtain its edge map with the Structured Edge detector [26] and then sticky SLIC superpixels [2]. Next, a cost map is computed and we construct a directed graph whose edge weight is obtained from the cost map. Keypoints (marked by green dots) are placed based on the configuration of sticky superpixels. Anchors (marked by stars in green) are selected for the tight- ness constraint. The Cartesian coordinates are established at the centroid of each initial bounding box. We search for the optimal contour segment in all four quadrants using the shortest path algorithm, and obtain the closed optimal contour by post-processing. Finally, we refine the proposal based on the optimal contour location. 26 It is worthwhile to point out that, although an initial bounding box is given, the optimal contour searching is not limited within the box, unlike in [77]. In this way we can avoid a good candidate being wrongly “refined”, because a box that has a high overlap with the object may still not fully enclose its contour. Meanwhile, our cost map can be shared across all boxes, so that we do not have to construct a graph for each bounding box as in [77]. 3.2.1 Cost Map We use the Structured Edge detector [26] to obtain an edge map. As mentioned above, the cost needs to have a negative correlation with the edge response. In our model, we define the cost of a pixel located at (x;y) as (x;y) = 8 < : 1e(x;y); e(x;y)T p; e(x;y)<T (3.1) wheree(x;y) is an edge response value lying in [0; 1] andp is a constant penalty for the pixel with an edge response smaller than the thresholdT . We penalize pixels of small edge response so that the optimal contour search prefers a detour of high edge response rather than a shortcut of low edge response as to better handle non-convex objects. On the other hand, the penalty should not be too expensive due to imperfection in the edge map. Figure 3.3 compares two searched optimal contours with and without the penalty term. The difference in recall is shown in Figure 3.3. With the cost map, a directed graphG is constructed, where each pixel is represented by a vertex. Edges are built between adjacent pixels using the 8-connection. The weight of an edge is defined as the cost of its destination vertex. 27 0.5 0.6 0.7 0.8 0.9 1 IoU 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall w/o penalty w penalty Figure 3.3: Two searched optimal contours with (left figure) and without (middle fig- ure) penalty. The recall-versus-IoU performance with (in cyan) and without (in blue) penalty on 500 randomly selected images from the test set is shown in the right figure. Experiments are done with 300 proposals from [92]. 3.2.2 Anchor Selection Anchors are selected from a set of keypoints, which are aligned with the edge map and independent of the initial bounding box. To obtain the keypoints, the edge map and superpixels are combined. That is, we first obtain sticky SLIC superpixels [2] 1 well aligned with the edge map. Then, each pair of adjacent superpixels defines an edge segment, called an edgelet, and keypoints are those pixels where different edgelets intersect. As seen in Figure 3.2, the keypoints have a roughly uniform distribution on the whole image because of the way SLIC superpixels are generated. To choose anchors from the keypoint set given an initial box b, we have two con- cerns: (a) anchors should impose tightness constraint based on the position of b, but the constraint should not be too strict, due to the diversity of object shapes; (b) anchors should be close to the true contour. For (a), we choose only four anchors near the boundary ofb, imposing constraint w.r.t the top, right, bottom and left sides ofb while allowing the maximum flexibility of contour searching. For (b), anchors are aligned with the edge map since they are chosen from the keypoints. To determine the detailed 1 Provided by Edge Boxes Toolbox [122]. 28 position, we analyze the shortest Euclidean distance from the object contour to the three quartiles together with the two terminals of each side of the ground truth bounding box. The experiment is conducted on 6,934 annotated objects from the PASCAL VOC2012 segmentation trainval dataset [29]. Among the five points, the midpoint of each side gives the smallest average distance. In other words, the midpoints of the bounding box sides are the closest to the true object contour on average. Based on this analysis, we set anchors to be the four keypoints with the shortest Euclidean distance to the side mid- points ofb. In case errors occur due to anchors that are not sufficiently close to the true contour, especially whenb is poorly positioned, a post-processing step is introduced. Keypoints and anchors make it possible to share computation across boxes. Details are discussed at the end of Section 3.2.3. 3.2.3 Contour Search Optimal Contour Segment and Complete Contour. A contour segment s is essen- tially a set of connected pixels. Given a proposed bounding box b and the anchor set A b =fA b 1 ;A b 2 ;:::;A b N g, with A b N+1 = A b 1 , the closed contour is decomposed into a sequence of contour segmentss b i;i+1 ,i = 1; ;N, that connect anchorsA b i andA b i+1 . For each pair of consecutive anchors, we choose the optimal segment, denoted by ^ s b i;i+1 , from all the possible segments terminating in this anchor pair. ^ s b i;i+1 is defined such that the sum of its cost is minimized, ^ s b i;i+1 = arg min s2S b i;i+1 X (x;y)2s (x;y); (3.2) 29 whereS b i;i+1 denotes the set of contour segments that start atA b i and end atA b i+1 . Then the complete optimal contour for the initial boxb is found by concatenating all contour segments except the overlapping part. Mathematically, we have c b = N [ i=1 ^ s b i;i+1 N [ i=1;j6=i O b i;j ; (3.3) where O b i;j = ^ s b i;i+1 \ ^ s b j;j+1 ;j6=i; (3.4) is the overlapping part of ^ s b i;i+1 and ^ s b j;j+1 . One reason for contour segment overlapping is that one side of the initial bounding box is not tight against the target object as shown in Figure 3.4. The overlap is ignored since it is an indication of poor localization. In this way, the error caused by the improperly selected anchors is eliminated. Dijkstra’s algorithm is applied to find the optimal contour segments. Because we do not limit the searching within the initial box, the final optimal contour is not necessarily fully enclosed in the given box. Therefore, we may shrink or expand the initial box, which gives the box refinement process more flexibility. This point is clearly demon- strated in the example given in Figure 3.4. Contour Scoring, Pruning and Adjustment. After combining the contour segments from all four quadrants using Eq. (3.3), we compute the average edge response as the score assigned to the complete optimal contourc b , (c b ) = P (x;y)2c b e(x;y) jc b j ; (3.5) where c b is the length ofc b . Since low average edge response is an indicator of non- objects (the edgeness property), b is removed from the proposal set if (c b ) < min , 30 Figure 3.4: Illustration of the optimal contour search process, where the initial bound- ing box and the four optimal contour segments are shown in the left figure. The optimal contour segment in quadrant I overlaps with those in quadrants II and IV . These over- lapping parts are removed from the final optimal contour shown in the right figure, since they indicate poor localization. Furthermore, we allow the optimal contour segment to go beyond the initial bounding box region such as that in quadrant III. where min is a pre-selected threshold. Otherwise, we replace initial bounding box b withb , which is the bounding box associated with the optimal contourc b in Eq. (3.3). Truncated Objects. Objects lying in image borders do not have a complete visible contour 2 , so it is no longer appropriate to trace a closed contour. To handle those objects, we have an additional strategy for optimal contour search- ing. Without loss of generality, we consider a bounding box that touches the left border of the image, as shown on the top left of Figure. 3.5. When searching for the optimal contour segments in quadrant II and quadrant III, the starting pixel can be any one lying on the left image border, rather than the original left anchor. In terms of searching in graph G, this strategy is equivalent to adding a dummy vertex V left to G, which has edges of identical weight pointing to all vertices that represent the left border pixels, as demonstrated on the top right of Figure 3.5, and the shortest path search starts atV left for quadrants II and III. Similarly,V top ,V right andV bottom are added toG to handle bounding boxes touching the other three borders. 2 Objects occluded by other objects still have a complete closed contour for the visible region. 31 Figure 3.5: Top left: A bounding box touching the left border (in magenta) is given. Top right: V left is added toG. The black dots represent the vertices corresponding to the left border pixels. Other details of G irrelevant to V left is omitted here. All edges shown in green have a constant weight. Bottom left: The contour segments are found by assuming that the object has complete visible contour. This set of contour segments gives a score of 0.379. Bottom middle: Assuming that the object is cropped, the optimal contour segments in quadrants II and III can start with any pixel on the left border, equivalent to starting withV left inG. A score of 0.417 is obtained. Bottom right: The final optimal contour is obtained by choosing the one with higher score. However, when the initial bounding boxb touches image borders, we do not know whether the object is indeed cropped by borders. Thus, we search for two sets of optimal contour segments. The first set assumes that the object has complete visible contour (Bottom left of Figure 3.5) while the second one assumes that the object is cropped (Bottom middle of Figure 3.5). Then we have two optimal contours under different assumptions and the one with higher score as defined in Eq.(3.5) is chosen. According to the final chosen contour and its score, we refine (or remove) the bounding box. 32 0.5 0.6 0.7 0.8 0.9 1 IoU 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 100 proposals DP BR-DP EB BR-EB RPN BR-RPN 0.5 0.6 0.7 0.8 0.9 1 IoU 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 100 proposals OBJ BR-OBJ BING BR-BING EB50 BR-EB50 0.5 0.6 0.7 0.8 0.9 1 IoU 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 300 proposals DP BR-DP EB BR-EB RPN BR-RPN 0.5 0.6 0.7 0.8 0.9 1 IoU 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 300 proposals OBJ BR-OBJ BING BR-BING EB50 BR-EB50 0.5 0.6 0.7 0.8 0.9 1 IoU 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 1000 proposals DP BR-DP EB BR-EB RPN BR-RPN 0.5 0.6 0.7 0.8 0.9 1 IoU 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 1000 proposals OBJ BR-OBJ BING BR-BING EB50 BR-EB50 Figure 3.6: Comparison of the recall-versus-IoU curves of six proposal methods with and without box refinement (BR), where results of 100, 300 and 1000 proposals are shown in the first, second and third column. Results of six methods are split into two groups for clarity (three in the top row while the other three in the bottom row). Results without and with BR are represented by dashed and solid lines, respectively. 3.3 Experiments We conduct the box refinement method on proposals generated by several state-of-the- art objectness-based methods provided by [46]. They include Edge Boxes (EB) [122], 33 Objectness (OBJ) [5] and BING [21], as well as recently proposed CNN-involved meth- ods, RPN [92] and Deep Proposal (DP) [34]. All experiments are conducted on the PAS- CAL VOC2007 dataset [29], which is the most common benchmark for object proposal methods. It contains 9,963 images and has bounding box annotations for 20 object cate- gories. The performance is reported on the test set, containing 4,952 images and 14,976 annotated object instances. Parameters are validated on the training and validation sets. Specifically, we setp = 10,T = 0:01 and min = 0:1. 3.3.1 Best Candidate Deviation We adopt the metric proposed in [119] to evaluate the alignment accuracy of proposals. By comparing each side of the ground truth bounding boxg = (g l ;g t ;g r ;g b ) and its best matched proposed boxd = (d l ;d t ;d r ;d b ) (i.e. the one that has the highest IoU with the ground truth), the horizontal and vertical deviations are defined as i = d i g i g w ;i2fl;rg; (3.6) j = d i g i g h ;j2ft;bg; (3.7) where g w = g r g l and g h = g b g t are the width and height of the ground truth box, respectively. Considering the wide range of objects’ size and aspect ratio, these deviations are normalized by objects’ height or width as shown above. We calculate the deviation for matched pairs only if their IoU value is larger than 0.5 and show the result in Fig. 3.7. Clearly, with our box refinement, the best candidates deviate less from ground truth boxes. 34 -1 -0.5 0 0.5 1 Deviation from ground truth 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Percentage of objects left side EB50 BR-EB50 OBJ BR-OBJ BING BR-BING EB BR-EB DP BR-DP RPN BR-RPN -1 -0.5 0 0.5 1 Deviation from ground truth 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Percentage of objects Top side EB50 BR-EB50 OBJ BR-OBJ BING BR-BING EB BR-EB DP BR-DP RPN BR-RPN -1 -0.5 0 0.5 1 Deviation from ground truth 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Percentage of objects Right side EB50 BR-EB50 OBJ BR-OBJ BING BR-BING EB BR-EB DP BR-DP RPN BR-RPN -1 -0.5 0 0.5 1 Deviation from ground truth 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Percentage of objects Bottom side EB50 BR-EB50 OBJ BR-OBJ BING BR-BING EB BR-EB DP BR-DP RPN BR-RPN Figure 3.7: The deviation of the four sides from the ground truth bounding boxes, with 1000 proposals. The original methods are represented by dashed lines and their improved version are in solid lines. 3.3.2 Improvement in Recall-versus-IoU Performance To show the effectiveness of our method in improving the recall-versus-IoU perfor- mance, we evaluate with three proposal budgets: 100, 300 and 1000 proposals 3 . We apply the proposed BR method to six objectness-based methods: RPN [92], DP [34], Edge Boxes [122] (including EB and EB50 fine-tuned at IoU = 0.7 and IoU = 0.5, respectively), BING [21] and Objectness (OBJ) [5]. We plot the recall-versus-IoU curves in Figure 3.6. For RPN, which is the state-of-the-art CNN-involved proposal generator, the average recall (AR) is improved by a large margin, from 45.9% to 54.6%, with a budget of 300 proposals. By comparing the two orange curves in the first row of Figure 3.6, we successfully improve the recall of RPN when the IoU is greater than 0.7. Specifically, 3 A set of 300 proposals is fed into the detector in [92]. 35 0.7-recall (namely, the recall at IoU threshold equal to 0.7), 0.8-recall and 0.9-recall are increased from 68.1%, 29.6% and 3.1% to 72.2%, 50.3% and 21.2%, respectively. Meanwhile, the high recall at lower IoU range is maintained. Some qualitative results obtained by improving RPN are shown in Figure 3.8. For EB, the state-of-the-art non-CNN objectness-based method, the AR is improved from 50.2% to 53.1% with a budget of 1000 proposals. The 0.8-recall jumps from 44.5% to 50.4% after box refinement. There is a small drop at 0.7-recall, since EB is fine-tuned at IoU=0.7. For other objectness-based methods fine-tuned at IoU=0.5, the improvement is more obvious since they suffer from severe localization bias. For OBJ, its AR is improved from 30.9% to 43.6% and similar gains are observed for BING and EB50. Furthermore, by comparing different columns of Figure 3.6 and results in Table 3.1, we see that our box refinement method achieves consistent performance improvement across different proposal budgets. 3.3.3 Overall Performance Benchmarking We show the effectiveness of our box refinement scheme by comparing ten proposal methods and five BR improved methods in Table 3.1. The ten methods include the five presented earlier as well as MCG [6], SS [110] (which are similarity-based methods), M-EB, M-DP and M-RPN (the improved version of EB, DP and RPN by MTSE [19], which focuses on proposal localization accuracy improvement as well). BR-RPN achieves the best performance in AR and 0.8-recall. For 0.5-recall, BR- RPN is very close to the best performance achieved by RPN. As compared with the state- of-the-art similarity-based methods, SS and MCG, with a budget of 1000 proposals, BR- RPN achieves higher AR and 0.8-recall. Specifically, its 0.8-recall is about 6% higher than MCG and 8% higher than SS. In contrast, the original RPN is lower than those two 36 Figure 3.8: Visual comparison of initial RPN proposals (in magenta) and the improved proposals using the proposed box refinement process (in blue). Ground truth bounding boxes are in green. Irrelevant ground truth boxes are omitted here. 37 (a) RPN (b) BR-RPN (c) EB (d) BR-EB (e) DP (f) BR-DP Figure 3.9: Captured and missed ground truth boxes with IoU threshold equal to 0.8 are shown in green and red, respectively. Methods without and with our box refinement methods are in yellow and blue, respectively. methods by more than 10%. Furthermore, we select 13 out of 15 methods in Table 3.1 and show their recall-versus-IoU curves in Figure 3.11 for performance comparison. 38 #prop = 100 #prop = 300 #prop = 1000 Method AR 0.5-recall 0.8-recall AR 0.5-recall 0.8-recall AR 0.5-recall 0.8-recall BING[21] .195 .584 .068 .236 .730 .074 .273 .860 .079 OBJ[5] .239 .604 .090 .281 .718 .098 .309 .810 .102 SS[110] .290 .560 .237 .399 .707 .348 .519 .835 .485 MCG[6] .377 .688 .326 .456 .781 .412 .535 .861 .505 EB[122] .322 .580 .285 .414 .722 .369 .502 .856 .445 M-EB[19] .303 .613 .243 .408 .760 .358 .515 .877 .493 DP[34] .330 .607 .282 .406 .707 .363 .483 .816 .444 M-DP[19] .339 .649 .290 .435 .758 .410 .519 .844 .528 RPN[92] .404 .829 .251 .459 .906 .296 .489 .940 .321 M-RPN[19] .400 .791 .306 .499 .893 .430 .566 .939 .534 BR-RPN .470 .820 .414 .546 .902 .503 .592 .936 .561 BR-DP .357 .609 .326 .442 .713 .424 .535 .821 .523 BR-BING .292 .606 .212 .366 .757 .285 .436 .875 .313 BR-OBJ .324 .612 .254 .395 .729 .315 .456 .831 .374 BR-EB .331 .576 .298 .432 .719 .401 .531 .851 .504 Table 3.1: Comparison of the average recall (AR), 0.5-recall and 0.8-recall of 15 meth- ods with 100, 300 and 1000 proposals. Results improved by our box refinement method have the “BR-” prefix. The best results are highlighted in bold. 3.3.4 Impact of Proposal Budgets It is interesting to study the relationship between the AR and the proposal budget. We show the results of the 13 methods examined earlier in Figure 3.12. From this figure we see that our box refinement scheme can improve existing methods consistently across all proposal budgets. In contrast, MTSE is less consistent as shown by the dotted lines in Figure 3.12. For example, the AR of M-RPN is 0.4% lower than RPN with 100 proposals while our method, BR-RPN, can achieve more than 7% improvement on AR with 100, 300 and 1000 proposals, as reported in Table 3.1. 3.3.5 Combination with Detectors We test our improved RPN proposals (BR-RPN) on the Faster R-CNN [92] detector and the results are show in Table 3.2. The proposal budget is kept the same, i.e., 300 proposals. The detection experiment is directly conducted on the original Faster R- CNN model and we do not fine-tune it (especially the box regression process) with our 39 (a) The original image (b) The initial box and the edge map (c) The optimal contour segments (d) The optimal complete contour (e) The refined box Figure 3.10: The illustration of contour searching. Edge maps are enhanced and con- tours are widened for better visualization. improved proposals. With minimum IoU = 0.5, we achieve a mean average precision (mAP) of 69.6% without the box regression process, with a gain of 1.3%. With the box regression process, the mAPs are 71.9% and 71.6% for BR-RPN and RPN [92], respectively. We also conduct a stricter evaluation with minimum IoU = 0.7, as we focus on improving the localization accuracy. We choose IoU = 0.7 for this stricter evaluation because proposals with IoU 0.7 are considered as object samples during 40 0.5 0.6 0.7 0.8 0.9 1 IoU 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 100 proposals SS MCG BING OBJ DP M-DP BR-DP EB M-EB BR-EB RPN M-RPN BR-RPN 0.5 0.6 0.7 0.8 0.9 1 IoU 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 300 proposals SS MCG BING OBJ DP M-DP BR-DP EB M-EB BR-EB RPN M-RPN BR-RPN 0.5 0.6 0.7 0.8 0.9 1 IoU 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 1000 proposals SS MCG BING OBJ DP M-DP BR-DP EB M-EB BR-EB RPN M-RPN BR-RPN Figure 3.11: Comparison of the recall-versus-IoU performance for 13 representative methods, where the methods with our proposed box refinement process and MTSE [19] are shown in solid lines and dotted lines, respectively. The remaining seven methods without any improvement are shown in dashed lines. 10 0 10 1 10 2 10 3 # proposals 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 AR Average Recall SS MCG BING OBJ DP M-DP BR-DP EB M-EB BR-EB RPN M-RPN BR-RPN 10 0 10 1 10 2 10 3 # proposals 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.8-recall 0.8-Recall SS MCG BING OBJ DP M-DP BR-DP EB M-EB BR-EB RPN M-RPN BR-RPN 10 0 10 1 10 2 10 3 # proposals 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ABO Average Best Overlapping SS MCG BING OBJ DP M-DP BR-DP EB M-EB BR-EB RPN M-RPN BR-RPN Figure 3.12: Comparison of the average recall (AR), 0.8-recall and average best overlap- ping (ABO) as a function of the proposal budget for 13 representative methods, where the methods improved by our box refinement process are shown in solid lines, the meth- ods improved by MTSE [19] are shown in dotted lines and the methods without any refinement are shown in dashed lines. training. In this experiment, the gain on mAP is 11.8% without box regression and 2.4% with box regression. w/o box regression w/ box regression Method IoU-0.5 IoU-0.7 IoU-0.5 IoU-0.7 RPN[92] .683 .274 .716 .462 BR-RPN .696 .392 .719 .486 Table 3.2: Comparison of the detection mean average precision (mAP) on RPN [92] and BR-RPN. 41 3.3.6 Impact of Original Proposal Quality For each annotated object box in VOC2007 dataset, we select the best proposal (the one with the highest IoU) obtained by BR-RPN and the corresponding original RPN [92] proposal to form a pair. Thus, we obtain 14,976 pairs of initial IoU versus achieved IoU. These pairs are partitioned into 10 bins based on the original IoU 4 . Then the average original IoU and average achieved IoU are calculated for each bin. The resulted 10 pairs of average initial IoU versus average achieved IoU are plotted in Figure 3.13. According to this figure, our method is able to improve the location accuracy of original proposals with different initial IoU. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Init IoU 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Avg Achieved IoU Figure 3.13: The relationship between the original RPN [92] proposals and correspond- ing refined proposals. 3.3.7 Impact of Object Sizes Similar with the analysis above, we obtain pairs of object size and the corresponding best IoU. This analysis is done on RPN [92] and BR-RPN. Those pairs are partitioned into 50 bins based on the ranked object size (in ascending order) and the average IoU is calculated in each bin. As shown in Figure 3.14, our method works better on large 4 Pairs with initial IoU of[0:0;0:1) go to Bin1,[0:1;0:2) to Bin2 and so on. 42 0% 20% 40% 60% 80% 100% Object size 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 IoU BR-RPN RPN Figure 3.14: The impact of object size (with 100% representing the largest object) on our box refinement method. objects while cannot obtain much improvement for smaller objects. One explanation is that large objects often have clearer contours than small objects. 3.4 Conclusions A box refinement method is proposed and applied to existing object proposal methods in this work. By conducting the shortest path search with anchors in a cost map obtained from the edge response, we are able to align the proposed bounding boxes with contours and, thus, improve their localization accuracy. An interesting extension of our work is to integrate the contour information into the box regression process of CNN detectors. Box regression needs accurate spatial information, which is unavailable from the top convo- lutional layer after several pooling layers. Contours, on the other hand, are informative for accurate localization. For localization accuracy, contour is a complement to deep features. Combining them to achieve box regression with higher accuracy is a research topic worth further exploration since localization accuracy can be an important concern in some specific applications. 43 Chapter 4 Multiple Instance Curriculum Learning for Weakly Supervised Object Detection 4.1 Introduction Object detection is an important problem in computer vision. In recent years, a set of detectors based on convolutional neural networks (CNN) are proposed [35, 92, 75], which perform significantly better than traditional methods (e.g., [32]). Those detectors need to be supervised with fully labeled data, where both object category and location (bounding boxes) are provided. However, we argue that such data are expensive in terms of labeling efforts and thus are not scalable. With the increase of dataset size, it becomes extremely difficult to label the locations of all object instances. In this work, we focus on object detection with weakly labeled data, where only image-level category labels are provided, while the object locations are unknown. This type of methods is attractive since image-level labels are usually much cheaper to obtain. For each image in a weakly labeled dataset, the image-level label tells both present and absent object categories. Thus, for each category, we have positive examples where at least one instance of that category is present as well as negative ones where no objects of that category exist. 44 Trained Detector SEC Example selec5on (Curriculum) Saliency maps Seeds Top scoring detec5ons Segmenta5on masks Consistent ✔ Easy Inconsistent ✗ Hard Re-‐training Re-‐localiza5on Re-‐localiza5on Segmenta5on-‐based Seed Growing (SSG) Figure 4.1: The training diagram of our MICL method. It iterates over re-localization and re-training. During re-localization, saliency maps are generated from the current detector. Segmentation seeds are obtained from the saliency maps, which later grow to segmentation masks. We use the segmentation masks to guide the detector to avoid being trapped in the discriminative parts. A curriculum is designed based on the segmen- tation masks and the current top scoring detections. With the curriculum, the multiple instance learning process can be organized in an easy-to-hard manner for the detector re-training. Some previous methods [68, 54, 12] extract a set of object candidates via unsuper- vised object proposals (also called object candidates) [110, 122, 71, 16] and then identify the best proposals that lead to high image classification scores for existing categories. However, the best proposals for image classification do not necessarily cover the full objects. (For example, to classify an image as “cat”, seeing the face is already sufficient and even more robust than seeing the whole body, as the fluffy fur can be confused with other animals.) Specifically, the best proposals usually focus on discriminative object parts, which oftentimes do not overlap enough with the extent of full objects, and thus become false positives for detection. 45 To reduce the false positives due to trapping in the discriminative parts, we use seg- mentation masks to guide the weakly supervised detector in the typical relocalization- and-retraining loop [95, 37]. The segmentation process starts with a few seeds from the object saliency maps generated by the current detector. Then segmentation masks are obtained by expanding those seeds using the “Seed, Expand and Constrain (SEC)” method [58]. One may use all generated masks to directly supervise the detector. How- ever, it will be misled by the hard and noisy examples where the segmentation network fails to produce reasonably good object masks. To overcome this challenge, we propose the multiple instance curriculum learning (MICL) approach, which combines the com- monly used multiple instance learning (MIL) framework with the easy-to-hard learning paradigm - curriculum learning (CL) [9]. The work flow of the proposed MICL sys- tem is shown in Fig. 4.1. It learns from the “easy” examples only in the re-training step of MIL. The re-trained detector is later used to re-localize the segmentation seeds and object boxes. While this process iterates, the training set is gradually expanded from easy to hard examples so the detector learns to handle more complex examples. We iden- tify the easiness of an example by examining the consistency between the results from the detector and the segmenter, without additional supervision on “easiness” required by traditional CL methods. Once the proposed MICL process is finished, the detector is applied to test images directly. The contributions of this work are summarized as following. First, we incorporate a semantic segmentation network to guide the detector to learn the extent of full objects and avoid being stuck at the discriminative object parts. Second, we propose an MICL method by combining MIL with CL so that the detector is not misled by hard and unre- liable examples, and our CL process does not require any additional supervision on the “easiness” of the training examples. Third, we demonstrate the superior performance 46 of the proposed MICL method as compared with the state-of-the-art weakly supervised detectors. 4.2 Proposed MICL Method The proposed MICL approach starts with initializing a detector to find the most salient candidate (Sec. 4.2.1). Meanwhile, we obtain the saliency maps from a trained clas- sifier/detector. The saliency maps are thresholded to obtain segmentation seeds. Then a segmentation network is trained to grow object masks from those seeds (Sec. 4.2.2). After that, we inject the curriculum learning (CL) paradigm into the commonly used re- localization/re-training multiple instance learning (MIL) framework [37] in the weakly supervised object detection problem, leading to the multiple instance curriculum learn- ing (MICL) approach. With MICL we further train the detector to learn the extent of objects under the guidance of the segmentation network (Sec. 4.2.3). 4.2.1 Detector Initialization To start the MIL process detailed in Sec. 4.2.3, we need an initial detector with certain localization capability. We achieve this by first training a whole image classifier and then tuning it into a detector to identify the most salient candidate. Similar with the methods described in [12, 54, 104], we first extract SS [110] object proposals. In order to find the most salient candidate (MSC) among the proposals, we then add a “saliency” branch composed of an FC layer in parallel with the classification branch, as shown in Fig. 4.2. This branch takes the region of interest (RoI, denoted by r) features from the second last layer of the VGG16 network [97] as the input, and computes its saliency. Then the softmax operation is applied, so that the saliency scores of all RoIs for category 47 RoI features Classifica1on branch Saliency branch . . . . . . SS proposals (RoI) Mul1-‐class cross entropy loss Person Horse Image scores RoI classifica1on RoI saliency × Σ . . . Figure 4.2: The detector with a saliency branch to find the most salient candidate (MSC) for image classification. c, denoted by h(c;r), sum to one, i.e. P r h(c;r) = 1. The saliency score is used to aggregate the RoI classification scorep(c;r) into image-level scores: p(c) = X r h(c;r)p(c;r)= X r h(c;r) = X r h(c;r)p(c;r): (4.1) Then, the whole network can be trained with image-level labels using the multi-label cross entropy loss: L = X c2C + y(c) lnp(c) + (1y(c)) ln(1p(c)); (4.2) wherey(c) is the ground truth image-level labels with 1 and 0 representing existing and non-existing categories respectively and p(c) stands for the aggregated classification score for categoryc. We rank the RoIs by the combined scoresh(c;r)p(c;r) and record the top scoring RoI for the curriculum learning process detailed in Sec. 4.2.3. 4.2.2 Segmentation-based Seed Growing In this module, we first obtain saliency maps from a classification (a single region of interest) or detection (multiple regions of interest) network and then train a segmentation network to expand object masks from the salient regions. 48 Saliency map for a single region. Several methods [120, 96, 118] are proposed to identify the saliency map from a trained image classification network automatically. The saliency map for categoryc, denoted byA(x;y;c), describes the discriminative power of the cell, (x;y), on the feature mapf(x;y;k) generated by the trained classification network. Here, we use CAM [120] to obtain the saliency maps for existing categories and Grad [96] for background as elaborated below. Being applied to a GAP classification network 1 , the CAM saliency map is defined as A CAM (x;y;c) = X k f(x;y;k)w(k;c); (4.3) where (x;y) is the cell coordinates of the feature map from the last convolutional layer, f(x;y;k) is the response at thek-th unit at this layer andw(k;c) is the weight parameters of the fully connected (FC) layer corresponding to categoryc. The Grad background saliency maps are defined as A Grad BG (x;y) = 1 max c2C + 1 Z(c) max k @p(c) @f(x;y;k) ; (4.4) where p(c) is the output of the classification network, indicating the probability that the input image belongs to category c, and Z(c) is a normalization factor so that the maximum saliency for all existing categories, denoted by C + , is normalized to one. More specifically, Z(c) = max x;y;k @p(c) @f(x;y;k) : (4.5) 1 GAP refers to global average pooling. GAP networks refer to the architecture with only the top layer being fully connected (FC). Prior to the FC layer is a GAP layer which averages the 3-D feature map into a feature vector. 49 Aggregated saliency map from multiple regions. The outputs of a detector are essen- tially a group of bounding boxes with classification scores. We propose a generalized saliency map mechanism that can be applied to detectors with the RoI pooling layer (e.g., Fast R-CNN [35] as used in this work). As illustrated in Fig. 4.3, given a RoIr and its feature map, denoted byf(x;y;k;r) (which has a fixed size because the classi- fier is composed of FC layers), obtained by the RoI pooling operator, one can compute the saliency map withinr as A(x;y;c;r) = X k @p(c;r) @f(x;y;k;r) f(x;y;k;r); (4.6) where denotes element-wise multiplication andp(c;r) is the classification score forr. To obtain the saliency maps for the entire image, the RoI saliency maps are aggregated via A(x;y;c) = N X r=1 ^ A(x;y;c;r)p(c;r); (4.7) where ^ A(x;y;c;r) is obtained by resizing A(x;y;c;r) to the RoI size using bilinear interpolation and then padding it to the image size, as shown in Fig. 4.4. Image features RoI features f(x, y, k; r) RoI Classifier (FC layers) p(c; r) × Σ RoI r Par9al deriva9ve w.r.t RoI features RoI scores RoI pooling RoI saliency map A(x, y; c, r) ∂p(c;r) ∂f(x,y,k;r) Figure 4.3: Compute the saliency map in a RoI 50 Weighted by RoI scores Σ . . . . . . . . . RoIs RoI saliency maps Resized&padded RoI saliency maps Image saliency maps Figure 4.4: Aggregate RoI saliency maps to the image saliency maps. The relationship between multi-region and single-region saliency maps. It is worth- while to point out that Eq. (4.6) is reduced to the single region or the whole image saliency map for a classification network with no RoI inputs (i.e., the entire image as one RoI). When the image classification network is a GAP network, it can be further reduced to Eq. (4.3) as derived in the original CAM [120]. The image classification network can be viewed as a detector with only one region of interest (RoI) and the RoI is the whole image, denoted asR. The RoI feature map f(x;y;k;R) is then equivalent to the image feature mapf(x;y;k). Also, the RoI clas- sification scorep(c;R) is simply the image classification scorep(c). Thus, we have A(x;y;c;R) = X k @p(c) @f(x;y;k) f(x;y;k): (4.8) Since there is only one RoI, the aggregation reduce to A(x;y;c) = ^ A(x;y;c;R)p(c); (4.9) where ^ A(x;y;c;R) is obtained by resizing A(x;y;c;R) to the image size via bilinear interpolation. Padding is no longer needed, as the RoI is the whole image. If we consider 51 the saliency map before resizing (i.e., the saliency map that matches the size of the image feature map instead of the image size), which is denoted byA feat (x;y;c), we have A feat (x;y;c) =A(x;y;c;R)p(c) =p(c) X k @p(c) @f(x;y;k) f(x;y;k): (4.10) If the image classification network is a GAP network, thenp(c) is obtained via p(c) = X k w(k;c) f(k) +b(c); (4.11) wherew(k;c) is the weight parameters of the fully connected (FC) layer corresponding to categoryc andb(c) is the bias parameter. f(k) denotes the feature vector obtained by the global average pooling operation on the feature mapf(x;y;k). Namely, f(k) = 1 HW X x;y f(x;y;k); (4.12) whereH andW are the height and width off(x;y;k), respectively. By substituting Eq. (4.12) to Eq. (4.11), we have p(c) = 1 HW X k w(k;c) " X x;y f(x;y;k) # +b(c): (4.13) Thus, the partial derivative in Eq. (4.10) is simplified as @p(c) @f(x;y;k) = 1 HW w(k;c); (4.14) and Eq. (4.10) is reduced to A feat (x;y;c) = p(c) HW X k w(k;c)f(x;y;k): (4.15) 52 As the saliency map is normalized so that the maximum saliency is one, the factor p(c) HW can be ignored and Eq.(4.15) is equivalent to Eq. (4.3), which is the original CAM saliency map derived in [120]. Object mask growing from discriminative regions. Segmentation seeds are obtained by thresholding these saliency maps and the seeds from CAM and Grad are simply pooled together. We adopt the SEC method [58] to train the DeepLab [17] network to expand masks from those seeds. In the first round of training, seeds come from a classification network while in later rounds they are from the currently trained detector. The trained segmentation network is applied to all training images and a bounding box is drawn around the largest connected component in the mask. Thus one instance from each existing category is localized. The location information will guide the detector training process described in Sec. 4.2.3. 4.2.3 Multiple Instance Curriculum Learning The commonly used MIL framework usually starts with an initial detector and then alter- nates between updating the boxes (re-localization) and updating the model (re-training). In re-locolization, the current detector is applied to training images and the highest- scoring box is saved. In re-training, the detector is re-trained on the saved boxes. To re-train the initialized detector from Sec. 4.2.1, one may use the highest-score boxes produced by itself, but it get stuck easily at the same box. Alternatively, on can use the bounding boxes from the SSG network, but those boxes may not be reliable due to inaccurate segmentation seeds. In other words, relying solely on the initial detector or the segmenter leads to sub-optimal results. To avoid misleading the detector by unre- liable boxes, we guide the detector by organizing the MIL process on a curriculum that requires the detection boxes and segmentation masks to agree with each other. Details are elaborated below. 53 SSG-Guided Detector Training. The easy-to-hard learning principle proposed in [9] has been proved helpful in training a weakly supervised object detector [95, 109]. How- ever, most previous methods require additional human supervision or additional object size information to determine the hardness of a training example, which is expensive to acquire. Instead of seeking additional supervision, we determine the hardness of an example by measuring the consistency between the outputs from the detector and the SSG network. The consistency is defined as the intersection over union (IoU) of the boxes from the SSG and the detector: S(c;z) =IoU(B DET (c;z);B SSG (c;z)); (4.16) wherez represents a positive example for categoryc;B DET (c;z) andB SSG (c;z) stand for the predicted bounding boxes of the object by the detector and the SSG network, respectively. As shown in Fig. 4.1, an example is consideredeasy if S(c;z)T; (4.17) whereT is a threshold to control the hardness of the selected examples. We argue the validity of this criterion from two perspectives. First, those examples areeasier because the goal for the detector is to mimic the mask expansion ability of SSG, and one exam- ple would appear easier if the detector already produces something similar (i.e., the gap between achieved results and the learning target is small). Second, the object localiza- tion on those examples are confirmed by both the detector and SSG, meaning that those predicted locations are more reliable. In other words, ifB SSG (c;z) significantly devi- ates fromB DET (c;z), it tends to be unreliable. We verify the reliability ofB SSG (c;z) on the selected examples in Sec. 4.4.3. 54 For an existing category c, one instance is localized by taking the average of B DET (c;z) and B SSG (c;z) on the selected examples. Those localized instances on easy examples are used for further detector training in a fully supervised manner. In this work, we use the popular Fast R-CNN detector [35] with Selective Search [110] as the object proposal generator. Re-localization and re-training. The detector trained on easy examples lacks the abil- ity to handle hard examples because it focuses on easy ones in the aforementioned training round. Thus, we gradually include more training examples by adopting the re-localization/re-training iterations in the MIL framework [37], illustrated in Fig. 4.1. In the re-localization step, the trained detector is applied to the whole training set and the highest scoring boxes for existing categories are recorded as the new B DET (c;z). Meanwhile, the outputs from the detector are used to re-localize segmentation seeds, based on which the SSG network is re-trained to generate newB SSG (c;z). By applying the same example selection criterion, another training subset, containing more examples because their results are more similar after learning from each other, is identified and the Fast R-CNN detector is re-trained. The MIL process alternates between re-training and re-localization until all training examples are included. After training is finished, the detector is applied directly to test images. 4.3 Implementation Details The backbone architecture for all modules is the VGG16 [97] network, for fair bench- marking with previous methods [12, 68, 104, 54, 95, 25]. The network is pretrained on the ImageNet dataset [22]. The whole system is implemented with the TensorFlow[1] library. The ImageNet pretrained model is converted from the public released Caffe model in the model zoo 55 4.3.1 Detector Initialization Classification fine-tuning. We first fine-tune the VGG16 network to perform image classification task on the PASCAL VOC dataset by removing the 1000-way out FC8 layer and adding a 20-way out FC layer as the new FC8. This network is trained on a subset of ImageNet where images from the PASCAL classes are selected. We argue this fine-tuning process does not require additional supervision as the pretrained model from the model zoo already sees all the images in the training set. We do not use images from PASCAL dataset for classification fine-tuning because the detector assumes each bounding box contains only one object but PASCAL images are multi-labeled. The fine- tuning process takes 10000 iterations with a constant learning rate of 10 4 and a batch size of 16. Category-specific object proposals. In order to find the most salient candidate, we first find a small set of category-specific proposals for each existing category, due to the limited GPU memory. We extract 1000 category-independent proposals for each image via Selective Search (SS) [110]. To assign a classification confidence score to each proposal, we adopt the Fast R-CNN [35] architecture but drop the bounding box regression layers. This network is initialized from the fine-tuned model for whole image classification task mentioned above. The classification head composed by FC layers (FC6-8) followed by a softmax operation is applied to each RoI feature obtained through RoI pooling layer, so every proposal receives a classification score vector. Then the 1000 RoI score vectors are aggregated to image-level score vector by global maximum pooling, p(c) = max r p(c;r); (4.18) 56 where p(c;r) denotes the probability of the r-th RoI being an instance of category c andp(c) denotes the probability that the input image contains instances of categoryc. Then the network can be trained by minimizing the multi-label cross entropy loss. The training process takes 10000 iterations with a constant learning rate of 10 4 and a batch size of 4. The trained network is applied to all training images. We select the top 100 proposals for each existing category and the most salient object will be identified from them. The category-specific proposals are also useful in curriculum learning. The most salient candidate detector. As mentioned in the paper, we add a saliency branch in parallel with the classification branch. More specifically, saliency branch is added in parallel with FC8 layer and takes the feature from the FC7 layer. The RoI feature for each category-specific proposal is computed by feeding the correspondingly cropped image region into the network. RoI features are computed on the fly and the network is end-to-end trained. We use the network fine-tuned for whole image classifi- cation to initialize the detector. The newly added saliency branch is randomly initialized. The training process takes 10000 iterations with a constant learning rate of 10 4 and a batch size of 4. 4.3.2 Segmentation-based Seed Growing We use the DeepLab-LargeFOV [18] segmentation network in this module. The back- bone is the VGG16 network. The training parameters are set identical with SEC[58]: the batch size is 15; the base learning rate is 10 3 with a decay rate equal to 0.1 and decay step equal to 2000; the total training iteration number is 8000. The network is initialized from the model pretrained on ImageNet 2 . 2 http://www.cs.jhu.edu/˜alanlab/ccvl/init models 57 Segmentation seeds. The seeds used for the first round of the segmentation training are from a classification network which is applied to the entire image. Since CAM [120] does not apply to the standard VGG-16 network [97], we follow the modification in SEC [58] to make CAM compatible with VGG-16. The network is finetuned by training images from PASCAL VOC datasets with image-level labels. To generate seeds for existing categories, the saliency maps are computed via Eq. (4.3) and the maximum saliency is normalized to 1. The threshold is set to 0.2. The background saliency maps are computed via Grad [96]. The detailed equation is given in the paper. Then they are normalized such that the maximum and the minimum background saliency is 1 and 0, respectively. The threshold is set to 0.9 to generate seeds or we use the top 10% most background-salient regions. We take the one with more background seeds. In the later training rounds on the segmentation network, the background seeds are kept identical while the foreground seeds are updated based the trained detector, follow- ing Eq. (4.6). The threshoding parameter is kept identical. Post-processing of segmentation results. Since the trained segmentation network is applied to training images, where the existing and absent object categories are known, the category label assigned to each pixel is obtained by M(x;y) = arg max c2C + p(c;x;y); (4.19) wherep(c;x;y) is the probability that pixel (x;y) belongs to categoryc andC + is the existing categories including background. 4.3.3 Multiple Instance Curriculum learning Example selection. The thresholdT for intersection over union (IoU) is set to 0:5 in our experiments except the final training round, where all examples are selected. Some 58 selected easy examples are shown in Fig. 4.5 while some hard examples not used for training is in Fig. 4.6. aeroplane bicycle bird boat bottle bus car cat chair cow dining table dog horse motorbike person potted plant sheep sofa train tvmonitor Figure 4.5: Visualized easy examples selected by measuring the consistency between boxes from the segmenter (in red) and the detector (in green). Those examples are used for the next round of detector training. Training. The Fast RCNN network is initialized from the MSC detector with the saliency branch dropped. The bounding box regression layers are initialized randomly. Because we use a subset of training examples, we do not take all proposals from 59 aeroplane bicycle bird boat bottle bus car cat chair cow dining table dog horse motorbike person potted plant sheep sofa train tvmonitor Figure 4.6: Visualized hard examples selected by measuring the consistency between boxes from the segmenter (in red) and the detector (in green). Those examples are NOT used for the next round of detector training. SS [110]. Instead, when only a part of examples are selected from the entire train- ing set, we use the category-specific proposals mentioned in Sec. 4.3.1, as one image containing both “person” and “chair” may be selected as easy example for “person” but not for “chair” by the easiness measurement. By taking category-specific proposals only, we avoid the detector seeing proposals that overlap largely with existing but not selected 60 categories. Using category-specific proposals also reduces the training time. Only in the last training round where all examples are involved, the category-independent propos- als are fed into the detector during training, so that the trained detector can be directly applied to test images, where the existing categories are unknown. The multiple instance curriculum learning (MICL) takes two rounds. In the first round, the simple examples are selected by measuring the consistency between the results from MSC detector and the segmenter. About 2300 examples are selected for the first MICL training round. Segmentation seeds are updated based on the detector from training round 1 and the segmenter is retrained (initialized from the ImageNet- pretrained model). Then, objects are re-localized by the detector and the segmenter. We take the average of the bounding boxes from those two and use those boxes as the pseudo location labels. In the second training round, all examples are used and the detector is initialized by the detector from training round one. For all training rounds, we use the SGD optimizer with a momentum of 0.9 for 40000 iterations. The learning rate is set to 10 4 and the batch size is 2. Proposals with IoU larger than 0.5 are treated as positive samples; those with IoU between 0.1 and 0.5 are negative samples, by following the setting in [35]. 4.4 Experiments 4.4.1 Datasets and Evaluation Datasets. To evaluate the performance of the weakly supervised MICL detector, we conduct experiments on the PASCAL VOC 2007 and 2012 datasets (abbreviated as VOC07 and VOC12 below) [29], where 20 object categories are labeled. For the MICL detector training, we only use image-level labels, with no human labeled bounding boxes involved. 61 Evaluation metrics. Following the evaluation of fully supervised detectors, we use the average precision (AP) for each category on the test set as the performance metric. Besides, we choose another metric, the Correct Location (CorLoc) [24], to evaluate weakly supervised detectors, which is usually applied to training images. The CorLoc is the percentage of the true positives among the most confident predicted boxes for existing categories. A predicted box is a true positive if it overlaps sufficiently with one of the ground truth object boxes. The IoU threshold is set to 50% for both metrics. 4.4.2 Experimental Results Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike prsn plant sheep sofa train tv mAP WSDNN-Ens[12] 46.4 58.3 35.5 25.9 14.0 66.7 53.0 39.2 8.9 41.8 26.6 38.6 44.7 59.0 10.8 17.3 40.7 49.6 56.9 50.8 39.3 WSDNN[12] 43.6 50.4 32.2 26.0 9.8 58.5 50.4 30.9 7.9 36.1 18.2 31.7 41.4 52.6 8.8 14.0 37.8 46.9 53.4 47.9 34.9 DA[68] 54.5 47.4 41.3 20.8 17.7 51.9 63.5 46.1 21.8 57.1 22.1 34.4 50.5 61.8 16.2 29.9 40.7 15.9 55.3 40.2 39.5 ConLoc[54] 57.1 52.0 31.5 7.6 11.5 55.0 53.1 34.1 1.7 33.1 49.2 42.0 47.3 56.6 15.3 12.8 24.8 48.9 44.4 47.8 36.3 Attn[104] 48.8 45.9 37.4 26.9 9.2 50.7 43.4 43.6 10.6 35.9 27.0 38.6 48.5 43.8 24.7 12.1 29.0 23.2 48.8 41.9 34.5 SE[95] - - - - - - - - - - - - - - - - - - - - 37.2 MICL 61.2 51.9 47.1 13.5 10.1 52.1 56.9 71.0 7.6 36.4 49.7 64.5 63.0 57.8 27.9 16.6 30.4 53.8 41.1 40.3 42.6 Table 4.1: Comparison of mAP values on the VOC07 test set. Note that the gaps between the previous approaches and our method are particularly large on categories such as “cat”, “dog” and “horse”. The improvements are from the SSG network, which grows the bounding box to cover the full objects from the discriminative parts (i.e., faces). The average precision (AP) on the VOC07 test set is shown in Tab. 4.1. An mAP of 42.6% is achieved by our MICL method, with 3.1% higher than the second best method, DA [68]. Note that DA [68] needs to cache the RoI features from pretrained CNN models for MIL and demands highly on disk space. Our proposed method avoids feature caching and thus is more scalable. The third best method is from ensembles [12] (WSDNN-Ens in Tab. 4.1). If compared directly to the results without ensembles (WSDNN in Tab. 4.1), our method is 7.7% superior. Some visualized detection results are shown in Fig. 4.7. 62 Figure 4.7: Qualitative detection results, where the correctly detected ground truth objects are in green boxes and blue boxes represent correspondingly predicted locations. Objects that the model fails to detect are in yellow boxes and false positive detections are in red. Tab. 4.2 shows the CorLoc evaluation on the VOC07 trainval set. We achieve 2.5% higher in the CorLoc than WSDNN-Ens [12]. Also, as compared with that of DA [68], which ranks the second in the mAP, our MICL method is superior by 8.1%. Note that 63 in terms of both AP and CorLoc, our MICL detector performs much better on certain categories such as “cat”, “dog” and “horse”. Objects in these categories usually have very discriminative parts (i.e. faces). Improvements on those categories are from the SSG network, which grows the bounding boxes from the discriminative regions. Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike prsn plant sheep sofa train tv Avg. WSDNN-Ens[12] 68.9 68.7 65.2 42.5 40.6 72.6 75.2 53.7 29.7 68.1 33.5 45.6 65.9 86.1 27.5 44.9 76.0 62.4 66.3 66.8 58.0 WSDNN[12] 65.1 63.4 59.7 45.9 38.5 69.4 77.0 50.7 30.1 68.8 34.0 37.3 61.0 82.9 25.1 42.9 79.2 59.4 68.2 64.1 56.1 DA[68] 78.2 67.1 61.8 38.1 36.1 61.8 78.8 55.2 28.5 68.8 18.5 49.2 64.1 73.5 21.4 47.4 64.6 22.3 60.9 52.3 52.4 ConLoc[54] 83.3 68.6 54.7 23.4 18.3 73.6 74.1 54.1 8.6 65.1 47.1 59.5 67.0 83.5 35.3 39.9 67.0 49.7 63.5 65.2 55.1 WSC[25] 83.9 72.8 64.5 44.1 40.1 65.7 82.5 58.9 33.7 72.5 25.6 53.7 67.4 77.4 26.8 49.1 68.1 27.9 64.5 55.7 56.7 MICL 85.3 58.4 68.5 30.4 20.9 67.2 77.1 84.6 24.3 69.5 51.5 80.0 79.8 85.3 46.2 44.9 52.1 64.6 61.3 60.0 60.5 Table 4.2: Comparison of the CorLoc on the VOC07trainval set. To our best knowledge, only [54] and [68] reported results on the VOC12 dataset. We follow the training/testing split in [54] and show the results in Table 4.3. We improve the mAP and the CorLoc by 3.6% and 8.4%, respectively. For performance benchmark- ing with [68], we estimate the mAP of our MICL method on the VOC12 val set by applying the detector trained on the VOC07 trainval set. The obtained mAP is 37.8%, which is 8% higher than [68]. We should point out that the VOC07trainval set does not overlap with the VOC12 val set and its size is about the same as the VOC12 train set used in [68]. Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike prsn plant sheep sofa train tv Avg. AP ConLoc[54] 64.0 54.9 36.4 8.1 12.6 53.1 40.5 28.4 6.6 35.3 34.4 49.1 42.6 62.4 19.8 15.2 27.0 33.1 33.0 50.0 35.3 MICL 65.5 57.3 53.4 5.4 11.5 48.8 45.4 80.5 7.6 35.2 25.3 75.8 59.5 68.8 18.0 17.0 24.7 37.7 25.8 14.1 38.9 Cor- ConLoc[54] 78.3 70.8 52.5 34.7 36.6 80.0 58.7 38.6 27.7 71.2 32.3 48.7 76.2 77.4 16.0 48.4 69.9 47.5 66.9 62.9 54.8 Loc MICL 84.9 78.6 76.9 30.1 32.6 80.0 69.7 90.6 32.1 67.7 47.4 85.5 85.3 85.9 41.4 50.7 62.8 62.7 57.9 41.7 63.2 Table 4.3: Results on the VOC12 dataset, where the AP and the CorLoc are measured on thetest andtrainval set, respectively. 64 4.4.3 Analyses On saliency maps. To evaluate the saliency maps, we adopt the “pointing game” metric proposed in ExciteBack [118]. If the location of the maximum saliency for an existing category c, which is a point on the maps, fall in one of the ground truth object boxes forc 3 , it is counted as a “hit”; otherwise it is a “miss”. To evaluate the precision of the maximum saliency point, the percentage of hits is computed. Differently with Excite- Back, the precision evaluation is conducted on the training set, as they are used to grow segmentation masks on the training images. The precision is calculated for each cate- gory separately and the mean of those precision values are shown Tab. 4.4. The saliency maps from the detector trained on the easy images achieves 2.5% higher in precision compared with the maps from the classification network. Saliency maps Precision (%) Classification network (single-region) 88.6 Detection network (multi-region) 91.1 Table 4.4: The precision of the maximum saliency point on VOC07trainval set. Similarly, the precision of segmentation seeds generated by thresholding the saliency maps is evaluated be checking if the seeds fall into object bounding boxes. The same thresholding parameters are applied to the saliency maps from the classifier and the detector. We also evaluate the recall of the segmentation seeds by the ratio between the number of hit seeds and the area of the object bounding boxes. The results are shown in Tab. 4.5. By generating saliency maps from the detector trained on easy examples, we are able to improve the precision and recall by 1.7% and 14.5%, respectively. 3 Even if the point falls in object boxes, it may still fall on background. However, since not all images from VOC07 trainval set have segmentation mask labels, checking the point against bounding boxes is the best estimate on the point precision of saliency maps, same for segmentation seed evaluation. 65 Seeds Precision (%) Recall (%) Classification network (single-region) 82.6 25.3 Detection network (multi-region) 84.3 39.8 Table 4.5: The precision and recall of segmentation seeds on VOC07trainval set. On segmentation results. The CorLoc metric is applied to the bounding boxes gener- ated from the trained segmentation network based on classification or detection saliency maps. Results are given in Tab. 4.6. After updating the segmentation seeds in the re- localization step, the re-trained segmenter improves the CorLoc by 1.3% on the VOC07 trainval set. Seeds for training CorLoc (%) Classification network (single-region) 53.5 Detection network (multi-region) 54.8 Table 4.6: The Corloc of the bounding boxes from the segmenter on VOC07 trainval set. On effectiveness of curriculum learning. To prove the effectiveness of curriculum learning, we set up three baseline Fast R-CNN detectors: (1) trained with pseudo loca- tion labels from the initial MSC detector, with no segmentation cue used, (2) trained with pseudo location labels from the SSG network, and (3) based on MIL yet without cur- riculum (where the subset of training examples is selected randomly in each round and the average ofB SSG (c;z) andB DET (c;z) is used as pseudo location labels). The abla- tion study is conducted on the VOC07 dataset. The comparison of the MICL approach and the three baselines are illustrated in Tab. 4.7. Adding the segmentation cue into the detector training boosts the CorLoc by 10:9% and mAP by 4:4% from the initial MSC detector. In parallel, MIL without curriculum gives us even slightly larger improvement of 11:9% CorLoc and 6:3% mAP. By introducing curriculum learning (CL), we obtain 66 an total improvement of 18:0% in the CorLoc and 10:6% in the mAP. We also compare the CorLoc change during training with and without CL in Fig. 4.8. MSC SSG MIL MICL CorLoc 42.5 53.4 54.4 60.5 mAP 32.0 36.4 38.3 42.6 Table 4.7: Performance comparison of the three Fast R-CNN baselines and the one with the proposed multiple instance curriculum learning paradigm. Subset All (SSG) All (MSC) CorLoc 72.9 53.2 42.4 mAP 38.0 36.3 32.0 Table 4.8: Comparison of the CorLoc on the selected training subset versus on the whole set and mAP on thetest set achieved by the correspondingly trained detectors. On easy example selection criterion. One reason that curriculum learning is effective is that the pseudo location labels on the selected subset are more reliable than the aver- age, as shown in Tab. 4.8. This can be explained by the different preferences of the SSG network and the MSC detector when they localize objects: the SSG tends to group close instances from the same category due to the lack of instance information, resulting boxes larger than objects; the MSC detector, however, may focus on the most discriminative regions, usually smaller than the true objects. This intuition is confirmed by Fig. 4.9, where we analyze the localization errors from the SSG and the MSC. Among objects that are mis-localized, errors can be classified into three categories: too large, too small and others. A box istoosmall if the following conditions hold: Area(B \B GT ) Area(B ) T; (4.20) Area(B )< Area(B GT ); (4.21) 67 1 2 3 Training round 40 50 60 70 CorLoc w/ CL w/o CL Figure 4.8: Comparison of the CorLoc performance of the Fast R-CNN detector with and without curriculum learning. where B stands for the bounding box from the detector (B DET ) or the segmenter (B SSG ) andB GT denotes the ground truth box. Similarly, a box istoolarge if Area(B \B GT ) Area(B GT ) T; (4.22) Area(B )> Area(B GT ): (4.23) If a box fits neither of the two groups of conditions, it is categorized as others. The thresholdT is set to 0.5 in our experiments. It is clear that the SSG tends to generate bounding boxes larger than the target while the MSC prefers smaller ones. Thus, if the results from the SSG and the MSC are consistent, the bounding box is neither too small nor too large, indicating reliable locations. We also train a Fast R-CNN detector on the easy subset only and compare with the detectors trained on the whole set with pseudo locations from SSG and MSC, respectively. As shown in Tab. 4.8, the one trained with easy examples achieves 1.7% higher in mAP than the second best, even though it sees fewer examples, which indicates the importance of the quality of pseudo location labels. 4.5 Conclusions In this work, we proposed an MICL method with a segmentation network injected to overcome the challenge that detectors often focus on the most discriminative regions 68 0 20 40 60 80 100 Error percentage (%) SSG MSC Too large Too small Others Figure 4.9: Percentages of three error types among the mis-localized objects from the SSG and the MSC. when trained without manually labeled tight object bounding boxes. In the proposed MICL approach, where the segmentation-guided MIL is organized on a curriculum, the detector is trained to learn the extent of full objects from easy to hard examples and the easiness is determined automatically by measuring the consistency between the results from the current detector and the segmenter. The benefits from the segmentation net- work and the power of the easy-to-hard curriculum learning paradigm are demonstrated by extensive experimental results. 69 Chapter 5 Unsupervised Video Object Segmentation 5.1 Introduction One important task in video understanding is object localization in time and space. Ide- ally, it should be able to localize familiar or novel objects consistently over time with a sharp object mask, which is known as video object segmentation (VOS). If no indication of which object to segment is given, the task is known as unsupervised video object seg- mentation or primary object segmentation. Once an object is segmented, visual effects and video understanding tools can leverage that information [73]. Related object segmentation tasks in static images are currently dominated by meth- ods based on the fully convolutional neural network (FCN) [18, 76]. These neural networks require large datasets of segmented object images such as PASCAL [29] and COCO [74]. Video segmentation datasets are smaller because they are more expen- sive to annotate [69, 83, 88]. As a result, it is more difficult to train a neural net- work explicitly for video segmentation. Classic work in video segmentation produced results using optical flow and shallow appearance models [40, 57, 67, 82, 86, 114] while more recent methods typically pretrain the network on image segmentation datasets and later adapt the network to the video domain, sometimes combined with optical flow [13, 20, 50, 105, 106, 112]. 70 Figure 5.1: An example of the changing segmentation target (foreground) in videos depending on motion. A car is the foreground in the top video while a car is the back- ground in the bottom video. To address this issue, our method obtains embeddings for object instances and identifies representative embeddings for foreground/background and then segments the frame based on the representative embeddings. Left: the ground truth. Middle: A visualization of the embeddings projected into RGB space via PCA, along with representative points for the foreground (magenta) and background (blue). Right: the segmentation masks produced by the proposed method. Best viewed in color. In this work, we propose a method to transfer the knowledge encapsulated in instance segmentation embeddings learned from static images and integrate it with objectness and optical flow to segment a moving object in video. Instead of training an FCN that directly classifies each pixel as foreground/background as in [20, 50, 105, 106], we train an FCN that jointly learns object instance embeddings and semantic cat- egories from images [31]. The distance between the learned embeddings encodes the similarity between pixels. We argue that the instance embedding is a more useful fea- ture to transfer from images to videos than a foreground/background prediction. As shown in Fig. 5.1, cars appear in both videos but belong to different categories (fore- ground in the first video and background in the second video). If the network is trained to directly classify cars as foreground on the first video, it tends to classify the cars as foreground in the second video as well. As a result, the network needs to be fine-tuned for each sequence [13]. In contrast, the instance embedding network can produce unique 71 embeddings for the car in both sequences without interfering with other predictions or requiring fine-tuning. The task then becomes selecting the correct embeddings to use as an appearance model. In order to select the correct embeddings, we adopt a bilateral network to estimate the background region. More specifically, the bilateral network is trained to model the motion patterns based on optical flow in the non-object region (inferred by a objectness map) and identify regions with similar motion patterns. Then we build a graph on a set of sampled pixels from consecutive frames to jointly model the estimated background, objectness and pixel similarity measured by the distance between their embedding vec- tors. Representative embeddings for foreground and background can be selected by con- ducting a cut on the graph, which is obtained by minimizing a cost function. After select- ing the embeddings, we propagate their labels to remaining pixels (Fig. 5.2). According to the experimental results, incorporating information from multiple frames leads to better segmentation results. We tested the proposed method on the DA VIS 2016 dataset [88] and the FBMS-59 dataset [83]. The experimental results demonstrate superior performance over previous state-of-the-art methods. The contribution is summarized as follows Proposing a new strategy for adapting instance segmentation models trained on static images to videos, with insights into the stability of instance segmentation embeddings over time; Proposing a trainable bilateral network to estimate background based on motion cues and objectness; Efficient multi-frame reasoning via graph cut of a reduced set of points seeded across objects; and 72 Achieving state-of-the-art results on the DA VIS 2016 and the FBMS-59 datasets with intersection-over-union (IoU) scores of 80.4% and 73.9%, and outperforms previous best results by 1.9% and 2.0%, respectively. 5.2 Method The proposed method is explained in detail in this section. An overview is depicted in Fig. 5.2. It contains four modules: 1) instance embedding extraction and objectness prediction 2) background estimation with a motion-based bilateral network, 3) classi- fication of sampled pixels with graph cut, and 4) frame segmentation by propagating labels of sampled pixels. 5.2.1 Instance embeddings and objectness We train a network to output instance embeddings and semantic categories on the image instance segmentation task as in [31]. Briefly, the instance embedding network is a dense-output convolutional neural network with two output heads trained on static images from an instance segmentation dataset. The first head outputs an embedding for each pixel, where pixels from same object instance have smaller Euclidean distances between them than pixels belonging to sepa- rate objects. SimilarityR between two pixelsi andj is measured as a function of the Euclidean distance in theE-dimensional embedding space,e, R(i;j) = 2 1 + exp(jje(i)e(j)jj 2 2 ) : (5.1) This head is trained by minimizing the cross entropy between the similarity and the ground truth matching indicatorg(i;j). For locationsi andj, the ground truth matching 73 Bilateral Network Module 2: Background estimation Instance embeddings Inverse Unary cost Unary cost Module 3: Embedding graph cut (of sampled pixels) Pairwise cost Pairwise cost Ground Truth Objectness Optical flow Estimated background Module 4: Label propagation Instance Segmentation Network Module 1: Objectness & Instance embedding Figure 5.2: An overview of the proposed method that consists of four modules. The instance embeddings and objectness score are extracted in Module 1, highlighted by green. The background is estimated from the inverse objectness map and the optical flow through a bilateral network (BNN) in Module 2, which is enclosed in light blue. Then, an embedding graph that contains sampled pixels (marked by dots) from a set of consecutive frames as vertices is constructed. The unary cost is defined based on the objectness and the estimated background from the BNN. The pairwise cost is from instance embedding and optical flow similarity. All vertices are classified in Module 3 by optimizing the total cost, where magenta and yellow dots denote the foreground and background vertices, respectively. Finally, the graph vertex labels are propagated to all remaining pixels based on embedding similarity in Module 4. Best viewed in color. indicator g(i;j) = 1 if pixels belong to the same instance and g(i;j) = 0 otherwise. The loss is given by L s = 1 jAj X i;j2A w ij [g(i;j) log(R(i;j)) + (1g(i;j)) log(1R(i;j))]; (5.2) where A is a set of pixel pairs, R(i;j) is the similarity between pixels i and j in the embedding space andw ij is inversely proportional to instance size to balance training. 74 The second head outputs an objectness score from semantic segmentation. We min- imize a semantic segmentation log loss to train the second head to output a semantic category probability for each pixel. The objectness map is derived from the semantic prediction as O(i) = 1P BG (i); (5.3) where P BG (i) is the probability that pixel i belongs to the semantic category “back- ground” 1 . We do not use the scores for any class other than the background in our work. 5.2.2 Motion-based bilateral networks for background estimation Motion cues are essential to the solution of the unsupervised VOS problem. In MP [105], a binary segmentation neural network uses the optical flow vector as the input and produces a motion saliency map. Due to the camera motion, flow patterns do not always correspond to moving objects. Consider the following two scenarios: 1) an object with the leftward motion and a static camera, and 2) an object with the rightward motion and a camera following the object. The latter is viewed as left-moving back- ground with a static object. The flow patterns are flipped for the two scenarios yet both expect high motion saliency on objects. In other words, it is confusing for the network to find motion-salient objects solely relying on optical flow. To address this problem, we integrate optical flow from FlowNet2.0 [49] with object- ness scores to estimate the background, which includes static objects and non-objects (also known as stuff [44]). Given an objectness map output from a segmentation net- work [31], we can locate stuff regions by thresholding the objectness map. The optical 1 Here in semantic segmentation, “background” refers to the region that does not belong to any category of interest, as opposed to video object segmentation where the “background” is the region other than the target object. We use “background” as in video object segmentation for the rest of the paper. 75 flow in those regions can model the motion patterns of background (due to the cam- era motion). By identifying regions of similar motion, we can associate static objects with the background. In other words, background is expanded from stuff regions to include static objects. Inspired by temporal propagation in semi-supervised VOS with the bilateral space [51, 79], we solve this background expansion problem with a bilateral network (BNN) [52], i.e. generalized bilateral filtering by replacing default Gaussian fil- ters [3, 107] with learnable ones. Bilateral filtering. We briefly review the bilateral filtering below and refer to [87, 107, 3] for more details. A fast bilateral filtering with approximation is implemented by four steps [87]: 1) constructing a bilateral grid, 2) splatting the features of input samples to the high-dimensional bilateral grid, 3) convolution on the high-dimensional grid, and 4) slicing the filtered features on grid back to samples of interest. Let d and f q 2 R d denote the dimension of the bilateral space and the position vector of sample q in the bilateral space, respectively. In addition, lets I (i),s V (v) ands O (o) denote the feature of an input samplei, a vertexv of the bilateral grid, and an output sampleo. We explain the case with the feature being a scalar below, but the process can be generalized to vector features (e.g., for image denoising, the feature is the 3-D color components) by repeating for each feature channel. A bilateral filtering process withd = 2 is illustrated in Fig. 5.3. The 2-D bilateral grid partitions the space into rectangles, and the position of each vertexv can be described by f v 2R d . The feature on the vertex is obtained by accumulating the features of input samples in its neighbor rectangles, in form of s V (v) = X i2 (v) w(f v ;f i )s I (i); (5.4) wherew(f v ;f i ) is the weight function that defines the influence from samplei on vertex v and (v) stands for the neighborhood ofv. Commonly used weight functionsw(;) 76 include multilinear interpolation and the nearest neighbor indicator. Afterwards, filters, c(), are applied to the splatted feature on the grid, as shown in the center figure of Fig. 5.3: s 0 V (v) = X m s V (vm)c(m); (5.5) wheres 0 V (v) is the filtered feature on vertexv. Finally, the filtered features on vertices are sliced to an output sample, o, given its position f o , as shown in the right figure of Fig. 5.3. Mathematically, the feature of sampleo is obtained by s O (o) = X v2 (o) w(f o ;f v )s 0 V (v); (5.6) where (o) represents the set of surrounding vertices, with a set size of 2 d . The weight functionw(;) is identical with the one in Eq. (5.4). The filter c() in Eq. (5.5) is Gaussian in traditional bilateral filtering. It is gener- alized to learnable filters in [52], which can be trained by minimizing a loss defined between s O (o) and the target feature of o. The learnable filters compose the bilateral networks (BNN) [51]. Motion-based bilateral networks. A commonly used bilateral space is composed by color components (e.g., RGB) and location indices (x;y), so the 5-D position vector can be written as f = (r;g;b;x;y) T . For videos, the timestep, t, is often taken as an additional dimension, yieldingf = (r;g;b;x;y;t) T [51, 79]. In the proposed method, we expand static regions spatially based on motion. There- fore, we have f = (dx;dy;x;y) T , where (dx;dy) T denotes the optical flow vector. We do not expand the static region temporally because optical flows on consecutive frames are not necessarily aligned. We build a regular grid of size (G flow ;G flow ;G loc ;G loc ) in 77 A filter Input samples (in orange) with features Features propagated to all samples Splatting Convolving Slicing Figure 5.3: Illustration of a fast bilateral filtering pipeline with a bilateral space of dimensiond = 2: 1) splatting (left): the available features from input samples (orange squares) are accumulated on the bilateral grid; 2) convolving (center): the accumulated features on vertices are filtered and propagated to neighbors; and 3) slicing (right): the feature of any sample with a known position in the bilateral space can be obtained by interpolation from its surrounding vertices. this 4-D bilateral space. To obtain a set of input samples for splatting, we locate the stuff pixels by thresholding the objectness map. This set of stuff pixels is the initial background, denoted byB init . We use the inverted objectness score as the feature to be splatted from an input pixel, i.e., s I (i) = 1p Obj (i);i2B init : (5.7) The inverted objectness score can be viewed as a soft vote for the background and the splatting process by Eq. (5.4) is to accumulate the background votes on the vertices of the bilateral grid. After that, a 4-D filter is applied to propagate the votes to neighboring vertices. Finally, by slicing, the propagated votes are forwarded to the remaining pixels on the same frame, based on their optical flow and spatial locations. 78 To train the BNN, we clip the negative background votes by ReLU and apply the tanh function to convert the clipped votes to background probability, p BG bnn (j) = 1 expf2ReLU[s O (j)]g 1 + expf2ReLU[s O (j)]g ; (5.8) and the training loss is the cross entropy between p BG bnn and the inverted ground truth mask, L = X j [1y(j)] lnp BG bnn (j) +y(j) ln[1p BG bnn (j)]; (5.9) where y(j) 2 f0; 1g is the ground truth, with 0 and 1 representing background and foreground, respectively. 5.2.3 Embedding graph cut Because optical flow computation is imperfect, the estimated background is not a suit- able final segmentation. For example, the static parts of non-rigid objects will receive high background scores from the BNN, resulting in false negatives. To achieve more accurate masks, we integrate the predicted background region with the pixel-wise instance embeddings mentioned in Sec. 5.2.1. Given that the instance embeddings describing pixel similarity, the objectness score and background score from the BNN, we adopt the graph cut method to classify pix- els by minimizing a cost function. Graph cut has been used previously to solve VOS over either pixel-level graphs [108] or superpixel-level graphs [86]. The former is time- consuming considering the number of pixels in a video while the latter is prone to super- pixel error. In BVS [79], the graph cut was conducted on a 6-D bilateral grid (composed by color, spatial and temporal positions). However, it is not realistic to build a grid graph 79 in the bilateral space with high dimensional instance embeddings and locations since the splatting/slicing process would be time-consuming and the resulting grid would be sparse. In the proposed method, we still construct a graph with pixels as vertices. To save computation, the graph is built from a small but diverse subset of pixels. These sampled pixels are called “seeds” and labeled via cost minimization. Then, their labels are propagated to all pixels, which will be explained in Sec. 5.2.4. Seed selection. The instance embeddings are in fact redundant, namely, a lot of pix- els have similar embedding vectors, especially in homogeneous regions (e.g. sky). By selecting a set of pixels with diverse embeddings, we can remove the redundancy to save computation while keep the capability to build a good appearance model for both fore- ground and background from representative embeddings. The first step of seed selec- tion is abandoning the pixels close to object boundaries, as they are usually not similar to either background or foreground in the embedding space and thus not representative ones. To avoid object boundaries, we only select seeds from candidate points where the instance embeddings are locally consistent. (An alternative method to identify the boundaries to avoid would be to use an edge detector such as [27].) We construct a map of embedding edges by mapping discontinuities in the embedding space. The embed- ding edge map is defined as the “inverse” similarity in the embedding space within the neighbors around each pixel, c(p) = 1 1 jN(p)j X q2N(p) R(p;q); (5.10) wherep andq are pixel locations,N(p) contains the four neighbors ofp, andR(r;q) is the similarity measure given in Eq. 5.1. Then in the edge map, we identify the pixels 80 Original image Embedding edge map Candidate setC Seed setS Figure 5.4: Top: An image (left) and the embedding edge map (right). Bottom: the candidate setC (left) and the seed setS (right). Best viewed in color. which are the minimum within a window ofnn centered at itself. These pixels from the candidate setC. Mathematically, C =fpjc(p) = min q2W (p) c(q)g; (5.11) whereW (p) denotes the local window. These candidate points, C, are diverse, but still redundant with one another, espe- cially when a frame contains a large homogeneous region. We take a diverse subset of these candidates as seeds by adopting the sampling procedure from KMeans++ initial- ization [7]. We only need diverse sampling rather than cluster assignments, so we do not perform the time-consuming KMeans step afterwards. The sampling procedure begins by adding the candidate point with the largest objectness score,O(i), to the seed set,S. Sampling continues by iteratively adding the candidate,s n+1 , with smallest maximum similarity to all previously selected seeds and stops when we reachN S seeds, s n+1 = arg min i2C max j2S R(i;j): (5.12) 81 Seeds from neighboring frames are densely connected Embedding discrepancy map Seed Selection A seed as a graph node Embedding graph Figure 5.5: An illustration of embedding graph construction. We find the local minima in the embedding discrepancy map and select them as graph nodes. Edges connect nodes that are spatial neighbors or from consecutive frames (see texts for the definition of spatial neighbors). Best viewed in color. We repeat this procedure to produce the seeds for each frame independently, forming a seed set S. Note that the sampling strategy differs from [31], where they consider a weighted sum of the embedding distance and semantic scores. We do not consider the semantic scores because we want to have representative embeddings for all regions of the current frame, including the background, while in [31], the background is disre- garded. Fig. 5.4 shows one example of the visualized embedding edge map, the corre- sponding candidate set and the selected seeds. Building an embedding graph. The seed selection process is repeated for every frame. To classify the seeds on framet, we build a graph based on seeds from frames (t 1) to (t + 1), i.e., a temporal window of length 3 centered att. Using a different temporal window that covers frame t is also possible, as studied in Sec. 5.3.4. We denote the seed set byV. The next step is to link seeds with edges in the graph. Given the seeds on a frame, we identify the closest seed to every pixel, and we link the seeds with a graph edge if two neighboring pixels are closest to different seeds. These edges are called spatial edges. For seeds from consecutive frames, they are densely linked to yield temporal edges. Other seed pairs are not connected. The graph edges are denoted byE. An illustration of the embedding graph of seeds is displayed in Fig. 5.5. 82 Graph cut. A cut of the embedding graph is obtained by minimizing the following cost function: L = X i2V (i) + X (i;j)2E [l(i)6= l(j)](i;j); (5.13) wherel(i)2f0; 1g is the assigned label to pixeli,() is the unary cost, and(;) is the pairwise cost. The unary cost is given by (i) = [1l(i)] BG (i) +l(i) FG (i); (5.14) where BG (i) and FG (i) are the costs for nodei to be labeled as background and fore- ground, respectively. For background cost, we utilize the background probability from Eq. (5.8). For foreground cost, it is defined by the objectness score,p Obj (i), obtained by the segmentation network [31]: BG (i) = lnp BG bnn (i); (5.15) FG (i) = lnp Obj (i): (5.16) The pairwise cost encourages similar nodes being assigned to the same category, reducing errors from static parts of non-rigid objects. We consider both instance sim- ilarity via embeddings e i and the motion similarity via optical flow m i = [dx i ;dy i ] T . Specifically,(i;j) is given by (i;j) = exp( jje i e j jj 2 2 2 e ) +(t i ;t j ) exp( jjm i m j jj 2 2 2 m ); (5.17) where e and m are the standard deviation of the Gaussian kernels, is the importance of the motion similarity relative to the embedding similarity andt i is the frame index. If 83 seedsi andj are from different frames, the motion similarity term is ignored as reflected by the dirac delta term,(t i ;t j ), since their optical flow may not be aligned. Although instance embeddings are trained on images, they are shown to be stable over consecutive frames in [70]. Thus, they are applicable to measure the similarity across frames. Our studies in Sec. 5.3.4 show that considering this inter-frame similarity is beneficial and ignoring the temporal edges leads to inferior performance. 5.2.4 Label propagation frame t frame (t+1) Segmented frame t Figure 5.6: Given an arbitrary pixel (marked by the diamond), its surrounding nodes (in red circles) are identified from the current frame, the previous frame (omitted here) and the following frame. The label of the node with the shortest embedding distance is assigned to the pixel. Best viewed in color. After graph cut, the final step is to propagate the label from seeds to remaining pixels. Given an arbitrary pixel,i, with its temporal-spatial location denoted by (x i ;y i ;t i ) T , we can identify its neighboring seeds on framet i by finding its spatially closest seed and the spatial neighbors of that seed in the graph. Besides seeds on framet i , we also include the neighboring seeds for the pixels located at (x i ;y i ;t i 1) T and (x i ;y i ;t i + 1) T , i.e., pixels with the same spatial location in the previous frame and the following frame, as 84 shown in Fig. 5.6. The neighboring seed set for pixeli is denoted byN (i). Among the seeds inN (i), the one with the shortest embedding distance toi is found via n = arg min m2N (i) jje i e m jj 2 2 : (5.18) The label of seed n is assigned to pixel i. We estimate the probability that pixel i is foreground from the shortest embedding distance to the nodes labeled as foreground and background inN (i), denoted byd FG (i) andd BG (i), respectively. The foreground probability is defined by p FG (i) = exp[d 2 FG (i)] exp[d 2 FG (i)] + exp[d 2 BG (i)] : (5.19) Note that if the nodes inN (i) are all foreground (or background), thenp FG (i) is defined to be 1 (or 0). Because the resolution of the dense embedding map is lower than the original video, we upsample the probability map using the multi-linear interpolation to the original resolution and further refine it with a dense conditional random field (CRF) [60]. 5.3 Experiments 5.3.1 Datasets and evaluation metrics The proposed method is evaluated on the DA VIS 2016 dataset [88] and the Freiburg- Berkeley Motion Segmentation 59 (FBMS-59) dataset [83]. The latter has multiple moving objects labeled separately. By following [70, 105], we convert the annotation to binary masks by grouping individual object masks. 85 DA VIS 2016. The DA VIS 2016 dataset [88] contains 50 densely annotated video sequences with high resolution. It is partitioned into two splits, train and val, with 30 and 20 sequences, respectively. Some videos from this dataset are challenging due to motion blur, occlusion and object deformation. The evaluation metrics include region similarity, boundary accuracy and temporal stability, proposed in [88]. The region sim- ilarity, denoted byJ , is defined as the intersection over union (IoU) between the anno- tation and the predicted mask. To measure the boundary accuracy, denoted byF, the annotation boundary and the predicted mask boundary are compared and the F-score (the harmonic mean of precision and recall) is computed. The temporal stability,T , measures the deformation needed to transform from one frame to its succeeding frame, and higher deformation means less smooth masks over time. This metric is applied to a subset of sequences in DA VIS 2016 as described in [88]. FBMS-59. 59 video sequences are collected in the FBMS-59 dataset [83]. In contrast to DA VIS, this dataset is sparsely annotated, with masks of 720 frames provided. We test the proposed method on the test split, containing 30 sequences. Apart from the aforementioned region IoU (J ), we also use the F-score protocol from [83] for this dataset, being consistent with previous methods. 5.3.2 Implementation We train the bilateral network on the DA VIS 2016 train split. For each frame, the 4-D bilateral space position vector (dx;dy;x;y) of pixels is normalized within each frame and then input to the BNN. The regular grid size in the bilateral space is set to (40, 40, 18, 18), i.e.,G flow = 40 andG loc = 18. G loc is set by referring to [79]. Theoretically, the learnable filters can be any 4-D tensor, but practically, to reduce the number of parameters of the network, the BNN is composed of four cascaded 1-D filters, one for each dimension. The four 1-D filters (one for each dimension) are initialized as uniform 86 filters. The kernel size of the filters is set to (2G loc +1) for the spatial dimensions and 15 for the optical flow dimensions. The kernel size along the spatial dimensions is large so that the background scores might be propagated to the regions far from the stuff region (for example, the background score from the sky can be propagated to a parked car on the ground, with similar motion patterns). To train the BNN, we set the batch size to 64 and use a learning rate of 0.0001, with a total of 10k steps. Data augmentation is done by random samplingM = 50k pixels in the low objectness region (p Obj < 0:001) for splatting. During inference, we pick theM pixels with the lowest objectness score. The optical flow is computed by a re-implementation of FlowNet2.0 [49]. In addition, we abandon the frames with unstable optical flow for training, by cal- culating the spatial variance of optical flow edge. To compute the optical flow edge, we first define the motion discrepancy map D m (j) as the average distance to its four neighbors, in form of D m (j) = 1 j (j)j X k2 (j) jjm j m k jj 2 ; (5.20) where (j) stands for the four neighbors of pixelj and m j is the optical flow vector. Then the optical flow edgeE(j) is defined by E(j) = 1 expfD 2 m (j)g: (5.21) Finally, the spatial variance of the flow edge is given by ARP [57] as v = P j E(j)jjl(j)cjj 2 2 P j E(j) ; (5.22) 87 where we have l(j) = (x j ;y j ) T , i.e., the spatial coordinates of pixel j, and c is the centroid of the flow edge map, given by c = P j E(j)l(j) P j E(j) : (5.23) We exclude the frames withv> 0:15 for training. For embedding graph cut, we use the instance embeddings from [31], where the training is conducted on the Pascal dataset [29]. We do not further fine-tune on any video dataset. The hyper-parameters are determined by cross validation: the pairwise cost weight, is 0.1; the variance for instance embeddings and optical flow in Eq. (5.17), 2 e = 1 and 2 m = 10. The weight of the motion similarity relative to embedding similarity,, is set to 1. For dense CRF refinement, we use the parameters used for the Pascal dataset [29] in DeepLab [18]. 5.3.3 Performance comparison DA VIS 2016. The results on DA VIS 2016 are displayed in Tab. 5.1. The proposed method outperforms other methods under the unsupervised scenario in terms ofJ Mean andF Mean, with an improvement of 1.9% and 3.0% over the second best method, IET [70]. Note that forJ Mean, our method even achieves slightly better results than some semi-supervised methods on the leaderboard 2 , OSVOS [13] and MSK [56]. In terms of temporal stability, our method is the second best in the unsupervised category: 1.3% worse than the most stable method, LVO [106]. We provide some visualized results in Fig. 5.7 and more can be found in the supplementary material. FBMS-59. The proposed method is evaluated on the test split of the FBMS-59 dataset, which has 30 sequences. The results are listed in Table 5.2. Our method outperforms 2 http://davischallenge.org 88 Figure 5.7: Qualitative segmentation results from DA VIS 2016 val split. The first four sequences feature motion blur, occlusion, large object appearance change, and static objects in background, respectively. 89 Figure 5.8: Visual examples from the test split of the FBMS-59 dataset [83]. Best viewed in color. 90 Semi-supervised Unsupervised OA VOS[112] OSVOS[13] MSK[56] SFL[20] LVO[106] MP[105] FSEG[50] ARP[57] IET[70] Ours J Mean" 86.1 79.8 79.7 67.4 75.9 70.0 70.7 76.2 78.5 80.4 J Recall" 96.1 93.6 93.1 81.4 89.1 85.0 83.5 91.1 - 93.2 J Decay# 5.2 14.9 8.9 6.2 0.0 1.3 1.5 7.0 - 4.8 F Mean" 84.9 80.6 75.4 66.7 72.1 65.9 65.3 70.6 75.5 78.5 F Recall" 89.7 92.6 87.1 77.1 83.4 79.2 73.8 83.5 - 88.6 F Decay# 5.8 15.0 9.0 5.1 1.3 2.5 1.8 7.9 - 4.4 T Mean# 19.0 37.8 21.8 28.2 26.5 57.2 32.8 39.3 - 27.8 Table 5.1: The results on theval split of DA VIS 2016 dataset [88]. The proposed method outperforms other unsupervised methods in terms ofJ /F Mean, and is even better than some semi-supervised methods. For the temporal stability (T ), our method is the second best. the second best method, IET [70], in theJ Mean and the F-score by 2% and 0.4%, respectively. We provide visualized segmentation results for the FBMS-59 dataset in Fig. 5.8. Method NLC [30] CUT [55] FST [86] CVOS [103] LVO [106] MP [105] ARP [57] IET [70] Ours J Mean 44.5 - 55.5 - - - 59.8 71.9 73.9 F-score - 76.8 69.2 74.9 77.8 77.5 - 82.8 83.2 Table 5.2: Comparison of theJ mean and the F-score on thetest split of the FBMS-59 dataset [83]. Runtime. The runtime of each module is reported in Tab. 5.3. On average, our method takes about 2.5 seconds to process one frame and we use an NVIDIA GTX Titan X GPU with 12 GB memory. The majority of computation is spent on embedding extraction and dense CRF. In terms of comparison with previous methods, SegFlow [20] is much faster, with 0.3 s/frame, but yields 10+% worse performance. Other methods that produce better results, including [50], [105], [106], [70], [57], do not report the runtime. 5.3.4 Analysis of module contributions Motion-based BNNs. A video clip can be segmented by directly thresholding the back- ground probabilityp BG bnn in Eq. (5.8). That is, a pixel is foreground ifp BG bnn <T bnn and we 91 Module Runtime (s/frame) Device Embedding + objectness 1.2 GPU Optical Flow 0.2 [49] GPU BNN 0.03 GPU Embedding Graph Cut 0.05 CPU Label Propagation 0.09 CPU Dense CRF 0.9 [18] CPU Total 2.47 - Table 5.3: The runtime of each module of the proposed method. The computation bottleneck is embedding extraction and dense CRF. setT bnn = 0:5. This serves as the first baseline and is denoted by “BNN”. Since optical flow is error-prone, raw results from the motion-based BNN are not satisfactory, espe- cially when there are unstable stuff regions, e.g., waves. The second baseline is obtained by adaptively thresholdingp BG bnn by the objectness scorep Obj . Namely, a pixel belongs to foreground ifp BG bnn < p Obj , which effectively eliminates false positives in unstable stuff regions. This baseline is referred as “Obj-BNN”. It combines the motion and objectness signals without utilizing the instance embedding or graph cut (also equivalent to assign- ing label to pixels based on the unary potentials only). Adding objectness boosts the performance of “BNN” by 20.9%, as shown in Tab. 5.4. The motion-based BNN with objectness achieves better results than previous methods using dual-branch CNNs [106], [50], [20] 3 , in terms ofJ Mean on theval split of DA VIS 2016. Method SFL [20] LVO [106] FSEG [50] BNN Obj-Gaussian-BNN Obj-BNN J Mean 67.4 70.1 70.7 53.8 72.3 74.7 F Mean 66.7 - 65.9 50.1 69.6 70.9 Table 5.4: Performance comparison between results of the motion-based BNN and other dual-branch methods on DA VIS 2016val split (without CRF refinement). 3 Note that [20] is not as comparable as [106] and [50]. Its motion branch does not take in explicitly computed optical flow (which, on the contrary, is the output), but two consecutive frames instead. 92 As mentioned above the bilateral filters are Gaussian by default. Therefore, we also build a BNN composed of a multivariate Gaussian filter with a diagonal covariance matrix expressed by two parameters, flow and loc , the former for flow dimensions and the latter for location dimensions. Mathematically, the covariance matrix, , is written as = diag[ flow ; flow ; loc ; loc ]: (5.24) We use the frames in the train split of DA VIS 2016 dataset [88] to find the optimal covariance matrix. More specifically, we apply the BNN with the Gaussian filters (i.e., no training with backpropagation) and obtain the background probability, p BG bnn , given by Eq. (5.8). We then adaptively threshold it byp Obj as how the “Obj-BNN” baseline is obtained (Tab. 3 in the submitted manuscript). We evaluate the masks obtained by this adaptive thresholding with the region similarity,J , as the metric on the train split, and record the covariance matrix that gives the best results. Then this Gaussian BNN is applied to the val split and the results are also adaptively thresholded byp Obj , yielding the “Obj-Gaussian-BNN” baseline in Tab. 5.4. After tuning, the optimalJ is 79.4 on the train split and this tuned covariance matrix results inJ = 72:3 on the val split of DA VIS 2016, shown in Tab. 5.4. The embedding graph. The embedding graph can be constructed in multiple ways depending on how the pairwise cost is defined and how graph nodes are linked in Tab. 5.5. Without the graph cut, the results match the “Obj-BNN” baseline in Tab. 5.4. We presentJ Mean results with (77.6) and without (74.7) the CRF refinement. We then constructed the embedding graph without temporal edges (i.e., each frame is indepen- dent). Three options for the pairwise cost in Eq. (5.17) were tested: considering the similarity in embedding space only (row 2), the similarity in motion only (row 3), and 93 both (row 4). We then explored adding different temporal dependencies to the full intra- frame model. We connected seeds in consecutive frames sparsely (row 5): for a seed pair from consecutive frames, we check the seed regions formed by the pixels closest to a seed. If their corresponding seed regions spatially overlap by at least one pixel (ignor- ing the frame index), they are connected by a temporal edge. We also connected seeds in consecutive frames densely (row 6). The variants of embedding graph cut are evaluated by theJ Mean of the final segmentation with seed labels propagated to all pixels. Best performance is observed with both embedding and motion similarities considered and dense temporal edges. Similarity Temporal edges Metrics Variant Embed. Motion Sparse Dense J Mean J Mean (+CRF) Obj-BNN 74.7 77.6 Similarity X 74.3 78.0 features for X 74.8 77.5 pairwise cost X X 75.7 78.9 Inter-frame X X X 76.2 79.8 seed linking X X X 77.3 80.4 Table 5.5: Performance comparison of different pairwise costs and seed linking schemes. Motion similarity and dense temporal edges help to achieve better perfor- mance. Online processing. The capability to process videos online, where results are generated for frames within a fixed latency, is a desirable feature. Using only preceding frames 4 produces the shortest latency and is causal. To process thet-th frame online, the embed- ding graph is built within a frame windowW , using seeds from frames (tW ) tot for the causal case and (tW ) to (t+W ) for the acausal case. As shown in Tab. 5.6, building the embedding graph with only the current and previous frames does not affect the per- formance much. Note thatW = 0 eliminates temporal edges and gives the appropriately 4 We allow accessing frame (t+1) for optical flow computation for framet. 94 lower results matching Tab. 5.5. We also explore which frames are used for propagating labels from seeds to pixels in Eq. (5.18): in the top row, only the current frame is used to propagate labels; in the middle row, labels are propagated to pixels from seeds in the current and previous frames; in the bottom row, labels are propagated from the current, previous, and following frames. In the acausal case, we found thatW = 1 gave the best performance, with seeds from the previous and the following frames included for label propagation. Causal graph window Acausal graph window Frames for label prop. W = 0 W = 1 W = 5 W = 10 W = 1 W = 5 W = 10 Current 75.3* 76.8 76.9 76.9 77.0 76.9 76.8 +Previous 75.4 77.0 77.0 77.1 77.2 77.0 77.0 +Following 75.7 77.3 77.2 77.3 77.3 77.2 77.2 *Grey cells mark the variants that are causal (i.e., for online processing). Table 5.6: Building the embedding graph with different sets of consecutive frames for online and offline processing. Under the online scenario, we consider a temporal win- dow of length (W + 1) ending at framet. For offline processing, a window of length (2W + 1) centered att is used. For label propagation, using seeds from the previous, the current and the following frames gives the optimal results. This group of variants is evaluated on DA VIS 2016val set withJ Mean (without CRF) as the metric. 5.4 Conclusions A motion-based bilateral network (BNN) is proposed to reduce the false positives from static but semantically similar objects for the VOS problem in this paper. Based on opti- cal flow and objectness scores, a BNN identifies regions with motion patterns similar to those of non-object regions, which help classify static objects as background. The estimated background obtained by the BNN is further integrated with instance embed- dings for multi-frame reasoning. The integration is done by graph cut, and to improve its efficiency, we build a graph consisting of a set of sampled pixels called seeds. Finally, 95 frames are segmented by propagating the label of seeds to remaining pixels. It is shown by experiments that the proposed method achieves the state-of-the-art performance in several benchmarking datasets. 96 Chapter 6 Conclusions and Future Work 6.1 Summary of the Research In this dissertation, we focus on problems related to object localization, including object detection and segmentation in images and videos: 1) object proposal enhancement, which aims at obtaining tight bounding boxes of object candidates; 2) weakly super- vised object detection, where training data is weakly annotated but an accurate object detector is desired; 3) unsupervised video object segmentation, which targets at seg- menting moving objects in videos. To enhance the object proposals, we first define an optimal contour for each proposal box, which satisfies two requirements: the edge intensity along the contour should be high and the contour should be close to the initial object proposal. Then we solve the optimal contour searching by applying the shortest path search algorithm. By aligning the initial proposal bounding box with the contour, we manage to improve the recall at higher localization requirement by a large margin. In the weakly supervised object detection problem, we specifically focus on reducing the error from localizing discriminative object parts. We first identify the discriminative regions from a trained classification network and later expand object masks from those regions by training an image segmentation network with a loss that encourages expan- sion. Because the segmentation masks are weakly supervised as well, they may be unreliable and misleading. To address this issue, we combine multiple instance learn- ing and curriculum learning to supervise the detectors with easier images with more 97 reliable segmentation masks first and gradually add harder examples. The proposed weakly supervised detector outperforms the state-of-the-art approaches by more than 3% in mAP. To solve the unsupervised video object segmentation problem, a bilateral network which takes in the optical flow vectors is trained to generate an initial background esti- mation. Since the optical flow computation is not perfect, the estimated background is combined with the instance embeddings and objectness scores to improve the segmen- tation quality. The similarity between pixels is taken into consideration by building a graph of selected pixels and the similarity is measured by the distance between instance embeddings. The effectiveness of the proposed method is demonstrated on the superior performance on several benchmark datasets. 98 6.2 Future Research Directions Recent breakthroughs with deep convolutional neural networks help to reach good per- formance in object localization and segmentation, when sufficient fully annotated train- ing data is available. However, to train the deep CNNs, the amount of required training data is usually huge. For different problems, the labeling effort varies. In the disserta- tion, we have explored weakly supervised object detection to reduce the labeling effort spent on object bounding boxes. A closely related problem is weakly supervised image segmentation. Comparing with bounding box drawing, labeling all pixels obviously takes more time. If we consider the video segmentation problem, annotating every sin- gle frame with an accurate mask is extremely challenging. There are several research directions for training data reduction. The first one is leverage weakly annotated data, such as meta-data, image tags from the internet; the second direction can be domain adaptation, where sufficient labeled training data is available in one domain (e.g. synthesized data) while no or a small amount labeled data are provided for the target domain (e.g. real data), and the goal is to perform well on the target domain. We bring up the following specific problems for future research. Weakly supervised image/video segmentation: leverage the weakly annotated images, which can be image/video tags, bounding boxes of objects or scribbles on objects, to segment images and videos. Domain adaptation for image/video segmentation: utilizing synthesized images with pixel-level labels (source domain) and unlabeled real images (target domain) to obtain a model which can segment on the latter. 99 6.2.1 Weakly Supervised Image/Video Segmentation In the proposed method for weakly supervised object detection (Chapter 4), we use a weakly supervised image segmentation network to grow the object masks from the most discriminative regions and during training, only image-level labels are available. We observe some failure cases and one of them is shown in Fig. 6.1, where the saliency map is acceptable (though part of the water region is very salient) but the mask grows too aggressively, which covers the water regions. It is challenging to resolve this failure because “water” and “boat” usually co-occur. Intuitively, SEC [26] encourages mask growing, but growing to not highly correlated regions (e.g. sky) will be punished by negative examples (i.e., images with no boats). As a result, growing to highly correlated regions is preferred, meaning that the mask of “boat” will be expanded to co-occurring “water”. Figure 6.1: A failed example of weakly supervised segmentation. From left to right: the image, the saliency map, and the expanded masks. Without additional signal, it is almost impossible to solve the error due to highly correlated patterns. In fact, the co-occurring context, “water” helps to classify the image as “boat”. The classification network should not be blamed for the high saliency on water. Thus, we need to seek other cheap signals to help this problem. It is observed that the optical flow (in Chapter 5) provides signals for moving objects. A boat sequence and its optical flow are illustrated in Fig. 6.2. Although the reflection 100 in water has very similar optical flow with the boat, the flow should at least be able to avoid mask growing to the water region without reflection. Figure 6.2: A pair of consecutive frames containing a boat and the optical flow computed by [49]. Optical flow is learned from synthesized videos [49], which means that training data can be obtained cheaply. Since objects move as a whole, the optical flow usually can provide useful information about the extent of a whole object. Therefore, we bring up the following research direction for weakly supervised image/video segmentation: Transfer the knowledge of the motion patterns learned from synthesized videos to static images and improve the segmentation performance when only image-level labels are available. 6.2.2 Domain Adaptation for Image/Video Segmentation With computer graphic techniques, labeled images and videos can be synthesized and thus cheaply obtained. However, due to the gap between synthesized images and real ones (a synthesized image and a real-world image are shown in Fig. 6.3), the model trained on the former usually does not work well on the latter. Therefore, adapting the knowledge learned from synthesized data for segmentation remains a challenging problem. A possible direction for domain adaptation is curriculum learning [9]. More specif- ically, we can adapt from source domain to target domain gradually by first solving the segmentation of easy images in the target domain. In this, easy images can be the ones 101 Figure 6.3: A synthesized image (left) and a real-world image (right). that appear more similar to the source domain. Once the model performs well on easy images, pseudo labels can be generated and a model can be directly trained on the target domain. Harder images with pseudo labels can be gradually added to the training data. Another direction is image translation, leveraging the generative adversarial net- work [39] to change the style of the source images and make it appear more similar to target images (or vice versa). The challenge for this direction is that there is no paired data for image translation, because we do not have labels for target domain images. CycleGAN [121] provided a solution for unpaired image translation, which may help the domain adaptation for segmentation problem. 102 Bibliography [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXivpreprintarXiv:1603.04467, 2016. [2] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. Pattern Analysis andMachineIntelligence,IEEETransactionson, 34(11):2274–2282, 2012. [3] A. Adams, J. Baek, and M. A. Davis. Fast high-dimensional filtering using the permutohedral lattice. InComputerGraphicsForum, volume 29, pages 753–762. Wiley Online Library, 2010. [4] B. Alexe, T. Deselaers, and V . Ferrari. What is an object? In Computer Vision andPatternRecognition(CVPR),2010IEEEConferenceon, pages 73–80. IEEE, 2010. [5] B. Alexe, T. Deselaers, and V . Ferrari. Measuring the objectness of image windows. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(11):2189–2202, 2012. [6] P. Arbel´ aez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik. Multiscale com- binatorial grouping. InProceedingsoftheIEEEConferenceonComputerVision andPatternRecognition(CVPR), pages 328–335, 2014. [7] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algo- rithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007. [8] A. Bearman, O. Russakovsky, V . Ferrari, and L. Fei-Fei. What’s the point: Semantic segmentation with point supervision. InEuropeanConferenceonCom- puterVision, pages 549–565. Springer, 2016. [9] Y . Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009. 103 [10] H. Bilen, M. Pedersoli, and T. Tuytelaars. Weakly supervised object detection with posterior regularization. In British Machine Vision Conference, volume 3, 2014. [11] H. Bilen, M. Pedersoli, and T. Tuytelaars. Weakly supervised object detection with convex clustering. In Proceedings of the IEEE Conference on Computer VisionandPatternRecognition, pages 1081–1089, 2015. [12] H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [13] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix´ e, D. Cremers, and L. Van Gool. One-shot video object segmentation. InCVPR2017. IEEE, 2017. [14] J. Canny. A computational approach to edge detection. Pattern Analysis and MachineIntelligence,IEEETransactionson, (6):679–698, 1986. [15] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. InProceedingsoftheIEEEConferenceonComputerVision andPatternRecognition(CVPR), pages 3241–3248. IEEE, 2010. [16] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. InComputerVisionandPatternRecognition(CVPR),2010 IEEEConferenceon, pages 3241–3248. IEEE, 2010. [17] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 2017. [18] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016. [19] X. Chen, H. Ma, X. Wang, and Z. Zhao. Improving object proposals with multi- thresholding straddling expansion. In Proceedings of the IEEE Conference on ComputerVisionandPatternRecognition(CVPR), June 2015. [20] J. Cheng, Y .-H. Tsai, S. Wang, and M.-H. Yang. Segflow: Joint learning for video object segmentation and optical flow. In The IEEE International Conference on ComputerVision(ICCV), 2017. 104 [21] M.-M. Cheng, Z. Zhang, W.-Y . Lin, and P. Torr. BING: Binarized normed gradi- ents for objectness estimation at 300fps. InProceedingsoftheIEEEConference on Computer Vision and Pattern Recognition (CVPR), pages 3286–3293. IEEE, 2014. [22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large- scale hierarchical image database. In Proceedings of the IEEE Conference on ComputerVisionandPatternRecognition(CVPR), pages 248–255. IEEE, 2009. [23] T. Deselaers, B. Alexe, and V . Ferrari. Localizing objects while learning their appearance. In European conference on computer vision, pages 452–466. Springer, 2010. [24] T. Deselaers, B. Alexe, and V . Ferrari. Weakly supervised localization and learning with generic knowledge. International journal of computer vision, 100(3):275–293, 2012. [25] A. Diba, V . Sharma, A. Pazandeh, H. Pirsiavash, and L. Van Gool. Weakly super- vised cascaded convolutional networks. arXivpreprintarXiv:1611.08258, 2016. [26] P. Doll´ ar and C. L. Zitnick. Structured forests for fast edge detection. In The IEEEInternationalConferenceonComputerVision(ICCV), 2013. [27] P. Doll´ ar and C. L. Zitnick. Structured forests for fast edge detection. In Pro- ceedingsoftheIEEEInternationalConferenceonComputerVision, pages 1841– 1848, 2013. [28] I. Endres and D. Hoiem. Category-independent object proposals with diverse ranking. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(2):222–234, 2014. [29] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes (VOC) challenge. International Journal of Com- puterVision(IJCV), 88(2):303–338, 2010. [30] A. Faktor and M. Irani. Video segmentation by non-local consensus voting. In BMVC, volume 2, page 8, 2014. [31] A. Fathi, Z. Wojna, V . Rathod, P. Wang, H. O. Song, S. Guadarrama, and K. P. Murphy. Semantic instance segmentation via deep metric learning.arXivpreprint arXiv:1703.10277, 2017. [32] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on patternanalysisandmachineintelligence, 32(9):1627–1645, 2010. 105 [33] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based image segmen- tation. InternationalJournalofComputerVision(IJCV), 59(2):167–181, 2004. [34] A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, and L. Van Gool. DeepPro- posal: Hunting objects by cascading deep convolutional layers. InProceedingsof theIEEEInternationalConferenceonComputerVision(ICCV), December 2015. [35] R. Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference onComputerVision(ICCV), pages 1440–1448, 2015. [36] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. InProceedingsoftheIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580– 587, 2014. [37] R. Gokberk Cinbis, J. Verbeek, and C. Schmid. Multi-fold mil training for weakly supervised object localization. In Proceedings of the IEEE Conference on Com- puterVisionandPatternRecognition, pages 2409–2416, 2014. [38] I. Goodfellow, Y . Bengio, and A. Courville. DeepLearning. MIT Press, 2016. [39] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. In Advances in neural informationprocessingsystems, pages 2672–2680, 2014. [40] M. Grundmann, V . Kwatra, M. Han, and I. Essa. Efficient hierarchical graph based video segmentation. IEEECVPR, 2010. [41] K. He, G. Gkioxari, P. Doll´ ar, and R. Girshick. Mask R-CNN. InComputerVision (ICCV),2017IEEEInternationalConferenceon, pages 2980–2988. IEEE, 2017. [42] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convo- lutional networks for visual recognition. In European Conference on Computer Vision, pages 346–361. Springer, 2014. [43] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [44] G. Heitz and D. Koller. Learning spatial context: Using stuff to find things. In Europeanconferenceoncomputervision, pages 30–43. Springer, 2008. [45] S. Hochreiter, Y . Bengio, P. Frasconi, J. Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001. 106 [46] J. Hosang, R. Benenson, and B. Schiele. How good are detection proposals, really? InProceedingsoftheBritishMachineVisionConference(BMVC), 2014. [47] A. Humayun, F. Li, and J. M. Rehg. RIGOR: Reusing inference in graph cuts for generating object regions. In Proceedings of the IEEE Conference on Computer VisionandPatternRecognition(CVPR), pages 336–343. IEEE, 2014. [48] A. Humayun, F. Li, and J. M. Rehg. The middle child problem: Revisiting para- metric min-cut and seeds for object proposals. InProceedingsoftheIEEEInter- nationalConferenceonComputerVision(ICCV), pages 1600–1608, 2015. [49] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. InIEEEConferenceon ComputerVisionandPatternRecognition(CVPR), volume 2, 2017. [50] S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2017. [51] V . Jampani, R. Gadde, and P. V . Gehler. Video propagation networks. In IEEE ConferenceonComputerVisionandPatternRecognition, volume 2, 2017. [52] V . Jampani, M. Kiefel, and P. V . Gehler. Learning sparse high dimensional fil- ters: Image filtering, dense crfs and bilateral neural networks. In Proceedings of theIEEEConferenceonComputerVisionandPatternRecognition, pages 4452– 4461, 2016. [53] W.-D. Jang and C.-S. Kim. Online video object segmentation via convolutional trident network. InProceedingsoftheIEEEConferenceonComputerVisionand PatternRecognition, pages 5849–5858, 2017. [54] V . Kantorov, M. Oquab, M. Cho, and I. Laptev. Contextlocnet: Context-aware deep network models for weakly supervised localization. In European Confer- enceonComputerVision, pages 350–365. Springer, 2016. [55] M. Keuper, B. Andres, and T. Brox. Motion trajectory segmentation via mini- mum cost multicuts. In Proceedings of the IEEE International Conference on ComputerVision, pages 3271–3279, 2015. [56] A. Khoreva, F. Perazzi, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learn- ing video object segmentation from static images. In Proceedings of the IEEE ConferenceonComputerVisionandPatternRecognition, 2017. 107 [57] Y . J. Koh and C.-S. Kim. Primary object segmentation in videos based on region augmentation and reduction. In Proceedings of the IEEE Conference on Com- puterVisionandPatternRecognition, 2017. [58] A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In European Conference on Com- puterVision, pages 695–711. Springer, 2016. [59] V . Koltun. Efficient inference in fully connected crfs with gaussian edge poten- tials. Adv.NeuralInf.Process.Syst, 2(3):4, 2011. [60] P. Kr¨ ahenb¨ uhl and V . Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pages 109–117, 2011. [61] P. Kr¨ ahenb¨ uhl and V . Koltun. Geodesic object proposals. In Proceedings of the European Conference on Computer Vision (ECCV), pages 725–739. Springer, 2014. [62] P. Krahenbuhl and V . Koltun. Learning to propose objects. InProceedingsofthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1574–1582. IEEE, 2015. [63] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems(NIPS), pages 1097–1105, 2012. [64] K. Kumar Singh, F. Xiao, and Y . Jae Lee. Track and transfer: Watching videos to simulate strong human supervision for weakly-supervised object detection. In TheIEEEConferenceonComputerVisionandPatternRecognition(CVPR), June 2016. [65] W. Kuo, B. Hariharan, and J. Malik. DeepBox: Learning objectness with con- volutional networks. In Proceedings of the IEEE International Conference on ComputerVision(ICCV), pages 2479–2487, 2015. [66] T. Lee, S. Fidler, and S. Dickinson. Learning to combine mid-level cues for object proposal generation. InProceedingsoftheIEEEInternationalConference onComputerVision(ICCV), pages 1680–1688, 2015. [67] Y . J. Lee, J. Kim, and K. Grauman. Key-segments for video object segmenta- tion. InComputerVision(ICCV),2011IEEEInternationalConferenceon, pages 1995–2002. IEEE, 2011. 108 [68] D. Li, J.-B. Huang, Y . Li, S. Wang, and M.-H. Yang. Weakly supervised object localization with progressive domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3512–3520, 2016. [69] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video segmentation by tracking many figure-ground segments. InProceedingsoftheIEEEInternational ConferenceonComputerVision, pages 2192–2199, 2013. [70] S. Li, B. Seybold, A. V orobyov, A. Fathi, Q. Huang, and C.-C. J. Kuo. Instance embedding transfer to unsupervised video object segmentation. In Proceedings oftheIEEEConferenceonComputerVisionandPatternRecognition, 2018. [71] S. Li, H. Zhang, J. Zhang, Y . Ren, and C.-C. J. Kuo. Box refinement: Object proposal enhancement and pruning. InApplicationsofComputerVision(WACV), 2017IEEEWinterConferenceon, pages 979–988. IEEE, 2017. [72] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribble-supervised convo- lutional networks for semantic segmentation. In Proceedings of the IEEE Con- ferenceonComputerVisionandPatternRecognition, pages 3159–3167, 2016. [73] H.-D. Lin and D. G. Messerschmitt. Video composition methods and their seman- tics. InAcoustics,Speech,andSignalProcessing,1991.ICASSP-91.,1991Inter- nationalConferenceon, pages 2833–2836. IEEE, 1991. [74] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In European conferenceoncomputervision, pages 740–755. Springer, 2014. [75] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg. Ssd: Single shot multibox detector. InEuropeanConferenceonComputerVision, pages 21–37. Springer, 2016. [76] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 3431–3440, 2015. [77] C. Lu, S. Liu, J. Jia, and C.-K. Tang. Contour box: Rejecting object propos- als without explicit closed contours. In Proceedings of the IEEE International ConferenceonComputerVision(ICCV), pages 2021–2029, 2015. [78] S. Manen, M. Guillaumin, and L. Gool. Prime object proposals with random- ized prim’s algorithm. In Proceedings of the IEEE International Conference on ComputerVision(ICCV), pages 2536–2543, 2013. 109 [79] N. M¨ arki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bilateral space video segmentation. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 743–751, 2016. [80] A. Newell, Z. Huang, and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. In Advances in Neural Information Processing Systems, pages 2274–2284, 2017. [81] D. Novotny and J. Matas. Cascaded sparse spatial bins for efficient and effective generic object detection. In Proceedings of the IEEE International Conference onComputerVision(ICCV), pages 1152–1160, 2015. [82] P. Ochs and T. Brox. Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. In Computer Vision (ICCV),2011IEEEInternationalConferenceon, pages 1583–1590. IEEE, 2011. [83] P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence, 36(6):1187–1200, 2014. [84] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free?- weakly-supervised learning with convolutional neural networks. In Proceedings oftheIEEEConferenceonComputerVisionandPatternRecognition, pages 685– 694, 2015. [85] G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille. Weakly-and semi- supervised learning of a deep convolutional network for semantic image segmen- tation. InProceedingsoftheIEEEInternationalConferenceonComputerVision, pages 1742–1750, 2015. [86] A. Papazoglou and V . Ferrari. Fast object segmentation in unconstrained video. InProceedingsoftheIEEEInternationalConferenceonComputerVision, pages 1777–1784, 2013. [87] S. Paris and F. Durand. A fast approximation of the bilateral filter using a sig- nal processing approach. International journal of computer vision, 81(1):24–52, 2009. [88] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2016. [89] P. O. Pinheiro, R. Collobert, and P. Doll´ ar. Learning to segment object can- didates. In Advances in Neural Information Processing Systems (NIPS), pages 1981–1989, 2015. 110 [90] P. O. Pinheiro, T.-Y . Lin, R. Collobert, and P. Doll´ ar. Learning to refine object segments. In European Conference on Computer Vision, pages 75–91. Springer, 2016. [91] P. Rantalankila, J. Kannala, and E. Rahtu. Generating object segmentation pro- posals using global and local search. In Proceedings of the IEEE Conference on ComputerVisionandPatternRecognition(CVPR), pages 2417–2424, 2014. [92] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information ProcessingSystems(NIPS), pages 91–99, 2015. [93] C. Rother, V . Kolmogorov, and A. Blake. Grabcut: Interactive foreground extrac- tion using iterated graph cuts. In ACM transactions on graphics (TOG), vol- ume 23, pages 309–314. ACM, 2004. [94] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis andMachineIntelligence,IEEETransactionson, 22(8):888–905, 2000. [95] M. Shi and V . Ferrari. Weakly supervised object localization using size estimates. InEuropeanConferenceonComputerVision, pages 105–121. Springer, 2016. [96] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional net- works: Visualising image classification models and saliency maps. arXivpreprint arXiv:1312.6034, 2013. [97] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. InICLR, 2015. [98] P. Siva, C. Russell, and T. Xiang. In defence of negative mining for annotating weakly labelled data. In European Conference on Computer Vision, pages 594– 608. Springer, 2012. [99] P. Siva, C. Russell, T. Xiang, and L. Agapito. Looking beyond the image: Unsu- pervised learning for object saliency and detection. In Proceedings of the IEEE conferenceoncomputervisionandpatternrecognition, pages 3238–3245, 2013. [100] P. Siva and T. Xiang. Weakly supervised object detector learning with model drift detection. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 343–350. IEEE, 2011. [101] H. O. Song, R. B. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, T. Darrell, et al. On learning to localize objects with minimal supervision. InICML, pages 1611– 1619, 2014. 111 [102] H. O. Song, Y . J. Lee, S. Jegelka, and T. Darrell. Weakly-supervised discovery of visual pattern configurations. In Advances in Neural Information Processing Systems, pages 1637–1645, 2014. [103] B. Taylor, V . Karasev, and S. Soatto. Causal video object segmentation from persistence of occlusions. In Proceedings of the IEEE Conference on Computer VisionandPatternRecognition, pages 4268–4276, 2015. [104] E. Teh, M. Rochan, and Y . Wang. Attention networks for weakly supervised object localization. BMVC, 2016. [105] P. Tokmakov, K. Alahari, and C. Schmid. Learning motion patterns in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2017. [106] P. Tokmakov, K. Alahari, and C. Schmid. Learning video object segmentation with visual memory. In Proceedings of the IEEE International Conference on ComputerVision, 2017. [107] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In ComputerVision,1998.SixthInternationalConferenceon, pages 839–846. IEEE, 1998. [108] Y .-H. Tsai, M.-H. Yang, and M. J. Black. Video segmentation via object flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3899–3908, 2016. [109] R. Tudor Ionescu, B. Alexe, M. Leordeanu, M. Popescu, D. P. Papadopoulos, and V . Ferrari. How hard can it be? estimating the difficulty of visual search in an image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [110] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International Journal of Computer Vision (IJCV), 104(2):154–171, 2013. [111] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragki- adaki. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804, 2017. [112] P. V oigtlaender and B. Leibe. Online adaptation of convolutional neural networks for video object segmentation. InBritishMachineVisionConference, 2017. [113] C. Wang, L. Zhao, S. Liang, L. Zhang, J. Jia, and Y . Wei. Object proposal by multi-branch hierarchical segmentation. In Proceedings of the IEEE Conference onComputerVisionandPatternRecognition(CVPR), pages 3873–3881, 2015. 112 [114] W. Wang, J. Shen, and F. Porikli. Saliency-aware geodesic video object segmen- tation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3395–3402, 2015. [115] Y . Xiao, C. Lu, E. Tsougenis, Y . Lu, and C.-K. Tang. Complexity-adaptive dis- tance metric for object proposals generation. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 778–786, 2015. [116] S. Xingjian, Z. Chen, H. Wang, D.-Y . Yeung, W.-K. Wong, and W.-C. Woo. Con- volutional LSTM network: A machine learning approach for precipitation now- casting. In Advances in neural information processing systems, pages 802–810, 2015. [117] J. S. Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. S. Kweon. Pixel-level matching for video object segmentation using convolutional neural networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2186– 2195. IEEE, 2017. [118] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation backprop. In European Conference on Computer Vision, pages 543–559. Springer, 2016. [119] Z. Zhang, Y . Liu, T. Bolukbasi, M.-M. Cheng, and V . Saligrama. BING++: A fast high quality object proposal generator at 100fps. arXiv preprint arXiv:1511.04511, 2015. [120] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference onComputerVisionandPatternRecognition, pages 2921–2929, 2016. [121] J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image trans- lation using cycle-consistent adversarial networks. In 2017 IEEE International ConferenceonComputerVision(ICCV), 2017. [122] C. L. Zitnick and P. Doll´ ar. Edge Boxes: Locating object proposals from edges. InProceedingsoftheEuropeanConferenceonComputerVision(ECCV), 2014. 113
Abstract (if available)
Abstract
Object localization is a crucial step for computers to understand an image. An object localizer typically takes in an image and outputs the bounding boxes of objects. Some applications require finer localization - delineating the shape of objects, which is called “object segmentation”. In this dissertation, three object localization related problems have been studied: 1) improving the accuracy of object proposals, 2) reducing the labeling effort for object detector training, and 3) segmenting the moving objects in videos. ❧ Object proposal generation has been an important pre-processing step for object detectors in general and the convolutional neural network (CNN) detectors in particular. However, some object proposal methods suffer from the “localization bias” problem, that the recall drops rapidly as the localization accuracy requirement increases. Since contours offer a powerful cue for accurate localization, we propose a box refinement method by searching for the optimal contour for each initial bounding box that minimizes the contour cost. The box is then aligned with the contour. Experiments on the PASCAL VOC2007 test dataset show that our box refinement method can significantly improve the object recall at a high overlapping threshold while maintaining a similar recall at a loose one. Given 1000 proposals, the average recall of multiple existing methods is increased by more than 5% with our box refinement process integrated. ❧ The second research problem is motivated by the fact that a convolutional neural network based object detectors usually require a large amount of accurately annotated bounding boxes of objects. On the contrary, the image-level labels are much cheaper to achieve. Thus, we supervise the detectors with image-level labels only. A common drawback of such training setting is that the detector usually outputs bounding box of discriminative object parts (e.g. a box of cat face). To address this challenge, we incorporate object segmentation into the detector training, which guides the model to correctly localize the full objects. We propose the multiple instance curriculum learning (MICL) method, which injects curriculum learning (CL) into the multiple instance learning (MIL) framework. The MICL method starts by automatically picking the easy training examples, where the extent of the segmentation mask agrees with detection bounding boxes. The training set is gradually expanded to include harder examples to train strong detectors that handle complex images. The proposed MICL method with segmentation in the loop outperforms the state-of-the-art weakly supervised object detectors by a substantial margin on the PASCAL VOC datasets. ❧ In the third part, we propose a method for unsupervised video object segmentation by transferring the knowledge encapsulated in image-based instance embedding networks. The instance embedding network produces an embedding vector for each pixel that enables identifying all pixels belonging to the same object. Though trained on static images, the instance embeddings are stable over consecutive video frames. To reduce the false positives from static objects, a motion-based bilateral network is trained to estimate the background, which is later integrated with instance embeddings into a graph. We classify graph nodes by defining and minimizing a cost function, and segment the video frames based on the node labels. The proposed method outperforms previous state-of-the-art unsupervised video object segmentation methods on several benchmark datasets.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
A data-driven approach to image splicing localization
PDF
Object classification based on neural-network-inspired image transforms
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
3D deep learning for perception and modeling
PDF
Visual knowledge transfer with deep learning techniques
PDF
Local-aware deep learning: methodology and applications
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
A deep learning approach to online single and multiple object tracking
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Exploring complexity reduction in deep learning
PDF
Deep learning models for temporal data in health care
PDF
Efficient graph learning: theory and performance evaluation
PDF
Machine learning techniques for outdoor and indoor layout estimation
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Human appearance analysis and synthesis using deep learning
PDF
Efficient pipelines for vision-based context sensing
Asset Metadata
Creator
Li, Siyang
(author)
Core Title
Object localization with deep learning techniques
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/02/2018
Defense Date
04/26/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,convolutional neural network,deep learning,OAI-PMH Harvest,object localization,object segmentation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Lim, Joseph (
committee member
), Sawchuk, Alexander (
committee member
)
Creator Email
lisiyang2013@gmail.com,siyangl@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-46749
Unique identifier
UC11669009
Identifier
etd-LiSiyang-6604.pdf (filename),usctheses-c89-46749 (legacy record id)
Legacy Identifier
etd-LiSiyang-6604.pdf
Dmrecord
46749
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Li, Siyang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer vision
convolutional neural network
deep learning
object localization
object segmentation