Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Advanced visual segmentation techniques: algorithm design and performance analysis
(USC Thesis Other)
Advanced visual segmentation techniques: algorithm design and performance analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ADV ANCED VISUAL SEGMENTATION TECHNIQUES: ALGORITHM DESIGN AND PERFORMANCE ANALYSIS by Xiang Fu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (MING HSIEH DEPARTMENT OF ELECTRICAL ENGINEERING) December 2015 Copyright 2015 Xiang Fu Dedication This dissertation and my research work are dedicated to my girlfriend, my family and many friends who have supported me throughout the process of my PhD program. A special debt of gratitude to my girlfriend, Becky. Without her words of encourage- ment, assistance and consideration in the life, I wouldn’t simply graduate and finish the thesis. I also appreciate the constant support from my parents, Yongfu Fu and Yaping Pan, who gave me the opportunities to study the new science and technology at the first class university in the United States. Furthermore, I’m particularly grateful to my friend, Jian, for his encouragement, his help, his cooperation, and his companionship in various courses/projects, co-op projects, screening exams, research topics, social activities, etc. ii Acknowledgments Firstly, I would like to be forever grateful to my PhD advisor, Prof. C.-C. Jay Kuo, for his appreciation, his countless hours of reflecting, meeting, motivation, and guidance throughout the entire PhD process. He is not only my academic advisor, but also my life mentor. It was a memorable experience and my great honor to take the PhD program at Media Communications Lab of University of Southern California. Secondly, I wish to acknowledge my mentor from Microsoft Research Asia in China, Dr. Changhu Wang, for his patience, his suggestions, and his encouragement for my research and paper writing. It lasted around three years to make a conference call for one hour every two weeks. Finally, I would like to thank my committee members for my qualifying exam and my defense who were generous with their precious time and profound knowledge. They are Prof. C.-C. Jay Kuo (chair), Prof. Ram Nevatia, Prof. Alexander Sawchuk, Prof. Panayiotis Georgiou, Prof. Yan Liu, and Prof. Justin Haldar. Their feedbacks are tremendously valuable for my research and this dissertation. iii Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Abstract xv 1 Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 11 2 Basic Segmentation Tools 13 2.1 Color Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 SLIC Superpixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 FH Merger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Hierarchical Graph-Based Video Segmentation . . . . . . . . . . . . . 19 2.5 Paint Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Phase Congruency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7 Contrast Limited Adaptive Histogram Equalization . . . . . . . . . . . 23 2.8 Hu and Haan’s Blur Estimator . . . . . . . . . . . . . . . . . . . . . . 23 2.9 Guided Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.10 Matting Laplacian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.11 Mean Shift Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.12 Texture Indicator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.13 Spectral Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3 Interactive Video Object Segmentation 35 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Hierarchical Supervoxel Graph Generation . . . . . . . . . . . . . . . . 39 iv 3.2.1 Hierarchical Supervoxel Graph . . . . . . . . . . . . . . . . . . 39 3.2.2 Bottom-Level Supervoxels . . . . . . . . . . . . . . . . . . . . 41 3.2.3 Higher-Level Supervoxels . . . . . . . . . . . . . . . . . . . . 43 3.3 Interactive Hierarchical Supervoxel Representation . . . . . . . . . . . 45 3.4 Video Object Segmentation Framework . . . . . . . . . . . . . . . . . 48 3.4.1 Hierarchical Supervoxel Graph Generation . . . . . . . . . . . 50 3.4.2 Interactive Object Representation . . . . . . . . . . . . . . . . 50 3.4.3 Representation Propagation and Refinement . . . . . . . . . . . 50 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 52 3.5.3 Performance of Hierarchical Supervoxel Graph . . . . . . . . . 53 3.5.4 Performance of Video Object Segmentation . . . . . . . . . . . 55 3.5.5 Time Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4 Image Segmentation with Fused Contour, Surface and Depth Cues 60 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Three Elementary Cues . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.1 Color Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.2 1D Contour Cue . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.3 2D Surface Cue . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.4 3D Depth Cue . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Region-Dependent Spectral Segmentation . . . . . . . . . . . . . . . . 79 4.3.1 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . 79 4.3.2 Contour-Guided Surface Merger . . . . . . . . . . . . . . . . . 80 4.3.3 Region-Dependent Spectral Graph . . . . . . . . . . . . . . . . 82 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.2 Improvement from Contour-Guided Surface Merger . . . . . . 86 4.4.3 Performance of Region-Dependent Spectral Segmentation . . . 86 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5 Robust Image Segmentation Using Contour-guided Color Palettes 93 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3 Contour-guided Color Palette Method . . . . . . . . . . . . . . . . . . 96 5.3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.2 Color Palette Generation . . . . . . . . . . . . . . . . . . . . . 97 5.3.3 Segment Post-Processing . . . . . . . . . . . . . . . . . . . . . 100 5.4 Comparison of MS and CCP . . . . . . . . . . . . . . . . . . . . . . . 102 5.5 Layered Affinity Models using CCP . . . . . . . . . . . . . . . . . . . 106 v 5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6 Conclusion and Future Work 119 6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography 123 vi List of Tables 3.1 Comparison on video object segmentation . . . . . . . . . . . . . . . . 57 4.1 Quantitative comparisons before and after the contour-guided surface merger for Mean Shift on the BSDS300 database. . . . . . . . . . . . . 86 4.2 Quantitative comparisons of our approaches with other segmentation methods on the BSDS300 Dataset, where the best two results are high- lighted in red (best) and blue (second best). . . . . . . . . . . . . . . . 87 5.1 Comparisons of the numbers of representative colors (upper) and the boundary F-measures (lower) by MS and CCP under three BW parame- ters for the three typical images. We also provide the average results of the entire BSDS300 dataset. . . . . . . . . . . . . . . . . . . . . . . . 104 5.2 Performance comparison of several segmentation methods on the BSDS300 Dataset, where the best two results are highlighted in red (best) and blue (second best). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.3 Performance comparison of three variants of our solution: 1) CCP-3 (Canny) to replace structured edge detection with the Canny detector, 2) CCP-3 (MS) to replace color palette generation with Mean Shift in full color space, 3) CCP-3 (w/o post) to exclude post-processing. . . . . . . 109 vii List of Figures 1.1 Role of Visual Segmentation is bridging the semantic gap. . . . . . . . 1 1.2 How to define “Coherent”? . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Percept tree is used to segment the image hierarchically. This image can be separated into the left woman, the right woman, and the background. Each region can be partitioned in more detailed regions. . . . . . . . . . 4 1.4 The first row shows the original image and the percept tree. The later two rows display two ground truth subjects and their corresponding sub- set of the percept tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Integration of multiple ground truth subjects. The darker pixels indicate the common boundaries, while the brighter pixels indicate the uncom- mon ones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Two Challenges for Image Segmentation. The challenge area for each image is shown in red circle. (a) Expect to Separate: Objects and Envi- ronment are Quite Involved; (b) Expect to Merge: Textured Regions for One Object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 CIEL ? a ? b ? color space. The vertical L ? axis represents luminance, rang- ing from 0 to 100. The horizontal a ? and b ? axes represent the red-green and yellow-blue opponent channels, both ranging from -128 to 127 in practice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Relationship between CIEL ? c ? h ? and CIEL ? a ? b ? color space. The ver- tical L ? axis is the same. The c ? axis represents chroma, which is the distance to the vertical axis, ranging from 0 to 100. The h ? axis repre- sents hue, which is the angle determined by the arctan of the ratio of a ? and b ? ranging from 0 to 360 . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Hue space in terms of color. . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 SLIC generates “uniform” superpixels by “local” clustering. The green stars illustrate the initial sample points are uniformly put in the spatial domain to cover the regions as many as possible. The blue bounding box represents the local regions for SLIC searching. . . . . . . . . . . . 16 2.5 SLIC superpixels with different number of superpixels or levels. All the pixels inside one connected boundary belong to one superpixel. . . . . . 16 viii 2.6 FH Merger segmentation results, where each random color represents one segment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 Illustration for paint selection tool. Left three: the user makes a selection by painting the object of interest with a brush (black-white circle) on a large image. Instant feedback (selection boundary) can be provided to the user during mouse dragging. Rightmost: composition and effect (sepia tone). Note that the blur scribbles are invisible to the user. . . . . 20 2.8 Swapping two Fourier components of two images in BSDS300. Top- left: original image1. Top-right: original image2. Bottom-left: IFT of the combination of image1 magnitude and image2 phase. Bottom-right: IFT of the combination of image2 magnitude and image1 phase. . . . . 21 2.9 The appearance of image spectrum is quite similar. Left: image spec- trum of image1. Right: average image spectrum of BSDS300. . . . . . 22 2.10 Key of CLAHE. (a) the key of “AHE” is to apply histogram equalization to overlapped local windows. (b) the key of “CL” is to reshape the noise to avoid noise enhancement. . . . . . . . . . . . . . . . . . . . . . . . 23 2.11 A Thin Lens Model. D is the absolute depth estimation, is the blur size, F is the focal length, v 0 is the distance between lens and focal plane, andf is the aperture number of the lens. . . . . . . . . . . . . . 24 2.12 Block diagram of Hu and Haan’s blur estimator. . . . . . . . . . . . . . 25 2.13 Sparse defocus maps in BSDS300. Black pixels indicate the in-focus regions, and white pixels indicate the blur regions. . . . . . . . . . . . . 26 2.14 Key idea of the guided filter. A guidance image is considered in the filtering output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.15 Matting laplacian result of an image example. (a) An image with sparse constraints: white scribbles indicate foreground, black scribbles indicate background. (b) An accurate hand-drawn trimap that is used to produce a reasonable matte (c). . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.16 Comparisons between Mean Shift and FH Merger. (a) original image; (b) Mean Shift; (c) FH Merger. We can see Mean Shift has good bound- ary adherence to the object, but may over-segment the smooth region like the large sky, while FH Merger can merge larger ranges, but it gen- erates uncontrolled small regions around the boundaries. . . . . . . . . 29 2.17 Overview of Spectral Clustering . . . . . . . . . . . . . . . . . . . . . 31 2.18 Overview of Multiple Layer Spectral Segmentation (MLSS) . . . . . . 32 2.19 Bipartite Graph Construction of Segmentation by Aggregating Super- pixels (SAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1 The girl in red in the famous black-and-white movie “Schindler’s List” (best view in color). If the target object in the movie is able to be auto- matically segmented and tracked, it becomes possible to further edit and analyze the extracted object. . . . . . . . . . . . . . . . . . . . . . . . 36 ix 3.2 Supervoxel comparison between (b) GBH and (c) the proposed method for (a) the original video. We use one color to indicate the same super- voxel in the same level. For low-level supervoxels of GBH, the super- voxels are scattered with uncontrollable shape and size (the number of the supervoxels is 1024). In contrast, the supervoxels of our proposed supervoxel graph in the same level have compact shape with similar size along the time axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Example of hierarchical superovxel graph and its representation of the face, the hands and the legs of the runner. For display purpose, we ignore the time axis. The hierarchical supervoxel graph is shown in the middle three columns. One color indicates the same supervoxel in the same level. For middle and high levels, we use color averaging instead of random color to denote each supervoxel. Based on the hierarchical supervoxel graph, different parts of the runner can be represented by supervoxels at different levels, as shown in the rightmost column. . . . . 39 3.4 Supervoxel comparison between (b) SLIC (using color averaging with- out motion as feature space) and (c) the proposed method (using color histogram with motion as feature space) for (a) the original video. We can see that, for SLIC, some internal regions of the fish are merged to the background, because the average color of this part is quite close to that of the background. Similarly, the underpant is also merged unless the motion is leveraged. But in the proposed method, both cases work quite well, owning to the leverage of color histogram (the fish case) and the motion cue (the underpant case). . . . . . . . . . . . . . . . . . . . 44 3.5 For the video of a man (row 1), the hierarchical supervoxel graph (row 2- 5) shows the connectivity among different supervoxels in terms of two dimensions: time and level. For each frame, disjoint regions that are spatially approximate and coherent in appearance with similar motions will be merged into larger regions; for each level, the regions with the same color or ids indicate the same supervoxel. This graph can be abstracted in row 6. When a user labels the target object in the first frame (row 7 col 1, red sketches are used for display purpose), hierar- chical supervoxel representation for this man is calculated, where each supervoxel can be from different levels in the graph. The corresponding segmentation results are shown in row 7. . . . . . . . . . . . . . . . . . 46 3.6 Illustration of the weights between two supervoxels. When merging supervoxels from lower level to higher level, the weight between two neighbor supervoxels is measured by the combination of distances in CIELAB color histogram, motion direction, and gradient of saliency map. 47 x 3.7 The video object segmentation framework contains three stages. The offline supervoxel representation stage builds hierarchical supervoxel graph based on local clustering and region merging. In the interactive object representation stage, the users label interested objects via Paint Selection tool [LSS09], which is further represented by a compact set of supervoxels obtained by the proposed hierarchical supervoxel selection algorithm. Finally, the representation propagation and refinement stage will reduce segmentation errors to achieve final video object segmentation. 49 3.8 Comparison of hierarchical supervoxel construction. (a) Average bound- ary recall. (b) Under-segmentation error. . . . . . . . . . . . . . . . . . 54 3.9 Examples of hierarchical supervoxel graphs produced by various meth- ods: (b) GBH, (c) SLIC and (d) the proposed method. (a) original frames; (b) GBH with 64 supervoxels, a closeup of GBH (64); (c) SLIC with 64 supervoxels, a closeup of SLIC (64); (d) the proposed method with 64 supervoxels, and a closeup of our method (64). For the “girl” video, the right arm is occluded for several frames. Our method can suc- cessfully detect it after its reappearance, and group it to the same super- voxel before its disappearance, while other methods failed to detect it or consider it as a new part. Similarly in the “referee” video, the reappeared right hand can only be recaptured by our method. . . . . . . . . . . . . 55 3.10 Visual comparison of video segmentation produced by (b) Adobe After Effects (AAE) and (c) our method for (a) the original video. Every four rows and three columns represent a video example. The first row of each example is labeled frame. These cases are challenging because of large motion, motion blur or ambiguous colors. We can see that the proposed solution outperforms AAE in above cases. . . . . . . . . . . . . . . . . 58 3.11 Our method can segment multiple objects at the same time. . . . . . . . 59 4.1 Original images and their corresponding cue maps. The cue is useful for segmentation of the left image, but is harmful for segmentation of the right image for each row. (a) Contour cue. (b) Surface cue. (c) Depth cue. 62 4.2 Hue is more reliable than luminance alone for many color images. For each column, (a) is the original image, (b) is the luminance map, and (c) is the hue map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Hue is not reliable for gray regions in many color images. For each column, (a) is the original image, (b) is the luminance map, and (c) is the hue map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Weight function for hue in modified CIELCH color space. The horizon- tal axis is the normalized chroma in CIELCH space from 0 to 1, and the vertical axis is the weight for hue. . . . . . . . . . . . . . . . . . . . . 66 xi 4.5 Our 1D contour cue is fused by phase congruency (PC) and Canny detector that are complementary for better contour detection. (a) Orig- inal image. (b) PC edge detection. (c) Canny detection. (d) Our 1D contour cue. PC is better for accurate local variance and true edges for the mountain or the building regions, but it is sensitive to texture and noise inside those regions. Canny detector is better to capture global variance that removes the noise and the detailed textures. . . . . . . . . 69 4.6 Linear enhancement function. The horizontal axis is the original edge strength normalized between 0 and 1, and the vertical axis is the target edge strength. The threshold of the enhancement function is adaptively selected by keeping most of the energies for edges (99.5%). . . . . . . . 69 4.7 The standard deviation of the high frequency components for each image in BSDS300. The left figure is for luminance, and the right figure is for hue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.8 Contour detection results before and after edge link. (a) original color image. (b) float edge map before edge link. (c) contour map after edge link. (d) the binary contour map. For the maps in (c), the connected pixels with the same color represent one edge list. The maps in (d) are the binary version of the maps in (c), which is our 1D contour cue. . . . 72 4.9 Comparisons between Mean Shift and FH Merger in the CIELAB color space. (a) is the original image, (b) is Mean Shift, and (c) is FH Merger. For each segmentation, the nearby pixels with the same color indicate the same cluster. The number at the bottom-right corner of each result in red is the number of clusters. We can see that Mean Shift provides accu- rate region boundaries but leading to over-segmented regions. In con- trast, FH Merger offers connected regions but with inaccurate bound- aries. Both methods are useful for our surface cue. . . . . . . . . . . . 73 4.10 FH Merger in difference color spaces for comparisons. (a) is the original image, (b) is in the CIELAB color space, (c) is in the CIELCH color space, and (d) is in the modified CIELCH color space. We can see FH Merger in the modified CIELCH space can deal with the gray regions much better than FH Merger in the CIELCH space. . . . . . . . . . . . 74 4.11 Block diagram of our depth estimation. . . . . . . . . . . . . . . . . . 75 4.12 Depth estimation with and without CLAHE. (a) is the original grayscale image, (b) is depth estimation without CLAHE, (c) is enhanced image after CLAHE, and (d) is depth estimation with CLAHE. In the depth maps, the brighter pixels indicate the in-focus regions or the foreground, while the darker pixels indicate the out-of-focus regions or the back- ground. We can see after CLAHE, the details in the gray regions are enhanced, which makes it easier to differentiate the object from the background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 xii 4.13 Depth estimation with and without guided filter. (a) is the original grayscale image, (b) is depth estimation without guided filter, and (c) is depth estimation with guided filter. We can see after guided filter, the depth variance inside one region becomes much smaller, which helps to group the object regions in the same depth layer. . . . . . . . . . . . . . 77 4.14 Our 3D depth cue generation. (a) Original image. (b) Sparse defocus map by Hu and Haan [HdH06]. (c) Our depth estimation where darker pixels indicate regions out-of-focus, and brighter pixels suggest regions in-focus. (d) FH Merger for depth estimation as our 3D depth cue. The number at the bottom-right corner is the number of clusters. We can see the branches regions and the starfish regions can be merged well using our depth cue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.15 Our fusion framework for image segmentation. . . . . . . . . . . . . . 79 4.16 Some results for the contour-guided surface merger. (a) is the original image, (b) is our contour cue for guidance, (c) is our surface cue by Mean Shift, and (d) is our contour-guided surface map by Mean Shift. For each merger result, the nearby pixels with the same color indicate the same cluster. The number at the bottom-right corner in red is the number of clusters. We can see the further mergers from our surface- guided merger can remove many regions with fake boundaries. The number of clusters for the surface map can be reduced by around half in average. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.17 Region-Dependent Spectral Graph. . . . . . . . . . . . . . . . . . . . . 83 4.18 Visual comparisons of RDS segmentation results against two state-of- the-art methods MLSS and SAS. . . . . . . . . . . . . . . . . . . . . . 89 4.19 Visual comparisons of RDS segmentation results against two state-of- the-art methods MLSS and SAS. . . . . . . . . . . . . . . . . . . . . . 90 4.20 Visual comparisons of RDS segmentation results against two state-of- the-art methods MLSS and SAS. . . . . . . . . . . . . . . . . . . . . . 91 4.21 Visual comparisons of RDS segmentation results against two state-of- the-art methods MLSS and SAS. . . . . . . . . . . . . . . . . . . . . . 92 5.1 The block diagram of the proposed contour-guided color palette (CCP) method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2 Illustration of banded region of interest (B-ROI) for color palette gen- eration. (a) shows a B-ROI with a bandwidth of 2. Pixels are sampled from both sides of the contour with a uniform stepsize . (b) is an example image overlaid with detected long contours. Long contours are indicated by different colors. (c) and (d) are the zoom-in of two local B- ROI’s of (b). In (c), the colors of the pixels labeled in red along one side of the B-ROI look similar without an obvious jump. In (d), the colors of the pixels labeled in red change with one large jump. . . . . . . . . . . 98 xiii 5.3 Comparison of CCP segmentation results before and after post-processing. Please focus on the squared regions in red: (a) shows a long contour straddled by two regions with similar colors; (c) shows the fake bound- ary in the sky due to gradual color transition; (e) shows small seg- ments in the background building region. (b), (d), and (f) are the post- processed results of (a), (c), and (e), respectively. . . . . . . . . . . . . 99 5.4 Comparisons of segmentation results by MS and CCP for three typical images, with different spectral BW parameters. . . . . . . . . . . . . . 102 5.5 Plots of the cumulative histogram versus representative color indices for MS and CCP on three typical images, where the blue, green and red curves are obtained using large, medium and small spectral BW parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.6 Visual comparisons of segmentation results of CCP-LAM and CCP- LAS against two state-of-the-art methods MLSS and SAS. . . . . . . . 110 5.7 Visual comparisons of segmentation results of CCP-LAM and CCP- LAS against two state-of-the-art methods MLSS and SAS. . . . . . . . 111 5.8 Visual comparisons of segmentation results of CCP-LAM and CCP- LAS against two state-of-the-art methods MLSS and SAS. . . . . . . . 112 5.9 Visual comparisons of segmentation results of CCP-LAM and CCP- LAS against two state-of-the-art methods MLSS and SAS. . . . . . . . 113 5.10 Visual comparisons of segmentation results of CCP-LAM and CCP- LAS against two state-of-the-art methods MLSS and SAS. . . . . . . . 114 5.11 Visual comparisons of segmentation results of CCP-LAM and CCP- LAS against two state-of-the-art methods MLSS and SAS. . . . . . . . 115 5.12 Visual comparisons of segmentation results of CCP-LAM and CCP- LAS against two state-of-the-art methods MLSS and SAS. . . . . . . . 116 5.13 Visual comparisons of segmentation results of CCP-LAM and CCP- LAS against two state-of-the-art methods MLSS and SAS. . . . . . . . 117 5.14 Visual comparisons of segmentation results of CCP-LAM and CCP- LAS against two state-of-the-art methods MLSS and SAS. . . . . . . . 118 xiv Abstract In this dissertation, we study two research topics: 1) how to interactively represent and segment the object(s) in a video; and 2) how to reliably generate automatic image segmentation. To handle temporal dynamics of multiple components of objects in an image sequence, we present an interactive hierarchical supervoxel representation for video object seg- mentation. Firstly, a hierarchical supervoxel graph with various granularities is built based on local clustering and region merging to represent the video, in which both color histogram and motion information are leveraged in the feature space, and visual saliency is also taken into account as merging guidance to build the graph. Then, a supervoxel selection algorithm is conducted to select supervoxels with diverse granularities to rep- resent the object(s) labeled by the user. Finally, based on the above tools, an interactive video object segmentation framework is proposed to handle complex and diverse scenes with large motion and occlusions. The effectiveness of the proposed algorithms in super- voxel graph construction and video object segmentation are shown by the experimental results. Although the proposed interactive video object segmentation in the first part of this dissertation can extract regions of interest based on user’s input in the key frames, it is not a fully automatic algorithm. To address this problem, we target at solving the xv problem of automatic image segmentation in the following two parts of the disserta- tion. The 1D contour and the 2D surface cues have been widely utilized in existing work. However, the 3D depth information of an image, a necessary cue according to human visual perception, is overlooked in image segmentation. In this dissertation, we study how to fully utilize the 1D contour, the 2D surface, and the 3D depth cues for image segmentation. First, three elementary segmentation modules are developed for these three cues respectively. Then, we propose a fusion framework with two strate- gies to segment an image: contour-guided surface merger (CGS), and region-dependent spectral graph (RDS). CGS is adopted to further merge two adjacent regions with weak boundaries guided by the contour cue. RDS is designed to organize region-based layers (surface and depth) in the spectral graph with initial weights according to the type of regions. For non-textured regions, the algorithm will rely on the contour cue, while for textured regions, the depth cue will play a more important role. Extensive experiments not only show the superior performance of the proposed algorithm over state-of-the-art algorithms, but also verify the necessities of the three cues in image segmentation. To further improve the automatic image segmentation, the contour-guided color palette (CCP) is proposed in the third part of the dissertation, which efficiently inte- grates contour and color cues of an image. To find representative colors of an image, we collect color samples from both sides of long contours and conduct the mean-shift (MS) algorithm in the sampled color space to define an image-dependent color palette. This color palette provides a preliminary segmentation in the spatial domain, which is further fine-tuned by post-processing techniques such as leakage avoidance, fake bound- ary removal, and small region mergence. Segmentation performances of CCP and MS are compared and analyzed. While CCP offers an acceptable standalone segmentation result, it can be further integrated into the framework of layered spectral segmentation to xvi produce a more robust segmentation. The superior performance of CCP-based segmen- tation algorithm is demonstrated by experiments on the Berkeley Segmentation Data Set. xvii Chapter 1 Introduction 1.1 Significance of the Research As we know, machine deal with the visual data in pixels, or voxels. However, our human just care about the semantic categories, like sky, mountain, tree regions in Fig. 1.1, instead of low level representations. To achieve this purpose, we have to separate the visual data to several coherent segments for consideration. In our perspective, the role of visual segmentation is bridging the semantic gap. From knowledge of human perception and image characteristics, we can design the segmentation algorithms. From segment descriptors and statistical techniques, we can relate the segments and scenes to semantic categories. Images Coherent Segments Semantic Categories sky mountain building tree forest people water garden auto Figure 1.1: Role of Visual Segmentation is bridging the semantic gap. Why visual segmentation is so significant? Because it can be regarded as a vital preprocessing step for various advanced computer vision applications, such as 1 object representation/recognition[FPX + 12], 3D modeling, scene parsing/categorization, image/video retrieval, visual content analysis, etc. For example, if we could achieve good segmentation, then the shapes, the internal features, and the geometric constraints of the segments will infer a better object detector. It helps scene parsing and visual retrieval as well. The purpose of visual segmentation is to simplify the representation for image and/or video. The desired image segmentation is to divide pixels of an image into several disjoint coherent groups, where each of them represents an object or a part of an object in a scene. All the pixels or the voxels inside one segment share coherent properties, and those in different segments have distinct properties. For video, it can be regarded as 2.5 dimensional visual data. Besides the spatial information in the image sequences, it has abundant motion and interactions between objects in the temporal domain, which lead to more redundant information. Similarly, video segmentation is to group the coherent voxels spatially and temporally. The key problem we have to answer is how to define the “coherent” for the visual data. From lower level to higher level, we may consider the coherent luminance, spatial distance, color, texture or blurriness, motion, shapes/edges, material type, and even the semantic object. From the subset of the attributes, we can define the specific “coherent” for the particular applications . Formally, the segmentation result can be represented by a binary pixel or voxel affin- ity function F , where F (i;j) = 1 when pixel or voxel i and j belong together and F (i;j) = 0 otherwise. The number of segments depends on the visual content. Gener- ally between 2 and 30 is likely to be appropriate for an image, but it is important that all the segments have approximately the same importance [MFTM01]. However, it is challenging to judge the performance of a segmentation result. According to human’s understanding of image, people could catch the salient objects 2 Single color, monochrome Lower Level Higher Level Coherent luminance Coherent Material type Spatially close Coherent Texture/blurriness Similar shapes, edges Semantic Object Similar motions Figure 1.2: How to define “Coherent”? and cut the image level by level. One image example is illustrated in Fig. 1.3 using percept tree where different colors denote different segments. Globally, this image can be separated into the left woman, the right woman, and the background. Each region can be further partitioned into more detailed regions. Nevertheless, human always segment the image in some level that is not too detailed. When we focus on the specific region, more details are exposed for further segmentation. Also the ground truth segmentation of the image is quite diverse based on the level of interest. If we ask different people to segment the same image independently, we will get several ground truth subjects. Suppose that the percept tree is considered as a full structure of the image content, each subject can just cover the subset of the percept tree, which is illustrated in Fig. 1.4[MFTM01]. If we integrate different boundaries from different subjects, we can see the common boundaries are relatively more significant, which corresponds to the high level segments, while the uncommon boundaries indicate more detailed segments. Fig. 1.5 illustrate some examples in Berkeley Segmentation Data Set (BSDS300), where the darker pixels indicate the common boundaries, and the brighter pixels indicate the uncommon ones. BSDS300 is a dataset containing multiple 3 Figure 1.3: Percept tree is used to segment the image hierarchically. This image can be separated into the left woman, the right woman, and the background. Each region can be partitioned in more detailed regions. “ground truth” subjects produced by humans for images of a wide variety of natural scenes[MFTM01]. There are at least two challenges for visual segmentation because of the conflictions between low level representations and semantic understanding. One is the area where people expect to separate. The object and the environment are so involved that they cannot be separated by the low level features, like luminance or color. This will lead to severe leakage problem. It always happens for protective coloring of animals or blurred boundaries, as shown in Fig. 1.6(a). The other is the area where people expect to merge. The edges in the textured regions are so sharp that makes it hard to merge them together. 4 Figure 1.4: The first row shows the original image and the percept tree. The later two rows display two ground truth subjects and their corresponding subset of the percept tree. This will lead to over-segmentation problem. Some examples are illustrated in Fig. 1.6(b). The challenge area for each image is shown in red circle. 5 Figure 1.5: Integration of multiple ground truth subjects. The darker pixels indicate the common boundaries, while the brighter pixels indicate the uncommon ones. (a) Similar Colors (b) Textured Regions Figure 1.6: Two Challenges for Image Segmentation. The challenge area for each image is shown in red circle. (a) Expect to Separate: Objects and Environment are Quite Involved; (b) Expect to Merge: Textured Regions for One Object. 6 According to the purpose and the variety of the segmentation, there are two objec- tives: interactive segmentation and automatic segmentation. Interactive segmentation divides an image or a video into foreground region and background region with the help of human notations. It requires human notations on foreground and background samples as seeds to segment the entire visual data, which leverages semantic understanding from human. On the other hand, automatic segmentation partitions an image or a video into several regions (usually over two segments) according to the low level features auto- matically, which has no semantic meanings. That’s why automatic segmentation is even more challenging. We start with the research on interactive video object segmentation. The purpose is to divide the video into foreground region and background region according to the regions of interest from the user. The user need to label the foreground region(s) for the key frames of the video, and the system will propagate the segmentation to the following frames. We propose to use supervoxel lists from different levels to represent the video object(s) labeled by the user. Although our interactive video object segmentation can extract the regions of interest from the user well, it cannot segment the visual data into multiple meaningful groups automatically. To represent the main video object(s), we have to use around five to twenty supervoxels, which is still far from enough. In addition, the hierarchical super- voxel graph is actually too heavy for video representation. To segment the visual data into several groups more intelligently for further content analysis in the future, we study the automatic image segmentation. In spite that a great number of segmentation algorithms were proposed in the last few decades, no good seg- mentation rules apply to all the images. The desired segmentation methods are supposed to be image-driven or region-driven. 7 In this thesis, we attempt to solve the automatic image segmentation problem from a new angle, which fuses three elementary cues: the 1D contour cue, the 2D surface cue, and the 3D depth cue. Unlike the conventional methods that needs the predefined num- ber of segments, the advantage of this approach is to achieve the number of segments automatically. To further improve the automatic image segmentation, the contour-guided color palette (CCP) is proposed to efficiently integrate contour and color cues of an image. To reduce the complexity of an image in the color domain, we collect color samples from both sides of long contours, and conduct the Mean Shift (MS) algorithm in the sampled color space to define an image-dependent color palette. While CCP offers an acceptable standalone segmentation result, it can be further integrated into the frame- work of layered spectral segmentation to produce a more robust segmentation. All in all, visual Segmentation plays a significant role in computer vision appli- cations. The diversity and complexity of the visual content make it challenging and difficult. 1.2 Contributions of the Research In this research, we propose an interactive video object segmentation framework, an effective region-dependent spectral graph using depth information, and the contour- guided color palette (CCP) scheme to automatic image segmentation, which involve the following points: To simplify the representation of a video, a hierarchical supervoxel graph with var- ious granularities is built based on local clustering and region merging, in which both color histogram and motion information are leveraged in the feature space, 8 and visual saliency is also taken into account as merging guidance to build the graph. A supervoxel selection algorithm is introduced to choose supervoxels with diverse granularities to represent the object(s) labeled by the user. We study the problem of how to represent and segment the objects in a video shot. To handle the motion and variations of the internal regions of objects, we present an interactive video object segmentation framework to handle complex and diverse scenes with large motion and occlusions. We design a region-dependent spectral approach for automatic image segmen- tation using three elementary cues. The 1D contour cue finds the discontinuity between two regions. The 2D surface cue makes sure the similarity inside one region. The 3D depth cue combines the complicated texture regions into several groups. These three cues work together for a scalable and robust image segmen- tation solution. For color image analysis, to make use of the property of hue space on lumi- nance invariance and reduce the influence of hue on gray regions, we modify the CIELCH color space, which is the cyclic version of the uniform CIELAB color space. We set a small weight for hue space for gray areas. We propose a contour detection scheme using phase congruency [Kov03] and Canny detector [Can86] in CIELAB color space, which are complementary with each other to achieve better performance. Phase congruency is better for accurate local variance and true edges but is sensitive to noise and textures, while Canny detector is better to catch global variance that helps to remove noise and detailed 9 textures. Contour detection is further cleaned by the edge link algorithm [Kov00]. The contour detection is the 1D cue of image for our fusion approach. We present the depth estimation using Hu and Hann’s blur estimator [HdH06]. We apply CLAHE [Zui94] as a preprocessing step to improve the details in dark regions, followed by the guided filter [HST13] to remove the textures and noise inside the regions, and matting Laplacian [LLW08] to propagate the sparse defo- cus map to the entire image. Depth estimation has better region merging for tex- tures since they are sharing the similar blurriness. Depth estimation is the 3D cue of image for our fusion approach. To simplify the surface maps and depth map into some reasonable small number of segments, we apply Mean Shift for the CIELAB space, the modified CIELCH space, and apply the FH Merger [FH04] for the CIELAB space, the modified CIELCH space, and the depth estimation. Surface maps by Mean Shift are further simplified by our proposed contour- guided surface merger (CGS). From Mean Shift results, we can see many fake boundaries between two regions. The key idea is that if there is no strong con- tours between two regions from Mean Shift, then they will be merged. We propose a fusion framework with two strategies to segment an image: contour- guided surface merger (CGS), and region-dependent spectral graph (RDS). CGS is adopted to further merge two adjacent regions with weak boundaries guided by the contour cue. RDS is designed to organize region-based layers (surface and depth) in the spectral graph with initial weights according to the type of regions. For non-textured regions, the algorithm will rely on the contour cue, while for textured regions, the depth cue will play an more important role. 10 The contour-guided color palette (CCP) is proposed to efficiently integrates con- tour and color cues of an image. To find representative colors of an image, we collect color samples from both sides of long contours and conduct the Mean Shift (MS) algorithm in the sampled color space to define an image-dependent color palette. This color palette provides a preliminary segmentation in the spatial domain, which is further fine-tuned by post-processing techniques such as leakage avoidance, fake boundary removal, and small region mergence. Segmentation performances of CCP and MS are compared and analyzed. While CCP offers an acceptable standalone segmentation result, it can be further inte- grated into the framework of layered spectral segmentation to produce a more robust segmentation. 1.3 Organization of the Dissertation The rest of this dissertation is organized as follows. All the basic segmentation tools used in our research are introduced in Chapter 2, including color space discussion, SLIC superpixels, the FH Merger, hierarchical graph-based video segmentation, paint selec- tion, phase congruency, contrast limit adaptive histogram equalization, Hu and Haan’s blur estimator, guided filter, matting laplacian, Mean Shift (MS) segmentation, texture indicator, and spectral segmentation. Then, we propose a hierarchical supervoxel graph construction algorithm for video representation, and design an interactive supervoxel representation to model target objects labeled by users in Chapter 3. In Chapter 4, a region-dependent segmentation approach is designed to fuse three cues for automatic image segmentation. It includes development of three elementary cues, and the proposed region-dependent fusion approach. To derive the three elementary cues, we describe sev- eral new tools, including a modified color space, a contour detection scheme, simplified 11 surface maps, and a depth estimation method. The proposed fusion approach contains a contour-guided surface merger (CGS), and a region-dependent spectral graph (RDS). In Chapter 5, the contour-guided color palette (CCP) is proposed to efficiently integrate contour and color cues of an image. It involves system overview, color palette genera- tion, and segment post-processing steps, including leakage avoidance by contours, fake boundary removal, and small region mergence. The comparisons between CCP and MS are analyzed by some typical instances. It can be further integrated into the framework of layered spectral segmentation to produce a more robust segmentation. Experimental results are given in Chapters 3, 4, and 5 to demonstrate the superiority of the proposed algorithms on the well-known datasets respectively. Finally, concluding remarks and future research directions are given in Chapter 6. 12 Chapter 2 Basic Segmentation Tools In this chapter, we will recall some basic segmentation tools we apply to our research, on interactive video object segmentation, and automatic image segmentation. 2.1 Color Space CIEL ? a ? b ? (CIELAB) is the popular uniform color space specified by the International Commission on Illumination, which describes all the colors visible to human eyes and serves as a device-independent model for reference [Wik14b]. It is widely used in image processing and computer vision fields for its uniformity to human visual system (HVS). Fig. 2.1 illustrates the CIEL ? a ? b ? color space. The vertical L ? axis represents lumi- nance, ranging from 0 to 100. The horizontal a ? and b ? axes represent the red-green and yellow-blue opponent channels, both ranging from -128 to 127 in practice [Con14]. Figure 2.1: CIEL ? a ? b ? color space. The vertical L ? axis represents luminance, ranging from 0 to 100. The horizontal a ? and b ? axes represent the red-green and yellow-blue opponent channels, both ranging from -128 to 127 in practice. 13 CIEL ? c ? h ? (CIELCH), the cyclic version of CIEL ? a ? b ? color space, has the advan- tages of uniformity and hue information. Fig. 2.2 shows the relationship between CIEL ? c ? h ? and CIEL ? a ? b ? . The vertical L ? axis is the same. The c ? axis represents chroma, which is the distance to the vertical axis, ranging from 0 to 100. The h ? axis represents hue, which is the angle determined by the arctan of the ratio of a ? and b ? ranging from 0 to 360 . Hue is one cyclic dimension as an angle to represent the color information, shown in Fig. 2.3. We can see that the areas with small chroma in CIELCH color space are gray areas[Con14], while the areas far from the vertical axis represent the colorful areas. Figure 2.2: Relationship between CIEL ? c ? h ? and CIEL ? a ? b ? color space. The vertical L ? axis is the same. The c ? axis represents chroma, which is the distance to the vertical axis, ranging from 0 to 100. The h ? axis represents hue, which is the angle determined by the arctan of the ratio of a ? and b ? ranging from 0 to 360 . Figure 2.3: Hue space in terms of color. 14 2.2 SLIC Superpixels SLIC [ASS + 12] stands for simple linear iterative clustering, which tries to produce uni- form superpixels for the image by local K-means clustering. By default, the only param- eter of the algorithm isk, the desired number of approximately equally-sized superpix- els. They applied five dimensions to measure the distance between two superpixelsd ij , shown in Eqn. 2.1. Three of them describe the color difference d c in CIELAB color space discussed in Sec. 2.1, and the other two describe the spatial distanced s . d ij = r d 2 c + m S 2 d 2 s ; d 2 c = (l i l j ) 2 + (a i a j ) 2 + (b i b j ) 2 ; d 2 s = (x i x j ) 2 + (y i y j ) 2 ; (2.1) whereS = p N=k,N is the number of pixels for the color image, andm is a weight to balance the importance of color similarity and spatial proximity. I prefer to use one sentence with three key words to describe this algorithm: “com- pact” and “uniform” superpixels by “local” clustering. For “compact”, they used the weightm in Eqn. 2.1 to determine the compact of the superpixel result. Ifm is larger, the spatial distance has more weight and the superpixel result tends to stay at the same position without large moving. Ifm is smaller, the color distance has more weight and the superpixel result tends to track the appearance of the object with strange shape and some errors. It is a trade-off to set the factor. For “uniform”, initially, thek number of sample points for superpixels are uniformly put in the spatial domain. In this case, the superpixels are set uniformly to cover the regions as many as possible. The green stars in Fig. 2.4 illustrate the feature of “uni- form”. 15 For “local”, they try to group the pixels in the neighborhood based on the initial positions. Here, any pixels inside the bounding box with lengthS will compare with all the superpixel centroids inside the bounding box with length 2S for local clustering. It will be done iteratively until convergence to a stable clustering result. The blue bounding box in Fig. 2.4 represents the “local” regions that SLIC searches. Figure 2.4: SLIC generates “uniform” superpixels by “local” clustering. The green stars illustrate the initial sample points are uniformly put in the spatial domain to cover the regions as many as possible. The blue bounding box represents the local regions for SLIC searching. Fig. 2.5 show several results of SLIC superpixels with different number of super- pixels or levels. All the pixels inside one connected boundary belong to one super- pixel. The desirable superpixels should adhere well to the object boundaries, and over- segmentation is not a big issue for superpixels. Figure 2.5: SLIC superpixels with different number of superpixels or levels. All the pixels inside one connected boundary belong to one superpixel. 16 2.3 FH Merger FH Merger, proposed by Felzenszwalb and Huttenlocher in 2004 [FH04], is a bottom up merge algorithm from pixels to regions so as to reduce the number of clusters for image representation. The main steps are illustrated as below. Firstly, they generate the graphG = (V;E) for the image according to the adjacent neighborhood model or the k-nearest neighbor- hood model for each pixel. Secondly, starting from each pixel as a node, they measure the distance of two nodes Diff (C 1 ;C 2 ) as the smallest weight among all the edges between two nodes. Diff (C 1 ;C 2 ) = min i 2C 1 ; j 2C 2 ;( i ; j )2E ! ( i ; j ) (2.2) They also measure the internal difference of each nodeInt (C) as the maximum weight among all the minimum spanning tree (MST) edges of the node. Int (C) = max e2MST(C;E) ! (e) (2.3) Thirdly, the edge weights between two neighbor nodes are sorted in ascending order. Fourthly, this step is iteratively done through all the edges from small to large weight in the step 3. For the edge weight between two nodes! (e) is less or equal to both of the minimum internal difference of two nodesMInt (C 1 ;C 2 ), these two nodes will be merged. ! (e)MInt (C i ;C j ) MInt (C 1 ;C 2 ) = min Int (C 1 ) + jC 1 j ;Int (C 2 ) + jC 2 j (2.4) 17 wherejCj is the number of pixels in the nodeC, and is used to control the size of the desired region. Increasing leads to larger regions, while a small leads to a smaller regions. Then the minimum internal difference of the merged node will be updated accord- ingly. Fifthly, after checking through all the edges in step 3, it returns all the labels for each pixel as the result of segmentation. The pixels sharing with the same label number belong to the same cluster. The key concept of FH Merger is that two neighbor nodes are merged if the differ- ence of them is less than each node’s internal variation. The number of clusters from FH Merger can be obtained automatically using the same tightness parameter. Some segmentation results are shown in Fig. 2.6, where each random color represents one segment. Figure 2.6: FH Merger segmentation results, where each random color represents one segment. 18 2.4 Hierarchical Graph-Based Video Segmentation Hierarchical graph-based video segmentation [GKHE10] is designed upon FH Merger for image segmentation, discussed in Sec. 2.3. They extend it to video by constructing a graph over the spatio-temporal video volumn with edges based on a 26-neighborhood in 3D space-time. Both coherences in appearance and motion are considered for voxel grouping into spatio-temporal regions. They also make several modifications as follows. Firstly, the fixed in Eqn. 2.4 does not control the desired region size well. Increas- ing leads to larger regions but with inconsistent and unstable boundaries, while a small lead to a consistent over-segmentation, but the region size is relatively too small for many applications. To overcome this drawback, they begin with a small , and scale the minimal region size as well as by a factors = 1:1 at each step of the hierarchy. This tree-structured region hierarchy allows selection of the desired segmentation at any desired granularity level. Secondly, the internal variation Int (C) in Eqn. 2.3 becomes increasingly unreli- able for discriminating between homogeneous and textured regions as their sizes grow. They use a region-based descriptor in forms of its CIELAB histograms with 20 bins to overcome this limitation. Thirdly, they also propose a parallel out-of-core algorithm and a clip-based pro- cessing algorithm that divides the video into overlapping clips in time to deal with the memory issues. The key concept of hierarchical graph-based video segmentation is still that two neighbor regions are merged if their color difference is less than or equal to each region’s internal variation. 19 2.5 Paint Selection Paint Selection, proposed in [LSS09], is a progressive painting-based tool for local selection in images which is able to provide instant feedback during selection on multi- megapixel images. Unlike conventional painting operations, users need not paint over the whole object. Instead, the selection can be automatically expanded from users’ paint brush and aligned with the object boundary, as shown in Fig. 2.7. The efficiency comes from a progressive selection algorithm and two optimization techniques: multi- core graph-cut and adaptive band upsampling. It can be applied to interactive image segmentation. (a) (b) (c) (d) Figure 2.7: Illustration for paint selection tool. Left three: the user makes a selection by painting the object of interest with a brush (black-white circle) on a large image. Instant feedback (selection boundary) can be provided to the user during mouse drag- ging. Rightmost: composition and effect (sepia tone). Note that the blur scribbles are invisible to the user. 2.6 Phase Congruency In this section, we will start with the discussion on the importance of phase, followed by phase congruency for edge detection [Kov03]. 20 Phase, an important signal component, is often ignored in favor of magnitude. Phase is highly immured to noise and contrast distortion, which is desirable for edge detection, image segmentation, image representation, etc. [Ska10] Phase carries more visual information than magnitude qualitatively, which can be demonstrated by swapping two Fourier components of two images in BSDS300 - phase and magnitude, like Fig. 2.8. It shows the reconstructed images appears to be more similar to the one whose Fourier phase was used in the reconstruction [Ska10]. Image1 Image2 Mag(Image1) ∙Phase(Image2) Mag(Image2) ∙Phase(Image1) Figure 2.8: Swapping two Fourier components of two images in BSDS300. Top-left: original image1. Top-right: original image2. Bottom-left: IFT of the combination of image1 magnitude and image2 phase. Bottom-right: IFT of the combination of image2 magnitude and image1 phase. Globally, phase is almost equally distributed across the entire range of frequencies, while the magnitude decays in an exponential manner with increasing frequency. We know that statistically each image has far more low frequency components than high frequency components, shown in Fig. 2.9. The difference is just the exponential shape. That’s the reason why the natural images statistical average of spectra is so similar. 21 Therefore, the information that differentiates between images is not encoded in the Fourier spectra, but is encoded in the phase. [Ska10] Figure 2.9: The appearance of image spectrum is quite similar. Left: image spectrum of image1. Right: average image spectrum of BSDS300. Phase congruency (PC), a phase-based edge detector, was introduced by Peter Kovesy in 1999 [Kov03]. It is able to effectively deal with noise or changes in contrast, which cannot be resolved for the traditional gradient based edge detection. The basic idea is that Fourier components are maximally in-phase for various signal features. And it also makes use of the advantages of local phase. Local phase has some good qualities: immunity to illumination change, immunity to zero phase types of noise, and its value is limited in [;]. According to the definition in [MO87], PC is the local minima of the spread of the phase of the various Fourier components of a signal, which can be calculated by Eqn. 2.5. PC = j P X (w)j P jX (w)j (2.5) where the numerator is the magnitude of the sum of the frequency components, and the denominator is the sum of magnitudes of the frequency components. PC equals 1 if and only if all the phase components are equal, that indicates the phase spread is zero. Phase congruency describes the non-homogeneity of an image represented by edge strength. 22 2.7 Contrast Limited Adaptive Histogram Equalization Contrast limited adaptive histogram equalization (CLAHE) [Zui94] is suitable for improving the local contrast of an image and bringing out more details. At the same time, it restrains the noise in the relatively homogenous regions. The key of AHE shown in Fig. 2.10(a) is to apply histogram equalization using the overlapping local windows, but this will lead to noise enhancement at the same time. CLAHE helps to overcome this problem to reshape the noise spike as shown in Fig. 2.10(b) [Wik14a]. This enhancement process brings out more details globally with different degrees. (a) Key of “AHE ” (b) Key of “CL ” Figure 2.10: Key of CLAHE. (a) the key of “AHE” is to apply histogram equalization to overlapped local windows. (b) the key of “CL” is to reshape the noise to avoid noise enhancement. 2.8 Hu and Haan’s Blur Estimator In this section, we will firstly analyze the foundation of depth estimation from blurriness, followed by the introduction of low cost robust blur estimator by Hu and Hann [HdH06]. 23 When an image is captured with a depth of field, the regions that are away from the focal plane are out of focus in the form of blurry. Thus an image that includes objects in focus and out of focus could be segmented in terms of depth. If we are able to measure the degree of the blur, we can also produce the depth map of the image [Zam12]. According to [Zam12], the amount of blur in an image increases with depth. Thus, if we can estimate the amount of blur, we can estimate the relative depth. If we would know all the camera’s parameters, it would even be possible to measure absolute depth using Equ. 2.6 by Pentland (1987) on depth-from-defocus [Pen87] according to a thin lens model shown in Fig. 2.11 [ZS11]. D Fv 0 v 0 Ff (2.6) whereD is the absolute depth estimation, is the blur size,F is the focal length,v 0 is the distance between lens and focal plane, andf is the aperture number of the lens. σ D F υ 0 Focal plane Lens Image sensor Figure 2.11: A Thin Lens Model. D is the absolute depth estimation, is the blur size,F is the focal length,v 0 is the distance between lens and focal plane, andf is the aperture number of the lens. A method for blur estimation was proposed by Hu and Haan in 2006 based on the grayscale image[HdH06]. In their method, they reblur the signal by different Gaussian kernels and determine the local blur of the signal, which is illustrated by a block 24 diagram in Fig. 2.12. The signals(x) is convolved with a Gaussian kernel with different standard deviation a and b leading tos a (x) ands b (x). The ratior(x) is computed in Eqn. 2.7 to make the blur independent of amplitude and offset. r (x) = s (x)s a (x) s a (x)s b (x) (2.7) Input Grayscale Image s(x) Blur ( σa) Blur ( σb) + + Difference Ratio r(x) + - + - Maximum Filter rmax(x) Blur Map σ Figure 2.12: Block diagram of Hu and Haan’s blur estimator. The difference ratio will point out the points where the blurring has the most impact. They assume that the blur is the same for that area, and thus they apply a maximum filter with a certain window which results inr max (x). If we assume a ; b , then the final blur estimation can be calculated in Eqn. 2.8. a b ( b a )r max (x) + b (2.8) Fig. 2.13 shows some sparse defocus maps in BSDS300, where black pixels indicate the in-focus regions, and white pixels indicate the blur regions. We can see the in-focus regions always appear on the boundaries of the objects. 25 Figure 2.13: Sparse defocus maps in BSDS300. Black pixels indicate the in-focus regions, and white pixels indicate the blur regions. 2.9 Guided Filter The guided filter [HST13] is an edge-preserving smoothing operator, like the bilateral filter [TM98]. But it is faster, and avoids staircasting and gradient reversal artifacts of the bilateral filter. The key idea of the guided filter is shown in Fig. 2.14, where a guidance image is considered in the filtering output. The guidance image can be the input image (the same as bilateral filter) or another different image. Furthermore, it makes the guided filter fast to have a non-approximate linear time algorithm, regardless of the kernel size and the intensity range. 26 Figure 2.14: Key idea of the guided filter. A guidance image is considered in the filtering output. 2.10 Matting Laplacian Matting Laplacian [LLW08] is proposed in image matting field to produce an alpha matte that can separate foreground from the background in a given image with limited user input. A image is a composite of foreground and background weighted by the alpha matte, which is illustrated in Equ. 2.9 I =F + (1)B (2.9) To solve this ill-posed problem, instead of restricting the estimation to a small part of the image, or performing iterative nonlinear estimation, they derive a cost function from local smoothness assumptions on foreground and background colors to obtain a quadratic cost function in alpha. Windows is used to enable propagation of information 27 between neighboring pixels. This global optimal alpha matte is achieved by a sparse linear system of equations. Fig. 2.15 illustrates the matting laplacian result of an image example. (a) (b) (c) Figure 2.15: Matting laplacian result of an image example. (a) An image with sparse constraints: white scribbles indicate foreground, black scribbles indicate background. (b) An accurate hand-drawn trimap that is used to produce a reasonable matte (c). In this thesis, we apply matting laplacian to propagate the sparse defocus map to the entire depth map. 2.11 Mean Shift Segmentation Mean Shift segmentation is an advanced technique for clustering based segmentation, which finds local maxima of the probability density given by samples in the feature space. Compared with K-Means segmentation, Mean Shift is a non parametric iterative algorithm, which considers the data distribution and does not assume anything about the number of clusters. But it is sensitive to the selection of bandwidth in the feature space. Both Mean Shift and FH Merger method (in Sec. 2.3) are not useful when setting a larger radius or threshold for aggressive mergence. They will merge unexpectedly. How- ever, both are widely used for over-segmentation or superpixel results. The difference is that Mean Shift has good boundary adherence to the object, but may over-segment 28 the smooth region like the large sky, while FH Merger can merge larger ranges, but it generates uncontrolled small regions around the boundaries. The comparisons between Mean Shift and FH Merger are shown in Fig. 2.16. (a) (b) (c) Figure 2.16: Comparisons between Mean Shift and FH Merger. (a) original image; (b) Mean Shift; (c) FH Merger. We can see Mean Shift has good boundary adherence to the object, but may over-segment the smooth region like the large sky, while FH Merger can merge larger ranges, but it generates uncontrolled small regions around the boundaries. 2.12 Texture Indicator HP Lab presented a textured area detection in images using a disorganization indicator based on component counts [BNR08]. The key issue is to differentiate textured regions from homogeneous regions and edge regions. They proposed a block-based solution, and indicate the texture level for each pixel using the weighted sum of indexes for dif- ferent blocks that contain that pixel. Since texture information is always located in the luminance channel, the input of the texture indicator is grayscale image. For each block with fixed size, they separate it into two groups, either using the mean color of the block for thresholding, or taking any segmentation method (like K-Means with K = 2). The component count for each groupp blockSize (number of connected com- ponents with 8-connected neighborhoods) and the contrast between these two groups 29 c blockSize (difference between the mean color) are calculated with normalization in the range [0, 1]. The multiplication of the sum of the component counts and the contrast indicates the texture level for the block. For textured block, both the component counts for both groups and the contrast between these two groups will be high; for homogenous block, even if the component counts will be high, the contrast between the two groups will be very small; for edge block, even if the contrast between the two groups is high, the component counts will be really small. The texture indicator is the weighted sum of the multiplication for different blocks, as shown in Equ. 2.10. For the final texture map, 0 indicates the pure homogeneous pixel, and 1 indicates the fully textured pixel. TextureIndicator =! 25 c 25 p 25 +! 10 c 10 p 10 +! 5 c 5 p 5 (2.10) Where weight! 25 +! 10 +! 5 = 1. Compared with the traditional variance for different blocks, the texture indicator has two advantages. For edge block with high contrast, the variance will be relatively high because both groups are very spread out around the mean and from each other. However, the texture indicator depending on the component counts and the contrast will overcome this problem because the component counts will be sufficiently small, leading to small texture likelihood. For textured block with small contrast, the variance will be relatively small because both groups are very close to the mean. In contrast, the component counts will be large enough to make the texture likelihood large. Therefore, compared with the variance, the texture indicator has better performance to differentiate textured regions from homogeneous or edge regions. 30 2.13 Spectral Segmentation Segmentation using spectral clustering technique is called spectral segmentation. Fig. 2.17 shows the overview of spectral clustering [Azr08]. After preparing all the data in the graph, we connect them refer to the rules and measure the affinity between each pair of nodes to get the affinity matrix. From affinity matrix, we can solve graph laplacian, and calculate the smallest K eigenvectors to form the K-dimensional feature space for each node. After clustering using K-means and projecting back on the original data, the final clustering result is achieved. The main difference between algorithms is the definition of graph laplacian from W to L. Data to cluster Affinity Matrix W Smallest K Eigenvectors of Graph Laplacian Clustering (K-Means) Projecting back on the original data Figure 2.17: Overview of Spectral Clustering Spectral clustering is a way to solve relaxed version of normalized cuts [SM00], which is a NP-hard problem. The method integrates the global image information into the grouping process. The key issue for spectral clustering is that the affinity matrix should be sparse to avoid NP-hard problem. Thus traditionally, each pixel just connect to the neighbor pixels within a fixed radius. This solution will neglect all the relations between two pixels far away. To overcome this problem and cover more connections, the advanced spectral graph using superpixels come out, such as Multi-Layer Spectral Segmentation (MLSS) [KLL13] and Segmentation by Aggregating Superpixels (SAS) [LWC12]. Both meth- ods include the superpixels as nodes, and enlarge the connection between two pixels 31 far away through the adjacent superpixels. Compared with the traditional region-based methods, advanced spectral graph using superpixels can merge large texture regions and reduce the over-segmentation problem. The key contribution of Multiple Layer Spectral Segmentation (MLSS) is to con- struct multi-layer graph and borrow the idea of semi-supervised learning to learn graph affinities, which is illustrated in Fig. 2.18. Figure 2.18: Overview of Multiple Layer Spectral Segmentation (MLSS) For multi-layer graph construction, they took all the pixels in the image and all the superpixels from different methods or parameters as nodes, like Mean Shift superpix- els. There are three kinds of edges to connect nodes. First is adjacent pixels in the original image; second is neighbor superpixels in the same layer; third is pixel and its corresponding superpixel. For these three kinds of connections, the connection between pixels or between superpixels are intra-layer edges, which encoded color information. The connection between pixel and superpixel are inter-layer edges, which added implicit boundary information. 32 For affinity measure, they learned the full affinities using semi-supervised learning technique. Besides the smoothness term in the left of Equ. 2.11, they added another fitting term in the right to force the affinities to be more close to the initial labels. Dif- ferentiating is applied to obtain closed form of iteration, shown in Equ. 2.12. " n m = N X i;j=1 ! ij n im n jm 2 + N X i=1 d i n im b im d i 2 (2.11) b n m =c(D (1c) W ) 1 b b m (2.12) In fact, this learned affinities can be represented by the inverse of a sparse matrix. Thus instead of taking the largest eigenvectors, they took the smallest eigenvector for the inverse matrix, which helps to largely reduce the complexity. The advantages of MLSS include the sparseness of graph. There are only three kinds of connections between nodes. Also they used multiple superpixel results to reduce the uncertainty about region shapes. Another method is called Segmentation by Aggregating Superpixels (SAS) using bipartite graph partitioning. In their bipartite graph construction, shown in Fig. 2.19, they put all the pixels and superpixels from different methods on one side, and put the same superpixels on the other side. There are just two kinds of connections. First is pixel and its correspond- ing superpixel; second is neighbor superpixels that are closer in the feature space. The advantage is that there is no connection between pixels, which largely reduce the com- plexity. The key idea is that they could transfer the eigenspace for the side of superpixels to the other side with all the pixels. This transfer cuts technique helps to save lots of time. 33 Figure 2.19: Bipartite Graph Construction of Segmentation by Aggregating Superpixels (SAS) 34 Chapter 3 Interactive Video Object Segmentation 3.1 Introduction Interactive video object segmentation plays a significant role in region of interest extrac- tion in diverse video-related applications, such as movie post-production, object bound- ary tracking, pose estimation, content based information retrieval, security/surveillance, event detection, and video summarization. Decades ago, we were impressed by the girl in red in the famous black-and-white movie “Schindler’s List”, as shown in Fig. 3.1. If the target object in the movie is able to be automatically segmented and tracked, it becomes possible to further edit the extracted object or put it to other backgrounds. It can also help further analyze the motion, pose, or activity of that object. However, how to effectively represent and segment moving objects from a video remains a big chal- lenge, due to the diversity of scenes with temporal incoherence, motion blur, occlusion, etc. In the last decade, several classic interactive video segmentation systems were devel- oped. In 2005, Li et al.[LSS05] extended the traditional graph cut algorithm [BJ01] to a 3D version for videos, but they only considered the color information and neglected other information cues. Wang et al.[WBC + 05] designed an interactive hierarchical ver- sion of Mean Shift [CM02] and min-cut [SM00] scheme for video object segmentation. In spite of smoothness preservation across both space and time, its user interface is not natural for common users and thus limits its applications. In 2009, Price et al.[PMC09] applied a graph-cut optimization framework and propagated the selection forward frame 35 Figure 3.1: The girl in red in the famous black-and-white movie “Schindler’s List” (best view in color). If the target object in the movie is able to be automatically segmented and tracked, it becomes possible to further edit and analyze the extracted object. by frame using many cues. Bai et al.[BWSS09] proposed the Video SnapCut system based on localized classifiers, which was further transferred to Adobe After Effect CS5 in 2010 as a state-of-the-art video object segmentation system. However, this system can only process no more than twenty frames one time, and cannot extract objects with large motion and occlusions. In addition, most of existing methods on video segmen- tation were restricted to object-level segmentation, which neglect abundant information inside the regions of moving objects. To make a complete representation for each patch of a moving object in the video, we base the video object segmentation framework on the supervoxel representation of a video. Supervoxel is a 3D extension of superpixel in image representation, and describes the motion status of a patch along the time axis. Typical superpixel and supervoxel algo- rithms were summarized in [ASS + 12] and [XC12], including Normalized Cut [SM00], Mean Shift [CM02], graph-based [FH04], Quick Shift [VS08], TurboPixels [LSK + 09], Linear Spectral Clustering (LSC) [LC15], etc. However, most of the existing super- voxel algorithms only use one-level supervoxels to represent a video for further analysis, which cannot support arbitrary object representation interactively labeled by a user. As 36 shown in Fig. 3.3, if the supervoxel granularity is small (low level), the object will con- sist of too many trivial regions which is hard to model or track; while if the granularity is large (high level), the supervoxels cannot cover small or narrow parts of objects such as hands. Thus, a hierarchical supervoxel representation is needed. Recently Grundmann et al. [GKHE10] extended the image segmentation approach [FH04] (GB, dicussed in Sec. 2.3) and presented a hierarchical supervoxel structure with different granularities (GBH, dicussed in Sec. 2.4) for video representation. However, it is intrinsically a local approach focusing on the color variance of local regions, and thus suffers from the problems of occluded objects and scattered regions with uncon- trollable shape and size for low-level supervoxels (Fig. 3.2). Some latest work on super- voxel such as trajectory binary partition tree [PS13], temporal superpixels [CWI13], and temporally consistent superpixels [RJRO13] also showed improvements on supervoxel generation. But they did not further study how to leverage supervoxel structure in video object representation and segmentation. In this chapter, we propose a novel hierarchical supervoxel graph construction approach in a bottom-up manner to represent a video. First, the bottom-level super- voxels are obtained by local clustering using color and motion cues. Then, higher-level supervoxels are composed of lower-level ones based on a region merging algorithm by leveraging color, motion, and visual saliency of objects. The proposed graph has poten- tial to overcome some challenges suffered in existing work such as occlusion, motion blurring, and inaccurate merging (Fig. 3.9 and 3.10). Given the constructed supervoxel graph, we propose a hierarchical supervoxel selec- tion algorithm to represent a target object using supervoxels with different granularities. The target object can be labeled using the efficient Paint Selection tool [LSS09] (dis- cussed in Sec. 2.5), and thus provides an easy-to-use interface. The supervoxel selec- tion algorithm tries to use the least number of supervoxels to fit the labeled regions. As 37 (a) (b) (c) Figure 3.2: Supervoxel comparison between (b) GBH and (c) the proposed method for (a) the original video. We use one color to indicate the same supervoxel in the same level. For low-level supervoxels of GBH, the supervoxels are scattered with uncontrol- lable shape and size (the number of the supervoxels is 1024). In contrast, the supervoxels of our proposed supervoxel graph in the same level have compact shape with similar size along the time axis. shown in Fig. 3.3, the hands and the legs can be represented and segmented at the same time even if they obviously belong to different levels. Based on the proposed video and object representations, an interactive video object segmentation framework is proposed. Extensive experiments show the effectiveness of the proposed algorithms compared with state of the arts in supervoxel graph construction and video object segmentation. The rest of the chapter is organized as follows. The supervoxel graph construc- tion algorithm for videos and interactive supervoxel representation algorithm for target 38 Original Low Level Middle Level High Level Representation Figure 3.3: Example of hierarchical superovxel graph and its representation of the face, the hands and the legs of the runner. For display purpose, we ignore the time axis. The hierarchical supervoxel graph is shown in the middle three columns. One color indicates the same supervoxel in the same level. For middle and high levels, we use color averaging instead of random color to denote each supervoxel. Based on the hierarchical supervoxel graph, different parts of the runner can be represented by supervoxels at different levels, as shown in the rightmost column. objects are introduced in Sec. 3.2 and 3.3, followed by the video object segmentation framework in Sec. 3.4. In Sec. 3.5 and 3.6, we show the experimental results and conclude the chapter. 3.2 Hierarchical Supervoxel Graph Generation In this section, we present how to build the hierarchical supervoxel graph for video representation. First, the hierarchical supervoxel graph is briefly introduced. Then the detailed graph construction algorithm is presented, including bottom-level supervoxel generation followed by higher-level generation. 3.2.1 Hierarchical Supervoxel Graph As a 3D extension of superpixel for video representation [RM03], supervoxel tries to capture internal motions and variations of the video, and simplify it into disjoint 3D 39 regions in some level. Different algorithm and parameter configurations will result in supervoxels with different granularities. Smaller granularity means lower level and finer representation, while bigger granularity corresponds to higher level and coarser repre- sentation. As shown in Fig. 3.3, low-level supervoxels can illustrate the detail parts like hair, neck, and hands; while high-level ones can illustrate the large parts like arms, clothes, legs, or even the entire human body. We try to build a tree structured supervoxels with different granularities, so that each (3D) region can have multiple representations, e.g. a small number of big-granularity supervoxels, or a large number of small-granularity supervoxels, all combinations of different granuarities. This is the foundation of interactive video object representation to be introduced in Sec. 3.3. In mathematics, hierarchical supervoxel graph can be described as follows. For a given videoV , we consider it as a spatio-temporal lattice = T, where denotes the 2D pixel lattice for each frame and T is the time space. Each supervoxel can be represented as s l r , where level l 2 f1; 2; ;Lg and region r 2 f1; 2; ;R (l)g. L is the largest available level of the hierarchical supervoxel graph, and R (l) is the largest region id at levell. The hierarchical supervoxel graph results inL levels of indi- vidual supervoxelsS= S 1 ;S 2 ; ;S L , where each level S l is a set of supervoxels n s l 1 ;s l 2 ; ;s l R(l) o , having s l r ;[ r s l r = , and s l i \s l j = ; for every (i;j) pair. In some frames, some region idr might be missing due to occlusion or large motion, which means region id does not necessarily include all the regions from 1 toR (l). We first introduce how to generate the bottom-level supervoxelsS 1 , followed by the construction of other levels based on the bottom level. 40 3.2.2 Bottom-Level Supervoxels We first extend the SLIC algorithm [ASS + 12] (dicussed in Sec. 2.2) to video analysis, and then proposed a modified version for robust supervoxel generation. Extension of SLIC to Videos SLIC [ASS + 12] tries to produce uniform superpixels for the image by local K-means clustering. It was claimed to yield state-of-the-art adherence to image boundaries and outperform existing methods when used for segmentation with high efficiency. In this work, we extend SLIC to the video scenario. Each video shot is considered as a spatio-temporal 3D cube and each pixel is one node in the graph. Local clustering is applied for the whole video space to build the bottom-level supervoxels. Starting from initial uniform-distributed seeds as the centroids of the supervoxels, each pixel in the space-time neighborhood of the supervoxel seed will be reassigned to the closest supervoxel in the space-time neighborhood. The distanced ij between the pixeli and the supervoxel centerj, shown in Eqn. 3.1, is measured by CIELAB color distanced c and space-time distanced s in [ASS + 12]. d ij = p d 2 c + d 2 s ; d 2 c = (l i l j ) 2 + (a i a j ) 2 + (b i b j ) 2 ; d 2 s = (x i x j ) 2 + (y i y j ) 2 + (t i t j ) 2 ; (3.1) where is the compact factor to balance between color distance and space-time distance. The supervoxels will be iteratively updated until no update occurs. This convergent local-clustering solution will result in the bottom-level supervoxels. 41 Modified SLIC (MSLIC) To better encode the motion and temporal information when conducting local clustering in videos, we modify the SLIC algorithm in two ways: encode motion feature, and add a spatio-temporal factor when measuring the distance between each pixel and the supervoxel center. First, we add motion vectors to the feature space for pixel reassignment. Since the motion orientation for each pixel in the same bottom-level supervoxel is supposed to be similar, besides the pixel’s color in CIELAB color space [l;a;b] T and the position [x;y;t] T , pixel-based motion vector [v x ;v y ] T is also applied to measure the distance between the pixel and the supervoxel, as shown in Eqn. 3.2. We obtain the motion vectors from advanced optical flow [Liu09]. ^ d ij = p d 2 c + d 2 s +d 2 v ; d 2 v = (v xi v xj ) 2 + (v yi v yj ) 2 ; (3.2) where is the balance factor between motion and the other two spaces: color and space- time. The second modification is about spatio-temporal distance measurement. Since the spatial domain and temporal domain have different measurements, it is not reasonable to use the same weight as SLIC to leverage the two distances. We design a spatio-temporal factorST to balance the spatial and temporal distances. The spatio-temporal distance between the pixeli and the supervoxel centerj is: ^ d 2 s = (x i x j ) 2 + (y i y j ) 2 +ST (t i t j ) 2 : (3.3) 42 In fact, the spatio-temporal factorST depends on the motion of the video, including the movement from both camera and objects. If the video tends to be static,ST could be set large; otherwise, it could be relatively small. 3.2.3 Higher-Level Supervoxels Based on the bottom-level supervoxels, appearance and motion are leveraged to merge neighbor supervoxels from lower level to higher level. All the bottom-level supervox- els are considered as nodes in the graph, and the spatio-temporal neighbor supervoxels are connected by weighted edges. The weight of each edge D ij is determined by the distances of CIELAB color histogramsD c and average motion directionsD v between supervoxelsi andj, as follows: D ij = p D 2 c +D 2 v ; D 2 c =kL i L j k 2 +kA i A j k 2 +kB i B j k 2 ; D 2 v = (v xi v xj ) 2 + (v yi v yj ) 2 ; (3.4) where is a parameter to balance color and motion space. (L i ;A i ;B i ) is the quantified color histogram of supervoxeli, weighted by the reciprocal of its size. Fig. 3.4 shows the comparisons between the output of SLIC (using average color without motion as feature space) and our method (using color histogram with motion as feature space). For SLIC, some internal regions of the fish are merged to the background, because the average color of this part is quite close to that of the background. Similarly, the underpant is also merged unless the motion cue is leveraged. But in the proposed method, both cases work quite well, owning to the use of color histogram (the fish case) and the motion cue (the underpant case). 43 (a) (b) (c) Figure 3.4: Supervoxel comparison between (b) SLIC (using color averaging without motion as feature space) and (c) the proposed method (using color histogram with motion as feature space) for (a) the original video. We can see that, for SLIC, some internal regions of the fish are merged to the background, because the average color of this part is quite close to that of the background. Similarly, the underpant is also merged unless the motion is leveraged. But in the proposed method, both cases work quite well, owning to the leverage of color histogram (the fish case) and the motion cue (the underpant case). In addition, we also consider saliency maps as guidance for supervoxel merging, as shown in Fig. 3.6. Based on the observation, supervoxels inside or outside the salient regions have higher priorities to merge together compared with the boundary ones. Thus, the derivative of the saliency map is combined to Eqn. 3.4 to calculate the edge weight, as shown in Eqn. 3.5. ^ D ij = q D 2 c +D 2 v +D 2 rGrad D 2 rGrad = (g i g j ) 2 ; (3.5) where is a factor to balance gradient of saliency and the other two spaces: color and motion. g i is the gradient averaging of saliency for supervoxel i. We adopt context- based saliency and shape prior (CBS) [JWY + 11], which integrates bottom-up salient stimuli and object-level shape prior, leading to clear boundaries. 44 We merge two supervoxels with least weighted edge one by one, until the number of the remaining supervoxels arrives the expectation. In this tree structure, each low-level supervoxel is only covered by one high-level supervoxel, but each high-level supervoxel might correspond to multiple low-level sueprvoxels. Fig. 3.5 shows an example of a video shot (row 1). The hierarchical supervoxel graph for level #6, #5, #4 and #1 are illustrated in Fig. 3.5 (row 2-5). We use the same color to denote one supervoxel in different frames at the same level. For each frame, supervoxels are merged from lower level to higher level. At the same level, supervoxels are tracked along the time. This process is illustrated in Fig. 3.5 (row 6). 3.3 Interactive Hierarchical Supervoxel Representation In this section, we introduce how to represent (an) arbitrary user-labeled object(s) in a frame using a compact set of supervoxels from different levels. There are two constraints we need to follow. On the one hand, the union set of the resulting supervoxels should accurately match the labeled regions. On the other hand, the number of supervoxels should be as few as possible. This will avoid the trivial supervoxels in lower regions, and make further modeling and tracking easier. Let’s denote the labeled regions asSeg. Then the optimization formula is given by (see notations in Sec. 3.2.1): arg minE (l;r) = 0 @ 1 jRj [ i=1 s l r(i) \Seg = jRj [ i=1 s l r(i) [Seg 1 A 2 +jRj 2 ; (3.6) where is a weight to balance the matching term and the number of supervoxels. Notice that here we only consider the regions on the labeled frame. There arejRj supervoxels s l r(i) that are all not overlapped with each other. Whens m r =s n r form>n, we select the 45 Frame #1 Frame #16 Frame #31 Frame #46 Original time Level #1 Level #4 Level #5 Level #6 level Hierarchical Supervoxel Graph Segmentation level Figure 3.5: For the video of a man (row 1), the hierarchical supervoxel graph (row 2-5) shows the connectivity among different supervoxels in terms of two dimensions: time and level. For each frame, disjoint regions that are spatially approximate and coherent in appearance with similar motions will be merged into larger regions; for each level, the regions with the same color or ids indicate the same supervoxel. This graph can be abstracted in row 6. When a user labels the target object in the first frame (row 7 col 1, red sketches are used for display purpose), hierarchical supervoxel representation for this man is calculated, where each supervoxel can be from different levels in the graph. The corresponding segmentation results are shown in row 7. 46 Distance L A B Gradient Weight Vx Vy V Figure 3.6: Illustration of the weights between two supervoxels. When merging super- voxels from lower level to higher level, the weight between two neighbor supervoxels is measured by the combination of distances in CIELAB color histogram, motion direc- tion, and gradient of saliency map. larger levelm. Thus a list of (l;r) non-overlapped supervoxels will be obtained through Eqn. 3.6. To speed up the object model construction, we propose a greedy solution, called hierarchical supervoxel selection algorithm. We start from scanning all the supervoxels s l r from the highest level to see whether there is any supervoxel can be covered by the labeled regionSeg. Once there is a good cover, this region of this supervoxel on this frame will be subtracted from the frame until there is almost no area left. Since the size of the supervoxels becomes smaller from higher level to lower level, we can tolerate more covering errors between s l r and the remaining segmentation at lower levels than higher levels. After this stage, the obtained supervoxel list can represent the objects on the labeled frame, where each supervoxel can be from different levels of the hierarchical supervoxel graph. The hierarchical supervoxel selection algorithm is summarized in Algorithm 1, where mine and maxe are the acceptance error rates at the lowest and highest lev- els respectively." is a very small number to guarantee that the majority of the object can be covered by the given supervoxels. 47 Algorithm 1 Hierarchical Supervoxel Selection Require: Hierarchical supervoxel graph s l r , where l 2 f1; 2; ;Lg, r 2 f1; 2; ;R (l)g; Ground truth segmentation for the key frameSeg; Ensure: Supervoxel listSV List; 1: SetSV List =NULL 2: SetReg =Seg 3: for each level from high to lowl =L;L 1;:::; 1 do 4: for each regionr = 1; 2;:::;R (l) do 5: if js l r \Regj js l rj mine + (maxe mine) l L then 6: SV List =SV List + (l;r) 7: Reg =Regs l r 8: ifjRegj<"jSegj then 9: Break out all the loops 10: end if 11: end if 12: end for 13: end for 14: return SV List; This hierarchical supervoxel representation is capable of characterizing any object(s) labeled by users even if they are not in the same level (like in Fig. 3.3), or at different locations. The supervoxel list can be easily propagated to subsequent frames. The objects along the time can be represented by the supervoxel list sequences, which makes the real-time object segmentation available. 3.4 Video Object Segmentation Framework A block diagram of the proposed video object segmentation framework is shown in Fig. 3.7. It contains three stages: offline hierarchical supervoxel graph generation, interactive object representation for the key frame, and representation propagation and refinement. 48 Local Clustering Region Merging Hierarchical Supervoxel Graph Hierarchical Supervoxel Selection Supervoxel List Propagation Offline Hierarchical Supervoxel Representation Interactive Object Representation for Key Frame Representation Propagation and Refinement Labeled by Paint Selection Tool Video Object Segmentation Figure 3.7: The video object segmentation framework contains three stages. The offline supervoxel representation stage builds hierarchical supervoxel graph based on local clus- tering and region merging. In the interactive object representation stage, the users label interested objects via Paint Selection tool [LSS09], which is further represented by a compact set of supervoxels obtained by the proposed hierarchical supervoxel selection algorithm. Finally, the representation propagation and refinement stage will reduce seg- mentation errors to achieve final video object segmentation. 49 3.4.1 Hierarchical Supervoxel Graph Generation For an input video, the hierarchical supervoxel graph is firstly constructed to represent the video using the algorithms introduced in Sec. 3.2. Some tiny refinement on super- voxel representation results are also conducted to meet the basic hypothesis of hierar- chical supervoxel graphs such as supervoxel connectivity and smoothness. Boundaries of all the supervoxels are further smoothed to avoid roughness. This stage is conducted offline, and is prepared for real-time interactive object segmentation. 3.4.2 Interactive Object Representation Our goal is to enable users to label (an) arbitrary object(s) in a video shot in an easy way, then the system can automatically segment and track the labeled object in the shot. Thus, an easy-to-use label tool is necessary for users to label objects. We embedded the Paint Selection [LSS09] tool (discussed in Sec. 2.5) in our system, which is an interactive image segmentation tool. It shows the instant feedback to users as they drag the mouse/pen, which is quite efficient to extract the objects. Based on the labeled objects, the proposed hierarchical supervoxel selection algo- rithm in Sec. 3.3 is applied to model and represent the objects by a set of supervoxels. 3.4.3 Representation Propagation and Refinement We assume that the input video shot has no shot cut and the labeled objects nearly appear all the time. We propagate the supervoxel set from the labeled frame to the adjacent frames, followed by some post-processing techniques to refine the boundary of salient objects. Eventually we could achieve the video object segmentation results for all the frames in this shot. 50 If the representation propagation does not work well for some subsequent frames, user can continue to label that frame for the segmentation of next frames. The disadvan- tage is that this will cost more human efforts. 3.5 Experimental Results In this section, we first describe the experimental data set and the generation of ground truth labels. Then, we compare the proposed algorithms with the state-of-the-art meth- ods in the two aspects, i.e. hierarchical supervoxel graph construction, and interactive video object segmentation. 3.5.1 Dataset Three public video datasets were used in our experiments. One is the dataset used in Video SnapCut [BWSS09], which is a state-of-the-art video object segmentation algo- rithm, and has successfully applied to Adobe After Effect CS5. Video SnapCut dataset has seven available video shots with around 100 frames for each shot at a resolution from 320x240 to 720x480. To make a larger dataset for a more objective evaluation, SegTrack database [TFNR12], and UCF Sports Action Data Set [RAS08] are also lever- aged. SegTrack database has five available video shots ranging in length from 21 to 70 frames for each shot at a resolution from 320x240 to 414x352. UCF Sports Action Data Set consists of a set of actions collected from various sports, which contains 40 video sequences at a resolution of 720x480. Except SegTrack, the other two datasets do not provide the ground-truth segmenta- tions. To quantitatively evaluate the segmentation performance, we develop an interac- tive interface to generate the ground truth frame by frame for the entire database using Paint Selection tool [LSS09]. 51 The three datasets were combined together as our experimental dataset, one fifth of which were randomly selected as the training set for parameter tuning, and the other videos compose the testing set. In bottom-level supervoxel generation, we set different spatial factors for different levels but the same motion factor . We first tuned to achieve the best visual results, and then. In higher-level generation, we first tuned the other motion factor and then the saliency factor. In supervoxel selection, we tried different combinations of mine, maxe, and, and chose the best one in the training set. After getting all parameters, we evaluated on the testing set. 3.5.2 Implementation Details For bottom-level supervoxel generation, we apply MSLIC for distance measure between pixel and supervoxel, as shown in Eqn. 3.2, where = 400 and = 25. In spatio- temporal distance shown in Eqn. 3.3, the spatio-temporal factor is determined by the category of the video. For sports videos with fast motion, ST = 1; for other general videos, we takeST = 2. For higher-level sueprvoxels merging, we calculate the weight of neighbor super- voxels using color histogram, motion, and gradient of saliency in Eqn. 3.5, where = 0.0004 and = 0.0025. For hierarchical supervoxel graph generation, we consider L = 5 levels, i.e. the number of supervoxels is 64, 128, 256, 512, and 1024. For supervoxel selection in Algorithm 1, the acceptance error rates at the lowest level and the highest level are 0.6 and 0.9 respectively. We take the small number" as 0.05. 52 3.5.3 Performance of Hierarchical Supervoxel Graph Two state-of-the-art algorithms are compared with the proposed hierarchical super- voxel graph generation algorithm. One is the hierarchical graph-based approach (GBH) [GKHE10] (discussed in Sec. 2.4), and the other is the SLIC algorithm [ASS + 12] (dis- cussed in Sec. 2.2). We compare these methods at different granularities of the graph, i.e. with different numbers of supervoxels as 64, 128, 256, 512, and 1024 respectively. Evaluation Measurement Boundary recall and under-segmentation error are typical measures for boundary adher- ence to evaluate superpixels [ASS + 12]. Boundary recall measures the fraction of ground-truth edges falling within a certain distanced of at least one detected superpixel boundary. We setd = 3 in the experiment. Under-segmentation error measures to what extent superpixels flood over the ground-truth segment boundaries. A ground-truth segment can divide a superpixel P into anin and anout part: P in andP out . Under-segmentation error can be calculated by: U = 1 N 2 4 X S2GT 0 @ X P:P\S6=; min (P in ;P out ) 1 A 3 5 ; (3.7) where N is the number of pixels in the image, andGT is the superpixel set of the ground-truth image. Finally, we used the average performance of each frame in all videos, weighted by the reciprocal of the number of frames in each video shot. Results The average boundary recall and under-segmentation error of compared algorithms with increasing number of supervoxels are shown in Fig. 3.8. We can see that, for both 53 measurements the proposed methods consistently outperform the compared methods. Specifically, GBH is worse than the proposed algorithm in all granularities, while the performance of SLIC drops quickly when the granularity of supervoxels becomes larger. 0 200 400 600 800 1000 0.7 0.75 0.8 0.85 0.9 0.95 1 Number of Supervoxels Boundary Recall (a) Boundary Recall GBH SLIC Proposed 0 200 400 600 800 1000 0 0.01 0.02 0.03 0.04 0.05 Number of Supervoxels Under-segmentation Error (b) Under-segmentation Error GBH SLIC Proposed Figure 3.8: Comparison of hierarchical supervoxel construction. (a) Average boundary recall. (b) Under-segmentation error. Fig. 3.9 shows the supervoxel graphs produced by the three algorithms for two video shots. We can see that, for the “girl” video with a large motion, the right arm is occluded in several frames. When this arm reappears in the frame, GBH and SLIC consider it as a new part or fail to detect it. In contrast, our method can detect that arm before its disappearance and after its reappearance, and then group them into one supervoxel. 54 Similarly, in the “referee” video, the reappeared right hand can only be recaptured by our method. (a) (b) (c) (d) Figure 3.9: Examples of hierarchical supervoxel graphs produced by various methods: (b) GBH, (c) SLIC and (d) the proposed method. (a) original frames; (b) GBH with 64 supervoxels, a closeup of GBH (64); (c) SLIC with 64 supervoxels, a closeup of SLIC (64); (d) the proposed method with 64 supervoxels, and a closeup of our method (64). For the “girl” video, the right arm is occluded for several frames. Our method can successfully detect it after its reappearance, and group it to the same supervoxel before its disappearance, while other methods failed to detect it or consider it as a new part. Similarly in the “referee” video, the reappeared right hand can only be recaptured by our method. 3.5.4 Performance of Video Object Segmentation We compare our solution in video object segmentation with the state-of-the-art solution Video SnapCut [BWSS09], which has been transferred to Adobe After Effects CS5 1 1 We directly use Adobe After Effects CS5 for video object segmentation in the experiments. 55 (AAE). Object accuracyA O and boundary accuracyA B [MO10] were used to measure the segmentation performance frame by frame.A O is given by A O = jG O \S O j jG O [S O j ; (3.8) whereG O is the set of all the pixels inside the ground-truth object, andS O is the set of all the pixels in the object segmented by algorithms.A B is given by: A B = P x min ~ G B (x); ~ S B (x) P x max ~ G B (x); ~ S B (x) ; (3.9) where ~ G B (x) = exp kx ^ xk 2 2 2 ! ; ~ S B (x) = exp kx ^ xk 2 2 2 ! ; ^ x = arg min y2G B orS B kxyk: (3.10) Herex is the location of each pixel for the segmentation map. G B is the set of border pixels for the ground-truth object. ~ G B and ~ S B are the fuzzy set for the border of the ground-truth object and that of the object segmented by algorithms respectively. We set the bandwidth parameter as 2. In the cause of fairness, we manually segment the main object in the first frame of each video shot, and check the segmentation performance of other frames. The results in Table 3.1 show that our segmentation method can produce 8%+ performance improve- ments over AAE in average object accuracy and average boundary accuracy. 56 Table 3.1: Comparison on video object segmentation Algorithm Object Accuracy Boundary Accuracy Adobe After Effects 80.00 % 53.06 % Our Method 88.07 % 63.26 % Fig. 3.10 is the visual comparison of four video segmentation cases of the two methods. For each case, the first row is the labeled frame, and other rows are testing frames. These four videos have large motions. For the “diver” video, when the motion is large enough, AAE cannot capture the arms and disconnect the body and the legs in some frames. For the “soccer player” video, the moving and blurring leg are ignored by AAE. For the “monkey” video, AAE cannot capture the monkey with large motions. For the “kicker” video, the occluded leg is ignored or disconnected from the body for some frames. In contrast, our method is able to deal with these challenging cases. It is worth noting that our method is capable of segmenting disconnected objects at the same time, shown in Fig. 3.11. However, AAE can only extract one connected object along the time and could only propagate the segmentation for twenty frames. 3.5.5 Time Cost Analysis The system contains offline graph generation, and online interactive object segmenta- tion. The online part is fast, around 0.01 second/frame on average with the resolution 320x240 using ten threads on a common machine, and achieved real-time response. The offline part is slower, about 0.8 second/frame under the same setting, one tenth slower than SLIC. Additional time cost was mainly from optical flow and saliency detection. 57 (a) (b) (c) (a) (b) (c) Figure 3.10: Visual comparison of video segmentation produced by (b) Adobe After Effects (AAE) and (c) our method for (a) the original video. Every four rows and three columns represent a video example. The first row of each example is labeled frame. These cases are challenging because of large motion, motion blur or ambiguous colors. We can see that the proposed solution outperforms AAE in above cases. 3.6 Conclusion In this chapter, we propose a hierarchical supervoxel graph construction algorithm for video representation, as well as an interactive supervoxel representation to model tar- get objects. Based on these representations, a video object segmentation framework is 58 Figure 3.11: Our method can segment multiple objects at the same time. developed, which has potential to handle challenging scenarios such as large motion, motion blur or ambiguous colors in videos. We have compared the proposed algo- rithms with the state-of-the-art methods in hierarchical supervoxel graph construction and video object segmentation. Experiments have shown obvious superiority of the proposed algorithms. 59 Chapter 4 Image Segmentation with Fused Contour, Surface and Depth Cues 4.1 Introduction Automatic image segmentation is a fundamental vision problem in diverse image-related applications, such as shape analysis, object recognition, image retrieval, and image com- pression. It automatically partitions an image into several disjoint coherent groups with- out any prior knowledge. In spite that a variety of segmentation algorithms were pro- posed in the last few decades, no good segmentation rules apply to all the images. Most of the existing methods are not image-driven or region-driven. It remains challenging for any single method to segment an image like human due to the diversity and ambiguity of visual textures in the natural image. Most of existing image segmentation algorithms can be roughly classified into two categories, i.e., region-based and contour-based methods, depending on whether the 2D surface cue or the 1D contour cue plays a key role. Region-based methods find the simi- larity among spatially connected pixels and group them together using the surface prop- erties like luminance and color. Typical algorithms include Watershed [RM00], Normal- ized cuts (NCut) [SM00], Mean Shift [CM02], Felzenszwalb and Huttenlocher’s graph- based method (FH) [FH04], Multi-Layer Spectral Segmentation (MLSS) [KLL13], and Segmentation by Aggregating Superpixels (SAS) [LWC12]. However, these methods neglect the obvious discontinuities between two regions that lead to the segmentation 60 boundaries. To resolve this limitation, contour-based approaches, like gPb-owt-ucm [AMFM11], were presented to find connected regions blocked by the detected contours. However, it is still very challenging to detect contours in low contrast or blurred regions. To overcome the limitations of region-based and contour-based methods, researchers started to combine surface properties and contour cues for better segmentation. They either took the contour cue as a post-processing step to correct the region segmenta- tion results or adopted the contours as barriers in the graph for region segmentation [MFCM03]. In spite of the combination of 1D contour and 2D surface cues, sometimes the algorithms are still failed in some challenging cases, discussed in Sec. 1.1. This motivates us to leverage the 3D depth cue of an image to alleviate these two problems in image segmentation. According to the study of human visual perception [Gib50] [BGG03], people are more concerned with dissimilarities or high contrast between two regions, and group similar regions in appearance and depth layer, corresponding to 1D contour, 2D surface, and 3D depth cues. The contour cue describes discontinuities between two regions. The surface cue illustrates similarities inside one region. The depth cue makes use of blur estimation to indicate the image layout even if some regions are apparently similar. On one hand, contour detection and image segmentation are quite related but not identical. Contour-based image segmentation applies contour detection results to find the cues of discontinuity between two neighbor segments. However, because of the low contrast regions or blurred regions, contour detection offers no guarantee that it will generate closed contours, which cannot lead to the segmentation regions. On the other hand, depth estimation from a still image separates the scenery into sev- eral layers globally, which helps image segmentation to clean the regions, especially the textures. The basic idea for depth estimation from a still image is using blur estimation. 61 In this chapter, we study how to fully utilize the 1D contour, the 2D surface, and the 3D depth cues for image segmentation. It is clear that although each cue might be sufficient to segment some images, each of them has its scope and limitations in that the results are not consistently reliable. As shown in Fig. 4.1, the corresponding cue will be helpful for the images in the left to achieve good segmentation, while it will lead to problems for the images in the right. We found that the contour cue is more reliable if the contour is longer and more closed, but it cannot find the discontinuities between two regions if the boundary is blurred, in low contrast, or in smooth transition. The surface cue is unable to simplify the representation of the regions with complex textures and large variance, which may lead to over-segmentation. The depth cue is extremely helpful to clean the regions, especially the textured ones, but it is unreliable if the regions have no edges. (a) Contour Cue (b) Surface Cue (c) Depth Cue Figure 4.1: Original images and their corresponding cue maps. The cue is useful for segmentation of the left image, but is harmful for segmentation of the right image for each row. (a) Contour cue. (b) Surface cue. (c) Depth cue. 62 To maximize the positive effects but minimize the negative parts of these cues, we propose a novel image segmentation framework. First, three elementary segmentation modules are developed for these three cues respectively. For the 1D contour cue, we make use of accurate local variance of phase congruency detector [Kov03] and global variance of Canny detector [Can86] to build the segmentation module. For the 2D sur- face cue, Mean Shift [CM02] and FH Merger [FH04] in CIELAB color space and mod- ified CIELCH color space are applied. For the 3D depth cue, we propose a depth esti- mation solution based on low cost robust blur estimator [HdH06]. The proposed depth estimation approach can handle low contrast regions and reduce the noises and textures inside the regions. Then, to aggregate contour, surface, and depth cues, we propose a region-dependent spectral segmentation framework, which consists of two stages. The first stage is contour-guided surface merger (CGS). After processed by the three elementary mod- ules, the surface regions are further merged to remove the fake boundaries according to the contour cue by Mean Shift. In the second stage, we design a region-dependent spec- tral graph (RDS), which has multiple pixel and superpixel layers with edges weighted by texture indicator. For non-textured areas, the contour cue and the surface cue are considered; for textured areas, only the depth cue is referenced. We conduct extensive experiments on the Berkeley Segmentation Database [MFTM01], which not only show the superior performance of the proposed algorithm over state-of-the-art algorithms, but also verify the necessities of the three cues in image segmentation. The rest of the chapter is organized as follows. In Sec. 4.2, we introduce the three elementary segmentation modules for the 1D contour, the 2D surface, and the 3D depth cues in details. Then we present the proposed region-dependent spectral segmentation framework in Sec. 4.3, including the framework (Sec. 4.3.1), contour-guided surface 63 merger (Sec. 4.3.2), and region-dependent spectral graph (Sec. 4.3.3). In Sec. 4.4 and 4.5, we show the experimental results and conclude the chapter. 4.2 Three Elementary Cues In this section, we present how to generate our three elementary cues. First, the modified color space is discussed for later usage. Then the 1D contour cue, the 2D surface cue, and the 3D depth cue are developed one by one in detail, which are prepared for our region-dependent spectral segmentation. Each of the elementary cues is indispensable for better segmentation since none of them are reliable in all scenarios. 4.2.1 Color Space For color image segmentation, we have to discuss the color space issues for the mid-level layer, such as contour detection and surface maps. Generally speaking, the luminance of the image alone is far from sufficient for color image segmentation. For examples of Fig. 4.2, the hue map is more useful than the luminance map alone for segmentation. Both the luminance maps and the hue maps are normalized for display purpose. The luminance may vary for the same object for shadowing or shading while the hue always keeps. That is to say, hue is more reliable to differentiate the different objects. We will take hue space as one of the key color attributes in our analysis because of its invariance to luminance. However, the gray regions in the color image have no color information, which can- not be represented by hue space accurately. Fig. 4.3 show that hue is not reliable for gray regions in color image. Thus we have to consider high dimensional space to include both luminance and hue. 64 (a) (b) (c) Figure 4.2: Hue is more reliable than luminance alone for many color images. For each column, (a) is the original image, (b) is the luminance map, and (c) is the hue map. (a) (b) (c) Figure 4.3: Hue is not reliable for gray regions in many color images. For each column, (a) is the original image, (b) is the luminance map, and (c) is the hue map. 65 As discussed in Sec. 2.1, CIELAB is one of the best solutions to deal with both gray regions and color regions, which is widely applied for tons of computer vision algorithms for its uniformity to human visual system (HVS). The only problem is that it is not luminance invariant for color regions, but hue is. To make use of hue’s characteristic of luminance invariance in color regions and reduce the influence of hue in gray regions, we present the weighted CIELCH color space h (2w)L C wH i T . The weight function w for hue is a cumulative dis- tribution function of Gaussian determined by the chroma, which is shown in Fig. 4.4. The horizontal axis is the normalized chroma in the CIELCH space from 0 to 1, and the vertical axis is the weight for hue. We assign less weight on hue and more weight on luminance when the chroma is small or the gray regions. The visual comparisons of the surface cue in the CIELAB space, the CIELCH space, and our modified CIELCH space are illustrated in Sec. 4.2.3. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Chroma Weight for Hue Figure 4.4: Weight function for hue in modified CIELCH color space. The horizontal axis is the normalized chroma in CIELCH space from 0 to 1, and the vertical axis is the weight for hue. In addition, since hue is a non-Euclidean space, we have to apply different definitions for the mean and the distance measure. 66 Hue as an angle is a circular dimension. To find the mean of the angles, we should convert polar coordinates to Cartesian coordinates. Assume we have N samples of hue as 1 ; 2 ; ; N , the mean of them is calculated in Eqn. 4.1. y = 1 N N X i=1 sin i x = 1 N N X i=1 cos i = tan 1 y x +" + +u (x) (1) u(y) ; (4.1) whereu (x) is a unit step function defined as u (x) = 8 < : 1 x 0 0 x< 0 (4.2) and" is a small number to prevent the denominator is zero. We can take" = 10 5 . This mean of hues is ranged between 0 and 2. We can divide it by 2 to get normalized mean of hues. The distance measure of two normalized hues from 0 to 1 can be defined in Eqn. 4.3. d hue (a;b) = minfjabj; 1jabjg (4.3) We will use all the analysis of color space above for the 1D contour cue, and the 2D surface cue in Sec. 4.2.2 and 4.2.3. 67 4.2.2 1D Contour Cue In this section, we will present our contour detection solution fused by phase congruency (discussed in Sec. 2.6) and Canny detector. Contour detection indicates the boundaries between two adjacent regions for image segmentation using the luminance and color discontinuities. The existing gradient-based contour detectors like Canny detector [Can86] are capable of global variance detection, but they fail to obtain the accurate boundaries because of the low contrast or blurred edges. To enhance the accuracies of the contour detection, our contour cue is fused by phase congruency (PC) and Canny detector with edge linking for color images. PC is a powerful measure of edge detection invariant to illumination and contrast using phase information of the image[Kov03]. Although PC can detect local variance accurately, it is too sensitive for local textures and noises. As shown in Fig. 4.5, PC has better detection than Canny on the boundaries of the mountain or the building regions, but it adds excessive contours inside the grass or the building regions from textures and noises. This limitation can be complimented by the Canny detector according to the global variance. Our contour detection scheme combines PC and Canny detector in both luminance L and hueH discontinuities, which can be represented in Eqn. 4.4. C =T (maxfnms (PC L Cy L );nms (PC H Cy H )g) (4.4) wherePC andCy are the edge maps for PC detector and Canny detector respectively, nms is the non-maxima suppression of two discontinuities to thin the edge map, andT is the linear enhancement function, shown in Fig. 4.6. The threshold of the enhancement function is adaptively selected by keeping most of the energies for edges (99.5%). 68 (a) (b) (c) (d) Figure 4.5: Our 1D contour cue is fused by phase congruency (PC) and Canny detector that are complementary for better contour detection. (a) Original image. (b) PC edge detection. (c) Canny detection. (d) Our 1D contour cue. PC is better for accurate local variance and true edges for the mountain or the building regions, but it is sensitive to texture and noise inside those regions. Canny detector is better to capture global variance that removes the noise and the detailed textures. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Original Edge Strength Target Edge Strength Th. Figure 4.6: Linear enhancement function. The horizontal axis is the original edge strength normalized between 0 and 1, and the vertical axis is the target edge strength. The threshold of the enhancement function is adaptively selected by keeping most of the energies for edges (99.5%). For our contour detection, there are three more concerns. One is the standard devia- tion of noise for phase congruency in luminance and hue. Another is the cyclic property 69 for phase congruency on hue space. The other is the normalization between luminance and hue. The standard deviation of noise is applied to control the sensitivity of noise for phase congruency. To make it adaptive, we can replace it with the scaled standard deviation of the high frequency components of the image. Because we believe the contents with high frequency indicate the noise level of the channel. I apply wavelets to denoise the image, and take the difference of the original image and the denoising image as the high frequency components. The standard deviation of the high frequency components for all the BSDS300 dataset is shown in Fig. 4.7, for luminance and hue respectively. We can see the standard deviation ranges of the high frequency components for luminance is [4:45e 11 ; 9:82], and for hue is [4:37e 11 ; 10:07]. According to the author of original phase congruency paper [Kov03], the suggested parameter for sensitivity of noise is from 0 to 20. Thus we set the standard deviation of the noise as the double of the true standard deviation of the high frequency components for phase congruency. 0 50 100 150 200 250 300 0 1 2 3 4 5 6 7 8 9 10 Image index for BSDS300 0 50 100 150 200 250 300 0 2 4 6 8 10 12 Image index for BSDS300 Figure 4.7: The standard deviation of the high frequency components for each image in BSDS300. The left figure is for luminance, and the right figure is for hue. 70 Because of the cyclic property of hue space, instead of taking the original hue discontinuity H (0 360 ), I also consider the other three hue discontinuity mod (H + 90 ; 360), mod (H + 180 ; 360), and mod (H + 270 ; 360). Then the median of these four is taken to get reliable hue discontinuity. The median operation helps to remove the large mistakes of distance measure for the cyclic property. For example, the hue distance between 5 and 300 is 295 previously. Instead, the updated result should be the median off295 ; 65 ; 65 ; 65 g = 65 , which is the same as what we expected. In addition, we remove the gray regions of the image for hue discontinuity. As discussed in Sec. 4.2.1, the gray regions have no color information, which always results in incorrect hue distances. The reliable gray region is calculated by eroding the regions whose normalized chroma is less than 0.05. Furthermore, the gradient measure for luminance is normalized. In CIELCH space, luminance difference is from 0 to 100, while hue difference is from 0 to 180. These two dimensions should be normalized to the same scale. The gradient for luminance is supposed to be 1.8 times of that for hue. After obtaining the float edge map, binary contour detection is formed by linking the points into lists of coordinates pairs as long as possible [Kov00]. Edge detection is a point-based result to point out the edge strength of the image, which is more focused on the local information. However, contour detection is a curve-based result to show the closed curves of the objects, which is more focused on the global information. Contours are often achieved by linking edges along some orientations to close the boundary of the objects. Because of the diversity of images, it’s really hard to achieve totally closed contours from edges. But it tries to get as many as long edge lists to form contours. Some results before and after the edge link are shown in Fig. 4.8, where the maps in (d) are our 1D contour cue. We will apply them to the contour-guided surface merger in 71 Sec. 4.3.2. In the contour map, 1 indicates contour pixels, and 0 indicates non-contour pixels. (a) (b) (c) (d) Figure 4.8: Contour detection results before and after edge link. (a) original color image. (b) float edge map before edge link. (c) contour map after edge link. (d) the binary contour map. For the maps in (c), the connected pixels with the same color represent one edge list. The maps in (d) are the binary version of the maps in (c), which is our 1D contour cue. 4.2.3 2D Surface Cue Region-based segmentation, which utilizes surface properties to group similar pixels in appearance together, is quite successful in the last decade. We cannot ignore it for better segmentation. 72 Among all of the region-based segmentation algorithms, Mean Shift [CM02] pro- vides accurate region boundaries but leading to over-segmented regions. In contrast, FH Merger [FH04] offers connected regions but with inaccurate boundaries. The compar- isons between these two region-based methods are shown in Fig. 4.9, where the nearby pixels with the same color belong to one cluster, and the number at the bottom-right corner in red is the number of clusters for each method. These two complementary superpixels are both useful for our surface cue. 148 87 217 137 (a) (b) (c) Figure 4.9: Comparisons between Mean Shift and FH Merger in the CIELAB color space. (a) is the original image, (b) is Mean Shift, and (c) is FH Merger. For each segmentation, the nearby pixels with the same color indicate the same cluster. The number at the bottom-right corner of each result in red is the number of clusters. We can see that Mean Shift provides accurate region boundaries but leading to over-segmented regions. In contrast, FH Merger offers connected regions but with inaccurate boundaries. Both methods are useful for our surface cue. The comparisons of FH Merger in difference color space are shown in Fig. 4.10. We can see FH Merger in the modified CIELCH space can deal with the gray regions much better than FH Merger in the CIELCH space. Our 2D surface cue includes four maps for preparation, which are generated from Mean Shift and FH Merger in both CIELAB color space[Wik14b] and modified 73 30 148 (a) (b) (c) (d) 43 24 83 49 Figure 4.10: FH Merger in difference color spaces for comparisons. (a) is the original image, (b) is in the CIELAB color space, (c) is in the CIELCH color space, and (d) is in the modified CIELCH color space. We can see FH Merger in the modified CIELCH space can deal with the gray regions much better than FH Merger in the CIELCH space. CIELCH color space, respectively. Mean Shift in these two color spaces will be applied to our fusion approach to image segmentation in Sec. 4.3.2. 4.2.4 3D Depth Cue Nevertheless, the surface cue fails to group the textured regions where high contrast appears, and fails to find the accurate boundaries of those regions. These limitations will be alleviated by our depth cue. Depth estimation from a still image separates the scenery into several layers glob- ally, which helps to simplify the representation of the regions for image segmentation, especially the textures. The intuitive idea for depth estimation is that when taking a photo, the objects close to the focal plane are in focus, while the objects far from the focal plane are out of focus [ZS11]. According to Sec. 2.8, Hu and Haan’s blur estimator [HdH06] is a useful tool to generate a sparse defocus map around the object boundaries, shown in Fig. 4.14(b), 74 where darker pixels indicate regions in-focus and brighter pixels suggest regions out-of- focus. However, the map is not dense and smooth enough, which cannot be adopted for the depth estimation directly. In addition, for some dark regions, the contrast is so low that it is challenging to estimate the blur. To overcome these problems, we propose our depth estimation in Fig. 4.11, followed by FH Merger [FH04] to the 3D depth cue. The depth estimation has four components, including CLAHE [Zui94], Hu and Haan’s blur estimator, guided filter [HST13], and matting Laplacian [LLW08]. Input Grayscale Image CLAHE Hu and Haan ’s Blur Estimator Depth Estimation Guided Filter Matting Laplacian Figure 4.11: Block diagram of our depth estimation. We apply contrast limited adaptive histogram equalization (CLAHE), discussed in Sec. 2.7, as a preprocessing step to improve the local contrast, and bring out more details for the dark regions. At the same time, it restrains the noise in the relatively homogeneous regions. Fig. 4.12 show the final depth estimation results with and without CLAHE for comparisons. We can see after CLAHE, the details in the gray regions are enhanced, which makes it easier to differentiate the object from the background. After the sparse blur estimation, some black pixels appear inside some regions, which come from textured regions or noise. To attenuate the depth variance inside the regions, we adopt guided filter (Sec. 2.9) to clean the textures and noises but also keep the edges. The sparse defocus map is the filtering input and the original color image is the guidance. Fig. 4.13 show the final depth estimation results with and without guided filter for comparisons. We can see after guided filter, the depth variance inside the same 75 (a) (b) (c) (d) Figure 4.12: Depth estimation with and without CLAHE. (a) is the original grayscale image, (b) is depth estimation without CLAHE, (c) is enhanced image after CLAHE, and (d) is depth estimation with CLAHE. In the depth maps, the brighter pixels indicate the in-focus regions or the foreground, while the darker pixels indicate the out-of-focus regions or the background. We can see after CLAHE, the details in the gray regions are enhanced, which makes it easier to differentiate the object from the background. object becomes much smaller, which helps to group the object regions in the same depth layer. Finally, we utilize the matting Laplacian (Sec. 2.10) to propagate the sparse defocus map to the entire image to make it smooth. The optimization function is formulated in Eqn. 4.5. E (d) =d T Ld + d ^ d T D d ^ d (4.5) 76 (a) (b) (c) Figure 4.13: Depth estimation with and without guided filter. (a) is the original grayscale image, (b) is depth estimation without guided filter, and (c) is depth estimation with guided filter. We can see after guided filter, the depth variance inside one region becomes much smaller, which helps to group the object regions in the same depth layer. where ^ d andd are the vector forms of the sparse defocus map and the target full defocus map. L is the matting Laplacian matrix and D is a diagonal matrix. is the scalar to balance between fidelity to the spares depth map and smoothness of interpolation. There are two assumptions for the function. One is that the pixels nearby with similar appearance should have similar depth estimation. The other is that the depth estimation for the sparse pixels should be close to the sparse defocus map as much as possible. Fig. 4.14 show some results of depth estimation and its FH Merger, which is our 3D depth cue. We can see the depth cue can group the textured regions well, like the 77 branches regions and the starfish regions, which cannot be realized by the contour or the surface cue. 50 41 (a) (b) (c) (d) Figure 4.14: Our 3D depth cue generation. (a) Original image. (b) Sparse defocus map by Hu and Haan [HdH06]. (c) Our depth estimation where darker pixels indicate regions out-of-focus, and brighter pixels suggest regions in-focus. (d) FH Merger for depth estimation as our 3D depth cue. The number at the bottom-right corner is the number of clusters. We can see the branches regions and the starfish regions can be merged well using our depth cue. However, when there are no edges, the blur estimation on those regions will be meaningless. For this kind of cases, we have to ask the surface cue for help to save the unreliable depth cue. That is the reason we utilize all of the three cues for our segmentation solution. The FH Merger for the depth map will be applied to our region-dependent approach in Sec. 4.3.3 to handle the segmentation in textured regions. 78 4.3 Region-Dependent Spectral Segmentation Based on the three elementary and complementary cues, we propose our framework with two strategies to achieve better segmentation. The first strategy is the contour- guided surface merger (CGS), and the second is the region-dependent spectral graph (RDS). 4.3.1 Proposed Framework We construct the fusion framework for automatic image segmentation, shown in Fig. 4.15, where mid-level layer, including the three elementary cues (contour, surface, depth, and segment size), are obtained from low-level layer (luminance, hue, edge, and blur). For high-level layer, the contour-guided surface merger (CGS) and the region- dependent spectral graph (RDS) will achieve fine-scale segmentation. This structure realizes progressive segmentation, and all the cues in the mid-level layer will be lever- aged for the final decision. Luminance Hue Edge Blur Contour Surface Depth Size Contour-guided Surface Merger Low-level Layer Mid-level Layer High-level Layer Region-dependent Spectral Graph Image Segmentation Figure 4.15: Our fusion framework for image segmentation. 79 4.3.2 Contour-Guided Surface Merger The contour-guided surface merger is inspired by FH Merger [FH04]. The difference is that all the graph properties are based on the boundary proportion instead of the region distance. According to the observations from the surface cue, especially Mean Shift, many fake boundaries appear between two neighbor regions, which can be corrected guided by the contour map. When the contours on the common boundaries of two neighbor regions are weak, these two regions are supposed to be merged. The contour-guided surface merger can work on any region-based maps, but we only apply it for the surface cue by Mean Shift, which have good boundary adherence. For other cues, it has inevitable drawbacks. For the surface cue by FH Merger, the poor boundary adherence will lead to true boundary removal in many cases. For the depth cue, it is more reliable for textured regions, where the boundaries are not accurate to separate the textured regions. Our target is to further merge the surface map by Mean Shift in Sec. 4.2.3 to some degree using the contour map in Sec. 4.2.2. Starting from the surface map by Mean Shift, we generate one node i for each cluster in the graph G = (V;E). Each node i2V has the properties of the perimeterP i , the mean colorI i , and the sizeS i . For the edges between two neighbor nodes e ij 2 E, we calculated the length of the common boundaryl ij , and also the contour length on this common boundaryc ij . The edge weight (difference between two neighbor nodes)! ij is measured by the proportion of the strong contours on the common boundaries between two neighbor nodes, which is from 0 to 1. ! ij = jc ij j jl ij j (4.6) When the difference between two neighbor nodes is smaller than or equal to the proportion of the common boundary in the total boundaries for either nodes, and also 80 the distance of the mean color between these two nodes is not large, these two nodes will be merged. This merge rule can be represented in Eqn. 4.7. ! ij jl ij j minfP i ;P j g ; kI i I j k<T I (4.7) Then the perimeter, the mean color, and the size of the merged node will be updated accordingly. P i+j =P i +P j 2jl ij j; I i+j = I i S i +I j S j S i +S j ; S i+j =S i +S j (4.8) The merger always happens when there is a small cluster surrounded by a large one with similar appearance. For this case, regardless of the outside cluster, the proportion of the common boundary in the total boundaries for the inside cluster would be close to 1. If there is no strong contours between them, then the small inside cluster will be merged to the large outside cluster. We check the merge rule iteratively from the edge with smallest weight to that with largest weight to determine to merge or not until all the edges between two neighbor nodes are scanned. This method helps to merge the small regions to simplify the repre- sentation. Fig. 4.16 show some merge results for the surface cue by Mean Shift, which indi- cates that further merger for the surface map remove many regions with fake boundaries guided by the contour map. The number of clusters for each map can be reduced by around half in average. 81 18 54 116 54 (a) (b) (c) (d) Figure 4.16: Some results for the contour-guided surface merger. (a) is the original image, (b) is our contour cue for guidance, (c) is our surface cue by Mean Shift, and (d) is our contour-guided surface map by Mean Shift. For each merger result, the nearby pixels with the same color indicate the same cluster. The number at the bottom-right corner in red is the number of clusters. We can see the further mergers from our surface- guided merger can remove many regions with fake boundaries. The number of clusters for the surface map can be reduced by around half in average. 4.3.3 Region-Dependent Spectral Graph We adopt multi-layer spectral segmentation [KLL13] (Sec. 2.13) to learn full affinities using different structure and initial weights. Unlike [KLL13], we organize five region- based layers (surface and depth) with initial weights according to the type of the regions, which is so called region-dependent spectral segmentation (RDS). For non-textured regions, we trust more in the contour cue and the surface cue; for textured regions, we rely on the depth cue only. In our design, a disorganization indicator based on component counts is applied for pixel-based texture detectiont i [BNR08] (Sec. 2.12) ranged from 0 to 1, where 1 is the highest texture likelihood and 0 is the lowest one. As shown in Fig. 4.17, for our multi-layer graphG = (V;E), the nodesV =X[Y are composed of pixels in the original imageX and regionsY = Y 1 [Y 2 [Y 3 [Y 4 [ Y 5 from the surface cue by FH Merger (CIELAB and modified CIELCH color space), 82 the contour guided surface map by Mean Shift (CIELAB and modified CIELCH color space), and the depth cue by FH Merger. The edges in green and in violet show two kinds of edge examples connected to one pixel in the pixel-based layer and one region in the surface layer. All the edge weights are controlled by the region property using texture indication. In detail, the undirected edgese ij 2 E are linked by four different criteria according to the node types as below. Pixel-based Layer X Surface Layers Y1, Y2, Y3, Y4 Depth Layer Y5 … … Texture Indicator Figure 4.17: Region-Dependent Spectral Graph. 1) If two pixelsi;j2 X are adjacent based on 8-connectivity, an edgee ij is linked with the weight: ! ij = (1 min (t i ;t j )) minfsim (c i ;c j ); sim (s i ;s j )g + min (t i ;t j ) sim (d i ;d j ); sim (cue i ; cue j ) = exp kcue i cue j k cue ; (4.9) where sim is the similarity on three cues, cue2fc;s;dg, and the constant cue controls the strength of the weight for each cue. 83 2) If two adjacent regions share a common boundary at the same surface layer (i;j2 Y 1 ;Y 2 ;Y 3 ; orY 4 ), an edgee ij is linked with the weight: ! ij = (1 min ( t i ; t j )) minfsim ( c i ; c j ); sim ( s i ; s j )g; sim ( c i ; c j ) = exp kc ij =l ij k c ; sim ( s i ; s j ) = exp k s i s j k s ; (4.10) where t i or s i denotes the mean texture or mean color of the inner pixels for regioni, andc ij =l ij is the proportion of the strong contours on the common boundaries between adjacent regioni andj. 3) If two adjacent regions share a common boundary at the depth layer (i;j2 Y 5 ), an edgee ij is linked with the weight: ! ij = min ( t i ; t j ) sim d i ; d j ; sim d i ; d j = exp d i d j d ! ; (4.11) where d i denotes the mean depth of the inner pixels for regioni. 4) If a pixel i2 X is included in its corresponding region j2 Y l , an edge e ij is linked with the weight: ! ij =; (4.12) where is a parameter that controls the correlation between pixel- and region-based layers. The intra-layer weights in Equ. 4.9 and 4.10 take the minimum of similarity of contour and surface because the contour cue is more reliable when the discontinuity is detected, otherwise the surface cue will be considered. 84 4.4 Experimental Results In this section, we evaluate the improvement of our contour-guided surface merger (CGS), and the performance of our region-dependent spectral segmentation (RDS) on a benchmark image database and compare it with the state-of-the-art methods. The surface cue of RDS is generated by contour-guided surface merger for Mean Shift [CM02] in the CIELAB and the modified CIELCH color space, and FH Merger [FH04] in the CIELAB and the modified CIELCH color space. The depth cue of RDS is generated by FH Merger with the same parameter. For Mean Shift, the parame- ter (h s ;h r ;M) = (9; 7; 200), where h s and h r are bandwidth in the spatial and color domain, andM is the minimum size of each segment. For FH Merger, the parameter (;k;M) = (1:0; 0:6; 200), where is the smoothing factor, k is the constant for the threshold function, andM is the minimum size of each segment.The parameter of the contour-guided surface merger (T I ;M) = (0:2; 200), whereT I is the threshold of mean color distance. For the RDS graph, the smoothness factorc in MLSS, cue and are set empirically toc = 10 5 , cue = 60, and = 10 2 . 4.4.1 Dataset The Berkeley Segmentation Dataset and Benchmark provides an empirical basis for research on image segmentation and boundary detection [MFTM01]. The public bench- mark consists of all the color segmentation for 300 images. The images have a size of 481x321 pixels, either horizontally or vertically. For each image, they collect several hand-labeled segmentations from different human subjects in average. For quantitative evaluation, four common criteria (e.g. [KLL13], [LWC12]) are used: 1) Probabilistic Rand Index (PRI) [UPH07], which counts the likelihood of pixel pairs whose labels are consistent between the segmentation and the ground truth; 2) 85 Table 4.1: Quantitative comparisons before and after the contour-guided surface merger for Mean Shift on the BSDS300 database. Algorithm PRI" VoI# GCE# BDE# MS 0.7592 3.3571 0.0827 13.87 CGS MS 0.7854 1.8642 0.1728 13.33 Variation of Information (V oI) [Mei05], which measures the amount of randomness in one segmentation that cannot be contained in the other; 3) Global Consistency Error (GCE) [MFTM01], which measures the extent to which one segmentation can be viewed as a refinement of the other; 4) Boundary Displacement Error (BDE) [FnR + 02], which measures the average displacement error of the boundary pixels between two segmented images. The segmentation is viewed better if PRI is larger or the other three are smaller. 4.4.2 Improvement from Contour-Guided Surface Merger In Table 4.1, we report the average scores over the Berkeley Database before and after the contour-guided surface merger for Mean Shift: Mean Shift (MS) and contour-guided surface merger (CGS MS). The region number for each image is obtained automatically using the parameter set discussed above. This comparison indicates that the contour- guided surface merger largely improves the segment performance with respect to the ground truth. For GCE, MS can be considered as a refinement of CGS MS, thus MS has better GCE performance. 4.4.3 Performance of Region-Dependent Spectral Segmentation In Table 4.2, we report the average scores over the Berkeley Database for eight other seg- mentation methods: NCut [SM00], MeanShift [CM02], FH [FH04], JSEG [DbsM01], MNCut [CBS05], NTP [WJHZ08], MLSS [KLL13], and SAS [LWC12]. The best and the second best results for each metric are highlighted in red and blue, respectively. In 86 Table 4.2: Quantitative comparisons of our approaches with other segmentation methods on the BSDS300 Dataset, where the best two results are highlighted in red (best) and blue (second best). Algorithm PRI" VoI# GCE# BDE# NCut [SM00] 0.7242 2.9061 0.2232 17.15 MeanShift [CM02] 0.7958 1.9725 0.1888 14.41 FH [FH04] 0.7139 3.3949 0.1746 16.67 JSEG [DbsM01] 0.7756 2.3217 0.1989 14.40 MNCut [CBS05] 0.7559 2.4701 0.1925 15.10 NTP [WJHZ08] 0.7521 2.4954 0.2373 16.30 MLSS [KLL13] 0.8146 1.8545 0.1809 12.21 SAS [LWC12] 0.8319 1.6849 0.1779 11.29 RDS 0.8321 1.7502 0.1738 11.19 RDS(w/o CGS) 0.8197 1.8642 0.1924 13.20 RDS(w/o depth) 0.8043 1.9645 0.1828 12.31 comparison, we list our region-dependent spectral method (RDS) with all the three ele- mentary cues, without CGS in the surface cue, and without depth layer, respectively. To achieve the optimal performance of our method, we manually select the different segment number for each image, as what MLSS [KLL13] did. According to the results in Table 4.2, except V oI, our RDS with all the three elemen- tary cues can achieve the best among all the unsupervised segmentation methods. Our V oI is second only to SAS. Compared with contour-guided surface merger, the depth cue is more significant to boost the segmentation performance, especially the textured regions. This indicates that these three cues are three indispensable and complementary cues for better image segmentation. Fig. 4.18 - 4.21 show some examples of our RDS segmentation with all the three cues compared with two reference methods: MLSS and SAS for visual comparison. As shown in the results, compared with the state-of-the-art approaches, our RDS segmen- tation can always capture the key structure of the image (reduce the leakage problem 87 due to the contour cue), and merge the textured regions (reduce the oversegmentation problem due to the depth cue). 4.5 Conclusion In this chapter, we have proved that the 1D contour cue, the 2D surface cue, and the 3D depth cue are three indispensable and complementary cues for a robust image seg- mentation solution. The contour cue describes discontinuities between two regions. The surface cue illustrates similarities inside one region. The depth cue makes use of blur estimation to indicate the image layout even if some regions are apparently similar. Nei- ther one of these three cues is feasible for all the images with diversity and ambiguity of visual scenery. Based on the proposed preparation of three elementary cues, we propose a fusion framework with two strategies. The first is the contour-guided surface merger (CGS), which further merges the adjacent regions of the surface map by Mean Shift with weak boundaries guided by the contour cue. The second is the region-dependent spectral graph (RDS), which organizes region-based layers in the spectral graph with initial weights according to the type of regions. For non-textured regions, we trust more in the contour cue, while for textured regions, we just rely on the depth cue. Extensive experimental results on the Berkeley Segmentation Dataset have shown all the three ele- mentary cues are required for better segmentation and the superiority of the proposed method in terms of quantitative and perceptual criteria. 88 (a) Original (c) MLSS (d) SAS (e) RDS (b) Ground Truths Figure 4.18: Visual comparisons of RDS segmentation results against two state-of-the- art methods MLSS and SAS. 89 (a) Original (c) MLSS (d) SAS (e) RDS (b) Ground Truths Figure 4.19: Visual comparisons of RDS segmentation results against two state-of-the- art methods MLSS and SAS. 90 (a) Original (c) MLSS (d) SAS (e) RDS (b) Ground Truths Figure 4.20: Visual comparisons of RDS segmentation results against two state-of-the- art methods MLSS and SAS. 91 (a) Original (c) MLSS (d) SAS (e) RDS (b) Ground Truths Figure 4.21: Visual comparisons of RDS segmentation results against two state-of-the- art methods MLSS and SAS. 92 Chapter 5 Robust Image Segmentation Using Contour-guided Color Palettes 5.1 Introduction Automatic image segmentation is a fundamental problem in computer vision. It plays an important role in diverse applications, such as object detection, scene parsing, and image retrieval. It partitions an image into a small number of disjointed coherent regions with low-level features, with the goal of minimizing intra-variance and maximizing inter- variance among regions. It is desired that the segmentation result is close to human semantic understanding and not sensitive to parameter setting and/or image content. To segment an image, pixel (or superpixel) grouping in the spatial and spectral domains were performed in the literature. Typically, spatial-domain pixel grouping is guided by contours [AMFM11, DUHB09] while spectral-domain pixel grouping is achieved by clustering in a color space [RM00, CM02, CBS05, FH04, KLL13, LWC12, SM00]. Thus, contours and colors are two widely used features in image segmentation, yet each of them has its own limitations. For example, contours are not reliable if they are short and fragmented. They might fail to separate two regions if parts of their com- mon boundaries are blurred and/or with a low contrast. The color feature is not effective to handle regions with textures or gradual color transition, leading to over-segmentation. One common challenge in these methods is the selection of proper parameters, such as 93 the color clustering bandwidth. In general, these optimal parameters are image depen- dent and difficult to determine. In this work, we integrate contour and color cues under one unified framework, and propose the contour-guided color palette (CCP) for robust image segmentation. 1 [FWC + 15] That is, it has only one key parameter and its perfor- mance is stable when the parameter lies in a suitable range. The basic idea of CCP is described as follows. To find representative colors of a given image, we collect color samples from both sides of long contours, and conduct Mean Shift (MS) algorithm [CM02] in the sampled color space to define an image- specific color palette. This scheme reduces color complexity of the original image, yet keeps a sufficient number of representative colors to separate distinctive regions and yield a preliminary segmentation. This result is further refined by post-processing tech- niques in the spatial domain, which leads to a robust standalone segmentation. The CCP result can be applied to any superpixel-based segmentation algorithm by replacing the over-segmentation layer, such as Mean Shift (MS) [CM02], Felzenszwalb and Hutten- locher’s graph-based (FH) [FH04], and SLIC [ASS + 12] superpixels. Furthermore, it can be integrated into the layered spectral segmentation framework, such as multi-layer spectral segmentation (MLSS) [KLL13] and segmentation by aggregating superpixels (SAS) [LWC12], and used as a coarse layer in this context for a more robust segmen- tation. The superior performance of CCP-based segmentation algorithms are demon- strated in the experiments on the Berkeley Segmentation Dataset (BSDS) [MFTM01]. The rest of this chapter is organized as follows. Related work is reviewed in Sec. 5.2. The CCP method is described in detail in Sec. 5.3. The advantages of CCP over MS are analyzed in Sec. 5.4. Then, the integration of CCP with layered spectral segmentation is 1 The MATLAB code of CCP Segmentation can be downloaded at https://github.com/ fuxiang87/MCL\_CCP. 94 introduced in Sec. 5.5. Experimental results are shown in Sec. 5.6. Finally, concluding remarks are given in Sec. 5.7. 5.2 Related Work According to the studies on human visual perception [Gib50, BGG03], people pay more attention to dissimilarities between two regions and lean to group similar regions in appearance, which correspond to the contour (1D) and the regional (2D) cues, respec- tively. Both of the two cues are needed for a better image segmentation. Regional cues are contributed by color and texture. Most image segmentation algorithms can be classified into two categories, i.e., region-based and contour-based methods. Region-based methods find the similarity among spatially connected pixels and group them together using surface properties such as luminance and color. Repre- sentative approaches include watershed [RM00], k-means, Mean Shift (MS) [CM02], normalized cuts (NCut) [CBS05, SM00], Felzenszwalb and Huttenlocher’s graph-based (FH) [FH04], multi-layer spectral segmentation (MLSS) [KLL13], and segmentation by aggregating superpixels (SAS) [LWC12]. However, these methods might neglect obvious discontinuities between two regions. To overcome this limitation, contour- based methods, such as gPb-OWT-UCM [AMFM11], and saliency driven total varia- tion (SDTV) [DUHB09] were developed to find connected regions blocked by detected contours. However, it is still challenging to detect closed contours in low-contrast or blurred regions for segmentation. One can combine region and contour cues to overcome their individual limitations, and several ideas were introduced in [FnR + 02, MFCM03]. For example, one can take the contour cue as a post-processing step to correct region-based segmentation results or treat the contour as a barrier in an affinity measure. 95 In this work, a new method called CCP is proposed to effectively integrate the con- tour and the color cues. Unlike the other methods, we take the contour cue as guidance to form an image-dependent color palette. It reduces color complexity of the origi- nal image, yet keeps a sufficient number of representative colors to separate distinctive regions. The CCP method is detailed in the following section. 5.3 Contour-guided Color Palette Method 5.3.1 System Overview The basic idea of the CCP method can be simply stated as follows. Long contours play an important role in image segmentation since they provide useful spatial-domain information in region partition. However, they may not form a closed region due to weak boundaries in some parts, leading to the leakage problem. To assist contour-guided segmentation, we use the color information as an auxiliary cue. That is, we collect color samples along both sides of each long contour and perform clustering in the sampled color space for color quantization. Once color is quantized, we get a number of closed regions with long contours as their boundaries. This initial segmentation can be further refined by post-processing techniques in the spatial domain. Fig. 5.1 shows the block diagram of the CCP method, which mainly consists of three modules: (1) image pre- processing, (2) contour-guided color palette generation for an initial segmentation, and (3) segment post-processing. The image pre-processing module includes denoising and contour extraction. There are many standard algorithms to select for this module. In our implementation, we adopt the bilateral filtering scheme [TM98] for denoising. And, we apply the structured edge detection [DZ13] method to the original input image to obtain a contour map with pixel value indicating the probability of being a contour point. Then, long contours are 96 Original Long Contours Denoised Color Palette Generation Leakage Avoidance Fake Boundary Removal Small Region Mergence CCP Contour Map Image Pre-Processing Initial Segmentation Segment Post-Processing Figure 5.1: The block diagram of the proposed contour-guided color palette (CCP) method. selected from the contour map. The contour information will be used to generate the desired color palette, based on which the initial segmentation result is directly obtained (module 2), followed by an effective post-processing step (module 3). Both modules 2 and 3 are guided by the contour information. 5.3.2 Color Palette Generation The well-known color clustering algorithms, such as k-means and Mean Shift (MS) [CM02] clustering, consider the color distribution of all pixels or superpixels in an image. However, not all pixels and their associated colors are equally important for the segmentation purpose as illustrated by the following two examples. First, the strong color-varying pixels inside a texture region (e.g., a large number of flowers in a gar- den in Fig. 5.3(a)), where the complexity of the color representation increases, are actually of less importance. Second, the pixels of similar colors inside a homogeneous region (e.g., a large near-white building in Fig. 5.3(e)), that gives many redundant color samples in the color space, are also not that important. A relatively minor variation in these images (e.g., the flower density and the wall size, respectively) will affect the color-based segmentation. Generally speaking, these algorithms are sensitive to their 97 (a) (b) (c) (d) δ δ Δ Figure 5.2: Illustration of banded region of interest (B-ROI) for color palette generation. (a) shows a B-ROI with a bandwidth of 2. Pixels are sampled from both sides of the contour with a uniform stepsize . (b) is an example image overlaid with detected long contours. Long contours are indicated by different colors. (c) and (d) are the zoom-in of two local B-ROI’s of (b). In (c), the colors of the pixels labeled in red along one side of the B-ROI look similar without an obvious jump. In (d), the colors of the pixels labeled in red change with one large jump. parameter settings, and it is challenging to automatically find a good parameter set for an arbitrary image. To develop a robust segmentation algorithm, we attempt to reduce the influence of color variations in an image by selecting a set of representative colors. To achieve this, we focus on key regions and obtain color samples accordingly. For image segmentation, one would like to have large segments and ignore small ones. Since large segments are enclosed by long contours, we can define a banded region of interest (B-ROI) for each long contour and its neighborhood. The B-ROI is centered at the contour location with a bandwidth of 2 as shown in Fig. 5.2(a). After obtaining the B-ROI, we sample pixels from both sides of the contour with a uniform stepsize and have their colors in the CIELAB color space [Wik14b] to form a set of representative colors. In the implementation, we used the structured edge detection algorithm [DZ13] to extract the contour and set = 2 and = 1 pixels, respectively. 98 (a) (b) (c) (d) (e) (f) Figure 5.3: Comparison of CCP segmentation results before and after post-processing. Please focus on the squared regions in red: (a) shows a long contour straddled by two regions with similar colors; (c) shows the fake boundary in the sky due to gradual color transition; (e) shows small segments in the background building region. (b), (d), and (f) are the post-processed results of (a), (c), and (e), respectively. We observe two typical cases for pixels along one side of the B-ROI for different images. First, the color remains about the same or changes gradually without an obvious jump. Second, the color changes with one or several large jumps, yet each interval between two jumps does have a similar color. These two cases are shown in Figs. 5.2(b)-(d), where Fig. 5.2(b) is an illustrative image overlaid with long contours while Figs. 5.2(c) and (d) provide the zoom-in images of two local regions of Fig. 5.2(b) and correspond to the two cases, respectively. For Case 1, as shown in Fig. 5.2(c), color samples can be further reduced to their average color. For Case 2, we need to select multiple color samples, each of which represents the color in one interval. As shown in Fig. 5.2(d), the B-ROI goes through the deer body at one side and two background 99 regions at the other side. In this case, we need to split the color samples into two groups and each group is represented by its average color. Further color simplification required by Case 1 and 2 can be achieved by MS clus- tering with bandwidth parameter h r in the spectral domain. Another is to adopt MS clustering with different bandwidth parameters for sampled colors located in different regions of the image. Since the object of interest is usually in the central region while the background is in the boundary region of an image, we adopt two bandwidth param- eters, i.e., a smaller one and a larger one (h rc ; h rb ), for sampled colors in the central and boundary regions, respectively. The final representative color set is called the color palette of the input image. Then, a color-quantized image can be obtained by replacing the color of each pixel with its most similar color in the color palette. In this way, an initial segmentation result is obtained. 5.3.3 Segment Post-Processing Three post-processing techniques are proposed to better the segmentation result: 1) leak- age avoidance by contours, 2) fake boundary removal, and 3) small region mergence, as illustrated in Fig. 5.3. The first problem arises when there is a long contour straddled by two regions with similar colors. One such example is given in Fig. 5.3(a), where the white fence and the white collar are close in color but separated by a long contour. After color quantization, the fence is mingled with the collar to yield complicated patterns and, as a result, these two regions are blurred. This is known as the leakage problem. To avoid this, we check the regions along each side of the contour in the B-ROI. After color quantization, even if both sides of the contour are quantized into the same color, they are still separated by the long contour. 100 The second problem occurs when there is a smooth color transition over a large region. For example, the sky color in Fig. 5.3(c) changes smoothly and it is split into multiple regions due to color quantization. This fake boundary can be removed by checking the common boundary of adjacent regions. We consider the ratio of the length of the common boundary and the minimum perimeter of the two regions, which indicates the relative significance of the common boundary. If the common boundary is significant and not overlaid much with detected long contours, these two regions will be merged. By this criterion, isolated regions have a high priority to be merged when there is not a long contour around them. The third problem occurs in the textured area such as the background building with small windows in Fig. 5.3(e). They are merged to the closest “effective neighbors” for simplicity. This can be implemented by merging a small region to its neighbor region of a similar color but without a contour in between. Since region aggregation is irreversible, we need to pay special attention to the order of fake boundary removal and small region mergence. In the beginning, region sizes are relatively small. The small region mergence process might merge two similar regions in the dark or blurred area, leading to the leakage problem. However, the fake boundary removal process does not have this side effect. For this reason, we conduct fake bound- ary removal before small region removal. Fake boundary removal and small region mergence can be conducted iteratively to achieve better performance. We conduct the iteration twice in the implementation. The post-processed results of Figs. 5.3(a), (c) and (e) are shown in Figs. 5.3(b), (d) and (f), respectively. 101 5.4 Comparison of MS and CCP In Fig. 5.4, we compare in detail the segmentation results of Mean Shift (MS) method and our CCP method for three typical images, denoted as #1, #2 and #3 from the left to the right. For MS, we select three best spectral bandwidth (BW) parameters from all the odd numbers between 5 and 25, resulting in 7 (small), 13 (medium) and 19 (large); for CCP, we select the spectral BW parameters as h r = 5 (small), (h rc ; h rb ) = (5; 7) (medium) and h r = 7 (large) in the color palette generation process. The spatial BW parameter is set to 7 in all MS results while no spatial BW parameter is required by CCP. We can see that CCP provides simplified segmentation results, which are more consistent with human perception and can serve as standalone solutions. In contrast, MS gives highly over-segmented images that are not acceptable to human eyes even if the spectral BW parameter is large enough. Similar conclusions were drawn from all the 300 images in the Berkeley Segmentation Dataset (BSDS) [MFTM01]. It is no doubt that CCP visually outperforms MS by a significant margin. BW MS CCP small medium large original MS CCP MS CCP Image #1 Image #2 Image #3 Figure 5.4: Comparisons of segmentation results by MS and CCP for three typical images, with different spectral BW parameters. 102 Selection of proper spectral and spatial BW parameters for MS is actually a chal- lenging task. The quality of the MS segmentation result is sensitive to these two param- eters. They are not only image dependent but also region dependent. To the best of our knowledge, there is no automatic mechanism to select good BW parameters. For com- parison, CCP only demands one BW parameter, and its results are stable over a range of BW values as illustrated in Fig. 5.4. This avoids the huge burden of performance fine-tuning. To explain the superior performance of CCP over MS, we list the representative color numbers and the boundary F-measures (harmonic mean of precision and recall, defined in [MFM04]) under three spectral BW parameters for the three images in Table 5.1. We also provide the average results of the entire BSDS300 dataset. A smaller spectral BW parameter usually generates more representative colors as illustrated by the numbers in the same column in the upper half of the table. The corresponding number of representative colors of CCP is significantly less than those of MS, although the three BW parameters of CCP are smaller than their counterparts of MS. In addition, CCP can achieve much better boundary F-measures than MS, as shown in the lower half of the table. This indicates a better boundary adherence with respect to human ground- truth boundaries. These two comparisons show the power of color sampling along the contours adopted by CCP. Through color sampling, we can eliminate color samples in regions of little significance and merge these regions with other important ones, as shown in the mountain, branch and tree regions of image #1, #2 and #3, respectively. Furthermore, because CCP yields fewer color samples in the color space, we can adopt a smaller BW parameter without increasing the number of representative colors too much. In this way, CCP can reduce the risk to make two colors along a significant contour get mixed, and thus avoid a severe leakage problem which usually occurs in the MS method. 103 For comparisons, please look at the boundaries between the bridge and the sky, those between the bird and the sky, and those between the face and the building in Fig. 5.4. Table 5.1: Comparisons of the numbers of representative colors (upper) and the bound- ary F-measures (lower) by MS and CCP under three BW parameters for the three typical images. We also provide the average results of the entire BSDS300 dataset. BW MS CCP #1 #2 #3 All #1 #2 #3 All s. 116 116 217 192 66 83 118 81 m. 100 117 218 178 68 85 122 77 l. 98 113 215 173 54 75 107 63 s. 0.70 0.67 0.72 0.59 0.75 0.75 0.78 0.68 m. 0.69 0.64 0.75 0.60 0.75 0.75 0.80 0.68 l. 0.68 0.60 0.76 0.60 0.74 0.74 0.78 0.68 For a segmented image, we count the number of pixels for a specific representa- tive color and sort the color index according to the number of associated pixels in a descending order. Then, we plot the cumulative normalized histogram as a function of the representative color index, as shown in Fig. 5.5. The curve reaches 100% when all representative colors are used. Let us use Image #1 as an example. The blue, green and red curves are obtained using large, medium, and small spectral BW parameters. The three curves of CCP reach 100% at color index #54, #68 and #66 while those of MS reach 100% at color index #98, #100 and #116, respectively, as indicated by the data in Table 5.1. Meanwhile, CCP can achieve around 10% higher boundary F-measures than MS. There are a few dominant colors in simple images such as Images #1 and #2, which can be caught by both CCP and MS. CCP reaches a higher percentage than MS with these dominant colors. Image #3 is more complicated in its content and more represen- tative colors are needed. In all three cases, along with better boundary adherence, the CCP curves are closer to the upper-left corner of the figure than MS. This indicates that CCP can use fewer colors to represent a larger region of an image and provide a more 104 (a) MS, Image #1 (b) CCP, Image #1 (c) MS, Image #2 (d) CCP, Image #2 (e) MS, Image #3 (f) CCP, Image #3 0 50 100 0 0.5 1 Representative Color Index Cumulative Histogram 0 100 200 0 0.5 1 Representative Color Index Cumulative Histogram 0 50 100 0 0.5 1 Representative Color Index Cumulative Histogram 0 50 100 0 0.5 1 Representative Color Index Cumulative Histogram 0 50 100 0 0.5 1 Representative Color Index Cumulative Histogram 0 100 200 0 0.5 1 Representative Color Index Cumulative Histogram Figure 5.5: Plots of the cumulative histogram versus representative color indices for MS and CCP on three typical images, where the blue, green and red curves are obtained using large, medium and small spectral BW parameters. simplified result. Similar conclusions were drawn from all 300 images in the BSDS dataset. 105 5.5 Layered Affinity Models using CCP Spectral segmentation has received a lot of attention in recent years due to its impressive performance [CBS05, SM00]. It begins with a graph representation of a given image, where each pixel is a node. Then, a sparse affinity matrix is created to measure the similarity between nearby nodes, while ignoring the connection among distant nodes even if they are in the same homogeneous region; say, two distant nodes in the same sky region. The simplification of sparse affinity matrices often leads to over-segmentation. To overcome this problem, a layered affinity model was introduced to allow more con- nections, such as the full pairwise affinity in MLSS [KLL13] and the bipartite graph partitioning in SAS [LWC12]. These methods share one common idea, namely, build- ing a graph model consisting of multiple layers. The finest one is the pixel layer as constructed by the standard spectral segmentation method. Then, one can add a couple of coarse layers on top of the pixel layer, where each coarse layer uses superpixels as its nodes and defines an affinity matrix accordingly. Typically, these superpixel layers are constructed using the MS [CM02] and the FH graph-based [FH04] methods. Finally, nodes between different layers are connected by an across-affinity matrix. Although these methods share the same basic idea, they differ in the details of the layered affinity matrix implementation. To further improve the segmentation result of CCP, we can leverage the two lay- ered affinity models proposed in MLSS and SAS. The integrated methods are called CCP-LAM (where LAM denotes “layer-affinity by MLSS”) and CCP-LAS (where LAS denotes “layer-affinity by SAS”), respectively. CCP-LAM and CCP-LAS can be easily obtained by replacing the superpixel layers in MLSS and SAS, respectively, by the CCP segmentations as described in Sec. 5.3, with the pixel layer kept as the finest layer. It was observed in [KLL13] and [LWC12] that the final image segmentation result can benefit from the diversity of multiple coarse layers. Following this line of thought, 106 we create multiple CCP segmentations by varying bandwidth parameter h r or (h rc ; h rb ) of the MS algorithm in the color palette generation process, which has been discussed in Sec. 5.3.2. 5.6 Experimental Results In this section, we evaluate the performance of three CCP segmentation results by consider three parameter settings: 1) CCP-1, h r = 5; 2) CCP-2, (h rc ; h rb ) = (5; 7); and 3) CCP-3, h r = 7. Furthermore, we take CCP-1, CCP-2 and CCP-3 as three coarse layers in the context of spectral segmentation with two layered affinity mod- els (i.e., LAM and LAS) to result in CCP-LAM and CCP-LAS methods. To achieve the optimal performance of CCP-LAM and CCP-LAS, we follow the procedure stated in [KLL13, LWC12] to manually select the best segment number using the LAM or LAS graph. We compare the performance of CCP-1, CCP-2, CCP-3, CCP-LAM and CCP-LAS with several benchmarking methods on the Berkeley Segmentation Dataset (BSDS) [MFTM01] in Table 5.2. The BSDS benchmark consists of 300 color images of size 481 321 pixels displayed either horizontally or vertically, and several hand-labeled segmentations were collected from different human subjects for each image. The bench- marking methods include NCut [SM00], MNCut [CBS05], MS [CM02], FH [FH04], SDTV [DUHB09], RIS-HL [WZT14], MLSS [KLL13], and SAS [LWC12]. Their num- bers are taken from [DUHB09, KLL13, LWC12, WZT14]. As shown in Table 5.2, five performance metrics (e.g., [DUHB09, KLL13, LWC12, WZT14]) are used for quantitative evaluation. They are: 1) Segmentation Covering (Cov) [AMFM11], which measures the region-wise covering of the ground truth by a segmentation; 2) Probabilistic Rand Index (PRI) [UPH07], which counts the likelihood 107 of pixel pairs whose labels are consistent between a segmentation and the ground truth; 3) Variation of Information (V oI) [Mei05], which measures the amount of randomness in one segmentation that cannot be contained by the other; 4) Global Consistency Error (GCE) [MFTM01], which measures the extent to which one segmentation can be viewed as a refinement of the other; 5) Boundary Displacement Error (BDE) [FnR + 02], which measures the average displacement error of boundary pixels between two segmented images. The segmentation result is better if Cov and PRI are larger while the other three criteria (V oI, GCE and BDE) are smaller. The best and the second best results in Table 5.2 are highlighted in red and blue, respectively. We draw the following conclusions from Table 5.2. First, CCP-LAM and CCP-LAS achieved the best performance in terms of all five metrics by a large margin. Second, all the three CCP methods had outstanding performance in the GCE and BDE metrics. This means that CCP yields an excellent segmentation with better boundary adherence and less displacement error with respect to the ground truth. It is worthwhile to emphasize that no image-dependent parameter was used in CCP-1, CCP-2 and CCP-3. The same parameter setting is applied to all the images. In contrast, a set of experiments were run in all other benchmarking methods, and the best result for each image was selected and used in performance computation. To evaluate each component of our CCP segmentation, we implement three vari- ants of our solution: 1) replace structured edge detection with the Canny detector, 2) replace color palette generation with Mean Shift in full color space, and 3) exclude post-processing. Some of the performance are shown in Table 5.3. The performance drops of the three variants are 11.8%, 32.3%, and 26.6% for Cov, and 2.8%, 7.5%, and 5.2% for PRI, respectively. We can see that color palette generation is the most signifi- cant part, because the performance of CCP when replacing color palette generation with Mean Shift drops the most among these three variants. 108 Table 5.2: Performance comparison of several segmentation methods on the BSDS300 Dataset, where the best two results are highlighted in red (best) and blue (second best). Algorithm Cov" PRI" VoI# GCE# BDE# NCut [SM00] 0.44 0.7242 2.9061 0.2232 17.15 MNCut [CBS05] 0.44 0.7559 2.4701 0.1925 15.10 MS [CM02] 0.54 0.7958 1.9725 0.1888 14.41 FH [FH04] 0.51 0.7139 3.3949 0.1746 16.67 SDTV [DUHB09] 0.57 0.7758 1.8165 0.1768 16.24 RIS-HL [WZT14] 0.59 0.8137 1.8232 0.1805 13.07 MLSS [KLL13] 0.53 0.8146 1.8545 0.1809 12.21 SAS [LWC12] 0.62 0.8319 1.6849 0.1779 11.29 CCP-1 0.47 0.7900 2.8502 0.1046 11.26 CCP-2 0.48 0.7932 2.7835 0.1077 11.17 CCP-3 0.53 0.8014 2.4723 0.1270 11.29 CCP-LAM 0.68 0.8404 1.5715 0.1635 10.20 CCP-LAS 0.68 0.8442 1.5871 0.1582 10.46 Table 5.3: Performance comparison of three variants of our solution: 1) CCP-3 (Canny) to replace structured edge detection with the Canny detector, 2) CCP-3 (MS) to replace color palette generation with Mean Shift in full color space, 3) CCP-3 (w/o post) to exclude post-processing. Variants Cov" PRI" CCP-3 0.53 0.8014 CCP-3 (Canny) 0.47 0.7790 CCP-3 (MS) 0.36 0.7413 CCP-3 (w/o post) 0.39 0.7597 Furthermore, Fig. 5.6 - 5.14 show the segmentation results of some images by MLSS, SAS, CCP-LAM and CCP-LAS for visual comparison. Again, CCP-LAM and CCP-LAS produced significantly better and meaningful segmentation results over MLSS and SAS in terms of visual appearance. 109 (a) Original (c) MLSS (d) SAS (e) CCP-LAM (f) CCP-LAS (b) Ground Truths Figure 5.6: Visual comparisons of segmentation results of CCP-LAM and CCP-LAS against two state-of-the-art methods MLSS and SAS. 5.7 Conclusion The contour-guided color palette (CCP) was proposed for robust image segmentation. This method effectively integrated the contour and color cues of an image, reduced its 110 (a) Original (c) MLSS (d) SAS (e) CCP-LAM (f) CCP-LAS (b) Ground Truths Figure 5.7: Visual comparisons of segmentation results of CCP-LAM and CCP-LAS against two state-of-the-art methods MLSS and SAS. color complexity, and kept a sufficient number of distinctive colors to achieve the desired 111 (a) Original (c) MLSS (d) SAS (e) CCP-LAM (f) CCP-LAS (b) Ground Truths Figure 5.8: Visual comparisons of segmentation results of CCP-LAM and CCP-LAS against two state-of-the-art methods MLSS and SAS. segmentation task. Based on the image-specific color palette, a preliminary segmenta- tion was obtained and it was further fine-tuned by post-processing techniques. The CCP 112 (a) Original (c) MLSS (d) SAS (e) CCP-LAM (f) CCP-LAS (b) Ground Truths Figure 5.9: Visual comparisons of segmentation results of CCP-LAM and CCP-LAS against two state-of-the-art methods MLSS and SAS. method produced an acceptable standalone segmentation result, which could be further 113 (a) Original (c) MLSS (d) SAS (e) CCP-LAM (f) CCP-LAS (b) Ground Truths Figure 5.10: Visual comparisons of segmentation results of CCP-LAM and CCP-LAS against two state-of-the-art methods MLSS and SAS. integrated with layered affinity models for spectral segmentation. The superior perfor- mance of the proposed CCP-LAM and CCP-LAS methods over existing state-of-the-art methods was demonstrated by extensive experimental results. 114 (a) Original (c) MLSS (d) SAS (e) CCP-LAM (f) CCP-LAS (b) Ground Truths Figure 5.11: Visual comparisons of segmentation results of CCP-LAM and CCP-LAS against two state-of-the-art methods MLSS and SAS. 115 (a) Original (c) MLSS (d) SAS (e) CCP-LAM (f) CCP-LAS (b) Ground Truths Figure 5.12: Visual comparisons of segmentation results of CCP-LAM and CCP-LAS against two state-of-the-art methods MLSS and SAS. 116 (a) Original (c) MLSS (d) SAS (e) CCP-LAM (f) CCP-LAS (b) Ground Truths Figure 5.13: Visual comparisons of segmentation results of CCP-LAM and CCP-LAS against two state-of-the-art methods MLSS and SAS. 117 (a) Original (c) MLSS (d) SAS (e) CCP-LAM (f) CCP-LAS (b) Ground Truths Figure 5.14: Visual comparisons of segmentation results of CCP-LAM and CCP-LAS against two state-of-the-art methods MLSS and SAS. 118 Chapter 6 Conclusion and Future Work 6.1 Summary of the Research In this dissertation, we study two research topics on visual data segmentation. One is interactive video object segmentation. The other is automatic image segmentation. For video object segmentation in Chapter 3, we developed an interactive hierar- chical supervoxel representation to handle temporal dynamics of multiple components of objects in a video. Firstly, a hierarchical supervoxel graph with various levels is generated offline to represent a video based on local clustering and region merging. Besides the color histogram and the motion vectors, visual saliency is also leveraged in the feature space to build the graph. Then an interactive supervoxel selection algorithm is presented to select supervoxels with diverse granularities to represent the object(s) labeled by the user. Finally, an interactive video object segmentation framework based on the above tools is proposed to handle complex and diverse scenes with large motion and occlusions. The obvious superiority of the proposed algorithms in both supervoxel graph generation and interactive video object representation have been shown by the experimental results. To represent the objects automatically for the visual data, we design a region- dependent spectral framework for automatic image segmentation in Chapter 4. Firstly, three elementary cues are developed, including the 1D contour cue, the 2D surface cue, and the 3D depth cue. Our contour detection can catch accurate local variance with alle- viative sensitivity to textures and noise. Our simplified surface maps reduce the number 119 of clusters for the CIELAB map and the modified CIELCH map. Our depth estimation can handle dark regions with low contrast, and is not sensitive to textures and noise. Based on these three cues, we propose a fusion approach to make use of different cues. The contour-guided surface merger (CGS) decreases the fake boundaries between two regions and reduce the number of clusters for the surface maps. The region-dependent spectral graph (RDS) integrates the surface cue and the depth cue according to the region type. We have shown some visual results for our three elementary cues and compared our RDS segmentation with the state-of-the-art approaches. Experiments have shown inspiring results for our proposed fusion framework. To further improve the automatic image segmentation, the contour-guided color palette (CCP) is proposed in Chapter 5. It efficiently integrates the contour and the color cues of an image. To find representative colors of an image, we collect color sam- ples from both sides of long contours and conduct the mean-shift (MS) algorithm in the sampled color space to define an image-dependent color palette. This color palette provides a preliminary segmentation in the spatial domain, which is further fine-tuned by post-processing techniques such as leakage avoidance, fake boundary removal, and small region mergence. Segmentation performances of CCP and MS are compared and analyzed. While CCP offers an acceptable standalone segmentation result, it can be fur- ther integrated into the framework of layered spectral segmentation to produce a more robust segmentation. The superior performance of CCP-based segmentation algorithm is demonstrated by experiments on the Berkeley Segmentation Dataset. 6.2 Future Research Directions To extend our research, we have the following research directions to further improve our approaches to visual data segmentation. 120 Complexity reduction for hierarchical supervoxel graph generation. Accord- ing to Sec. 3.5 in Chapter 3, the time cost for the offline hierarchical supervoxel graph generation is still too much. To apply it to the practical system, we may replace the current saliency detection and motion estimation with more effective ones. Evaluation of depth estimation. The current evaluation of the depth maps in Chapter 4 is only based on the visual perception, which has no quantitative com- parison with the state-of-the-art approaches. We need to develop a more reliable and objective evaluation measure for relative depth estimation. Other models attempt using CCP. Instead of layered affinity model in Chapter 5, we may have an attempt on other advanced models using superpixels to achieve the eventual segmentation results. Image segmentation evaluation based on different ground truth subjects. In the current literature, the image segmentation is evaluated by averaging the perfor- mance between the given segmentation and each of the ground truth subjects inde- pendently. As mentioned in motivation of the dissertation, the common bound- aries from different ground truth subjects are more significant than the uncommon ones. We will make use of these information for adaptive segmentation evaluation. Extension to automatic video segmentation. All the techniques mentioned in Chapter 4 and 5 can be extended to automatic video segmentation. One of the sig- nificant issues is to ensure the temporal consistency for the spatial segmentation. Segmentation for other computer vision applications. According to Chapter 1, visual segmentation is a vital preprocessing step for various advanced computer 121 vision applications. We may apply our segmentation results to object detection, scene understanding, etc. 122 Bibliography [AMFM11] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. Pattern Analy- sis and Machine Intelligence, IEEE Transactions on, 33(5):898–916, May 2011. [ASS + 12] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aur´ elien Lucchi, Pas- cal Fua, and Sabine Ssstrunk. SLIC superpixels compared to state-of-the- art superpixel methods. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(11):2274–2282, November 2012. [Azr08] Arik Azran. A tutorial on spectral clustering, 2008. <http://videolectures.net/mlcued08 azran mcl/>. [BGG03] Vicki Bruce, Mark A. Georgeson, and Patrick R. Green. Visual Perception: Physiology, Psychology and Ecology. Psychology Press, 4th edition, 2003. [BJ01] Yuri Boykov and Marie-Pierre Jolly. Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In Computer Vision (ICCV), 2001 IEEE International Conference on, pages 105–112, 2001. [BNR08] Ruth Bergman, Hila Nachlieli, and Gitit Ruckenstein. Detection of textured areas in natural images using an indicator based on component counts. Journal of Electronic Imaging, 17(4):043003, January 2008. [BWSS09] Xue Bai, Jue Wang, David Simons, and Guillermo Sapiro. Video SnapCut: robust video object cutout using localized classifiers. ACM Transactions on Graphics (TOG), 28(3):70:1–70:11, July 2009. [Can86] John Canny. A computational approach to edge detection. Pattern Anal- ysis and Machine Intelligence, IEEE Transactions on, 8(6):679–698, June 1986. 123 [CBS05] Timothee Cour, Florence Benezit, and Jianbo Shi. Spectral segmentation with multiscale graph decomposition. In Computer Vision and Pattern Recognition (CVPR), 2005 IEEE Conference on, pages 1124–1131, 2005. [CM02] Dorin Comaniciu and Peter Meer. Mean Shift: A robust approach toward feature space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5):603–619, May 2002. [Con14] Colour Management Consultancy. Introduction to colour spaces, 2014. <http://www.colourphil.co.uk/lab lch colour space.shtml>. [CWI13] Jason Chang, Donglai Wei, and John W. Fisher III. A video representation using temporal superpixels. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2051–2058, 2013. [DbsM01] Yining Deng and b. s. Manjunath. Unsupervised segmentation of color- texture regions in images and video. Pattern Analysis and Machine Intelli- gence, IEEE Transactions on, 23(8):800–810, August 2001. [DUHB09] M. Donoser, M. Urschler, M. Hirzer, and H. Bischof. Saliency driven total variation segmentation. In Computer Vision (ICCV), 2009 IEEE Interna- tional Conference on, pages 817–824, 2009. [DZ13] Piotr Dollr and C. Lawrence Zitnick. Structured forests for fast edge detec- tion. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 1841–1848, 2013. [FH04] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167– 181, September 2004. [FnR + 02] Jordi Freixenet, Xavier Mu noz, Joan D. Raba, Mart´ ı, and Xavier Cuf´ ı. Yet another survey on image segmentation: Region and boundary information integration. In Computer vision-eccv 2002, pages 408–422, 2002. [FPX + 12] Xiang Fu, Sanjay Purushotham, Daru Xu, Jian Li, and C.-C. Jay Kuo. Hierarchical bag-of-words model for joint multi-view object representa- tion and classification. In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific, pages 1–6. IEEE, 2012. [FWC + 15] Xiang Fu, Chien-Yi Wang, Chen Chen, Changhu Wang, and C.-C. Jay Kuo. Robust image segmentation using contour-guided color palettes. In Com- puter Vision (ICCV), 2015 IEEE International Conference on, 2015. 124 [Gib50] James J Gibson. The perception of the visual world. Houghton Mifflin, 1950. [GKHE10] Matthias Grundmann, Vivek Kwatra, Mei Han, and Irfan A. Essa. Efficient hierarchical graph-based video segmentation. In Computer Vision and Pat- tern Recognition (CVPR), 2010 IEEE Conference on, pages 2141–2148, 2010. [HdH06] Hao Hu and Gerard de Haan. Low cost robust blur estimator. In Image Pro- cessing, 2006 IEEE International Conference on, pages 617–620, 2006. [HST13] Kaiming He, Jian Sun, and Xiaoou Tang. Guided image filtering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(6):1397– 1409, June 2013. [JWY + 11] Huaizu Jiang, Jingdong Wang, Zejian Yuan, Tie Liu, and Nanning Zheng. Automatic salient object segmentation based on context and shape prior. In Proceedings of the British Machine Vision Conference, pages 110.1– 110.12, 2011. [KLL13] Tae Hoon Kim, Kyoung Mu Lee, and Sang Uk Lee. Learning full pair- wise affinities for spectral segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(7):1690–1703, July 2013. [Kov00] Peter Kovesi. MATLAB and Octave functions for computer vision and image processing. Centre for Exploration Targeting, School of Earth and Environment, The University of Western Australia, 2000. Available from: <http://www.csse.uwa.edu.au/pk/research/matlabfns/>. [Kov03] Peter Kovesi. Phase congruency detects corners and edges. In Digital Image Computing: Techniques and Applications (DICTA), 2003 Interna- tional Conference on, pages 309–318, 2003. [LC15] Zhengqin Li and Jiansheng Chen. Superpixel segmentation using linear spectral clustering. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1356–1363, 2015. [Liu09] Ce Liu. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, US, 2009. [LLW08] Anat Levin, Dani Lischinski, and Yair Weiss. A closed-form solution to natural image matting. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(2):228–242, February 2008. 125 [LSK + 09] Alex Levinshtein, Adrian Stere, Kiriakos N. Kutulakos, David J. Fleet, Sven J. Dickinson, and Kaleem Siddiqi. TurboPixels: Fast superpixels using geometric flows. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(12):2290–2297, December 2009. [LSS05] Yin Li, Jian Sun, and Heung-Yeung Shum. Video object cut and paste. ACM Transactions on Graphics (TOG), 24(3):595–600, July 2005. [LSS09] Jiangyu Liu, Jian Sun, and Heung-Yeung Shum. Paint selection. ACM Transactions on Graphics (TOG), 28(3):69:1–69:7, July 2009. [LWC12] Zhenguo Li, Xiao-Ming Wu, and Shih-Fu Chang. Segmentation Using Superpixels: A bipartite graph partitioning approach. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 789– 796, 2012. [Mei05] Marina Meilˇ a. Comparing Clusterings: An axiomatic view. In Proceedings of the 22nd international conference on Machine learning (ICML), pages 577–584, 2005. [MFCM03] Xavier Mu˜ noz, Jordi Freixenet, Xavier Cuf´ ı, and Joan Mart´ ı. Strategies for image segmentation combining region and boundary information. Pattern recognition letters, 24(1-3):375–392, January 2003. [MFM04] David R. Martin, Charless C. Fowlkes, and Jitendra Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(5):530–549, May 2004. [MFTM01] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to eval- uating segmentation algorithms and measuring ecological statistics. In Computer Vision (ICCV), 2001 IEEE International Conference on, pages 416–423, 2001. [MO87] M. C. Morrone and R. A. Owens. Feature detection from local energy. Pattern recognition letters, 6(5):303–313, December 1987. [MO10] Kevin McGuinness and Noel E. O’Connor. A comparative evaluation of interactive segmentation algorithms. Pattern Recognition, 43(2):434–444, February 2010. [Pen87] Alex P. Pentland. A new sense for depth of field. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 9(4):523–531, April 1987. 126 [PMC09] Brian L. Price, Bryan S. Morse, and Scott Cohen. LIVEcut: Learning- based interactive video segmentation by evaluation of multiple propagated cues. In Computer Vision (ICCV), 2009 IEEE International Conference on, pages 779–786, 2009. [PS13] Guillem Palou and Philippe Salembier. Hierarchical video representa- tion with trajectory binary partition tree. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2099–2106, 2013. [RAS08] Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah. Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. In Computer Vision and Pattern Recognition (CVPR), 2008 IEEE Conference on, pages 1–8, 2008. [RJRO13] Matthias Reso, J¨ orn Jachalsky, Bodo Rosenhahn, and J¨ orn Ostermann. Temporally consistent superpixels. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 385–392, 2013. [RM00] Jos B.T.M. Roerdink and Arnold Meijster. The watershed transform: Definitions, algorithms and parallelization strategies. Fundam. Inform., 41(1,2):187–228, April 2000. [RM03] Xiaofeng Ren and Jitendra Malik. Learning a classification model for seg- mentation. In Computer Vision, 2003. Proceedings. Ninth IEEE Interna- tional Conference on, pages 10–17, 2003. [Ska10] Nikolay Skarbnik. The importance of phase in image processing. Master’s thesis, Israel Institute of Technology, Haifa, Israel, 2010. [SM00] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmenta- tion. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905, August 2000. [TFNR12] David Tsai, Matthew Flagg, Atsushi Nakazawa, and James M. Rehg. Motion coherent tracking using multi-label MRF optimization. Interna- tional Journal of Computer Vision, 100(2):190–202, November 2012. [TM98] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Computer Vision (ICCV), 1998 IEEE International Conference on, pages 839–846, 1998. [UPH07] Ranjith Unnikrishnan, Caroline Pantofaru, and Martial Hebert. Toward objective evaluation of image segmentation algorithms. Pattern Analy- sis and Machine Intelligence, IEEE Transactions on, 29(6):929–944, June 2007. 127 [VS08] Andrea Vedaldi and Stefano Soatto. Quick shift and kernel methods for mode seeking. In European Conference on Computer Vision (ECCV), pages 705–718, 2008. [WBC + 05] Jue Wang, Pravin Bhat, Alex Colburn, Maneesh Agrawala, and Michael F. Cohen. Interactive video cutout. ACM Transactions on Graphics (TOG), 24(3):585–594, July 2005. [Wik14a] Wikipedia. Adaptive histogram equaliza- tion — wikipedia, the free encyclopedia, 2014. <http://en.wikipedia.org/wiki/Adaptive histogram equalization>. [Wik14b] Wikipedia. Lab color space — wikipedia, the free encyclopedia, 2014. <http://en.wikipedia.org/wiki/Lab color space>. [WJHZ08] Jingdong Wang, Yangqing Jia, Xian-Sheng Hua, and Changshui Zhang. Normalized tree partitioning for image segmentation. In Computer Vision and Pattern Recognition (CVPR), 2008 IEEE Conference on, pages 1–8, 2008. [WZT14] Jiajun Wu, Junyan Zhu, and Zhuowen Tu. Reverse image segmentation: A high-level solution to a low-level task. In Proceedings of the British Machine Vision Conference. BMV A Press, 2014. [XC12] Chenliang Xu and Jason J. Corso. Evaluation of super-voxel methods for early video processing. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1202–1209, 2012. [Zam12] Tim Zaman. Depth estimation from blur estimation. Technical report, Delft University of Technology, Delft, Netherlands, 2012. [ZS11] Shaojie Zhuo and Terence Sim. Defocus map estimation from a single image. Pattern Recognition, 44(9):1852–1858, September 2011. [Zui94] Karel Zuiderveld. Graphics Gems IV. Academic Press Professional, Inc., San Diego, CA, USA, 1994. 128
Abstract (if available)
Abstract
In this dissertation, we study two research topics: 1) how to interactively represent and segment the object(s) in a video
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advanced visual processing techniques for latent fingerprint detection and video retargeting
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Visual knowledge transfer with deep learning techniques
PDF
Variational techniques for cardiac image analysis: algorithms and applications
PDF
Human activity analysis with graph signal processing techniques
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Design and performance analysis of low complexity encoding algorithm for H.264 /AVC
PDF
Line segment matching and its applications in 3D urban modeling
PDF
Advanced techniques for high fidelity video coding
PDF
Facial gesture analysis in an interactive environment
PDF
Music retrieval systems: robust performance under the effect of uncertainty
PDF
Temporal perception and reasoning in videos
PDF
Transceiver design and performance of adaptive bit-interleaved coded MIMO-OFDM systems
PDF
Design and performance analysis of a new multiuser OFDM transceiver system
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Advanced video coding techniques for Internet streaming and DVB applications
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Facial age grouping and estimation via ensemble learning
Asset Metadata
Creator
Fu, Xiang (author)
Core Title
Advanced visual segmentation techniques: algorithm design and performance analysis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/21/2015
Defense Date
08/27/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
CCP,CGS,contour-guided color palette,contour-guided surface merger,hierarchical supervoxel graph,image segmentation,image/video representation,interactive segmentation,OAI-PMH Harvest,RDS,region-dependent spectral graph,video segmentation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Georgiou, Panayiotis (
committee member
), Nevatia, Ram (
committee member
)
Creator Email
fuxiang87@gmail.com,xiangfu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-192142
Unique identifier
UC11278903
Identifier
etd-FuXiang-3986.pdf (filename),usctheses-c40-192142 (legacy record id)
Legacy Identifier
etd-FuXiang-3986.pdf
Dmrecord
192142
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Fu, Xiang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
CCP
CGS
contour-guided color palette
contour-guided surface merger
hierarchical supervoxel graph
image segmentation
image/video representation
interactive segmentation
RDS
region-dependent spectral graph
video segmentation