Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Depth inference and visual saliency detection from 2D images
(USC Thesis Other)
Depth inference and visual saliency detection from 2D images
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Depth Inference and Visual Saliency Detection from 2D Images
by
Jingwei Wang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2013
Copyright 2013 Jingwei Wang
Acknowledgements
I would like to acknowledge many people for helping me duringmy doctoral work. With-
out their persistent help and support, I would not be able to complete this work.
First of all, I would like to express my deepest gratitude to my advisor, Professor
C.-C. Jay Kuo. He has spent a lot efforts and valuable time on me. In fact, many of his
brilliant ideas serve the basis of this thesis. I admire his endless energy and enthusiasm
in research. Also, I would like to thank all my committee members, Prof. Keith Jenkins,
Prof. Antonio Ortega, Prof. Richard Leahy, Prof. Yan Liu and Prof. Itti Laurent for
their valuable time and precious comments.
Iwouldliketothankmyparentswhogave mylife. Spendingcountlesstimeandeffort
raising me up, teaching me basic principle to live in this world. Without them, I could
not become who I am. I would like to thank all my friends for their encouragement and
support when I was going through a tough time during my phd stage. I can not image
my life without them.
Finally, IwanttothankmyYogateachers atUSC,theirgentle andconsistent instruc-
tion help me find the inner peace through meditation, which always comfort my restless
heart.
ii
Table of Contents
Acknowledgements ii
List of Tables vi
List of Figures vii
Abstract xi
Chapter 1: Introduction 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Review of Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2: Background 10
2.1 Depth Perception in Vision . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Depth Clues for Stereoscopic View . . . . . . . . . . . . . . . . . . 12
2.1.2 Depth Clues for Monocular View . . . . . . . . . . . . . . . . . . . 14
2.1.3 Comparison of Stereoscopic View and Monocular View . . . . . . . 19
2.2 Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Depth Map Generation from Stereoscopic View . . . . . . . . . . . 20
2.2.2 Depth Map Generation from Monocular View . . . . . . . . . . . . 21
2.2.2.1 Vanishing Point Based Method . . . . . . . . . . . . . . . 21
2.2.2.2 Occlusion Based Method . . . . . . . . . . . . . . . . . . 22
2.2.2.3 Shading Based Method . . . . . . . . . . . . . . . . . . . 22
2.2.2.4 In-focus Degree Based Method . . . . . . . . . . . . . . . 23
2.2.2.5 Texture Gradient and Size Based Method . . . . . . . . . 24
Chapter 3: Depth Inference from 2D Image/Video: Theory and Algo-
rithms 25
3.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Image Depth Map Generation . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 In-Focus Region Detection . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 32
iii
3.2.2.1 Salient Map Computation . . . . . . . . . . . . . . . . . . 33
3.2.2.2 Color Based Grab Cut Algorithm . . . . . . . . . . . . . 34
3.2.3 Monocular Depth Cue Integration . . . . . . . . . . . . . . . . . . 36
3.2.3.1 Background Depth Map Generation . . . . . . . . . . . . 36
3.2.3.2 Foreground Depth Assignment . . . . . . . . . . . . . . . 40
3.3 Video Depth Map Generation . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 Depth Map Propagation . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Post Process Step . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 4: Depth Inference from 2D Image/Video: Chip Implementation
Considerations 49
4.1 Improved Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Still Image Depth Map Generation . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Fast In Focus Region Detection . . . . . . . . . . . . . . . . . . . . 53
4.2.2 Color Based Mean Shift Segmentation . . . . . . . . . . . . . . . . 56
4.3 Single Video Depth Map Generation . . . . . . . . . . . . . . . . . . . . . 59
4.3.1 Motion based Depth Map Propagation . . . . . . . . . . . . . . . . 60
4.4 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter 5: Salient Object Detection 67
5.1 Related Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Proposed Score Fusion Strategy . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Visual Saliency Map Features . . . . . . . . . . . . . . . . . . . . . 74
5.2.2 Score Level Fusion Strategies . . . . . . . . . . . . . . . . . . . . . 76
5.2.2.1 Transformation based Fusion . . . . . . . . . . . . . . . . 76
5.2.2.2 Classification based Fusion . . . . . . . . . . . . . . . . . 77
5.2.2.3 Density based Fusion . . . . . . . . . . . . . . . . . . . . 79
5.3 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.2 Evaluation Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.3 Model Comparison and Results . . . . . . . . . . . . . . . . . . . . 83
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Chapter 6: Salient Object Segmentation 91
6.1 Proposed CRF-based Segmentation Model . . . . . . . . . . . . . . . . . . 93
6.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.1.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.2 Evaluation Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
iv
Chapter 7: Conclusion and Future Work 108
7.1 Summary of The Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Reference List 112
v
List of Tables
2.1 Depth clues used for depth inferring algorithms.. . . . . . . . . . . . . . . 12
5.1 Result of prediction estimation in mean average precision(mAP) . . . . . 81
5.2 AUC ranking of example images in Fig. 5.9 (LSVM: linear SVM, MBoost:
Modest adaboost ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1 Model rankings over different salient object datasets. . . . . . . . . . . . . 99
vi
List of Figures
1.1 depthmapgenerationfromstereo-vision(a)LeftView(b)RightView(c)Depth
Map (figure source: http://vision.middlebury.edu/stereo) . . . . . . . . . 2
1.2 Depth map generation from a single image: (a) the original image and (b)
thedepthmap.(figuresource: www.extra.research.philips.com/euprojects/attest) 3
2.1 example of how human eyes perceive depth. . . . . . . . . . . . . . . . . . 11
2.2 Stereoscopic view illustration . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Focus Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Occlusion Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Texture Gradient Illustration . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Shading and Shadows Illustration . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Relative Size Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.8 Vanishing Point Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.9 ModifiedCameraAperture(a)Conventional CameraAperture(b)Modified
Camera Aperture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 The block-diagram of the proposed depth map inference algorithm. . . . . 28
3.2 (a) The original image, (b) the in-focus region, (c) the salient focus region
and (d) the detected foreground region. . . . . . . . . . . . . . . . . . . . 30
3.3 The model for the distance between second derivative extrema. . . . . . . 31
3.4 The resulting depth map for the image given in Fig.(3.2(a)). . . . . . . . . 37
vii
3.5 flowchart of background depth map generation. . . . . . . . . . . . . . . . 38
3.6 background depth map generation result (two rows for two test images). . 39
3.7 intersection point in (a) Left vanishing point case, (b) Right vanishing
point case, (c) Up vanishing point case, (d) Down vanishing point case,
and (e) Inner vanishing point case. . . . . . . . . . . . . . . . . . . . . . . 41
3.8 The resulting depth map (upper row) and ground truth (bottom row). . . 44
3.9 The resulting confidence map of a consecutive video sequence. . . . . . . . 46
3.10 Examplesofdepthmapgeneration: (a) themountainimage, (b)thedepth
mapofthemountainimage, (c) theairplaneimage, and(d)thedepthmap
of the airplane image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.11 Performance comparison ofthevanishpointdetection method[65]andthe
proposed method.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 The block-diagram of the improved depth map inference algorithm. . . . . 51
4.2 still image depth generation flowchart. . . . . . . . . . . . . . . . . . . . . 52
4.3 relation between minimum reliable scale and blurring scale. . . . . . . . . 54
4.4 mean shift segmentation result (a)original image (b)mean shift segmenta-
tion result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 The depth map propagation along video sequence using motion vectors. . 61
4.6 depthmapresultsofhumanbeingimagegroup(a)originalimage(b)generated
depth map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.7 depth map results of animal image group (a)original image (b)generated
depth map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.8 depth map results of environmental object image group (a)original image
(b)generated depth map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.9 video depth map sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . 66
viii
5.1 illustration of state-of-the-art saliency detection model inaccurate cases a)
Originalimages; b)groundtruth-humaneyefixationmap; c)failedsaliency
object detection models, from left to right(Itti [41], AIM [49], Judd [60];
d) score-level fusion results. Comparing to individual models, score-level
fusion results are closer to human eyes’ visual attention.) . . . . . . . . . 72
5.2 Illustration of our fusion framework. . . . . . . . . . . . . . . . . . . . . . 73
5.3 average precision-recall and ROC curves of all saliency fusion strategies . 80
5.4 Sample images from MIT Dataset . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 performance comparison of different choice of model selection . . . . . . . 85
5.6 saliency map results of individual models.) . . . . . . . . . . . . . . . . . . 85
5.7 score-level fusion results. for each image, the fist row are original image
and fusion results of top 11 models, and the second row are ground truth
and fusion results of top 3 models.) . . . . . . . . . . . . . . . . . . . . . . 87
5.8 performance consistence over different image cases, accuracy of salient ob-
ject detection models over least and most consistent images . . . . . . . . 89
5.9 Examples of salient object detection results by top 3 models are shown
in columns 3 through 5, while the salient objects detected by proposed
fusion strategies are given in column 6 through 8. Their ROC and AUC
performances are shown in column 9 and 10. (GT denotes ground-truth
eye fixation map.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.1 An example saliency map with high region overlap score but poor bound-
ary. Anothersaliencymapwithslightlylower(orequal)regionscorewhich
has better boundary score (lowe CM) resembles better with human per-
ception of objects[48]. This means that F-measure (or any region scoring)
alone is not enough to evaluate accuracy of a model. score . . . . . . . . . 93
6.2 The diagram of ourproposedCRF-based salient object detection framework. 95
6.3 Illustration of salient object boundary detection (GT marks as Ground
Truth). Our model captures both object regions and boundaries better
than existing models. CorrespondingF-measure and CM scores are shown
under each image, in order. A good model should be high on F-measure,
and low on CM score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4 CM and F-measure scores by sweeping saliency map threshold over ASD,
SED, and SOD datasets averaged over all images. . . . . . . . . . . . . . . 103
ix
6.5 Worst-case analysis. Sample images where models fail in predicting ob-
ject boundary. Comparing to different models, our model has a better
degradation performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.6 CM vs. F-measure results over ASD, SED, and SOD datasets. Each point
shows the average score over all thresholds and images.. . . . . . . . . . . 104
6.7 Sample images with salient objects in- and off-center from three datasets. 106
6.8 Center-biasedanalysis. Ourmodelisbetterabletodetectin-andoff-center
salient objects, in terms of both region and boundary. . . . . . . . . . . . 106
x
Abstract
With the rapid development of 3D vision technology, it is an active research topic to
recover the depth information from 2D images. Current solutions heavily depend on the
structure assumption of the 2D image and their applications are limited. It is now still
technically challenging to develop an efficient yet general solution to generate the depth
map from a single image. Furthermore, psychological study indicates that human eyes
are particular sensitive to salient object region within one image. Thus, it is critical to
detect salient object accurately, and segment its boundary very well as small depth error
in these areas will lead to intolerant visual distortion. Briefly speaking, research works
in this literature can be categorized into two different categories. Depth map inference
system design and salient object detection and segmentation algorithm development.
For depth map inference system design, we propose a novel depth inference system
for 2D images and videos. Specifically, we first adopt the in-focus region detection and
salient map computation techniques to separate the foreground objects from the remain-
ing background region. After that, a color-based grab-cut algorithm is used to remove
the background from obtained foreground objects by modeling the background. As a
result, the depth map of the background can be generated by a modified vanishing point
detection method. Then, key frame depth maps can be propagated to the remaining
xi
frames. Finally, to meet the stringent requirements of VLSI chip implementation such as
limited on-chip memory size and real-time processing, we modify some building modules
with simplified versions of the in-focus region detection and the mean-shift algorithm.
Experimental result shows that the proposed solution can provide accurate depth maps
for83% of images whileother state-of-the-art methodscan only achieve accuracy for34%
of these test images. This simplified solution targeting at the VLSI chip implementation
has been validated for its high accuracy as well as high efficiency on several test video
clips.
For salient object detection, inspired by success of late fusion in semantic analysis
and multi-modal biometrics, we model saliency detection as late fusion at confidence
score level. In fact, we proposed to fuse state-of-the-arts saliency models at score level
in a para-boosting learning fashion. Firstly, saliency maps generated from these mod-
els are used as confidence scores. Then, these scores are fed into our para-boosting
learner(i.e. SupportVector Machine(SVM), Adaptive Boosting(AdBoost), or Probability
Density Estimator(PDE)) to predict the final saliency map. In order to explore strength
of para-boostinglearners, traditional transformation basedfusionstrategies suchas Sum,
Min, Max are also also applied for comparison purpose. In our application scenario,
salient object segmentation is ourfinal goal. So, wefurtherproposeanovel salient object
segmentation schema using Conditional Random Field(CRF) graph model. In this seg-
mentation model, we first extract local low-level features, such as output maps of several
saliency models, gradient histogram and position of each image pixel. We then train a
random forest classifier to fuse saliency maps into a single high-level feature map using
ground-truthannotations. Finally, Bothlow-andhigh-levelfeaturesarefedintoourCRF
xii
and parameters are learned. The segmentation results are evaluated from two different
perspectives: region and contour accuracy. Extensive experimental comparison shows
that both our salient object detection and segmentation model outperforms the ground
truth labeled by human eyes.state-of-the-art saliency models and are, so far, the closest
to human eyes’ performance.
xiii
Chapter 1
Introduction
1.1 Significance of the Research
The three-dimensional (3D) display devices such as 3D-TV has experienced a rapid ad-
vancement in the past few years, which brings the 3D technology closer to our daily life
than ever before. These devices have expedited the development of image capturing,
coding and display technologies, On the other hand, there are not sufficient 3D video
contents to play, and most visual data are captured and stored in form of 2D monocular
videos. It is desirable to convert 2D visual data into its corresponding 3D format, which
is known as the 2D-to-3D video conversion problem. It is a hot research topic in both
academia and industry nowadays.
To reconstruct 3D vision, one approach is to infer the depth information from stereo
images (e.g., two images of the same object from two different perspectives), which is
the basis for existing 3D reconstruction techniques using “stereo-vision”. In particular,
Wheatstone [68] first proposed to calculate the binocular disparity (e.g., the difference
between the projections of the same object on two different images) from stereo images
1
and further recover the 3D depth information. For example, we illustrate the stereo
images and the binocular disparity in Fig. (1.1), where the left image and the middle
image have the same objects but different perspectives (e.g., left and right). With the
stereo images, the binocular disparity can be calculated by comparing the projections of
the same object in each of these two images. As such, the depth map for each object can
be generated as shown in the right image.
(a) (b) (c)
Figure1.1: depthmapgenerationfromstereo-vision(a)LeftView(b)RightView(c)Depth
Map (figure source: http://vision.middlebury.edu/stereo)
Thisapproachhastwo-fold disadvantages: first,itisusuallymoreexpensivetoobtain
the stereo images by fusing multi-modal images on the same object since multiple image
capturing devices are needed. Second, in some applications (e.g., visual data cannot or
too expensive to be re-recorded), there is only one single still image available. from one
single 2D still image.
To mitigate these issues, it is desired to accurately infer the depth map from one
single 2D still image (“monocular image”). For example, the left image in Fig.(1.2) is
the original 2D image and its corresponding depth map can be recovered as shown in the
right image. As such, this 2D image coupling with its depth map can render a vivid 3D
visual effect.
2
(b) (a)
Figure 1.2: Depth map generation from a single image: (a) the original image and (b)
the depth map.(figure source: www.extra.research.philips.com/euprojects/attest)
Also, it is worthwhile to detect and segment salient object for monocular depth map
inference system. The reasons lie in two aspects.
First, as human eyes tend to pay more attention to salient object area, it is critical to
generateaconsistentdepthmapinthisregion. Anintuitivestrategywouldbesegmenting
salient object region out, then assigning this region a suitable depth value to separate
it from background depth model. Second, many existing segmentation algorithms either
require user input, or need extra manual label to mark where ”foreground” is. This
kind of user interaction can not beobtained in a fully automatical system. Hence, salient
objectdetectionplaysanimportantroleinautomatically indicatingwhereforeground(i.e.
salient) object is. In addition, high quality salient object detection and segmentation
algorithms are required in our depth inference scenario as small depth map error within
salient object region will lead to intolerant visual distortion. Thus, it also is desirable to
3
study salient object detection and segmentation algorithms for high quality monocular
depth map system design purpose.
1.2 Review of Related Research
In the past few years, many depth-map generation methodologies have been proposed
based on monocular image. For example, Nagai et al. [62] proposed to reconstruct the
3D vision of the surface of fixed and specific objects (e.g., hands, faces, etc.) from one
single image which can handle only a small set of images. In addition, [7] recovered the
depth information according to shading of objects by assuming that the shading and/or
texture only exist on simple and smooth surface. This assumption does not hold true in
most cases and thus this approach performpoorly on those images with complex shading
andtextures. Also, Hertzmann et al. [29] assumesthere exist someassistant objects with
clear and knownshapes, which can serve as the referencefor target objects. As such, this
method can recover the 3D models with high quality but has been restricted to specific
images. Moreover, the method proposed by Delage, Lee and Ng [21] can generate depth
map for indoor scenes by assuming that single 2D image contains only vertical walls and
horizontal floors, while Michels, Saxena and Ng [34] used supervised learning to estimate
the depth map for 2D images.
From these aforementioned methods, it can be observed that most of existing works
deploy only partial depth clues (i.e., information related with depth map such as texture,
shading, color, etc.) to infer the depth map of objects, thereby, making them suitable
for only a small set of images with specific contents. Therefore, a general depth-map
4
generation algorithm is still not in the existing literatures and is urgently needed. This
is the significance of our research topic and the motivation behindour depth inference as
well.
Aswementionedbefore,salientobjectdetectionandsegmentationplaysanimportant
role in depth inference system for general image application purpose, yet many existing
saliency models are constrained by their assumptions and, thus, are only suitable for
certain image types. Assuming salient objects are conspicuous either in color, intensity
or orientation, Itti et al.[41] derived bottom-up visual saliency using center-surround
differencesacrossmulti-scaleimagefeatures. Believingsalientregionshavediscrimination
valuesinluminancechannel, ZhaiandShah[73]definedpixel-level saliencybycontrastto
all other pixels. Assuming local color differences between salient and non-salient regions,
Achantaetal. [53]proposedafrequency-tunedmethodthatdirectlydefinespixelsaliency
using the color differences from the average image color. These methods focused on local
dissimilarities but fail short in detecting the global visual attention. On the other hand,
supposed human eyes’ are interested in high-level features such as faces, cars, text. Judd
etal.([60],[10])developedsaliencyobjectdetectionmodelswhichincorporateseveralhigh-
level semantic features such as face, text, and human body. But this model could not
catch the local dissimilarity very well.
Furthermore, majority of existing saliency models have focused on detecting image
dissimilarities and producing a real valued saliency map [31, 57]. Harel et al. [31] used
Markov chains and a measure of dissimilarity to achieve efficient saliency computation
with their Graph Based Visual Saliency (GBVS) model. Goferman et al. [57] modeled
local low-level clues and visual organization rules, to highlight salient objects along with
5
their contexts. However, due to lack of shape/contour information, these models and
their variants can not provide accurate information of salient object boundaries.
Thus, it is also necessary to develop salient object detection and segmentation algo-
rithms that can suit more general image content cases, while define object boundaries
very well.
1.3 Contributions of the Research
In this research work, we first propose an effective algorithm to recover the depth map
from one 2D still image and propagate this depth map to the remaining frame images of
2D videos. Moreover, to ensure depth map’s boundary accuracy in human eyes’ visual
attention regions(i.e. salient object region), we further propose novel salient object de-
tection strategies to fuse multiple saliency models, and design an efficient salient object
segmentation algorithm to generate clear salient object boundaries.
The main contribution of this work are listed as below:
• We propose an effective foreground segmentation strategy, which provides a rough
foregroundestimationusingin-focusdetection[59]aswellassalientmapcomputation[32]
and furtherrefine this estimation by grab cut algorithm[13]. This strategy provides
anaccuratesegmentationresultsfordepthmapgenerationandsignificantlyimprove
the accuracy of overall algorithm.
• The traditional methods apply the vanish point method [65] to detect the line
structure of the 2D images and further recover the depth map for the background
regions. Hence, these methods become ineffective for those images without line
6
structure. Inthiswork, weproposeamodifiedvanishingpointmethodbydetecting
the in-focus degree, which demonstrates improved robustness on a broad range of
images.
• Most ofpreviousworks[21,65,19]focusonthedepthmapgeneration fromonestill
2D image rather than 2D video. In this research work, we developed a complete
and efficient framework to infer the depth map for an entire 2D video sequence,
which divides the video sequence into several correlated scene units (e.g., images
within the same unit have the similar objects and background) so that the depth
map for a key frame image can be used to generate depth maps for the remaining
frame images in the same unit.
• Togeneratethedepthmapfor2Dvideosequence,weproposeanefficientdepthmap
propagation strategy which matches the pixels of the same objects in neighboring
frame images and then propagates the depth values of the key frame image to the
corresponding objects in the neighboring images.
• In order to monitor the accumulated error during propagation process, we propose
a confidence map (i.e., each pixel of the confidence map describes the difference
of corresponding pixel between the key frame image and current frame image un-
der study) so that the propagation can be restarted for high accuracy when the
accumulated error exceed certain error upper bound.
• For chip implementation purpose, we improve our original proposed algorithm by
expediting those time-consuming steps including:
7
– We use a “fast” in-focus region detection algorithm [22] to reduce the memory
requirement and runtime cost of the original in-focus detection method [59].
– The mean shift algorithm [20] is used to replace the grab cut algorithm [13]
for better efficiency.
– In order to propagate depth map among neighboring frame images, we pro-
posed to make use of the “motion vector” in standard compressed video for-
mats(e.g., MPEGandH.264) whichdescribethelocations ofthesameobjects
in neighboring frame images. This approach can significantly speed up the
propagation process while offering accurate depth maps for the entire video
sequence.
• Topredictsalient object location inoneimage, weproposeseveral score level fusion
strategies including non-learning based, and learning based fusion schemes which
combine current salient object detection models. Through exhaustive testing over
benchmarkdataset [60], weshowthatourscorelevel fusionstrategies, furtherboost
the performance of state-of-the-art saliency detection models ([60, 10, 1]), and are
closer to human eyes’ observation agreement [1] in visual attention.
• To further investigate the role of each individual salient object model, we explore
the possibility of fusing fewer models while still keeping good performance. A com-
parison is made between fusing all top models and several selective top models.
Experimental results show that the integration of a few of best models can outper-
form fusing all models.
8
• For salient object segmentation, we train a Conditional Random Field(CRF) to
combine output maps of several state-of-the-art saliency models and generate a
binary segmentation result. Since these models are based on different uncorrelated
measures, combining them helps better in locating the region of the salient object,
and defining an accurate object boundaries.
• Almost all researchers have employed region measurement (i.e. F-measure) as stan-
dardperformancescores, butdetectingasalient objectisnotjustaboutsegmenting
its region, and it is also important to capture the object boundary. We evaluate
segmentation results from different salient models in two different perspectives: re-
gion and contour measurement. Extensive experimental results indicate that our
proposed CRF based segmentation outperform state-of-the-art salient object seg-
mentation models.
1.4 Organization of the Dissertation
The rest of this dissertation is organized as follows. The background of this research is
describedinChapter2whichbrieflyrevisitthedepthcluesforstereoscopicandmonocular
view and the related depth-map generation methods. Then, the proposed algorithm for
depth map generation from one 2D still image and 2D videos is presented in Chapter 3.
TheimprovedalgorithmwithchipimplementationconsiderationsareproposedinChapter
4. In Chapter 5 and 6, we further propose methods for detection and segmentation of
salient object regions. Finally, this dissertation is concluded in Chapter 7 where further
research directions are also provided.
9
Chapter 2
Background
2.1 Depth Perception in Vision
Human being can perceive the depth information using eyes to provide vivid 3D vision,
which can be illustrated with Fig. (2.1) as an example. In this figure, most viewers
can identify the objects and further observe their different orders of the depth levels.
In details, this image contains three different objects including a little girl (the nearest
to the viewer), a bridge (in the middle) and the sky (the farthest to the viewer), which
have ascending depth levels. In other words, the eyes of human being infer the depth
informationaccordingtothe“relative”locationsofdifferentobjectsinthesamescene[39].
On the other hand, the image captured by cameras are typically 2D images without any
depth information. Therefore, it is desired to reproduce the depth information from 2D
imageinordertoreproducetheobject perceptionofhumanbeing. Thisisthemotivation
behind our work.
Infact, thedepthinformationofa2Dimagecanberepresentedbyadepth map which
consists of depth values for all pixels in the 2D image. These depth values can describe
10
Figure 2.1: example of how human eyes perceive depth.
the distances of different objects from the viewer and reflect their relative locations. As
such,the2Dimagecouplingwithitscorrespondingdepthmapcangenerate the3Dvisual
effects.
To this end, many depth clues (e.g., information related with depth perception) can
be used to generate the depth map as shown in Table (2.1). According to the number
of viewing images, these depth clues can be divided into two groups: Stereoscopic View
with two different viewing images and Monocular View where only one single viewing
image is available. We will introduce these depth clues briefly in the following sections:
11
Table 2.1: Depth clues used for depth inferring algorithms.
Category Depth Clue
Stereoscopic View
binocular disparity
parallax motion
Monocular View
in focus degree
occlusion
texture gradient
shading
relative familiar size
vanishing point
2.1.1 Depth Clues for Stereoscopic View
The most straightforward method is to recover depth map from stereoscopic view (e.g.,
two images of the same object from two different perspectives). One of the most impor-
tant sources of stereoscopic view is provided by “binocular disparity”, which perceives
the depth information by observing the positional difference between the two retinal
projections of the same object. In [68] Wheatstone has demonstrated the depth percep-
tion caused by retinal disparity, which is known as “stereopsis” and can be illustrate in
Fig.(2.2). In this figure, an object is captured by two cameras from two different per-
spectives (e.g. left and right). As such, there exist two projection points (i.e. one is the
camera projection point and the other is the projection point without camera) and thus
one disparity at each perspective. Thereby, these two disparities can be used to further
compute the object depth according to geometry information. Based on this depth clue,
a large number of researches such as [23, 63, 17] have focused on the computation of
disparity in order to provide depth perception.
12
Left View Right View
Object
Left View Point
Right View Point
Left Camera
Projection Point
Right Camera
Projection Point
Left
Projection Point
Right
Projection Point Left
Disparity
Right
Disparity
Focal
Length
Focal
Length
Object
Depth
Figure 2.2: Stereoscopic view illustration
Another important depth clue of stereoscopic vision is “parallax motion”. Motion
parallax is a depth cue that results from our motion. In fact, Helmholtz [28] first noticed
animportantobservation: theobjectsmovesfromdifferentdistancetoobserveratvarious
velocities to human eyes [28]. For example, when the observer moves horizontally and
observes the relative movement of still objects in front of him/her. As a result, the closer
objectstendtohavefastervelocitythanthosedistantobjectsfromtheobserver. Thereby,
the depth perception of 3D vision can be inferred according to the relative motion speed
of different objects, which is known as “motion parallax”. Note that motion parallax
cues result from the movement of projected points on eye retina along the time, while
stereoscopic disparities cues depend on the projections on the retina at one time. Also,
13
bothofthemneedtocomparetheperspectiveprojectionsofthesamescenefromdifferent
viewing positions (e.g., retinal projections on two eyes).
2.1.2 Depth Clues for Monocular View
AsshowninTable(2.1), therearetotally seven depthcluesforthemonocularviewwhere
only one viewing image is available and we will present these depth clues as below:
The first depth clue for the monocular view is in-focus degree, which is based on an
important observation: different image regions within the same image are often blurred
by different amounts due to the focus limitations of cameras[51]. For example, we can
observe the image in Fig.(2.3) where the focus is the cat in the foreground and the rest
parts are severely blurred. As such, the depth perception can be obtained according to
the in-focus degree.
Figure 2.3: Focus Illustration
14
AnotherimportantmonocularclueisocclusionwhichwasfirstintroducedbyChapanis[15].
Typically,occlusionhappenswhenanobjectobscuresanotherobjectbehind. Wecancon-
sider the image in Fig. (2.4) as an example, where the near surface occludes the further
rock. Inparticular,demonstratedbyKellmanandShipley[40],thenearerobjectpartially
hiding the object behind and thus creating ”T-shaped junctions” as shown in Fig. (2.4).
As such, these T-junctions contain relative depth information of the objects in occlusion.
For example, T-junction’s stem is defined as area of occluded object (further object)
while the roof is corresponded to the occluding object (near object). Therefore, these
T-junctions structure resulting in occlusion can be used to recover the depth information
of occlusion objects[38].
Figure 2.4: Occlusion Illustration
Gibson [25] discovered that when a large surface recedes in distance, the unit area’s
projectedsizedecreaseisotropically withthedistance. Moreimportantly, especiallywhen
the surface is covered by repeated patterns, the projected size of these patterns could be
describedasafunctionoftheirdistancefromtheviewerandtheorientationofthesurface.
15
This observation reveals another depth clue for monocular view: “texture gradient” as
shown in Fig.(2.5). Clearly, the square patterns that are near to the viewer are much
largerinsizeoverthefurtherpatterns. Onthebasisofthisobservation, Gibsonsuggested
that depth perception can be observed from the density of surface texture.
Figure 2.5: Texture Gradient Illustration
Next, we will introduce some monocular clues of depth (e.g., shading, aerial perspec-
tive and relative size) that heavily depend on object properties (i.e., color, size) and
illumination conditions.
One of these important monocular cue relying on illumination condition is “shad-
ing/shadow”. Infact, Yonas[9]hasprovidedaccuratedescriptionofshadingandshadow:
shading is the luminance distribution on a surface from a single light source, and shadow
is defined as luminance attenuation on a surface as a result of the occlusion of light-
ing source. Let us consider the sculptured bust shown in Fig.(2.6 where the luminance
changes dramatically on surfaces with steep slope (e.g. the surface of nose area) and the
3D modelofthisnoseareacan berecovered byanalyzing theluminancechanges. Usually
theshadingcanbeusedtoinferthe3Dshapeofasurfacebecausetheamountofreflected
16
light can be used to describe the surface’s orientation. In addition, the shadow can be
used as a complementary description of 3D surface. Therefore, shading and shadows can
both be utilized for depth inferring of 2D images.
Figure 2.6: Shading and Shadows Illustration
The last monocular clue is “relative familiar size”: an object has a smaller camera
projection when it moves away from observer. In this way, if the actual relative sizes of
similar objects are known, their relative depths can be inferred from their projection size
difference. This depth clue has been studied by Hittelson [30] and is especially useful
when similar objects with various depths are presented. As shown in Fig.(2.7), there are
two similar boats in this image and the further boat seems to be much smaller than the
closer one. Thereby, the relative depth orders of these boats can be determined using
their relative sizes. Furthermore, Lehmann [69] suggested that the relative familiar size
tends to be lead by other depth cues in the perception procedure.
17
Figure 2.7: Relative Size Illustration
Notethatabove-mentioneddepthcluescanonlyinferthedepthorderofobjectsrather
than their distance from the viewer. To provide the depth information about distance,
the “vanishing point” [66] was firstly proposed by Brunelleschi [66]. In general, this
depth clue adopts a linear perspective, where parallel lines in the 3D space will converge
towards one single point (known as “vanishing point”) as these lines recede in distance.
This phenomenon has been illustrated in Fig.(2.8), where the parallel rails in the 3D
space can actually converge to one single point that is far away from the observer. This
effects can provide many meaningful information about the objects in 3D space, such as
depth, size, orientation, etc. Hence, the knowledge of vanishing point plays an essential
role in many depth inferring and 3D reconstruction algorithms.
18
Figure 2.8: Vanishing Point Illustration
2.1.3 Comparison of Stereoscopic View and Monocular View
When the above-mentioned depth clues are used for depth inferring, they display various
performances on different cases. In general, when the objects are close to the viewer, the
disparity between two perspectives becomes much more evident, thereby, making stereo
disparity and parallax motion are more suitable in these cases. However, the disparity
becomeslessdistinguishableastheobjects arefaraway fromtheviewer wheremonocular
clues becomes more effective.
Within the monocular clues, the vanishing point and texture gradient are more suit-
able for man-made environment, which includes vertical surfaces along with horizontal
floors. In addition, shading clue is useful for single lighting-source environment where
object surfaces have uniform lighting reflection ratio. Moreover, occlusion clue is more
effective for environments with overlapping objects while relative size is better for those
similar objects with different sizes. Therefore, suitable depth clues should be carefully
chosen for specific environments in order to recover the depth information.
19
2.2 Depth Estimation
Based on the various depth clues in previous section, many depth map generation meth-
ods have been proposed in past a few years [23, 63, 17, 62, 7, 29, 64, 21, 34] and can be
categorized into two groups: “stereoscopic view” methods and “monocular view” meth-
ods. Note that each depth estimation method usually deploys a combination of various
depth clues to achieve better performance.
2.2.1 Depth Map Generation from Stereoscopic View
We first present the depth-map generation methods based on “stereoscopic view” [17,
23, 63] which attempt to recover the depth information from the stereo images (e.g.,
two images of the same object from two different perspectives). For example, Frueh and
Zakhor [23] has successfully constructed a 3D city model by combining its ground view
and aerial view. These methods usually include three major stages:
• The features of the same object on two images should be matched with each other.
• The relative displacements of each object (known as “disparity”) can be calculated
with two images.
• Lastly,thedepthmapofeachobjectcanbedeterminedwiththeseobjectdisparities.
Note that the disadvantages of these methods are three-fold: first, the stereoscopic
view based methods are effective only if the features of thesame objects from two images
can match with each other. Otherwise, an image matching failure could happenand thus
the estimation of object disparity is less reliable. Second, the stereoscopic view based
20
methods are inaccurate when the object disparities become indiscernible. For example,
the disparity becomes imperceptible when the objects are far from the viewers. Third, in
some applications (e.g., historical visual data), it is extremely expensive or impossible to
providetwoimagesfromdifferentperspectivesandthusonlyonesingleimageisavailable.
Hence, it is necessary to utilize the monocular clues for depth estimation.
2.2.2 Depth Map Generation from Monocular View
In this section, we introduce some approaches using some monocular clues as shown in
Table (2.1).
2.2.2.1 Vanishing Point Based Method
Horst first used Hough transform to detect the lines in images and proposed the concept
of “vanishing point” in [65]. In addition, Delag [21] presented an effective algorithm to
predict the depth information of indoor images using vanishing point. These approaches
usually include three major steps:
• The vanishing point should be located with standard techniques [65].
• The location of the floor boundary should be further estimated.
• Withperspectivegeometry,thedepthmapoftheimagecanbecompletelyrecovered
and 3D vision can be further reconstructed.
However, these approaches made some strict assumptions: the scene must include
both ground/horizontal planes and vertical surfaces; the position and the calibration
21
matrix of the camera must be known; the images must contain vanishing points. These
assumptions actually limit the applications of these vanishing point based methods.
2.2.2.2 Occlusion Based Method
Dimiccoli [19] made use of “occlusion clue” to effectively detect the T-junctions shown
in Fig. (2.4) in one single image and further infer the depth information using these
T-junctions. In details, this method first applies a bilateral filter to a 2D image which
can remove the texture details and remain the main line structure. Then, this image
is partitioned into several small regions using segmentation method and the T-junctions
can be further detected. As such, the depth map can be generated with the T-junctions
and segmentation results.
Theseocclusioncluebasedmethodcanworkverywelloncaseswhereoccludedobjects
can beeasily distinguishedfromeach other. Particularly, whentheoccluded objects have
the same color or texture, it becomes extremely difficult to locate the T-junctions using
occlusion clue.
2.2.2.3 Shading Based Method
Recently Kang[24]proposedtoinferthedepthmapusing“shadingclue”whichmadeuse
of therelationship between thereflectance mapanddepthmap. Ingeneral, the2D image
is first divided into small triangular patches and the reflectance value on each patch can
be modeled as a linear function of the depth and surface angle of the same patch. As
such, the depth map can be generated by solving a set of linear equations iteratively till
the solution is converged.
22
Note thatshadingclueisusuallysuitableforsinglelighting-source environmentwhere
object surfaces have uniform reflection ratio due to the linear model. Therefore, its
performance can severely degrade on those complex and irregular textured images.
2.2.2.4 In-focus Degree Based Method
To make use of “in-focus degree”, Levin [6] proposed to modify the conventional cameras
by inserting a patterned occluder into the aperture of the camera lens. As such, a coded
aperture shown as Fig. (2.9) can be created so that the in-focus pattern is different from
the natural images and can be easily detected. Thereby, the depth information can be
recovered by analyzing the in-focus degree.
(a) (b)
Figure 2.9: Modified Camera Aperture (a)Conventional Camera Aperture (b)Modified
Camera Aperture
It is obvious that the modified camera needs a special coded aperture designed by
experts which prevents these methods from more extensive applications.
23
2.2.2.5 Texture Gradient and Size Based Method
Saxena [34] proposed a framework to estimate the 3D vision based on the observation
that the environment is reasonably structured. In particular, this approach partitioned
the environment into many small patches without any overlap where each patch has
homogeneoustexture(knownas“superpixel”). Assuch,thetexturegradientandrelative
size can be computed among these super pixels and the depth map of this environment
can be completely recovered.
Note that this approach heavily depends on the texture gradient, therefore, it would
become ineffective for some special cases (e.g., the sky) where the texture gradient is
almost the same over the entire image. In addition, this approach is suitable for out-
doorimages withuniformilluminationbecauseacomplex illumination couldsignificantly
changes the texture gradient and leads to an incorrect depth map.
It can be observed that most of these approaches are designed for the images with
specific environments, thereby, making them unsuitable for general applications. There-
fore, it is urgently to develop a general 2D-to-3D image conversion algorithm that can be
applied to a broad range of images.
24
Chapter 3
Depth Inference from 2D Image/Video: Theory and
Algorithms
With the rapid development of 3D display devices (i.e., auto-stereoscopic displays), the
three-dimensional(3D)visualizationhasbecomeincreasinglymorepopularinrecentyears
which can provide vivid depth cues as we perceive in our daily lives. However, these
modern 3D devices have insufficient 3D contents for display because most of recorded
visual data are stored in form of 2D monocular videos. Therefore, it is desirable to
convert 2D visual data into its corresponding 3D form. This is known as the 2D-to-3D
video conversion problem, which has become a hot topic in both academia and industry.
The most straightforward approach is to recover the 3D depth clues from 2D still key
frame images. To this end, a few 3D reconstruction techniques based on “stereo-vision”
have been proposed [23, 63, 17] and attempt to recover the depth information from the
stereo input images or a series of 2D still images on the same object. For example,
Frueh and Zakhor [23] has constructed a 3D city model by merging its ground view and
airborne view. The disadvantages of these approaches are two-fold: first, it is usually
more expensive to obtain the stereo visions by fusing multi-modal images on the same
25
object since multiple image capturing devices are needed. Second, in some applications
(e.g., visual data cannot or too expensive to be re-recorded), there is only one single still
image available.
To resolve this issue, we propose an effective algorithm in this chapter to recover the
depth information from one single still image for 2D-to-3D video conversion purpose. In
particular, we first focus on the problem of inferring the depth map, which provides the
depth information of each pixel in the image, from one single key frame image using
several monocular cues. Then, thedepthmapfor onesingle key framecan bepropagated
into its neighboring frames within the entire video sequence.
Clearly, the key problem is how to recover the depth information for one single key
frameimage. Tothisend,wefirstadoptthein-focus-region detection method[59]toesti-
mate the location of foreground objects since in-focus regions usually contain foreground
objects. Then the boundary separating the foreground objects from the background re-
gion can be determined using salient computation [32] and the grab cut algorithm [13].
As such, foreground objects can be segmented out from the entire image and their depth
values can be assigned based on their relative sizes and occlusion information. Lastly,
the vanishing point detection method [65] has been modified to generate the depth-map
for the background region. Therefore, the depth map for the entire key frame image has
been recovered completely which is ready to be propagated into its neighboring frames.
Theproposeddepth-inferencealgorithm hasbeenvalidated withalargeset of2Dtest
video sequences from the industry, which is very generic and can be easily applied to a
broad range of circumstances.
26
The rest of this chapter is organized as follows. The overview of the proposed depth
inferencealgorithm isdescribedinSec. 3.1. Thekeyframedepthmapgeneration module
is presented in Sec. 3.2 and the depth-map generation for entire video is detailed in Sec.
3.3. Experimental results are provided in Sec. 3.4 to validate the proposed algorithm.
Sec. 3.5 concludes the proposed algorithm.
3.1 Algorithm Overview
The block-diagram of the proposeddepth inference algorithm is shownin Fig.(3.1) which
consists of three major stages:
• Stage I: Image Segmentation
First of all, we aim to separate the foreground objects from the background us-
ing image segmentation. As such, we can perform depth-map assignments for the
foreground and the background separately, which can eliminate the potential in-
terference and improve the accuracy of generated depth map. For this purpose,
many advanced techniques have been employed in this stage: in-focus detection
[59], salient computing [32] and the grab cut algorithm [13].
• Stage II: Image Depth-map Generation
In this stage, the depth values are assigned to the foreground objects based on
geometric cuessuch astherelative sizes andocclusion information. Inaddition, the
depth-map for the background can be generated using a modified vanishing point
detection method [65].
27
• Stage III: Video Depth-map Generation
To generate the depth maps for neighboring frame images, a general belief propa-
gation method [72] is used to interpolate them with the known depth map of the
key frame image.
These techniques are detailed in the following sections.
Input Image
Infocusregiondetectionmethod
In-focus regions
Salient map
Salientcomputingmethod
Background region Foreground region
Image
Depth map
Grabcutalgorithm
Depthassignment Vanishpointdetection
StillImage
Segmentation
StillDepth-Map
Generation
VideoDepth-Map
Generation
Neighboring
depth maps
Beliefpropagationmethod
Figure 3.1: The block-diagram of the proposed depth map inference algorithm.
28
3.2 Image Depth Map Generation
In this section, we will introduce the still image depth map generation in details as
illustrated in Fig.(3.1).
3.2.1 In-Focus Region Detection
The first step is to detect the in-focus region which tends to contain the foreground
objects. Moreover, theobjectswhichareinfocusduringtheimagecapturingstageusually
have higher sharpness. Therefore, the in-focus detection method [22] that measures the
sharpness of one specific region can be used to locate in-focus region and further identify
the foreground objects in the focus region.
For example, let us consider the original image in Fig. 3.2(a) and its in-focus region
is shown as Fig. 3.2(b), which provides a rough idea of the foreground region (shown in
Fig.3.2(d)) for the image segmentation.
We will present the detailed techniques for in-focus region detection. First, we have
to define the “blurring degree” of each pixel which measures the sharpness of specific
regions. Particularly, the larger blurring degree a region have, the less sharp this re-
gion appears. In details, we need estimate the blurring degree at edges (e.g., bound-
aries/contours separating different regions or objects) and then propagate the blurring
degree to the neighboring regions that have similar intensity and color. This approach
assumesthat blurrinessshouldvarysmoothlyover theimage except fortheregionswhere
the color is discontinuous.
29
Figure 3.2: (a) The original image, (b) the in-focus region, (c) the salient focus region
and (d) the detected foreground region.
As illustrated in [59], an edge can be modeled as a step function in the intensity
domain and the edge focal blur is caused by its blurring kernel that is modeled as a
Gaussian function g(x,y,σ
b
):
g(x,y,σ
b
) =
1
2πσ
2
b
exp
−(x
2
+y
2
)
2σ
2
b
(3.1)
where σ
b
denotes the blurring degree and it is extremely difficult to estimate σ
b
directly.
30
Instead of direct estimation, we can evaluation its indirectly from the width of a
blurred edge. Assume a blurred edge along the y axis with amplitude A and blur param-
eter σ
b
. And the expected edge response to the second derivative filter can be modeled
by:
r
x
2
(x,y,σ
2
) = Au(x)∗g
x
2
x,y,σ
2
b
+σ
2
2
(3.2)
=
−Ax
√
2π(σ
2
b
+σ
2
2
)
3
/
2
exp(−x
2
2
σ
2
b
+σ
2
2
) (3.3)
=
−Ax
√
2π(
d
/
2
)
3
exp(−x
2
.
2(d/2)
2
) (3.4)
with: (d/2)
2
=σ
2
b
+σ
2
2
(3.5)
where u(x) is a step function, σ
2
is the scale of second derivative operator and A can be
obtained by calculating the local extrema around an edge pixel.
blurred edge
second drivative respond model
d
Figure 3.3: The model for the distance between second derivative extrema.
From the formula, it is indicated that σ
b
can beestimate by measuring the distance d
(edgewidth)inFig.(3.3) between secondderivative extremaofoppositesignintheedge’s
gradient direction [22]. In order to measure d, we use the pixel responses model[59] in
31
equation (3.5) and find the edge width d(Fig.(3.3)) using the least square fitting error
model, thenwecancompute thesizeof blurkernelσ
b
usingequation (3.5). Thisprovides
us with an indirect blur measures only at edge pixels within one image.
Theaboveblurestimationonlyprovidesblurringdegreeatedgepixels,wealsoneedto
propagate blurring degrees from edge pixels to a whole image. To do so, an optimization
procedure[5] and assume that the blurring degree distribution is smooth in those regions
where intensity and color are smooth. Following the basic idea of [5], neighboring pixels
p, qshouldhavesimilarblurrinessiftheyhavesimilarintensitiesandcolors. Assuch,the
optimal propagation result can be obtained by solving the following sparse linear system
with the local smooth constraint we described before.
(L+λU)t=λ
˜
t (3.6)
where
˜
t is our edge blurring degree map, U is an identity matrix, and t is our target.
In this way, we can generate the blurring degree map for the entire image by solving this
equation.
3.2.2 Image Segmentation
With the conventional approaches for image segmentation, user interactions are required
to obtain a rough interface region between the foreground objects and the background,
which is not fully automatic. To make this process automatic and efficient, we propose
to adopt thesalient computation [32] and thegrab cut algorithm [13] to refine theresults
from in-focus detection so as to avoid any user interactions. In general, a salient region
32
can be extracted using the salient computation which contains the foreground objects
and partial background region. Then, the grab cut algorithm can further remove the
background region from the salient region to provide more accurate foreground region,
which can significantly improve the performance of image segmentation. We will detail
these two technique as following.
3.2.2.1 Salient Map Computation
The salient computation method is proposed [32] to detect the “stand-out” (or salient)
objects within an image. To be specific, this approach computes the degree of “saliency”
of each pixel to measure its level of “standing out” from its neighborhood and further
identifies the salient region. In our proposed method, salient computing is applied to the
in-focus region as shown in Fig.(3.2(b)) which has remained the foreground and removed
most background region, thereby, the potential interference from the background of the
entire image can beavoided in order to improve theaccuracy of image segmentation. For
example, we show the results of salient computing in Fig.(3.2(c)), where most salient ob-
jects (i.e., penguins) have been detected and the in-focus region has been further refined.
Salient computation usually involves computation of feature map and salient map.
Here we use feature maps described in [32]. Following the concept in [32], we calculate
our salient map as follows.
the salient map S can be computed by measuring the dissimilarity between neigh-
boring pixels in the feature map. In fact, the dissimilarity of F(i,j) and F(p,q) can be
defined as:
d((i,j)||(p,q)) =|logkF(i,j)−F(p,q)k
2
| (3.7)
33
The dissimilarity is measuredon alogarithmic scale. Inaddition, we can furtherconsider
feature map as a fully-connected directed graph G and the directed edge from node (i,j)
to node (p,q) can be assigned with a weight value w((i,j),(p,q)) as:
w((i,j),(p,q)) =exp(d((i,j)||(p,q)))·exp(−
(i−p)
2
+(j−q)
2
2σ
2
) (3.8)
Here σ is a constant. In this way, the edge cost from node (i,j) to node (p,q) can
be related to their dissimilarity, and their local closeness. To locate those “stand-out”
regions, we can define a Markov network on G, where the distribution is proportional to
the time a random walker would spend at each node(pixel). This process would result
in large number at those nodes that have high dissimilarity with their surroundingnodes
by applying Markov random walk, as the transition probability into another subgraphs
is high. In this way, the regions that “stand-out” and have high dissimilarity in its
neighborhood can be marked in high value on the final salient map S.
3.2.2.2 Color Based Grab Cut Algorithm
It can beobserved that the salient region in Fig.(3.2(c)) still contains partial background
information sothat it cannot beused as foregrounddirectly. To furtherremove theback-
ground from the salient region, the grab cut algorithm should be used. In fact, the grab
cut algorithm was proposed in [13] to segment a foreground object from its background
34
underthe assumptionthat thebackground region can bemodeled by aGaussian mixture
model (GMM), which can be expressed as:
h(x) =
K
X
k=1
w
k
·
1
√
2πσ
k
e
−(x−μ
k
)
2
2σ
2
k
(3.9)
where K is the number of Gaussian mixtures, w
k
> 0 are mixture weights with a total
sum equals to one, and μ
k
as well as σ
k
are the mean and the standard deviation of the
k
th
Gaussian distribution component, respectively.
Specifically, the grab cut algorithm [13] models an image that is in color space by two
GMM models such as one background model and one foreground model. Each of these
two models is a Gaussian mixture with K components. Here k = {k
1
,...,k
N
} is used to
assign a GMM component to each pixel value x, which is either from the background
model or the foreground GMM model. Thus, the grab cut segmentation can be modeled
as an energy minimization problem. Here an energy function E can be defined as
E(α,k,θ,x) =U(α,k,θ,x)+V(α,x) (3.10)
whereU estimates the data cost of a certain segmentation, and V is a smooth parameter
defines local continuity.
Moreover, using the color GMM models, the data term U can be defined as:
U(α,k,θ,x) =
N
X
n=1
D(α
n
,k
n
,θ,x
n
) (3.11)
35
where D(α
n
,k
n
,θ,x
n
) = −logp(x
n
|α
n
,k
n
,θ)− logπ(α
n
,k
n
), while p(·) is a Gaussian
pdf, and π(·) are mixture weighting coefficients of the corresponding Gaussian pdf.
The smoothness term V is used to guarantee image spatial smoothness:
V(α,x) =
X
(m
1
,m
2
)∈C
1
kx
m
1
−x
m
2
k
2
[α
m
1
6=α
m
2
]−βkx
m
1
−x
m
2
k
2
where [.] denotes the indication function with the value 0 or 1, and m
1
,m
2
is a pair of
neighboring pixels. C is the set of pairs of neighboring pixels and β is a free parameter.
The smooth term V is used to add coherence in regions of similarity.
As such, using the above definition of U and V, we can then minimize the energy
function E, which results in an optimal segmentation.
3.2.3 Monocular Depth Cue Integration
After segmentation, foreground region and background region are separated from each
other and we can further derive an image depth map using the following procedure.
3.2.3.1 Background Depth Map Generation
The depth map for the background region can be generated using a modified version
of vanishing point detection[65]. The traditional vanishing point detection method [65]
involves the detection of dominant lines using the Hough transform, which is not appli-
cable to images that have no obvious lined structure. In this work, we propose a new
vanishpointdetectionalgorithm, whichfindsthefocusorderofobjectsinthebackground
region.
36
This idea is inspired by the observation that farther objects are more blurred than
nearer ones, which means further objects appear later in the background. Furthermore,
we extend the vanishing point concept to describe the near/far relationship of objects in
the background. That is, we determine a “vanishing point” location by comparing the
locations of the in-focus region and the background in the image. Then, each object in
the background region can be assigned a depth value according to its relative location
alongthevanishingline. Theresultingdepthmapfortheforegroundandthebackground
is shown in Fig.(3.4). Note that we use the gray level of the pixels to denote the depth
values in theimage. Thedarker the pixel is, thefarther it is located. The result is clearly
consistent with human perception.
Figure 3.4: The resulting depth map for the image given in Fig.(3.2(a)).
.
In details, if we exclude foreground objects from an image, we can calculate the
background depth map using blurring degree variance in background region. To do so,
37
we first divide the whole background region into four different regions (i.e., up-left, up-
right, bottom-left and bottom-right) according to their locations in the image. Then
we further divide these regions into smaller blocks and a background depth map can be
obtained using a straight-forward strategy:
• step 1: As blurring degree increases, set the blurring degree for each block as the
major voting of each pixel’s blurring degree within the block. In our implementa-
tion, the block size is set as 8×8 pixels.
• step 2: For given four different regions, each region has its own depth value by
averaging the blurring degrees of all blocks within this region.
The decision rule for vanishing point location has been described in Fig.(3.5). First of
Figure 3.5: flowchart of background depth map generation.
all, we shouldcheck theexistence of vanishingpoint by picking uptwo regions containing
38
either maximum or minimum depth value. If the difference between them is less than a
pre-defined value, we assume it has no vanishing point (VP). In addition, if sum of two
bottom regions’ depth values is sufficiently greater than that of two up-regions’ depth
values, it means that VP is located below out of the image. However, that case barely
exists and we can assume that there is no VP in that case. Furthermore, If absolute
difference between sum of left two regions’ depth values and sum of right two regions’
depth values is greater than absolute difference between sum of upper two regions’ depth
values and sum of bottom two regions’ depth values, Left or Right will be the candidate
for VP’s location. Otherwise, Up or Down will be the candidate. Finally, we will get
one of six candidates as a final VP location according to assigned depth value. After the
vanishing point can be located, we can further generated the background depth map [65]
accordingly. One example of our background depth map generation is shown in Fig.(3.6)
where the original images are on the left and corresponding depth maps are on the right
(the images in the middle are intermediate results from in-focus detection and grab cut
algorithm).
Figure 3.6: background depth map generation result (two rows for two test images).
39
3.2.3.2 Foreground Depth Assignment
Whenthedepthmapofthebackgroundisavailable, thenextstepistogeneratethedepth
mapfortheforegroundregion. Let’sfirstconsideranobservation: whenwetakeapicture,
theforegroundplaneusuallyisparalleltoourcamera’splanewhilethebackgroundobjects
are usually on another plane which has certain degree to our foreground plane. Take the
image in Fig.(3.2(a)) as an example, when the penguin is parallel to camera plane, the
ocean is perpendicular to the camera plane.
In this way, we can categorize our images into two cases based on the relative sizes
of foreground objects when compared with the background. 1) large foreground object
area; 2) small foreground object area. In case 2, the foreground object can be viewed
as a dot in background plane, which leads to a known depth by referring to background
depth map. For example, we can use any point on foreground boundary to denote its
depth value. However, this method does not work for case 1.
Next, thequestionbecomeshowtoget theaccurate depthmapforforegroundobjects
in case 1 when the background depth map is available. To answer this question, we
need to first define the “intersection point” between the foreground object plane and the
background object plane. In fact, the intersection points can be concluded in five cases
as shown in Fig.(3.7) when the furthest points or vanishing point (VP) is located. In the
Fig.(3.7), the foreground object plane is marked as a white oval and its corresponding
intersection pointswiththebackgroundisdenotedasreddots. Inaddition, thevanishing
point is marked as a blue dot. Note that the locations of intersection points could vary
a lot due to different background regions.
40
Another absolute depth reference clue is the blurring degree of the foreground area
which is an important clue if the foreground is divided into many small pieces. In this
circumstance, those foreground pieces which share the same blurring degree values are
considered to be at the same depth plane. We have tested the effectiveness of these two
clues and the performance can be significantly improved when we consider both blurring
degree and intersection points together.
(a) (b) (c)
(d) (e)
Figure 3.7: intersection point in (a) Left vanishing point case, (b) Right vanishing point
case, (c) Up vanishingpoint case, (d) Down vanishingpoint case, and (e) Innervanishing
point case.
As such, the foreground depth assignment approach can be summarized as follows:
• step 1: Locate the intersection points between foreground object plane and back-
ground object plane based on background VP location.
• step 2: Use the absolute background depth information of those intersection points
to infer foreground depth.
41
• step 3: Compare the blurring degrees of each foreground pieces.
• step 4: Assign theabsolute foregrounddepthvalue to each foregroundobject based
on results from 2) and 3).
Thismethodcanprovideexactdepthmapifthesegmentationresultisaccurate. However,
the segmentation from grab cut algorithm may be misleading in some special cases. For
example, the dark night situation can make the disparity among foreground objects’
and background’s color components too small. In this case, we apply a color histogram
equalization preprocess on top of the original image to enlarge the differences between
foreground and background color components.
3.3 Video Depth Map Generation
Thedepthmapgenerationforonesingleimageishighlytime-consuming,thereby, making
itprohibitivetorepeatthisprocedureforeveryframeimageinavideosequence. Instead,
we propose to use the loopy belief propagation [72] in order to generate the depth map
for an entire video. In brief, we first divide the video sequence into different scene units
so that each unit contains several “high-correlated” (e.g., similar foreground and back-
ground information) image sequences. Therefore, the depth map of one key frame can be
propagated to the rest frames within the same unit.
42
3.3.1 Depth Map Propagation
The loopy belief propagation method [72] can propagate depth maps of key frame to its
successive frames, which considers both the data cost and discontinuity cost to assign a
labeltoapixel. Specifically,itconsidersallpixelsinanimageasgridsinagraphandeach
gridisconnectedtothe4-connectedneighborsofthepixel. Assuch,thebeliefpropagation
method can propagate the depth values of a pixel towards its four neighboring grids. We
candefinetheconfidenceprobabilityofpassingadepthvaluetoitsneighborhoodthrough
a sum of two cost parts: the data cost and the discontinuity cost.
As for the discontinuity cost, it can be modeled by three popular models 3.12 3.13
3.14. The firstoneis Potts model and treats thecost as anonzero constant whenthe two
neighboring pixel’s depth value are different, or zero otherwise. The second one is the
truncatedlinearmodel, whichtreatsthecost asalinearrelationshipwiththedepthvalue
difference among two neighboring pixels. As such, when the cost is greater than a preset
maximum, the model will be truncated to a constant. The third one is the truncated
quadratic model, which is similar to the truncated linear model but used a quadratic
polynomial model.
pottsmodel :V (f
p
,f
q
)=
0 iff
p
=f
q
d otherwise
(3.12)
linearmodel :V (f
p
,f
q
)=min(skf
p
−f
q
k,d) (3.13)
43
quadraticmodel :m
t
p→q
(f
q
) =min(c(f
p
−f
q
)
2
+h(f
p
)) (3.14)
Moreover, data cost is a major component needed to be defined here. In our im-
plementation, we measure it by comparing the likelihood of two grids between the key
frame and its successive frame. If the two grids are considerable similar, the depth value
of a grid from key frame can be propagated to its successive frame. In particular, by
computing a gradient map feature in the 4×4 pixels neighborhood around the pixel, we
can usetheEuclideandistance between two gridsas datacost andapplythelinear model
3.13 as the discontinuity cost.
Figure 3.8: The resulting depth map (upper row) and ground truth (bottom row).
.
3.3.2 Post Process Step
The details around edges may be lost after the depth map inference. To fix this problem,
we propose to filter the depth map with a joint bilateral filter. As for this filter, its
spatial component is a Gaussian kernel centering on the pixel, its range component is a
44
exponential function where the exponent is the intensity difference between current pixel
and the testing pixel. The filter can be shown in equation(3.15) and (3.16).
I
p
=
X
q∈S
w
p,q
I
q
/
X
q∈S
w
p,q
(3.15)
w
p,q
=g(||p−q||)r(I
p
−I
q
) (3.16)
where g(||p−q||) and r(I
p
−I
q
) denote the spatial and range component separately. In
general, the post process step aims to smooth the depth map while keeping the details
of edges and fixing some prediction errors in the depth map. We show an example in
Fig.(3.8) where the details of edges are remained after filtering.
Moreover, the depth map error can be accumulated during the propagation process
and we build a confidence map to measure the accumulated error. In fact, the confidence
map is defined as the total cost of propagating the depth values to a pixel. In other
words, the value of confidence map corresponding to each pixel is the accumulated cost
along the propagation process and thus smaller value in the confidence map means “less
error”. AsshowninFig.(3.9), thedepthmaperrorisactuallyaccumulatingoverthetime.
Thereby, we need to stop the propagation temporarily when the accumulated depth map
error exceeds a pre-defined threshold value and evaluate the depth map of another key
frame in order to restart the propagation process.
45
Figure 3.9: The resulting confidence map of a consecutive video sequence.
.
3.4 Experimental Results
Weusetwotestimagesasexamplestovalidatetheproposedalgorithmfordepth-mapgen-
erationinthissection. ThesetwotestimagesareshowninFig.(3.10(a)) (i.e., snowmoun-
tain) and Fig.(3.10(c)) (i.e., an air-plane). Their depth maps are plotted in Fig.(3.10(b))
and Fig.(3.10(d)), respectively. It can be observed that the depth maps can be recovered
from the input 2D still images with an satisfactory accuracy.
(a) original image (b) depth map
(c) original image (d) depth map
Figure 3.10: Examples of depth map generation: (a) the mountain image, (b) the depth
mapofthemountainimage, (c)theairplaneimage, and(d)thedepthmapoftheairplane
image.
46
(a) original image (b) vanish point detection
(c) proposed method
Figure 3.11: Performance comparison of the vanish point detection method [65] and the
proposed method.
In addition, we compare the vanishing point detection method [65] and the proposed
methodonthesameimageinFig.(3.11(a)). Theresultfromthevanishingpointdetection
method is shown in Fig.(3.11(b)) which fails to provide an accurate depth map because
the vanishing point detection is applied to the entire image and one horizontal line is
identified as the “vanishing line” by mistake. On the contrary, the proposed method
offers more accurate depth map in Fig.(3.11(c)) because the foreground can beseparated
from the background before the depth map generation and the interference from the
background can be avoided.
Moreover, we further test the proposed method on an image data set from industry,
which contains a total of 778 images belonging to 50 different categories. The proposed
47
method can provide accurate depth maps for 83% of these images while the vanishing
point detection method can only achieve accuracy for 34% of these images.
3.5 Conclusion
In this section, we proposed an effective algorithm to recover the depth information
from one single still 2D image and propagate the depth information to the entire video
sequence. This approach utilizes many advanced techniques such as in-focus detection
[59], salient computation [32], grab cut algorithm [13] and belief propagation method
[72]. In addition, the extensive experiments have validate the effectiveness of proposed
algorithm on a large number of test images from the industry.
48
Chapter 4
Depth Inference from 2D Image/Video: Chip
Implementation Considerations
Wehaveproposedadepthinferencealgorithminthepreviouschapter,whichworkswellin
asoftwaresimulationenvironment. However, 3Ddepthmapgeneration techniqueusually
needs to be done in real time on 3D display devices, such as 3D TV sets, mobile devices
and computer machines. Thereby, our previous proposed depth inference algorithm has
to be customized to fit the limited memory and real-time runtime requirements of these
devices.
In this chapter, we improve our previous proposed algorithm in order to implement it
on a single chip where the depth information can be generated in real-time using a small
amount of memory storage. In fact, the in-focus detection and grab cut algorithm in the
previous chapter are too time-consuming to be afforded in the chip implementation and
we made following improvements for efficiency purpose:
• Fast In-focus Detection Method: The computation of blurring degree in previous
in-focus detection algorithm is highly time-consuming, therefore, we use a “fast”
49
in-focus detection method [22] to speed up the computation, which side-steps the
expensive blurring degree calculation using a minimum reliable scale.
• Mean-shift algorithm: To replace the expensive grab cut algorithm, we use mean-
shift algorithm[20] to do a fast segmentation, which can avoid the solution of a
complicated optimization problem.
4.1 Improved Algorithm Overview
TheoverallimprovedalgorithmforchipimplementationhasbeensummarizedinFig.(4.1),
which generates the 3D visual outputs for 2D video inputs for chip implementation.
In general, the 2D video inputs are first divided into many scene units and each
unit consists of several highly-correlated frames. Let us consider the first scene unit
which contains first K frames in Fig.(4.1) as an example and the same procedure can be
repeated for the rest units. Without loss of generality, assumingthefirst frame is the key
frame for this unit, the depth map for the first frame can be generated using “still image
depth generation” in Fig.(4.1) and propagate to the remaining frames in the same unit
(i.e., 2-nd to K-th frames).
We will present these techniques in details as following.
4.2 Still Image Depth Map Generation
The key problem in Fig.(4.1) is the still image depth map generation for one single frame
image, which consists of three major stages and can be illustrated as Fig.(4.2):
stage I: the foreground segmentation
50
' H S W K P D S , Q S X W F R P S U H V V H G Y L G H R V W I U D P H . W K I U D P H Q G I U D P H . W K I U D P H . W K I U D P H , Q S X W [ 6 W L O O L P D J H G H S W K J H Q H U D W L R Q ' H S W K P D S , Q S X W [ 6 W L O O L P D J H G H S W K J H Q H U D W L R Q ' H S W K P D S ' H S W K P D S , Q W H U S R O D W H ) U D P H V 2 X W S X W Y L G H R 8 S V D P S O H Figure 4.1: The block-diagram of the improved depth map inference algorithm.
The first task is to segment out the foreground region. In brief, we first extract in-
focus region usingfast in-focus detection and calculate correspondingsalient map. Then,
themean-shiftalgorithm(instead ofgrabcutalgorithminFig.(3.1)) canbeperformedon
theobtainedsalientmaptosegmentouttheforegroundregion. Notethatthebackground
region can be obtained when the foreground is separated from the entire image.
stage II: background depth map generation
51
Similar to the algorithm in chapter 3, we generate the depth map for the background
region using a modified vanishing point method (Sec. 3.2.3.1).
stage III: foreground depth map generation
When the background depth map is available, the depth map for the foreground
objects can be determined according to their relative sizes and occlusion information.
Note that the depth map generation methods for both stage II and stage III are the
same as previous proposed algorithm in chapter 3, and thus we only present the new
techniques for foreground segmentation in this section.
Input Image
Fastinfocusregiondetection
In-focus regions
Salient map
Salientcomputingmethod
Background region Foreground region
Image
Depth map
Meanshiftalgorithm
Depthassignment Vanishpointdetection
StillImage
Segmentation
StillDepth-Map
Generation
Figure 4.2: still image depth generation flowchart.
52
4.2.1 Fast In Focus Region Detection
To avoid the expensive calculation of the blurring degree, we use a “fast” in-focus region
detection method [22] proposed by Elder and Zucker to obtain a rough estimation of the
foreground region.
From previous chapter, we know that the blurring edges can be modeled as the con-
volution between a step function and a a Gaussian blurring kernel g(x,y,σ
b
).
g(x,y,σ
b
)=
1
2πσ
2
b
e
−(x
2
+y
2
)
/
2σ
2
b
(4.1)
By measuring the width of blurring edges, we can first precisely calculate blurring
degree at edge pixels, then propagate the blurring degree from edge pixels to regions
through an optimization process.
However, this process is highly time-consuming and not be affordable in chip im-
plementation. Therefore, this fast in-focus region detection uses the “minimum reliable
scale”asanalternativetomeasuretheblurringdegreebecausetheminimumreliablescale
is proportional to the blurring degree. In the following, we will introduce the definition
and related calculation of the minimum reliable scale.
Given a blurred step edge along the y axis with amplitude A and blur parameter σ
b
,
the gradient magnitude can be calculated by
r
x
1
(x,y,σ
1
) = Au(x)∗g
x
1
(x,y,
q
σ
2
b
+σ
2
1
) (4.2)
=
A
q
2π
σ
2
b
+σ
2
1
e
−x
2
/2(σ
2
b
+σ
2
1
)
53
Accordingto[22], themaximumgradientresponsetoablurrededgedecreases byincreas-
ingestimation blurringdegree. Thus,thereexists a“minimumreliable scale” ˆ σ
2
1
atwhich
the luminance gradient of a region can be reliably detected. The minimum reliable scale
for estimating the gradient of the edge is defined by the scale at which the edge response
just exceeds a threshold.
The relationship between the minimum reliable scale and blurring degree of the edge
responsemodelisdemonstratedinFig.(4.3). Accordingto[22],the2-ndminimumreliable
blurringscaleisproportionaltoblurringdegree, thereby,wecanusetheminimumreliable
scale map to estimate the image blurringdegree. In order to compute the 2-nd minimum
reliable scale, we need to compute the second derivation pixel response. The second
0 5 10 15 20 25 30
8
10
12
14
16
18
20
22
24
26
28
blur scale (pixels)
Min. reliable filter scale (pixels)
1st derivative
2nd derivation
Figure 4.3: relation between minimum reliable scale and blurring scale.
54
derivative of the intensity function can be estimated with second derivative of Gaussian
operator as:
g
x
2
(x,y,σ
2
) =
1
2πσ
4
2
x
/
σ
2
2
−1
e
−(x
2
+y
2
)
/
2σ
2
2
g
y
2
(x,y,σ
2
) =
1
2πσ
4
2
y
/
σ
2
2
−1
e
−(x
2
+y
2
)
/
2σ
2
2
g
xy
2
(x,y,σ
2
)=
xy
2πσ
6
2
x
/
σ
2
2
−1
e
−(x
2
+y
2
)
/
2σ
2
2
(4.3)
The expected output of the second derivative operator to local edge model along x is:
r
x
2
(x,y,σ
2
) = Au(x)∗g
x
2
x,y,
q
σ
2
b
+σ
2
2
(4.4)
=
−Ax
√
2π
σ
2
b
+σ
2
2
3/2
e
−x
2
/2(σ
2
b
+σ
2
2
)
(4.5)
Similarly, according to [22], the minimum reliable second derivative scale ˆ σ
2
along the
maximum direction θ
M
can be further derived into a simple format:
r
θ
M
2
(x) =
1.8s
ˆ σ
3
2
(4.6)
where s is a variable depending on sensor noise. We can now solve the minimum
reliable scale ˆ σ
2
and use it as a “fast” estimation of blurring degree. We then simply
propagate our estimated minimum reliable scale to neighboring pixels with similar color
values. This algorithm is a fast process and requires only 16.32KB to storage the image
since each image is down-sampled to 120×68 resolution. In our experiment, the compu-
tation time, in a 32-bit Window 7 system with a intel core 2 P8700 2.53GHz CPU and
2G RAM, is less than 1 second.
55
4.2.2 Color Based Mean Shift Segmentation
Aswementionedbefore,theconventionalgrabcutalgorithmhasveryhighcomputational
complexity and is not suitable for chip implementation. Instead, we use mean shift
algorithm [20] to perform the foreground segmentation.
Mean shift algorithm is a procedure for locating the maximum of a density function
using its discrete samples. To find the local maximum, this algorithm defined the “mean
shiftvector” toguaranteetheconvergefromanypixeltoitslocal maximuminthesample
space.
Given a set of n sample data points x
i
,i =1,...,n in the d-dimensional space, we can
obtain an estimate for the density at x using multivariate kernel density estimator with
kernel function K(.):
ˆ
f(x) =
1
n
n
X
i=1
K(x−x
i
) (4.7)
For a special class of radially symmetric kernels, the kernel density estimation can be
rewritten as:
ˆ
f
h,K
(x) =
c
k,d
nh
d
n
X
i=1
k
x−x
i
h
2
!
(4.8)
where h is a parameter of kernel function k(.) indicating the kernel width, and c
k,d
is the
normalization constant, which makes k(.) integrate to one.
To estimate the local maximum in density space, the gradient estimator can obtained
using (4.8).
ˆ
∇f
h,K
(x)≡∇
ˆ
f
h,K
(x) =
2c
k,d
nh
d+2
n
X
i=1
(x−x
i
)k
′
x−x
i
h
2
!
(4.9)
56
Define g(x) =−k
′
(x), then rewrite as:
ˆ
∇f
h,K
(x) =
2c
k,d
nh
d+2
n
X
i=1
(x−x
i
)g
x−x
i
h
2
!
(4.10)
=
2c
k,d
nh
d+2
"
n
X
i=1
g
x−x
i
h
2
!#
n
P
i=1
x
i
g
x−x
i
h
2
n
P
i=1
g
x−x
i
h
2
−x
.
The second term is the “mean shift vector”:
m
h,G
(x) =
n
P
i=1
x
i
g
x−x
i
h
2
n
P
i=1
g
x−x
i
h
2
−x (4.11)
From theequation above, it isindicated that themean shift vector always points towards
the direction of maximum increase in the density. Given a point x, using ”mean shift
vector” to update its location, we have:
x
update
=
n
P
i=1
x
i
g
x−x
i
h
2
n
P
i=1
g
x−x
i
h
2
j =1,2,... (4.12)
It can be proved that the iterative upate will converge to the density maximum [20]. To
apply mean shift algorithm in our segmentation scenario, our procedure is as follows:
• step 1: Place a mean shift window over each image pixel.
• step2: Shiftthewindowtowards thedensitymaximumwith(4.12) untilconverged.
• step 3: Track windows that have been transverses, and merge along the path.
57
After performing mean shift segmentation on salient regions, we discard segment
pieces expanded across salient object boundaries, and keep the rest as foreground seg-
mentation pieces. We have tested the mean-shift algorithm on different image cases in
Fig.(4.4) and the homogenous regions can be grouped together into one single region.
Moreover, it usually takes less than 5 iterations for the mean-shift algorithm to be con-
verged. As such, the computation time, in a 32-bit Window 7 system with a intel core 2
P8700 2.53GHz CPU and 2G RAM, is only 1.5 second on average.
(a)
(b)
Figure 4.4: mean shift segmentation result (a)original image (b)mean shift segmentation
result.
When the foreground is separated from the background, the vanishing point method
(see Sec.3.2.3.1) can be used to generate the depth map for the background and the
depth values for foreground objects can be assigned according to their relative sizes and
occlusion information as described in chapter 3.
58
4.3 Single Video Depth Map Generation
Thedepthmapforthekeyframeimagecannotbecopieddirectlyfortheremainingframe
images in the same scene unit, because this depth map does not consider the temporal
correlationorconsistencyoverthetime,thereby,leadingtosevereproblems. Forexample,
when the depth values of the same object change dramatically along the frame sequence,
the copied depth map from the key frame can cause flickering of this object which results
in extremely unpleasant audience experience.
To address this issue, the depth map for the key frame image shouldbe“propagated”
rather than copied to other frame images in the same unit according to their temporal
correlation of the vide sequence. In fact, there exist two methods to propagate the depth
map:
The first approach is similar as the previous proposed algorithm in chapter 3, which
propagates thedepthmap of key frameimage andsmooths thedepthmaps over thetime
using an additional post process. This approach can solve the temporal inconsistency
problem of depth maps but requires expensive computational efforts due to video in
practice, thereby, making it infeasible for chip implementation.
Thesecondapproachcanfixthisinconsistencyproblemwithnegligiblecomputational
efforts, which makes use of “motion vectors” in the compressed video format to generate
the depth maps for non-key frame images. We will details this approach as below.
59
4.3.1 Motion based Depth Map Propagation
Typically the videos are compressed in certain format to reduce the costs of data trans-
mission and data storage. In addition, the compressed videos usually divide each frame
image into many small blocks and maintain a “motion vector” for each block which de-
scribes the moving direction of each block in the next frame image. As such, compressed
videos do not need to transmit all frame images within the same scene unit but recover
the entire scene unit with one key frame image and many motion vectors.
Inspiredbythisobservation,itisnaturaltomakeuseof“motionvectors”topropagate
the depth map of key frame image to the remaining frame images within the same scene
unit because these motion vectors exactly describe the temporal correlation between
neighboringframeimages. Inthisway, whenthedepthmapofkeyframeisobtainedfrom
“still imagedepthmapgeneration” method, thisdepthmapcanbemodifiedaccordingto
motion vectors to interpolate the depth maps for other images along the frame sequence.
This procedure can be illustrated in Fig.(4.5) and be further elaborated. Without
loss of generality, we can assume each scene unit of the compressed vide stream contains
9 frame images and take the first scene unit as an example. If the first frame image
of this scene unit is the key frame image, its depth map can be obtained from the still
image depth map generation as described before. Then, this depth map can be divided
into many small blocks where each block has its corresponding motion vector. Since
one motion vector determines the location of a specific block in the next frame image,
it is a natural choice to copy the depth map of one block in the key frame image to its
60
destination on the next frame image. As such, the depth maps for the remaining frame
images can be generated with aid of motion vectors.
Figure 4.5: The depth map propagation along video sequence using motion vectors.
The most important advantage of this approach is extremely high efficiency, because
motion vectors can beobtained directlyfromcompressedvideos(e.g., TVprogramtrans-
mission, DVD, Blue Ray DVD, etc.) and the propagation needs only simple arithmetic
operations. Therefore, this approach is very suitable for our chip implementation.
4.4 Experimental Results
We have validated our improved algorithm in this chapter using a large set of test videos
from the industry. For illustration purpose, we first show several test images and their
depthmap results andthese images have different contents, such as human being, animal
and environmental object.
61
The comparison for human being group is shown in Fig.(4.6), which includes one
outdoor mid shot (e.g., images showing complete body of human being) and one indoor
close up shot (e.g., images showing partial of human being) of human face. It can be
(a)
(b)
Figure4.6: depthmapresultsofhumanbeingimagegroup(a)originalimage(b)generated
depth map.
observed that the proposed algorithm can exactly recover the depth maps for both cases.
In particular, the two walking persons on the left are nearest to the audience and thus
have thelargest depthvalue, becausethein-focus region detection can segment these two
persons from the rest image. In addition, the trees are accurately separated from the
background, which is contributed to the depth map from vanishing point detection. On
thecontrary, theclose upshotontheright hasvery simple backgroundthat can beeasily
separated from the foreground (human face) and thus there exists no vanishing point.
Therefore, the entire face has the same depth values.
62
The second group are animal images including two different types: the left image
is more colorful and irregular in the shape while the right image has monotonous color
and regular shape. The comparison results are shown in Fig.(4.7). On the left, many
(a)
(b)
Figure4.7: depthmapresultsofanimalimagegroup(a)originalimage(b)generateddepth
map.
tentacles of the sea anemone are in the foreground region and be successfully separated
from the rest background. Hence, these tentacles are assigned with large depth values
that means they are more closer to the viewers. On the right, the bird is captured by
the camera from a long distance and it is in the foreground region. Clearly, this bird has
been segmented from the rest of the image and assigned with a large depth value. These
images demonstrate that the image segmentation including in-focus detection, salient
63
detection and mean-shift algorithm can effectively separate the foreground objects from
the background, which leads to accurate depth maps.
The third group includes some environmental objects, such as trees and airplanes as
shown in Fig.(4.8). Similarly, we choose different types of images: theleft image contains
more complex background while the right image has simple one. As we can observed,
(b)
(a)
Figure 4.8: depth map results of environmental object image group (a)original image
(b)generated depth map.
the left image has a vanishing point on the top and the depth values of objects gradually
increases from this vanishing point to the audience. The corresponding recovered depth
mapshowninthebottom-left of Fig.(4.8) iscoincident withourobservation. Inaddition,
the right image actually has no vanishing point due to the monotonous background,
64
therefore, the airplane in the foreground is the nearest to the audience and on the same
depth plane.
Moreover, in order to show the performance of improved algorithm on 2D videos,
we show the extracted depth maps for neighboring frame images from a temporal video
sequence in Fig.(4.9) where the first one is the depth map of the key frame image and
generated using the “still image depth map generation” method. The rest of depth maps
are generated using the motion vector based propagation method. This figure shows
that with the aid of motion vectors from the compressed videos, these depth maps from
propagation method sufferfrom noflickering issue andare temporally smooth, which can
validate the effectiveness of our proposed algorithm.
4.5 Conclusions
In this chapter we improved our previous proposed algorithm to produce the depth maps
for video sequences with chip implementation considerations. To avoid time-consuming
algorithms (e.g., in-focus region detection [59] and grab cut algorithm[13]), we use a
fast in-focus region detection algorithm [22] and mean shift algorithm [20] to provide
significant complexity reduction while offering satisfactory depth maps. Moreover, a
novel depth map propagation method is also proposed to efficiently generate the depth
maps for the entire video sequences according to the motion vectors in the compressed
videos. The improved algorithm has been verified on a large number of test videos from
the industry and provides satisfactory depth maps for most of them.
65
Figure 4.9: video depth map sequence.
66
Chapter 5
Salient Object Detection
In previous chapters, we proposed two 2D/3D conversion systems designed for different
purpose. One interesting fact we found out is that human eyes are particular sensitive to
salient object region of one input image. Thus, it is critical for us to detect salient object
accurately, and segment its boundary very well as small depth map error within salient
objectregionwillleadtointolerant visualdistortion. So,wefurtherstudystate-of-the-art
salient object detection and segmentation algorithms in chapter 5 and chapter 6.
Recently, saliency detection which is to predict where human looks in a image, has
attracted a lot of interests. Accuracy of saliency detection is essential for many human
visual based applications such as robotics, image segmentation and so on. In literature,
many saliency detection models have been proposed based on bottom-up image cues,
top-down semantic cues and naturally the combination of these two categories of visual
cues. These models used various visual features including low-, middle- and high-level
features.
However, every single model has its own hypothesis and methodology focusing on
different aspects of humanvisual attention information. Bottom-up fashion models using
67
low-level features are biologically plausible and based on computation model[60]. As-
suming saliency regions are conspicuous either in color, intensity or orientation, Itti et
al.[41] derived bottom-up visual saliency using center-surround differences across multi-
scale image features. Believing that local patches of visual attention are highly dissimilar
to its surroundings, Harel et al.[32] built a Graph Based Visual Saliency (GBVS) model
and measured the dissimilarity among local patches using Markov chain. Supposed that
human eyes are attracted by local image patches which have less appearing frequency
in natural image patch code book, Bruce et al.[49] proposed saliency detection models
by applying sparse representation on local image patches(AIM). These models performed
well qualitatively butwithapparentlimitation, whichisshortofhigh-level semantic cues.
On the other hand, several researchers shed lights on saliency detection from a top-down
point of view. Assuming human eyes are more attracted by objects, Chang et al.[37]
proposed an objectness saliency model based on object detection. However, overall per-
formance of theses models are low especially at locating saliency region without objects
presented[60]. As a result of various assumptions, each model has its own most suit-
able image category [1]. For example, focusingon local dissimilarities, Itti[41], GBVS[32]
models fell short in detecting global visual attention, while AIM[49] show their weakness
in catching local dissimilarities. On the other hand, top-down models highlight the im-
portance of high level features(face, car, and text) as human visual attention, yet will fail
to detect salient objects which have not been trained. Considering no individual model
is able to fit over all images, fusion of different models has the potential to improve the
overall performance for large-scale images. To improve both bottom-up and top-down
models, several researchers proposed to combine low-level and high-level features. Cerf
68
et al.[43] showed improvement by adding high level factor, face detection, to Ittis model.
By adding more high-level features including faces, people and text, Judd et al.[60] de-
veloped saliency detection models which used these high-level features as conjunction
with other low- and middle-level features to learn best weights for all combined features
usingSupportVector Machine(SVM). However, these feature-level fusingmodelsdirectly
use feature to predict salient object, which attenuate the power of computation model.
Therefore, this methodology still has large performance gap comparing to human eyes.
To fill the performance gap between early feature-level fusion and human eyes and
broaden applicable image ranges, we proposed to fuse state-of-the-art saliency models at
scorelevel. Herewetreatsaliencydetectionprobleminahigherlevelangle: fusingseveral
expert models who perform well on different aspects at confidence score level. Specially,
para-boosting strategies which fuses outputs of several expert models at the same time is
applied here. Furthermore, it has been proven that similar object in different conditions
such as size will lead to location shift of human eye fixation[44]. Thus, to predict such
complicated eye behavior, supervised learning using human eye fixation data can be an
effective methodology. Thus, we here learn to boost the prediction performance using
eye fixation ground truth. To learn the influence of different learning choices, we further
investigated various learning techniques(SVM, AdaBoosting, PDE), and evaluate their
prediction performance in this scenario.
We here claimed three major contributions: Firstly, we proposed to fuse several ex-
pert saliency models outputs at the same time (named as para-boosting). Till now, to
our knowledge, few papers have addressed score level fusion for saliency detection mod-
els. Through exhaustive testing evaluation over benchmark database[60], it shows that
69
para-boostingat score level outperformsthestate-of-the-art saliency detection individual
models ([60],[10],[1]), and are closer to human eyes’s observation agreement[1] in visual
attention. Secondly, several fusing strategies as para-boosting including transformation
based, learning based fusion schemes (joint density estimation based, classifier based)
are proposed and compared in this chapters to shed light on possible fusing directions.
Furthermore, especially our proposed learning based schemes perform the best among
different fusion schemes by experimental results. Thirdly, we investigate the role of each
individual model when fusing. Experimental results proves that individual models does
not play equal role in the final decision, which enable fuse fewer models while keeping
similar performance is possible. Corresponding experimental results show that the inte-
gration of a few best models can outperform fusing all models.
5.1 Related Theory
Generally speaking, the late fusion has several advantages. Firstly, as each model’s out-
putis a probability valuebetween 0and 1, it provides lower dimensioninputto advanced
learning-based fusion methods comparing to raw image feature. This relative lower di-
mensionfeaturevectorcanreducethepossibilityofover-fitting[1]. Secondly,theaccuracy
of current salient object detection models mostly depend on training data and classifica-
tion models. As different models learn prediction based on different feature aspects and
training data itself, fusing them at late stage is able to improve the prediction accuracy
over a broader range of images [1].
70
Making use of the detection results of different recognition systems, score level fusion
strategies already have been broadly applied in varieties of biometric system, and it can
further improve the detection/recognition performance comparing to a single recognition
system[4]. Here we present a brief overview of current score level fusion methods in
biometricsystem. Ingeneral,therearethreedifferenttypesofscorelevelfusionstrategies.
Transformation based Approaches.: In transformation score fusion approach,
each single score is firstly normalized to a common range for further combining. Choice
of the normalization schemes highly depends on the input data itself ([4] [54]). Kittler
et al.[33] discussed fusion framework by evaluating the sum rule, product rule, mini-
mum rule, maximum rule, median rule, and majority voting rule in their work. In their
proposedscheme, scoresarefirstlyconvertedintoposterioriprobabilitiesthroughnormal-
ization. It has been discovered that the sum rule outperformed other rules in biometric
system application.
Classification based Approaches: In this type of scheme, scores from individual
modelsare consideredas featurevectors of theclassifier, which areconstructed tofurther
improve detecting accuracy ([27],[47]). Chen et al.[16] used a neural network classifier to
combine the scores from the face and iris recognition system. Wang et al. [67] further
proposed to apply a classification-based algorithm based on SVM to fuse the scores.
Probability Density based Approaches: Probability density based score fusion
methods highly depends on accuracy of score’s probability density estimation. Well-
known probability models, such as naive Bayesian[52] and the GMM [36], are broadly
applied. Nandakumar et al. [36] proposed a score combination framework based on like-
lihoodratio estimation. Theinputscorevectors aremodeledas afiniteGaussianmixture
71
(a)
(b)
(c)
(d)
Figure 5.1: illustration of state-of-the-art saliency detection model inaccurate cases a)
Original images; b)ground truth-human eye fixation map; c) failed saliency object detec-
tion models, from left to right(Itti [41], AIM [49], Judd [60]; d) score-level fusion results.
Comparingtoindividualmodels,score-level fusionresultsareclosertohumaneyes’visual
attention.)
72
Score-level
Fusion
Model computation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
false positive rate
true positive rate
ROC curve
itti
AIM
AWS
Judd
Gen adaboost
Mod adaboost
linearSVM
nonlinearSVM
Performance Evaluation Fusion Output
...
Input image
Figure 5.2: Illustration of our fusion framework.
model. They show that this density estimation method achieved good performance on
biometric datasets with face, fingerprint, iris and speech modalities.
5.2 Proposed Score Fusion Strategy
In this chapter, we propose to employ state-of-the-art saliency detection models to im-
prove overall performance using para-boosting. Since different models show their cor-
responding capability in solving certain types of images, para-boosting will reduce the
performance gap between individual models and Human Inter-Observer(IO) model. In
saliency detection scenario, saliency probability map as output of each model is regarded
as a single score result, and score level fusing schema is then applied to boost saliency
detection performance by taking different saliency detection models outputs into consid-
eration. Fig. 5.2shows anillustration of ourproposedpara-boostinglearningframework.
Specifically, we firstly retrieve the saliency maps of 11 state-of-the-art approaches, and
these saliency maps are considered as input of our para-boosting system. Then, a final
saliency map is predicted basedon thescore-level fusionschema. To investigate effective-
nessofthisframework, twocategories ofscorelevel fusionincludingtransformationbased
and learning based schemes have been proposed here in this framework. It is worthwhile
73
to note that the dimension of our input to the para-boosting system is only 11 saliency
probability maps, which is a relative low dimension of input vector comparing to other
combining methods([1]).
5.2.1 Visual Saliency Map Features
Para-boosting by fusion at score level has several advantages: firstly, strength of each
individual model and flexibility among models can be kept since we fuse each models
saliency map as confidence score; secondly, dependence among different models will be
maintained well. For example, different saliency models may have dependence with each
other if they partial use the same features. No independence assumption is made when
fusing scores; thirdly, final predict results generated by score space rather than feature
space. In score space, saliency maps are usedto predict final saliency map, which is more
direct when prediction than using features. Besides, it provides lower dimension input
which reduce the possibility of over-fitting during learning. Therefore, fusing them at
score level is able to improve the prediction accuracy over a broader range of images.
Withoutlossofgenerality, followingthediscussionsin[1],wechoose11state-of-the-art
models(low-level feature based: Itti [41], GBVS [31], AIM [49], HouNips[70], HouCVPR
[71], AWS [3], SUN [42], CBS [26], SalientLiu [61]; High-level feature based: SVO [37],
Judd [60]) based on two principles. 1) the selected models must achieve good prediction
performance. 2) The features adopt by selected models must cover a large range from
bottom up to top down features so that our final fusion strategy can adjust to different
image scenarios.
74
Low Level Saliency Maps. For features in spatial domain, Itti [41] model calcu-
lated a saliency map using colors, intensity, and orientations in different scale level, in
total there are 42 feature maps used in this model, whileGBVS [31] modeled a saliency
map as an activation map which used up to 12 feature maps as input. With regrading to
featuresintransformationdomain,HouCVPR[71]andAWS[3]modelscomputedthe
saliency maps in Fast Fourier transform (FFT) channels, the input features were either
1 (gray image) single map or 3 (RGB) maps. On the other hand, AIM [49], HouNips
[70], andSUN [42] models trained anature dictionary either inRGB or DoG channels,
then used the RGB maps(3 feature maps) or DoG maps (12 feature maps) as inputs of
themodels. Furthermore,SalientLiu [61] employed abroadrange offeatures fromlocal
multi-scale contrast toglobal center-surroundingdistributions. SimilartoSalientLiu[61],
CBS [26] also applied many features such as color superpixels, closed shape, and etc.
Both of the two models used at least 3 feature maps as their input.
High Level Saliency Maps. SVO [37] tried to build an model that can describe
the objectness, and this model use semantic object detection results as well as saliency
prior as input. Furthermore, Judd [60] employed as many as 33 feature maps ranging
from low-level to high-level features to learn a saliency map.
These 11 models are able to cover a broad range of features and describe their re-
lationship to human eyes’ visual saliency from different views of angle. By using the
saliency maps output resulted from these models, we avoid high-dimensional feature vec-
tors while keeping similar prediction performance. In our proposed framework, saliency
maps computed from different models are eventually augmented into 11-D features, and
fed to the fusion system explained in the next section.
75
5.2.2 Score Level Fusion Strategies
Here we proposed three types of score-level fusion strategies applied in para-boosting of
saliency detection: Non-learning based fusing approaches such as transformation based
fusion, learning based fusion approaches such as classification based and density based
approaches. These approaches fusing multiple score input from different point of views,
hence it is worthwhile to test them and compare their powers in score fusion.
5.2.2.1 Transformation based Fusion
Score Normalization Rule. As indicated in [4], thetransformation based ruleinvolves
two steps: normalization and fusing rules. As shown in Fig. 5.9, our input individual
saliency results is a probability map. Therefore, we adopt the normalization method
described in Eq. 5.1 to normalize each individual saliency map.
s=
s
′
−μ
σ
(5.1)
Where μ and σ are the mean and standard deviation(STD) of input s
′
. Without loss
of generality, here we adopt three different fusion rules that have been reported to have
good performance in biometric recognition system.
1) Sum rule: In this rule, the final score output is computed as mean of input score
sequence.
S =mean(s
1
,...,s
n
) (5.2)
where s
k
indicate each individual score report, and S is the final score output.
76
2) Min rule: thefinal score is theminimum value among all inputscores, andS is the
final score output.
S =min(s
1
,...,s
n
) (5.3)
where s
k
indicate each individual score report, and S is the final score output.
3) Maxrule: Contrarytominrule, thefinalscoreoutputhereisthemaxvalueamong
all individual scores.
S =max(s
1
,...,s
n
) (5.4)
where s
k
indicate each individual score report, and S is the final score output.
5.2.2.2 Classification based Fusion
Forgeneraltestingpurpose,hereweapplytwodifferentkindofclassifiers: linearandnon-
linear classifier. Linear classifiers are usually faster in computation. However, non-linear
classifiers are usually slower but more powerful. We built the training set by sampling
images at eye fixations ground truth. Each sample contains 11 individual saliency prob-
ability at one pixel together with a 0/+1 label. Positive samples are taken from the top
p percent salient pixels of the human fixation map and negative samples are taken from
thebottomqpercent. Following [60], wechose samples fromthetop 5%andbottom30%
in order to have samples that were strongly positive and strongly negative. each training
vectors were normalized to have zero mean and unit standard deviation and the same
parameters are used to normalize the testing data.
77
SVM Regression. Here we trained two Support Vector Machines (SVM) classifiers
by applying both liblinear and libsvm(non-linear) libraries, which are publicly available
MatlabversionofSVM.Weadoptedboththelinearkernel(Eq. 5.5) andnon-linearkernel
Radial Basis Function (RBF)(Eq. 5.6) as they perform well in a broad range of image
applications. During testing, instead of predicting binary labels, we generate a label
whose value is within the range of [0,1] so that the final output is a saliency probability
map.
K(x
i
,x
j
) =α·x
T
i
x
j
(5.5)
K(x
i
,x
j
) =exp(−γkx
i
−x
j
k
2
2
),γ >0 (5.6)
Where α,γ are the parameters for the kernel function.
AdaBoosting. To further investigate the non-linear classifiers’ capability in fusion,
we used AdaBoost algorithm[10], which has been broadly applied in scene classification
and object recognition. In AdaBoost, it combines a numberof weak classifiers h
t
to learn
a strong classifier H(x) =sign(f(x));f(x) =
P
T
t=1
α
t
h
t
(x) where α
t
is the weight of the
t
t
h classifier. Here, the number of weak classifiers is set, T, as 10 to balancing the speed
and accuracy. Similarly, we consider the real value of H(x) to create a saliency map
(i.e., f(x)). We used the publicly available software for Gentle AdaBoost and Modest
AdaBoost.
78
5.2.2.3 Density based Fusion
NaiveBayesian. AssumingtheindependenceamongdifferentsaliencymodelsM
k
,then
a Naive Bayesian accumulation model [1] can be built in Eq. 5.7.
p(x|M
1
,M
2
,...,M
K
)∝
1
Z
K
Y
k=1
p(x|M
k
) (5.7)
Herep(x|M
1
,M
2
,...,M
K
)indicatesthefinalfusionprobabilityforeachpixel,andp(x|M
k
)
is the probability of each pixel observation from each individual model. K is the number
of models and Z is a normalized factor. Since a very small value from a single model
will suppress all other models, we here apply a modify Bayesian accumulation(Eq. 5.8)
in practice to damp the attenuation power of very small value such as 0.
p(x
f
|M
1
,M
2
,...,M
K
)∝
1
Z
K
Y
k=1
(p(x
f
|M
k
)+1) (5.8)
General Density Estimation. Without assuming independence among saliency
object detection models, we propose to fuse different models based on join density esti-
mation of their confidence outputs for the final saliency map [58]. Firstly, we classify the
training samples into two classes: non-saliency (c
0
) and saliency (c
1
). Each sample has
the d-dimensional feature vector if there are d different models. Then, for both the non-
saliency class c
0
and saliency class c
1
, we estimate their corresponding density function
using Parzen-window density estimation as following equation.
P(x) =
1
nh
d
n
X
i=1
K(
x−x
i
h
) (5.9)
79
Figure 5.3: average precision-recall and ROC curves of all saliency fusion strategies
Where, n is the number of observations, h is the corresponding window width and
K(x) is a non-negative window function or kernel function in the d-dimensional space.
s.t.
Z
R
d
K(x)dx =1 (5.10)
Finally, the likelihood ration L = P(x|c
1
)/P(x|c
0
) can be employed as the final
confidence fusion score.
In the chapter, we use the PRTools [55] to do the Parzen-window density estimation.
Inthistoolbox, theyusedGaussiankernelandanoptimumsmoothingparameterhbased
on the observations to estimate the density.
80
Table 5.1: Result of prediction estimation in mean average precision(mAP)
fusion model No. mAP AUC
Sum rule 11 0.63 0.87
3 0.62 0.88
Min rule 11 0.71 0.73
3 0.68 0.87
Max rule 11 0.47 0.81
3 0.53 0.86
linear SVM 11 0.78 0.88
3 0.78 0.87
non-linear SVM 11 0.60 0.85
3 0.61 0.85
Gentle adaboost 11 0.61 0.87
3 0.63 0.88
Modest adaboost 11 0.65 0.88
3 0.66 0.88
Naive Bayesian 11 0.79 0.81
3 0.69 0.88
density estimation 11 0.84 0.85
3 0.79 0.87
Figure 5.4: Sample images from MIT Dataset
81
5.3 Experimental Results
Athoroughevaluationofdifferentscorefusionstrategiesispresentedinthissection. Here,
we implemented Human Inter-Observer (human IO) model for comparison purpose[10].
In this model, we estimate the quality of each subject’s saliency map result by using the
groundtruthsaliencymapgeneratedbyallothersubjects,thenweaveragetheindividual
measurement score through all subjects.
5.3.1 Dataset
Here we use a benchmark dataset-MIT [60] dataset with broad range of image selection
for the purpose of fair model comparison. It contains 1003 images which collected from
Flicker and LabelMe datasets. The ground truth saliency maps are generated using the
eye fixation data collected from fifteen different human subjects. Several sample images
fromMITdataset[60]areshowninFig. 5.4. Thisdataset[60]coversmanyimagescenarios,
ranging from street view, human face, various objects to synthesized patterns. Due to
this fact, salient object models performed well in this dataset[60] can be expected good
behavior over a large range of image types.
5.3.2 Evaluation Scores
Forperformancecomparison,wereportthreedifferentscores. Agoodfusionmodelshould
perform well over all measurement scores.
-Precision-recall: Theprecision-recall(PR)curveisreportedinourexperiments. The
final fusing saliency map is a probability map with values within [0,1]. Thus, to compare
with ground truth eye fixation map, we generate a binary saliency map by comparing
82
each value with a threshold. By varying a threshold within [0 : 0.1 : 1], different binary
saliency map can be produced. To avoid any bias over threshold, we here calculate the
average precision-recall curve over all 10 threshold values as our final score.
- ROC and AUC: We calculate the Receiver Operating Characteristic (ROC) and
AreaUnderROCCurve(AUC)resultsintheformoftruepositiveratesandfalsepositive
rates obtained during the calculation of precision-recall. For better illustration purpose,
the ROC curve is drawn as the false positive rate vs. true positive rate, and the area
under this curve (AUC) indicates how well the saliency map matches with human eyes’
observation.
5.3.3 Model Comparison and Results
Average Performance on MIT Dataset[60].
For fusion methods involving training, such as classifier based and general density
basedapproach,wefollowedthecross-validation(thedataset[60]wasdividedinto5parts,
each with 200 images. Each time we trained the model from 4 parts and tested it over
the remaining part. Firstly, we compared the performance of each model includingsingle
individual models and the fusion models. Results are then averaged over all partitions).
Fig. 5.3 shows Precision-Recall(PR) and ROC curves of different models over MIT
dataset[60]. The proposed fusion strategies such as gentle adaboost (Gboost), modest
adaboost (Mboost), linear SVM (LSVM), nonlinear SVM (NLSVM), Naive Bayesian
(Bayes), sumandgeneraldensity(density)eitheroutperformorcanstronglycompetewith
any individual saliency model. Particularly, linear SVM and modest adaboost (Mboost)
were proved to be top performance among all the fusing integration strategies, which are
83
the closet model to Human IO model. This was expected as we combine the confidence
scores of different models to fit a set of broader images. Hence, the effectiveness of our
score level fusion can be demonstrated here.
Among all score level fusion models, the learning based methods such as Gboost,
Mboost, LSVM,NLSVManddensityoutperformBayes andSum,whicharenon-learning
fusing strategies. The reason is obvious as both Bayes and Sum fusion rules assume in-
dependence among each individual models. Actually, independence assumption among
modelsishardtoachieve asmanysaliencymodelssharesimilarfeaturesets. Thus,agen-
eral integration strategy without independence assumption, such as density, outperform
Bayes and sum.
Noted that not all fusion approaches outperform than the individual models. For
example, min and max fusion strategies’ performance is not as well as several individual
models. This is mainly because these fusion strategies introduce the bias regarding to
different models’output. Specifically, maxfusionhas strongbias onthemodelwith max-
imum confidence score while neglecting scores from other models. Moreover, the density
fusing approach requires sufficient training data to obtain an accurate or reasonable den-
sity estimation function, hence its performance is not as stable as other learning based
fusing approaches.
Model Choice and Comparison.
To explore the power of each model contributed to the final score decision, we choose
3 best performed models among our 11 testing models: gbvs [31], SVO [37], and Judd
[60].
84
Figure 5.5: performance comparison of different choice of model selection
Image Itti GBVS AIM HouNips HouCVPR AWS
SUN CBS SVO SalientLiu Judd
Figure 5.6: saliency map results of individual models.)
85
The PR and ROC curves as performance comparison between fusing all the models
and these 3 top individual models in Fig. 5.5. Generally speaking, non-learning based
fusing are less consistent than learning based methods. For non-learning based methods
such as Bayes, Min, Max and Sum, using 3 best models outperforms fusing all models
because selecting the best 3 models helps in excluding the dominant influence brought
in by relatively worse performed models. Unlike non-learning based fusing methods,
learning based fusion techniques such as Gboost, Mboost, LSVM, NLSVM and density
keep similar performance when fusing top 3 models and all the models as the learning
procedure can block inferior individual models’ influence. Furthermore, this comparison
indicates a dimension reduction operation is possible while retaining the performance.
Furthermore, the mean average precision (mAP) and AUC of 11 and 3 model choice
are reported in table 5.1 over MIT dataset[60]. For non-learning based approaches, sig-
nificant AUCimprovement (from0.73 to0.87 forMinrule, from0.81 to0.86 forMax rule
andfrom0.81to0.88forNaiveBayesian) isobserved, whilelearningbasedmethodsshow
no obvious difference. Besides, Linear SVM are more stable than other fusing methods
in the sense of mAP and AUC variance( LSVM-11: mAp as 78% AUC as 0.88; LSVM-3:
mAp as 78% AUC as 0.87 )and Modest Adaboost ( MBoost-11: mAp as 65% AUC as
0.88; LSVM-3: mAp as 66% AUC as 0.88).
For more directly illustration, Fusing saliency map examples of 3 and 11 models are
shown in Fig. 5.5, the corresponding individual model results are shown in Fig. 5.6 for
comparison perpurpose. The saliency map results further echoes our previous analysis
that learning-based fusing approaches, such as linear SVM (LSVM), non-linear SVM
(NLSVM), Gentle adaboost (Gboost) and Modest adaboost (Mboost) are more stable in
86
GentleAdaboost ModestAdaboost LinearSVM Non-LinearSVM Bayes Min Max Sum GeneralDensity Image>
Figure 5.7: score-level fusion results. for each image, the fist row are original image and
fusion results of top 11 models, and the second row are ground truth and fusion results
of top 3 models.)
saliency map output while non-learning based fusing approaches shows great differences
in saliency map.
Model Consistence on Image Samples.
To further investigate fusion models’ strengthens and weakness, in Fig. 5.8, we illus-
trate the most and least consistent images which our fusion models agree with Human
IO. It can be seen that the fusing models work well when there is a clear salient object,
and fall short at images which does not contain well defined visual attention center. This
is reasonable as different model may report different observation for the last case, hence
fusingthemmay notachieveamorefocusedsalient object areainthefinalsaliencymaps.
87
Table 5.2: AUC ranking of example images in Fig. 5.9 (LSVM: linear SVM, MBoost:
Modest adaboost )
image number
(top to bottom) img 1 img 2 img 3 img 4 img 5 img 6 img 7
rank 1 Bayes Bayes Judd Bayes MBoost GBVS LSVM
rank 2 SVO LSVM MBoost SVO Judd LSVM Judd
rank 3 GBVS Judd LSVM LSVM LSVM Bayes MBoost
rank 4 LSVM GBVS GBVS MBoost GBVS Judd GBVS
rank 5 Judd SVO Bayes GBVS Bayes MBoost Bayes
rank 6 MBoost MBoost SVO Judd SVO SVO SVO
Besides,samplesaliencymapsfrom3bestindividualmodelsand3topfusionmethods
are shown in Fig. 5.9, and the corresponding model rankings are shown in table 6.1.
Withoutlossofgenerality, thesampleimagescoverfromstreetsceneandpeopletosimple
patternsimagetypes. It canbeseen thatourproposedapproachshowsitsdominant over
othermodelsindifferentimagetypes. Saliencymapsresultedfromfusingmodelsaremore
concentrated andfocusedcomparedtoresultsfromeachindividualtopmodel. According
to AUC and ROC curve reported on each individual image, the fusing results are closer
to ground truth, which indicates that our score level fusion model reduced the gap most
between the current proposed model and Human IO model.
5.4 Conclusions
Here we proposed different score-level fusion strategies including transformation based,
classification based, and probability density based fusion schemes into state-of-the-art
saliency detection models. Experimental results indicated that our score-level fusion
strategies results exceed state-of-the-art models and are, so far, the closest to ground
truth labeled by human eyes. Furthermore, through extensive comparison, we showed
88
LinearSVM ModestAdaboost Bayes
Top2 Bottom2
AUC0.9920
AUC0.9919
AUC0.3522
AUC0.3827
AUC0.9952 AUC0.9957
AUC0.9934 AUC0.9943
AUC0.4398
AUC0.5035
AUC0.3557
AUC0.3678
Figure 5.8: performance consistence over different image cases, accuracy of salient object
detection models over least and most consistent images
that our proposed fusion schemes kept good performance over a broad range of images,
and enriched the application image range by fusing different individual saliency models.
In this chapter, We have discussed the performance of fusing best individual saliency
models. For future work, we would like to explore the possibility of adaptive selection
of individual model for better performance. Furthermore, given one fusion strategy, it
would be valuable to analysis its weakness and how we can improve the performance by
fusing in new individual saliency models.
89
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
false positive rate
true positive rate
ROC curve
gbvs
SVO
Judd
Mod adaboost
linearSVM
Bayes
1 2 3 4 5 6 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 gbvs
SVO
Judd Mod adaboost fusing linearSVM Bays AUC score
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
false positive rate
true positive rate
ROC curve
gbvs
SVO
Judd
Mod adaboost
linearSVM
Bayes
1 2 3 4 5 6 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 gbvs
SVO
Judd Mod adaboost fusing linearSVM
Bays
AUC score
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
false positive rate
true positive rate
ROC curve
gbvs
SVO
Judd
Mod adaboost
linearSVM
Bayes
1 2 3 4 5 6 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
gbvs
SVO
Judd Mod adaboost
fusing linearSVM Bays
AUC score
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
false positive rate
true positive rate
ROC curve
gbvs
SVO
Judd
Mod adaboost
linearSVM
Bayes
1 2 3 4 5 6 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
gbvs
SVO
Judd Mod adaboost fusing linearSVM
Bays
AUC score
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
false positive rate
true positive rate
ROC curve
gbvs
SVO
Judd
Mod adaboost
linearSVM
Bayes
1 2 3 4 5 6 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 gbvs SVO Judd
Mod adaboost
fusing linearSVM Bays AUC score
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
false positive rate
true positive rate
ROC curve
gbvs
SVO
Judd
Mod adaboost
linearSVM
Bayes
1 2 3 4 5 6 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 gbvs SVO Judd Mod adaboost fusing linearSVM Bays AUC score
1 2 3 4 5 6 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 gbvs
SVO
Judd Mod adaboost fusing linearSVM
Bays
AUC score
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
false positive rate
true positive rate
ROC curve
gbvs
SVO
Judd
Mod adaboost
linearSVM
Bayes
Image GT GBVS Judd SVO Bayes LSVM Mboost ROC AUC
Figure 5.9: Examples of salient object detection results by top 3 models are shown in
columns 3 through 5, while the salient objects detected by proposed fusion strategies are
given in column 6 through 8. Their ROC and AUC performances are shown in column 9
and 10. (GT denotes ground-truth eye fixation map.)
90
Chapter 6
Salient Object Segmentation
Recently, salient objectdetection hasattracted alotofinterestincomputervisionsinceit
can serve as an important pre-processing step in many complex problems such as object
recognition, image retargeting, and image compression[1]. In these applications, it is
often necessary to locate andsegment themost salient object in thescene. Several object
detection and image segmentation methods have been proposed independently. Salient
object segmentation models combine the best of two worlds by first finding the most
interesting scene object and then segmenting its area.
Majority of existing saliency models have focused on detecting local image dissim-
ilarities and producing a real valued saliency map. Itti et al.[41] derived a bottom-up
visual saliency using center-surround differences across multi-scale image features. Harel
et al. [31] used Markov chains and a measure of dissimilarity to achieve efficient saliency
computation with their Graph Based Visual Saliency (GBVS) model. Goferman et al.
[57] modeled local low-level clues and visual organization rules, to highlight salient ob-
jects along withtheir contexts. However, due tolack of shape/contour information, these
91
models and their variants can not provide accurate information of salient object bound-
aries.
Other researchers have tried to incorporate shape prior information into saliency de-
tection frameworks and generate saliency maps with improved salient object boundaries.
Assumingashapepriorforvisualattention, Jiangetal.[26]detectedsimplesalientregions
usingcontourenergycomputation. Otherresearchersworked furtherinpredictingsalient
object boundaries by applying different graph theories into their models. Meherani et.
al [46] applied a Graph Cut post-processing to further refine salient object boundaries.
However, as a generative prediction model, Graph Cut suffers in providing high accuracy
segmentation results. Discriminative graph learning models, such as CRF, are able to
remove outliers and provide higher accuracy[14]. Liu et al.[61] used Conditional Random
Field (CRF) to learn Regions of Interest (ROI) usingthree features: multi-scale contrast,
center-surroundhistogram, andcolorspatial-distributionfeature. SinceLiuet al.’smodel
employs high-dimensional input features and does not make full use of high-level scene
information, it generates boundaries that suffer greatly from local variance in color and
illumination. This leads to inferior performance in providing correct object boundaries.
We introduce two major contributions in this chapter. First, we train a CRF to
combineoutputmapsofseveralstate-of-the-art saliencymodelsandbuildastrongmodel.
Since these models are based on different uncorrelated measures, combining them helps
better in locating the region of the salient object. Second, we evaluate models from
two different perspectives: region and contour measurement. Almost all researchers have
employed region measurement (i.e. F-measure) as standard performance scores. But
detecting a salient object is not just about segmenting its region, and it is also important
92
F-measure: 0.91
CM: 2.07
F-measure: 0.92
CM: 106.42
Figure 6.1: An example saliency map with high region overlap score but poor bound-
ary. Another saliency map with slightly lower (or equal) region score which has better
boundary score (lowe CM) resembles better with human perception of objects[48]. This
means that F-measure (or any region scoring) alone is not enough to evaluate accuracy
of a model. score
to capture the object boundary. A model could result in high region overlap accuracy
but having poor boundary detection accuracy. We illustrate this in Fig. 6.1 for an
sample image and two segmentation maps. While the two segmentations have about the
same area overlap with the ground truth annotation (∼ 0.9 F-measure), the middle one
resembles much better with humanannotation and captures boundarybetter. Therefore,
it is necessary to consider boundary accuracy in modeling saliency.
The overall framework of our model is shown in Fig. 6.2. Low-level and high-level
feature maps extracted from several training images, togethers with pixel ground-truth
labels, are used to train a CRF. For a testing image, these features are calculated, passed
to CRF, and finally a real-valued saliency map is computed. In Sec. 6.1, we describe
our CRF model in details. In Sec. 6.2 we exhaustively compare our model against state-
of-the-art models using both region overlap (Error rate and F-measure) and boundary
accuracy (Contour Mapping[48]) scores. Sec. 6.3 concludes this chapter.
6.1 Proposed CRF-based Segmentation Model
We formulate the salient object segmentation problem as a binary labeling problem in
a graph. Different graph models can be applied to address this labeling problem. Here,
93
we exploit CRF since it has several advantages over other graph models such as Markov
Random Fields (MRF)[45], and Graph Cuts (GC)[11]. Firstly, it can use arbitrary low-
levelorhigh-levelfeaturesasitsinput,becausethereisnoassumptionaboutindependence
of input features. Secondly, as CRF applies effective discriminative learning, it reduces
the classification outliers[8], thus resulting in smoother boundaries comparing with MRF
and GC.
Fig. 6.2 shows an illustration of our proposed framework. Firstly, we extract local
features, includingbottom-upsaliency (derivedfromothersaliency models), gradienthis-
togram, and pixel position. Then, to make use of global image information, we further
train a regression map using Random Forest (RF)[12], and use it as another input fea-
ture of our CRF graph. In fact, we use RF to map above-mentioned low-level features
to binary labels, and the output is a real-valued saliency map (See Sec.6.1.1). Through
trainingprocedure,CRFcanoptimally combinethesefeaturestominimizethesegmenta-
tion/labeling cost of training data. Finally, we predict segmentation labels of test images
using our trained graph, where label with maximum prediction probability is chosen for
each pixel.
In this CRF graph, each node represents an image pixel, and is connected to its 4-
connected neighbors. We then associate a class label with each node (1 is the salient
object, 0 is the background). The energy function of our CRF is defined in Eq. 6.1:
E(A) =
X
x
K
X
k=1
λ
k
F
k
(a
x
)+
X
x,x
′
S(a
x
,a
x
′) (6.1)
94
Training Images
Training Images’
Binary Labels
saliency models
Learn Two-Class
Segmentation
CRF
Evaluate Two-class
Segmentation
CRF
Testing image
Extract Features
Predicted Result
Two-Class
Segmentation
parameters
HoG
Extract Features
saliency models HoG
Position
Position
RandomForest
Regression
RandomForest
Regression
Position Position
Model 1~N
Model 1~N
High-level Feature
(x 0 , y 0 ) … (x n , y n )
(x 0 , y 0 ) … (x n , y n )
Figure 6.2: The diagram of our proposed CRF-based salient object detection framework.
whereλ
k
is the weight of thek−th feature, and x, x
′
are two adjacent pixels. F
k
(a
x
)
indicates the probability of assign label a
x
to pixel x.
The goal of learning is to minimize the energy (cost) E(A) of label assignment a
x
to
each pixelx. Thisenergyfunctionhastwoterms: thefirsttermmeasuresthecost ofeach
labeling assignment; the second term measures the total distance between nearby pixels
with different labels in feature space. This term imposes a homogeneity in local labeling
and smoothness. The first term, salient object indication function F
k
(a
x
), is defined as
follows:
F
k
(a
x
)=
f
k
(x) a
x
=0
1−f
k
(x) a
x
=1
(6.2)
To avoid dominance of feature types in optimization, each input feature f
k
(x) is
linearly normalized into [0,1]. When pixel x is located within the salient object region,
95
itssaliencyfeatureresponsef
k
(x)willbehigh,thusF
k
willhavelowervaluewhena
x
=1,
which leads to higher probability of being labeled as 1 through the optimization process.
The second term, pairwise smooth cost term S(a
x
,a
′
x
), models the spatial relationship
between two adjacent pixels: S(a
x
,a
′
x
) as:
S(a
x
,a
x
′)=|a
x
−a
x
′|·exp
−βd
x,x
′
(6.3)
where d
x,x
′ =
K
P
k=1
(f
k
(x)−f
k
(x
′
))
2
0.5
is the L
2
norm of input features.
This smoothness cost is a penalty term when adjacent pixels are assigned different
labels. The more similar the features of the two pixels, the less likely they be assigned
different labels. With this pairwise constraint for segmentation, the homogenous interior
area inside the salient object can be grouped into the same region.
6.1.1 Features
Low Level Features. Here we mainly use saliency map outputs from different bottom-
up saliency models as our low-level features. Using saliency maps as features has several
advantages. Firstly,aseachmodel’soutputisasaliencyprobabilityvaluebetween0and1,
it provides a low dimensional input into advanced learning-based methods. The relative
lower dimensional feature vector can reduce the possibility of over-fitting. Secondly,
the accuracy of current salient object detection models mostly depend on training data
and classification models. As different models are based on different feature aspects and
characteristics oftrainingdata, usingthemasinputfeatureshelpsimprovetheprediction
accuracy over a broader range of images.
96
Wechose10state-of-the-artsaliencymodels,including: ITTI[41],GBVS[31],AIM[49],
HouNips[70], HouCVPR[71], AWS[3], SUN[42], CBS[26], SalientLiu[61], and SVO[37].
Our selection of models was based on two principles: 1) a model must achieve good
prediction performance and 2) features adopted by selected models must cover a large
range from possible features so that our strategy can adjust to different image scenarios
(i.e., feature diversity). Due to space limitations, we do not explain details about these
models. The interested reader is referred to [1] for more details.
Position (the x and y coordinate of each pixels) is another surprisingly good feature.
Indeed, salient objects are usually located close to the central part of an image (a phe-
nomenon called center-bias or photographer-bias). Many models have taken advantage
of this feature for saliency detection. Hence we use the position coordinate as another
low-level feature indicating local saliency. Furthermore, we use Histogram of Gradient
(HoG)[18] feature to capture pixel-wise saliency information. It has been shown [71] that
salient regions tend to have a large variance in local pixel patches, thus making HoG a
useful feature for saliency description. HoG descriptor maintains a few key advantages
overotherdescriptormethods. Firstly, HoGisadensedescriptor,andeachpixelcanhave
its own descriptor. Secondly, since theHoG descriptor operates on localized patches, it is
invarianttogeometric transformations. Thus,weadoptHoGasarobustlow-level feature
to describe local saliency objects.
High Level Features. One of the most important keys to good segmentation is the
global image information. Indeed, several image classification techniques make use of
global image context as their essential ingredient for good classification [50]. Here, to
97
boost the segmentation accuracy, we feed a single ”high-level” feature into our CRF
model. This feature is derived from low-level features described above.
Specifically, we use the output from an advanced classifier to estimate the location of
the salient object. Several classifiers can be a potential good choice. We chose RF here
as it is fast and has high classification performance as good as state-of-the-art learning
methods[12]. A typical random forest is a collection of weak classifiers h(x;θ
k
),k =
1,...,K; where x represents the observed input feature vector. θ
k
is the corresponding
parameter of tree k, and is independent from each other. The classification result can
be obtained by majority voting among all weak classifiers with the same input feature
x. As each classifier’s input features are the same, RF classification procedure can be
done paralleled and hence faster than other cascade classifier such as Ada-boost[50].
We have tested using Ada-boost classifier instead of RF in our CRF framework, and
no evidently improvement over performance. Hence we use RF here for above reasons.
Specially, to avoid the binary prediction error of RF classifier be accumulated over our
finalsegmentation result,hereweusethepredictionprobabilityvalueinsteadof0,1labels
(i.e., a saliency map).
6.1.2 Learning
To get an optimal linear combination of features, the goal of CRF learning is to estimate
the linear weights under the Maximized Likelihood (ML) criterion. Given N training
samples {x,a}
N
n=1
, the optimal λ
∗
is chosen as follows.
λ
∗
=argmax
λ
X
n
logP(a|x;λ) (6.4)
98
Table 6.1: Model rankings over different salient object datasets.
Model ITTI GBVS AIM HouNIPS HouCVPR AWS SUN CBS SVO SalientLiu Meharani Ours
ASD Error 64.38 52.91 28.33 82.35 99.30 67.82 81.28 22.76 6.10 7.76 7.58 4.32
ASD F 0.45 0.56 0.47 0.28 0.03 0.44 0.27 0.79 0.69 0.79 0.71 0.87
ASD CM 50.35 39.58 72.20 78.92 123.51 59.23 76.53 21.46 37.06 20.82 23.77 11.38
SED Error 50.12 46.19 19.47 50.46 64.22 50.95 52.39 37.12 26.51 31.79 15.56 12.09
SED F 0.31 0.32 0.33 0.30 0.23 0.30 0.29 0.34 0.35 0.35 0.36 0.58
SED CM 70.71 68.90 69.60 73.48 82.08 73.31 74.49 67.92 67.69 66.87 49.52 26.64
SOD Error 69.27 67.25 32.23 85.84 99.48 76.81 86.14 62.81 26.80 54.58 20.12 18.95
SOD F 0.32 0.32 0.33 0.25 0.21 0.29 0.21 0.44 0.55 0.49 0.58 0.61
SOD CM 74.74 72.20 76.51 88.15 103.67 82.52 121.92 79.21 59.83 61.47 61.38 50.21
Here, exacting computation of marginal distribution P(a|x;λ) is intractable. How-
ever, the pseudo-marginal (belief) computed by belief propagation can be used as a
good approximation[35]. By applyingtree-reweighed belief propagation(TRW), an upper
bound of marginal distribution is tractable, and the optimization can be finally achieved
through an iterative message updates within our CRF graph[61].
6.2 Experimental Setup
Compared Models: We compared our approach with 11 state-of-the-art saliency mod-
els: ITTI [41], GBVS [31], AIM [49], HouNips[70], HouCVPR [71], AWS [3], SUN [42],
CBS [26], SalientLiu [61], SVO [37], and Meharani[46]. Previous benchmarks[2, 1] have
shown superior fixation prediction and saliency detection accuracy of these models.
Cross Validation: We follow a cross-validation procedure for training CRF. In our
experiments, each dataset is split into 5 folds. The CRF learned over 4 folds is applied
to the remaining one, and results are averaged over all 5 folds.
6.2.1 Datasets
We employ three benchmark salient object datasets to evaluate our model.
99
Image GT AWS GBVS CBS SalientLiu SVO Meharani Proposed method
ASD SED SOD
0.70 - 36.48 0.92 - 8.33 0.93 - 3.74 0.45 - 47.06 0.81 - 25.55 0.88 - 7.87 0.97 - 3.02
0.58 - 18.93 0.96 - 1.85 0.32 - 63.68 0.49 - 15.08 0.50 - 43.58 0.78 – 7.89 0.91 - 2.97
0.38 - 34.87 0.92 - 6.63 0.84 - 25.02 0.31 - 103.54 0.74 - 39.41 0.75 - 28.14 0.94 - 5.04
0.05 - 99.02 0.58 - 38.59 0.61 - 39.11 0.20 - 82.91 0.81 - 12.31 0.84 - 11.74 0.91 - 8.97
0.45 - 56.87 0.45 - 73.27 0.63 - 43.72 0.30 - 93.27 0.67 - 45.14 0.56 - 65.61 0.91 - 6.86
0.57 - 40.48 0.28 - 78.73 0.75 - 23.59 0.73 - 17.18 0.51 - 53.18 0.29 - 71.35 0.88 - 6.16
Figure6.3: Illustrationofsalientobjectboundarydetection(GTmarksasGroundTruth).
Our model captures both object regions and boundaries better than existing models.
Corresponding F-measure and CM scores are shown under each image, in order. A good
model should be high on F-measure, and low on CM score.
100
ASD: This dataset contains 1,000 images from the MSRA dataset[53]. Salient objects
and boundaries have been manually annotated within the user-drawn rectangles (from
the original dataset) to obtain binary masks.
SED: This dataset contains two parts[56]. The first one, single object dataset (SED1),
has 100 images containing only one salient object similar to the ASD. In the second one,
twoobjectsdataset(SED2),therearetwosalientobjectsineachimage(100images). Our
purpose of employing this dataset is to evaluate accuracy of models over more complex
stimuli (i.e., multiple salient objects). Here, we have merged two datasets in one.
SOD: This dataset is a collection of salient object boundaries based on the Berkeley
Segmentation Dataset (BSD)[48]. Seven subjects have been asked to choose the salient
object(s) in 300 images. This dataset contains many images with several objects, thus
makes it challenging for saliency models. Because there are several annotators, we can
compute Inter-observer model on this dataset.
6.2.2 Evaluation Scores
We use three scores for measuring accuracy. The first two scores (error rate and F-
measure) evaluate accuracy in predicting salient regions (i.e., overlapping area between
predicted salient regions and ground-truth). The third score (Contour Mapping (CM)),
evaluates accuracy in predicting object boundary.
Error Rate: is simply the fraction of pixels that the binary saliency map differs from
the ground-truth annotation. We use this score to make our results directly comparable
with Meharani et. al[46].
101
F-Measure: For each model, we first calculate the precision-recall (PR) by thresholding
saliency maps, and generating a binary saliency map. Noted that both results from
our method and Meharani [46] are binary results (either 0 or 1), thus the corresponding
precision-recalloverdifferentthresholdsisconstant. WeherereporttheclassicF-Measure
defined as: F =
2×Precision×Recall
Precision+Recall
.
ContourMapping: Andthird,forthefirsttime,wearecomparingmodelsontheirabil-
ity to predict boundary. We use the score designed in [48] for comparing two boundaries.
According to [48], contour mapping measure (CM) is the normalized mapping distance
betweentwoboundariesAandB(modelvs. human): CM(A,B)=
1
|T|
δ([A],[B]). Where
T is the trace corresponding to the optimal mapping sequence and |T|, the size of the
trace, is the number of mapped point pairs. Moreover, δ([A],[B]) is defined as the min-
imum accumulated mapping distance from contour points on A to contour points on B
through cyclic shifts. This procedure can be done by a standard dynamic programming
[48]. Noted here the smaller CM distance means that two contours are more similar to
each other.
6.2.3 Results
Model Comparison: We first compare the overall performance (average score over all
images of a dataset) of our model and other compared models in table 6.1. Note that
scores can be directly calculated on Meharani model and our model, as the outputs are
binary (i.e., no threshold). For other models, with real-valued maps, we use a naive
threshold (0.5) to binarize a saliency map and compute scores. It is reasonable to use a
constant threshold, as in real applications, optimal threshold selection usually can not be
102
Figure 6.4: CM and F-measure scores by sweeping saliency map threshold over ASD,
SED, and SOD datasets averaged over all images.
done as ground truth is unknown. An ideal model should have low error rate, low CM,
and high F-measure. From table 6.1, it is clear that our proposed model consistently
outperforms other models over all 3 datasets using three scores. Therefore, our model is
ableto bothpredictsalient regions andboundaries,andis closest to humanperformance.
Sample segmentation results from7 top-performingmodels (as indicated in table 6.1) are
shown in Fig. 6.3. Our model can produce most accurate boundaries among all models.
Meharani, CBS, and SVO models have also high CM scores because they explicitly take
into account shape prior information in saliency detection. To compare models in more
detail, in Fig. 6.5 we show images where models fail. It can be observed that our model
has better degradation compared to other models. For these images, our model achieves
103
ASD SED SOD
Image GT AWS GBVS CBS SalientLiu SVO Meharani Proposed method
0.25 - 76.41 0.70 - 17.95 0.41 - 54.10 0.56 - 27.87 0.67 - 15.13 0.80 - 15.49 0.80 - 12.95
0.22 - 57.48 0.62 - 28.01 0.65 - 26.76 0.58 - 11.32 0.37 - 61.36 0.37 - 39.38 0.66 - 24.66
0.34 - 71.48 0.75 - 39.85 0.83 - 30.08 0.35 - 87.70 0.78 - 39.95 0.48 - 68.23 0.85 - 16.79
Figure 6.5: Worst-case analysis. Sample images where models fail in predicting object
boundary. Comparing to different models, our model has a better degradation perfor-
mance.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
20
40
60
80
100
120
140
F-measure
CM score
ASD dataset
itti
gbvs
AIM
HouNips
HouCVPR
AWS
SUN
CBS
SVO
SalientLiu
Meharani
proposed method
0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65
45
50
55
60
65
70
75
80
85
F-measure
CM score
SED dataset
itti
gbvs
AIM
HouNips
HouCVPR
AWS
SUN
CBS
SVO
SalientLiu
Meharani
proposed method
0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7
0
20
40
60
80
100
120
140
F-measure
CM score
SOD dataset
itti
gbvs
AIM
HouNips
HouCVPR
AWS
SUN
CBS
SVO
SalientLiu
Meharani
proposed method
human IO
Figure 6.6: CM vs. F-measure results over ASD, SED, and SOD datasets. Each point
shows the average score over all thresholds and images.
higher F-measure and lower CM score. Thus, our model is consistent and robust under
different scenarios.
Analysis of Map Thresholding: To further examine the effect of map thresholding
on overall performance, we compare models by thresholding saliency maps from 0 to
1. The corresponding CM and F-measure results are shown in Fig. 6.4. Noted that
since both our model and Meharani [46] generate binary maps, the performance over
different thresholds is a straight line. It can be observed that although scores vary with
threshold, our proposed method keeps leading performance over other models on all
possible thresholds. One interesting observation is that each model has its own best
104
threshold value and this value dependson the used dataset. This means that it is usually
difficulttochooseathresholdthatcansuitalltheimages. Ontheotherhand,byeffective
learning, our CRF model succeeded in maintaining good performance without the need
for threshold adjustment.
It is now clear that CM and F-measure show model performance from two different
perspectives. In Fig. 6.6, we show plot performance of model in two-scores plane. Each
point is this plot is the average score over all thresholds and images. It can be seen that
our model keeps good performance in both CM and F-measure, followed by Meharani,
CBS, and SVO models. Among all the models, HouCVPR shows the worst performance,
probably because of applying FFT transform, the resulting saliency regions and bound-
aries usually are not continuous. Furthermore, as SOD dataset contains ground-truth
labels from different human subjects, we can compute the human IO score (i.e., human
agreement in labeling object boundaries) as described in [1]. Human IO model achieves
the best score using both scores. It can be seen that our model has the closest score to
humans on SOD dataset.
Analysis of Center-bias: Tomakeafair modelcomparison, here weintendtocompare
models by directly accounting for center-bias. Because objects are usually framed at the
center of the image (a.k.a. photographer bias), center-bias phenomenon exists in current
datasets. As a result, a trivial central gaussian blob outperforms almost all models[2, 1].
A good model should be able to locate salient objects when they appear near image
boundaries (off-center case) and when they appear at the image center (center case).
For this analysis, we manually chose 50 images with object at the center and another
50 images with objects off-center from three datasets. The testing results of models are
105
Image GT AWS CBS GBVS SalientLiu SVO Meharani Proposed method
ASD
center off-center off-center center center off-center
SED SOD
0.74 - 18.21 0.79 - 26.32 0.56 - 57.83 0.26 - 64.70 0.77 - 5.19 0.72 - 14.23 0.91 - 3.90
0.48 - 31.20 0.85 - 8.42 0.51 - 51.75 0.55 - 19.43 0.70 - 20.59 0.80 - 10.36 0.91 - 5.06
0.35 - 42.19 0.82 - 12.45 0.32 - 51.81 0.33 - 35.53 0.60 - 34.11 0.60 - 23.89 0.93 - 5.62
0.19 - 83.44 0.09 - 71.88 0.44 - 64.48 0.33 - 105.52 0.65 - 47.15 0.68 - 35.54 0.73 - 26.55
0.61 - 19.14 0.43 - 44.46 0.27 - 58.14 0.67 - 17.82 0.41 - 48.46 0.34 - 38.04 0.73 - 9.02
0.40 - 63.13 0.67 - 67.14 0.68 - 50.45 0.25 - 89.12 0.86 - 31.61 0.77 - 37.87 0.83 - 21.98
Figure 6.7: Sample images with salient objects in- and off-center from three datasets.
Figure 6.8: Center-biased analysis. Our model is better able to detect in- and off-center
salient objects, in terms of both region and boundary.
106
shown in Fig. 6.8. It can be seen that for some models, such as CBS, HouCVPR, and
AWS, performance differs considerably over center and bias cases. These models arethus
less stable over all cases. On the contrary, our model works well on both center and off-
center cases. Model maps on some sample images with salient objects on and off-center
are shown in Fig. 6.7.
6.3 Conclusions
CRF has been proven as a useful tool for many vision problems such as semantic object
classification, but few works have considered it for salient object detection. Here we
employ CRF for salient object segmentation purpose. Furthermore, we propose to use
a contour measurement to estimate the quality of saliency object boundaries and model
comparison. Extensive experimental results show that our proposed method has strong
capability in locating salient object regions and capturing object boundaries through
discriminativelearningusingourCRFmodel. Comparingwith11state-of-the-artsaliency
models, our proposed model is proven to be, so far, the closest to human annotation
performance.
107
Chapter 7
Conclusion and Future Work
7.1 Summary of The Research
In chapter 3 of this dissertation, we propose an effective algorithm to recover the depth
information from one single still 2D image and propagate the depth information to the
entire video sequence. This approach utilizes many advanced techniques such as in-focus
detection [59], salient computation [32], grab cut algorithm [13] and belief propagation
method [72]. The extensive experiments on a large set of industrial test cases have
validated the accuracy and efficiency of proposed algorithm.
Inchapter 4, weimprove ourprevious proposedalgorithm toproducethedepthmaps
for 2D video sequences with chip implementation considerations (e.g., limited memory
storage space and real-time requirement). In particular, to avoid the time-consuming
algorithms (e.g., in-focus region detection [59] and grab cut algorithm[13]), we use a
fast in-focus region detection algorithm [22] and mean shift algorithm [20] to provide
significant complexity reduction while offering satisfactory depth maps. Moreover, a
novel depth map propagation method is proposed to efficiently generate the depth maps
108
for the entire video sequences using the “motion vectors” in the standard compressed
video format (e.g., MPEG and H.264). The improved algorithm has been verified on a
large number of test videos from the industry and provides satisfactory depth maps for
most of them.
To further detect salient object accurately, in chapter 5, we proposed several score
level fusion strategies including transformation based, classification based, and probabil-
ity density based fusion schemes to fuse some state-of-the-art saliency detection models.
Experimental results indicate that some of our score level fusion strategies results out-
perform state-of-the-art models and are, so far, the closest to ground truth labeled by
human eyes. Furthermore, a quantitative comparison among different fusion schemes
is performed. We find that some fusion strategy perform better than others, and the
integration of few best models outperform fusing all models.
In chapter 6, we propose a novel Conditional Random Field(CRF) graph model for
salient object segmentation purpose. In our proposed model, we first extract local low-
level features, such as output maps of several saliency models, gradient histogram and
position of each image pixel. We then train a random forest classifier to fuse saliency
maps into a single high-level feature map usingground-truthannotations. Both low- and
high-level features are fed into our CRF and parameters are learned. Furthermore, the
segmentation results are evaluated from two different perspectives: region and contour
accuracy. Extensive experimental comparison shows that our proposed model outper-
forms the state-of-the-art over both scores. It works well over a broad range of image
types and has good degradation performance in presence of detection errors.
109
7.2 Future Research Directions
To extend our research work, we have following research directions to further improve
our proposed works in this dissertation:
• Evaluation of generated depth map
In our existing experiments, the evaluation of generated depth maps is based on
human eye’s perception: the depth map is correct only if it match the observation
with human eyes. This evaluation method is quite unreliable and needs human
interactions. Therefore, we need to develop a more reliable evaluation method
withouthumaninteractionbyestablishingsomeevaluationmetricsformeasurement
purpose.
• Image segmentation improvement
Our current image segmentation (e.g., grab cut algorithm [13], mean shift algo-
rithm [20]) heavily depend on color information, which can provide satisfactory
depth maps for most test cases. However, this approach can fail for some cases,
particularly when foreground objects contains too many colors. In our future re-
search, we will consider other features, such as texture and motion information, to
improve the reliability of image segmentation method
• Fine-grained depth map for the foreground
The generated depth map for the foreground is coarse-grained: all the pixels of
each foreground object are assigned with the same depth value. For example, the
pixels of a face close up share the same depth value, which cannot provide high
110
resolution for the depth map. We will investigate more depth clues such as shading
and shadow to provide a fine-grained depth map fore the foreground.
• Temporal stability improvement
The proposed depth map propagation method terminates the propagation process
when the accumulated error exceeds upper error bound, which results in temporal
inconsistence of the depth maps and leads to a visual unpleasant experience for
viewers. To solve this problem, a spatial smooth filter will be used to avoid sudden
depth jump in neighboring frame images so that to improve the temporal stability
of depth maps.
111
Reference List
[1] D. N. S. A. Borji and L. Itti, “Salient object detection: A benchmark,” European
Conference on Computer Vision (ECCV).
[2] L. I. A. Borji, “State-of-the-art in visual attention modeling,” IEEE Trans. Pattern
Analysis, vol. 35, no. 1, pp. 185–207.
[3] X.A.Garcia-Diaz, X.R.Fdez-VidalandR.Dosil, “Decorrelation anddistinctiveness
provide with human-like saliency,” In: Advanced Concepts for Intelligent Vision
Systems (ACIVS), vol. 5807.
[4] K. N. A. K. Jain and A. Ross, “Score normalization in multimodal biometric sys-
tems,” Pattern Recognition, vol. 38, no. 12.
[5] D. L. A. Levin and Y. Weiss, “A closed form solution to natural image matting,” on
IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6] F.D.A.Levin,R.FergusandW.T.Freeman, “imageanddepthfromaconventional
camera with a coded aperture,” in ACM Trans. Graphics, vol. 26, pp. 70 – 78, 2007.
[7] M.A.MakiandC.Wiles, “Geotensity: Combiningmotionandlightingfor3dsurface
reconstruction,” In International Journal of Computer Vision (IJCV),vol. 48, no.1,
pp. 75–90, 2002.
[8] M.C.A.QuattoniandT.Darrell,“Conditionalrandomfieldsforobjectrecognition,”
Conference on Neural Information Processing Systems.
[9] L.G.A.YonasandJ.Hallstrom,“Developmentofsensitivitytoinformationprovided
by cast shadows in pictures,” in Perception and Pictorial Representation, vol. 7, p.
100C109, 1978.
[10] A.Borji,“Boostingbottom-upandtop-downvisualfeaturesforsaliencyestimation,”
IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Y. Boykov and G. Funka-Lea, “Graph cuts and efficient n-d image segmentation,”
International Journal of Computer Vision, vol. 70, no. 2, pp. 109–131.
[12] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32.
[13] A. B. C. Rother, V. Kolmogorov, “Grabcut: Interactive foreground extraction using
iterated graph cuts,” In Computer Vision and Pattern Recognition (CVPR), vol. 23,
no. 3, pp. 309–314, 2004.
112
[14] A. M. C. Sutton, “An introduction to conditional random field,” on Foundations
and Trends in Machine Learning, vol. 4, no. 4, pp. 267–373.
[15] A. Chapanis and R. McCleary, “Interposition as a cue for the perception of relative
distance,” In Journal of General Psychology, vol. 48, p. 113C132, 1953.
[16] T. Chen and R. Rao, “Audio-visual integration in multimodal communications,”
Proc. IEEE, vol. 86, no. 5, pp. 837–852.
[17] D. K.-S.T.J.R. D. Anguelov, P.SrinivasanandJ.Davis, “Scape: shapecompletion
andanimation ofpeople,” In ACM Trans. Graph.), vol. 24, no. 1, pp.408–416, 2005.
[18] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,”
IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[19] M.DimiccoliandP.Salembier,“Exploitingt-junctionsfordepthsegregationinsingle
images,” In IEEE International Conference on Acoustics, Speech and Signal Process-
ing(ICASSP), vol. 1, no. 1, pp. 1229–1232, 2009.
[20] C. Dorin and M. Peter, “Mean shift: A robust approach toward feature space anal-
ysis,” In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),
vol. 24, no. 5, p. 603C619, 2002.
[21] H. L. E. Delage and A. Ng, “A dynamic bayesian network model for autonomous
3d reconstruction from a single indoor image,” In Computer Vision and Pattern
Recognition (CVPR), vol. 2, no. 1, pp. 2418–2428, 2006.
[22] J. H. Elder and S. W. Zucker, “Local scale control for edge detection and blur
estimation,” IEEE TRANS. on pattern analysis and Machine Intelligence, vol. 20,
no. 1, pp. 699–716, 1998.
[23] C. Frueh and A. Zakhor, “Constructing 3d city models by merging ground-based
and airborne views,” In Computer Vision and Pattern Recognition (CVPR), vol. 2,
no. 1, pp. 52–61, 2003.
[24] C. G. G. Kang and W. Ren, “Shape from shading based on finite-element,” in Proc.
Intl. Conf. on Machine Learning and Cybernetics (ICMLC), vol. 8, pp. 5165 – 5169,
2005.
[25] J. Gibson, The perception of the visual world. Oxford, England: Houghton Mifflin.,
1950.
[26] Z.Y.-T.L.N.Z.H.Jiang,J.WangandS.Li,“Automaticsalientobjectsegmentation
based on context and shape prior,” British Machine Vision Conference (BMVC).
[27] A. Hagen, “Robust speech recognition based on multi-stream processing,” Ph.D.
thesis.
[28] H. Helmholtz, Treatise on Physiological Optics. James P.C.Southall, 1925.
113
[29] A. Hertzmann and S. Seitz, “Example-based photometric stereo: Shape reconstruc-
tion with general varying brdfs,” In IEEE Trans Pattern Analysis and Machine
Intelligence (PAMI), vol. 27, no. 1, pp. 1254–1264, 2005.
[30] W. Hittelson, “Size as a cue to distance: static localization,” in American Journal
of Psychology, vol. 64, p. 54C67, 1950.
[31] C. K. J. Harel and P. Perona, “Graph-based visual saliency,” Conference on Neural
Information Processing Systems (NIPS).
[32] ——, “Graph-based visual saliency,” In Proceedings of Neural Information Process-
ing Systems (NIPS), vol. 19, no. 1, pp. 545–552, 2006.
[33] R. P. W. D. J. Kittler, M. Hatef and J. Matas, “On combining classifiers,” IEEE
Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 3.
[34] A. S. J. Michels and A. Ng, “High speed obstacle avoidance using monocular vision
andreinforcementlearning,” In 22nd International Conference on Machine Learning
(ICML), vol. 27, no. 1, pp. 593–600, 2005.
[35] W. T. F. J. S. Yedidia and Y. Weiss, “Generalized belief propagation,” pp. 689–695,
2000.
[36] S.C.D.K.Nandakumar,Y.ChenandA.K.Jain, “Likelihoodratio-based biometric
score fusion,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 30,
no. 2, pp. 342–347.
[37] H. T. C. S. H. L. K. Y. Chang, T. L. Liu, “Fusing generic objectness and visual
saliency for salient object detection,” International Conference on Computer Vision
(ICCV).
[38] G. Kanizsa, Organization in Vision: Essays on Gestalt Perception. New York
Praeger, 1979.
[39] P. Kellman and M. Arterberry, The cradle of knowledge. MA: MIT Press, 1998.
[40] P. Kellman and T. Shipley, “Visual interpolation in object perception,” Current
Directions in Psychological Science, vol. 1, p. 193C199, 1991.
[41] C. K. L. Itti and E. Niebur, “A model of saliency-based visual attention for rapid
sceneanalysis,” IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI),
vol. 20, no. 11, pp. 1254–1259.
[42] T. K. M. H. S. G. W. C. L. Zhang, M. H. Tong, “Sun: A bayesian framework for
saliency using natural statistics,” Journal of Vision (JOV).
[43] E. F. M. Cerf and C. Koch, “Faces and text attract gaze independent of the task:
Experimental dataandcomputer model,” on Journal of Vision(JOV),vol. 9, no. 12,
pp. 1–15.
114
[44] A. N. M. Pajak, “Object-based saccadic selection duringscene perception: Evidence
from viewing position effects,” on Journal of Vision(JOV), vol. 13, no. 5, pp. 1–21.
[45] B. S. Manjunath, “Unsupervised texture segmentation using markov random field
models,” IEEE trans. on Pattern Analysis and Machine Intelligence, vol. 13, no. 5,
pp. 478–482.
[46] P. Mehrani and O. Veksler, “Saliency segmentation based on learning and graph cut
refinement.”
[47] J. Ming and F. J. Smith, “Speech recognition with unknown partial feature corrup-
tion - a review of the union model,” Computer Speech and Language, vol. 17.
[48] V. Movahedi and J. H. Elder, “Design and perceptual validation of performance
measures for salient object segmentation,” 7th IEEE Computer Society Workshop
on Perceptual Organization in Computer Vision(POCV).
[49] J. N. D. B. Bruce, “Saliency based on information maximization,” Conference on
Neural Information Processing Systems (NIPS).
[50] M. T. N. Plath and S. Nakajima, “Multi-class image segmentation using conditional
random fields and global classification,” Proc.of the 26th Annual International Con-
ference on Machine Learning.
[51] A. Pentland, “A new sense for depth of field,” In Proc. of International Conference
on Computer Vision (ICCV), vol. 4, p. 839C846, 1985.
[52] N. Poh and S. Bengio, “Eer of fixed and trainable fusion classifiers: A theoretical
studywithapplicationtobiometricauthenticationtasks,”Multiple Classier Systems.
[53] F. E. R. Achanta, S. Hemami and S. Susstrunk, “Frequency-tuned salient region
detection,” IEEEConference on Computer Vision and Pattern Recognition (CVPR),
2009.
[54] A. M. M. I. R. Snelick, U. Uludag and A. K. Jain, “Large scale evaluation of mul-
timodal biometric authentication using state-of-the-art systems,” IEEE Trans. on
Pattern Analysis and Machine Recognition, vol. 27, no. 3.
[55] D. RobertandD. Tax, “Experimentswithclassifier combining rules,” Multiple Clas-
sifier Systems.
[56] R. B. S. Alpert, M. Galun and A. Brandt, “Image segmentation by probabilistic
bottom-up aggregation and cue integration,” IEEE Conference on Computer Vision
and Pattern Recognition (CVPR).
[57] L. Z.-M. S. Goferman and A. Tal, “Context-aware saliency detection,” IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2010.
[58] P. Salil and A. K. Jain, “Decision-level fusion in fingerprint verification,” Pattern
Recognition, vol. 35, no. 4, pp. 861–874.
115
[59] B. Soonmin and D. Fredo, “Defocus magnification,” In Computer Graphics Forum,
vol. 26, no. 3, p. 571C579, 2007.
[60] F. D. T. Judd, K. Ehinger, “Learning to predict where human look,” International
Conference on Computer Vision (ICCV).
[61] N. Z. X. T. T. Liu, J. SunandH. Shum, “Learningto detect a salient object,” IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
[62] M. I. T. Nagai, T. Naruse and A. Kurematsu, “Hmm-based surface reconstruction
from single images,” In IEEE Intl Conf Image Processing (ICIP), vol. 2, no. 1, pp.
561–564, 2003.
[63] S. Thrunand B. Wegbreit, “Shapefrom symmetry,” In International Conf on Com-
puter Vision(ICCV), vol. 2, no. 1, pp. 1824–1831, 2005.
[64] L. Torresani and A. Hertzmann, “Automatic non-rigid 3d modeling from video,” In
European Conference on Computer Vision (ECCV), vol. 2, no. 1, pp.299–312, 2004.
[65] M. P. N. S. V. Cantoni, L. Lombardi, “Vanishing point detection: Representation
analysis and new approaches,” In Dip. di Informatica e Sistemistica, vol. 20, no. 1,
pp. 26–38, 2001.
[66] G. Vasari, Life of the Artists. Oxford University Press, 1998.
[67] F. Wang and J. Han, “Multimodal biometric authentication based on score level
fusion using support vector machine,” Opto-Electronics Review, vol. 17, no. 1.
[68] C. Wheatstone, “Contributions to the physiology of vision, part i: On some remark-
able and hitherto unobserved, phenomena of binocular vision,” In Phylosophical
Transactions of the Royal Society of London, vol. 128, p. 371C394, 1838.
[69] M. Wolfgang, “Consciousness, perception, and action,” in Handbook of perception,
vol. 1, pp. 109 – 122, 1974.
[70] L. Z. X. Hou, “Dynamic attention: Searching for coding length increments,” Con-
ference on Neural Information Processing Systems (NIPS).
[71] ——,“Saliencydetection: Aspectralresidualapproach,” IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR).
[72] J. Yedidia and W. Freeman, “Understanding belief propagation and its generaliza-
tions,”In Exploring Artificial Intelligence in the NewMillennium,vol.1,p.236C239,
2003.
[73] Y. Zhai and M. Shan, “Visual attention detection in video sequences using spa-
tiotemporal cues,” on ACM Multimedia.
116
Abstract (if available)
Abstract
With the rapid development of 3D vision technology, it is an active research topic to recover the depth information from 2D images. Current solutions heavily depend on the structure assumption of the 2D image and their applications are limited. It is now still technically challenging to develop an efficient yet general solution to generate the depth map from a single image. Furthermore, psychological study indicates that human eyes are particular sensitive to salient object region within one image. Thus, it is critical to detect salient object accurately, and segment its boundary very well as small depth error in these areas will lead to intolerant visual distortion. Briefly speaking, research works in this literature can be categorized into two different categories. Depth map inference system design and salient object detection and segmentation algorithm development. ❧ For depth map inference system design, we propose a novel depth inference system for 2D images and videos. Specifically, we first adopt the in-focus region detection and salient map computation techniques to separate the foreground objects from the remaining background region. After that, a color-based grab-cut algorithm is used to remove the background from obtained foreground objects by modeling the background. As a result, the depth map of the background can be generated by a modified vanishing point detection method. Then, key frame depth maps can be propagated to the remaining frames. Finally, to meet the stringent requirements of VLSI chip implementation such as limited on-chip memory size and real-time processing, we modify some building modules with simplified versions of the in-focus region detection and the mean-shift algorithm. Experimental result shows that the proposed solution can provide accurate depth maps for 83% of images while other state-of-the-art methods can only achieve accuracy for 34% of these test images. This simplified solution targeting at the VLSI chip implementation has been validated for its high accuracy as well as high efficiency on several test video clips. ❧ For salient object detection, inspired by success of late fusion in semantic analysis and multi-modal biometrics, we model saliency detection as late fusion at confidence score level. In fact, we proposed to fuse state-of-the-arts saliency models at score level in a para-boosting learning fashion. Firstly, saliency maps generated from these models are used as confidence scores. Then, these scores are fed into our para-boosting learner (i.e. Support Vector Machine (SVM), Adaptive Boosting (AdBoost), or Probability Density Estimator (PDE)) to predict the final saliency map. In order to explore strength of para-boosting learners, traditional transformation based fusion strategies such as Sum, Min, Max are also also applied for comparison purpose. In our application scenario, salient object segmentation is our final goal. So, we further propose a novel salient object segmentation schema using Conditional Random Field(CRF) graph model. In this segmentation model, we first extract local low-level features, such as output maps of several saliency models, gradient histogram and position of each image pixel. We then train a random forest classifier to fuse saliency maps into a single high-level feature map using ground-truth annotations. Finally, Both low- and high-level features are fed into our CRF and parameters are learned. The segmentation results are evaluated from two different perspectives: region and contour accuracy. Extensive experimental comparison shows that both our salient object detection and segmentation model outperforms the ground truth labeled by human eyes.state-of-the-art saliency models and are, so far, the closest to human eyes' performance.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Single-image geometry estimation for various real-world domains
PDF
Block-based image steganalysis: algorithm and performance evaluation
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
Motion pattern learning and applications to tracking and detection
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
3D object detection in industrial site point clouds
PDF
Advanced features and feature selection methods for vibration and audio signal classification
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
Integrating top-down and bottom-up visual attention
PDF
A learning‐based approach to image quality assessment
PDF
Local-aware deep learning: methodology and applications
PDF
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
Explainable and lightweight techniques for blind visual quality assessment and saliency detection
PDF
Unsupervised learning of holistic 3D scene understanding
PDF
3D deep learning for perception and modeling
PDF
Contributions to structural and functional retinal imaging via Fourier domain optical coherence tomography
PDF
Deep representations for shapes, structures and motion
Asset Metadata
Creator
Wang, Jingwei
(author)
Core Title
Depth inference and visual saliency detection from 2D images
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/21/2013
Defense Date
05/13/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
2D,3D,depth,image,OAI-PMH Harvest,saliency
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Itti, Laurent (
committee member
), Jenkins, Brian Keith (
committee member
)
Creator Email
wang1984wei@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-293350
Unique identifier
UC11288414
Identifier
etd-WangJingwe-1802.pdf (filename),usctheses-c3-293350 (legacy record id)
Legacy Identifier
etd-WangJingwe-1802.pdf
Dmrecord
293350
Document Type
Dissertation
Rights
Wang, Jingwei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
2D
3D
depth
saliency