Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Machine learning techniques for outdoor and indoor layout estimation
(USC Thesis Other)
Machine learning techniques for outdoor and indoor layout estimation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MACHINE LEARNING TECHNIQUES FOR
OUTDOOR AND INDOOR LAYOUT ESTIMATION
by
Yuzhuo Ren
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2017
Copyright 2017 Yuzhuo Ren
Dedicated to my parents for their endless love and support
ii
Acknowledgments
I would like to acknowledge many people who help me through my PhD study.
First of all, I would like to express my deepest respect and thanks to my advisor
Professor C.-C. Jay Kuo. He spent a lot of time and eort guiding my research. Without
his guidance and encouragement, I could not go through dicult times and achieve
several milestones in my PhD study. I admire his diligent working spirit. His endless
energy and enthusiasm in research stimulates me for hard working and delivery high
quality research. His attitude toward research has a signicant impact on my research.
I would like to thank Professor Alexander A.(Sandy) Sawchuk, Professor Antonio
Ortega, Professor Panayiotis (Panos) G. Georgiou, Professor Aiichiro Nakano to serve
my qualifying exam committee and gave me valuable suggestions on my research.
I would like to thank my parents, who raise me up, support me and encourage me all
the time. They always stand behind me and help me as much as possible. They always
teach me how to positively handle all the diculties and be a good person.
And nally, last but by no means least, I would like to thank my fellow doctoral
students for their feedback, cooperation and friendship. In addition I would like to
express my gratitude to the department sta for the help and thank my friends at USC
who support me during my PhD study and make my life colorful.
Thanks for all your encouragement!
iii
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables vi
List of Figures vii
Abstract xii
Chapter 1: Introduction 1
1.1 Signicance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Global-Attributes Assisted Labeling . . . . . . . . . . . . . . . . . 8
1.2.2 Coarse-to-Fine Indoor Layout Estimation (CFILE) . . . . . . . . . 10
1.2.3 Context-Assisted 3D(C3D) Object Detection from RGB-D Images 10
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2: Related Work 13
2.1 Outdoor geometric Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Local-Patch-based Labeling . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Blocks World Modeling . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.3 Grammar-based Parsing and Mergence . . . . . . . . . . . . . . . . 15
2.1.4 3D Building Layout Inference . . . . . . . . . . . . . . . . . . . . . 17
2.2 Indoor Layout Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Structured Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Classical Methods for Indoor Layout Estimation . . . . . . . . . . 20
2.2.3 3D- and Video-based Indoor Layout Estimation . . . . . . . . . . . 21
2.2.4 CNN- and FCN-based Indoor Layout Estimation . . . . . . . . . . 23
2.3 3D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Methods Based on Hand-Crafted Features or CAD Models . . . . 23
2.3.2 CNN-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 3D Object Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.4 Context Information in Object Detection . . . . . . . . . . . . . . 25
iv
Chapter 3: GAL: A Global-Attributes Assisted Labeling System for Outdoor Scenes 27
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Proposed GAL System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Initial Pixel Labeling (IPL) . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Global Attributes Extraction (GAE) . . . . . . . . . . . . . . . . . 32
3.2.4 Layout Reasoning and Label Renement (LR2) . . . . . . . . . . . 41
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 4: A Coarse-to-Fine Indoor Layout Estimation (CFILE) Method 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Coarse-to-Fine Indoor Layout Estimation (CFILE) . . . . . . . . . . . . . 59
4.2.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.2 Coarse Layout Estimation via MFCN . . . . . . . . . . . . . . . . 60
4.2.3 Layout Renement . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.3.1 Generation of High-Quality Layout Hypotheses . . . . . . 62
4.2.3.2 Layout Ranking . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.2 Experimental Results and Discussion . . . . . . . . . . . . . . . . . 66
4.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Chapter 5: Context-Assisted 3D (C3D) Object Detection from RGB-D Images 78
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Proposed Context-Assisted 3D (C3D) Method . . . . . . . . . . . . . . . . 81
5.2.1 Network System Design . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.2 Small Object Detection . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.3 Graphical Model Optimization . . . . . . . . . . . . . . . . . . . . 88
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Chapter 6: Conclusion and Future Work 100
6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Bibliography 103
v
List of Tables
3.1 Comparison of the averaged labeling accuracy (%) of six methods with
respect to the building subset (B), the full set (F) and the full set with
relabeled ground truth (F/R), where 7 and 5 mean all seven classes and
the ve classes belonging to the vertical category, respectively.Results are
updated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 The performance gain from each individual attribute. . . . . . . . . . . . 50
4.1 Performance comparison of coarse layout results for Hedau's test dataset,
where the performance metrics are the xed contour threshold (ODS) and
the per-image best threshold (OIS) [AMFM11]. We use FCN to indicate
the informative edge method in [ML15]. Both MFCN
1
and MFCN
2
are
proposed in our work. They correspond to the two settings where the
layout and semantic surfaces are jointly trained on the original image size
(MFCN
1
) and the downsampled image size 404 404. (MFCN
2
) . . . . . 67
4.2 Performance benchmarking for Hedau's dataset. . . . . . . . . . . . . . . . 67
4.3 Performance benchmarking for the LSUN dataset. . . . . . . . . . . . . . 68
5.1 Evaluation for 3D large object detection on the SUN RGB-D test set. . . 93
5.2 Comparison of indoor object detection accuracy (measured in average
precision (AP) %) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Evaluation for 3D small object detection on SUN RGB-D test set. . . . . 94
vi
List of Figures
1.1 Outdoor geometric labeling problem denition. Given input outdoor
image, the desired output is a pixel-wise segmentation of seven geometric
labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Outdoor layout estimation applications. . . . . . . . . . . . . . . . . . . . 2
1.3 Outdoor layout estimation challenges. . . . . . . . . . . . . . . . . . . . . 3
1.4 Indoor layout problem denition. Given input indoor image, the desired
output is either a corner representation layout or a segmentation repre-
sentation layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Indoor layout estimation applications. . . . . . . . . . . . . . . . . . . . . 4
1.6 Indoor layout estimation challenges. . . . . . . . . . . . . . . . . . . . . . 5
1.7 \Manhattan World" assumption. The scene is composed of three main
directions orthogonal to each other. . . . . . . . . . . . . . . . . . . . . . . 6
1.8 3D amodal object detection problem denition. Given input RGB image
with depth map, the desired output is 3D cuboid detected for each object
and its object category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.9 3D object detection application. 3D object detection makes it possible for
robot to recognize and localize object. . . . . . . . . . . . . . . . . . . . . 7
1.10 3D amodal object detection challenges. First column: original RGB
image. Second column: depth map. Third column: point cloud. . . . . . . 8
2.1 To obtain useful statistics for modeling geometric classes,[HEH07a] slowly
build structural knowledge of the image: from pixels(a), to super-pixels
(b), to multiple potential groupings of super-pixels (c), to the nal geo-
metric labels (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Examples of multiple segmentations [HEH07a]. . . . . . . . . . . . . . . . 15
2.3 Catalog of the possible block view classes and associated 2D projections.
The 3D blocks are shown as cuboids although our representation imposes
no such constraints on the 3D shape of the block. The arrow represents
the camera viewpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Example output of our automatic scene understanding system. The 3D
parse graph summarizes the inferred object properties (physical bound-
aries, geometric type, and mechanical properties) and relationships between
objects within the scene. [GEH10]. . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Parsing images using grammar rules [LZZ14]. . . . . . . . . . . . . . . . . 18
2.6 Illustration of the ve grammar rules [LZZ14]. . . . . . . . . . . . . . . . . 19
vii
2.7 [PHK15] detects building facades in a single 2D image, decomposes them
into distinctive planes of dierent 3D orientations, and infers their optimal
depth in 3D space based on cues from individual planes and 3D geomet-
ric constraints among them. Left: Detected facade regions are covered
by shades of dierent colors, each color representing a distinctive facade
plane. Middle/Right: Ground contact lines of building facades on the
ground plane before/after considering inter-planar geometric constraints.
The coarser grid spacing is 10m. . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 Illustration of structured learning framework in indoor layout estimation.
[HHF09]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 The
ow chart of the proposed GAL system. . . . . . . . . . . . . . . . . 29
3.2 Fusion of 3-class labeling algorithms. Stage 1: Three individual random
forest classiers trained by segments from SLIC [ASS
+
12], FH [FH04] and
CCP [FWC
+
15], respectively, where the gray images show the probability
output of each individual classier under dierent segmentation methods
and dierent geometric classes. Stage 2: The three probability outputs
from Stage 1 are cascaded into one long feature vector for each intersected
segmentation unit and an SVM classier is trained to get the nal decision. 32
3.3 Comparison of initial pixel labeling results (from left to right): the original
image, the 7-class labeling result from [HEH08a], the proposed 3-class
labeling result, the ground truth 7-class labeling. Our scheme oers better
\support" and \sky" labels. . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 The process of validating the existence of the sky/ground lines and their
location inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Horizon detection and its application to layout reasoning and label rene-
ment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Trapezoidal shape tting with the sky line in the top and the ground line
in the bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7 An example of vertical line detection and correction. . . . . . . . . . . . . 38
3.8 Examples of surface orientation renement using the shape and the van-
ishing line cues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.9 Examples of obtaining the object mask using the person and the car object
detectors and the grab cut segmentation method: (a) the bounding boxes
result of the person detector, (b) the person segmentation result, (c) the
bounding box result of the car detector, and (d) the car segmentation result. 40
3.10 Two examples of initially labeled porous regions where the top (trees) is
correctly labeled while the bottom (mountain) is wrongly labeled. . . . . . 41
3.11 A proposed framework for geometric layout reasoning: (a) the simplest
case, (b) the background scene only, and (c) a general scene, where SL, GL,
H, O denote the sky line, the ground line, the horizon and the occluder,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
viii
3.12 Qualitative comparisons of three geometric layout algorithms (from left to
right): the original Image, CNN-method[BHC15], Hoiem et al. [HEH08a],
Gupta et al. [GEH10], the GAL and the ground truth. The surface
layout color codes are magenta (planar left), dark blue (planar center),
red (planar right), green (non-planar porous), gray (non-planar solid),
light blue (sky), black (support). . . . . . . . . . . . . . . . . . . . . . . . 53
3.13 Comparison of 3D rendered views based on geometric labels from the
H-method (left) and GAL (right). . . . . . . . . . . . . . . . . . . . . . . . 54
3.14 Comparison of labeling accuracy between the CNN methods, the H-method,
the G-method and GAL with respect to seven individual labels. . . . . . . 55
3.15 Error analysis of the proposed GAL system with three exemplary images
(one example per row and from left to right): the original image, the
labeled result of GAL and the ground truth. . . . . . . . . . . . . . . . . . 56
4.1 The pipeline of the proposed coarse-to-ne indoor layout estimation (CFILE)
method. For an input indoor image, a coarse layout estimate that con-
tains large surfaces and their boundaries is obtained by a multi-task fully
convolutional neural network (MFCN) in the rst stage. Then, occluded
lines and missing lines are lled in and possible layout choices are ranked
according to a pre-dened score function in the second stage. The one
with the highest score is chosen to the nal output. . . . . . . . . . . . . . 58
4.2 Illustration of a layout model Layout = (l
1
;l
2
;l
3
;l
4
;v) that is parameter-
ized by four lines and a vanishing point: (a) an easy setting where all
ve surfaces are present; (b) a setting where some surfaces are outside the
image; (c) a setting where key boundaries are occluded. . . . . . . . . . . 60
4.3 Illustration of the FCN-VGG16 with two output branches. We use one
branch for the coarse layout learning and the other branch for semantic
surface learning. The input image size is re-sized to 404 404 to match
the receptive eld size of the lter at the fully connection layer. . . . . . . 61
4.4 Illustration of critical lines detection for better layout hypothesis genera-
tion. For a given input image, the coarse layout oers a mask that guides
vanishing lines selection and critical lines inference. The solid lines indi-
cate detected vanishing linesC. The dashed wall lines indicate those wall
lines that are not detected but inferred inside maskC from ceiling corners.
The dashed
oor lines indicate those
oor lines that are not detected but
inferred inside mask C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Example of Layout ranking using the proposed score function. . . . . . . . 65
4.6 Illustration of ground truth relabeling for LSUN dataset. 1 Frontal wall,
2 Left wall, 3 Right wall, 4 Floor, 5 Ceiling . . . . . . . . . . . . . . . . . 66
4.7 Comparison of coarse layout results (from left to right): the input image,
the coarse layout result of the FCN in [ML15], the coarse layout results of
the proposed MFCN
2
and the ground truth. The results of the MFCN
2
are more robust. Besides, it provides clearer contours in occluded regions.
The rst two examples are from Hedau dataset and the last two examples
are from LSUN dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
ix
4.8 Visualization of six best results of the CFILE method in Hedau's test
dataset (from top to bottom): original images, the coarse layout estimates
from MFCN, our results with pixel-wise accuracy (where the ground truth
is shown in green and our result is shown in red). . . . . . . . . . . . . . . 72
4.9 Visualization of three worst results of the CFILE method in Hedau's test
dataset (from top to bottom): original images, the coarse layout estimates
from MFCN, our results with pixel-wise accuracy (where the ground truth
is shown in green and our result is shown in red). . . . . . . . . . . . . . . 73
4.10 Visualization of layout results of the CFILE method in the LSUN vali-
dation set. Ground truth is shown in green and our result is shown in
red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.11 Visualization the scores of dierent layout hypotheses. The red line is
layout hypothesis generated by our proposed method and green line is the
ground truth layout. Images are from Hedau's dataset. . . . . . . . . . . . 74
4.12 Training image statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.13 Worse examples. The wall boundary is not accurately detected. . . . . . . 75
4.14 Worse examples. One wall boundary is missing. . . . . . . . . . . . . . . . 75
4.15 Worse examples: bird-eye view images. . . . . . . . . . . . . . . . . . . . . 76
4.16 Worse examples: close shot images. . . . . . . . . . . . . . . . . . . . . . . 76
4.17 Other worse examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1 Illustration on how a scene context helps improve 3D object detection,
where 3D object detection with or without the assistance of the context
information is compared. For ease of visualization, only a few object pro-
posals are drawn. In the top row, the gure shows 3D object proposals. In
the bottom row, we compare the condence of detected objects. The prob-
ability of the \bed" increases while the probability of the \sofa" decreases
in our method due to the use of the \bedroom" scene classication result. 79
5.2 The block diagram of the context-3D system. In the rst stage, we use the
scene-CNN and the object-CNN (enclosed by blue and red dotted boxes,
respectively) to obtain scene classication and object detection results.
For the top branch, the RGB image and the HHA image [GGAM14] serve
as the input to the scene-CNN. For the bottom branch, the RGB-D image
serves as the input to another CNN for 3D region proposals. Then, the
scene category classication vector and the 3D region proposal results
are concatenated and fed into the third CNN for 3D object detection so
as to provide accurate object detection results. In the second stage, we
jointly optimize a cost function associated with scene classication and
object detection under the Conditional Random Field (CRF) framework.
The cost function includes the scene potential and the object potential
obtained from the rst stage as well as the scene/object context, the
object/object context and the room geometry information. . . . . . . . . 80
5.3 The co-occurrence probabilities of scene categories (along the x-axis) and
object classes (along the y-axis), where a higher value is indicated by a
brighter color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
x
5.4 Illustration of the system architecture of three CNNs which serves as the
rst stage of the proposed C3D CNN for large object detection. It consists
of three branches. The top row is a 2D scene CNN, the middle row is a
2D object CNN, and the bottom row is a 3D object CNN. Their features
are concatenated to form an end-to-end 3D object detection system. The
output of the network is a cuboid with its object class label. . . . . . . . . 83
5.5 Two examples used to illustrate the mapping from a 2D bounding box to
a 3D cuboid (from left to right): 2D object detection results, the depth
maps, and the 3D cuboids generated by enclosing all 3D points in the
detected 2D bounding boxes. . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6 3D small object detection and localization, where the left, middle and
right columns show the input images, the frontal view and the top view of
results, respectively. The ground truth bounding box, the initial cuboid
obtained by directly mapping 2D pixels to 3D point cloud using the depth
map and the rened cuboid obtained by our proposed method are indi-
cated in green, blue and red, respectively. . . . . . . . . . . . . . . . . . . 89
5.7 A sample image and its corresponding CRF model. The relationship
between the scene and an object (S/O) and between two objects (O/O)
is modeled by the edge between nodes. When the S/O and O/O rela-
tionships are considered, the condence of \bed", \night stand", \pillow"
increases while the condence of \sofa" decreases. . . . . . . . . . . . . . . 90
5.8 The co-occurrence probabilities of one object class (along the x-axis) and
another object class (along the y-axis), where a higher value is indicated
by a brighter color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.9 Example of overlapping cuboids with dierent object categories; namely,
the bed cuboids and the chair cuboids. The numbers on top of each cuboid
is detection condence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.10 Detection with condence scores larger than 0.5 for each algorithm, where
we see that the contextual information helps preserve the true positives
and reduce the false detection. . . . . . . . . . . . . . . . . . . . . . . . . 98
5.11 The ground truth and our detection results are shown in green and yellow,
respectively. The reasons for erroneous detections include the following
(from top to bottom): (a) the TV monitor is missed because its size is too
small, (b) the missing object is heavily occluded, (c) the major part of
the missing object is outside the camera view, and (d) there exists large
intra-class variation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
xi
Abstract
In this dissertation, we study three research problems: 1) Outdoor geometric labeling,
and 2) Indoor layout estimation and 3) 3D object detection.
A novel method that extracts global attributes from outdoor images to facilitate
geometric layout labeling is proposed in Chapter 3. The proposed Global-attributes
Assisted Labeling (GAL) system exploits both local features and global attributes. First,
by following a classical method, we use local features to provide initial labels for all
super-pixels. Then, we develop a set of techniques to extract global attributes from 2D
outdoor images. They include sky lines, ground lines, vanishing lines, etc. Finally, we
propose the GAL system that integrates global attributes in the conditional random eld
(CRF) framework to improve initial labels so as to oer a more robust labeling result.
The performance of the proposed GAL system is demonstrated and benchmarked with
several state-of-the-art algorithms against a popular outdoor scene layout dataset.
The task of estimating the spatial layout of cluttered indoor scenes from a single RGB
image is addressed in Chapter 4. Existing solutions to this problem largely rely on hand-
craft features and vanishing lines. They often fail in highly cluttered indoor scenes. The
proposed coarse-to-ne indoor layout estimation (CFILE) method consists of two stages:
1) coarse layout estimation; and 2) ne layout localization. In the rst stage, we adopt a
fully convolutional neural network (FCN) to obtain a coarse-scale room layout estimate
that is close to the ground truth globally. The proposed FCN considers the combination
of the layout contour property and the surface property so as to provide a robust estimate
in the presence of cluttered objects. In the second stage, we formulate an optimization
xii
framework that enforces several constraints such as layout contour straightness, surface
smoothness and geometric constraints for layout detail renement. The proposed CFILE
system oers the state-of-the-art performance on two common benchmark datasets.
Given a RGB-D image, we study the problem of 3D object detection from RGB-D
images so as to achieve localization (i.e., producing a bounding box around the object)
and classication (i.e., determining the object category) simultaneously Chapter 5. Its
challenges arise from high intra-class variability, illumination change, background clutter
and occlusion. To solve this problem, we propose a novel solution that integrates the 2D
information (RGB images), the 3D information (RGB-D images) and the object/scene
context information together, and call it the Context-Assisted 3D (C3D) Method. In
the proposed C3D method, we rst use a convolutional neural network (CNN) to jointly
detect a 3D object in a scene and its scene category. Then, we improve the detection
result furthermore with a Conditional Random Field (CRF) model that incorporates
the object potential, the scene potential, the scene/object context, the object/object
context, and the room geometry. Extensive experiments are conducted to demonstrate
that the proposed C3D method achieves the state-of-the-art performance for 3D object
detection against the SUN RGB-D benchmark dataset.
xiii
Chapter 1
Introduction
1.1 Signicance of the Research
Scene understanding is an important and challenging topic in the computer vision eld.
It consists of two main subtopics - outdoor scene understanding and indoor scene under-
standing. This thesis research will address both of them. The signicance of our study
is elaborated below.
There are many research issues in outdoor scene understanding, including scene clas-
sication [ZLCD16], [ZWW
+
15], outdoor scene parsing [LYT11], [FCNL13], [SZWW15],
3D reconstruction [CCPS13a], [HEH05b], etc. The outdoor layout estimation problem
is one of the key outdoor scene understanding problems. It identies the 3D structure
of an outdoor scene by estimating seven geometric labels - support, sky, planner facing
left, planner facing center, planner facing right, porous and object. Figure 1.1 shows
the denition of outdoor geometric labeling problem. Being dierent from outdoor scene
parsing where several semantic labels (i.e., tree, building, sand, road, ocean, elds, etc.)
segments are estimated, geometric labeling problem tries to segment the functional sur-
faces which can be used in reconstruct the 3D world. For example, road, ocean and elds
are grouped into one geometric class called \support" which functions a supporting role
the in 3D world. Dierent building facades in the same building can be labeled into
dierent planner surfaces, facing left or facing right. The planner orientation indicates
the depth in 3D world. The 3D scene can be rendered based on geometric rule once
geometric labels are available.
The applications of outdoor geometric labeling include outdoor robotics [UAB
+
08],
autonomous driving [CSKX15], [MGLW16], 3D reconstruction [SRCF15], [MVL
+
15],
1
Input Image Geometric Labeling
Support
Figure 1.1: Outdoor geometric labeling problem denition. Given input outdoor image,
the desired output is a pixel-wise segmentation of seven geometric labels.
depth estimation [LSL15], etc, shown in Figure 1.2. Outdoor robotics system is expected
to navigate through extreme environment and are mostly used in military applications.
Outdoor robotics system should have the ability to identify the navigable road region
and avoid the obstacles, trees, buildings, walls, etc. Road, car, pedestrian detection are
key technologies in autonomous driving system so that the vision system can help the
car avoid any collision.
(a) Outdoor Robotics (b) Autonomous Driving (c) 3D Reconstruction
Figure 1.2: Outdoor layout estimation applications.
The problem is challenging due to large appearance variations of outdoor scenes.
For example, the \support" can be road or grass which have large variations in color
and texture features. The cli and the building facade are both planner facing right
however their appearance are very dierent. The surface orientation (left/center/right)
estimation is extremely dierent. The challenges are illustrated in Figure 1.3.
2
(a) Large Intra-class Variations
(b) Different Semantic Classes with the Same Geometric Label
Figure 1.3: Outdoor layout estimation challenges.
Indoor layout estimation is an important problem for indoor scene understanding.
A general description of the problem is to identify the boundaries between ceiling and
walls, wall and wall, walls and
oor, illustrative images are shown in Figure 1.4. The
desired output is either a corner representation layout where boundaries are labeled or
a segmentation representation layout where surfaces are labeled. Together with other
computer vision techniques, including scene classication, semantic segmentation, 3D
object detection, object orientation estimation, a total indoor scene understanding can
be achieved.
There are many applications in indoor layout estimation, including indoor robotics,
real estate and virtual interior design, illustrative images are shown in Figure 1.5. In
robotics system, robots navigate the free-space, their vision system should have the
ability to recognize dierent surface boundaries and objects. In real estate application,
there are many 2D indoor images on the website of real estate. It will be more attractive
to the customers if 3D model of the rooms can be rendered from the 2D images. With
3
Input Image Layout: Segmentation Representation Layout: Corner Representation
Figure 1.4: Indoor layout problem denition. Given input indoor image, the desired
output is either a corner representation layout or a segmentation representation layout.
indoor layout estimation, the boundaries of the surfaces can be available, as a result, the
relative depth among dierent surfaces is available, 3D model of the room can be easily
built. With indoor room layout estimation, virtual interior design becomes possible
where people are able to design their own house and see its 3D model.
(a) Indoor Robotics (b) Real Estate (c) Virtual Interior Design
Figure 1.5: Indoor layout estimation applications.
Indoor layout estimation is an important yet challenging problem. The challenges
include lots of objects occluding the surface boundaries, view point variations resulting in
totally dierent layouts and poor illumination conditions which makes the wall boundary
hard to detect. The challenges are illustrated in Figure 1.6.
Indoor Scene understanding from a single image is generally based on the so-called
\Manhattan World" assumption. The scene is composed of three main orthogonal direc-
tions, which is illustrated in Figure 1.7. This assumption is quite common when working
4
(a) Lots of Objects (b) Poor Illumination
(c)Same Room with Different View Points
Figure 1.6: Indoor layout estimation challenges.
with images of indoor scenes. Although this is a strong restriction, yet it is usually sat-
ised (at least partially) in most man-made structures. Outdoor images with man-made
buildings usually satisy \Manhattan World" assumption.
Indoor 3D object detection is an important computer vision problem and has many
applications. Indoor 3D object detection is helpful for indoor scene understanding, indoor
scene classication, indoor robotics, etc. The goal of indoor 3D object detection is to
detect object in 3D indoor scene. Given original image and its depth map, indoor 3D
object detection system is to recognize the objects and detect its location in 3D cuboid.
In other words, the occluded part of the object in 3D space should be recovered, which
is called amodal detection [KTCM15] in literature. One example is shown in Figure
1.8. With the development of depth camera, such as Kinect, more and more images
with depth are available for researchers to develop data driven algorithms to detect 3D
objects.
5
Indoor Image “Manhattan World” Assumption
Figure 1.7: \Manhattan World" assumption. The scene is composed of three main
directions orthogonal to each other.
3D object detection makes it possible for robotic system to recognize and localize
objects in 3D world. 3D object detection is one of the most important components to
make the robotic system recognize objects, so that more complicated interactions with
objects, such as moving objects, grasping objects, etc, can be possible. One example is
shown in Figure 1.9. In interior design application, the 3D objects detection can help
understand the room layout and analyze the dierent room designs.
3D object detection is very challenging. First, the large intra-class variance of 3D
objects make it dicult for computer vision and machine learning algorithm. Second,
the 3D data format is huge compared to 2D single image because there is one more
dimension after converting 2D image into 3D point cloud. The large input data increases
the computational cost. In order to reduce the computation, researchers have to reduce
the resolution in 3D and the information lost badly aects the accuracy of the algorithm.
Third, the errors in depth makes the 3D point cloud not accurate which adds diculty
for detection algorithm. Figure 1.10 shows example images, their depth map and point
cloud. The rst example shows that the depth inaccuracy makes the shape of the bed
distorted in 3D space. The second example shows that the point cloud information of
the table in 3D is very few because of heavily occlusion.
6
Bed
Nightstand
Lamp
Pillow
Input Image Desired Output
Figure 1.8: 3D amodal object detection problem denition. Given input RGB image
with depth map, the desired output is 3D cuboid detected for each object and its object
category.
Figure 1.9: 3D object detection application. 3D object detection makes it possible for
robot to recognize and localize object.
7
Figure 1.10: 3D amodal object detection challenges. First column: original RGB image.
Second column: depth map. Third column: point cloud.
1.2 Contributions of the Research
The contributions of this thesis research can be summarized below.
1.2.1 Global-Attributes Assisted Labeling
We study existing solutions to the outdoor layout estimation problem, includ-
ing traditional methods and recently proposed convolutional neural network based
methods. The strength and weakness of existing solutions are analyzed and dis-
cussed. We point out that integrating the local information and the global informa-
tion can achieve better performance than only considering only one of them. Our
solution to outdoor layout estimation focuses on combining the local information
and the global information in a systematic way so as to oer a robust layout esti-
mation for various scenes, including natural and urban scenes. The performance
of our proposed system is benchmarked on a popular dataset. It outperforms the
second best method by a large margin.
8
We design a two-stage main-class (sky, ground and others) classier. In the rst
stage, individual classiers are trained on dierent segments. We also study the
strength and weakness of dierent super-pixel segmentation algorithms, including
SLIC [ASS
+
12], FH [FH04], and CCP [FWC
+
15]. All the three segmentation algo-
rithms are applied and a random forest classier is trained to learn the probability
of each super-pixel classied into one of the three main classes. The super-pixel
based features include the color feature, the position feature and the texture fea-
ture. The three probability outputs from the rst stage are cascaded into one
long feature vector for each intersected segmentation unit and an SVM classier is
trained to get the nal decision. The fusion SVM classier is trained on ner scale
super-pixel segments.
We propose seven global attributes, including sky/ground lines, Horizon, left or
right-oriented surfaces, vertical line, vanishing line, solid and object which are
proved to be powerful global attributes for geometric labeling. The seven global
attributes describe the property for a variety of outdoor scenes, including both
urban scene and natural scene.
We propose a graphical model to optimize the outdoor layout estimation. We
propose a Conditional Random Field (CRF) context optimization framework to
integrate local information from initial super-pixel label with global information
from global attributes in order to oer a robust outdoor layout estimation. The
unary potential term includes the probability determined by initial label from local
information together with global attributes including horizon, vertical line, solid
and porous property. The pairwise potential is dened by sky/ground line, the
vanishing line, the left or right-oriented surfaces, which describes the relationship
among adjacent segments.
9
1.2.2 Coarse-to-Fine Indoor Layout Estimation (CFILE)
We propose a novel coarse-to-ne indoor layout estimation (CFILE) system. The
proposed system combines bottom-up knowledge from deep learning with top-down
prior knowledge. It is shown by experimental results that the proposed CFILE
method oers the state-of-the-art performance. It outperforms the second best
method by a large margin two popular datasets, Hedau's dataset and the LSUN
dataset.
We propose to adopt the fully convolutional neural network (FCN) to learn the
label of the coarse layout and main surfaces jointly. We adopt a multi-task fully
convolutional neural network (MFCN) to learn the coarse layout in one branch and
main surfaces in another branch. It is shown that the coarse-scale layout estimate
obtained by the MFCN is robust and close to the ground truth. Based on the
contour measure metrics, the coarse layout outperforms previous methods.
We formulate an optimization framework that enforces three constraints (i.e. sur-
face smoothness, contour straightness and proper geometrical structure) to rene
the coarse-scale layout estimate. A set of high quality possible layouts is generated
by vanishing lines sampling along coarse layout, automatically lling in occluded
vanishing lines and occluded lines. A coarse layout probability based score func-
tion is proposed to select the best layout from the layout hypothesis. The proposed
score function measures how well layout proposal is aligned with coarse layout. The
experimental result shows that the proposed coarse layout probability based score
function measures the qualify of layout proposal very well.
1.2.3 Context-Assisted 3D(C3D) Object Detection from RGB-D
Images
We propose a joint scene classication and object recognition framework by exploit-
ing the discriminant power of CNNs and inter-dependency between objects and the
10
scene. To achieve this goal, we rst build a 2D scene classication CNN and a 3D
object proposal CNN. The 2D scene classication CNN provides the scene infor-
mation to guide the 3D object detection task in the next stage. Afterwards, we
feed the results from these two CNNs into the third CNN called the 3D object
detection CNNs. Both object and scene features are concatenated to provide a
more accurate object category under a scene context.
We propose a method to detect 3D small objects, which achieves the state-of-the-
art performance. The proposed method leverages the high resolution of the 2D
RGB image. A cuboid alignment method is proposed to put 2D objects in the 3D
space.
We adopt a hierarchical graphical model to exploit the relationship between objects
and the scene as well as the relationship between objects to improve the 3D object
detection performance furthermore. Experiments are conducted to demonstrate
that the proposed C3D method achieves the state-of-the-art performance in a pop-
ular benchmark dataset.
1.3 Organization of the Dissertation
The rest of this dissertation is organized as follows.
Background and related work are reviewed in Chapter 2. The outdoor layout esti-
mation work includes the local-patch-based method, the blocks world modeling method,
the grammar-based parsing and mergence method, and the 3D building layout infer-
ence method. The indoor layout estimation work includes the classic structured learning
method, the 3D- and video-based indoor layout estimation method and the convolu-
tional neural network based method. The 3D object detection work includes the tradi-
tional hand-crafted feature approach, the CNN-based approach and the context-based
approach.
11
In Chapter 3, we propose a Global-attributes Assisted Labeling (GAL) system which
integrates the local features and the global attributes to tackle the outdoor layout estima-
tion problem. In Chapter 4, we propose a novel coarse-to-ne indoor layout estimation
method. The fully convolutional neural network (FCN) is adopted to estimate a coarse
layout, which is then followed by a layout renement step to nd the exact layout loca-
tion. In Chapter 5, we propose a context-assisted 3D (C3D) object detection method
from RGB-D images. Convolutional neural network is leveraged to jointly detect 3D
objects. A hierarchical Conditional Random Field (CRF) model is adopted to optimize
the object detection task. Experimental results are given in Chapters 3, 4 and 5 to
demonstrate the superiority of the proposed algorithms. Finally, concluding remarks
and future research directions are given in Chapter 6.
12
Chapter 2
Related Work
2.1 Outdoor geometric Labeling
2.1.1 Local-Patch-based Labeling
Hoiem et al. [HEH07a] designed super-pixel level features and used boosted decision
tree classiers to nd the most likely label to each super-pixel. Features such as color,
position, texture pattern and segment shape were used to describe local visual properties.
The super-pixel segmentation has two limitations in the geometric labeling problem.
First, dierent regions with weak boundaries subject to under-segmentation while texture
regions subject to over-segmentation. Second, since one segment is only annotated with
a single semantic label, if a wrong decision is made, the loss is huge.
One segmentation result cannot be perfect in geometric labeling as pointed in
[HEH07a]. To overcome this problem, an algorithm using the weighted sum of deci-
sions from multiple segmentation results was proposed in [HEH07a]. The
owchat of the
proposed algorithm in [HEH07a] is shown in Figure 2.1. Specically, given an image, a
popular graph-based segmentation algorithm [FH04] was applied using several dierent
parameter settings, leading to multiple super-pixels. Then, an algorithm was proposed
to merge them into dierent segmentation numbers. The super-pixel merge examples
are shown in Figure 2.2. The labeling accuracy can improve by considering the weighted
sum of decisions from dierent segments.
The super-pixel learning method attempts to establish the relation between local
super-pixel appearance and the desired label. However, due to the lack of global infor-
mation and global physical constraints, the super-pixel learning algorithm may not give
meaningful results.
13
Figure 2.1: To obtain useful statistics for modeling geometric classes,[HEH07a] slowly
build structural knowledge of the image: from pixels(a), to super-pixels (b), to multiple
potential groupings of super-pixels (c), to the nal geometric labels (d).
2.1.2 Blocks World Modeling
To overcome the limitation of local-patch based labeling, researchers incorporated the
global constraints or context rules in recent years. Gupta et al. [GEH10, GHKB10a]
proposed a qualitative physical model for outdoor scene images. In the inference process,
each segment was tted into one of the eight block-view classes. Figure 2.3 shows eight
block-view classes. Other constraints included geometric constraints, contact constraints,
intra-class and stability constraints and depth constraints. The geometric constraints
were obtained from the initial labeling result in [HEH07a]. The contact constraints
were employed to measure the agreement of geometric properties with ground and sky
contact points. The intra-class and stability constraints were introduced to measure
physical stability within a single block and against other blocks, respectively. The depth
constraints were used to measure the agreement of projection of blocks in the 2D image
plane with the estimated depth ordering. Given a candidate block, B
i
, its associated
geometric properties and its relationship with other blocks was estimated by minimizing
the cost function dened by the constrains, which is formulated in Eq. 2.1.2. Figure 2.4
shows one example of parsing graph. By tting blocks into a physical world, the global
information can be added into geometric labeling so that the local patch labeling error
can be reduced. However, these algorithms suer from the limited number of block
14
Figure 2.2: Examples of multiple segmentations [HEH07a].
models, which fails to cover all possibilities in the real world. In addition, if a segment
is tted to a wrong block model, its label could be totally wrong.
C(B
i
) =F
geometry
(G
i
) +
X
S2ground;sky
F
contacts
(G
i
;S) +F
intra
(S
i
;G
i
;d)
+
X
j2blocks
F
s
tability(G
i
;S
i;j
;B
j
) +F
depth
(G
i
;S
i;j
;D)
(2.1)
2.1.3 Grammar-based Parsing and Mergence
Liu et al. [LZZ14] proposed a Bayesian framework and ve merge rules to merge super-
pixels in a bottom-up fashion for geometric labeling of urban scenes. Figure 2.5 and
15
Figure 2.3: Catalog of the possible block view classes and associated 2D projections. The
3D blocks are shown as cuboids although our representation imposes no such constraints
on the 3D shape of the block. The arrow represents the camera viewpoint.
Figure 2.5 show the parsing graph and ve grammar rules respectively. The algorithm
found the straight lines, estimated the vanishing points in the image, and partitioned the
image into super-pixels. The inference was done based on Composite Cluster Sampling
(CCS). The initial covering was obtained using the K-means algorithm. Then, at each
iteration, it made proposals based on ve grammar rules: layering (rule 1), siding (rule
2), supporting (rule 3), anity (rule 4) and mesh (rule 5). The ve grammar rules are
used to maximize the posterior probability in Bayesian formulation. The layering rule
includes the focal length and camera height and describes the connection between the
scene node and other super-pixels. Siding rule describes the spatial connection of two
super-pixels and the contact line. Supporting rule states one super-pixel is supporting
another if their surface normals are orthogonal. The anity rule states that two super-
pixels are likely to belong to the same surface if they have similar appearance. Mesh
rule statues that multiple super-pixels are arranged in a mesh structure described by
two orthogonal vanishing points. The Bayesian inference consisted of two stages. In the
rst stage, proposals were made using rule 4 and rule 5. In the second stage, proposals
were made based on all ve grammar rules. Since this algorithm heavily relies on the
Manhattan world assumption as well as accurate vanishing point detection results, it
cannot handle most nature scenes well.
16
Figure 2.4: Example output of our automatic scene understanding system. The 3D
parse graph summarizes the inferred object properties (physical boundaries, geometric
type, and mechanical properties) and relationships between objects within the scene.
[GEH10].
2.1.4 3D Building Layout Inference
Pan et al. [PHK15] focused on images containing building facades. Given an urban
scene image, they rst detected a set of distinctive facade planes and estimated their 3D
orientations and locations. Being dierent from previous methods that provided coarse
orientation labels or qualitative block approximations, their algorithm reconstructed
building facades in the 3D space quantitatively using a set of planes mutually related
via 3D geometric constraints. Each plane was characterized by a continuous orientation
vector and a depth distribution, and an optimal solution was searched through inter-
planar interaction. It inferred the optimal 3D layout of building facades by maximizing
17
Figure 2.5: Parsing images using grammar rules [LZZ14].
a dened objective function. The data term was the product of two scores, indicat-
ing the image feature compatibility and the geometric compatibility, respectively. The
former measured the agreement between the 2D location of a facade plane and image
features while the latter measured the probability of a ground contact line position. The
smoothness term included the convex-corner constraint, the occlusion constraint and the
alignment constraint. By exploiting quantitative plane-based geometric reasoning, this
solution is more expressive and informative than other methods. However, it does not
provide suitable models for the labeling of the ground, sky, porous and solid classes in
general scenes.
18
Figure 2.6: Illustration of the ve grammar rules [LZZ14].
2.2 Indoor Layout Estimation
2.2.1 Structured Learning
The structured learning methodology [NL11] has been widely used in the context of
indoor room layout estimation. It targets at learning the structure of an environment in
the presence of imperfect low-level features. It consists of two stages [NL11]. First, a set
of structure hypotheses are generated. Second, a score function is dened to evaluate
the structure in hypotheses set. The rst stage is guided by low level features such as
vanishing lines under the Manhattan assumption. The number of layout hypotheses in
the rst stage is usually large while most of them are of low accuracy due to the presence
of clutters. If the quality of hypotheses is low in the rst stage, there is no easy way to x
it in the second stage. In the second stage of layout ranking, the score function contains
various features such as the line membership [HHF09], [ML15], the geometric context
[HHF09], [ML15], the object location [GHKB10b], etc. The score function cannot handle
19
Figure 2.7: [PHK15] detects building facades in a single 2D image, decomposes them
into distinctive planes of dierent 3D orientations, and infers their optimal depth in
3D space based on cues from individual planes and 3D geometric constraints among
them. Left: Detected facade regions are covered by shades of dierent colors, each color
representing a distinctive facade plane. Middle/Right: Ground contact lines of building
facades on the ground plane before/after considering inter-planar geometric constraints.
The coarser grid spacing is 10m.
objects well since they overlap with more than one surfaces (e.g., between the
oor and
walls). The occluding objects in turn make the surface appearance quite similar along
their boundaries.
2.2.2 Classical Methods for Indoor Layout Estimation
Research on indoor room layout estimation has been active in recent years. Hedau et
al. [HHF09] formulated it as a structured learning problem. The pipeline is shown
in Figure2.8. Layout hypotheses are generated by vanishing lines. The best layout is
selected based on line membership feature and geometric context feature. There are
many follow-up eorts after this milestone work. They focus on either developing new
criteria to reject invalid layout hypotheses or introducing new features to improve the
score function in layout ranking. Dierent hypothesis evaluation methods were consid-
ered in [GZH15], [HHF09], [GHKB10b], [SFPU13], [ZZ13], [PBF
+
12], [RPJT13]. Hedau
et al. [HHF09] reduced noisy lines by removing clutters rst. Specically, they used the
line membership together with semantic labeling to evaluate hypotheses. Gupta et al.
[GHKB10b] proposed an orientation map that labels three orthogonal surface directions
20
Vanishing
Point
Estimation
Layout
Generation
Evaluate Box
Layout
Pick Highest
Score Box
Layout
Line
Segment
Detection
Figure 2.8: Illustration of structured learning framework in indoor layout estimation.
[HHF09].
based on line segments and, then, used the orientation map to re-evaluate layout propos-
als. Besides, they detected objects and t them into 3D boxes. Since an object cannot
penetrate the wall, they used the box location as a constraint to reject invalid layout pro-
posals. The work in [HHF10], [WGR13] attempted to model objects and spatial layout
simultaneously. Hedau et al. [HHF12] improved their earlier work in [HHF10], [HHF09]
by localizing the box more precisely using several cues such as edge- and corner-based
features. Ramalingam et al. [RPJT13] proposed an algorithm to detect Manhattan
Junctions and selected the best layout by optimizing a conditional random eld whose
corners are well aligned with pre-detected Manhattan Junctions. Pero et al. [PBF
+
12]
integrated the camera model, an enclosing room box, frames (windows, doors, pictures),
and objects (beds, tables, couches, cabinets) to generate layout hypotheses. Lampert
et al. [LBH09] improved objects detection by maximizing a score function through the
branch and bound algorithm.
2.2.3 3D- and Video-based Indoor Layout Estimation
Zhao and Zhu [ZZ13] exploited the location information and 3D spatial rules to obtain
as many 3D boxes as possible. For example, if a bed is detected, the algorithm will
21
search its neighborhood to look for a side table. Then, they rejected impossible layout
hypothesis. Choi et al. [CCPS13a] trained several 3D scene graph models to learn the
relation among the scene type, the object type, the object location and layout jointly.
Guo et al. [GZH15] recovered 3D model from a single RGBD image by transferring the
exemplar layout in the training set to the test image. Fidler et al. [FDU12] and Xiang
et al. [XS12] represented objects by a deformable 3D cuboid model for improved object
detection and then used in layout estimation. Fouhey et al. [FDG
+
14] exploited human
action and location in time-lapse video to infer functional room geometry. With depth
information available, a lot of work have been done in the eld of 3D object detection
which help improve the indoor layout estimation. Gupta [GGAM14] netuned the R-
CNN pipeline on a 3D depth encoding. CNN features are extracted from geometric
encoding of depth, including horizontal disparity, height above ground, and the angle
between pixel's normal and the inferred gravity direction. Gupta [GAGM15] further
improved 3D object detection and segmentation by matching CAD object model to the
coarse pose estimated from deep neural network. Song [SX14] proposed to slide 3D
detection widow in 3D space for 3D object detection. Depth maps are synthesized by
a collection of 3D CAD models rendered from hundreds of viewpoints. Features are
extracted from the 3D point cloud and an Exemplar-SVM classier is trained. Song
[SX15] proposed deep Sliding Shapes where a 3D ConvNet takes a 3D volumetric scene
from a RGB-D image as input and outputs 3D object bounding boxes. The rst 3D
Region Proposal Network (RPN) learns objectness from geometric shapes and the rst
joint Object Recognition Network (ORN) to extract geometric features in 3D and color
features in 2D. Jiang et al. [JX13] proposed a novel linear method to match cuboids
in indoor scenes using RGBD images. The cuboid matching problem is formulated to
a mixed integer linear program and the optimization is solved eciently with a branch
and bound method. Khan et al. [KHB
+
15] improved the cuboid representation by
generating two types of cuboid hypotheses, one of which corresponds to regular objects
inside a scene and the other is for the main structures of a scene, such as
oor and walls.
22
2.2.4 CNN- and FCN-based Indoor Layout Estimation
The convolution neural network (CNN) has a great impact on various computer vision
research topics, such as object detection, scene classication, semantic segmentation,
etc. Mallya and Lazebnik [ML15] used the FCN to learn the informative edge from an
RGB image to provide a rough layout. The FCN shares features in convolution layers and
optimize edges detection and geometric context labeling [HHF09], [HEH
+
05a], [HEH07b]
jointly. The learned contours are used as a new feature in sampling vanishing lines for
layout hypotheses generation. As the learned contours is not robust, other hand-craft
features, such as line membership feature, geometric context features are combined in
structured learning process to estimate the nal layout. Dasgupta et al. [DFCS16]
used the FCN to learn semantic surface labels. Instead of learning edges, their solution
adopted the heat map of semantic surfaces obtained by the FCN as the belief map and
optimized it furthermore by vanishing lines. Generally speaking, a good layout should
satisfy several constraints such as boundary straightness, surface smoothness and proper
geometrical structure. However, the CNN is weak in imposing spatial constraints and
performing spatial inference. As a result, an inference model was appended in both
[ML15] and [DFCS16] to rene the layout result obtained by CNN.
2.3 3D Object Detection
2.3.1 Methods Based on Hand-Crafted Features or CAD Models
Hand-crafted 3D features can be extracted for 3D object detection. For example, Song
et al. [SX14] ran a detector using hand-crafted 3D features to scan a 3D scene with a
3D sliding window. Ren et al. [RS16] extended the 2D Histogram-of-Oriented-Gradients
(HOG) feature to the 3D Clouds-of-Oriented-Gradients (COG) feature for 3D object
detection. Unsupervised feature learning was proposed in [BRF13, LBF14] for object
recognition with RGB-D inputs. They adopted sparse coding to learn a hierarchical
feature representation from raw RGB-D data in an unsupervised way. Furthermore,
23
CNNs have been used to extract the depth feature based on the computer-aided-design
(CAD) models rather than RGB-D images in [WSK
+
15, HKM15, SMKLM15, MS15,
XFZW15, SBZB15] for 3D object classication and retrieval. Our C3D method is learned
directly from annotated RGB-D images without any 3D CAD model.
2.3.2 CNN-based Methods
CNNs have been widely used for object detection in 2D RGB images, e.g., RCNN
[GDDM14], fast RCNN [Gir15], faster RCNN [RHGS15], the SPP-net [HZRS14], etc.
These solutions can be generalized for the 3D object detection problem with the RGB-D
inputs. One example is the depth-RCNN [GGAM14, GAGM15]. To exploit the depth
information, the depth-RCNN converts the depth channel to three HHA channals and
feeds them to the RCNN to obtain HHA features. Then, image features and HHA fea-
tures obtained by two RCNNs [GDDM14] were concatenated to train a SVM classier for
object recognition. Furthermore, 3D bounding boxes were estimated by aligning them
with 3D CAD models [GGAM14, GAGM15]. Song et al. [SX16] proposed a 3D CNN
architecture to detect 3D objects directly from 3D voxels. For each 3D proposal, the 3D
voxel is fed into a 3D CNN while the 2D color patch (2D projection of the 3D proposal)
to a 2D CNN to obtain the object category and the 3D box regression jointly.
2.3.3 3D Object Proposal
To obtain high quality object proposals in 2D images plays an important role in RCNN.
The object proposal methods can be classied into two categories: the objectiveness
approach [ZD14, CZLT14, ADF12] and the similarity approach [UvdSGS13, APTB
+
14].
The objectiveness approach suers from localization bias. That is, the recall rate drops
rapidly when the intersection over union (IoU) ratio is larger than a certain threshold
as discussed in [CMWZ15, LZZ
+
17]. The region proposal network (RPN) developed
recently in [RHGS15] encounters the same problem.
24
To get a good 3D object proposal is even more challenging for the following reasons.
First, the 3D point cloud represents only the visible part of the 3D space, there is a
large portion of free space remaining. The range of the physical depth of a 3D object,
which is visible only with a larger perspective, is dicult to estimate from the 3D point
cloud accurately. Second, the search space for the 3D proposal is higher than the 2D
proposal because of the addition of one more dimension. This in turn demands a higher
computational cost. A straightforward extension of 2D proposal methods to the 3D does
not work well due to the depth error and the occlusion problem as discussed in [SX16].
Recently, research on 3D region proposals has become active. Chen et al. [CKZ
+
15]
studied the 3D region proposal for the autonomous driving application. They formulated
it under the Conditional Random Field (CRF) framework and dened an energy func-
tion that consists of object size priors, the ground plane and several depth features for
minimization. They solved this optimization problem to infer the free space, point cloud
densities and the distance to the ground. Song et al. [SX16] designed a multi-scale 3D
region proposal network (RPN) to learn 3D indoor object proposals of dierent sizes. It
takes a 3D volume as the input, encodes it using the Truncated Signed Distance Func-
tion (TSDF) and adopts a two-scale fully convolutional network to learn the 3D object
region. However, the 3D RPN is sensitive to localization bias.
2.3.4 Context Information in Object Detection
The relationship between objects and their surrounding scenes was exploited for 2D
object detection and scene classication with a graphical model in [MTF
+
03, YFU12].
Murphy et al. [MTF
+
03] proposed a CRF method to solve the object detection and
scene classication tasks jointly. Yao et al. [YFU12] proposed a graphical model that
involves segmentation, object detection and scene classication to delineate the context
information. Lin et al. [LFU13] proposed a holistic approach by incorporating the
2D segmentation, the 3D geometry and the context relation between scenes and objects.
However, the context-based graphical model is highly dependent on the initial prediction
25
of the scene and the object labels for the ultimate prediction performance. Zhang et al.
[ZBK
+
16] used the depth information to train four pre-dened scene templates and the
object detection network jointly. However, the four scene templates are too limited in
practical applications.
Generally speaking, prior knowledge on object's physical size, location, object/scene
interdependency and object/object interdependency oer valuable cues for robust 3D
object detection. In this work, we consider a scene-CNN, an object-CNN and their
interaction for robust 3D object detection. The scene classication and the object detec-
tion results obtained from these two CNNs, respectively, serve as potentials in a graphical
model.
26
Chapter 3
GAL: A Global-Attributes
Assisted Labeling System for
Outdoor Scenes
3.1 Introduction
Automatic 3D geometric labeling or layout reasoning from a single scene image is one
of the most important and challenging problems in scene understanding. It oers the
mid-level information for other high-level scene understanding tasks such as 3D world
reconstruction[CCPS13a], [HEH05b], [HHF09], [GSEH11], [MZYM11], [PLLY15], depth
map estimation [LGK10], [EPF14], [LSP14], scene classication [ZLCD16], [ZWW
+
15],
[CRK16b], [CRK14], [CRK16a] and content-based image retrieval [GJT16], [JKS
+
15],
[PDH
+
15], [LPC
+
17].
Recovering the 3D structure from a single image using a local visual pattern recogni-
tion approach was studied in early geometric labeling research, including [HEH05b],
[LGK10], [HEH08a], [HEH07a], [KST
+
09], [LHK09], [SSN09], [CCPS13b]. To give
an example, Hoiem et al. [HEH07a] dened seven labels (i.e., sky, support, planar
left/right/center, porous and solid) and classied super-pixels to one of these labels
according to their local visual appearances. Features such as color, position, texture
pattern and segment shape were used to describe local visual properties. Since the same
surface (e.g., a building facade) may take a dierent geometric role in dierent images,
the performance of all local-patch-based labeling methods is limited.
27
To improve the performance of local-patch-based methods, researchers incorporated
the global constraints or context rules in recent years. Gupta et al. [GEH10], [GHKB10a]
proposed a qualitative physical model for outdoor scenes by assuming that objects are
composed by blocks of volume and mass. If a scene image ts the underlying 3D model,
better surface layout estimation can be achieved. However, their model is not generic
enough to cover a wide range of scenes. Liu et al. [LZZ14], and Pan et al. [PHK15]
focused on images containing building facades and developed multiple global 3D context
rules using their distinctive geometric cues such as vanishing lines. Furthermore, recov-
ering the 3D structure from depth estimation algorithms was examined in [EPF14] and
[LSP14].
Recently, researchers applied the convolutional neural networks (CNN) to the seman-
tic segmentation task [LSD15], [BHC15], [ZJRP
+
15], [FCNL13], [PC13], [IKJM16],
[NHH15] and [DHS15]. Some improvement over traditional machine learning based
methods are observed for object-centric images and road images. However, we have
so far not yet seen a robust performance of the CNN-based solution to semantic outdoor
scene labeling. This is probably due to the fact that it demands a large number of labeled
scene images to do the training and such a dataset is still not available.
Being motivated by recent trends, we exploit both local and global attributes for
layout estimation and propose a Global-attributes Assisted Labeling (GAL) system in
this work. GAL uses local visual pattern to provide initial labels for all pixels and extracts
global attributes such as sky lines, ground lines, horizon, vanishing lines, etc. Then, it
uses global attributes to improve initial labels. Our work contributes to this eld in two
folds. First, it provides a new framework to address the challenging geometric layout
labeling problem, and this framework is supported by encouraging results. Second, as
compared with previous work, GAL can handle images of more diversied contents using
inference from global attributes. The performance of the GAL system is benchmarked
with several state-of-the-art algorithms against a popular outdoor scene layout dataset,
and signicant performance improvement is observed.
28
The rest of this chapter is organized as follows. The GAL system is described in
Sec. 3.2. Experimental results are shown in Sec. 3.3. Sec. 3.4 includes analysis of several
poor results. Finally, concluding remarks are given in Sec. 3.5.
3.2 Proposed GAL System
3.2.1 System Overview
The
ow chart of the GAL system is given in Figure 3.1.
Stage 1: Initial Pixel Labeling (IPL);
Stage 2: Global Attributes Extraction (GAE);
Stage 3: Layout Reasoning and Label Renement (LR2).
GAE#4:Vertical Line Detection
GAE#5: Vanishing Line Detection
GAE#6: Solid Detection
(LR2)
Global Attributes Extraction
(GAE)
GAE#1: Sky and Ground Lines
Detection
GAE#2: Horizon Detection
GAE#3: Planar Surfaces Detection
GAE#7: Porous Detection
IPL
Image Final Label Pop-up
Figure 3.1: The
ow chart of the proposed GAL system.
For a given outdoor scene image, we can obtain initial pixel labeling results using
pixel-wise labeling method in the rst stage. Here, we trained seven class labels use
SegNet architecture [BHC15]. The reason that we use SegNet over other CNN based
segmentation approaches[LSD15, ZJRP
+
15] because of the balance of segmentation accu-
racy and eciency[BHC15]. The labeling performance of the IPL stage is however not
satisfactory due to the lack of global scene information. To address this issue, we pose
the following seven questions for each scene image and would like to answer them based
on all possible visual cues (e.g., color, edge contour, defocus degree, etc.) in the second
stage:
29
1. Is there sky in the image? If yes, where?
2. Is there ground in the image? If yes, where?
3. Does the image contain a horizon? If yes, where?
4. Are there planar surfaces in the image? If yes, where and what are their orienta-
tions?
5. Is there any building in the image? If yes, where and what is its orientation?
6. Is there solid in the image? If yes, where is it?
7. Is there porous in the image? if yes, where is it?
The answers to the rst part of each question lead to a 7D global attribute vector
(GAV) with binary values (YES or NO), where we set \YES" and \NO" to \1" and
\0", respectively. If the value for an entry is \1", we need to provide more detailed
description for the corresponding global attribute. The knowledge of the GAV is helpful
in providing a robust labeling result. Based on extracted global attributes, we conduct
layout reasoning and label renement in the third stage. Layout reasoning can be greatly
simplied based on global attributes. Then, the label of each pixel can be either con-
rmed or adjusted based on inference. The design and extraction of global attributes in
the GAE stage and the layout reasoning and label renement in the LR2 stage are two
novel contributions. They will be elaborated in Sec. 3.2.3 and Sec. 3.2.4, respectively.
3.2.2 Initial Pixel Labeling (IPL)
The method proposed by Hoiem in [HEH08a] oers an excellent candidate in the rst
stage to provide initial pixel-level labels of seven geometric classes (namely; sky, sup-
port, planar left/right/center, porous and solid). This method extracts color, texture
and location features from super-pixels, and uses a learning-based boosted decision tree
classier. We also tried the CNN approach [LSD15], [BHC15], [ZJRP
+
15] to initialize
pixel labels. However, due to lacking enough training data, the CNN solution provides
results much worse than those in [HEH08a].
30
To enhance the prediction accuracy for sky and support in [HEH08a], we develop a
3-class labeling scheme that classies pixels to three major classes; namely, \support",
\vertical" and \sky", where planar left/right/center, porous and solid are merged into
one \vertical" mega-class. This 3-class classier is achieved by integrating segmentation
results from SLIC [ASS
+
12], FH [FH04], and CCP [FWC
+
15] with a random forest
classier [LW02] in a two-stage classication system. Figure 3.2 shows the proposed two-
stage system. In the rst stage, we train individual classiers for \support",\vertical"
and \sky" and get their probability maps using the SLIC, FH and CCP segmentations.
In the second stage, since the segmentation units are not the same, we transfer all
segments into smaller segmentation units using the FH method to obtain the ne-scale
segmentation boundaries. Then, we fuse the probability output from the rst stage to
get the nal decision.
The reason of combining dierent segmentation schemes is that dierent segmenta-
tion methods capture dierent levels of information. For example, the SLIC method
[ASS
+
12] provides more segments in the \support" region than the FH and CCP meth-
ods in the rst stage. Thus, the SLIC segmentation has better location information and
a higher chance to get the correct result for \support". The performance of our 3-class
labeling scheme is 88.7%, which is better than that of [HEH08a] by 2%. Several visual
comparisons are shown in Figure 3.3. We see that our 3-class labeling system can seg-
ment the low contrast \support" and \sky" well. After the 3-class labeling, initial labels
of the ve classes inside the vertical region come directly from [HEH08a].
The accuracy of the IPL stage is highly impacted by three factors: 1) a small number
of training samples, 2) the weak discriminant power of local features, and 3) lacking
of a global scene structure. They are common challenges encountered by all machine
learning methods relying on local features with a discriminative model. To overcome
these challenges, we design global attributes vectors and integrate them into a graphical
model, which is elaborated in the next subsection.
31
Figure 3.2: Fusion of 3-class labeling algorithms. Stage 1: Three individual random for-
est classiers trained by segments from SLIC [ASS
+
12], FH [FH04] and CCP [FWC
+
15],
respectively, where the gray images show the probability output of each individual classi-
er under dierent segmentation methods and dierent geometric classes. Stage 2: The
three probability outputs from Stage 1 are cascaded into one long feature vector for each
intersected segmentation unit and an SVM classier is trained to get the nal decision.
3.2.3 Global Attributes Extraction (GAE)
In the second GAE stage, we attempt to ll out the 7D binary-valued GAV and nd
the related information associated with an existing element. The 7 global attributes
are: 1) the sky/ground line, 2) the horizon, 3) the planar surface, 4) the vertical line, 5)
the vanishing line, 6) the solid object, and 7) the porous material. Take image I with
dimension HW 3 as an example. In the GAE stage, its 7D GAV will be extracted
to generate 7 probability maps denoted by P
k
, k = 1; ; 7, where the dimension of P
k
isHW andk represents one of the 7 global attributes. Furthermore, we use P
k
(s
i
;l
j
)
to denote the probability for segment s
i
to be labeled as l
j
, where j denotes one of the
7 classes (support, left, center, right, porous, solid and sky) based on global attribute k.
32
Figure 3.3: Comparison of initial pixel labeling results (from left to right): the original
image, the 7-class labeling result from [HEH08a], the proposed 3-class labeling result,
the ground truth 7-class labeling. Our scheme oers better \support" and \sky" labels.
Sky and Ground Lines Detection. Sky and ground regions are important ingre-
dients of the geometrical layout of both natural and urban scenes. To infer their existence
and correct locations is critical to the task of scene understanding. We develop a robust
procedure to achieve this goal as illustrated in Figure 3.4. Based on initial pixel labels
obtained in the rst stage, we obtain initial sky and ground lines, which may not be
correct due to erroneous initial labels.
To netune initial sky and ground lines, we exploit the following three cues from the
input scene image for sky and ground line validation:
33
the line segment map denoted by P
LS
[vGJMR08], where P
LS
= 1 for line pixels,
P
LS
= 0 for non-line pixels;
the probability edge map of structured edges denoted by P
SE
, [DZ14], [ZD14],
[DZ13]; and
the probability edge map of the defocus map denoted by P
DF
[ZS11].
The nal probability is dened as
P
sky/ground line
(s
i
;l
sky/ground
) =P
LS
P
SE
P
DF
: (3.1)
An example is given in Figure 3.4, where all three maps have higher probability for the
sky line but lower probability scores for the ground line. As a result, the erroneous
ground line is removed.
After obtaining the sky and ground lines, we check whether there is any vertical
region above the sky line or below the ground line using the 3-class labeling scheme,
where the vertical region is either solid or porous. The new 3-class labeling scheme can
capture small vertical regions well and, after that, we will zoom into each vertical region
to rene its subclass label.
Horizon Detection. When the ground plane and the sky plane meet, we see a
horizon. This occurs in ocean scenes or scenes with a
at ground. The horizon location
helps reason the ground plane, vertical plane and the sky plane. For example, the
ground should not be above the horizon while the sky should not be below the horizon.
Generally speaking, 3D layout estimation accuracy can be signicantly enhanced if the
ground truth horizon is available for layout reasoning[HEH08a]. Research on horizon
estimation from building images was done before, e.g., [HZ09], [BLTK10]. That is, it
can be inferred by connecting horizontal vanishing points. However, the same technique
does not apply to natural scene images where the vanishing point information is lacking.
34
Figure 3.4: The process of validating the existence of the sky/ground lines and their
location inference.
In our implementation, we use two dierent methods to estimate the horizon in
two dierent outdoor scenes. For images containing buildings as evidenced by strong
vertical line segments, their horizon can be estimated by tting the horizontal vanishing
points[BLTK10], [Ren13]. For natural scene images that do not have obvious vanishing
points, we propose a horizon estimation algorithm as shown in Figure 3.5. First, we
extract multiple horizontal line segments from the input image using the LSD algorithm
[vGJMR08]. Besides, we obtain the edge probability map based on [DZ14], [ZD14],
[DZ13] and use it as well as a location prior to assign a probability to each pixel in the
line segments. Then, we build a histogram to indicate the probability of the horizon
location. Finally, we will select the most likely horizonal line to be the horizon.
After detecting the horizon, we perform layout reasoning to determine the sky and
support regions. One illustrative example is given in Figure 3.5. We can divide the
initial labeled segments into two regions. If a segment above (or below) the horizon is
35
Figure 3.5: Horizon detection and its application to layout reasoning and label rene-
ment.
labeled as sky (or support), it belongs to the condent region. On the other hand, if a
segment above (or below) the horizon is labeled as support (or sky), it belongs to the
uncondent region. The green circled region in Figure 3.5 is labeled as sky due to its
white color. Thus, it lies in the uncondent region. There exists a con
ict between the
local and global decisions. To resolve the con
ict, we use the Gaussian Mixture Model
(GMM) to represent the color distribution in the condent region above and below the
horizon. Then, we conclude that the white color under the horizon can actually be the
support (or ground) so that its label can be corrected accordingly. We useP
horizon
(s
i
;l
j
)
to denote the probability output from the GMM.
Planar Surfaces Detection. An important by-product of sky/ground line local-
ization is the determination of planar surface orientation in the vertical region. This
is feasible since the shapes of sky and ground lines provide useful cues for planar sur-
face orientation inference in natural or urban scenes. To be more specic, we check the
36
Figure 3.6: Trapezoidal shape tting with the sky line in the top and the ground line in
the bottom.
trapezoidal shape tting scheme (including triangles and rectangles as special cases) for
the vertical region, where the top and the bottom of the trapezoidal shape are bounded
by the sky and ground lines while its left and right are bounded by two parallel ver-
tical lines or extends to the image boundary. We set P
planar surface
(s
i
;l
j
) = 1, where
j2 (left,center,right), if the corresponding surface orientation is detected. Otherwise,
we set P
planar surface
(s
i
;l
j
) = 0.
Three trapezoidal region shape tting examples are shown in Figure 3.6. Clearly,
dierent tting shapes indicate dierent planar surface orientations. For example, two
trapezoidal regions with narrow farther sides indicate an alley scene as shown in the top
example. A rectangle shape indicates a front shot of a building as shown in the middle
example. The two trapezoidal regions with one common long near side indicates two
building facades with dierent orientations as given in the bottom example. Thus, the
shapes of sky and ground lines oer important cues to planar surface orientations.
37
Figure 3.7: An example of vertical line detection and correction.
Vertical Line Detection. A group of parallel vertical line segments provides a
strong indicator of a building structure in scene images. It oers a valuable cue in cor-
recting wrongly labeled regions in building scene images. The probability of surface orien-
tation from the vertical line is denoted by P
verticall
(s
i
;l
j
), where j2 (left, center, right).
In our implementation, we use the vertical line percentage in a region as the attribute
to generate the probability map for the building region. An example of using vertical
lines to correct wrongly labeled regions is illustrated in Figure 3.7. The top region of
the building is wrongly labeled as \sky" because of the strong location and color cues in
the initial labeling result. However, the same region has a strong vertical line structure.
Since the \sky" region should not have this structure, we can correct its wrong label.
Vanishing Line Detection. A group of vanishing lines oers another global
attribute to indicate the surface orientation. In our implementation, we rst use the
vertical line detection technique to obtain a rough estimate of the left and right bound-
aries of a building region and then obtain its top and bottom boundaries using the sky
line and the ground line. After that, we apply the vanishing line detection algorithm in
[HHF09] and use the obtained vanishing line result to adjust the orientation of a building
facade. We set P
vanishingl
(s
i
;l
j
) = 1, where j2 (left,center,right), if the corresponding
surface orientation is detected. Otherwise, we set P
vanishingl
(s
i
;l
j
) = 0. Note that the
surface orientation of a planar surface can be obtained from its shape and/or vanishing
lines. If it is not a building, we cannot observe vanishing lines and the shape is the only
cue. If it is a building we can observe both its shape and vanishing lines. The two cues
38
Figure 3.8: Examples of surface orientation renement using the shape and the vanishing
line cues.
are consistent with each other based on our experience. Two examples of using the shape
and vanishing lines of a building region to correct erroneous initial labels are given in
Figure 3.8. The initial surface orientation provided by the IPL stage contains a large
amount of errors due to the lack of global view. They are corrected using the global
attributes.
Solid Detection. The non-planar solid class is typically composed by foreground
objects (such as people, car, etc.) rather than the background scene. The bottom of
an object either touches the ground or extends to the image bottom boundary. For
example, objects (say, pedestrians and cars) may stand on the ground in front of the
building facade in an urban scene while objects are surrounded by the ground plane in
a typical natural scene. Object detection is an active research eld by itself. Actually,
the object size has an in
uence on the scene content, i.e. whether it is an object-centric
or scene-centric image. The object size is big in an object-centric image. It occupies a
large portion of the image, and most of the background is occluded. The scene layout
problem is of less interest since the focus is on the object rather than the background.
39
In contrast, there is no dominant object in a scene-centric image and the scene layout
problem is more signicant. On one hand, it is better to treat the object detection
problem independently from scene analysis. On the other hand, a scene image may still
contain some objects while the background scene is occluded by them. For examples,
the existence of objects may occlude some part of sky and ground lines, thus increasing
the complexity of scene layout estimation. In order to simplify the scene understanding
problem, it is desired to identify and remove objects rst.
In our implementation, we apply two object detectors (namely, the person detector
and the car detector) from [GFM] [FGMR10]. We rst obtain the bounding boxes
of detected objects and, then, adopt the grab cut method [RKB04] to get their exact
contours. The grab cut method uses a binary value to indicate whether a pixel belong to
object or not. Mathematically, we haveP
solid
(s
i
;l
solid
) = 1 for the location where solid is
detected, and P
solid
= 0, otherwise. Two examples are shown in Figure 3.9. To achieve
a better layout estimation, detected and segmented objects are removed from the scene
image to allow more reliable sky and ground line detection.
Figure 3.9: Examples of obtaining the object mask using the person and the car object
detectors and the grab cut segmentation method: (a) the bounding boxes result of the
person detector, (b) the person segmentation result, (c) the bounding box result of the
car detector, and (d) the car segmentation result.
Porous Detection. As shown in the bottom subgure of Figure 3.10, the mountain
region, which belongs to the solid class, is labeled as porous by mistake. This is due to the
fact that the super-pixel segmentation method often merges the porous region with its
background. Thus, it is dicult to split them in the classication stage. To overcome this
40
Figure 3.10: Two examples of initially labeled porous regions where the top (trees) is
correctly labeled while the bottom (mountain) is wrongly labeled.
diculty, we add the contour randomness feature from the structured edge result [DZ14],
[ZD14], [DZ13] to separate the porous and the solid regions. Note thatP
porous
(s
i
;l
porous
)
is proportional to contour randomness. By comparing the two examples in Figure 3.10,
we see that there exist irregular contours inside the true porous region (trees) but regular
contours inside the solid region (mountain). In our implementation, we double check
regions labeled by solid or porous initially and use the contour smoothness/randomness
to separate solid/porous regions.
3.2.4 Layout Reasoning and Label Renement (LR2)
The 7D GAV can characterize a wide range of scene images. In this section, we focus on
layout reasoning using the extracted global attribute vector. We rst analyze the outdoor
scenes from the simplest setting to the general setting as shown in Figure 3.11. Then, we
propose a Conditional Random Field (CRF) minimization framework for outdoor layout
reasoning.
41
The simplest case of our interest is a scene with a horizon as shown in Figure 3.11(a).
This leads to sky and support two regions and the support could be ocean or ground.
The existence of occluders makes this case slightly more complicated. The sky may
have two occluders - solid (e.g., balloon, bird, airplane, kite, parashoot) and porous
(e.g., tree leaves near the camera). The ocean/ground may also have occluders - solid
(e.g, boat in the ocean, fence/wall near the camera) and porous (e.g. bushes near the
camera). Although there are rich scene varieties, our layout reasoning framework can
remove segments labeled with planar center/left/right in the simplest case condently.
The horizon will be split into individual sky and ground lines if there exist large planar
surfaces that completely or partially block the horizon as shown in Figure 3.11(b). It is a
background scene only, if there is no occluder. The large planar surfaces can be buildings
in urban scenes and mountains in natural scenes. Generally speaking, any region between
the sky and ground lines is a strong candidate for planar surfaces. We can further classify
them into planar left/right/center based on vanishing lines or the outer contour of the
surface region. A general scene may have all kinds of solid/porous occluders (e.g. people,
car, trees, etc.) in front of a background scene as shown in Figure 3.11(c). The occluders
may block the ground line, the planar surface and other solid/porous objects to make
layout inference very challenging. In the following, we exploit initial pixel labeling and
all attributes and formulate layout reasoning as a Conditional Random Field (CRF)
[FVS
+
09], [SLZ
+
15] optimization problem.
Problem Formulation: We model each image as a graph denoted by G(S;E),
where each node is the super-pixel segment and a member ofS and two adjacent vertices
are connected by an edge which is a member ofE. Our goal is to minimize the following
energy function given the adjacency graph G(S;E) and a weight :
E(L) =
X
s
i
2S
(l
i
js
i
) +
X
(s
i
;s
j
)2E
(l
i
;l
j
js
i
;s
j
); (3.2)
42
where S is the segmentation of consideration, l
i
is the class label assigned to the i
th
segment s
i
, is the unary potential function, is the pairwise potential function, and
is a tradeo between the unary potential and the pairwise potential. Empirically, we
set = 0:1.
We classify the seven proposed global attributes into two major groups. The rst
group consists of the horizon, the vertical line, the solid and the porous. Their probability
maps are determined by the appearance properties of segments rather than the context
relationship among adjacent segments. It is proper to adopt the unary potential as their
probability outputs. The second group consists of the sky/ground line, the vanishing
line, the left or right-oriented surfaces, which describe the relationship among adjacent
segments. Thus, we consider the pairwise potential of these attributes. Specically, we
have the following denitions of the unary potential and the pairwise potential.
We dene the unary potential for each segment s
i
as
(l
i
js
i
) = log(P (l
i
js
i
)); (3.3)
where
P (l
i
js
i
) = w
|
0
B
B
B
B
B
B
B
B
B
B
@
P
initial
(l
i
js
i
)
P
porous
(l
i
js
i
)
P
solid
(l
i
js
i
)
P
horizon
(l
i
js
i
)
P
verticall
(l
i
js
i
)
1
C
C
C
C
C
C
C
C
C
C
A
(3.4)
is the probability that segments
i
is predicted with labell
i
. In words, probabilityP (l
i
js
i
)
is the weighted average of ve components: 1) initial label probabilityP
initial
(l
i
js
i
) which
comes from the label probability from our 3 class classier and vertical class classier from
[HEH
+
05a]; 2) porous detectorP
porous
(l
i
js
i
); 3) solid detectorP
solid
(l
i
js
i
); 4) horizon line
inference P
horizon
(l
i
js
i
); and 5) vertical line inference P
verticall
(l
i
js
i
). The weight vector
43
w is a 5D vector that determines the contributions of the ve components in the unary
potential.
We dene the pairwise potential for pairwise segments s
i
and s
j
as
(l
i
;l
j
js
i
;s
j
) = (
s
+
v
+
p
)[l
i
6=l
j
]; (3.5)
where [] is a zero-one indicator function and
s
= log(P
sky/ground line
b
(s
i
;s
j
)); (3.6)
v
= log(kP
vanishing line
(s
i
;l
i
)P
vanishing line
(s
j
;l
j
)k); (3.7)
p
= log(kP
planar surfance
(s
i
;l
i
)P
planar surface
(s
j
;l
j
)k) (3.8)
where P
sky/ground line
b
(s
i
;s
j
), P
vanishing line
(s
i
;l
i
), and P
planar surfance
(s
i
;l
i
) denote aver-
aged probability along the shared boundary of s
i
and s
j
, probability of s
i
is labeled as
l
i
from vanishing line information, and probability of s
i
is labeled as l
i
from planar sur-
face information, respectively, andkk is the norm of the probability dierence between
segments s
i
and s
j
.
In Eq. 3.6, when there is neither sky nor ground line across s
i
and s
j
,
P
sky/ground line
b
(s
i
;s
j
) should be small. However,
s
would be large if s
i
and s
j
are
with dierent labels. To minimize potential function in Eq. 5.1,
s
forcess
i
ands
j
to be
with the same label. This explains how the ground line is removed in the example given
in Figure 3.4. Furthermore,
v
and
p
are the pairwise potentials computed from the
probability map of vanishing lines and the probability map of planar surface orientation
as shown in Eqs. 3.7 and 3.8, respectively. A smoothness constraint is imposed by
v
(or
p
).
v
(or
p
) penalizes more if s
i
and s
j
have dierent labels yet their vanishing
line (or planar surface orientation) dierence at locations
i
ands
j
is small. This explains
44
how the surface orientation smoothness can be achieved in the examples given in Figure
3.8.
We rst learn the parameters, w and , from the CRF model by maximizing their
conditional likelihood through cross-validation on the training data. Then, the multi-
label graph cut optimization method [FVS
+
09] is used to minimize the energy function
in Eq. 5.1.
3.3 Experimental Results
In contrast to the object detection problem, very few datasets are available for geomet-
ric layout algorithms evaluation. The dataset in [HEH08a] is the largest benchmarking
dataset in the geometric layout research community. There are larger geometric layout
datasets but for indoor images which doesn't t our problem [CCPS13a], [HHF09]. The
dataset in [HEH08a] consists of 300 images of outdoor scenes with labeled ground truth.
The rst 50 images are used for training the surface segmentation algorithm as done
in previous work [HEH08a], [GEH10]. The remaining 250 images are used for evalua-
tion. We follow the same procedure in our experiments. To test our generalized global
attributes and reasoning system, we use the same 250 testing images in our experiment.
It is not a straightforward task to label the ground truth for outdoor scene images
in this dataset. It demands subjects to reason the functionalities of dierent surfaces
in the 3D real world (rather than recognizing objects in images only), e.g., where is the
occlusion boundary? what is the surface orientation? whether there is depth dierence?
etc. To get a consistent ground truth, subjects should be very well trained. In this
section, we will present both qualitative and quantitative evaluation results.
Qualitative Analysis. Figure 3.12 shows eight representative images (the rst col-
umn) with their labeled ground truths (the last column). We show results obtained by
CNN method [BHC15](the second column), Hoiem et al. [HEH08a] (the third column),
Gupta et al. [GEH10] (the fourth column), the proposed GAL (the fth column). To
45
apply the CNN to this problem, we retrained the CNN use the training data with initial-
ized weights from the VGG network [BHC15]. We call the methods proposed in Hoiem et
al. [HEH08a] and Gupta et al. [GEH10] the H-method and the G-method, respectively,
in the following discussion for convenience.
In Figure 3.12 (a), the building is tilted and its facade contains multiple surfaces with
dierent orientations. Because of the similarity of the local visual pattern (color and
texture), the CNN method got confused between surface and porous. The H-method
is confused by upper and lower parts in the planar-left facade, and assigns them to
planar left and right two opposite oriented surfaces. This error can be corrected by the
global attribute (ie building's vertical lines). The G-method loses the detail of surface
orientation in the upper left corner so that the sky and the building surface are combined
into one single planar left surface. It also loses the ground support region. These two
errors can be corrected by sky-line and ground-line detection. GAL infers rich semantic
information about the image from the 7D GAV. Both the sky and the ground exist in the
image. Furthermore, it is a building facade due to the existence of parallel vertical lines.
It has two strong left and right vanishing points so that it has two oriented surfaces - one
planar left and the other planar right. There is a problem remaining in the intersection
region of multiple buildings in the right part of the image. Even humans may not have
an agreement on the number of buildings in that area (two, three, or four?) The ground
truth also looks strange. It demands more unambiguous global information for scene
layout inference.
For Figure 3.12 (b), the ground truth has two problems. One is in the left part of
the image. It appears to be more reasonable to label the building facade to \planar left"
rather than \planar center". Another is in the truck car region. The whole truck should
have the same label; namely, gray (non-planar solid). This example demonstrates the
challenging nature of the geometric layout problem. Even human can make mistakes
easily. Other observed human-labeling mistakes in ground truth are reported in the
supplemental le.
46
Figure 3.12 (c)-(f) show the benet of getting an accurate sky line and/or an accurate
ground line. One powerful technique to nd the sky line and the ground line is to use
defocus estimation. Once these two lines are detected, the inference becomes easier so
that GAL yields better results.
Being constrained by the limited model hypothesis, the G-method chooses to have
the ground (rather than the porous) to support the physical world. However, this model
does not apply to the image in (c), images with bushes in the bottom as an occluder, etc.
Also, the ground truth in (c) is not consistent in the lower region. The result obtained
by GAL appears to be more reasonable.
Figure 3.12 (d) was discussed earlier. Both the H-method and the G-method labeled
part of the ground as the sky by mistake due to color similarity. GAL can x this
problem using horizon line detection. All three methods work well for Figure 3.12 (e).
GAL performs slightly better due to the use of the horizon detection and label renement
stage. The right part of the image is too blurred to be of interest. GAL outperforms the
CNN-method, the H-method and the G-method due to an accurate horizon detection in
Figure 3.12 (f). Figure 3.12 (g) demonstrates the power of the sky line detection and
object (car) detector. Figure 3.12 (h) demonstrates the importance of an accurate sky
line detection.
We also compare the \pop-up" view of the H-method and GAL in Figure 3.13 using
a 3D view rendering technique given in [HEH05b], [HEH07a]. It is clear that GAL oers
a more meaningful result.
Based on the above discussion, we can draw several concluding points. First, there
are not enough examples for the CNN to learn the visual pattern. The CNN is a data
driven approach that extract visual patterns from training data. For the geometric
labeling problem, labels cannot be fully determined by visual patterns. For example, the
left and right surface of a building have similar color and texture visual pattern locally,
but their orientations are dierent. In contrast, the orientation can be captured by the
global information such as shapes and lines. With very little training data, CNN's power
47
is somehow limited. Second, the local-patch-based machine learning algorithm (e.g., the
H-method) lacks the global information for 3D layout reasoning. It is dicult to develop
complete outdoor scene models to cover a wide diversity of outdoor images (e.g., the
G-method). GAL attempts to nd key global attributes from each outdoor image. As
long as these global attributes can be determined accurately, we can combine the local
properties (in form of initial labels) and global attributes for layout reasoning and label
renement. Third, the proposed methodology can handle a wider range of image content.
As illustrated above, it can provide more satisfactory results to quite a few dicult cases
that confuses prior art such as the H-method and the G-method.
Quantitative Analysis. For quantitative performance evaluation, we use the metric
used in [HEH07a, GEH10] by computing the percentage of overlapping of labeled pixel
and ground truth pixel (called the pixel-wise labeling accuracy). We use the original
ground truth given in [HEH
+
05a] for fair comparison between all experimental results
reported below (with only one exception which will be clearly stated).
We rst compare the labeling accuracy of the CNN-method, the H-method, the G-
method and GAL for seven geometric classes individually in Figure 3.14. We see clearly
that GAL outperforms the H-method in all seven classes by a signicant margin. The
gain ranges from 1.27% (support) to more than 25% (planar left and right). GAL also
outperforms the G-method in 6 classes. The gain ranges from 0.81% (sky) to 14.9%
(solid). The G-method and GAL have comparable performance with respect to the
\planar center" class. All methods do well in sky and ground labeling. The solid class
is the most challenging one among all seven classes since it has few common geometric
structures to exploit.
The complete performance benchmarking (in terms of pixel-wise labeling accuracy)
of ve dierent methods are shown in Table 3.1. The L-method in [LZZ14] and the
P-method in [PHK15] were dedicated to building facade labeling and their performance
were only evaluated on subsets of the full dataset in [HEH
+
05a]. The L-method in
[LZZ14] use the subset which contains 100 images images where the ground truth of
48
both occlusion boundaries and surface orientation are provided [HEH08a]. P-method in
[PHK15] use the subset which contains 55 building images.
We use B1, B2, F and F/R to denote the subset used in [LZZ14], the subset used
in [PHK15], the full dataset and the full dataset with relabeled ground truth in the
table, respectively. The two numbers within the parenthesis (7 and 5) denote results for
all \seven" classes and for the \ve" vertical classes (i.e., excluding sky and support),
respectively. For the F(7) column in Table 3.1, we compare the performance for all seven
classes in the full set. Accurate labeling for this dataset is actually very challenging
as pointed out in [LR09], [RKAT08]. This is also evidenced by the slow performance
improvement over the last seven years - a gain of 2.35% from the H-method to the
G-method. GAL oers another gain of 4.95% over the G-method [GEH10], which is
signicant. For the B1(5) column, we compare the performance for ve classes in the
building subset which contains 100 images. GAL outperforms the CNN-method and the
L-method by 21.24% and 3.39%, respectively. For the B2(7) column, we compare the
performance against the building subset which contains 55 images. GAL outperforms
the CNN-method, the H-method, the G-method and the P-method by 20.4%, 8.8%,
8.08% and 6.85%, respectively. For the F(5) column, we compare the performance for
ve classes in the full set. GAL outperforms the CNN-method, the H-method, the G-
method by 19.64%, 7.14% and 2.22%. Finally, for the F/R(7) column, we show the
labeling accuracy of GAL against the modied ground truth. The labeling accuracy can
go up to 80.26%.
To further analyze the performance gain of proposed GAL system, we add the global
attributes one by one to our proposed GAL system and analyze performance gain from
individual global attribute, which is shown in Table.2. The initial result for 7 class tested
in the full set is 74.88% which is obtained by our 3 class labeling result and the labels
in vertical region from [HEH
+
05a]. When compute the performance gain from global
attribute horizon,P
horizon
is rst computed andP
horizon
is included in the unary term in
CRF model. As there is no global attributes contribute to pairwise term, we simply apply
49
Table 3.1: Comparison of the averaged labeling accuracy (%) of six methods with respect
to the building subset (B), the full set (F) and the full set with relabeled ground truth
(F/R), where 7 and 5 mean all seven classes and the ve classes belonging to the vertical
category, respectively.Results are updated
Dataset (Class No.)
B1(5) B2(7) F(5) F(7) F/R(7)
CNN-method [BHC15] 58.33 61.27 56.33 68.21 N/A
H-method [HEH08a] N/A 72.87 68.80 72.41 N/A
G-method [GEH10] N/A 73.59 73.72 74.76 N/A
L-method [LZZ14] 76.34 N/A N/A N/A N/A
P-method [PHK15] N/A 74.82 N/A N/A N/A
Proposed GAL 79.73 81.67 75.94 79.71 80.26
the same pairwise edge potentials dened in [FVS
+
09]. As both attribute planar surface
and attribute vanishing line contribute to surface orientation accuracy improvement, we
include both of them to see their performance gain together. The table shows that all
global attributes contributes to the performance gain signicantly.
Table 3.2: The performance gain from each individual attribute.
porous solid horizon
Performance Gain +0.12% + 0.80% + 1.08%
vertical sky/ground vanishing line
line line planar surface
Performance Gain + 1.00% + 0.48% + 0.95%
3.4 Error Analysis
Three exemplary images that have large labeling errors are shown in Figure 5.11. Figure
5.11 (a) is dicult due to the complexity in the middle region of multiple houses. We
feel that the labeled \planar center" result by GAL for this region is still reasonable
although it is not as rened as that oered by the ground truth. Figure 5.11 (b) is one
of the most challenging scenes in the dataset since even humans may have disagreement.
The current ground truth appears to be too complicated to be useful. GAL can nd the
50
ground line but not the sky line due to the tree texture in the top part of the image.
GAL made a labeling mistake in porous in Figure 5.11 (c) due to dominant texture in
the corresponding region. We need a better porous detector to x this problem. The
global attributes will not be able to help much. There is a planar right region in the
ground truth. Although this label is accurate to human eyes by looking at the local
surface, the transition from support (or ground) to planar right is not natural. It could
be an alternative to treat the whole region in the bottom part as support. As shown in
this image, it is extremely dicult to nd an exhaustive set of scene models to t all
dierent situations.
3.5 Conclusion
A novel GAL geometric layout labeling system for outdoor scene images was proposed in
this work. GAL exploits both local and global attributes to achieve higher accuracy and
it oers a major advancement in solving this challenging problem. Its performance was
analyzed both qualitatively and quantitatively. Besides, several error cases were studied
to reveal the limitations of the proposed GAL system.
Clearly, there are still many interesting problems remaining, including the develop-
ment of better global attributes extraction tools, its integration with CNN detectors,
and the design of more powerful and more general inference rules. Furthermore, due to
major dierences between indoor and outdoor scene images, the key global attributes
of indoor scene images will be dierent from those of outdoor scene images. So, more
research needs to be done to develop a GAL system for indoor scenes. Future research
is planned to determine important global attributes of indoor scene images and to study
their geometric layout.
51
Figure 3.11: A proposed framework for geometric layout reasoning: (a) the simplest
case, (b) the background scene only, and (c) a general scene, where SL, GL, H, O denote
the sky line, the ground line, the horizon and the occluder, respectively.
52
Figure 3.12: Qualitative comparisons of three geometric layout algorithms (from left to
right): the original Image, CNN-method[BHC15], Hoiem et al. [HEH08a], Gupta et al.
[GEH10], the GAL and the ground truth. The surface layout color codes are magenta
(planar left), dark blue (planar center), red (planar right), green (non-planar porous),
gray (non-planar solid), light blue (sky), black (support).
53
Figure 3.13: Comparison of 3D rendered views based on geometric labels from the H-
method (left) and GAL (right).
54
Figure 3.14: Comparison of labeling accuracy between the CNN methods, the H-method,
the G-method and GAL with respect to seven individual labels.
55
Figure 3.15: Error analysis of the proposed GAL system with three exemplary images
(one example per row and from left to right): the original image, the labeled result of
GAL and the ground truth.
56
Chapter 4
A Coarse-to-Fine Indoor Layout
Estimation (CFILE) Method
4.1 Introduction
The task of spatial layout estimation of indoor scenes is to locate the boundaries of the
oor, walls and the ceiling. It is equivalent to the problem of semantic surface labeling.
The segmented boundaries and surfaces are valuable for a wide range of computer vision
applications such as indoor navigation [KHFH11], object detection [HHF10] and aug-
mented reality [KHFH11], [LSK
+
15], [XF14], [MBHRS14]. Estimating the room layout
from a single RGB image is a challenging task. This is especially true in highly clut-
tered rooms since the ground and wall boundaries are often occluded by various objects.
Besides, indoor scene images can be shot at dierent viewpoints with large intra-class
variation. As a result, high-level reasoning is often required to avoid confusion and
uncertainty. For example, the global room model and its associated geometric reasoning
can be exploited for this purpose. Some researchers approach this layout problem by
adding the depth information [ZKSU13, GZH15].
The indoor room layout estimation problem has been actively studied in recent years.
Hedau et al. [HHF09] formulated it as a structured learning problem. It rst generates
hundreds of layout proposals based on inference from vanishing lines. Then, it uses
the line membership features and the geometric context features to rank the obtained
proposals and chooses the one with the highest score as the desired nal result.
In this work, we propose a coarse-to-ne indoor layout estimation (CFILE) method.
Its pipeline is shown in Figure4.1. The system uses an RGB image as its input and
57
Coarse Layout MFCN Layout Hypotheses and Ranking Critical Line
Detection
Input Result
…
Figure 4.1: The pipeline of the proposed coarse-to-ne indoor layout estimation (CFILE)
method. For an input indoor image, a coarse layout estimate that contains large surfaces
and their boundaries is obtained by a multi-task fully convolutional neural network
(MFCN) in the rst stage. Then, occluded lines and missing lines are lled in and
possible layout choices are ranked according to a pre-dened score function in the second
stage. The one with the highest score is chosen to the nal output.
provides a box layout as its output. The CFILE method consists of two stages: 1)
coarse layout estimation; and 2) ne layout localization. In the rst stage, we adopt
a multi-task fully convolutional neural network (MFCN) [DHS15], [SRK17] to obtain
a coarse-scale room layout estimate. This is motivated by the strength of the FCN in
semantic segmentation [LSD15] and contour detection [XT15]. The FCN has a strong
discriminant power in handling a large variety of indoor scenes using the surface property
and the layout contour property. It can provide a robust estimate in the presence of
cluttered objects, which is close to the ground truth globally. In the second stage, being
motivated by structured learning, we formulate an optimization framework that enforces
several constraints such as layout contour straightness, surface smoothness and geometric
constraints for layout detail renement.
It is worthwhile to emphasize that the spatial layout estimation problem is dierent
from semantic object segmentation problem in two aspects. First, the spatial layout
problem targets at the labeling of semantic surface of an indoor room rather than objects
in the room. Second, we have to label occluded surfaces while semantic segmentation
does not deal with the occlusion problem at all. It is also dierent from the contour
detection problem since occluded layout contours have to be detected.
58
The rest of this chapter is organized as follows. The proposed CFILE method is
described in detail in Sec. 4.2. Experimental results are shown in Sec. 4.3. Error
analysis is illustrated in Sec. 4.4. Concluding remarks are drawn in Sec. 4.5.
4.2 Coarse-to-Fine Indoor Layout Estimation (CFILE)
4.2.1 System Overview
Most research on indoor layout estimation [GZH15], [HHF09], [GHKB10b], [SFPU13],
[ZZ13], [PBF
+
12], [RPJT13] is based on the \Manhattan World" assumption. That is,
a room contains three orthogonal directions indicated by three groups of vanishing lines.
Hedau et al. [HHF09] presented a layout model based on 4 rays and a vanishing point.
The model can written as
Layout = (l
1
;l
2
;l
3
;l
4
;v); (4.1)
where l
i
is the i
th
line and v is the vanishing point. If (l
1
;l
2
;l
3
;l
4
;v) can be easily
detected without any ambiguity, the layout problem is straightforward. One example is
given in Figure 4.2 (a), where ve surfaces are visible in the image without occlusion.
However, more challenging cases exist. Vertices p
i
and e
i
in Figure 4.2 (a) may lie
outside the image. One example is shown in Figure 4.2 (b). Furthermore, vertices p
2
and p
3
are
oor corners and they are likely be occluded by objects. Furthermore, line
l
2
may be entirely or partially occluded as shown in Figure 4.2 (c). Lines l
3
and l
4
are
wall boundaries, and they can be partially occluded but not fully occluded. Line l
1
is
the ceiling boundary which is likely to be visible.
The proposed CFILE system consists of two stages as illustrated in Figure 4.1. In
the st stage, we propose a multi-task fully convolutional neural network (MFCN) to
oer a coarse yet robust layout estimation. Since the CNN is weak in imposing spatial
smoothness and conducting geometric reasoning, it cannot provide a ne-scale layout
result. In the second stage, we rst use the coarse layout from MFCN as the guidance
to detect a set of critical lines. Then, we generate a small set of high quality layout
59
(a) (b) (c)
Figure 4.2: Illustration of a layout model Layout = (l
1
;l
2
;l
3
;l
4
;v) that is parameterized
by four lines and a vanishing point: (a) an easy setting where all ve surfaces are
present; (b) a setting where some surfaces are outside the image; (c) a setting where key
boundaries are occluded.
hypotheses based on these critical lines. Finally, we dene a score function to select
the best layout as the desired output. Detailed tasks in these two stages are elaborated
below.
4.2.2 Coarse Layout Estimation via MFCN
We adopt a multi-task fully convolutional neural network (MFCN) [LSD15, ML15,
DHS15] to learn the coarse layout of indoor scenes. The MFCN [DHS15] shares fea-
tures in the convolutional layers with those in the fully connected layers and builds
dierent branches for multi-task learning. The total loss of the MFCN is the sum of
losses of dierent tasks. The proposed two-task network structure is shown in Figure
4.3. We use the VGG-16 architecture for fully convolutional layers and train the MFCN
for two tasks jointly, i.e. one for layout learning while the other for semantic surface
learning (including the
oor, left-, right-, center-walls and the ceiling). Our work is
dierent from that in [ML15], where layout is trained together with geometric context
labels [HEH
+
05a], [HEH07b] which contains object labels. Here, we train the layout and
semantic surface labels jointly. By removing objects from the concern, the boundaries
60
of semantic surfaces and layout contours can be matched even in occluded regions, lead-
ing to a clearer layout. As compared to the work in [DFCS16], which adopts the fully
convolutional neural network to learn semantic surfaces with a single task network, our
network has two branches, and their learned results can help each other.
input layer
output layer
fully connected layer
max pooling layer
convolutional layer
64
128
256
512
512
4096 deconvolutional layer
Figure 4.3: Illustration of the FCN-VGG16 with two output branches. We use one
branch for the coarse layout learning and the other branch for semantic surface learning.
The input image size is re-sized to 404 404 to match the receptive eld size of the lter
at the fully connection layer.
The receptive eld of the lter at the fully connected layer of the FCN-VGG16 is
404 404, which is independent of the input image size [LSD15], [XVR
+
15]. Xu et al.
[XVR
+
15] attempted to vary the FCN training image size so as to capture dierent level
of details in image content. If the input image size is larger than the receptive eld size,
the lter of the fully connected layer looks at a part of the image. If the input image
size is smaller than the receptive eld size, it is padded with zeros and spatial resolution
is lost in this case. The layout describes the whole image's global structure. We resize
the input image to 404 404 so that the lter examines the whole image.
4.2.3 Layout Renement
There are two steps in structured learning: 1) to generate a hypotheses set; and 2) to
dene a score function and search a structure in the hypotheses set that maximizes the
score function. We attempt to improve in both areas.
61
Given an input image I of size wh 3, the output of the coarse layout from the
proposed MFCN in Figure 4.3 is a probability function in form of
P
(k)
=Pr(L
ij
=kjI); 8k2f0; 1g; i2 [1;:::;h]; j2 [1;:::;w]; (4.2)
where L is an image of size wh that maps each pixel in the original image, I
ij
, to a
label in the output image L
ij
2f0,1g, where 0 denotes a background pixel and 1 denotes
a layout pixel. One way to estimate the nal layout from the MFCN output is to select
the label with the highest score; namely,
^
L
ij
= argmax
k
P
(k)
ij
8i2 [1;:::;h]; j2 [1;:::;w]: (4.3)
It is worthwhile to point out that
^
L
ij
generated from the MFCN output is noisy
for two reasons. First, the contour from the MFCN is thick and not straight since the
convolution operation and the pooling operation lose the spatial resolution gradually
along stages. Second, the occluded
oor boundary (e.g., thel
2
line in Figure 4.2) is more
dicult to detect since it is less visible than other contours (e.g., the l
1
, l
3
and l
4
lines
in Figure 4.2). We need to address these two challenges in dening a score function.
The optimal solution for Eq. (4.3) is dicult to get directly. Instead, we rst generate
layout hypotheses that are close to the global optimal layout, denoted by L
, in the layout
renement algorithm. Then, we dene a novel score function to rank layout hypotheses
and select the one with the highest score as the nal result.
4.2.3.1 Generation of High-Quality Layout Hypotheses
Our objective is to nd a set of layout hypotheses that contains fewer yet more robust
proposals in the presence of occluders. Then, the best layout with the smallest error can
be selected.
Vanishing Line Sampling. We rst threshold the layout contour obtained by the
MFCN, convert it into a binary mask, and dilate it by 4 pixels to get a binary mask image
62
denoted byC. Then, we apply the vanishing lines detection algorithm [GHKB10b] to the
original image and select those inside the binary mask as critical lines l
i(original)
, shown
in solid lines in Figure 4.4 (c) (d) (e) for ceiling, wall and
oor separately. Candidate
vanishing point v is generated by grid search around the initial v from [GHKB10b].
Handling Undetected Lines. There is case when no vanishing lines are detected
inside C because of low contrast, such as wall boundaries, l
3
(or l
4
). If ceiling corners
are available, l
3
(or l
4
) are lled in by connecting ceiling corners and vertical vanishing
point. If ceiling corners do not present in the image, the missing l
3
(or l
4
) is estimated
by logistic regression use the layout points in L.
Handling Occluded Lines. As discussed earlier, the
oor line,l
2
, can be entirely or
partially occluded. One illustrative example is shown in Figure 4.4 where l
2
is partially
occluded. If l
2
is partially occluded, the occluded part of l
2
can be recovered by line
extension. For entirely occluded l
2
, if we simply search lines inside C or uniformly
sample lines [ML15], the layout proposal is not going to be accurate as the occluded
boundary line cannot be recovered. Instead, we automatically ll in occluded lines
based on geometric rule. If p
2
(or p
3
) is detectable by connecting detected l
3
(or l
4
) to
e
2
v(or e
3
v), l
2
is computed as the line passing through the available p
2
or p
3
and the
vanishing point l
2
associated with. If neither p
2
nor p
3
is detectable, l
2
is estimated by
logistic regression use the layout points in L.
In summary, the nall
critial
used in generating layout hypotheses is the union of three
parts as given below:
l
critical
=l
i(original)
[l
i(occluded)
[l
i(undetected)
; (4.4)
wherel
i(original)
denotes detected vanishing lines insideC,l
i(occluded)
denotes the recovered
occluded boundary, and l
i(undetected)
denotes undetected vanishing lines because of low
contrast but recovered from geometric reasoning. These three types of lines are shown
63
(a) Coarse Layout
Critical Lines
(c)Ceiling (d)Wall (e)Floor
(b) Vanishing Lines
Figure 4.4: Illustration of critical lines detection for better layout hypothesis generation.
For a given input image, the coarse layout oers a mask that guides vanishing lines
selection and critical lines inference. The solid lines indicate detected vanishing lines C.
The dashed wall lines indicate those wall lines that are not detected but inferred inside
mask C from ceiling corners. The dashed
oor lines indicate those
oor lines that are
not detected but inferred inside mask C.
in Figure 4.4. With l
i(original)
and vanishing point v, we generate all possible layouts L
using the model described in Sec. 4.2.1.
4.2.3.2 Layout Ranking
We use the coarse layout probability map P as a weight mask to evaluate the layout.
The score function is dened as
S(LjP) =
1
N
X
i;j
P
i;j
; 8L
i;j
= 1; (4.5)
where P is the output from the MFCN, L is a layout from the hypotheses set, N is a
normalization factor that is equal to the total number of layout pixels in L. Then, the
optimal layout is selected by
L
= argmax
L
S(LjP): (4.6)
The score function is in favor of the layout that is aligned well with the coarse layout.
Figure 4.5 shows one example where the layout hypotheses are ranked using the score
function in Eq. (4.6). The layout with the highest score is chosen to be the nal result.
64
S= 0.242 S= 0.221 S= 0.201
S=0.191 S=0.184 S=0.140
Figure 4.5: Example of Layout ranking using the proposed score function.
4.3 Experiments
4.3.1 Experimental Setup
We evaluate the proposed CFILE method on two popular datasets; namely, Hedau's
dataset [HHF09] and the LSUN dataset [ML15]. Hedau dataset contains 209 training
images, 53 validation images and 105 test images. Mallya et al . [ML15] expanded
Hedau dataset by adding 75 new images into training set while validation and test
set unchanged, which referred to Hedau+ dataset. We conduct data augmentation for
Hedau+ dataset as done in [ML15] by cropping, rotation, scaling and luminance adjust-
ment in the training of the MFCN. The LSUN dataset [ML15] contains 4000 training
images, 394 validation images and 1000 test images. Since no ground truth is released
for the 1000 test images, we evaluate the proposed method on the 394 validation set
only. We resize all images to 404 404 by bicubic interpolation in the MFCN training,
and train two coarse layout models for the two datasets separately.
Hedau+ dataset provides both the layout and the geometric context labels but it
does not provide semantic surface labels. Thus, we use the layout polygon provided in
the dataset to generate semantic surface labels. The LSUN dataset provides semantic
surface labels but not the layout. We detect edges on semantic surface labels and dilate
them to a width of 7 pixels in the MFCN training. The semantic surface label provided in
LSUN dataset is not consistent. The same semantic surface is labeled dierently among
65
images. We relabel all the training data to make the label consistent, which is illustrated
in Figure 4.6 By following [ML15], we use the NYUDv2 RGBD dataset in [GAM13] for
semantic segmentation to initialize the MFCN. Also, we set the base learning rate to
10
4
with momentum 0:99.
Original Image Original Label New Label
Figure 4.6: Illustration of ground truth relabeling for LSUN dataset. 1 Frontal wall, 2
Left wall, 3 Right wall, 4 Floor, 5 Ceiling
We adopt two performance metrics: the pixel-wise error and the corner error. To
compute the pixel-wise error, the obtained layout segmentation is mapped to the ground
truth layout segmentation. Then, the pixel-wise error is the percentage of pixels that
are wrongly matched. To compute the corner error, we sum up all Euclidean distances
between obtained corners and their associated ground truth corners.
4.3.2 Experimental Results and Discussion
The coarse layout scheme described in Sec. 4.2.2 is rst evaluated using the methodology
in [AMFM11]. We compare our results, denoted by MFCN
1
and MFCN
2
, against the
informative edge method [ML15], denoted by FCN, in Table 4.1. Our proposed two coarse
layout schemes have higher ODS (xed contour threshold) and OIS (per-image best
threshold) scores. This indicates that they provide more accurate regions for vanishing
line samples in layout hypotheses generation.
We use several exemplary images to demonstrate that the proposed coarse layout
results are robust and close to the ground truth. That is, we compare visual results of
66
Table 4.1: Performance comparison of coarse layout results for Hedau's test dataset,
where the performance metrics are the xed contour threshold (ODS) and the per-image
best threshold (OIS) [AMFM11]. We use FCN to indicate the informative edge method
in [ML15]. Both MFCN
1
and MFCN
2
are proposed in our work. They correspond to the
two settings where the layout and semantic surfaces are jointly trained on the original
image size (MFCN
1
) and the downsampled image size 404 404. (MFCN
2
)
FCN[ML15] MFCN
1
(our) MFCN
2
(our)
Metrics ODS OIS ODS OIS ODS OIS
Hedau's dataset 0.255 0.263 0.265 0.284 0.265 0.291
Table 4.2: Performance benchmarking for Hedau's dataset.
Method Pixel Error (%)
Hedau et al. (2009)[HHF09] 21.20
Del Pero et al. (2012)[PBF
+
12] 16.30
Gupta et al. (2010)[GHKB10b] 16.20
Zhao et al. (2013)[ZZ13] 14.50
Ramalingam et al. (2013)[RPJT13] 13.34
Mallya et al. (2015)[ML15] 12.83
Schwing et al. (2012)[SU12] 12.80
Del Pero et al. (2013)[PBK
+
13] 12.70
Dasgupta et al. (2016)[DFCS16] 9.73
Proposed CFILE 8.67
the FCN in [ML15] and the proposed MFCN
2
in Figure 4.7. As compared to the layout
results of the FCN in [ML15], the proposed MFCN
2
method provides robust and clearer
layout results in occluded regions, which are not much aected by object boundaries.
Next, we evaluate the performance of the proposed full layout algorithm, CFILE,
including the coarse layout estimation and the layout optimization and ranking. The
performance of several methods for Hedau's dataset and the LSUN dataset is compared
in Table 4.2 and Table 4.3, respectively. The proposed CFILE method achieves state-
of-the-art performance. It outperforms the second best algorithm by 1:16% in Hedau's
dataset and 1:32% in the LSUN dataset.
The best six results of the proposed CFILE method for Hedau's test images are
visualized in Figure 4.8. We see from these six examples that the coarse layout estimation
algorithm is robust in highly cluttered rooms (see the second row and the fourth). The
67
Figure 4.7: Comparison of coarse layout results (from left to right): the input image,
the coarse layout result of the FCN in [ML15], the coarse layout results of the proposed
MFCN
2
and the ground truth. The results of the MFCN
2
are more robust. Besides, it
provides clearer contours in occluded regions. The rst two examples are from Hedau
dataset and the last two examples are from LSUN dataset.
Table 4.3: Performance benchmarking for the LSUN dataset.
Method Corner Error (%) Pixel Error (%)
Hedau et al. (2009)[HHF09] 15.48 24.23
Mallya et al. (2015)[ML15] 11.02 16.71
Dasgupta et al. (2016) [DFCS16] 8.20 10.63
Proposed CFILE 7.95 9.31
layout renement algorithm can recover occluded boundaries accurately in Figure 4.8 (a),
(b), (d) and (e). It can also select the best layout among several possible layouts. The
worst three results of the proposed CFILE method for Hedau's test images are visualized
in Figure 4.9. Figure 4.9 (a) show one example where the ne layout result is misled by
the wrong coarse layout estimate. Figure 4.9 (b) is a dicult case. The left wall and right
wall have the same appearance and there are several confusing wall boundaries. Figure
4.9 (c) gives the worst example of the CFILE method with accuracy 79:4%. However,
68
it is still higher than the worst example reported in [ML15] with accuracy 61:05%. The
ceiling boundary is confusing in Figure 4.9 (f). The proposed CFILE method selects
the ceiling line overlapping with the coarse layout. More visual results from the LSUN
dataset are shown in Figure 4.10.
Figure 4.11 shows the visualization of dierent layout hypotheses scored use our
proposed score function in Sec 4.2.3.The higher the score, the better layout hypothesis
is aligned with coarse layout.
We test our proposed algorithm against LSUN2016 dataset. We achieve 0.0757 pixel-
wise error and 0.0523 corner error.
4.4 Error Analysis
We pick 20 worst examples among 1000 test images in LSUN2016 dataset. In each
example, coarse layout and original images are shown side by side with overlaid our
layout result. There are several error types. Most of the error cases belong to the layout
type without ceiling. In coarse layout estimation from FCN, it's dicult to estimate
the wall boundary accurately. This is because ceiling corners help accurately locate the
wall boundaries. If the estimated coarse layout deviate largely from ground truth layout,
especially when coarse layout gives wrong layout type, it's dicult to recover it in layout
optimization. Some of the error cases related to lack of similar training images and
others from the ambiguity of wall boundary. There are also very challenge cases, such
as bird-eye view image and close shot image. The statistics of training images' layout
types are shown in Figure 4.12.
In Figure 4.13, all of the four images are type 4 which are with the second most
training images. We correctly label all of them belong to layout type 4. However, we
fail to accurately detect the wall boundary. The beds in (a) and (b) are very special
because they have the vertical beam around the beds. Since the vertical beam have
higher edge response, they are detected as wall boundary. (c)(d) also shows the wall
69
boundary detection problem. The images from type 4 do not have ceilings which adds
diculties for CNN to learn the wall boundary. Because the discriminative feature
for wall boundaries is mainly vertical edge response and image global structure, if the
vertical edge response comes from vertical beam or other objects is higher than the wall
boundary, the wall boundary will be wrongly detected. This type of error is not due to
the training image size. Even there are much more images from type 4 than that from
type 0, the layout accuracy of type 4 is much lower than type 0 because of absence of
ceiling corners.
Figure 4.14 shows the case where one wall boundary is missing. This case is hard to
dierentiate from one wall boundary case especially when there are lot of occlusion on
the
oor.
Figure 4.15 shows several examples with bird-eye view. Most of the training images
are frontal view, there are very few bird-eye view training images therefore the coarse
layout result is not very good for this type of images. In layout optimization step, coarse
layout is used as a weight mask to select best layout. If the coarse layout has very large
error, layout optimization is not able to recover a good layout.
Figure 4.16 shows close shot images. The errors in this category is also related to
training images. These images belong to type 8 and 9 in Figure 4.12. There's not enough
training images for these types. (f) is strange, it seems a re
ect of a room in the mirror.
Figure 4.17 shows other error cases. (a) shows the error where the boundary of the
bed is wrongly detected as
oor boundary because the location of the bed is very low
and it occupies the whole
oor region. (b)-(d) shows other boundary detection problem.
4.5 Conclusion
A coarse-to-ne indoor layout estimation (CFILE) method was proposed to estimate the
room layout from an RGB image. We adopted a multi-task fully convolutional neural
70
network (MFCN) to oer a robust coarse layout estimate for a variety of indoor scenes
with joint layout and semantic surface training. However, CNN is weak in enforcing spa-
tial constraints. To address this problem, we formulated an optimization framework that
enforces several constraints such as layout contour straightness, surface smoothness and
geometric constraints for layout detail renement. It was demonstrated by experimental
results that the proposed CFILE system oers the best performance on two commonly
used benchmark datasets.
71
98.8% 98.4% 98.4%
(a) (b) (c)
97.4% 94.1% 93.7%
(d) (e) (f)
Figure 4.8: Visualization of six best results of the CFILE method in Hedau's test dataset
(from top to bottom): original images, the coarse layout estimates from MFCN, our
results with pixel-wise accuracy (where the ground truth is shown in green and our
result is shown in red).
72
(a) (b) (c)
81.8% 81.3% 79.4%
Figure 4.9: Visualization of three worst results of the CFILE method in Hedau's test
dataset (from top to bottom): original images, the coarse layout estimates from MFCN,
our results with pixel-wise accuracy (where the ground truth is shown in green and our
result is shown in red).
(a) (b) (c)
(d) (e) (f)
Figure 4.10: Visualization of layout results of the CFILE method in the LSUN validation
set. Ground truth is shown in green and our result is shown in red.
73
Image Coarse Layout Score = 0.209 Score = 0.156 Score = 0.132
Image Coarse Layout Score = 0.188 Score = 0.168 Score = 0.148
Image Coarse Layout Score = 0.259 Score = 0.208 Score = 0.187
Figure 4.11: Visualization the scores of dierent layout hypotheses. The red line is layout
hypothesis generated by our proposed method and green line is the ground truth layout.
Images are from Hedau's dataset.
Type
Layout Example Number
Type
Layout Example Number
0 6
1 7
2 8
3 9
4 10
5
Figure 4.12: Training image statistics.
74
a b
c d
Figure 4.13: Worse examples. The wall boundary is not accurately detected.
Figure 4.14: Worse examples. One wall boundary is missing.
75
c d
a b
e
Figure 4.15: Worse examples: bird-eye view images.
a b
c d
e f
Figure 4.16: Worse examples: close shot images.
76
a b
c d
Figure 4.17: Other worse examples.
77
Chapter 5
Context-Assisted 3D (C3D)
Object Detection from RGB-D
Images
5.1 Introduction
3D object detection is an important yet challenging problem in computer vision, espe-
cially for highly cluttered indoor scenes. It nds applications in many elds such as
robotic navigation, real-estate advertisement, indoor interior design and holistic scene
understanding. Recent advances in depth sensors can capture the depth information
easily using a machine. The availability of RGB-D images in an indoor environment will
facilitate indoor scene understanding such as indoor object recognition [SLX15], room
layout estimation [RLCK16], surface orientation identication [GGAM14], etc. A grow-
ing number of annotated RGB-D datasets have been built, and they are of great value
to researchers in the training of a deep learning system [SLX15, XOT13], [JKJ
+
13],
[SHKF12].
As compared with 2D object detection, research on 3D object detection using con-
volutional neural networks (CNNs) is much less. Furthermore, none of them exploit the
scene and object interdependency and integrate scene classication and 3D object detec-
tion in one CNN. However, the scene information can serve as a valuable cue to resolve
ambiguities in 3D object detection. One example is shown in Figure 5.1. For a bedroom
scene, the probability of a bed is boosted from 0.9 to 0.95 while the probability of a sofa
78
Bed
Night Stand
Lamp
Pillow
Sofa
0.9
0.7
0.6
0.5
0.75
Bed
Night Stand
Lamp
Pillow
Sofa
0.95
0.75
0.7
0.6
0.1
Figure 5.1: Illustration on how a scene context helps improve 3D object detection,
where 3D object detection with or without the assistance of the context information
is compared. For ease of visualization, only a few object proposals are drawn. In the top
row, the gure shows 3D object proposals. In the bottom row, we compare the condence
of detected objects. The probability of the \bed" increases while the probability of the
\sofa" decreases in our method due to the use of the \bedroom" scene classication
result.
is reduced from 0.75 to 0.1 using our proposed solution. CNNs oer good solutions to
scene classication and object detection independently as evidenced in the recent litera-
ture. Since these two types of information are often correlated, the 3D object detection
performance can be improved by exploiting this correlation.
CNNs targeting at object classication and detection often integrate the multi-scale
information by pooling and concatenate the response of several convolutional layers for
decision making [SZ14]. Furthermore, encoding the context information using a probabil-
ity graphical model can achieve remarkable performance in scene understanding [Tor03],
[HEH08b], [YFU12], [BRC], [BPRC16], [LR09], [HGSK09], [LFU13]. Here, we propose
a context-assisted 3D (C3D) method that infers object's location/class, the presence of
a class in an image and the scene type from RGB-D images. It is achieved by optimizing
79
Object
Potential
Stage 1: Initial 3D Object Detection Stage 2: Joint Optimization
Scene
Potential
2D Scene
Classification
CNN
Classification
Score Vector
3D Region
Proposal
CNN
3D Object
Detection CNN
Figure 5.2: The block diagram of the context-3D system. In the rst stage, we use the
scene-CNN and the object-CNN (enclosed by blue and red dotted boxes, respectively)
to obtain scene classication and object detection results. For the top branch, the RGB
image and the HHA image [GGAM14] serve as the input to the scene-CNN. For the
bottom branch, the RGB-D image serves as the input to another CNN for 3D region
proposals. Then, the scene category classication vector and the 3D region proposal
results are concatenated and fed into the third CNN for 3D object detection so as to
provide accurate object detection results. In the second stage, we jointly optimize a cost
function associated with scene classication and object detection under the Conditional
Random Field (CRF) framework. The cost function includes the scene potential and the
object potential obtained from the rst stage as well as the scene/object context, the
object/object context and the room geometry information.
the context information, scene classication and object detection results jointly under
the Conditional Random Field (CRF) framework. The proposed C3D method exploits
the strong discriminative power of CNNs to handle a diversied set of RGB-D images
and adopts the CRF graphical model to represent the context information.
The main idea of the proposed C3D method is sketched below. First, we propose
a joint scene classication and object recognition system by exploiting the discriminant
power of CNNs and inter-dependency between objects and the scene. To achieve this
goal, we build a 2D scene classication CNN and a 3D object proposal CNN. The 2D
scene classication CNN provides the scene information to guide the 3D object detection
task in the second stage. Afterwards, we feed the results from these two CNNs into the
third CNN called the 3D object detection CNNs. Both object and scene features are
concatenated to provide a more accurate object category under a scene context. Finally,
80
we adopt a graphical model to exploit the relationship between objects and the scene
to improve the 3D object detection performance furthermore. It is demonstrated by
extensive experiments that the C3D method achieves the state-of-the-art performance
against the SUN RGB-D dataset.
The rest of this paper is organized as follows. The proposed C3D method is described
in Sec. 5.2. Experimental results are shown in Sec. 5.3. Finally, concluding remarks are
given in Sec. 5.4.
5.2 Proposed Context-Assisted 3D (C3D) Method
The proposed C3D method takes the advantage of the discriminant power of the CNN
and incorporates the context priors in a graphical model. It consists of two stages as
shown in Figure 5.2. In the rst stage, the CNN is used to obtain the initial scene type
and the 3D object class. The blue and red dotted boxes in the gure indicate the scene-
CNN and the object-CNN, respectively. Scene category features from the scene-CNN are
concatenated into those of the large object-CNN to provide accurate detection results
for large objects. In the second stage, we formulate a CRF optimization problem and
dene a cost function that contains the scene potential and the object potential based
on the information obtained from the rst stage and the context information. Then,
the joint cost function is optimized. Small and planar objects are more challenging than
large objects. To provide more accurate detection results, we treat them separately as
discussed in Sec. 5.2.2. In the following, we will elaborate on the three most innova-
tive modules of the proposed C3D system: 1) network system design, 2) small object
detection, and 3) graphical model optimization.
5.2.1 Network System Design
For the object detection CNN, we exploit the scene information to improve 3D object
detection. The indoor object category is highly dependent on the scene. For example, we
81
Figure 5.3: The co-occurrence probabilities of scene categories (along the x-axis) and
object classes (along the y-axis), where a higher value is indicated by a brighter color.
expect to see the bed in a bedroom. Furthermore, object categories inside a room often
characterize the functionality of the room. For instance, sofas are typically observed in
a living room. We show the co-occurrence probabilities of scene categories and object
classes based on the SUN RGB-D training dataset in Figure 5.3. In the bed room scene,
the probability of a \bed" is higher than that of a \sofa". This scene/object relationship
serves as a valuable prior. Sometimes, an object detector cannot dierentiate the \bed"
and the \sofa" well because of their appearance similarity. Under this circumstance, the
scene/object co-occurrence statistics can be used to improve the detection performance.
Here, we use the 2D scene CNN for scene classication and, then, integrate the scene
category result and the 3D object detection result.
A system of three CNNs, which is the rst stage of the proposed C3D CNN, is shown
in Figure 5.4. As shown in the gure, the system consists of three input branches as
detailed below.
82
Figure 5.4: Illustration of the system architecture of three CNNs which serves as the rst
stage of the proposed C3D CNN for large object detection. It consists of three branches.
The top row is a 2D scene CNN, the middle row is a 2D object CNN, and the bottom row
is a 3D object CNN. Their features are concatenated to form an end-to-end 3D object
detection system. The output of the network is a cuboid with its object class label.
1. The 2D scene classication branch (in the top row).
The FC
1
layer in the top branch oers the probability of the scene type ne-tuned
by our indoor scene classication task. That is, we concatenate the feature vectors
trained by the RGB image and the encoded depth image HHA to form the FC
1
layer scene feature vector.
2. The 2D object detection branch (in the middle row).
The FC
2
layer in the middle branch generates features for the 2D object detection
task, where the 2D bounding box is obtained by projecting the 3D bounding box
to the 2D image plane.
3. The 3D object detection branch (in the bottom row).
This branch is a shallower network with ve convolutional layers to prevent over-
tting. It takes a 3D object proposal encoded by the Truncated Signed Distance
Function (TSDF) as the input. The FC
3
layer in the bottom layer provides features
for the 3D object detection task.
83
Features from FC
1
, FC
2
and FC
3
are concatenated to form a long feature vector
for joint 3D bounding box regression (i.e. localization in the 3D space) and 3D object
classication. The loss function is the multi-task loss that consists of the classication
error loss and the 3D cuboid regression loss. Song et al. [SX16] showed that a network
can oer a better result by combining the 2D projected object features corresponding
the 3D proposal. We generalize this idea furthermore and develop the above system for
the proposed C3D method.
The 3D object detection and the 2D scene classication can be estimated from their
respective CNNs. However, the inter-dependency among objects and the room geometry
is not yet explored in the designed network. The context information can be conveniently
formulated by a graphical model. Then, the 3D object detection result, the scene clas-
sication result and the room geometry constraint will be fed into in a graphical model
for optimization in the second stage. This will be elaborated in Sec. 5.2.3.
5.2.2 Small Object Detection
Detection of small (or planar) objects is challenging since their size (or one dimension
of the size) in the input is small. After converting 2D objects into the 3D space using
the depth map, they are even more dicult to recognize due to lower resolution. This
is attributed to the following facts. First, since the 3D convolution requires a higher
computation cost, the 3D resolution is often restricted to a voxel grid of dimension
30 nowadays. Second, depth inaccuracy introduces object's shape distortion in the 3D
space. This eect is more obvious for smaller objects.
2D Object Detection. To detect 3D small objects, we leverage the higher resolution
of 2D images and the context information. Specically, we adopt a 2D proposal-free
detector called the single shot detector (SSD) [LAE
+
16] to detect indoor small objects.
The SSD contains several layers used to generate feature maps of multiple scales. The
default bounding boxes may not correspond to the actual receptive eld of each layer
84
necessarily. The box size is designed in such a way that specic feature map locations
can be responsive to specic areas of the image and particular scales of objects.
Mapping from 2D Bounding Box to 3D Cuboid. Once 2D objects are detected,
the depth information can be used to map them back to the 3D space. That is, one
can identify object pixels in a 2D bounding box and map them to points in a 3D point
cloud using the depth information. Finally, the cuboid that encloses all 3D points is
identied as the 3D object. Two examples of converting a 2D bounding box to a 3D
cuboid using this method are shown in Figure 5.5. The object detection results, the
depth information used to generate 3D point cloud view, and the generated 3D cuboid
enclosing all 3D points in 2D bounding boxes are shown in the rst, second and third
columns of the gure, respectively. The top example gives an accurate 3D cuboid. The
bottom example shows that, even the 2D object detection is correct, we still see errors in
the 3D view. This is because the 2D bounding box includes part of another object (a bed
in this example) which has a huge depth dierence. In 2D detected bounding box, some
part of bed is included in the bounding box and the depth of the bed is much smaller
than the dresser. As a result, the dresser's cuboid is straddling forward. The empty
space in the 3D view contains no information which tends to lead to 3D object proposals
of lower accuracy. To mitigate this problem, after mapping a 2D bounding box into a
3D cuboid, we rene the cuboid size and location using three pieces of information: 1)
2D segmentation results, 2) prior knowledge on object's size and 3) the room geometry.
Depth Processing. It is dicult to capture the depth information of an object accu-
rately due to occlusion and the limitation of depth sensing. Thus, we need to pay special
attention to depth processing in our method. Being dierent from 2D segmentation where
pixels are connected inside a segment, points inside a 3D point cloud may be isolated
inside a cuboid due to depth errors, which often occurs along the occlusion boundary.
This is caused by inaccurate depth measure of 3D objects due to the limitation of depth
sensors. For example, when the camera view is orthogonal to the frontal surface of an
object, the depth dimension of the object is lost. In this case, the alignment method
85
Figure 5.5: Two examples used to illustrate the mapping from a 2D bounding box to a
3D cuboid (from left to right): 2D object detection results, the depth maps, and the 3D
cuboids generated by enclosing all 3D points in the detected 2D bounding boxes.
will return a 2D bounding box rather than a 3D cuboid. Then, we need to move the
cuboid center along the depth direction of the initial bounding box to look for the most
reasonable depth information under other constraints to infer the depth dimension.
Generally speaking, if the camera view is not orthogonal to the frontal surface of an
object, the depth of the frontal surface is more accurate since it is closer to the depth
sensor and most of the region is visible. In contrast, the back of the object is invisible
and its depth information is not measurable. Thus, segments lie in the frontal surface are
more reliable for 3D object location estimation. After the initial alignment that aligns
the cuboid with the segmentation boundary, we conduct the second alignment based on
the depth information and room geometry. It consists of two steps. First, the bounding
box is adjusted along the depth dimension so that object's frontal surface and the cuboid
boundary are aligned. Second, if the aligned cuboid goes beyond the room boundary, it
will be truncated accordingly. We use the Manhattan Box [SLX15] method to nd the
room boundary for 3D object proposal adjustment.
86
Cuboid Renement. We propose a cuboid renement algorithm that exploits 2D
segmentation, point cloud densities and room geometry to ne-tune the initial 3D cuboid
of an object. It aims to adjust the initial cuboid so that the rened cuboid boundary is
better aligned with the object boundary. We use the 2D RGB-D segmentation method in
[SLX15], [RBF12] to obtain a set of 2D segments from the image, and map 2D segmented
pixels to a cloud of 3D points using the depth information so as to generate the initial
3D bounding box b
0
. We use S to denote the enclosing set of 2D segments of the
target object. Segments in S can be divided into two groups: 1) the set of segments
completely enclosed byb, which is called the inner set and denoted byS
in
, and 2) the set
of straddling segments that are partially enclosed byb, which is and called the straddling
set and denoted by S
st
.
The cuboid renement algorithm is to expand the cuboid from the bounding box
b(S
in
), which is the box enclosing inner set S
in
, to the one closest to the initial box, b
0
.
The overlap between bounding boxesb
i
andb
j
, denoted byO(b
i
;b
j
), is measured by their
intersection-over-union (IoU) value. We sort bounding boxes in S
st
in decreasing IoU
values. The cuboid algorithm adds boxes from the sorted straddling set in a greedily
manner. The pseudo-code of the proposed cuboid renement method is given below,
where we use b(S;
m
), where m2fx;y;zg, to denote a bounding box that initially
encloses setS and is shifted by
m
. If the cuboid is shifted along the depth (z) direction,
it is in form of b(S;
z
).
Two examples are given in Figure 5.6 to illustrate the eectiveness of the cuboid
renement algorithm described in this subsection for small (or planar) object localization.
The left, middle and right columns of the gure show the input images, the frontal view
and the top view of results, respectively. The ground truth bounding box, the initial
cuboid obtained by directly mapping 2D pixels to 3D point cloud using the depth map
and the rened cuboid obtained by our proposed method are indicated in green, blue and
red, respectively. We see that the proposed cuboid renement algorithm oers signicant
performance improvement in identifying the cuboid location and size.
87
Cuboid Renement 1
Input: initial cuboid b
0
, superpixel segments S
Output: rened cuboid b
1: compute inner set S
in
and sorted straddling setfs
1
;:::s
k
g
2: k 1
3: o O(b(S);b
0
)
4: ^ o O(b(S[s
k
);b
0
)
5: while ^ o>o do
6: o ^ o
7: S S[s
k
8: k k + 1
9: O O(b(S[s
k
;
z
);b
0
)
10: end while
11: b
b(S;
z
)
5.2.3 Graphical Model Optimization
CRF Graphical Model. The inter-dependency of objects and the scene and objects
among themselves in an image denes a structured prediction problem [NL11]. It can
be mathematically formulated using the CRF model [LFU13]. Formally, an image is
represented by a graph G(V;E) that consists of a set of vertices, V , and a set of edges,
E. The vertex set can be further written as V =V
o
[v
s
, where V
o
=fv
o
i;i2Bg is the
set of object nodes and B is the number of cuboids while v
s
2f1; ;Sg is the scene
node, and v
o
i2f0; 1; ;Cg is the object node. Here, S and C denote the numbers of
the scene type and the object class, respectively. The value of v
o
i can equal to 0, which
means that the cuboid is a false positive. The edge set can be written as E =E
oo
[E
so
,
where E
oo
=f(v
o
i;v
o
j) : v
o
i;v
o
j2 V
o
g is the edge set for an object/object (O/O) pair
andE
so
=f(v
s
;v
i
) :v
i
2V
o
g is the edge set for a scene/object (S/O) pair. An exemplary
CRF model for a sample image with scene category \bedroom" is shown in Figure 5.7.
Each node has its own condence while each edge enforces the output to be compatible
with the context information among the scene and objects.
88
Figure 5.6: 3D small object detection and localization, where the left, middle and right
columns show the input images, the frontal view and the top view of results, respectively.
The ground truth bounding box, the initial cuboid obtained by directly mapping 2D
pixels to 3D point cloud using the depth map and the rened cuboid obtained by our
proposed method are indicated in green, blue and red, respectively.
The potential function to be minimized under the CRF model can be written as
p(x
s
; X
o
jG; w) =
X
v
i
2Vo
w
|
o
'
o
(x
i
)
| {z }
object
+
X
(v
i
;v
j
)2Eoo
w
|
oo
'
oo
(v
i
;v
j
)
| {z }
object/object relation
+ w
|
s
'
s
(x
s
)
| {z }
scene
+
X
(v
i
;v
j
)2Eso
w
|
so
'
so
(v
i
;v
j
)
| {z }
scene/object relation
; (5.1)
wherex
s
is the scene class probability vector and x
i
is the probability for thei
th
object.
Note thatx
i
is also thei
th
element of object class probability vector X
o
. There are four
types of potentials in Eq. (5.1):
'
o
: the unary potentials dened for objects,
89
Scene/Object Potential
Object/Object Potential
S
O3 O1
O2 O4
Bed(O1)
Nightstand(O4)
Lamp(O3)
Pillow(O2)
Figure 5.7: A sample image and its corresponding CRF model. The relationship between
the scene and an object (S/O) and between two objects (O/O) is modeled by the edge
between nodes. When the S/O and O/O relationships are considered, the condence of
\bed", \night stand", \pillow" increases while the condence of \sofa" decreases.
'
s
: the unary potential associated with the scene,
'
oo
: the pairwise potential that captures the context relation between objects,
'
so
: the pairwise potential that captures the context relation between the scene
and objects.
We can associate each potential with a weight to control the relative contributions of each
term to the overall objective function. These weights can be learned from the training
data.
Unary Potential. The object unary potential '
o
is dened as the classier score after
the softmax operation on the output of the 3D CNN. The scene unary potential '
s
is
dened as the classier score obtained by the 2D VGG-based scene classier.
Scene/Object Potential. We use the scene/object co-occurrence to model the
scene/object relationship. For example, in the \conference room" scene, the probabilities
90
of \chair" or \table" are higher than \bed" or \sink". The co-occurrence probability of
multiple scene types and object classes obtained from the SUN RGB-D training dataset
are shown in Figure 5.3. Mathematically, for each scene/object pair (v
s
;v
o
), the edge
potential is dened as
'
so
(v
s
=s;v
o
=c) =
1
N
N
lim
i=1
m
i
lim
j=1
1(v
s
i =s)1(v
(i)
o
j
=c);
where N is the total number of training images, m
i
is the total number of detected
objects in the i
th
training image, v
(i)
o
j
is the j
th
object in the i
th
training image, and 1()
equals to 1 when the enclosed condition holds; otherwise, it is set to 0.
Object/Object Potential. We dene the object/object potential by considering three
factors: 1) the object/object co-occurrence potential denoted by'
co
; 2) the object/object
closeness potential denoted by '
c
; and 3) the object/object relative location potential
denoted by '
l
.
The object/object co-occurrence probability obtained from the SUN RGB-D train-
ing dataset is shown in Figure 5.8. For example, the \chair" and \table" object pair
frequently occurs while the \sink" and \bed" object pair rarely occurs. Based on this
statistics, the object/object potential can be dened as
'
co
(v
o
p =c
l
;v
o
q =c
l
0) =
1
N
N
lim
i=1
1(9j;j
0
:v
(i)
o
j
=c
l
;v
(i)
o
j
0
=c
l
0); (5.2)
where v
(i)
o
j
is the j
th
object in the i
th
training image.
Furthermore, we consider the closeness and the relative location of two objects of
consideration. For example, a night stand is typically near a bed while a monitor is
usually on top of a table. The closeness potential, '
c
(v
o
i;v
o
j), is dened as the normal-
ized cuboid center distance between object v
o
i and v
o
j. The relative location potential,
'
l
(v
o
i;v
o
j), is the normalized probability that an object v
o
i is on top of v
o
j. These two
terms help remove false positive cuboids especially when two cuboids are overlapping.
One example with overlapping cuboids is shown in Figure 5.9.
91
Figure 5.8: The co-occurrence probabilities of one object class (along the x-axis) and
another object class (along the y-axis), where a higher value is indicated by a brighter
color.
Parameter Learning and Inference. We use the maximum likelihood method to
determine the CRF model parameters. We also adopt the loopy belief Propagation
[LN08] algorithm, which oers a good approximation to the marginal distribution, to
compute the marginal distributions of each node. The energy function to be minimized
is in log-linear form, and we use the CRF optimization tool provided in [HHWI11] to
minimize the objective function.
When the learned CRF model is applied to detection of a new object, we run the
trained CRF model to obtain the nal object detection results.
5.3 Experimental Results
We implement our network architecture in Marvin [XSSY], which is a deep learning
framework developed to support N-dimensional CNNs. The performance evaluation is
92
Bed Cuboids Chair Cuboids
Figure 5.9: Example of overlapping cuboids with dierent object categories; namely, the
bed cuboids and the chair cuboids. The numbers on top of each cuboid is detection
condence.
conducted on the test SUN RGB-D dataset [SLX15], which has the ground-truth amodal
3D bounding boxes. We follow the evaluation metric in [SLX15] and compute the 3D
volume intersection-over-union (IoU) between the ground truth and predicted boxes and
use a threshold value of 0.25 to calculate the average recall for proposal generation and
the average precision for detection.
Table 5.1: Evaluation for 3D large object detection on the SUN RGB-D test set.
mAP
Sliding Shapes[SX14] ..- 42.1 33.4 ..- ..- 23.3 25.8 61.9 -
Deep Sliding Shapes[SX16] 44.2 78.8 61.2 20.5 32.3 53.5 50.3 78.9 53.3
Proposed C3D w/o CRF 59.5 81.5 62.6 21.7 34.7 55.0 51.2 84.8 56.4
Proposed C3D 60.3 82.9 63.7 22.1 35.0 56.5 57.3 85.7 57.9
Table 5.2: Comparison of indoor object detection accuracy (measured in average preci-
sion (AP) %)
DSS(3D) 11.9 1.5 4.1 0.0 6.4 20.4 18.4 0.2 15.4 13.3 0.5
SSD(RGB) 40.9 12.2 13.7 40.6 33.3 48.3 41.7 36.7 50.3 43.5 34.4
SSD(HHA) 23.0 10.2 11.8 15.9 23.3 33.1 39.2 18.1 44.9 45.9 22.1
3D Large Object Detection Evaluation. For the scene-CNN, we take the FC
7
feature from VGG16 with weights of pretrained PlaceCNN [ZLX
+
14]. Then, a linear
93
Table 5.3: Evaluation for 3D small object detection on SUN RGB-D test set.
mAP
Deep Sliding Shapes[SX16] 11.9 .1.5 .4.1 .0.0 6.4 20.4 18.4 .0.2 15.4 13.3 .0.5 8.5
Proposed C3D w/o CRF 32.7 .6.5 .9.7 .5.0 18.1 25.2 22.3 .3.2 28.6 14.5 .3.2 15.1
Proposed C3D 33.9 .8.7 .10.1.5.0 18.5 25.7 25.3 .4.7 30.6 16.1 .4.8 16.7
SVM is trained to get the scene classication probability score vector for each image.
The top-1 classication rate is 47.0% for 45 scene categories. We also concatenate FC
7
features from the image and those from the HHA, the top-1 scene classication increases
to 51.4%. For the 2D object detection CNN, we adopt the VGG16 trained by the
ImageNet. Its FC
7
feature vector is extracted and combined with that of the 3D object
detection network. For the 3D object detection network, the 2D scene classication score
vector, the 2D object detection feature vector and the 3D object detection feature vector
are concatenated to train 3D bounding box regression and object classes. The 3D object
detection network is based on [SX16].
We have tried to use the scene feature vector, FC
7
, to replace the 2D scene classi-
cation score vector as described above. However, the result is not as good. It shows that
the scene classication score vector, which is ne-tuned based on the indoor scene clas-
sication task, is more discriminative than the FC
7
feature. Another idea is to arrange
the scene classication network and the object detection network as a cascaded system.
Scene classication is conducted rst and then object category detection is performed
under a scene type. However, the scene classication error hurts the object detection
performance. Error accumulation is the main weakness of such a cascaded system.
We compare the detection performance of several benchmarking methods against the
SUN RGB-D dataset in Table 5.1 and Table 5.3. As described in the rst paragraph of
Sec. 5.2, the proposed C3D method consists of two stages and the CRF optimization task
is conducted in the second stage. To show the contributions of the CRF optimization, we
also list the results for the C3D method without CRF optimization in these two tables.
94
We see that all object categories have performance improvement with the help of
the scene context information. The C3D method has 4.6% mAP improvement over its
baseline (i.e. the Deep Sliding Shapes method in [SX16]) for large object detection.
In particular, we see signicant performance improvement in the \bathtub" (16.1%) ,
\toilet" (6.8%) and \bed" (4.1%) classes by leveraging the scene/object relationship.
The adoption of the CRF optimization module can have an improvement of 1.5% mAP
over the one without the CRF module. Some 3D object detection examples are shown
in Figure 5.10.
3D Small Object Detection Evaluation. We retrain the SSD network with
the SUN RGB-D dataset for the RGB and the HHA images to obtain the 2D object
detection result. The HHA image is derived by a depth encoding method that converts
one channel depth to three channels, including horizon disparity, height and angle. The
SSD(RGB) and SSD(HHA) are 2D object detectors with the VGG as its base network.
In the training of the RGB and HHA networks, the weights from the CNN pre-trained
by the ImageNet are used as lter weight initialization.
The 2D object detection results using the RGB image and the HHA image are com-
pared with that using the 3D object detection method [SX16] in Table 5.2. Their perfor-
mance is measured by the 2D IoU. The SSD(HHA) performs worse than the SSD(RGB).
This is reasonable due to the lower depth resolution. It may not be proper to use the
CNN weights pre-trained by the ImageNet as an initialization since the depth and color
images have very dierent characteristics. As shown in Table 5.2, the detection accu-
racy of small objects (e.g., bookshelf, box, counter, dresser, garbage bin, lamp, night
stand) and the planar objects (door, monitor, pillow, tv) is signicantly higher than
those of their counterparts in the 3D object detection method [SX16]. We see clearly the
advantage of the 2D object detector in detecting small and planar objects. Finally, we
compare the detection performance of several benchmarking methods against the SUN
RGB-D dataset in Table 5.3 for small and planar objects. Our method has 8.2% mAP
95
improvement. The merits of the proposed C3D for small and planar object detection are
very obvious.
Error Analysis. We provide four examples to show the error source of the proposed
C3D method in Figure 5.11. The ground truth and our detection results are shown in
green and yellow, respectively. Erroneous detections are attributed to the following
reasons.
1. The TV monitor is missing for the rst image because its size is too small. If the
small object in the 2D image is not detected at the rst place, we cannot recover
it later.
2. The missing object is heavily occluded. The bathtub is not detected for the second
image because of occlusion.
3. The major part of the missing object is outside the camera view. For the rst
image, most of the bed is outside of the image, our method cannot detect the bed.
For the third image, Although the bed is detected, the overlapping score is not
higher enough to consider it as a correct detection. The ground truth of the bed
include a region which is outside of the view. There are quite a few such cases in
this dataset. detection.
4. There exists large intra-class variation. The reason why the bathtub is missing in
the second image is that there are not enough training data of this bathtub type.
The large intra-class diversity makes the detection problem dicult.
5.4 Conclusion
A context-assisted 3D (C3D) object localization and detection method based on the
RGB-D input using several CNNs and the graphical model was proposed in this
work. The context information, which includes the scene category, the scene/object co-
occurrence, the object/object co-occurence and room geometry, was exploited to improve
96
3D object detection accuracy in the proposed C3D method. The superior performance
of the proposed C3D method was demonstrated against the SUN RGB-D dataset.
97
Bathtub Bed Table Toilet Chair Sofa NightStand Desk BookShelf
Box Counter Door Dresser GarbageBin Lamp
TV
Monitor Pillow Sink
Ground Truth Deep Sliding Shapes C3D
Figure 5.10: Detection with condence scores larger than 0.5 for each algorithm, where
we see that the contextual information helps preserve the true positives and reduce the
false detection.
98
Figure 5.11: The ground truth and our detection results are shown in green and yellow,
respectively. The reasons for erroneous detections include the following (from top to
bottom): (a) the TV monitor is missed because its size is too small, (b) the missing
object is heavily occluded, (c) the major part of the missing object is outside the camera
view, and (d) there exists large intra-class variation.
99
Chapter 6
Conclusion and Future Work
6.1 Summary of the Research
In this dissertation, we studied three research topics related to layout estimation: 1)
outdoor layout estimation by geometric labels 2) indoor layout estimation by a coarse-
to-nd system, and 3) 3D object detection by incorporating context information.
For outdoor layout estimation, we proposed the Global-attributes Assisted Labeling
(GAL) system that combines local super-pixel based classication method and seven
global attributes using a conditional random eld framework. It takes the advantages
of global visual reasoning procedures in the human visual system. It explores the global
rules that human brain used to construct rough structures from a single view. It enables
the reasoning on a broad range of scene image types, which are not restricted to city
landscapes, where previous works perform accurately. As a general framework, the
GAL system can be easily extended to indoor scene geometrical labeling or even more
challenging scene understanding problems.
A simple coarse-to-ne indoor layout estimation (CFILE) method was proposed in
Chapter 4. The CFILE system integrates the deep neural network and the geometric
knowledge. The eectiveness of multi-task FCN for coarse layout learning was demon-
strated, especially in the joint learning of the coarse layout and the semantic surface. A
coarse layout probability based score function was adopted to provide scores for dierent
layout hypotheses. The CFILE system can serve a building module for the task of total
indoor scene understanding.
A Context-Assisted 3D (C3D) object detection method was proposed in Chapter 5.
The C3D system processes the RGB-D input using several CNNs and a graphical model.
100
The context information such as the scene category, the scene/object co-occurrence, the
object/object co-occurrence and the room geometry were exploited to improve 3D object
detection accuracy in the C3D method. It achieves the state-of-the-art performance
against the SUN RGB-D dataset.
6.2 Future Research
There are several research directions to extend our current research for further improve-
ment in outdoor and indoor layout estimation and 3D object detection. They are elab-
orated below.
More advanced scoring functions.
The layout ranking process in the CFILE method is solely based on the coarse layout
estimated from the MFCN in the rst step. The layout will be wrong if the coarse
layout has a large error. A condence parameter in the MFCN learning can be used to
measure how condent the learned coarse layout is. Based on our observation, for most
images (i.e., more than 95% test images), the coarse layout from the MFCN can provide
a coarse but accurate result. For the remaining 5% image, where the MFCN cannot
provide good results, it is largely due to the big dierence between the viewing angles
of training and test images. These images of poor performance include bird-eye view
images, images without a ceiling corner, etc. If the learned parameters can identify these
images by assigning a low condence score, the coarse layout result cannot be trusted.
For those images, we can ignore MFCN's coarse layout result and take other features or
information into account in layout hypothesis generation and layout scoring.
Domain Adaptation.
The construction of the RGB-D dataset is expensive. The largest RGB-D dataset is the
SUN RGB-D dataset that contains 10,335 images. As compared to the RGB dataset for
scene classication such as Places, the SUN RGB-D dataset is fairly small. The original
images and depth images in the SUN RGB-D dataset were captured by depth cameras in
101
the real world. There is a new indoor synthesized large dataset called the SUNCG dataset
[ZSY
+
17] that contains millions of indoor images with their depth maps. However, we
may not get better results using this large training dataset. For example, experiments
in [SYZ
+
16] showed that the use of the SUNCG dataset to train a CNN model and
test it in the NYU V2 test set, the result is not better than using the much smaller
NYU V2 training set, which contains only 1,449 images. The synthesized dataset and
the SUN RGB-D dataset have dierent data properties. For example, the synthesized
indoor images are very tiny, only contains furnitures, without a lot of real world items
such as clothes on the bed, books and pens on the desk. Images captured in real life tend
to contain many dierent objects while those in the synthesized dataset contain only a
common set of daily objects. The domain adaptation technique is needed to adapt a
model from one dataset to another that has dierent data properties.
102
Bibliography
[ADF12] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. Measuring the
objectness of image windows. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 34(11):2189{2202, 2012.
[AMFM11] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Con-
tour detection and hierarchical image segmentation. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 33(5):898{916, 2011.
[APTB
+
14] Pablo Arbel aez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques,
and Jitendra Malik. Multiscale combinatorial grouping. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages
328{335, 2014.
[ASS
+
12] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pas-
cal Fua, and Sabine Susstrunk. Slic superpixels compared to state-of-the-
art superpixel methods. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 34(11):2274{2282, 2012.
[BHC15] Vijay Badrinarayanan, Ankur Handa, and Roberto Cipolla. Segnet: A deep
convolutional encoder-decoder architecture for robust semantic pixel-wise
labelling. arXiv preprint arXiv:1505.07293, 2015.
[BLTK10] Olga Barinova, Victor Lempitsky, Elena Tretiak, and Pushmeet Kohli. Geo-
metric image parsing in man-made environments, pages 57{70. Computer
VisionECCV 2010. Springer, 2010.
[BPRC16] Jawadul H Bappy, Sujoy Paul, and Amit K Roy-Chowdhury. Online adap-
tation for joint scene and object classication. In European Conference on
Computer Vision, pages 227{243. Springer, 2016.
[BRC] Jawadul Hasan Bappy and Amit K Roy-Chowdhury. Inter-dependent cnns
for joint scene and object recognition.
[BRF13] Liefeng Bo, Xiaofeng Ren, and Dieter Fox. Unsupervised feature learning
for rgb-d based object recognition. In Experimental Robotics, pages 387{
402. Springer, 2013.
103
[CCPS13a] Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, and Silvio Savarese.
Understanding indoor scenes using 3d geometric phrases. In Computer
Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages
33{40. IEEE, 2013.
[CCPS13b] Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, and Silvio Savarese.
Understanding indoor scenes using 3d geometric phrases. In Computer
Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages
33{40. IEEE, 2013.
[CKZ
+
15] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin
Ma, Sanja Fidler, and Raquel Urtasun. 3d object proposals for accurate
object class detection. In Advances in Neural Information Processing Sys-
tems, pages 424{432, 2015.
[CMWZ15] Xiaozhi Chen, Huimin Ma, Xiang Wang, and Zhichen Zhao. Improving
object proposals with multi-thresholding straddling expansion. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 2587{2595, 2015.
[CRK14] Chen Chen, Yuzhuo Ren, and C-C Jay Kuo. Large-scale indoor/outdoor
image classication via expert decision fusion (edf). In Asian Conference
on Computer Vision, pages 426{442. Springer, 2014.
[CRK16a] Chen Chen, Yuzhuo Ren, and C-C Jay Kuo. Big Visual Data Analysis:
Scene Classication and Geometric Labeling. Springer, 2016.
[CRK16b] Chen Chen, Yuzhuo Ren, and C-C Jay Kuo. Outdoor scene classication
using labeled segments. In Big Visual Data Analysis, pages 65{92. Springer,
2016.
[CSKX15] Chenyi Chen, Ari Se, Alain Kornhauser, and Jianxiong Xiao. Deepdriv-
ing: Learning aordance for direct perception in autonomous driving. In
Proceedings of the IEEE International Conference on Computer Vision,
pages 2722{2730, 2015.
[CZLT14] Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. Bing:
Binarized normed gradients for objectness estimation at 300fps. In Pro-
ceedings of the IEEE conference on computer vision and pattern recognition,
pages 3286{3293, 2014.
[DFCS16] Saumitro Dasgupta, Kuan Fang, Kevin Chen, and Silvio Savarese. Delay:
Robust spatial layout estimation for cluttered indoor scenes. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 616{624, 2016.
[DHS15] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmen-
tation via multi-task network cascades. arXiv preprint arXiv:1512.04412,
2015.
104
[DZ13] Piotr Doll ar and C Lawrence Zitnick. Structured forests for fast edge detec-
tion. In Computer Vision (ICCV), 2013 IEEE International Conference
on, pages 1841{1848. IEEE, 2013.
[DZ14] Piotr Doll ar and C Lawrence Zitnick. Fast edge detection using structured
forests. 2014.
[EPF14] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction
from a single image using a multi-scale deep network. In Advances in neural
information processing systems, pages 2366{2374, 2014.
[FCNL13] Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun.
Learning hierarchical features for scene labeling. IEEE transactions on
pattern analysis and machine intelligence, 35(8):1915{1929, 2013.
[FDG
+
14] David F Fouhey, Vincent Delaitre, Abhinav Gupta, Alexei A Efros, Ivan
Laptev, and Josef Sivic. People watching: Human actions as a cue for single
view geometry. International Journal of Computer Vision, 110(3):259{274,
2014.
[FDU12] Sanja Fidler, Sven Dickinson, and Raquel Urtasun. 3d object detection and
viewpoint estimation with a deformable 3d cuboid model. In Advances in
Neural Information Processing Systems, pages 611{619, 2012.
[FGMR10] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva
Ramanan. Object detection with discriminatively trained part-based mod-
els. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
32(9):1627{1645, 2010.
[FH04] Pedro F Felzenszwalb and Daniel P Huttenlocher. Ecient graph-based
image segmentation. International Journal of Computer Vision, 59(2):167{
181, 2004.
[FVS
+
09] Brian Fulkerson, Andrea Vedaldi, Stefano Soatto, et al. Class segmentation
and object localization with superpixel neighborhoods. In ICCV, volume 9,
pages 670{677. Citeseer, 2009.
[FWC
+
15] Xiang Fu, Chien-Yi Wang, Chen Chen, Changhu Wang, and C-C Jay Kuo.
Robust image segmentation using contour-guided color palettes. In Pro-
ceedings of the IEEE International Conference on Computer Vision, pages
1618{1625, 2015.
[GAGM15] Saurabh Gupta, Pablo Arbel aez, Ross Girshick, and Jitendra Malik. Align-
ing 3d models to rgb-d images of cluttered scenes. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages
4731{4740, 2015.
105
[GAM13] Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. Perceptual organiza-
tion and recognition of indoor scenes from rgb-d images. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages
564{571, 2013.
[GDDM14] Ross Girshick, Je Donahue, Trevor Darrell, and Jitendra Malik. Rich fea-
ture hierarchies for accurate object detection and semantic segmentation.
In Computer Vision and Pattern Recognition, 2014.
[GEH10] Abhinav Gupta, Alexei A. Efros, and Martial Hebert. Blocks world revis-
ited: Image understanding using qualitative geometry and mechanics, pages
482{496. Computer VisionECCV 2010. Springer, 2010.
[GFM] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Dis-
criminatively trained deformable part models, release 5.
http://people.cs.uchicago.edu/ rbg/latent-release5/.
[GGAM14] Saurabh Gupta, Ross Girshick, Pablo Arbel aez, and Jitendra Malik. Learn-
ing rich features from rgb-d images for object detection and segmentation.
In European Conference on Computer Vision, pages 345{360. Springer,
2014.
[GHKB10a] Abhinav Gupta, Martial Hebert, Takeo Kanade, and David M. Blei. Esti-
mating spatial layout of rooms using volumetric reasoning about objects
and surfaces. In Advances in Neural Information Processing Systems, pages
1288{1296, 2010.
[GHKB10b] Abhinav Gupta, Martial Hebert, Takeo Kanade, and David M Blei. Esti-
mating spatial layout of rooms using volumetric reasoning about objects
and surfaces. In Advances in neural information processing systems, pages
1288{1296, 2010.
[Gir15] Ross Girshick. Fast r-cnn. In International Conference on Computer Vision
(ICCV), 2015.
[GJT16] C Gururaj, D Jayadevappa, and Satish Tunga. An eective implementation
of exudate extraction from fundus images of the eye for a content based
image retrieval system through hardware description language. In Emerging
Research in Computing, Information, Communication and Applications,
pages 279{290. Springer, 2016.
[GSEH11] Abhinav Gupta, Scott Satkin, Alexei Efros, and Martial Hebert. From
3d scene geometry to human workspace. In Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on, pages 1961{1968. IEEE,
2011.
[GZH15] Ruiqi Guo, Chuhang Zou, and Derek Hoiem. Predicting complete 3d models
of indoor scenes. arXiv preprint arXiv:1504.02437, 2015.
106
[HEH
+
05a] Derek Hoiem, Alexei Efros, Martial Hebert, et al. Geometric context from
a single image. In Computer Vision, 2005. ICCV 2005. Tenth IEEE Inter-
national Conference on, volume 1, pages 654{661. IEEE, 2005.
[HEH05b] Derek Hoiem, Alexei A. Efros, and Martial Hebert. Automatic photo pop-
up. ACM Transactions on Graphics (TOG), 24(3):577{584, 2005.
[HEH07a] Derek Hoiem, Alexei A. Efros, and Martial Hebert. Recovering surface lay-
out from an image. International Journal of Computer Vision, 75(1):151{
172, 2007.
[HEH07b] Derek Hoiem, Alexei A Efros, and Martial Hebert. Recovering surface lay-
out from an image. International Journal of Computer Vision, 75(1):151{
172, 2007.
[HEH08a] Derek Hoiem, Alexei Efros, and Martial Hebert. Closing the loop in scene
interpretation. In Computer Vision and Pattern Recognition, 2008. CVPR
2008. IEEE Conference on, pages 1{8. IEEE, 2008.
[HEH08b] Derek Hoiem, Alexei A Efros, and Martial Hebert. Putting objects in
perspective. International Journal of Computer Vision, 80(1):3{15, 2008.
[HGSK09] Geremy Heitz, Stephen Gould, Ashutosh Saxena, and Daphne Koller. Cas-
caded classication models: Combining models for holistic scene under-
standing. In Advances in Neural Information Processing Systems, pages
641{648, 2009.
[HHF09] Varsha Hedau, Derek Hoiem, and David Forsyth. Recovering the spatial
layout of cluttered rooms. In Computer vision, 2009 IEEE 12th interna-
tional conference on, pages 1849{1856. IEEE, 2009.
[HHF10] Varsha Hedau, Derek Hoiem, and David Forsyth. Thinking inside the
box: Using appearance models and context based on room geometry. In
Computer Vision{ECCV 2010, pages 224{237. Springer, 2010.
[HHF12] Varsha Hedau, Derek Hoiem, and David Forsyth. Recovering free space
of indoor scenes from a single image. In Computer Vision and Pattern
Recognition (CVPR), 2012 IEEE Conference on, pages 2807{2814. IEEE,
2012.
[HHWI11] Qixing Huang, Mei Han, Bo Wu, and Sergey Ioe. A hierarchical con-
ditional random eld model for labeling and segmenting images of street
scenes. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Conference on, pages 1953{1960. IEEE, 2011.
[HKM15] Haibin Huang, Evangelos Kalogerakis, and Benjamin Marlin. Analysis
and synthesis of 3d shape families via deep-learned generative models of
surfaces. In Computer Graphics Forum, volume 34, pages 25{38. Wiley
Online Library, 2015.
107
[HZ09] Feng Han and Song-Chun Zhu. Bottom-up/top-down image parsing with
attribute grammar. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 31(1):59{73, 2009.
[HZRS14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid
pooling in deep convolutional networks for visual recognition. In European
Conference on Computer Vision, pages 346{361. Springer, 2014.
[IKJM16] Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, and Roland Memisevic.
Generating images with recurrent adversarial networks. arXiv preprint
arXiv:1602.05110, 2016.
[JKJ
+
13] Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T Barron, Mario
Fritz, Kate Saenko, and Trevor Darrell. A category-level 3d object dataset:
Putting the kinect to work. In Consumer Depth Cameras for Computer
Vision, pages 141{165. Springer, 2013.
[JKS
+
15] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A
Shamma, Michael S Bernstein, and Li Fei-Fei. Image retrieval using scene
graphs. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE
Conference on, pages 3668{3678. IEEE, 2015.
[JX13] Hao Jiang and Jianxiong Xiao. A linear approach to matching cuboids in
rgbd images. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 2171{2178, 2013.
[KHB
+
15] Salman H Khan, Xuming He, Mohammed Bennamoun, Ferdous Sohel, and
Roberto Togneri. Separating objects and clutter in indoor scenes. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 4603{4611, 2015.
[KHFH11] Kevin Karsch, Varsha Hedau, David Forsyth, and Derek Hoiem. Render-
ing synthetic objects into legacy photographs. In ACM Transactions on
Graphics (TOG), volume 30, page 157. ACM, 2011.
[KST
+
09] Panagiotis Koutsourakis, Loic Simon, Olivier Teboul, Georgios Tziritas,
and Nikos Paragios. Single view reconstruction using shape grammars for
urban environments. In Computer Vision, 2009 IEEE 12th International
Conference on, pages 1795{1802. IEEE, 2009.
[KTCM15] Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik.
Amodal completion and size constancy in natural scenes. In Proceedings of
the IEEE International Conference on Computer Vision, pages 127{135,
2015.
[LAE
+
16] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott
Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multi-
box detector. In European Conference on Computer Vision, pages 21{37.
Springer, 2016.
108
[LBF14] Kevin Lai, Liefeng Bo, and Dieter Fox. Unsupervised feature learning for
3d scene labeling. In 2014 IEEE International Conference on Robotics and
Automation (ICRA), pages 3050{3057. IEEE, 2014.
[LBH09] Christoph H Lampert, Matthew B Blaschko, and Thomas Hofmann. E-
cient subwindow search: A branch and bound framework for object local-
ization. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
31(12):2129{2142, 2009.
[LFU13] Dahua Lin, Sanja Fidler, and Raquel Urtasun. Holistic scene understanding
for 3d object detection with rgbd cameras. In Proceedings of the IEEE
International Conference on Computer Vision, pages 1417{1424, 2013.
[LGK10] Beyang Liu, Stephen Gould, and Daphne Koller. Single image depth esti-
mation from predicted semantic labels. In Computer Vision and Pattern
Recognition (CVPR), 2010 IEEE Conference on, pages 1253{1260. IEEE,
2010.
[LHK09] Daniel C. Lee, Martial Hebert, and Takeo Kanade. Geometric reasoning
for single image structure recovery. In Computer Vision and Pattern Recog-
nition, 2009. CVPR 2009. IEEE Conference on, pages 2136{2143. IEEE,
2009.
[LN08] Yuan Li and Ram Nevatia. Key object driven multi-category object recog-
nition, localization and tracking using spatio-temporal context. In ECCV
(4), pages 409{422, 2008.
[LPC
+
17] Shangwen Li, Sanjay Purushotham, Chen Chen, Yuzhuo Ren, and C-C Jay
Kuo. Measuring and predicting tag importance for image retrieval. IEEE
transactions on pattern analysis and machine intelligence, 2017.
[LR09] Svetlana Lazebnik and Maxim Raginsky. An empirical bayes approach to
contextual region classication. In Computer Vision and Pattern Recog-
nition, 2009. CVPR 2009. IEEE Conference on, pages 2380{2387. IEEE,
2009.
[LSD15] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional
networks for semantic segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3431{3440, 2015.
[LSK
+
15] Chenxi Liu, Alexander G Schwing, Kaustav Kundu, Raquel Urtasun, and
Sanja Fidler. Rent3d: Floor-plan priors for monocular layout estimation. In
Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference
on, pages 3413{3421. IEEE, 2015.
[LSL15] Fayao Liu, Chunhua Shen, and Guosheng Lin. Deep convolutional neural
elds for depth estimation from a single image. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 5162{5170,
2015.
109
[LSP14] Lubor Ladicky, Jianbo Shi, and Marc Pollefeys. Pulling things out of per-
spective. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 89{96, 2014.
[LW02] Andy Liaw and Matthew Wiener. Classication and regression by random-
forest. R news, 2(3):18{22, 2002.
[LYT11] Ce Liu, Jenny Yuen, and Antonio Torralba. Nonparametric scene parsing
via label transfer. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 33(12):2368{2382, 2011.
[LZZ14] Xiaobai Liu, Yibiao Zhao, and Song-Chun Zhu. Single-view 3d scene pars-
ing by attributed grammar. In Computer Vision and Pattern Recognition
(CVPR), 2014 IEEE Conference on, pages 684{691. IEEE, 2014.
[LZZ
+
17] Siyang Li, Heming Zhang, Junting Zhang, Yuzhuo Ren, and C-C Jay Kuo.
Box renement: Object proposal enhancement and pruning. In Applica-
tions of Computer Vision (WACV), 2017 IEEE Winter Conference on,
pages 979{988. IEEE, 2017.
[MBHRS14] Ricardo Martin-Brualla, Yanling He, Bryan C Russell, and Steven M Seitz.
The 3d jigsaw puzzle: Mapping large indoor spaces. In Computer Vision{
ECCV 2014, pages 1{16. Springer, 2014.
[MGLW16] Markus Maurer, J Christian Gerdes, Barbara Lenz, and Hermann Winner.
Autonomous driving: technical, legal and social aspects. 2016.
[ML15] Arun Mallya and Svetlana Lazebnik. Learning informative edge maps for
indoor scene layout prediction. In Proceedings of the IEEE International
Conference on Computer Vision, pages 936{944, 2015.
[MS15] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neu-
ral network for real-time object recognition. In Intelligent Robots and Sys-
tems (IROS), 2015 IEEE/RSJ International Conference on, pages 922{928.
IEEE, 2015.
[MTF
+
03] Kevin Murphy, Antonio Torralba, William Freeman, et al. Using the forest
to see the trees: a graphical model relating features, objects and scenes.
Advances in neural information processing systems, 16:1499{1506, 2003.
[MVL
+
15] Ondrej Miksik, Vibhav Vineet, Morten Lidegaard, Ram Prasaath, Matthias
Niener, Stuart Golodetz, Stephen L Hicks, Patrick P erez, Shahram Izadi,
and Philip HS Torr. The semantic paintbrush: Interactive 3d mapping
and recognition in large outdoor spaces. In Proceedings of the 33rd Annual
ACM Conference on Human Factors in Computing Systems, pages 3317{
3326. ACM, 2015.
110
[MZYM11] Hossein Mobahi, Zihan Zhou, Allen Y. Yang, and Yi Ma. Holistic 3d recon-
struction of urban structures from low-rank textures. In Computer Vision
Workshops (ICCV Workshops), 2011 IEEE International Conference on,
pages 593{600. IEEE, 2011.
[NHH15] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning decon-
volution network for semantic segmentation. In Proceedings of the IEEE
International Conference on Computer Vision, pages 1520{1528, 2015.
[NL11] Sebastian Nowozin and Christoph H Lampert. Structured learning and
prediction in computer vision. Foundations and Trends R
in Computer
Graphics and Vision, 6(3{4):185{365, 2011.
[PBF
+
12] Luca Del Pero, Joshua Bowdish, Daniel Fried, Bonnie Kermgard, Emily
Hartley, and Kobus Barnard. Bayesian geometric modeling of indoor scenes.
In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Con-
ference on, pages 2719{2726. IEEE, 2012.
[PBK
+
13] Luca Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, and Kobus
Barnard. Understanding bayesian rooms using composite 3d object models.
In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 153{160, 2013.
[PC13] Pedro HO Pinheiro and Ronan Collobert. Recurrent convolutional neural
networks for scene parsing. arXiv preprint arXiv:1306.2795, 2013.
[PDH
+
15] Mattis Paulin, Matthijs Douze, Zaid Harchaoui, Julien Mairal, Florent Per-
ronin, and Cordelia Schmid. Local convolutional features with unsupervised
training for image retrieval. In Proceedings of the IEEE International Con-
ference on Computer Vision, pages 91{99, 2015.
[PHK15] Jiyan Pan, Martial Hebert, and Takeo Kanade. Inferring 3d layout of
building facades from a single image. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2918{2926, 2015.
[PLLY15] Sungheon Park, Hyeopwoo Lee, Suwon Lee, and Hyun S Yang. Line-based
single view 3d reconstruction in manhattan world for augmented reality.
In Proceedings of the 14th ACM SIGGRAPH International Conference on
Virtual Reality Continuum and its Applications in Industry, pages 89{92.
ACM, 2015.
[RBF12] Xiaofeng Ren, Liefeng Bo, and Dieter Fox. Rgb-(d) scene labeling: Features
and algorithms. In Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on, pages 2759{2766. IEEE, 2012.
[Ren13] Yuzhuo Ren. Techniques for vanishing point detection. University of South-
ern California, 2013.
111
[RHGS15] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN:
Towards real-time object detection with region proposal networks. In
Advances in Neural Information Processing Systems (NIPS), 2015.
[RKAT08] Srikumar Ramalingam, Pushmeet Kohli, Karteek Alahari, and Philip HS
Torr. Exact inference in multi-label crfs with higher order cliques. In Com-
puter Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference
on, pages 1{8. IEEE, 2008.
[RKB04] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Inter-
active foreground extraction using iterated graph cuts. ACM Transactions
on Graphics (TOG), 23(3):309{314, 2004.
[RLCK16] Yuzhuo Ren, Shangwen Li, Chen Chen, and C-C Jay Kuo. A coarse-to-ne
indoor layout estimation (cle) method. In Asian Conference on Computer
Vision, pages 36{51. Springer, 2016.
[RPJT13] Srikumar Ramalingam, Jaishanker Pillai, Arpit Jain, and Yuichi Taguchi.
Manhattan junction catalogue for spatial reasoning of indoor scenes. In
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3065{3072, 2013.
[RS16] Zhile Ren and Erik B. Sudderth. Three-dimensional object detection and
layout prediction using clouds of oriented gradients. In The IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), June 2016.
[SBZB15] Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep
panoramic representation for 3-d shape recognition. IEEE Signal Processing
Letters, 22(12):2339{2343, 2015.
[SFPU13] Alexander Schwing, Sanja Fidler, Marc Pollefeys, and Raquel Urtasun. Box
in the box: Joint 3d layout and object reasoning from single images. In
Proceedings of the IEEE International Conference on Computer Vision,
pages 353{360, 2013.
[SHKF12] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor
segmentation and support inference from rgbd images. In European Con-
ference on Computer Vision, pages 746{760. Springer, 2012.
[SLX15] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A
rgb-d scene understanding benchmark suite. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 567{576,
2015.
[SLZ
+
15] Dongjin Song, Wei Liu, Tianyi Zhou, Dacheng Tao, and David A Meyer.
Ecient robust conditional random elds. Image Processing, IEEE Trans-
actions on, 24(10):3124{3136, 2015.
112
[SMKLM15] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller.
Multi-view convolutional neural networks for 3d shape recognition. In Pro-
ceedings of the IEEE International Conference on Computer Vision, pages
945{953, 2015.
[SRCF15] Johannes L Sch onberger, Filip Radenovi c, Ondrej Chum, and Jan-Michael
Frahm. From single image query to detailed 3d reconstruction. In 2015
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 5126{5134. IEEE, 2015.
[SRK17] Ronald Salloum, Yuzhuo Ren, and C. C. Jay Kuo. Image splicing localiza-
tion using a multi-task fully convolutional network (mfcn), 2017.
[SSN09] Ashutosh Saxena, Min Sun, and Andrew Y. Ng. Make3d: Learning 3d
scene structure from a single still image. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 31(5):824{840, 2009.
[SU12] Alexander G Schwing and Raquel Urtasun. Ecient exact inference for
3d indoor scene understanding. In Computer Vision{ECCV 2012, pages
299{313. Springer, 2012.
[SX14] Shuran Song and Jianxiong Xiao. Sliding shapes for 3d object detection
in depth images. In European Conference on Computer Vision, pages 634{
651. Springer, 2014.
[SX15] Shuran Song and Jianxiong Xiao. Deep sliding shapes for amodal 3d object
detection in rgb-d images. arXiv preprint arXiv:1511.02300, 2015.
[SX16] Shuran Song and Jianxiong Xiao. Deep sliding shapes for amodal 3d object
detection in rgb-d images. In The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2016.
[SYZ
+
16] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and
Thomas Funkhouser. Semantic scene completion from a single depth image.
arXiv preprint arXiv:1611.08974, 2016.
[SZ14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[SZWW15] Bing Shuai, Zhen Zuo, Gang Wang, and Bing Wang. Dag-recurrent neural
networks for scene labeling. arXiv preprint arXiv:1509.00552, 2015.
[Tor03] Antonio Torralba. Contextual priming for object detection. International
journal of computer vision, 53(2):169{191, 2003.
[UAB
+
08] Chris Urmson, Joshua Anhalt, Drew Bagnell, Christopher Baker, Robert
Bittner, MN Clark, John Dolan, Dave Duggins, Tugrul Galatali, Chris
Geyer, et al. Autonomous driving in urban environments: Boss and the
urban challenge. Journal of Field Robotics, 25(8):425{466, 2008.
113
[UvdSGS13] Jasper RR Uijlings, Koen EA van de Sande, Theo Gevers, and Arnold WM
Smeulders. Selective search for object recognition. International journal of
computer vision, 104(2):154{171, 2013.
[vGJMR08] Rafael Grompone von Gioi, Jeremie Jakubowicz, Jean-Michel Morel, and
Gregory Randall. Lsd: A fast line segment detector with a false detection
control. IEEE Transactions on Pattern Analysis & Machine Intelligence,
(4):722{732, 2008.
[WGR13] Huayan Wang, Stephen Gould, and Daphne Roller. Discriminative learning
with latent variables for cluttered indoor scene understanding. Communi-
cations of the ACM, 56(4):92{99, 2013.
[WSK
+
15] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang,
Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation
for volumetric shapes. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 1912{1920, 2015.
[XF14] Jianxiong Xiao and Yasutaka Furukawa. Reconstructing the worlds muse-
ums. International Journal of Computer Vision, 110(3):243{258, 2014.
[XFZW15] Jin Xie, Yi Fang, Fan Zhu, and Edward Wong. Deepshape: Deep learned
shape descriptor for 3d shape matching and retrieval. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages
1275{1283, 2015.
[XOT13] Jianxiong Xiao, Andrew Owens, and Antonio Torralba. Sun3d: A database
of big spaces reconstructed using sfm and object labels. In Proceedings of
the IEEE International Conference on Computer Vision, pages 1625{1632,
2013.
[XS12] Yu Xiang and Silvio Savarese. Estimating the aspect layout of object cate-
gories. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
Conference on, pages 3410{3417. IEEE, 2012.
[XSSY] Jianxiong Xiao, Shuran Song, Daniel Suo, and Fisher Yu. Marvin: A min-
imalist GPU-only N-dimensional ConvNet framework. http://marvin.is.
Accessed: 2015-11-10.
[XT15] Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In Pro-
ceedings of the IEEE International Conference on Computer Vision, pages
1395{1403, 2015.
[XVR
+
15] Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus
Rohrbach, and Kate Saenko. A multi-scale multiple instance video descrip-
tion network. arXiv preprint arXiv:1505.05914, 2015.
114
[YFU12] Jian Yao, Sanja Fidler, and Raquel Urtasun. Describing the scene as a
whole: Joint object detection, scene classication and semantic segmenta-
tion. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
Conference on, pages 702{709. IEEE, 2012.
[ZBK
+
16] Yinda Zhang, Mingru Bai, Pushmeet Kohli, Shahram Izadi, and Jianxiong
Xiao. Deepcontext: Context-encoding neural pathways for 3d holistic scene
understanding. CoRR, abs/1603.04922, 2016.
[ZD14] C Lawrence Zitnick and Piotr Doll ar. Edge boxes: Locating object pro-
posals from edges. In European Conference on Computer Vision, pages
391{405. Springer, 2014.
[ZJRP
+
15] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav
Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Con-
ditional random elds as recurrent neural networks. In Proceedings of
the IEEE International Conference on Computer Vision, pages 1529{1537,
2015.
[ZKSU13] Jian Zhang, Chen Kan, Alexander Schwing, and Raquel Urtasun. Esti-
mating the 3d layout of indoor scenes and its clutter from depth sensors.
In Proceedings of the IEEE International Conference on Computer Vision,
pages 1273{1280, 2013.
[ZLCD16] Jinyi Zou, Wei Li, Chen Chen, and Qian Du. Scene classication using local
and global features with collaborative representation fusion. Information
Sciences, 348:209{226, 2016.
[ZLX
+
14] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude
Oliva. Learning deep features for scene recognition using places database.
In Advances in neural information processing systems, pages 487{495, 2014.
[ZS11] Shaojie Zhuo and Terence Sim. Defocus map estimation from a single
image. Pattern Recogn., 44(9):1852{1858, September 2011.
[ZSY
+
17] Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee,
Hailin Jin, and Thomas Funkhouser. Physically-based rendering for indoor
scene understanding using convolutional neural networks. The IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2017.
[ZWW
+
15] Mujun Zang, Dunwei Wen, Ke Wang, Tong Liu, and Weiwei Song. A novel
topic feature for image scene classication. Neurocomputing, 148:467{476,
2015.
[ZZ13] Yibiao Zhao and Song-Chun Zhu. Scene parsing by integrating function,
geometry and appearance models. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3119{3126, 2013.
115
Abstract (if available)
Abstract
In this dissertation, we study three research problems: 1) Outdoor geometric labeling, and 2) Indoor layout estimation and 3) 3D object detection. ❧ A novel method that extracts global attributes from outdoor images to facilitate geometric layout labeling is proposed in Chapter 3. The proposed Global-attributes Assisted Labeling (GAL) system exploits both local features and global attributes. First, by following a classical method, we use local features to provide initial labels for all super-pixels. Then, we develop a set of techniques to extract global attributes from 2D outdoor images. They include sky lines, ground lines, vanishing lines, etc. Finally, we propose the GAL system that integrates global attributes in the conditional random field (CRF) framework to improve initial labels so as to offer a more robust labeling result. The performance of the proposed GAL system is demonstrated and benchmarked with several state-of-the-art algorithms against a popular outdoor scene layout dataset. ❧ The task of estimating the spatial layout of cluttered indoor scenes from a single RGB image is addressed in Chapter 4. Existing solutions to this problem largely rely on hand-craft features and vanishing lines. They often fail in highly cluttered indoor scenes. The proposed coarse-to-fine indoor layout estimation (CFILE) method consists of two stages: 1) coarse layout estimation
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Techniques for vanishing point detection
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Advanced techniques for stereoscopic image rectification and quality assessment
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
Facial age grouping and estimation via ensemble learning
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Advanced techniques for human action classification and text localization
PDF
Green learning for 3D point cloud data processing
PDF
A learning‐based approach to image quality assessment
PDF
RGBD camera based wearable indoor navigation system for the visually impaired
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Object localization with deep learning techniques
PDF
Experimental design and evaluation methodology for human-centric visual quality assessment
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
3D deep learning for perception and modeling
PDF
Advanced visual processing techniques for latent fingerprint detection and video retargeting
Asset Metadata
Creator
Ren, Yuzhuo
(author)
Core Title
Machine learning techniques for outdoor and indoor layout estimation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/23/2017
Defense Date
04/26/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D reconstruction,indoor layout,machine learning,OAI-PMH Harvest,outdoor layout
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.- C. Jay (
committee chair
), Nakano, Aiichiro (
committee member
), Sawchuk, Alexander (
committee member
)
Creator Email
yuzhuore@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-445354
Unique identifier
UC11265422
Identifier
etd-RenYuzhuo-5853.pdf (filename),usctheses-c40-445354 (legacy record id)
Legacy Identifier
etd-RenYuzhuo-5853.pdf
Dmrecord
445354
Document Type
Dissertation
Rights
Ren, Yuzhuo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D reconstruction
indoor layout
machine learning
outdoor layout