Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
DP22843.pdf
(USC Thesis Other)
DP22843.pdf
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DERIVING 3-D SHAPE DESCRIPTIONS
FROM STEREO
USING HIERARCHICAL FEATURES
by
Chi-Kit Ronald Chung
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Engineering)
Novem ber 1992
Copyright 1992 Chi-Kit Ronald Chung
UMI Number: DP22843
All rights reserved
INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
UM I
Dissertation F\.bi.sfer*g
UMI DP22843
Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author.
Microform Edition © ProQuest LLC.
All rights reserved. This work is protected against
unauthorized copying under Title 17, United States Code
ProQuest'
ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, Ml 48106- 1346
UNIVERSITY OF SOUTHERN CALIFORNIA
THE GRADUATE SCHOOL
UNIVERSITY PARK
LOS ANGELES, CALIFORNIA 90007
This dissertation, written by
. . £ ! ) A r . K ? . £ . A < L £ . h . H P . R ..................................
under the direction of h is Dissertation
Committee; and approved by all its members,
has been presented to and accepted by The
Graduate School, in partial fulfillment of re
quirements for the degree of
DOCTOR OF PHILOSOPHY
Dean of Graduate Studies
Date Noy ember _ 24^ _ ^ 19 92
DISSERTATION COMMITTEE
Chairperson
CX^ ' ^ 4 .
CoS
’ 92 .
C S 5 9
my parents, Ping-Kwai and Hoi-Kuen
A cknow ledgm ents
It is only with the direction, support, patience and encouragement of my
thesis advisor, Prof. Ramakant Nevatia, that this thesis can be possible. He
not only taught me a lot about what science is and how to approach it system
atically, but also gave me hints on tackling various problems in an academic
life. His door was always open when I needed his advice. My deepest thanks
also go to Prof. Gerard Medioni and Prof. Michael Arbib. They are all very
busy persons and yet they found time to serve on my thesis committee and
gave me a lot of invaluable suggestions about the work.
I thank everybody at the Institute for Robotics and Intelligent Systems
during the last few years: especially Andres Huertas for his help with the
machines, Dr. Steve Cochran for his help with the utility software, and Prof.
Keith Price and Prof. Ken Goldberg for giving me feedback about the work. I
also thank Dorothy Steele and Delsa Castello for their help and support over
the years.
My deepest gratitude goes to my parents whom I can never thank enough.
This research was supported in part by the Advanced Research Projects
Agency of the Department of Defense and was monitored by the Air Force
Office of Scientific Research under Contract No. F49620-90-C-0078, and sup
ported in part by a subcontract from the Hughes Aircraft Company.
C ontents
D ed ication ii
A cknow ledgm ents iii
List O f Figures vii
List O f Tables x
A bstract xi
1 Introduction 1
1.1 M otivation.............................................................................................. 1
1.2 Objective of the S t u d y ...................................................................... 2
1.3 Our A pproach....................................................................................... 6
1.4 Original C o n trib u tio n s...................................................................... 9
1.5 Organization of this Dissertation .................................................. 10
2 B asic C oncepts and R elated R esearch Survey 11
2.1 Fundamentals of Stereo A nalysis...................................................... 11
2.2 Stereo C orrespondence....................................................................... 16
2.2.1 Area-based ............................................................................... 17
2.2.2 E d g el-b ased ............................................................................... 17
2.2.3 E xtended-F eature-based....................................................... 17
2.2.4 Homogeneous-region-based.................................................... 18
2.2.5 S urface-based........................................................................... 18
2.2.6 Combined M e th o d s................................................................. 19
2.3 Shape R epresentation.......................................................................... 19
2.4 Shape R econstruction.......................................................................... 21
2.4.1 Shape Recovery from a Single I m a g e ................................ 21
2.4.2 Shape Recovery from Stereo Im ag es................................... 23
2.4.3 Recovering Architectural Structures from Stereo . ... 25
2.5 S u m m a r y .............................................................................................. 27
iv
3 S tereo C orrespondence: U se of M onocular G roupings and
O cclusion A nalysis 28
3.1 In tro d u ctio n ........................................................................................... 28
3.2 Occlusion Effects in S te re o ................................................................ 30
3.2.1 A Basic Property of Occlusions in Stereo Views . ... 31
3.2.2 A Constraint for the NVFOV R e g io n ............................... 34
3.2.3 Some Observations about Ju n c tio n s.................................. 36
3.3 A Hierarchical Features Based A p p ro a c h ..................................... 44
3.3.1 Monocular Perceptual G ro u p in g ......................................... 46
3.3.2 Binocular Stereo C orrespondence..................................... 49
3.3.3 Feedback from Stereo Correspondence to Monocular
G ro u p in g s................................................................................ 54
3.3.4 Stereo Correspondence of Surface Markings ................. 56
3.4 Experimental R e su lts.......................................................................... 57
3.5 Computational Complexity and Run-time .................................... 58
3.6 C onclusion.............................................................................................. 64
4 Shape D escription: R ecovering LSH G C s and SH G C s 65
4.1 In tro d u ctio n ........................................................................................... 65
4.2 Shape Recovery of L S H G C s ............................................................ 68
4.2.1 The M e th o d ............................................................................. 68
4.2.2 Experimental R e su lts............................................................. 71
4.3 Shape Recovery of S H G C s ............................................................... 72
4.3.1 Hypothesizing S H G C s .......................................................... 75
4.3.2 Setting Up Correspondences across Stereo Images . . . 77
4.3.3 Fitting Cross-Sections to the Correspondence Quadruples 82
4.3.4 Experimental R e su lts............................................................. 84
4.4 Computational Complexity and Run-time .................................... 86
4.4.1 Recovering an L S H G C .......................................................... 86
4.4.2 Recovering an S H G C ............................................................. 89
4.5 C onclusion.............................................................................................. 90
5 A pplication: E xtractin g B uilding Structures from a Stereo
Pair of A erial Im ages 92
5.1 In tro d u ctio n ........................................................................................... 92
5.2 A Building Extraction S y ste m ......................................................... 96
5.2.1 Structural Description and Matching M odule................. 98
5.2.2 Figure Extraction and Ground Extraction Modules . . 106
5.3 Experimental R e s u lts .......................................................................... 113
5.4 Computational Complexity and Run-time .................................... 118
5.5 C onclusion.............................................................................................. 122
v
6 C onclusion 126
6.1 Summary .............................................................................................. 126
6.2 Future R esearch................................................................................... 127
A p p en d ix A
A Relaxation Network for Constrained O p tim iz a tio n ........................ 129
A p p en d ix B
Preservation of Coplanarity from 3-D Space to Disparity Space . . . 133
A p p en d ix C
A Measure of C o p la n a rity ......................................................................... 134
R eferen ces 136
V I
List O f Figures
1.1 The traditional approach to stereo processing ............................ 4
1.2 Difficulty of recovering depth discontinuities by observing local
depth e s tim a te s ................................................................................... 5
1.3 Stereo projections of a curved su rfa c e ............................................. 6
1.4 A stereo correspondence approach using hierarchical features . 7
1.5 Our approach to deriving shape descriptions from stereo . . . 8
2.1 The pinhole camera projection m o d e l............................................. 12
2.2 The epipolar plane and the epipolar line of an image point . . 14
2.3 The parallel-axis stereo g e o m e try ................................................... 15
2.4 Lim-Binford‘s method for curved surface reconstruction .... 24
2.5 Properties of the epipolar s lic e s ....................................................... 26
3.1 Types of o cclu sio n s.............................................................................. 32
3.2 Effects of NVFOV regions to stereo correspondence.................. 33
3.3 Effects of NVFOV regions on the junctions along occlusion
boundaries............................................................................................. 34
3.4 Asymmetry of the occurrence of NVFOV region with respect to
stereo v ie w s .......................................................................................... 35
3.5 Formation of T -ju n c tio n .................................................................... 37
3.6 Classification of T-junctions with respect to stereo views . . . 37
3.7 Direction of epipolar displacement of T - ju n c tio n s ..................... 38
3.8 Formation of lim b-junction................................................................ 40
3.9 Classification of limb-junctions with respect to stereo views . . 41
3.10 Direction of epipolar displacement of limb-j u n c tio n s .................. 41
3.11 Formation of NVFOV surface at orientation discontinuity oc
clusion .................................................................................................... 43
3.12 Projection of the surfaces composing an orientation discontinuity 43
3.13 Examples of permissible change of the junction type at orienta
tion discontinuity occlusions for trihedral objects .................... 44
3.14 Overview of our stereo s y s te m .......................................................... 46
3.15 Some constraints for ribbon-selection............................................ 48
3.16 Hierarchical Matching ..................................................................... 49
3.17 Some constraints for ribbon m atch in g .......................................... 51
3.18 Type and Uniqueness constraints for matching junctions . . . 52
3.19 Figural Continuity constraint for matching junctions ............. 52
3.20 Some constraints for branch m atching........................................... 53
3.21 Consistent and Inconsistent rib b o n -p a irs .................................... 55
3.22 Hierarchical features of a scene with multiple occlusions (scene
M O ) ....................................................................................................... 59
3.23 Disparity O utput for the scene M O .............................................. 60
3.24 Results of a scene of curved objects (scene C O ) ....................... 61
3.25 Results of a scene of textured o b je c ts ........................................... 62
4.1 Overview of our approach .............................................................. 66
4.2 Sample GC s h a p e s ............................................................................ 68
4.3 The axis, apex, and meridians of an L S H G C ............................. 69
4.4 Contour generator of an LSHGC is a m e rid ia n .......................... 70
4.5 Stereo correspondence of LSHGC c o n to u rs ................................. 71
4.6 Results for a synthetic scene of an L S H G C ................................. 73
4.7 Results for a scene of a c o n e ........................................................... 74
4.8 The recovered volumetric description overlaid on the left image
of the co n e............................................................................................. 75
4.9 Tangents to the image contours of an SHGC meet at the pro
jection of its a x i s ................................................................................ 76
4.10 Volumetric shape recovery of an S H G C ....................................... 78
4.11 Corresponding points on the image contours of an SHGC which
belong to the same cross-section, when the contour generators
are creases............................................................................................. 79
4.12 The apex of a given cross-section of an S H G C .......................... 80
4.13 Corresponding points on the image contours of an SHGC which
belong to the same cross-section, when the contour generators
are lim b s................................................................................................. 81
4.14 Geometry of recovering a cross-section of an S H G C ................ 83
4.15 Results for a synthetic scene of an S H G C .................................... 85
4.16 Results of hierarchical stereo matching and volumetric shape
recovery for the scene of a l a m p ..................................................... 87
4.17 The recovered volumetric description overlaid on the left image
of the l a m p .......................................................................................... 88
5.1 A typical aerial image ..................................................................... 93
5.2 Overview of our stereo s y s te m ........................................................ 98
5.3 Hypothesis of an L-junction from two line se g m e n ts................ 99
5.4 Epipolar constraint for matching ju n c tio n s ................................ 101
viii
5.5 Figural Continuity constraint for matching junctions ............. 102
5.6 Definition of 2-D collinearity in our s y ste m ................................ 102
5.T Ordering constraint for matching ju n c tio n s................................. 103
5.8 Epipolar constraint for matching branches of junctions .... 104
5.9 Surface-orientation constraint for matching branches of junctions 104
5.10 Uniqueness constraint for matching line s e g m e n ts ................... 105
5.11 Hypothesis of a link-match from two junction-matches .... 107
5.12 The outside and inside zones of an extended feature relative to
a center point in a 2-D sp a ce............................................................ 108
5.13 The outsidemost and insidemost extended features with respect
to a center point outside the desired b o u n d a ry ........................... 109
5.14 Hypothesis of surface boundaries from a cluster of coplanar links 110
5.15 False boundaries across collinear edges of coplanar surfaces . . I l l
5.16 Solid-formation constraint for selecting surface boundaries . . 112
5.17 Results for a scene (B 1 0 ).................................................................. 115
5.18 Results for another scene ( B l l ) .................................................... 116
5.19 Results for the Pentagon b u ild in g ................................................. 117
5.20 Results of recovering the hole on the roof of Pentagon building 118
5.21 Results for the oblique views of a hotel (scene O B L )................ 119
IX
List O f Tables
3.1 Time complexity and run-time of matching hierarchical features 63
4.1 Time complexity and run-time of recovering an LSHGC . ... 89
4.2 Time complexity and run-time of recovering an NLSHGC . . . 90
5.1 Time complexity and run-time of recovering the building in the
scene B I O ............................................................................................. 123
5.2 Time complexity and run-time of recovering the building in the
scene B l l .............................................................................................. 123
5.3 Time complexity and run-time of recovering the building in the
Pentagon s c e n e ................................................................................... 124
5.4 Time complexity and run-time of recovering the building in the
scene O B L ............................................................................................. 124
x
A bstract
Humans show no difficulty in inferring three-dimensional (3-D) shape of
many objects presented to them, even though they have not seen the objects
before. We address the problem of how the 3-D shape of objects in a scene
can be inferred from a stereo pair of two-dimensional (2-D) intensity images.
Related work done by others is reviewed, specifically the work regarding stereo
correspondence and shape reconstruction from intensity image data.
We propose a stereo correspondence system that computes a hierarchy of
descriptions up to the surface level from each view using a perceptual group
ing technique, and matches these features at the different levels. W ith the
description and correspondence processes going hand in hand, we allow high
level abstract features to help reduce correspondence ambiguities, and we ex
ploit the multiple views to help confirm the different levels of descriptions in
the scene. Occlusion is a major problem in stereo analysis and is often not
treated explicitly. We present a basic property of occlusions in stereo views,
and show how we can use this property incorporated with the structural de
scriptions to identify different types of occlusions in the scene. In particular,
we identify crease and limb boundaries, and infer properties of surfaces that
are visible only in one of the stereo views. We give some experimental results
on scenes with curved objects and with multiple occlusions.
Such a stereo correspondence system is capable of producing segmented
surfaces and of distinguishing between creases and limbs, yet no precise 3-D
information is extracted about the limb boundaries. We argue that interme
diate 2 |-D (visible) descriptions may not be always directly available from
stereo, especially when there are curved surfaces in the scene. We show that
3-D volumetric descriptions of objects can be inferred if we assume that the
objects consist of parts which are instances of some general shape primitives.
Our methods are based on the invariant properties of the shape primitives in
their monocular and stereo projections. Experimental results on both syn
thetic and real images of objects with curved surfaces are also given.
We apply some of the proposed ideas to an im portant application domain of
computer vision, namely the reconstruction of polyhedral building structures
from a stereo pair of aerial images. We have designed a system that computes
a hierarchy of descriptions such as segments, junctions, and links between
junctions from each view, and matches these features at the different levels.
Such features not only help reduce correspondence ambiguity during stereo
matching, but also allow us to infer about surface boundaries even though the
boundaries may be broken because of noise and weak contrast. We hypothesize
surface boundaries by examining global information such as continuity and
coplanarity of linked edges in 3-D, rather than by merely looking at local
depth information. We avoid the error made in inferring 3-D information
from triangulation by translating 3-D collinearity into 2-D collinearity in the
two views, and coplanarity from the 3-D space into the disparity space, and
working merely in the 2-D and the disparity domains. When the walls of the
buildings are visible, we also exploit the relationships among adjacent surfaces
of an object to help confirm the different levels of descriptions. Experimental
results for various aerial scenes are also shown.
C hapter 1
Introduction
Humans are able to recover with ease the shape of many of the objects pre
sented to them. This capability is all the more impressive since the world
surrounding us is three-dimensional (3-D) and the images projected onto the
retinae of human eyes are essentially two-dimensional (2-D). In theory, a sin
gle 2-D image can be the projection of any of the many 3-D scenes, as all the
points along a given line of sight project to the same point on the image. Such
a shape inference capability will have a tremendous number of applications
if it can be duplicated in computer vision. The subject of this thesis is the
process of deriving 3-D shape descriptions from 2-D images using one of the
visual cues: stereo vision.
1.1 M otivation
While there are other visual cues that allow 3-D information to be recovered
from 2-D images, stereo vision is among the most direct and reliable ones.
There are other methods to extract depth information in machine vision, such
as those using range sensors, but they generally require m atte surfaces in the
scene and their speed is limited, although technology has been improving.
Nonetheless, an existing system, the human visual system, has demonstrated
that there is a way to recover 3-D information efficiently and reliably from
a stereo pair of images without sending out signals to interfere with the sur
rounding environment.
The idea behind stereo vision is this: if two images from different view
points can be placed in correspondence, in the sense that it is known which
1
points in the two images are projected by the same point in space, the inter
section of the lines of sight from two matching image points determines the
object point.
Most related previous work, however, has focused on recovering depth, not
shape, from stereo. Even if the issue of whether depth can always be recovered
without recovering shape is put aside, such uncorrelated depth estimates may
be limited to applications like robot navigation and obstacle avoidance. For
tasks like grasping and object recognition, explicit shape descriptions are gen
erally necessary. To pick up an object, we have to visually separate the object
from the background, and estimate the 3-D location of its center of gravity
so that the m anipulator can grasp around the center of gravity to assure a
stable grip. To recognize an object, we have to come up with a description
of it to compare with the descriptions of the object models in the database,
so that hypotheses about the identity of the object as one of the models can
be suggested. It is our belief that the recognition problem will be easier if
the description contains more global information about an object, as less false
hypotheses would be made. Better yet, if the description can decompose the
shape of an object into parts, each of which an instance of a general shape
primitive, then objects can be classified according to the general shapes of their
parts and how their parts are glued together. Such a scheme of shape decom
position would allow class-subclass hierarchy of objects to be constructed, and
recognition to be achieved based on symbolic1 descriptions rather than precise
quantitative data. All these suggest that the ability to derive global shape
descriptions from 2-D images, possibly through stereo, can be very useful.
1.2 O bjective o f th e Study
We study the problem of deriving shape descriptions from stereo. It can be
considered as consisting of two parts:
1. Stereo C orrespondence Problem : For each feature in one view, we
have to locate the feature in the other view which is the projection of
the same physical entity in 3-D.
1For example, we would probably label an object consisting of a large flat circular plate
supported by three sticks a three-legged table. The exact dimensions of the component
parts do not have direct relationship with the identify of the object.
2
2. Shape D escription Problem : We have to build up symbolic descrip
tions of the imaged scene so that we can distinguish objects in the scene
from the background (Segmentation), and represent the interaction of
the surfaces or parts that compose an object (Shape Decomposition).
Such descriptions would require the depth and orientation discontinuities
among the different surface patches to be extracted, and the nature of
the visible surface boundaries as being crease boundaries (true surface
boundaries) or limb boundaries (visible boundaries of curved surfaces
slowly turning away from the viewer) to be identified.
Traditionally, these two problems have been regarded as two separate, in
dependent processes. It is the view of many that stereo matching precedes any
high level structural descriptions (Figure 1.1), and the only goal of stereo is
to come up with a depth map like the range data measured from direct range
sensors. The dense depth map, as well as the depth and orientation discon
tinuities in the scene, can be obtained by matching some primitive features
in the images plus a subsequent surface interpolation process. If necessary,
shape information can be derived from such range data. The shape derivation
process is hence independent of how the range data is acquired.
Such a view may have been influenced by the Random Dot Stereogram
(RDS) experiments of Julesz [65], where structural descriptions in monocular
images are absent, and the only clue to locate depth and orientation dis
continuities and to recover structural descriptions in the scene is 3-D depth
measurements acquired by matching intensity windows or edges in the images.
Here we take a different point of view. We believe that while for some scenes
such as RDS, the stereo correspondence process must come before the descrip
tion process, in general structural descriptions can aid in the correspondence
process and there are many advantages to the two processes working together
cooperatively. There are three reasons for that:
1. It is well known that correspondence ambiguity can be a problem in
matching primitive features, especially when there are repetitive patterns
in the scene. By extracting a hierarchy of structural descriptions in each
view and matching these features at the different levels, we gain the
following:
(a) high level structural features help reduce ambiguity of multiple
matches in the correspondence process;
3
Stereo Matchin,
Triangulation
+ Surface Interpolation
Primitive
Features
Primitive
Features
Right
Image
Left
Image
Stereo
Correspondences
2.5-D
Depth Map
j Shape from Range data
Shape
Descriptions
Figure 1.1: The traditional approach to stereo processing.
(b) correspondence help confirm the different levels of abstract features
in the two views and maintain a consistent interpretation of the
scene.
2. It is usually assumed that depth and orientation discontinuities in the
scene can be located either by setting some thresholds during a surface
interpolation step subsequent to the correspondence process, or by look
ing at the disparity differences between adjacent edge elements (e.g.,
line processes [106]). This may be true if the scene is textured enough
so th at stereo matching returns dense depth estimates. However, if the
scene is not densely textured, even if we can assume the initial depth
estimates are perfect, surface interpolation without knowing which side
4
of an edge is the occluding surface would lead to erroneous result. In
addition, the sparcity of features would present difficulties in adjusting
the thresholds to detect discontinuities (points on a 1-D step function
will be more “collinear” when they are farther apart, as shown in Fig
ure 1.2). We conjecture that this is where localizing discontinuities from
multiple intensity images is different from, and more difficult than, that
from direct range data. We believe only by looking at global properties
like continuity of surface boundaries can we locate the discontinuities in
such scenes.
..........
.............
.... .
Figure 1.2: Difficulty of recovering depth discontinuities by observing local
depth estimates: data “look” smoother when they are more sparse, even across
depth discontinuities.
3. Unlike range data acquired from direct range sensors, depth information
regarding curved surfaces slowly turning away from the viewer is gener
ally not directly available from stereo. The reason is, at a different angle
of view, the apparent surface edge is projected by different contour on
the curved surface (Figure 1.3). We call those boundaries in 3-D limb
boundaries, and the projected edges limb edges. It would be inaccurate
to m atch the limb edges in the two views directly, the error of which
increases monotonically with the image resolution and the stereo angle.
Lim and Binford [69] have shown that the error can be significant. A
even more im portant issue is whether we will mistake a limb bound
ary as a real surface boundary in deriving the shape description. When
there are curved surfaces in the scene, depth information is simply not
available unless global shape description is taken into consideration.
We believe the correspondence process and the shape description process
can help each other in deriving shape descriptions from stereo. In this thesis
we will present some ideas of how it can be achieved.
5
Right image Left image
Figure 1.3: Stereo projections of a curved surface.
1.3 Our A pproach
We propose to compute a hierarchy of descriptions from each view and matches
these features at the different levels. W ith the description and correspondence
processes going hand in hand, we allow high level abstract features to help
reduce correspondence ambiguity, and we exploit the multiple views to help
confirm the different levels of descriptions in the scene. One of the difficulties
in pursuing this approach is that it requires the ability to make structural
descriptions from monocular views, a traditionally difficult problem.
The Gestalt school of psychology in the early twentieth century provided
a number of useful insights into human perception. They proposed that the
ability of detecting feature relationships such as collinearity, parallelism, con
nectivity, symmetry, and repetitive patterns among image elements is very
im portant in human vision. The phenomenon is called perceptual organiza
tion. Some studies [115, 116] have also suggested their importance in com
puter vision. Recently some researchers [1, 82, 83] have investigated the use
6
of cues from perceptual organization to perform segmentation from one single
intensity image.
Our system extracts hierarchical features from each view using a perceptual
grouping technique developed by Mohan and Nevatia [83], and matches these
features across stereo. A block diagram of the approach is shown in Figure 1.4.
We also present a basic property of occlusions in stereo views, and show how
we can use this property with the structural descriptions to identify different
types of occlusions in the scene. In particular, we identify creases and limb
boundaries, and make inferences about properties of surfaces that are visible
only in one of the stereo views.
features features
Stereo
Matching
Monocular
Grouping
Monocular
Grouping
correspondences
LEFT RIGHT
Figure 1.4: A stereo correspondence approach using hierarchical features.
Such a stereo correspondence system is capable of producing segmented
surfaces and distinguishing between creases and limbs, yet no precise 3-D in
formation is extracted about the limb boundaries. We propose to infer volu
metric shape directly from stereo correspondences, by viewing the objects as
composed of some general shape primitives rather than via some intermediate
2 |-D depth measurements. This is shown in Figure 1.5. Precise 3-D informa
tion then comes naturally together with the shape descriptions. Our methods
are based on some invariant properties of some general shape primitives in
their monocular and stereo projections.
We also apply the ideas to an important application domain of computer
vision: the recovery of architectural structures from a stereo pair of aerial
images. Since aerial images are typically highly cluttered, and many features in
7
Right
Image
Left
Image
2-D
Descriptions
2-D
Descriptions
Invariant Projective Properties
of General Shape Primitives
Triangulation
Stereo
Correspondences
2.5-D
Depth Estimates Primitives
Descriptions
I sometimes not
j j directly available
Figure 1.5: Our approach to deriving shape descriptions from stereo.
the scene such as the surface boundaries, the pavement edges, and the shadow
boundaries are parallel to one another, monocular grouping from aerial images
is particularly difficult, 3-D information can be very helpful in the grouping
process. We have designed a stereo system that extracts structural features
bottom -up hierarchically from each image, and at the same time matches
such features across stereo to recover 3-D information to help in the feature
extraction process. Such a system has shown good results on a variety of
different scenes.
We impose the following restrictions for the scene:
1. all surfaces are opaque;
8
2. the images are taken under general points of view, i.e., in each view the
observed structures such as surfaces and junctions are stable under slight
perturbations of the viewpoint;
3. the image resolution is high enough relative to the sizes of the objects in
the scene so that errors in detecting edges around junctions are negligible;
4. objects consist of or can be approximated by instances of the shape
primitives that we assume in inferring their shapes from the image data.
We also assume the objects we are working with are not necessarily densely
textured, so that a complete 2|-D depth map of the scene like the output
from range sensors may not be directly available from stereo. If a complete
2|-D depth map can be recovered from stereo by matching dense texture
plus a surface interpolation process, an example of which is the random dot
stereogram, the shape description process will be somewhat independent of
the correspondence process.
1.4 O riginal C ontributions
The major contributions presented in this thesis are:
— the design of a stereo system that combines the stereo matching and the
structural description processes;
— an analysis of occlusion effects in stereo images;
— the identification of visible surface boundaries as being crease boundaries
or limb boundaries from stereo;
— an invariant property of the stereo projections of an LSHGC (Linear
Straight Homogeneous Generalized Cylinder) and a method to recover
an LSHGC from its stereo images using this property;
— an invariant property of the stereo projections of an SHGC (Straight
Homogeneous Generalized Cylinder) and a method to recover an SHGC
from its stereo images using this property;
— the design of a stereo system that recovers building structures from a
stereo pair of aerial images.
9
1.5 O rganization o f this D issertation
We start in chapter 2 by giving an introduction about the idea behind stereo
processing, and present a survey of previous research related to our problem.
We describe the ideas and show some results on building a hierarchical stereo
system in chapter 3, and we do the same on deriving shape descriptions in
chapter 4. As an application, we apply some of the proposed ideas to an
im portant application domain of computer vision, namely the reconstruction
of polyhedral building structures from a stereo pair of aerial images. We
describe the system and show the results in chapter 5. Finally we give some
conclusions and offer suggestions about future research in chapter 6.
10
C hapter 2
B asic C oncepts and R elated R esearch Survey
This chapter outlines some of the fundamental concepts of stereo analysis, and
gives a brief survey of previous work related to our problem, specifically the
work regarding stereo correspondence and recovering three-dimensional shape
from intensity images. The survey is brief, as the number of related topics
is large, and also extensive surveys can be found elsewhere [10, 35, 24]. The
intention here is to outline some of the shortcomings in previous work and to
gain some insight about possible solutions to the problem of shape from stereo.
2.1 Fundam entals o f Stereo A nalysis
The projection of a 3-D scene onto a 2-D image can be approximately modeled
by the pinhole camera projection as shown in Figure 2.1. For convenience,
here we put the image plane between the focal point and the 3-D scene to
compensate the effect of inversion in the image.
Using the focal point as the origin of the Cartesian coordinate system and
having the optical axis to be the Z-axis, the image coordinates (if, u) and the
object coordinates (x,y,z ) are related by:
fx
u = —
z
v = —
z ’
where / is the focal length of the camera. Note that the direction of the Z -axis
is taken to be pointing away from the scene so that a right-hand coordinate
system is maintained.
As shown in Figure 2.1, as a 3-D object is projected to a 2-D image, the
depth information along the lines of sight is lost during the projection process.
11
Optical Axis
Z
Figure 2.1: The pinhole camera projection model.
The image point p can be the projection of the point P in space or any other
point P ' along the line of sight. How to recover the lost depth information
from intensity images has been a basic problem in vision.
The idea of stereo is very simple. We take one more picture of the scene
from another angle. Provided that the relative transformation between the two
views is known, once we know which features in the two images are projected
by the same entity in 3-D, the position of the entity in space can be recovered
at the point of intersection of the corresponding lines of sight. This process
is called the triangulation process. A paradigm of the stereo vision process
therefore generally involves:
1. camera calibration,
2. feature extraction,
3. feature matching, and
4. 3-D information computation.
12
While stereo provides one possible solution to depth determination from
2-D images, the existence of multiple views brings another problem. For each
feature in one view, we have to locate the feature in the other view, which
is the projection of the same physical entity in 3-D. This problem, known as
the correspondence problem, is historically regarded as the most im portant
problem in stereo analysis.
As the relative transformation of the two views is known, for any image
point p in any one of the images, say in image X, a plane in 3-D can be
defined by p and the focal points of the two views. The plane is called the
epipolar plane of the image point p. As shown in Figure 2.2, the epipolar
plane will cut the other image plane R along a line. It can be proved that this
line, called the epipolar line of p, always contains the image point in image R
corresponding to p. As a result, given any image point in any one of the stereo
views, we can uniquely define the epipolar line in the other view, along which
the corresponding image point in that view can only be found. This reduces
the originally two-dimensional search for correspondence to a one-dimensional
search. This property of stereo, called the epipolar constraint, has been widely
used in stereo analysis.
However, multiple matchable features are often found along the same epipo
lar line; the stereo system has to distinguish the correct matches from the
false matches. Such ambiguity in stereo matching has been a major concern
in previous work, and various constraints have been proposed to overcome
it. Two popular ones are the surface-continuity and the ordering constraints.
The surface-continuity constraint assumes that depth is continuous in space,
while the ordering constraint requires the order of features along epipolar lines
to be preserved. These are in fact heuristic constraints rather than physical
constraints, as they are applied over the entire scene in most previous stereo
systems, although they are in fact valid only within a single surface.
Even with such constraints, ambiguity in stereo matching may still exist.
Imagine that two sets of dots along the same corresponding epipolar lines in the
two views are to be matched with one another. More than one correspondence
possibility can be obtained by sliding the features in the right view over those
in the left view from left to right during matching, while keeping the surface-
continuity and ordering constraints satisfied. To put it in the terminology of
optimization, these possibilities are the multiple local extrema in the solution
space. As surface-continuity and ordering constraints alone usually cannot
assure the correspondence process not to be trapped in such local extrema,
13
Epipolar
Plane
Left
Epipolar &
Lmes w
Right
Left
Focal
Point
Point
Figure 2.2: The epipolar plane and the epipolar line of an image point.
people have resorted to using more abstract features for correspondence, so
that solution at the global extremum can be obtained.
While the idea of stereo is valid for any camera configuration, the parallel-
axis stereo geometry, in which the optical axes of the cameras are parallel
(Figure 2.3), is most widely used because of its convenience in computation.
Under such a configuration, the scan-lines in the images will be the epipolar
lines in the two views. It can be proved that stereo images taken under other
14
P(x, y, z)
/Right
/Camera
Axis
Left
Camera
Axis /
Left
Image
Right
Image
Focal Length
0(0, 0,0)
B
Stereo Baseline
Figure 2.3: The parallel-axis stereo geometry.
camera configurations can also be transformed2 into images as if they were
taken under the parallel-axis stereo geometry. In this dissertation it is al
ways assumed the images are taken under such a camera configuration, unless
otherwise stated.
2The idea is simple. Knowing the position of the focal point of an image, we can always
reconstruct all the lines of projection going through the image, each of which equipped with
a corresponding intensity value. Let’s say we have two images L and R. We can take the
plane in 3-D that contains the image L, and find all the points of intersection between this
plane and the lines of projection going through the image R. All these points of intersection,
with the intensity values inserted into the corresponding places, then form with the image
L a stereo pair of images which are coplanar.
15
Under the parallel-axis stereo geometry, the disparity of any object point
(x, y , z) in 3-D, whose image coordinates in the left and right views are (u/, u)
and (ur,u) respectively, can be defined as
d = u\ — ur.
In stereo analysis, the disparity d is often used in place of depth z for
interm ediate 3-D computations. It is directly computable from the image
coordinates without knowing the camera parameters, and it can be easily
converted to depth using the simple relation
where / is the common focal length of the two cameras, and B is the distance
called the stereo baseline between the focal points of the two views. Disparity
and depth are therefore inversely proportional to each other.
2.2 Stereo C orrespondence
There has been a great deal of work done on how to correspond features in a
stereo pair of views so that a 2|-D depth map of the scene can be recovered
through triangulation. Stereo correspondence is complicated by three inherent
problems:
1. ambiguity among matches;
2. photometric distortions and camera noise; and
3. occurrence of occlusions among surfaces.
To deal with these problems, features of various abstraction in combination
with various sets of constraints have been proposed for correspondence. A good
survey of recent literature can be found in [10, 35]. Two principal approaches
are generally used: area-based matching and feature-based matching. They are
characterized by the types of information used for correspondence.
Area-based matching approaches attem pt to match small windows from
the left and right views by looking at how correlated their intensities or some
continuous dense functions of their intensities are. Such an approach has the
advantages that the intensity windows being matched are simple to extract,
and it delivers a dense depth map if the scene is densely textured enough. On
16
the other hand, it requires the presence of significant texture in the scene. As
information being matched corresponds directly to intensity variations, area-
based matching approach is sensitive to photometric distortions and camera
noise. The presence of an occluding boundary in the correlation window will
also confuse the correlation-based matcher, giving an erroneous depth esti
mate. This is because partially occluded surfaces are generally hidden to
different extents at different angles of view, and as a result intensities around
an occlusion boundary in the stereo views do not generally correspond.
Feature-based matching approaches use structural features rather than im
age intensities for stereo matching. Since such features are abstractions of the
underlying intensity changes rather than the actual intensities, they are less
sensitive to photometric variations and are less subject to mismatches across
depth discontinuities. Scenes with little texture can also be handled. It has
also been shown [71, 60] that features such as edges can be estimated to sub
pixel accuracy, thus increasing the resolution of depth estimation. Features
proposed for stereo matching include edgels (edge elements), line segments,
junctions, homogeneous intensity regions, and surface level features.
2.2.1 A rea-based
Representative examples of area-based stereo systems are [85, 46-48, 89, 84,
94, 114]. They have shown impressive performance on images such as those
of aerial terrain, where the surface can be assumed to vary smoothly and
continuously. However, as explained above, they are likely to fail for structured
scenes with little texture and significant occlusions.
2.2.2 E dgel-based
The simplest features to use are edgels [33, 88, 74, 75, 45, 11, 77, 76, 73, 36, 79,
108, 20]. Since edgels carry relatively little information for correspondence,
large correspondence ambiguity is generally experienced by the edgel-based
stereo systems.
2.2.3 E xt ended-Feature-based
Some have resorted to more abstract features. Medioni and Nevatia [78],
followed by others [3, 4, 92, 5, 56, 81], have suggested using line segments.
Line segments are higher level features and incorporate some figural continuity
17
properties directly. In an early work, Ganapathy [43] has also suggested using
junctions for correspondence.
However, features like segments and curves are still not distinct enough to
reduce all the correspondence ambiguity, especially when there are repetitive
patterns in the images and when the depth range in the scene is not narrow.
To reduce correspondence ambiguity, many of the above systems incorporate
a coarse-to-fine strategy in stereo matching, in which correspondence results
at lower resolutions are propagated to guide processings at higher resolutions.
However, it has been argued [97] that high spatial frequencies and low spa
tial frequencies may be sometimes informationally orthogonal to each other
(imagine seeing through the narrow slits of a fence to a tree far away). In ad
dition, occlusions are not explicitly identified in such an approach. Use of the
surface-continuity assumptions over the entire scene would lead to inaccuracy
along the surface boundaries, which on the contrary are the places where we
need to estimate depth most accurately.
2.2.4 H om ogeneous-region-based
Some [98, 72] have proposed to use regions of homogeneous intensity for corre
spondence, yet homogeneous intensity regions do not always have direct rela
tionship with actual surface boundaries, and they are sensitive to photometric
variations and camera noise.
2.2.5 Surface-based
To reduce the correspondence ambiguity while not making the surface-
continuity and ordering assumptions over the entire scene, Lim and Binford
[68] have suggested building a hierarchical level of descriptions up to the surface
level in each image, so that features of surface level abstraction are matched
directly and used as guidance for matching lower level features in the hier
archy. They have proposed to use edgels, junctions, curves, surfaces, and
even bodies. Matching features of higher abstraction can reduce the matching
complexity and avoid local mismatches, providing globally consistent corre
spondence results. Surface-continuity and ordering constraints can also be
applied to the places where they are likely to hold. The question is how to
segment those surfaces and identify the occlusion boundaries. Their system
has been demonstrated with relatively simple scenes, and its major weakness
18
was in the modules to infer surface level descriptions from monocular images
by simply following along contours.
As surfaces may have markings and their boundaries may be fragmented
and occasionally missing, extracting surfaces by merely doing contour-tracing
will not be reliable. Mohan and Nevatia [82, 83] have suggested using a tech
nique called perceptual grouping to extract surfaces from an image, in which
relationships of co-curvilinearity and symmetry are employed to extract hier
archical features up to the surface level. They have also shown results on how
such descriptions can be used for stereo matching. Yet monocular reasoning
and stereo matching are regarded as two separate processes, and no coupling
between monocular groupings and stereo correspondence was suggested.
2.2.6 C om bined M eth ods
Notice that there have also been attem pts to combine feature matching and
surface interpolation processes, or area-based and feature-based matching ap
proaches. Examples are [6, 52, 32]. They have shown encouraging results on
a variety of scenes. However, the features they use are generally of local na
ture and they again apply the surface-continuity assumption over the entire
scene to reduce the correspondence ambiguity. Surface boundaries are there
fore not accurately localized, and they have problems with scenes that are not
significantly textured.
2.3 Shape R ep resentation
Explicit shape descriptions of objects rather than a set of uncorrelated depth
estimates are usually required for various applications of vision. Here we
present some of the different shape representations, and in the next section
their derivation from the scene data as proposed in the literature.
Shape descriptions can be either segmented or not. Non-segmented de
scriptions represent an entire object as a set of distinct features without taking
into account its component parts and the relationships among the parts. The
features are usually easier to extract but do not carry a lot of information.
Examples of such representations include Extended Gaussian Images [57, 58,
62], Generalized Hough Transform [8,9], Fourier Transform and moments [37],
and oct-trees [64].
19
On the other hand, segmented descriptions describe an object in its com
ponent parts and express the structural relationships among the parts. They
are therefore rich in that they have strong discriminative power to distinguish
between objects, stable in that they can tolerate slight variations in the data,
and are capable of describing objects with articulating parts.
Yet within segmented descriptions there is still an issue of how abstract
the descriptions are. Lower level descriptions are less distinct and thus induce
higher computational complexity for tasks like object recognition. Moreover,
lower level descriptions are usually viewer-centered descriptions which have the
problem of being different when an object is viewed from a different angle. An
example is the surface-based descriptions [63,40, 93,19,14, 2, 39]. This renders
correspondence with object models during recognition more expensive, as each
object model has to be stored as a number of different aspects (an aspect is
an equivalence class of views of an object which have the same set of visible
surfaces and the same topological relationships among them). There is also a
question of how many distinct aspects [38,44,49,63] a given object can possibly
have. On the contrary, volumetric descriptions represent the complete shape
of an object and are invariant to the angle of view, thus grasping and object
recognition via them are more robust and accurate.
For these reasons, volumetric descriptions have received a lot of attention
in computer vision research. There have been attem pts [86] to extend the
2-D shape descriptors Symmetric Axis Transform (SAT), introduced by Blum
[18], and Smooth Local Symmetries (SLS), proposed by Brady and Asada
[23] as an improvement over SAT, to describe 3-D shape. Yet the axis of
symmetry is generalized to be a surface in 3-D and they are thus undesirable.
Superquadrics have also been suggested [21, 22, 95] to represent 3-D objects.
However, such a representation is not natural, as humans have difficulties
in visualizing an object from the representation. It has also been argued that
some of the parameters are exponents and thus surface fitting may be unstable
with respect to them, especially if the exponents are small [21]. Superquadrics
also need dense range data for fitting and extracting the representation.
The volumetric representation having the longer history is probably Gen
eralized Cylinders, which was introduced by Binford in [16]. Various attem pts
[91, 25, 41] have applied generalized cylinder representation to recognition.
A Generalized Cylinder (GC) is defined as consisting of an arbitrary planar
shape called the cross-section, swept along an arbitrary curve in 3-D called the
axis. In general, the size and even the shape of the cross-section may change
20
along the axis; the rule describing the change along the axis is called the sweep
rule. It has been argued that this representation may be too general to be
useful, and specific types of GCs have been proposed to describe 3-D objects,
such as Linear Straight Homogeneous GC (LSHGC), Straight Homogeneous
GC (SHGC), Constant GC (CGC) etc.. The GC description is rich, and is
naturally amenable to a hierarchical representation so that a complex object
can be represented as an assembly of simpler objects. It is especially good in
describing object shape at the global level of part-subpart articulations, and
fine details of the individual parts can be augmented to the descriptions in a
hierarchical manner.
2.4 Shape R econstruction
In the following we outline some of the previous work regarding how shape
representations listed above can be extracted from a single intensity image or
from stereo images. Our survey here is focused on recovering GC shape de
scriptions, as most other shape representations are extracted from dense range
data in the literature. We will also present the work done on an im portant
application of vision, namely the reconstruction of building structures from
stereo images. There are vision systems that recover object shapes from cor
respondences of the scene data with exact models in the database. However,
as mentioned above, they are not of our primary interest here and will be
skipped. Extensive surveys on shape description and recognition can be found
in [13, 24].
2.4.1 Shape R ecovery from a Single Im age
Humans are able to perceive 3-D objects from 2-D lines drawings with ease.
According to experiments of Barrow and Tanenbaum [12], shape from bound
ary contours seems to dominate information from other cues such as shading.
In particular, Biederman [15] claims that in experiments with humans the
recognition of a fully colored image of an object is no faster than the recogni
tion of a line drawing of the object. As we assume dense texture and shading
may not be present in our problem, our survey here covers work concerning
shape recovery from boundary contour only.
21
Early 3-D shape reconstruction work was focussed on analysis of line draw
ings of polyhedra [61,31, 70,66]. There have also been some primitive attem pts
to handle curved surfaces such as [12, 105, 113, 55].
P once et ah: Ponce et al. [96] have studied the problem of recovering 3-D
shape from a single image more analytically. While the problem is undercon
strained, they show that it can be simplified in the case where the objects
viewed are generalized cylinders. They investigated some invariant properties
of SHGCs, and in particular they showed that the image of the axis of a SHGC
is recoverable from its contour in the image.
D hom e et al.: Dhome et al. [34] proposed a method to localize polyhedral
objects from its perspective image. Its principle is based on the interpretation
of three image lines as the perspective projection of three linear ridges of
the object model, and on the search of model attitude consistent with these
projections. In another paper authored by Richetin et al. [101], they extend
the method to localize SHGCs. A line joining two zero-curvature points of a
SHGC in the image, together with the tangents at those points, are used as
the straight line triplet in the above method. The problem they tackled is
actually object localization rather than recognition. The whole process is to
physically transform a known object model to fit the scene data so that the
transformation parameters can be recovered.
U lupinar and N evatia: Recently, Ulupinar and Nevatia [109, 110] pro
posed a general technique to recover shape of Zero-Gaussian-Curvature sur
faces (ZGC), SHGC, and CGC, based on the analysis of symmetries in their
images. A net is constructed on the target surface based on the symmetries
of the contours, each little window in the net being approximated by a plane
whose orientation is to be computed. Constraints regarding how the orien
tations of the little windows are related to one another are derived and used
to compute the orientations. However, this has been shown to be still under-
constained in general, and an additional perceptual assumption, which is the
rulings of the net being orthogonal, has to be made to compute the orienta
tions of the small windows. Moreover, it assumes the imaging model to be
an orthographic projection which is not always appropriate, especially if the
object is close to the visual sensor. However, in another paper [111] they have
provided some insights of how to interpret line drawings under perspective
projection. The most inspiring is the definition of convergent symmetry which
allows the orientation of a planar object to be determined uniquely from its
perspective image.
22
2.4.2 Shape R ecovery from Stereo Im ages
It is difficult to derive 3-D shape description from one view, as the problem is
indeed highly underconstrained. No wonder assumptions about the imaging
model and other perceptual assumptions are usually made in the shape infer
ence process. Moreover, object shapes derived from one view can at best be
localized in 3-D within a scaling factor. If we are given multiple images of the
scene from different known viewpoints, which are essentially stereo images, the
problem will be potentially simplified because of the additional information.
Surface Interpolation Work: However, recovering shape from stereo has
generally been regarded as merely recovering a dense depth map of the scene.
The depth map is usually acquired by doing surface interpolation over the
sparse depth estimates acquired from stereo matching. Examples are [45, 107,
17, 104]. Some of them have addressed the problem of recovering depth and
orientation discontinuities in the scene. However, such methods require looking
at local depth differences which may not be reliable when the scene is not
densely textured.
Moreover, as mentioned in chapter 1, depth information regarding curved
surfaces slowly turning away from the viewer is generally not directly available
from stereo. The reason is, at different angles of view the limb edges are
projected by different contours which we call contour generators on the curved
surface. This is an intrinsic problem in stereo analysis but has been ignored
in most stereo work. Only surface orientations, not depth measurements, are
available on those image contours.
R ao and N evatia: Rao and Nevatia [100, 99] described a technique to derive
volumetric descriptions from stereo. Their work concentrates on LSHGCs and
piecewise LSHGCs. They employ a hypothesis-and-verification strategy, using
two properties of LSHGCs originally given in [103]:
1. The contour generators of an LSHGC are two straight line segments;
2. The contour generators of an LSHGC are coplanar.
The system takes an input of possibly sparse and imperfect 2 |-D depth mea
surements from stereo, and two methods have been proposed to reconstruct
LSHGCs from the data. In one method an LSHGC is hypothesized from its
contour generators exhibiting the above properties, and verified by the cuts
which they call terminators at the two ends of the LSHGC. In another method
it is just the opposite: an LSHGC is hypothesized from the terminators which
are assumed to be planar, and verified by the contour generators.
23
Term inator
Epipolar slice
Left Image Right Image
Figure 2.4: Lim-Binford‘s method for curved surface reconstruction.
The proposed methods, however, make the assumption that 2 |-D depth
measurements along the contour generators are available, which is clearly not
the case along the limbs of curved surfaces as described earlier.
Lim and Binford: Lim and Binford [69] have explicitly addressed the prob
lem caused by limb edges in the process of reconstructing curved surfaces from
stereo. They did not indicate how limb edges can be identified, but they pro
posed a method to reconstruct the curved surfaces from stereo. The object in
the scene is assumed to be composed of a number of parallel cross-sections,
each of them can be described by a conic function which involves five parame
ters. The curved object is first cut into a number of slices such that each slice
is in an epipolar plane as shown in figure 2.4. W ithin each epipolar plane,
the four lines of sight can be recovered from the stereo images, and the cross-
section is a conic which is tangential to the four lines of sight. This leaves only
one free param eter for the conic in each epipolar plane.
Two constraints, the extremum constraint and the terminator constraint,
have been proposed and either of them can be used to determine the free
param eter. The extremum constraint chooses the most compact shape among
all possible conics in each epipolar plane. The terminator constraint chooses
the conic which has the same eccentricity as that of the boundary of the
term inator surface.
24
There are still problems with Lim and Binford’s method:
1. Epipolar planes are parallel to each other only when the projection ge
ometry is orthographic. However, the limb problem is significant only
when the curved surface is at close range where orthographic approxima
tion is not a good one. This means epipolar slices, being non-parallel to
one another, generally do not possess similar characteristics and neither
term inator constraint nor extremum constraint can be applied to them
as a whole.
2. Epipolar planes are in general not parallel to the terminator surface as
shown in figure 2.5. The difference in the orientations strictly depends
upon the orientation of the object with respect to the cameras. This
implies the epipolar slices have no direct relationship with the term inator
surface and thus term inator constraint is generally not applicable. On
the other hand, applying extremum constraint will give a stack of conics
being most compact only in the orientations of the epipolar slices. The
shape recovered therefore will be different with different angles of view
of the object.
2.4.3 R ecovering A rchitectu ral Structures from Stereo
One m ajor application of computer vision is to detect architectural structures
from aerial images. Although the application is rather specific, it is a good
yet not out-of-reach platform to evaluate how good the state-of-the-art vision
systems can perform on cluttered and more realistic scenes. Some stereo sys
tems have been developed specifically for aerial image understanding. Here
we list some of the more important ones. Task-specific knowledge is generally
employed in such systems.
L iebes, Baker et al., H erm an and Kanade: For example, Liebes, Baker
et al. [67, 7] have suggested using orthogonal trihedral vertices (OTVs) for
stereo correspondence. Herman and Kanade [50] have developed a system
called MOSAIC to incrementally update the 3-D model of a scene from suc
cessive views of the scene using stereo, in which they assume the vanishing
point of the vertical lines in the images is known a priori.
M ohan and N evatia: Mohan and Nevatia [82] have built a system to recover
buildings from aerial images. Their system extracts a hierarchy of structural
features from an image using a technique called perceptual organization, and
25
focal points of
stereo cameras
terminator surface
lines on different
epipolar planes
epipolar slices
Figure 2.5: Properties of the epipolar slices: epipolar planes in general are
neither parallel to one another nor parallel to the terminator surface of an
object.
select the more probable features by looking at some perceptual constraints
among them. They have also shown results on how such descriptions could
be used for stereo matching. Yet the system extracts structural features all
the way from edges to surfaces in each view independently without exploiting
the existence of stereo views during the process, and it matches the structural
features only at the highest level. It is also restricted to buildings of rectangular
structures.
Fua and Hanson: Fua and Hanson [42] have also developed a system to
reconstruct building structures from stereo. They model the roofs of build
ings as rectilinear objects which have planar intensity distributions and are
planar in 3-D. An objective function is thus defined for any roof hypothesis
to represent how close is the hypothesis from the predefined building model.
An initial roof hypothesis can therefore be gradually conformed to optimize
26
the objective function using some energy-minimizing techniques. The initial
hypothesis can be given by a human operator, or be extracted bottom-up
from edges in the images. The system does extract surfaces in the scene, yet
in semi-automatic mode it requires operator-guided cueing, and in automatic
mode the hypotheses are generated from intensity regions or orthogonal edges
which do not always have direct relationships with actual surface boundaries.
H sieh et al.: Hsieh et al. [59] have also built a system to combine edgel-based
matching and area-based matching approaches for stereo matching, yet the
output still falls short of explicitly recovering surface boundaries in the scene.
2.5 Sum m ary
To conclude, the problem of recovering shape using monocular cues is highly
under constrained, and more assumptions are generally necessary to come up
with a solution. Moreover, shape can be at best recovered with a scaling factor
with one single view. However, we believe the problem is still a very interesting
and im portant one, and more research needs to be pursued. The process is the
only clue that can be relied on when the objects in the scene are relatively far
away from the viewer, as in that case there is no significant difference between
single-eye and multiple-eye vision.
On the other hand, stereo vision does provide a more direct method to
extract precise 3-D information of a scene if the objects are in close range.
However, most related previous work has focused on recovering depth, not
shape, from stereo, by matching some primitive features in the images. Such
systems generally encounter large correspondence ambiguity. Moreover, few
has addressed the problem of curved surfaces in the scene. The ideas proposed
by [100, 99, 69] are still inadequate for the problem.
We believe the concept of exploiting both the ideas developed in monoc
ular cues and the ease of recovering 3-D information from stereo seems to
be a promising direction. W hether a more competent stereo system can be
developed would largely depend upon the understanding of how the different
processes like stereo correspondence, structural descriptions, locating depth
and orientation discontinuities, distinguishing between crease boundaries and
limb boundaries, and deriving shape descriptions can be integrated together
in a coherent manner.
27
C hapter 3
Stereo C orrespondence: U se o f M onocular
G roupings and O cclusion A nalysis
In this chapter, we describe a hierarchical stereo system [27] that computes
a hierarchy of descriptions up to the surface level from each view using a
perceptual grouping technique, and matches these features at the different
levels. Occlusion is a major problem in stereo analysis and is often not treated
explicitly. We present a basic property of occlusions in stereo views, and show
how we can use this property incorporated with the structural descriptions to
identify different types of occlusions in the scene. In particular, we identify
depth discontinuities and limb boundaries, and infer properties of surfaces that
are visible only in one of the stereo views. We give some experimental results
on scenes with curved objects and with multiple occlusions.
3.1 Introduction
While most take the view that stereo matching precedes structural descrip
tions, here we propose a hierarchical system that computes a hierarchy of
descriptions such as edges, curves, symmetries, and ribbons from each view,
and matches these features at the different levels. W ith the description and
correspondence processes going hand in hand, we gain the following advan
tages:
1. high level abstract features help reduce ambiguity of multiple matches
in the correspondence process;
2. correspondences help confirm the different levels of abstract features in
the two views and maintain a consistent descriptions about the scene;
28
3. output is a set of segmented surfaces.
The difficulty in pursuing this approach is that it requires the ability to
make structural descriptions from monocular views. As surfaces may have
surface markings and their boundaries may be fragmented and occasionally
missing, extracting surfaces by merely doing contour-tracing would not be
reliable. Mohan and Nevatia [82, 83] have suggested using a technique called
perceptual grouping to extract surfaces from an image, in which relationships
of co-curvilinearity and symmetry are employed to extract hierarchical features
up to the ribbon level.
We believe that even though the monocular description problem remains
a difficult one, enough progress has been made to give useful descriptions for
aiding in the stereo process. We use the perceptual grouping technique de
veloped by Mohan and Nevatia, though we have made im portant additions to
the grouping technique itself. We use a hierarchy of descriptions as advocated
by Lim and Binford [68, 83], but we believe that we have developed a more
competent system, in part because of the better descriptions we are able to
generate, and in part because of a more sophisticated occlusion analysis that
we employ. Our matching technique is also quite different from that of Lim
and Binford.
Yet merely going to higher level features does not solve the entire problem.
Occlusions still present difficulties to stereo analysis. W ithout occlusions in
the scene, the stereo correspondence problem can be cast as a one-to-one
mapping problem between two sets of features, in which the topology among
the features is always preserved if noise is omitted. The traditional use of the
stereo constraints of surface-continuity and ordering is in fact valid only in the
absence of occlusions. However, if there are multiple objects occluding one
another in the scene, some features in the scene may be visible in one view but
hidden in the other, and they destroy the invariance of the topology among
the features even in the absence of noise. Moreover, a major consequence of
occlusion in stereo views is that a closer surface hides a more distant surface to
different extents in the two views. This implies even if two patches in the two
views are in fact projected from the same surface, their apparent boundaries do
not fully correspond if the physical surface in space is occluded. As a result in
addition to looking at the properties of the high level features alone, we need
to study how occlusions affect the interaction among the features to match
them across stereo views.
29
Despite these observations, occlusions are seldom treated explicitly and it
is usually assumed that they can be identified either by setting some thresholds
during the surface interpolation step or by looking at the disparity differences
of the neighboring edge elements (e.g., line processes [106]). However, as ex
plained previously, surface interpolation without knowing which side of an
edge is the occluded surface would lead to erroneous result, and the spar city
of features would present difficulties in adjusting the thresholds to detect dis
continuities. It is our belief that occlusions deserve much more attention and
have to be treated explicitly, not only because they are the major cue that
makes the stereo correspondence problem a much harder mapping problem,
but also because depth discontinuities that come with occlusions capture some
of the most im portant information in the depth map of a scene. As we shall
see in later sections, even the problem of limb boundaries [69] can be cast as
a problem of occlusions, and curved surfaces can be identified by examining
some effects of occlusions in stereo.
In this chapter we describe how perceptual grouping can be used to extract
features up to ribbon level monocularly from the stereo views, as proposed by
Mohan and Nevatia [82,83], and what the physical constraints are to match the
features hierarchically. Specifically, our system applies the surface-continuity
and ordering constraints within a surface only, and not across the occlusion
boundaries. We also describe a basic property of occlusions in stereo views,
and how we can use this property to identify different types of occlusions.
In particular, we use the property to identify depth discontinuities and limb
boundaries, as well as to infer properties of surfaces that are visible in only
one of the stereo views.
The organization of the chapter is as follow. In section 3.2 we describe a
basic property of occlusions in stereo views, and how different types of occlu
sions can be identified using this property. Section 3.3 describes our hierarchi
cal stereo system which consists of three subsystems, namely the monocular
groupings, the binocular stereo correspondence, and the feedback from stereo
correspondence to monocular groupings. Finally, some experimental results
are given in section 3.4.
3.2 O cclusion Effects in Stereo
We live in a 3-D world which is full of opaque objects and surfaces. This means
that from any point of view in space, it is inevitable that closer surfaces will
30
occlude or hide more distant surfaces to various extents. Occlusion is therefore
a fundamental phenomenon that must be overcome by any stereo system.
Occlusions have often been thought of as occurring along depth disconti
nuities only. In fact occlusions can occur along orientation discontinuities and
within smooth convex surfaces as well, as long as surfaces are hidden either
partially or entirely. They all present similar problems to stereo correspon
dence as explained in later sections. We therefore characterize occlusions into
three types according to the nature of the boundary between the occluding
and the occluded surfaces:
1. D ep th D iscon tin u ity O cclusion
A surface occludes another surface of the same or another object along a
depth discontinuity. Accompanying the occlusion are T-junctions along
the depth discontinuity (Figure 3.1 (a)).
2. O rientation D iscon tin u ity O cclusion
A surface of a 3-D object occludes an adjacent surface of the same ob
ject along an orientation discontinuity. Accompanying the occlusion are
viewpoint-independent junctions such as L, W (Arrow), or Y junctions
along the orientation discontinuity (Figure 3.1 (b)). We call these junc
tions edge-junctions, as they are formed from real edges.
3. Limb O cclusion
A closer portion of a smooth convex surface occludes a more distant por
tion of the same surface along a limb edge. Accompanying the occlusion
are limb-W or limb-L junctions along the limb edge, if the limb term i
nates at term inator surfaces (Figure 3.1 (c)). We call these junctions
limb-junctions, as they are formed on a limb.
Note that more than one type of occlusion can occur simultaneously along
the same edge in an image, e.g., a surface can occlude an adjacent surface along
an orientation discontinuity and another surface along a depth discontinuity
on the same edge. We define an occlusion as one occurring between two
specific surfaces along a specific continuous occlusion boundary, and multiple
occlusions may have their occlusion boundaries projecting to the same edge.
3.2.1 A B asic P rop erty of O cclusions in Stereo V iew s
A basic property of occlusions in stereo views is the occurrence of Not-Visible-
From-Other-View (NVFOV) regions, which are projections of surfaces that are
31
ctlon
Llmb-W Junction
limb
L-jun ctlon
(a) (b) (c)
Figure 3.1: Types of occlusions: (a) Depth discontinuity occlusion; (b) Orien
tation discontinuity occlusion; (c) Limb occlusion. (Shaded areas are occluded
regions)
visible only in one of the stereo views. The reason is as the angle of view moves
from the left view to the right view, different surfaces at different depths will
move in their images to different extents, thereby generating NVFOV regions
on the stereo images (the formation of NVFOV regions have close resemblance
with the formation of shadows, but we do not call them shadow regions to make
them distinct from real shadows). W ith respect to each occlusion boundary
between any two surfaces, we will call the view containing the NVFOV region
the more-exposure view, as it shows more of the occluded surface, and the
other view the less-exposure view. We will also call the junctions along the
occlusion boundary in the more-exposure view the more-exposure junctions,
and the corresponding junctions in the other view the less-exposure junctions.
It is the occurrence of these NVFOV regions, rather than occlusions them
selves, that presents difficulties to stereo correspondence. In two ways they
hinder stereo correspondence (Figure 3.2):
- Inter-surface D isparity D iscontinuity: The NVFOV regions dis
turb the adjacency of points across occlusion boundaries and destroy
the surface continuity assumption commonly used in stereo correspon
dence methods. In fact it is these NVFOV regions which create disparity
discontinuities across different surfaces.
— Intra-surface Stereo N on-correspondence: A closer surface hides a
more distant surface to different extents in the two views. This means
th at even if two surfaces in the two views are truly corresponding, their
apparent boundaries and also points within the surfaces may not be all
32
Left image Right image Left image Right image
(a) (b)
Figure 3.2: Effects of NVFOV regions to stereo correspondence: (a) signifi
cant N V F O V region: disparity is discontinuous across occlusion boundary
and images of occluded surface are not totally corresponding; (b) negligible
N V F O V region: all problems due to occlusion disappear.
corresponding if the physical surface in space is occluded. The points
that do not have correspondence are effectively the ones that are bound
ing and within the NVFOV region caused by the occlusion.
These NVFOV regions can occur with all three types of occlusions, and
their significance determines directly how much interference occlusions will
present to stereo correspondence (Figure 3.3). From this perspective, problems
in stereo vision such as inapplicability of surface continuity constraint across
occlusion boundaries, non-correspondence of limb edges, and occurrence of
NVFOV surfaces along orientation discontinuities are merely different aspects
of the occlusion problem. Although occlusion is such an im portant issue in
stereo vision, little attem pt has been made to handle it explicitly in previous
stereo systems. Two of the previous systems [68, 83] do extract surfaces and
then T-junctions from each view so that real physical boundaries of the sur
faces can be identified for stereo matching. This handles depth discontinuity
occlusions but not limb occlusions and orientation discontinuity occlusions.
An im portant observation is, as occlusions generate NVFOV regions, the
occurrence of these NVFOV regions in turn move the accompanying junctions
across the epipolar lines or change the junction types of the junctions (Fig
ure 3.3). Different types of junctions exhibit different behaviors, and whether
they change their epipolar positions across stereo views agree with the fact that
33
T-jum tion
Right image Left image
A
W-ju fiction
Limb-W junc
’imb-L junct
W-junction
Left image Right image Left image ^ Right image
Figure 3.3: Effects of NVFOV regions on the junctions along occlusion bound
aries: (a) at depth discontinuity occlusion; (b) at orientation discontinuity
occlusion; (c) at limb occlusion. (Shaded areas are the NVFOV regions.)
T-junctions and limb-junctions are viewpoint-dependent junctions, whereas
edge-junctions are viewpoint-independent junctions. This phenomenon can
be used as a means to identify different types of occlusions. We will present
some observations of how the NVFOV regions modify the junctions and how
occlusions can be identified.
3.2.2 A C onstraint for th e N V F O V R egion
If a NVFOV region is to appear along with an occlusion, in which view is it
going to appear? It is mentioned that the occurrence of a NVFOV region cre
ates disparity discontinuity between the occluded and occluding surfaces along
the occlusion boundary. W hether around the occlusion boundary the dispar
ities of the occluded surface are larger or smaller than those of the occluding
surface will depend upon two factors (Figure 3.4): (1) in which direction the
NVFOV region shifts the image of the occluded surface relative to the image
of the occluding surface, to the left or to the right; (2) in which view the im
age of the occluded surface is shifted, the left view or the right view, which is
effectively the view where the NVFOV region appears.
Note that the only place where the NVFOV region appears is between the
occluded and occluding surfaces in one of the stereo views. This means if the
occluded surface is toward the left of the occluding surface, the NVFOV region
will always shift the image of the occluded surface to the left relative to the
image of the occluding surface. It will be to the right otherwise. Therefore it
can be concluded that given the position of the occluded surface relative to the
34
TbpV feW
4 | front
occluded SkS:!:;# ocpnidii
surface m |iv:v:;S:s * surface
jcuwckj occludmg
urf^* • ’ •surface
Left image Right image
T o p vie5 * i ba‘S § ^
vfrontxA ::
occludmg
Left image Right image
Top viewi
front:
occludmg
■surface
Right image Left image
fop
J P ’ front
occluding occluded
surface £HfTace
Left image
surface"
Right image
(a) Possible NVFOV regions (b) Impossible NVFOV regions
: direction in which image of occluded surface is shifted
relative to that of occluding surface
Figure 3.4: Asymmetry of the occurrence of NVFOV region with respect to
stereo views: (a) possible NVFOV regions; (b) impossible NVFOV regions.
occluding surface, whether the NVFOV region appears in the left view or the
right view directly determines whether the disparities of the occluded surface
are larger or smaller than those of the occluding surface around the occlusion
boundary.
The fact is, at the vicinity of the occlusion boundary, the occluded surface
has to be more distant than the occluding surface. This renders the occurrence
of the NVFOV region asymmetric with respect to the stereo views, which can
be formulated as:
N V F O V R egion O ccurrence Constraint: Around any occlusion boundary
between two surfaces, the N VFO V region has to appear in the left view if the
occluded surface is toward the left of the occluding surface, and in the right
view if the occluded surface is toward the right of the occluding surface.
In other words, once we know the relative positions of the occluding and
the occluded surfaces, we can easily tell whether the left view or the right view
will be the more-exposure view. Likewise, once we know which branches of a
junction are from the occluding surface and from the occluded surface, we can
tell whether the junction in a specific view is a more-exposure junction or a
less-exposure junction by looking at the relative positions of the branches. We
will elaborate this further for different types of junctions as we mention how
NVFOV regions modify them.
35
3.2.3 Som e O bservations about Junctions
As mentioned earlier, the occurrence of NVFOV regions generally modifies the
junctions along the occlusion boundaries across stereo views. More precisely,
they move the T-junctions and limb-junctions across epipolar lines and change
the junction types of the edge-junctions. We will present one by one some
observations of how the NVFOV regions modify the three types of junctions:
1. T -Junctions
A T-junction is formed along a depth discontinuity (Figure 3.5). It is
composed of two branches, the stem and the cap, which are boundaries
of the occluded and occluding surfaces respectively. As the stem and
the cap can be easily identified, we can tell whether a T-junction in a
specific view is a less-exposure junction or a more-exposure junction by
checking whether the stem is toward the left or toward the right of the
cap (Figure 3.6).
Let t\ and t 2 be the lines of sight through the corresponding less-exposure
and more-exposure T-junctions, and Pi and P 2 be their points of contact
with the stem. In general, unless ti and t 2 are accidentally coplanar,
the T-junctions will not be on the corresponding epipolar lines. The
magnitude of the displacement across the epipolar lines depends upon the
difference in the orientations of the epipolar planes E\ and E 2 containing
ti and t 2.
As the camera moves from the position of the less-exposure view to that
of the more-exposure view, the NVFOV region will gradually appear
and move the T-junction along the stem from the projection of Pi to the
projection of P2. Let V be the tangential vector at the less-exposure T-
j unction in the direction toward the occluded portion of the stem. Then
whether the T-junction will go up or go down across the epipolar lines
at the less-exposure view depends on the orientation of V (Figure 3.7).
We thus have the following constraint:
T -junction E pipolar D isplacem ent Constraint:
The immediate displacement of a T-junction across the epipolar lines
from the less-exposure view to the more-exposure view is in the same
direction as that of V .
This constraint may be of only limited practical use. The occluding and
occluded surfaces, being surfaces of possibly different objects, may have
36
cap
stem
occluded
surface
occluding
surface
More-exposure view Less-exposure view
Less-exposure
i T-junction
More-exposure
T-junction
Left image Right image
(a) stem towards the right of cap
More-exposure
T-junction
Less-i
T-junction
Left image Right image
(b) stem towards the left of cap
Figure 3.6: Classification of T-junctions with respect to stereo views
37
epipolir displaceme n t
tangentij l 1 vector V
More-exposure Less-exposure
tangential vector V
epipolaj; displ acemi
More-exposure Less-exposure
T-junction T-junction T-junction T-junction
(a ) (b )
Figure 3.7: Direction of epipolar displacement of T-junctions.
vastly different depths. As a result the portion of the stem from Pi to
P2 may elongate quickly with increasing stereo angle. Fortunately T-
junctions can be readily identified from a monocular view once surfaces
are identified. The above analysis serves for bringing some insights about
the epipolar displacement of limb-junctions, and for completeness of the
concept.
2. L im b-Junctions
Lim and Binford [69] have made use of limbs but have presented no
method to identify them. Identifying limbs is in fact a first step to the
identification and description of curved surfaces. The appearance of each
limb edge itself is no different from that of a real edge, but rather it is
the junctions formed at the two ends of the limb edge which make the
difference. Nalwa [87] has shown that the branches of a limb junction
will be all co-tangent at the junction. But identifying limb junctions in
real pictures merely from this cue may be difficult.
A limb-j unction is formed when the line of sight from the camera is
tangential to the contour C at the intersection of the curved surface and
its term inator surface (Figure 3.8). It is composed of two edges of the
term inator surface and a limb edge. We will call the term inator edge
closer to the camera the front edge, and the other terminator edge the
back edge. Note that in a limb-W junction all branches are visible, and in
a limb-L junction the back edge is occluded. The branches can be easily
identified by checking the rough disparities. The edges of the terminator
38
surface are the branches closest to and farthest away from the camera,
and the limb edge is in the middle.
From the perspective of occlusion, the front edge and the limb edge are
from the occluding portion of the curved surface, and the back edge is
from the occluded portion. We can thus tell whether a limb-j unction in
a specific view is a less-exposure junction or a more-exposure junction by
checking whether the back edge is physically toward the left or toward
the right of the limb edge (Figure 3.9). Note that due to the nature of
the limb-junction, the back edge is physically toward the right of the
limb edge if it is on the left of the limb edge in the image. The back
edge is invisible in limb-L junction, but its position relative to the limb
edge is revealed from that of the front edge.
Let ti and t2 be the lines of sight through the corresponding less-exposure
and more-exposure limb-j unctions, and P1 and P2 be their points of
contact with C . In general, as in depth discontinuity occlusion, unless
t\ and t 2 are accidentally coplanar, the limb-j unctions will not be on the
corresponding epipolar lines.
As the camera moves from the position of the less-exposure view to that
of the more-exposure view, the NVFOV region will gradually appear
and move the limb-j unction along C from the projection of Pi to the
projection of P2. Let V be the tangential vector at the less-exposure
limb-junction in the direction toward the back edge. Analysis for the
epipolar displacement of the limb-junction would be exactly the same
as that for the T-junction, except that the terminator boundary C is
playing the role of the stem. We thus have a similar constraint:
Lim b-junction Epipolar D isplacem ent Constraint:
The immediate displacement of a limb-junction across the epipolar lines
from the less-exposure view to the more-exposure view is in the same
direction as that ofV.
A limb-junction is therefore one that has the shape of a W or L junc
tion monocularly, and exhibits epipolar displacement across stereo views
according to the above constraint (Figure 3.10). This presents an addi
tional cue to identify the limbs. The requirement is that at least one end
of the limb has to term inate at a terminator surface. But we conjecture
that without presence of other cues such as shading and texture, limbs
39
terminator
surface
curved surface
Less-exposure view More-exposure view
Figure 3.8: Formation of limb-junction.
not term inating at term inator surfaces are not identifiable from binocu
lar stereo alone, e.g., the stereo images of a sphere are not distinguishable
from those of a circular plate.
Note th at the occlusion involves only adjacent portions of a single surface
which is of limited disparity range. As a result unlike depth discontinuity
occlusion, the portion of the terminator contour C from Pi to P2 will
elongate much more slowly with increasing stereo angle.
40
Left image Right image Left image Right image
(a) back edge physically towards the right of limb edge
More-exposure Less-exposure
Left image Right image Left image Right image
(b) back edge physically towards the left of limb edge
Figure 3.9: Classification of limb-j unctions with respect to stereo views
tangential vector V
epipolar displacemen
I I ............
Less-exposure
Limb-W junction
(a)
More-exposure
Limb-W junction
epipolar displacemen
tangential vector V
Less-exposure
Limb-W junction
(b)
More-exposure
Limb-W junction
:::::::
epipolar displacement-'
/tangential vector V ............
Less-exposure , \ More-exposure
Limb-L junction ' ' Limb-L junction
gential vecjtor V
.......... 4pipj?lar displacement t
Less-exposure
Limb-L junction (d)
lore-exposure
Limb-L junction
Figure 3.10: Direction of epipolar displacement of limb-j unctions.
Note also that as long as the epipolar planes containing the lines of sight
through the corresponding limb-j unctions are not accidentally coplanar,
there would be epipolar displacement of the limb-junction. In some
cases, the epipolar displacement may be too small to be detectable due
to too low resolution of the images, but in those cases matching the
41
limb boundaries directly will also cause negligible error in the estimated
disparity.
3. E dge-Junctions
NVFOV surfaces (Figure 3.11) along orientation discontinuities are usu
ally ignored in stereo processing systems. As surfaces composing an
edge-junction have to be from the same object, each of these NVFOV
surfaces shares the same object with the rest of the surfaces around the
edge-j unction. They are therefore important for object level description
and recognition. Although a few attem pts [67, 50] have been made to in
fer properties of these NVFOV surfaces, they apply to specific domains
such as architectural scenes only. Domain-specific knowledge, such as
orthogonal trihedral vertices [67] and restricted vanishing points of all
lines in the scene [50], are employed. We seek to extend the concept to
cover generic objects.
To detect these NVFOV surfaces, we can check for any change in the
junction type of the corresponding edge-j unctions across stereo views.
In addition, as surfaces are occluded in one of the views, the junctions
would be surrounded by different number of surfaces. But are there any
more constraints for their occurrence?
Figure 3.12 shows the lines of sight from the camera to a convex (con
vex to the direction of sight from the camera) and a concave orientation
discontinuities. It can be shown that no m atter what the angle of sight
is, none of the surfaces composing a concave orientation discontinuity
can occlude the other surface along the orientation discontinuity. This
phenonmenon raises an im portant constraint for orientation discontinu
ity occlusions:
C onvexity Constraint: Orientation discontinuity occlusions only hap
pen around convex edges. Since the convexity of a trihedral vertex is
uniquely defined by the convexity of any one of the composing edges, the
correspondence of the edge-junctions (projected from a trihedral vertex)
around an orientation discontinuity occlusion has to recover a convex
vertex, but not a concave one.
The change in the junction type is thus asymmetric across stereo views.
If a W -junction in the left view is matchable with a Y-junction in the
right view, which means their matching returns a convex vertex, they
42
NVFOV surface
Less-exposure view More-exposure view
Figure 3.11: Formation of NVFOV surface at orientation discontinuity occlu
sion.
Concave Edge
Convex Edge
Figure 3.12: Projection of the surfaces composing an orientation discontinuity:
(a) a convex orientation discontinuity; (b) a concave orientation discontinuity.
43
Left image Right image Left image Right image Left image ^ Right image
Figure 3.13: Examples of permissible change of the junction type at orientation
discontinuity occlusions for trihedral objects.
are not matchable if the two pictures are swapped as that will return a
concave vertex. Examples of permissible change in the junction type are
given in figure 3.13.
3.3 A H ierarchical Features B ased A pproach
Most stereo systems m atch very local features (such as intensities or edgels) or
m atch extended features such as line segments. To resolve global ambiguity,
they employ constraints of surface continuity of disparity and preservation of
the order of features along epipolar lines. These constraints, however, are valid
only in the interior of a smooth surface and not across the occlusion boundaries.
The traditional approach is to infer the occlusion boundaries from the stereo
disparity data, but we argue that this approach already incorporates errors
that cannot be corrected later.
Instead, we propose to pursue an approach that computes descriptions at
a hierarchy of levels with matching also done at various levels. This approach
has many advantages. Higher level features are more distinct and easier to
m atch (the complexity is lower and the errors are fewer). Furthermore, iden
tifying surfaces and occlusion boundaries allows us to apply the smoothing
and ordering constraints only to those regions where they are likely to hold.
The figural continuity constraint is built in. Lower level features, matched in
the context of higher level matches, can help provide higher accuracy and also
allow us to use texture and other surface markings that do not form any higher
level structures.
44
In our system we use the following hierarchy of features: edges, curves,
ribbons and junctions. Note that junctions are not only highly localized and
distinct features, but matching them also gives us a means of identifying the
different occlusions described in section 3.2.3.
The first problem in using the described feature hierarchy is the difficulty
of computing them from monocular images of real scenes. Boundaries found
by edge detection techniques tend to be fragmented and several edges that
do not correspond to object boundaries, but are caused by surface markings,
highlights and noise, are likely to be present. To infer surfaces from such fea
tures is a difficult task. We use a perceptual grouping technique described by
Mohan and Nevatia [82, 83]. In this approach relationships of co-curvilinearity
and symmetry are employed to extract collated features of various abstrac
tions from the image. This method, along with some of our modifications, is
described briefly in section 3.3.1.
The monocular grouping algorithm is highly effective but not expected to
be perfect. We allow feedback between the stereo matching process and the
ribbon selection process, so that we can get better groupings by the use of two
views than we could by each taken separately.
A block diagram of our system is shown in figure 3.14. Note that there are
effectively three subsystems:
1. Monocular Groupings of the left and right views,
2. Stereo Correspondence,
3. Feedback from Stereo Correspondence to Monocular Groupings.
We will describe each subsystem one by one.
Notice that various levels of our stereo system require matching structural
features of different levels in the two views, and constraints for the matching
process can be formulated as unary and binary constraints among the pos
sible matches. Unary constraints come from individual merits of an entity,
whereas binary constraints relate a pair of entities. The constraints can be ex
citatory (positive) or inhibitory (negative), and can even be absolute such as
enforcing m utual exclusion among some entities. We use a relaxation network
described in appendix A to accomplish the tasks. In the rest of the thesis,
when we describe the constraints for matching certain structural features, we
will also indicate the nature of the constraints as being excitatory, inhibitory,
or absolute.
45
Junction
Extraction
Junction
Extraction
1 r 1 r
Selected
Ribbons
Selected
I Ribbons
R ibbon^jT& s
I F e>
Selection I
Junctions Junctions
Matched
Ribbons,
Junctions,
&Branches
Ribbon
Selection
F eet
Ribbons Ribbons
Closures Closures
Symmetries Symmetries
Symmetry Symmetry
Corresponding
\ibbonsJunct \onsJSranches
Contours Contours
Continuity Continuity
Curves Curves
Matched
Curves
Linking Linking
Edges Edges
Edge
Detection Detection
Image Image
RIGHT LEFT
Figure 3.14: Overview of our stereo system.
3.3.1 M onocular P ercep tu al G rouping
Mohan and Nevatia [83] have proposed using relationships of co-curvilinearity
and symmetry to extract collated features of various abstractions from each
image (Figure 3.14). Edges are detected from each image using Canny’s edge-
detector [26], and are linked into edge-contours using eight-neighbor connec
tivity. Edge-contours are segmented into curves [102] at curvature extrema
so th at every curve is smooth in itself, and curves are grouped into contours
based on co-curvilinearity. Symmetries are then detected from each pair of ap
proximately symmetrical contours, and they form ribbons if they have proper
46
end closures at both ends of the symmetries. Closure at the end of a symmetry
can be composed of a contour, a set of multiple contours, or the ends of other
symmetries. A number of conflicting ribbons are computed and selection of
ribbons has to be done to resolve conflicts.
We believe ribbons should not be selected merely based on their individual
merits. If objects in the scene are all composed of thin plates, that is the only
solution as the surfaces are indeed truly independent. But surfaces within a
3-D object do exhibit certain relationships with one another. Note that most
of the objects in real life are three-dimensional, e.g., a desk, a cup, and a TV
set, and they are the objects that humans are most interested in recovering
their 3-D descriptions.
The basic relationship between two adjacent smooth surface patches within
a 3-D object is that they share a smooth contour which is bounded by non-T-
junctions only (Figure 3.15 (b)). We implement this relationship as a strong
excitatory constraint in the ribbon-selection process.
We thus have the following constraints for selection of ribbons:
1. Unary Constraints
— Symmetry-smoothness (excitatory): A ribbon with smoother sym
m etry axes is more likely to be a real ribbon. Smoothness is mea
sured in terms of number of corners in our implementation.
— Boundary-evidence (excitatory): A ribbon with less fragmented
boundary is more likely to be a real ribbon.
2. Binary Constraints
— Uniqueness constraint (mutually exclusive): As opaque surfaces are
assumed, each point visible in an image can belong to at most one
surface. Two ribbons having overlap in their contained points are
therefore in absolute conflict (Figure 3.15 (a)). However, we do
allow two ribbons with one totally embedded inside the other to
exist together, as we regard them as one on top of the other.
— Solid-formation constraint (excitatory): Two ribbons which are not
in conflict and share a smooth contour bounded by non-T-junctions
support each other, as they are likely to be projected from adjacent
surfaces of the same 3-D object (Figure 3.15 (b)).
47
Figure 3.15: Some constraints for ribbon-selection: (a) Uniqueness constraint:
ribbons 3, 4 are in absolute conflict; (b) Solid-formation constraint: ribbons
1, 2, 3, and ribbons 5, 6 form two 3-D objects respectively.
As there are altogether three compromisable constraints, two predefined
weights in the weighted sum of the constraints are necessary for the cost func
tion to be optimized in the relaxation. The cost function is therefore:
E(V) = - £ K(BEC,' + tusscSSC.) - W v ; i > sfcSFCtf
i ij
where w$sc and u;g fc are kept constant at 1 and 3 respectively in our system.
The values are designed according to the relative importance of the constraints.
Once surfaces have been segmented and decomposed into component rib
bons, extraction of junctions from each image becomes an easy task. Owing to
the assumption that vertices are at most trihedral, we only extract L, W, Y,
and T junctions. L and W junctions might turn out to be limb-L and limb-W
junctions in stereo matching.
48
Ribbon-matching 11-1
I
Junction-matching
I
Y
A -
X l
y
- A
Branch-matching
Figure 3.16: Hierarchical Matching.
3.3.2 B inocular Stereo C orrespondence
Matching of features follows a hierarchical order. Two ribbons are potentially
m atchable only if all their junctions are matchable. Two junctions are poten
tially matchable only if all their branches that belong to the ribbons being
m atched are matchable (Figure 3.16). We will present the physical constraints
for matching the features at various levels, and conclude about how different
types of occlusions are identified.
49
C onstraints for R ibbon M atching
Ribbons in the left and right views are matched using the following constraints:
1. Unary Constraints
— Epipolar constraint (absolute): Two ribbons are potentially match-
able only if they have corresponding bounds of vertical extent.
— Shape-correspondence constraint (absolute): Two ribbons are po
tentially matchable if all their junctions other than those between
the T-junctions are matchable (Figure 3.17 (a)). To take into ac
count possible discrepancies in detecting junctions from the two
views, highly obtuse L-junctions can be exceptions.
2. Binary Constraints
— Uniqueness constraint (mutually exclusive): Two ribbon-matches
are in conflict if there is any conflict between their proposed
j unction-mat ches.
— Figural Continuity constraint (excitatory): Two ribbon-matches
support each other if they are not in conflict and there is overlap
in their proposed junction-matches (Figure 3.17 (b)).
As there is only one compromisable constraint, no weight in the weighted
sum of the constraints is necessary for the cost function to be optimized in the
relaxation.
C onstraints for Ju n ction M atching
In our system junction-matching is initiated by ribbon matching. To consider
possible matching between two junctions, only the branches which belong to
the two ribbons being matched are involved. Junctions are matched using the
following constraints:
1. Unary Constraints
— Type constraint (absolute): It is obvious that a non-T-junction can
be matched with a non-T-junction, and a more-exposure T-junction
can only be matched with a less-exposure T-junction. To take into
account of possible disocclusion of T-junction, we also allow a less-
exposure T-junction to be matched with a non-T-junction. (Fig
ure 3.18).
50
Right Image Left Image
(a)
V
Right Image Left Image
Figure 3.17: Some constraints for ribbon matching: (a) Shape-correspondence
constraint: ribbons is and rs have similar axes of symmetry and similar vertical
extents, yet they are not matchable as not all their junctions are matchable; (b)
Figural-continuity constraint: ribbon-matches (Zsi, rsi) and (Is2, rs 2) support
each other as they propose same junction-matches (/ja, rja) and (/j&, rjk).
— Epipolar constraint (absolute): Two junctions are potentially
matchable only if they are on corresponding epipolar lines. This
constraint is relaxed for T-junctions and possible limb-junctions, as
their counterparts generally do not fall on corresponding epipolar
lines.
— Shape-correspondence constraint (absolute): Two junctions are po
tentially matchable only if all their branches that belong to the
ribbons being matched are matchable.
2. Binary Constraints
— Uniqueness constraint (mutually exclusive): A non-T-junction can
be matched with at most one non-T-junction, but it can be matched
with more than one less-exposure T-junctions if it is occluded in the
other view (Figure 3.18).
— Figural-continuity constraint (excitatory): If junction I is matched
with junction r, it is preferred that junctions which share branches
with junction / and junction r in the two views are also matched,
if those junctions are indeed matchable (Figure 3.19).
As there is only one compromisable constraint, no weight in the weighted
sum of the constraints is necessary for the cost function to be optimized in the
relaxation.
51
junction less-exposure T -
non-T-junction
junction % less-exposure T -
Right Image Left Image
Figure 3.18: Type and Uniqueness constraints for matching junctions: (a)
Type constraint: A less-exposure T-junction can be matched with a non-T-
junction; (b) Uniqueness constraint: more than one less-exposure T-junction
can be matched with a non-T-junction.
Ui
Left View
rJi
• • • « • • • ■
Right View
Figure 3.19: Figural Continuity constraint for matching junctions: matches
(lj2,r j2) support each other.
C onstraints for Branch M atching
Branches around matchable junctions are matched using the following con
straints:
1. Unary Constraints
52
Left Image Right Image
(a)
Left Image Right Image
(b)
Figure 3.20: Some constraints for branch matching: (a) Epipolar Constraint:
the two junctions are not matchable as none of la and 4 is matchable with r& ;
(b) Surface-orientation Constraint: la has to be matched with r a, 4 has to be
matched with r& , so that la X 4 and ra x rb point to the same direction (out
of paper).
— Epipolar constraint (absolute): Two branches are potentially
matchable only if the tangential vectors along them are either both
pointing up or both pointing down across the epipolar lines (Fig
ure 3.20 (a)). If the tangential vectors are approximately parallel
to the epipolar lines, they have to be both pointing toward the left
or toward the right.
2. Binary Constraints
— Uniqueness constraint (mutually exclusive): One branch can be
matched with at most one branch in the other view.
— Surface-orientation constraint (mutually exclusive): The branch-
matches associated with a junction-match have to be such that the
surface normals of the corresponding surfaces have their 2 compo
nents nz ’s (components perpendicular to the reference image plane)
in the same direction. This implies branches all pointing up or all
pointing down across the epipolar lines have to be matched in the
order from left to right (Figure 3.20 (b)).
As there is no compromisable constraint, no weight in the weighted sum
of the constraints is necessary for the cost function to be optimized in the
relaxation.
53
Identification of O cclusions
Matching surface level features automatically removes the inter-surface dis
parity discontinuity problem in stereo correspondence, but the intra-surface
stereo non-correspondence problem still has to be resolved by identifying the
occlusions.
Different types of occlusions can be identified before or during stereo corre
spondence using concepts presented in section 3.2.3. Their identifications are
based on the identifications of junctions:
1. T-junctions are identified monocularly from the selected ribbons in each
view. False boundaries of the occluded surfaces, which are portions of
their apparent boundaries terminating at T-junctions, will be excluded
from stereo matching.
2. Limb-j unctions are W or L junctions whose branches are approximately
co-tangent and which present epipolar displacements across stereo views.
Limb edges, which are edges terminating at limb-j unctions, will be ex
cluded from stereo matching as well.
3. Edge-j unctions changing their junction type across stereo views and re
covering convex vertices will mean occurrence of NVFOV surfaces from
orientation discontinuity occlusions. The NVFOV surfaces are surfaces
around the edge-j unctions that remain unmatched in ribbon-matching.
3.3.3 Feedback from Stereo C orrespondence to
M onocular G roupings
The previous section described how stereo matching is carried out for the
hierarchical features selected from the left and right views. As interpretations
from the separate views may not totally agree, there are ribbons which are
selected in one view but their corresponding ribbons are not selected in the
other view (Figure 3.21). Such ribbons would not be matched in the stage of
stereo correspondence.
The ribbons which are matched are first put into a group which we call the
consistent ribbon-pairs, as they are the ones where current interpretations of
separate views are consistent. Note that NVFOV surfaces along orientation
discontinuity occlusions will not and should not be matched. They are put
into the same group of the matched ribbons.
54
Consistent Ribbon-pairs
mutually exclusive
Inconsistent Ribbon-pairs
Left image Right image
Figure 3.21: Consistent and Inconsistent ribbon-pairs.
For the rest of the selected ribbons in each view, we look at the pool of
all possible ribbons in the other view and search for potentially matchable
ribbons which are not in conflict with any of the consistent ribbon-pairs. If
there is no such a ribbon from the other view, the unmatched ribbon will be
rejected. If there is, the two ribbons will be a potential match. We will call
all these potential matches inconsistent ribbon-pairs.
Inconsistent ribbon-pairs undergo the process of ribbon-selection again.
This time each node in the relaxation network is not a single ribbon, but
instead a ribbon-pair from the left and right views respectively. The relaxation
process is similar to that in the first ribbon-selection, except that merits of
5 5
the two ribbons in a node due to various constraints are averaged to become
the merits of the node. Output values of the nodes representing the consistent
ribbon-pairs are always 1, i.e., they are always selected, but they will be
involved in giving binary constraints to the nodes representing the inconsistent
ribbon-pairs. Outputs of the relaxation are then ultimate pairs of selected
ribbons from the left and right views of the scene.
As the relaxation network always gives the maximally allowable number of
nodes which do not have conflict, two cycles of ribbon-selection can therefore
average the groupings from the left and right views and give a consistent
interpretation of the scene.
3.3.4 Stereo C orrespondence of Surface M arkings
As ribbons, junctions, and branches of junctions are hierarchically matched,
the boundary contours of each ribbon, which terminate at junctions, are also
matched automatically. In particular, depth measures along boundary edges
parallel to the epipolar lines, which are ignored in most stereo work, can also be
interpolated from the depth measures of the junctions at which they terminate.
In our approach, surfaces in the scene are segmented in the process of stereo
computation. This may be sufficient for many description and recognition
purposes. A denser disparity map can also be obtained by matching the surface
markings on the surfaces. We choose to match curves rather than edgels
because of their higher abstraction. Two curves are potentially matchable only
if they are within corresponding ribbons. As disparities along the boundary
of each ribbon have been obtained, on assuming ribbons are planar, we can
estim ate the disparity of a curve at any epipolar line from its position inside
its own ribbon. We can also employ all the constraints generally used in
stereo correspondence which are applicable for features within a surface. The
constraints for curve-matching are given below:
1. Unary Constraints
— Epipolar constraint (absolute): Curves are potentially matchable
only if they have overlap in their vertical extents across the epipolar
lines.
— Vergence constraint (excitatory): A match between two curves is
preferred if their disparity is closer to the estimated disparity on
the corresponding ribbons.
56
2. Binary Constraints
— Uniqueness constraint (mutually exclusive): A curve cannot be si
multaneously matched with two curves in the other view which have
overlap in their vertical extents.
— Ordering constraint (mutually exclusive): Curve-matches have to
follow the order from left to right.
— Surface Continuity constraint (excitatory): Two curve-matches
with their curves next to each other in the same order support
each other.
— Figural Continuity constraint (excitatory): Two curve-matches
with their curves being connected to each other support each other.
As there are altogether three compromisable constraints, two predefined
weights in the weighted sum of the constraints are necessary for the cost func
tion to be optimized in the relaxation. The cost function is therefore:
E(V) = - £ (K V C ,) - j £ K V > S ccSCC;j + tflfccFCCtf)
i ij
where Wscc and ^ fcc are kept constant at 0.6 and 1 respectively in our system.
The values are designed according to the relative importance of the constraints.
3.4 E xperim ental R esu lts
We show results on different kinds of scenes to illustrate the capability of our
approach. The simple relaxation network described in section A is applied
throughout our stereo system in the selection and stereo matching of features
at various levels. The relaxation process is found to converge within twenty
iterations in all test images.
Figure 3.22 shows the images, the hierarchical features and the stereo
matching results of a scene with multiple occlusions. Note that there are
many boundary edges which are parallel to the epipolar lines, and in our sys
tem their disparities are linearly interpolated from those of the junctions. If
the surface boundaries are matched blindly without identifying their natures,
the depth map will be that as shown in figure 3.23 (a). Limb edges in stereo
views in fact do not correspond to each other. Our stereo system is capable
of identifying the limb edges from the limb-j unctions at which they terminate,
57
and computing their depths at those junctions. Quantitative depth measures
along the limb edges are still underconstrained and we are investigating how
they can be computed from the reconstruction of the curved surfaces. For
display purpose here we assume disparity along a limb edge is linear between
the limb-j unctions and do the interpolation as shown in figure 3.23 (b). In
the same figure we also show the NVFOV surfaces identified along the ori
entation discontinuity occlusions, for display purpose their disparities being
interpolated as if they are planar.
Another example is another scene with curved objects. Processing re
sults of the scene are shown in figure 3.24. Note that selected ribbons from
the monocular views are not consistent (Figure 3.24 (g) and (h)), and this
is fixed by the feedback from stereo correspondence to monocular groupings
(Figure 3.24 (i) and (j)). Figure 3.24 (k) shows the complete disparity output,
including the non-correspondence of limb edges of the tape dispenser and the
disparities of the surface markings. Here the limbs of the cylindrical object at
the back are not identified because of lack of resolution. Our stereo system
requires higher resolution of the images for more accurate depth estimations.
The last example is a scene with significant texture. The images, the edges,
and the disparity output are shown in figure 3.25.
3.5 C om putational C om plexity and
R un-tim e
We present an informal computational complexity analysis of our system. The
hierarchical features used in our system are extracted based on the idea of
perceptual grouping proposed by Mohan and Nevatia [83, 80], and hence the
complexity analysis is also similar to theirs.
Notice that as features are extracted higher up in the hierarchy of descrip
tions, there is also a large reduction in the number of features from one level
to the other. As a result the number of features is small in real terms for our
stereo matching process which is strongly guided by correspondences of higher
level features.
P reprocessing: We have used existing systems for the tasks of edge detec
tion, edge linking and corner detection as preprocessings in our system.
We do not discuss the computational complexities of such processes here.
58
(a) left image (b) right image (c) left edges (d) right edges
(e) left symmetries (f) right symmetries
■ W
(i) matched ribbons, junctions (j) matched ribbons, junctions
(left) (right)
Figure 3.22: Hierarchical features of a scene with multiple occlusions (scene
MO).
C o -c u rv ilin e a rity : Suppose there are n curves in the image. Since we con
sider two curves at a time to check for continuity, which means there
are only nC2 combinations to consider, the whole process of checking
co-curvilinearity is of time complexity 0 (n 2).
(g) left selected rib- (h) right selected rib
bons bons
(a) apparent disparity output (b) identified limbs and NVFOV surfaces
Figure 3.23: Disparity Output for the scene MO.
H yp oth esizin g Sym m etries: Since there are 0 (n 2) contour-pairs to be con
sidered for symmetrical relationships among the 0(n) contours, the
whole process of hypothesizing symmetries is of time complexity 0 (n 2).
H yp oth esizin g R ibbons: For each of the m symmetric contour-pairs hy
pothesized, we have to search for possible closures at both of its two
ends to hypothesize a ribbon. The process examines an area of fixed
width between the curve ends, extracts all the curves inside the area,
and groups them together with respect to continuity. Since the number
of curves found in such an area is much less than the total number of
curves in the image, the search process for one end of a contour-pair
can be assumed to take constant time, in the sense that it is indepen
dent of the total number of curves. As a result the whole process of
hypothesizing ribbons takes 0(m ) time.
S electin g R ibbons: Suppose there are altogether r hypothesized ribbons. It
takes 0 (r 2) time to set up the relaxation network, as binary relationships
among the nodes need to be extracted. Suppose on average each one of
them has p competitors for some p « r. During each iteration of the
relaxation each ribbon is compared with each one of its competitors. The
relaxation therefore takes 0(rp) time, assuming that the total number
of iterations in the relaxation is constant.
60
(a) left image (b) right image (c) left edges (d) right edges
/ x , r, , . /m . , , , . (g) left selected rib- (h) right selected rib-
(e) leit symmetries (t) right symmetries ,
bons bons
(i) matched ribbons, junctions (j) matched ribbons, junctions
(right)
(k) complete disparity output
(left)
Figure 3.24: Results of a scene of curved objects (scene CO).
M a tc h in g S tru c tu re s across S tere o Im ages: Suppose there are f fea
tures in the left and right views respectively for stereo correspondence.
Since each feature in the left view can form at most / possible matches
with the features in the right view, there can be at most 0(f2) total
number of possible matches. These possible matches will go through
61
i.y.r
»-■
(a) left image (b) right image (c) left edges (d) right edges
(e) complete disparity output
Figure 3.25: Results of a scene of textured objects.
a selection process to remove conflicts. It takes 0(f4) time3 to set up
3This sounds expensive. However, we have to point out that any system which attem pts
to match features across stereo views using multiple order relationships among the matches
requires similar or higher order of magnitude of computations. The difference is in fact in
the number of features to be matched and how distinct the features are.
6 2
the relaxation network, as binary relationships are involved. Let us say
on average each match has q competitors. During each iteration of the
relaxation process for selection, each match is compared to each one of
its competitors. The relaxation therefore takes 0 ( f 2q) time, if we as
sume the total number of iterations is constant. This analysis applies to
stereo correspondence of features at any level. Since our stereo matching
goes by a top-down manner, the total time it takes is bounded by 0 ( F 4)
where F is the number of features at the most global level in each view.
For Random Dot Stereograms F is the number of edgels. For highly
structural scenes F is the number of selected ribbons.
Run-times on a Symbolics 3620 for some of the experimental examples are
shown in Table 3.1.
Process Complexity Run-Time
(sec)
Entities
Scene MO:
Co-curvilinearity 0 (n 2) 2156.6
Q O
I I
e
Hypothesizing Symmetries 0 (n 2) 573.83 320 symmetries
Hypothesizing Ribbons 0(m) 3995.34 m = 320;
29 ribbons
Selecting Ribbons 0 (r2) 32.09 r = 29;
14 ribbons selected
Stereo Matching 0 (R 4) 96.49+27.25 R = 14;
1 1 matched ribbons
Scene CO:
Co-curvilinearity 0 (n 2) 1173.10 3
I I
t — ‘
CO
to
Hypothesizing Symmetries 0(n2) 160.18 127 symmetries
Hypothesizing Ribbons 0 (m ) 1056.59 m = 127;
20 ribbons
Selecting Ribbons 0 (r2) 280.11 r = 20;
1 0 ribbons selected
Stereo Matching 0 (R 4) 45.79+19.22 R = 1 0 ;
9 matched ribbons
Table 3.1: Time complexity and run-time of matching hierarchical features.
63
3.6 C onclusion
We have presented a stereo system that computes hierarchical descriptions of
a scene from each view and combines the information from the two views to
give a 3-D description of the scene. This system utilizes bilateral communica
tion between monocular groupings and stereo correspondence. The hierarchy
of descriptions helps reduce the computational complexity without sacrific
ing accuracy, and helps avoid the errors caused by improper application of
commonly used stereo correspondence constraints of surface-continuity and
ordering. Occlusions are specifically identified and interpreted. Visible sur
faces are segmented and we can infer some properties of the surfaces visible in
only one view.
We have not implemented surface interpolation procedures. However, we
expect this to be a relatively easy task, as the surfaces have already been
segmented into smooth regions.
64
C hapter 4
Shape D escription: R ecovering LSH G Cs and
SH G C s
In chapter 3 we have outlined a stereo system which is capable of deliver
ing segmented surfaces in the scene, as well as identifying the nature of the
visible surface boundaries as being creases or limbs. However, precise depth
information along the limb boundaries is still not available. This chapter de
scribes some ideas of how such 3-D information can be inferred from the stereo
correspondences by taking into account the global shape of the objects.
We present methods [29, 30] to infer volumetric shape from stereo, assum
ing the objects consist of instances of some generic shape primitives such as
LSHGCs and SHGCs. Our methods are based on some invariant properties of
the shape primitives in their monocular and stereo projections. Precise depth
information comes naturally with the shape descriptions. Experimental results
on both synthetic and real images of objects with curved surfaces are given.
Our technique allows dense surface descriptions to be recovered for objects
with or without texture at all, and it is not restricted to narrow stereo angles
or low resolution images. Our technique can also handle objects in close range
where perspective distortion in the images can be significant.
4.1 Introduction
If dense 2 |-D depth measurements are not always directly available from stereo
correspondences, especially when there are curved surfaces in the scene, how,
then, can we derive shape descriptions? We believe the key to the problem
is to reconstruct shape not always via intermediate depth measurements, but
rather directly from stereo correspondences by taking into account the global
shape of objects. A block diagram of our approach is shown in Figure 4.1.
65
Right
Image
Left
Image
2-D
Descriptions
2-D
Descriptions
Stereo
Correspondences
Invariant Projective Properties
o f General Shape Primitives
Triangulation
2.5-D
Primitives Descriptions
3-D
Descriptions
sometimes not
directly available
Figure 4.1: Overview of our approach.
Recovering global descriptions, be they surfaces or volumes, from sparse
scene data without making any assumptions about the shape is obviously im
possible, as there is in general an infinite number of shapes that can display
the same 2|-D scene data. To infer surfaces from sparse 3-D data, the general
idea is to assume the scene is smooth everywhere except across the surface
boundaries. To infer volumes, we suggest to look for regularities or symme
tries in the scene which are unlikely to happen by accident and are properties
of some primitives of shape. Strictly speaking, such an approach is still model-
based, yet the models being used are not shapes of specific objects, but shape
primitives that are common and that exhibit properties unlikely to happen
by accident. We propose to use Generalized Cylinders (GCs), introduced by
Binford [16], as the shape primitives for the reconstruction. The motivation in
using GC shape is three-fold. First, a GC description itself is a volumetric de
scription. Second, they are im portant classes of shape that can represent many
66
objects. Third, many im portant projective properties of specific instances of
GCs have been discovered recently that can help in the reconstruction process.
This chapter concentrates on how Linear Straight Homogeneous General
ized Cylinders (LSHGCs) and Straight Homogeneous Generalized Cylinders
(SHGCs) can be reconstructed from stereo images. They are both common
classes of shape and large variety of objects can be described by them or com
posites of them. Samples of LSHGCs and SHGCs are shown in figure 4.2. We
propose to reconstruct volumetric descriptions based on some invariant prop
erties of these shape models in their 2-D projections, especially those along
the occluding contours, since they are where features can be extracted reliably
from the images of sparsely textured objects. Since we have two views of the
scene between which the transformation is known, we can also take advantage
of the epipolar geometry by making use of some invariant properties of the
shape models in their stereo projections during the reconstruction process.
Our technique can be summarized as the following:
1. We use a hierarchical stereo system described in chapter 3 to extract
hierarchical features in the each image for stereo correspondence; such a
system not only allows surfaces in the scene to be segmented based on
global properties like co-curvilinearity and symmetry rather than local
depth differences, but also visible surface boundaries to be identified
as either creases or limb boundaries from the stereo properties of the
junctions.
2. We look for invariant projective properties of LSHGCs and SHGCs from
the boundaries of the segmented surfaces, and if such properties exist, we
hypothesize the involved surfaces to be surfaces of LSHGCs or SHGCs
accordingly.
3. For the boundary of each hypothesized surface, we establish pointwise
correspondences across the stereo images between the projected con
tours, in the sense that the corresponding image contour points in the
stereo images are projected by the same cross-section of the object; such
correspondences are based on some invariant properties of LSHGCs and
SHGCs in their stereo projections that we will describe in this chapter.
4. From such pointwise correspondences we reconstruct the volumetric de
scription of the object using a method based on geometry.
67
(a) LSHGCs
(b) SHGCs
Figure 4.2: Sample GC shapes.
4.2 Shape R ecovery o f LSH G Cs
To reconstruct LSHGCs from stereo, we first need to look at a number of
properties of LSHGCs in the images. We present the properties and propose
a method for the reconstruction in section 4.2.1. Experimental results then
follow in section 4.2.2.
4.2.1 T h e M eth od
An LSHGC is a volume defined by sweeping a given cross-section function
along a straight line called axis, such that the cross-section is scaled linearly
along the axis (see figure 4.3). Linking the points on the surface which corre
spond to the same unsealed arc length s along boundaries of the cross-sections,
68
Axis
\ Apex
eriaians
Figure 4.3: The axis, apex, and meridians of an LSHGC.
we have meridians. Because of the linearity in the sweeping function, all merid
ians are straight and intersect at a point which we call apex on the axis, at
which the scaling is zero.
L em m a 1 The image contour of an LSHGC is the projection of one of its
meridians under orthographic or perspective projection.
P ro o f Shafer and Kanade [103] have shown algebraically that the contour
generator of an LSHGC is a straight line under orthographic projection. Here
we give a more intuitive proof that the contour generator is in fact one of the
meridians regardless of the projection geometry. Because of the linearity in
the sweeping function, the surface normals to the LSHGC surface at points
along a meridian are all parallel and perpendicular to the meridian. Say a line
69
m
Figure 4.4: Contour generator of an LSHGC is a meridian.
of sight from the optical center C of a camera touches the LSHGC surface at
point P as shown in figure 4.4. Let rap be the meridian passing through P.
Then both C P and mp are perpendicular to the surface normal at point P
and they define the tangent plane 7 r to the surface at point P . Since all surface
normals along a meridian are parallel, the tangent plane 7 r is orthogonal to all
the surface normals along the meridian mp, and so is any line on the plane 7 r.
As a result any point Q on the meridian is a point on the contour generator
since the line of projection CQ lies on the plane 7r. □
T heorem 1 Given four image contours of an LSHGC in a stereo pair of
images, the points of intersection among their extensions in the two images
are projections of the apex of the LSHGC and fall on corresponding epipolar
lines under orthographic or perspective projection (An example illustrating the
theorem is shown in figure 4-5).
P ro o f Since all meridians intersect at the apex in 3-D, the projection of
meridians on any image will also intersect at the projection of the apex to that
image. By lemma 1 the image contours in the stereo images are projections of
meridians; their extensions therefore intersect at the images of the same point
in 3-D, the apex. As a result the points of intersection fall on corresponding
epipolar lines. □
Theorem 1 not only allows us to hypothesize an LSHGC from stereo images,
but also to recover the apex in 3-D simply by matching the image apices at
the intersection of the image contours. Notice that a contour in 3-D on the
70
# ■#
Left Image
Right Image
Figure 4.5: Stereo correspondence of LSHGC contours.
surface of the LSHGC can be recovered by matching the terminator contours
in stereo. The apex and the 3-D contour therefore uniquely define an LSHGC
which projects to the four image contours, regardless of how the cylinder is
cut at the two ends. If the cuts are important, we can first recover their
partial descriptions in 3-D by matching their images, and the cuts are where
the partial descriptions intersect with the cylinder in 3-D.
Since we have not specified any particular cut on the LSHGC, the method
works even for cuts being non-planar or non-orthogonal to the axis.
4.2.2 E xperim en tal R esu lts
Results on a synthetic stereo pair of images of an LSHGC are shown in fig
ure 4.6. We extract a hierarchy of structural descriptions from each image
using a perceptual grouping technique and match those descriptions in stereo
using the stereo system described in [27]. Edges are detected from each im
age using Canny’s edge-detector [26], and are linked into edge-contours using
eight-neighbor connectivity. Edge-contours are segmented into curves at cur
vature extrem a so that every curve is smooth in itself, and curves are grouped
into contours based on co-curvilinearity. Symmetries are then detected from
each pair of approximately symmetrical contours, and they form ribbons if
they have proper end closures at both ends of the symmetries. Closure at the
71
end of a symmetry can be composed of a contour, a set of multiple contours, or
the ends of other symmetries. A number of conflicting ribbons are computed
and selection of ribbons is done to resolve conflicts.
The output of the hierarchical stereo system is segmented and matched
surfaces in stereo. Junctions have also been labelled as either junctions along
limb boundaries or junctions along creases, from which limb edges have also
been identified. We then extend the limb edges to see if their points of in
tersection in the stereo views fall on corresponding epipolar lines. If they do,
we hypothesize the involved surface to be the surface of an LSHGC and the
points of intersection in the stereo images are the projections of the apex. Us
ing the apex and a cut of the object in 3-D, recovered by matching the image
apices and the image terminators respectively, we can recover the volumetric
description of the object.
In figure 4.7 we show another set of results on a stereo pair of real images
of a cone. Notice that although surface markings have not been used in the
process, their presence can be exploited by checking if the recovered volumetric
description is consistent with depth measurements along those surface mark
ings, and if not, the volumetric description can be deformed to fit the depth
data. In figure 4.8 we overlay the recovered descriptions on the left image to
illustrate the performance of our method. Note also that the perspective dis
tortion in the images are significant; the eccentricities of the projected ellipses
change gradually along the axis from one end to the other.
4.3 Shape R ecovery o f SH G C s
Ulupinar and Nevatia [110] have recently demonstrated that the shape of an
SHGC can be recovered merely from a single line drawing. As the problem
is highly underconstrained with one view, the method has to make use of
some perceptual assumptions which are believed to give results consistent with
human perception. It is also restricted to using orthographic projection as its
projection model. In the following we show that if we have a second “eye”,
there is a more direct method to compute the shape as well as the pose of an
SHGC without making those perceptual assumptions. In addition, our method
can also handle images projected under perspective projection. We do assume
that the cross-section function is visible from at least one of the cuts of the
SHGC. However, we do not require the cross-sections to be orthogonal to the
axis, i.e., oblique SHGCs are allowed.
72
(d) right edges
v
(h) image apex
(right)
left view
Figure 4.6: Results for a synthetic scene of an LSHGC.
The technique consists of three steps:
1. we hypothesize surfaces of SHGCs from some invariant projective prop
erties of SHGCs along their image contours;
2. we set up pointwise correspondences across the stereo images so image
contour points projected by the same cross-section of an SHGC are iden
tified;
(a) left image (b) right image (c) left edges
(e) matched ribbons, (f) matched ribbons, , x n
• /i r, \ • . , . . (g) image apex (left)
junctions (left) junctions (right)
(i) volumetric description projected on the
7 3
(a) left image (b) right image (c) left edges (d) right edges
(e) matched ribbons (f) matched ribbons . . . ,, . * (h) image apex
(left) (right) image apeX ( } (right)
(i) volumetric description projected on the
left view
Figure 4.7: Results for a scene of a cone.
3. from such pointwise correspondences we reconstruct the volumetric de
scription of the object using a method based on geometry.
7 4
Figure 4.8: The recovered volumetric description overlaid on the left image of
the cone.
4.3.1 H ypothesizing SHGCs
To reconstruct SHGCs in a scene, we first have to find evidence for their
existence in the images. Ponce et al. [96] have derived an im portant theorem
regarding the tangents to the image contours of an SHGC under orthographic
projection. Ulupinar and Nevatia [110] extended it to perspective projection
and give a more intuitive proof for the theorem. The theorem can be stated
as this: If two image contour points of an SHGC are from the same cross-
sectionthe tangents to the contours at these points when extended intersect
on the projection of the axis. An example is given in figure 4.9. Using
Figure 4.9: Tangents to the image contours of an SHGC meet at the projection
of its axis.
this property of SHGCs, we can hypothesize the existence of an SHGC by
establishing pairwise correspondences between points on the image contours
in each image such that their tangents intersect on the same straight line, with
the straight line so derived being the projection of its axis.
The pairwise correspondences and the projected axis in each image can
be estim ated using Hough Transform as in [96], and confirmed by checking
whether all corresponding pairs of points follow the same order and are con
tinuous along the contours. A cheaper method is to first hypothesize the axis
from two initial known correspondence pairs if they are available, and infer
the rest of the correspondences from the hypothesized axis. The initial corre
spondence pairs can come from junctions at the ends of the image contours,
as they are from the same cross-sections which are the term inator surfaces.
Zero-curvature points on the image contours are another possibility as shown
by Ponce et al. [96].
76
Another evidence is that if the cuts of an SHGC are both along the cross-
sections, then their boundaries exhibit parallel symmetry in 3-D as defined
by Ulupinar and Nevatia [110]: there exists a linear corresponding function
between points on the boundaries of the cuts such that the tangents to the
boundaries at the corresponding points are parallel. Since parallel symmetry
is preserved under orthographic projection, Ulupinar and Nevatia have used
this property to confirm an SHGC in an orthographically projected image. As
here we have stereo images of the cuts, we can check the parallel symmetry
in 3-D to confirm the existence of an SHGC as well as whether the cuts are
along the cross-sections. This has not been implemented in our system.
In our current system we hypothesize a surface to the surface of an SHGC
if pairwise correspondences can be established along its image contours in
both stereo views as described above. The next step will be to see if we can
construct an SHGC with stereo projections consistent with the scene data.
The idea is simple. Matching the projections of the axis in the stereo images
will automatically recover the 3-D position of the axis. Similarly, matching
the projections of the term inator boundaries will recover the 3-D shape of the
cross-section function. W hat is left is to scale the cross-section function in 3-D
so that it touches the lines of projections from the same cross-section of the
object, and this is illustrated in figure 4.10.
The problem is that, we have to first set up correspondences across the
stereo images so that we know which pair of points in the left image and
which pair of points in the right image are projected from the same cross-
section. This is nontrivial as in case if the contour generators are limbs, the
corresponding points in the images are indeed projections from four different
points on the surface of the SHGC. Here let us call the corresponding points
in the stereo images projected from any given cross-section pii, pi2, and pri,
pr2 respectively as shown in figure 4.10.
4.3.2 S ettin g U p C orrespondences across Stereo
Im ages
The correspondence problem is simpler when the contour generators are creases
instead of limbs, in which case pu and pri are projections of the same point in
3-D and will fall on corresponding epipolar lines, and so will pi2 and pr2 (see
figure 4.11). This shows the importance of identifying the nature of the visible
surface boundaries as being creases or limbs.
77
Left Image Right Image
Figure 4.10: Volumetric shape recovery of an SHGC.
To solve the correspondence problem for the limb edges is more involved.
We will first state a few previously proved properties of SHGCs, from which we
will derive a new theorem useful for establishing the correspondences across
stereo views. Here let us first define the tangent to a surface at a point P
in the direction of a line L in 3-D to be the tangent which lies on the plane
containing the point P and the line L.
The first property we would like to bring up here is:
Lem m a 2 (Shafer and K anade [103]) Given points on the surface of an
SH G C that belong to the same cross-section, the tangents to the surface at
78
Left Image Right Image
Figure 4.11: Corresponding points on the image contours of an SHGC which
belong to the same cross-section, when the contour generators are creases.
these •points in the direction of the axis when extended intersect at a common
point on the axis (see figure j.12).
Following Shafer and Kanade, we call the common point of intersection
on the axis the apex, and the tangents in the direction of the axis the apex
tangents of the given cross-section. We also call the 2-D projections of the
apex and the apex tangents on any image plane the image apex and image
apex tangents. Notice that different cross-sections of an SHGC generally have
different sets of apex tangents and different apices on the axis.
Another property is:
L em m a 3 (U lupinar and N evatia [110]) All the tangent lines to a surface
at a point, say P , which is on a limb edge of the surface under any given
projection geometry, project as the same line on the image plane.
This property, in combination with lemma 2, imply that tangents to the
limb edges at points which belong to the same cross-section, say pn and pi2,
are in fact equivalent to the 2-D projections of the apex tangents at those
points. As a result their point of intersection in the image will also be the
2-D projection of the point of intersection of the apex tangents. This has been
included in the work of [96] and [110] when they prove that tangents to the
limb edges intersect on the projection of the axis. We rephrase it as in below:
79
apex point
apex tangents
Figure 4.12: The apex of a given cross-section of an SHGC.
L e m m a 4 (P o n ce et al [96], U lu p in a r an d N e v atia [110]) Given two
points on the limb edges of an SHGC that belong to the same cross-section, the
point of intersection between the tangents at those points is the image apex of
that particular cross-section.
Combining lemmae 2 and 4, we get the following theorem:
T h e o re m 2 If four points on the limb edges of an SHGC in a stereo pair of
images belong to the same cross-section, the points of intersections among the
tangents at those points in the two images fall on corresponding epipolar lines
(An example illustrating the theorem is given in figure f.13).
80
Left Image Right Image
Figure 4.13: Corresponding points on the image contours of an SHGC which
belong to the same cross-section, when the contour generators are limbs.
P ro o f By lemma 4 tangents at the image points intersect at the image apex
of that cross-section in each image. Since the four points are from the same
cross-section, and by lemma 2 apex is uniquely defined for each cross-section,
the two image apices in the stereo images are in fact projections of the same
point. As a result the two image apices fall on corresponding epipolar lines. □
Theorem 2 gives one way to establish correspondences among points on
the limb edges which belong to the same cross-section. W hat is remaining is
to scale the cross-section function in 3-D to project to the four corresponding
points. This is still nontrivial as the four points correspond to four lines of
projection which in general are not coplanar. The following describes a non
iterative method of how we can compute the cross-section in 3-D for each set
of correspondences. Remember that the cross-section function in 3-D can be
computed by matching the term inator contours in stereo. Similarly, the axis
of the SHGC in 3-D can be recovered by matching the image axes.
81
4.3.3 F ittin g C ross-Sections to th e C orrespondence
Q uadruples
We have obtained for every cross-section slice of the SHGC two corresponding
image apices and four image apex tangents in the stereo images. Here we
treat the recovery problem as one to recover a virtual LSHGC whose apex is
the apex of the cross-section, whose meridians are the apex tangents of that
cross-section, and whose cut is the cross-section itself. Following the scheme
in section 4, we first derive the LSHGC and then determine the proper cut
that is consistent with the scene data.
By matching the image apices we can recover the apex A of the virtual
cylinder (see figure 4.14). We then move the cross-section function down the
axis in 3-D to some distance, say t from the apex. We call the new axis point
C(t). Pick one of the four image apex tangents, say the one at point pn. The
image apex tangent and the optical center of the corresponding camera form a
plane of projection which we call 7 r, to which the virtual cylinder should touch.
For each point s along the boundary of the cross-section function, we compute
two measures: the distance r(s) from C{t) to the boundary point s, and the
distance R(s) from C(t) through the point s to the plane 7 r. The fraction
R (s)/r(s) is the scaling of the cross-section such that the point s touches the
tangent plane 7r. The proper scaling of the cross-section function to touch with
the plane 7 r, regardless of whether the edges are limb edges or real edges, can
therefore be computed as:
scale(t) = min R(s)/r(s)
The apex A, the axis, and the scale function scale(t) at distance t along the axis
uniquely define a virtual LSHGC which gives rise to the image apex tangents.
The next step is to recover the proper cut of the cylinder to project to
the correspondence quadruple. From the above process we can easily recover
the point of contact P(t) between the projection plane 7 r and the scaled cross-
section of the cylinder at distance t from the apex. The line AP(t) then defines
the contour generator on the cylinder which projects to the given image apex
tangent. Notice that the contour generator of an LSHGC has to be a straight
line. Let us say that the proper cut is at a distance t\ from the apex along
the axis, and the proper scaling of the cross-section function on that plane is
scale(t\). The point P(ti) on the surface of the cylinder which projects to
point pii is then given by the intersection of two lines: the line of projection
82
Pl2
Left Image
Figure 4.14: Geometry of recovering a cross-section of an SHGC.
through point pn, and the contour generator AP(t). Finally, using property
of similar triangles we can get:
h = (AP(ti)/AP(t)).t
scale(t\ ) = (AP(ti)/AP(t)).scale(t)
The above process of recovering the virtual cylinder and the cut can be
applied to any one of the four image apex tangents. In principle they should
83
all return the same cross-section, unless the object is in fact not an SHGC but
merely “looks” like an SHGC in monocular views, or the cut being used as
the cross-section function is actually not along one of the cross-sections. This
serves as an additional verification from the reconstruction process. Our sys
tem computes t\ and scale(ti) from each of the image apex tangents separately,
and uses their averages for shape reconstruction if the values are consistent
with one another.
4.3.4 E xperim en tal R esu lts
Results on a synthetic stereo pair of images of an SHGC are shown in fig
ure 4.15. We compute the hierarchical descriptions from each image, and
match those descriptions as in the scene of the LSHGC. We hypothesize the
image axes of the SHGC from the tangents at the identified limb-junctions,
and determine the correspondence pairs between the image contour points
from the hypothesized axis. From the positions where the tangents at the cor
responding image contour points intersect at the image axes, correspondences
among the image contour points can be set up across the stereo images. From
these correspondences we are able to recover the cross-sections of the SHGC
in 3-D as shown in figure 4.15.
In figure 4.16 we show another set of results on a stereo pair of real images
of a typical desk lamp. The cameras were configured such that the optical
axes were parallel with a base line of approximately 25cm long. The lamp was
about 75cm away from the cameras. Both cameras have a spatial resolution
of 512 by 480 with 8 bits of grey scale. Here we show more details about the
interm ediate steps used in the hierarchical stereo matching system proposed
in [27]. Edges are detected from each image using Canny’s edge-detector [26],
and are linked into edge-contours based on eight-neighbor connectivity. Edge-
contours are segmented into curves at curvature extrema so that every curve
is smooth in itself, and curves are grouped into contours based on continuity.
Symmetries are then detected from each pair of approximately symmetrical
contours, and they form ribbons if they have proper closures at both ends of
the S 3rmmetries. The closure at the end of a symmetry can be composed of a
curve, a set of multiple curves, or the ends of other symmetries. Very small
symmetries are ignored to save computation time. Still a large number of
symmetries are left and they form many conflicting ribbons. The ribbons then
go through a selection process based on a number of constraints among the
ribbons. The selected ribbons and the hierarchies of descriptions in the two
84
(a) left image (b) right image (c) left edges (d) right edges
(e) matched ribbons, (f) matched ribbons, , . . . , . . . , .
junctions (left) junctions (right) lma*e axls W lma«e 11x 1 8 < rlSht>
(i) volumetric description projected on the
left view
Figure 4.15: Results for a synthetic scene of an SHGC.
images are then used for stereo correspondence. Junctions are extracted from
the matched ribbons and labelled as limb-junctions or real junctions from their
behavior across the stereo images. Limb edges are also identified during this
step. The details of such perceptual grouping and stereo matching processes
is given in [27].
We then group neighboring ribbons which share smooth boundaries into
objects. Notice that the lamp object consists of two neighboring sections
of curved surfaces. Using the Hough Transform method mentioned in sec
tion 4.3.1, we are able to derive the SHGC axes of both curved sections and
identify that they both share the same axis. We then treat the two curved
sections as one single SHGC and derive the volumetric descriptions as in the
previous example. In figure 4.17 we overlay the recovered descriptions on
the left image to illustrate the performance of our method. Notice that the
perspective distortion in the images are significant; the eccentricities of the
projected ellipses change gradually along the axis from one end to the other.
4.4 C om putational C om plexity and
R u n -tim e
We present an informal computational complexity analysis of our system. The
hierarchical features are extracted using the stereo correspondence system de
scribed in chapter 3, where the computational complexity of the feature ex
traction process is also presented.
4.4.1 R ecovering an LSHGC
H yp oth esizin g th e Im age A p ex from one view: Only opposite contour
pairs of a ribbon, which have to be lines, are considered to be the surface
boundary of an LSHGC. The lines are extended, and the point of inter
section is hypothesized to be the image apex. This only takes constant
time for each ribbon.
M atching Im age A pices across Stereo Views: An opposite contour pair
of a ribbon and their correspondences in the other view are checked if
they hypothesize image apices on corresponding epipolar lines. This also
takes constant time.
C onstructing th e V olum etric D escription: When a ribbon is hypothe
sized to be the projected boundary of an LSHGC, the neighboring rib-
bon(s) is also hypothesized to be the image of the term inator surface of
the object. A cut of the object can then be recovered by matching the
86
(a) left image (b) right image (c) left edges (d) right edges
w
-(g) matched ribbons, (h) matched ribbons,
(e) left symmetries (f) right symmetries . ,. n . ,. / • i_.\
v ' J v ' ° J junctions (left) junctions (right)
(i) image axis and (j) image axis and (k) volumetric description projected
contour point corre- contour point corre- on the left view
spondences (left) spondences (right)
Figure 4.16: Results of hierarchical stereo matching and volumetric shape
recovery for the scene of a lamp.
image term inator contours across stereo, which takes time proportional
to the perimeter of the terminator surface. Meridians of the LSHGC
can then be generated directly from the apex and the cut, which also
takes time proportional to the perimeter of the terminator surface. The
8 7
Figure 4.17: The recovered volumetric description overlaid on the left image
of the lamp.
construction process therefore takes 0(t) time, where t is the average
perimeter of the terminator surfaces.
On the whole, our system takes 0(1) + 0(1) 4- 0 ( t ) time or 0 ( t ) time to
recover an LSHGC from a ribbon. Since the time complexity is driven by the
number of pixels along the boundaries of the terminator surfaces, a param et
ric representation of the contours in the images may significantly reduce the
complexity.
Run-times on a Symbolics 3620 for the experimental examples are shown
in Table 4.1.
Process Complexity Run-Time
(sec)
Entities
LSHGC example:
Locate LSHGC Apex 0(1) 0.24
Reconstruct Slices 0(1) 30.00 t = 121; 16 slices
Cone example:
Locate LSHGC Apex 0(1) 0.77
Reconstruct Slices
0 (0
64.99 t = 172; 16 slices
Table 4.1: Time complexity and run-time of recovering an LSHGC.
4.4.2 R ecovering an SH G C
H yp oth esizin g th e Im age A xis from one view: As presented in [96], the
complexity of the Hough Transform algorithm is 0(p2d) where p is the
average number of points on an image contour, d2 is the dimension of
the Hough space.
M atching Im age C ontour Points across Stereo Views: For each pair
of corresponding image contour points in the left view, we check if there
is any pair of corresponding image contour points in the right view such
that the tangents to the image contours at the four points intersect on
corresponding epipolar lines. This takes 0(p 2) time.
C on stru ctin g th e V olum etric D escription: When a ribbon is hypothe
sized to be the projected boundary of an SHGC, the neighboring rib-
bon(s) is also hypothesized to be the image of the terminator surface of
the object. A cut of the object can then be recovered by matching the im
age term inator contours across stereo, which takes time proportional to
the perimeter of the term inator surface. It also takes time proportional
to the perimeter of the term inator surface to recover a cross-section of
the SHGC from the axis, the cut, and the four image contour points
projected by the cross-section. The construction process therefore takes
0(tp) time, where t is the average perimeter of the terminator surfaces.
On the whole, our system takes 0(p2d) -f 0(p2) + 0(tp) time or 0 (p2d)
time to recover an SHGC from a ribbon. Since the time complexity is driven
by the number of pixels along the image contours and the boundaries of the
term inator surface, a parametric representation of the contours in the images
may significantly reduce the complexity.
89
Run-times on a Symbolics 3620 for the experimental examples are shown
in Table 4.2.
Process Complexity Run-Time
(sec)
Entities
NLSHGC example:
Locate Image Axis 0(p2d) 136.20 p = 145;
Correspondence Quadruples 0(p2) 2.30
k = 5; d = 480
Reconstruct Slices 0(tp) 142.78 t = 120; 16 slices
Lamp example:
Locate Image Axis 0(p2d) 354.25 p = 283;
Correspondence Quadruples
0(p2)
6.03
k = 5; d = 480
Reconstruct Slices 0(tp) 211.61 t = 174; 16 slices
Table 4.2: Time complexity and run-time of recovering an NLSHGC.
4.5 C onclusion
In this chapter we have examined the problem of deriving volumetric shape
descriptions from stereo images. We emphasize that intermediate 2|-D dense
depth measurements may not be always directly available from stereo, which
is basically why shape from stereo cannot be treated as merely a sequence of
two modules: depth from stereo, and shape from range data. As a result, the
volumetric reconstruction may have to be computed directly from stereo cor
respondences. We have described how volumetric shape can be reconstructed
from stereo using some primitives of shape such as LSHGCs and SHGCs. The
methods are based on some invariant properties of the shape models in their
2-D projections. Such properties are not all monocular; we have proposed
some properties in stereo which further help in confirming and reconstructing
LSHGCs and SHGCs from stereo images.
Our technique allows dense surface descriptions to be recovered for objects
with or without texture at all, and it is not restricted to narrow stereo angles
or low resolution images. Our technique can also handle objects in close range,
which is in fact where stereo is most effective, without being affected by any
possible perspective distortion in the projected images. We have shown results
90
for objects with circular cross-sections, but our method is not limited to these.
Our m ethod can even allow LSHGC objects with arbitrary cuts across the
cylinders, as well as SHGC objects with oblique cross-sections.
91
C hapter 5
A pplication: E xtracting B uilding Structures
from a Stereo Pair o f A erial Im ages
In this chapter, we apply some of the proposed ideas to address the prob
lem of extracting polyhedral building structures from a stereo pair of aerial
images [28]. We describe a system that computes a hierarchy of descriptions
such as segments, junctions, and links between junctions from each view, and
matches these features at the different levels. Such features not only help re
duce correspondence ambiguity during stereo matching, but also allow surface
boundaries to be inferred even though the boundaries may be broken because
of noise and weak contrast. We hypothesize surface boundaries by examining
global information such as continuity and coplanarity of linked edges in 3-D,
rather than by merely looking at local depth information. We avoid the er
ror made in inferring 3-D information from triangulation by translating 3-D
collinearity into 2-D collinearity in the two views, and coplanarity from the
3-D space into the disparity space, and working merely in the 2-D and the
disparity domains. When the walls of the buildings are visible, we also exploit
the relationships among adjacent surfaces of an object to help confirm the
different levels of descriptions. Experimental results for various aerial scenes
are also shown.
5.1 Introduction
We have presented some ideas about extracting structural descriptions of ob
jects in an imaged scene. An im portant application of such a capability is to
detect architectural structures from aerial images. W ith stereo images, not
only can the problem be potentially made easier because of the additional
information, but three-dimensional (3-D) depth information about the scene
92
Figure 5.1: A typical aerial image.
can also be estimated quantitatively. In this chapter we address the structural
description problem in the context of extracting building structures from a
stereo pair of aerial intensity images. We make the assumption that building
structures are polyhedral.
A typical aerial image is shown in Figure 5.1. General characteristics of an
aerial image are:
— a large number of the buildings are polyhedral structures;
— very often, only the roofs of the buildings are visible;
— occlusions among buildings are rare;
— the ground is textured with parallel markings such as lane markings,
pavement edges, and even shadows cast by the buildings, with cars, trees,
and pedestrians scattered somewhat randomly all over the place; yet the
texture is seldom dense enough to allow complete surface boundaries to
be extracted merely by looking at local depth measurements.
As a result, we can describe the simplest version of the building recovery
problem as a figure-ground problem in 3-D, in which the figures are the roofs of
the buildings to be extracted. An exaggerated analogy of the scenerio would be
9 3
the stereo images of a white square hanging in space over a dark background,
in which the only stereoscopically matchable features are the occluding edges
between the square and the background. Since the best we can get from
stereo matching is a closed rectangular contour in 3-D, traditional approach
would recover a single plane for the whole scene as no depth discontinuity
is identifiable based on local depth measurements. Yet the existence of a
closed contour does strongly suggest the existence of a white square hanging
in space (based on the physical evidence alone it is also possible that the white
square is indeed a surface marking on a piece of black paper, but several sparse
scratches on the background would be enough to make our point). Structural
descriptions are therefore very im portant in inferring surface boundaries in
this case.
Our approach is to exploit the structural features to recover discontinuity
information, without solely relying on local depth differences between adjacent
features in the scene. As suggested by the above example, one im portant
observation is that a closed contour rarely occurs in 3-D, and if it does, it has
to be either one of the three possibilities:
1. the boundary of a surface patch;
2. the boundary of a set of connected surface patches;
3. the boundary of a closed surface marking.
Moreover, there are some intrinsic relationships among these possibilities. For
example, a surface marking, as its name implies, is always contained in a
surface patch (which can be the background) or a set of connected surface
patches, and a set of connected surface patches usually compose a solid object.
The key is how we can capture such relationships to infer where the actual
surface boundaries are.
We therefore propose to hypothesize closed contours in 3-D as the surface
boundaries, and when there are conflicts, to select among the hypotheses based
on their individual merits and the relationships displayed among them.
However, edges projected by surface boundaries are usually far from per
fect and they may be broken because of noise and weak contrast. A grouping
process is generally necessary to extract structural features for stereo corre
spondence and for inferring the continuity of the surface boundaries. Let us
point out that such a grouping process is indeed unavoidable. Even if depth
and orientation discontinuities can be identified by matching some primitive
94
features and by examining the local depth measurements, discontinuities so
identified are also likely to be broken and sparse.
Since aerial images are usually highly cluttered, and many features in the
scene such as the surface boundaries, the pavement edges, and the shadow
boundaries are parallel to one another, the idea of monocular grouping from
the edge level to the surface level, as used in the system described in chapter 3,
will encounter difficulties for aerial scenes. 3-D information of the features
can be very helpful in the grouping process. Here we design a stereo system
differently from the previously proposed systems so that it extracts structural
features bottom-up hierarchically from each image, and at the same time it
matches such features across stereo to recover 3-D information to help in the
grouping process.
For generic surfaces, the only property we can make use of in capturing the
continuity of surface boundaries is co-curvilinearity. It is easier if polyhedral
structures can be assumed, as surface boundaries can then be broken down
into merely junctions and links, where a link is the collection of collinear
edges between two junctions. Assuming the building structures are polyhedral,
the natural structural descriptions that should be made use of in reducing
correspondence ambiguity and inferring surface boundaries are therefore line
segments, junctions, and links between junctions whose branches are collinear.
We conjecture that junctions, in particular, are even more distinct than line
segments and can resolve the ambiguity in matching parallel structures because
they capture the relationships among different line segments.
For polyhedral structures, we can further constrain the number of surface
boundary hypotheses by enforcing that the boundary of a hypothesized surface
patch has to be planar in 3-D. This removes the possibility of extracting the
boundary of a set of connected surfaces as in the case of generic structures.
Our approach can therefore be summarized as these:
1. we propose a stereo system that computes a hierarchy of descriptions
such as edges, line segments, junctions, and links between junctions
from each view, and matches these features at the different levels;
2. we hypothesize surface boundaries from such matched and confirmed
structural descriptions based on continuity and coplanarity, and select
among these hypotheses based on the the individual merits of the hy
potheses and the relationships displayed among them.
95
Such an approach requires checking collinearity and coplanarity of linked
edges in 3-D whose depth information is estimated using the triangulation
method, a process known to be error-prone. We will show that we can reduce
the error made in inferring 3-D information from stereo correspondences at this
stage by translating 3-D collinearity into 2-D collinearity in the two views, and
coplanarity from the 3-D space into the disparity space, and working merely
in the 2-D and the disparity domains.
5.2 A B uilding E xtraction System
As mentioned above, our approach is to first compute a hierarchy of descrip
tions such as edges, line segments, junctions, and links between junctions from
each view, and m atch these features across the stereo views at the different
levels. We then hypothesize surface boundaries from such matched and con
firmed structural descriptions based on continuity and coplanarity, and select
among these hypotheses based on the the individual merits of the hypotheses
and the relationships displayed among them.
We conjecture that junctions, in particular, can resolve ambiguity in m atch
ing parallel structures in the scene because of the following properties of junc
tions:
— point features: The probability of two junctions in the left and right
views to fall on corresponding epipolar lines by accident is much smaller
than that for any other extended features; in fact the probability tends
to zero as junctions are point features.
— distinct: Junctions are formed from intersections of line segments, and
thus capture information both about the segments themselves and about
the relationships among the segments. The essence of matching junctions
is, if two line segments in space do not actually intersect in 3-D, the
extensions of their stereo projections will not intersect to give points on
the corresponding epipolar lines either.
— help identify occlusions: As described in an earlier work [27], one effect
of occlusions in stereo is that they generate regions which appear in one
view but do not have correspondence in the other. Such regions in turn
modify the properties of the junctions accompanied with the occlusion
boundaries across the stereo views, either by moving the junctions across
the epipolar lines or by changing the junction-types of the junctions.
96
Such behavior of the junctions allow us to identify different types of
occlusions and to locate surfaces visible in only one of the stereo views.
— easy to extract: Junctions are easy to extract as well as to match. We
can always hypothesize junctions by extending contours in each view to
see if they intersect at any points, and then check if there are matchable
junctions hypothesized in the other view which fall on corresponding
epipolar lines.
Aerial images can be taken from overhead views or from oblique views,
where walls of buildings are visible in the latter but not in the former. If
the input images are taken from overhead views, the relationships among ad
jacent surfaces of the same solid objects are usually unavailable. In case of
conflicts among surface boundary hypotheses, we have to rely largely on the
individual merits of the hypotheses and the depth information extracted from
stereo matching to locate the actual surface boundaries. We will first assume
the images are taken from overhead views and present how buildings can be
recovered from stereo. We will also show how the relationship between the
roof and the wall of a building can be made used of for images taken from
oblique views.
An overview diagram of our stereo system is shown in Figure 5.2. It is
basically composed of three interrelated modules:
1. Structural D escriptions & M atching: edges, line segments, junc
tions, and links are extracted and matched across stereo views;
2. Figure Extraction: surface boundaries are hypothesized from the con
firmed links and junctions, and conflicts are resolved;
3. G round Extraction: the ground level, assumed to be planar, is re
covered as the plane that contains most of the matched features outside
the recovered surface boundaries; surface boundaries extracted from the
figure extraction module which are not above the ground level are in
turn regarded as markings on the ground and discarded.
We will describe these modules in more details in the following sections.
As in chapter 3, various levels of our stereo system require matching structural
features of different levels in the two views, and constraints for the matching
process can be formulated as unary and binary constraints among the possible
matches. We use a relaxation network described in appendix A to accom
plish the tasks. When we describe the constraints for matching the structural
97
Hypothesize*
f Surfaces
Connectivity,
C opianarity
fHypothesizei
I Links
Junctions
Junctions
Eximm* .
in m m m
Extension &
Intersection
Cbllineai
Segments
Lmkittg^
Line fit (in;
Matched |
Junctions J
Matched
Segments E d jg p
Detection
FIGURE GROUND
:88i:88;88i88ll
Selected
Surfaces
| Disparity
Ground
1
“Inside”
Links
‘Outside’
Links
Guidance
Coplanar
Clusters
LEFT RIGHT
STRUCTURAL
DESCRIPTIONS & MATCHING
Figure 5.2: Overview of our stereo system.
features, we also indicate the nature of the constraints as being excitatory,
inhibitory, or absolute.
5.2.1 Structural Description and M atching M odule
Edges are extracted from each image using Canny’s edge detector [26], and
they are linked and fit to become chains of line segments using techniques of
9 8
\
Figure 5.3: Hypothesis of an L-junction from two line segments.
[90]. To make later processings easier, we extend each line segment from its
two end-points for a limited length, and break any other line segments that
cross over to the extensions. We then hypothesize an L-junction from each
pair of line segments using the following criteria (Figure 5.3):
1. the two line segments are not parallel;
2. none of the line segments contain the point of intersection between them;
3. the paths from the end-points of the line segments to the point of inter
section have to be free of other edges;
4. The distances from the end-points of the line segments to the point of
intersection are within a threshold (60 pixels).
We then say an L-junction is formed at the intersection of the two segments,
and the segments are called branches of the junction.
Notice that if the images are taken from oblique views, at the projection
of a multihedral vertex, several L-junctions may be hypothesized at the same
location by different combinations of the branches. Since the junction-type
and the number of branches of a junction may change across stereo views
(e.g., from “W ” in the left view to “L” in the right view, see section 3.2.3),
here we treat each of these L-junctions separately in the matching phase. Such
L-junctions will eventually hypothesize different faces of the vertex, and such
information will be made use of in the surface boundary selection process
described in section 5.2.2.
99
Such descriptions are extracted monocularly, and we match them across
stereo views to confirm them and to maintain a consistent interpretation of
the scene. Since junctions are more distinct and they capture information
among the line segments, junctions, as well as the branches composing them,
are first matched in a hierarchical manner using the following constraints.
C onstraints for M atching Junctions:
1. Unary Constraints
— Epipolar constraint (absolute): Two junctions are matchable only
if they fall on corresponding epipolar lines (Figure 5.4).
— Hierarchical constraint (absolute): Two junctions are matchable
only if there exists at least one way to match all their component
branches (as explained below in matching branches).
2. Binary Constraints
— Uniqueness constraint (mutually exclusive): An L-junction can be
matched with at most one L-junction. Junction-matches which
share the same junction either in the left or in the right views are
therefore mutually exclusive with one other.
— Figural Continuity constraint (excitatory): If junction I is matched
with junction r, it is preferred that junctions with branches collinear
with the branches of junction / and junction r in the two views are
also matched, if those junctions are indeed matchable (Figure 5.5).
The definition of collinearity used in our system is illustrated in
Figure 5.6. Two line segments a and b are collinear if the angles
O aci O b c, and 9ab subtended by the segments and the line segment
joining them are within the range 180° = L 20°.
— Ordering constraint (inhibitory): If the component branch matches
of two junction matches have order reversal along the epipolar lines,
the junction matches inhibit each other (Figure 5.7).
As there are altogether two compromisable constraints, one predefined
weight in the weighted sum of the constraints is necessary for the cost function
to be optimized in the relaxation. The cost function is therefore:
E (V ) = L ViVj(FCCij + wocOCij)
ij
100
Left View Right View
Figure 5.4: Epipolar constraint for matching junctions: junctions are match-
able only if they fall on corresponding epipolar lines.
they are either both pointing up or both pointing down across the
both toward the right.
2. Binary Constraints
— Uniqueness constraint (mutually exclusive): One branch can be
matched with at most one branch in the other view.
— Surface-Orientation constraint (mutually exclusive): The branch-
matches associated with a junction-match have to be such that
the cross-product of the two branches in a view must point to the
same direction (either in or out of the image plane) as that of the
corresponding branches in the other view (Figure 5.9), so as to
ensure the same “face” of any physical surface in space is matched.
where w0c is kept constant at -2 in our system. The value is designed accord
ing to the relative importance of the constraints.
C onstraints for M atching Branches of Junctions:
1. Unary Constraints
— Epipolar constraint (absolute): Two branches are matchable only if
epipolar lines (Figure 5.8). If the branches are almost parallel to
the epipolar lines, they have to be pointing both toward the left or
101
Left View Right View
Figure 5.5: Figural Continuity constraint for matching junctions: matches
W urji), (lj 2,rj2) support each other.
-'ab
Figure 5.6: Definition of 2-D collinearity in our system: angles 6ac, 0 & c, and
0ab are within the range 180° = h 20°.
This implies branches in the same epipolar band have to be matched
in the order from left to right.
As there is no compromisable constraint, no weight in the weighted sum
of the constraints is necessary for the cost function to be optimized in the
relaxation.
The matched junctions and the corresponding branches are then used to
guide the matching of the rest of the line segments. Line segments which are
collinear with the matched branches in both views are extracted and regarded
Figure 5.7: Ordering constraint for matching junctions: the two junction
matches inhibit each other as their component branch matches have order
reversal along epipolar lines.
as correct matches. All these “correct” branch-matches and segment-matches
are then used to lock on the segment matching process, in the sense that
any other possible segment-matches which are mutually exclusive with them
are discarded, and the rest of the segment-matches go through the relaxation
process by taking into account the information of the “correct” matches. The
following shows the constraints used for matching line segments.
C onstraints for M atching Line Segm ents:
1. Unary Constraints
— Epipolar constraint (absolute): Two line segments are matchable
only if there is overlap between the epipolar bands containing them.
2. Binary Constraints
— Uniqueness constraint (mutually exclusive): A line segment cannot
be matched with more than one line segment in the other view
which have overlap in their epipolar extents (Figure 5.10).
— Ordering constraint (excitatory): If line segment I is matched with
line segment r, it is preferred that line segments on the left (right)
Left View Right View
Figure 5.8: Epipolar constraint for matching branches of junctions: the two
junctions are not matchable as none of la and h is matchable with r& .
Left View Right View
Figure 5.9: Surface-orientation constraint for matching branches of junctions:
l\ has to be matched with 7* 1, h has to be matched with 7* 2, so that l\ x I2 and
7 * 1 x 7*2 point to the same direction (out of paper).
of line segment / are matched with line segments on the left (right)
of line segment r.
— Surface Continuity constraint (excitatory): If line segment / is
matched with line segment r, it is preferred that an immediate
neighbor on the left(right) of line segment I is also matched with
104
Left View Right View
Figure 5.10: Uniqueness constraint for matching line segments: matches (/, rq),
( /,r 2) are mutually exclusive with each other.
an immediate neighbor on the left(right) of line segment r, if they
are indeed matchable.
— Figural Continuity constraint (excitatory): If line segment / is
matched with line segment r, it is preferred that line segments
collinear with line segment / are also matched with line segments
collinear with line segment r.
As there are altogether three compromisable constraints, two predefined
weights in the weighted sum of the constraints are necessary for the cost func
tion to be optimized in the relaxation. The cost function is therefore:
E (V ) = £ ViVj(FCCij + toocOCo-)
u
where u;S cc and w0c are kept constant at 1 and 2 respectively in our system.
The values are designed according to the relative importance of the constraints.
Such junctions and line segments confirmed in both views are then used
for subsequent processes.
1 0 5
5.2.2 Figure E xtraction and G round E xtraction
M odules
We do not wish to rely on observing local depth measurements to recover sur
face boundaries, as we expect the scene may not be densely textured enough.
Instead, we use the properties coplanarity and continuity of the surface bound
aries to hypothesize them. Junctions and line segments extracted and con
firmed in the stereo views should allow us to capture such properties in 3-D.
Since each matched junction already defines a plane in 3-D itself, we can
start with the matched junctions and cluster all matched junctions and seg
ments into different sets such that entities in each cluster are all coplanar,
and we extract possible surface boundaries from each cluster in turns. How
ever, branches of junctions are usually short, making their 3-D information
unreliable.
As a surface boundary can be broken down into junctions and links, we pro
pose to first extract links between matched junctions to capture the collinear
ity information of the edges. A link is created between two junctions if the
following criteria are satisfied (Figure 5.11):
1. there exists branches one from each junction that are collinear;
2. the corresponding branches of the corresponding junctions in the other
view are also collinear;
3. there exists enough edgel evidence (50% of the distance between the
junctions) to support the formation of the links in both views, i.e., there
are enough segments that lie between the junctions in both views.
Since links are much longer and they capture information between con
nected junctions, their 3-D information are more reliable. Two links joining at
a junction together with their correspondences in the other view define a plane
in the 3-D space. We start with pairs of connected links to define planes in
3-D, and cluster all the links into different sets such that links in each cluster
are all coplanar.
Surface boundaries can then be hypothesized by extracting sets of con
nected links in each cluster. However, computing all possible combinations of
connected links can be exhaustive. In fact what we need from a set of coplanar
links is not any inside closed contours, as those are formed from surface mark
ings, but the contours composed from the “outsidemost” links which embed
106
\
\
\
Left View Right View
Figure 5.11: Hypothesis of a link-match from two junction-matches.
the rest of the links. To describe how the outsidemost links are extracted from
a cluster, we first need to make a few definitions.
If we pick a point which we call the center point in an 2-D space, the two
end points of any extended feature in the space, as shown in Figure 5.12, when
coupled with the center point, define a sector in the 2-D space, which in turn
consists of an outside and an inside zones. We then call an extended feature
an outsidemost feature of a cluster of features in the 2-D space if there is no
other features in the cluster that cross into its outside zone. Similarly, we call
an extended feature an insidemost feature of a cluster of features if there is
no other features in the cluster that cross into its inside zone.
We first take the 2-D coordinates of the end points of all the links and
compute the centroid of such 2-D points. Using the centroid as the center
point, the outsidemost links can therefore be extracted from the cluster of
coplanar links. This algorithm will extract all the outsidemost links for any
center point inside the desired outsidemost boundary. The drawback of using
the centroid as the center point is that it may be outside the desired outside
most boundary, e.g., the centroid of an L-shaped roof, and in this case some
of the outsidemost links may be missed. Yet it will not hurt our system, as
the missed links will be recovered when we hypothesize surface boundaries by
107
outside zone inside zone
center point
Figure 5.12: The outside and inside zones of an extended feature relative to a
center point in a 2-D space.
tracing along the identified outsidemost links. Another alternative is to ex
tract all the outsidemost and insidemost links with respect to the centroid (see
Figure 5.13), which will definitely include all the possible outsidemost links of
the cluster.
We therefore first extract the outsidemost links in each cluster, and merely
start with them to extract surface boundary hypotheses based on continuity.
When there are multiple links attached to the same end of a link, we trace
through all possibilities to increase the robustness of the process. Each surface
boundary hypothesis therefore contains at least one of the outsidemost links
in the cluster.
There may be separate roofs in the scene that are coplanar, and we execute
the above process recursively (Figure 5.14): for each cluster, we extract the
outsidemost links, recover chains of links containing them, remove all links
enclosed by the chains, and then we extract the outsidemost links from the
rest of the links and execute the process again, until no more links are left in
the cluster.
1 0 8
outsidemost
insidemost
center point
Figure 5.13: The outsidemost and insidemost extended features with respect
to a center point outside the desired boundary.
Edges are usually broken and sometimes totally missing in real images.
Because of this, we do not expect all of the surface boundaries hypothesized
from the above process to be closed. We use the following simple criteria to
close an open surface:
1. If the branches of the junctions at the two open ends of the boundary
are collinear with each other, and the edgel support across the opening
exceeds a certain threshold (50% of the gap length), the opening is closed.
2. If the branches of the junctions at the two open ends are not collinear
with each other but their extensions intersect on corresponding epipolar
lines in the two views, then we look at how much edgel support there is
along the paths from the two open ends to the point of intersection in
each view. If the edgel support exceeds a certain threshold (50% of the
total path length), the opening is also closed.
Such surface boundary hypotheses are likely to have conflicts among them,
and a selection needs to be made. Ideally, the hypotheses should only be
1 0 9
•• •
• x - r - r g :
‘
iM
M
j Y » Y ilV i V » Y l > ' t Y > ¥ » Y li
Jsss
S S S S x :
Figure 5.14: Hypothesis of surface boundaries from a cluster of coplanar links.
either actual surface boundaries or surface markings, since the possibility of the
boundaries of a set of connected surfaces have been ruled out by the planarity
criterion. However, as we do allow gaps in the boundaries, some hypotheses
may be constructed across collinear edges of different surfaces which happen
to be coplanar (Figure 5.15).
To separate surface markings from actual surface boundaries is simple:
surface markings are contained in larger surface boundaries which are coplanar
with them. As a result, among a set of coplanar surface hypotheses, the ones
that enclose more area and are not contained by others are more likely to be
actual surface boundaries. This is formulated as a weak constraint called the
“Outsidemost-boundary constraint” outlined later in this section. It is more
involved to distinguish between individual roofs and the false boundaries across
different coplanar roofs. We basically rely on edgel support along the boundary
of the hypotheses to make the decision.
As a summary, the constraints used to resolve conflicts among the hypoth
esized surface boundaries are given below.
1 1 0
Figure 5.15: False boundaries across collinear edges of coplanar surfaces.
C onstraints for S electing Surface Boundaries:
1. Unary Constraints
— Boundary-Evidence constraint (absolute): The fraction of edges de
tectable along the surface boundary has to exceed a certain thresh
old.
— Regularity constraint (excitatory): A surface boundary hypothesis
is considered more likely to be an actual surface boundary if: (1) it
is made from two sets of parallel links (skew symmetrical); or (2)
there are more than three junctions on the boundary, and there ex
ists a circle containing all the junctions such that all the junctions
are evenly distributed along the circumference of the circle (rota
tional symmetrical). To check this, we use the centroid of the 2-D
positions of the junctions to hypothesize the center of the circle.
— Outsidemost-boundary constraint (excitatory): A surface boundary
hypothesis enclosing more area is considered more likely to be an
actual surface boundary.
2. Binary Constraints
— Uniqueness constraint (mutually exclusive): Surface boundary hy
potheses are mutually exclusive with one another if: (1) there is
overlap in their enclosed area, as we assume all surfaces are opaque;
or (2) they share part of their boundaries and are coplanar with one
another, as they should be absorbed into one single surface patch.
I ll
Left View Right View
Figure 5.16: Solid-formation constraint for selecting surface boundaries: sur
face hypotheses sl5 s2, and s3 support each other pairwise.
If the images are taken from oblique views, i.e., the walls of the buildings
are also visible, we have one more constraint: selected surfaces should compose
feasible solid objects. This is formulated simply as a strong binary constraint
added to the surface-selection process:
— Solid-Formation constraint (excitatory): Two surface boundary hy
potheses support each other if (Figure 5.16): (1) they share one and
only one link; and (2) they are not coplanar with each another.
As there are altogether three compromisable constraints, two predefined
weights in the weighted sum of the constraints are necessary for the cost func
tion to be optimized in the relaxation. The cost function is therefore:
E(V) = - E W BEC< ) - \ £ ViVjiwrcRCij + tOgfcSFCtf)
i ij
where w^c and u;gfc are kept constant at 1 and 3 respectively in our system.
The values are designed according to the relative importance of the constraints.
There are buildings whose roofs are not a single connected surface patch,
but with “holes” in it. An example is the Pentagon building shown in Fig
ure 5.19. An evidence of existence of a hole is that there are features which
are inside the recovered surface boundary but are not coplanar with it, and
the boundary of the hole is formed by the “insidemost” links coplanar with
the recovered surface boundary.
1 1 2
For each recovered surface boundary, we therefore check if there are seg
ment matches inside but not coplanar with the surface boundary. If there are,
we first use the centroid of the end points of the segment matches as the center
point to extract the insidemost links coplanar with the surface boundary. Hole
boundaries are then hypothesized based on the continuity of the insidemost
links, and the one with smallest area is taken as the hole boundary.
Such an approach requires checking collinearity and coplanarity of linked
edges in 3-D whose depth information is estimated using the triangulation
method, a process known to be error-prone. The translation of 3-D collinearity
into 2-D collinearity in the projection through any viewpoint is well-known.
We also show in appendix B that coplanarity in the 3-D space is also preserved
in the disparity space. Such properties allow us to avoid the error made in
inferring 3-D information from stereo correspondences by working in the 2-D
and the disparity domains. A method for checking coplanarity is also described
in appendix C.
We assume the ground level is planar, and we recover it as the plane that
contains most of the matched features outside the recovered roof boundaries,
using the method described in appendix C. Surface boundaries extracted from
the figure extraction module which are coplanar with the ground level are in
turn regarded as markings on the ground and discarded.
5.3 E xperim ental R esults
Results for images taken from overhead views are shown in Figures 5.17, 5.18,
and 5.19 respectively. We extract edges and line segments from the stereo
images, and hypothesize L-junctions from the line segments. The junctions
are then matched across stereo views, and the results are used to guide the
matching of the line segments in the entire scene. The junctions also allow us
to hypothesize links from pairs of junctions whose branches are collinear. We
then take the 3-D information of any two connected links together with their
correspondences in the other view to cluster the links into different sets, such
that links in a cluster are all coplanar. Outsidemost links are extracted from
each cluster and surface boundaries are hypothesized based on the continuity
of links. Because of weak contrast and noise and other factors, not all the
boundary hypotheses are closed. For boundary hypotheses displaying skew
symmetries or rotational symmetries defined in section 5.2.2 but missing one
link to be closed, since they are more likely to be real surface boundaries,
113
our system will close them automatically. For the rest of the open boundary
hypotheses, the system will look at how much edgel support there is along the
gaps to decide whether they should be closed. Surface boundaries are then
selected from the hypotheses based on the individual merits of the hypotheses
and the relationships displayed among them.
The Pentagon building shown in Figure 5.19 has been a popular example
used by many stereo systems. However, most stereo systems merely recover a
coarse depth map but not surface boundaries in the scene. The notion of the
existence of a hole on the top of the building, in particular, has mostly been
ignored. An exception is the system of [32] which shows very impressive results
on locating the surface boundaries. Yet some of the recovered boundaries are
broken and are not very accurate, which is a drawback of area-based matching
and localizing depth discontinuities by examining local depth differences. To
recover the roof boundary of the Pentagon building is a particularly difficult
task, as there are a lot of coplanar features inside the roof. Hypothesizing
surface boundaries from every possible combination of connected links would
lead to a combinatorial explosion. Our system is capable of using the 3-D
information from stereo to group all the links on the roof into one single
cluster, and the outsidemost links of the cluster are extracted from it. Our
system also recognizes that the outsidemost links form a closed boundary
which also displays rotational symmetry. The boundary is therefore taken as
the boundary of a surface patch in the scene. In Figure 5.20 we also show the
interm ediate steps of how the hole on the roof of the building can be recovered.
Lacking real image data taken from oblique views, we have taken some
stereo pictures ourselves in our laboratory for testing purposes. We put a
toy building on top of a blow-up of a typical aerial picture and took the
images from oblique angles with a camera mounted on a linear table. An
example is shown in Figure 5.21. Such images are not completely realistic,
yet they capture some of the basic characteristics of real images taken from
oblique views. As in above examples, we show the intermediate steps and the
performance of our stereo system. A major characteristic of oblique views is
that multiple neighboring surfaces composing the same solid object may be
visible, and their relationships among one another are exploited in our system
in selecting among the surface boundary hypotheses.
114
(a ) left im a g e (b ) righ t im a g e (c) left seg m en ts (d ) righ t seg m en ts
, . , „ . . , > . . (g) matched junctions (h) matched junctions
(e) left junctions (f) right junctions (right)
(i) matched segments (i) matched segments „ n (1) closed surfaces
(left) (right) « hnks (left)
%
(m) selected surfaces
(n) d-D Rendered View
Figure 5.17: Results for a scene (BIO).
(a ) left im a g e (b ) righ t im a g e (c) left seg m en ts (d ) right seg m en ts
(i) matched segments (j) matched segments .. r . (1) selected surfaces
(left) (right) (k) links (left) ^ ft)
/ N /rx • . i • ,• (g) matched junctions (h) matched junctions
(e) left junctions (f) right junctions (right)
(m) 3-D Rendered View
Figure 5.18: Results for another scene (B ll).
(a ) left im a g e (b ) righ t im a g e (c) left seg m en ts (d ) right seg m en ts
/ \ ' i p . ,. ■ .■ (g) matched junctions (h) matched junctions
(e) left junctions (f) right junctions (right)
(i) matched segments (j) matched segments , . . (1) one coplanar clus-
(left) (right) ( ) m s ( e J ter of links (left)
(m) outsidemost (n) selected surfaces
closed surfaces for the (left)
cluster (left)
Figure 5.19: Results for the Pentagon building.
1 1 7
(a) noncoplanar seg- (b) coplanar links
ments (left) (left)
(c) insidemost copla- (d) insidemost surface
nar links (left) boundary (left)
(e) selected surfaces
and the holes (left) 3- ° R“ d« ed V> ew
Figure 5.20: Results of recovering the hole on the roof of Pentagon building.
5 .4 C o m p u ta tio n a l C o m p lex ity and
R u n -tim e
We present an informal computational complexity analysis of our system. The
analysis is similar to the one presented in chapter 3.
P re p ro c e ssin g : We have used existing systems for the tasks of edge detec
tion, edge linking and line fitting as preprocessings in our system. We
do not discuss the computational complexities of such processes here.
1 1 8
(a) left image (b) right image (c) left segments (d) right segments
. . . . . . . . . . (g) matched junctions (h) matched junctions
(e) left junctions (f) right junctions (right)
(i) matched segments (j) matched segments /, r x (1) outsidemost closed
(left) (tight) (k) links (left) surfaces (left)
(m) selected surfaces
(left)
(n) Recovered
Depth Map
Figure 5.21: Results for the oblique views of a hotel (scene OBL).
1 1 9
H yp oth esizin g Junctions: Suppose there are n line segments in the image.
Since we consider two line segments at a time to hypothesize a possible
junction, the whole process of hypothesizing junctions is of time com
plexity 0{n2). However, the actual number of junctions hypothesized
is usually much less than 0(n2), as two line segments can form a junc
tion only if they are close enough and at an angle with each other. For
practical purposes we can assume on average each line has k neighboring
lines with which it can form junctions. The total number of junctions
hypothesized is therefore of 0(nk).
M atching Junctions: Suppose there are j junctions in each image, and sup
pose the dimension of each image is r x r. If the allowed disparity range is
D pixels, each junction in the left image can form matches with junctions
in the right image only in D pixels along the corresponding epipolar line.
Since the average number of junctions at each pixel is j / r 2, the hypoth
esis process generally takes 0(j • D • j / r 2) or 0(fjD) time. For practical
purposes, we can assume the number of junctions per unit area in each
image, i.e., j / r 2, is constant. In that case the hypothesis process is then
of tim e complexity O(jD). All these possible matches then go through
a selection process to remove conflicts. It takes 0 ( ^ D 2) tim e4 in gen
eral to set up the relaxation network, as binary relationships among the
nodes need to be extracted. This becomes 0 (j2D2) if the number of
junctions per unit area in an image is constant. Let us say on average
each match has p competitors. During each iteration of the relaxation
process, each match is compared to each one of its competitors. The
* 2
relaxation therefore takes O(HDp) time, if we assume the total number
of iterations is constant.
M atching Line Segm ents: Suppose a line segment has an average length
of p pixels. If the allowed disparity range is D pixels, each segment in
the left image can form matches with segments in the right image only
in D • p pixels within the corresponding epipolar band. Since the average
number of segments at each pixel is n/r2, the hypothesis process gen
erally takes 0(n • Dp - n/r2) or 0(^2 Dp) time. For practical purposes,
4This sounds expensive. However, we have to point out that any system which attempts
to match features across stereo views using multiple order relationships among the matches
requires similar or higher order of magnitude of computations. The difference is in fact in
the number of features to be matched and how distinct the features are.
120
we can assume the number of segments per unit area in each image,
i.e., n/r2, is constant. In that case the hypothesis process is then of
time complexity 0(nDp). All these possible matches then go through a
selection process to remove conflicts. It takes 0 ( ^ D 2p2) time in gen
eral to set up the relaxation network, as binary relationships among the
nodes need to be extracted. This becomes 0(n2D2p2) if the number of
segments per unit area in an image is constant. Let us say on average
each match has p competitors. During each iteration of the relaxation
process, each match is compared to each one of its competitors. The re-
laxation therefore takes O(^-Dpp) time, if we assume the total number
of iterations is constant.
H yp oth esizin g Links: Suppose there are J matched junctions in the stereo
images. Since we consider two junctions at a time to hypothesize a pos
sible link, the whole process of hypothesizing links is of time complexity
less than 0 (J 2). Again, the actual number of the links hypothesized are
usually of 0(J), if we assume on average each junction has a constant
number of junctions whose branches are collinear with its. Notice that a
link is hypothesized together with its correspondence in the other view.
Therefore in the following when we say a link we actually refer to a link
match.
C lustering Links based on Coplanarity: Suppose there are altogether /
links hypothesized, and they belong to g different coplanar clusters. We
can find all the sets of two connected links in 0(1) time, if we first write
the label of each link to a 2-D array at the 2-D coordinates of its end
points, and for each link we just look for other link labels around its end
points in the array. W hat the clustering process does is to first pick two
connected links at a time, which defines a plane in the disparity space,
and then to check which of the rest of the links lies on the specified plane.
However, once a cluster is formed, the process will not consider any two
links in the cluster to generate the same plane again. As a result only g
planes are generated. Since all the links are checked if they lie on any of
the planes, the clustering process takes 0(lg ) time.
H yp oth esizin g Surfaces: From each of the g link clusters the outsidemost
links are extracted and surfaces are hypothesized from them. To extract
the outsidmeost links, a center point is first picked, which can be the
end point of one of the links in the cluster, and then the outside zone
121
of every link can be defined with respect to the center point. A link is
an outsidemost link if no other link crosses into its outside zone. The
process of extracting outsidemost links from the clusters, each of which
on average contains l/g links, therefore takes 0 (g (-)2) or 0 ( l2/g) time.
Suppose there are L outsidemost links in the different clusters, where
L « I. As mentioned above, using a 2-D array, we can establish all
the binary connectivity relationships among all the links in each cluster
in linear time. Since during the process of hypothesizing surfaces each
outsidemost link is allowed to be traced only once in each direction along
the link, the contour-tracing process for all the clusters therefore takes
approximately O(L) time. The whole process of hypothesizing surfaces
therefore takes 0 ( l2/g) -f O(L) or 0 ( l2/g) time.
S electin g Surfaces: Suppose there are altogether r hypothesized surfaces.
To set up the relaxation network takes 0 ( r 2) time, as binary relationships
are involved. Suppose on average each one of them has q competitors
for some q « r. During each iteration of the relaxation each ribbon
is compared with each one of its competitors. The relaxation therefore
takes 0(rq) time, assuming that the total number of iterations in the
relaxation is constant.
Run-times on a Symbolics 3620 for some of the experimental examples are
shown in the tables 5.1, 5.2, 5.3, and 5.4.
5.5 C onclusion
We have designed a stereo system that recovers surfaces from a stereo pair of
intensity images. The system is geared toward recovering surfaces of building
structures which display high degree of regularity. Experimental results for
aerial images taken from overhead views and oblique views are also shown.
Since urban scenes are often highly cluttered and aerial images are usually
not taken under the best imaging conditions, the system is designed in such a
way that imperfectness in extracting different levels of structures are allowed
and resolved based on the relationships among different levels of features. This
aspect is most prominent when links are hypothesized from the collection of
possibly broken edges between junctions, and when highly regular yet open
surface boundary hypotheses are closed with artificial links.
122
Process Complexity Run-Time
(sec)
Entities
Hypothesizing Jens
Matching Jens
Matching Lines
Hypothesizing Links
Clustering Links
Hypothesizing Surfaces
Selecting Surfaces
0(n2)
0(j2V2)
0(n2D2)
0(J2)
0(lg)
0(l2/g)
0(r2)
169.33
919.11
1086.64
277.87
40.49
77.98
1.41
n -- 242;
291 jens formed
j = 291; D = 44;
204 jms formed;
56 jms selected
300 1ms formed;
161 1ms selected
J = 56; I = 23
g = 5
6 surfaces formed
1 surface selected
Table 5.1: Time complexity and run-time of recovering the building in the
scene BIO (jcn = junction, jm = junction-match, lm = segment-match).
Process Complexity Run-Time
(sec)
Entities
Hypothesizing Jens
Matching Jens
Matching Lines
Hypothesizing Links
Clustering Links
Hypothesizing Surfaces
Selecting Surfaces
0(n2)
0(j2D2)
0{n2D 2)
0(J2)
O(lg)
0(l2/g)
0 (r 2)
306.92
1124.37
1553.73
692.39
73.59
159.63
48.00
n = 296;
409 jens formed
j = 409;L> = 81;
254 jms formed;
88 jms selected
418 1ms formed;
250 1ms selected
J = 88; I = 34
9 = 7
10 surfaces formed
1 surface selected
Table 5.2: Time complexity and run-time of recovering the building in the
scene B ll (jcn = junction, jm = junction-match, lm = segment-match).
Another distinct feature of our system is that our system aims at exploiting
3-D information as much as possible, yet not totally relying on it. We do not
take the extreme of matching some bottom level features and inferring high
level features such as surfaces merely from the depth map, or the other extreme
of extracting high level features all the way from bottom to top in each view
123
Process C om plexity Run-Tim e
(sec)
Entities
Hypothesizing Jens
Matching Jens
Matching Lines
Hypothesizing Links
Clustering Links
Hypothesizing Surfaces
Selecting Surfaces
0(n2)
0(j2D2)
0(n2D2)
0(P)
0(lg)
Oifls)
0{r2)
683.11
1030.94
2258.99
1365.69
176.90
374.64
33.08
n = 588;
485 jens formed
j = 485; D = 16;
261 jms formed;
131 jms selected
590 1ms formed;
416 1ms selected
J = 131; I = 52
g = 6
11 surfaces formed
1 surface selected
Table 5.3: Time complexity and run-time of recovering the building in the
Pentagon scene (jcn = junction, jm = junction-match, lm = segment-match).
Process C om plexity Run-Tim e
(sec)
Entities
Hypothesizing Jens 0(n2) 33.75 n= 121;
260 jens formed
Matching Jens 0(pD 2) 821.84 j = 260; D = 85;
179 jms formed;
57 jms selected
Matching Lines 0(n2D2) 118.68 103 1ms formed;
86 1ms selected
Hypothesizing Links O(P) 165.78
I I
C n
I I
00
Clustering Links
0(lg)
94.16 g = 9
Hypothesizing Surfaces
0(l2/g)
115.83 12 surfaces formed
Selecting Surfaces 0(r 2) 268.32 3 surfaces selected
Table 5.4: Time complexity and run-time of recovering the building in the
scene OBL (jcn = junction, jm = junction-match, lm = segment-match).
and using the high level features for stereo matching. We extract features
from each view step by step from bottom to top, during which we also use
stereo views to confirm such features and recover their 3-D information using
stereo matching. We make use of the extracted structural features, together
124
with their 3-D information, to hypothesize and select surface boundaries in
the scene.
Our system relies heavily on examining information such as collinearity
and coplanarity of linked edges in 3-D, whose depth information is extracted
from the triangulation process, a known error-prone process. We avoid the
error by examining collinearity in 2-D and coplanarity in the disparity space
instead.
125
C hapter 6
C onclusion
6.1 Sum m ary
In this dissertation we have addressed the problem of deriving shape descrip
tion from stereo in the following domains:
1. Stereo Correspondence: We have designed a stereo system that com
putes hierarchical descriptions of a scene from each view and combines
the information from the two views to give a 3-D description of the
scene. Structural descriptions help reduce correspondence ambiguity
during stereo matching, whereas stereo correspondences help confirm
the different levels of abstract features in the two views and maintain a
consistent interpretation about the scene. The hierarchy of descriptions
also avoids the errors caused by improper application of commonly used
surface-continuity and ordering constraints for stereo correspondence.
Another benefit of matching structural features is that occlusions can
be specifically identified and interpreted. Surfaces are segmented and
their visible boundaries are identified as either crease boundaries or limb
boundaries. We can also infer some properties of the surfaces visible in
only one view.
2. Shape Description: As 2|-D dense depth measurements may not be al
ways directly available from stereo, especially when there are curved
surfaces in the scene, we have proposed some ideas of how volumet
ric descriptions of objects can be computed directly from stereo cor
respondences. We have shown how volumetric shape can be inferred,
even though some of the visible surface boundaries may be limb bound
aries, using some primitives of shape such as LSHGCs and SHGCs as
126
the shape models. The methods are based on some invariant properties
of the shape models in their 2-D projections. Such properties are not
all monocular; we have proposed some properties in stereo which further
help in confirming and reconstructing LSHGCs and SHGCs from stereo
images. The technique can handle objects in close range, which is in fact
where stereo is most effective, without being affected by any possible
perspective distortion in the projected images.
3. We have applied some of the ideas to build a system specifically for
extracting building structures from a stereo pair of aerial images. Al
though the application is rather specific, it is both a major application
of computer vision and a good platform for evaluating how good some of
the ideas can perform under uncontrolled environments. The promising
experimental results show that the direction we pursue is indeed a valid
one. Since urban scenes are often highly cluttered and aerial images
are usually not taken under the best imaging conditions, the system is
designed in such a way that imperfectness in extracting different levels
of structures are allowed and resolved based on the relationships among
different levels of features. Another distinct feature of the system is that
our system aims at exploiting 3-D information as much as possible, yet
not totally relying on it. We do not take the extreme of matching some
bottom level features and inferring high level features such as surfaces
merely from the depth map, or the other extreme of extracting high level
features all the way from bottom to top in each view and using the high
level features for stereo matching. We extract features from each view
step by step from bottom to top, during which we also use stereo views
to confirm such features and recover their 3-D information using stereo
matching. We make use of the extracted structural features, together
with their 3-D information, to locate surface boundaries in the scene.
6.2 Future R esearch
Our goal in stereo development has not been just to obtain a depth or needle
map, but to compute abstract descriptions of the surfaces and objects visible
in the scene. We believe that we have made major progress toward this goal,
and we attribute the success to the approach of combining descriptions and
stereo matching, rather than viewing them as separate processes in a linear
chain.
127
However, the chapter of stereo analysis is still far away from being closed,
and much remains to be done. We can see that there are at least a couple of
issues th at need to be addressed for the shape from stereo problem:
— Our system does require relatively high resolution images and probably
would not be effective in highly textured and unstructured scenes where
the monocular grouping processes fail (we can still match curves and
get partial descriptions). To handle such scenes well, we would need to
add yet another level of matching similar to what is used in area-based
approaches to our system.
— We have not made use of the surface markings if there are any during the
shape reconstruction process. In fact they can be used either to confirm
the recovered volumetric descriptions, or to deform the descriptions to
fit the depth measurements along the markings. Same arguments apply
if other shape recovery cues such as shading are available. Our system
can provide a first estimate of the volumetric description, which is based
on the boundary contour information alone, for further confirmation or
modifications by other sources of information.
— We have proposed some ideas of how some general primitives of shape,
namely LSHGCs and SHGCs, can be recovered from stereo. These have
covered a large number of objects commonly occurring in our daily life.
However, objects are generally composites of such shape primitives. How
the shape of composite objects can be recovered from stereo needs to be
studied in details. Another direction is to study how the proposed tech
niques can be extended to reconstruct other classes of shape primitives.
128
A p p en d ix A
A R elaxation N etw ork for C onstrained
O ptim ization
A typical problem is, given a set of nodes {Ni} whose values {V;} have some
known interactions among one another, what are the values of the nodes that
achieve the best compromise according to the constraints among them? The
problem is especially difficult if each node can only take the value either 0 or
1, which is equivalent to the problem of making the best selection among the
nodes.
If all the interactions involve at most two nodes at a time, we can model the
constraints as either unary or binary. Unary constraints come from individual
merits of a node, while binary constraints relate a pair of nodes. We can further
subclassify unary and binary constraints according to whether the constraint
is absolute or compromisable:
1. U nary Constraints:
(a) Unary Absolute Constraint: It is an unary constraint that has to
be satisfied.
(b) Unary Excitatory or Inhibitory Constraint: It is an unary constraint
that represents how good or how bad a node is in a certain aspect.
It can be positive (excitatory) or negative (inhibitory).
2. B inary C onstraints:
(a) Binary Mutually Exclusive Constraint: Two nodes are mutually
exclusive if at most one of them can be selected at the same time.
129
(b) Binary Excitatory or Inhibitory Constraint: It is a binary constraint
that represents whether two nodes support (excitatory) or desup
port (inhibitory) each other. It can be positive if it is excitatory,
negative if it is inhibitory.
Such a problem can be formulated as an optimization problem: find {Id}
such th at the cost function
E(V) = - Y Jm ) - \ Y . i T iiViVi )
i ij
is minimized with respect to {Vi}, where I{ is the total individual merit of the
node Ni, and T,j is the total binary constraint between the values of the nodes
N{ and Nj.
Since E is a quadratic function with respect to {K}? it is therefore in
general strictly convex, i.e., there generally exists a unique extremal solution
for {Vi}. The optimal solution can be obtained by taking for all i,
^ = 0
dVi ’
which return a linear system of equations to solve for the optimal solution of
{K}.
A typical application is surface reconstruction from some sparse depth
estimates, in which the constraints are to have the output surface conforming
as close as possible to the input depth estimates and as smooth as possible.
However, such a system will return values {Id} of any magnitude, depend
ing upon the unary and binary constraints among the nodes. If the problem
is to achieve some binary decisions among the nodes, say each Id has to be
either 0 or 1, the idea is inapplicable. It is the contribution of Hopfield and
Tank [54, 53] to introduce a variable U i for each node and a sigmoid function
g{u) such that for all i,
V = g { u i )
to constrain V between the two specified values 0 and 1. However, the addition
of the nonlinear function g(u) renders the cost function no longer quadratic,
and a linear system of equations to solve for the optimal solution of {Id}
is not available. To get to the optimal solution from a starting point, we
can use the gradient descent method to update {Id} under the time function
dVi/dt = —dE /dV i so that d E /d t = J2i§Viln = ~ ( § v ) 2 < ^ ^ or ^ z* e- >
130
the cost function E is guaranteed to be lowered at each step. The nodes are
then iterated to equilibrium using the following dynamics: for all i,
du i u T m r
= + +
3
Vi = g{ui).
Note that g(u) (typically g(u) = |(1 + ta n h (u /u 0))) is sigmoid so that V { is
close to 0 or 1 at equilibrium, and monotonically increasing so that dui/dt and
dVi/dt always have the same sign.
Since all what Hopfield Network does is to make the best compromise
among all the “compromisable” constraints, it is not suitable for constrained
optimization problems where absolute constraints among the nodes, such as
some nodes being mutually exclusive with one another, are present. This can
be exemplified by experiments done by others [112] on the traveling salesman
problem. However, in our problem there are many instances in which mutually
exclusive constraints have to be enforced among some entities. We have tried
the winner-takes-all method, in the sense that during each cycle of iteration,
the nodes are first sorted in descending order of their values of —dE /dV i =
Ii + YjjTijVji an(i the nodes are picked in that order one by one to have
their V values set to 1, while the nodes mutually exclusive with them have
their V values reset to 0. In this way we can guarantee that the mutually
exclusive constraints among the nodes are always satisfied. The process is
carried out until equilibrium is obtained or the total number of iterations
exceeds a certain limit. This is in fact similar to a tree search among the
nodes with the evaluation function of each node Ni being —dE /dV i, in which
the goal is to get to the state where all nodes set to 1 are not mutually exclusive
with one another, and the cost function E (V ) is minimized.
We have also tried adding Lagrange multipliers [51] into the relaxation net
work to enforce mutually exclusive constraints. We add Lagrange multipliers
{Ak} one for each mutually exclusive constraint Lk(V) = 0 (e.g., Lk(V) can
be (Vi + Vs + Vg — l) 2 to enforce mutually exclusive among the nodes N 2, iV5,
and Ng) to the relaxation network, and iterate {T^} and {A^} to minimize the
new cost function
E new(V) = E (V ) + J2*kL k (V)
k
using the updating rules
dui u dEfiQyj u d E v . dL^i^V^j
~dt= ~ T ~ dVi = ~ T ~ d V i~ Y k dVi ’
131
Vi = g ( u i ) .
d h _ dE new _
dt d \ k K h
The idea is, when Lk(V) gets big, so will A fc , which will in turn hold back
the values of all the nodes involved in Lk(V) in the appropriate directions.
Moreover, at equilibrium dXk/dt = 0, which imply Lk(V) = 0 for all k , i.e.,
all mutually exclusive constraints are satisfied.
Both methods give satisfactory results but the winner-takes-all method
does get to the solution much faster. We attribute to the hierarchical nature
of our system which reduces most of the ambiguities in the relaxations and
renders even simple methods such as winner-takes-all be adequate.
132
A p p en d ix B
P reservation o f C oplanarity from 3-D Space
to D isparity Space
The problem is to find out whether coplanarity in 3-D is preserved in the
disparity space under perspective projection into stereo images. We assume
the pinhole camera projection is the projection model, and the image planes
of the stereo images are coplanar, i.e., a parallel-axis epipolar geometry.
Taking the focal point of the left camera as the origin of the world coordi
nate system, for any point (x,y,z ) in 3-D space, we have
u = (fx)fz,
« = (f y ) / z ,
D = (fB)/z
where (u, u) is its image coordinates on the left view, D is the disparity, / is
the common focal length of both cameras, and B is the baseline width.
Suppose a set of points {(#;, yi, Zi)} in space are all coplanar, and suppose
the plane that contains all the points is ax + by + cz -f d = 0 for some a, b, c, d.
Then for all i,
axi + byi + cz{ + d = 0.
It can be shown that for all i,
aui + bvi + Di + (cf) = 0,
i.e., all the points {(wz -, Vi, A )} in the disparity space are also coplanar and
they lie on the plane ax + by + (d/B)z -f (cf) = 0 in the disparity space.
133
A p p en d ix C
A M easure o f C oplanarity
Given a set of 3-D points Zi)}, the problem is to find the best plane
that contains all the points, or to compute a measure of how coplanar the
points are. It can be formulated as an optimization problem: find the plane
ax A by + cz + 1 = 0
such that the cost function
E (a , b, c) = Y ^(axi + tyi + cz* + !)'■
is minimized with respect to the plane parameters a, b, c.
E is a quadratic function with respect to a, 6, c and is therefore in general
strictly convex, i.e., there generally exists a unique extremal solution for a, b,
c.
The optimal solution can be obtained by taking the partial derivatives of
E(a, b, c) with respect to a, b, c respectively to be zero, which return three
linear equations for solving the optimal solution of a, b, c. The solution can
be outlined as:
Ap = k
where
A
E i x ix i E ; X iyi E ; XiZi
E * Vix i E i ViVi E i Viz i
E i Zi%i E i ZiVi E i z iz i
P
134
E i
k = - E i V i
- E i z i
A coplanarity measure of the given 3-D points can also be obtained by
looking at the minimum value of E(a^b,c) at the optimal solution. A good
measure is the RMS error of fitting of all points to the optimal plane:
\
E (a ,b , c) optimal
W ) '
Same idea can also be applied to check for collinearity in 2-D.
135
R eferences
[1] N. Ahuja and M. Tuceryan. Extraction of Early Perceptual Structure in
Dot Patterns: Integrating Region, Boundary, and Component Gestalt.
Computer Vision, Graphics, and Image Processing, 48:304-356, 1989.
[2] N. Ayache. A model-based vision system to identify and locate partially-
visible industrial parts. In Proceedings of the Conference on Computer
Vision and Pattern Recognition, pages 492-494, Washington, DC, 1983.
[3] N. Ayache and B. Faverjon. Fast stereo matching of edge segments
using prediction and verification of hypotheses. In Proceedings of the
Conference on Computer Vision and Pattern Recognition, pages 662-
664, San Francisco, California, June 19-23 1985.
[4] N. Ayache and B. Faverjon. A fast stereovision matcher based on predic
tion and recursive verification of hypothesis. In Proceedings of the IEEE
Workshop on Computer Vision: Representation and Control, pages 27-
37, Bellaire, Michigan, October 1985.
[5] N. Ayache and F. Lustman. Fast and reliable passive trinocular stereovi
sion. In Proceedings of the IEEE International Conference on Computer
Vision, pages 422-427, London, England, June 1987.
[6] H. H. Baker. Depth from edge and intensity based stereo. Technical
Report AIM-347 and STAN-CS-82-930, Stanford University, Computer
Science Department, Stanford, California, September 1982. Based on
the author’s thesis (Ph.D. - Illinois).
[7] H. H. Baker, T. O. Binford, J. Malik, and J. Meller. Progress in stereo
mapping. In Proceedings of the DARPA Image Understanding Work
shop, pages 327-335, Arlington, Virginia, June 23 1983.
[8] D. H. Ballard. Generalizing the Hough Transform to detect arbitrary
shapes. Pattern Recognition, 13(2):111— 122, 1981.
[9] D. H. Ballard and C. Brown. Computer Vision. Prentice Hall, 1982.
136
10] S. Barnard and M. Fischler. Computational stereo. AC M Computing
Surveys, 14(4):553-572, December 1982.
11] S. T. Barnard and W. B. Thompson. Disparity analysis of images. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2(4):333-
340, July 1980.
12] H.G. Barrow and J.M. Tenenbaum. Interpretting line drawings as three
dimensional surfaces. Artificial Intelligence, 17:75-116, 1981.
13] P. J. Besl and R. C. Jain. Three-Dimensional Object Recognition. AC M
Computing Surveys, 17(1):75— 145, 1985.
14] B. Bhanu. Representation and shape matching of 3-D objects. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 6(3):340-
350, 1984.
15] I. Biederman. Recognition by components: a theory of human image
understanding. Psychological Review, 94(2): 115-147, 1987.
16] T. 0 . Binford. Visual perception by computer. In IEEE Conference on
Systems and Controls, Miami, Florida, December 1971.
17] A. Blake and A. Zisserman. Visual Reconstruction. Artificial Intelli
gence. MIT Press, Cambridge, Massachusetts, 1987.
18] H. Blum. Biological shape and visual science (part 1). Journal of The
oretical Biology, 38:205-287, 1973.
19] R. C. Bolles and R. A. Cain. Recognizing and locating partially visi
ble objects: The local-feature-focus method. International Journal of
Robotics Research, l(3):637-643, 1982.
20] T. E. Boult and L.-H. Chen. Synergistic smooth surface stereo. In
Proceedings of the IEEE International Conference on Computer Vision,
pages 118-122, Tampa, Florida, December 1988.
21] T. E. Boult and A. D. Gross. Recovery of superquadrics from depth
information. In Proceedings of the Workshop on Spatial Reasoning and
Multi-Sensor Fusion, pages 128-137, Chicago, Illinois, October 1987.
22] T. E. Boult and A. D. Gross. On the recovery of superellipsoids. In
Proceedings of the DARPA Image Understanding Workshop, pages 1052-
1063, 1988.
137
[23] J. M. Brady and H. Asada. Smoothed local symmetries and their im
plementation. International Journal of Robotics Research, 3(3):36-61,
1984. Fall.
[24] M. Brady. Computational approaches to image understanding. AC M
Computing Surveys, 14:3-71, 1982.
[25] R. A. Brooks. Model-based three-dimensional interpretations of two-
dimensional images. IEEE Transactions on Pattern Analysis and Ma
chine Intelligence, 5(2):140— 150, 1983.
[26] J. F. Canny. A computational approach to edge detection. IEEE Trans
actions on Pattern Analysis and Machine Intelligence, 8(6):679-698,
November 1986.
[27] C.-K. R. Chung and R. Nevatia. Use of monocular groupings and oc
clusion analysis in a hierarchical stereo system. In Proceedings of the
Conference on Computer Vision and Pattern Recognition, pages 50-56,
Maui, Hawaii, June 1991.
[28] C.-K. R. Chung and R. Nevatia. Recovering Building Structures from
Stereo. In IEEE Workshop on Applications of Computer Vision, Palm
Springs, California, November 1992. To appear.
[29] C.-K. R. Chung and R. Nevatia. Recovering LSHGCs and SHGCs from
stereo. In Proceedings of the DARPA Image Understanding Workshop,
pages 401-407, San Diego, CA, January 1992.
[30] C.-K. R. Chung and R. Nevatia. Recovering LSHGCs and SHGCs from
stereo. In Proceedings of the Conference on Computer Vision and Pat
tern Recognition, pages 42-48, Champaign, Illinois, June 1992.
[31] M. B. Clowes. On seeing things. Artificial Intelligence, 2(1):79— 116,
1971.
[32] S. D. Cochran and G. Medioni. 3-D Surface Description from Binocular
Stereo. IEEE Transactions on Pattern Analysis and Machine Intelli
gence, 14(10):981— 994, October 1992.
[33] P. Dev. Perception of depth surfaces in random-dot stereograms: A
neural model. International Journal of Man-Machine Studies, 7:511—
528, 1975.
138
[34] M. Dhome, M. Richetin, J.-T. Lapreste, and G. Rives. Determination of
the attitude of 3-D objects from a single perspective view. IEEE Trans
actions on Pattern Analysis and Machine Intelligence, 11(12): 1265— 12T8,
December 1989.
[35] U. R. Dhond and J. K. Aggarwal. Structure from stereo-A review.
IEEE Transactions on Systems, Man and Cybernetics, 19(6):1489— 1510,
November/December 1989.
[36] M. Drumheller and T. Poggio. On parallel stereo. In Proceedings of the
IEE E Conference on Robotics and Automation, pages 1439-1448, San
Francisco, California, April 1986.
[37] S. A. Dudani, K. J. Breeding, and R. B. McGhee. Aircraft identification
by moment invariants. IEEE Transactions on Computers, 26(1):39— 46,
1977.
[38] D. Eggert and K. Bowyer. Computing the orthographic projection as
pect graph of solid of revolution. In Proceedings of the Workshop in In
terpretation of 3D Scenes, pages 102-108, Austin, Texas, November 27-
29 1989.
[39] T.-J. Fan, G. Medioni, and R. Nevatia. Segmented descriptions of 3-D
surfaces. IEEE Journal of Robotics and Automation, RA-3(6):527-538,
December 1987.
[40] O. D. Faugeras and Hebert M. The representation, recognition, and
locating of 3-D objects. International Journal of Robotics Research,
5(3) :27— 52, 1986.
[41] F.P. Ferrie and M.D. Levine. Integrating information from multiple
views. In Proceedings of the IEEE Workshop on Computer Vision, pages
117-122, December 1987.
[42] P. Fua and A.J. Hanson. Objective functions for feature discrimination:
Applications to semiautomated and automated feature extraction. In
Proceedings of the DARPA Image Understanding Workshop, pages 676-
694, May 1989.
[43] S. Ganapathy. Reconstruction of scenes containing polyhedra from stereo
pair of views. PhD thesis, Stanford University, Stanford, California,
1976.
139
[44] Z. Gigus, J. Canny, and R. Seidel. Efficiently computing and represent
ing aspect graphs of polyhedral objects. In Proceedings of the IEEE
International Conference on Computer Vision, pages 30-39, Tampa,
Florida, December 1988.
[45] W. E. L. Grimson. From Images to Surfaces: A Computational Study of
the Human Early Visual System. MIT Press, Cambridge, Massachusetts,
1981.
[46] M. Hannah. Computer Matching of Areas in Stereo Images. PhD thesis,
Stanford University Computer Science Department, Stanford, Califor
nia, July 1974. Technical Report STAN-CS-74-438.
[47] M. Hannah. Bootstrap stereo. In Proceedings of the DARPA Image
Understanding Workshop, pages 201-208, College Park, Maryland, April
1980.
[48] M. Hannah. SRI’s baseline stereo system. In Proceedings of the DARPA
Image Understanding Workshop, pages 149-155, Miami Beach, Florida,
December 1985.
[49] M. Hebert and T. Kanade. The 3D profile method for object recogni
tion. In Proceedings of the Conference on Computer Vision and Pattern
Recognition, San Francisco, California, June 1985.
[50] M. Herman and T. Kanade. The 3D MOSAIC scene understanding
system: Incremental reconstruction of 3D scenes from complex images.
Technical Report CMU-CS-84-102, Carnegie-Mellon University, P itts
burgh, PA, February 1984.
[51] M. Hestenes. Optimization Theory. Wiley & Sons, New York, 1975.
[52] W. Hoff and N. Ahuja. Surfaces from stereo: Integrating feature m atch
ing, disparity estimation, and contour detection. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 11(2):121— 136, February
1989.
[53] J.J. Hopfield. Neurons with graded response have collective computa
tional properties like those of two-state neurons. Proceedings National
Academy of Science, USA, 81:3088-3092, May 1984.
[54] J.J. Hopfield and D.W. Tank. Neural networks and physical systems
with emergent collective computational abilities. Proceedings, National
Academy of Science, USA, 79:2554-2558, April 1982.
140
[55] R. Horaud and M. Brady. On the geometric interpretation of image
contours. Artificial Intelligence, 37:333-353, 1988.
[56] R. Horaud and T. Skordas. Stereo correspondence through feature
grouping and maximal cliques. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 11 (11): 1168— 1180, November 1989.
[57] B. K. P. Horn. Extended Gaussian Images. Technical Report AI Memo
740, Massachusetts Institute of Technology, July 1983.
[58] B. K. P Horn and K. Ikeuchi. The mechanical manipulation of randomly
oriented parts. Scientific American, 251 (2): 100— 111, August 1984.
[59] Y. C. Hsieh, D. M. McKeown, and F. P. Perlant. Performance evalua
tion of scene representation and stereo matching for cartographic feature
extraction. IEEE Transactions on Pattern Analysis and Machine Intel
ligence, 14(2):214— 238, February 1992.
[60] A. Huertas and G. Medioni. Detection of intensity changes with sub
pixel accuracy using Laplacian-of-Gaussian masks. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 8(5):651— 664, September
1986.
[61] D. Huffman. Impossible objects as nonsense sentences. In B. Meltzer and
D. Michie, editors, Machine Intelligence 6, pages 295-323. Edinburgh
University Press, Edinburgh, 1971.
[62] K. Ikeuchi. Recognition of 3-D objects using the Extended Gaussian
Image. In Proceedings of the International Joint Conference on Artificial
Intelligence, pages 595-600, August 1981.
[63] K. Ikeuchi. Precompiling a geometrical model into an interpretation for
object recognition in bin-picking tasks. In Proceedings of the DARPA
Image Understanding Workshop, pages 321-339, Los Angeles, California,
February 1987. Morgan Kaufmann Publishers, Inc.
[64] C. L. Jackins and S. L. Tanimoto. Oct-trees and their use in representing
three-dimensional objects. Computer Graphics and Image Processing,
14(4):249-270, November 1980.
[65] B. Julesz. Foundations of Cyclopean Perception. The University of
Chicago Press, Chicago, 1971.
[66] T. Kanade. Recovery of the three-dimensional shape of an object from
a single view. Artificial Intelligence, 17:409-460, 1981.
141
[67] S. Liebes. Geometric constraints for interpreting images of common
structural elements: Orthogonal trihedral vertices. In Proceedings of the
DARPA Image Understanding Workshop, April 1981.
[68] H. S. Lim and T. 0 . Binford. Structural correspondence in stereo vision.
In Proceedings of the DARPA Image Understanding Workshop, pages
794-808, Los Angeles, California, 1987. Morgan Kaufmann Publishers,
Inc.
[69] H. S. Lim and T. 0 . Binford. Curved surface reconstruction using stereo
correspondence. In Proceedings of the DARPA Image Understanding
Workshop, pages 809-819, Cambridge, Massachusetts, April 1988.
[70] A. Mackworth. Interpreting pictures of polyhedral scenes. Artificial
Intelligence, 4:121-137, 1973.
[71] P. MacVicar-Whelan and T. Binford. Curve finding with subpixel ac
curacy. In Proceedings of the DARPA Image Understanding Workshop,
April 1981.
[72] S. B. Marapane and M. M. Trivedi. Region-based stereo analysis for
robotic applications. IEEE Transactions on Systems, Man & Cybernet
ics, 19(6): 1447-1464, November/December 1989.
[73] D. Marr. Vision. W. H. Freeman and Company, 1982.
[74] D. Marr and T. Poggio. Cooperative computation of stereo disparity.
A A A S Science, 194(4262) :283-287, October 15 1976.
[75] D. Marr and T. Poggio. A computational theory of human stereo vision.
Proceedings of the Royal Society of London, B(204):301-328, 1979.
[76] J. E. W. Mayhew. Stereopsis. In 0 . Braddick and A. Sleigh, editors,
Physical and Biological Processing of Images, pages 204-216. Springer-
Verlag, New York, NY, September 27-29 1982. From the Proceedings
of an International Symposium Organized by the Rank Prize Funds,
London, England.
[77] J. E. W. Mayhew and J. P. Frisby. Psychophysical and computational
studies towards a theory of human stereopsis. Artificial Intelligence,
pages 349-385, 1981.
[78] G. Medioni and R. Nevatia. Segment-based stereo matching. Computer
Graphics and Image Processing, 31(1) :2— 18, July 1985.
142
[79] V. Milenkovic and T. Kanade. Trinocular vision using photometric and
edge orientation constraints. In Proceedings of the DARPA Image Un
derstanding Workshop, pages 163-175, Miami Beach, Florida, December
1985.
[80] R. Mohan. Perceptual Organization for Computer Vision, PhD thesis,
University of Southern California, August 1989. IRIS Technical Report
254.
[81] R. Mohan, G. Medioni, and R. Nevatia. Stereo error detection, cor
rection, and evaluation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 11(2):113-120, February 1989.
[82] R. Mohan and R. Nevatia. Using Perceptual Organization to Extract
3-D Structures. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 11(11): 1121— 1139, November 1989.
[83] R. Mohan and R. Nevatia. Perceptual Organization for Scene Segmen
tation and Description. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 14(6):616— 635, June 1992.
[84] H. P. Moravec. Obstacle Avoidance and Navigation in the Real World
by a Seeing Robot Rover. PhD thesis, Stanford University, Stanford,
California, September 1980. Technical Report AIM-340 and STAN-CS-
80-813.
[85] K. Mori, M. Kidode, and H. Asada. An iterative prediction and correc
tion method for automatic stereocomparison. Computer Graphics and
Image Processing, 2:393-401, 1973.
[86] L. R. Nackman and S. M. Pizer. Three-dimensional shape description
using the symmetric axis transform 1: Theory. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 7(2):187— 201, March 1985.
[87] V. Nalwa. Line-drawing interpretation: A mathematical framework. In
Proceedings of the Conference on Computer Vision and Pattern Recog
nition, pages 18-31, 1988.
[88] J. I. Nelson. Globality and stereoscopic fusion in binocular vision. J.
Theor. Biol., 49:1-88, 1975.
[89] R. Nevatia. Depth measurement by motion stereo. Computer Graphics
and Image Processing, 5:203-214, 1976.
143
[90] R. Nevatia and K. R. Babu. Linear feature extraction and description.
Computer Graphics and Image Processing, 13(3):257-269, July 1980.
[91] R. Nevatia and T. 0 . Binford. Description and recognition of complex-
curved objects. Artificial Intelligence, 8:77-98, 1977.
[92] Y. O hta and T. Kanade. Stereo by two-level dynamic program
ming. IEEE Transactions on Pattern Analysis and Machine Intelligence,
7(2):139— 154, April 1985.
[93] M. Oshima and Y. Shirai. A scene description method using three-
dimensional information. Pattern Recognition, pages 9-17, 1979.
[94] D. Panton, C. Grosch, D. DeGryse, J. Ozils, A LaBonte, S. Kaufmann,
and L. Kirvida. Geometric reference studies. Final Technical Report
RADC-TR-81-182, December 1981. Volume 44, Number 12.
[95] A. P. Pentland. Recognition by parts. In Proceedings of the IEEE In
ternational Conference on Computer Vision, pages 612-620, June 1987.
[96] J. Ponce, D. Chelberg, and W. B. Mann. Invariant properties of straight
homogeneous generalized cylinders and their contours. IEEE Trans
actions on Pattern Analysis and Machine Intelligence, 11 (9):951— 966,
September 1989.
[97] K. Prazdny. On the coarse-to-fine strategy in stereomatching. Bulletin
of the Psychonomic Society, 25:92-94, 1987.
[98] K. Price. Hierarchial matching using relaxation. Computer Vision,
Graphics, and Image Processing, 34(1) :66— 75, April 1986.
[99] K. Rao. Shape Description from Sparse and Imperfect Data. PhD the
sis, University of Southern California, December 1988. IRIS Technical
Report 250.
[100] K. Rao and R. Nevatia. Computing volume descriptions from sparse 3-D
data. International Journal of Computer Vision, 2(1):33— 50, June 1987.
[101] M. Richetin, M. Dhome, J. T. Lapreste, and G. Rives. Inverse per
spective transform using zero-curvature contour points: application to
the localization of some generalized cylinders from a single view. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 13(2): 185—
191, February 1991.
1 4 4
[102] P. Saint-Marc and G. Medioni. Adaptive smoothing for feature extrac
tion. In Proceedings of the DARPA Image Understanding Workshop,
pages 1100-1113, Boston, Massachusetts, April 1988. Morgan Kaufmann
Publishers, Inc.
[103] S. A. Shafer and T. Kanade. The theory of straight homogeneous gen
eralized cylinders. Technical Report CMU-CS-083-105, Carnegie-Mellon
University, 1983.
[104] S. S. Sinha and B. G. Schunck. Discontinuity preserving surface recon
struction. In Proceedings of the Conference on Computer Vision and
Pattern Recognition, pages 229-234, San Diego, California, June 1989.
[105] K. A. Stevens. The visual interpretations of surface contours. Artificial
Intelligence, 17:47-73, 1981.
[106] D. Terzopoulos. Regularization of inverse visual problems involving dis
continuities. IEEE Transactions on Pattern Analysis and Machine In
telligence8:413-424, 1986.
[107] D. Terzopoulos. The computation of visible-surface representations.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
10(4):417-438, July 1988.
[108] S. Tsuji, J. Zheng, and M. Asada. Stereo vision of a mobile robot: World
constraints for image matching and interpretation. In Proceedings of the
IEE E Conference on Robotics and Automation, volume 3, pages 1594-
1599, San Francisco, California, April 7-10 1986.
[109] F. Ulupmar and R. Nevatia. Inferring shape from contour for curved sur
faces. In Proceedings of the International Conference on Pattern Recog
nition, volume 1, pages 147-154, Atlantic City, New Jersey, June 1990.
[110] F. Ulupmar and R. Nevatia. Recovering shape from contour for SHGCs
and CGCs. In Proceedings of the DARPA Image Understanding Work
shop, pages 544-556, Pittsburgh, Pennsylvania, September 1990.
[111] F. Ulupmar and R. Nevatia. Constraints for interpretation of perspective
images. Computer Vision, Graphics, and Image Processing, 53(l):88-96,
January 1991.
[112] V. Wilson, , and G. S. Pawley. On the stability of the TSP problem
algorithm of Hopfield and Tank. Biological Cybernetics, 58:63-70, 1988.
1 4 5
[113] G. Xu and S. Tsuji. Inferring surfaces from boundaries. In Proceedings of
the IEEE International Conference on Computer Vision, pages 716-720,
1987. London.
[114] Y. T. Zhou and R. Chellappa. Stereo matching using a neural network.
In Proceedings of the International Conference on Acoustics, Speech, and
Signal Processing, pages 940-943, New York, N.Y., April 11-14 1988.
IEEE.
[115] S.W. Zucker, Computational and psychophysical experiments in group
ing: Early orientation selection. In J. Beck, B. Hope, and A. Rosenfeld,
editors, Human and Machine Vision, pages 545-567. Academic Press,
New York, NY, 1983.
[116] S.W. Zucker. The diversity of perceptual grouping. In M.A. Arbib
and A.R. Hanson, editors, Vision, Brain, and Cooperative Computation,
pages 231-262. M.I.T. Press, Cambridge, Massachusetts, 1987.
1 4 6
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
etd-FelinaCast-423.pdf
PDF
ON THE BLOCK corrected.doc
PDF
etd-AnsleyJenn-152.pdf
PDF
Microsoft Word - Mitchell, Gerald.dissertation final7142011.doc
PDF
etd-WimerAntho-163.pdf
PDF
Microsoft Word - Lettman_dissertation[Final Version for Submission4].doc
PDF
etd-YunYoungyu-57.pdf
PDF
Microsoft Word - title page.doc
PDF
Diss Last Format
PDF
Microsoft Word - 9-19-11 Winship Thesis FINAL.doc
PDF
etd-KimJohnJ-181.pdf
PDF
ThesisFinalReformatted
PDF
CHAPTER 4
PDF
etd-WangYangLe-256.pdf
PDF
BESSOLO-P-EDITED
PDF
Microsoft Word - OHFthesisAugust8.doc
PDF
Microsoft Word - Dissertation-final10
PDF
Microsoft Word - Dissertation Template.docx
PDF
etd-Nattwongas-93.pdf
PDF
EFFECT OF FOCAL PRESSURE ON THE RETINA AND THE DEVELOPMENT OF
Asset Metadata
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255769
Unique identifier
UC11255769
Legacy Identifier
DP22843