Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
3-D building detection and description from multiple intensity images using hierarchical grouping and matching of features
(USC Thesis Other)
3-D building detection and description from multiple intensity images using hierarchical grouping and matching of features
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
3-D Building Detection And Description From Multiple
Intensity Images Using Hierarchical Grouping And Matching
of Features
by
Sanjay Noronha
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
Doctor of Philosophy
(Computer Science)
May 20133
Copyright 2013 Sanjay Noronha
Acknowledgements
For my Dad who did not have the opportunity to go to college, and my
Mom, who though no longer here, is a lifelong inspiration: this thesis is dedicated
to you.
I thank my advisor Ram Nevatia for his friendship, insight, help and advice
on matters academic and non-academic. Thanks to Keith Price and Gerard
Medioni for numerous discussions and a range of matters that broadened my
horizons. Special thanks to Andres Huertas for his friendship and help. To all my
friends at IRIS including Parag Havaldar, Mourad Zerroug, Gideon Guy, Hillel
Rom, Misuen Lee, Andy Lin, Chia-Wei Liao, Alexandre Francois and others:
thank you. Special thanks to Gaurav Sukhatme for his friendship and even-keeled
advice.
Thanks are due in no small measure to my wife, Komal, for pushing me to
make this dream of mine a reality. My kids Aarav, Asheeka and Aadit provided
the unspoken motivation to make it to the finish line.
ii
Table of Contents
Acknowledgements
ii
vi
1
3
4
7
8
10
10
11
11
12
13
14
15
17
18
19
22
22
22
23
23
23
24
Abstract
Chapter 1 Introduction
1.1 Issues in Object Detection and Description from intensity images
1.2 Description of the fundamental approach
1.3 Contributions of this Dissertation
1.4 Outline of the thesis
Chapter 2 Previous Work
2.1 Perception of 3-D Objects from 2-D Data
2.1.1 Methods Using Specific Models
2.1.2 Methods Using Generic Models
2.1.3 Perceptual Grouping
2.2 Building detection and description from Intensity Images
2.2.1 Detection and description from monocular images
2.2.2 Detection and description from multiple images
Chapter 3 Overview of an application of the fundamental approach
3.1 Basic Assumptions
3.2 Basic Approach
3.3 Distinctive Features of the System
3.3.1 Use of a generic model
3.3.2 Use of domain knowledge
3.3.3 Use of hierarchical perceptual grouping and matching
3.3.4 Use of geometric and projective properties
3.3.5 Non-preferential use of multiple views
3.3.6 Use of wall and shadow information
iii i
3.4 Geometric and Projective Properties
3.4.1 Use of camera models for epipolar constraints
3.4.2 Projection and 3D Rectangular Roofs
3.4.3 Use of geometric and projective properties
3.4.4 Visible Walls
24
25
25
2 6
2 6
Chapter 4 Generation and Selection of Building Hypotheses
4.1 Features in the hierarchy
4.2 Formation of flat-roof rectangular hypotheses
4.2.1 Generation of hypotheses from multiple images
4.2.2 Generation of hypotheses from each image independently
4.2.3 Complexity of the hypotheses generation process
4.3 Selection of flat-roof rectangular hypotheses
4.3.1 Accumulating Selection Evidence
4.3.2 The selection mechanism
4.3.3 Complexity of the hypothesis selection process
4.4 Formation of gable-roof rectangular hypotheses
4.4.1 Generation of symmetric gable-roof hypotheses
4.4.2 Complexity of the hypotheses generation process
4.5 Selection of gable-roof hypotheses
4.5.1 Selection evidence accumulated
4.5.2 The selection mechanism
4.5.3 Complexity of the hypothesis selection process
Chapter 5 Verification of Hypotheses
5.1 Overview of the verification process for flat-roof hypotheses
5.1.1 Roof Evidence
5.1.2 Wall Evidence
5.1.3 Shadow Evidence
5.1.4 Combination of Roof, Wall and Shadow Evidence
5.1.5 Complexity of the hypothesis verification process
5.2 Overview of the verification process for gable-roof hypotheses
5.2.1 Roof Evidence
5.2.2 Wall Evidence
5.2.3 Shadow Evidence
5.2.4 Combination of Roof, Wall and Shadow Evidence
5.3 Overlap Analysis
5.4 3-D Description of the Scene
Chapter 6 Results and Performance Analysis
6.1 Analysis of results of the automatic system
6.1.1 Results on Fort Hood, Texas
6.1.2 Results for Fort Benning, Georgia
6.2 Evaluation of the automatic system
6.2.1 Run-time Evaluation
6.2.2 Detection Evaluation
6.2.3 Effect of Multiple Views
28
29
36
37
39
41
46
47
50
50
51
53
54
57
57
60
61
62
62
63
63
65
66
68
70
70
70
70
72
73
77
79
79
79
89
92
93
94
96
iv
Chapter 7 Assisted model construction
7.1 Real-time hypothesis generation based on user input
7.1.1 Generation of flat-roof rectangular hypotheses from a single view
7.1.2 Generation of gable-roof rectangular hypotheses
7.2 Real-time Assisted Results
7.3 Conclusion
Chapter 8 Conclusions and Future Research
8.1 Summary
8.2 Future Research
8.2.1 Building Extraction in more complex environments
8.2.2 Detection and Description of Non-Rectilinear Shaped Objects
8.3 Conclusion
98
99
105
106
111
112
112
113
113
114
115
v
100
Appendix A Best Matches for Sets of Lines Across Views
A.1 Best Matches for a Parallelogram Across Views
A.2 Best Matches for a Triple Across Views
126
127
127
Bibliography 116
vi
A bstra ct
A method for detection and description of rectangular buildings with flat and with
gable roofs from two or more registered aerial intensity images is proposed. The output is
a 3D description of the buildings, with an associated confidence measure for each building.
Hierarchical perceptual grouping and matching across views aids building hypothesis for-
mation and increases the robustness of the system.
The system is exclusively feature-based. Perceptual grouping in each view and
matching across views is performed in a hierarchical manner, utilizing primitives of in-
creasing complexity, starting with line segments and junctions, and proceeding to higher
level features, namely parallels, U-contours and parallelograms (as 3D rectangles appear as
parallelograms when projected under the conditions employed in aerial imagery). Binocu-
lar and trinocular epipolar constraints are used to reduce the search space for matching fea-
tures. A selection mechanism retains hypotheses with sufficient roof evidence to warrant
further examination. A verification procedure employs roof, wall and shadow evidence to
decide whether a hypothesis should be considered a building or not. Overlapping verified
hypotheses are disambiguated depending on the type of overlap (partial overlap or contain-
ment) and the 3D heights of the hypotheses.
A user-assisted mode enables building model construction utilizing user inputs. By
using the (computed) feature hierarchy in the model construction process this mode reduces
the number of user inputs required to construct building models, when compared with ge-
neric 3D modeling tools.
1
Chapter 1
Introduction
3D building detection from one or more intensity images has been a research topic in
Computer Vision for many years. This thesis presents a system that attempts to solve the
problem of building detection and description from two or more aerial intensity images.
This problem is a constrained version of the generic problem of 3D object recognition.
Though the following discussions in this chapter focus on the specific problem of building
detection and description, the issues apply to generic object recognition as well.
There are a number of challenges that must be met in order to solve the problem of
building detection and description. The first challenge is the formulation of the
representation and description of an object. In order to extract an object from one or more
images, a system must possess some model of the object(s) it is searching for. The work
presented here uses shape as the primary description of an object. The second challenge is
to separate the object (“figure”) from the background (“ground”). An object is perceived
when parts of the image are extracted, grouped, and matched with the description of the
object. This process of figure-ground separation may be done independently of the
description of the object, or, as is the case in this work, arise out of the perception of the
object, as a natural consequence. The third challenge arises from the loss of direct 3-D
information in the process of projection into an image. Reconstruction of a 3-D scene from
a single image is ambiguous. Some of this ambiguity may be removed by using two or more
views of the scene. Each of these challenges is elaborated upon in the following paragraphs.
• Choice of Model: Models may be specific, such as the model of a tooth, or generic,
such as a cone, which is capable of describing a class of objects. The system pre-
sented in this thesis utilizes the generic model of a rectangular block to model build-
ings. The use of a generic model gives the system the capability to detect and
describe a large number of objects (buildings). The flexibility and generality of the
2
generic model, however, are traded off for increased search complexity, and fewer
applicable constraints to aid object detection when compared with the search com-
plexity using a specific model.
• Figure-Ground Separation (Object Segmentation): Figure-ground separation is
a natural task in the object perception problem. Separation of figure and ground may
be an independent process, or may be a consequence of object recognition. As an
independent process, several techniques have been proposed that enable separation
an object using one or more images. Some segmentation methods use image inten-
sities as a basis for segmenting regions. Objects are extracted by analyzing the re-
lationships among those regions. Other systems detect object boundaries by
grouping the intensity edges of the image. The criteria for grouping these elements
are broadly classified as perceptual grouping. The system proposed in this disserta-
tion performs simultaneous object perception and figure-ground separation. In other
words, the goal of the system is to coalesce the available elements (edges) based on
a generic model, and the detection of an instance of the model in the image auto-
matically classifies it as “figure”, labeling the remaining areas as “ground”. The ad-
vantage of this approach is that there is no intermediate object segmentation process
before the object recognition and description, which may have been a source of er-
ror.
• 3-D Reconstruction: Using a single image to reconstruct a 3-D scene is an ambig-
uous problem. The availability of multiple views provides additional constraints to
aid image interpretation. However, the presence of multiple images presents a basic
problem of finding correspondences in other images for each point in a given image.
This is referred to as the correspondence problem. A solution to this problem is pro-
posed in this thesis in order to effectively use the constraints arising from more than
one view of a scene.
The fundamental approach of this dissertation uses simple geometric models as the
component shape descriptors of objects. In particular, this approach is applied to the task
of perception of buildings, from two or more aerial images. The generic model used to
model buildings is a rectangular block, with either a flat top (roof) or a gable top. The
geometric invariant properties of these models are utilized by a perceptual grouping process
to generate hypotheses starting with extracted edges from the available views. Matching of
the hypotheses and its constituents are subject to the constraints arising from the use of
3
multiple views. Good hypotheses with sufficient supporting evidence are selected, and then
verified in 3-D.
Promising results have been obtained on real imagery taken from Fort Hood, Texas
and Fort Benning, Georgia Analysis of some results and several evaluations of the system
are provided.
The rest of this chapter is organized as follows: in Section 1.1 some of the issues of
object recognition and description from intensity images are described; the fundamental
approach of the system is introduced in Section 1.2; the contributions of this dissertation
are listed in Section 1.3; the outline of this dissertation is in Section 1.4.
1.1 Issues in Object Detection and Description from intensity images
There are a number of issues that need to be addressed in object detection and
description from intensity images. They are
• Fragmented low-level descriptions: As an example consider Figure 1.1, which de-
picts 2 images of an aerial scene. Figure 1.2 shows the edges detected in the two im-
ages. The fragmentation of the edges along the boundaries of the objects, and other
markings, is clearly visible. In the case of such outdoor scenes, segmenting the
scene into object and ground on the solely basis of a low-level descriptions such as
extracted edges, is usually not feasible. Studies on indoor scenes with fairly high
resolution, ([57]) have demonstrated that obtaining useful view-independent de-
scriptions of common objects is non-trivial as well, for the same reason. Fragmen-
tation of low-level descriptions (such as edges extracted from an image) is a
problem common to all imagery. Compounding this difficulty is the presence of
background clutter. Apart from increasing the computational complexity of the pro-
cess (as an increased number of primitives need to be processed) there exists the
possibility of accidental configurations in the background appearing as objects.
• Occlusion of objects: When an object is occluded, it becomes necessary to describe
the occluded object based on a specific or a generic model. For example, in the case
of aerial images, part of a road may be occluded by a tall building from a certain
viewpoint. In such cases it may be necessary to make a hypothesis based on contex-
tual rather than actual evidence. There are several variations on the theme of occlu-
4
sion such as transparent objects, and multiple images where parts of objects visible
in some views may not be visible in others.
1.2 Description of the fundamental approach
A fundamental approach is proposed to achieve the goal of detecting 3-D objects
from multiple images. Figure 1.3 outlines the major steps involved in the approach. The
following paragraphs describe the motivation for each step and brief summary of each step
as it applies to the specific problem of this dissertation.
• Choice of model: A generic model is chosen to allow the system to handle a class
of objects. The rectangular block, which is the model used to describe buildings, is
general enough to describe a majority of the buildings found in typical suburban
scenes. At the same time it is a simple model that can be constructed from a small
number of primitives.
• Derivation of invariant properties: Invariant properties can have great utility in ob-
ject detection and description, when the properties exist. The most useful properties
for object detection and description are those that are invariant after projection, and
that can be computed easily. A simple example that is implicitly used in the system,
is that a straight line in 3-D projects to a straight line in any image.
• Generation of 3-D hypotheses: The hypothesize and verify paradigm is the basis of
the fundamental approach. The advantages of using this paradigm include a simple
division of tasks (arguably allowing easier implementation of each task than with-
out splitting the tasks), and the ability to better localize strengths and weaknesses in
the system. In this system, 3-D hypotheses are constructed from the edges in the im-
ages, using a strategy of grouping features in each view and matching them across
views. By grouping and matching detected (or constructed) features, the constraints
associated with single views and the constraints associated with matching across
views may be exploited.
• Selection of 3-D hypotheses: A selection process is applied to filter out 3-D hypoth-
eses that have weak supporting evidence. A good selection process greatly reduces
the number of unnecessary 3-D hypotheses.
5
Figure 1.1 Two views of an aerial scene
6
Figure 1.2 Edges extracted from the views in Figure 1.1
7
• Verification of 3-D hypotheses: 3-D hypotheses are verified using domain-specific
evidence. The strategy used in the fundamental approach is one of least commit-
ment. In other words decisions are enforced later rather than earlier. In practise this
allows the system to hypothesize more hypotheses than if decisions were made ear-
lier. This implies that the selection and verification procedures need to be more dis-
criminatory, as decisions to retain or eliminate hypotheses from contention are
made in these processes. The verification process is presented in Section 5.1 and
Section 5.2.
1.3 Contributions of this Dissertation
The specific original contributions of this dissertation are:
• Design and development of a new perceptual grouping and matching process to
construct 3-D hypotheses from two or more images corresponding to the 2-D pro-
jection of a generic model. (Section 4.2 and Section 4.4)
Choose geometric models
Derive Invariant Properties
Generate 3-D Hypotheses
Select 3-D Hypotheses
Verify 3-D Hypotheses
Figure 1.3 Block diagram of the fundamental approach
8
• Development of a selection mechanism to differentiate “good” hypotheses from
“bad” ones. (Section 4.3 and Section 4.5)
• Development of a verification mechanism for 3-D building hypotheses by evaluat-
ing the 3-D evidence for each hypothesis and analyzing the 3-D spatial conflicts
among them. (Section 5.1 and Section 5.2)
• Development of a system to detect and describe buildings from two or more inten-
sity images with general viewpoints, and to demonstrate the applicability of image
understanding techniques to real world problems.
• Demonstration of a non-trivial extension (gable roofs) of the generic model within
the general framework for object description
• Application of perceptual grouping techniques to interactive object detection and
description.
1.4 Outline of the thesis
In this chapter, the problem proposed to be solved in this dissertation is formulated,
defining, in the process, a clear goal, and outlining some of the issues and challenges. The
fundamental approach to solving the problem is briefly discussed here. In Chapter 2,
previous work related to this research, such as perceptual grouping, extraction of 3-D
objects from a 2-D image, and building extraction from single and multiple images is
studied. The fundamental approach is applied to solving the problem of building detection
and description, as a restricted instance of the general problem of object detection and
description. In Chapter 3, an overview of the specific approach used in the extraction of
buildings is introduced. The assumptions and some special features of the system, such as
the use of geometric and projective properties, are also presented in Chapter 3.
The details of all major processes, namely the hypothesis generation process, the
hypothesis selection process, and the hypothesis verification process, are covered in
Chapter 4 and Chapter 5. These chapters contain the complexity of each of these processes.
The hypothesis generation and hypothesis selection processes that use roof evidence alone
are presented in Chapter 4. Perceptual grouping techniques are discussed in the section on
hypothesis generation. Description of the evaluation function used in hypothesis selection
is described in Chapter 4. The verification process, which involves use of evidence from
9
roofs, walls and shadows, is presented in Chapter 5. The evaluation functions for roof, wall
and shadow evidence is presented in this chapter. Chapter 6 discusses two distinct user-
assisted methods that aid the user in constructing a model. The primary goal of the user-
assisted methods is to reduce the work (time) needed by the user to construct a 3-D model
of the scene. The analysis of some results of the system, and evaluation of the performance
are presented in Chapter 7. This dissertation is concluded in Chapter 8 with ideas for future
research in this area.
10
Chapter 2
Previous Work
Various methods have been proposed to infer buildings in particular, and 3-D objects
in general, from 2-D data. Work using perceptual grouping techniques is discussed in some
detail as it is an important technique used by many building detection and generic object
recognition systems. With regard to choice of suitable model(s), most building detection
systems use simple generic models such as rectangular parallelepipeds. In generic 3D
object recognition some methods use specific models, while others use generic models.
Some methods use the techniques of shading [32] or texture [89] to recover the 3-D
information. However, a number of arguably unreasonable assumptions have to be made to
enable these techniques to be applicable and effective. There are many methods that
generate 3-D shape description from 3-D range data [87]; these methods are not directly
related to the problem addressed here.
In this chapter, work related to this dissertation is discussed. The chapter briefly
covers work on the generic problem of perception of 3-D objects from one or more intensity
images in Section 2.1. Section 2.2 details the work done in building detection.
2.1 Perception of 3-D Objects from 2-D Data
Perception of 3-D objects from intensity images is a hard problem. A number of
general techniques have been developed. These are detailed in the following subsections.
However, these techniques are either not applicable to the building detection (such as
methods used on perfect contours, or those using specific models), or are too general to take
advantage of domain-specific knowledge, which would result in poor performance when
compared to similar systems designed to look specifically for buildings. Thus, while work
11
in generic 3-D object recognition provides many general insights into the problem of
building detection, it is not sufficient to solve the problem.
2.1.1 Methods Using Specific Models
It is hard to detect 3-D objects from the lines extracted from one or more real intensity
images, owing to fragmented low-level descriptions and background clutter. The use of
geometric models of the objects helps to solve the segmentation problem as well as the 3-
D inference problem. These may be specific models or generic models. A specific model
represents a particular object, while a generic model represents a class of objects. This
section discusses methods using specific geometric models.
Roberts [78] uses a polyhedral-model matching technique to do the segmentation of
the scene, which consists of polyhedral objects with homogeneous surfaces and uniform
backgrounds. The ACRONYM system developed by Brooks uses specific models to
recognize 3-D objects from a 2-D image coupled with a matching process. Lowe’s
SCERPO system [60] uses a specific-model-based matching technique to recognize objects
from a single 2-D image. An indexing technique is used to recognize objects in a database
of specific models handled by the system. Lamdan, Schwartz and Wolfson [51] suggest a
model based method using an indexing technique to recognize 3-D objects from 2-D
images. A different indexing method, called structural indexing, is used by Stein and
Medioni [87] to create an index of structural tokens for the recognition of 3-D objects from
2-D images. The systems that use specific models are limited to a pre-selected set of
objects. Generalizing these systems to handle larger sets of objects is non-trivial or not
possible without major re-architecture.
2.1.2 Methods Using Generic Models
Systems have been developed to detect classes of objects from single or multiple
intensity images using generic models. Usually, only one generic model is used by these
systems. This is the category that this dissertation falls under. Methods using a single
intensity image are described in the following paragraph, and those using more than one
image are described after that.
12
Sato and Binford [82] use an SHGC (Straight Homogeneous Generalized Cylinder)
as the generic model to detect 3-D objects that can be modeled by SHGCs. However,
it has been applied to single object scenes and does not address the problem of
occlusion. More general work has been done by Zerroug and Nevatia [100][101] to
detect 3-D objects that may be modeled as different types of GCs (Generalized
Cylinders). They solve the problem of detecting complex objects and automatically
solve the segmentation problem. However, the system is not able to handle simple
objects, e.g. rectangular boxes, because the detection of GCs relies on several
properties. Therefore, a sufficient number of constraints do not exist for their system
to detect simple objects. In addition, the simplicity of the objects implies that
accidental object-like background configurations exist with much higher probability.
• Object detection from multiple images:
Instead of using a single intensity image, some systems use multiple intensity images
to detect 3-D objects. Several techniques, such as stereo [10][16][69] and motion
[19], may be used to recover 3-D information from multiple intensity images.
However, to take advantage of constraints across views, the non-trivial
correspondence problem needs to be solved.
Mohan and Nevatia [69] extract ribbons which represent object surfaces from stereo
images. A stereo matching process is used to aid the task of object segmentation.
Chung and Nevatia [10] developed a hierarchical stereo system to recover LSHGCs
(Linear Straight Homogeneous GCs) and SHGCs by matching grouped features at
different levels. The use of matched high level features helps reduce the ambiguity of
multiple matches of low level features.
These systems have the advantage of being able to extract 3-D information from
multiple images using triangulation. Stereo matching of features provides additional
constraints for the system to effect object segmentation.
2.1.3 Perceptual Grouping
A general approach to solving the problem of detecting objects from one or more
intensity images is to make hypotheses of the objects first and then to verify them. The
perceptual grouping technique has shown more successful results than other techniques in
• Object detection from a single image:
13
dealing with this problem. The importance of perceptual grouping in the processing of
images by the human visual system has been demonstrated by many psychophysical
experiments [102][79].
The role of perceptual organization in computer vision was emphasized and
demonstrated by Witkin, Tenenbaum [98], Lowe, and Binford [61][62] some time ago.
Dolan and Weiss [17] use perceptual grouping to group curved lines based on proximity
and good continuation in a local neighborhood. The GROPER system developed by Jacobs
[46] uses a perceptual grouping process to locate collections of edges likely to have been
produced by a single object. The grouping process helps the GROPER system do a good
job of perceiving objects.
Mohan and Nevatia [68][69] use a grouping process to make hypotheses of objects
using a priori knowledge about the scene. Stein and Medioni [87] use a hierarchical
perceptual grouping process with an indexing scheme to recognize 3-D objects. Guy and
Medioni [27] use a Global Extension Field to enhance the saliency of the grouping tokens.
Specialized fields have been introduced for different grouping tokens. Sha’ashua and
Ullman [83] suggest a method of computing the saliency by an iterative scheme, using a
uniform network of locally connected processing elements to guide the grouping process.
This method groups line segments with a bias for long curve and low total curvature.
Other techniques, such as Hough Transform [18] and graph-based techniques [76],
are also used to solve the problem of fragmented low-level segmentation, but the
computational complexity of these methods is the major problem with these techniques.
2.2 Building detection and description from Intensity Images
Building detection and description from one or more intensity images has been
researched intensely for some years. While it is a non-trivial problem, it provides a
reasonable number of constraints for a vision system to be able to solve or to finesse some
of the major problems in generic object detection.
Section 2.2.1 describes previous work in the area using a single intensity image. Lin
and Nevatia have obtained good results using a single image [57]. Systems using a single
image do not need to solve the correspondence problem (the problem of determining the
corresponding point in other views given a point in a chosen view) that systems using more
14
than one image have to, and consequently do not have to disambiguate multiple matches
that the systems using more than one image have to. In other words, they do not face a
possible combinatoric explosion because of multiple matches occurring in systems using
more than one image. However, systems using a single image have no direct method of
making 3-D inferences. This forces reliance on domain-specific cues, such as shadows and
walls, to provide 3-D information. Without control over the imaging process, it is not
possible to guarantee that these cues will always be present.
Section 2.2.2 describes previous work using more than one image. These existence
of multiple views makes direct 3-D inferences possible through triangulation. However,
perfect matching assumes perfect camera models (parameters which specify the camera’s
internal imaging parameters, as well as its orientation with respect a world coordinate
system), which is often not the case. Given the low resolution of some of the images (1m/
pixel), and that the distance of the buildings to the camera imaging planes is large compared
to the size of the buildings, small errors in camera models often cause 3-D errors of the
order of building dimensions. In many systems, matching is typically done for a particular
feature, such as junctions hypothesized from detected lines, and 3-D inferences drawn at
that point. By matching different features of different complexities, more robust
performance may be obtained.
2.2.1 Detection and description from monocular images
There have been many methods proposed to solve the problem of building detection
and description from aerial images [35][36][69][39][23][44][58][94][66][67][81].
Building detection requires robust segmentation techniques and methods to infer the 3-D
structure. The segmentation techniques usually rely on regions or edges extracted from the
image. Region-based techniques [23][58] construct closed curves that often do not
correspond to the objects of interest. Simple edge-based techniques such as contour tracing
[35][36][94] encounter the problem of a rapidly growing search space. A more robust edge-
based technique is the perceptual grouping technique[47][54][68]. For the reconstruction
of the 3-D information, most of the monocular systems use the corresponding shadow
evidence of a building to infer the building height [36][44][54][58].
Huertas and Nevatia [36] use a contour-tracing process to detect rectilinear buildings
from a nadir view aerial image. Although some constraints have been introduced to restrict
15
the search space, the system has problems handling fragmented edge input. Shadow
evidence is used to infer 3-D information.
The original BABE system [44] of CMU uses contour-tracing to make hypotheses
from a single nadir view image. The system analyzes the relationships of lines and corners
of a gable roof model and use these relationships to help the decisions in the contour-tracing
process. Shadow evidence is used to verify the hypotheses.
McGlone and Shufelt [65][66] use a special projection property, namely the
vanishing point, to classify lines and group the lines according to the building model used
in their system. The use of vanishing point requires an accurate camera model. The line
classification process is not reliable because the perspective effect of a building in aerial
images is usually very small and can not be detected.
Lin and Nevatia [57] use a feature-based approach to detect and describe buildings.
Their system hypothesizes 2-D parallelograms corresponding to the projection of a 3-D
building. Verification and height estimation of the hypotheses is done using auxiliary
evidence such as walls and shadows, because no direct 3-D information is available from
one view. The system presented in this dissertation is similar in approach to that of Lin and
Nevatia. However, the use of multiple views allows better hypothesis selection and
verification. Further, Lin and Nevatia do not handle gable-roof buildings, which system
presented in this dissertation handles.
2.2.2 Detection and description from multiple images
There has been some previous work done on building detection from multiple aerial
images. In [40] it is shown how buildings may be inferred from two views, using grouping
and matching simultaneously.
The MOSAIC system [29][30] uses stereo matching techniques to incrementally
construct the buildings in the scene. The system requires accurate stereo matching over
several images. An important issue in using this approach is the decision process in merging
data from different views and making hypotheses from them.
Mohan and Nevatia [68][69] developed a system to extract buildings from stereo
images. A hierarchical perceptual grouping process is used to generate hypotheses from
segmented low-level intensity edge features. Their system uses rectangular block model,
16
but works only on images taken from restricted viewpoints and relies on the stereo
matching technique to extract 3-D information.
Fua and Hanson [23][24] also developed a system to reconstruct buildings from
stereo images. They also assumed rectilinear shape for the buildings. Their system
reconstructed a building from an initial roof hypothesis by optimizing an objective
function. The initial hypothesis must be given by a user or extracted from the line segments
of the image. However, the initial hypotheses automatically extracted by their system were
not always related to the actual roof surfaces.
In [6] a system that used two views with parallel-axis geometry is presented, and in
[47] a system that used multiple views is described. The three systems [6], [40] and [47]
assume true stereo views where the only difference is the viewpoint. In particular, they
assume that the time-lag between the views is negligible if any. The system described in [8]
uses a single view to determine roof outlines. These are used to locate buildings in the other
available views through a search procedure that is limited by stereo constraints.
Roux and McKeown [81] designed a system to use matched corners of multiple views
in making building hypotheses by a graph-based technique. However, a precise camera
model is required and sometimes the matched corners are unreliable.
Lin and Nevatia [57] generalize their system to handle more than one image.
However, it operates, for the most part, as a multi-monocular system with minimal
matching between views, as opposed to a true multi-view system.
17
Chapter 3
Overview of an application of the fundamental
approach
In this chapter, an application of the fundamental approach discussed in the previous
chapter is presented. A system has been developed to detect and describe 3-D man-made
structures, such as buildings, from two or more registered aerial views (images). Cultural
features such as buildings represent structures that have specific geometric properties. The
approach uses simple geometric models as the elements to describe the shapes of objects
and is particularly useful in solving the building detection and description problem. In this
application, the shape of a building is restricted to be a single rectangular block, or a
composition of rectangular blocks. In theory this framework allows the system to detect and
describe all rectilinear shapes. The detected buildings may have flat roofs or gable roofs.
The projection of a 3-D object into a 2-D view results in information loss. Depth of
a 3-D scene is not uniquely recoverable from a 2-D view. In other words, a 2-D view could
be the projection of an infinite number of possible 3-D scenes. With the use of multiple
views, it is possible to recover 3-D information. However, the presence of multiple views
does not trivialize the problem of 3-D object recognition and description. The fundamental
problem that must be solved when using multiple views is that of finding an area or a feature
in one view that correspond to an area or a feature in another view, commonly referred to
as the correspondence problem. There is no guarantee that areas or features in one view will
have unambiguous correspondences in other views. Multiple matches for areas or features
are usually the rule rather than the exception. This suggests that some reasonable
assumptions about the 3-D scenes and the 2-D views should be made by a vision system to
confine and simplify the problem of 3-D object perception from multiple views. The
general approach adopted in this system is to form hypotheses by matching and grouping a
hierarchy of features across the available views, beginning with lines detected by an edge-
18
detector, and proceeding through junctions, U-contours, parallels and parallelograms. A
selection process selects “good” hypotheses for verification. A verification process filters
the selected hypotheses using wall and shadow evidence of the buildings. Triangulating the
hypotheses from two or more views gives a 3-D representation of the hypothesized
buildings, which is returned as the 3-D reconstruction of the scene.
The domain-specific assumptions made by the system are discussed in Section 3.1.
An overview of the basic approach of the system is presented in Section 3.2. The distinctive
features of the method applied here, are presented in Section 3.3. To help understand the
system in more detail, some of the geometric and projective properties used by the system
are discussed in Section 3.4.
3.1 Basic Assumptions
When multiple views are used there are two sources of constraints that aid the
detection and description of 3-D objects: monocular constraints (which may be domain-
specific) such as the expected shape of an object in a view given its viewpoint; and
constraints arising from the views taken together. The constraints from views considered
two (or more) at a time are primarily geometric constraints resulting from the relative
viewpoints of different views. Some domain-specific assumptions as well as some
assumptions are made about the acquired views to simplify the problem. Justification for
each assumption is provided:
• Projection is weak perspective locally. The views (images) are acquired from a
considerable height above the ground, which causes the size of an object to be small
compared to the distance between the camera and the object. It is therefore reason-
able to assume that the projection of an object has the properties of weak perspec-
tive. Weak perspective projections are perspective projections where the distance of
the object to the imaging plane is large compared to the distance from the imaging
plane to the focus of the camera. Important properties of weak perspective projec-
tion are that parallel lines in 3-D project to parallel lines in each view. Note that the
weak perspective assumption applies only in a small neighborhoods (typically the
size of buildings). It is possible to have a very large perspective effect over the
19
whole view. This assumption provides us with useful projective properties, which
are utilized in the processes that construct and validate hypotheses.
• Camera models and sun angles are known. This is an assumption about the gen-
eral knowledge of the camera orientation and the illumination direction. Camera
models for views of a 3-D scene may be acquired off-line in mapping applications.
The sun angles (the direction of illumination in 2-D and the sun incident angle) can
be computed from the time that a view is taken and the longitude and latitude of the
scene. If these data are not available, they can be computed directly in each view by
an interactive process. The camera models provide constraints on the projection of
objects in a single view as well as epipolar constraints across views, while the sun
angles aid the system with contrast and shadow information.
• Roofs are rectilinear and flat, or rectangular and symmetric gabled, buildings
lie on flat ground, walls are vertical, and shadows fall on flat ground. These are
assumptions about the general situations of the object and the environment in the
scene. The buildings are assumed to be rectilinear and flat, or rectangular and sym-
metric gabled. This assumption provides the system with geometric constraints on
the projection of buildings. Shadows of buildings are assumed to fall on flat ground
though in some cases the shadows fall on nearby vehicles, trees, or buildings. How-
ever, the system can tolerate this situation to a certain degree depending on the
availability of other evidence. The system can detect a building without any shadow
evidence.
3.2 Basic Approach
Geometric and projective constraints are used to hypothesize buildings from low-
level features, namely edges, that are extracted from the input intensity images (views). The
hypotheses are formed based on edge-information obtained from the available views.
Starting with lines, a hierarchy of features is constructed. This goal of this hierarchical
feature construction is to hypothesize buildings. Parallelogram matches across the available
views that triangulate (into the world) to rectangular parallelepipeds in 3D, are used to
model buildings. Detected lines are used to hypothesize junctions, and parallels; lines and
junctions are used to hypothesize U-contours; parallels and lines are used to hypothesize
parallelograms. Lines may be thought of as single sides of parallelograms; junctions as the
20
corners of parallelograms; parallels as two sides of parallelograms; and U-contours as three
sides of parallelograms. Perceptual grouping techniques are applied at different
complexities in the feature hierarchy (viz. lines, junctions, parallels, U-contours and
parallelograms). Matching of detected features takes across views at every feature in the
hierarchy, except for U-contours. Matching of features at different levels in the hierarchy
makes the system more robust as it allows comparison of the views at different levels of
feature complexity thus eliminating many accidental features that appear in a single view
from further consideration and boosting the confidence in the 3-D features that are
supported in more than one view. In addition, this allows the system to draw 3-D inferences
during the hypothesis process, and not have to wait till after the hypotheses have been
formed. It is instructive to note that grouping and matching take place at each feature in the
hierarchy before features of higher complexity are constructed. Triangulation of 2-D
features into the 3-D world results in 3-D hypotheses.
A selection process filters hypotheses for verification based on the roof evidence
supporting and evidence negating the existence of the hypotheses. The selection process
eliminates a large percentage of “bad” hypotheses, and therefore reduces the run time of the
expensive verification process. This process uses only roof (outline of the object) evidence.
Hence no domain-specific knowledge is used.
Roof, wall and shadow evidence is used to verify the selected hypotheses. The use of
both wall and shadow evidence allows the verification process to generate more reliable
results than if only one of these cues were used. The verification process encapsulates the
use of domain-specific knowledge to filter out hypotheses that meet the geometric
constraints on the formation of building hypotheses. The 3-D scene is reconstructed from
the verified hypotheses. Figure 3.1 shows the block diagram of the system.
The system design philosophy has been to make only those decisions that can be
made confidently at each level - in other words - a strategy of least commitment. Thus, the
hypothesis generation process creates as many hypotheses as are feasible, given the
constraints available at that time. The selection process favors keeping hypotheses that may
be viable. The verification process has the most global information and therefore can make
decisions based on a larger and more complete set of geometric and projective constraints.
21
Figure 3.1 Block Diagram of Hypothesis Formation
Line Detection
Junction Detection
Parallel Detection
U-contour Detection
View 2
3-D Building Hypotheses
Parallelogram
Detection
Line Detection
Junction Detection
Parallel Detection
U-contour Detection
View n
Parallelogram
Detection
Line Detection
Junction Detection
Parallel Detection
U-contour Detection
View 1
Parallelogram
Detection
match
match
match
match
match
All Building Hypotheses
Selected Hypotheses
Selection Process
Verification Process
22
3.3 Distinctive Features of the System
The system has several distinctive features, including the use of generic models,
domain knowledge, hierarchical grouping and matching, geometric and projective
properties, non-preferential use of multiple views, and wall and shadow information,
embedded in the approach. They are discussed in the following sub-sections.
3.3.1 Use of a generic model
The system uses generic models. The shape of many buildings is rectilinear, that is,
a composition of rectangular parallelepipeds. By using this generic model, the system is
able to detect a wide variety of buildings with different sizes and profiles. The use of a
generic model affords a fairly general system at the cost of increased complexity in the
hypothesis, selection and verification processes. This feature increases the generality of the
system, but also enlarges the difficulty and the complexity of the system because of the
flexibility of the model.
Gable-roof buildings are handled using the same conceptual structure as that for flat-
roof buildings. The differences are that a gable must be detected on the roof, and that the
expected shape of the shadow evidence changes. In handling gable-roof buildings
successfully within the same structure, the system demonstrates it is general enough to
accommodate reasonable extensions to the generic model with relatively minor additions.
3.3.2 Use of domain knowledge
The use of domain knowledge helps the system in many ways. For example, to detect
rectilinear buildings, it may be expected that roofs are flat, or gabled, and walls are vertical.
For views taken of a suburban area, it may be expected that shadows fall on flat ground.
Knowledge of the light source (the Sun, for outdoor views) enables prediction of the
existence and the clarity of shadows. The knowledge of minimum and maximum building
heights is used to restrict the search space, exclude unreasonable hypotheses, and accelerate
the verification process. This knowledge provides us with constraints which are used in
several stages of the system.
23
Perceptual grouping and matching is an important tool in the hypothesis generation
process and is used to overcome the problem of fragmented low-level segmentation. The
grouping and matching process creates a hierarchical structure of matched features for each
hypothesis. Low level features are grouped and matched into higher level matched features
of increasing complexity, resulting terminally in building hypotheses. The feature hierarchy
is important in that the system can use features from different levels when necessary. High
level features include pointers to all the related low level features. Therefore, the constituent
features and evidence of a hypothesis can be retrieved easily through those pointers. The
advantage of grouping features hierarchically is that the formation of high level features
can help the formation of low level features.
A choice is made to match features (while grouping them) at all stages in the
hierarchy to increase robustness of the hypotheses. In addition, following a strategy of least
commitment, matching at each level allows application of constraints at that level, thereby
reducing the number of hypotheses created. Least commitment without some pruning at
these levels would cause an unnecessary combinatoric explosion, causing avoidable
complexity for the selection and verification processes.
3.3.4 Use of geometric and projective properties
The most important matching constraint used is the well-known epipolar constraint.
The epipolar constraint is derived from the relationships of the camera geometries and is
used extensively by the system to limit the matching space of different features which
include lines, junctions, parallels and parallelograms.
By assuming that the projection is weak perspective locally (at the scale of an object
in the scene), several properties may be utilized. The most important property used in the
system is that parallel lines in an object project into parallel lines in any view. Other
projective properties such as the expected projection of a right-angle in 3-D, given the
camera geometries, are also used. Based on the geometric and projective constraints, the
perceptual grouping and matching processes organize low level features into high level
matched features.
3.3.5 Non-preferential use of multiple views
This system uses multiple views for hypothesis, selection and verification. The
design and implementation of the algorithms used by the system ensures that these methods
3.3.3 Use of hierarchical perceptual grouping and matching
24
are order-independent. In other words, the results should be the same irrespective of what
order the system receives the views. This is a non-trivial task as there are several geometric
operations that are asymmetric. Extra processing was added to ensure that these operations
are symmetric.
3.3.6 Use of wall and shadow information
An important indication of the existence of a 3-D structure in a view is the presence
of walls and shadows. Walls are visible in an oblique view and shadows are visible when
the illumination is good. Most of the other building extraction systems use shadow
information alone. However, sometimes shadow information is not available or not reliable.
The system verifies a 3-D hypothesis by its supporting evidence of walls and shadows in all
the available views. The acquisition of this evidence relies on the use of geometric and
projective properties to predict the expected wall and shadow boundaries. It is instructive
to note that shadow evidence is considered a monocular cue as no assumption is made that
the views were acquired at the same time. Hence no attempt is made to match shadow lines.
The presence of widely differing shadows also precludes the use of simple area-based
stereo-matching systems. The features caused by vertical line of the walls in the views are
usually small, which may cause large errors where matching is attempted across views. In
addition visible horizontal wall lines in one view are often self-occluded in other views due
to viewpoint or shadow-casting. For these reasons wall evidence is also considered as
monocular evidence, with no attempt to match it.
3.4 Geometric and Projective Properties
In this section some of the geometric and projective properties used by the system are
introduced. The most important constraint is the epipolar constraint. This constraint is
derived from the camera models and is used for matching at each level in the feature
hierarchy. The second property describes why a rectangular roof in 3D projects to a
parallelogram in 2D. This property forms the basis for the building hypothesis process, as
it allows 3D building hypothesis from properly constrained 2D detected parallelograms.
The third and fourth properties help the verification process outline the predicted visible
wall and shadow boundaries. Therefore, self-occlusion of a wall, and shadow evidence can
be handled correctly by the system. The relationships between the building height and the
25
projected wall height and shadow width are illustrated in the last property. The inference of
3-D building height from 2-D wall height and shadow width is accomplished by using this
property.
3.4.1 Use of camera models for epipolar constraints
A camera model for a view consists of the relative orientation of the camera with
respect to a chosen world coordinate system (external parameters) and a description of the
internal camera parameters (internal parameters). One way of representing the external
parameters is a general rotation (3 parameters) and a translation (3 parameters) to the
camera center. The cameras used are assumed to have an imaging plane normal to the
camera axis. This assumption implies that the only other parameter needed is the focal
length of the camera. A camera model may be used to determine the projection of a 3-D
point in an image by intersection the ray from the camera center to that point with the
imaging plane. This principle applies to every point in the scene, and hence to the scene
itself. If the goal is to recover the 3-D scene the camera parameters enable back-projection
into the scene. However each point in the image restricts the corresponding 3-D point to lie
on a ray, leaving the position on that ray to be resolved through other means.
Given camera models for any two views it is simple to derive epipolar lines for these
two views. All computations proceed through a common 3-D world coordinate system. In
Figure 3.2, if the camera centers for the two views are C
1
and C
2
respectively and P is a
point in the 3-D world. p
1
is the projection of P in view
1
.If Π is the plane determined by
C
1
,p
1
and C
2
, and I
2
is the imaging plane of view
2
, then the projection of P in view
2
is
constrained to lie on the line l determined by the intersection of planes Π and I
2
.
3.4.2 Projection of 3D Rectangular Roofs
Under weak perspective projection locally (which is assumed for the aerial images
considered in this thesis) “local” (of the order of building dimensions) parallel lines in 3D
project to parallel lines in any image. A rectangular parallelepiped, used to model buildings,
may be thought of as two pairs of local parallel lines in 3D. Each pair of 3D parallel lines
will project to a pair of parallel lines in an image. In general, two pairs of intersecting
parallel lines will form a parallelogram. It is instructive to note that given the camera model
for an image, any rectangular parallelepiped that is parallel to the ground in 3D may be
projected to obtain a corresponding parallelogram for that image. Conversely, given a
26
parallelogram in an image and the camera model for that image, it is possible to predict
whether that parallelogram corresponds to a rectangular parallelepiped that is parallel to the
ground in 3D. This property is used extensively by the system to constrain the hypothesis
formation process.
3.4.3 Shadow Casting Sides
To use the shadow information, the system must know which sides of a building will
cast shadows given the orientation of the hypothesis, the camera viewpoint, and the sun
angles. The four vertices of a parallelogram hypothesis are stored in clockwise order. The
four sides of the parallelogram are ordered clockwise. The orientations of the four sides are
shown in Figure 3.3. The condition for a side of the roof hypothesis to cast a shadow is that
the counter-clockwise angle from the side to the 2-D direction of shadow cast by vertical
line (y) must be less than 180
o
. In Figure 3.5, the two sides of the roof casting shadows are
drawn in black lines. The part of the shadow occluded by the building itself can also be
computed and appropriate procedures are used to take care of a self-occlusion situation
3.4.4 Visible Walls
Visible walls are utilized by the system to verify hypotheses. The system has to figure
out which sides of a roof hypothesis are adjacent to visible walls. The orientation of a side
of the roof hypothesis is the shown in Figure 3.4. The condition for a side of the roof
C
1
C
2
P
Π
l
I
2
Figure 3.2 Construction of epipolar lines
p
1
27
hypothesis to be adjacent to a visible wall is that the counter-clockwise angle from the
vertical line to the side is less than 180
o
. Knowing the shadow casting sides and the visible
wall sides, the walls inside the shadow are identified. Usually, the walls inside shadow are
less visible, and thus special attentions must be paid to manage the different illumination
situations of the visible walls.
shadow casting side
not shadow casting side
direction of shadow
cast by vertical line (ψ)
Figure 3.3 Shadow casting sides
vertical
line
side adjacent to visible wall
side adjacent to non-visible wall
Figure 3.4 Sides Adjacent to Visible Walls
28
Chapter 4
Generation and Selection of Building Hypotheses
The system uses the hypothesize and verify paradigm to build a 3D model of a scene
from multiple aerial views of the scene. The building hypotheses that form the model, are
constructed from detected image features, in a hierarchical fashion, matching features
across views at each level in the feature hierarchy. Perceptual grouping techniques are
applied to the detected and constructed features to aid the separation of figure from ground.
The system relies heavily on the availability of multiple views to enable feature matching
across views, thereby increasing the confidence of features with matches and helping filter
out features that have no reasonable matches in other views. All available views are used
non-preferentially when forming hypothesis. A selection mechanism makes a value
judgement of the “goodness” of the evidence supporting each hypothesis filtering out
hypotheses without sufficient evidence. The remaining hypotheses are passed to a
verification procedure. Verification of the selected hypotheses is performed based on the
supporting evidence in the form of roof evidence, visible vertical walls, and visible
shadows, of the hypothesized building.
Section 4.1 describes the feature hierarchy, as well as the constraints applied at each
stage in grouping and matching the features. This description is complemented by an
example for which features at various levels in the hierarchy are displayed. Section 4.2
describes the method for hypothesizing flat-roof rectangular structures. Section 4.3
describes the selection mechanism for flat-roof hypotheses, including the evidence
gathered for selection and the method for using this evidence to filter out hypotheses
without sufficient evidence. Section 4.4 describes the hypothesis mechanism for gable-roof
rectangular hypotheses. Section 4.5 outlines how gable-roof hypotheses are selected.
29
4.1 Features in the hierarchy
The system described in this dissertation is feature-based. It uses edge descriptions
as abstractions of input images. Figure 4.1 shows an example of two views of an aerial
scene. These edges are the low-level descriptions of the input images. Through a process of
grouping features in each image (using perceptual grouping techniques) and matching these
features across views, the system constructs a feature hierarchy with features of increasing
complexity. By matching grouped features at every stage in the hierarchy the system is able
to apply constraints that come from the relative geometries of the views (such as the
epipolar constraint), as well as 3-D constraints such as the 3D Orthogonality Constraint
used in junction matching, at each stage in the hierarchy. This enables more informed
decisions at each stage. Multiple level matching and grouping ensures that the available
feature information is exploited more fully which leads to higher confidence in the
hypotheses. This increases the robustness of the system by allowing 3D building
hypotheses to arise at different levels in the hierarchy. The hierarchy in order of complexity
consists of lines, junctions, parallels, U-contours, parallelograms and triples. Parallelogram
matches are used to model flat-roof buildings while triple matches are used to model gable-
roof buildings.
Lines
Colinear line segments are grouped. Segments are considered colinear if there is a
free path from the end of one segment to the other i.e. no other segment blocks the line
joining the two closest endpoints of the colinear segments, and if the angle between the
segments is less than 10
o
. A search for colinear edge segments is launched away from each
Figure 4.1 Two views of an aerial scene
30
end of each segment to a distance of l/4th of its length. Colinearity is applicable to a set of
greater than two segments as well. The above criterion must be met between every pair of
neighboring segments. The results of grouping lines segments detected in the images shown
in Figure 4.1 are shown in Figure 4.2.
After colinear grouping, the lines are tested for matches across the views by using the
following quadrilateral constraint:
• Epipolar constraint: The match for a line segment in one view must lie at least par-
tially within a quadrilateral defined by the epipolar and 3D height constraints. Con-
sider Figure 4.3. Let line l in view
1
have endpoints p
1
and p
2
. By the epipolar
constraint for points, the match for p
1
in view
2
must lie on an epipolar line defined
by p
1
, say l
1
′. Similarly a match for p
2
must lie on l
2
′. A particular height in the
world coordinate system corresponds to a particular point on the epipolar line.
Hence, knowing the local ground height in the world coordinate system, and the
maximum height of a building in 3-D, the search space may be reduced to the seg-
ment on each epipolar line defined by these z-coordinate values. In general, there
are 4 distinct points, two for each epipolar line. This limits the search for matching
segments to a quadrilateral defined by these four points. In Figure 4.3 these points
are denoted by p
11
′, p
12
′, p
21
′ and p
22
′, corresponding to line l. Each pair of lines (l,
l′) that satisfy the epipolar (quadrilateral) constraint in any pair of views is deter-
mined to form a line match and included in the set of line matches that we will call
S
lm
, and is passed to the higher levels for further processing.
Figure 4.2 Grouped lines in each image from Figure 4.1
31
Note that a line in one view may match multiple lines in other views, causing
multiple matches. A line in Figure 4.4b) matches 4 distinct lines in Figure 4.4a).
The results of matching the grouped lines detected in Figure 4.2 are shown in
Figure 4.5.
Junctions
Junctions are matched using the lines that participate in at least one line match.
Consider a pair of lines L
ik
(k = m, n) in view
i
, with endpoints P
ikl
(l = 1, 2). Junction J
ij
is
formed at the intersection of L
ik
(k = m, n) iff the angle between L
im
and L
in
is greater than
30
o
and min (distance(J
ij
, P
ik1
), distance(J
ij
, P
ik2
)) ≤ length (L
ik
) for (k = m, n). Denote the
p
1
p
2 p
11
′
p
21
′
p
12
′
p
22
′
view
2
view
1
l
2
′
l
1
′
Figure 4.3 Quadrilateral (epipolar) constraint
l
Figure 4.4 Example of multiple line matches
(a)
(b)
32
set of junctions formed in view
i
by S
Ji
. Junctions in the sets S
Ji
(i = 1..n
views
) are then
matched across the views, when the following constraints are satisfied:
• Epipolar constraint: Given a junction J
ij
in view
i
, its match in another view
l
must
be within a certain segment (depending on the height range in 3D) of the epipolar
line corresponding to J
ij
,in view
l
. This is a special case for one point of the quadri-
lateral epipolar constraint for lines discussed in the previous section on lines.
• Line Match Constraint: If junction J
ij
, formed by lines L
im
and L
in
, matches junc-
tion J
kl
, formed by lines L
kp
and L
kq
, then exactly one of the following must hold:
either there exist line matches (L
im
, L
kp
) and (L
in
, L
kq
)in S
lm
, or there exist line
matches (L
im
, L
kq
) and (L
in
, L
kp
) in S
lm
.
• 3D Orthogonality Constraint: Given a junction match, we can compute the 3D
angle between the lines forming it (from the knowledge of the matching lines). The
angle is required to be between 80
o
and 100
o
in 3D.
• Trinocular Constraint: When there are more than 2 views available, the well-
known trinocular constraint may be applied to the locations of the junctions. Func-
tionally, given a point p
1
,in view
1
, and choosing a point, p
2
in view
2
, that lies the
epipolar line of p
1
in view
2
, unambiguously determines a point p
3
in view
3
. This
point, p
3
, is the intersection of the epipolar lines of p
1
and p
2
in view
3
. Details of the
trinocular constraint may be obtained in [45].
Note that a junction in one view may match multiple junctions in other views,
causing multiple junction matches for the same junction. A junction in Figure 4.6a)
Figure 4.5 Matched lines for the example in Figure 4.1
33
matches 2 distinct junctions in Figure 4.6b).The results of detecting and matching
junctions in the images of Figure 4.1 are shown in Figure 4.7.
Parallels
Starting with the lines in S
lm
, parallel pairs of lines are detected in each the view are
matched across views. Parallels are formed between pairs of lines, L
ij
and L
ik
in the same
view
i
, when the following constraints are satisfied:
• The perpendicular distance between L
ij
and L
ik
is less than the maximum projected
width of a building.
(a) (b)
Figure 4.6 Example of multiple junction matches
Figure 4.7 Junction matches detected in images from Figure 4.1
34
• The acute angle between L
ij
and L
ik
is less than 10
o
.
• At least 50% of L
ij
overlaps with L
ik
OR at least 50% of L
ik
overlaps with L
ij
.
The pair of lines (L
ij
, L
ik
) forms a parallel. Along with the lines L
ij
and L
ik
, an abstract
representation in the form of two line segments is stored. Denote these line segments by L
ij
′
and L
ik
′. The angle of L
ij
′ and L
ik
′ is the average of the angles of L
ij
and L
ik
weighted by the
lengths of L
ij
and L
ik
. If necessary, L
ij
and L
ik
are extended to form L
ij
′ and L
ik
′ so that they
overlap completely. Figure 4.8 illustrates the extension process. The representation (L
ij
′,
L
ik
′) is used in the computation of parallel matches.
While the task domain causes a large number of parallels in each view (two to three
times the number of lines in that view), because of the alignment of buildings, roads,
parking lots and shadows, the number of parallel matches is typically lower than the
number of lines in any view. A parallel match is hypothesized if there is evidence in at least
two views. When there are matching parallels in more than two views, a single parallel
match (called a maximal parallel match) is formed across all the views. The constraint used
in matching is the parallel match constraint described below:
• Parallel match constraint: Consider parallels P
ik
with component segments L
ik1
and L
ik2
in view
i
, and P
jl
with component segments L
jl1
and L
jl2
in view
j
. The par-
allel match constraint is satisfied for this pair of parallels iff exactly one of the fol-
lowing criteria is met:
•(L
ik1
, L
jl1
) and (L
ik2
, L
jl2
) are elements of S
lm
•(L
ik2
, L
jl1
) and (L
ik1
, L
jl2
) are elements of S
lm
L
ij
L
ik
extension
extension
L
ij
′
L
ik
′
Figure 4.8 Extension of lines forming a parallel
35
In the case of a (maximal) parallel match with constituent parallels from more than
two views, the parallel match constraint must be satisfied by parallels in every pair
of views that contributes constituent parallels. A Parallel is not hypothesized in a
view that does not contribute a parallel to a maximal parallel match. Maximal
parallel matches are generated in order to ensure that duplicate parallel matches do
not occur. If there are a total of n views, parallel matches over the views are
represented as n-tuples. The set of parallel matches is denoted by S
pm
. Thus if there
are 4 views and (P
11
, P
21
, nil, nil) and (P
11
, nil, P
31
, nil) are detected as parallel
matches, they are replaced by the maximal parallel match (P
11
, P
21
, P
31
, nil) in S
pm
.
Note that a parallel in one view may match multiple parallels in other views, causing
multiple parallel matches for the same parallel. A parallel in Figure 4.9b) matches
4 distinct parallels in Figure 4.9a).The results of detecting and matching parallels in
the images of Figure 4.1 are shown in Figure 4.10.
U-contours
U-contours are used in one of the methods to hypothesize parallelogram matches.
Suppose a parallel in view
i
is denoted by lines (l
i1
, l
i3
). If there exists a junction j in S
ji
such
that j may be denoted as either (l
i1
, l
i2
)or(l
i3
, l
i2
), then (l
i1
, l
i2
, l
i3
) is a potential U-contour,
say U. U is a valid U-contour iff a major part of l
i2
lies between l
i1
and l
i3
.
Figure 4.9 Example of multiple parallel matches
(a)
(b)
36
Parallelograms
Formation of parallelogram matches is the basis for hypothesizing building roofs.
The existence of evidence to form a parallelogram match, is a strong indication that a
rectangular 3D structure exists. There are two distinct processes for hypothesizing
parallelogram matches. The first process, detailed in Section 4.2.1 needs a minimal set of
matching features in at least 2 images to be able to hypothesize a parallelogram match. An
independent process initiates a 3-D hypothesis if there is compelling evidence in a single
image. This process is described in Section 4.2.2, and is considered a backup process to the
first process.
4.2 Formation of flat-roof rectangular hypotheses
The system handles two or more views non-preferentially. All the features in the
hierarchy described in Section 4.1 are used in the formation of rectangular hypotheses.
Starting from the most primitive they are lines, junctions, parallels, U-contours and
parallelograms (because the projection of a rectangle is a parallelogram in general,
assuming weak perspective locally), in each image. Grouping and matching of features is
performed at each stage in the hierarchy to take advantage of the constraints that exist at
each level. With increasing complexity, the primitives become increasingly distinct given
the additional constraints available at each stage (from the previous stage), and have fewer
Figure 4.10 Parallel matches detected in images from Figure 4.1
37
matches across the views. The hypothesis generation process for flat-roof buildings is
shown in Figure 4.11 and the results of the hypothesis generation process on the images in
Figure 4.1 are shown in Figure 4.16.
4.2.1 Generation of hypotheses from multiple images
Let S
pmi
denote a parallel match in the set of parallel matches S
pm
. Denote the
constituent parallels of S
pmi
by P
ij
where j is the image number. For each P
ij
, a search is
launched to determine the best closure of the parallel match as follows:
• Start from the center lines (shown in Figure 4.12) of the parallels in the parallel
match and search for possible closure lines from the center to each of the ends. A
closure line is one that lies between the lines forming the parallel, and which forms
an acute angle of less than 10
o
with the projection of the line orthogonal to the par-
allel in 3D and parallel to the ground plane in 3D.
• The search is extended to a quarter of the length of the parallel beyond the ends of
the parallel. The lines being searched for are elements of the set of matched lines,
S
lm
, and are thus guaranteed to have matches in at least one other view. Figure 4.12
depicts this search procedure in a single view.
• Detect possible closures, by scoring the coverage of the gap between the parallel
lines. All closures covering greater that half the gap between the parallel lines are
considered. The score of the closure is the ratio of the sum of lengths of the lines
forming the closure to the distance between the ends of the parallels. Figure 4.13
illustrates this concept. Note that the search is performed simultaneously on all the
images and hence the closures are matched as they are detected.
• Put qualifying closures into two sets: one on each half of the parallel match.
• Generate all combinations of closures from the two sets to hypothesize buildings.
Figure 4.14 depicts formation of hypotheses in one view.
It may be noted that the hypotheses generated by this process are 3-D hypotheses.
This is because the method starts with a parallel match. Given camera models the
constituent parallels in a parallel match may be triangulated to a single 3D parallel. Further,
the method determines closures using lines with matches (elements of S
lm
). Hence each of
the closures may be triangulated to form a 3D closure. It follows that the resulting
parallelogram match that is hypothesized may be triangulated to form a 3D rectangle. This
38
Line Detection Line Detection Line Detection
Line Matching
Junction Detection Junction Detection Junction Detection
Junction Matching
Parallel Detection Parallel Detection Parallel Detection
Parallel Matching
U-contour Detection U-contour Detection U-contour Detection
Parallelogram Match Formation
View 1 View 2 View n
3-D Flat-Roof Building Hypotheses
Figure 4.11 Block Diagram of Flat-roof Hypothesis Formation
39
3D rectangle comprises the roof of a building. Thus the output of this method consists of
parallelogram matches (or building hypotheses).
4.2.2 Generation of hypotheses from each image independently
Given the cluttered nature of aerial images, it is probable that some features detected
in one view may not have matches, or may not match sufficiently to initiate a hypothesis.
In order that these hypotheses are not ignored, the system has a hypothesis-generation
mechanism that is essentially monocular. It generates a 2D parallelograms in each view
based on the U-contours in the view being considered. For each 2D parallelogram
hypothesis a search determines if there is (possibly fragmented) evidence in the other views
to support this hypothesis. This search is performed iteratively through the space
determined by epipolar geometry and by the 3-D height constraints on a hypothesis. When
a sufficient match is found a 3-D hypothesis is created. The following paragraphs describe
the process in detail.
0.25 * length of parallel
0.25 * length of parallel
segments
limits of the search for completions
parallel line
directions of searches
center line
Figure 4.12 Search for parallel closures
40
parallel line
center line
l
l
2
l
1
Score
l
1
l
2
+
l
------------- - =
Figure 4.13 Score computation for parallel closures
lines forming closure
segments
Figure 4.14 Permutations of closures to form hypotheses
41
•For view
i
let the set of U-contours be S
ui.
Computation of U-contours is described
in Section 4.1.
• For each U-contour U (say (l
1
, l
2,
l
3
)), in S
ui
, search for line evidence to support the
side that could complete a parallelogram. The direction of the search is the same as
the orientation of the l
1
(or l
3
). The search starts at a distance of 1/4th of the length
l
1
(or l
3
) from the ends of l
1
(or l
3
) and proceeds the same distance beyond the l
1
(or
l
3
). Line segments that are approximately parallel to l
2
i.e. which form an acute an-
gle less than 10
o
with l
2
, and which lie in the search space, are collected.
• Group clusters of these line segments into colinear lines if possible. Each of these
clustered groups of lines causes a 2-D parallelogram hypothesis to be formed. De-
note a representative hypothesis by p
u
. Figure 4.15 clarifies the search procedure.
• For each 2-D parallelogram hypothesis search the other views for matching infor-
mation. The aim of this search is to determine the best 3-D hypothesis by consider-
ing the hypothesis as a whole, rather than as a sum of its parts. In other words, it is
advantageous to use a primitive of higher complexity (a parallelogram) to match,
rather than a multitude of lower-level primitives (U-contours, parallels, lines), as
such matches are more distinctive when they occur. The search procedure, which is
described in detail in Appendix A., varies the height parameter from the ground
height to the maximum height of a building in small increments and scores the line
evidence for the hypothesis in all the views at each height. Local maxima of the
scores are flagged as 3D hypotheses. For each 2-D hypothesis the following holds:
if the search produced no maxima, no 3-D hypothesis is formed; if the search pro-
duced maxima, the number of 3-D hypotheses formed equals the number of maxi-
ma.
4.2.3 Complexity of the hypotheses generation process
The hypothesis generation process requires the system to assimilate parallelogram
matches starting with edges detected by the Canny edge detector. The input to the system
is a set of edge segments. As we are looking for straight lines, the first operation of the
system is to colinearize these edge segments into lines. These lines are matched across
every pair of views to extract line matches. The matched lines take part in junction
detection, and junction matching, parallel formation and parallel match formation, U-
contour formation, and in parallelogram match hypothesis. The following paragraphs
42
illustrate the salient features of each of these processes.
• Segment colinearization: Let n
seg
be the number of edge segments detected by the
edge detector. If s
len
is the average length of a segment, a search for colinear edge
segments is launched away from each end of each segment to a distance of l/4th of
its length as noted in Section 4.1. The search is spatially hashed i.e. the search is
aided by a precomputed segment spatial hash that stores all segments in a adjacent
l
1
l
2
l
3
search space
Figure 4.15 Search for U-contour completions
U-contour
Figure 4.16 Hypotheses detected in images from Figure 4.1
43
blocks of pixels. The size of the block is chosen to be 5 pixels * 5 pixels. This hash
allows constant time access to features in a spatial neighborhood, after which fur-
ther filters may be applied as necessary. The check for colinearity is a constant time
operation. Hence the search takes O(s
len
) time. Let m be the average number of edge
segments which are grouped into a line. There will be O(n
seg
/ m) groups formed
and each group takes O(s
len
) time to do the search. The process of grouping m edge
segments into a line takes O(m) time. Therefore, the segment colinearization pro-
cess takes O(s
len
.n
seg
) time. Usually, s
len
« n
seg
. This is performed in all views being
considered.
• Line matching: Let n
li
be the number of lines formed in any view
i
, for i=1..n
views
,
where n
views
is the number of views. A line match is generated by matching a line
l
i
,in view
i
, with a line l
j
,in view
j
, where j≠i, subject to the quadrilateral epipolar
constraint. This matching involves a quadrilateral search. The dimensions of the
quadrilateral determined by the allowed 3-D height range may be considered a con-
stant, say c
ij
, between any two views, view
i
and view
j
, while the other dimensions
are directly proportional to the length of l
i
, say len
li
,or O(len
li
). The search for
matches in view
j
for l
i
is O(len
li
.c
ij
). The search for matches for lines in view
i
is thus
O(n
li
.len
li
.c
ij
). As all ordered pairs of views are considered, the complexity of line
matching is
The product (len
li
* c
ij
) may be thought of as representing the area of the quadrilateral
being searched. This product is effectively reduced by using spatial hashing. We consider
the following typical example. If len
li
=30 and c
ij
=20, then by using a 5x5 spatial hash, we
have an effective search complexity which is limited to (len
li
/5 * c
ij
/5). Though the spatial
hash does not reduce the worst case complexity, the product of (len
li
* c
ij
) can be safely
approximated by a small constant C. The average complexity of the line matching process
is
• Junction Detection: Let n
li
be the number of matched lines in view
i
. For each line,
the junction detection process searches for other lines in view
i
that could form junc-
tions with it. Thus junction formation is a quadratic process with a complexity com-
n
li
len
li
c
ij
⋅⋅
ji ≠
∑
i
∑
O
n
li
i
∑
44
puted as follows:
•Junction matching: Let n
ji
be the number of junctions formed in any view
i
, for
i=1..n
views
, where n
views
is the number of views. A junction match is generated by
matching a junction j
i
,in view
i
, with a junction j
j
,in view
j
, where j≠i, subject to the
epipolar constraint. Matching involves a search along the epipolar line in view
j
. The
length of the segment searched is determined by the allowed 3-D height range and
may be considered a constant, say c
ij
. The search for matches in view
j
for j
i
is O(c
ij
).
The search for matches for all junctions in view
i
is thus O(n
ji
.c
ij
). As all ordered
pairs of views are considered, the complexity of junction matching is
• Parallel formation: If n
li
is the number of matched lines in view
i
, the parallel for-
mation process tries to form parallels from all pairs of lines. Though the complexity
of this operation is O(n
li
2
), the constraints that the lines be fairly parallel, and that
they be a reasonable distance apart are inexpensive to verify, and help eliminate
most of the line-pairs before they are verified for overlap. The worst case complex-
ity of parallel formation is denoted by
• Parallel matching: Let n
pi
be the number of parallels in any view
i
. Each parallel,
p
i
, comprises matched lines. By examining the lines matched in other views by each
line comprising p
i
, it is possible to form parallel matches based on the applicable
constraints. If each line comprising p
i
matches an average of m lines in another
view, the complexity of determining if p
i
has matches in those views is O(m
2
). Thus
the complexity of parallel matching may be denoted by
It is instructive to note that the application of constraints for parallel matches is not
expensive, given the line match information already available, and hence this step
O
n
li
2
i
∑
O
n
ji
c
ij
⋅
ji ≠
∑
i
∑
O
n
li
2
i
∑
O
n
pi
m
2
⋅
i
∑
45
is not a time-consuming step when compared to some other processes.
• U-Contour formation: If n
pi
is the number of parallels in view
i
, and len
pi
is the
length of a parallel p
i
in view
i
, the complexity of U-contour formation for p
i
is de-
noted by O(len
pi
), as it involves a search for a U-completion, which is linearly de-
pendent on len
pi
. Thus the complexity of U-contour formation is given by
As is the case with line matching, spatial hashing reduces the average complexity
of the process, though it does not reduce the worst case complexity.
• Parallelogram formation and matching from parallel matches: In this process,
if n
pm
is the number of parallel matches, p
m
is any parallel match, and len
pm
is the
maximum length of the constituent parallels forming p
m
, then the complexity of
parallelogram formation and matching from parallel matches is
• Parallelogram formation and matching from U-contours: If n
ui
is the number of
U-contours formed in view
i
, u
i
is any U-contour in view
i
, has parallel p
i
as its basis,
and the length of p
i
is len
pi
, then complexity of parallelogram formation from U-
contours is given by
The matching process for parallelograms is equivalent to the matching the four
sides of the parallelogram. The complexity of matching each side is the same as that
of line matching. Thus the complexity of parallelogram matching is O(n
ui
).
Summing this up, the complexity of the hypothesis formation process is
Assuming m is a constant, len
pi
, len
pm
, and c
ij
are bounded by some constant, this
reduces to
O
len
pi
n
pi
⋅
i
∑
O
n
pm
len
pm
⋅()
O
len
pi
n
pi
⋅
i
∑
O
n
li
n
li
2
n
pi
m
2
⋅ n
pi
len
pi
⋅ n
pm
len
pm
⋅ n
ui
len
pi
⋅ ++ + + + ()
i
∑
n
ji
c
ij
⋅
ji ≠
∑
i
∑
+
O
n
li
2
n
pi
n
pm
n
ui
n
ji
2
++ + +
i
∑
46
Assuming that n
li
> n
pi
, n
pm
, n
ui
, n
ji
and considering that n
views
is usually 2 or 3, the
summation over i may be ignored to yield a complexity of
Run times on a Sparc Ultra 2 for the different processes for the example in
Figure 4.1 are included below:
4.3 Selection of flat-roof rectangular hypotheses
The generation mechanisms described in Section 4.2.1 and Section 4.2.2 are rather
liberal in the application of the available constraints. Hence a fair number of hypotheses are
usually formed. Typically, the number of 3-D hypotheses is O(number of line matches). A
selection process filters the hypothesized 3-D buildings based on evidence that is collected.
Though it is possible to verify 3-D hypotheses using all the available cues, i.e. roofs, walls
and shadows the system uses only the roof evidence at this stage, as complete verification
is an expensive process. A less expensive selection process, based on easily derivable
evidence is used to reduce the number of hypotheses fairly significantly, while eliminating
Figure 4.1 Feature
Run Time
(seconds)
Percentage of
Time
Edge Extraction Edge 7.3 8.77%
Line Formation Line 6.8 8.17%
Line Matching Line match 5.4 6.49%
Junction Detection Junction 4.7 5.65%
Junction Matching Junction match 3.0 3.61%
Parallel Formation Parallel 8.1 9.74%
Parallel Matching Parallel match 4.5 5.41%
U-Contour Formation U-Contour 3.5 4.21%
Hypothesis Selection Parallelogram match 25.6 30.77%
Hypothesis Verification Parallelogram match 14.3 17.19%
TOTAL
83.2 100.0%
O
n
li
2
()
47
“good” hypotheses. The results of hypothesis selection for the images shown in Figure 4.1
are shown in Figure 4.17. The nature of the evidence used is discussed in Section 4.3.1 and
method for using this evidence in Section 4.3.2.
4.3.1 Accumulating Selection Evidence
• Positive roof evidence:
At the time of running the selection process, the hypothesized building is
represented as a 3-D rectangular block i.e. a 3-D model is available. Hence, its
projection in each view is available as well (as camera models are known). Positive
roof evidence consists of line segments in each view that support the hypothesis.
Consider an arbitrary view, say view
i
. Let the projected parallelogram of the 3-D
hypothesis be (p
i1
, p
i2
, p
i3
, p
i4
). A spatially-indexed search is used to detect line
segments, such that each line segment l
i
satisfies the following criteria:
• l
i
must form an acute angle of less than 10
o
with one of the four sides of the par-
allelogram
• the perpendicular distance of the midpoint of l
i
from any one of the sides of the
parallelogram is less than 4 pixels
• a major part (> 50%) of l
i
overlaps with the side of the parallelogram it is closest
to
Figure 4.17 Selected hypotheses from the images in Figure 4.1
48
The roof positive score is initialized to zero. The contribution of a line l in the set
of positive evidence lines to the positive score is the ratio of the length that l
overlaps with the nearest side of the roof hypothesis to the perimeter. This
automatically weights longer sides (and evidence supporting them) more than it
does shorter sides. More rigorously, consider a roof hypothesis H, with a projection
in view
i
denoted by H
i
. Denote the number of views by n
views
. Let the perimeter of
H
i
be perim
i
. Denote the positive roof score by posRoofScore, and its contribution
from view
i
by posRoofScore
i
.If{l
ij
} is the set of lines contributing to posRoofScore
i
and {l
ij
ovlp
} is the set of parts of each element of {l
ij
} that overlaps with the nearest
side of H
i
, then
and
• Negative roof evidence
Consider an arbitrary view, view
i
. Let the projected parallelogram of the 3-D
hypothesis be (p
i1
, p
i2
, p
i3
, p
i4
). A spatially-indexed search is used to detect line
segments, such that each line segment l
i
satisfies the following criteria:
• l
i
must intersect at least one of the four sides of the parallelogram
• l
i
must form an acute angle of greater than 30
o
with a side of the parallelogram
that it intersects
• a major part (> 50%) of l
i
overlaps with a side of the parallelogram it intersects
The roof negative score is initialized to zero. The contribution of a line l in the set
of negative evidence lines to the negative score is the ratio of the length of l to the
perimeter. Consider a roof hypothesis H, with a projection in view
i
denoted by H
i
.
Denote the number of views by n
views
. Let the perimeter of H
i
be perim
i
. Denote the
posRoofScore
i
length l ()
ll
ij
ovlp
{} ∈
∑
=
posRoofScore posRoofScore
i
ii =
n
views
∑
=
49
negative roof score by negRoofScore, and its contribution from view
i
by
negRoofScore
i
. If {l
ij
} is the set of lines contributing to negRoofScore
i
then
and
This operation is performed on all the views available, and the evidence i.e. the line
segments, is grouped by view for later evaluation. Figure 4.18 illustrates the
concepts of positive and negative line evidence in one view for flat-roof hypothesis.
• Height inference
Having access to both the ground height and a 3-D model of the hypothesis, it is
simple to derive the 3-D height of the hypothesis above the surrounding surface
(assumed to be a flat plane).
negRoofScore
i
length l ()
ll
ij
{} ∈
∑
=
negRoofScore negRoofScore
i
ii =
n
views
∑
=
positive evidence
negative evidence
Figure 4.18 Positive and negative line evidence
p
i1
p
i2
p
i3
p
i4
50
4.3.2 The selection mechanism
Hypotheses have a user-determined range of heights that they must lie within.
Typically the hypotheses must be taller than 2m, and are less than 20m. The height evidence
is applied at this stage to filter hypotheses that do not fall within these parameters. It is
helpful to note that even though the height criterion was applied in the line-matching stage,
a tolerance is allowed on both the minimum and the maximum height. This is an instance
of delaying decisions till later. Hypotheses that satisfy the height criterion are examined for
roof evidence. This process is detailed below.
The “goodness” of the roof is measured by weighing the amount positive evidence it
has against the negative evidence. This evidence is encapsulated into a “roof score”, which
allows the hypotheses to be thresholded. The roof score is computed using the positive and
negative roof evidence.
The maximum score per view is 1.0 and occurs when the hypothesis has perfect line
support. The roof score is computed as the difference between the positive and negative
scores. This is done over all the views. The encapsulated roof score is denoted by roofScore
and is computed as follows:
A hypothesis is selected iff the following holds:
• posRoofScore /n
views
> 0.5, and
•(posRoofScore - negRoofScore)/n
views
> 0.3
4.3.3 Complexity of the hypothesis selection process
The selection process consists of scoring the evidence that supports the roofs of
hypotheses in every available view. The collection of the evidence for each hypothesis is
proportional to the sum of the perimeters of the 2-D projections that comprise a 3-D
hypothesis. The perimeter of a hypothesis is bounded by the maximum length of any side
of a building (which is a parameter in the system). If it is assumed that the average perimeter
of a 2-D projection of a 3-D hypothesis is P, there are n
views
available, and the number of
hypotheses is n
all
, the complexity of the selection process is O(P.n
all
.n
views
). P is a constant,
and n
views
is small (usually 2 or 3) and may be assumed to be a constant. Hence the worst
case complexity is O(n
all
).
roofScore posRoofScore negRoofScore – =
51
4.4 Formation of gable-roof rectangular hypotheses
For gable-roof rectangular hypotheses, the hierarchy of features used is the same until
and including parallel formation. The parallels formed in each view are used to hypothesize
triples. The reason for hypothesizing triples is to model the “spine” as well as the sides of
the gable roof. The output of the process consists of the 3D gable-roof hypotheses in a
world coordinate system. Figure 4.19 shows two views of an aerial scene containing
symmetric gable-roof buildings.
Triples
Triples are derived from the set of parallels already constructed, as follows: suppose
parallel P
ik
in view
i
comprises line segments L
ik1
and L
ik2
. A search is launched for line
segments L
ik3
that lie either between L
ik1
and L
ik2
, or on the side of L
ik1
, or on the side of
L
ik2
. L
ik3
must have an orientation similar to L
ik1
and L
ik2
i.e. the acute angle between L
ik3
and L
ik1
(or L
ik2
) is less than 10
o
. Figure 4.20 shows 3 possible cases for triple formation
from a single parallel. Figure 4.21 shows the triples formed in the images shown in
Figure 4.19.
Figure 4.19 Two views of an aerial scene with symmetric gable-roof buildings
52
L
ik1
L
ik1
L
ik1
L
ik2
L
ik2 L
ik2
L
ik3
L
ik3
L
ik3
Figure 4.20 Cases of triple formation from a parallel
Figure 4.21 Triples formed in the images from Figure 4.19
53
4.4.1 Generation of symmetric gable-roof hypotheses
The generation of a gable-roof hypothesis is initiated by a triple in any image. Section
4.1 describes the method and the constraints used in the formation of triples. Triples are
examined to see if they could give rise to valid 3-D hypotheses as follows: Given a camera
model, and ground height the relative height in 3-D of the “spine” may of the gable roof
with respect to the sides in that view may be derived as follows. Assuming that the gable is
symmetrical in 3-D, if the spine and the sides were at the same height, they would project
to equidistant parallel lines in the image (as the projection is weak perspective locally).
However, as the spine is higher than the sides, it must be displaced in the direction of the
projection of the vertical (in 3-D) in that image. The extent of displacement from the center,
in the image, affords an estimate of the relative height of the spine with respect to the sides.
This constraint is extremely important in filtering out triples that could not give rise to a
valid gable hypothesis, as it does not rely on matching information from other views.
Figure 4.23 depicts this process.
For each triple in this set of triples a search is launched for matching sets of lines in
the other views. This search is performed iteratively through the epipolar space defined by
the camera geometry and the maximum and minimum allowed heights of a building. In
essence this at each iteration (height) the search scores the goodness of the supporting line
features for a hypothesis over all the views. Maxima of these scores are used to hypothesize
gable-roof hypotheses at the heights at which the maxima occur. Appendix A.2 outlines this
process for gable-roof hypotheses. At this point the system possesses 3-D information for
the spine and the sides, but no information about the extent of the hypothesis i.e. we need
to determine closures for the hypothesis.
Closures for gable-roof hypotheses are determined by search for terminating
junctions on the lines forming the spine and the sides of the roof in all the views. The system
uses binary junctions. If a junction j is to be a terminator, the lines that form j, say l
1
and l
2
,
must meet the following constraint in order to qualify as a terminator for the hypothesis:
• either l
1
or l
2
should be a component of exactly one of the lines forming the triple
•if l
1
is a component of one of the lines in the triple, l
2
is constrained to be the pro-
jection of a 3D line, say L
2
, that is perpendicular in 3D to the 3D spine and the 3D
side closest to L
2
. The system has computed the 3D orientations of the sides and the
spine. The location of the 3D junction J, corresponding to its projection j, must lie
54
on the 3D spine or on one of the sides of the gable in 3D. The terminator of the gable
is uniquely defined by these constraints. This terminator must correspond to L
2
, and
its projection to l
2
. If the acute angle between the projected terminator and l
2
is less
than 10
o
, l
2
is a valid terminator. Figure 4.22 illustrates this concept.
In general we find more than one termination on each end of the roof. The junctions
with the highest scores at each end that are valid terminators are selected as the terminators
for this hypothesis. If no terminators are found, or if terminators for only one side are found
the hypothesis is rejected. Figure 4.24 illustrates the formation of gable-roof hypothesis.
Figure 4.25 shows the results of gable-roof formation on the images in Figure 4.19.
4.4.2 Complexity of the hypotheses generation process
The hypothesis generation process requires the system to construct triple matches
(that model the gable-roofs of the buildings) starting with edges detected by the Canny edge
detector. The system uses parallels as a basis for hypothesizing triples. As parallels have
been precomputed (to handle the case of flat-roof buildings) the complexity of generating
gable-roof hypotheses is reduced to that of hypothesizing triple matches from parallels.
90
o
J
L
2
3D spine
3D side
3D side
l
2
j
2D spine
3D terminator
l
1
Figure 4.22 Search for gable-roof terminators
55
direction of
projection of vertical
displaced spine
Figure 4.23 Displacement of the spine of a gable in the
direction of the projection of the vertical
Parallel Detection Parallel Detection Parallel Detection
View 1 View 2 View n
3-D Gable-Roof Building Hypotheses
Triple Detection Triple Detection Triple Detection
Triple Match Formation
Figure 4.24 Block Diagram of Gable-roof Hypothesis Formation
56
• Triple formation from parallels: If n
pi
is the number of parallels formed in view
i
,
and each parallel forms an average of m triples, the number of triples formed is
n
pi
.m. Let n
li
be the number of lines on view
i
. The search for a line that could form
a triple with any parallel has complexity n
li
. Thus the complexity of triple formation
is m.n
pi
.n
li
. In practise m is usually less than 1. This reduces the worst case com-
plexity to O(Σn
pi
.n
li
) when summed over all the views. Though this is a quadratic
form, the parallel-checking operation is inexpensive. In practise this step does not
consume much time when to compared to the selection and verification processes.
• Triple matching: The triple matching process is very similar to the parallelogram
matching process outlined in Section 4.2.3. The triple matching process is linear in
the number of triples present in all the views. If n
triple
denotes the number of triples
in all views, the complexity of this operation is O(n
triple
).
Figure 4.25 Gable-roof hypotheses for the images in Figure 4.19
57
4.5 Selection of gable-roof hypotheses
The generation mechanism described in Section 4.4.1 is more constrained than the
mechanism for flat-roof hypotheses. The ratio of triples to good gable-roof hypotheses is
much higher than the ratio of parallels (or even parallel matches) to flat-roof hypotheses. A
selection process filters the hypothesized 3-D gable-roof buildings based on evidence that
is collected. This process of gable-hypothesis selection applies constraints less stringently
in comparison with the application of constraints to flat-roof building selection, because
there are more components needed to form a gable-roof hypotheses, consequently
increasing the possibility of missing or insufficient evidence for individual components.
Selected gable-roof hypotheses for the images in Figure 4.19 are shown in Figure 4.26.
4.5.1 Selection evidence accumulated
• Positive roof evidence:
At the time of applying the selection process, a 3-D model of the hypothesized
gable-roof building is available. Hence, it’s projection in each view are available as
Figure 4.26 Selected gable-roof hypotheses for the images in Figure 4.19
58
well (as camera models are known). Positive roof evidence consists of line segments
in each view that support the hypothesis.
Consider an arbitrary view, view
i
. Let the projected gable-roof of the 3-D
hypothesis be (p
i1
,p
i2
,p
i3
,p
i4,
p
i5,
p
i6
). A spatially-indexed search is used to detect
line segments, such that each line segment l
i
satisfies the following criteria:
• l
i
must form an acute angle of less than 10
o
with one of the six sides of the gable-
roof, or with the spine (p
i2
, p
i5
)
• the perpendicular distance of the midpoint of l
i
from any one of the sides of the
gable-roof or the spine, is less than 4 pixels
• a major part (> 50%) of l
i
overlaps with the side of the gable-roof (or the spine)
it is closest to
The roof positive score is initialized to zero. The contribution of a line l in the set
of positive evidence lines to the positive score is the ratio of the length that l
overlaps with the nearest side of the roof hypothesis to the sum of the perimeter and
the length of the spine. This automatically weights longer sides (and evidence
supporting them) more than it does shorter sides. More rigorously, consider a roof
hypothesis H, with a projection in view
i
denoted by H
i
. Denote the number of views
by n
views
. Let the sum of the perimeter and the length of the spine of H
i
be perim
i
.
Denote the positive roof score by posRoofScore, and its contribution from view
i
by
posRoofScore
i
.If{l
ij
} is the set of lines contributing to posRoofScore
i
and {l
ij
ovlp
}
is the set of parts of each element of {l
ij
} that overlaps with the nearest side of H
i
,
then
and
posRoofScore
i
length l ()
ll
ij
ovlp
{} ∈
∑
=
posRoofScore posRoofScore
i
ii =
n
views
∑
=
59
Consider an arbitrary view, view
i
. Let the projected gable-roof of the 3-D
hypothesis be (p
i1
, p
i2
, p
i3
, p
i4
, p
i5
, p
i6
). A spatially-indexed search is used to detect
line segments, such that each line segment l
i
satisfies the following criteria:
• l
i
must intersect at least one of the six sides of the gable-roof, or intersect the
spine (p
i2
, p
i5
)
• l
i
must form an acute angle of greater than 30
o
with a side of the gable-roof (or
the spine) that it intersects
• a major part (> 50%) of l
i
overlaps with a side of the gable-roof (or the spine) it
intersects
The roof negative score is initialized to zero. The contribution of a line l in the set
of negative evidence lines to the negative score is the ratio of the length of l to the
sum of the perimeter and the length of the spine. Consider a roof hypothesis H, with
a projection in view
i
denoted by H
i
. Denote the number of views by n
views
. Let the
sum of the perimeter and the length of the spine of H
i
be perim
i
. Denote the negative
roof score by negRoofScore, and its contribution from view
i
by negRoofScore
i
.If
{l
ij
} is the set of lines contributing to negRoofScore
i
then
and
This operation is performed on all the views available, and the evidence i.e. the line
segments, is grouped by view for later evaluation. Figure 4.27 illustrates the
concepts of positive and negative line evidence in one view for a gable-roof
hypothesis.
• Height inference
Having access to both the local ground height and a 3-D model of the hypothesis, it
is simple to derive the 3-D height of the hypothesis above the local ground height.
This is done for the sides of the gable-roof and for the spine.
negRoofScore
i
length l ()
ll
ij
{} ∈
∑
=
negRoofScore negRoofScore
i
ii =
n
views
∑
=
• Negative roof evidence
60
4.5.2 The selection mechanism
Gable-roof hypotheses have more constraints than flat-roof hypotheses. This allows
for a more effective selection process. The system restricts the height of the spine to be
between 10% and 100% of the maximum height specified by the user. The height of the
sides (both sides are assumed to be at the same height) must not be lower than the spine by
more than the minimum of 10% of the maximum height of a building and 2m. The height
evidence is applied at this stage to filter hypotheses that do not fall within these parameters.
This process is detailed below.
The “goodness” of the roof is measured by weighing the amount positive evidence it
has against the negative evidence. This evidence is encapsulated into a “roof score”, which
allows the hypotheses to be thresholded. The roof score is computed using the positive and
negative roof evidence including the positive and negative contributions to the spine of the
gable-roof.
The maximum score per view is 1.0 and occurs when the hypothesis has perfect line
support. The roof score is computed as the difference between the positive and negative
negative evidence
Figure 4.27 Positive and negative line evidence for
gable-roof hypotheses
p
i6
p
i1
p
i3
p
i4
positive evidence
p
i5
p
i2
spine
61
scores. This is done over all the views. The encapsulated roof score is denoted by roofScore
and is computed as follows:
A hypothesis is selected iff the following holds:
• posRoofScore /n
views
> 0.4, and
• (posRoofScore - negRoofScore)/n
views
> 0.25
4.5.3 Complexity of the hypothesis selection process
The selection process consists of scoring the evidence that supports the roofs of the
(gable-roof) hypotheses in every available view. The collection of the evidence for each
hypothesis is proportional to the sum of the perimeters (including the spine of the gable-
roof) of the 2-D projections that comprise a 3-D hypothesis. The perimeter of a hypothesis
is bounded by the maximum length of any side of a building (which is a parameter in the
system). If it is assumed that the average perimeter of a 2-D projection of a 3-D hypothesis
is P, and the number of triples formed in all the views is n
triple
, the worst-case complexity
of the selection process is O(P.n
triple
). P is usually bounded. Hence the average complexity
is O(n
triple
).
roofScore posRoofScore negRoofScore – =
62
Chapter 5
Verification of Hypotheses
In Chapter 4 detected lines are used to build up a feature hierarchy of increasing
complexity towards the goal of hypothesizing buildings. The building hypothesis process
adopts a strategy of low commitment, which allows a large number of hypotheses to be
formed. The selection mechanism, covered in Chapter 4, demonstrates that the use of roof
evidence and the predicted 3D height from stereo are fairly good filters for removing
hypotheses that satisfy the geometric constraints necessary for being declared a building,
but does not possess sufficient domain-specific support to qualify as valid buildings.
However, use of roof evidence alone is insufficient, in general, to effectively filter out all
the hypotheses that are not buildings. The system searches for evidence of walls and
shadows cast by a hypothesized building. In addition to the evidence of features supporting
or negating a roof hypothesis, statistical properties of the regions of the hypothesized roof
and the shadows cast are factored in.
In Section 5.1 the verification process for flat-roof building hypotheses is described.
Section 5.2 outlines the verification process for gable-roof hypotheses. At this stage
mutually exclusive overlapping or contained hypotheses (both flat-roof and gable-roof)
may exist. Section 5.3 outlines the overlap disambiguation process. Section 5.4 defines
what the system returns as a 3-D description of the scene.
5.1 Overview of the verification process for flat-roof hypotheses
Verified flat-roof hypotheses are filtered out from the set of selected flat-roof building
hypotheses. The selection process involved positive and negative evidence for roofs alone.
This allowed many hypotheses without sufficient roof evidence to be filtered out. In the
verification process, the system uses the available geometric, photometric and domain-
63
specific constraints such as expected shadow and wall lines, to determine whether a
selected hypothesis is a building or not. The exact methods for quantifying the
contributions of roof evidence, wall evidence and shadow evidence are described in Section
5.1.1, Section 5.1.2 and Section 5.1.3 respectively. The combination of these factors is
described in Section 5.1.4.
5.1.1 Roof Evidence
Roof evidence for a hypothesis from the selection process, computed in Section
4.3.1, is used in the verification process as well. However, it is not the only criterion. It is
combined with wall and shadow evidence (as described in Section 5.1.4) to determine
whether a hypothesis is a 3-D flat-roof building.
5.1.2 Wall Evidence
In each view which is not nadir at least one and not more than two of the side walls
of the buildings will be visible. Figure 5.1a depicts when one side of the wall is visible
while Figure 5.1b depicts when two sides of the wall are visible. The walls are assumed to
be vertical in 3-D. The verification for walls involves looking for the projections of the
horizontal bottom of the wall (the interface of the vertical wall and the ground). It may be
noted that the system possesses a 3D model of each building being verified, as was
demonstrated in Section 4.2. Thus the 3D height of each hypothesized building is known.
Using the camera models, the projection of the vertical direction in 3D is computed. From
the top of the wall to the bottom, a search for line evidence parallel to the side of the
hypothesized building, is performed in incremental steps. Figure 5.2 illustrates this
concept. Wall evidence is deemed to be found if there is evidence of parallel lines at the
distance from the top of the building that is predictable from its height in 3D. The score
associated with this evidence is computed as follows:
In view
i
, let the projection of a 3-D hypothesis H in view
i
be H
i
, let the expected
length of the visible wall of H
i
be wallexpected
i
. The incremental search returns the
maximum wall evidence for this hypothesis in view
i
. In computing the actual wall evidence
the search scores overlapping wall evidence lines that overlap with a side of the building,
once for the length that the wall evidence lines overlap each other. In other words, wall
evidence is the total length of overlap regardless of how many (possible overlapping) wall
64
evidence lines cover that length. Suppose the length that the evidence overlaps with the
sides of the roof is wallactual
i
. Then the wall evidence wallev
i
, for H
i
, is computed by
and the total wall evidence, wallev, for H over n
views
views is
The maximum wall score per view is 1.0 and corresponds to the case when line actual
wall evidence covers the entire length of the sides where wall evidence is expected.
wallev
i
wallactual
i
wall ected exp
i
------------------------------------ - =
wallev wallev
i
i 1 =
n
views
∑
=
2 visible walls
1 visible wall
Figure 5.1 Visible walls
(a)
(b)
ground
evidence
vertical evidence
Figure 5.2 Incremental search for wall evidence
65
5.1.3 Shadow Evidence
A 3-D building structure should cast shadows under suitable imaging conditions. The
system possesses knowledge of the direction of illumination from the sun, which in turn
allows it to predict the location and orientation of shadows (on flat ground) from the 3-D
hypotheses. Shadows have previously been used in monocular detection of buildings. In the
case of the system presented in this dissertation, the analysis is made easier as it has an
estimate of the height of the building before it searches for shadows. Even though the 3D
height of the building is known, a search for shadows is launched within a 5 pixel window
about the predicted shadow. This is because 3D heights are obtained from triangulation and
small changes in 2D position (in the images) cause large changes in 3D height estimates.
The method used for shadow detection in this thesis is similar to that used in [57] in that
there is a search for evidence of shadows, and this evidence consists of the shadow lines
that should be cast by the building. In the system presented here an estimate of where the
shadow should exist is available given the 3D height of the hypothesis and the sun angles.
In [57] the system does not have any 3D height information and hence has to do a much
more exhaustive search. Figure 5.3 shows the search for shadows.
The search for shadows is carried out in an incremental manner similar to that for
wall evidence. Knowing the direction of illumination, a search is performed to detect
evidence of the predicted projection of the shadow. This includes the shadow cast by the
horizontal roof lines, and the shadow cast by the vertical walls of the building. The evidence
is in the form of detected lines at or near the outline of the predicted shadow. The
illumination
direction
shadow lines
cast by roof
shadow lines
cast by vertical lines
shadow region
statistics
Figure 5.3 Search for shadow evidence
66
“goodness” of the shadow evidence is captured in a shadow score. This score is the fraction
of the visible shadow outline that has supporting line evidence. When shadow evidence
lines overlap, the contribution of the part of the lines that overlap is counted once towards
the shadow score. Occlusion of shadows by the building itself is taken into consideration
when searching for shadows. In other words, the predicted shadow is the set of shadow lines
that should be visible from the given viewpoint, after accounting for what part of these lines
will be occluded by the building itself, and assuming that no other occlusion of shadow
lines takes place. Shadow evidence is computed as follows:
In view
i
, let the projection of a 3-D hypothesis H in view
i
be H
i
, let the expected
length of the visible shadow from the horizontal roof as well as the vertical wall of H
i
be
shadowexpected
i
. The incremental search returns the maximum shadow evidence for H
i
in
view
i
. Suppose the length that the actual evidence overlaps with the predicted shadow is
shadowactual
i
. Then the shadow evidence shadowev
i
, for H
i
, is computed by
and the total shadow evidence, shadowev, for H over n
views
views is
The maximum shadow score per view is 1.0 and corresponds to the case when actual
shadow evidence covers the entire predicted shadow.
5.1.4 Combination of Roof, Wall and Shadow Evidence
The method for determining whether a selected flat-roof hypothesis should be
verified or not is dependent on the evidence detected to support or negate the hypothesis.
This evidence, namely roof, shadow and wall evidence is used in different ways to verify a
hypothesis and to disambiguate overlapping hypotheses. The verification process uses a
decision tree involving the evidence available.
• Use of Roof, Wall and Shadow Evidence in Verification:
The algorithm used to verify a hypothesis is outlined below:
shadowev
i
shadowactual
i
shadow ected exp
i
-------------------------------------------- - =
shadowev shadowev
i
i 1 =
n
views
∑
=
67
Suppose that a 3-D hypothesis has a roof score of r (as computed in Section 4.3.1),
a shadow score equal to s and a wall score that equals w. If there are n
views
views
being considered then
•if r ≥ 0.75 * n
views
the hypothesis is verified and the verification process exits.
Experimental evidence indicates that not more than 30% of the (eventually ver-
ified) buildings satisfy this criterion.
•if r < 0.75 * n
views
, shadow evidence is considered.
•if s ≥ 0.25 * n
views
, or if the shadow evidence s
i
≥ 0.5 in some view
i
the hy-
pothesis is verified and the verification process exits.
•if s < 0.25 * n
views
and s
i
< 0.5 for all i=1..n
views
and r ≥ 0.5 * n
views
and if
w ≥ 0.5 * n
views
, then the hypothesis is verified and the verification process
exits.
•if s < 0.25 * n
views
and s
i
< 0.5 for all i=1..n
views
and r < 0.5 * n
views
the hy-
pothesis is not verified and the verification process exits.
• if none of the conditions enumerated above are satisfied the hypothesis is not
verified and the verification process exits.
Figure 5.4 depicts this process as a decision tree. The selection of the thresholds
stated above was arrived at through testing on different sets of data with widely differing
image and object characteristics. These criteria were fixed, and the tests conducted on two
large sites, Fort Hood and Fort Benning. Optimal combination of these criteria is itself an
interesting subject of study. The results of verification of hypotheses generated from the
pair of images in Figure 4.1 are shown in Figure 5.5.
The roof, shadow and wall scores are combined using a sum of weighted averages to
yield a confidence measure. If r is the roof score, s is the shadow score, and w is the wall
score of a building hypothesis, the equation used to obtain a verification score (or
confidence measure), v, for a verified hypothesis is:
v = (r
wt
* r) + (s
wt
* s) + (w
wt
+ w) (5.1)
where r
wt
= 0.6, s
wt
= 0.3 and w
wt
= 0.1. Also r
wt
+ s
wt
+ w
wt
= 1. The reason for weighting
wall evidence lower than roof and shadow evidence is that it is usually less reliable than
68
roof evidence and shadow evidence. This confidence measure is used in overlap disambig-
uation that is covered in Section 5.3.
5.1.5 Complexity of the hypothesis verification process
Let n
sel
be the number of parallelograms selected by the selection process. For each
selected parallelogram, there are four procedures have to be executed by the verification
process:
• Roof verification process: The roof evidence used has already been computed.
Hence this process need not be considered for complexity analysis of the verifica-
Verified Parallelograms
Selected Parallelograms
Shadow evidence strong?
Wall evidence strong?
Not strong enough
Not strong enough
Yes
Yes
Yes
No
No
No
Discarded hypotheses
Roof evidence strong?
Figure 5.4 Decision Tree for verification
69
tion process.
• Wall verification process: The wall verification process searches and evaluates ev-
idence collected from nearby area, and is directly proportional to the perimeter of
the hypothesis in each view. If P is the average perimeter of a hypothesis in any view
and n
views
is the number of views then the complexity of this process is
O(P.n
sel
.n
views
). P is bounded by the maximum side of any hypothesis. Hence the
worst-case complexity of this process is O(n
sel
.n
views
). Note that for most runs the
system uses n
views
=2. Thus the process has an average complexity of O(n
sel
).
• Shadow verification process: The shadow verification process searches for the
best shadow evidence over a given range. This process is similar to the wall verifi-
cation process. The complexity of the two processes are identical. Hence, on aver-
age, the shadow verification process occurs in O(n
sel
) time. The worst-case
complexity is O(n
sel
.n
views
).
• Containment and Overlap Analysis: Since a spatial index is used to check the
containment or overlap situations between hypotheses, it takes only constant time
for a hypothesis to find the overlapping hypotheses. For each pair of overlapping hy-
potheses, constant time is required to check all the evidence and make a decision.
Therefore, only O(n
sel
) time instead of O(n
sel
2
) time is necessary in the worst case,
to finish this analysis.
Figure 5.5 Verified hypotheses for the images in Figure 4.1
70
Therefore, the verification process takes O(nsel + nsel + nsel) time on average. Thus the
average time complexity for verification process is O(n
sel
). Therefore, the verification
process can be considered to have a linear time complexity in the number of selected
hypotheses n
sel
.
5.2 Overview of the verification process for gable-roof hypotheses
Verified gable-roof hypotheses are filtered out from the set of selected gable-roof
building hypotheses. The selection process involved positive and negative evidence for
roofs alone. In the verification process, the system uses the available geometric and domain
knowledge to determine whether a selected gable-roof hypothesis is a building or not.
These methods are very similar to those applied to flat-roof hypotheses. The exact methods
for quantifying the contributions of roof evidence, wall evidence and shadow evidence are
described in Section 5.2.1, Section 5.2.2 and Section 5.2.3 respectively. The combination
of these factors is described in Section 5.2.4.
5.2.1 Roof Evidence
Roof evidence for a hypothesis from the selection process for gable-roof hypotheses,
described in Section 4.5.1, is used in the verification process as well.
5.2.2 Wall Evidence
The wall evidence for a gable-roof hypothesis is similar to that of an equivalent flat-
roof hypothesis because the only difference in the generic model used for the flat-roof
hypothesis and that used for the gable-roof hypothesis is the shape of the roof. As the wall
evidence is independent of the shape of the roof, it is collected by treating a gable-roof
hypothesis as a flat-roof hypothesis (by ignoring the “spine”). The same process described
for flat-roof hypotheses in Section 5.1.2 is applied to compute the wall evidence of gable-
roof hypotheses.
5.2.3 Shadow Evidence
The search for shadows for gable-roof hypotheses differs from the search for shadows
for flat-roof hypotheses. The shape of the gable-roof, coupled with the differing height of
the “spine” of the gable, as compared to the sides of the gable, cause shadow lines that are
71
not parallel to the sides of the building causing them. However, knowing the heights of the
sides and the “spine” of the gable in 3D, and the direction of illumination, the system
predicts the shape before launching a search for supporting line evidence. This search
includes the shadow cast by the roof, and the shadow cast by the vertical walls of the
building. The formula used for computing the shadow score is the same as that for flat-roof
hypotheses, and is detailed in Section 5.1.4. Following is an algorithmic description of the
search for shadows of gable-roof hypotheses:
• Figure 5.6 shows a typical shadow cast by a gable-roof hypothesis. Suppose the 3D
extremities of the gable-roof G are denoted by P
1
through P
6
. Denote the projection
of P
1
though P
6
in view
i
by p
1i
through p
6i
(refer to Figure 5.6).
• The extremities of the visible shadow are computed using the sun angles to be ps
1i
through ps
4i
.
• Search for line evidence that supports the outline of the shadow in a window of 5
pixels on either side of the predicted shadow outline.
• The shadow score, s, is computed using the same method as that used for flat-roofs
in Section 5.1.3.
shadow
area
Figure 5.6 Shadow cast by a gable-roof hypothesis
p
1i
p
2i
p
3i
p
4i
p
5i
p
6i
ps
1i
ps
2i
ps
3i
ps
4i
72
5.2.4 Combination of Roof, Wall and Shadow Evidence
The method for determining whether a selected gable-roof hypothesis should be
verified or not is very similar to that for flat-roof hypotheses described in Section 5.1.4. This
evidence, namely roof, shadow and wall evidence is computed in the manner described in
Section 5.2.1, Section 5.2.2 and Section 5.2.3. The verification process uses a decision tree
identical to the one for flat-roof hypotheses described in Figure 5.4. The exact algorithm is
detailed below:
Suppose that a 3-D gable-roof hypothesis has a roof score of r (as computed in
Section 4.5.1), a shadow score equal to s and a wall score that equals w. If there are
n
views
views being considered then
•if r ≥ 0.5 * n
views
the hypothesis is verified and the verification process exits.
•if r < 0.5 * n
views
, shadow evidence is considered.
•if s ≥ 0.1 * n
views
, or if the shadow evidence s
i
≥ 0.2 in some view
i
the hy-
pothesis is verified and the verification process exits.
•if s < 0.1 * n
views
and s
i
< 0.2 for all i=1..n
views
and r ≥ 0.25 * n
views
and if
w ≥ 0.1 * n
views
, then the hypothesis is verified and the verification process
exits.
•if s < 0.1 * n
views
and s
i
< 0.2 for all i=1..n
views
and r < 0.25 * n
views
the hy-
pothesis is not verified and the verification process exits.
• if none of the conditions enumerated above are satisfied the hypothesis is not
verified and the verification process exits
When compared to the algorithm for flat-roof hypotheses, the thresholds used for
gable-roof hypotheses are low. The reason for this is that a gable-roof hypothesis needs a
greater number of lines to form its roof and its shadows, thus raising the possibility that
some of these components will either not be found, or be found with less feature support,
and hence lower confidence. This is compensated for by the fact that there are a greater
number of applicable constraints to gable-roof hypothesis in the hypothesis formation
stage, than there are to flat-roof hypothesis, and hence fewer gable-roof hypotheses than
flat-roof hypotheses are usually formed for identical numbers of basic features (lines). The
results of verification on the selected gable-roof hypotheses shown in Figure 4.19 are
depicted in Figure 5.7.
73
The roof, shadow and wall scores are combined using a sum of weighted averages to
yield a confidence measure. If r is the roof score, s is the shadow score, and w is the wall
score of a building hypothesis, the equation used to obtain a verification score (or
confidence measure), v, for a verified hypothesis is:
v = (r
wt
* r) + (s
wt
* s) + (w
wt
+ w) (5.2)
where r
wt
= 0.85, s
wt
= 0.05 and w
wt
= 0.05. Also r
wt
+ s
wt
+ w
wt
= 1. The reason for weight-
ing wall evidence and shadow evidence lower than roof evidence is that their expected
shapes are complex, and usually less reliable than roof evidence. This confidence measure
is used in overlap disambiguation that is covered in Section 5.3.
5.3 Overlap analysis
The system verifies selected flat-roof and gable-roof hypotheses one at a time. Hence
it is possible that verified hypotheses may overlap in 3-D. Testing for overlap is done by
dropping the buildings models to the ground plane, and testing whether their projections
overlap on the ground plane. This effectively reduces overlap testing to a 2-D problem.
Figure 5.7 Verified gable-roof hypotheses for the images in Figure 4.19
74
Note that more than two hypotheses may overlap. In these cases hypotheses are handled in
pairs, recursively, till there are no conflicts. The use verification scores as a discriminator
ensures that the process is order-independent, as it orders a set of overlapping hypotheses
in an unambiguous way. There are three distinct cases where overlap occurs. The first two
cases and Figure 5.8 are described in [57].
• Complete overlap or containment:
If verified hypothesis H
2
is completely contained in verified hypothesis H
1
(in 3-D),
and the difference in 3-D heights of H
1
and H
2
is less than 1m, then H
2
is removed
from the set of verified 3-D hypotheses. If the difference in height is greater than
1m, then both H
1
and H
2
are included in the verified hypotheses. This case implies
than H
2
is a superstructure of H
1
. Figure 5.8 explains these cases.
• Partial overlap with no shared evidence:
When hypotheses H
1
and H
2
partially overlap, neither is completely contained by
the other, and they share no common roof evidence, then only one of H
1
and H
2
can
survive. The one that stays is the one with a higher verified score, as computed in
Section 5.1.3 if the hypothesis is a flat-roofed, or 5.2.3 if the hypothesis is a gable-
roofed.
• Partial overlap with shared evidence:
When hypotheses H
1
and H
2
partially overlap, neither is completely contained by
the other, and they share common roof evidence, a differential process determines
which of H
1
and H
2
survives. The rationale for using a differential process in this
case is that the hypotheses will have very similar scores, as they share evidence, and
contained
hypothesis
containing
hypothesis
H
1
H
2
eliminated
hypothesis
containing
hypothesis
H
1
H
2
Figure 5.8 Cases in containment
75
that the differences must be highlighted in order to make a finer judgement as to
which hypothesis is better.
Suppose in view
i
, hypothesis H
1
is projected to h
1
and H
2
is projected to h
2
.
Computation of the differential evidence is done thus:
• without loss of generality say side l
11
of h
1
shares evidence with side l
21
of h
2
.
• search for line segments for the section of l
11
that does not overlap with l
21
, and for
l
21
that does not overlap with l
11
.
• score this roof evidence as the fraction of the searched segment that is covered by
actual line evidence, using the method for scoring roof evidence described in Sec-
tion 5.1.1.
• repeat the process for all sides of h
1
and h
2
that share evidence in view
i
• repeat the steps described above on all views.
• compute the line evidence for the part of the predicted wall lines of h1 that is not
shared with h2, and for the part of the predicted wall lines of h2 that is not shared
by h1 (differential wall evidence)
projection of
roof of
hypothesis H
1
projection of
roof of
hypothesis H
2
Figure 5.9 Partial overlap of hypotheses with no shared evidence
76
• compute the line evidence for the part of the predicted shadow lines of h1 that is not
shared with h2, and for the part of the predicted shadow lines of h2 that is not shared
by h1 (differential shadow evidence)
The differential score for H
1
is the verification score computed using Equation (5.1)
in Section 5.1.4 if H
1
is a flat-roof hypothesis or using Equation (5.2) in Section
5.2.4 if H
1
is a gable-roof hypothesis, using the differential roof, shadow and wall
scores as input. The differential score for H
2
is computed by the same method. The
hypothesis with a higher differential score survives, while the one with a lower
differential score is eliminated from the set of verified hypotheses. It is interesting
to note that the architecture of the system allows different models (like flat-roof and
gable-roof buildings) to be disambiguated using the same methodology. Figure 5.10
depicts the computation of differential roof evidence in a single view. Figure 5.11
is an example of the use of differential evidence to disambiguate overlapping
flat-roof hypotheses that share evidence.
contributes
negatively to h
1
contributes
positively to h
1
contributes
positively to h
2
projection of
Figure 5.10 Computation of differential roof evidence in one view
roof of
hypothesis h
1
projection of
roof of
hypothesis h
2
77
5.4 3-D Description of the Scene
Verified buildings, both flat-roofed and gable-roofed, are 3-D structures. The 3-D
information of the verified buildings coupled with the camera model and the terrain model
of the scene can be used to generate the 3-D wire frame model of the scene. At this point,
texture-mapping may be applied to reconstruct the scene from viewpoints other than those
captured in the views. This ability to generate realistic views from arbitrary viewpoints has
interesting applications such as fly-by simulations. Figure 5.12 shows the 2 images used to
construct a model and 3 views of a 3D wire frame model constructed by the system.
Figure 5.11 Example showing the use of differential evidence for disambiguation
78
Figure 5.12 Three views of a model constructed by the system
79
Chapter 6
Results and Performance Analysis
The system to detect and describe buildings from multiple aerial images that is
presented in this dissertation, comprises of several independent processes. There are
usually several ways of implementing each process. The uncompromisable requirement of
the system is that it produce good results, consistently. Given this requirement, the primary
criterion used for optimization is speed of execution. Space (memory usage) optimization
is not a guiding constraint. Results on large sections of the Fort Hood site, and the entire
Fort Benning site are presented in Section 6.1. These results are presented with a view to
characterizing the system’s performance, pointing out its strengths, as well as some
systematic or recurring weaknesses, that may be better handled in future work. The
performance of the automatic system is evaluated both quantitatively and qualitatively in
Section 6.2.
6.1 Analysis of results of the automatic system
The automatic system presented in this dissertation has been run on two large sites,
namely Fort Hood, Texas and Fort Benning, Georgia. The sites have distinct image
properties. This has lead to the creation of a more robust system than if only one of these
sites were considered. These sites have a sufficient variety of building types that a system
handling any one of them individually, would need to be fairly general, to work effectively.
6.1.1 Results on Fort Hood, Texas
Fort Hood, Texas is a site with over 100 buildings. The dataset contains camera
information for all the 18 views that are provided. 7 views are nadir and 11 are oblique. The
18 views are acquired in a manner that spans the site. At least 1, and not more than 6 images
80
will include any given area of the site. Hence the views do not cover the same objects. A
digital terrain model (DTM) exists for large sections of the site. In places that do not contain
a DTM, the system allows the user to approximate the DTM. The site is characterized by
low buildings of varying shape, size, orientation and roof intensity. The site has foliage and
trees that clutter the background, as well as a number of man-made structures such as car
park areas, vehicles and vehicle ports that create accidental alignment of features that
sometimes qualify as buildings. The views are acquired from nadir as well as oblique
angles, and at resolutions varying from 0.3 m/pixel to 1.3 m/pixel. They are acquired at
different times of the day (at different days in the year) leading to widely varying image
characteristics in the different views for the same area on the ground. Taking into account
the points made above, Fort Hood is considered a challenging site for a building detection
system, with sufficient diversity to test a system under widely varying conditions.
Figure 6.1 shows some steps in the detection of three gable-roof buildings and the
final verified result. The spine of the gables are extremely low and hence the buildings are
described as flat-roof buildings. One half of the building labeled A is included in the set of
verified hypotheses. The whole building (detected as a flat-roof) exists in the set of selected
hypotheses. However, that hypothesis is eliminated by the one labeled A, as this hypothesis
has much stronger roof evidence, and the differential evidence in favor of the other
hypothesis is not sufficient to raise its confidence beyond that of the hypothesis labeled A.
Note that the shadow evidence is common, and the wall evidence is almost negligible.
Figure 6.2 shows lines detected in two oblique views of a part of the site containing
complex multi-level buildings. Figure 6.3 shows the verified buildings overlaid on the
views. This example illustrates some of the difficulties of the problem domain. The detected
parts of the buildings cast shadows on the lower, smaller parts of the buildings. Some areas
of these smaller buildings are included in the verified hypotheses. The use of two oblique
views confirms the claim that system allows general viewpoints. The images have been
acquired at different times of the day, as evidenced by the differing orientation of the
shadows in the two views. This precludes a simpler stereo-matching algorithm, as the views
are effectively of differing scenes. The background clutter in the form of trees, markings
and small man-made structures or vehicles is evident from Figure 6.2. It is instructive to
note that the accidental configurations causing most of the ambiguity for this system are
man-made structures, which tend to have a number of straight lines oriented at right angles,
rather than naturally occurring foliage.
81
Figure 6.4 shows two views of low (under 5m in height) L-shaped buildings. This
example illustrates how a rectilinear structure may be recognized from its rectangular
components. The recognition process is complicated by the need to identify low buildings,
as is the case here. Allowing for heights of under 5m with tolerances of 2m implies that
configurations of parked vehicles could be classified as buildings. There exist some parked
vehicles in rectangular configurations. In this particular case, the parked vehicles are too
Figure 6.1 Gable-roof buildings detected as flat-roof buildings
A
A
Detected Lines
Selected hypotheses
Verified hypotheses
82
low (all of them are cars) to be verified.
Figure 6.5 shows two views of a set of buildings. This example shows the system
recognizing multi-level buildings. The first view is a nadir view, while the second view is
an oblique view. The system has no preferences built in for either nadir or oblique views,
and is able to use them without special instructions. The roofs in this example, are of
varying intensities within each view. In addition, the intensities vary significantly across
views, for corresponding areas in each view. This characteristic is usually a major problem
for a stereo-matching program that relies on area-matching alone. It is not a problem for
Figure 6.2 Lines detected in two images with multi-level buildings
83
this system where the hypothesizing mechanism is entirely feature-based. Examining the
building labeled B in both images shows that the system is able to detect buildings even if
they have small protrusions, if a sufficiently large segment of the roof exists to be enable it
to be pieced together using perceptual grouping. Additionally the use of multiple images
implies that the evidence may be bad in one view and still lead to a good hypothesis.
Figure 6.6 shows two views of complex buildings in a relatively uncluttered
background. The part of the building labeled C is an example of a building that exists but
Figure 6.3 Two oblique views of multi-part buildings
84
Figure 6.4 Two views of L-shaped buildings
Figure 6.5 Example of multi-level buildings
B
B
85
is not detected (a true negative). The building exists in the set of all hypothesized buildings,
but is not selected. The reason it is not selected is that the other part of the building, labeled
D, occludes different sections of C because it is taller and the viewpoint causes occlusion,
and also because it casts a shadow on C. This causes a mismatch along an entire side of C,
and its consequent elimination in the selection process.
Figure 6.7, Figure 6.8 and Figure 6.9 show representative results in different areas of
site. Figure 6.10 shows 3 views of a model constructed by the automatic system for large
parts of the Fort Hood site. These results were obtained by running mutually exclusive
sections of the site and collating the results. The results were produced by running all
sections with two views only. Different pairs of views were used in each example as no
single pair overlapped completely in the area shown in each of the examples.
Figure 6.6 Complex buildings in a relatively uncluttered background
C C
86
Figure 6.7 Example of variations in shape and height
87
Figure 6.8 Example of buildings with many parts
Figure 6.9 Example of a true negative and a false positive
88
Figure 6.10 Three views of the model constructed for the Motor Pool area
89
6.1.2 Results for Fort Benning, Georgia
The Fort Benning dataset contains camera information for the 3 views. A digital
terrain map (DTM) exists for the site. The site is characterized by high buildings (in
comparison to the non-building structures) of varying shape, size, orientation and roof
intensity. The buildings have very distinct markings on the roofs that complicate the task of
delineating them. These markings would almost certainly cause major problems for a
building detection system that is area-based only. There are a number of gable-roof
buildings that increase the complexity of the site. The site has as a small number of
prominent roads that create distinct features in the images. The views are nadir and at fairly
high resolutions of between 0.2 m/pixel and 0.16 m/pixel. Fort Benning provides a
validation that the current system is general enough to handle fairly different sets of data,
and that matching information may be used to advantage.
Figure 6.11 shows the edges extracted from two views. These images have been
included to illustrate the density of edges formed, as the images themselves look
deceptively simple. The result of searching for flat-roof buildings in two views is shown in
Figure 6.12. Figure 6.13 shows the result of searching for gable-roof buildings. These
results are combined and examined for overlap, with the final results shown in Figure 6.14.
90
Figure 6.11 Edges extracted from views of the Fort Benning site
91
Figure 6.12 Verified flat-roof buildings in the Benning site
Figure 6.13 Verified gable-roof buildings in the Benning site
92
6.2 Evaluation of the automatic system
The run time of the system and the number of features generated at every major step
of the system are presented in section of quantitative evaluation below. In the section on
detection evaluation, we analyze the building detection rate of the system. The distribution
of confidence values of the results is discussed in the confidence evaluation section.
The system uses several parameters in the generation, selection, and verification of
hypotheses. Some parameters, such as the search range of wall and shadow evidence, can
be set as a function of the image resolution. Some parameters, such as the weights used in
the wall and shadow evaluation functions, are chosen based on our experiences on several
examples. We can also have a learning program to find the best parameters over a set of
training examples. All the results shown here use the same parameters.
Figure 6.14 Final hypotheses (flat-roof and gable-roof buildings)
93
6.2.1 Run-time Evaluation
Quantitative data of the results from a typical pair of Fort Hood views and from a pair
of Fort Benning views are summarized in Table 6.1 and Table 6.3 respectively. In these
tables, a number of intermediate steps and their run times are shown.
It is observed that grouping and matching at each feature level viz. lines, junctions,
and parallels, take comparable times to execute. Both, grouping and matching, at all levels
in the feature hierarchy, involve searching for features that satisfy some constraints. The
brute force method of searching through the list of all features has been replaced in most
cases, by a more efficient spatial search, that utilizes the inherent constraint that the features
must lie within some spatial boundaries. Spatial-hashing of the features in the (x, y) domain
has been used to reduce the actual run-times. It may be noted that much of the time is taken
in the selection and verification processes. Though the number of primitives (parallelogram
matches) in these processes is usually smaller than the number of primitives in the earlier
grouping and matching processes, the tests performed on these features are more expensive
in terms and time, causing them to take longer to execute.
Table 6.1
Fort Hood Feature
Run Time
(seconds)
Percentage of
Time
Edge Extraction Edge 23.3 10.33%
Line Formation Line 20.2 8.96%
Line Matching Line match 15.5 6.87%
Junction Detection Junction 13.2 5.85%
Junction Matching Junction match 7.0 3.10%
Parallel Formation Parallel 21.7 9.62%
Parallel Matching Parallel match 13.1 5.81%
U-Contour Formation U-Contour 10.0 4.43%
Hypothesis Selection Parallelogram match 64.0 28.38%
Hypothesis Verification Parallelogram match 37.5 16.63%
TOTAL
225.5 100.0%
94
6.2.2 Detection Evaluation
Shufelt and McKeown proposed measurement in [85] to evaluate their results by
counting the percentage of correct or incorrect pixels. The three measurements introduced
by them, as well as two more introduced by Lin and Nevatia [57] to characterize the
performance of this system. The measures introduced in [57] provide important high-level
indices of performance.
The following measures proposed in [57] are computed by making a comparison of
the automated results with a reference model constructed manually using generic modeling
tools such as those available under the Radius Common Development Environment
(RCDE).
T
p
(True Positive) is a building in the reference model and detected by the program
F
p
(False Positive) is a building detected by the program but not present in the
reference model
T
n
(True Negative) is a building in the reference model but not detected by the
Table 6.2
Fort Benning Feature
Run Time
(seconds)
Percentage of
Time
Edge Extraction Edge 35.3 11.10%
Line Formation Line 27.2 8.55%
Line Matching Line match 22.0 6.92%
Junction Detection Junction 17.8 5.60%
Junction Matching Junction match 8.5 2.67%
Parallel Formation Parallel 33.4 10.50%
Parallel Matching Parallel match 19.0 5.97%
U-Contour Formation U-Contour 16.7 5.25%
Hypothesis Selection Parallelogram match 87.1 27.39%
Hypothesis Verification Parallelogram match 51.0 16.04%
TOTAL
318.0 100.0%
95
program.
A building is considered to be detected, if a part of the building is detected by the
system. The description of the detected building may not necessarily be correct.
The other three measures viz. correct building pixel, incorrect building pixel and
correct non-building pixel percentages, are calculated by labeling every pixel in the image
as either a building pixel or a non-building pixel as proposed in [85]. The percentage of the
number of pixels correctly labeled as building pixels over the number of building pixels in
the image, the percentage of the number of pixels incorrectly labeled as building pixels over
the number of pixels labeled as building pixels, and the percentage of the number of pixels
correctly labeled as non-building pixels over the number of non-building pixels in the
image are computed. The evaluation of counting correct building and non-building pixels
gives an approximate idea of how accurate the description is.
The following are the five measures used to evaluate the “goodness” of the results:
• Detection Percentage = 100 x T
p
/ (T
p
+ T
n
)
• Branch Factor = 100 x F
p
/ (T
p
+ F
p
)
• Correct Building Pixel Percentage.
• Incorrect Building Pixel Percentage.
• Correct Non-Building Pixel Percentage.
A large section of the Motor Pool area of the Fort Hood site is selected for evaluation
using the parameters described above. The area was run in sections with between 3 and 14
buildings in each section. Each section was run using two overlapping views. Figure 6.10
shows the model depicted in 3 different views. The buildings vary considerably in size,
shape and image characteristics. Many rectilinear buildings are composed of rectangular
parts. In order to characterize the performance of the system, the parameters T
p
,T
n
and F
p
are computed for complete (possibly multi-part) buildings, as well as for individual
rectangular building fragments. The derived measures, namely the detection percentage
and the branch factor, are computed independently in each case. The results are
summarized in Table 6.3. The pixel-based measures are presented in Table 6.4 for the same
area. It may be noted that the pixel-based measures indicate better performance that the
measures for individual buildings or building fragments because the large buildings are
detected, with false positives and true negatives being usually small.
96
6.2.3 Effect of Multiple Views
The system presented here handles two or more views non-preferentially.
Comparison of results using three views and using two views demonstrates that the
additional view sometimes aids the building detection and description process with respect
to detecting buildings (or parts of buildings) that were not detected using two views (i.e. T
p
is increased, T
n
is decreased). More importantly, the additional view causes the number of
false positives (F
p
) to decrease. Table 6.5 shows T
p
,T
n
and F
p
using three views, for the
sections of Fort Hood that have at least three overlapping views. Comparison of Table 6.5
with Table 6.3 shows that the third view increases detection percentage and reduces the
branch factor. Table 6.6 shows the correct building pixel, incorrect building pixel and
correct non-building pixel measures for the same areas, and using three views. Tests run
with four views do not increase T
p
, or decrease T
n
in the areas of Fort Hood that have at
least four overlapping views.
Table 6.3
T
p
T
n
F
p
Detection
Percentage
Branch
Factor
Complete
(rectilinear)
buildings
84 4 14 95.45% 14.2%
Rectangular
building
fragments
134 11 14 92.41% 9.46%
Table 6.4
Correct Building Pixel
%
Incorrect Building Pixel
%
Correct Non-building
Pixel %
97.1% 4.3% 99.8%
97
Table 6.5
T
p
T
n
F
p
Detection
Percentage
Branch
Factor
Complete
(rectilinear)
buildings
27 1 0 96.43% 0.00%
Rectangular
building
fragments
38 3 0 92.68% 0.00%
Table 6.6
Correct Building Pixel
%
Incorrect Building Pixel
%
Correct Non-building
Pixel %
97.4% 3.9% 99.8%
98
Chapter 7
Assisted model construction
While the results produced by the automatic system described in the previous
chapters are good, they are not perfect. In cases where the automatic system does not
produce perfect results, such as a missed building (true negative), it still retains the feature
hierarchy for the area covered by the true negative in each view. The domain knowledge
possessed by the system, possibly in conjunction with the features detected in the area of
the true negative, may be leveraged to produce a hypothesis corresponding to this building,
with “hints” from a human user. When the system runs in the mode where it incorporates
“hints”, or inputs, from the user, possibly using the precomputed feature hierarchy and
domain knowledge, it is said to be in assisted mode.
Several approaches to user assisted modeling are possible. The conventional
approach is to provide a set of generic models which are then fit to the image data by
changing model and viewing parameters as in the basic CME system [90]. In this approach,
the system provides geometric computations but substantial time and effort are required
from the user. Newer approaches have attempted to combine user input with varying
amounts of automatic processing [33] [34]. As many urban areas contain buildings that are
identical or very similar to others, tools for replicating them can also increase the user
productivity as in [33].
Basic modeling tasks are still performed by the automatic system described earlier,
but the system receives simple, yet critical, assistance from a user. The assisted system’s
capabilities are limited by those of the underlying system. In this case, the shapes of the
buildings are restricted to be rectilinear; the roofs may be either flat or symmetric gables.
We describe two approaches that attempt to provide the human user with tools that augment
the automatic system. The goals to be met are to significantly reduce user effort in the
construction of models and to maintain or improve the quality of the results when compared
99
with the “dumb” manual systems i.e. systems that do not take advantage of geometric,
photometric or domain-specific constraints.
A user interaction typically consists of the user pointing to a point or line feature; the
pointing need not be precise as precise features are automatically selected by the system.
Such interaction is called an “input”. The system requires two (or more) views of a scene
with associated camera geometry; however, most user interactions take place in one view
only. Other views may be displayed but the user is not asked to view the images
stereoscopically. We believe that confining most of the interactions to one view can
significantly reduce the effort required by the user.
Section 7.1 describes the “smart” real-time system which uses many of the geometric
and domain-specific constraints available, and proceeds incrementally based on user inputs.
Section 7.2 shows results obtained using the real-time user-assisted system and a
performance comparison with a generic modeling tool. Section 7.3 concludes the chapter.
7.1 Real-time hypothesis generation based on user input
The approach detailed here constructs as building hypothesis in real-time based on
user inputs. This approach uses the user’s input(s) as well as the context information,
specifically lines and junctions, from the view, to form a plausible hypothesis. The presence
of multiple views enables a 3-D hypothesis to be generated immediately. The process is
geared to make it’s best guess based on the detected features and the available user inputs.
This implies that as the user provides incremental input, the system should converge better
on the desired output.
To determine a flat-roof rectangular hypothesis, a maximum of 3 positional inputs is
required to determine the roof, with a possible additional input to correct the estimated 3-
D height. This process is described in Section 7.1.1. Determining a symmetrical gable-roof
hypothesis also requires a 3 positional inputs, with 2 possible additional inputs to correct
the heights of the sides and the spine of the gable-roof. The method described in this thesis
for gable-roof hypotheses requires 5 positional inputs in 2 views, which accurately
determine the gable-roof with the correct 3-D heights. This process is outlined in Section
7.1.2.
100
7.1.1 Generation of flat-roof rectangular hypotheses from a single view
Flat-roof rectangular hypothesis formation requires a maximum of 3 positional user
inputs in a single view, with a possible 4th input to correct the 3-D height of the generated
3-D hypothesis. Tests on a number of buildings reveal that fewer than 30% of the
hypotheses needed 3 positional inputs, and none needed a height correction. The actions
performed by the system after each input are outlined in the following paragraphs.
Figure 7.1 is a flow-chart that diagrams the process. Figure 7.2 shows a view of building
that is used as an example to illustrate the user-assisted process.
Actions after single input
Figure 7.3 depicts the situation after the first positional input from the user. The
system executes the following algorithm:
Generated hypotheses
Hypothesis from 1st
input
Hypothesis from 2nd
input
no hypothesis generated
Hypothesis from 3rd
input
hypothesis not good enough
OR
no hypothesis generated
hypothesis not good enough
OR
good hypothesis
good hypothesis
Figure 7.1 Construction of flat-roof hypotheses in real-time
101
• from the set of all junctions S
Ji
for i=1..n
views
, defined in Section 4.1, locate all hy-
pothesized junctions near (within a radius of 5 pixels of) the positional input.
• if no junctions are detected report failure and exit.
• for each junction found, attempt to construct a parallelogram as follows: use lines
forming the junction to derive the parallelogram (2-D roof hypothesis). Closures for
the two other sides of the parallelogram are sought from the set of matched lines
S
lm
, defined in Section 4.1, using the procedure for finding closures for parallels in
Section 4.2.1. The best closures (as defined in Section 4.2.1) determine the two oth-
er sides of the parallelogram. If no closure is found on one side, the side is hypoth-
esized to begin where the junction leg adjacent to it ends, and is parallel to the side
opposite it. If neither closure is found the junction is discarded and no hypothesis is
generated for this junction.
• match the 2-D parallelogram across the available views, and select the best match
(as defined in Appendix A.) across all views, to yield a parallelogram match (build-
ing hypothesis).
• compute verification scores for this building hypothesis using Equation (5.1), as is
done during unassisted (automatic) operation.
• select the best 3-D hypothesis (the one with the highest verification score) from the
set of all 3-D hypotheses generated by junctions in the neighborhood of the user in-
put and present it to the user.
Figure 7.2 Building used to illustrate user-assistance
102
• if the user is not satisfied with the hypothesis, or if there is no hypothesis generated,
allow the user to provide an additional positional cue to the system. Figure 7.4 il-
lustrates the results on the example in Figure 7.2 after the first user input.
Actions after second input
Figure 7.5 illustrates the two cases that arise after the second input. The second input
is used to generate new hypotheses in the same manner as with the first input. However,
hypotheses whose projections have corners near the first input as well, are considered first.
If no such hypotheses are found, hypotheses that are formed exclusively from the second
input i.e. which do not have corners near the first input, are examined to retrieve the best
hypothesis from that set. If the user is not satisfied with the hypothesis, or if no hypothesis
is generated, the user may choose to provide a 3rd and final positional cue. Figure 7.6 shows
the results after the second user input on the example in Figure 7.2.
leg
junction leg
hypotheses generated
(roof plane)
junction near
junction
user input
Figure 7.3 Actions after the first positional input
Figure 7.4 Results after first input on building in Figure 7.2
1
103
Actions after third input
On receiving the third input, the following algorithm is executed:
• use the three points to form three possible parallelograms to represent roof hypoth-
eses, as shown in Figure 7.7. The adjacent sides of the parallelogram must be pro-
jections of lines that are perpendicular in 3D and parallel to the ground. This
constraint is applied after each of the parallelograms is matched in all the views.
• find the best parallelogram matches (building hypotheses) across all available views
for each of the three hypotheses, using the process detailed in Appendix A..
• calculate the 3D orientation of the planes, for each of the three building hypotheses
first input
second input
Figure 7.5 Actions after the second positional input
Figure 7.6 Results after second input on building in Figure 7.2
2
104
• select the hypothesis with least inclination to the ground plane.
The results after the third input on the example in Figure 7.2 are shown in Figure 7.8.
After the third input, a result is always formed. The result is correct in the particular image
where the 3 inputs have been specified. However, there 3D height of the results may not be
estimated correctly. To handle a wrong 3D height estimation, the user may specify the
height by pointing to one of the corners of the building on the ground. This additional
height input specifies the building unambiguously. Figure 7.9 illustrates the points a user
might specify as corners of the building on the ground. In tests carried out, this height
correction was not required.
Figure 7.7 Three 2-D hypotheses possible from 3 positional inputs
Figure 7.8 Results after third input on building in Figure 7.2
3
105
7.1.2 Generation of gable-roof rectangular hypotheses
Gable-roof hypotheses are generated by providing enough input for the system to
accurately determine a 3-D gable-hypotheses, using knowledge of the camera models, and
without relying on the underlying evidence. The following process, which requires the user
to input 3 positions in one image and 2 in another, completely specifies a gable-roof
hypothesis. This required inputs are as follows:
• specify one endpoint, p
i1
, of a side of the gable in one view, say view
i
• specify an endpoint of the spine, p
i2
, nearer to the input in the previous step
• specify the other end of the spine, p
i5
• specify the point matching p
i1
, say p
j1
, in another view, say view
j
• specify the point matching p
i5
, say p
j5
, in view
j
Given these inputs the system executes the following algorithm to the gable-roof
hypothesis:
• from p
i1
and p
j1
, derive the 3-D point P
1
that projects to p
i1
in view
i
and p
j1
in view
j
,
using the camera models provided.
• similarly derive P
5
, the 3-D point corresponding to p
i5
and p
j5
in their respective
views.
• The spine of the gable-roof is defined by P
2
and P
5
, and is assumed to be parallel to
the ground in 3D. Thus P
2
is at the same world z-coordinate as P
5
. A ray from the
Figure 7.9 Possible inputs for 3D height correction.
inputs
for
height
correction
106
camera center of view
i
through p
i2
is extended into the 3D world to intersect the
plane parallel to the ground and at the same z-coordinate as P
5
. This unambiguously
defined P
2
.
•given P
1
, P
2
, P
5
, and the assumption that the gable-roof is symmetrical about the
spine, in 3-D, reflect P
1
about line (P
2
, P
5
) to determine P
3
.
• the side of the gable-roof (P
1
, P
6
) is parallel to (P
2
, P
5
) and has the same length.
This determines P
6
.
• similarly compute P
4
, using P
3
.
This specifies the gable-hypothesis completely, and unambiguously, without the need
for height correction. This entities described are diagrammed in Figure 7.10. Figure 7.11
depicts the process. Figure 7.12 shows a real example with the views with the user inputs
and constructed model overlaid in each view.
The advantage of this method over the method used for flat-roof hypotheses is that it
does not need any feature information, or geometric constraints of the model. The reader is
referred to [53] for a method for gable-roof hypotheses that takes advantage of the detected
features, often reducing the need for precise input from the user.
7.2 Real-time Assisted Results
The real-time system for assisting the user relies on many of the techniques
developed for the automatic system. As a result, it is able to generate good results with
relatively a relatively small number of simple inputs. Figure 7.13 shows the model
constructed using the real-time assisted system. As this system allows to user to override
its hypotheses at each stage, the user is able to construct a model which can be as accurate
as one constructed by a system that does not possess the mechanisms for utilizing detected
features in the hypothesis formation process. Hence accuracy of the model is not an issue
in the evaluation of this system. One measure of the utility of the real-time assisted system
is the time it could save over constructing the model using standard model-construction
tools such as those provided within the Radius Common Development Environment
(RCDE). The other measure is the number of inputs required in each system to achieve the
same result. These comparisons are provided in the following paragraphs and summarized
107
in Table 6.9.
• Comparison of time taken: Using the real-time assisted system, the model for Fort
Benning (in Figure 7.13) was constructed in 375 seconds, or 6 minutes 15 seconds.
The same model, when constructed in RCDE by a user familiar with the construc-
tions tools, took in excess of 20 minutes. Some interesting differences were the
Table 6.9
Real-time assisted
system
RCDE Modeling tools
Time to build Model 375s > 1800s
Number of Inputs to
build Model
109 187
p
i1
p
i2
p
i5
p
j1
p
j5
view
j
view
i
P
6
P
5
P
4
P
3
P
2
P
1
Figure 7.10 Features used in gable-roof hypothesis in assisted mode
108
computation of the height of the “spine” of the gable-roof buildings and the accura-
cy of the inputs required. Using the real-time assisted system, providing the three
inputs in one view and two more in another view automatically determined the
“spine” of the gable-roof building in 3-D. No height adjustment was necessary. Us-
ing the RCDE modeling tools, each vertex was automatically created at the same
3D height. This necessitated adjustment of the height of the two ends of the “spine”
of the gable as an extra step. This step required delicate manipulation, and was the
most time-consuming. In the detection of flat-roof buildings, the real-time assisted
system allowed greater error in user input as it searched for features in a neighbor-
Select p
i1
Select p
i2
Select p
i5
Select p
j1
Select p
j5
Gable-roof 3-D building hypothesis
Figure 7.11 Generation of gable-roof hypotheses in assisted mode
109
hood of the input. Further, in the cases that the system did require three user inputs
to determine a flat-roof hypothesis, it utilized the camera geometry and the assump-
tion that the buildings were rectilinear to constrain the shape of the projection of the
hypothesis, and find the best fit for the three inputs.
• Comparison of number of inputs: The real-time assisted system required an av-
erage of 2 inputs for the 7 flat-roof buildings and 5 inputs for each of the 19 gable-
roof buildings for a total of 109 inputs. Using the RCDE system, flat-roof buildings
needed 4 inputs for the corners and 1 input to adjust the height. Gable-roof buildings
Figure 7.12 Example of assisted gable-roof construction
Views with inputs
Views with model overlaid
p
i1
p
i2
p
i5
p
j1
p
j5
110
Figure 7.13 Real-time assisted results on Fort Benning
111
required 6 inputs to identify the corners and 2 inputs to adjust the height of the
spine. The total using this method was 187 inputs.
7.3 Conclusion
The assisted mode of operation allows the user to rapidly build, or correct the model
generated by automatic system. The goals of reducing the effort expended in constructing
a model, even when done entirely in the assisted mode, is easily met, when it is compared
with the time taken to construct a model without these tools. A detailed comparison of this
process with the “dumb” (dependent entirely on user input, and which does not utilize
geometrical, photometric and domain-specific constraints) was provided in Section 7.2.
112
Chapter 8
Conclusions and Future Research
The task of building detection and description is non-trivial even with the use of two
or more views. The system, presented as part of this thesis, to automatically detect and
describe buildings has achieved good detection rates on a consistent basis under widely
varying imaging conditions. This validates many of the salient features of the approach
such as the hypothesize and verify paradigm, the reliability of a feature-based approach, the
use of a hierarchical approach, and the simultaneous use of perceptual grouping and
matching at different levels of feature complexity to increase robustness.
It is interesting to analyze building detection and description as a restriction of the
general problem of object recognition and description. In the attempt to solve the problem
of building detection and description from multiple views, the characteristics of general
objects recognition and description that were restricted were constraints on pose (the object
is assumed to be on the ground), shape and large occlusions. The other major characteristics
of segmented descriptions, background clutter, and choice of level (high-level or low-level)
and type (2D or 3D) of primitives used to describe the objects, are present.
8.1 Summary
From this study of shape description, using building detection and description as a
case study, it is fair to generalize that the use of hierarchical descriptions of an object have
significant advantages over non-hierarchical descriptions. By obtaining evidence from a
hierarchy, a correct hypothesis is more likely to be identified than without a hierarchy. With
the fragmented low-level descriptions that are available, it becomes important to derive
confidence in a hypothesis from a hierarchy, rather than from a single level of primitives.
113
I contend that the task of object recognition and description must rely heavily on the
features detected. The variance in the photometric properties of identical scenes under
different illumination cause the primary use of region segmentation to be unreliable, in
general. In the case of building detection from aerial images, the illumination cannot be
controlled.
On the issue of background clutter, in the case of rectilinear building detection, the
presence of trees and vegetation does not cause a significant problem. The features caused
by these objects can be filtered out on the basis of shape alone. The major problem is the
presence of structures on the ground, such as parking lots and clearings, which give rise to
low-level features that are indistinguishable from low-level building features. Other cues,
such as 3D height, and domain-specific information like walls and shadows have to be used.
8.2 Future Research
While good progress has been demonstrated in building detection and description
from multiple aerial views, much remains to be done. The current system works well on
rectilinear buildings of a “reasonable” (this varies with the resolution, but with a resolution
of 1 pixel per meter, buildings with sides greater than 20 meters are “reasonable”) size, in
a fairly cluttered neighborhood. Some of the possible extensions to the work are proposed,
with notes on the problems that are likely to arise in each case.
8.2.1 Building Extraction in more complex environments
Currently, the system has been tested on sites with buildings in largely suburban
settings. On of the important areas of application is in an urban setting, where the buildings
are much more closely spaced, and the buildings are smaller in size. Though the current
approach has no assumptions that preclude detection and description of the buildings
described above, it requires strong cues (such as prominent junctions, or parallel sides) that
may not be present in the types of settings described here. While handling smaller (than
“reasonable” sized buildings, described above) closely packed buildings is not simply a
matter of changing thresholds on size, much of the methodology of the current system is
applicable. In addition, the system will need mechanisms that make it more sensitive to
smaller cues (like line segment matches, or similar simultaneous changes in contrast in the
available views), for accurate delineation.
114
8.2.2 Detection and Description of Non-Rectilinear Shaped Objects
A natural extension to the building detection and description system is to relax the
constraint of rectilinearity, and allow buildings or arbitrary polygonal shape to be detected.
Most buildings are rectilinear, which implies that the current system is equipped to handle
a large subset of the buildings that may be present in an arbitrary scene. The system
essentially uses lines to identify the extremities of the roof, and the junctions to accurately
define the extent of the sides. Parallelism is employed to bolster the confidence in the
generated hypotheses. To allow for detection and description of arbitrary polygonal
buildings two actions are needed: the first is to apply parallelism only if it is detected, and
the second is to generalize the current “graph search” (in the constructed graph with
junction-matches as nodes, and line-matches as edges in the graph) for 4-cycles to n-cycles,
where n may be limited by the user.
Extending the system to include special non-polygonal shapes such as circles
(occurring in hemispherical domes for instances) could utilize many of the higher-level
matching techniques used in rectilinear building formation, though there may be no need
for a hierarchy of features. Detecting and describing such structures might actually be a
simpler task than detecting rectilinear buildings, as the probability of finding accidental
parallel curves (which cause many of the problems in rectilinear building detection and
description owing to the high probability of finding parallel lines in scenes) is low. The
issue here is whether more than one view would provide a significant advantage over having
a single view. Scenes with buildings with real circular edges (as opposed to “limb” circular
edges), such as storage tanks would certainly have an advantage with more than one view,
as the possibility of high confidence matching exists. Making use of multiple views with
buildings that cause “limb” boundaries, such as reactors, is harder, but multiple views
should still help. Extending the system to include general non-polygonal buildings will
require the development of new techniques or concepts. The system uses detected lines as
the basic low-level description of the scene being analyzed. In addition it is assumed that
the system is looking for real edge matches and not “limb” edge matches, that may occur
with curved buildings.
115
8.3 Conclusion
Building detection and description is a non-trivial problem even when two or more
views are available. This dissertation uses a hypothesize and verify idea, coupled with
hierarchical grouping and matching of features to detect and describe rectilinear flat-roof
buildings and rectangular gable-roof buildings. Good results have been obtained using
these methods. Ideas for future work are discussed and some of the issues described.
116
B ibliography
[1] N. Ahuja and M. Tuceryan, “Extraction of early perceptual structure in dot patterns:
integrating region, boundary, and component Gestalt,” Computer Vision, Graphics
and Image Processing, 48(3), pp. 304-356, 1989.
[2] N. Ayache and O.D. Faugeras, “HYPER: A New Approach for the Recognition and
Positioning of Two-Dimensional Objects,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 8(1), pp. 44-54, 1986.
[3] D. Ballard and C. Brown, “Computer Vision,” Prentice-Hall, Inc., 1982.
[4] H.G. Barrow and J.M. Tenenbaum, “Computational Approaches to Vision,” Hand-
book of Perception and Human Performance, V ol. II, Chap 38, Wiley-Interscience,
1986.
[5] R. Bergevin and M.D. Levine, “Generic Object Recognition: Building and Matching
Coarse Descriptions from Line Drawings,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 15(1), pp. 19-36, 1993.
[6] T. Binford, “Inferring Surfaces from Images,” Artificial Intelligence, 17, pp. 205-245,
1981.
[7] H.H. Bulthoff, S.Y . Edelman, and M.J. Tarr, “How are three-dimensional objects rep-
resented in the brain?” Max Planck Institute Technical Report No. 5, March 1994.
[8] J. Canny, “A Computational Approach to Edge Detection,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 8(6), pp. 679-698, November 1986.
[9] C. Chung and R. Nevatia, “Recovering Building Structures from Stereo,” In Proceed-
ings of the IEEE Workshop on Applications of Computer Vision, pp. 64-73, Palm
Springs, Calif., December 1992.
117
[10] C. Chung and R. Nevatia, “Recovering LSHGCs and SHGCs from Stereo,” In Pro-
ceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 42-
48, 1992.
[11] M.B. Clowes, “On Seeing Things,” Artificial Intelligence, 2(1), pp. 79-116, 1971.
[12] R.T. Collins, Y .Q. Cheng, C. Jaynes, F. Stolle, X. Wang, A.R. Hanson, and E.M. Rise-
man, “Site Model Acquisition and Extension from Aerial Images,” In Proceedings of
the International Conference on Computer Vision, pp. 888-893, 1995.
[13] R.T. Collins, A.R. Hanson, and E.M. Riseman, “Site Model Acquisition under the
UMass RADIUS Project,” In Proceedings of the 1994 ARPA Image Understanding
Workshop, pp. 351-358, 1994.
[14] R.T. Collins, A.R. Hanson, E.M. Riseman, and Y . Cheng, “Model Matching and Ex-
tension for Automated 3D Site Modeling,” In Proceedings of the 1993 ARPA Image
Understanding Workshop, pp. 197-203, 1993.
[15] S.J. Dickinson, A.P. Pentland, and A. Rosenfeld, “From V olumes to Views: An Ap-
proach to 3-D Object Recognition,” Computer Vision, Graphics and Image Process-
ing, 55(2), pp. 130-154, 1992.
[16] U.R. Dhond and J.K. Aggarwal, “Structure from Stereo - A Review,” IEEE Transac-
tions on Systems, Man and Cybernetics, 19(6), pp. 1489-1510, 1989.
[17] J. Dolan and R. Weiss, “Perceptual Grouping of Curved Lines,” In Proceedings of the
1989 ARPA Image Understanding Workshop, Palo Alto, CA., pp. 1135-1145, 1989.
[18] R. Duda and P. Hart, “Use of the Hough Transformation to Detect Lines and Curves
in Pictures,” Communications of the ACM, V ol. 15, pp. 11-15, 1972.
[19] O. Faugeras, “On the Motion of 3-D Curves and its Relationship to Optical Flow,” In
Proceedings of the First European Conference on Computer Vision, pp. 107-117,
1990.
[20] P.J. Flynn and A.K. Jain, “Correspondence - BONSAI: 3-D Object Recognition Us-
ing Constrained Search,” IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 13(10), pp. 1066-1075, 1991.
118
[21] W. Forstner, “Mid-Level Vision Processes for Automatic Building Extraction,” In
Proceedings of the Ascona Workshop on Automatic Extraction of Man-Made Objects
from Aerial and Space Images, pp. 179-188, 1995.
[22] D.A. Forsyth, J.L. Mundy, A. Zisserman, C. Coelho, A. Heller, and C. Rothwell, “In-
variant Descriptors for 3-D Object Recognition and Pose,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 13(10), pp. 971-991, 1991.
[23] P. Fua and A.J. Hanson, “Extracting Generic Shapes Using Model-Driven Optimiza-
tion,” In Proceedings of the 1988 ARPA Image Understanding Workshop, pp. 994-
1004, 1988.
[24] P. Fua and A.J. Hanson, “Objective Functions for Feature Discrimination,” In Pro-
ceedings of the 11th International Joint Conference on Artificial Intelligence, pp.
1596-1602, Aug 1989.
[25] K. Green, D. Eggert, L. Stark, and K. Bowyer, “Generic Recognition of Articulated
Objects by Reasoning about Functionality,” In Proceedings of the International Con-
ference on Pattern Recognition, pp. 847-849, 1994.
[26] W. Grimson, “Object Recognition by Computer: The Role of Geometric Con-
straints,” The MIT Press, 1990.
[27] G. Guy and G. Medioni, “Perceptual Grouping using Global Saliency enhancing op-
erators,” In Proceedings of the 11th IAPR International Conference on Pattern Rec-
ognition, The Hague, Holland, V ol I, pp. 99-104, 1992.
[28] H. Helson, “The Fundamental Propositions of Gestalt Psychology,” Psychological
Review, 40, pp. 13-32, 1933.
[29] M. Herman and T. Kanade, “Incremental Reconstruction of 3-D Scenes from Multi-
ple, Complex Images,” Artificial Intelligence, 30(3), pp. 289-341, Dec 1986.
[30] M. Herman and T. Kanade, “The 3D MOSAIC Scene Understanding System: Incre-
mental Reconstruction of 3D Scenes from Complex Images,” In Proceedings of the
DARPA 1984 Image Understanding Workshop, pp. 137-148, 1984.
[31] S. Heuel and R. Nevatia, “Including Interaction in an Automated Modeling System,”
In Proceedings of the IEEE Symposium on Computer Vision, pp. 383-388, 1995.
[32] B. Horn, “Robot Vision,” MIT Press, Cambridge, MA. 1986.
119
[33] Y . Hsieh, “SiteCity: A Semi-Automated Site Modeling System”, Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, Cal-
ifornia, June 18-20 1996, pages 499-506
[34] S. Heuel and R. Nevatia, “Including Interaction in an Automated Modeling System”,
Proceedings of Image Understanding Workshop, Palm Springs, CA, February 1996,
pp. 429-434.
[35] A. Huertas, “Using Shadows in the Interpretation of Aerial Images,” University of
Southern California USC-ISG Technical Report 104, October 1983.
[36] A. Huertas and R. Nevatia, “Detecting Buildings in Aerial Images,” Computer Vision,
Graphics and Image Processing, 41(2), pp. 131-152, February 1988.
[37] A. Huertas, W. Cole, and R. Nevatia, “Detecting Runways in Complex Airport
Scenes,” Computer Vision, Graphics and Image Processing, 51, pp. 107-145, 1990.
[38] A. Huertas, W. Cole, and R. Nevatia, “Using Generic Knowledge in Analysis of Aeri-
al Scenes: A Case Study,” In Proceedings of the 11th International Joint Conference
on Artificial Intelligence, pp. 1642-1648, Aug 1989.
[39] A. Huertas, C. Lin, and R. Nevatia, “Detection of Buildings from Monocular Views
of Aerial Scenes using Perceptual Organization and Shadows,” In Proceedings of the
1993 ARPA Image Understanding Workshop, pp. 253-260, April 1993.
[40] D.A. Huffman, “Impossible Objects as Nonsense Sentences,” Machine Intelligence,
6(B), Meltzer and D. Michie, pp. 295-323, Edinburgh University Press, 1971.
[41] D.P. Huttenlocher and S. Ullman, “Object Recognition Using Alignment,” In Pro-
ceedings of the International Conference on Computer Vision, pp. 102-111, 1987.
[42] D.P. Huttenlocher and S. Ullman, “Recognizing Solid Objects by Alignment with an
Image,” International Journal of Computer Vision, 5(2), pp. 195-212, 1990.
[43] K. Ikeuchi & T. Kanade, “Modeling Sensors: Toward Automatic Generation of Ob-
ject Recognition Program,” Computer Vision, Graphics and Image Processing,Vol
48, pp. 50-79, 1989.
[44] R. Irving and D. McKeown, “Methods for exploiting the Relationship Between
Buildings and their Shadows in Aerial Imagery,” IEEE Transactions on Systems, Man
and Cybernetics, 19(6), pp. 1564-1575, November/December 1989.
120
[45] M. Ito and A. Ishii, “Three-view stereo analysis”, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 8:524-532, 1986.
[46] D.W. Jacobs, “GROPER: A grouping based recognition system for two dimensional
objects,” In Proceedings of IEEE Computer Society Workshop on Computer Vision,
pp. 164-169, December 1987.
[47] C. Jaynes, F. Stolle, and R. Collins, “Task Driven Perceptual Organization for Extrac-
tion of Rooftop Polygons,” In Proceedings of the 1994 ARPA Image Understanding
Workshop, pp. 359-365, 1994.
[48] T. Kanade, “Recovery of the Three-Dimensional Shape of an Object from a Single
View,” Artificial Intelligence, V ol. 17, pp. 409-460, 1981.
[49] D. Kim and R. Nevatia, “A Method for Recognition and Localization of Generic Ob-
jects for Indoor Navigation,” In Proceedings of the 1994 ARPA Image Understanding
Workshop, pp. 1069-1076, 1994.
[50] Y . Lamdan, J.T. Schwartz, and H.J Wolfson, “Affine Invariant Model-Based Object
Recognition,” IEEE Transactions on Robotics and Automation, 6(5), pp. 578-589,
1990.
[51] Y . Lamdan, J.T. Schwartz, and H.J. Wolfson, “On Recognition of 3-D Objects from
2-D Images,” In Proceedings of IEEE International Conference on Robotics and Au-
tomation, pp. 1407-1413, 1988.
[52] Y . Leclerc, “Region Grouping Using the Minimum-Description-Length Principle,” In
Proceedings of the 1990 ARPA Image Understanding Workshop, pp. 473-479, 1990.
[53] J. Li, R. Nevatia and S. Noronha, “User Assisted Modeling of Buildings,” In Pro-
ceedings of the 1994 ARPA Image Understanding Workshop, pp. 1069-1076, 1998
[54] C.A. Lin, A. Huertas, and R. Nevatia, “Detection of Buildings Using Perceptual
Grouping and Shadows,” In Proceedings of IEEE Conference on Computer Vision
and Pattern Recognition, pp. 62-69, Seattle, WA., June 1994.
[55] C.A. Lin, A. Huertas, and R. Nevatia, “Detection of Buildings from Monocular Im-
ages,” In Proceedings of Ascona Workshop on Automatic Extraction of Man-Made
Objects from Aerial and Space Images, pp. 125-134, 1995.
121
[56] C.A. Lin and R. Nevatia, “3-D Descriptions of Buildings from an Oblique View Aeri-
al Image,” In Proceedings of IEEE Symposium on Computer Vision, pp. 377-382,
1995.
[57] C.A. Lin and R. Nevatia, “Buildings Detection and Description from Monocular
Aerial Images,” In Proceedings of the 1996 ARPA Image Understanding Workshop,
pp. 461-468, 1996.
[58] Y . Liow and T. Pavlidis, “Use of Shadows for Extracting Buildings in Aerial Images,”
Computer Vision, Graphics and Image Processing, V ol. 49, pp. 242-277, 1990.
[59] D.G. Lowe, “Perceptual Organization and Visual Recognition,” Boston: Kluwer Ac-
ademic Publishers, 1985.
[60] D.G. Lowe, “Three-dimensional object recognition from single two-dimensional im-
ages,” Artificial Intelligence, V ol. 31, pp. 355-395, 1987.
[61] D.G. Lowe and T.O. Binford, “Perceptual Organization as a Basis for Visual Recog-
nition,” In Proceedings of National Conference on Artificial Intelligence, AAAI 83,
pp. 255-260, Washington, D.C., Aug 1983.
[62] D.G. Lowe and T.O. Binford, “Segmentation and Aggregation: An Approach to Fig-
ure Ground Phenomena,” In Proceedings of the 1982 DARPA Image Understanding
Workshop, pp. 168-178, 1982.
[63] G. Luger and W. Stubblefield, “Artificial Intelligence and the Design of Expert Sys-
tems,” The Benjamin/Cummings Publishing Company, Inc. 1989.
[64] A.K. Mackworth, “Interpreting Pictures of Polyhedral Scenes,” Artificial Intelli-
gence, V ol. 4, pp. 121-137, 1973.
[65] J.C. McGlone and J. Shufelt, “Incorporating Vanishing Point Geometry into a Build-
ing Extraction System,” In Proceedings of the 1993 ARPA Image Understanding
Workshop, pp. 437-448, 1993.
[66] J.C. McGlone and J. Shufelt, “Projective and Object Space Geometry for Monocular
Building Extraction,” In Proceedings of the Conference on Computer Vision and Pat-
tern Recognition, pp. 54-61, Seattle, WA., June 1994.
122
[67] D. McKeown, W. Harvey, and J. McDermott, “Rule Based Interpretation of Aerial
Imagery,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(5), pp.
570-585, 1985.
[68] R. Mohan and R. Nevatia, “Segmentation and description based on perceptual orga-
nization,” In Proceedings of IEEE Conference on Computer Vision and Pattern Rec-
ognition, San Diego, CA., pp. 333-341, Jun. 1989.
[69] R. Mohan and R. Nevatia, “Using Perceptual Organization to Extract 3-D Struc-
tures,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(11), pp.
1121-1139, November 1989.
[70] R. Nevatia, “Machine Perception,” Prentice Hall, 1982.
[71] R. Nevatia and R. Babu, “Linear Feature Extraction and Description,” Computer Vi-
sion, Graphics and Image Processing, V ol. 13, pp. 257-269, 1980.
[72] R. Nevatia and T. Binford, “Description and Recognition of Complex-Curved Ob-
jects,” Artificial Intelligence, V ol. 8, pp. 77-98, 1977.
[73] B. Nicolin and R. Gabler, “A Knowledge-Based System for the Analysis of Aerial
Images,” IEEE Transactions on Geoscience and Remote Sensing, 25(3), pp. 317-329,
1987.
[74] C.F. Olson, “Time and Space Efficient Pose Clustering,” IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 251-258, 1994.
[75] F. Preparata and M. Shamos, “Computational Geometry,” Springer-Verlag, 1985.
[76] U. Ramer, “Extraction of Lines Structures from Photographs of Curved Objects,”
Computer Graphics and Image Processing, 4(1), pp. 81-103, 1975.
[77] E. Rivlin, S.J. Dickinson, and A. Rosenfeld, “Object Recognition by Functional
Parts,” In Proceedings of the 1994 ARPA Image Understanding Workshop, pp. 1531-
1539, 1994.
[78] L.G. Roberts, “Machine Perception of Three Dimensional Solids,” Optical and Elec-
tro-Optical Information Processing, pp. 159-197, 1968.
[79] I. Rock and S. Palmer, “The Legacy of Gestalt Psychology,” Scientific American, pp.
84-90, 1990.
123
[80] C.A. Rothwell, D.A. Forsyth, A. Zisserman, and J.L. Mundy, “Extracting Projective
Structure from Single Perspective Views of 3D Point Sets,” In Proceedings of Inter-
national Conference on Computer Vision, pp. 573-582, 1993.
[81] M. Roux and D. McKeown, “Feature Matching for Building Extraction from Multi-
ple Views,” In Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 46-53, Seattle, WA., June 1994.
[82] H. Sato and T.O. Binford, “Finding and Recovering SHGC Objects in an Edge Im-
age,” Computer Vision, Graphics and Image Processing, 57(3), pp. 346-356, 1993.
[83] A. Sha’ashua and S. Ullman, “Structural saliency: the detection of globally salient
structures using a locally connected network,” In Proceedings of International Con-
ference on Computer Vision, Tampa, FL., pp. 321-327, December 1988.
[84] R.N. Shepard and J. Metzler, “Mental Rotation of Three-Dimensional Objects,” Sci-
ence, V ol. 171, pp. 701-703, 1971.
[85] J. Shufelt and D. McKeown, “Fusion of Monocular Cues to Detect Man-Made Struc-
tures in Aerial Imagery,” Computer Vision, Graphics and Image Processing, 57(3),
pp. 307-330, May 1993.
[86] L. Stark and K. Bowyer, “Achieving Generalized Object Recognition through Rea-
soning about Association of Function to Structure,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 13(10), pp. 1097-1104, October 1991.
[87] F. Stein and G. Medioni, “Recognizing 3D Objects from 2D Groupings,” In Proceed-
ings of the 1992 ARPA Image Understanding Workshop, San Diego, CA., pp. 667-
674, January 1992.
[88] F. Stein and G. Medioni, “Structural Indexing: Efficient Two Dimensional Object
Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
14(2), pp. 1198-1204, 1992.
[89] K. Steven, “Surface Perception from Local Analysis of Texture and Contour,” MIT
Artificial Intelligence Lab AI-TR-512, 1980.
[90] T. Strat, L. Quam, J. Mundy, R. Welty, W. Bremner, M. Horwedel, D. Hackett, and
A. Hoogs, “The RADIUS Common Development Environment”, Proceedings of the
1992 DARPA Image Understanding Workshop, San Diego, California, pp. 215-226
124
[91] S. Ullman and R. Basri, “Recognition by Linear Combinations of Models,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 13(10), pp. 992-1006,
1991.
[92] F. Ulupinar and R. Nevatia, “Perception of 3-D Surfaces from 2-D Contours,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 15(1), pp. 3-18, 1993.
[93] F. Ulupinar and R. Nevatia, “Shape from Contour: Straight Homogeneous General-
ized Cylinders and Constant Cross Section Generalized Cylinders,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 17(2), pp. 120-135, 1995.
[94] V . Venkateswar and R. Chellappa, “A Framework for Interpretation of Aerial Imag-
es,” In Proceedings of International Conference on Pattern Recognition, pp. 204-206,
Atlantic City, NJ., June 1990.
[95] E.L. Walker and M. Herman, “Geometric Reasoning for Constructing 3D Scene De-
scriptions from Images,” Geometric Reasoning, D. Kapur and J.L. Mundy, MIT
Press, pp. 275-290, 1989.
[96] D. Waltz, “Generating Semantic Descriptions from Drawings of Scenes with Shad-
ows,” MAC-TR-271, MIT, Cambridge, MA., 1972.
[97] D. Weinshall, “Model-Based Invariants for 3-D Vision,” International Journal of
Computer Vision, 10(1), pp. 27-42, 1993.
[98] A.P. Witkin and J.M. Tenenbaum, “On the Role of Structure in Vision,” Human and
Machine Vision, J. Beck, B. Hope, and A. Rosenfeld, pp. 481-543, Academic Press,
New York, NY ., 1983.
[99] P. Wolf, “Elements of Photogrammetry,” McGraw Hill, 1983.
[100] M. Zerroug and R. Nevatia, “Quasi-Invariant Properties and 3-D Shape Recovery of
Non-Straight, Non-Constant Generalized Cylinders,” In Proceedings of Computer
Vision and Pattern Recognition, pp. 96-103, 1993.
[101] M. Zerroug and R. Nevatia, “From an Intensity Image to 3-D Segmented Descrip-
tions,” In Proceedings of the International Conference on Pattern Recognition, pp.
108-113, 1994.
125
[102] S.W. Zucker, “Computational and Psychophysical Experiments in Grouping: Early
Orientation Selection,” Human and Machine Vision, J. Beck, B. Hope, and A. Rosen-
feld, pp. 545-567. Academic Press, New York, NY , 1983.
126
Appendix
A. Best Matches for Sets of Lines Across Views
Matching of features across views is used extensively by the system presented in this dis-
sertation. Often matches for aggregate features such as parallels and parallelograms is re-
quired. The next paragraph presents the methodology used to obtain the best matches of a
set of lines in one view with another view. This technique is used to match parallelograms
and triples.
Assume a set of n
lines
lines {l
ik
} with k=1..n
lines
in view
i
. The lines in the set {l
ik
} are
matched to lines {l
jl
}in view
j
subject to the quadrilateral constraint described in Section
4.1. As the 3-D height of the 3-D line that projects into l
ik
is varied between the minimum
and maximum heights, the projection of this line in view
j
, l
ik
′, traverses the quadrilateral
defined in the quadrilateral constraint. Parameterize this traversal with parameter s=0 cor-
responding to the minimum height (possibly the ground plane) and s=1 corresponding to
the maximum height. Figure A.1 illustrates this concept.
As parameter s is varied from 0 through 1, l
ik
′ will match different lines, {l
jl
}, in the quad-
rilateral. This gives rise to multiple matches at different parameters (and hence different 3-
l
ik
’(s=0)
l
ik
’(s=1)
l
ik
view
i
view
j
Figure A.1 Parameterized search for line matches
127
D heights). Define a proximity measure for line l
ik
, d
ijkl
as the perpendicular distance from
the midpoint of l
ik
to any line l
jl
in {l
jl
}. Define a measure of the goodness of the match for
a set of lines {l
ik
} in view
i
, m
s
, at parameter value s, over all n
views
views by
where σ is chosen to be 0.5 (hence σ
2
= 0.25) and len(l
jl
) is the length of l
jl
. The physical
significance of this is to weight the lines in {l
jl
} by the distance from the projection of l
ik
in view
j
(l
ik
′) and have that weight drop off exponentially. This may be done with an arbi-
trary set of lines {l
ik
} over any number of views from 1 to n
views
. The values of s at local
maxima of the function m
s
are used as the best matches of the set of line {l
ik
}. It is important
to note that the function maximizes the support at a certain parameter s, which corresponds
to a 3D height. Thus the function is biased towards planar sets of lines at the same 3D height
i.e. planar sets of lines parallel to the ground. In this thesis, this mechanism has been used
to derive excellent matches for sets of lines.
A.1 Best Matches for a Parallelogram Across Views
In the case of a parallelogram the set of lines {l
ik
} defined above, comprises the four bound-
ing lines that define the parallelogram. The maxima of the function m
s
yield the 3D heights
at which projections of the building in the available views have maximal line support (in a
local neighborhood defined by σ). It may be noted that the function is biased to look for
buildings with roofs that are parallel to the ground (as it uses a single parameter s, which
represents 3D height).
A.2 Best Matches for a Triple Across Views
In the case of a triple the formulation has to be generalized to allow for lines at different 3D
heights (or equivalently, different values of s). The triple models a gable-roof which has a
spine at a different 3D height from the sides of the gable-roof. Section 4.4.1 shows how the
difference in height may be estimated. Assume the sides of the gable-roof, l
i1
and l
i3
, are at
a height h in 3D, which corresponds to parameter value s (s=0 corresponds to the ground
and s=1 corresponds to the maximum allowed height of a building). If the height of the
spine, l
i2
, above the sides is d, and the vertical height d corresponds to a parameter value of
m
s
e
–(σ
2
⋅ d
ijkl
2
)
⋅ len l
jl
()
, kl
∑
≠ ji
∑
=
128
δ, then the parameter value of the spine is s+δ. The epipolar line for l
i2
for each value of
s+δ in the other views will account for the height difference corresponding to d, and change
the value of d
ijkl
(k=2) appropriately. This allows the system to search in a 1-dimensional
space defined by the parameter s, as opposed to a 2-dimensional space defined by s and δ.
Abstract (if available)
Abstract
A method for detection and description of rectangular buildings with flat and with gable roofs from two or more registered aerial intensity images is proposed. The output is a 3D description of the buildings, with an associated confidence measure for each building. Hierarchical perceptual grouping and matching across views aids building hypothesis formation and increases the robustness of the system. ❧ The system is exclusively feature-based. Perceptual grouping in each view and matching across views is performed in a hierarchical manner, utilizing primitives of increasing complexity, starting with line segments and junctions, and proceeding to higher level features, namely parallels, U-contours and parallelograms (as 3D rectangles appear as parallelograms when projected under the conditions employed in aerial imagery). Binocular and trinocular epipolar constraints are used to reduce the search space for matching features. A selection mechanism retains hypotheses with sufficient roof evidence to warrant further examination. A verification procedure employs roof, wall and shadow evidence to decide whether a hypothesis should be considered a building or not. Overlapping verified hypotheses are disambiguated depending on the type of overlap (partial overlap or contain- ment) and the 3D heights of the hypotheses. ❧ A user-assisted mode enables building model construction utilizing user inputs. By using the (computed) feature hierarchy in the model construction process this mode reduces the number of user inputs required to construct building models, when compared with generic 3D modeling tools.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Event detection and recounting from large-scale consumer videos
PDF
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
3D object detection in industrial site point clouds
PDF
3D urban modeling from city-scale aerial LiDAR data
PDF
Hybrid methods for robust image matching and its application in augmented reality
PDF
Object detection and recognition from 3D point clouds
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Exploitation of wide area motion imagery
PDF
3D deep learning for perception and modeling
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Accurate 3D model acquisition from imagery data
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Tracking multiple articulating humans from a single camera
PDF
Point-based representations for 3D perception and reconstruction
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Line segment matching and its applications in 3D urban modeling
PDF
Grounding language in images and videos
PDF
Modeling social and cognitive aspects of user behavior in social media
Asset Metadata
Creator
Noronha, Sanjay P.
(author)
Core Title
3-D building detection and description from multiple intensity images using hierarchical grouping and matching of features
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/07/2013
Defense Date
02/06/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
automated building detection,computer vision,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
), Leahy, Richard M. (
committee member
), Price, Keith (
committee member
)
Creator Email
s_noronha@yahoo.com,snoronha@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-218764
Unique identifier
UC11295107
Identifier
usctheses-c3-218764 (legacy record id)
Legacy Identifier
etd-NoronhaSan-1428.pdf
Dmrecord
218764
Document Type
Dissertation
Rights
Noronha, Sanjay P.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
automated building detection
computer vision