Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Accurate 3D model acquisition from imagery data
(USC Thesis Other)
Accurate 3D model acquisition from imagery data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Accurate 3D Model Acquisition from Imagery Data
by
Zhuoliang Kang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2015
Copyright 2015 Zhuoliang Kang
[This page intentionally left blank]
Acknowledgement
First and foremost, I would like to thank my advisor Professor G erard Medioni. I am
really a lucky man to have him being my Ph.D advisor. He is an excellent advisor not
only from the academic perspective but also from the personal perspective. I started
my research without any experience in 3D computer vision. Professor G erard Medioni
guided me through all those hard times with his patience and expertise. I will always
remember the stack of papers full of marks which he revised again and again. I will
always remember those discussions when I got stuck in research. I will always remember
those conversations when he suggested me to make the decision for the best of my own
career. Pursuing a Ph.D is a long journey, but I have really enjoyed every minutes doing
research under the supervision of Professor G erard Medioni.
I would like to thank Professor Hao Li, Professor Alexander Sawchuk, Professor Ram
Nevatia and Professor Ulrich Neumann for spending their precious time on being in my
dissertation and qualifying exam committee and the valuable feedbacks. I would also like
to thank Professor Fei Sha who nominated me for the Provost's fellowship and provided
me the opportunity to start the study at USC. Looking back at the road I have traveled,
I always feel grateful for meeting all those great people in my career: Professor Junping
Zhang from Fudan University, Jon Snoddy from Walt Disney Imagineering, Sheila Vaidya,
Michael Goldman and Holger Jones from Lawrence Livermore National Laboratory. I
iii
benet a lot from the former and current IRISers: Jongmoo Choi, Yinghao Cai, Dian
Gong, Xuemei Zhao, Jan Prokaj, Ruizhe Wang, Younghoon Lee, Tung Sin Leung, Shay
Deutsch, Matthias Hernandez, Chen Sun, Song Cao, Bor-Jeng Chen, Anh Tran, Jungyeon
Kim and Jatuporn Toy Leksut. Five years at USC is a long journey full of great memories
thanks to all my friends: Meihong Wang, Yili Zhao, Jing Huang, Hui Zhang, Yuan Shi,
Wei Guan, Zhenzhen Gao, Guan Pang and Rongqi Qiu.
I would like to thank the endless support and love across thousands of miles from my
parents, my elder brother's family and all my old friends spreading around the world:
Yi Zhou, Xiaoping Huang, Xiaoyang Peng, Cheng Li, Minhua Xie, Gamer, Junbo Zhang
and Bo Yuan. I might have missed many moments with you during the ve years, but
love doesn't fade away and distance makes it stronger. Last but not the least, special
thanks to my wife Zhili Cao. I am who I am because of your love. This dissertation is
dedicated to you.
iv
Table of Contents
Acknowledgement iii
List of Tables viii
List of Figures x
Abstract 1
1 Introduction 3
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Issues associated with Online Methods . . . . . . . . . . . . . . . . . . . . 9
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Related Work 15
2.1 Algorithms for Camera Pose Estimation . . . . . . . . . . . . . . . . . . . 16
2.2 Algorithms for Dense 3D Reconstruction . . . . . . . . . . . . . . . . . . . 19
2.3 Structure-from-Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Real-time Monocular SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Oine 3D Model Acquisition Systems . . . . . . . . . . . . . . . . . . . . 25
2.6 Online 3D Model Acquisition Systems . . . . . . . . . . . . . . . . . . . . 27
3 Online City 3D Reconstruction from Aerial Video 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Multi-View Stereo and Model Update . . . . . . . . . . . . . . . . . . . . 36
3.5.1 Multi-view stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.2 Model update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Camera Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6.1 Optical
ow guided feature matching . . . . . . . . . . . . . . . . . 40
3.6.2 Occlusion handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.3 Stabilized optical
ow . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7 Urban Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
v
3.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8.1 Camera pose estimation results . . . . . . . . . . . . . . . . . . . . 46
3.8.2 Reconstruction results . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 Progressive 3D Model Acquisition with a Hand-held Camera 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Sparse Point-based Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 Camera pose estimation . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.2 Point model update . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.3 Optical
ow guided feature matching . . . . . . . . . . . . . . . . . 61
4.4.4 Bundle adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Dense Patch-based Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.1 Patch tracklets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.2 Patch model update . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.3 Scanning for structural details . . . . . . . . . . . . . . . . . . . . 65
4.6 Model Renement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6.1 Depth map renement . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6.2 Patch ltering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7.1 Implementation and datasets . . . . . . . . . . . . . . . . . . . . . 69
4.7.2 Running-time analysis . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.7.3 Camera pose estimation results . . . . . . . . . . . . . . . . . . . . 72
4.7.4 Reconstruction results . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7.5 Problematic cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Camera Pose Estimation for Multi-Camera Aerial Imagery 77
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Challenges of Multi-Camera Aerial Imagery . . . . . . . . . . . . . . . . . 79
5.2.1 Non-hovering viewing pattern . . . . . . . . . . . . . . . . . . . . . 79
5.2.2 Pose consistency of multiple cameras . . . . . . . . . . . . . . . . . 81
5.2.3 Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.1 Structure-from-Motion using sparse features . . . . . . . . . . . . . 82
5.3.2 Multi-camera alignment . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.3 Parallelization using multiple GPUs . . . . . . . . . . . . . . . . . 85
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
vi
6 MeshRecon: a Mesh-Based Oine 3D Model Acquisition System 92
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2 Pipeline Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Camera Pose Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4 Dense Point Cloud Generation . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4.1 Multi-resolution plane sweep . . . . . . . . . . . . . . . . . . . . . 99
6.4.2 Point-Camera-Ray 3D map . . . . . . . . . . . . . . . . . . . . . . 101
6.5 Visibility-consistent Surface Reconstruction . . . . . . . . . . . . . . . . . 102
6.5.1 Delaunay triangulation . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.5.2 Surface reconstruction with graph-cut . . . . . . . . . . . . . . . . 104
6.6 Variational Mesh Renement . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.6.1 Mesh-based energy function . . . . . . . . . . . . . . . . . . . . . . 106
6.6.2 GPU parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.7 Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.7.1 System implementation . . . . . . . . . . . . . . . . . . . . . . . . 110
6.7.2 Datasets and results . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.7.2.1 Aerial imagery . . . . . . . . . . . . . . . . . . . . . . . . 110
6.7.2.2 Indoor object . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.7.2.3 Outdoor scenes . . . . . . . . . . . . . . . . . . . . . . . . 117
6.7.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.7.3.1 Stationary camera and moving object . . . . . . . . . . . 119
6.7.3.2 Failure cases . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7 City-Scale Geometry Change Detection from Aerial Imagery 124
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3 Pipeline Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.4 City-Scale Geometry Inference . . . . . . . . . . . . . . . . . . . . . . . . 129
7.5 Change Detection from 3D Comparison . . . . . . . . . . . . . . . . . . . 130
7.5.1 Alignment of city 3D models . . . . . . . . . . . . . . . . . . . . . 130
7.5.2 Double-sided geomtric distance map . . . . . . . . . . . . . . . . . 131
7.5.3 Geometry changes with physical scale . . . . . . . . . . . . . . . . 132
7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.6.1 Real-world urban scenarios . . . . . . . . . . . . . . . . . . . . . . 133
7.6.2 Change detection results . . . . . . . . . . . . . . . . . . . . . . . . 135
7.6.3 Discussion on complementary methods . . . . . . . . . . . . . . . . 137
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8 Conclusion and Future Work 142
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Reference List 146
vii
List of Tables
3.1 Running times. Init.: time cost in the initialization step. Pose.: average
time cost on each frame for camera pose estimation. Model update:
average time cost on each key frame for multi-view stereo and model update. 46
3.2 Quantitative evaluation of the error comparing against the \gold-standard"
camera poses obtained from Global SfM. Trans. error: average distance
dierence of camera positions. Due to the lack of physical scale, the dis-
tance is measured in the same coordinate units. Rot. error: average
angular dierence of viewing directions. . . . . . . . . . . . . . . . . . . . 46
4.1 Dataset details and running-time analysis. # Frame: number of total
frames. # Key Frame: number of key frames. Scanning: total time
cost for live scanning. Renement: total time cost for model renement
including depth map renement and patch ltering. Total: total time cost
including both scanning and model renement. . . . . . . . . . . . . . . . 71
6.1 Dataset details and running-time for aerial imagery. # Frame: number of
images used for reconstruction. Cam Pose.: running time of camera pose
estimation. MeshRecon: running time of mesh reconstruction given the
estimated camera poses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2 Dataset details and running-time for indoor environment. # Frame: num-
ber of images used for reconstruction. VisualSFM: running time of cam-
era pose estimation using VisualSFM. MeshRecon: running time of mesh
reconstruction given the estimated camera poses. Total: total running
time including camera pose estimation and dense reconstruction. . . . . . 115
6.3 Dataset details and running-time for outdoor environment. # Frame:
number of images used for reconstruction. VisualSFM: running time
of camera pose estimation using VisualSFM. MeshRecon: running time
of mesh reconstruction given the estimated camera poses. Total: total
running time including camera pose estimation and dense reconstruction. 117
viii
6.4 Dataset details and running-time for challenging cases. # Frame: number
of images used for reconstruction. VisualSFM: running time of camera
pose estimation using VisualSFM. MeshRecon: running time of mesh
reconstruction given the estimated camera poses. Total: total running
time including camera pose estimation and dense reconstruction. . . . . . 121
ix
List of Figures
1.1 Image-based geometric 3D model acquisition . . . . . . . . . . . . . . . . 4
1.2 Two sub-problems for image-based geometric acquisition: camera pose
estimation of input images; dense reconstruction of the 3D model given
camera pose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Illustration of oine methods and online methods for 3D model acquisition 7
1.4 City-scale 3D reconstruction from aerial imagery. The input is multi-
camera aerial imagery captured using a high-resolution multi-camera sys-
tem, and the output is the reconstructed geometric city 3D model . . . . 9
1.5 General 3D model acquisition with a commodity camera. The input are
images captured from dierent views using a commodity camera, and the
output is a full 3D mesh model with structural details . . . . . . . . . . . 9
2.1 Illustration of dierent camera pose estimation algorithms . . . . . . . . . 17
2.2 Dense reconstruction methods categorized based on the model representation 20
2.3 Structure-from-motion (SfM) methods take discrete image set as input,
and solve the camera pose estimation problem in a global setting. Image
from Bundler [4] by Noah Snavely . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Real-time monocular simultaneous localization and mapping (SLAM) meth-
ods take video sequence as input and generate camera pose in real-time . 24
2.5 Patch-based multi-view Stereo (PMVS) [40] by Furukawa and Ponce . . . 26
2.6 High-resolution large-scale reconstruction [50, 108] by Hoang Hiep VU et
al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Dense tracking and mapping (DTAM) [77] by Richard A. Newcombe et al. 28
x
2.8 Dense reconstruction on Mobile [102] by Marc Pollefeys et al. . . . . . . . 29
3.1 An example of the 3D urban scene and the camera motion path in WAAS 31
3.2 Overview of the system pipeline . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 From left to right: key frame I
k
, height map H
k
and condence map W
k
. 38
3.4 Illustration of feature matching between a new frameI
k
and the maintained
sparse 3D model M
s
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Illustration of occlusion handling when back-projecting a 3D feature point
X from M
S
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Illustration of frame pairs and
ow elds. Consecutive frame pairs are
shown as red-green anaglyph images. (a) original frame pair. (b) stabilized
frame pair. (c)
ow eld between original frame pair. (d)
ow eld between
stabilized frame pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 Display of epipolar lines: blue lines from out approach; red lines from
Global SfM; green lines from Sequential SfM. Epipolar lines from our ap-
proach and Global SfM are consistent and overlapped. Large errors exist
in the epipolar lines from Sequential SfM . . . . . . . . . . . . . . . . . . 47
3.8 Display of camera paths on dierent datasets. The camera paths from our
approach and Global SfM are overlapped . . . . . . . . . . . . . . . . . . 48
3.9 (a) The synthetic model spans horizontally in a 2.5km 2.3km region with
a height range of 230m. (b) The signed distance error is unbiased with a
mean of -0.01m. The standard deviation is the RMS error, which is 0.72m 51
3.10 Reconstructed full 3D model on Rochester video . . . . . . . . . . . . . . 52
3.11 Closeup views of the results on Rochester video (shaded model) . . . . . . 53
4.1 Overall pipeline of the proposed system . . . . . . . . . . . . . . . . . . . 59
4.2 Illustration of the optical
ow guided feature matching between a new
frame I
n
and the sparse model M
s
. . . . . . . . . . . . . . . . . . . . . . 61
4.3 Each key frame I
k
is divided into pixels image cells associated with
patch tracklets to continuously update the patch model M
D
. . . . . . . . 63
4.4 Illustration of scanning for structural details . . . . . . . . . . . . . . . . . 65
xi
4.5 Illustration of depth map renement. Mesh models are generated through
Poisson surface reconstruction under the same parameters . . . . . . . . . 67
4.6 Illustration of patch ltering. Mesh models are generated through Poisson
surface reconstruction under the same parameters . . . . . . . . . . . . . . 70
4.7 Illustration of the estimated camera motion as well as the reconstructed
sparse model composed of SIFT feature points. Top row from left to right:
results on shoe and stone. Middle row from lef to right: results on Bach
and statue. Bottom row from left to right: results on Cleopatra and gargoyl 73
4.8 Reconstruction results including a sample input image, reconstructed dense
patches and the detailed mesh model from 2 dierent views. From top to
bottom: shoe, stone, Bach, statue, Cleopatra and gargoyl . . . . . . . . . 74
4.9 Comparsion of the reconstruction results on birdhouse . . . . . . . . . . . 75
5.1 Examples of the multi-camera aerial imagery . . . . . . . . . . . . . . . . 78
5.2 Illustration of the complicated viewing pattern in multi-camera aerial im-
agery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 The 2D-2D feature matches are tracked across multiple frames . . . . . . 84
5.4 Illustration of the merge-and-track algorithm . . . . . . . . . . . . . . . . 86
5.5 Illustration of the quantitative results on multi-camera aerial imagery dataset 88
5.6 Running-time results on multi-camera aerial imagery dataset. Init: run-
ning time of initialization using relative pose estimation method; Single:
running time of separate camera pose estimation for each individual cam-
era; Multi: running time of multi-camera pose estimation after obtaining
a global sparse 3D model with multi-camera alignment; Total: total run-
ning time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.1 The input of MeshRecon is a set of sparse images captured from dierent
views. The output is a full 3D mesh model of the target object . . . . . . 93
6.2 The pipeline of MeshRecon . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 Dense depth maps estimated using multi-resolution plane sweep algorithm.
Top: the input image and estimated depth map. Bottom: the dense point
cloud is full of outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4 The pipeline of visibility-consistent surface reconstruction . . . . . . . . . 103
xii
6.5 Examples of degenerate triangle surfaces that are not two-manifold sur-
faces [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.6 Illustration of the mesh renement. (1). initial surface model. (2). surface
model rened from 0.25 down-sampled images. (3). surface model rened
from 0.5 down-sampled images. (4). surface model rened from full size
images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.7 Example frame of multi-camera aerial datasets. From top to bottom:
Rochester, CLIF 06 and CLIF 07 . . . . . . . . . . . . . . . . . . . . . 111
6.8 Results on multi-camera aerial datasets. From top to bottom: results on
Rochester, CLIF 06 and CLIF 07 . . . . . . . . . . . . . . . . . . . . . 113
6.9 Close-up views of the reconstructed model on Rochester. Geometry de-
tails of building structures as well as vegetations are well captured . . . . 114
6.10 Results on Rochester. Top: the reconstructed city 3D model on Rochester,
NY, USA using aerial imagery. Bottom: the city 3D model provided by
Google . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.11 Results on indoor datasets. From top to bottom: results onwoodland squirrel,
squirrel, gnome, gargoyle and shoe . . . . . . . . . . . . . . . . . . . . . . 116
6.12 Results on outdoor datasets. From top to bottom: results on herz-jesu,
entry, fountain, castle and lion fountain . . . . . . . . . . . . . . . . . . . 118
6.13 Comparison of results reconstructed under dierent conditions: camera
moving vs. face moving; diused lighting vs. direct lighting . . . . . . . . 120
6.14 Results on objects with transparent and non-lambertian surfaces. From
top to bottom: results on hand sanitizer and juice bottle . . . . . . . . . . 122
7.1 Two 6-camera WAAI captured at 2006 (left) and 2007 (right) separately
with close-up views covering the ground area where a building is demolished
in 2007. It is dicult to identify the geometry changes over the entire city
by comparing the high-resolution multi-camera aerial imagery even for
human analysts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2 Pipeline of our approach. City-scale 3D models,M
0
andM
1
, are produced
from data recorded at time t
0
and t
1
. After alignment, geometry changes
can be identied based on the geometric distance betweenM
0
andM
1
. 128
7.3 Facet subdividing process favors textured surfaces. Texture-less regions
are represented with fewer triangle facets . . . . . . . . . . . . . . . . . . 130
xiii
7.4 The point-to-surface geometric distance is not symmetric. D
M
0
!M
1
has
a larger value thanD
M
1
!M
0
for the extra trees that are added inM
0
. . 132
7.5 Illustration of the region of interest on Rochester and Columbus datasets 133
7.6 The 3D model from Google covers the urban area in Rochester, NY, USA.
Texture is also provided but not used in our approach . . . . . . . . . . . 134
7.7 City 3D models on two real-world urban scenarios . . . . . . . . . . . . . 134
7.8 Detailed geometry changes at dierent scales are detected, ranging from
an entire building cluster over tens of meters to an individual tree within
several meters. Geometry changes corresponding to the extra structures
inM
0
are colorized according to the geometric distance and displayed in
M
1
for better visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.9 Quantitative result of the detected geometry changes . . . . . . . . . . . . 137
7.10 False positive examples. Left: the slopped stadium roof is simplied as
a plane in the Google model. Right: limited by the ground sampling
distance, the aerial imagery fails to capture the netted antenna which is
treated as a solid tower in Google model . . . . . . . . . . . . . . . . . . . 138
7.11 Examples of the detected geometry changes. From left to right: geometric
distance mapD
M
0
!M
1
, modelM
1
with texture, geometry changes corre-
sponding to the extra structures inM
0
are displayed inM
1
and colorized
based on the geometric distance for visualization . . . . . . . . . . . . . . 139
7.12 An example of the imagery inconsistency from geometry changes in aerial
imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
xiv
Abstract
Acquisition of geometric 3D models from 2D imagery has been essential for various appli-
cations. In particular, this dissertation investigates two important application scenarios:
city-scale 3D reconstruction from aerial imagery and general 3D model acquisition with
a commodity camera.
The rst part of this dissertation explores an online solution to the problem. We
propose an approach to solve camera pose estimation and dense reconstruction from
Wide Area Aerial Surveillance (WAAS) videos captured by an airborne platform. Our
approach solves them in an online fashion: it incrementally updates a sparse 3D map and
estimates the camera pose as each new frame arrives; depth maps of selected key frames
are computed using a variational method and integrated to produce a full 3D model via
volumetric reconstruction. In practice, aerial imagery is usually captured using a multi-
camera system. We propose an approach for camera pose estimation of multi-camera
aerial imagery which is parallelized on multiple GPUs for eciency. The approach is also
extended for progressive 3D model acquisition with a hand-held camera.
In many scenarios, online approach is not a necessity and accuracy has higher priority
over eciency. In the second part, we present MeshRecon, a mesh-based oine sys-
tem composed of three modules: a dense point cloud is generated using multi-resolution
plane sweep method; an initial mesh model is extracted from the point cloud via global
1
optimization considering visibility information of all images; the mesh model is then itera-
tively rened to capture structural details by optimizing the photometric consistency and
spatial regularization. The major processes are parallelized on GPU for eciency. For
the aerial imagery case, we evaluate our system on several real-world multi-camera aerial
imagery datasets, each covering an urban scenario of several square kilometers. Quan-
titative result shows that the reconstructed geometric 3D model is highly accurate with
error smaller than 1 meter over the entire city. Besides aerial imagery, we also evaluate
its performance on general geometric 3D model acquisition of real-world objects. Result
shows that the system is robust and
exible for various types of objects at dierent scales
in both indoor and outdoor environments. Based on city 3D models reconstructed at dif-
ferent times, we present a system for city-scale geometry change detection by performing
comparisons at the 3D geometry level. Our system is able to detect geometry changes at
dierent scales, ranging from a building cluster to small-scale vegetation changes, with
high accuracy. In the end, we conclude the dissertation with contributions and future
work.
2
Chapter 1
Introduction
1.1 Problem Statement
Geometric 3D model acquisition is the process of capturing the three-dimensional geom-
etry of real-world objects and scenes, which has been essential for applications in various
elds. Many approaches have been proposed for 3D model acquisition for dierent sce-
narios using dierent technologies. Active scanners, such as structured-light scanner and
laser scanner, provide high quality range maps which can be merged into a single 3D
model. The reconstructed models are highly accurate, but the set up is cumbersome.
As illustrated in Figure 1.1, image-based approaches attempt to extract the geometric
3D model from a set of images or video sequences based on photometric information.
Recent advances in computer vision enable accurate 3D model acquisition from images
with reconstruction error on the order of millimeters without particular hardware, except
a digital camera. In this dissertation, we investigate image-based geometric 3D model
acquisition, i.e. , generating geometric 3D models from imagery. The texture information
is out of the scope in this dissertation. In particular, we consider the case where the target
3
(a) Input is images captured from dierent views
(b) Output is a geometric 3D model
Figure 1.1: Image-based geometric 3D model acquisition
4
object/scene is stationary and the camera is moving and capturing images from dierent
views. We also assume that the lighting is not changing or changing slowly, such that the
photometric condition of the target object/scene is consistent among neighboring views.
Note that this assumption holds in most scenarios for 3D model acquisition, and we also
explore the impact when this does not hold, i.e. , camera is stationary and object is
moving under undiused lighting.
Most image-based 3D model acquisition approaches share following two core problems
during the process (Figure 1.2):
• camera pose estimation: in order to estimate the 3D geometry from a set of
images from dierent views, we need to obtain the 6D camera pose (camera 3D
position and camera 3D viewing direction in the world space) for each input image.
In computer vision, Structure-from-Motion (SfM) is the standard framework used
to estimate the camera poses as well as the 3D positions of a sparse set of feature
points. In Chapter 2, we review various SfM approaches in detail.
• dense reconstruction: the sparse point cloud from SfM is not sucient to fully
describe the 3D geometry of the target object. We need to retrieve dense 3D
geometry information in order to fully capture the target 3D shape. Given the
estimated camera poses, Multi-View Stereo (MVS) algorithms are used to produce
a dense 3D model satisfying photometric consistency among the input images. A
detailed review of the related works on dense reconstruction is also presented in
Chapter 2.
5
Figure 1.2: Two sub-problems for image-based geometric acquisition: camera pose esti-
mation of input images; dense reconstruction of the 3D model given camera pose
6
Images
Camera Pose Estimation
Dense Reconstruction
camera pose
3D model
(a) Oine methods solve cam-
era pose estimation and dense
reconstruction sequentially af-
ter obtaining all input images
…
image frames
camera pose
Camera Pose Estimation
Dense Reconstruction
3D model
(model update)
(b) Online methods estimates the camera pose as each new frame arrives,
and continuously update the reconstructed 3D model during the process
Figure 1.3: Illustration of oine methods and online methods for 3D model acquisition
In terms of the processing pipeline, geometric 3D model acquisition methods can be
categorized into oine methods and online methods. As shown in Figure 1.3 (a),
oine methods solve the two problems sequentially: the pipeline starts with collecting
images covering the target object from dierent views. After obtaining all the images,
the camera poses are estimated using standard SfM framework. Given the camera poses
of all images, dense 3D reconstruction algorithm is then used to produce the 3D model.
Oine methods produce accurate 3D models through incorporating information from all
collected data. However, if the user is not satised with the reconstructed model, e.g.,
object is not fully covered, the user needs to capture additional images and repeat the
reconstruction pipeline. Thus, the acquisition of a visually pleasing 3D model can be time-
consuming with several iterations of data collection and reconstruction loop. In terms of
storage requirement, oine methods need to store all collected data for processing which
causes problem for large-scale high-resolution imagery dataset.
7
Online methods have been proposed to produce geometric 3D models for applica-
tions where immediate feedback of the current camera pose and reconstructed model are
needed. As shown in Figure 1.3 (b), online methods solve the camera pose estimation and
dense reconstruction in parallel. The input data is collected in the form of video streams
or ordered image sequences. As each new image arrives, its camera pose is estimated im-
mediately and the reconstructed 3D model is continuously updated. In terms of storage
requirement, online methods are more
exible because the processed video frames are not
required to be stored in the system.
In this dissertation, we explore both online and oine 3D model acquisition solutions
using imagery data as input. In particular, we concentrate on two practical applications:
• City-Scale 3D Reconstruction from Aerial Imagery: as shown in Figure 1.4,
in this case, the input is Wide Area Aerial Imagery (WAAI) which is captured
by an airborne platform hovering around the urban scenes. In many scenarios, a
multi-camera system is mounted on the airborne platform providing high-resolution
multi-camera aerial imagery with a wide viewing angle. We aim to producing an
accurate city-scale 3D mesh model describing the geometry information of the city
using the imagery data. In this dissertation, we investigate both online and oine
solutions for this problem.
• General 3D Model Acquisition with a Commodity Camera: as shown in
Figure 1.5, we are also interested in a simple-to-use solution for general 3D model
acquisition, where a commodity camera is turned into a 3D scanner. Our goal is to
provide a 3D model acquisition system that is both accurate and robust for dierent
8
Figure 1.4: City-scale 3D reconstruction from aerial imagery. The input is multi-camera
aerial imagery captured using a high-resolution multi-camera system, and the output is
the reconstructed geometric city 3D model
Figure 1.5: General 3D model acquisition with a commodity camera. The input are
images captured from dierent views using a commodity camera, and the output is a full
3D mesh model with structural details
types of objects at dierent scales in both indoor and outdoor environments. In
this dissertation, we propose an online solution using video sequence as input as
well as an oine solution using a set of sparse images as input.
1.2 Issues associated with Online Methods
Compared with oine methods, online image-based 3D model acquisition faces the fol-
lowing challenges:
9
• Pose drift: pose drift is a well-known problem in incremental camera pose esti-
mation due to the scale and translation errors accumulated along the path. For
3D model acquisition application where accurate and consistent camera poses are
essential, pose drift is a problem which leads to inaccurate 3D geometry and mis-
registration in the reconstructed model.
In standard oine method, a nonlinear optimization process, called Bundle Ad-
justment (BA), is used to rene the estimated camera poses utilizing the point
correspondences across all images. For online applications, a careful design of the
pipeline is needed to achieve live performance.
• Lack of global visibility information: occlusion is a critical problem in both
camera pose estimation and dense 3D reconstruction, especially in regions with
repeated textures, such as building side walls. In these areas, the local feature
descriptor is not sucient to provide accurate point correspondences. For example,
feature points in a building side-wall can be incorrectly matched with the points
with similar texture in the opposite side of the building without global visibility
information. The incorrect point matches generate inaccurate camera pose and
artifacts in the reconstructed model.
In oine method, this problem is easier to handle with the complete view of the
global picture as all input data is available before processing. For online applica-
tions, we need to handle the occlusion problem with the partial 3D model incre-
mentally reconstructed during process.
10
• Eciency: eciency is essential for an online application, where immediate feed-
back of the reconstructed model and camera poses are needed. We consider the
eciency in the aspects of both time and space. On one hand, the processing time
for each frame should be short for online performance. On the other hand, the
running time of each frame should remain stable after a long sequence of images
when the 3D model covers a large area.
In the proposed online works, we favor parallelizable algorithms and speed up our
approaches by taking advantage of the computing power available in multicore pro-
cessors and Graphics Processing Units (GPUs). We also take special care of the
previously visited areas in order to keep the system scalable for large scenes.
1.3 Contributions
In this dissertation, we investigate the problem of accurate 3D model acquisition from
imagery data. Compared with the state of the art, this dissertation provides three key
contributions in terms of novelty:
• City-Scale 3D Reconstruction using Multi-Camera Aerial Imagery. Re-
cently, high-resolution multi-camera aerial imagery is becoming the mainstream
thanks to the progress of imagery sensor techniques. Compared with traditional
singe-camera aerial imagery, multi-camera aerial imagery provides several challenges
for both camera pose estimation and dense reconstruction. Firstly, the complicated
viewing pattern of multi-camera imagery requires a carefully designed pose estima-
tion system to avoid pose drift. Secondly, it is challenging to maintain the pose
11
consistency of dierent cameras because the neighboring cameras share a limited
overlap. It turns out that a xed rigid transformation of the camera rig is not
reliable due to the camera vibration during recording. In the end, eciency is obvi-
ously a problem due to the high-resolution and multi-camera property of the input
data. In this dissertation, we propose an end-to-end system for generating accurate
city 3D models using multi-camera aerial imagery. To the best of our knowledge,
this is the rst practical system proposed in literature for exploiting 3D geometry
information from multi-camera aerial imagery data.
• Accurate City-Scale Geometry Change Detection. The second key novel
contribution of this dissertation is a novel framework for city-scale geometry change
detection using high-resolution multi-camera aerial imagery. This problem is chal-
lenging to solve at the 2D imagery level because there are too much information in
the input data. It turns out to be a dicult task to identify the geometry changes
by visually comparing the high-resolution multi-camera aerial imagery even for a
human expert. Furthermore, the appearance changes that are irrelevant to geom-
etry changes, e.g. , viewing angle, camera setting, seasonable change and weather
change, make the problem even more dicult. In this dissertation, we propose a
novel framework for geometry change detection which solves the problem by com-
paring at the 3D geometry level. Result shows that our system is able to detect
geometry changes at dierent scales, ranging from the scale of a building cluster to
the scale of each individual vegetation.
12
• Progressive 3D Acquisition Integrating Online and Oine techniques.
Existing 3D model acquisition systems can be categorized as either online or oine
systems. In this dissertation, we propose a novel framework for progressive 3D
model acquisition integrating both online and oine techniques. The hybrid system
works with a hand-held camera and starts with a live scanning stage where the 3D
model is incrementally updated. This provides an immediate visual feedback to the
user to guider the next movement. After live scanning, an oine renement stage
is integrated to rene the scanned 3D model which ensures the accuracy.
Besides the contributions in terms of novelty, we also claim two key contributions in terms
of the implementation of practical systems:
• City 3D Reconstruction System for Multi-Camera Aerial Imagery. We
presented an end-to-end 3D reconstruction system to produce accurate city geomet-
ric 3D models using aerial imagery. Ecient camera pose estimation algorithm is
proposed for high-resolution multi-camera aerial imagery using multiple GPUs. Our
system has been transferred to Lawrence Livermore National Laboratory (LLNL)
successfully and integrated into practical usage.
• Release of a Mesh-based Oine 3D Model Acquisition System. We imple-
mented MeshRecon, a mesh-based oine system for 3D model acquisition using a
set of sparse images as input. The system is
exible for various objects at dierent
scales in both indoor and outdoor environments and is ecient with parallelization
on GPU. We have released our system as well as a set of indoor datasets on the
13
website [11] to the computer vision research community for better evaluation and
comparison of dierent 3D model acquisition approaches.
1.4 Dissertation Outline
The rest of this dissertation is organized as follows: in Chapter 2, we review the related
work for camera pose estimation and dense reconstruction as well as the state-of-the-art
solutions for geometric 3D model acquisition, including both oine methods and online
methods; in Chapter 3, an online city 3D reconstruction framework using WAAS videos is
proposed combining variational depth map estimation and volumetric reconstruction; in
Chapter 4, we present a framework for progressive 3D model acquisition with a commod-
ity hand-held camera; in Chapter 5, we investigate the problem of camera pose estimation
for high-resolution multi-camera aerial imagery; in Chapter 6, we propose MeshRecon,
a mesh-based oine system for geometric 3D model acquisition using a set of sparse
images as input; in Chapter 7, we explore an important application based on geometric
city 3D models reconstructed at dierent times, i.e. , city-scale geometry change detec-
tion from multi-camera aerial imagery; In Chapter 8, we conclude the dissertation with
contributions and future work.
14
Chapter 2
Related Work
In this chapter, we review the related work as well as the state-of-the-art systems for
image-based 3D model acquisition. Structure-from-Motion (SfM) and Simultaneous Lo-
calization and Mapping (SLAM) are the terms traditionally used in computer vision
community and robotics community, presenting the process of estimating camera motion
and the 3D scene structures from images or video sequences at the same time. In this
chapter, we use SfM to refer to oine camera pose estimation of images and SLAM to
refer to the real-time camera pose estimation on video sequences. In section 2.1, we intro-
duce the fundamental algorithms for camera pose estimation which are commonly used in
SfM and SLAM. In section 2.2, we introduce and categorize the fundamental algorithms
for dense 3D reconstruction. After describing the fundamental algorithms, we summarize
various SfM works in section 2.3 and real-time monocular SLAM works in section 2.4.
In the end, we review the state-of-the-art 3D model acquisition systems, including both
oine solutions in section 2.5 and online solutions in section 2.6. More discussions on
specic topics are given individually in each corresponding chapter.
15
2.1 Algorithms for Camera Pose Estimation
Estimation of accurate camera poses is essential for 3D model acquisition systems. There
are many camera pose estimation methods have been proposed to estimate the 6D camera
poses (position and orientation) for input images given either the 2D-2D point correspon-
dences between images or 2D-3D point correspondences between the image and a given
3D model.
Relative pose estimation. As illustrated in Figure 2.1 (a), relative pose estimation
methods aim to estimating the relative pose of two calibrated cameras given the 2D-2D
image point correspondences. The relative pose between two cameras is represented using
a rotation matrix R and a translation vector T . Various methods have been proposed
to solve this problem given dierent constraints with dierent number of required point
correspondences [67, 80, 99].
Perspective-n-Point (PnP) method. As illustrated in Figure 2.1 (b), PnP meth-
ods are designed to estimate the pose of a calibrated camera from 2D-3D point corre-
spondences between the image and a given 3D model. The output is the absolute camera
pose which shares the same coordinate system as the 3D model. There are both direct
methods [65, 43] and iterative methods [73] have been proposed to address this problem
requiring dierent numbers of point correspondences.
Bundle Adjustment. Given a set of observed points and their projections in mul-
tiple images from dierent views, Bundle Adjustment [48] (Figure 2.1 (c)) aims to simul-
taneously rening the 3D positions of the observed points as well the camera poses of all
16
Camera 0
Camera 1
12
| 12
(a) Relative pose estimation methods estimate the relative rotation and trans-
lation between two cameras based on 2D-2D image point correspondences
Camera 0 0
| 0
(b) Perspective-n-Point methods estimate the absolute pose given the 2D-3D
point correspondences between the image and the 3D model
camera 1
camera n
1
| 1
|
…
(c) Bundle Adjustment renes both the camera parameters and 3D positions
of observed points given their projections in multiple images
Figure 2.1: Illustration of dierent camera pose estimation algorithms
17
the images. This is achieved through a nonlinear optimization process minimizing the
total reprojection error with respect to all 3D point and camera parameters:
E
i
;X
j
=
X
i
X
j
d(P(
i
;X
j
); x
ij
)
2
(2.1)
where
i
represents the camera pose of i-th image, X
j
is the j-th observed 3D point,
d(P(
i
;X
j
); x
ij
) denotes the geometric error between the predicated projection of X
j
on
i-th image and its actual projection x
ij
. As most points are only observed in a small
part of the images, the normal equation arising in the optimization process share a sparse
block structure. A variety of Sparse Bundle Adjustment (SBA) algorithms [104, 71] have
been proposed to speed up the process utilizing this sparse structure. Recently, Multicore
Bundle Adjustment (MBA) [112, 12] is proposed to further speed up the process for large-
scale reconstruction through parallelization of the algorithm using multicore CPU as well
as multicore GPUs.
Rotation averaging. Multiple rotation averaging has been the fundamental algo-
rithm for several global SfM frameworks proposed recently. Given the relative rotation
R
i;j
between each camera pairs indexed byfi;jg in a setN , rotation averaging methods
aim at estimating an absolute rotation (in regards to a global world coordinate) for each
individual camera satisfying the rotation constraintR
j
=R
ij
R
i
. In practice, the optimal
solution is achieved through minimizing the following energy function:
X
fi;jg2N
d(R
ij
;R
j
R
1
i
) (2.2)
18
where d(:;:) denotes some cost function measuring the metric dierence between two
rotation matrices. A most recent survey for rotation averaging can be found in [47].
2.2 Algorithms for Dense 3D Reconstruction
Various multi-view stereo (MVS) algorithms have been proposed for dense 3D reconstruc-
tion incorporating photometric information of multiple images from dierent views. As
summarized in [40], they can be categorized into four classes based on the underlying
representation of the 3D model (Figure 2.2) :
Volume-based methods. Volume-based methods [35, 92, 117, 116, 30, 84, 107, 95,
92] split the 3D space into a grid of discrete voxels and estimate a binary or probability
occupancy status for each voxel based on checking the photometric consistency from
dierent views. It requires a bounding box containing the target object/scenes and the
reconstructed structural details is controlled by the volume size.
Depth map fusion methods. Methods based on the fusion of depth maps [44, 77,
102, 76, 68, 86, 78] estimate a depth map for each individual image and merge them to
produce a single 3D model. Depth maps are estimated using algorithms such as plane
sweeping [41], and further smoothed via dynamic programming, graph cuts or variational
methods.
Patch-based methods. Patch-based methods [66, 40] attempt to obtain a 3D model
composed of local planar patches densely covering the object surface, which makes it
exible for objects with complicated shapes.
19
Depth map fusion Patch-based
( )
Mesh-based
Volume-based
Figure 2.2: Dense reconstruction methods categorized based on the model representation
Mesh-based methods. Deformable polygon mesh-based methods [49, 39, 50, 31]
start with an initial polygon model and deform it to t the photometric information
from dierent views. Occlusion change due to mesh deformation is also handled in these
approaches. However, they require a good initial model in order to converge to the global
optimal resolution.
In practical applications, these methods are not used in isolation. Instead, they are
combined utilizing the good elements from each method. For example, depth maps are
estimated from a plane sweeping algorithm and then merged via volumetric integration
using Truncated Signed Distance Function (TSDF) in [116]. In [40], dense patches are
reconstructed from images through iterations of matching, expansion and ltering pro-
cess. The detailed mesh model is then rened from the coarse mesh model through
mesh-based optimization. In [50], a coarse model is generated from dense points clouds
through graph cuts [61] and further rened into a detailed model through variational
mesh deformation [31].
20
Figure 2.3: Structure-from-motion (SfM) methods take discrete image set as input, and
solve the camera pose estimation problem in a global setting. Image from Bundler [4] by
Noah Snavely
2.3 Structure-from-Motion
In computer vision, Structure-from-Motion (SfM) (Figure 2.3) represents the process of
simultaneously estimating the camera motion and 3D positions of a sparse set of scene
points from multiple images. In this section, we review various SfM methods which have
been proposed for dierent scenarios in the past decade.
Most SfM frameworks share the following three estimation procedures. First, relative
rotations and translations between camera pairs are estimated using relative pose methods
as summarized in section 2.1. In the second step, all the camera poses are transformed and
registered within a global world coordinate system. In the end, Bundle Adjustment is used
to rene the obtained camera poses by minimizing the total reprojection error. Proposed
SfM frameworks can be categorized into dierent classes according to the strategy used
in the second step, i.e. , the registration of all cameras within a global coordinate system.
Incremental SfM. The most standard SfM framework is incremental SfM [87, 17,
97], where cameras are registered within the same coordinate incrementally through
21
adding new cameras one by one. First, the camera poses as well as the 3D positions
of a set of feature points are estimated as initialization from an initial camera pair. The
initial camera pair is chosen such that they share large number of matched points and
have a large baseline at the same time. After initialization, a new camera is added with
feature point matched with the recovered 3D points. Its camera pose is estimated us-
ing the Perspective-n-Point (PnP) method as summarized in section 2.1. Additional 3D
points are also added with the matches obtained from the new camera to incrementally
enlarge the reconstructed point cloud. This process is repeated until all cameras are
added. Drift is a well-known problem for incremental SfM that the camera poses of new
cameras tend to drift away from the correct position as the model updated due to the
errors accumulated during the process. Although incremental SfM is generally successful
for scenarios in dierent scales, it suers several drawbacks. Bundle Adjustment needs
to be used frequently during the process to rene the model, which makes the frame-
work time-consuming. Another drawback of incremental SfM is that it is sensitive to the
initialization and the order of new camera additions.
Hierarchical/Graph-based SfM. As an extension of incremental SfM, various SfM
works have been proposed to make the problem tractable and scalable utilizing hierarchi-
cal and graph structure of the camera poses. In [36], the observed 3D points and camera
pose of consecutive image triplets are estimated and merged into larger model in a hier-
archical fashion. In [98], the system starts with processing on a small subset of images
called skeletal set, and adds the remaining images later using pose estimation. A graph
algorithm is used to select the subset of the reconstructed images. In [17], vocabulary tree
22
based whole image similarity is used to provide the candidate image pairs followed by the
skeletal sets algorithm [98] to nd the minimal set of images for large scale reconstruction.
Global SfM. Recently, several SfM works have been proposed to address the drift
and initialization problem of traditional incremental SfM works through a global opti-
mization involving all the camera poses. In the rst group of works [45, 96, 18, 29, 52],
camera rotations and translations are solved sequentially. The absolute camera rotations
of all cameras are computed through multiple rotation averaging methods as described
in previous section. Then, the absolute translations are computed utilizing the geometry
constraint involving all camera poses. In the second group of works [75, 53, 82], all camera
poses and 3D points are solved within a single global optimization procedure.
Other works. There are other works that are very dierent from traditional SfM
works. An interesting work is proposed by Hongdong Li [42], where geometric structure
is estimated based on graph rigidity theory without explicit estimation of camera motion.
However, this algorithm cannot scale to problems with a large number of feature points.
2.4 Real-time Monocular SLAM
Given a video input, real-time simultaneous localization and mapping (SLAM) (Fig-
ure 2.4) estimates the camera pose and updates the 3D scene structures in real-time as
new frame arrives. Many real-time SLAM algorithms have been proposed using dierent
types of input, such as single camera video, stereo video or depth maps stream using a
RGB-D sensor. In this dissertation, we concentrate on the monocular SLAM works using
video stream from a single moving camera as input.
23
(a) Parallel Tracking and Mapping [13] by Georg Klein and David Murray
(b) LGD-SLAM [34] by Jakob Engel et al.
Figure 2.4: Real-time monocular simultaneous localization and mapping (SLAM) meth-
ods take video sequence as input and generate camera pose in real-time
24
The current state-of-the-art monocular SLAM system is Parallel Tracking and Map-
ping (PTAM) [58] proposed by Georg Klein and David Murray. In PTAM, camera track-
ing and mapping are processed separately in two parallel threads. The estimation of cam-
era pose is achieved through Perspective-n-Point method using the 2D-3D point matches
between the input frame and the maintained 3D map. The 3D map is updated from
triangulation of point tracklets in key frames. Both global and local Bundle Adjustment
are used to rene the camera poses and the resulting 3D map at each key frame. PTAM
is able to work at the video frame rate (30Hz) with a input resolution of 640480 and re-
main accurate across a long sequence. The extension of this work on mobile platform [59]
is also proposed with reduced accuracy and robustness compared to PTAM on a PC
platform.
2.5 Oine 3D Model Acquisition Systems
Several systems for oine 3D model acquisitions have been proposed, including open
source academic works as well as commercial solutions. In this section, we review two
state-of-the-art works for oine 3D model acquisition.
Patch-based Multi-View Stereo (PMVS). The state-of-the-art work on dense
reconstruction is proposed by Furukawa and Ponce [40], which is a patch-based stereo
algorithm generating dense models via image patch matching and expansion iteratively
(Figure 2.5). The input is a set of images with accurate camera poses estimated from stan-
dard SfM algorithms, and the output is a model composed of local planar patches densely
covering the visible surface. It works iteratively over three steps: matching, expansion
25
Figure 2.5: Patch-based multi-view Stereo (PMVS) [40] by Furukawa and Ponce
and ltering. In the matching step, feature points between images are matched which
leads to oriented local patches through a photometric-based optimization. In the ex-
pansion step, denser patches are reconstructed by propagating the geometry information
from reconstructed patches. Erroneous patches violating visibility constraint are removed
in the ltering step. After obtaining the dense patch model, a polygonal mesh model is
produced through Poisson surface reconstruction [57] and rened through an iterative de-
formation algorithm proposed in [39]. Recently, Clustering Views for Multi-view Stereo
(CMVS) [38] is proposed to extend the framework for large-scale dense reconstruction by
clustering the input images into a set of image clusters for parallelization.
High-resolution Large-scale Reconstruction. Hiep et al. [50, 108] proposed a
pipeline for large-scale, detailed reconstruction in which a photometric consistent mesh is
generated from an initial dense point cloud with variational renement (Figure 2.6) . This
works also servers as the engine behind Acute3D [1] and Autodesk's 123D Catch [2], which
are commercial solutions for images-based 3D model acquisition. The pipeline starts with
obtaining a extremely dense point cloud from SIFT features, HOG features as well as
pixel grids using Normalized Cross Correlation (NCC) score. Delaunay triangulation is
26
Figure 2.6: High-resolution large-scale reconstruction [50, 108] by Hoang Hiep VU et al.
performed on this dense point cloud to generate a polygonal model composed of lots of
false facets. A global optimal visibility consistent triangular mesh model is then extracted
from the model through polygonal carving using graph-cuts [61]. In the end, the obtained
mesh model is rened to capture small details through a variational renement procedure
as proposed in [31].
2.6 Online 3D Model Acquisition Systems
Recently, several online systems for image-based 3D model acquisition have been pro-
posed. In this section, we go through three state-of-the-art works in this group.
Live Dense Reconstruction. In [78], accurate camera poses as well as a coarse
point cloud model are obtained using PTAM [58]. An initial continuous surface is esti-
mated from the coarse point cloud using the method proposed in [81]. Given a camera
bundle composed of several neighboring frames, a detailed depth map of the center frame
is estimated using constraint scene
ow incorporating photometric information from dif-
ferent views. Specically, the deformation of the initial model is estimated through optical
27
Figure 2.7: Dense tracking and mapping (DTAM) [77] by Richard A. Newcombe et al.
ow between the synthetic views and the actual imagery. The ray-casted triangular mesh
from each camera bundle is integrated directly to produce the global mesh model.
Dense Tracking and Mapping (DTAM). Recently, a system for 3D model acqui-
sition is proposed in [77], where dense camera tracking and reconstruction is performed
in real-time (Figure 2.7). In DTAM, robust camera tracking is achieved through dense
image alignment between the synthetic views rendered from the continuously-updated
dense model and the actual imagery. For dense reconstruction, the depth map of each
key frame is estimated using a variational method with discrete-continues optimization
minimizing a photometric data term and a smoothness regularization term.
Live Reconstruction on Mobile Device. A most recent work [102] has demon-
strated the potential for real-time dense reconstruction on a mobile platform with the
aid of on-device inertial sensors (Figure 2.8). In this system, an initial camera pose for
each frame is estimated from the on-device inertial information. The feature points in a
maintained spare 3D model is back-projected onto the current frame and matched with
the local feature point through a local search procedure. The camera pose is then rened
based on the obtained 2D-3D feature matches. During processing, new points are added
into the sparse model as each selected key frame arrives. For dense reconstruction, dense
28
Figure 2.8: Dense reconstruction on Mobile [102] by Marc Pollefeys et al. .
depth maps are estimated from binocular stereo with a multi-resolution scheme to speed
up the process. Finally, invalid depth values are ltered out based on checking the depth
consistency across neighboring depth maps.
29
Chapter 3
Online City 3D Reconstruction from Aerial Video
3.1 Introduction
3D urban reconstruction from imagery is essential for various applications, such as urban
planning and virtual city tours. For wide area aerial surveillance (WAAS) specically, an
up-to-date 3D model is helpful to solve occlusion problems for many tasks, such as vehicle
tracking [89] and trac inference [118]. The complete 3D model can be further used to
remove redundancy in the video stream and compress the data dramatically, which is very
important with the limited power and computation capability on the airborne platform.
To achieve these goals, both accuracy and eciency of the reconstruction process are
required.
Two core problems need to be addressed in the process: camera pose estimation,
i.e. the estimation of 6D camera poses (position and orientation) for each image; and
dense reconstruction, i.e. the establishment of a dense 3D model. In state-of-the-art
works, most approaches solve these two problems sequentially: feature-based Structure-
from-Motion (SfM) estimates camera poses based on a sparse set of matched 2D feature
30
Figure 3.1: An example of the 3D urban scene and the camera motion path in WAAS
points, such as SIFT [72]. After obtaining all camera poses, a multi-view stereo algorithm
is used to generate a dense 3D model. In feature-based SfM, camera pose drift is a
critical issue due to the errors accumulated along the path. To reduce drift, a global
Bundle Adjustment [104] is required. It involves a non-linear optimization over many or
all camera poses and feature point positions, which makes it time-consuming when the
number of frames is large.
WAAS videos are captured in high-resolution by an airborne platform hovering around
the target scenes following a simple camera motion path (Figure 3.1). They thus come
with large overlap between all frames which allows for detailed 3D urban reconstruction
that is both accurate and ecient. In this chapter, we propose an online approach to
solve camera pose estimation and dense reconstruction from WAAS videos. A sparse 3D
model and a dense 2.5D Digital Surface Model (DSM) are incrementally updated during
the process. Our approach estimates the camera pose at each frame while updating these
two models as each new frame arrives. Dierent from traditional SfM solely based on
31
matched 2D points, the camera pose of each new frame is estimated using Perspective-n-
Point (PnP) method based on 2D-3D Image-Model feature matches. Dense optical
ow
between successive frames computed after a step of 2-D stabilization is used to guide the
feature matching between each new frame and the maintained sparse 3D model. We also
produce a highly-detailed full 3D model via volumetric integration. Experiments on both
synthetic and real-world datasets validate the performance of our approach. In summary,
we oer the following contributions:
• Online solution: camera pose estimation and dense reconstruction are solved
within an online pipeline, which estimates camera poses at each frame while updat-
ing the models during the process.
• Ecient framework: without global Bundle Adjustment, our online approach is
signicantly faster than the latest batch methods.
• Accurate camera pose estimation and detailed urban modeling: camera
poses are estimated as accurately as with global Bundle Adjustment without drift
along the path. We also produce a highly-detailed full 3D urban model.
3.2 Related Work
Standard SfM frameworks are usually used to estimate camera poses for urban aerial
images in oine. Wendal et al. [110] also proposed a work for live dense reconstruc-
tion which is designed specically for micro aerial vehicles, as an extension of Parallel
Tracking And Mapping (PTAM) [58]. There are various dense reconstruction methods
have been proposed for urban reconstruction. Volumetric methods split the space into
32
voxels and estimate a binary or probability occupancy status for each voxel. In [85], each
voxel is associated with an occupancy probability and an appearance model. In [25],
probabilistic volumetric modeling is combined with smooth signed distance surface es-
timation to generate surfaces for urban scenes. Point-based methods generate a dense
point cloud by estimating 3D position for pixels in the input images. Structural proper-
ties of urban scenes are often used to constrain the problem. In [37], scenes are assumed
to be composed of piecewise-planar surfaces with dominant orientations. There are also
works [68][86] that constrain the problem with piecewise planar prior without dominant
orientations. Graph cuts [37] and dynamic programming [68] are usually used to nd the
stereo correspondences. Patch/mesh-based methods have also been proposed with capa-
bility to do urban reconstruction. In [40], a patch-based stereo algorithm is proposed to
generate dense models via image patch matching and expansion. Hiepet al. [50] proposed
a pipeline for large-scale and detailed reconstruction, in which a photometric consistent
mesh is generated from the initial dense point cloud with variational optimization.
3.3 Overview
Our approach uses a WAAS video sequence as input. The pipeline is composed of 4
parts: 1. Initialization (Figure 3.2 (a)): the camera poses of the rst N + 1 frames as
well as a virtual ground plane are estimated as initialization. 2. Multi-view stereo
and model update (Figure 3.2 (b)): two 3D models are maintained and incrementally
updated during the process: a sparse 3D model composed of SIFT feature points, which
is used for camera pose estimation, and a dense 2.5D Digital Surface Model (DSM) which
33
virtual ground plane X-Y
X
Y
1 1
~
N
I I
Initialization
(a) Initialization and the estimation of a virtual ground plane
(feature match)
(model update)
…
PnP method
Multi-view stereo
camera poses
Digital Surface Model Sparse 3D model
(b) Online camera pose estimation and model update
(c) Urban modeling by fusing dense depth maps from dierent views using volumetric method
Figure 3.2: Overview of the system pipeline
34
is used for occlusion handling. When new frames arrive, our approach estimates the
depth maps using multi-view stereo and updates the sparse 3D model as well as the
DSM. 3. Camera pose estimation (Figure 3.2 (b)): the camera pose is estimated at
each frame using Perspective-n-Point (PnP) method with 2D-3D Image-Model feature
matching. 4. Urban modeling (Figure 3.2 (c)): a full 3D model is generated using
volumetric integration method. In the following four sections, we explain each part in
detail.
3.4 Initialization
We estimate the camera poses of the rst N + 1 frames using standard Structure-from-
Motion (SfM) algorithm as initialization. N is the number of neighboring frames used in
multi-view stereo algorithm, which is described in section 3.5.1. The choice of a proper
N relies on the baseline between successive frames. In all experiments presented in this
chapter, we use N = 20. Based on the sparse point cloud generated from the initial SfM
algorithm, we also aim to nd a virtual ground plane in terms of which the elevation
values in the DSM are dened. Our approach estimates a virtual ground plane G using
Principal Component Analysis (PCA) on the obtained sparse point cloud (Figure 3.2 (a)).
The normal direction is determined as the direction with minimum variance. For WAAS
videos of urban scenes, a reasonable planeG can be found using the above strategy since
the height range is very small compared with the range spanned in the ground plane. Our
approach still works well when the plane is virtual, e.g. , in mountainous terrain case, as
shown later in section 3.8.2.
35
3.5 Multi-View Stereo and Model Update
As shown in Figure 3.2 (b), our approach maintains two models and updates them when
new frames arrive: a sparse 3D model M
S
composed of SIFT feature points and a 2.5D
Digital Surface Model M
D
. When a new frame and its neighboring frames arrive, we
compute its depth map using multi-view stereo as described in section 3.5.1. Then, we
update M
S
and M
D
as described in section 3.5.2. The camera pose of each new frame
is estimated as described in section 3.6, except for the rst N + 1 frames whose camera
poses are estimated in the initialization step. It is not necessary to update the models
at each frame due to the small variance between successive frames in video sequences. In
practice, we update the models on key frames only. A new frame is labeled as key frame
when the feature matches between this frame and the maintained sparse 3D model M
S
is lower than a threshold. Details of the feature matching is described in section 3.6.1.
In any case, we also add a key frame every 50 frames to keep the models up-to-date.
3.5.1 Multi-view stereo
Given camera poses of a key frameI
k
and its neighboring framesI
i
2N (k), we compute
the height map H
k
for I
k
using multi-view stereo. Finding stereo correspondences for
urban imagery is a dicult problem due to the homogeneous textures and massive occlu-
sions. In our approach, we use the multi-view stereo algorithm proposed in DTAM [77].
36
Photometric evidence from multiple views as well as epipolar geometry are used to con-
strain the problem and reduce ambiguities. We minimize an energy function including a
non-convex photometric error term and a convex regularization term:
E
h
=
X
C(x; h(x))dx +
X
krh(x)k
dx (3.1)
where h(x) represents the height of pixel x above the virtual ground plane G.
is the
2D image domain ofI
k
. C(x; h(x)) represents the photometric error term measuring the
average intensity error across its neighboring views:
C(x;h) =
1
N
X
i2N (k)
I
k
(x)I
i
(
i
(
1
k
(x;h;G)))
(3.2)
where
1
k
(x;h;G) is the operator to compute the position of the 3D point projected
from pixel x on I
k
when assigned to height h above the virtual ground plane G, and
i
(
1
k
(x;h;G)) is the operator to compute the pixel location on I
i
back-projected from
this 3D point. krh(x)k
is a Huber norm regularization term used to smooth the gen-
erated height map while reserving boundary discontinuities at the same time. Dierent
from a pure total-variation regularization term, the Huber norm allows smooth variance
in small-scale to avoid the stair-casing eect.
To optimize Equation 3.1, we couple the photometric error term and regularization
term with an auxiliary variable h
0
and optimize h and h
0
alternatively:
E
h;h
0 =
X
n
C(x; h(x)) +
rh
0
(x)
o
dx +
X
1
2
(h(x) h
0
(x))
2
dx (3.3)
37
Figure 3.3: From left to right: key frame I
k
, height map H
k
and condence map W
k
With h
0
xed, we search exhaustively over a nite range of sampled height values to
optimize the non-convex photometric error term of variable h. Fixing h, the optimization
of h
0
can be achieved using a primal-dual approach. More details can be found in [77][26].
Our implementation is slightly dierent from the one used in DTAM, which optimizes
over the inverse depth of each pixel. It is mainly designed for indoor virtual reality
applications, where the depth range is limited and easy to estimate for exhaustive search
when optimizing the photometric error term. Instead, we optimize over the height above
the virtual ground plane, since depth range varies signicantly but height range is limited
in urban scenes. The height range as well as the virtual gound plane G are estimated in
the initialization step. A condence mapW
k
is also computed representing the condence
of the estimated height values in H
k
:
W
k
(x) = min(cos((x)); 1C(x;H
k
(x))) (3.4)
38
where (x) is the angle between the estimated surface normal and optical ray to avoid
grazing ramps. The intensity values of input frames are scaled to [0; 1] such that the
photometric error term C(x;H
k
(x)) lies in the same range. An example is illustrated in
Figure 3.3.
3.5.2 Model update
M
S
is a sparse 3D point cloud composed of SIFT feature points used for camera pose
estimation. Our approach updatesM
S
after obtaining the height mapH
k
and condence
map W
k
of a key frame I
k
. For each SIFT feature point x detected on I
k
, we compute
the corresponding 3D point X according to H
k
(x) if its condence value W
k
(x) is larger
than a threshold. The 3D positions of X as well as the feature descriptors of x are then
added into M
S
. The condence threshold is set as 0.7 in our experiments.
A digital Surface Model (DSM) is a proper representation to store the geometric
information of urban scenes with values representing elevations above the virtual ground
plane. Our approach also maintains a DSM M
D
as well as a corresponding condence
map W
D
, and incrementally updates them at each key frame. For a 3D point projected
from pixel x, we useX to represent its projection on the virtual ground plane. Our
approach updates the DSM using the following rules:
M
D
(X ) =
W
0
D
(X )M
0
D
(X ) +W
k
(x)H
k
(x)
W
0
D
(X ) +W
k
(x)
(3.5)
W
D
(X ) =W
0
D
(X ) +W
k
(x) (3.6)
39
whereM
0
D
andW
0
D
are the DSM and the condence map before update at key frame I
k
.
In areas where multiple inconsistent height values are projected to the same locationX ,
e.g. building side-walls, we reset M
D
(X ) using the maximum height. In the process of
updating M
S
and M
D
, we bilinearly interpolate H
k
(x) and W
k
(x) to achieve sub-pixel
accuracy.
3.6 Camera Pose Estimation
We aim to estimate the 6D camera pose whenever a new frame arrives. Camera pose drift
is a common problem when estimating camera poses sequentially, as error accumulates
along the path. In standard SfM works, 2D-2D feature matches between images are used,
with a global Bundle Adjustment involving all cameras implemented to address the drift
problem. Our approach utilizes a dierent strategy to estimate the camera pose at each
new frame using 2D-3D Image-Model feature matches. When a new frame I
k
arrives, 2D
SIFT feature points in I
k
are matched with the 3D feature points stored in M
S
. The
camera pose of I
k
is then estimated using PnP method with RANSAC scheme. Using
these 2D-3D Image-Model feature matches, the camera pose estimation of each frame is
bundled with the maintained 3D model M
S
, which eectively reduces drift.
3.6.1 Optical
ow guided feature matching
Feature matching is one of the most time-consuming step of standard SfM [111]. To make
use of the small variance between successive frames in video sequences, we use dense
optical
ow to guide the feature matching between I
k
and M
S
. The feature matching
process is composed of 3 steps as illustrated in Figure 3.4.
40
S
M
k
I
3
2
1 k
I
1
Figure 3.4: Illustration of feature matching between a new frame I
k
and the maintained
sparse 3D model M
s
1. Back-projection: for each 3D feature point X stored in M
S
, we back-project it
onto the previous frameI
k1
and obtain its projection x
k1
. Occlusion is handled in this
step as described in section 3.6.2.
2. Dense optical
ow: we compute the optical
ow between I
k1
and I
k
after a
step of 2-D stabilization, as detailed in section 3.6.3. Then, the projection of X on I
k
can be evaluated by adding the displacement to its projection on I
k1
:
x
k
= x
k1
+ u
k
k1
(x
k1
) (3.7)
where u
k
k1
(x
k1
) represents the
ow vector fromI
k1
toI
k
at point x
k1
. In all experi-
ments, the optical
ow between the frames down-sampled to half size is accurate enough
to provide a good approximation.
3. Local search: we now search in a small region around x
k
on I
k
to nd the 2D
SIFT feature match ofX, which is very ecient. A feature match is found if the similarity
is larger than 0.7, which is dened as the dot product of normalized feature descriptors.
41
In all experiments, the search region is dened as a circle with radius of 3 pixels centered
around x
k
. If multiple features on I
k
are found to match with X, we drop them all to
avoid false matches with ambiguous features, e.g. , in regions with homogeneous texture.
3.6.2 Occlusion handling
Occlusion needs to be handled when back-projecting a 3D feature point X from M
S
onto I
k1
. We make the assumption that no objects are
oating, which is reasonable in
urban environments. As illustrated in Figure 3.5, occlusion is handled by checking the
occluding status along the optical ray (X,X
cam
), whereX
cam
is the camera center ofI
k1
with its projectionX
cam
on the virtual ground plane. To check the visibility of X whose
projection isX on the virtual ground plane, we only need to check the occluding status
of line segment (X;X
cam
) on theM
D
: X is visible if all height values on this line segment
are known and do not exceed the occluding height at that location. Furthermore, with a
maximum height priorh
max
of the target scene, we can nd the highest possible occluding
point X
max
on the optical ray with height h
max
and its projectionX
max
on the virtual
ground plane. To check the visibility of X, we only need to check the occluding status of
line segment (X;X
max
) on M
D
instead.
3.6.3 Stabilized optical
ow
The state-of-the-art method for ecient optical
ow computation is the Total-Variation
L
1
(TV-L
1
) optical
ow as described in [115], which minimizes an L
1
norm data term
and a total-variation regularization term:
42
cam
X
max
X
X
max
cam
D
M
max
h
Figure 3.5: Illustration of occlusion handling when back-projecting a 3D feature point X
from M
S
(d)
(a) (b)
(c)
Color code for
visualization of
flow field
dx
dy
Figure 3.6: Illustration of frame pairs and
ow elds. Consecutive frame pairs are shown
as red-green anaglyph images. (a) original frame pair. (b) stabilized frame pair. (c)
ow
eld between original frame pair. (d)
ow eld between stabilized frame pair
43
E
u
=
X
(
I
i
(x)I
j
(x + u
j
i
(x))
dx +
ru
j
i
)dx (3.8)
However, applying it directly to the original frames does not give very good results be-
cause the
ow is due to both camera motion and parallax (Figure 3.6 (c)), which does
not necessarily t the small motion assumption and piecewise linear constraint in Equa-
tion 3.8. Instead, we rst stabilize successive frames in terms of a dominant plane using
the homography estimated based on SIFT features with RANSAC scheme, and compute
the optical
ow between the stabilized frames. After stabilization, the displacement of an
image point is directly proportional to its height above the dominant plane, and inversely
proportional to its distance from the camera [60]. For aerial images, the displacement can
be considered proportional to the height since the distance to the camera is very large.
As shown in Figure 3.6 (d), the
ow eld between the stabilized frames is piecewise linear
with small displacement proportional to the height, which ts the optical
ow formulation
well. Then, we compute the
ow eld between the original frames by compensating for
the homography used for stabilization. The estimated dominant plane for stabilization
between each pair of successive frames is not required to be either stable, or the same
as the virtual ground plane G. In cases where a real dominant plane does not exist,
e.g. mountainous terrain case, our approach still works as well.
3.7 Urban Modeling
In the maintained 2.5D DSM, the geometric information of terrain and building roofs
are well represented while the structures of building side-walls and regions with multiple
44
heights are discarded. The last step of our approach is to build a full 3D model by
integrating the depth maps of key frames from dierent views. We use the volumetric
integration method with truncated signed distance function (TSDF) proposed in [30]:
depth maps as well as the condence maps are fused into a weighted discrete voxel grid
with the value of each voxel representing its distance to the closest surface. The nal mesh
model is extracted from the voxel grid via marching cubes [70]. Due to the limitation of
memory size, we process the 3D model block by block. The size of voxel grid for each
block is adjusted automatically according to the height values in the DSM M
D
.
3.8 Experiments
We evaluate our approach on several real-world datasets: Rochester video contains an
aerial video of Rochester, NY, USA; WPAFB 2009 [16] contains aerial imagery captured
ying over Wright-Patterson Air Force Base, OH, USA; Providence (site22) [92] is a
smaller-scale aerial video of Providence, RI, USA; More dataset details are listed in
Table 3.8. Experiments are performed on a machine with Intel Xeon 3.6GHz quad-core
processor and GTX 590 graphics card. We use FlowLib [9] and SiftGPU [14] to compute
TV-L
1
optical
ow and detect SIFT features, which are both parallelized on GPU. The
multi-view stereo algorithm is also parallelized on GPU with CUDA implementation. In
multi-view stereo, we process the high-resolution image tile by tile in order to t the
limited GPU memory. Other modules are parallelized on the multicore processor.
45
Key/All Our approach Comparison
Dataset Image Size frames Init. Pose. Model update Total Global SfM Speedup
Rochester 48723248 8 / 400 20s 2.0 s/fr. 75s /key fr. 43m 33s 4h 13m 11s 5.8
WPAFB 2009 48723248 8 / 300 20s 2.4 s/fr. 77s /key fr. 32m 47s 1h 23m 02s 2.5
Providence (site22) 1280720 10 / 180 8s 0.8 s/fr. 8s /key fr. 5m 47s 23m 57s 4.1
Table 3.1: Running times. Init.: time cost in the initialization step. Pose.: average
time cost on each frame for camera pose estimation. Model update: average time cost
on each key frame for multi-view stereo and model update.
Sequential SfM Our approach
Dataset Trans. error Rot. error Trans. error Rot. error
Rochester video 0.071 0:56
0.020 0:08
WPAFB 2009 0.065 0:36
0.009 0:09
Providence (site22) 0.273 0:19
0.027 0:15
Table 3.2: Quantitative evaluation of the error comparing against the \gold-standard"
camera poses obtained from Global SfM. Trans. error: average distance dierence of
camera positions. Due to the lack of physical scale, the distance is measured in the same
coordinate units. Rot. error: average angular dierence of viewing directions.
3.8.1 Camera pose estimation results
For comparison, we run the datasets using standard SfM algorithm with Bundle Adjust-
ment. Due to the lack of groun-truth, we obtain the \gold-standard" camera poses using
pairwise feature matching between all frames (Global SfM). The estimated camera poses
are accurate with no drift along the path. We also compare with camera poses estimated
using sequential feature matching between successive frames (Sequential SfM).
Accuracy. We use several dierent methods to evaluate the accuracy of the esti-
mated camera poses. As shown in Figure 3.7, we use epipolar lines to check the con-
sistency of camera poses. The epiporlar lines are consistent between frame pairs for the
\gold-standard" camera poses from Global SfM as well as the poses estimated using our
approach. Large errors exist for the camera poses estimated from Sequential SfM. As
shown more clearly in Figure 3.8, the camera path estimated from our approach aligns
46
(a) Rochester video
(b) WPAFB 2009
(c) Providence (site 22)
Figure 3.7: Display of epipolar lines: blue lines from out approach; red lines from Global
SfM; green lines from Sequential SfM. Epipolar lines from our approach and Global SfM
are consistent and overlapped. Large errors exist in the epipolar lines from Sequential
SfM
47
Our approach
Global SfM
Sequential SfM
(a) Rochester video
Our approach
Global SfM
Sequential SfM
(b) WPAFB 2009
Our approach
Global SfM
Sequential SfM
(c) Providence (site22)
Figure 3.8: Display of camera paths on dierent datasets. The camera paths from our
approach and Global SfM are overlapped
48
well with the \gold-standard" camera path, while error accumulates in the camera path
estimated from Sequential SfM. Quantitative evaluation of the error comparing against
Global SfM also validate the performance of our approach as shown in Table 3.8.
Running time analysis. The running-times are shown in Table 3.8. Since our ap-
proach is parallelized on GPU and multicore processor, in order to evaluate the eciency
fairly, Global SfM and Sequential SfM are implemented using VisualSFM [15] which is
also highly parallelized on GPU with Multicore Bundle Adjustment implementation [112]
and SiftGPU. Results demonstrate that our approach is signicantly faster than Global
SfM. Our approach works in an online fashion with running time increasing proportion-
ally to the number of input images. As a batch method, Global SfM does not utilize the
sequential property of video, and it is time-consuming mainly due to the pairwise feature
matching between all frames. Sequential SfM can be accomplished with much reduced
running time using sequential feature matching between successive frames. We do not
compare with Sequential SfM here since it comes with a large sacrice in accuracy as
described above. Additionally, the required storage space of our approach is constant
because only the key frame and its neighboring frames are required to be stored in the
system during the process. On the contrary, Global SfM requires a storage space which
grows linearly as new frames arrive.
Our approach estimates camera poses as accurately as Global SfM without drift along
the path. At the same time, it works eciently in an online fashion. Moreover, our ap-
proach provides not only accurate camera poses but also dense 3D geometric information
during the process, i.e. , the DSM and the dense height maps of all key frames. By con-
trast, standard SfM outputs the camera poses and a sparse point cloud. In order to get the
49
dense 3D information as accomplished by our approach, a separate dense reconstruction
procedure is still needed to complete the pipeline.
3.8.2 Reconstruction results
Synthetic terrain data. Due to the lack of ground truth for real-world datasets, we
generate a synthetic dataset of mountainous terrain in order to qualitatively evaluate
the reconstruction performance and the
exibility of our approach for terrain. The 3D
ground-truth (Figure 3.9 (a)) is built from a geo-tagged textured terrain DSM, which
has 8.4-meter lateral spacing and 1-meter elevation precision. We render 200 frames
with resolution 2048 2048 following a camera motion path of a
yover around the
target scene. For evaluation, we measure the signed distance of sampled points on the
reconstructed full 3D model to the ground-truth model. As the results shown in Figure 3.9
(b), the reconstructed 3D model ts the ground-truth model very well.
Real-world dataset. We also illustrate the reconstruction results on real-world
datasets. The reconstructed full 3D model on Rochester video are shown in Figure 3.10.
More close-up views are shown in Figure 3.11. As shown in the close-up views, the
geometric details, e.g. , chimney, bridge, are well captured in the reconstructed full 3D
model. To quantitatively evaluate the accuracy of the reconstructed model. we use the
latest city 3D model from Google as ground-truth. Result shows that the reconstructed
city 3D model using our system is very accurate with a mean error of 2:0 meters over the
entire city.
50
X
Y
H
(a) Synthetic terrain model
(b) Histogram of signed distance
Figure 3.9: (a) The synthetic model spans horizontally in a 2.5km 2.3km region with a
height range of 230m. (b) The signed distance error is unbiased with a mean of -0.01m.
The standard deviation is the RMS error, which is 0.72m
51
Figure 3.10: Reconstructed full 3D model on Rochester video
52
Figure 3.11: Closeup views of the results on Rochester video (shaded model)
53
3.9 Conclusion
We presented an approach to solve camera pose estimation and dense reconstruction from
WAAS videos. Our approach is signicantly faster than the latest batch method as an
online approach. At the same time, it estimates camera poses as accurately as standard
SfM algorithm with global Bundle Adjustment. A sparse 3D model and a 2.5D DSM are
incrementally updated during the process. Our approach also produces a highly-detailed
full 3D model.
54
Chapter 4
Progressive 3D Model Acquisition with a Hand-held
Camera
4.1 Introduction
Acquisition of 3D models from images for real-world objects is essential for applications
in various elds. The standard o-line 3D modeling method starts with a data collection
stage where images covering dierent views of the target are collected. The camera poses
of the collected images are estimated using Structure-from-Motion (SfM) framework [48].
Multi-view stereo algorithms are then used to generate the 3D model. O-line methods
are able to generate accurate 3D model through incorporating information from all col-
lected data. If the user is not satised with the reconstructed model, e.g., object is not
fully covered, the user needs to collect additional images and repeat the reconstruction
pipeline. Thus, the acquisition of a visually pleasing 3D model can be time-consuming
with several iterations of data collection and reconstruction loop. Real-time approaches
have been proposed to overcome this problem. During scanning, the reconstructed model
55
is continuously updated and displayed as feedback to guide the user for the next view-
point. However, the accuracy of the reconstructed model is not guaranteed due to the
lack of global information.
As 3D printing technology becoming popular and aordable, we are interested in pro-
viding a simple-to-use 3D model acquisition solution incorporating the good elements
from both o-line methods and real-time methods. In this chapter, we propose a sys-
tem with a commodity hand-held camera combining several state-of-the-art methods in
computer vision and graphics, which oers following specialties:
• Real-time scanning: data are collected through a live scanning stage, where
immediate feedback of a continuously-updated model allows the user to determine
if the region of interest is reconstructed in desired details and choose new viewpoints
based on the state of the current model.
• Oine model renement: after live scanning is completed, the obtained recon-
struction is rened into a more accurate model through an o-line model renement
procedure utilizing information from all collected data.
• Detailed reconstruction with a commodity device: the resolution of the
reconstructed structural details is controlled by the image resolution. It enables
our system to capture highly-detailed structures with a commodity device through
scanning the object in close-up views.
• No need for inertial information: no inertial information from GPS or inertial
measurement unit (IMU) is needed.
56
The rest of this chapter is organized as follows: related works are listed in section 4.2;
we introduce the system overview in section 4.3, and explain the point tracker in sec-
tion 4.4 and the patch tracker in section 4.5; the model renement procedure is described
in section 4.6; we provide experimental results in section 4.7 and conclude our this chapter
in section 4.8.
4.2 Related Work
A variety of real-time methods based on active scanners have been proposed for 3D
model acquisition. Rusinkiewicz et al. [93] proposed a system estimating range scans
based on structured-light projected on the object surface. The range scans are registered
using a fast Iterative Closest Point (ICP) method [94] and merged through VRIP [30].
KinectFusion [79] represents the 3D space with a discrete 3D volume and update it in
real-time with depth maps obtained from a hand-held depth sensor.
There are also image-based real-time 3D modeling system have been proposed to ex-
tract the 3D model from images in real-time. ProFORMA [83] updates a triangle mesh
model while the user rotating the object in front of a stationary camera. The model is pro-
duced through Delaunay tetrahedralisation of SfM points followed by a triangle carving
step, which makes it not suitable for objects with non-planar structural details. In [86],
Pollefeys et al. proposed a real-time dense reconstruction system for urban scenes using a
multi-camera rig. Depth maps are estimated using plane sweep algorithm and integrated
to build the urban model through visibility-based depth map fusion [76]. Recently, sev-
eral works have been proposed for real-time dense reconstruction with a single hand-held
57
camera. In [78], accurate camera poses are obtained using PTAM [58]. Constrained
scene
ow updates the depth map for each camera bundle incorporating photometric
information from dierent views. In [77], the depth map of each key frame is estimated
using a variational method with discrete-continues optimization. Robust camera tracking
is achieved through dense image alignment with the synthetic views rendered from the
continuously-updated dense model. They have shown great performance for scene recon-
struction in a workspace scenario. Their capabilities to acquire a complete 3D model
have not been evaluated. A most recent work [102] has demonstrated the potential to do
dense reconstruction on a mobile platform with the aid of on-device inertial sensors. By
contrast, our system aims for 3D model acquisition using a commodity camera without
inertial information from other sensors.
4.3 Overview
The overall pipeline of our system is illustrated in Figure 4.1. It starts with a live
scanning stage where the user \paint" the target object obtaining input video with a
commodity hand-held camera. This stage is accomplished using a point-based tracker
for camera pose estimation and a patch-based tracker for dense reconstruction. The
point-based tracker maintains tracklets of SIFT points and updates a model composed
of a sparse set of 3D SIFT points for camera tracking. At the same time, the patched-
based tracker maintains tracklets of image patches and updates a patch model composed
of local planar patches densely covering the surface. During scanning, reconstructed
patches are continuously updated and displayed to the user as immediate feedback. After
58
Feedback
Model Update
(Patch Tracker)
Camera Tracking
(Point Tracker)
Update
Live Scanning
Offline
Patch model
Refined mesh model
Figure 4.1: Overall pipeline of the proposed system
live scanning is completed, denser and more accurate patches are reconstructed through
an o-line model renement procedure. Erroneous patches are also ltered out based on
checking the visibility information in this step. The detailed mesh model is generated
from the rened dense patches through Poisson surface reconstruction [57].
4.4 Sparse Point-based Tracker
Accurate and ecient camera poses estimation at each input frame is essential for live 3D
model acquisition. This section describes the online point-based tracker used to estimate
the 6DOF camera pose (position and viewing direction).
4.4.1 Camera pose estimation
For camera pose estimation, our system maintains and continuously updates a model M
s
composed of a sparse set of 3D SIFT points. When a new frame I
n
arrives, SIFT points
on I
n
are detected and matched with the 3D points stored in M
s
. The camera pose of
59
I
n
is then estimated using Perspective-n-Point (PnP) method followed by an iterative
method minimizing the re-projection error based on Levenberg-Marquardt optimization.
Using the 2D-3D Image-Model feature matches, the camera pose for each new frame
I
n
is aligned with the maintained 3D model M
s
, which eectively reduces the pose drift
during scanning. The update ofM
s
is described in next section, and the feature matching
between I
n
and M
s
is explained in section 4.4.3.
4.4.2 Point model update
As each new frame I
n
arrives, SIFT points between successive frames are matched and
chained into point tracklets across multiple frames. The matching of features between
successive frames is described in next section. When the baseline of a tracklet is large
enough,we generate a 3D SIFT point through triangulation and insert it into M
s
. In
practice, we triangulate a tracklet into 3D point when the angle between the viewing rays
cast from the initial detection and the detection on the current frame I
n
is larger than
10
. To avoid inserting duplicate points in M
s
associated with the same SIFT feature, a
point tracklet is deleted if its detection on the current frame is matched with a SIFT point
already stored in M
s
. The SIFT points on I
n
that are associated with neither existing
point tracklets nor points in M
s
are added as initial detections to start new tracklets.
It is not necessary to updateM
s
and add new tracklets at each frame due to the small
variance between successive frames in the input video sequence. In practice, we attempt
to update M
s
and add new tracklets at key frames only. A key frame is added when the
time exceeds more than ten frames and the camera moves more than a minimum distance
60
− 1
1
2
3
Figure 4.2: Illustration of the optical
ow guided feature matching between a new frame
I
n
and the sparse model M
s
since last key frame. The camera poses of the rst two key frames are estimated using
5-point method [80] to initialize M
s
.
4.4.3 Optical
ow guided feature matching
Feature matching is one of the most time-consuming step of Structure-from-Motion (SfM)
framework. To make use of the small variance in video sequences, we compute dense
optical
ow between successive frames to guide the matching of SIFT feature points.
Feature matching between a new frame I
k
and the sparse model M
s
is accomplished
through 3 steps as illustrated in Figure 4.2:
1. Back-projection: for each 3D SIFT pointX stored inM
s
, we back-project it onto
the previosu frame I
n1
and obtain its projection x
n1
. Visibility is checked based
on the point normal at X, which is estimated as described in section 4.5.2.
2. Dense optical
ow: we compute the dense optical
ow between the previous
frameI
n1
and the current frameI
n
using Total-VariationL
1
(TV-L
1
) optical
ow
61
[115], which minimizes an L
1
norm data term and a total-variation regularization
term:
E
u
=
Z
(
I
n1
(x)I
n
(x + u
n
n1
(x))
+
ru
n
n1
)dx (4.1)
where u
n
n1
(x) represents the displacement from I
n1
to I
n
at point x. Then, the
projection ofX onI
n
can be evaluated by adding the displacement to its projection
on I
n1
:
x
n
= x
n1
+ u
n
n1
(x
n1
) (4.2)
3. Local search: we now search in a small region around x
n
onI
n
to nd the matchig
SIFT point for X. In all experiments, a match is found if the dot product of their
normalized feature descriptors is larger than 0.8. The search region is dened as
a circle with radius of 3 pixels centered around x
n
. If multiple SIFT points on I
n
are found to be matched with X, we drop them all to avoid false matches with
ambiguous features, e.g., in regions with homogeneous texture. Feature matching
between successive frames I
n1
and I
n
is achieved in the same way without the
back-projection step.
4.4.4 Bundle adjustment
Whenever a new key frame inserted, we run a full Bundle Adjustment involving all key
frames by minimizing the total re-projection error. Running time is an essential concern
for a live scanning system. In our implementation, dense optical
ow guided feature
matching provides accurate initial pose estimation. We are able to achieve sub-pixel
accuracy after a few iterations of bundle adjustment. We also use the recent progress in
62
( , )
( , )
( )
( )
Figure 4.3: Each key frame I
k
is divided into pixels image cells associated with
patch tracklets to continuously update the patch model M
D
parallelized bundle adjustment [112] to further speed up the process. More details of the
implementation and running time analysis are described in section 4.7.1.
4.5 Dense Patch-based Tracker
Point-based tracker provides the camera pose and updates the 3D model M
S
composed
of sparse SIFT feature points. To fully capture the geometry of the target object, our
system generates another 3D model M
D
composed of local planar patches densely cov-
ering the visible surface. Inspired by [40], we set each patch p representing a local pla-
nar surface with center position c(p) and unit normal n(p). During scanning, M
D
is
continuously-updated and displayed as immediate feedback for the user. We describe the
patch tracklets in next section, and the model update in section 4.5.2.
4.5.1 Patch tracklets
As shown in Figure 4.3, our system maintains a set of patch tracklets and updates them
during live scanning. Whenever a key frameI
k
is inserted as described in section 4.4.2, it
is divided into a regular grid of pixels image cells ( is set as 7 in all experiments).
63
Each image cellC
k
(i;j) is then associated with a patch trackletT
k
(i;j) comprising point
trajectories for all pixels in C
k
(i;j). When a new frame I
n
arrives, point trajectories in
T
k
(i;j) are updated by adding the point displacements obtained from the dense optical
ow between the last frame I
n1
and current frame I
n
. Sub-pixel accuracy is achieved
through bilinear interpolation. The point trajectories are rened using their projections
on the corresponding epipolar lines on I
n
.
4.5.2 Patch model update
Whenever a key frame arrives, our system updates the patch trackletes as described above
and attempts to add new patches into the patch model M
D
. If the angle between the
viewing rays casted from the initial projection and the projection on the current frame is
larger than 10
for all point trajectories in patch trackletT
k
(i;j), we generate a candidate
patch p from T
k
(i;j) by triangulating the point trajectories in T
k
(i;j) into a set of 3D
points. The depth value of each pixel inC
k
(i;j) is saved for later renement as described
in section 4.6. The patch centerc(p) is set as the 3D point associated with the center pixel
of C
k
(i;j). To obtain the patch normal, we eigen-decompose the covariance matrix of
the 3D points into eigenvectors
i
and the eigenvalues
i
(i = 1; 2; 3). The patch normal
n(p) is set as
3
, which corresponds to the minimal eigenvalue
3
. Patch p is considered
as a valid patch if
1
=
2
< 2 and
2
=
3
> 10. The non-planar patches and the patches
that are stretched into a single direction are ltered out as invalid patches.
64
(a) A sample input frame captured in distane view and the patches reconstructed from distance
views which is sparse
(b) A sample input frame captured in close-up view for the head region and the reconstructed
patches from close-up views which are denser and in smaller scales
Figure 4.4: Illustration of scanning for structural details
4.5.3 Scanning for structural details
The density and the scale of the reconstructed patches are controlled by the resolution of
the source image. Patches that are generated from images captured in close-up views have
a smaller scale. Reconstructed patches are also denser in these regions as each patch is
associated with an image cell with xed size. As shown in Figure 4.4, the reconstructed
patches generated from the input frames in distant views are sparser with structural
details lost. On the other hand, frames captured in close-up views generate denser patches
capturing structural details in smaller scales. The updated patch model M
D
is displayed
to the user in the course of live scanning process. This immediate feedback of the current
reconstruction allows the user to choose next movement and achieve reconstruction in
desired resolution by browsing the regions of interests in close-up views.
65
4.6 Model Renement
The reconstructed patches from live scanning are not quite accurate due to the error
in the patch tracklets which are computed based on the accumulated dense optical
ow.
Also, structural details in small scales are lost as we use image cells with a large size (77
pixels) for robust patch estimation. There are also erroneous patches in the model which
are mainly associated with the texture-less regions where the optical
ow displacements
are not reliable. To overcome these problems, we generate a patch model that is denser
and more accurate through a depth map renement procedure as described in the next
section. The erroneous patches are ltered out based on the visibility information as
described in section 4.6.2. In the end, a detailed mesh model is generated from the
rened patches through Poisson surface reconstruction.
4.6.1 Depth map renement
As shown in Figure 4.5 (a), the patch model obtained from live scanning is composed of
sparse patches in large scales, which builds a coarse mesh mode that is over-smooth with
structural details lost. For each key frameI
k
, an initial inverse depth map h
k
is obtained
based on the depth information generated from the dense patch tracker. We rene h
k
by incorporating photometric information of the neighboring key frames of I
k
through a
variational multi-view stereo method. The neighboring key frames I
i
2N (k) (we use 10
neighboring key frames in all experiments) are selected as the key frames obtained around
I
k
in time order. In particular, we estimate an inverse depth oset map dh
k
representing
66
(a) The reconstructed coarse patches after live scanning and the mesh model generated from the coarse
patches
(b) The denser and accurate patches after depth map renement and the mesh model generated from the
dense patches
Figure 4.5: Illustration of depth map renement. Mesh models are generated through
Poisson surface reconstruction under the same parameters
67
the deformation of the initial model. dh
k
is computed by minimizing an energy function
including a non-convex photometric error term and a convex regularization term:
E
dh
=
Z
C
k
(x; h
k
(x) + dh
k
(x))dx +
Z
krdh
k
(x)k
dx (4.3)
wherekrdh(x)k
is a Huber norm regularization term used to smooth the oset map.
C
k
(x;h) measures the photometric error across its neighboring views incorporating oc-
clusions:
C
k
(x;h) =
X
i2N (k)
i
k
(x)
I
k
(x)I
i
(
i
(
1
k
(x;h)))
(4.4)
where
1
k
(x;h) is the operator to compute the position of the 3D point projected from
pixel x on I
k
when assigned to inverse depth value h, and
i
(
1
k
(x;h)) is the operator
to compute the pixel location on I
i
back-projected from this 3D point.
i
k
(x)2f0; 1g
represents the visibility of pixel x2 I
k
observing from I
i
which is computed through
depth test based on the initial inverse depth maps. By minimizing the energy function
in Equation 4.3, we deform the initial patches to ensure photometric consistency across
neighboring frames and force the deformation to be smooth. Details of the optimization
can be found in [77][26]. We also explored a coarse-to-ne extension in [54].
We ray-cast the rened inverse depth maps to generate denser and more accurate
patches. The patch center and its normal are computed int the same way as described in
section 4.5.2. We associate each patch with an image cell in a smaller size (3 3 pixels
in our implementation) to generate denser patches capturing structural details in small
scales. As shown in Figure 4.5 (b), the rened patches are denser and more accurate
which lead to a detailed mesh model with structural details well captured.
68
4.6.2 Patch ltering
We incorporate two types of visibility information to lter out invalid patches. Inspired
by [102], we lter out erroneous patches that are not consistent among neighboring frames
within a small baseline. Particularly, we warp the reconstructed patches onto N
d
= 4
neighboring frames to check the consistency based on the rened depth maps. If the depth
dierence is larger than a tolerance value on more than N
d
= 2 neighboring frames, the
patch is ltered out as an erroneous patch. We also lter out erroneous patches that vio-
late occlusion consistency across large baselines: we warp each patch onto non-neighboring
frames to check if it occludes patches in the target frame. In our implementation, patches
that are violating occlusion consistency for more than N
v
= 3 frames are also ltered
out as erroneous patches. Figure 4.6 illustrates an example of the reconstructed patches
before and after patch ltering.
4.7 Experiments
4.7.1 Implementation and datasets
We implement our system on a machine with Intel Xeon 3.6GHz quad-core processor and
a GTX 590 graphics card. For real-time performance, we use SiftGPU [14] for feature
detection, Multicore Bundle Adjustment for Bundle Adjustment [12] and FlowLib [9]
for dense optical
ow computation which are all parallelized on GPU. The depth map
renement algorithm is also parallelized on GPU with CUDA implementation.
We test our system on three dierent datasets of real-world objects. stone, shoes
and Bazh are collected in a lab environment without particular setup for lighting and
69
(a) Patch model before patch ltering which contains erroneous patches and the generated mesh model
which includes several artifacts
(b) Patch model after ltering out invalid patches and the generated mesh model which is clean and accurate
Figure 4.6: Illustration of patch ltering. Mesh models are generated through Poisson
surface reconstruction under the same parameters
70
Dataset Running-time statistics
Name Size # Frames # Key Frames Scanning Renement Total
stone 640 480 770 69 1m 16s 2m 44s 4m 00s
shoe 640 480 690 61 1m 08s 2m 30s 3m 38s
Bach 640 480 680 60 1m 06s 2m 28s 3m 34s
statue 1280 720 630 52 2m 09s 5m 10s 7m 19s
Cleopatra 1280 720 588 47 2m 04s 4m 44s 6m 48s
gargoyl 1280 720 687 58 2m 21s 5m 41s 8m 05s
birdhouse 1280 720 270 18 1m 08s 2m 13s 3m 21s
Table 4.1: Dataset details and running-time analysis. # Frame: number of total frames.
# Key Frame: number of key frames. Scanning: total time cost for live scanning.
Renement: total time cost for model renement including depth map renement and
patch ltering. Total: total time cost including both scanning and model renement.
background. For data collection, we use Logitech C520 which is a commodity webcam
working under standard resolution of 640 480. Besides the experiments tested in the
lab, we also evaluate our system on videos from two publicly available datasets. statue,
Cleopatra and gargoyl are three videos of objects placed on a table captured in a lab
environment from CLAM [20] dataset. birdhouse from Image Sequence dataset [106]
is captured in outdoors. Both datasets are collected using a commodity hand-held digital
camera. Details of the used datasets are listed in Table 4.1, including the total number
of frames, number of key frames, and image size.
4.7.2 Running-time analysis
Running time is important for the live scanning stage. Though parallelized using GPU,
SIFT feature detection and the computation of dense optical
ow are the most time-
consuming parts in our system. In practice, we process them in parallel using two separate
threads since they are independent of each other for each new frame. A few iterations
of Bundle Adjustment is sucient to provide accurate camera pose renement. In our
71
implementation, we use a value of 10 as the maximum number of iterations allowed. For
live scanning, our system works at the frame rate of 10 Hz with a input resolution of
640 480. Although our system is not working at the full frame rate of the input stream
(30 Hz for the webcam), it still provides a smooth scanning experience. The detailed
running statistics for each dataset are listed in Table 4.1.
4.7.3 Camera pose estimation results
We illustrate the estimated camera poses of obtained key frames as well as the recon-
structed sparse model M
s
composed of SIFT feature points in Figure 4.7. For all experi-
ments, our system is able to match with the feature points inserted in the beginning and
close the loop when scanning the target object back to the original viewpoints.
4.7.4 Reconstruction results
The reconstructed models are illustrated in Figure 4.8. For the objects placed on a
round table, we detect and remove the patches on the table through RANSAC plane
tting. Results demonstrate that our system is
exible for various types of objects.
Note that the structural details, such as shoe ties, holes in the stone, are well captured
in the reconstructed model. We also compare our results on birdhouse with the model
reconstructed from [106] and Autodesk's 123D Catch [50], which is a mature commerical
solution for image-based 3D modeling. As shown in Figure 4.9, our system is able to deal
with the complicated geometric shapes, such as the holes within the birdhouse, with the
patch-based representation.
72
Figure 4.7: Illustration of the estimated camera motion as well as the reconstructed sparse
model composed of SIFT feature points. Top row from left to right: results on shoe and
stone. Middle row from lef to right: results on Bach and statue. Bottom row from left
to right: results on Cleopatra and gargoyl
73
Figure 4.8: Reconstruction results including a sample input image, reconstructed dense
patches and the detailed mesh model from 2 dierent views. From top to bottom: shoe,
stone, Bach, statue, Cleopatra and gargoyl
74
(a) A sample input image (b) Reconstructed mesh model from our system
(c) Model generated from [106] (d) Model generated from Autodesk's 123D Catch
Figure 4.9: Comparsion of the reconstruction results on birdhouse
75
4.7.5 Problematic cases
Texture-less objects, such asbazh, remain a challenge for our system where the dense opti-
cal
ow is not reliable. We lter out erroneous patches through the visibility consistency
test. There are still false patches producing artifacts in the reconstructed models. In
the future, we attempt to address this problem combining the photometric-based global
optimization as did in [50][106].
4.8 Conclusion
In this chapter, we proposed an end-to-end practical system for 3D model acquisition
using a commodity hand-held camera. The pipeline starts with a live scanning stage
where the immediate feedback of the reconstructed model allows the user to ensure the
current state of the model before next movement. During scanning, structural details can
be well captured by scanning the region of interest in close-up viewpoints. Our system is
able to handle objects with complicated shapes using a patch-based model representation.
After live scanning completed, an o-line model renement procedure is taken to rene
the reconstructed model and lter out erroneous patches.
76
Chapter 5
Camera Pose Estimation for Multi-Camera Aerial Imagery
5.1 Introduction
In Chapter 3, we proposed an online city 3D reconstruction framework for Wide Area
Aerial Surveillance (WAAS) videos captured using a single-camera system. An important
assumption we make about these videos is that there is large overlap between all frames
as the videos are captured hovering around the same region. This is also the key for
the success of our drift-free online approach using 2D-3D Image-Model feature matches
without Bundle Adjustment. However, this is not the case in real-world applications: in
practice, the aerial imagery is captured using a camera array system composed of multiple
sensors in order to cover a large area and maintain high resolution at the same time
(Figure 5.1). Each individual camera covers a relatively small area. The combination of
images from multiple cameras satises the property that there are large overlap between
all frames. However, this is not the case for the imagery from each individual camera. For
applications such as vehicle tracking, the mosaicked image is sucient [90]. For accurate
3D reconstruction, the mosaicked image causes problems due to the approximation of
77
(a) Multi-camera aerial imagery on Columbus, OH, USA, captured using
a 6-camera system. The resolution of each single camera is 4008 2672
(b) Multi-camera aerial imagery on Rochester, NY, USA, captured using
a 5-camera system. The resolution of each single camera is 4872 3248
Figure 5.1: Examples of the multi-camera aerial imagery
78
the relative camera poses in the mosaicking process. In this chapter, we investigate the
problem of camera pose estimation for multi-camera aerial imagery. In the following of
this chapter, we analysis the challenges of multi-camera aerial imagery compared with
the traditional single-camera aerial imagery in section 5.2 and propose a multi-camera
pose estimation approach in section 5.3. Implementation details and experimental results
are presented in section 5.4. We conclude this chapter in section 5.5.
5.2 Challenges of Multi-Camera Aerial Imagery
For camera pose estimation, the multi-camera aerial imagery is much more challenging
compared with the traditional single-camera aerial imagery. In particular, there are
three major challenges: complicated viewing pattern, consistency of multiple cameras
and eciency.
5.2.1 Non-hovering viewing pattern
The rst challenge is due to the viewing pattern as illustrated in Figure 5.2. The aerial
imagery data is captured on an airborne platform hovering around the target urban area.
The observed area in the center camera is indeed always preserved because of the hovering
pattern. This is also the key for the success of our online approach without using Bundle
Adjustment as proposed in Chapter 3. However, this hovering pattern is not always
the case in the multi-camera aerial imagery, especially for the cameras in the corner of
the camera array. Instead, the observed area changes rapidly for the corner cameras
(Figure 5.2 (b)). This rapid observation change causes the camera pose drift problem as
the error accumulated along the path.
79
(a) The 1
st
, 10
th
and 20
th
frame on Rochester, NY, USA from the top-middle camera. The observed area
is almost xed because of the hovering pattern
(b) The 1
st
, 10
th
and 20
th
frame on Rochester, NY, USA from the bottom-right camera. The observed
area changes rapidly which causes the problem of camera pose drift
Figure 5.2: Illustration of the complicated viewing pattern in multi-camera aerial imagery
80
5.2.2 Pose consistency of multiple cameras
The second challenge of multi-camera aerial imagery is the pose consistency of multiple
cameras. The multi-camera system is designed to provide the coverage of a large ground
area and a wide viewing angle. As a result, there is limited coverage overlap between
neighboring cameras in the camera array, which makes the alignment of dierent cameras
challenging. One intuitive solution is to deploy the pre-calibrated rigid transformation
of the camera rig. However, we found that a xed rigid transformation between dierent
cameras is not accurate for geometric 3D acquisition due to the vibration during recording.
5.2.3 Eciency
Thanks to the progress in imagery sensor techniques, the image resolution of recorded
aerial imagery is growing nowadays. Eciency becomes a critical issue for the high-
resolution aerial imagery as the image resolution for each single camera is becoming larger
and lager with millions of pixels per frame. Furthermore, the multi-camera property
makes the problem even more challenging. In this chapter, we investigate on solving the
problem with parallelization using multiple GPUs.
5.3 Approach
As mentioned in the previous section, eciency is a critical issue for high-resolution multi-
camera aerial imagery. The dense multi-view stereo algorithm in the coupled framework
proposed in Chapter 3 becomes the bottleneck of the process, which makes it unsuitable
for the multi-camera case. For this purpose, we present an approach for camera pose
81
estimation based on Structure-from-Motion (SfM) framework using sparse feature points.
There is little overlap between neighboring cameras in the multi-camera array, which
makes the problem challenging. Experiments show that a xed relative pose between
dierent cameras is unsuitable for accurate dense reconstruction purpose. We propose
a track-and-merge strategy to ensure the pose consistency of multiple cameras. Major
computations of our approach are parallelized on GPU to accelerate the process. Ad-
ditionally, we also parallelize the process of dierent cameras using multiple GPUs to
further improve the eciency.
5.3.1 Structure-from-Motion using sparse features
Most incremental SfM systems [15, 4] share the same pipeline where cameras are registered
within the same coordinate incrementally by adding new cameras one by one. In the
beginning, the camera poses as well as the 3D positions of a set of feature points are
estimated as initialization from an initial camera pair. The initial camera pair is chosen
such that they share a large number of matched points as well as a large baseline. After
initialization, a new camera is added with 2D feature points matched with the recovered
3D points. Its camera pose is estimated using the Perspective-n-Point (PnP) method.
Additional 3D points are also added with the matches obtained from the new camera to
incrementally enlarge the reconstructed point cloud. This process is repeated until all
cameras are added. Drift is a well-known problem for incremental SfM that the camera
poses of new cameras tend to drift away from the correct position due to the errors
accumulated during the process.
82
The key dierence between various SfM systems is how they estimate the 2D-2D and
2D-3D feature matches. For our aerial imagery case, we use the optical
ow guided feature
matching algorithm proposed in Chapter 3 to nd the 2D-3D feature matches between
an new frame and the sparse 3D map. The 2D-2D feature matching between successive
frames can also be accomplished using the same pipeline. In our approach, we merge the
2D-2D feature matches across multiple images into 2D feature tracks (Figure 5.3). Given
the camera pose of new frames, we update the sparse 3D map with new 3D feature points
added by triangulating the 2D feature tracks. Additionally, we use dierent settings for
triangulation at dierent stages of the SfM pipeline: triangulation is processed with a
large threshold of the viewing angle to provide a large baseline and ensure the accuracy
in the initialization stage (Figure 5.3 (a)); the threshold of viewing angle is smaller in the
camera pose estimation stage to provide a quick update of the sparse 3D map (Figure 5.3
(b)). We also lter out the 2D feature tracks associated with the existing 3D feature
points to avoid duplicate points in the sparse 3D map.
5.3.2 Multi-camera alignment
The feature-based SfM framework proposed in section 5.3.1 works well for the single
camera aerial imagery. However, it cannot be directly deployed for multi-camera imagery
due to the inconsistency of dierent cameras. The alignment of multiple cameras in
our case is challenging because there is limited overlap between neighboring cameras
(Figure 5.1). For this purpose, we propose a merge-and-track algorithm to ensure the
consistency of multiple cameras. The algorithm is composed of three steps:
83
(a) Long 2D feature tracks provide a large baseline and ensure the accuracy in initialization
(b) Short 2D feature tracks provide a quick update of the 3D map
Figure 5.3: The 2D-2D feature matches are tracked across multiple frames
84
Initialization: the framework starts with camera pose estimation of each individual
camera using the standard SfM algorithm based on sparse feature points (Figure 5.4 (a)).
As a result, each single camera generates a separate sparse 3D map, and each 3D map is
located within a dierent coordinate system. As illustrated in Figure 5.4 (b), there are
ve unaligned sparse 3D maps generated after initialization.
Camera alignment: the 3D feature points of dierent SfM point clouds are matched
until enough overlap is detected between dierent cameras. The pose for each individual
camera is then transformed into a global coordinate system (Figure 5.4 (c)) using a sim-
ilarity transformation (rotation, translation and scale) estimated based on the matched
SfM points between dierent sparse 3D maps [105].
Global track and update: after alignment, a global SfM point cloud is updated
and used to estimate the poses for all cameras ensuring the pose consistency. During the
process, Bundle Adjustment (BA) [104, 112] is used to rene the estimation and avoid
drift. We deploy local Bundle Adjustment over the latest several frames frequently and
global Bundle Adjustment over all frames less frequently. At the same time, Bundle Ad-
justment within a single camera is processed more frequently than the Bundle Adjustment
involving all cameras, seeking balance between accuracy and eciency.
5.3.3 Parallelization using multiple GPUs
In our approach, feature detection, optical
ow and Bundle Adjustment are parallelized
on GPU to handle the high-resolution imagery. However, eciency is still a problem
for multi-camera aerial imagery even with GPU parallelization. For this purpose, we in-
vestigate parallelizing the processes of dierent cameras using multiple GPUs to further
85
(a) Each individual camera is tracked independently as initialization result-
ing in unaligned SfM point clouds from dierent cameras
(b) The unaligned SfM point clouds from dierent cameras
(c) The SfM point clouds are aligned into a consistent global sparse 3D map ,which is
used for later tracking and update using all cameras
Figure 5.4: Illustration of the merge-and-track algorithm
86
improve the eciency for multi-camera imagery. The process of feature detection, 2D-2D
and 2D-3D feature matching, optical
ow between successive frames and Bundle Adjust-
ment for each individual camera are distributed on dierent GPUs. The triangulated 3D
feature points are then added to update the global sparse 3D map which is shared by all
cameras.
5.4 Experiments
5.4.1 Implementation
In our implementation, we use SiftGPU [14] for Sift feature detection and Parallel Bundle
Adjustment (PBA) [112] for Bundle Adjustment implementation which are both paral-
lelized on GPU to improve the eciency. FlowLib [9] provides the GPU parallelized
optical
ow implementation for the optical
ow guided feature matching process. For
multi-GPU implementation, we use the multi-threading functions in Boost library [3] to
distribute the process of each camera onto a dierent GPU. In our experiments, we use
a PC machine with Intel Xeon 3.6GHz quad-core processor and two NVIDIA GeForce
GTX760 graphics cards.
5.4.2 Datasets
We evaluate the proposed approach on two high-resolution multi-camera aerial imagery
datasets. 6pack (Figure 5.5 (a)) is a synthetic dataset covering a textured mountainous
terrain provided by Lawrence Livermore National Laboratory. The aerial imagery is
rendered using a 6-camera system with image resolution of 4096 4096 for each camera.
87
(a) Example frame of 6pack rendered using a 6-camera system
(b) Example frame of Rochester captured using a 5-camera system
Figure 5.5: Illustration of the quantitative results on multi-camera aerial imagery dataset
88
For experiments, we use 125 frames for each of the 6 cameras, which results in 750 frames
in total. Rochester (Figure 5.5 (b)) contains an aerial video covering the urban area
in Rochester, NY, USA. The aerial imagery is captured in 2010 using a 5-camera system
with image resolution of 4872 3248 for each camera. For experiments, we use 440
frames for each of the 5 cameras, which results in 2200 frames in total.
5.4.3 Results
As results shown in Figure 5.6, the running time using multiple GPUs is largely reduced
compared to the running time using a single GPU. Using 2 GPUs, the running-time is
approximately 0.5 seconds per frame on 6pack and 0.6 seconds per frame on Rochester,
which is largely improved compared to the coupled online framework proposed in Chapter
3. This provides a camera pose estimation system with close-to real-time performance
because the high-resolution aerial imagery is usually captured at a low frame rate, e.g. ,
2 frames per second.
However, the running time is not exactly reduced in half because there are processes
that cannot be parallelized onto multiple GPUs, such as the global Bundle Adjustment
over all cameras. The running time can be further reduced by incorporating more GPUs
into the system and assigning each single camera with an unique GPU. Rened using
global Bundle Adjustment involving all cameras and frames, the estimated camera pose
are very accurate with average projection error within a few pixels on both datasets.
89
0
200
400
600
800
Init Single Multi Total
2 GPUs
1 GPU
(a) Running time statistics (seconds) on 6pack
0
500
1000
1500
2000
2500
Init Single Multi Total
2 GPUs
1 GPU
(b) Running time statistics (seconds) on Rochester
Figure 5.6: Running-time results on multi-camera aerial imagery dataset. Init: running
time of initialization using relative pose estimation method; Single: running time of
separate camera pose estimation for each individual camera; Multi: running time of
multi-camera pose estimation after obtaining a global sparse 3D model with multi-camera
alignment; Total: total running time
90
5.5 Conclusion
In this chapter, we investigated a very challenging problem: camera pose estimation
of high-resolution multi-camera aerial imagery. For this purpose, we proposed a novel
framework based on Structure-from-Motion (SfM) algorithm using sparse feature points.
The system handles multi-camera aerial imagery using a merge-and-track algorithm com-
posed of three steps. For eciency, major processes including feature detection, optical
ow and Bundle Adjustment are parallelized on GPU. To further handle the multi-camera
imagery, we distribute the computations of dierent cameras onto multiple GPUs. The
results on two multi-camera aerial imagery datasets show that our system is very ecient
with high accuracy.
91
Chapter 6
MeshRecon: a Mesh-Based Oine 3D Model Acquisition
System
6.1 Introduction
In the previous chapter, we presented an approach to generate accurate camera poses of
high-resolution multi-camera aerial imagery. Given the camera pose, the coupled online
3D reconstruction framework proposed in Chapter 3 is not suitable for generating accurate
city 3D models from high-resolution multi-camera imagery due to the high-resolution
multi-camera property. In this chapter, we investigate the problem of accurate 3D model
acquisition using one techniques with camera pose given.
Online 3D model acquisition methods, such as depth map fusion methods [76, 54, 55,
56], provide an immediate visual feedback to the user during the reconstructing process.
However, this advantage comes at the cost of sacricing accuracy because the dense ge-
ometry information is estimated considering a partial set of views only (e.g. , neighboring
views of a selected key frame). In many cases, immediate visual feedback is not a necessity
and accuracy is more important. Batch or oine methods are used for more accurate 3D
92
Figure 6.1: The input of MeshRecon is a set of sparse images captured from dierent
views. The output is a full 3D mesh model of the target object
model reconstruction, which are composed of three sequential steps. The pipeline starts
with a data collection step where images of the target object are captured from dierent
views. After that, we estimate the camera poses of the collected images. In this step,
Bundle Adjustment [104] involving all cameras is used to rene the camera pose. The
resulting camera pose is more accurate than the pose estimated using an online pipeline,
such as PTAM [58], because the problem is solved using information from a global point
of view. As the last step, a dense 3D model is reconstructed by tting the photometric
information from all views, which makes it more accurate than the model estimated using
an online method.
For this purpose, we propose MeshRecon [11], a mesh-based oine 3D model acqui-
sition system integrating several state-of-the-art computer vision techniques [114, 61, 31,
50]. It is a system designed for dense mesh reconstruction using a set of sparse images as
input (Figure 6.1). We use a specic le format, called SFM le, as the data interface to
93
load camera pose information which makes the system compatible with camera pose gen-
erated using various third-party softwares. The whole pipeline is composed of three major
steps: dense point cloud generation, visibility-consistent mesh surface reconstruction and
variational mesh renement. MeshRecon has the following features:
• Flexibility for various objects: we integrate the visibility-consistent surface
reconstruction technique based on Delaunay triangulation and graph-cut optimiza-
tion proposed in [61], which allows for ecient reconstruction of objects at dierent
scales. Result shows that our system is
exible for city 3D reconstruction as well
as general 3D model acquisition in both indoor and outdoor environment.
• Accurate reconstruction with structural details: reconstructed 3D mesh
model is iteratively rened to t the global photometric information using vari-
ational methods [31, 50]. As a result, structural details can be well captured from
high-resolution imagery.
• Ecient framework using GPU: multi-view stereo and variational mesh rene-
ment process are implemented with GPU parallelization, which makes the system
very ecient. A detailed geometric 3D model can be reconstructed eciently using
a standard GPU-enabled PC machine.
6.2 Pipeline Overview
As illustrated in Figure 6.2, the system pipeline of MeshRecon is composed of four
components: (1). After obtaining the images captured from dierent views, 6DOF camera
94
(a) Camera pose estimation from input images (b) Dense point cloud generated using plane sweep
(c) Initial mesh model extracted considering visibil-
ity information
(d) Accurate mesh model rened using variational
method
Figure 6.2: The pipeline of MeshRecon
95
poses of the input images are estimated (Figure 6.2 (a)). In this step, the bounding box
of interested 3D space as well as the image neighboring information are also computed;
(2). Given the estimated camera poses, a point-camera-ray 3D map is generated using
multi-resolution plane sweep algorithm (Figure 6.2 (b)). The point-camera-ray 3D map
is essentially a dense point cloud including 2D-3D point-camera correspondences, which
provides the dense geometry information as well as the visibility information from dierent
views. (3). An initial mesh 3D model is extracted from the dense point-camera-ray
map using Delaunay triangulation and graph-cut optimization by minimizing visibility
violations plus a spatial regularization energy (Figure 6.2 (c)). (4). The initial model
generated in the previous step is usually topologically correct but noisy. This model is
then rened to capture the structural details by tting the photometric information of
input imagery data using variational method (Figure 6.2 (d)). Each step is described in
detail in the following sections.
6.3 Camera Pose Conguration
Accurate camera pose information is essential for successful dense 3D reconstruction. In
computer vision community, the camera pose is usually estimated using a Structure-from-
Motion (SfM) framework for discrete imagery data or a Simultaneous Localization And
Mapping (SLAM) framework for continuous imagery sequence. Instead of integrating
a self-implemented SfM and SLAM algorithm into the system, MeshRecon reads a
specic le format, called SFM le, as the interface to load camera poses from dierent
sources. A SFM le species the 6DOF camera poses of all input images, the bounding
96
box of interested 3D space as well as the image neighboring information. This strategy
ensures the
exibility of MeshRecon such that the camera poses estimated using various
third-party softwares can be easily utilized.
MeshRecon deploys a standard pin-hole camera model to describe the intrinsic cam-
era conguration. A 3 4 projection matrix P projecting a point in the 3D space onto
the 2D image plane with homogeneous coordinates is represented as:
P = K [RjT ] (6.1)
where K is the 3 3 intrinsic camera parameter matrix, R is the 3 3 rotation matrix
andT is the 3 1 translation vector in terms of the world coordinate. For simplicity, we
assume the intrinsic camera model to be with squired-pixel and the distortion is handled
in pre-process. K is then represented as:
K =
2
6
6
6
6
6
6
4
f 0 cx
0 f cy
0 0 1
3
7
7
7
7
7
7
5
(6.2)
where f is the focal lengthx in pixel unit, (cx;cy) is the 2D image plane center (usually
the center of the image).
Most SfM works,e.g. , VisualSFM [15], Bundler [4], and SLAM works,e.g. , PTAM [13],
generate the camera pose of input images as well as a point cloud of sparse 3D feature
points. Besides the camera pose parameters, we also compute two additional important
97
information and load them in the SFM le format based on the output point cloud and
the 2D-3D correspondences.
An accurate bounding box limits the depth sampling range in multi-view stereo pro-
cess, which is important for an ecient and accurate reconstruction. MeshRecon pro-
vides two options to automatically generate the bounding box based on the point cloud:
(1) The bounding box is computed as the 3D volume covering all points in the point
cloud, which is suitable for general reconstruction. (2) In many cases, the images are
captured by hovering around the target object, and the bounding box is computed as the
3D volume visible by all cameras. User input can be easily added by changing the values
in SFM le. Besides the bounding box, image neighboring information is also important
for stereo-based dense reconstruction methods. On one hand, image neighbors should
have a large overlap and similar point of views to ensure photometric consistency. On
the other hand, large baseline is needed to provide useful stereo evidence. Our system
estimates the image neighboring information according to the number of 2D-3D feature
matches sharing a same 3D point between image pairs.
6.4 Dense Point Cloud Generation
Given the estimated camera pose of input images, we extract dense geometry information
from the images using multi-view stereo algorithm as the rst step. Specically, we use
multi-resolution plane sweep algorithm [114] to generate a dense depth map for each
image. Following the procedure in [61], we then compute a point-camera-ray 3D map
containing point-camera visibility information based on the estimated depth maps.
98
6.4.1 Multi-resolution plane sweep
Various multi-view stereo methods have been proposed to estimate dense depth maps
from imagery. In MeshRecon, we use a standard multi-resolution plane sweep algo-
rithm parallelized on GPU [114] (Figure 6.3). Given a selected key frame I
k
as well as
its neighboring images I
i
2N (I
k
), the plane sweep algorithm generates a dense depth
map D
k
whose values represent the depth to the camera image plane. We warp the
neighboring images to the key frame I
k
by sweeping through a series of virtual planes
with dierent depth. Specically, the neighboring image can be warped using a 3 3
homography matrix H [74]:
H =R +
1
d
TN
T
(6.3)
whereN is the normal vector of the image plane ofI
k
,R andT are the rotation and trans-
lation between the key frame and the neighboring frame, d is the depth of virtual image
plane. For a specic depthd, we compute the photometric consistency score at pixelx as
the sum of Normalized Cross-Correlation (NCC) between the warped images and the key
frame. For added
exibility, we combine the NCC computed at dierent resolutions [114].
The output value in the depth map is selected using a simple winner-take-all strategy,
i.e. , the depth value with minimum error is selected. The depth sampling range is com-
puted using the bounding box provided in the input SFM le, which ensures accuracy
with a limited number of samples. In MeshRecon, we parallelize the process for each
pixel on GPU for eciency. For our aerial imagery case, the sampling projection plane
99
Figure 6.3: Dense depth maps estimated using multi-resolution plane sweep algorithm.
Top: the input image and estimated depth map. Bottom: the dense point cloud is full of
outliers
100
is swept along the normal direction of a virtual ground plane, which is estimated from
the global SfM point cloud using Principal Component Analysis (PCA). This allows for
ecient sampling since the height range of an urban scenario is usually limited.
6.4.2 Point-Camera-Ray 3D map
Following the pipeline of [61], we generate a dense point-camera-ray map describing the
dense geometry as well as the visibility information. Essentially, the point-camera-ray
map is a dense 3D point cloud with the 2D-3D point-camera visibility information, i.e. ,
whether a 3D point is visible from a specic camera.
As the rst step, we generate an initial point-camera-ray 3D map incrementally. Given
an estimated depth mapD
k
, we project the existing 3D pointX in the dense point cloud
onto D
k
with its projection at pixel x. If the depth value in D
k
is consistent at pixel x
for this 3D point, we mark X as visible from the camera of I
k
and set pixel x in depth
mapD
k
as \merged". By doing that, we avoid duplicate 3D points in the map. The \un-
merged" pixels onD
k
is then projected into 3D space and added into the point-camera-ray
map. During this process, we lter out invalid depth estimations by cross-checking the
depth consistency between neighboring images. Additionally, we lter out the points
in texture-less regions due to the lack of photometric evidence supporting the geometry
estimation. After obtaining the initial point-camera-ray 3D map, we further merge the
closest neighboring 3D points with distance below a certain threshold to reduce the model
complexity for later process. As illustrated in Figure 6.2 (2), the generated point cloud
is noisy with many outliers because the depth maps (Figure 6.3) estimated using plane
sweep is inaccurate without spatial regularization.
101
6.5 Visibility-consistent Surface Reconstruction
Our goal is generate a full 3D mesh model instead of a noisy dense point cloud. For
this purpose, we implement the algorithm proposed in [61] to generate an initial 3D
mesh model from the point-camera-ray 3D map. The reconstructed 3D mesh model is
consistent in terms of the visibility information included in the point-camera-ray 3D map.
The pipeline is illustrated in Figure 6.4.
6.5.1 Delaunay triangulation
As the rst step, we compute a Delaunay triangulation [23] of the dense point-camera-ray
map. A Delaunay triangulation splits the 3D space inside the point cloud convex hull
into a set of connected tetrahedrons. The vertices of the tetrahedrons are the 3D points
in the point cloud. A Delaunay triangulation also satises the property that no vertex is
inside the circum-hypersphere of any tetrahedrons. Besides Delaunay triangulation, there
are many other ways to provide a convenient representation of 3D space partitioning,
such as uniform grid volume and octree representation. Compared with other methods
partitioning the 3D space, Delaunay triangulation provides the following advantages for
our application:
• Adaptive resolution for objects at dierent scales: the 3D space inside the
convex hull is partitioned into a set of tetrahedrons whose vertices are the 3D points
in the point cloud. The space partition is invariant in terms of the coordinate scale
which ensures its generalization for objects at dierent scales.
102
(a) A dense point-camera-ray map generated using plane
sweep algorithm
(b) Delaunay triangulation partitions the 3D space inside
the convex hull into a set of tetrahedrons
(c) A visibility-consistent triangle mesh surface is produced
Figure 6.4: The pipeline of visibility-consistent surface reconstruction
103
• Geometry-sensitive 3D space partition: the 3D points in texture-less regions
are ltered out when generating the point-camera-ray 3D map. As a result, we have
a dense distribution of tetrahedrons in the textured areas where structural details
are more likely to appear. Compared with volume-based representation, texture-
less regions in our system are described using fewer tetrahedrons which makes the
reconstruction process ecient.
6.5.2 Surface reconstruction with graph-cut
Given the Delaunay triangulation of the point-camera-ray map, we estimate the mesh
surface using the method proposed in [61]. Specically, we transform the surface recon-
struction problem into a binary labeling problem. The algorithm labels each tetrahedron
as either \inside" or \outside" in terms of the target object. The mesh surface can be
easily extracted as the union of the triangle facets between pairs of \inside" and \outside"
tetrahedrons.
To solve this binary labeling problem, we build a nite directed graph,G = (V;E) ,
and solve a global optimization problem using graph-cut [61]. Each vertex v
i
2V (G) in
the graph represents a tetrahedron in the Delaunay triangulation, and each edge E(G) =
(v
i
;v
j
) linking the vertexv
i
andv
j
in the graph represents the triangle facet between the
adjacent tetrahedrons associated with v
i
and v
j
. This graph corresponds to the Voronoi
diagram associated with the Delaunay triangulation. Additionally, two special vertices,
v
out
and v
in
, are added into the graph with edges linked to each vertex.
104
Figure 6.5: Examples of degenerate triangle surfaces that are not two-manifold sur-
faces [24]
Given the constructed graph G, we solve the binary labeling problem by minimizing
an energy function:
E(S) = E
vis
(S) +E
qual
(S) (6.4)
which includes of a visibility consistency term E
vis
(S) and a regularization term E
qual
(S).
S represents the reconstructed surface between \inside" and \outside" tetrahedrons. The
visibility consistency term E
vis
(S) measures the visibility violations given the estimated
surfaceS. Instead of using the triangle-circumsphere angle as proposed in [50], we use the
normalized circumsphere diameter as the regularization term to handle reconstructions
at dierent scales. By doing that, no parameter tunning is needed for dierent datasets.
More details of the graph-cut optimization can be found in [61]. After labeling each
tetrahedron as \inside" or \outside", an initial triangle mesh surface can be extracted
as the facets between tetrahedrons with dierent labels. There are outlier points in the
point-camera-ray 3D map, which leads to many reconstructed mesh surface components
105
that are noise. We lter out those noisy mesh surfaces if the number of connected tetrahe-
drons is below than a certain threshold. Additionally, we also separate degenerate points
(Figure 6.5) to make the output surface model a 2-manifold surface.
6.6 Variational Mesh Renement
Surface reconstruction method described in the previous section produces an initial mesh
surface model which ts the visibility information well. However, this initial model is in-
accurate due to the lack of spatial regularization and limitations of plane sweep algorithm
(Figure 6.6 (1)). As the last step, we rene the initial mesh model to capture structural
details by tting the dense photometric information from dierent views. Specically, we
use a variational method [50] to iteratively rene the vertex positions of the mesh model
deploying a coarse-to-ne strategy with its topology xed (Figure 6.6 (2)(3)(4)).
6.6.1 Mesh-based energy function
In image-based 3D dense reconstruction framework, an accurate model should t the
photometric information from dierent views, i.e. , the texture information should be
consistent when rendered to the same view point using dierent input images. Addition-
ally, we include a spatial smoothness constraint to regularize the problem for texture-less
regions. Following the pipeline of [50], we rene the mesh model by minimizing an energy
function:
E
rene
(S) = E
photo
(S) + E
smooth
(S) (6.5)
106
(1) (2) (3)
(4)
Figure 6.6: Illustration of the mesh renement. (1). initial surface model. (2). surface
model rened from 0.25 down-sampled images. (3). surface model rened from 0.5
down-sampled images. (4). surface model rened from full size images
107
where S is the rened mesh surface model. E
photo
(S) is the photometric date term mea-
suring the photometric back-projection error given S:
E
photo
(S) =
X
i;j
Z
S
ij
h(I
i
;I
S
ij
)dx
i
(6.6)
which accumulates using all neighboring image pairsI
i
andI
j
as provided in the SFM le.
I
S
ij
is the warped image ofI
j
rendered at the view point ofI
i
throughS, i.e. , we texture
the surface model S using image I
j
, and render the warped image I
S
ij
using camera pose
of I
i
.
S
ij
is the valid image domain of the warped image I
S
ij
, which represents the area
visible to both I
i
and I
j
. h(:;:) is a function measuring the consistency of local image
patch between I
i
and I
S
i;j
at pixel x
i
. We use Normalized Cross-Correlation (NCC) in
MeshRecon to ensure the robustness against illumination changes. An advantage of
using the back-projection error measurement is that we do not need to explicitly han-
dle perspective distortion from dierent view points as we warp the neighboring images
using the surface model directly. In the original implementation of [50], the smoothness
regularization term is a discrete approximation of the thin-plate energy:
E
smooth
(S) =
Z
S
(k
2
1
+k
2
2
)dS (6.7)
where k
1
and k
2
are the principle curvatures. Compared with membrane energy which
penalizes the surface area, the thin-plate energy penalizes strong surface bending instead
108
without shrinkage bias. In MeshRecon, we found that a combination of membrane
energy and thin-plate energy leads to a better performance and avoid degenerate facets
caused by noise in the point-camera-ray map. As proposed in [31], we approximate the
energy function dened on a continuous surface domain into an energy function dened
on the discrete triangle mesh surface using barycentric coordinates. More details of the
variational optimization can be found in [88, 31, 50].
6.6.2 GPU parallelization
Variational method generates highly accurate 3D models. To ensure a fast convergence
rate, we implement the optimization process in a coarse-to-ne manner, i.e. , we start
the optimization using down-sampled images and come to higher-resolution images later
(Figure 6.6 (2)(3)(4)). However, the computational cost is still a problem in this case.
For this purpose, we parallelize the optimization process on GPU using CUDA to ensure
eciency.
To parallelize the gradient computation of each vertex, we compute the weighted
gradient associated with each pixel for neighboring image pairs by allocating the process
of each pixel to a unique CUDA thread. To parallelize the warping of neighboring images,
depth map of each image is computed with Z-Buer. Notice that we need to use atomic
operations to update the gradient value and the depth value in Z-Buer to avoid race
condition.
109
6.7 Implementation and Results
6.7.1 System implementation
MeshReon was developed in C++ and tested on a PC machine with Intel(R) Core(TM)
i5-4200M Duo-core CPU and 8GB memory. Plane sweep and mesh renement are par-
allelized on a NVIDIA GeForce GTX 765M GPU using CUDA. For Delaunay triangu-
lation, we use the 3D triangulation function in Computational Geometry Algorithms
Library (CGAL) [7]. We use max
ow library [10] developed by Vladimir Kolmogorov for
graph-cut optimization.
6.7.2 Datasets and results
We evaluate our system on several real-world urban scenarios (Figure 6.7). For camera
pose estimation of multi-camera aerial imagery, we use the approach proposed in Chapter
5. Besides the aerial imagery case, we also test the system on various general imagery
datasets to evaluate its performance and
exibility on dierent types of indoor and out-
door objects at dierent scales. For these datasets, we deploy the camera pose estimated
using VisualSFM [15] and convert it to SFM le format. For all experiments, we use the
same set of parameters without tuning. Image pyramid of three levels with scale factor
of 0:5 is used for coarse-to-ne implementation.
6.7.2.1 Aerial imagery
Rochester contains an aerial video covering the urban area in Rochester, NY, USA,
focusing on the ground area with latitude from N 43
08
0
42
00
to N 43
09
0
54
00
and longitude
110
Figure 6.7: Example frame of multi-camera aerial datasets. From top to bottom:
Rochester, CLIF 06 and CLIF 07
111
Dataset Running-time
Name Image Size # Frames Cam Pose. MeshRecon
Rochester 4872 3248 1375 48 min 6 hour
CLIF 06 4008 2672 890 25 min 4 hour
CLIF 07 4008 2672 900 28 min 4 hour
Table 6.1: Dataset details and running-time for aerial imagery. # Frame: number of
images used for reconstruction. Cam Pose.: running time of camera pose estimation.
MeshRecon: running time of mesh reconstruction given the estimated camera poses.
from W 77
36
0
42
00
to W 77
37
0
12
00
. The aerial video (Figure 6.7 (top)) is captured in
2010 using a 5-camera system with image resolution of 4872 3248 for each camera.
Columbus Large Image Format (CLIF) 06 / 07 [5, 6] are two publicly available
datasets containing aerial videos captured in 2006 and 2007 separately covering the urban
area in Columbus, OH, USA, focusing on the ground area with latitude from N 39
59
0
6
00
to N 40
00
0
54
00
and longitude from W 82
59
0
42
00
to W 83
02
0
6
00
. CLIF06 (Figure 6.7
(center)) is captured using a 6-camera system with image resolution of 4008 2672 for
each camera, and CLIF07 (Figure 6.7 (bottom)) is captured using a 6-camera system
with image resolution of 4016 2672 for each camera. We produce the city 3D models
using images from the center-bottom camera of CLIF06 and the 3 top-row cameras of
CLIF07. More details of the datasets and running time statistics is shown in table 6.1.
The reconstructed models are illustrated in Figure 6.8, which are complete detailed
mesh 3D model over the entire city scenarios. As shown in Figure 6.9, the geometric
details at dierent scales of both building structures and vegetations are well captured
in the reconstructed 3D model. To quantitatively evaluate the accuracy, we use the 3D
model on Rochester, NY, USA provided by Google as the ground-truth for comparison
(Figure 6.10). The result shows that our model is very accurate with geometric distance
112
Figure 6.8: Results on multi-camera aerial datasets. From top to bottom: results on
Rochester, CLIF 06 and CLIF 07
113
Figure 6.9: Close-up views of the reconstructed model on Rochester. Geometry details
of building structures as well as vegetations are well captured
Figure 6.10: Results on Rochester. Top: the reconstructed city 3D model on Rochester,
NY, USA using aerial imagery. Bottom: the city 3D model provided by Google
114
Dataset Running-time
Name Image Size # Frames VisualSFM MeshRecon Total
woodland squirrel 2048 1152 20 31s 1m 54s 2m 25s
squirrel 2048 1152 23 26s 2m 06s 2m 32s
gnome 2048 1152 27 31s 2m 36s 3m 07s
gargoyle 2048 1152 23 1m 41s 3m 04s 4m 45s
shoe 1280 720 34 4m 00s 1m 45s 5m 45s
Table 6.2: Dataset details and running-time for indoor environment. # Frame: number
of images used for reconstruction. VisualSFM: running time of camera pose estimation
using VisualSFM. MeshRecon: running time of mesh reconstruction given the estimated
camera poses. Total: total running time including camera pose estimation and dense
reconstruction.
error less than 1.0 meter over the entire city, which is a large improvement compared with
the 2.0 meter error of the online framework proposed in Chapter 3.
6.7.2.2 Indoor object
A common usage case of MeshRecon is geometric 3D model acquisition of a target
object in an indoor environment. In this case, the target object, e.g. , a sculpture, is
put on a table and the user captures images from dierent views by circulating around
the object. We validate the performance of MeshRecon on several dierent datasets of
real-world objects, including self-collected data and publicly released data.
For data collection, woodland squirrel, squirrel, gnome and gargoyle are collected in a
lab environment using the camera on a Samsung Galaxy S4 Active mobile phone. There
is no particular setup for lighting and background for this set of data. The collected
datasets are also released at [11]. shoes is included in Image Sequence dataset [106]
which is a video collected using a commodity hand-held digital camera. We selected a
subset of the video frames to evaluate our system since it is designed for discrete image
115
Figure 6.11: Results on indoor datasets. From top to bottom: results on
woodland squirrel, squirrel, gnome, gargoyle and shoe
116
Dataset Running-time
Name Image Size # Frames VisualSFM MeshRecon Total
herz-jesu 3072 2048 8 27s 3m 44s 4m 11s
entry 3072 2048 10 46s 4m 16s 5m 02s
fountain 3072 2048 11 50s 5m 12s 6m 02s
castle 3072 2048 19 2m 21s 4m 31s 6m 52s
lion fountain 4104 2736 11 34s 6m 55s 7m 29s
Table 6.3: Dataset details and running-time for outdoor environment. # Frame: number
of images used for reconstruction. VisualSFM: running time of camera pose estimation
using VisualSFM. MeshRecon: running time of mesh reconstruction given the estimated
camera poses. Total: total running time including camera pose estimation and dense
reconstruction.
set. Details of the datasets and running time analysis are listed in Table 6.2, including
the number of selected frames and image size. The reconstruction results are illustrated
in Figure 6.11 with close-up views showing that the structural details are well captured.
6.7.2.3 Outdoor scenes
Another common usage case is geometric 3D model acquisition of object/scenes in an out-
door environment, such as building structures and statues. We also evaluate MeshRecon
on various types of objects/scenes for this case. herz-jesu, entry, fountain and castle are
four dense multi-view stereo datasets of outdoor building structures included in CVLab
dataset [8]. lion fountain is a dataset from Smart3DCapture of Acute3D [1]. Details
of the used datasets are listed in Table 6.3. The reconstruction results are illustrated
in Figure 6.12 showing that MeshRecon is
exible and robust for outdoor objects of
dierent types at dierent scales.
117
Figure 6.12: Results on outdoor datasets. From top to bottom: results on herz-jesu,
entry, fountain, castle and lion fountain
118
6.7.3 Discussions
6.7.3.1 Stationary camera and moving object
As mentioned in Chapter 1, we make the assumption in this dissertation that the target
object is stationary and the camera is moving and captures images from dierent views.
In this section, we investigate another way of capturing imagery data: the camera is
stationary and the target object is moving. In particular, we study this problem with
experiments on geometric 3D model acquisition of human faces. In our experiments,
we compare the results generated from the camera moving, i.e. , the target face is
stationary and the camera is moving with the results generated from the face moving,
i.e. , the target face is moving and the camera is stationary. Additionally, the lighting
condition is an important factor for maintaining the photometric consistency. We also
adjust the lighting condition and compare the results generated under diused lighting,
which gives an isotropic lighting from dierent views, and results generated under direct
lighting, which gives an anisotropy lighting with shadows. We use the camera on a
Samsung Galaxy S4 Active mobile phone to capture the imagery data. About 20 images
are used with image resolution of 2048 1152 in all experiments.
The reconstruction results are illustrated in Figure 6.13. As expected, the geometric
3D models are well reconstructed with camera moving under both diused lighting
(Figure 6.13 (a)) and direct lighting (Figure 6.13 (b)) because the photometric infor-
mation is consistent from dierent views when the object is stationary. When the face
is moving under diused lighting (Figure 6.13 (c)) , the photometric information in
119
(a) Results with camera moving and diused lighting
(b) Results with camera moving and direct lighting
(c) Results with face moving and diused lighting
(d) Results with face moving and direct lighting
Figure 6.13: Comparison of results reconstructed under dierent conditions: camera
moving vs. face moving; diused lighting vs. direct lighting
120
Dataset Running-time
Name Image Size # Frames VisualSFM MeshRecon Total
hand sanitizer 3264 2448 18 1m 06s 4m 11s 5m 17s
juice bottle 3264 2448 22 2m 02s 5m 33s 7m 35s
Table 6.4: Dataset details and running-time for challenging cases. # Frame: number
of images used for reconstruction. VisualSFM: running time of camera pose estimation
using VisualSFM. MeshRecon: running time of mesh reconstruction given the estimated
camera poses. Total: total running time including camera pose estimation and dense
reconstruction.
most regions is consistent because the lighting is isotropic from dierent views. The re-
construction result is satisfactory except for some small artifacts in the cheek. However,
the system fails when the face is moving under direct lighting (Figure 6.13 (d)) where
the photometric information largely changes from view to view, e.g. , shadow on the face.
In summary, the system is robust under dierent lightings conditions when the object
is stationary and the camera is moving. When the object is moving, the reconstruction
performance becomes sensitive to the lighting conditions.
6.7.3.2 Failure cases
Multi-view stereo techniques tend to fail on objects with transparent and non-lambertian
surfaces due to the in-consistency of photometric information. We test MeshRecon on
two datasets of transparent and non-Lambertian objects, hand sanitizer and juice bottle,
to evaluate its performance on these challenging cases. The dataset details and running-
time are listed in table 6.4. The reconstruction results are illustrated in Figure 6.14.
The results show that MeshRecon fails to provide an accurate reconstruction of the
transparent regions due to the inconsistency of photometric information. Object prior
information could be utilized to handle these challenging cases.
121
Figure 6.14: Results on objects with transparent and non-lambertian surfaces. From top
to bottom: results on hand sanitizer and juice bottle
122
6.8 Conclusion
In this chapter, we presented MeshRecon, a mesh-based oine 3D model acquisition
system. It takes images from dierent views as input, and produces a mesh 3D model of
the target object/scene. The system is implemented by integrating several state-of-the-
art techniques in computer vision eld. Result on several real-world multi-camera aerial
imagery datasets shows that our system is very accurate and is able to generate city 3D
models with error less than 1 meter using imagery data only. Furthermore, experimen-
tal result on various object datasets shows that MeshRecon is
exible and robust on
dierent types of objects at dierent scales in both indoor and outdoor environments.
With the parallelization on GPU, it is able to generate detailed 3D models very e-
ciently using a commodity PC machine. The executable version of MeshRecon as well
as the indoor datasets used for evaluation are released to the computer vision research
community at [11].
123
Chapter 7
City-Scale Geometry Change Detection from Aerial
Imagery
7.1 Introduction
Cities are changing over time in terms of both the appearance and the geometry as new
projects occur to meet the capricious expectations of citizens. Change detection has a long
history [69, 22, 51, 109, 28, 91, 100, 101] in computer vision community, and there have
been many methods proposed to detect the appearance changes and geometry changes
of urban scenarios. In this chapter, we focus on detecting geometry changes which are
benecial for lots of applications, such as city 3D map maintenance and urban planning.
Geometry change detection from imagery is challenging because of the the large ap-
pearance changes over time due to factors irrelevant to geometry changes, such as changes
in illumination, camera condition and climate. We address this problem in the context
of high-resolution Wide Area Aerial Imagery (WAAI) which is captured using a multi-
camera system mounted on an airborne platform hovering over the target area. For
124
change detection, high-resolution WAAI is advantageous in terms of the eciency and
detailedness:
• Ecient city-scale coverage: the multi-camera system comes with a wide virtual
angle of view, enabling the coverage of an entire city of several square kilometers in
a single run.
• Detailed geometry information: thanks to the growing resolution of the multi-
camera imagery, detailed geometry information can be inferred which allows for the
detection of small scale geometry changes.
Geometry changes occur at dierent scales: buildings are constructed and demolished
resulting in changes over tens of meters; tress are planted and cut down resulting in
changes within several meters. The property of WAAI in turn makes geometry change
detection more challenging at the 2D image level (Figure 7.1) even for a human analyst.
For this purpose, we instead propose to solve the problem by performing comparisons at
the 3D geometry level: city 3D models are inferred from multi-camera WAAI captured at
dierent times; the models are aligned and scaled to meet the physical scales; geometry
changes are then identied based on the model-to-model geometric distance. It is also
possible to perform comparison when pre-made 3D models are available, such as Google
map or LiDar data. We evaluate our approach on two real-world urban scenarios each
covering ground area over several square kilometers. Our results show that our approach
is able to detect detailed geometric changes ranging from an entire building cluster to
vegetation changes over the entire city.
125
Figure 7.1: Two 6-camera WAAI captured at 2006 (left) and 2007 (right) separately
with close-up views covering the ground area where a building is demolished in 2007.
It is dicult to identify the geometry changes over the entire city by comparing the
high-resolution multi-camera aerial imagery even for human analysts
126
7.2 Related Work
Change detection is not a new problem in the computer vision community. Various
methods have been proposed for general urban change detection using the appearance
inconsistency between images captured at dierent times [91]. In [85], a 3D voxel-based
probabilistic model is maintained storing the occupancy and appearance information to
handle occlusion. [33] focuses on robust change detection from noisy images using 3D
line segments. In this case, the appearance changes could be a result from general factors
other than geometry changes, such as changes in illumination and camera condition.
There are also many works proposed focusing on geometry changes. [100] addresses
the problem using the appearance inconsistency between images warped based on an
original 3D model. The warped images are captured in the same epoch which avoids most
appearance inconsistency irrelevant to geometry changes. [101] extends this work to city-
scale change detection using a cadastral 3D model and panorama images captured from a
vehicle driving around the city. However, its capability for small scale changes is limited
in our case due to the small parallax in aerial imagery. Another group of work performs
3D comparison using geometry information inferred from dierent sources. In [32], stereo
images and a given GIS database are used to obtain the geometry information. [103]
instead, uses satellite images captured at dierent epochs as input. Limited by the level
of details, they represent the urban scenario using a 2.5D Digital Surface Model (DSM).
Detection of full 3D geometry changes is beyond their capability. To the best of our
knowledge, our work is the rst to detect geometry changes including details of each
individual vegetation over an entire city.
127
Data
0
ℳ
0
ℳ
1
geometry changes
geometric distance map
1
Figure 7.2: Pipeline of our approach. City-scale 3D models,M
0
andM
1
, are produced
from data recorded at timet
0
andt
1
. After alignment, geometry changes can be identied
based on the geometric distance betweenM
0
andM
1
7.3 Pipeline Overview
The pipeline of our approach is composed of two steps (Figure 7.2). In the geometry
inference step, two 3D city models,M
0
andM
1
, are produced from data recorded at
dierent times, t
0
and t
1
. Our goal is to identify the geometry changes between time t
0
andt
1
by analyzing the 3D inconsistency betweenM
0
andM
1
. In the change detection
step,M
0
andM
1
are aligned into the same coordinate, and the geometry changes are
detected based on the geometric distance betweenM
0
andM
1
.
The key element in our approach is the availability of city 3D models,M
0
andM
1
,
describing the geometry of the area of interest at dierent times. For this purpose, we
128
propose to infer the city 3D models containing detailed geometry information from high-
resolution multi-camera WAAI using dense reconstruction technique. If available, the
use of accurate pre-made 3D models from other sources is also an option. In section 7.4,
we describe the inference of geometry from aerial imagery using dense reconstruction
technique. In section 7.5, we explain the geometry change detection step. In section 7.6,
we present experimental results on two real-world urban scenarios. In section 7.7, we
conclude this chapter with future work.
7.4 City-Scale Geometry Inference
Given the high-resolution aerial imagery captured using a multi-camera system, we uti-
lize dense 3D reconstruction technique to produce a city-scale 3D model containing de-
tailed geometry information. This process is accurate using a variational mesh renement
approach and ecient with parallelization on GPU. More details of the reconstruction
process is described in chapter 6.
Additionally, the mesh resolution is adaptively adjusted to image resolution as pro-
posed in [50]. To better handle texture-less surfaces in urban scenarios, facet subdividing
process is modied favoring textured surfaces (Figure 7.3): a triangle facet is subdivided
if it covers more than 16 pixels in at least one image pair and the total intensity variance
of the projected pixels is larger than a certain threshold.
129
Figure 7.3: Facet subdividing process favors textured surfaces. Texture-less regions are
represented with fewer triangle facets
7.5 Change Detection from 3D Comparison
After aligning the city 3D models describing the geometry at dierent times, we measure
the dierence between them using point-to-mesh geometric distance in physical scale.
Geometry changes are then identied and categorized based on the geometric distance.
7.5.1 Alignment of city 3D models
One possible way to align the 3D models is to perform image registration between the
input imagery captured at dierent times which implicitly aligns the models. In our case,
the appearance changes a lot between the images which makes image registration dicult.
Additionally, the 3D model could be obtained from techniques like laser scanning, where
texture information is missing. For this reason, we choose to align the 3D models directly.
We rst perform a coarse alignment of the two 3D models by estimating a similarity
transformation using anchor points. The alignment is then rened using Iterative Closest
130
Point (ICP) method [27], where outliers are iteratively removed to make it robust to
possible geometry changes.
7.5.2 Double-sided geomtric distance map
Most geometry change detection methods using aerial data represent the urban scenario
using a 2.5D Digital Surface Model (DSM) and handle noise using image processing
operations. Given the accurate 3D models from high-resolution aerial imagery, we found
that a simple model-to-model comparison at the 3D geometry level allows for accurate
detection of detailed geometry changes without special noise handling operations.
Voxel-based measurement [85, 100] has been a popular choice for measuring geometric
dierence between dierent 3D models. However, the scalability is a problem for detec-
tion of detailed geometry changes over an entire city. For this purpose, we instead use
the geometric distance from modelM
0
toM
1
to evaluate the dierence and produce a
distance mapD
M
0
!M
1
, where each vertexp inM
0
is assigned with the point-to-surface
euclidean distance toM
1
[19]:
D
M
0
!M
1
(p) = min
p
0
2M
1
kpp
0
k
2
(7.1)
However, the model-to-model geometric distance is not symmetric and the values of
D
M
0
!M
1
depend on which model you set asM
0
orM
1
. As shown in Figure 7.4, the
distance inD
M
0
!M
1
is much larger thanD
M
1
!M
0
for the extra structures that are
added inM
0
and demolished inM
1
. For this reason, we estimate the distance maps
131
Figure 7.4: The point-to-surface geometric distance is not symmetric. D
M
0
!M
1
has a
larger value thanD
M
1
!M
0
for the extra trees that are added inM
0
from both sides to handle the asymmetry confusion, and use only the geometric distance
assigned to the extra structures to reduce duplicated responses. In experiments, we dene
the extra structures as structures with a large distance to the local ground plane.
The uncertainty in texture-less structures, such as river surfaces, is the major source
of error in the reconstructed models. Instead of trying to solve this dicult problem,
we choose to remove those uncertain structures. The texture-less surfaces can be easily
removed as they are represented using a few large triangles thanks to the texture-favoring
subdividing methods described in section 7.4.
7.5.3 Geometry changes with physical scale
For change detection, we colorize the distance maps according to the geometric distance
which provides a straightforward visualization of the geometry changes (Figure 7.4). The
scale of the reconstructed model is determined from the camera pose estimation process
and no physical meaning is guaranteed. In experiments, we scale up the 3D models to
132
(a) ROI on Rochester, NY, USA (latitude
from N 43
08
0
42
00
to N 43
09
0
54
00
and longi-
tude from W 77
36
0
42
00
to W 77
37
0
12
00
)
(b) ROI on Columbus, OH, USA (latitude from
N 39
59
0
6
00
to N 40
00
0
54
00
and longitude from W
82
59
0
42
00
to W 83
02
0
6
00
)
Figure 7.5: Illustration of the region of interest on Rochester and Columbus datasets
physical scales using geo-tag information if available. For datasets without such infor-
mation, we scale the models by measuring the physical scale from Google Earth service.
As shown in Figure 7.8, the structural changes can be easily categorized based on the
physical scales: building-scale changes usually comes with a scale of tens of meters; sub-
building and aorestation changes usually comes with much smaller scale. More detailed
quantitative analysis of the source of detected geometry changes in dierent scales are
presented in section 7.6.2.
7.6 Experiments
7.6.1 Real-world urban scenarios
We evaluate our system on two real-world urban scenarios (Figure 7.5), each covering
a ground area over several square kilometers. The rst urban scenario is Rochester,
NY, USA with a city 3D model of 2010 which is generated from aerial imagery dataset
133
Figure 7.6: The 3D model from Google covers the urban area in Rochester, NY, USA.
Texture is also provided but not used in our approach
(a) City 3D model of Rochester 2010 generated
from aerial imagery
(b) City 3D model of Rochester 2014 provided by
Google
(c) City 3D model of Columbus 2006 generated us-
ing aerial imagery
(d) City 3D model of Columbus 2007 generated us-
ing aerial imagery
Figure 7.7: City 3D models on two real-world urban scenarios
134
Rochester, and a city 3D model of 2014 provided by Google (Figure 7.6). Another
urban scenario is Columbus, OH, USA with city 3D model of 2006 and a city 3D model
of 2007 which are reconstructed using aerial imagery datasets CLIF06 and CLIF07.
The regions of interest are illustrated in Figure 7.5, and the reconstructed 3D models are
illustrated in Figure 7.7.
7.6.2 Change detection results
To distinguish the geometry changes from noise, we label the structures with geometric
distance larger than a certain threshold as structures with geometry changes. For
our experiments, we set the threshold as 3 meters for Rochester and 10 meters for
CLIF according to the ground sampling distance of the aerial imagery. As examples
shown in Figure 7.8, our approach is able to identify geometry changes at dierent scales:
large scale geometry changes over tens of meters correspond to an entire building or
building cluster; whereas small scale changes within several meters are usually related to
aorestation changes and small building modications.
For an quantitative evaluation of the performance and a better understanding of the
source of the geometry changes at dierent scales, we manually label all detected geometry
changes into building changes, aorestation changes and false positives. The quantitative
evaluations are categorized based on the geometric distance and illustrated in Figure 7.9.
Result shows that our approach is able to detect both building changes and aorestation
changes at dierent scales. The detection of large scale geometry changes are accurate
with few false positives, whereas there are more false positives in the detected small
scale changes. Additionally, the detection performance for geometry changes at dierent
135
37m
0m
Figure 7.8: Detailed geometry changes at dierent scales are detected, ranging from an
entire building cluster over tens of meters to an individual tree within several meters.
Geometry changes corresponding to the extra structures inM
0
are colorized according
to the geometric distance and displayed inM
1
for better visualization
136
0
40
80
120
160
200
3m - 10m 10m - 20m > 20m
0
20
40
60
80
10m - 20m >20m
Rochester CLIF
Figure 7.9: Quantitative result of the detected geometry changes
scales depends on the ground sampling distance of the input imagery, e.g, there are few
false positives for geometry changes ranging from 10 meters to 20 meters on Rochester,
whereas more false positives are found on CLIF in the same range because CLIF06
imagery has a larger ground sampling distance. We also analyze the error source of the
large scale false positives. As examples shown in Figure 7.10, large scale false positives
mainly result from the divergence of the level of details in dierent models. More examples
of the detected geometry changes are shown in Figure 7.11.
7.6.3 Discussion on complementary methods
For applications like 3D model update, it is not necessary to re-compute the entire city
3D model while most part of the scenario remains unchanged. Instead, a coarse-to-ne
strategy can be used: regions with possible geometry changes are located rst; dense
reconstructions and detailed change detection are performed in these regions only. For
137
Figure 7.10: False positive examples. Left: the slopped stadium roof is simplied as a
plane in the Google model. Right: limited by the ground sampling distance, the aerial
imagery fails to capture the netted antenna which is treated as a solid tower in Google
model
this purpose, complementary methods like [100] could be used to provide the changed
regions in a coarse level by performing comparisons at 2D image level.
WAAI usually comes with a small baseline as they are captured by an airborne plat-
form
ying at a large altitude (several thousands feet). The pixel displacement from
geometry changes is small (e.g, 25 pixels for a building-scale geometry change in the ex-
ample shown in Figure 7.12 (a)). Additionally, texture-less and homogeneously-textured
surfaces cause problems. For this reason, pixel-wise measurements, such as absolute in-
tensity dierence (Figure 7.12 (b)) used in the original pipeline of [100], are unreliable
and sensitive to the accuracy of the original 3D model. Instead, robust measurements
with a strong spatial regularization, such as NCC score with a large window size and
optical
ow (Figure 7.12 (c)(d)), have shown potential to provide the changed regions in
a coarse level. However, the capability of image-based complementary methods for small
scale geometry changes, such as aorestation changes, is still questionable in the context
of high-resolution multi-camera WAAI.
138
Figure 7.11: Examples of the detected geometry changes. From left to right: geometric
distance mapD
M
0
!M
1
, modelM
1
with texture, geometry changes corresponding to the
extra structures inM
0
are displayed inM
1
and colorized based on the geometric distance
for visualization
139
(a) Original image where the center building struc-
ture (marked as red) is demolished
(b) Map of absolute intensity dierence between the
original image and the warped image
(c) Map of NCC scores with windows size of 11 be-
tween the original image and the warped image
(d) Map of optical
ow [115] between the original
image and the warped image
Figure 7.12: An example of the imagery inconsistency from geometry changes in aerial
imagery
140
7.7 Conclusion
In this chapter, we addressed the problem of detecting geometry changes in the context of
Wide Area Aerial Imagery (WAAI) captured using a multi-camera system mounted on an
airborne platform hovering over the target area. Appearance changes over time and the
property of WAAI make the problem dicult to solve at 2D image level. We addressed
this problem by performing comparisons at 3D geometry level between city 3D mod-
els generated at dierent times. Accurate 3D models are produced from high-resolution
multi-camera WAAI using dense reconstruction technique, allowing for detection of de-
tailed geometry changes over an entire city.
We categorized the detected geometry changes based on the geometric distance where
no semantic meaning is guaranteed. Our future work is to incorporate both the geome-
try and appearance information for a semantic categorization of the detected geometry
changes.
141
Chapter 8
Conclusion and Future Work
In this dissertation, we investigated the problem of accurate 3D model acquisition from
imagery data. The input are images of the target object/scene captured from dierent
views, and the output is a full 3D mesh model. We explored both online and oine
techniques on two important application scenarios: city-scale 3D reconstruction from
multi-camera aerial imagery and general 3D model acquisition with a commodity cam-
era. For the aerial case, we started with a novel online city 3D reconstruction framework
using aerial video integrating multi-view stereo method and volumetric method. For cam-
era pose estimation, we proposed an optical
ow guided feature matching algorithm for
ecient 2D-2D and 2D-3D feature matching. Furthermore, a novel track-then-merge al-
gorithm is presented to handle high-resolution multi-camera aerial imagery using multiple
GPUs. For accurate dense reconstruction, we presented MeshRecon, an oine system
using sparse images as input, which is
exible and robust for various objects in both
indoor and outdoor environments at dierent scales. The reconstructed city 3D model
is accurate with error less than 1 meter over the entire city. Based on the reconstructed
city models, a novel geometry change detection system is proposed with the capability to
142
identify geometry changes at dierent scales reliably. For general 3D model acquisition, a
hybrid system with a hand-held camera is proposed integrating the good elements of both
online and oine techniques: the online stage provides an immediate visual feedback; the
oine stage ensures the accuracy.
8.1 Contributions
Compared with the state of the art, we provided three key contributions in terms of
novelty in this dissertation:
• City-Scale 3D Reconstruction using Multi-Camera Aerial Imagery. The
rst key contribution is the novel framework of city 3D reconstruction using high-
resolution multi-camera aerial imagery. To the best of our knowledge, this is the
rst 3D reconstruction system proposed to handle the multi-camera aerial imagery.
For camera pose estimation, we found that a rigid transformation of camera rig is
not reliable due to the vibrations during recording. In our system, we proposed a
estimate then merge strategy to ensure the pose consistency of multiple cameras.
Parallelizations of major process makes the system ecient. For dense reconstruc-
tion, both online and oine solutions are explored for dierent application scenarios.
• Accurate City-Scale Geometry Change Detection. The second key contribu-
tion is the novel framework for city-scale geometry change detection. After aligning
city 3D models captured at dierent times, we solve the geometry change detection
problem by comparing at the 3D geometry level directly. To the best of our knowl-
edge, this is the rst system provides ecient detection of geometry changes at
143
dierent scales, ranging from a building cluster towards each individual vegetation
all over the city.
• 3D Acquisition Integrating Online and Oine techniques. The third key
contribution is the novel framework for 3D model acquisition with a hand-held cam-
era integrating both oine and online techniques. The hybrid system shares the
good elements of both sides: in the live scanning stage, the 3D model is incremen-
tally updated providing an immediate visual feedback to the user the guide the next
movement; in the oine stage, the 3D model is rened to ensure the accuracy.
Besides the novelty contributions presented above, we also provided two key contributions
in terms of the implementation of practical systems:
• Accurate City 3D Reconstruction System for Multi-Camera Aerial Im-
agery. We presented an end-to-end 3D reconstruction system to produce accurate
city geometric 3D models using aerial imagery. Ecient track-and-merge algorithm
is proposed for camera pose estimation of high-resolution multi-camera aerial im-
agery using multiple GPUs. Our system has been transferred to Lawrence Livermore
National Laboratory (LLNL) successfully and integrated into practical usage.
• Release of a Mesh-based Oine 3D Model Acquisition System. We pro-
posed MeshRecon, a mesh-based oine system for 3D model acquisition using a
set of sparse images as input. The system is
exible for various objects at dierent
scales in both indoor and outdoor environments, and ecient with parallelization
on GPU using a standard PC machine. We have released our system as well as
an indoor imagery datasets on the website [11] to the computer vision research
144
community for better evaluation and comparison of dierent geometric 3D model
acquisition approaches
8.2 Future Work
The results from our research point to a number of future directions. In particular, we are
interested in exploring two topics as future work: primitive-based 3D model acquisition
and semantic 3D model acquisition.
• Primitive-based 3D model acquisition. A compact representation of the ge-
ometric 3D model is essential for ecient rendering, manipulation and storage,
which makes it important for various applications. Most 3D model acquisitions
systems generate a 3D mesh model describing the geometry with a set of polygonal
facets. This representation has the capability to describe complicated geometry
details accurately with a large number of facets. However, the real-world objects,
especially man-made structures, are usually built from limited geometry primitive
shapes, e.g. , plane, cube, cylinder, sphere, cone, pyramid and torus. A primitive-
based [42, 113] or hybrid [62, 63] method reconstructs the 3D model as a combination
of geometry primitives which largely reduces the model complexity.
• Semantic 3D model acquisition. The output 3D mesh model contains only the
geometry information. The use case is limited on applications where no other infor-
mation is needed, e.g. , visualization, 3D printing. Semantic information is essential
for many important applications, such as virtual reality and augmented reality. For
145
example, the semantic information of buildings and terrains are important to pro-
vide an immersive experience in virtual/augmented reality gamings. The semantic
information also saves eorts of human labeling in geography products. On the
other hand, the semantic information can be used to improve the reconstruction
process [46, 21] and handle challenging cases, such as objects of transparent and
re
ectance surface.
Furthermore, as an analogy to the pixel-feature-object [64] representation of 2D imagery
data for various applications such as object recognition, machine learning techniques may
also be integrated to build a hierarchical geometry representation describing the 3D model
using a facets-primitive-object structure.
146
Reference List
[1] Acute3D. http://www.acute3d.com.
[2] Autodesk 123D Catch. http://www.123dapp.com/catch.
[3] boost. http://www.boost.org/.
[4] Bundler. http://www.cs.cornell.edu/
~
snavely/bundler/.
[5] CLIF 2006. https://www.sdms.afrl.af.mil/index.php?collection=clif2006.
[6] CLIF 2007. https://www.sdms.afrl.af.mil/index.php?collection=clif2007.
[7] Computational Geometry Algorithms Library. http://www.cgal.org/.
[8] CVLab Multi-View Stereo Dataset. http://cvlabwww.epfl.ch/data/multiview/
denseMVS.html.
[9] FlowLib. http://www.gpu4vision.org.
[10] Max-
ow Library. http://vision.csd.uwo.ca/code/.
[11] MeshRecon. http://www-scf.usc.edu/
~
zkang/software.html.
[12] Multicore Bundle Adjustment. http://grail.cs.washington.edu/projects/
mcba/.
[13] PTAM. http://www.robots.ox.ac.uk/
~
gk/PTAM/.
[14] SiftGPU. http://cs.unc.edu/
~
ccwu/siftgpu.
[15] VisualSFM. http://ccwu.me/vsfm/.
[16] WPAFB 2009. https://www.sdms.afrl.af.mil/index.php?collection=
wpafb2009.
[17] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless,
Steven M Seitz, and Richard Szeliski. Building rome in a day. Communications of
the ACM, 54(10):105{112, 2011.
[18] Mica Arie-Nachimson, Shahar Z Kovalsky, Ira Kemelmacher-Shlizerman, Amit
Singer, and Ronen Basri. Global motion estimation from point matches. In 3D
Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), 2012
Second International Conference on, pages 81{88. IEEE, 2012.
147
[19] Nicolas Aspert, Diego Santa Cruz, and Touradj Ebrahimi. Mesh: measuring errors
between surfaces using the hausdor distance. In ICME (1), 2002.
[20] Jonathan Balzer and Stefano Soatto. Clam: Coupled localization and mapping
with ecient outlier handling. In CVPR, 2013.
[21] Sid Yingze Bao, Manmohan Chandraker, Yuanqing Lin, and Silvio Savarese. Dense
object reconstruction with semantic priors. In CVPR, 2013.
[22] Mathias Bejanin, Andres Huertas, G Medioni, and Ramakant Nevatia. Model val-
idation for change detection [machine vision]. In Proceedings of the Second IEEE
Workshop on Applications of Computer Vision, 1994.
[23] Jean-Daniel Boissonnat, Olivier Devillers, Monique Teillaud, and Mariette Yvinec.
Triangulations in cgal. In Proceedings of the sixteenth annual symposium on Com-
putational geometry, pages 11{18. ACM, 2000.
[24] Mario Botsch, Leif Kobbelt, Mark Pauly, Pierre Alliez, and Bruno L evy. Polygon
mesh processing. CRC press, 2010.
[25] F. Calakli, Ali O. Ulusoy, Maria I Restrepo, Joseph L Mundy, and Gabriel Taubin.
High Resolution Surface Reconstruction from Multi-view Aerial Imagery. In 3DIM-
PVT, 2012.
[26] Antonin Chambolle and Thomas Pock. A rst-order primal-dual algorithm for
convex problems with applications to imaging. Journal of Mathematical Imaging
and Vision, 40(1):120{145, 2011.
[27] Yang Chen and G erard Medioni. Object modelling by registration of multiple range
images. Image and vision computing, 10(3):145{155, 1992.
[28] Chris Clifton. Change detection in overhead imagery using neural networks. Ap-
plied Intelligence, 18(2):215{234, 2003.
[29] David Crandall, Andrew Owens, Noah Snavely, and Dan Huttenlocher. Discrete-
continuous optimization for large-scale structure from motion. In CVPR, 2011.
[30] Brian Curless and Marc Levoy. A volumetric method for building complex models
from range images. In Proceedings of the 23rd annual conference on Computer
graphics and interactive techniques, pages 303{312. ACM, 1996.
[31] Amael Delaunoy, Emmanuel Prados, Pau Gargallo I Pirac es, Jean-Philippe Pons,
Peter Sturm, et al. Minimizing the multi-view stereo reprojection error for trian-
gular surface meshes. In British machine vision conference, 2008.
[32] GR Dini, K Jacobsen, F Rottensteiner, M Al Rajhi, and C Heipke. 3d building
change detection using high resolution stereo images and a gis database. ISPRS,
2012.
[33] Ibrahim Eden and David B Cooper. Using 3d line segments for robust and ecient
change detection from multiple noisy images. In ECCV. 2008.
148
[34] Jakob Engel, Thomas Sch ops, and Daniel Cremers. Lsd-slam: Large-scale direct
monocular slam. In ECCV, 2014.
[35] Olivier Faugeras and Renaud Keriven. Variational principles, surface evolution,
PDE's, level set methods and the stereo problem. IEEE, 2002.
[36] Andrew W Fitzgibbon and Andrew Zisserman. Automatic camera recovery for
closed or open image sequences. In Computer VisionECCV'98, pages 311{326.
Springer, 1998.
[37] Yasutaka Furukawa, Brian Curless, Steven M Seitz, and Richard Szeliski.
Manhattan-world stereo. In CVPR, 2009.
[38] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski. Towards
Internet-scale Multi-view Stereo. In CVPR, 2010.
[39] Yasutaka Furukawa and Jean Ponce. Carved visual hulls for image-based modeling.
In ECCV.
[40] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stere-
opsis. PAMI, 32(8):1362{1376, 2010.
[41] David Gallup, J-M Frahm, Philippos Mordohai, Qingxiong Yang, and Marc Polle-
feys. Real-time plane-sweeping stereo with multiple sweeping directions. In Com-
puter Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on, pages
1{8. IEEE, 2007.
[42] David Gallup, J-M Frahm, and Marc Pollefeys. Piecewise planar and non-planar
stereo for urban scene reconstruction. In CVPR, 2010.
[43] Xiao-Shan Gao, Xiao-Rong Hou, Jianliang Tang, and Hang-Fei Cheng. Complete
solution classication for the perspective-three-point problem. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 25(8):930{943, 2003.
[44] Michael Goesele, Brian Curless, and Steven M Seitz. Multi-view stereo revisited.
In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Con-
ference on, volume 2, pages 2402{2409. IEEE, 2006.
[45] Venu Madhav Govindu. Lie-algebraic averaging for globally consistent motion es-
timation. In CVPR.
[46] Christian Hane, Christopher Zach, Andrea Cohen, Roland Angst, and Marc Polle-
feys. Joint 3d scene reconstruction and class segmentation. In CVPR, 2013.
[47] Richard Hartley, Jochen Trumpf, Yuchao Dai, and Hongdong Li. Rotation averag-
ing. International journal of computer vision, 103(3):267{305, 2013.
[48] Richard Hartley and Andrew Zisserman. Multipleviewgeometryincomputervision,
volume 2. Cambridge Univ Press, 2000.
149
[49] Carlos Hern andez Esteban and Francis Schmitt. Silhouette and stereo fusion for 3d
object modeling. Computer Vision and Image Understanding, 96(3):367{392, 2004.
[50] Vu Hoang Hiep, Renaud Keriven, Patrick Labatut, and J-P Pons. Towards high-
resolution large-scale multi-view stereo. In CVPR, 2009.
[51] Andres Huertas and Ramakant Nevatia. Detecting changes in aerial views of man-
made structures. In ICCV, 1998.
[52] Nianjuan Jiang, Zhaopeng Cui, and Ping Tan. A global linear method for camera
pose registration. In ICCV, 2013.
[53] Fredrik Kahl. Multiple view geometry and the l-norm. In ICCV, 2005.
[54] Zhuoliang Kang and Gerard Medioni. Fast dense 3d reconstruction using an adap-
tive multi-scale discrete-continuous variational method. In WACV, 2014.
[55] Zhuoliang Kang and Gerard Medioni. 3d urban reconstruction from wide area
aerial surveillance video. InWorkshoponApplicationsforAerialVideoExploitation
(WAVE), 2015.
[56] Zhuoliang Kang and Gerard Medioni. Progressive 3d model acquisition with a
commodity hand-held camera. In WACV, 2015.
[57] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface recon-
struction. In Proceedings of the fourth Eurographics symposium on Geometry pro-
cessing, 2006.
[58] Georg Klein and David Murray. Parallel tracking and mapping for small ar
workspaces. In ISMAR, 2007.
[59] Georg Klein and David Murray. Parallel tracking and mapping on a camera phone.
In ISMAR, 2009.
[60] R. Kumar, P. Anandan, and K. Hanna. Direct recovery of shape from multiple
views: A parallax based approach. In ICPR, 1994.
[61] Patrick Labatut, J-P Pons, and Renaud Keriven. Ecient multi-view reconstruc-
tion of large-scale scenes using interest points, delaunay triangulation and graph
cuts. In ICCV, 2007.
[62] Florent Lafarge, Renaud Keriven, Mathieu Br edif, and Vu Hoang Hiep. Hybrid
multi-view reconstruction by jump-diusion. In CVPR, 2010.
[63] Florent Lafarge and Cl ement Mallet. Creating large-scale city models from 3d-
point clouds: a robust approach with hybrid representation. International journal
of computer vision, 99(1):69{85, 2012.
[64] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional
deep belief networks for scalable unsupervised learning of hierarchical representa-
tions. In ICML, 2009.
150
[65] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate
o (n) solution to the pnp problem. International journal of computer vision,
81(2):155{166, 2009.
[66] Maxime Lhuillier and Long Quan. A quasi-dense approach to surface reconstruc-
tion from uncalibrated images. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 27(3):418{433, 2005.
[67] Hongdong Li and Richard Hartley. Five-point motion estimation made easy. In
ICPR.
[68] H.H. Liao, Y. Lin, and G erard. Medioni. Aerial 3D reconstruction with line-
constrained dynamic programming. In ICCV, 2011.
[69] Robert L Lillestrand. Techniques for change detection. IEEE Transactions on
Computers, 100(7):654{659, 1972.
[70] William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d
surface construction algorithm. In ACM Siggraph Computer Graphics, volume 21,
pages 163{169. ACM, 1987.
[71] Manolis IA Lourakis and Antonis A Argyros. Sba: A software package for generic
sparse bundle adjustment. ACM Transactions on Mathematical Software (TOMS),
36(1):2, 2009.
[72] David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV,
60(2):91{110, 2004.
[73] C-P Lu, Gregory D Hager, and Eric Mjolsness. Fast and globally convergent pose
estimation from video images. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 22(6):610{622, 2000.
[74] Yi Ma. An invitation to 3-d vision: from images to geometric models, volume 26.
Springer Science & Business Media, 2004.
[75] Daniel Martinec and Tomas Pajdla. Robust rotation and translation estimation in
multiview reconstruction. In CVPR, 2007.
[76] Paul Merrell, Amir Akbarzadeh, Liang Wang, Philippos Mordohai, J-M Frahm,
Ruigang Yang, David Nist er, and Marc Pollefeys. Real-time visibility-based fusion
of depth maps. In ICCV, 2007.
[77] R.A. Newcombe, S.J. Lovegrove, and A.J. Davison. DTAM: Dense tracking and
mapping in real-time. In ICCV, 2011.
[78] Richard A Newcombe and Andrew J Davison. Live dense reconstruction with a
single moving camera. In CVPR, 2010.
151
[79] Richard A Newcombe, Andrew J Davison, Shahram Izadi, Pushmeet Kohli, Otmar
Hilliges, Jamie Shotton, David Molyneaux, Steve Hodges, David Kim, and An-
drew Fitzgibbon. KinectFusion: Real-time dense surface mapping and tracking. In
ISMAR, 2011.
[80] D. Nist er. An ecient solution to the ve-point relative pose problem. PAMI,
26(6):756{770, 2004.
[81] Yutaka Ohtake, Alexander Belyaev, and H-P Seidel. A multi-scale approach to 3d
scattered data interpolation with compactly supported basis functions. In Shape
Modeling International, 2003, pages 153{161. IEEE, 2003.
[82] Carl Olsson, Anders P Eriksson, and Fredrik Kahl. Ecient optimization for l-
problems using pseudoconvexity. In ICCV, 2007.
[83] Qi Pan, Gerhard Reitmayr, and Tom Drummond. Proforma: Probabilistic feature-
based on-line rapid model acquisition. In BMVC, 2009.
[84] Sylvain Paris, Fran cois X Sillion, and Long Quan. A surface reconstruction method
using global graph cut optimization. International Journal of Computer Vision,
66(2):141{161, 2006.
[85] Thomas Pollard and Joseph L Mundy. Change detection in a 3-d world. In CVPR,
2007.
[86] M. Pollefeys, D. Nist er, J.M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. En-
gels, D. Gallup, S.J. Kim, P. Merrell, et al. Detailed real-time urban 3d reconstruc-
tion from video. IJCV, 78(2):143{167, 2008.
[87] Marc Pollefeys, Luc Van Gool, Maarten Vergauwen, Frank Verbiest, Kurt Cor-
nelis, Jan Tops, and Reinhard Koch. Visual modeling with a hand-held camera.
International Journal of Computer Vision, 59(3):207{232, 2004.
[88] Jean-Philippe Pons, Renaud Keriven, and Olivier Faugeras. Multi-view stereo re-
construction and scene
ow estimation with a global image-based matching score.
International Journal of Computer Vision, 72(2):179{193, 2007.
[89] Jan Prokaj and G erard Medioni. Using 3d scene structure to improve tracking. In
CVPR, 2011.
[90] Jan Prokaj and G erard Medioni. Accurate ecient mosaicking for wide area aerial
surveillance. In Applications of Computer Vision (WACV), 2012 IEEE Workshop
on, pages 273{280. IEEE, 2012.
[91] Richard J Radke, Srinivas Andra, Omar Al-Kofahi, and Badrinath Roysam. Image
change detection algorithms: a systematic survey. IEEE Transactions on Image
Processing, 2005.
[92] Maria I Restrepo, Brandon A Mayer, Ali O Ulusoy, and Joseph L Mundy. Char-
acterization of 3-d volumetric probabilistic scenes for object recognition. IEEE
Journal of Selected Topics in Signal Processing, 6(5):522{537, 2012.
152
[93] Szymon Rusinkiewicz, Olaf Hall-Holt, and Marc Levoy. Real-time 3d model ac-
quisition. In ACM Transactions on Graphics (TOG), volume 21, pages 438{446.
ACM, 2002.
[94] Szymon Rusinkiewicz and Marc Levoy. Ecient variants of the icp algorithm. In 3-
D Digital Imaging and Modeling, 2001. Proceedings. Third International Conference
on, pages 145{152. IEEE, 2001.
[95] Sudipta N Sinha, Philippos Mordohai, and Marc Pollefeys. Multi-view stereo via
graph cuts on the dual of an adaptive tetrahedral mesh. In ICCV, 2007.
[96] Sudipta N Sinha, Drew Steedly, and Richard Szeliski. A multi-stage linear approach
to structure from motion. In Trends and Topics in Computer Vision, pages 267{
281. Springer, 2012.
[97] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring
photo collections in 3d. ACM transactions on graphics (TOG), 25(3):835{846,
2006.
[98] Noah Snavely, Steven M Seitz, and Richard Szeliski. Skeletal graphs for ecient
structure from motion. In CVPR, 2008.
[99] Henrik Stew enius, Christopher Engels, and David Nist er. Recent developments on
direct relative orientation. ISPRS Journal of Photogrammetry and Remote Sensing,
60(4):284{294, 2006.
[100] Aparna Taneja, Luca Ballan, and Marc Pollefeys. Image based detection of geo-
metric changes in urban environments. In ICCV, 2011.
[101] Aparna Taneja, Luca Ballan, and Marc Pollefeys. City-scale change detection in
cadastral 3d models using images. In CVPR, 2013.
[102] Petri Tanskanen, Kalin Kolev, Lorenz Meier, Federico Camposeco, Olivier Saurer,
and Marc Pollefeys. Live metric 3d reconstruction on mobile phones. In ICCV,
2013.
[103] Jiaojiao Tian, Houda Chaabouni-Chouayakh, Peter Reinartz, Thomas Krau, and
Pablo dAngelo. Automatic 3d change detection based on optical satellite stereo
imagery. ISPRS TC VII Symposium, 2010.
[104] Bill Triggs, Philip McLauchlan, Richard Hartley, and Andrew Fitzgibbon. Bundle
adjustment - a modern synthesis. Vision algorithms: theory and practice, pages
153{177, 2000.
[105] Shinji Umeyama. Least-squares estimation of transformation parameters between
two point patterns. PAMI, 1991.
[106] Benjamin Ummenhofer and Thomas Brox. Dense 3d reconstruction with a hand-
held camera. In DAGM. 2012.
153
[107] George Vogiatzis, Philip HS Torr, and Roberto Cipolla. Multi-view stereo via
volumetric graph-cuts. In Computer Vision and Pattern Recognition, 2005. CVPR
2005. IEEE Computer Society Conference on, volume 2, pages 391{398. IEEE,
2005.
[108] H-H Vu, Patrick Labatut, J-P Pons, and Renaud Keriven. High accuracy and
visibility-consistent dense multiview stereo. Pattern Analysis and Machine Intelli-
gence, IEEE Transactions on, 34(5):889{901, 2012.
[109] Shintaro Watanabe, Koji Miyajima, and Naoki Mukawa. Detecting changes of
buildings from aerial images using shadow and shading model. In ICPR, 1998.
[110] Andreas Wendel, Michael Maurer, Gottfried Graber, Thomas Pock, and Horst
Bischof. Dense reconstruction on-the-
y. In CVPR, 2012.
[111] Changchang Wu. Towards linear-time incremental structure from motion. In 3DV,
2013.
[112] Changchang Wu, Sameer Agarwal, Brian Curless, and Steven M Seitz. Multicore
bundle adjustment. In CVPR, 2011.
[113] Changchang Wu, Sameer Agarwal, Brian Curless, and Steven M Seitz. Schematic
surface reconstruction. In CVPR, 2012.
[114] Ruigang Yang and Marc Pollefeys. Multi-resolution real-time stereo on commodity
graphics hardware. In CVPR, 2003.
[115] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime TV-L1
optical
ow. In DAGM, 2007.
[116] Christopher Zach. Fast and high quality fusion of depth maps. In 3DPVT, 2008.
[117] Christopher Zach, Thomas Pock, and Horst Bischof. A globally optimal algorithm
for robust tv-l1 range image integration. In ICCV, 2007.
[118] Xuemei Zhao and Gerard Medioni. Robust unsupervised motion pattern inference
from video and applications. In ICCV, 2011.
154
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D urban modeling from city-scale aerial LiDAR data
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
Exploitation of wide area motion imagery
PDF
Face recognition and 3D face modeling from images in the wild
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
A framework for high‐resolution, high‐fidelity, inexpensive facial scanning
PDF
Accurate image registration through 3D reconstruction
PDF
Complete human digitization for sparse inputs
PDF
Landmark detection for faces in the wild
PDF
Data-driven 3D hair digitization
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Reconstructing 3D reconstruction: a graphical taxonomy of current techniques
PDF
3D deep learning for perception and modeling
PDF
Techniques for vanishing point detection
PDF
Effective data representations for deep human digitization
PDF
3D object detection in industrial site point clouds
PDF
Digitizing human performance with robust range image registration
PDF
Line segment matching and its applications in 3D urban modeling
PDF
Point-based representations for 3D perception and reconstruction
Asset Metadata
Creator
Kang, Zhuoliang
(author)
Core Title
Accurate 3D model acquisition from imagery data
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
06/18/2015
Defense Date
04/03/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D reconstruction,computer vision,OAI-PMH Harvest,urban reconstruction
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gérard G. (
committee chair
), Li, Hao (
committee member
), Sawchuk, Alexander A. (Sandy) (
committee member
)
Creator Email
zhuoliangkang@gmail.com,zkang@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-572252
Unique identifier
UC11300371
Identifier
etd-KangZhuoli-3475.pdf (filename),usctheses-c3-572252 (legacy record id)
Legacy Identifier
etd-KangZhuoli-3475.pdf
Dmrecord
572252
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kang, Zhuoliang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D reconstruction
computer vision
urban reconstruction