Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Semantic modeling of outdoor scenes for the creation of virtual environments and simulations
(USC Thesis Other)
Semantic modeling of outdoor scenes for the creation of virtual environments and simulations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1
Semantic Modeling of Outdoor Scenes for The Creation of Virtual Environments and Simulations
By
Meida Chen
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
Submitted in Partial Fulfillment of the Requirements for the Degree
DOCTOR OF PHILOSOPHY
(CIVIL ENGINEERING)
May 2020
2
Contents
Chapter 1: Introduction and Motivations .............................................................................................. 7
Chapter 2: Backgrounds and Related Literature ................................................................................ 12
2.1. Photogrammetry ........................................................................................................................... 12
2.2. Point cloud segmentation ............................................................................................................. 13
2.2.1. Point cloud segmentation with handcrafted feature descriptors............................................... 13
2.2.2. Point cloud segmentation with deep learning techniques ......................................................... 14
2.2.3. Research gaps on point cloud segmentation ............................................................................. 17
2.3. Vegetation segmentation and identification of individual tree locations ................................. 18
2.4. Building footprints extraction...................................................................................................... 19
2.5. Surface material classification ..................................................................................................... 21
Chapter 3: Photogrammetric Point Clouds Segmentation and Objects Information Extraction
Framework .............................................................................................................................................. 23
3.1. Research Objective ...................................................................................................................... 23
3.2. Proposed Framework ................................................................................................................... 24
3.2.1. Top-level objects segmentation workflow design ..................................................................... 25
3.2.2. Point descriptors ranking ........................................................................................................... 28
3.2.3. Selection of classifiers ............................................................................................................... 33
3.3. Object information extraction ..................................................................................................... 35
3.3.1. Identify individual tree locations............................................................................................... 35
3.3.2. Building footprints extraction ................................................................................................... 37
3.3.3. Terrain surface materials classification ..................................................................................... 40
3.4. Experimental Results .................................................................................................................... 41
3.4.1. Testbeds selection and data collection ...................................................................................... 41
3.4.2. Prototype development and mesh segmentation ....................................................................... 43
3.4.3. Quantitative and qualitative analysis of the designed framework ............................................. 44
3.4.4. Point descriptor ranking ............................................................................................................ 45
3.4.5. Selection of classifiers ............................................................................................................... 50
3.4.6. Identify individual tree locations............................................................................................... 53
3.4.7. Building footprints extraction and roof styles classification ..................................................... 55
3.4.8. Ground surface material classification ...................................................................................... 58
3.4.9. Prototype development and mesh segmentation ....................................................................... 62
3.5. Conclusions .................................................................................................................................... 63
Chapter 4: Fully Automated Top-level Terrain Elements Segmentation using a Model Ensembling
Framework .............................................................................................................................................. 66
4.1. Research Objective ....................................................................................................................... 66
4.2. UAV-based Photogrammetric Database ..................................................................................... 67
4.3. Proposed Model Ensembling Framework .................................................................................. 69
4.3.1. Data Preprocessing .................................................................................................................... 70
4.3.2. Segmentation network ............................................................................................................... 72
4.3.2.1. 3D U-net with volumetric representation ............................................................................ 72
4.3.2.2. Data augmentation ............................................................................................................... 73
4.3.2.3. Cross-validation for single U-net model.............................................................................. 74
4.3.3. Post-processing ......................................................................................................................... 77
4.3.3.1. Ground post-processing ....................................................................................................... 77
4.3.3.2. Conditional Random Field (CRF) post-processing ............................................................. 77
4.3.3.3. Building post-processing ..................................................................................................... 78
3
4.4. Validation ...................................................................................................................................... 79
4.4.1. Quantitative analysis for point cloud segmentation .................................................................. 79
4.4.2. Qualitative results for creating virtual environments ................................................................ 81
4.5. Discussion and Conclusions ......................................................................................................... 83
Chapter 5: Training Deep Learning-Based 3D Point Clouds Segmentation Model Using Synthetic
Photogrammetric Data........................................................................................................................... 86
5.1. Research Objective ....................................................................................................................... 86
5.2. The Framework for Generating Annotated Synthetic Photogrammetric Data ...................... 86
5.2.1. The 3D scene generation process .............................................................................................. 88
5.2.2. 2D image rendering and 3D point cloud reconstruction ........................................................... 90
5.3. Experiments and results ............................................................................................................... 92
5.3.1. Question 1. How much does it help to add details to the synthetic scene? ............................... 94
5.3.2. Question 2. Is it necessary to use photogrammetric reconstructed point clouds instead of the
depth map-generated point clouds for training purposes?................................................................... 95
5.3.3. Question 3. Is it necessary to create synthetic scenes with realistic contextual relationships
between objects? ................................................................................................................................. 96
5.3.4. Question 4. Can synthetic data be used for training deep learning models and replace the need
for creating real-world training data? .................................................................................................. 97
5.4. Conclusions .................................................................................................................................. 100
Chapter 6: Intellectual Merit and Broader Impacts ......................................................................... 101
4
Table 1. Color-based descriptors. ................................................................................................ 45
Table 2. Density-based descriptors............................................................................................... 45
Table 3. Local surface-based descriptors. .................................................................................... 45
Table 4. Open source data-based descriptors. ............................................................................. 45
Table 5. Texture-based descriptors............................................................................................... 46
Table 6. Excluding texture descriptors. ........................................................................................ 46
Table 7. Excluding open source descriptors. ................................................................................ 46
Table 8. Excluding local surface descriptors. .............................................................................. 47
Table 9. Excluding density descriptors. ........................................................................................ 47
Table 10. Excluding color descriptors. ......................................................................................... 47
Table 11. Using all descriptors. .................................................................................................... 47
Table 12. Confusion Matrixes of using RF classifier. .................................................................. 50
Table 13. Confusion Matrixes of using SVM classifier................................................................. 50
Table 14. Confusion Matrixes of using RF classifier. .................................................................. 53
Table 15. Confusion Matrixes of using SVM classifier................................................................. 53
Table 16. Confusion matrixes of tree locations identification. ..................................................... 55
Table 17. Confusion matrixes of roof style classification. ............................................................ 58
Table 18. Confusion matrixes of Ground Material Classification without Fine tuning. .............. 61
Table 19. Confusion matrixes of Ground Material Classification with Fine tuning. ................... 62
Table 20. UAV-based Photogrammetric Database. ..................................................................... 68
Table 21. Point-cloud Segmentation Comparisons ...................................................................... 80
Table 22. Segmentation results with the first synthetic training data set. .................................... 98
Table 23. Segmentation results with the second synthetic training data set. ............................... 98
Table 24. Segmentation results with the third synthetic training data set. ................................... 99
Table 25. Segmentation results with the fourth synthetic training data set. ................................. 99
Table 26. Segmentation results with the real-world training data set. ......................................... 99
5
Figure 1. Semantic terrain points labeling framework. .............................................................................. 25
Figure 2. Workflow of top-level point cloud segmentation. ........................................................................ 25
Figure 3. Incorrect shortest path in ATLAS simulation. ............................................................................. 35
Figure 4. Workflow of individual tree locations identification. .................................................................. 37
Figure 5. Workflow of building footprints extraction. ................................................................................ 38
Figure 6. Workflow of ground material classification. ............................................................................... 40
Figure 7. Photogrammetry generated point clouds: (a) USC; and (b) MUTC. .......................................... 43
Figure 8. Mismatched footprint covered area. ........................................................................................... 44
Figure 9. Excluding texture descriptors. .................................................................................................... 46
Figure 10. Excluding open source descriptors. .......................................................................................... 46
Figure 11. Excluding local surface descriptors. ......................................................................................... 47
Figure 12. Excluding density descriptors. .................................................................................................. 47
Figure 13. Excluding color descriptors. ..................................................................................................... 47
Figure 14. Using all descriptors. ................................................................................................................ 47
Figure 15. The top 20 descriptor rankings using Random Forest. ............................................................. 48
Figure 16. The top 40 descriptor rankings using Random Forest. ............................................................. 49
Figure 17. The top 80 descriptor rankings using Random Forest. ............................................................. 49
Figure 18. Classification results of USC data set: (a) ground truth; (b) classified with SVM; and (c)
classified with Random Forest. ................................................................................................................... 51
Figure 19. Classification results of MUTC data set: (a) ground truth; (b) classified with SVM; and (c)
classified with Random Forest. ................................................................................................................... 52
Figure 20. Tree location identification: (a) clustered points; (b) individual tree locations; (c) MUTC data
set in simulation environment; and (d) MUTC data set in simulation environment with tree replaced. .... 54
Figure 21. Result of tree locations identification. ...................................................................................... 55
Figure 22. Building footprints extraction: (a) classified buildings; (b) extracted roofs; (c) extruded
building footprints; and (d) textured model. ............................................................................................... 56
Figure 23. Roof style classification for USC. ............................................................................................. 57
Figure 24. Roof style classification for USC. ............................................................................................. 58
Figure 25. Mesh rendered orthophoto: (a) USC; (b) Fort Drum Army Base; (c) The Camp Pendleton
Infantry Immersion Trainer; and (d) 29 Palms Range 400. ....................................................................... 59
Figure 26. The process of creating ground material database. .................................................................. 60
Figure 27. Ground Material Classification Result: (a) MUTC Dataset; (b) ground material vector map
without fine tuning; and (c) ground material vector map with fine tuning. ................................................ 61
Figure 28. Path finding in ATLAS: (a) without ground material classification; and (b) with ground
material classification result. ...................................................................................................................... 62
Figure 29. Designed user interface. ............................................................................................................ 63
Figure 30. Mesh segmentation for USC data set: (a) segmented buildings; and (b) segmented ground. .. 63
Figure 31. Model ensembling framework. .................................................................................................. 69
Figure 32. Photogrammetric generated point cloud and DSM: (a) Point cloud with noises; (b) DSM; and
(c) Cleaned point cloud. .............................................................................................................................. 71
Figure 33. Selecting data within AOI: (a) Point clouds outside of the AOI; (b) Point clouds and camera
positions; and (c) Point cloud inside the AOI. ............................................................................................ 72
Figure 34. 3D U-Net architecture............................................................................................................... 73
Figure 35. F1 scores of the cross-validation results................................................................................... 75
Figure 36. Mis-segmentation cases: a) Ground mis-segmentation case; b) Building mis-segmentation
case; and c) Segmentation noises. .............................................................................................................. 76
Figure 37. Point-cloud segmentation result and the created virtual environment for data set #20: (a)
Point cloud segmentation result; and (b) The created virtual environment. .............................................. 81
Figure 38. Point-cloud segmentation result and the created virtual environment for data set #7: (a) Point-
cloud segmentation result; and (b) The created virtual environment. ........................................................ 81
Figure 39. Segmentation result after the ground post-processing. ............................................................. 82
6
Figure 40. Segmentation result after the building refinement process. ...................................................... 83
Figure 41. The designed synthetic data generation framework. ................................................................. 87
Figure 42. DSM: (a) the original DSM from the NED, and (b) modified DSM. ........................................ 88
Figure 43. Procedurally generated 3D building models with the same building footprint. ....................... 89
Figure 44. Generated 3D scenes: (a) forests (b) city clutter, and (c) trees and vehicles. .......................... 90
Figure 45. Outputs from the simulator (i.e., AirSim): (a) rendered image; (b) annotation; and (c) depth
map. ............................................................................................................................................................. 91
Figure 46. Photogrammetric point cloud annotation using ray casting: (a) raw photogrammetric point
cloud and (b) extracted ground points. ....................................................................................................... 91
Figure 47. Photogrammetric point cloud annotation using a k-nearest neighbor algorithm: (a) depth map
generated point cloud with annotation and (b) extracted ground points from an annotated
photogrammetric point cloud. ..................................................................................................................... 92
Figure 48. Generated synthetic training data sets: (a) The first synthetic training data set; (b) forests and
vehicles in the second synthetic training data set; (c) the third synthetic training data set; and (d) the
fourth synthetic training data set. ............................................................................................................... 93
Figure 49. Tree point cloud: (a) depth map-generated tree point cloud, and (b) photogrammetric-
reconstructed tree point cloud. ................................................................................................................... 96
7
Chapter 1: Introduction and Motivations
With the recent advances in sensing technologies and computer vision algorithms, photogrammetric
techniques have been highly studied over the past few years and allowed the creation of geo-specific 3D
point clouds that are highly detailed and centimeter-level accurate. Traditional survey tools such as sonic
measuring devices and total stations collect very sparse data points on a building facade or a survey area,
whereas a 3D point cloud of an object generated using photogrammetric technique consists of millions of
points representing the spatial information and surface textures of target objects. Such high-accurate data
has garnered increasing attention from both academia and industry. Many existing studies and applications
have utilized photogrammetric techniques to create as-is 3D models of outdoor scenes for different purposes
such as urban planning, building energy simulation, virtual environment creations, historical building
information storage, construction quality and schedule control, facility management, and so forth [1]-[5].
Several researchers have also pointed out that creating 3D models that can accurately represent the physical
condition of a large area of interest is a very important component in a cybercity implementation [6]-[8].
With the rapid advancement of unmanned aerial vehicle (UAV) technology, the data collection process for
creating 3D point clouds of an outdoor scene using photogrammetric techniques can be conducted with few
resources (people and equipment) in a short time. USC- Institute for Creative Technologies (ICT) research
team previously developed a UAV path-planning tool in which imagery data can be collected within two
hours to model a 1km
2
area, and the 3D point clouds can be reconstructed within a few hours [9]. Note that,
in order to create virtual environments for simulations and to enable 3D interaction and collision detection
in modern game engines, triangular meshes are needed instead of point clouds. The process of generating
meshes can be either accomplished with the photogrammetric process using raw image data or cloud-to-
mesh triangulation algorithms, such as the one proposed by Lin et al. [10].
Such a rapid 3D modeling process for reconstructing high fidelity and geo-specific simulation environments
has caught the attention of the U.S. Army and motivated the One World Terrain (OWT) Project. One of the
objectives of the OWT project is to provide small units with the organic capability to create geo-specific
8
virtual environments to support military operations. The ability of the intelligence community, mission
commanders, and front-line soldiers to understand their deployed physical environment in advance is
critical in the planning and rehearsal phases of any military operation. Unfortunately, even at the highest
levels of simulation fidelity, synthetic training environments (STE) currently have limitations that can
adversely affect training protocols and value. One of the main deficiencies is the lack of scene-semantic
segmentation to allow both user- and system-level interaction. Photogrammetric-generated 3D point
clouds/meshes enable simple analytics such as distance or traversal time, as well as additional exported data
manipulation. Although these features are immensely helpful and purposeful, they cannot enable a real-
time assessment of potential changes in the terrain or present structures, nor can they account for the
variability in material and object composition. The generated meshes simply contain the polygons and
textures—i.e., they do not contain semantic information for distinguishing between objects such as the
ground, buildings, and trees. Being able to segment, classify, and recognize distinct types of objects together,
along with identifying and extracting associated features (e.g., individual tree locations, building footprints,
and ground materials) in the generated meshes, are essential tasks in creating realistic virtual simulations.
Rendering different objects in a virtual environment and assigning actual physical properties to each will
not only enhance the visual quality but also allow various user interactions with a terrain model.
For instance, it is undeniably helpful to be able to plan the needed explosive size of a destruction operation;
however, the ability to simultaneously calculate the residual effects and considerations, such as volume and
spread of debris, creates a much fuller picture of the operational outcome. A blast effect over a building
will be different than a blast on the ground. Concrete walls would fracture and, depending on the size of
the blast, may not deform or depress the 3D meshes beyond the walls, whereas depression caused by
deformation on bare earth would be more significant. Additionally, considering the case of training soldiers
in a virtual environment with 3D meshes representing the scene. The task is to recognize the shortest path
from location A to location B in which the individual is visible from a given vantage point. With artificial
intelligence (AI) searching algorithms, such as A*, the shortest path could be computed, and penalties could
9
be assigned to a route based on the number of obstructions blocking the enemies’ line-of-sight. However,
in reality, line-of-sight that is blocked by concrete walls, glass windows, and trees should be assigned
different penalties when considering a route, since some materials, such as glass used in windows, cannot
protect soldiers from sniper gunshots. Identifying surface materials is another problem in which information
should be provided for computing the trafficability of an entity in the virtual environment. Terrain surfaces
must be classified properly based on their material composition since different terrain surface materials
such as bare soil, grass, rock, mud, etc. could affect the off-road vehicle performance (e.g., the speed of
driving a vehicle on grass is different than driving a vehicle on bare soil) [11]. Though these examples are
an oversimplification, they emphasize the point that, without semantic segmentation of the mesh data,
realistic virtual simulations cannot be achieved.
Furthermore, besides the examples mentioned above for supporting military operations, with the data
segmentation and object information extraction capabilities, the photogrammetric technique also holds
many other potential usages in the smart city and digital twins such as building energy conservation and
developing crowd management strategies. Conducting proper and accurate pedestrian and crowd movement
simulation is important for establishing crowd control strategy, especially in extreme events and natural
disasters. The photogrammetric reconstructed terrain model can provide the basic 3D information for the
simulation such as ground elevation and slope. However, an accurate simulation cannot be achieved without
knowing the ground material composition such as paved roads, bare soil, grass, mud, etc. It is also well
acknowledged that improving building energy efficiency is an essential step to achieve the goal of
worldwide sustainability. Building energy simulation tools provide a better understanding of building
energy performance when designing new buildings and renovating existing ones. Extracting accurate
building footprints and related features (i.e., building height, window to wall ratio, and building type) will
provide the basic building geometry for conducting building energy simulation for urban areas [12].
Building models can also be created based on the extracted footprints and its attributes could be changed
on-the-fly within the virtual digital city (e.g., changing building height limits or roof style for a district) to
10
support the visualization or validation of the city planning decisions. Similarly, extracting information of
vegetated areas and tree features (e.g., locations, height, crow width, etc.) will serve as a foundation for
urban planning considering ecosystems and greening interventions [13], [14].
Semantic segmentation of 3D data has been a major challenge in the field of computer vision and remote
sensing. In addition to the abovementioned needs of scene-semantic segmentation from training and
simulation communities, it is also being used in a variety of applications in numerous fields such as
autonomous vehicles, forest structure assessment, urban planning, and scan-to-BIM process, among others
[15]-[18]. As such, considerable research efforts have been made to design and develop segmentation
algorithms to label the derived 3D data [19]-[28]. However, most existing research has focused on outdoor
data collected using Light Detection and Ranging (LIDAR) sensors and indoor data captured with RGB-D
sensors. Only a few studies have focused on segmenting UAV-based photogrammetric data that covers a
large area of interest. In addition, there exist several LIDAR and RGB-D captured benchmark data sets for
the comparison between various point-cloud segmentation algorithms/approaches [29], but a thorough
search of the relevant literature has not yielded any large UAV-based photogrammetric benchmark data
sets.
The first part of this study focused on designing a point cloud/mesh segmentation and information-
extraction framework to support next-generation modeling, simulation & training. Both supervised and
unsupervised machine-learning algorithms were utilized. A framework was designed with photogrammetric
data characteristics taken into consideration. A 3D point cloud is first segmented into top-level terrain
elements (i.e., ground, buildings, and trees). Individual tree locations and building footprints are then
extracted from the segmented point clouds. Finally, 3D meshes are segmented based on the point-cloud
segmentation result. Experiments were conducted to compare various point descriptors and classification
algorithms to identify strengths and limitations. The designed framework was also validated using the
selected data sets.
The second part of the study focused on improving the performance of the top-level terrain elements
11
segmentation process in the designed framework, and a model ensembling sub-framework was designed.
Unlike previous work on point cloud segmentation in which one generic-trained model was used to segment
all points into the desired categories, the designed sub-framework ensembles segmentation models
sequentially, and the points are segmented in a hierarchical manner. With this hierarchical design, several
data pre-processing and post-processing approaches were integrated to complement the limitations of using
single segmentation model to not only improve segmentation performance but to also produce a more
visually pleasing 3D terrain model for simulations. A large, UAV-based photogrammetric database with 22
data sets was created for validation purposes. The data were collected from different geographic locations
in the U.S. with different coverage areas, architectural styles, vegetation types, and terrain shapes.
It is well known that deep learning algorithms are data-hungry, especially in the 3D domain. Acquiring and
annotating real-world 3D data is a labor-intensive and time-consuming process. Although the UAV-based
photogrammetric database that was created in this study is considerably larger than any other existing
photogrammetric point clouds database, whether or not we have truly realized the full potential of deep
learning techniques still remains unknown. Furthermore, for some tasks, data for specific objects can not
be easily obtained. For instance, there are limited amounts of military-related objects (e.g., tank, fighter
aircraft, etc.) even in the OWT data repository which contains data for several military bases. To this end,
the last part of this study focused on investigating the possibility of using synthetic photogrammetric data
to substitute real-world data for training deep learning algorithms. A workflow was designed to exploit the
synthetic photogrammetric data to train a point cloud segmentation model without the efforts for real-world
collections and manual data annotations. To create the annotated 3D terrain, synthetic images are rendered
based on simulated drone paths over the virtual environment. The rendered images are then used to produce
synthetic point cloud with similar fidelity and quality as real-world UAV-based photogrammetric data.
Ground-truth annotation is automatically obtained via a ray-casting and nearest neighbor search process.
Experiments were conducted, and the results showed that a model trained with the generated synthetic data
was able to produce accurate segmentation on real-world UAV captured point clouds.
12
Chapter 2: Backgrounds and Related Literature
In this section, the basic concepts of photogrammetry are briefly introduced. Following that, previous
studies that focused on semantic labeling of point clouds and object information extraction are reviewed,
and research gaps are highlighted.
2.1. Photogrammetry
The photogrammetric technique is a reverse-engineering process that can generate dense 3D point clouds
with 2D images. Since 2D images do not have the depth information of a scene, this process recovers the
depth information from pairs of images. Sufficient overlap between images is required for the
photogrammetric process to recover depth information. In each image, distinctive points, also known as
key points, are first detected. Scale Invariant Feature Transforms [SIFT] and Speeded Up Robust Features
[SURF] are the two most widely used key points/features that are invariant to rotation, scale, and distortion.
Camera orientations are then approximated based on the matched key points in different images. Finally,
the triangulation process is used to reconstruct a dense point cloud [30].
Although the reconstructed point clouds contain both spatial and color information of a scene, they do not
contain semantic information for distinguishing between objects. Furthermore, in order to use the data in
modern game engines for simulation, 3D meshes are needed instead of point clouds. Unlike point cloud,
mesh is a data format that not only contain per-point information in 3D space but also contain surface
information that will allow user interaction with the environment (e.g., collision detection) in a virtual world.
By connecting points in a point cloud to form triangular surfaces, a 3D triangular mesh can be generated
[9]. The mesh data can then be used to create a virtual environment and simulation. Han and Golparvar-
Fard used unordered site photos to reconstruct 3D models for construction progress monitoring [31].
Several other studies have also used the photogrammetric technique to capture the as-is condition of outdoor
13
scenes [32]-[36].
2.2. Point cloud segmentation
2.2.1. Point cloud segmentation with handcrafted feature descriptors
There is a long history of investigation of the 3D point cloud segmentation, classification, and object
recognition. These tasks are the foundation of many cutting-edge technologies used in autonomous vehicles,
forest structure assessment, and scan-to-BIM processes, among others [16], [18], [37], [38]. Nevertheless,
segmenting a large 3D point cloud with millions of point data into different categories is still a challenging
task. Existing work, which has made valuable contributions, has mostly focused on LIDAR and RGB-D-
sensed data. Many studies have investigated designing and developing handcrafted feature descriptors and
using machine learning algorithms to perform point-wise- and segment-wise classification [24], [27], [39]-
[47]. Since each 3D point in a point cloud is only represented by x, y, and z values—which do not offer
enough information for classification—the designed handcrafted feature descriptors are usually computed
with support of local neighbor points. Frome et al. proposed to compute 3D shape contexts and harmonic
shape-contexts regional descriptors for each point in a point cloud and compare them with a database using
nearest neighbor search methods for object recognition [39]. Other effective local shape descriptors such
as eigen entropy, anisotropy, planarity, sphericity, linearity, curvature, and verticality have also been
proposed in the past decade [22], [24], [41], [44], [45], [48]. These feature descriptors can be derived using
singular value decomposition for the covariance matrix that is formed from neighbor points. Chehata et al.
combined local shape descriptors with echo based- and waveform-based airborne LIDAR features for a
LIDAR point-cloud segmentation task [41]. Several other researchers proposed to perform bare earth
extraction (i.e., segmenting ground points from everything else) as a pre-process for segmenting other
objects such as human-made structures, buildings, and cars [49], [50]. In addition to the 3D shape
descriptors, Weinmann et al. proposed to extract 2D feature descriptors by projecting all 3D points onto a
2D plane [45]. Son et al. proposed curvature feature descriptors to extract pipelines from laser-scanned
14
industrial plants for reconstructing as-built 3D models [51]. J. Chen et al. researched point cloud
classification for construction equipment and proposed a principal axes descriptor that considered the fact
that most construction equipment has a rectangular structure and are line symmetry [17]. Becker et al. used
both geometric and color descriptors for segmenting photogrammetric-generated point clouds [22].
Several supervised machine learning algorithms such as Support Vector Machine (SVM), random forest,
k-nearest neighbors (KNN), Naive Bayes classifier, and Neural Network (NN) have been adapted for point
cloud segmentation using handcrafted feature descriptors [15], [22], [24], [27], [41], [43]-[47], [52].
Weinmann et al. reported that the SVM classifier outperformed KNN and naï ve Bayesian classifiers for
LIDAR point-clouds-urban scene segmentation [45]. Yang and Dong used an SVM classifier to segment
mobile laser-scanned point clouds into linear, planar, and spherical categories and refined the segmentation
result using their proposed similarity measurements [15]. Zhang et al. first segmented an airborne LIDAR
point cloud using a surface-growing algorithm and used SVM to classify each segment into the desired
category [46]. Hackel et al. proposed a random forest classifier to segment both mobile and terrestrial
LIDAR-scanned point clouds with strong varying-point density. As the authors indicated, providing the
classifier with point descriptors computed in a multi-scale fashion can overcome the segmentation
challenges posed by varying-point density [44]. The same researcher group then used the random forest
algorithm in a more recent study to predict the class probabilities for each point to generate contour
candidates and design a three-stage approach for extracting object contours [24]. The work of Bassier et al.
showed that random forest could achieve higher accuracy for indoor LIDAR data-segmentation compared
to SVM [27].
2.2.2. Point cloud segmentation with deep learning techniques
As discussed, several research tasks that required segmenting 3D point clouds were tackled by carefully
designing the handcrafted feature descriptors. However, in general, finding the optimal feature-descriptor
combination has remained a challenge [23]. Over the past few years, we have witnessed a shift of research
15
focus from investigating handcrafted feature descriptors to adapting and developing novel deep-learning
architectures to address the segmentation problem in the 3D domain. In contrast to traditional machine
learning algorithms, deep learning techniques are capable of extracting feature descriptors during the
training process [29]. Since the original success of deep learning (i.e., AlexNet) was proposed to solve
classification problems in the 2D domain [53], designing an efficient and effective 3D representation for
deep learning algorithms has been and remains an active field of research. There are two main ways of
representing 3D point clouds that are then fed to deep-learning segmentation algorithms: (1) volumetric
representation and (2) an unordered point set (i.e., the original form of point clouds).
Maturana and Scherer [54] first proposed using volumetric representation for 3D object recognition in a
point cloud. Their designed VoxNet contains two convolutional layers for feature extraction, a pooling layer
for dimensionality reduction, and two fully connected layers for classification. Wu et al. designed 3D
ShapeNets to recognize object categories and reconstruct their full 3D shapes using an RGB-D-sensed depth
image in a volumetric representation [55]. Qi et al. argued that overfitting is a key issue while training a
volumetric Convolutional Neural Network (CNN) model [56]. To overcome such a challenge and force a
CNN model to exploit local features, their proposed CNN architecture consists of an additional prediction
loss that was computed from a partial object apart from the prediction loss that was computed from the
whole object. A pre-segmented point cloud was required for the abovementioned approaches to perform
classification. Song and Xiao [57] proposed a Deep Sliding Shapes model that takes a whole 3D scene as
input and output both bounding box and object classification. To predict a class label for each point in a
point cloud, Huang and You [21], Hackel et al. [58] and Tchapmi et al. [59] proposed to train a CNN model
to take a local 3D voxel gird as the input and then classify its center point. Consequently, to segment the
entire 3D scene, these methods construct local 3D voxel girds using each and every point in a point cloud
as the center point. Hackel et al. [58] extended the multi-scale handcrafted feature extraction concept and
constructed the local 3D voxel in a multi-scale fashion. Tchapmi et al. [59] proposed a refinement of the
16
coarse voxel predictions by adding a trilinear interpolation and a fully connected Conditional Random
Fields to their architecture.
Despite the success of volumetric representation for various point cloud segmentation and classification
tasks, the voxelization process introduces a computational overhead to the entire workflow. Qi et al. [23]
introduced PointNet, the first deep learning architecture that takes a point cloud in its original form (i.e.,
unordered point set) as the input. The authors designed two sub-architectures to demonstrate the capabilities
of using PointNet for both point-wise segmentation and segment-wise classification tasks. PointNet++ was
then designed to overcome the limitation of PointNet where local features are not extracted [60]. With
similar concepts in multi-scale handcrafted feature extraction, PointNet++ captures local features by
recursively applying PointNet on a nested multi-resolution partitioning of the input point cloud. With the
success of PointNet and PointNet++ on classifying and segmenting a point cloud in its original form, several
novel CNN-based, point-cloud segmentation architectures have been designed that operate on an unordered
point set directly [28], [61]-[70]. Jiang et al. proposed a SIFT-like module that can be integrated into
PointNet architecture [66]. The proposed module extracts multi-scale features encoded in eight crucial
orientations. Engelmann et al. [63] proposed to use KNN and k-means algorithms to learn features that
considered point-neighbor relations in both the feature space and world space. Hermosilla et al. [61]
reported that most real-world point clouds are non-uniformly sampled, which can negatively affect the
processes that convolutional neural networks undergo and suggested a Monte Carlo integration to compute
convolutions. Thomas et al. [64] addressed the challenge of non-uniformly distributed point spacing by
using a regular subsampling strategy. For the purposes of segmenting large-scale LIDAR-sensed point
clouds, Landrieu and Simonovsky [71] proposed to represent the unordered point cloud in an ordered
manner by using a superpoint graph (SPG) representation. With SPG representation, the entire scene can
be considered while segmenting the parts that are far apart.
17
2.2.3. Research gaps on point cloud segmentation
In order to provide common data sets for comparative purposes [29] and truly realize the full potential of
deep learning techniques on 3D data segmentation and classification tasks [58], considerable efforts have
been devoted to establishing both indoor and outdoor 3D benchmark data sets [58], [72]-[76]. The two
primary sources for creating benchmarks are LIDAR sensors and RGB-D cameras. Since LIDAR sensors,
in general, have a longer sensing range and are more robust to strong infrared noises than RGB-D cameras,
most outdoor benchmarks have been created with stationary terrestrial-, mobile terrestrial-, and aerial-
LIDAR [58], [75]-[80]. Consequently, most existing works reviewed in this section on 3D segmentation of
outdoor scenes were designed for and validated with LIDAR data sets. However, segmenting
photogrammetric generated point clouds is much more challenging than segmenting LIDAR data due to the
following two reasons. First, several point features that are available in the LIDAR data do not exist in the
photogrammetric-generated point clouds (e.g., echo-based and waveform-based features) [45]. Second, The
photogrammetric generated point clouds tend to be noisy, and in some cases, ground cannot be captured
due to dense canopy [16]. In addition, when considering the extraction of information from the segmented
point clouds/meshes in a later process, higher accuracy is desired for segmenting a class of objects (e.g.,
trees). For instance, high segmentation accuracy is desired for vegetation if the goal is to extract individual
tree locations and vice versa when the goal is to extract building footprints, locations of windows, and
building facade materials. As previous studies showed that point features might contribute differently for
segmenting different objects in LIDAR generated point clouds [41], [45]. Thus, in this study, the author
has investigated several available point features in photogrammetric generated point clouds and their
contributions to the segmentation process. In addition, since the selection of the segmentation/classification
algorithms is essential for producing a high-quality result [81], several classification algorithms have been
evaluated based on their performance. Furthermore, linking the work of point cloud segmentation
(especially photogrammetric-generated point clouds) and extracting detailed information such as individual
tree location, building footprints, and ground surface elements for generating synthetic training
18
environment and allow artificial intelligence path planning remains a challenge. Object information
extraction related works are reviewed in the following three sub-sections.
2.3. Vegetation segmentation and identification of individual tree locations
Several studies have focused on individual tree segmentation in order to identify their locations, and other
related features from LIDAR collected point clouds [82]-[84]. Most proposed approaches contain two steps:
(1) segmenting tree points from everything else; and (2) identifying individual tree locations and related
features from the segmented tree points [85]-[87]. Huang et al. [85] and Zhang et al. [82] have proposed to
segment tree points by combining the use of an airborne LIDAR generated point cloud and near-infrared
images. Vegetation regions were extracted using the Normalized Difference Vegetation Index (NDVI) that
was derived from near-infrared images, and the extracted regions were then projected on to the point cloud
to the segment tree points. Individual tree locations were extracted in a region growing fashion with setting
treetops (points that have local maximum height value) as seed points. Persson et al. [86] focused on
identifying individual tree locations in dense forest areas. Trees were first segmented from the ground by
using an active contours algorithm. Following that, individual tree locations were identified by fitting a
parabolic surface to the top of the segmented tree canopies. Ritter et al. [87] also focused on forest datasets,
and the point clouds were collected with terrestrial laser scanners. The authors proposed a two-step
clustering algorithm, which exploited the ability of terrestrial laser scanners to collect data points on the
leaves inside of crowns. In the first step, tree points were stratified into horizontal layers, and cluster centers
in each horizontal layer were computed based on the point density. These centers from different layers were
then clustered again in the second step of the algorithm for computing the individual tree locations. Monnier
et al. [84] focused on detecting individual trees from a point cloud that was collected with a mobile laser
scanner for dense urban areas. In their study, trees were segmented using local geometrical features of
individual points. Trunks of trees were assumed to exist in the point cloud and were approximated by
vertical cylinders to generate a “cylindrical descriptor.” Individual trees were detected by combining the
19
information from both the cylindrical descriptors and the segmented tree points.
However, these methods suffer from various problems when used to identify individual tree locations and
extract related features from the photogrammetric generated point cloud. For example, the data for trunks
of trees may not exist in the generated point clouds due to the dense canopy and leaves inside of crowns
cannot be collected since photogrammetric techniques do not have the penetration capability that LIDAR
does. In addition, treetop surfaces may not always form a regular shape such as a parabolic surface due to
data noise and the lack of 3D reconstruction accuracy. Thus, this study aims to 1) formulate the tree location
identification problem considering these limitations of the photogrammetric technique and 2) investigate
and compare different approaches for tackling the problem.
2.4. Building footprints extraction
Ideally, building footprints should be extracted based on the data collected for the exterior walls. However,
the photogrammetric technique can only capture data within the line-of-sight, and oftentimes parts of the
exterior walls are often missing due to the occlusion caused by vegetation, parked vehicles, and so forth.
Thus, reconstructing building footprints with only the data of exterior walls may not be feasible in practice.
It is worth pointing out that the point clouds generated using the photogrammetric technique have a lower
quality than LIDAR technique [88]. Since this study focused on creating 3D terrain models of large areas,
the way how the imagery data were collected in this research is different than other research studies (images
were captured with high altitude, i.e., > 70 meters in most of the cases, and low overlapping ratio, i.e.,
<70%). As a result, the generated point clouds that are used in this research contain more noises than other
research studies that are focused on the data collection for one or a few buildings. Such a low-quality data
adds another layer of difficulty when extracting building footprints (e.g., trees are connected/injected to
buildings in the point cloud).
Several studies have suggested extracting roof boundaries and project them onto a 2D plane as the building
20
footprints [89]-[94]. Similar to the task of identifying individual tree locations, most proposed building
footprints extraction approaches include two steps: (1) Segmenting building/roof points from everything
else; And 2) extracting building footprints. Wang et al. [90] used the Adaboost machine learning algorithm
to first segment the raw point cloud into buildings, trees, and ground. A shortest path algorithm (i.e., the
Floyd-Warshall algorithm) was then utilized for computing initial rough building footprints based on the
segmented building points. Following that, the Bayesian maximum a posteriori estimation was adapted to
process the initial building footprints further and preserve straight lines and 90 degrees angles. Zhou and
Neumann [89] have researched on extracting building footprints from airborne LIDAR collected point
clouds. In their study, only a few points were collected on the vertical surfaces. Thus, non-ground points
were detected through a connected component labeling algorithm. Following that, roof and tree points were
separated by using local point features (i.e., regularity, horizontality, flatness, and normal distribution).
With their proposed algorithm, roof boundaries were then extracted as building footprints and further
smoothed and aligned with the identified principal directions. A similar approach that also considered the
dominant directions of a building orientation while extracting building footprints was proposed in [93]. Sun
and Salvaggio [91] proposed to use a graph cuts-based optimization algorithm to segment vegetation and
non-vegetation points. Region growing algorithm was then utilized to extract roof points. Following that,
roof points were projected on a 2D grid, and building footprints were extracted from the 2D grid with
rectilinear constraints. Awrangjeb and Lu [92] proposed to segment roofs using rule-based algorithms.
Following that, a corner detecting algorithm and Douglas-Peucker algorithm were combined to extract and
regularize building footprints.
Previous studies have relied heavily on the assumption that roofs have flat surfaces and are above a certain
height. Based on this assumption, data points that fall on roof areas can be separated from those that fall on
exterior walls and vegetation. However, in reality, ductwork, most notably, is housed on the roofs,
especially on commercial buildings. Such ductworks can create shadows and noises in the reconstructed
models especially when using photogrammetric technique, as this technique is sensitive to reflective
21
materials and inaccurate for generating models of thin, tube-shaped objects. As a result, roofs in
photogrammetric-generated point clouds cannot be accurately extracted using the previously proposed
approaches.
2.5. Surface material classification
Recognizing material categories from 2D images is a fundamental problem in computer vision and has
drawn attention from both academia and industry in the past two decades [95]. The problem is usually
viewed as a texture classification problem and has been studied side by side with image classification and
object detection problems [96]. Earlier works have been focused on investigating and developing statistical
approaches to quantify handcrafted texture features that were extracted using different texture filters (e.g.,
Sobel filter, color histograms, filter banks, Gabor filters, etc.) for the material classification problems [97]-
[102]. Some of the proposed material classification approaches have benchmarked using a texture database
that was created in a controlled environment (i.e., CUReT dataset [103]) and could achieve classification
accuracy over 95% [98], [101], [102]. However, M. Varma and A. Zisserman’s approach [98] could only
achieve a classification accuracy of 23.8% when applied to a database that contains common materials with
real-world appearances (i.e., Flickr Material Database - FMD [104]). C. Liu et al. improved the
classification accuracy on FMD at 2010 (i.e., achieved 44.6% accuracy) with a proposed Bayesian learning
framework and a set of new features [105]. D. Hu et al. improved the classification accuracy again on FMD
at 2011 (i.e., achieved 54% accuracy) with a set of proposed variances of gradient orientation and magnitude
texture features.
With the recent advancement in deep learning, researchers started to investigate and develop different
Convolutional Neural Network (CNN) architectures for solving the material classification tasks. Several
CNN architectures that achieved record-breaking on image classification tasks are reviewed here. AlexNet
that contains five convolutional layers and three fully connected layers was proposed at 2012 [106]. VGG
was then proposed at 2014, the main advances of VGG over AlexNet is on the size of the convolutional
22
filters and the depth of the networks [107]. VGG suggested using smaller convolutional filters (i.e., 3*3)
and more convolutional layers. Such a strategy resulted in a 7.3% top-5 error rate on the ImageNet challenge.
GoogleNet [108] was also proposed at 2014 and slightly outperformed VGG with a 6.7% top-5 error rate
on the ImageNet challenge. Inception module was proposed with GoogleNet to reduce the computational
complexity. It is worth noting here that GoogleNet does not contain the fully connected layers and the
parameter was reduced by a factor of 12 compared to AlexNet. ResNet was then proposed at 2015 and got
a 3.57% top-5 error rate on the ImageNet challenge [109]. Residual block was introduced with ResNet
which allows fitting an identity output from one layer to its adjacent layer. DenseNet further extended the
concept from ResNet and introduced a dense block that allows mapping identity output from one layer to
several latter layers [110]. The authors argued that by connecting each layer to every layer in feedforward
fashion in a dense block can alleviate vanishing gradient problem and strengthen feature propagation. G.
Kalliatakis et al. compared three CNN architectures (i.e., AlexNet [106], OverFeat [111], and the CNN
architecture introduced in [112]) for ground material classification problem and benchmarked using FMD
[113]. The results indicated that material classification using CNN models for FMD could achieve over 60%
accuracy. M. Cimpoi et al. extracted texture features using VGG model and performed the classification
using SVM, the classification accuracy of FMD could achieve 82.4% [114].
Previous studies have been focused and made valuable contributions to material classification for real-
world images. Since this study focused on ground material classification instead of materials for other
objects, the generated orthophoto that covers the entire area of interest is used instead of individual real-
world image. However, the orthophoto that was rendered from photogrammetric-generated meshes have a
lower quality than the real-world images (i.e., the orthophoto could be distorted and blurred) and a thorough
search of the relevant literature does not yield any existing ground material orthophoto database. Thus, this
study first focused on creating a mesh rendered material database. Following that, the author evaluated the
strength and limitations of the existing CNN architecture for classifying mesh rendered images.
23
Chapter 3: Photogrammetric Point Clouds Segmentation and Objects Information Extraction
Framework
3.1. Research Objective
With the research gaps identified in the literature review section, the objective of this study is to create a
framework for semantic labeling of 3D point clouds/meshes generated with photogrammetric techniques
and extract the necessary object information for the creation of virtual environments and simulations. The
segmentation process was first performed on the generated 3D point clouds. Following that, the generated
meshes were segmented accordingly. In this research project, alternative machine learning algorithms to
classify point clouds into top-level terrain elements and different ground surface materials were explored.
Previous studies have been focused and made valuable contributions on segmenting Light Detection and
Ranging (LIDAR) generated point clouds but not to the data created using photogrammetric technique. In
this study, the author considered the fact that several point features that are available in the LIDAR data do
not exist in the photogrammetric generated point clouds (e.g., multi-return pulses, echo-based and
waveform-based features) [11]. The effectiveness of using different point features that are available in the
generated point cloud (via photogrammetry) for segmenting different objects were analyzed. Furthermore,
the proposed information extraction process was designed to overcome the data quality issues in the
generated point clouds (i.e., noise in the data, challenges in capturing the ground plane and trunk of a tree
due to dense canopy [16] or highly heterogeneous (not flat) surface).
The specific research questions that were answered include the followings.
1. How should photogrammetric-generated point clouds be classified into top-level terrain elements
(i.e., ground, buildings, and vegetation) considering the data quality issues and lack of point
features compared with LIDAR-generated point clouds?
a. Which point features that are available in photogrammetric generated data can be used to
describe the object characteristics and classify top-level terrain elements?
24
b. What is the best classification method for classifying top-level terrain elements?
2. How to extract object information such as tree locations, and building footprints in a
photogrammetric generated point cloud to support the creation of geo-specific 3D models in a
virtual environment?
a. How should individual tree locations be identified using the classified tree points and
taking possible missing tree trunk data into consideration?
b. How should building footprints be accurately extracted considering potentially missing
data for the walls (due to occlusions) and noises in the reconstructed roofs?
3. How to recognize and label materials on ground surfaces (i.e., dirt, grass, and road) from
photogrammetric data?
3.2. Proposed Framework
This research aims to investigate and develop a framework for the segmentation/classification of
photogrammetric generated point cloud into predefined categories and extract object information for the
creation of virtual environments and simulations. The methodology will combine concepts from the areas
of computer vision and machine learning. Figure 1 presents the designed framework that illustrates the
workflow, emphasizing the main elements and steps involved in the process. The framework is designed
based on the review of the literature as stated in Section 2 where top-level terrain elements (i.e., ground,
trees, and buildings) are segmented before the detailed information extraction processes can take place.
Since the input of a virtual environment and simulation needs to be in a mesh format instead of point cloud
format, mesh segmentation is a necessary step. Details of each step in the framework are discussed in the
following subsections.
25
Figure 1. Semantic terrain points labeling framework.
3.2.1. Top-level objects segmentation workflow design
In order to understand and evaluate the effectiveness and performance of different classifiers and
handcrafted point attributes/descriptors in the context of top-level terrain elements segmentation, a
workflow was designed following previous studies that focused on LIDAR point cloud segmentation [40]-
[42]. The workflow utilized both supervised and unsupervised machine learning processes as shown in
Figure 2.
Figure 2. Workflow of top-level point cloud segmentation.
26
One difference between photogrammetric and LIDAR point clouds is that the point cloud obtained via the
photogrammetric technique is very dense. Since this work aims to do point cloud classification for large
areas, the raw photogrammetric point clouds contain huge amounts of points (e.g., more than hundreds of
millions of points in one dataset) and processing such an amount would be impractical for any high-
performance computer. Moreover, one assumption that most of the supervised machine learning algorithms
are making is that the testing data should have the same distribution as the training data [115]. However,
due to the different parameters used in the data collection process, the raw point cloud could have varying
point densities across different capture sessions. For instance, one flight session could take more photos
around a building while the next session may take fewer photos of a nearby building of similar size. If both
sessions were combined into the same data set, this would result in non-uniform point density. Such non-
uniformity in point spacing would make the objects with similar geometry to have different values for their
local point descriptors. To alleviate this problem, a voxelization algorithm was used to uniformly
downsample the raw point clouds and prevent uneven point spacing. The voxelization process was simply
performed by discretizing the whole 3D space into 3D grids with predefined spacing (e.g., 0.5 meters). The
grid cells with points inside were then extracted to form the downsampled point cloud using the centroid of
each cell as a point position. It is worth noting that, the predefined point spacing cannot be too large such
that the downsampled point cloud loses details on objects, and it cannot be too small which makes the
further process impractical. In this study, the point spacing was set to 0.5 meters so that the downsampled
point clouds have only a few million points, and objects of interest (i.e., buildings, trees, and ground) are
not losing details by being differentiated.
After the downsampling process, the next step was to extract the ground points. Several researchers have
suggested performing a bare earth extraction as a pre-process for segmenting other objects [49], [50]. The
reason that ground points should not be classified with supervised learning algorithms in this case is that
point clouds generated with photogrammetric technique do not have the echo-based and full-waveform-
based features like LIDAR generated point clouds do, and it is a challenging task to classify ground points
27
with only local point features (local point features/descriptors will be discussed in Section 4.1.1). For
instance, roof points may have very similar features/descriptors to ground points such as planarity and point
density ratio. Previous work by Sithole and Vosselman [116] compared eight different ground-
segmentation filters and found that most filters perform better in rural landscapes with smooth terrain than
in urban areas with complicated human-made structures or in rough areas with vegetation that results in the
terrain of uneven height. Thus, in the proposed framework, the authors chose to combine the use of a region
growing and a progressive morphological filtering algorithm to avoid the issues of previous work when
applied in complicated urban areas. The region-growing algorithm was first applied, and the progressive
morphological filtering algorithm was then used to process the “non-ground” points that were extracted
from the region-growing algorithm. The region-growing algorithm recursively grows a cluster of points by
examining nearby points within a predefined radius. If the nearby points have similar normal vectors to the
current cluster, they are combined, and the process repeats until no additional valid points can be added to
the current cluster. Once all the points are examined in the data set, the algorithm picks the largest cluster
as the ground point. Note that the predefined radius cannot be smaller than the point spacing that was used
for the downsampling process; otherwise, no neighbor points can be selected. It also needs to be large
enough so that small elevation changes on the ground (e.g., sidewalk and planting beds) will not affect the
result. 3 meters was used for the radius in this study since it allows the algorithm to select enough neighbor
points to overcome the challenges from small elevation changes. Although this method works effectively
on extracting relatively flat ground, it has two main limitations. First, if the ground is separated by a large
building or a wall, the method would only be able to extract one part. Secondly, if the terrain is sloped, the
method may fail to grow the ground cluster over a high, sloped hill. To alleviate these limitations, a
progressive morphological-filtering algorithm proposed by Zhang et. al [49] was utilized to handle the
isolated or sloped ground points. The original method was designed for segmenting airborne LIDAR data
using mathematical morphology, and the core idea was to apply these morphological operations, including
“dilation” and “erosion,” “opening,” and “closing,” to filter out non-ground points as noise [117]. As such,
any remaining points unaffected by the operations mentioned above would be the ground points. Naively
28
applying the morphological operations with a fixed window size would result in inaccurate ground points
when noises are larger than the predefined window size. A progressive morphological-filter algorithm
solves this problem by iteratively increasing the window size during the operations to remove all vegetation
and building points as noise.
The next two steps in the designed workflow are following a common supervised machine learning process.
Since the segmentation needs to be performed in a point-level, point features/descriptors need to be
computed for each individual point. Following that, with the defined point descriptors, the supervised
classifier is used to classify the points into different categories. A training process is needed to train a
classifier with a set of manually classified points; the trained classifier can then be used to classify the
unlabeled points.
3.2.2. Point descriptors ranking
The use of effective point descriptors is the foundation to get an accurate segmentation result. Different
point descriptors may have different effects for classifying different objects. For instance, open source data-
based features are crucial for identifying building roofs and color information are more effective for
classifying vegetation. As one can imagine, when the individual tree locations need to be identified, the
accuracy of the result for classifying tree points are important while the accuracy of the result for classifying
building points are irrelevant. To extract building footprints, on the other hand, the accuracy of the result
for classifying building points is crucial. Thus, to understand and to be able to select effective point
descriptors for performing segmentation tasks with different purposes is required. Previous studies have
demonstrated the effectiveness of using local point descriptors for segmenting LIDAR generated point
cloud [41], [45], [118]. This study first adapts and evaluate the effectiveness of these local point descriptors
(i.e., color-based, density-based, and local surface-based descriptors) for photogrammetric generated point
clouds. Following that, An additional set of proposed point descriptors that can be derived from open source
data and other feature extractors (i.e., open source data-based and texture-based descriptors) are explored.
29
Instead of computing descriptors globally, local point descriptors limit the computation area to a predefined
radius. For each point, it selects only nearby points within the radius and computes the point descriptors by
these points. Examples of such descriptors include local planarity and curvature, which can be obtained by
computing a principal-component analysis on nearby points. As one can imagine, these descriptors can be
very helpful for identifying whether a point falls on a flat surface or edge. To provide more detailed
information of the nearby regions on different levels of details, multi-scale descriptors are computed for
each category by selecting nearby points with a varying radius. It is important to compute descriptors in
such a multi-scale fashion since it allows for the extraction of detailed surface information, and the produced
results are robust to noise [119]. As an example of the importance of using multi-scale descriptors, if a
descriptor is extracted on a 10-cm scale, a point fall on the edge of a window frame will have low planarity.
However, if the descriptor is extracted with a 1-m scale, such a point would be considered to be in a flat
area. Therefore, information from different scales needs to be combined to provide the best point description.
Details of the local point descriptors are discussed below:
Color-based descriptors: Point clouds from photogrammetric reconstruction contain color information in
addition to a target (x, y, z) position and stored as red, green, and blue (RGB) channels. To improve
segmentation quality, color values were transformed from RGB to HSV (hue, saturation, value) color space
since previous work has shown that HSV space works better in color image-segmentation tasks [120]. For
each color channel in the HSV space, the average and standard deviation were computed as color descriptors.
The color descriptors also include the original color of the point.
Point density-based descriptors: As previously discussed, the point clouds would first be downsampled
to obtain uniform point spacing. Therefore, by measuring density in different directions, nearby shape
profiles are implicitly obtained for a point. Three different point density descriptors at each scale were
computed as follows: (1) to measure if the nearby points were uniformly distributed, the number of points
n in a sphere with a predefined radius r was computed; (2) to measure whether the nearby points are
distributed vertically (e.g., trees, poles, etc.), the number of points m in a cylinder with the same radius r
30
and a fixed height h was computed; and (3) the ratio between n and m was also computed. Since these
descriptors were computed in a multi-scale fashion, different values were assigned to r on each scale. In
this study, the smallest r was set to the point spacing that was used for point downsampling process. The
point spacing was also used to increase the value of r at each scale. h is a constant value for all scales. In
order to ensure the computed ratio is between 0 and 1, h needs to be larger than the largest r. For simplicity,
h was set to twice of the largest r in this study.
Local surface-based descriptors: The local surface-based features of each point data are computed using
the Eigenvalues that are derived from the covariance matrix of its n local surrounding points in the sphere.
The covariance matrix is calculated with
∑ p =
1
n
∑ (p
i
− p ̅)(p
i
− p ̅)
T n
i=1
(1)
where p is the point data that is represented using its x, y, and z coordinates; p
i
is one of its n surrounding
points; p ̅ is the mean/center of its surrounding points. The eigenvalues λ1 > λ2 > λ3 are then computed with
a principal component analysis based on the covariance matrix. Please note that eigenvalues need to be
normalized between 0 to 1 with respect to λ1. The local surface-based features include the following:
Omnivariance = √∏ λ
i
3
i=1
3
(2)
Eigenentropy = − ∑ λ
i
ln (
3
i=1
λ
i
) (3)
Anisotropy =
(λ
1
−λ
3
)
λ
1
(4)
Planarity =
(λ
2
−λ
3
)
λ
1
(5)
Sphericity =
(λ
3
)
λ
1
(6)
Linearity =
(λ
1
−λ
2
)
λ
1
(7)
Curvature =
λ
3
(λ
1
+λ
2
+λ
3
)
(8)
31
Verticality = 1 − |⟨[0,0,1],e
3
⟩| (9)
Eigenvalues represent the magnitude in the direction where p ’s neighboring points are extended. Different
local surface point descriptors can be computed with the combination of the three eigenvalues as shown
above. For instance, if a point lays on a planar surface, it is expected that its planarity to be close to 1
according to equation (5) since it’s λ1 and λ2 will have similar magnitude but λ3 will be much smaller than
λ1 and λ2. These features can provide useful information for classifying buildings and trees. In the case of
wall points, it is expected that they have large planarity and verticality values. Roof points have large
planarity value but small verticality. Tree points have large Eigen entropy value but small planarity value.
It is worth noting here that thresholds are not used for the classification process. Instead, a supervised
machine learning process is adopted.
Open source data-based descriptors: Since photogrammetric techniques can generate geo-specific point
clouds, global road and building datasets such as Open Street Maps that are publicly available should not
be neglected for computing point descriptors. Such an open data is often built through crowdsourced
volunteered geographic information. The accuracy of such data is low and inconsistent in different locations
[121]. Thus, the data cannot be used directly to segment the point cloud. However, such information could
be utilized to compute point descriptor and be used during the classification process. For instance, when a
point cloud is overlaid with an open-sourced map data, a new point descriptor could be the distance between
a point to its closest major road or its closest building in the map. The proposed open source data-based
descriptors include:
D building, the distance between a point to its closest building footprint.
D road, the distance between a point to its closest OSM road vector.
In building, a binary descriptor that represents if a point falls within a building footprint.
In road, a binary descriptor that represents if a point is on the OSM road vector with a defined road
width.
32
Texture-based descriptors: Photogrammetric techniques can generate 3D meshes with high detailed
textures. Such information should also be utilized for the point-cloud classification process since different
objects may have similar features, such as color, but different textures. For instance, although a building
with a green roof and walls has a similar color to trees, the textures of the walls and roof are different than
trees. Texture descriptors are extracted from orthophotos generated using photogrammetric techniques. A
convolutional neural network (CNN) was utilized to extract texture features in this research since previous
studies have shown that such a network is a powerful feature-extraction algorithm [122]. Orthophotos are
first cropped into small images that cover small areas (i.e., 3 m by 3 m). Each image patch is then fitted
into a pre-trained GoogLeNet model. Texture features are extracted from the layer that is connected to the
fully connected layer in the network. The texture map has a size of 9 pixels by 9 pixels, and each pixel
contains 2,048 features. A dimensionality-reduction process is performed to transform the texture map to a
lower dimension (i.e., 80 features). Principal Component Analysis (PCA) was used for reducing the feature
dimension. The dimensionality-reduction process serves three purposes. First, each pixel has 2,048 texture
features, which is much more than the number of features in any other category, and this may affect the
trained classifier to be biased on the texture features. Secondly, many of the texture features are correlated,
and PCA has the capability to convert them into a set of linearly uncorrelated features (i.e., principal
components). Thirdly, by reducing the number of features can also reduce the computing power needed for
training and testing a classifier. The texture maps are then projected onto the 3D point cloud by assigning
the features from the texture map to its closest point in the 3D point cloud.
To understand the contribution of each point descriptors in the classification process, three experiments
were conducted to analyze the abovementioned point descriptors. The point descriptors were ranked with
the proposed ranking methodology as follows. First, use only one category of point descriptors (e.g., color-
based descriptors) to classify the points in a point cloud with the SVM algorithm and compare the results.
This experiment provides insights into the contribution from each descriptor category to each classified
object. Secondly, use one less point descriptor from all the descriptors (e.g., the use of local surfaced-based,
33
density-based, open source data-based, and texture-based descriptors during the training and classification
process and exclude color-based descriptors) with SVM algorithm to compare the classification results. The
purpose of conducting this experiment is to check the consistency of the patterns that were found in the first
experiment. Finally, to complement the first two experiments, the last experiment ranks point descriptors
through a feature-selection algorithm (i.e., random forest) and identify the descriptor importance to the
overall classification process instead of each classified object.
3.2.3. Selection of classifiers
Many supervised and unsupervised machine-learning algorithms have been extensively studied on the
classification of LIDAR point clouds and include urban scene classification, contour detection, building
feature extraction, scan-to-BIM process, and so forth. Since photogrammetric generated point clouds are
different from LIDAR point clouds as discussed earlier, the performance of different classifiers needs to be
tested. In this study, different supervised machine learning algorithms (i.e., Support Vector Machine-SVM
and Random Forest algorithms) that have been used to solve classification problems with a similar nature
were evaluated.
SVM: Vapnik et al. investigated pattern recognition using the generalized portrait algorithm, which is the
basis for SVM [123]. The authors further explored statistical learning theory and developed the SVM
algorithm with kernel methods and soft margin hyperplanes at 1995 [124]. As pointed out by Hsu, Chih-
Wei, et al., SVM parameter tuning is an essential step [125]. The accuracy of the result highly depends on
the selection of the parameters for the SVM. Thus, the SVM parameter tuning guide that was provided by
Hsu, Chih-Wei, et al. will be followed in the proposed research. It consists of four steps (1) transform data
into SVM package format; (2) normalize each attribute in the data; (3) Use the RBF kernel and compute
the best parameter C and γ through cross-validation; (4) Finally, use the best parameter C and γ to train a
training set.
34
Random Forest: Iterative Dichotomiser 3 (ID3) was originally proposed by Quinlan, J. Ross. to generate
decision tree classifier [126]. C4.5 was then proposed to overcome some of the limitations presented in ID3
such as the inability to classify objects with continue attributes and to deal with missing attributes [127].
However, even with C4.5, the generated decision tree can still be easily overfitted. Ho, Tin Kam. introduced
a Random Forest method to construct a set of decision trees for a classification work to improve the
accuracy and to prevent from overfitting [128]. Furthermore, Random Forest has the ability to rank the
importance of each attribute and make the selections [41], [129]. One important advantage of an RF is that
it is non-parametric and scale invariant. Therefore, the algorithm could handle input data, as it is without
feature normalization.
SVM and Random Forest algorithms are selected to be compared since previous studies demonstrated that
these two algorithms could outperform several other classifiers on classifying LIDAR collected point clouds.
SVM has been widely used for segmenting LIDAR collected point clouds of both indoor and outdoor
environments [15], [43], [45], [46], [52]. As pointed out in the previous section, SVM could achieve better
accuracy than Nearest Neighbor (NN), k Nearest Neighbor (KNN), and Naï ve Bayesian classifiers in the
case of LIDAR point clouds urban scene classification [45], and Random Forest classifier could outperform
KNN, Neural Network, Boosted Trees, and SVM classifiers on indoor point clouds classification tasks [27].
The accuracy of the classification results was compared to those from a manual classification.
Computational time is a critical factor considering the practical use of the point-cloud labeling system for
the creation of virtual environments and simulations. Thus, the algorithms were also evaluated on their
computational time.
35
3.3. Object information extraction
3.3.1. Identify individual tree locations
Figure 3. Incorrect shortest path in ATLAS simulation.
Vegetation tends to contain extremely complex geometries and is therefore often difficult to reconstruct
accurately using photogrammetric techniques [9]. This limitation is especially noticeable in data sets that
are reconstructed using aerial photos captured at high altitudes. Such a limitation not only causes the
vegetation visual appearance to be poor in a virtual environment, but it also limits the simulation
functionalities such as computing the shortest path from a start point to a destination. For instance, when
computing the shortest path going through a group of trees in a photogrammetric-generated virtual
environment, the path cannot be accurately computed since the reconstructed tree models appear as a big
solid blob instead of individual trees [9]. The path will be computed to either go over or around the trees
and, as such, both cases are incorrect. Figure 3 demonstrates such a scenario with the abovementioned
issues in the USC-ICT previously developed simulation tool, the Aerial Terrain Line-of-sight Analysis
System (ATLAS). The blue icon represents a pedestrian starting point and the pin icon indicates his/her
destination. ATLAS then uses the A* algorithm to compute the shortest path between the two points and
36
visualize the result in green line segments. However, the path shown here is based on inaccurate 3D
geometries, and an A* algorithm would assume that a unit cannot pass the 3D tree meshes, as the 3D mesh
tree canopies are connected to the ground. Thus, the computed path (i.e., green line) is not ideal compared
to the true optimal shortest path (i.e., red line), which would go under trees as most units could in reality.
To address this issue, the author proposed a solution to replace the reconstructed tree geometries with geo-
typical 3D tree models. The 3D tree models needed to be placed at locations where an individual tree is
identified from the reconstructed point clouds. A similar problem has been investigated in previous work
using LIDAR-generated point clouds. Based on the related works, the tree location-identification problem
can be considered as either a model fitting problem [84] or a point cloud-clustering problem [82], [85],
[87]. However, in some cases, a tree trunk would be missing since the captured photos only contain limited
views, while in other cases, the treetop surface may not form a regular parabolic shape with
photogrammetric data. These irregularities make it difficult to fit a model into a tree point cloud accurately.
Thus, the author decided to approach the problem of identifying tree locations by clustering tree points. The
main advantage of using this clustering approach is that it does not assume that tree trunks or treetops need
to exist or need to be of a particular shape. Instead, the clustering algorithm assigns point groups based on
the fact that points belonging to one tree are close to each other.
A two-step clustering process consisting of a connected component labeling and a K-mean clustering
algorithm is proposed in this study, as shown in Figure 4. In the first step, the tree points are grouped into
different connected components, with the constraints that all point in the same component within a pre-
defined Euclidean distance. Since the point cloud was downsampled to ensure uniform point spacing in the
previous process, the distance constraint can be set to the downsampled point spacing to compute the
connected components. This step tends to produce two types of segmented groups. Group (1) contains only
a single tree since no other nearby tree points exist within the distance constraints. On the other hand, group
(2) would contain multiple nearby trees that form a forest. While group (1) could be identified as an
individual tree directly, further clustering is needed to identify each tree inside the forest for the group (2).
37
A K-means clustering algorithm was used to extract clusters from the group (2) further and use the cluster
centers as tree locations. Compared to a traditional k-means clustering algorithm, the proposed clustering
method has a constraint that requires the cluster center to be a point in the tree point cloud. One essential
step for k-means clustering is to determine the k value (i.e., the number of clusters). This parameter needs
to be estimated since no prior information exists on how many trees exist in a group. The authors proposed
to find the ideal k value iteratively based on a pre-defined distance threshold. Starting from an initial k
value, the proposed method would iteratively adjust the k value using a bisection search until all points in
every cluster are within the pre-defined threshold of point-to-center distance. An intuitive way to understand
this distance threshold is to constrain the maximum tree width for each tree in the group. Once the tree
clusters are found, additional tree features such as color, tree width, or tree height could be computed from
points in each cluster.
Figure 4. Workflow of individual tree locations identification.
3.3.2. Building footprints extraction
Photogrammetric-generated point clouds/meshes are often too dense for real-time use in simulations [130].
In a simulation, thousands of triangular surfaces are usually generated for a flat building exterior wall that
requires enormous amounts of computing power to process. Thus, in many cases, simplified building
models are preferred for creating a virtual environment. One way of reconstructing such simplified building
models is to extract building footprints and related features. In this research, a building-footprint extraction
38
process is designed as shown in Fig. 5. The process mainly consists of three steps: (1) a roof-extracting
process, (2) a noise- filtering process, and (3) a boundary-extracting process. Note that since roofs are
extracted and the ground points were classified in the previous process, the height information of each
footprint can be easily derived.
Figure 5. Workflow of building footprints extraction.
This study considers the problem of segmenting roofs from walls in a classified building point cloud as a
ground extraction problem. The roofs can be considered as ground when rotating the building point cloud
180 degrees (upside down). The author proposes to adopt the progressive morphological filtering algorithm
for extracting roofs since it has the capability to segment ground isolated by walls or buildings (e.g.,
courtyards). Note that, a noise reduction process is recommended to be conducted before applying the
progressive morphological filtering algorithm. The noise (e.g., isolated points) can be eliminated with
thresholding of the point density.
To further process the segmented roofs and extract roof boundaries. A combination of a color classification
and a connected component labeling algorithm is recommended to eliminate the miss-classified tree points
and further segment individual component (e.g., rooftop mechanical penthouse) from the main roof. Points
that have color with the angular cosine distance near a predefined color (e.g., green) are first removed.
Connected component labeling algorithm is then used with the following constraint. A “roof component”
will be removed if it contains fewer points than a threshold value, this process will remove the points that
belong to a tree but do not have a color similar to the predefined tree color and cannot be removed through
color classification. Please note that the color classification process should not be performed in the case
39
where roofs and the surrounding trees have a very same color since roof points can also be accidentally
removed. The points in each roof component are then projected onto a 2d grid. Following that, points on
the boundary are detected using a flood fill algorithm.
Even with the abovementioned filters, the detected boundary may still contain noise. The shortest-path
algorithm, a principal-direction projection algorithm, and an iterative end-point fit algorithm were
compared for further smoothing the extracted roof boundaries. The results indicated that the iterative end-
point fit algorithm might eliminate critical corner points that define a building shape. The shortest-path
algorithm may keep unnecessary points in the computed roof boundaries, and the principal-direction
projection algorithm may miss a project boundary to a principal direction that is far from the actual
boundary location. Thus, the proposed smoothing process combined the use of the shortest path and the
iterative end-point fit algorithms. The shortest-path algorithm was used for the graph that was constructed
using the boundary points. The graph and edge costs were defined as follows. Two points are connected if
the distance between these points is in a pre-defined range (e.g., two meters). The cost for traveling from
one point to its adjacent neighbor point is performed in two parts: (1) computing their Euclidean distance,
and (2) computing the angle between the line segment formed by these two points and its adjacent line
segment. Since most building footprints are polygons with straight lines and 90-degree angles, 90- and 0-
degree angles are preserved while computing the shortest path by assigning them a low-cost value. Finally,
to simplify the extracted footprints and remove points on straight lines, the iterative end-point fit algorithm
was used.
Aside from extracting building footprints, roof styles are also necessary information that need to be
identified for reconstructing synthetic building models. A process that combines both supervised and
unsupervised machine learning techniques has been designed and developed to accomplish this task.
Individual roof points that were segmented as abovementioned are given as the input. The descriptors of
each roof are computed using a K-means algorithm. K-means algorithm is performed to cluster all points
in a roof into different clusters. The descriptor of a roof is a vector that contains the percentage of points
40
that fall in each cluster. Intuitively, the descriptor is representing the percentage of points that fall on the
edges, the leaders, etc. Finally, Random forest algorithm was utilized for the classification process.
3.3.3. Terrain surface materials classification
As discussed earlier, identify terrain surface materials in a virtual environment is crucial for creating a
realistic simulation. The hypothesis behind this study is that materials can be identified based on the 2D
textures/patterns instead of 3D geometric shapes of an object. A designed framework is shown in Figure 6.
3D meshes will be procedurally/algorithmically projected/rendered onto different 2D planes (images).
Following that, the ground material classification process will be performed on the rendered images. The
expected outcome from this process is a vector map that contains the ground material information and can
be used in a simulation.
Figure 6. Workflow of ground material classification.
This study is intended to adapt the state of the art deep learning technique (i.e., Convolutional Neural
Network-CNN) for image classification since previous works have shown that CNN outperforms other
techniques on many image classification related tasks[106]-[110]. The main challenge for adopting CNN
for ground material classification task is that the training process is time consuming due to its high
computational complexity. For instance, training a CNN classifier can take a few hours or even days
41
depending on the size of the training database and the selected CNN architecture. One way to overcome
such a challenge is fine-tuning a pre-trained classifier using a small subset of the newly collected data. This
study first focused on creating a suitable training database to improve the generalizability of the selected
classifier. The classifier is first trained using the database. Following that, the performance of the classifier
with and without the fine-tuning process was evaluated.
3.4. Experimental Results
Experiments were conducted to (1) analyze the performance of different point descriptors and supervised
machine-learning algorithms for segmenting photogrammetric-generated point clouds; (2) validate the
proposed individual tree location identification- and building footprint extraction-processes; and (3)
evaluate material classification process with and without a fine-tuning process.
3.4.1. Testbeds selection and data collection
University Park Campus of University of Southern California (USC) and Muscatatuck Urban Training
Center (MUTC) were selected as the testbeds. USC is located two miles south of downtown Los Angeles.
The campus is surrounded by Jefferson Boulevard, Figueroa Street, Exposition Boulevard, and Vermont
Avenue. This campus covers 308 acres, comprised of buildings, tree and grassland, and paved ground. Most
buildings on campus are classrooms, research labs, auditoriums, and residential buildings. Buildings have
an average height of 5-6 floors, with various appearance, colors, and shapes, making it a challenge for the
point cloud/mesh classification and segmentation process. According to the Los Angeles Region Tree
Canopy Map (data updated in 2011), captured by the LAR-IAC program, approximately 20% of the USC
is covered by tree canopy. MUTC is a real city that includes built physical infrastructure, a well-integrated
cyber-physical environment, an electromagnetic effects system and human elements. Imagery data were
collected for 480 acres that contain 68 major buildings covering 850,000 square feet of floor space
42
(including school structure, religious structure, hospital, light industrial structures, single-family and
dormitory-type dwellings, and administrative buildings), an extensive searchable/maneuverable and
instrumented utility tunnel system. Besides, the training center also has more than 9 miles of roads and
streets and some engineered rubble pile. Note that, MUTC is larger than USC but fewer images were
collected, which adds another layer of the difficulty for the point cloud classification process.
The images were captured with a small unmanned aerial vehicle (UAV). In both cases, a Phantom 4 Pro
that is manufactured by DJI was used to carry out the data collection operation. Note that, the flight
planner—i.e., Rapid Aerial PhoTogrammetric Reconstruction System (RAPTRS)— was used for the flight
path planning. RAPTRS was designed under OWT project for imaging large areas across multiple flights.
Information that is necessary for computing a flight path includes: (1) a bounding box for the area of interest;
(2) camera orientation; (3) flight altitude; and (4) desired overlap between images. Details of RAPTRS can
be found at [9]. Camera orientation and overlap between images were set to 45 degrees forward and 70%
respectively for both cases. Flight altitude for USC was set to 70 meters, and the flight altitude for MUTC
was set to 100 meters. The point clouds were generated with ContextCapture (i.e., a photogrammetry
software). The point clouds were down-sampled to 3.8 million points (0.5-meter point spacing) and 3.4
million points (0.5-meter point spacing) for USC and MUTC respectively. Figure 7 Shows the
photogrammetric-generated point clouds for USC and MUTC.
43
(a)
(b)
Figure 7. Photogrammetry generated point clouds: (a) USC; and (b) MUTC.
3.4.2. Prototype development and mesh segmentation
To validate the proposed semantic terrain points labeling framework, a prototype was designed and
implemented. The framework for point cloud classification was implemented using C++ and Python. To
efficiently process the point clouds, the Point Cloud Library (PCL) was adapted to perform point cloud
downsampling and feature extractions. For the classification process, the SVM and RF algorithms were
implemented in the Scikit-learn library. Both classification algorithms were implemented with parallel
processing. Thus, a multi-core CPU was used to accelerate both training and testing processes. The
proposed tree location identification and building footprint-extraction methods were implemented in
Python 2.7. The CNN model for ground material classification was implemented using tensorflow.
As previously mentioned, 3D meshes are needed for creating a virtual environment and simulations instead
of point clouds. In addition to producing a point cloud, the photogrammetric technique can also produce a
corresponding mesh in the same coordinate system. With this feature, 3D meshes can be easily segmented
based on point-cloud segmentation results. The desired label is assigned to a mesh vertex if its closest point
is within a distance threshold in the point cloud. Following that, based on the vertex labels, a mesh edge is
kept if both of its vertices have been assigned a label or it is eliminated otherwise. The distance threshold
44
is based on the point spacing used in the point-cloud downsampling process. This distance threshold is
further discussed in the experimental results section.
3.4.3. Quantitative and qualitative analysis of the designed framework
To evaluate the performance of the top-level object and ground surface material classification process, the
precision and recall of results were compared to that from manual classification. Four straightforward
metrics designed to quantify the result of the proposed tree location identification and building footprints
extraction approaches. Two metrics that are used for quantifying result tree locations include (1) precision
and recall of the recovered trees; (2) average error distance of identified tree locations. Tree locations are
manually identified based on the generated point clouds. Average error distances are computed only on the
recovered trees. Result building footprints are analyzed similarly. First precision and recall of the recovered
building are analyzed. Following that, mismatched footprint covered areas are quantified as shown in
Figure 8 where green lines represent the actual footprint of a building, red lines represent the detected
footprint, blue areas are covered by the detected footprint but not covered by the actual footprint (CD), and
yellow areas are covered by the actual footprint but not covered by the detected footprint (CA). The average
of CD and CA are used to analyze the proposed building footprint extraction process.
Figure 8. Mismatched footprint covered area.
45
3.4.4. Point descriptor ranking
The proposed point descriptors introduced in Section 4.1.2. were first compared based on the point-cloud
classification results using the SVM algorithm. The USC data set was selected for the comparison study.
Twenty percent of the points were used for the training process, and 80% of the data was used for the testing
process. Classified classes include (1) ground; (2) buildings; and (3) trees. The confusion matrixes of the
classification results are shown in Tables 1, 2, 3, 4, and 5 for color-based, density-based, local surface-
based, open source data-based, and texture-based descriptors, respectively. Please note that points
belonging to small objects such as light poles, fences, cars, and so forth. were excluded from the data set.
These small objects were used for the classifier-selection experiment, and they are further discussed in
Section 4.3.5. The results indicate that tree and building classification results could achieve the highest
accuracy (i.e., 0.83 and 0.89 respectively) with the color-based descriptors, and the ground classification
result could achieve the highest accuracy (i.e., 0.92) with the density-based descriptors. The tree
classification result shows the lowest accuracy (i.e., 0.3) using the open-source data-based descriptors,
because no information currently exists on tree locations in the Open Street Map (OSM) data, and all the
descriptors are related to the location of buildings and roads. It is worth pointing out that points that belong
to buildings and trees tend to be misclassified as ground when using color-based and texture-based
descriptors, and only 2% to 4% of the points were misclassified between building and tree classes.
Table 1. Color-based descriptors. Table 2. Density-based descriptors.
Table 3. Local surface-based descriptors. Table 4. Open source data-based descriptors.
46
Table 5. Texture-based descriptors.
The classification result using all the point descriptors was then compared with the classification results of
one less point descriptor from all the descriptors (i.e., excluding one type of point descriptors during the
training and classification process). The confusion matrix of the classification results using all point
descriptors is shown in Table 11, and the confusion matrixes of the classification result using one less
descriptor are shown in Tables 6 to 10. A similar pattern was found in the result of this experiment compared
to the previous one. Building classification accuracy dropped from 93% to 90%, and tree classification
accuracy dropped from 92% to 89% when the color-based point descriptors during the process were
excluded. Ground classification accuracy dropped from 96% to 89% when the density-based point
descriptors were excluded. In other words, the color-based point descriptors contributed the most to
building and tree classification, while the density-based point descriptor contributed the most to ground
classification.
Table 6. Excluding texture descriptors. Figure 9. Excluding texture descriptors.
Table 7. Excluding open source descriptors. Figure 10. Excluding open source descriptors.
47
Table 8. Excluding local surface descriptors. Figure 11. Excluding local surface descriptors.
Table 9. Excluding density descriptors. Figure 12. Excluding density descriptors.
Table 10. Excluding color descriptors. Figure 13. Excluding color descriptors.
Table 11. Using all descriptors. Figure 14. Using all descriptors.
To better understand the contribution of the proposed point descriptors in the classification process and
how they may affect the later process of creating a virtual environment and simulation, the results were
then compared visually. A section of the USC data set classification results is shown in Figures 9 to 14.
Ground, building, and tree points are shown in blue, green and red, respectively. Figure 13 shows the
classification result without using the color-based point descriptors. Most of the building points
misclassified as tree points are located on building edges, and most of the tree points misclassified as
building points are located on the side of the trees. This is because the point descriptors, with the exception
of the color-based descriptors, have a similar value for the points on building edges and trees, walls and the
side of trees. For instance, a local surface descriptor such as a curvature has a similar value for points that
are on the edge of a building and on trees. Thus, these points were difficult to differentiate without color
descriptors. Figure 10 shows the classification result without using the open source data-based descriptors.
Most of the misclassified ground points are located near the center of the roofs. This is because the open
source data-based descriptors contain information on the closeness of a point to its nearest building and
48
whether a point is inside of the OSM building footprints. Such information is useful to differentiate roof
points from ground points regardless of similar descriptors such as planarity, curvature, and color etc.
Using the Random Forest algorithm, the descriptors were ranked and sorted based on the information gain
that was computed when training the classifier. There is a total of 172-point descriptors that have been
computed including 33 color descriptors, 15 density descriptors, 40 local surface descriptors, 4 open source
data descriptors, and 80 texture descriptors. The top 20, 40, and 80 descriptors were analyzed, which
constitute of 11.6%, 23.2%, and 46,4% of the total descriptors, respectively. Top 80 descriptors were
selected since their feature importance value computed from RF are greater than the mean importance of
all the descriptors. Top 20 and 40 descriptors are also reported here to show more details on how the
descriptors were selected from each category. Percentage of point descriptors in the top 20, 40, and 80
descriptors are shown on the left in Figures 15 to 17. The percentage of point descriptors selected from
each category for the top 20, 40, and 80 descriptors are shown on the right in Figures 15 to 17. The results
show that color, density, and the open source data-based descriptors include more information than the local
surface and texture-based descriptors for classifying a point cloud.
Figure 15. The top 20 descriptor rankings using Random Forest.
49
Figure 16. The top 40 descriptor rankings using Random Forest.
Figure 17. The top 80 descriptor rankings using Random Forest.
Texture descriptors cannot provide as much information as other descriptors mainly because of the
following two reasons. First, textures were extracted from orthophoto and projected onto the 3D point cloud.
Points on walls and sides of trees received the texture descriptors from the edges of roofs and boundaries
of treetops. Second, texture descriptors have low resolution compared to the original orthophoto, since they
were extracted from the last layer before the fully connected layer in the CNN model. The texture map has
a size of 9 pixels by 9 pixels was generated from an orthophoto that covers a 3m by 3m area. Further
research on generating a higher-resolution texture map and rendering images from a different side of objects
for texture extraction are still needed. It is worth noting that the segmented point clouds will be used as a
reference to segment the photogrammetric-generated 3D meshes for creating a virtual environment.
Misclassified roof points shown in Figure 14 appear as holes on the segmented building meshes. Thus, the
50
proposed ground extraction process was evaluated through the classifier selection experiment and will be
discussed in Section 5.2.
3.4.5. Selection of classifiers
In this experiment, classes that were classified include (1) ground; (2) buildings; (3) trees; and (4) others,
which include points that belong to light poles, fences, cars, and so forth. The ground was classified using
the proposed ground extraction approach previously discussed. Classification results for the USC data set
are shown in Figure 18. The ground, buildings, trees, and others are marked with blue, green, yellow, and
red, respectively. The confusion matrixes of the classification results are shown in Tables 12 and 13. The
accuracy of the ground extraction using the proposed approach (92%) was lower than using the SVM
algorithm (96%) as shown in Tables 12 and 13. However, no roof points were misclassified as ground as
shown in Figure 18 (b) and (c). The points misclassified as ground are located at the bottom of buildings
and trees. The ground extraction approach outperformed the SVM classification approach when considering
the later process of mesh segmentation and the creation of a virtual environment. The proposed approach
will create fewer artifacts (i.e., holes on the roof) and will be less noticeable in a virtual environment and
simulation.
Table 12. Confusion Matrixes of using RF classifier.
Table 13. Confusion Matrixes of using SVM classifier.
51
(a)
(b)
(c)
Figure 18. Classification results of USC data set: (a) ground truth; (b) classified with SVM; and (c)
classified with Random Forest.
The SVM and Random Forest algorithms produced very similar results with an overall accuracy of 0.92
and 0.91, respectively. The SVM algorithm outperformed the Random Forest algorithm for classifying
“buildings” and “others.” Note that the accuracy for classifying “others” with both the SVM and Random
Forest algorithms are quite low (i.e., 0.57 when using SVM and 0.5 when using Random Forest). This was
due to the following two reasons: (1) not enough point data exists in the training data set for others. Thus,
the number of points was not balanced in the training set, which means that “others” contains much fewer
data points compared to other categories such as buildings; and (2) several different objects contained in
“others,” such as cars and light poles, are not similar in shape, color, and texture. The computation time for
training an SVM classifier and a Random Forest classifier was 446 s and 238 s, respectively. The
52
computation time for using the SVM classifier and the Random Forest classifier to classify unseen data was
1447 s and 186 s, respectively.
(a)
(b)
(c)
Figure 19. Classification results of MUTC data set: (a) ground truth; (b) classified with SVM; and (c)
classified with Random Forest.
Classification results for the MUTC datadset are shown in Figure 19 and the confusion matrixes of the
classification results are shown in Tables 14 and 15. The overall accuracy of using SVM and random forest
algorithms are 0.90 and 0.89, respectively. The SVM algorithm outperformed the Random Forest algorithm
for classifying “buildings” but underperformed the latter for classifying “others.” The computation time for
training an SVM classifier and a Random Forest classifier is 199 s and 65 s, respectively. The computation
time for using the SVM classifier and the Random Forest classifier to classify unseen data is 346 s and 60
s respectively. In both the USC and MUTC experiments, the results indicate that the SVM classifier slightly
outperforms the Random Forest classifier on classification accuracy. However, the running time for the
53
Random Forest algorithm was much shorter than SVM. Thus, the SVM is recommended for an autonomous
application in which training data preexist and cannot be altered. The Random Forest classifier, on the other
hand, is recommended for an interactive application in which users can correct some of the miss-classified
points and perform the classification process again to achieve better accuracy.
Table 14. Confusion Matrixes of using RF classifier.
Table 15. Confusion Matrixes of using SVM classifier.
3.4.6. Identify individual tree locations
The proposed tree-location identification process was performed on the MUTC data set. The manually
segmented ground truth tree points were used as the input. Each step of the proposed individual tree-location
identification process is shown in Figure 20. The clusters generated using the connected component
algorithm are shown in Figure. 20 (a), in which each yellow bounding box represents one cluster. The
identified tree locations are shown in Figure 20 (b) in which each white point represents a tree location.
Figure 20 (c) and (d) shows the simulation environment generated using the segmented MUTC meshes
with and without tree replacement, respectively. In Figure 20 (d), the mesh trees are replaced with geo-
typical 3D tree models using the identified tree locations and related features. The green lines in Figure 20
(c) and (d) comprise the shortest paths from a blue unit to its destination. The shortest path shown in Figure
20 (c) is incorrect since a path cannot penetrate 3D mesh trees. With the proposed tree replacement process,
the shortest path can be correctly found in Figure 20 (d). The average tree width was set to 7 meters for the
54
k-means algorithm, and the minimum number of points was set to 15 for the connected component
algorithm.
(a)
(C)
(b)
(D)
Figure 20. Tree location identification: (a) clustered points; (b) individual tree locations; (c) MUTC
data set in simulation environment; and (d) MUTC data set in simulation environment with tree
replaced.
The identified tree locations using the proposed approach are compared with the manually identified tree
locations from the tree point cloud. The precision, recall, and average error distance were procedurally
computed to evaluate the results. Figure 21 shows the correctly found tree locations in blue (true positive),
incorrectly found tree locations in green (false positive) and missed trees in red (false negative). True
positive is the number of tree locations in the results with the nearest neighboring point from the ground
truth-tree locations that are within 7 meters. A false positive means the number of tree locations in the
results that do not have the nearest neighboring point from the ground truth that is within 7 meters. A false
negative means the number of tree locations in the ground truth that do not have the nearest neighboring
point from the result that is within 7 meters. Seven meters was used here because it is the assumed tree
width in the tree-location identification process. The confusion matrix of the tree-locations identification
55
result is shown in Table 16. The results have a 70% precision and 84% recall. The average error distance
of the recovered tree locations was 1.91 m. It is worth noting that the point cloud classification accuracy
also affects the tree location identification result. By using the connected component algorithm for
clustering points into different clusters, any cluster that contains points less than a threshold (15 in this case)
is removed. This allows removing the small artifacts (miss-classified building points) in the classified tree
points. However, when the number of miss-classified building points in a cluster is greater than the
threshold, an incorrect tree location will be computed. Therefore, by improving the point cloud
classification accuracy can also improve the tree location identification result.
Figure 21. Result of tree locations identification.
Table 16. Confusion matrixes of tree locations identification.
3.4.7. Building footprints extraction and roof styles classification
The proposed building-footprint extraction process was performed on the USC data set. Figure 22 (a)
shows the segmented building points of this data set with the removal of the isolated points (noises). Roof
56
patches were extracted from the segmented building points using progressive morphological filtering and
connected component algorithms. The extracted roof patches are shown in Figure 22 (b). Roof boundaries
were detected from the 2D projection of the building points, and the building heights were computed by
taking the difference between the elevation of the surrounding classified ground points and the elevation of
the highest point in a roof patch. The extracted roof boundaries were extruded with the height information
as shown in Figure 22 (c). Figure 22 (d) shows the textured model generated based on the extracted roof
boundaries and identified tree locations. The model was created with Conform, a 3D geospatial software
for fusing, visualizing, editing, and exporting 3D environments for urban planning, simulations, and games.
(a)
(b)
(C)
(D)
Figure 22. Building footprints extraction: (a) classified buildings; (b) extracted roofs; (c) extruded
building footprints; and (d) textured model.
The extracted footprints were projected onto the XY plane for quantitative analysis. One hundred and
twenty-three buildings were correctly found, five buildings were incorrectly found, and fourteen buildings
were missed. Most of the small buildings were missed (e.g., mobile office trailers and one-story buildings)
because a height threshold (i.e., 3 meters) was used to eliminate tree points that were misclassified as
buildings. Please note that low point cloud classification accuracy will increase the number of buildings
57
that were incorrectly found. The algorithm will consider the miss-classified tree points as buildings if they
could not be eliminated using the height threshold. As a result, an incorrect building footprint will be
computed around the miss-classified tree points. Figure 23 shows the extracted footprints overlaid with the
manually created footprints for a section of the USC data set. The manually created footprints are shown in
green lines, with the extracted footprints in red lines. The areas covered by the detected footprints but not
covered by the manually created footprint (CD) are shown in blue. The areas that are covered by the actual
footprints but not covered by the extracted footprints (CA) are shown in yellow. It is notable that buildings
A, B, C, D, E, F, and G have gable roofs. The extracted footprints for these buildings are slightly larger
than the actual buildings because of the roof style. The building parts not covered by the extracted footprints
were due to a different elevation of roof patches than the main roof. These roof patches were relatively
small and eliminated during the noise-reduction process. The average area of CD was 56 m
2
, and the average
area of CA was 26 m
2
.
Figure 23. Roof style classification for USC.
Roof styles were then identified for each of the roof patches that were extracted from the previous process.
Roof styles include flat roof, flat roof with AC duct installed on top, gable roof, and hip roof. Color-based,
density-based, and local surface-based point descriptors were computed for each point in all roof patches.
58
Following that, the points in each roof patch were clustered into 40 clusters using a k-means algorithm with
the computed point descriptors. Roof descriptors were created as the percentage of points that were in each
of the 40 clusters. Roofs were then classified into one of the predefined styles using the Random Forest
algorithm. Figure 24 shows the roof-style classification result in which flat roof, flat roof with AC duct,
gable roof, and hip roof are represented with blue, green, yellow, and red, respectively. Table 17 shows the
confusion matrix of the roof styles classification result. The result had 92.4% precision and 81.2% recall.
Most of the issues with misclassification occurred with respect to (1) the roof size, and (2) the data quality
of the roof. Many of the small roofs and roofs with missing data are misclassified.
Figure 24. Roof style classification for USC.
Table 17. Confusion matrixes of roof style classification.
3.4.8. Ground surface material classification
The proposed ground material classification framework tested using the MUTC data set. A ground material
59
database was created using the photogrammetric generated meshes from USC, Fort Drum Army Base, The
Camp Pendleton Infantry Immersion Trainer (IIT), and 29 palms range 400. The classes used for ground
material classification include (1) bare soil; (2) road; and (3) vegetation. Orthophoto were first rendered
from the generated 3D meshes. Figure 25 shows the rendered orthophotos for the above-mentioned data
sets. The process for creating the ground material database is shown in Figure 26. The rendered orthophotos
were manually segmented into the predefined classes. Following that, the segmented orthophotos were
cropped into small image patches. A label of a class was then assigned to each of the cropped image patches
based on the amount of different ground materials the image patch contained. The assigned label was the
ground material that the majority of the pixels belong to. The created ground material database contained
18,000 image patches (i.e., 6,000 image patches in each of the predefined classes).
(a)
(b)
(c)
(d)
Figure 25. Mesh rendered orthophoto: (a) USC; (b) Fort Drum Army Base; (c) The Camp Pendleton
Infantry Immersion Trainer; and (d) 29 Palms Range 400.
60
Figure 26. The process of creating ground material database.
GoogleNet architecture [108] was selected for the convolutional neural networks (CNN) in this study. The
time for training the CNN classifier with the created database was 12 hours. The trained CNN classifier
was then used to classify the rendered and cropped image patches from the MUTC 3D meshes. The
classification result was stored on a vector map in which each point represents the center of each cropped
image patch. It is worth pointing out here that although the resolution of a computed-ground material vector
map depends on the distance between two adjacent cropped image patches, it is not the size of the area
covered by each cropped image patch. In this study, the distance between the two adjacent cropped image
patches was set to 3 meters and the 5 by 5 meters area was covered by each cropped image patch.
(a)
(b)
61
(c)
Figure 27. Ground Material Classification Result: (a) MUTC Dataset; (b) ground material vector map
without fine tuning; and (c) ground material vector map with fine tuning.
The experiment was conducted to evaluate the ground material classification approach with and without a
fine-tuning process. Fine-tuning is the process that re-trains the fully connected layers in the CNN model
using a small set of manually classified image patches from the data set that will be classified. Five percent
of the rendered orthophotos were manually segmented and used for the fine-tuning process. The generated
ground material vector map of the MUTC without and with the fine-tuning process is shown in Figure 27
(b) and (c), respectively. Bare soil, road, and vegetation are shown as blue, green and red, respectively. The
confusion matrixes of the classification result without and with the fine-tuning process are shown in Tables
18 and 19, respectively. It is clear that the fine-tuning process can effectively improve the classifier
performance for classifying each ground material (i.e., it improved the classification accuracy of 42% for
bare soil, 1% for road, and 10% for vegetation). The time of the fine-tuning process was 13 minutes, and
the time for the classification process was 15 minutes in both cases.
Table 18. Confusion matrixes of Ground Material Classification without Fine tuning.
62
Table 19. Confusion matrixes of Ground Material Classification with Fine tuning.
Figure 28 (a) and (b) show the shortest path computed between the blue unit to its destination without and
with using the ground material vector map. A* algorithm was used to compute the shortest path. The edge
weights of the navigation mesh were adjusted based on the ground material vector map. Edge weights for
roads were set to a low value (0.2) and the edge weights for bare soil and vegetation were set to high value
(1.0) in the navigation mesh. This example illustrates that with the ground material classification process,
the vehicle trafficability can be considered during the path finding simulation in a created virtual
environment.
(a)
(b)
Figure 28. Path finding in ATLAS: (a) without ground material classification; and (b) with ground
material classification result.
3.4.9. Prototype development and mesh segmentation
A prototype named the “Semantic Terrain Points Labeling System (STPLS)” was built based on the
proposed framework to validate each of the proposed tasks. The ground extraction, top-level object
classification process, tree locations identification process, roof boundaries extraction, roof style
63
classification process, and mesh segmentation process were implemented. The user interface was
implemented with python library-Tkinter as show in Figure 29.
Figure 29. Designed user interface.
The designed mesh segmentation process was validated using the USC data set. The results are shown in
Figure 30. USC meshes were segmented based on the point cloud segmentation result. The
distance/closeness for selecting mesh vertices is set to 1 meter during the mesh segmentation process. Note
that this distance needs to be larger than the down-sampling point spacing (i.e., 0.5 meters) since some of
the misclassified points could be sparsely falling on an object (e.g., a few points on building roofs are
misclassified as tree points). If down-sampled point spacing is used for the distance/closeness, the
segmented meshes will contain small holes due to these misclassified points. Furthermore, segmenting trees
from the ground can create holes in the ground meshes in some cases. For instance, some of the tree
canopies are directly connected to the ground in the generated meshes due to the large size of the tree
canopy and the limitation of the photogrammetric technique. Thus, instead of segmenting tree meshes, they
are flattened to the elevation of their closest ground mesh.
(a)
(b)
Figure 30. Mesh segmentation for USC data set: (a) segmented buildings; and (b) segmented ground.
3.5. Conclusions
A point clouds/meshes segmentation and object information-extraction framework is introduced in this
study. The framework was designed to work with photogrammetric data considering the data
64
characteristics. Experiments were first conducted to rank different point descriptors. The results showed
that color based- and density-based descriptors contribute the most to the point cloud-classification process
and that open source-based descriptors provide useful information. However, only a few open source-based
descriptors have been designed in this study. More descriptors in this category should be designed and
tested in future research. Low-resolution, texture-based descriptors were extracted from orthophotos using
the GoogLeNet model, and future research should be focused on generating a higher-resolution texture map
using images rendered from different sides of objects to create texture-based descriptors. The performance
of SVM and RF algorithms for classifying photogrammetric-generated point clouds were then compared.
Although the results showed that both algorithms produced similar results, the running time for both
training and testing an RF model is much shorter than for an SVM model.
The proposed tree-location identifications, building-footprint extractions, and mesh-segmentation process
were validated using data sets from the MTUC and USC. The results showed that the proposed processes
could be integrated into an existing virtual environment and simulation creation workflow and enhance
visual quality. However, the proposed object information-extraction processes have the following
limitations that need to be addressed in future research. First, tree locations could be better identified in an
urban area where trees are planted farther apart than forest areas. Approaches that can better identify tree
locations in a forest that includes trees of different widths and heights would be an improvement. Second,
tree types should also be identified so that the created virtual environments can better represent the real
world. Third, extracting roof boundaries as building footprints are not accurate in some cases, such as with
hipped or gabled roofs. Often, the overhangs of roof edges extend past the exterior walls for water drainage
and design reasons. Thus, future research should also be focused on extracting building footprints
considering the inconsistency between roof shapes and building exterior walls.
This study also presented a process to classify ground materials using orthophotos. A ground material
database was created by annotating and cropping orthophotos from the previous collects, and a
classification network based on GoogleNet was trained using the database and tested on the MUTC dataset.
65
The results showed that the trained with the fine-tuning process the model performance can be improved
compared to apply the pre-trained model directly and the computation can be saved compared to train a
new model from scratch.
66
Chapter 4: Fully Automated Top-level Terrain Elements Segmentation using a Model Ensembling
Framework
4.1. Research Objective
The study discussed in the previous section has demonstrated the capability of using handcrafted point
features (i.e., color-based, point density-based, and local surface-based features) with supervised machine
learning algorithms (i.e., Support Vector Machine and Random Forest) for segmenting top-level terrain
elements. One limitation of using handcrafted point features is that a pre-trained segmentation model using
existing datasets cannot be applied/reused on a newly collected dataset. This is because point clouds with
different quality may yield different values for the handcrafted point features and the quality of a
photogrammetric-generated point cloud highly depends on the parameters used for aerial photo collection
(i.e., flight altitude and overlap between images). In practice, these parameter settings are defined based on
the maximum allowable time for data collection, the workforce talent available, and the equipment available.
Consequently, the previous designed top-level terrain elements segmentation workflow can only be used in
a semi-automatic fashion where training data needs to be manually created every time new point clouds are
generated. To overcome such a limitation, more robust features need to be developed and a generalizable
segmentation model needs to be trained with existing datasets. The state-of-the-art deep learning techniques
- i.e., Deep Neural Networks (DNN) architectures provide the suitable initial foundation to build the
automatic pipeline. Previous works have applied DNN successfully on segmenting 3D data such as outdoor
LiDAR point clouds. However, no research works have been done in applying DNN for semantic
segmentation of photogrammetric-generated terrain.
Since our datasets contain large scale point clouds across different geographic regions, it poses additional
challenges to train a model that generalizes well for different maps. Furthermore, photogrammetric-
generated point clouds tend to be noisy—in some case ground cannot be captured due to dense canopy from
trees, and vegetation tends to appear as solid blobs instead of individual tree with well-formed branches
67
[9]—which makes the segmentation task even more challenging than working with the LiDAR data.
Furthermore, considering the end-use of the segmentation results in a simulation environment, special
handling is required for the artifacts caused by mis-segmentation. As such, the objective of this study is to
examine the potential of using a DNN architecture for top-level terrain elements segmentation and create
an end-to-end photogrammetric data segmentation framework while considering the data characteristics.
To ensure the generalizability of the proposed framework, a large UAV-based photogrammetric database
was annotated and used for validation.
4.2. UAV-based Photogrammetric Database
A large UAV-based, photogrammetric point-cloud database was annotated for creating the ground truth.
The author’s previously developed autonomous UAV-path planning and imagery collection system (i.e.,
RAPTRS) was used for collecting aerial images. RAPTRS provides a user-friendly interface that encodes
photogrammetry best practices. Unlike other commercially available UAV remote-control software,
RAPTRS was designed for collecting aerial images that cover a large area of interest with multiple flights.
Parameters required for operating RAPTRS include a bounding box of the area of interest, flight altitude,
the desired overlap between images, and camera orientation. An optimized flight path can be computed
with these parameters, and the imaging task can be automatically accomplished. For detailed information
on RAPTRS, please refer to [9]. DJI Phantom 3, 4, and 4 Pro were used to acquire all the images, and 3D
point clouds were reconstructed using commercial photogrammetry software (i.e., Bentley ContextCapture).
The database contains a total of 22 data sets that cover approximately 10 square kilometers of area with
about 100 million points. These data sets were collected from different U.S. states for various operational
military purposes. The total number of collected images varied across different data sets due to data
acquisition time constraints ranging from a few hundred to ten thousand. It is worth noting that since this
study focused on segmenting a large point cloud into top-level terrain elements, a high point density may
not provide the necessary information for differentiating large objects such as buildings and trees. In
68
addition, point clouds with higher density require a longer processing time. Thus, all point clouds were
downsampled with 0.3-meter point spacing. Furthermore, some desert areas were manually removed since
no man-made objects or vegetation existed. The point clouds were manually labeled with the following
three labels: (1) ground, (2) man-made objects, and (3) vegetation. Note that the ground contains points
consisting of paved roads, bare earth, grass, rocks, and so forth, while man-made objects include points
consisting of buildings, cars, light poles, fences, and so forth. Detailed information on the database is
summarized in Table 20.
ID Number of
points
Ground Man-
made
Vegetation Area size
(sqkm)
Number of
images
Location
(State)
Data source
1 620,335 68% 28% 4% 0.060 918 Virginia Marines
2 834,092 37% 42% 21% 0.061 1,970 California Civilian
3 2,500,757 54% 5% 41% 0.188 1,876 California Navy/Marines
4 2,835,042 92% 3% 5% 0.340 3,494 California Navy/Marines
5 934,174 77% 19% 4% 0.108 1,282 California Marines
6 1,983,098 36% 5% 59% 0.124 1,268 North
Carolina
Army
7 45,916,551 51% 11% 38% 4.637 6,657 State of New
York
Army
8 1,136,756 55% 5% 40% 0.092 521 Virginia Army
9 558,954 70% 11% 19% 0.053 1,353 Virginia Army
10 557,873 89% 10% 1% 0.069 354 Virginia Army
11 696,695 65% 13% 22% 0.062 1,247 Virginia Army
12 3,539,117 90% 0% 10% 0.417 707 Virginia Army
13 476,713 23% 65% 12% 0.029 1,851 California Civilian
14 1,014,131 75% 7% 19% 0.104 1,541 California Navy/Marines
15 8,528,814 69% 17% 14% 1.086 2,843 Indiana Army
16 4,778,877 42% 41% 17% 0.430 1,824 Florida Civilian
17 4,011,934 78% 22% 0% 0.460 3,111 California Marines
18 814,998 83% 17% 0% 0.093 1,451 California Marines
19 1,725,449 52% 3% 44% 0.145 1,230 Colorado Air Force
20 8,439,512 25% 47% 28% 0.671 10,588 California Civilian
21 2,076,160 82% 17% 0% 0.241 795 California Army
22 1,451,388 77% 23% 0% 0.157 5,776 Arizona Marines
Table 20. UAV-based Photogrammetric Database.
69
4.3. Proposed Model Ensembling Framework
Taking the photogrammetric data characteristics into consideration, a simple yet effective model
ensembling framework was designed, as shown in Figure 31. The core concept of this design decision was
to segment different terrain elements into a hierarchical manner by ensembling different segmentation
models sequentially. In the first step, a terrain point cloud was segmented into bare ground and non-ground.
Following that, man-made objects and vegetation could be extracted from the non-ground points. By using
separate models responsible for segmenting different objects, carefully designed post-processing
approaches could be easily integrated. One post-processing approach was designed for improving the
segmentation result after applying the bare ground and non-ground segmentation model. Two post-
processing approaches were designed to not only improve the segmentation accuracy of man-made objects
and vegetation, but to also improve the visual appearance for transferring the result in the 3D mesh format.
In addition, a data preprocessing approach was designed to eliminate the artifacts under the ground, as well
as remove areas that are not within the area of interest but unintentionally reconstructed. Details of each
designed component in the framework are discussed in the following sections.
Figure 31. Model ensembling framework.
70
4.3.1. Data Preprocessing
Data cleaning: Noises and artifacts often exist in a photogrammetric-reconstructed 3D point cloud due to
the generation of mismatches [131]. The generated mismatches usually appear under the terrain surface
when reconstructing large 3D terrain point clouds, as shown in Figure 32 (a). A noise filter, such as a
connected component algorithm that has been previously used to remove the noise points in the LIDAR-
collected point clouds [132], cannot be directly applied since, in some cases, the undersurface artifacts are
very close to the terrain surface. Training a learning-based model to identify and filter out these
undersurface artifacts is also challenging since they occur at random locations with random shapes, colors,
and sizes. These artifacts can dramatically affect the ground segmentation process, as a ground
segmentation model tries to detect ground points based on their relative height compared to surrounding
points. Thus, a data-cleaning process was designed that utilized the nearest neighbor-filter algorithm and a
generated digital-surface model (DSM) in the point cloud form, as shown in Figure 32 (b). For each point
in the point cloud p (PC), the process first extracted its neighbor points P (DSM) in the DSM within a predefined
radius r centered at p (PC) only using the x and y values. Following that, p (PC) was considered as a noise point
and was removed from the point cloud if its z value was smaller than any point’s z value in P (DSM). It was
important to define an appropriate r value to remove only the noise points and keep any points on vertical
surfaces of an upside-down, cone-shaped object (e.g., trees) and points on horizontal surfaces below other
objects (e.g., points on the ground under a protruding balcony). We used five meters for the r-value in this
study. A point cloud cleaned with the designed data-cleaning process is shown in Figure 32 (c).
(a)
71
(b)
(c)
Figure 32. Photogrammetric generated point cloud and DSM: (a) Point cloud with noises; (b) DSM;
and (c) Cleaned point cloud.
Area of interest (AOI) selection: As previous works have suggested, collecting oblique images is
necessary to reconstruct 3D point clouds for mapping terrains with vertical objects (e.g., buildings) [130].
However, since the oblique images capture areas outside of the predefined area of interest (AOI), the
reconstructed 3D point clouds also contain data in unwanted areas. It is worth noting that since images are
captured far away from unwanted areas, sufficient resolution and overlap between images cannot be ensured
with the computed sensor network (i.e., UAV paths). Consequently, 3D data in the unwanted areas are
usually of low quality. Figure 33 (a) shows an example of a portion of a point cloud that was reconstructed
outside of the AOI. Data with such low quality do not meet the virtual simulation requirements and are
considered as noise. Thus, removing the data outside of the AOI is a necessary step. The designed data-
selection approach utilized the fact that data within the AOI are below the positions of the camera where
images were collected. Figure 33 (b) shows a top view of a data set with all camera positions, and Figure
33 (c) shows the selected points within the AOI.
72
(a)
(b)
(c)
Figure 33. Selecting data within AOI: (a) Point clouds outside of the AOI; (b) Point clouds and camera
positions; and (c) Point cloud inside the AOI.
4.3.2. Segmentation network
4.3.2.1. 3D U-net with volumetric representation
Ronneberger et al. [133] originally designed U-Net for biomedical image segmentation, and it was
successfully applied to many other 2D image segmentation tasks. The major innovation of U-Net compared
to earlier image segmentation networks is the designed encoder-decoder architecture, which can perform
image segmentation in its original resolution. As discussed in chapter 2, several existing studies have
successfully adapted 3D voxel grids to represent a point cloud and use different CNN architectures for
classification and segmentation tasks. In this study, a simple yet effective 3D encoding and decoding
network architecture similar to 2D U-Net was designed as shown in Figure 34. To fit the 3D data into the
designed 3D U-Net, point clouds need to be converted to voxel grids with the same dimension. A 3D scene
is first divided into equally sized large voxels where width (W L), depth (D L), and height (H L) are the voxel
size along each X, Y, and Z axis. Following that, each large voxel is subdivided into width (W S) * depth
(D S) * height (H S) small grid of voxels. Note that large voxels that do not contain any points are eliminated
during the voxelization process. For the sake of convenience, W L, D L, and H L were defined to be divisible
73
by W S, D S, and H S respectively, in this study. Note that RGB colors are not used since it various across
different datasets which were collected at different seasons with different weathers and lighting conditions.
The segmentation model relies solely on the object 3D geometry information, occupancy value is computed
for each cell as suggested by Huang and You [21], in which a cell has points inside is 1 and 0 otherwise.
Figure 34. 3D U-Net architecture.
As illustrated in the architecture in Figure 34, 3D voxels are first fed through several encoding units that
consist of convolution layers and max-pooling layers to extract the feature maps. Low-resolution feature
maps are then upsampled to full resolution through a set of decoding units. Finally, the label is assigned to
each cell within the large voxel through a 1 * 1 * 1 convolution layer. To allow final segmentation to take
place with awareness of the high-level features, short- and long-skip connections are also used in the
designed architecture between the encoding and decoding units. Batch-normalization layers are added
between each convolution layer to improve training stability and efficiency. Since using the model to make
predictions on new unseen data is the ultimate goal, a dropout layer with a 50% drop-out rate is applied
after each max-pooling layer to avoid overfitting.
4.3.2.2. Data augmentation
A large amount of training data is usually required to train a deep neural network, and data augmentation
is a powerful way of creating more data to overcome the limited data problem. Data augmentation, in
74
general, will improve model performance and prevent overfitting. Commonly used data augmentation
strategies for 2D images are flipping, rotating, scaling, cropping, and translating. However, not all of these
data augmentation strategies can be applied in the 3D domain. Unlike 2D image data in which objects do
not have real dimensions, generated 3D point clouds using aerial images with GPS information contain
objects with actual dimensions. Thus, scaling will produce 3D data with inappropriate dimensions that may
negatively affect the training process. Furthermore, since the 3D point clouds we considered are roughly
aligned upwards (i.e., the direction of physical gravity is roughly downwards), flipping should not be used.
In this study, a straight and forward 3D data augmentation strategy was designed that simultaneously
performs rotation, translation, and crops. Intuitively, we first fixed the orientation and location of the 3D
voxel grids that were constructed with the original point cloud. The point cloud is then rotated horizontally
around its center axis by an angle θ between 0 to 360 degrees. In this way, the 3D voxels are occupied
differently with different θ.
4.3.2.3. Cross-validation for single U-net model
An experiment was conducted using 20 data sets from the UAV-based photogrammetric database
introduced in Section 4.2 to test the abovementioned 3D U-net point cloud segmentation workflow with
volumetric representation, and to identify its strengths and weaknesses. Note that data sets #7 and #20 were
not included in this experiment since data set #7 covers an area that is significantly larger than any other
data sets. Data set #20 is a university located in downtown Los Angeles with the highest building densities
and different architectural styles compared to other data sets. Data sets #7 and #20 are considered
challenging cases for segmentation and were used for validating the entire model ensembling framework,
which we will discuss in Section 4.4. The experiment was conducted in a cross-validation manner. Data
sets #3, #9, #14, and #15 were selected as the testing cases, and four segmentation models were trained
separately using the other 19 data sets. The desired point labels are ground, man-made objects, and
vegetation. The large voxel size was set to 40 m * 40 m * 40 m, and the small voxel size was set to 0.5 m
75
* 0.5 m * 0.5 m. The voxel sizes were selected to ensure the data size does not exceed the GPU memory
limit. The data augmentation process introduced in the previous section was used to enlarge the training
data, and θ was set to 60 degrees. All models were trained for 60 epochs with a minibatch size of 6. Note
that the θ value and epoch size were selected so that each model can be trained within a reasonable amount
of time (i.e., two days). A widely used optimization algorithm i.e.-Adam [134] was used in this study to
update network weights during training. The commonly used harmonic mean of precision and recall (i.e.,
F1 score) was used to evaluate the segmentation results. Results are summarized in Figure 35. The weighted
average F1 score for data sets #3, #9, #14, and #15 are 0.96, 0.96, 0.98, and 0.95, respectively.
Figure 35. F1 scores of the cross-validation results.
The results show that pre-trained models perform reasonably well across different data sets that were
collected at different geographic locations. However, when considering the later process of creating a
virtual environment for simulations, three issues remain. Figure 36 illustrates the issues that were found
from the segmentation results of data set #15. Figure 36 (a) shows the first case in which the middle portion
of a flat roof is mis-segmented as ground; Figure 36 (b) shows the second case where a part of a building
that has complex 3D geometry is mis-segmented as trees; and Figure 36 (c) shows the third case in which
a few points that are close to each other are mis-segmented, as such mistakes usually occur between man-
made objects and vegetation. These issues not only affect the visual appearance of the created virtual
environment, but also introduce errors during a simulation. For instance, when creating a virtual
76
environment using the segmented point cloud, the second and third mis-segmentation cases introduces holes
on the resulting building meshes. When using the segmented data to simulate a destruction operation, the
ground and buildings should respond differently. However, in the first mis-segmentation case, mis-
segmented roofs react the same way as the ground. Thus, three post-processing techniques are introduced
in the following subsections to overcome these challenges.
(a)
(b)
(c)
Figure 36. Mis-segmentation cases: a) Ground mis-segmentation case; b) Building mis-segmentation
case; and c) Segmentation noises.
77
4.3.3. Post-processing
4.3.3.1. Ground post-processing
A ground post-processing technique was designed to improve the ground segmentation result. The
assumption behind this design is based on the fact that ground points should not be floating in the air, but
instead close to each other to form large, flat, and hilly components. A connected component algorithm was
first performed on the segmented ground-point cloud. The points were grouped into different connected
components, with the constraint that all points in the same component were within a predefined Euclidean
distance. The largest component was then considered as the ground. Note that in cases where the ground
could be split into several large connected components by forests or large buildings, selecting only the
largest component may yield an incorrect result. Thus, the rest of the connected components are sorted in
descending order based on their number of points. The connected components are then iteratively selected
as the ground if the component’s number of points is not dropped more than a certain percentage compared
to the component above it on the list.
4.3.3.2. Conditional Random Field (CRF) post-processing
Conditional Random Field (CRF) method was used as a post-processing step to overcome the third case
discussed in Section 4.3.2.3 in which the predicted segmentation results are not always accurate and could
contain errors and noises. CRF is a well-established method to improve segmentation results and has been
applied successfully in both 2D image and 3D mesh segmentations [135]. The core idea is to maximize
label agreements between nearby points by encouraging those in close proximity to have similar labels if
they share similar colors. This creates an adjacency structure between points by defining unary potentials
at each individual point and pairwise potentials at nearby point pairs. The refined labels are calculated by
running inference on the CRF to approximate the maximum posterior. In our framework, we apply a fully
connected CRF to refine the labels in segmented point clouds. The unary energy is defined as the soft-max
78
probability of predicted labels at each point. The pairwise energy is defined both for positions and colors
using Gaussian kernels. Set the kernel size to 10.0 and 1.0 for positions and colors produce the best result
during our experiments. This is because that the kernel size for positions affect the influence radius of CRF
smoothing. The kernel size of 10.0 for positions indicates that the smoothing range will be about 10 meters
spatially. Empirically we found that most of the mis-segmentation region is smaller than 10 meters, setting
the kernel size to 10 allows the CRF to remove most of mis-segmentations without over-smoothing the
labels for smaller man-made objects or trees. Similar, we choose the kernel size of 1.0 for colors to cover
the interval [0,1] of RGB colors.
4.3.3.3. Building post-processing
In the authors’ previously developed virtual environment-creation pipeline, the photogrammetry tree data
was replaced by geo-typical 3D models to improve run-time rendering appearances after building and tree
segmentations. Therefore, incorrectly classifying building points as trees produces larger artifacts in the
final rendering since parts of buildings could be removed. It is important to have a high recall rate when
classifying tree points. We proposed a post-processing method based on the CRF to refine the building
points by generating a proxy mesh for each building point cloud. The key idea was to use the proxy shape
defined by the building point cloud to collect nearby points into the building. As such, a local CRF was
used to collect only nearby points with similar colors to ensure consistent appearances. This prevented
nearby tree points, which tend to have very different colors from building points, to also be assigned to
buildings. A new proxy shape was then iteratively generated and the process was repeated until all nearby
building points were collected. To produce a proxy shape that approximated the overall building shape, we
computed the 2D concave hull based on the projected building points on the ground plane. To solve for the
CRF, we define the unary potentials for existing building points to 1.0. The unary potentials would be 0.5
for points that fall inside or within a threshold distance of the concave hull and 0.0 for the remainder of the
points. This setup ensures that only the points fall near the concave hull could potentially flip their labels
79
into buildings, while the points outside the range will retain its current labels. The unary potentials are set
to 1.0 for points that are sure to be buildings and 0.0 for points that are sure to be trees for each iteration.
We choose 0.5 for undecided points to allow equal possibility of changing the labels to either trees or
buildings. The final labels for these points will depend on the proximity and color similarity to the nearby
tree and building points after CRF smoothing.
4.4. Validation
4.4.1. Quantitative analysis for point cloud segmentation
To validate the entire point-cloud segmentation framework, experiments were conducted to compare the
designed model ensembling approach, single U-net approach, and PointNet++. All models were trained on
the 20 data sets from the UAV-based photogrammetric database and tested on the two challenging cases
(i.e., data set #7 and data set #20). For the data augmentation process, we set θ to 60 degrees again, and all
models were trained on the augmented data. Hyperparameter values discussed in Section 4.3.2.3 for the
cross-validation were used for both the model ensembling approach and the single U-net approach. It is
also worth noting that since the designed model ensembling approach contained two U-net models—one
for differentiating ground and non-ground and the other for differentiating man-made objects and
vegetation—two different large voxel sizes were used. The large voxel size for the ground vs. non-ground
model and the man-made objects vs. vegetation was set to 40 m * 40 m * 20 m and 40 m * 40 m * 40 m,
respectively. The large voxel size for ground vs. non-ground model was smaller (i.e., 20 m on the z-axis)
because points at high altitudes do not contribute to segment ground points. For PointNet++, we
downsampled the original point clouds into 0.5 m spacing to ensure similar point density between the two
methods. As a 3D scene could contain millions of points and will not fit into the network in a single iteration,
we cut the 3D scene into chunks to reduce the number of points in one batch. In our experiment, we set a
chunk to contain approximately 16,000 points to reduce memory burden on GPUs, and fed each chunk into
the network as a single batch during training.
80
F1 score
Ground Man-made Objects Vegetation Weighted average
Data set #20 Model ensembling approach 0.920 0.933 0.904 0.922
Single U-net 0.905 0.904 0.888 0.900
PointNet++ 0.909 0.912 0.887 0.904
Data set #7 Model ensembling approach 0.959 0.831 0.922 0.929
Single U-net 0.951 0.811 0.900 0.916
PointNet++ 0.950 0.826 0.879 0.911
Table 21. Point-cloud Segmentation Comparisons
Table 21 summarizes the results of the comparison. The F1 score is again used to evaluate the performance
of the different approaches. The results show that the F1 scores are very similar for all three pre-given
classes when using the single U-net approach as well as the state-of-the-art PointNet++ that consumes point
cloud in its original form. Our designed model ensembling approach using two U-net models and post-
processing approach outperforms both the single U-net approach and the PointNet++. Note that the F1
scores of the man-made objects for all three approaches are quite low (i.e., 0.831 when using the model
ensembling approach, 0.811 when using the single U-net approach, and 0.826 when using the PointNet++).
This is because with such a large area of interest (4.6 km
2
), aerial images for data set # 7 were collected
with high flight altitude and low overlaps compared to all the other data sets in the database. The quality of
the reconstructed 3D point cloud is low compared to the training data sets. The reconstructed tree crowns
lack detail and are connected to each other in forest areas. Small objects, such as cars, cannot be
reconstructed properly and have similar shapes as bumps on the road. Consequently, mislabeling the forest
as a man-made object caused lower precision, and mislabeling small objects (e.g., cars) as ground caused
the lower recall rate.
81
4.4.2. Qualitative results for creating virtual environments
The ultimate goal of segmenting a photogrammetric-generated point cloud is to create realistic virtual
environments and provide the necessary information for simulation. Results for creating virtual
environments using segmented data are discussed in this section. Photogrammetric- generated meshes are
segmented based on the point-cloud segmentation result using the mesh segmentation approach discussed
in Section 3.4.9. The segmented meshes were then imported the segmented meshes into the simulation tool
for visualization. Figures 37 and 38 show the point-cloud segmentation results and the created virtual
environments for data sets #20 and #7, respectively. For point-cloud segmentation results, the ground, man-
made objects, and vegetation are marked with blue, green, and red, respectively.
(a)
(b)
Figure 37. Point-cloud segmentation result and the created virtual environment for data set #20: (a)
Point cloud segmentation result; and (b) The created virtual environment.
(a)
(b)
Figure 38. Point-cloud segmentation result and the created virtual environment for data set #7: (a)
Point-cloud segmentation result; and (b) The created virtual environment.
Figure 39 shows the created virtual environments with and without ground post-processing for simulating
blast damage. Note that for demonstration purposes, blasts over buildings would destroy any affected
82
meshes, and blasts over the ground would deform the affected meshes in ATLAS. In cases in which roofs
were mislabeled as ground, damaged state parameters were not properly assigned to their corresponding
meshes, and roof meshes were deformed instead of being destroyed. The virtual environment created using
the point-cloud segmentation results with ground post-processing would behave correctly in cases where
correctly labeled roofs could be destroyed. It is worth noting that the designed ground post-processing
approach has limitations and does not make corrections if the mis-segmented buildings are directly
connected to the ground (i.e., walls and roofs that are connected are mislabeled as ground). Such cases
usually occur on small-building point clouds that were not properly reconstructed due to the high flight
altitude, which lacks sufficient image resolution on small buildings.
Figure 39. Segmentation result after the ground post-processing.
The original U-net segmentation results before post-processing include building misclassifications as
discussed in Section 5.3.3. Figure 10 (a) shows the virtual environment created with such mis-segmented
data in which a hole appears on the building since the meshes labeled as trees are flattened to ground
83
elevation for tree replacement purposes. Figures 40 (b) and (c) show the segmentation results after the
building-refinement process using concave hull and CRF filtering. Note that in our refinement process, it
is only possible to reclassify tree points to buildings, but the opposite is not possible, as mislabeling
buildings into trees will produce worse artifacts at run-time than mislabeling trees into buildings. For our
purposes, this heuristic helps produce better-quality results and makes the whole processing pipeline less
vulnerable to prediction errors caused by point cloud noise and the segmentation model.
(a)
(b)
(c)
Figure 40. Segmentation result after the building refinement process.
4.5. Discussion and Conclusions
Advances in photogrammetry and UAV technologies have enabled the rapid reconstruction of geo-specific
3D point clouds and meshes for creating virtual simulation environments. However, generated 3D data do
not contain semantic information for distinguishing between objects. Consequently, actual physical
properties cannot be properly assigned in the generated virtual environments for enabling various analyses.
This study contributes to fill this research gap by introducing a model ensembling framework for
84
segmenting top-level terrain elements from photogrammetric point clouds. The proposed framework
ensembles two point cloud segmentation models (i.e., 3D U-net) in a hierarchical manner in which each
model is responsible for segmenting specific objects. In addition, the framework also consists of both data
preprocessing and post-processing approaches that are designed to overcome the data segmentation
challenges posed by photogrammetric data quality-related issues.
Twenty-two data sets were manually segmented to create the ground truth for validation purposes. The data
is segmented into top-level terrain elements (i.e., ground, man-made objects, and vegetation). Cross-
validation was initially conducted to test the generalizability of the designed 3D U-net. The results showed
that the trained segmentation models perform reasonably well and achieve over 0.95 on the f1-score on data
sets that were not included in the training sets. The entire model ensembling framework was then tested
through a quantitative analysis. Two challenging data sets were selected for comparing the proposed
framework with the single 3D U-net segmentation model and the PointNet++ model. The results showed
that the designed framework achieved a higher f1-score than the other networks for every object needing
segmentation. Results of the segmented data for the creation of a virtual simulation environment are also
discussed. The designed model ensembling framework and post-processing approaches will not only
improve the visual appearance, but also avoid issues during simulation caused by miss-segmentation.
In addition to segmenting top-level terrain elements, recognizing and detecting small objects such as
windows, doors, cars, light poles, and street signs could also provide useful information for creating a
virtual simulation environment. However, it is difficult for photogrammetric techniques, in general, to
reconstruct small and thin objects, particularly when captured source images are far away from the targets.
Consequently, small objects generated using UAV-based photogrammetric techniques have low geometric
accuracy, and detecting such small objects directly from a reconstructed 3D point cloud is challenging. In
some of the collected data sets where the aerial images were collected at very high altitudes, thin objects
such as light poles and parking meters could not be reconstructed. Therefore, in addition to operating on
3D point clouds, future work should also investigate methods that utilize 2D-UAV imagery to augment the
85
object detection. This would allow researchers to utilize more mature research on neural network
architectures (i.e., CNN based networks) in the 2D domain to help address the more difficult 3D object-
detection problem.
86
Chapter 5: Training Deep Learning-Based 3D Point Clouds Segmentation Model Using Synthetic
Photogrammetric Data
5.1. Research Objective
In Chapter 4, the author analyzes the generalizability of different algorithms across a variety of landscapes
(i.e., various buildings styles, types of vegetation, and urban density) with different data qualities (i.e., flight
altitudes and overlap between images). Although the new introduced database is considerably larger than
existing databases, it remains unknown whether deep learning algorithms have truly achieved their full
potential in terms of accuracy, as sizable data sets for training and validation are currently lacking.
Obtaining a large, annotated 3D point cloud database is time-consuming and labor-intensive not only from
a data annotation perspective in which the data must be manually labeled by well-trained personnel, but
also from a raw data collection and processing perspective. Furthermore, it is generally difficult for
segmentation models to differentiate objects, such as buildings and tree masses, and these types of scenarios
do not always exist in the collected data set. Thus, the objective of this study is to investigate the possibility
of using synthetic photogrammetric data to substitute for real-world data in training deep learning
algorithms. The author has investigated methods for generating synthetic photogrammetric data to provide
a sufficiently sized database for training a deep learning algorithm with the ability to enlarge the data size
for scenarios in which segmentation algorithms have difficulties.
5.2. The Framework for Generating Annotated Synthetic Photogrammetric Data
The designed synthetic data-generation framework is illustrated in Figure 41, which emphasizes the main
elements and steps involved in the process. The designed framework can be used to generate synthetic
training data in which labels can be automatically created during the data generation process. Please note
that this research is not intended to generate synthetic data with realistic appearances to human beings.
87
However, the generated data should have similar enough features such that deep learning models trained
using the synthetic data should achieve a similar performance to that of real-world data.
Figure 41. The designed synthetic data generation framework.
First, 3D scenes are generated. Digital surface models (DSM) were gathered from publicly available GIS
data sources. Buildings models are procedurally generated with predefined rules and open-sourced building
footprints as inputs. To complete the 3D scene, artists created geotypical 3D models of clutter and small
objects (e.g., trees, light poles, parking meters, street signs, cars, and so forth) that are procedurally placed
in the scene. Next, the generated 3D terrain models are imported into game engines for the rendering process.
AirSim (i.e., an open-source simulator for autonomous vehicles) is used to render images and generate
associated ground truth labels. The rendered images are further processed using commercial
photogrammetry software (i.e., Bentley ContextCapture). Finally, labels from 2D images are projected to a
3D point cloud using a ray casting method and the nearest neighbor algorithm.
88
5.2.1. The 3D scene generation process
Procedurally generating 3D scenes with the desired randomness is the key element in the designed
framework. The main objects that need placement in the scene include the terrain surface model, buildings,
vegetation, vehicles, and city clutter. Since the author intended to design the framework to generate a large
database of 3D scenes, scalability also plays a key role in the design processes. Thus, user intervention
during the data generation process should be minimized with all required input data easily obtained. DSM
can be obtained from the National Elevation Dataset (NED), which is the terrain surface elevation data
produced and distributed by the USGS [136]. It is worth pointing out that the highest resolution data in the
NED is 1 meter. Although one-meter DSM resolution can already satisfy many different applications, it is
still considered low when creating a synthetic 3D scene for image rendering and 3D reconstructions. Thus,
the DSM first needs to be upsampled. Fine details are then added to the terrain by raising or lowering the
terrain elevation to create 3D geometry that is similar to ditches, street gutters, deceleration strips, and so
forth. Figure 42 shows an example of a DSM gathered from the NED and the modified DSM.
(a)
(b)
Figure 42. DSM: (a) the original DSM from the NED, and (b) modified DSM.
To create 3D building models, building footprints can be gathered from Open Street Map (OSM). Rather
than using original preprogrammed OSM heights, building heights are randomly assigned to each building
footprint to ensure randomness. Footprints are extruded to represent the basic 3D building models.
89
Windows, doors, and other architectural elements are then procedurally added to the building faç ades using
a set of parametric rules. As a simple example of how this works, one can imagine a building faç ade
represented by a rectangle, with the floors and tiles of windows and doors split along vertical and horizontal
directions by predefined values. Roofs are randomly selected for each building from predefined roof styles
(i.e., flat roof, flat roof with parapet, gable roof, and hip roof). Roof elements (i.e., chimneys, exhaust vents,
and so forth) and architectural features (i.e., dormers, cones, turrets, and so forth) are randomly placed on
top of the roof. Figure 43 shows examples of 3D models that are procedurally generated with the same
building footprint.
(a)
(b)
Figure 43. Procedurally generated 3D building models with the same building footprint.
Within the scene, vegetation, vehicles, and city clutter are placed in randomly generated positions with
randomly selected scale values within predefined ranges. To ensure that objects do not intersect (e.g., cars
do not physically intersect with trees) a minimum distance constraint is enabled while generating object
positions. In addition, any generated object positions inside the building footprints are removed. To create
forests instead of individual trees, polygons are randomly created as boundaries of forest and dense tree
positions are generated within the boundary.
90
5.2.2. 2D image rendering and 3D point cloud reconstruction
To render 2D images for the photogrammetric reconstruction of 3D terrains, the generated 3D building
models and DSM must be imported into a game engine, and 3D objects (e.g., trees and vehicles) have to be
placed at the generated positions. In this study, the author used 3D models that were easily obtained from
the 3D model marketplace. To ensure the realisticity of the generated scene, both gravity and physical
collisions are enabled so that vehicles can be correctly orientated to follow the terrain slope. Figure 44
shows the generated 3D scenes using the created object positions and scales.
(a)
(b)
(c)
Figure 44. Generated 3D scenes: (a) forests (b) city clutter, and (c) trees and vehicles.
The high-fidelity visual and physical simulator (i.e., AirSim [137]) was utilized for 2D image-rendering
purposes. AirSim was originally designed for developing and testing autonomous driving vehicle
algorithms. Since then, it has been widely used for solving other computer vision problems such as semantic
segmentation of 2D images [138], real-time monocular depth estimation [139], and aerial path planning
optimization [140]. The three main outputs of AirSim include photorealistic 2D images and their associated
annotation and depth map. Figure 45 shows the outputs of AirSim with the generated 3D scene imported.
The same UAV path-planning algorithm used in real-world UAV captures is used for rendering 2D images
in the virtual environment. Crosshatch UAV paths were used in this study. Both front and side overlap
between images was used during the rendering process to ensure the 3D reconstruction quality.
91
(a)
(b)
(c)
Figure 45. Outputs from the simulator (i.e., AirSim): (a) rendered image; (b) annotation; and (c) depth
map.
Commercial photogrammetry software - Bentley ContextCapture was used again for the photogrammetric
reconstruction process. Annotating the generated point clouds is an important task. As discussed in Section
4.3.1, noises were introduced during the photogrammetric-reconstruction process due to the generation of
mismatches [131]. Consequently, the 3D geometry of the reconstructed point cloud is not accurate and
simply utilizing a ray casting approach to project the 2D ground truth labels to the photogrammetric
generated 3D point clouds does not provide accurate annotation. Figure 46 illustrates such an issue in which
the ground points are labeled and extracted by projecting 2D labels through the ray casting method with
intrinsic and extrinsic camera parameters produced from the bundle adjustment process. The extracted
ground points have noises floating on air as shown in Figure 46 (b).
(a)
(b)
Figure 46. Photogrammetric point cloud annotation using ray casting: (a) raw photogrammetric point
cloud and (b) extracted ground points.
To overcome such a challenge, a ground truth 3D point cloud is first created using the ray casting method
with depth maps generated directly from the simulator. Following that, a k-nearest neighbor algorithm was
92
used to transfer the point labels from the depth map-created point cloud to the photogrammetric point cloud.
Figure 47 shows the annotated point clouds created from depth maps and extracted ground points from the
annotated point cloud using the k-nearest neighbor algorithm. Comparing the results in Figure 46 (b) and
Figure 47 (b), annotation noises can be eliminated by using the proposed k-nearest neighbor approach.
(a)
(b)
Figure 47. Photogrammetric point cloud annotation using a k-nearest neighbor algorithm: (a) depth
map generated point cloud with annotation and (b) extracted ground points from an annotated
photogrammetric point cloud.
5.3. Experiments and results
In this section, the author provides an evaluation of the proposed method for generating annotated data for
training deep learning algorithms to segment 3D photogrammetric point clouds. The experiments were
conducted to answer a set of fundamental questions related to how synthetic data should be generated. 3D
U-net introduced in Section 4.3.2.1 was used to test the synthetic training data. The 3D U-net model was
trained using synthetic data and tested on real-world data. The results showed in Section 4.3.2.3 and Section
4.4.1 in which the models were trained in the real-world data are used as the baseline for comparative
purposes. Hyperparameter values discussed in Section 4.3.2.3 were used again during the experiments for
training the 3D U-net model. The data augmentation strategies introduced in Section 4.3.2.2 were used to
enlarge the generated synthetic training data, and θ was set to 60 degrees.
Four sets of synthetic training data were generated for this study. In the first synthetic training data set, the
DSM was obtained from the NED without any modifications (i.e., DSM with smooth surfaces and 1-meter
resolution). 3D building models were created with OSM footprints and basic parametric rules for adding
windows and doors. The generated 3D building models do not contain any complex architectural elements
93
(e.g., protruding balcony). To create realistic contextual relationships between objects, city clutter (e.g.,
street signs, traffic lights, light poles, bus stops, and so forth) was placed in the scene along the road vectors
obtained from OSM, and individual trees were placed around the buildings. Figure 48 (a) shows an
example of the generated scene in the first synthetic training data set. In the second synthetic training data
set, the DSM was modified to add fine details as discussed in Section 5.2.1. Forests and vehicles were added
as separate scenes to the training data sets to increase the scene complexity. Figure 48 (b) shows the added
forest and vehicle scenes.
(a)
(b)
(c)
(d)
Figure 48. Generated synthetic training data sets: (a) The first synthetic training data set; (b) forests
and vehicles in the second synthetic training data set; (c) the third synthetic training data set; and (d)
the fourth synthetic training data set.
In the third and fourth synthetic training data sets, the 3D scenes were created following the procedures
introduced in Sections 5.2.1 and 5.2.2. Complex architectural elements were added while creating 3D
building models. Unlike the first and second training data sets in which city clutter was placed along the
roads, in the third and fourth data sets the clutter, individual trees, and vehicles were randomly placed in
94
the scenes so that they do not have realistic contextual relationships between objects. Differences between
the third and fourth data sets include creating the point cloud directly from the rendered depth maps in the
third data set, while the point clouds in the fourth data set were created through the photogrammetric
reconstruction and the k-nearest neighbor labeling processes. Figure 48 (c) and (d) show example scenes
in the third and fourth synthetic training data sets.
Four 3D U-net models were trained using the four generated synthetic training data sets with the same
hyperparameter values. The models were then applied on Fort Drum (#7 from Table 20), MUTC (#15 from
Table 20), and USC (#20 from Table 20) data sets to compare performances. In addition to the metrics used
in Chapter 4 (i.e., precision, recall, f1-score), intersection over union (IOU), also known as the Jaccard
Index, was also used for evaluation. IOU is computed as the points overlapped between the predicted
segmentation and the ground truth divided by the points union between the predicted segmentation and the
ground truth. Segmentation results are summarized in Tables 22–25. For completeness, the segmentation
results from Section 4.3.2.3 and Section 4.4.1, in which the models were trained using the real-world
training data, are also shown in this chapter (Table 26).
5.3.1. Question 1. How much does it help to add details to the synthetic scene?
Tables 22 and 23 show a direct comparison between adding and not adding fine details to the DSM of the
synthetic training data. Ground segmentation performance has improved for all three testing data sets (i.e.,
F1-score improves from 0.901 to 0.956 in the Fort Drum data set, from 0.873 to 0.964 in the MUTC data
set and from 0.874 to 0.912 in the USC data set). In addition, since the forest scene was added to the second
synthetic training data set, the model trained with the second set is more robust for segmenting forests from
buildings. As such, the performance of segmenting both forests and buildings improved, and performance
improvements can be found in the Fort Drum and the MUTC segmentation results. The F1-score of building
segmentation improves from 0.555 to 0.802 in the Fort Drum data set and from 0.609 to 0.826 in the MUTC
data set. The F1-score of vegetation segmentation improves from 0.728 to 0.910 in the Fort Drum data set
95
and from 0.795 to 0.865 in the MUTC data set. However, in the USC data set, the performance of
segmenting buildings and trees decreased due to the following two reasons. First, no forests exist on the
USC campus and most of the trees are planted at predetermined intervals along streets. Second, buildings
on the USC campus include more complex architectural elements (e.g., protruding balconies and church
towers) than the generated synthetic buildings, which caused confusion in the model trained on the
synthetic data between complex buildings and forests. Therefore, adding details to synthetic buildings is
also a necessary step. The segmentation performance using synthetic buildings with complex architectural
elements will be discussed later in this section in the answer to Question #3.
5.3.2. Question 2. Is it necessary to use photogrammetric reconstructed point clouds instead of the
depth map-generated point clouds for training purposes?
From the proposed synthetic data-generation workflow introduced in Section 5.2, readers can see that
annotated 3D point clouds can be directly generated from the rendered depth maps and 2D annotations.
Figure 48 (c) and (d) shows that the point clouds generated from depth maps and photogrammetry
reconstruction process have almost no visual differences. These concerns prompted the author to examine
the performance differences between training a model with the depth map-generated point clouds and the
photogrammetric reconstructed point clouds. Tables 24 and 25 show a direct comparison between the
segmentation results using the two types of training data for the models. Overall, we can clearly see the
benefits of using the photogrammetric-reconstructed point clouds as the training data. The macro average
of the F1-score improves from 0.505 to 0.904 in the Fort Drum data set, from 0.598 to 0.914 in the MUTC
data set, and from 0.791 to 0.909 in the USC data set. Note that the vegetation IOU in all three data sets is
very low (i.e., 0.049 in Fort Drum, 0.080 in MUTC and 0.469 in USC), because the quality of the depth
map-generated point clouds is different from the quality of the photogrammetry-reconstructed point clouds.
Figure 49 illustrates the tree point clouds generated using depth maps and photogrammetry reconstruction.
The photogrammetry reconstructed tree point cloud appears as a solid blob with no points generated inside
96
the crown. The depth map-generated tree point cloud has a similar quality as LIDAR-collected data in
which data points on the leaves inside of the crown can also be generated. Consequently, the segmentation
model trained on depth map-generated point clouds learned to predict points that are not on hollow-shaped
objects as tree points. Therefore, rendering 2D RGB images and creating the training point clouds from
photogrammetry reconstruction is a necessary step to introduce photogrammetric noises to the data and
improve the segmentation performance.
(a)
(b)
Figure 49. Tree point cloud: (a) depth map-generated tree point cloud, and (b) photogrammetric-
reconstructed tree point cloud.
5.3.3. Question 3. Is it necessary to create synthetic scenes with realistic contextual relationships
between objects?
An interesting question is whether the generated synthetic scene should contain contextual relationships
between objects similar to the real world. On the one hand, one could argue that it makes more sense to
feed the synthetic data with contextual relationships between objects so the trained model is well adapted
for real-world data predictions (e.g., the model understands that traffic lights usually appear near a cross-
section). On the other hand, it might make more sense to randomize the synthetic scene as much as possible
so that the trained model is more robust for predicting point labels in a new environment.
To this end, the segmentation results using the fourth synthetic training data sets (i.e., city clutter placed
randomly) shown in Table 25 are compared with the segmentation results using the second synthetic
97
training data sets (i.e., city clutter was placed along the street) shown in Table 23. The performance of the
trained model using the fourth synthetic training data sets outperforms the other model in all three testing
cases. The F1-score of building segmentation improves from 0.802 to 0.830 in the Fort Drum data set and
from 0.826 to 0.854 in the MUTC data set. Therefore, adding contextual relationships between objects into
the synthetic scenes did not improve the model performance. In contrast, placing the city clutter randomly
could increase the geometric complexity of the scene and thus improve the model performance. Note that
a large improvement in the building F1-score is shown in the USC case (i.e., an improvement from 0.831
to 0.919). This is because complex architectural elements were also added into the fourth synthetic data sets
and the trained model correctly predicted buildings with complex architectural elements on the USC
campus.
5.3.4. Question 4. Can synthetic data be used for training deep learning models and replace the
need for creating real-world training data?
Finally, Tables 25 and 26 provide a direct comparison between using synthetic training data and real-world
training data for training point cloud segmentation models. Note that the synthetic training data was created
to cover a similar area size as the real-world training data to ensure an equivalent comparison. The models
trained with the synthetic training data outperform the models trained using the real-world training data in
two out of three testing cases. Using the synthetic training data, the F1-score macro average improves from
0.887 to 0.904 in the Fort Drum case and from 0.898 to 0.909 in the USC case. Note that the model trained
using the synthetic training data underperformed compared with the model trained using the real-world
training data in the MUTC case. This is because the MUTC data set includes small bushes that the real-
world training data also has but the synthetic training data does not. Overall, the comparison results
validated the proposed synthetic data generation workflow can be used to create training data for deep
learning models and replace the need for creating real-world training data.
98
Fort Drum
precision recall f1-score IOU
ground 0.944 0.863 0.901 0.820
building 0.409 0.861 0.555 0.384
tree 0.953 0.589 0.728 0.572
macro avg 0.769 0.771 0.728 0.592
weighted avg 0.868 0.781 0.799 0.683
MTUC
ground 0.969 0.794 0.873 0.774
building 0.469 0.867 0.609 0.438
tree 0.876 0.728 0.795 0.660
macro avg 0.771 0.796 0.759 0.624
weighted avg 0.869 0.798 0.816 0.700
USC
ground 0.877 0.871 0.874 0.776
building 0.888 0.844 0.865 0.763
tree 0.809 0.880 0.843 0.728
macro avg 0.858 0.865 0.861 0.756
weighted avg 0.863 0.861 0.861 0.756
Table 22. Segmentation results with the first synthetic training data set.
Fort Drum
precision recall f1-score IOU
ground 0.942 0.970 0.956 0.915
building 0.808 0.795 0.802 0.669
tree 0.933 0.888 0.910 0.835
macro avg 0.894 0.884 0.889 0.806
weighted avg 0.920 0.920 0.920 0.855
MTUC
ground 0.967 0.961 0.964 0.930
building 0.863 0.793 0.826 0.704
tree 0.814 0.923 0.865 0.763
macro avg 0.881 0.892 0.885 0.799
weighted avg 0.928 0.926 0.926 0.868
USC
ground 0.886 0.940 0.912 0.839
building 0.938 0.746 0.831 0.711
tree 0.712 0.915 0.801 0.668
macro avg 0.845 0.867 0.848 0.739
weighted avg 0.861 0.842 0.843 0.730
Table 23. Segmentation results with the second synthetic training data set.
99
Fort Drum
precision recall f1-score IOU
ground 0.871 0.975 0.920 0.852
building 0.351 0.872 0.501 0.334
tree 0.984 0.049 0.093 0.049
macro avg 0.736 0.632 0.505 0.412
weighted avg 0.829 0.684 0.613 0.537
MTUC
ground 0.953 0.976 0.965 0.932
building 0.551 0.895 0.682 0.518
tree 0.967 0.080 0.148 0.080
macro avg 0.824 0.650 0.598 0.510
weighted avg 0.885 0.840 0.804 0.744
USC
ground 0.866 0.932 0.898 0.814
building 0.747 0.949 0.836 0.718
tree 0.978 0.474 0.639 0.469
macro avg 0.863 0.785 0.791 0.667
weighted avg 0.842 0.810 0.795 0.671
Table 24. Segmentation results with the third synthetic
training data set.
Fort Drum
precision recall f1-score IOU
ground 0.926 0.980 0.952 0.908
building 0.837 0.824 0.830 0.710
tree 0.982 0.884 0.930 0.870
macro avg 0.915 0.896 0.904 0.829
weighted avg 0.930 0.928 0.928 0.868
MTUC
ground 0.956 0.972 0.964 0.930
building 0.863 0.845 0.854 0.745
tree 0.953 0.895 0.923 0.858
macro avg 0.924 0.904 0.914 0.844
weighted avg 0.939 0.940 0.939 0.888
USC
ground 0.854 0.963 0.905 0.827
building 0.936 0.902 0.919 0.850
tree 0.926 0.879 0.902 0.821
macro avg 0.905 0.915 0.909 0.833
weighted avg 0.913 0.911 0.911 0.836
Table 25. Segmentation results with the fourth synthetic
training data set.
Fort Drum
precision recall f1-score IOU
ground 0.928 0.978 0.952 0.908
building 0.786 0.833 0.809 0.679
tree 0.967 0.840 0.899 0.817
macro avg 0.893 0.884 0.887 0.801
weighted avg 0.918 0.916 0.915 0.848
MTUC
ground 0.965 0.974 0.969 0.940
building 0.878 0.878 0.878 0.782
tree 0.933 0.888 0.910 0.834
macro avg 0.925 0.913 0.919 0.852
weighted avg 0.945 0.945 0.945 0.898
USC
ground 0.866 0.944 0.903 0.823
building 0.941 0.869 0.904 0.824
tree 0.869 0.909 0.888 0.799
macro avg 0.892 0.907 0.898 0.815
weighted avg 0.902 0.899 0.899 0.817
Table 26. Segmentation results with the real-world
training data set.
100
5.4. Conclusions
Deep learning algorithms are data-hungry, especially in the 3D domain. Acquiring and annotating 3D data
is a labor-intensive and time-consuming process. To this end, this study designed and developed a synthetic
3D data generation workflow and investigated the potential of using synthetic photogrammetric data to
substitute real-world data for training deep learning algorithms. The designed synthetic data- generation
framework takes full advantage of the off-the-shelf 3D scene generation engine, autonomous vehicles and
UAV simulator, and photogrammetry software. The key elements in the designed framework include
randomness in the generated scenes and adding photogrammetric noises to the synthetic point clouds. The
designed framework was validated through a comparison of 3D U-nets trained on synthetic data and on
real-world training data. The experiment results were analyzed to answer four fundamental questions on
how synthetic data should be generated. Adding detail to the ground surfaces and building facades while
generating the synthetic scenes is a necessary step in boosting the model’s performance. Using synthetic
data without producing a point cloud through the photogrammetric-reconstruction process (i.e., eliminating
the photogrammetric noises) decreases the model’s performance dramatically, particularly in segmenting
vegetation. Furthermore, the results show that adding realistic contextual relationships between objects to
the synthetic data does not improve the model’s performance. Finally, by comparing synthetic data and
real-world data as the training data, the results show that the designed synthetic data-generation workflow
could be used to create training data for training 3D U-net and achieve a similar performance to training
data collected from the real-world. However, this study has several limitations. First, the generated synthetic
data does not include small bushes, and the trained models cannot differentiate small bushes and city clutter
correctly. Second, realistic surface materials were not generated in the synthetic scene, and point cloud
segmentation architecture such as 3DMV that takes advantage of 2D texture features was not tested in the
experiments. Third, the experiment was conducted to test a deep-learning segmentation algorithm with
volumetric representation only, and other state-of-the-art algorithms (e.g., PointNet, SPG) that operate on
an unordered point set were not tested.
101
Chapter 6: Intellectual Merit and Broader Impacts
This research provides a framework for the segmentation/classification of a photogrammetric-generated
point cloud for the creation of virtual environments and simulations. The designed framework provides a
novel way of extracting object information while considering the data quality issues presented in a
photogrammetric-generated point cloud. In addition, the project leads to the development of: (1) a fully
automated top-level terrain elements segmentation approach; (1) a building-footprints extraction approach;
(2) a novel individual tree location and an information-extraction approach considering the data
characteristics; and (3) a ground-material classification framework that uses mesh-rendered images.
Furthermore, it also leads to an understanding of the effectiveness of using different point descriptors and
machine-learning algorithms for segmenting a photogrammetric-generated point cloud into top-level terrain
elements, as well as the potential for using synthetic training data to train deep learning models and replace
the need for real-world training data. The results not only benefit in the creation of geo-specific visual
environments, they also benefit fields such as building-energy simulation and facility management by
providing accurate, geo-specific, and complete building and surrounding environment information.
The ability to create semantic models of outdoor scenes is valuable and applicable to urban planning,
historic-building information storage, building renovation, facility management, and building and district
energy simulation, among others. This research develops a framework that provides an improved point-
cloud classification process to support the creation of semantic 3D models. The designed framework and
generated-semantic 3D models can be applied for creating “as-built” documentation to enhance the way we
manage and interact with our existing built environment. The results of this study can also be extrapolated
to different types of building environments that require continuous condition assessments, leading to more
efficient operations and maintenance of facilities.
102
References
[1] J. Irizarry and D. B. Costa, "Exploratory study of potential applications of unmanned aerial systems
for construction management tasks," J. Manage. Eng., vol. 32, (3), pp. 05016001, 2016.
[2] A. Khaloo and D. Lattanzi, "Hierarchical dense structure-from-motion reconstructions for
infrastructure condition assessment," J. Comput. Civ. Eng., vol. 31, (1), pp. 04016047, 2016.
[3] H. Aljumaily, D. F. Laefer and D. Cuadra, "Urban point cloud mining based on density clustering and
MapReduce," J. Comput. Civ. Eng., vol. 31, (5), pp. 04017021, 2017.
[4] H. Son, C. Kim and Y. Kwon Cho, "Automated schedule updates using as-built data and a 4D
building information model," J. Manage. Eng., vol. 33, (4), pp. 04017012, 2017.
[5] A. Rashidi and E. Karan, "Video to BrIM: Automated 3D As-Built Documentation of Bridges," J.
Perform. Constr. Facil., vol. 32, (3), pp. 04018026, 2018.
[6] M. Breunig et al, "Collaborative multi-scale 3D city and infrastructure modeling and simulation," in
Tehran's Joint ISPRS Conferences of GI Research, SMPR and EOEC 2017, 2017, .
[7] S. Cousins, "3D mapping Helsinki: How mega digital models can help city planners," Construction
Research and Innovation, vol. 8, (4), pp. 102-106, 2017.
[8] T. Ruohomä ki et al, "Smart city platform enabling digital twin," in 2018 International Conference on
Intelligent Systems (IS), 2018, .
[9] R. Spicer, R. McAlinden and D. Conover, "Producing usable simulation terrain data from UAS-
collected imagery," in Interservice/Industry Training, Simulation and Education Conference (I/ITSEC),
2016, .
[10] H. Lin, C. Tai and G. Wang, "A mesh reconstruction algorithm driven by an intrinsic property of a
point cloud," Comput. -Aided Des., vol. 36, (1), pp. 1-9, 2004.
[11] S. A. Shoop, Terrain Characterization for Trafficability, 1993.
[12] Y. Chen, T. Hong and M. A. Piette, "Automatic generation and simulation of urban building energy
models based on city datasets for city-scale building retrofit analysis," Appl. Energy, vol. 205, pp. 323-
335, 2017.
[13] J. L. Savard, P. Clergeau and G. Mennechez, "Biodiversity concepts and urban ecosystems,"
Landscape Urban Plann., vol. 48, (3-4), pp. 131-142, 2000.
[14] A. Ossola et al, "Greening in style: Urban form, architecture and the structure of front and backyard
vegetation," Landscape Urban Plann., vol. 185, pp. 141-157, 2019.
[15] B. Yang and Z. Dong, "A shape-based segmentation method for mobile laser scanning point clouds,"
ISPRS Journal of Photogrammetry and Remote Sensing, vol. 81, pp. 19-30, 2013.
[16] L. Wallace et al, "Assessment of Forest Structure Using Two UAV Techniques: A Comparison of
Airborne Laser Scanning and Structure from Motion (SfM) Point Clouds," Forests, vol. 7, (3), pp. 62,
2016.
[17] J. Chen et al, "Principal axes descriptor for automated construction-equipment classification from
point clouds," J. Comput. Civ. Eng., vol. 31, (2), pp. 04016058, 2016.
[18] Y. Perez-Perez, M. G. Fard and K. A. El-Rayes, "Semantic-rich 3D CAD models for built
environments from point clouds: An end-to-end procedure," in 2017 ASCE International Workshop on
Computing in Civil Engineering, IWCCE 2017, 2017, .
103
[19] M. Volpi and V. Ferrari, "Semantic segmentation of urban scenes by learning local class
interactions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 2015, .
[20] X. Hu and Y. Yuan, "Deep-Learning-Based Classification for DTM Extraction from ALS Point
Cloud," Remote Sensing, vol. 8, (9), pp. 730, 2016.
[21] J. Huang and S. You, "Point cloud labeling using 3d convolutional neural network," in Pattern
Recognition (ICPR), 2016 23rd International Conference On, 2016, .
[22] C. Becker et al, "Classification of Aerial Photogrammetric 3D Point Clouds," arXiv Preprint
arXiv:1705.08374, 2017.
[23] C. R. Qi et al, "Pointnet: Deep learning on point sets for 3d classification and segmentation," in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, .
[24] T. Hackel, J. D. Wegner and K. Schindler, "Joint classification and contour extraction of large 3D
point clouds," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 130, pp. 231-245, 2017.
[25] M. Rouhani, F. Lafarge and P. Alliez, "Semantic segmentation of 3D textured meshes for urban
scene analysis," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 123, pp. 124-139, 2017.
[26] A. Boulch, B. Le Saux and N. Audebert, "Unstructured Point Cloud Semantic Labeling Using Deep
Segmentation Networks," 2017.
[27] M. Bassier, B. Van Genechten and M. Vergauwen, "Classification of sensor independent point cloud
data of building objects using random forests," Journal of Building Engineering, 2018.
[28] Y. Li et al, "Pointcnn: Convolution on x-transformed points," in Advances in Neural Information
Processing Systems, 2018, .
[29] D. Griffiths and J. Boehm, "A review on deep learning techniques for 3D sensed data classification,"
Remote Sensing, vol. 11, (12), pp. 1499, 2019.
[30] D. Forsyth and J. Ponce, Computer Vision: A Modern Approach. 2011.
[31] M. Golparvar-Fard, F. Peñ a-Mora and S. Savarese, "D4AR–a 4-dimensional augmented reality
model for automating construction progress monitoring data collection, processing and communication,"
Journal of Information Technology in Construction, vol. 14, (13), pp. 129-153, 2009.
[32] J. Armesto et al, "FEM modeling of structures based on close range digital photogrammetry," Autom.
Constr., vol. 18, (5), pp. 559-569, 2009.
[33] I. Brilakis, H. Fathi and A. Rashidi, "Progressive 3D reconstruction of infrastructure with
videogrammetry," Autom. Constr., vol. 20, (7), pp. 884-895, 2011.
[34] L. Klein, N. Li and B. Becerik-Gerber, "Imaged-based verification of as-built documentation of
operational buildings," Autom. Constr., vol. 21, pp. 161-171, 2012.
[35] A. Koutsoudis et al, "Multi-image 3D reconstruction data evaluation," Journal of Cultural Heritage,
vol. 15, (1), pp. 73-79, 2014.
[36] A. Rashidi, "Improved Monocular Videogrammetry for Generating 3D Dense Point Clouds of Built
Infrastructure." , Georgia Institute of Technology, 2014.
[37] M. Himmelsbach, T. Luettel and H. Wuensche, "Real-time object classification in 3D point clouds
using point feature histograms," in Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ
International Conference On, 2009, .
104
[38] F. Bosché et al, "The value of integrating Scan-to-BIM and Scan-vs-BIM techniques for construction
monitoring using laser scanning and BIM: The case of cylindrical MEP components," Autom. Constr.,
vol. 49, pp. 201-213, 2015.
[39] A. Frome et al, "Recognizing objects in range data using regional point descriptors," in Computer
Vision-ECCV 2004Anonymous 2004, .
[40] M. Rutzinger et al, "Object-based point cloud analysis of full-waveform airborne laser scanning data
for urban vegetation classification," Sensors, vol. 8, (8), pp. 4505-4528, 2008.
[41] N. Chehata, L. Guo and C. Mallet, "Airborne lidar feature selection for urban classification using
random forests," International Archives of Photogrammetry, Remote Sensing and Spatial Information
Sciences, vol. 38, (Part 3), pp. W8, 2009.
[42] B. Hö fle and M. Hollaus, Urban Vegetation Detection using High Density Full-Waveform Airborne
Lidar Data-Combination of Object-Based Image and Point Cloud Analysis. 2010.
[43] C. Mallet, F. Bretar and U. Soergel, "Analysis of full-waveform lidar data for classification of urban
areas," Photogrammetrie Fernerkundung Geoinformation, vol. 5, pp. 337-349, 2008.
[44] T. Hackel, J. D. Wegner and K. Schindler, "FAST SEMANTIC SEGMENTATION OF 3D POINT
CLOUDS WITH STRONGLY VARYING DENSITY." ISPRS Annals of Photogrammetry, Remote
Sensing & Spatial Information Sciences, vol. 3, (3), 2016.
[45] M. Weinmann, B. Jutzi and C. Mallet, "Feature relevance assessment for the semantic interpretation
of 3D point cloud data," ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information
Sciences, vol. 5, pp. W2, 2013.
[46] J. Zhang, X. Lin and X. Ning, "SVM-based classification of segmented airborne LiDAR point clouds
in urban areas," Remote Sensing, vol. 5, (8), pp. 3749-3775, 2013.
[47] B. Jutzi and H. Gross, "Nearest neighbour classification on laser point clouds to gain object
structures from buildings," The International Archives of the Photogrammetry, Remote Sensing and
Spatial Information Sciences, vol. 38, (Part 1), pp. 4-7, 2009.
[48] M. Chen et al, "Semantic modeling of outdoor scenes for the creation of virtual environments and
simulations," in Proceedings of the 52nd Hawaii International Conference on System Sciences, 2019, .
[49] K. Zhang et al, "A progressive morphological filter for removing nonground measurements from
airborne LIDAR data," IEEE Trans. Geosci. Remote Sens., vol. 41, (4), pp. 872-882, 2003.
[50] T. H. Stevenson et al, "Automated bare earth extraction technique for complex topography in light
detection and ranging surveys," Journal of Applied Remote Sensing, vol. 7, (1), pp. 073560-073560, 2013.
[51] H. Son, C. Kim and C. Kim, "Fully automated as-built 3D pipeline extraction method from laser-
scanned data based on curvature computation," J. Comput. Civ. Eng., vol. 29, (4), pp. B4014003, 2014.
[52] A. Adan and D. Huber, "3D reconstruction of interior wall surfaces under occlusion and clutter," in
2011 International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission,
2011, .
[53] A. Krizhevsky, I. Sutskever and G. E. Hinton, "Imagenet classification with deep convolutional
neural networks," in Advances in Neural Information Processing Systems, 2012, .
[54] D. Maturana and S. Scherer, "Voxnet: A 3d convolutional neural network for real-time object
recognition," in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference On,
2015, .
[55] Z. Wu et al, "3d shapenets: A deep representation for volumetric shapes," in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2015, .
105
[56] C. R. Qi et al, "Volumetric and multi-view cnns for object classification on 3d data," in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, .
[57] S. Song and J. Xiao, "Deep sliding shapes for amodal 3d object detection in rgb-d images," in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, .
[58] T. Hackel et al, "Semantic3d. net: A new large-scale point cloud classification benchmark," arXiv
Preprint arXiv:1704.03847, 2017.
[59] L. Tchapmi et al, "Segcloud: Semantic segmentation of 3d point clouds," in 2017 International
Conference on 3D Vision (3DV), 2017, .
[60] C. R. Qi et al, "Pointnet : Deep hierarchical feature learning on point sets in a metric space," in
Advances in Neural Information Processing Systems, 2017, .
[61] P. Hermosilla et al, "Monte carlo convolution for learning on non-uniformly sampled point clouds,"
in SIGGRAPH Asia 2018 Technical Papers, 2018, .
[62] C. R. Qi et al, "Frustum pointnets for 3d object detection from rgb-d data," in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2018, .
[63] F. Engelmann et al, "Know what your neighbors do: 3d semantic segmentation of point clouds," in
Proceedings of the European Conference on Computer Vision (ECCV), 2018, .
[64] H. Thomas et al, "KPConv: Flexible and Deformable Convolution for Point Clouds," arXiv Preprint
arXiv:1904.08889, 2019.
[65] B. Yang et al, "Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds,"
arXiv Preprint arXiv:1906.01140, 2019.
[66] M. Jiang et al, "Pointsift: A sift-like network module for 3d point cloud semantic segmentation,"
arXiv Preprint arXiv:1807.00652, 2018.
[67] H. Su et al, "Splatnet: Sparse lattice networks for point cloud processing," in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2018, .
[68] B. Wu et al, "Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object
segmentation from 3d lidar point cloud," in 2018 IEEE International Conference on Robotics and
Automation (ICRA), 2018, .
[69] B. Wu et al, "Squeezesegv2: Improved model structure and unsupervised domain adaptation for
road-object segmentation from a lidar point cloud," in 2019 International Conference on Robotics and
Automation (ICRA), 2019, .
[70] M. Engelcke et al, "Vote3deep: Fast object detection in 3d point clouds using efficient convolutional
neural networks," in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, .
[71] L. Landrieu and M. Simonovsky, "Large-scale point cloud semantic segmentation with superpoint
graphs," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, .
[72] I. Armeni et al, "Joint 2d-3d-semantic data for indoor scene understanding," arXiv Preprint
arXiv:1702.01105, 2017.
[73] A. Dai et al, "Scannet: Richly-annotated 3d reconstructions of indoor scenes," in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2017, .
[74] A. Dai et al, "Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface
reintegration," ACM Transactions on Graphics (ToG), vol. 36, (3), pp. 24, 2017.
106
[75] X. Roynard, J. Deschaud and F. Goulette, "Paris-Lille-3D: A large and high-quality ground-truth
urban point cloud dataset for automatic segmentation and classification," The International Journal of
Robotics Research, vol. 37, (6), pp. 545-557, 2018.
[76] J. Gehrung et al, "An approach to extract moving objects from MLS data using a volumetric
background representation," ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial
Information Sciences, vol. 4, pp. 107, 2017.
[77] D. Munoz et al, "Contextual classification with functional max-margin markov networks," in 2009
IEEE Conference on Computer Vision and Pattern Recognition, 2009, .
[78] A. Quadros, J. P. Underwood and B. Douillard, "An occlusion-aware feature for range images," in
2012 IEEE International Conference on Robotics and Automation, 2012, .
[79] A. Serna et al, "Paris-rue-madame database: A 3D mobile laser scanner dataset for benchmarking
urban detection, segmentation and classification methods," in 2014, .
[80] B. Vallet et al, "TerraMobilita/iQmulus urban point cloud analysis benchmark," Comput. Graph.,
vol. 49, pp. 126-133, 2015.
[81] M. Pal, "Random forest classifier for remote sensing classification," Int. J. Remote Sens., vol. 26, (1),
pp. 217-222, 2005.
[82] C. Zhang, Y. Zhou and F. Qiu, "Individual tree segmentation from LiDAR point clouds for urban
forest inventory," Remote Sensing, vol. 7, (6), pp. 7892-7913, 2015.
[83] B. Yang et al, "Automatic forest mapping at individual tree levels from terrestrial laser scanning
point clouds with a hierarchical minimum cut method," Remote Sensing, vol. 8, (5), pp. 372, 2016.
[84] F. Monnier, B. Vallet and B. Soheilian, "Trees detection from laser point clouds acquired in dense
urban areas by a mobile mapping system," Proceedings of the ISPRS Annals of the Photogrammetry,
Remote Sensing and Spatial Information Sciences (ISPRS Annals), Melbourne, Australia, vol. 25, pp.
245-250, 2012.
[85] Y. Huang et al, "Toward automatic estimation of urban green volume using airborne LiDAR data
and high resolution Remote Sensing images," Frontiers of Earth Science, vol. 7, (1), pp. 43-54, 2013.
[86] A. Persson, J. Holmgren and U. Soderman, "Detecting and measuring individual trees using an
airborne laser scanner," Photogramm. Eng. Remote Sensing, vol. 68, (9), pp. 925-932, 2002.
[87] T. Ritter et al, "Automatic Mapping of Forest Stands Based on Three-Dimensional Point Clouds
Derived from Terrestrial Laser-Scanning," Forests, vol. 8, (8), pp. 265, 2017.
[88] F. Fassi et al, "Comparison between laser scanning and automated 3d modelling techniques to
reconstruct complex and extensive cultural heritage areas," International Archives of the
Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 5, pp. W1, 2013.
[89] Q. Zhou and U. Neumann, "Fast and extensible building modeling from airborne LiDAR data," in
Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic
Information Systems, 2008, .
[90] O. Wang, S. K. Lodha and D. P. Helmbold, "A bayesian approach to building footprint extraction
from aerial lidar data," in 3D Data Processing, Visualization, and Transmission, Third International
Symposium On, 2006, .
[91] S. Sun and C. Salvaggio, "Aerial 3D building detection and modeling from airborne LiDAR point
clouds," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 6, (3),
pp. 1440-1449, 2013.
107
[92] M. Awrangjeb and G. Lu, "Automatic building footprint extraction and regularisation from lidar
point cloud data," in Digital Lmage Computing: Techniques and Applications (DlCTA), 2014
International Conference On, 2014, .
[93] K. Zhang, J. Yan and S. Chen, "Automatic construction of building footprints from airborne LIDAR
data," IEEE Trans. Geosci. Remote Sens., vol. 44, (9), pp. 2523-2533, 2006.
[94] A. Vetrivel et al, "Identification of damage in buildings based on gaps in 3D point clouds from very
high resolution oblique airborne images," ISPRS Journal of Photogrammetry and Remote Sensing, vol.
105, pp. 61-78, 2015.
[95] E. H. Adelson, "On seeing stuff: The perception of materials by humans and machines," in Human
Vision and Electronic Imaging VI, 2001, .
[96] D. Hu, L. Bo and X. Ren, "Toward robust material recognition for everyday objects." in Bmvc, 2011,
.
[97] I. Brilakis, L. Soibelman and Y. Shinagawa, "Material-based construction site image retrieval," J.
Comput. Civ. Eng., vol. 19, (4), pp. 341-355, 2005.
[98] M. Varma and A. Zisserman, "A statistical approach to material classification using image patch
exemplars," IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, (11), pp. 2032-2047, 2009.
[99] E. Hayman et al, "On the significance of real-world conditions for material classification," Computer
Vision-ECCV 2004, pp. 253-266, 2004.
[100] B. Caputo et al, "Classifying materials in the real world," Image Vision Comput., vol. 28, (1), pp.
150-163, 2010.
[101] J. Zhang et al, "Local features and kernels for classification of texture and object categories: A
comprehensive study," International Journal of Computer Vision, vol. 73, (2), pp. 213-238, 2007.
[102] L. Liu and P. Fieguth, "Texture classification from random features," IEEE Trans. Pattern Anal.
Mach. Intell., vol. 34, (3), pp. 574-586, 2012.
[103] K. J. Dana et al, "Reflectance and texture of real-world surfaces," ACM Transactions on Graphics
(TOG), vol. 18, (1), pp. 1-34, 1999.
[104] L. Sharan, R. Rosenholtz and E. Adelson, "Material perception: What can you see in a brief
glance?" Journal of Vision, vol. 9, (8), pp. 784-784, 2009.
[105] C. Liu et al, "Exploring features in a bayesian framework for material recognition," in Computer
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference On, 2010, .
[106] A. Krizhevsky, I. Sutskever and G. E. Hinton, "Imagenet classification with deep convolutional
neural networks," in Advances in Neural Information Processing Systems, 2012, .
[107] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image
recognition," arXiv Preprint arXiv:1409.1556, 2014.
[108] C. Szegedy et al, "Going deeper with convolutions," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2015, .
[109] K. He et al, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, .
[110] G. Huang et al, "Densely connected convolutional networks," arXiv Preprint arXiv:1608.06993,
2016.
[111] P. Sermanet et al, "Overfeat: Integrated recognition, localization and detection using convolutional
networks," arXiv Preprint arXiv:1312.6229, 2013.
108
[112] M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks," in European
Conference on Computer Vision, 2014, .
[113] G. Kalliatakis et al, "Evaluating deep convolutional neural networks for material classification,"
arXiv Preprint arXiv:1703.04101, 2017.
[114] M. Cimpoi, S. Maji and A. Vedaldi, "Deep filter banks for texture recognition and segmentation,"
in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, .
[115] F. Provost, "Machine learning from imbalanced data sets 101," in Proceedings of the AAAI’2000
Workshop on Imbalanced Data Sets, 2000, .
[116] G. Sithole and G. Vosselman, "Experimental comparison of filter algorithms for bare-Earth
extraction from airborne laser scanning point clouds," ISPRS Journal of Photogrammetry and Remote
Sensing, vol. 59, (1), pp. 85-101, 2004.
[117] R. Honsberger, Mathematical Gems II Dolcani Mathematical Expositions 2. 1976.
[118] T. Hackel, J. D. Wegner and K. Schindler, "Joint classification and contour extraction of large 3D
point clouds," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 130, pp. 231-245, 2017.
[119] M. Pauly, R. Keiser and M. Gross, "Multi‐scale feature extraction on Point‐Sampled surfaces," in
Computer Graphics Forum, 2003, .
[120] D. J. Bora, A. K. Gupta and F. A. Khan, "Comparing the performance of L* A* B* and HSV color
spaces with respect to color image segmentation," arXiv Preprint arXiv:1506.01472, 2015.
[121] B. Ciepłuch et al, "Comparison of the accuracy of OpenStreetMap for ireland with google maps and
bing maps," in Proceedings of the Ninth International Symposium on Spatial Accuracy Assessment in
Natural Resuorces and Enviromental Sciences 20-23rd July 2010, 2010, .
[122] A. Sharif Razavian et al, "CNN features off-the-shelf: An astounding baseline for recognition," in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, .
[123] V. Vapnik, "Pattern recognition using generalized portrait method," Automation and Remote
Control, vol. 24, pp. 774-780, 1963.
[124] C. Cortes and V. Vapnik, "Support-vector networks," Mach. Learning, vol. 20, (3), pp. 273-297,
1995.
[125] C. Hsu, C. Chang and C. Lin, "A practical guide to support vector classification," 2003.
[126] J. R. Quinlan, "Induction of decision trees," Mach. Learning, vol. 1, (1), pp. 81-106, 1986.
[127] J. R. Quinlan, C4. 5: Programs for Machine Learning. 19931.
[128] T. K. Ho, "The random subspace method for constructing decision forests," IEEE Trans. Pattern
Anal. Mach. Intell., vol. 20, (8), pp. 832-844, 1998.
[129] T. K. Ho, "Random decision forests," in Document Analysis and Recognition, 1995., Proceedings
of the Third International Conference On, 1995, .
[130] R. McAlinden et al, "Procedural reconstruction of simulation terrain using drones," in Proc. of
Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC), 2015, .
[131] E. Rupnik, F. Nex and F. Remondino, "Oblique multi-camera systems-orientation and dense
matching issues," The International Archives of Photogrammetry, Remote Sensing and Spatial
Information Sciences, vol. 40, (3), pp. 107, 2014.
109
[132] S. Ural et al, "Road and roadside feature extraction using imagery and LiDAR data for
transportation operation," ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information
Sciences, vol. 2, (3), pp. 239, 2015.
[133] O. Ronneberger, P. Fischer and T. Brox, "U-net: Convolutional networks for biomedical image
segmentation," in International Conference on Medical Image Computing and Computer-Assisted
Intervention, 2015, .
[134] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv Preprint
arXiv:1412.6980, 2014.
[135] A. Arnab et al, "Conditional random fields meet deep neural networks for semantic segmentation:
Combining probabilistic graphical models with deep learning for structured prediction," IEEE Signal
Process. Mag., vol. 35, (1), pp. 37-52, 2018.
[136] D. Gesch et al, "The national map—Elevation," US Geological Survey Fact Sheet, vol. 3053, (4),
2009.
[137] S. Shah et al, "Airsim: High-fidelity visual and physical simulation for autonomous vehicles," in
Field and Service Robotics, 2018, .
[138] Y. Zhang, P. David and B. Gong, "Curriculum domain adaptation for semantic segmentation of
urban scenes," in Proceedings of the IEEE International Conference on Computer Vision, 2017, .
[139] A. Atapour-Abarghouei and T. P. Breckon, "Real-time monocular depth estimation using synthetic
data with domain adaptation via image style transfer," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018, .
[140] N. Smith et al, "Aerial path planning for urban scene reconstruction: A continuous optimization
method and benchmark," ACM Transactions on Graphics (TOG), vol. 37, (6), pp. 183, 2019.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
In-situ quality assessment of scan data for as-built models using building-specific geometric features
PDF
Understanding human-building-emergency interactions in the built environment
PDF
Understanding human-building interactions through perceptual decision-making processes
PDF
Point cloud data fusion of RGB and thermal information for advanced building envelope modeling in support of energy audits for large districts
PDF
Towards health-conscious spaces: building for human well-being and performance
PDF
A radio frequency based indoor localization framework for supporting building emergency response operations
PDF
A framework for comprehensive assessment of resilience and other dimensions of asset management in metropolis-scale transport systems
PDF
Homogenization procedures for the constitutive material modeling and analysis of aperiodic micro-structures
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Building information modeling based design review and facility management: Virtual reality workflows and augmented reality experiment for healthcare project
PDF
Integration of truck scheduling and routing with parking availability
Asset Metadata
Creator
Chen, Meida
(author)
Core Title
Semantic modeling of outdoor scenes for the creation of virtual environments and simulations
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Civil Engineering
Publication Date
03/31/2021
Defense Date
03/04/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
building-footprint extractions,creation of virtual environments and simulations,individual tree location identification,OAI-PMH Harvest,point cloud feature extraction,point cloud segmentation,synthetic training data
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Soibelman, Lucio (
committee chair
), Becerik-Gerber, Burcin (
committee member
), Fleming, Steven (
committee member
), Masri, Sami (
committee member
), McAlinden, Ryan (
committee member
)
Creator Email
mchen.toledo@gmail.com,meidache@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-277796
Unique identifier
UC11673391
Identifier
etd-ChenMeida-8235.pdf (filename),usctheses-c89-277796 (legacy record id)
Legacy Identifier
etd-ChenMeida-8235.pdf
Dmrecord
277796
Document Type
Dissertation
Rights
Chen, Meida
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
building-footprint extractions
creation of virtual environments and simulations
individual tree location identification
point cloud feature extraction
point cloud segmentation
synthetic training data